Målfrid 2023 – Freely Available Documents from Norwegian State Institutions
This corpus consists of documents from 525 domains of Norwegian state institutions and comprises approximately 3,5 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.
The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions.
The corpus is the result of a focused crawl conducted between December 2022 and January 2023, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.
For technical information, please consult the documentation files.
This corpus consists of documents from 525 domains of Norwegian state institutions and comprises approximately 3,5 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.
The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions.
The corpus is the result of a focused crawl conducted between December 2022 and January 2023, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.
For technical information, please consult the documentation files.
Extended metadata
resource Common Info:
resource Type: corpus
identification Info:
resource Name: Målfrid 2023 – Fritt tilgjengelege tekster frå norske statlege nettsider
resource Name: Målfrid 2023 – Freely Available Documents from Norwegian State Institutions
description: Dette korpuset inneheld dokument frå 525 internettdomene tilknytta norske statlege institusjonar. Totalt består materialet av omlag 3,5 milliardar "tokens" (ord og teiknsetting). I tillegg til tekster på bokmål og nynorsk inneheld korpuset tekster på nordsamisk, lulesamisk, sørsamisk og engelsk.
Dataa vart samla inn som ein lekk i Målfrid-prosjektet, der Nasjonalbiblioteket på vegner av Kulturdepartementet og i samarbeid med Språkrådet haustar og aggregerer tekstdata for å dokumentere bruken av bokmål og nynorsk hjå statlege institusjonar.
Språkbanken føretok ei fokusert hausting av nettsidene til dei aktuelle institusjonane mellom desember 2022 og januar 2023. Tekstdokument (HTML, DOC(X)/ODT og PDF) vart lasta ned rekursivt frå dei ulike domena, 12 nivå ned på nettsidene. Me tok ålmenne høflegheitsomsyn og respekterte robots.txt.
For teknisk informasjon, sjå dokumentasjonsfilene.
description: This corpus consists of documents from 525 domains of Norwegian state institutions and comprises approximately 3,5 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.
The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions.
The corpus is the result of a focused crawl conducted between December 2022 and January 2023, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.
For technical information, please consult the documentation files.
Målfrid 2023 – Freely Available Documents from Norwegian State Institutions
dc:identifier
oai:nb.no:sbr-98
dc:description
This corpus consists of documents from 525 domains of Norwegian state institutions and comprises approximately 3,5 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.
The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions.
The corpus is the result of a focused crawl conducted between December 2022 and January 2023, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.
For technical information, please consult the documentation files.