Skip to content

Målfrid 2023 – Freely Available Documents from Norwegian State Institutions

This corpus consists of documents from 525 domains of Norwegian state institutions and comprises approximately 3,5 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.

The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions.

The corpus is the result of a focused crawl conducted between December 2022 and January 2023, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.

For technical information, please consult the documentation files.

This corpus consists of documents from 525 domains of Norwegian state institutions and comprises approximately 3,5 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.

The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions.

The corpus is the result of a focused crawl conducted between December 2022 and January 2023, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.

For technical information, please consult the documentation files.

Extended metadata

Download resources

Download metadata