Skip to content

Målfrid 2021 – Freely Available Documents from Norwegian State Institutions

This corpus consists of documents from 339 internet domains run by Norwegian state institutions, and comprises approximately 4.1 billion tokens (words and punctuation) in total, which makes it one of the largest freely available text resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.

The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions.

The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.

The crawled documents were further processed according to their format: text was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from PDFs using Google Cloud Vision OCR.

The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) at document level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates).

The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each document contains the following keys:

– lang: language of the document (detected using TextCat)
– url: the url of the document at crawl time
– date: crawl date
– mimetype: media type of the document (simplified): HTML, DOC or PDF
– fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents

This corpus consists of documents from 339 internet domains run by Norwegian state institutions, and comprises approximately 4.1 billion tokens (words and punctuation) in total, which makes it one of the largest freely available text resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.

The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions.

The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.

The crawled documents were further processed according to their format: text was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from PDFs using Google Cloud Vision OCR.

The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) at document level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates).

The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each document contains the following keys:

– lang: language of the document (detected using TextCat)
– url: the url of the document at crawl time
– date: crawl date
– mimetype: media type of the document (simplified): HTML, DOC or PDF
– fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents

Extended metadata

Download resources

Download metadata