Skip to content

Legal Documents from Norwegian Nynorsk Municipialities

The texts in this corpus have been collected with the web crawler Veidemann in collaboration with the National Library’s Web Archive, based on a revised list of municipalities from the National Association of Nynorsk Municipalities (see lnk.no).

The web crawler was set to download documents in pdf format. The resulting collection of documents was then scanned using Google’s OCR API. Although the OCR generally is of high quality, some errors will remain in the material.

The resulting corpus is made up of 50,000 documents (legal documents, minutes from meetings etc.), and contains a total of some 127 million words. About 88.5 million of these are in Norwegian Nynorsk, the rest is mostly Norwegian Bokmål. All the texts in the corpus are classified by language.

The corpus is currently published as a json object, where the key is an identifier (URN) for the Veidemann download, and the value is a list of lists of pages in the document with associated page numbers and target form. A text file is also provided, containing a list of the URNs in the corpus. These URNs refer to the websites (URLs) from which the individual documents were downloaded.

The original pdf files and the OCR format are available upon request to Språkbanken. Please contact us using or e-mail address, sprakbanken@nb.no.

The texts in this corpus have been collected with the web crawler Veidemann in collaboration with the National Library’s Web Archive, based on a revised list of municipalities from the National Association of Nynorsk Municipalities (see lnk.no).

The web crawler was set to download documents in pdf format. The resulting collection of documents was then scanned using Google’s OCR API. Although the OCR generally is of high quality, some errors will remain in the material.

The resulting corpus is made up of 50,000 documents (legal documents, minutes from meetings etc.), and contains a total of some 127 million words. About 88.5 million of these are in Norwegian Nynorsk, the rest is mostly Norwegian Bokmål. All the texts in the corpus are classified by language.

The corpus is currently published as a json object, where the key is an identifier (URN) for the Veidemann download, and the value is a list of lists of pages in the document with associated page numbers and target form. A text file is also provided, containing a list of the URNs in the corpus. These URNs refer to the websites (URLs) from which the individual documents were downloaded.

The original pdf files and the OCR format are available upon request to Språkbanken. Please contact us using or e-mail address, sprakbanken@nb.no.

Extended metadata

Download resources

Download metadata