Norwegian Newspaper Corpus

The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.

This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.

There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.

The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.

The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.

There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.

Extended metadata

resource Common Info:
resource Type: corpus
identification Info:
resource Name: Norsk aviskorpus
resource Name: Norwegian Newspaper Corpus
description: Norsk aviskorpus var et prosjekt ved Universitetet i Bergen der man trålet nyhetsnettsteder etter nyhetsartikler. Denne versjonen av Norsk aviskorpus består av tekst fra perioden 1998 til og med 2019. Korpuset inneholder om lag 1,68 milliarder ord for bokmål og 68 millioner ord for nynorsk. Det finnes også en forenklet versjon av korpuset for tekstene fra perioden 1998-2011. Her er alle setningsdubletter fjernet, og setningene er sortert alfabetisk. Setningene er separert med s-tagger. Tekstene fra 1998-2011 er samlet i en felles nedlastbar fil, ellers foreligger dataene som en fil per år. Se dokumentasjonsfilene for en beskrivelse av innholdet og filformater.
description: The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles. This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk. There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically. The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.
url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/
P I D: hdl:21.11146/4
identifier: sbr-4
distribution Info:
licence Info:
user Category: Public
distribution Access Medium: downloadable
download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/
attribution Text: We hereby credit the individual publishers for making their texts available for language technology purposes. The copyright of the texts in this corpus remains with the individual publisher.
licence:
licence Family: Creative Commons (CC)
licence Name: Creative_Commons-BY-NC (CC-BY-NC)
licence Url: https://creativecommons.org/licenses/by-nc/4.0/
conditions Of Use: BY
conditions Of Use: NC
conditions Of Use: *
non Standard Conditions Of Use: * NORED * No redistribution. The licence is motivated by the need to block the possibility of third parties redistributing the orignal texts for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of The Norwegian Newspaper Corpus are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.
licensor:
actor Info:
actor Type: organization
role: Licensor
organization Info:
organization Name: Nasjonalbiblioteket
organization Name: National Library of Norway
organization Short Name: NB
organization Short Name: NLN
department Name: Språkbanken
department Name: The Language Bank
communication Info:
email: sprakbanken@nb.no
url: https://www.nb.no/sprakbanken/
address: P.O. Box 2674 Solli
zip Code: 0203
city: Oslo
region: Oslo
country: Norway
distribution Rights Holder
- actor Info:
- actor Type: organization
- role: Distribution Rights Holder
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
contact
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
metadata Info:
metadata Creation Date: 04.02.2016
metadata Language Name: English
metadata Language Id: en
metadata Last Date Updated: 22.06.2023
metadata Creator
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Birkenes
- given Name: Magnus Breder
- affiliation:
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Lindstad
- given Name: Arne Martinus
- affiliation:
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
version Info:
version: 2020
revision: Texts from 2015-2019 added to the corpus
last Date Updated: 20.04.2020
validation Info:
validated: false
resource Documentation Info:
documentation Unstructured:
role: documentation
document Unstructured: Documentation files describing the content, structure and file formats of the resource.
resource Creation Info:
creation Start Date: 01.01.1998
creation End Date: 20.04.2020
resource Creator
- actor Info:
- actor Type: person
- role: Resource Creator
- person Info:
- surname: Hofland
- given Name: Knut
- affiliation:
- organization Info:
- organization Name: Universitetet i Bergen
- organization Name: University of Bergen
- organization Short Name: UiB
- organization Short Name: UiB

Download resources

Download metadata

Download metadata https://www.nb.no/sprakbanken/oai?verb=GetRecord&identifier=oai:nb.no:sbr-4&metadataPrefix=cmdi

dc:type	corpus
dc:title	Norwegian Newspaper Corpus
dc:identifier	oai:nb.no:sbr-4
dc:description	The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles. This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk. There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically. The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.
dc:publisher
dc:format	downloadable
dc:date	1998-01-01
dc:date	2020-04-20
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-BY-NC (CC-BY-NC)
dc:rights	https://creativecommons.org/licenses/by-nc/4.0/
dc:creator	Knut Hofland
dc:lang	Norwegian Bokmål
dc:lang	Norwegian Nynorsk

Norwegian Newspaper Corpus

Extended metadata

Resource Common Info

Corpus Info

Dublin Core (DC)

Download resources

Download metadata