The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.
This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.
There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.
The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.
The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.
This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.
There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.
The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.
Extended metadata
resource Common Info:
resource Type: corpus
identification Info:
resource Name: Norsk aviskorpus
resource Name: Norwegian Newspaper Corpus
description: Norsk aviskorpus var et prosjekt ved Universitetet i Bergen der man trålet nyhetsnettsteder etter nyhetsartikler.
Denne versjonen av Norsk aviskorpus består av tekst fra perioden 1998 til og med 2019. Korpuset inneholder om lag 1,68 milliarder ord for bokmål og 68 millioner ord for nynorsk.
Det finnes også en forenklet versjon av korpuset for tekstene fra perioden 1998-2011. Her er alle setningsdubletter fjernet, og setningene er sortert alfabetisk. Setningene er separert med s-tagger.
Tekstene fra 1998-2011 er samlet i en felles nedlastbar fil, ellers foreligger dataene som en fil per år. Se dokumentasjonsfilene for en beskrivelse av innholdet og filformater.
description: The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.
This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.
There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.
The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.
attribution Text: We hereby credit the individual publishers for making their texts available for language technology purposes. The copyright of the texts in this corpus remains with the individual publisher.
non Standard Conditions Of Use: * NORED * No redistribution. The licence is motivated by the need to block the possibility of third parties redistributing the orignal texts for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of The Norwegian Newspaper Corpus are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.
revision: Texts from 2015-2019 added to the corpus
last Date Updated: 20.04.2020
validation Info:
validated: false
resource Documentation Info:
documentation Unstructured:
role: documentation
document Unstructured: Documentation files describing the content, structure and file formats of the resource.
resource Creation Info:
creation Start Date: 01.01.1998
creation End Date: 20.04.2020
resource Creator
actor Info:
actor Type: person
role: Resource Creator
person Info:
surname: Hofland
given Name: Knut
affiliation:
organization Info:
organization Name: Universitetet i Bergen
organization Name: University of Bergen
organization Short Name: UiB
organization Short Name: UiB
corpus Info:
corpus Type: Written Corpus
corpus Part Info:
media Type: text
corpus Part General Info:
linguality Info:
linguality Type: bilingual
multilinguality Type: multilingualSingleText
multilinguality Type Details: News text in Norwegian Bokmål and Norwegian Nynorsk
language Info:
language Id: nb
language Name: Norwegian Bokmål
size Per Language:
size Info:
size: 1680000000
size Unit: words
language Info:
language Id: nn
language Name: Norwegian Nynorsk
size Per Language:
size Info:
size: 68000000
size Unit: words
modality Info:
modality Type: writtenLanguage
time Coverage Info:
time Coverage: 1998-2019
creation Info:
creation Mode: mixed
creation Mode Details: Crawling news web sites, with post processing.
dc:type
corpus
dc:title
Norwegian Newspaper Corpus
dc:identifier
oai:nb.no:sbr-4
dc:description
The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.
This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.
There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.
The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.