Norwegian Newspaper Corpus
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Norsk aviskorpus
- resource Name: Norwegian Newspaper Corpus
- description: Norsk aviskorpus var et prosjekt ved Universitetet i Bergen der man trålet nyhetsnettsteder etter nyhetsartikler. Denne versjonen av Norsk aviskorpus består av tekst fra perioden 1998 til og med 2019. Korpuset inneholder om lag 1,68 milliarder ord for bokmål og 68 millioner ord for nynorsk. Det finnes også en forenklet versjon av korpuset for tekstene fra perioden 1998-2011. Her er alle setningsdubletter fjernet, og setningene er sortert alfabetisk. Setningene er separert med s-tagger. Tekstene fra 1998-2011 er samlet i en felles nedlastbar fil, ellers foreligger dataene som en fil per år. Se dokumentasjonsfilene for en beskrivelse av innholdet og filformater.
- description: The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles. This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk. There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically. The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/
- P I D: hdl:21.11146/4
- identifier: sbr-4
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/
- attribution Text: We hereby credit the individual publishers for making their texts available for language technology purposes. The copyright of the texts in this corpus remains with the individual publisher.
- licence:
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-BY-NC (CC-BY-NC)
- licence Url: https://creativecommons.org/licenses/by-nc/4.0/
- conditions Of Use: BY
- conditions Of Use: NC
- conditions Of Use: *
- non Standard Conditions Of Use: * NORED * No redistribution. The licence is motivated by the need to block the possibility of third parties redistributing the orignal texts for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of The Norwegian Newspaper Corpus are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- distribution Rights Holder
- actor Info:
- actor Type: organization
- role: Distribution Rights Holder
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Birkenes
- given Name: Magnus Breder
- affiliation:
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- actor Info:
- actor Type: person
- role: Resource Creator
- person Info:
- surname: Hofland
- given Name: Knut
- affiliation:
- organization Info:
- organization Name: Universitetet i Bergen
- organization Name: University of Bergen
- organization Short Name: UiB
- organization Short Name: UiB
- corpus Info:
- corpus Type: Written Corpus
- corpus Part Info:
- media Type: text
- corpus Part General Info:
- linguality Info:
- linguality Type: bilingual
- multilinguality Type: multilingualSingleText
- multilinguality Type Details: News text in Norwegian Bokmål and Norwegian Nynorsk
- language Info:
- language Id: nb
- language Name: Norwegian Bokmål
- size Per Language:
- size Info:
- size: 1680000000
- size Unit: words
- language Info:
- language Id: nn
- language Name: Norwegian Nynorsk
- size Per Language:
- size Info:
- size: 68000000
- size Unit: words
- modality Info:
- modality Type: writtenLanguage
- time Coverage Info:
- time Coverage: 1998-2019
- creation Info:
- creation Mode: mixed
- creation Mode Details: Crawling news web sites, with post processing.
dc:type | corpus |
dc:title | Norwegian Newspaper Corpus |
dc:identifier | oai:nb.no:sbr-4 |
dc:description | The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles. This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk. There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically. The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats. |
dc:publisher | |
dc:format | downloadable |
dc:date | 1998-01-01 |
dc:date | 2020-04-20 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-BY-NC (CC-BY-NC) |
dc:rights | https://creativecommons.org/licenses/by-nc/4.0/ |
dc:creator | Knut Hofland |
dc:lang | Norwegian Bokmål |
dc:lang | Norwegian Nynorsk |
Download resources
-
norsk_aviskorpus.zip
-
nak_2012.tar
-
nak_2013.tar
-
nak_2014.tar
-
nak_2015.tar
-
nak_2016.tar
-
nak_2017.tar
-
nak_2018.tar
-
nak_2019.tar
-
norsk_aviskorpus_nno_tok.zip
-
norsk_aviskorpus_nob_tok.zip
-
nak_1998_2011.pdf
-
nak_2012_2019.pdf