N-grams from NBdigital 2022

This resource contains n-grams – i.e. uni-, bi- and trigrams – from all books and newspapers that had been digitized at the National Library of Norway up to July 15 2022. The n-grams have been extracted from a material consisting of approximately 610,000 books and 4,000,000 newspapers, amounting to a total of 138.5 billion tokens (words and punctuation). The file format is UTF-8-encoded CSV.

Columns in the n-gram CSV files:
– first – the first word (in uni-, bi- and trigrams)
– second – the second word (in bi- and trigrams)
– third – the third word (in trigrams)
– lang – the language of the n-gram (only for books, the newspapers have no language classification as yet)
– freq – the total frequency of the n-gram in the collection of books and newspapers
– json – a dictionary with raw frequency for each year

totals.json contains aggregated frequencies per year in the book and newspaper corpora. Using them, relative frequencies can be calculated in order to compare frequencies over time as in NB N-gram.

metadata-digibok.csv and metadata-digavis.csv contain simple metadata for the books and newspapers. More extensive metadata can be obtained through Oria or the APIs at https://api.nb.no/.

See the documentation files for further information.

totals.json contains aggregated frequencies per year in the book and newspaper corpora. Using them, relative frequencies can be calculated in order to compare frequencies over time as in NB N-gram.

metadata-digibok.csv and metadata-digavis.csv contain simple metadata for the books and newspapers. More extensive metadata can be obtained through Oria or the APIs at https://api.nb.no/.

See the documentation files for further information.

Download resources

Extended metadata

Last ned metadata (CMDI XML)

Last ned metadata (CMDI XML) https://www.nb.no/sprakbanken/oai?verb=GetRecord&identifier=oai:nb.no:sbr-76&metadataPrefix=cmdi

dc:type	corpus
dc:title	N-grams from NBdigital 2022
dc:identifier	oai:nb.no:sbr-76
dc:description	This resource contains n-grams – i.e. uni-, bi- and trigrams – from all books and newspapers that had been digitized at the National Library of Norway up to July 15 2022. The n-grams have been extracted from a material consisting of approximately 610,000 books and 4,000,000 newspapers, amounting to a total of 138.5 billion tokens (words and punctuation). The file format is UTF-8-encoded CSV. Columns in the n-gram CSV files: – first – the first word (in uni-, bi- and trigrams) – second – the second word (in bi- and trigrams) – third – the third word (in trigrams) – lang – the language of the n-gram (only for books, the newspapers have no language classification as yet) – freq – the total frequency of the n-gram in the collection of books and newspapers – json – a dictionary with raw frequency for each year totals.json contains aggregated frequencies per year in the book and newspaper corpora. Using them, relative frequencies can be calculated in order to compare frequencies over time as in NB N-gram. metadata-digibok.csv and metadata-digavis.csv contain simple metadata for the books and newspapers. More extensive metadata can be obtained through Oria or the APIs at https://api.nb.no/. See the documentation files for further information.
dc:publisher
dc:format	downloadable
dc:date	2022-07-15
dc:date	2022-12-21
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-ZERO (CC-ZERO)
dc:rights	https://creativecommons.org/publicdomain/zero/1.0/
dc:creator	Magnus Breder Birkenes
dc:creator	Lars Johnsen
dc:lang	Norwegian Bokmål
dc:lang	Norwegian Nynorsk
dc:lang	Northern Sami
dc:lang	Southern Sami
dc:lang	Lule Sami
dc:lang	Kven

N-grams from NBdigital 2022

Download resources

Extended metadata

Dublin Core (DC)

Last ned metadata (CMDI XML)