N-grams from NBdigital
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: N-grams from NBdigital
- resource Name: N-gram frå NBdigital
- description: This resource contains n-grams – i.e. unigrams, bigrams and trigrams – from all books and newspapers that had been digitized at the National Library of Norway up to September 2013. The n-grams have been extracted from a material consisting of approximately 220,000 books and 540,000 newspapers. The n-grams are available in two formats, CSV and SQlite: CSV is probably the most interesting format for most developers, because it is very easy to import these files into standard applications. The SQLite files contain indexed databases, which are used in the service NB N-gram. Users who want to contribute to the development of NB N-gram can download the source code on GitHub, and the SQLite databases from this page. A word count by source (books/newspapers) and language variety (Bokmål/Nynorsk) is given in the json file.
- description: Dette korpuset inneheld n-gram – unigram, bigram og trigram – frå alle bøker og aviser som var digitaliserte ved Nasjonalbiblioteket fram til september 2013. Dei er laga på basis av eit material på om lag 220.000 bøker og 540.000 aviser. N-gramma finst i to format, CSV og SQLite: CSV vil vera mest interessant for dei fleste utviklarar, sidan det er lett å importere desse inn i vanleg programvare. SQLite-filene inneheld ferdig indekserte SQL-databasar som vert brukte i tenesta NB N-gram. Brukarar som ynskjer å bidra i utviklinga av NB N-gram kan laste ned kjeldekoden på GitHub og SQLite-databasane frå denne sida. Ei ordteljing fordelt på kjelde (avis/bok) og språkform (bokmål/nynorsk) finst i json-fila.
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-35/
- P I D: hdl:21.11146/35
- identifier: sbr-35
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-35/
- licence:
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-ZERO (CC-ZERO)
- licence Url: https://creativecommons.org/publicdomain/zero/1.0/
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- distribution Rights Holder
- actor Info:
- actor Type: organization
- role: Distribution Rights Holder
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Ohren
- given Name: Oddrun Pauline
- affiliation:
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: Acquisition and Bibliographic Services
- department Name: Tilvekst og kunnskapsorganisering
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- corpus Info:
- corpus Type: Ngram Corpus
- corpus Part Info:
- media Type: textNgram
- corpus Text Ngram Info:
- ngram Info:
- base Item: word
- order: 3
- text Format Info:
- mime Type: text/csv
- size Per Text Format:
- size Info:
- size: 6
- size Unit: files
- size Info:
- size: 50,2
- size Unit: gb
- size Info:
- size: 35110402259
- size Unit: tokens
- text Format Info:
- mime Type: application/x-sqlite3
- size Per Text Format:
- size Info:
- size: 6
- size Unit: files
- size Info:
- size: 16,0
- size Unit: gb
- size Info:
- size: 35110402259
- size Unit: tokens
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: monolingual
- language Info:
- language Id: nb
- language Name: Norwegian Bokmål
- language Info:
- language Id: nn
- language Name: Norwegian Nynorsk
- modality Info:
- modality Type: writtenLanguage
- size Info:
- size: 12
- size Unit: files
- size Info:
- size: 66,2
- size Unit: gb
- size Info:
- size: 35110402259
- size Unit: tokens
- time Coverage Info:
- time Coverage: 1736-2013
dc:type | corpus |
dc:title | N-grams from NBdigital |
dc:identifier | oai:nb.no:sbr-35 |
dc:description | This resource contains n-grams – i.e. unigrams, bigrams and trigrams – from all books and newspapers that had been digitized at the National Library of Norway up to September 2013. The n-grams have been extracted from a material consisting of approximately 220,000 books and 540,000 newspapers. The n-grams are available in two formats, CSV and SQlite: CSV is probably the most interesting format for most developers, because it is very easy to import these files into standard applications. The SQLite files contain indexed databases, which are used in the service NB N-gram. Users who want to contribute to the development of NB N-gram can download the source code on GitHub, and the SQLite databases from this page. A word count by source (books/newspapers) and language variety (Bokmål/Nynorsk) is given in the json file. |
dc:publisher | |
dc:format | downloadable |
dc:date | |
dc:date | 2015-06-02 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-ZERO (CC-ZERO) |
dc:rights | https://creativecommons.org/publicdomain/zero/1.0/ |
dc:creator | National Library of Norway |
dc:lang | Norwegian Bokmål |
dc:lang | Norwegian Nynorsk |
Download resources
-
20150604_uni-bok-csv.tar.gz
-
20150604_bi-bok-csv.tar.gz
-
20150604_tri-bok-csv.tar.gz
-
20150604_uni-avis-csv.tar.gz
-
20150604_bi-avis-csv.tar.gz
-
20150604_tri-avis-csv.tar.gz
-
20150604_uni-bok-sqlite.tar.gz
-
20150604_bi-bok-sqlite.tar.gz
-
20150604_tri-bok-sqlite.tar.gz
-
20150604_uni-avis-sqlite.tar.gz
-
20150604_bi-avis-sqlite.tar.gz
-
20150604_tri-avis-sqlite.tar.gz
-
totals.json