Texts from Norwegian Wikipedia
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Texts from Norwegian Wikipedia
- resource Name: Tekster fra norsk Wikipedia
- description: This corpus is a dump from approximately March 20 2019 of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami. The corpus contains 492,864 articles for Norwegian Bokmål, 139,927 articles for Norwegian Nynorsk and 7,626 articles for Northern Sami. The files are structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of "key:value" pairs containing text and metadata. There are eight such key:value pairs per article: – bytelength: length of text in number of bytes – pageid: text identifier – title: title as in Wikipedia – hiddencategories: metadata – text: text as in Wikipedia – revised: audit information – contentcategories: metadata – wikidata: other data An example of the JSON format can be found in the documentation file.
- description: Dette korpuset inneholder en dump av samtlige Wikipediaartikler på bokmål, nynorsk og nordsamisk fra ca. 20. mars 2019. Korpuset inneholder 492.864 artikler for bokmål, 139.927 artikler for nynorsk og 7.626 artikler for nordsamisk. Korpuset er strukturert som et JSON-array over artiklene slik de foreligger på nettet. Hver artikkel er et strukturert element, med ett nivå av "nøkkel:verdi", som inneholder tekst og metadata. Det er åtte slike nøkkel:verdi-par i artiklene: – bytelength: lengde på teksten i bytes – pageid: identifikator for teksten – title: tittel som i Wikipedia – hiddencategories: metadata – text: teksten som i Wikipedia – revid: revisjonsinformasjon – contentcategories: metadata – wikidata: andre data Et eksempel på JSON-formatet finnes i dokumentasjonsfilen.
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-50/
- P I D: hdl:21.11146/50
- identifier: sbr-50
- distribution Info:
- licence Info:
- user Category: Restricted
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-50/
- licence:
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-BY-SA (CC-BY-SA)
- licence Url: https://creativecommons.org/licenses/by-sa/3.0/
- conditions Of Use: BY
- conditions Of Use: SA
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: Wikimedia Norge
- organization Name: Wikimedia Norge
- distribution Rights Holder
- actor Info:
- actor Type: organization
- role: Distribution Rights Holder
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- role: IPR Holder
- organization Info:
- organization Name: Wikimedia Norge
- organization Name: Wikimedia Norge
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Lindstad
- given Name: Arne Martinus
- affiliation:
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: Wikimedia Norge
- organization Name: Wikimedia Norge
- corpus Info:
- corpus Type: Written Corpus
- corpus Part Info:
- media Type: text
- corpus Text Info:
- text Format Info:
- mime Type: application/json
- size Per Text Format:
- size Info:
- size: 640417
- size Unit: articles
- size Info:
- size: 3
- size Unit: files
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: monolingual
- language Info:
- language Id: nb
- language Name: Norwegian Bokmål
- size Per Language:
- size Info:
- size: 492864
- size Unit: articles
- size Info:
- size: 1
- size Unit: files
- size Info:
- size: 1,3
- size Unit: gb
- language Info:
- language Id: nn
- language Name: Norwegian Nynorsk
- size Per Language:
- size Info:
- size: 139927
- size Unit: articles
- size Info:
- size: 1
- size Unit: files
- size Info:
- size: 300
- size Unit: mb
- language Info:
- language Id: se
- language Name: Northern Sami
- size Per Language:
- size Info:
- size: 7626
- size Unit: articles
- size Info:
- size: 1
- size Unit: files
- size Info:
- size: 10
- size Unit: mb
- modality Info:
- modality Type: writtenLanguage
- time Coverage Info:
- time Coverage: 2007-2019
dc:type | corpus |
dc:title | Texts from Norwegian Wikipedia |
dc:identifier | oai:nb.no:sbr-50 |
dc:description | This corpus is a dump from approximately March 20 2019 of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami. The corpus contains 492,864 articles for Norwegian Bokmål, 139,927 articles for Norwegian Nynorsk and 7,626 articles for Northern Sami. The files are structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of "key:value" pairs containing text and metadata. There are eight such key:value pairs per article: – bytelength: length of text in number of bytes – pageid: text identifier – title: title as in Wikipedia – hiddencategories: metadata – text: text as in Wikipedia – revised: audit information – contentcategories: metadata – wikidata: other data An example of the JSON format can be found in the documentation file. |
dc:publisher | |
dc:format | downloadable |
dc:date | 2007-06-23 |
dc:date | 2019-03-22 |
dc:rights | Restricted |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-BY-SA (CC-BY-SA) |
dc:rights | https://creativecommons.org/licenses/by-sa/3.0/ |
dc:creator | Wikimedia Norge |
dc:lang | Norwegian Bokmål |
dc:lang | Norwegian Nynorsk |
dc:lang | Northern Sami |