Målfrid 2024 – Freely Available Documents from Norwegian State Institutions
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Målfrid 2024 – Fritt tilgjengelege tekster frå norske statlege nettsider
- resource Name: Målfrid 2024 – Freely Available Documents from Norwegian State Institutions
- description: Dette korpuset inneheld dokument frå 497 internettdomene tilknytta norske statlege institusjonar. Totalt består materialet av omlag 2,6 milliardar "tokens" (ord og teiknsetting). I tillegg til tekster på bokmål og nynorsk inneheld korpuset tekster på nordsamisk, lulesamisk, sørsamisk og engelsk. Dataa vart samla inn som ein lekk i Målfrid-prosjektet, der Nasjonalbiblioteket på vegner av Kulturdepartementet og i samarbeid med Språkrådet haustar og aggregerer tekstdata for å dokumentere bruken av bokmål og nynorsk hjå statlege institusjonar. Språkbanken føretok ei fokusert hausting av nettsidene til dei aktuelle institusjonane mellom desember 2023 og januar 2024. Tekstdokument (HTML, DOC(X)/ODT og PDF) vart lasta ned rekursivt frå dei ulike domena, 12 nivå ned på nettsidene. Me tok ålmenne høflegheitsomsyn og respekterte robots.txt. For teknisk informasjon, sjå dokumentasjonsfilene.
- description: This corpus consists of documents from 497 domains of Norwegian state institutions and comprises approximately 2.6 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English. The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions. The corpus is the result of a focused crawl conducted between December 2023 and January 2024, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions. For technical information, please consult the documentation files.
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-99/
- P I D: hdl:21.11146/99
- identifier: sbr-99
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-99/
- licence:
- licence Family: DIFI
- licence Name: Norwegian Licence for Open Government Data (NLOD)
- licence Url: https://data.norge.no/nlod/en/2.0
- conditions Of Use: BY
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- contact
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- actor Info:
- actor Type: organization
- role: Metadata Creator
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- corpus Info:
- corpus Type: Multilingual Corpus
- corpus Part Info:
- media Type: text
- corpus Text Info:
- text Format Info:
- mime Type: application/jsonl
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: multilingual
- multilinguality Type: multilingualSingleText
- language Info:
- language Id: nb
- language Name: Norwegian Bokmål
- size Per Language:
- size Info:
- size: 1749716066
- size Unit: tokens
- language Info:
- language Id: nn
- language Name: Norwegian Nynorsk
- size Per Language:
- size Info:
- size: 159909404
- size Unit: tokens
- language Info:
- language Id: en
- language Name: English
- size Per Language:
- size Info:
- size: 647802002
- size Unit: tokens
- language Info:
- language Id: sme
- language Name: Northern Sami
- size Per Language:
- size Info:
- size: 1764161
- size Unit: tokens
- language Info:
- language Id: sma
- language Name: Southern Sami
- size Per Language:
- size Info:
- size: 346893
- size Unit: tokens
- language Info:
- language Id: smj
- language Name: Lule Sami
- size Per Language:
- size Info:
- size: 252774
- size Unit: tokens
