Målfrid 2021 – Freely Available Documents from Norwegian State Institutions
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Målfrid 2021 – Fritt tilgjengelege tekster frå norske statlege nettsider
- resource Name: Målfrid 2021 – Freely Available Documents from Norwegian State Institutions
- description: Dette korpuset inneheld dokument frå 339 internettdomene tilknytta norske, statlege institusjonar. Totalt består materialet av omlag 4,1 milliardar "tokens" (ord og teiknsetjing), noko som gjer korpuset til eit av dei største fritt tilgjengelege tekstkorpusa for bokmål og nynorsk. Korpuset inneheld òg tekster på nordsamisk, sørsamisk, lulesamisk og engelsk. Dataa vart samla inn som ein lekk i Målfrid-prosjektet, der Nasjonalbiblioteket på vegner av Kulturdepartementet og i samarbeid med Språkrådet haustar og aggregerer tekstdata for å dokumentere bruken av bokmål og nynorsk hjå statlege institusjonar. Språkbanken føretok ei fokusert hausting av nettsidene til dei aktuelle institusjonane mellom 11. desember 2020 og 18. januar 2021. Tekstdokument (HTML, DOC(X)/ODT og PDF) vart lasta ned rekursivt frå dei ulike domena, 12 nivå ned på nettsidene. Me tok ålmenne høflegheitsomsyn og respekterte robots.txt. Dei nedlasta dokumenta vart prosessert vidare. Bolkar med tekst vart ekstrahert frå HTML med Justext, eit system for "boilerplate removal" (http://corpus.tools/wiki/Justext). Textract (https://textract.readthedocs.io/en/stable/) vart brukt for å ekstrahere tekst frå Word/ODT-dokument, mens Cloud Vision OCR frå Google (https://textract.readthedocs.io/en/stable/) vart brukt til å ekstrahere tekst frå pdf-filer. Dei ekstraherte tekstene vart klassifiserte ved bruk av TextCat språkidentifikasjon (https://www.let.rug.nl/~vannoord/TextCat/) på dokumentnivå. Eksakte duplikat av same dokument (innanfor same domene) vart fjerna. Korpuset er lagt til rette som gzippa JSON-liner (jsonl), eitt dokument per line. Det er ei JSONL-fil for kvar kombinasjon av domene, språk og innhaldstype. Filene er på UTF-8 tekstformat, med ASCII lineskift. Kvart dokument inneheld dei følgande nyklane: – lang: språk i dokumentet (identifisert med TextCat) – url: url-en til dokumentet då det vart hausta – date: dato for innhausting av dokumentet – mimetype: (forenkla) mediatype for dokumentet: HTML, DOC eller PDF – fulltext: ei rekkje av strenger, der kvar streng representerer eitt avsnitt – ein tom streng angir ei ny side i PDF-dokumenta
- description: This corpus consists of documents from 339 internet domains run by Norwegian state institutions, and comprises approximately 4.1 billion tokens (words and punctuation) in total, which makes it one of the largest freely available text resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English. The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions. The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions. The crawled documents were further processed according to their format: text was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from PDFs using Google Cloud Vision OCR. The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) at document level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates). The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each document contains the following keys: – lang: language of the document (detected using TextCat) – url: the url of the document at crawl time – date: crawl date – mimetype: media type of the document (simplified): HTML, DOC or PDF – fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents
- resource Short Name: Målfrid 2021
- resource Short Name: Målfrid 2021
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-69/
- P I D: hdl:21.11146/69
- identifier: sbr-69
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-69/
- licence:
- licence Family: DIFI
- licence Name: Norwegian Licence for Open Government Data (NLOD)
- licence Url: https://data.norge.no/nlod/en/2.0/
- conditions Of Use: BY
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- contact
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Lindstad
- given Name: Arne Martinus
- affiliation:
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- actor Info:
- actor Type: person
- role: Resource Creator
- person Info:
- surname: Birkenes
- given Name: Magnus Breder
- affiliation:
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- corpus Info:
- corpus Type: Written Corpus
- corpus Part Info:
- media Type: text
- corpus Text Info:
- text Format Info:
- mime Type: application/jsonl
- size Per Text Format:
- size Info:
- size: 4140863529
- size Unit: tokens
- size Info:
- size: 1609959
- size Unit: entries
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: multilingual
- multilinguality Type: other
- language Info:
- language Id: nb
- language Name: Norwegian Bokmål
- size Per Language:
- size Info:
- size: 3109152950
- size Unit: tokens
- size Info:
- size: 1109335
- size Unit: entries
- language Variety Info:
- language Variety Type: other
- language Variety Name: formal written language
- language Info:
- language Id: nn
- language Name: Norwegian Nynorsk
- size Per Language:
- size Info:
- size: 269542462
- size Unit: tokens
- size Info:
- size: 153212
- size Unit: entries
- language Variety Info:
- language Variety Type: other
- language Variety Name: formal written language
- language Info:
- language Id: sme
- language Name: Northern Sami
- size Per Language:
- size Info:
- size: 5653533
- size Unit: tokens
- size Info:
- size: 5128
- size Unit: entries
- language Variety Info:
- language Variety Type: other
- language Variety Name: formal written language
- language Info:
- language Id: sma
- language Name: Southern Sami
- size Per Language:
- size Info:
- size: 390686
- size Unit: tokens
- size Info:
- size: 579
- size Unit: entries
- language Variety Info:
- language Variety Type: other
- language Variety Name: formal written language
- language Info:
- language Id: smj
- language Name: Lule Sami
- size Per Language:
- size Info:
- size: 207170
- size Unit: tokens
- size Info:
- size: 204
- size Unit: entries
- language Variety Info:
- language Variety Type: other
- language Variety Name: formal written language
- language Info:
- language Id: en
- language Name: English
- size Per Language:
- size Info:
- size: 755916728
- size Unit: tokens
- size Info:
- size: 341501
- size Unit: entries
- language Variety Info:
- language Variety Type: other
- language Variety Name: formal written language
- modality Info:
- modality Type: writtenLanguage
dc:type | corpus |
dc:title | Målfrid 2021 – Freely Available Documents from Norwegian State Institutions |
dc:identifier | oai:nb.no:sbr-69 |
dc:description | This corpus consists of documents from 339 internet domains run by Norwegian state institutions, and comprises approximately 4.1 billion tokens (words and punctuation) in total, which makes it one of the largest freely available text resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English. The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions. The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions. The crawled documents were further processed according to their format: text was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from PDFs using Google Cloud Vision OCR. The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) at document level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates). The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each document contains the following keys: – lang: language of the document (detected using TextCat) – url: the url of the document at crawl time – date: crawl date – mimetype: media type of the document (simplified): HTML, DOC or PDF – fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents |
dc:publisher | |
dc:format | downloadable |
dc:date | 2020-12-01 |
dc:date | 2021-04-30 |
dc:rights | Public |
dc:rights | DIFI |
dc:rights | Norwegian Licence for Open Government Data (NLOD) |
dc:rights | https://data.norge.no/nlod/en/2.0/ |
dc:creator | Magnus Breder Birkenes |
dc:creator | Andre Kåsen |
dc:lang | Norwegian Bokmål |
dc:lang | Norwegian Nynorsk |
dc:lang | Northern Sami |
dc:lang | Southern Sami |
dc:lang | Lule Sami |
dc:lang | English |