Web News Collection
In collaboration with DH-lab, the Norwegian Web Archive has created a collection of texts from web news publications from 2019-22. These texts are available for computational analysis through DH-lab’s API.
The objective is to allow scholars, students and others to make their own corpora of web news texts, facilitating digital text analysis of web news.
We are working to develop notebooks and user-friendly web apps to interact with the data. For now, you can find examples of use in nettavis-tekstanalyse.ipynb.
Below, you will find some basic information and metadata about the Web News Collection. Please contact us at nettarkivet@nb.no if you have any questions!
A corpus is, simply put, a collection of texts. In this case, it consists of texts from web news sources.
The first version of the web news corpus contains:
- 1,572,655 texts
- 784,171,966 words
- 268 publication titles
The corpus includes texts in various languages. The most frequent ones are:
- Norwegian Bokmål: 1,437,768 texts
- Norwegian Nynorsk: 111,892 texts
- Northern Sami: 11,416 texts
- Kven: 302 texts
- Southern Sami: 101 texts
- Lule Sami: 78 texts
In total, the corpus includes texts from 268 publications with a responsible editor. The most frequent titles are:
- NRK: 130 162
- VG: 66 800
- Forskning.no: 65 469
- TV2: 55 367
- Dagens næringsliv: 50 005
- Dagbladet: 46 333
- Finansavisen: 38 514
- Adresseavisen: 33 640
- Aftenposten: 31 075
- Khrono: 29 794
- Hamar Arbeiderblad: 29 775
- Dagsavisen: 27 009
- ABC Nyheter: 25 690
- E24: 24 930
- Nettavisen: 23 670
To work with the corpus, you can use dhlab for python.
Currently, the setup allows for corpus building, getting concordances, getting collocations and calculate relative frequency of collocated words.
Here is an overview of the schema attributes that can be used with the API, using a text from Aftenposten as an example:
schema:properties | dtype | description | example |
doctype | str | nettavis | nettavis |
dhlabid | int | unique id for text object | 600274473 |
title | str | publication title | Aftenposten |
publisher | int | domain name | aftenposten.no |
city | str | place of editor | Oslo |
lang | str | ISO 639-2 | nob |
oaiid | str | target-uri | https://www.aftenposten.no:443/norge/politikk/i/… |
timestamp | int | YYYYMMDD (date for crawling) | 20200526 |
ocr_timestamp | int | YYYYMMDD (date for text extraction) | 20220820 |
urn | str | WARC-Record-ID | <urn:uuid:b01b7ad0-c5c3-4b2e-ab30-8d9bddf8c312> |
year | int | YYYY (year of crawl) | 2020 |