Web News Collection
In collaboration with DH-lab, the Norwegian Web Archive has created a collection of texts from web news publications from 2019-22. These texts are available for computational analysis through DH-lab’s API.
The objective is to allow scholars, students and others to make their own corpora of web news texts, facilitating digital text analysis of web news.
We are working to develop notebooks and user-friendly web apps to interact with the data. For now, you can find examples of use in nettavis-tekstanalyse.ipynb.
Below, you will find some basic information and metadata about the Web News Collection. Please contact us at nettarkivet@nb.no if you have any questions!
“Collections as Data” means that we provide content from the web archive in a format that supports computational analysis. This allows researchers to explore and analyse trends and shifts in archived web data.
You can learn more from the initiative Always Already Computational: Collections as Data.
The first version of the collection contains texts from 2019-22:
- 1,572,655 texts
- 784,171,966 words
- 268 publication titles
The collection includes texts in various languages. The most frequent ones are:
- Norwegian Bokmål: 1,437,768 texts
- Norwegian Nynorsk: 111,892 texts
- Northern Sami: 11,416 texts
- Kven: 302 texts
- Southern Sami: 101 texts
- Lule Sami: 78 texts
In total, the collection includes texts from 268 publications with a responsible editor. The most frequent titles are:
- NRK: 130 162
- VG: 66 800
- Forskning.no: 65 469
- TV2: 55 367
- Dagens næringsliv: 50 005
- Dagbladet: 46 333
- Finansavisen: 38 514
- Adresseavisen: 33 640
- Aftenposten: 31 075
- Khrono: 29 794
- Hamar Arbeiderblad: 29 775
- Dagsavisen: 27 009
- ABC Nyheter: 25 690
- E24: 24 930
- Nettavisen: 23 670
To work with the collection, you can choose between the dhlab-package for python and easy-to-use webapps from the DH-lab.
For apps, there are currently limited support for the Web News Corpus:
Corpus building, getting concordances, getting collocations and calculate relative frequency of collocated words.
Here is an overview of the schema attributes that can be used with the API, using a text from Aftenposten as an example:
schema:properties | dtype | description | example |
doctype | str | nettavis | nettavis |
dhlabid | int | unique id for text object | 600274473 |
title | str | publication title | Aftenposten |
publisher | int | domain name | aftenposten.no |
city | str | place of editor | Oslo |
lang | str | ISO 639-2 | nob |
oaiid | str | target-uri | https://www.aftenposten.no:443/norge/politikk/i/… |
timestamp | int | YYYYMMDD (date for crawling) | 20200526 |
ocr_timestamp | int | YYYYMMDD (date for text extraction) | 20220820 |
urn | str | WARC-Record-ID | <urn:uuid:b01b7ad0-c5c3-4b2e-ab30-8d9bddf8c312> |
year | int | YYYY (year of crawl) | 2020 |