Web News Collection

Word galaxy from dhlab, illustrating corpus of text

In collaboration with DH-lab, the Norwegian Web Archive has created a collection of texts from web news publications from 2019-22. These texts are available for computational analysis through DH-lab’s API.

The objective is to allow scholars, students and others to make their own corpora of web news texts, facilitating digital text analysis of web news.

We are working to develop notebooks and user-friendly web apps to interact with the data. For now, you can find examples of use in nettavis-tekstanalyse.ipynb.

Below, you will find some basic information and metadata about the Web News Collection. Please contact us at nettarkivet@nb.no if you have any questions!

schema:properties	dtype	description	example
doctype	str	nettavis	nettavis
dhlabid	int	unique id for text object	600274473
title	str	publication title	Aftenposten
publisher	int	domain name	aftenposten.no
city	str	place of editor	Oslo
lang	str	ISO 639-2	nob
oaiid	str	target-uri	https://www.aftenposten.no:443/norge/politikk/i/…
timestamp	int	YYYYMMDD (date for crawling)	20200526
ocr_timestamp	int	YYYYMMDD (date for text extraction)	20220820
urn	str	WARC-Record-ID	<urn:uuid:b01b7ad0-c5c3-4b2e-ab30-8d9bddf8c312>
year	int	YYYY (year of crawl)	2020

Web News Collection

In total, the collection includes texts from 268 publications with a responsible editor. The most frequent titles are:

To work with the collection, you can choose between the dhlab-package for python and easy-to-use webapps from the DH-lab.

For apps, there are currently limited support for the Web News Corpus:
Corpus building, getting concordances, getting collocations and calculate relative frequency of collocated words.

Here is an overview of the schema attributes that can be used with the API, using a text from Aftenposten as an example:

What is Collections as Data?

How big is the Web News Collection?

Which languages are in the collection?

Which publication titles are in the collection?

In total, the collection includes texts from 268 publications with a responsible editor. The most frequent titles are:

How can I work with the collection?

To work with the collection, you can choose between the dhlab-package for python and easy-to-use webapps from the DH-lab.

For apps, there are currently limited support for the Web News Corpus:Corpus building, getting concordances, getting collocations and calculate relative frequency of collocated words.

Which schema-attributes can be used with the API?

Here is an overview of the schema attributes that can be used with the API, using a text from Aftenposten as an example:

How do I cite the Web News Collection?

For apps, there are currently limited support for the Web News Corpus:
Corpus building, getting concordances, getting collocations and calculate relative frequency of collocated words.