Skip to content

Discussions from Wikipedia

This corpus is a dump of discussion threads from the Norwegian Wikipedia, where authors discuss various issues regarding the publication of specific Wikipedia articles.

The material is split into two files, one each for Norwegian Bokmål (nb.wikipedia.json) and Nynorsk (nn.wikipedia.json). Each file is a structured JSON array. One discussion corresponds to one element, with one level containing text and metadata. There are eight key/value pairs per discussion:

– title: title of article under discussion
– pageid: text identifier
– revid: audit information
– wikidata: other data
– contentcategories: metadata
– hiddencategories: metadata
– text: discussion text
– bytelength: length of text in number of bytes

An example of this can be found in the pdf file (2019_wikidisc.pdf).

This corpus is a dump of discussion threads from the Norwegian Wikipedia, where authors discuss various issues regarding the publication of specific Wikipedia articles.

The material is split into two files, one each for Norwegian Bokmål (nb.wikipedia.json) and Nynorsk (nn.wikipedia.json). Each file is a structured JSON array. One discussion corresponds to one element, with one level containing text and metadata. There are eight key/value pairs per discussion:

– title: title of article under discussion
– pageid: text identifier
– revid: audit information
– wikidata: other data
– contentcategories: metadata
– hiddencategories: metadata
– text: discussion text
– bytelength: length of text in number of bytes

An example of this can be found in the pdf file (2019_wikidisc.pdf).

Extended metadata

Download resources

Download metadata