Training Corpus jos1M

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.

Extended metadata

Last ned metadata (CMDI XML)

Last ned metadata (CMDI XML)

dc:type	corpus
dc:title	Training Corpus jos1M
dc:identifier	oai:clarino.uib.no:Jos1M
dc:description	The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.
dc:publisher
dc:format
dc:date
dc:date
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-BY-NC (CC-BY-NC)
dc:rights	http://creativecommons.org/licenses/by-nc/4.0/

Training Corpus jos1M

Extended metadata

Dublin Core (DC)

Last ned metadata (CMDI XML)