The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.
Extended metadata
dc:type
corpus
dc:title
Training Corpus jos1M
dc:identifier
oai:clarino.uib.no:Jos1M
dc:description
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.