Training Corpus jos1M

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.

Utvidet metadata

resource Common Info
- resource Type: corpus
- identification Info
  - resource Name: Training Corpus jos1M
  - description: The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.
  - resource Short Name: jos1M
  - url: http://clarino.uib.no/iness/landing-page?resource=jos1M&view=short
  - url: http://clarino.uib.no/iness/landing-page?resource=jos1M
  - P I D: hdl:11495/DC84-BF60-3823-5
- distribution Info
  - licence Info
    - user Category: Public
    - licence
      - licence Family: Creative Commons (CC)
      - licence Name: Creative_Commons-BY-NC (CC-BY-NC)
      - licence Url: http://creativecommons.org/licenses/by-nc/4.0/
      - conditions Of Use: BY
      - conditions Of Use: NC
- contact
  - actor Info
    - actor Type: person
    - role: author
    - person Info
      - surname: Krek
      - given Name: Simon
      - affiliation:
      - organization Info
        organization Name: “Jožef Stefan” Institute
- metadata Info
  - metadata Creation Date: 28.03.2017
  - metadata Last Date Updated: 06.03.2018
  - metadata Creator
    - actor Info
      - actor Type: person
      - person Info
        surname: Dione
        given Name: Cheikh Bamba
        sex: male
        position: Researcher (Ph.D)
        affiliation:
        organization Info
        organization Name: University of Bergen
        organization Name: Universitetet i Bergen
        organization Short Name: UiB
        organization Short Name: UoB
        department Name: Department of Linguistic, Literary and Aesthetic Studies
      - communication Info
        email: clarin@uib.no
        email: iness@uib.no
- resource Creation Info
  - resource Creator
    - actor Info
      - actor Type: person
      - person Info
        surname: Erjavec, Tomaž
        affiliation:
        organization Info
        organization Name: Jožef Stefan Institute
    - actor Info
      - actor Type: person
      - person Info
        surname: Krek, Simon
        affiliation:
        organization Info
        organization Name: Jožef Stefan Institute

Last ned metadata

Last ned metadata

dc:type	corpus
dc:title	Training Corpus jos1M
dc:identifier	oai:clarino.uib.no:Jos1M
dc:description	The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers.
dc:publisher
dc:format
dc:date
dc:date
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-BY-NC (CC-BY-NC)
dc:rights	http://creativecommons.org/licenses/by-nc/4.0/

Training Corpus jos1M

Utvidet metadata

Resource Common Info

Corpus Info

Dublin Core (DC)

Last ned metadata