Tranining corpus ssj500kv1.2

The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.

Utvidet metadata

resource Common Info
- resource Type: corpus
- identification Info
  - resource Name: Tranining corpus ssj500kv1.2
  - description: The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.
  - url: http://clarino.uib.no/iness/landing-page?resource=slv-ssj500k-dep&view=short
  - url: http://eng.slovenscina.eu/tehnologije/ucni-korpus
  - P I D: hdl:11495/DB26-0437-026E-4
  - identifier: slv-ssj500k-dep
- distribution Info
  - licence Info
    - user Category: Public
    - distribution Access Medium: accessibleThroughInterface
    - execution Location: http://hdl.handle.net/11495/DB26-0437-026E-4
    - attribution Text: Krek, Simon and Erjavec, Tomaž (2014). Training corpus ssj500kv1.2. Jožef Stefan Institute, Slovenia. http://hdl.handle.net/11495/DB26-0437-026E-4
    - licence
      - licence Family: Creative Commons (CC)
      - licence Name: Creative_Commons-BY-NC-SA (CC-BY-NC-SA)
      - licence Url: https://creativecommons.org/licenses/by-nc-sa/4.0/
      - conditions Of Use: BY
      - conditions Of Use: NC
      - conditions Of Use: SA
    - licensor:
    - actor Info
      - actor Type: organization
      - role: iprHolder
      - organization Info
        organization Name: Slovenian Ministry of Education, Science and Sport
- contact
  - actor Info
    - actor Type: person
    - role: author
    - person Info
      - surname: Krek
      - given Name: Simon
      - affiliation:
      - organization Info
        organization Name: “Jožef Stefan” Institute
- metadata Info
  - metadata Creation Date: 10.02.2015
  - metadata Last Date Updated: 14.10.2016
  - metadata Creator
    - actor Info
      - actor Type: person
      - role: metadataCreator
      - person Info
        surname: Parra Escartín
        given Name: Carla
        affiliation:
        organization Info
        organization Name: University of Bergen
        organization Short Name: UiB
        department Name: Department of Linguistic, Literary and Aesthetic Studies
- version Info
  - version: Version 1.2 of the ssj500k training corpus with the category "organisation" added to the Named Entity annotation level.

Last ned metadata

Last ned metadata http://hdl.handle.net/11495/D8A2-CFB1-49F7-1

Gå til ressursside

Gå til ressursside http://hdl.handle.net/11495/DB26-0437-026E-4

dc:type	corpus
dc:title	Tranining corpus ssj500kv1.2
dc:identifier	oai:clarino.uib.no:slv-ssj500k-dep
dc:description	The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.
dc:publisher
dc:format	accessibleThroughInterface
dc:date
dc:date
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-BY-NC-SA (CC-BY-NC-SA)
dc:rights	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc:lang	Slovenian

Tranining corpus ssj500kv1.2

Utvidet metadata

Resource Common Info

Corpus Info

Dublin Core (DC)

Last ned metadata

Gå til ressursside