Tranining corpus ssj500kv1.2
Utvidet metadata
- resource Common Info
- resource Type: corpus
- identification Info
- resource Name: Tranining corpus ssj500kv1.2
- description: The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.
- url: http://clarino.uib.no/iness/landing-page?resource=slv-ssj500k-dep&view=short
- url: http://eng.slovenscina.eu/tehnologije/ucni-korpus
- P I D: hdl:11495/DB26-0437-026E-4
- identifier: slv-ssj500k-dep
- distribution Info
- licence Info
- user Category: Public
- distribution Access Medium: accessibleThroughInterface
- execution Location: http://hdl.handle.net/11495/DB26-0437-026E-4
- attribution Text: Krek, Simon and Erjavec, Tomaž (2014). Training corpus ssj500kv1.2. Jožef Stefan Institute, Slovenia. http://hdl.handle.net/11495/DB26-0437-026E-4
- licence
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-BY-NC-SA (CC-BY-NC-SA)
- licence Url: https://creativecommons.org/licenses/by-nc-sa/4.0/
- conditions Of Use: BY
- conditions Of Use: NC
- conditions Of Use: SA
- licensor:
- actor Info
- actor Type: organization
- role: iprHolder
- organization Info
- organization Name: Slovenian Ministry of Education, Science and Sport
- licence Info
- contact
- actor Info
- actor Type: person
- role: author
- person Info
- surname: Krek
- given Name: Simon
- affiliation:
- organization Info
- organization Name: “Jožef Stefan” Institute
- actor Info
- metadata Creation Date: 10.02.2015
- metadata Last Date Updated: 14.10.2016
- metadata Creator
- actor Info
- actor Type: person
- role: metadataCreator
- person Info
- surname: Parra Escartín
- given Name: Carla
- affiliation:
- organization Info
- organization Name: University of Bergen
- organization Short Name: UiB
- department Name: Department of Linguistic, Literary and Aesthetic Studies
- actor Info
- version: Version 1.2 of the ssj500k training corpus with the category "organisation" added to the Named Entity annotation level.
- corpus Info
- corpus Type: Treebank
- corpus Part Info
- media Type: text
- corpus Part General Info
- linguality Info
- linguality Type: monolingual
- language Info
- language Id: sl
- language Name: Slovenian
- linguality Info
dc:type | corpus |
dc:title | Tranining corpus ssj500kv1.2 |
dc:identifier | oai:clarino.uib.no:slv-ssj500k-dep |
dc:description | The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project. |
dc:publisher | |
dc:format | accessibleThroughInterface |
dc:date | |
dc:date | |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-BY-NC-SA (CC-BY-NC-SA) |
dc:rights | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc:lang | Slovenian |