LIA sápmi – the LIA corpus of Sami dialects

The LIA Sápmi corpus is a speech corpus with recordings from 1960 – 1990 of Sami dialects from the northern part of Norway, Finland and Sweden, some recordings from NRK sami radio and some from UiT, mostly collected by Niels Jernsletten. The the topics of the interviews and conversations are typically about old trades and traditional life.
The corpus have about 190 000 tokens and 122 speakers from 19 places.
Automatic lemmatization, morphological tagging and translation to Norwegian are done by Giellatekno.

Extended metadata

resource Common Info
- resource Type: corpus
- identification Info
  - resource Name: LIA sápmi – Sámegiela hállangiellakorpus
  - resource Name: LIA sápmi – LIA-korpuset for samiske dialekter
  - resource Name: LIA sápmi – the LIA corpus of Sami dialects
  - description: The LIA Sápmi corpus is a speech corpus with recordings from 1960 – 1990 of Sami dialects from the northern part of Norway, Finland and Sweden, some recordings from NRK sami radio and some from UiT, mostly collected by Niels Jernsletten. The the topics of the interviews and conversations are typically about old trades and traditional life. The corpus have about 190 000 tokens and 122 speakers from 19 places. Automatic lemmatization, morphological tagging and translation to Norwegian are done by Giellatekno.
  - resource Short Name: LIA sápmi
  - url: http://tekstlab.uio.no/LIA/samisk/index.html
  - P I D: http://hdl.handle.net/11538/0000-000C-368C-A
- distribution Info
  - licence Info
    - user Category: Academic
    - distribution Access Medium: accessibleThroughInterface
    - execution Location: http://tekstlab.uio.no/LIA/samisk/index.html
    - licence
      - licence Family: CLARIN
      - licence Name: CLARIN_ACA-NC-LOC-PRIV-ND-*
      - licence Url: https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&LOC=1&PRIV=1&NORED=1&ND=1
      - conditions Of Use: *
      - conditions Of Use: BY
      - conditions Of Use: ID
      - conditions Of Use: LOC
      - conditions Of Use: NC
      - conditions Of Use: ND
      - conditions Of Use: NORED
      - conditions Of Use: PRIV
      - non Standard Conditions Of Use: The corpus has audio and video recordings classified as personal data. In agreement with NSD, the Data Protection Official in Norway, the corpus is accessible only through Glossa, a search and post-processing tool developed by the Text Laboratory. The audio excerpts given by the search interface can not be shown in public unless you have an agreement with the Text Laboratory. Please note that every individual researcher is responsible for treating the participants in the corpus with respect and sincerity. Furthermore, the participants must be kept anonymous in every published paper or other output.
    - licensor:
    - actor Info
      - actor Type: organization
      - organization Info
        organization Name: University of Oslo
        organization Name: Universitetet i Oslo
        organization Short Name: UiO
        organization Short Name: UoO
        department Name: Department of Linguistics and Scandinavian Studies
        department Name: Institutt for lingvistiske og nordiske studier (ILN)
      - communication Info
        email: tekstlab-post@iln.uio.no
        url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
    - distribution Rights Holder
      - actor Info
        actor Type: organization
        organization Info
        organization Name: University of Oslo
        organization Name: Universitetet i Oslo
        organization Short Name: UiO
        organization Short Name: UoO
        department Name: Department of Linguistics and Scandinavian Studies
        department Name: Institutt for lingvistiske og nordiske studier (ILN)
        communication Info
        email: tekstlab-post@iln.uio.no
        url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
- contact
  - actor Info
    - actor Type: organization
    - organization Info
      - organization Name: The Text Laboratory
      - organization Short Name: Textlab
      - department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
    - communication Info
      - email: tekstlab-post@iln.uio.no
      - url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
      - address: Box 1102 Blindern
      - zip Code: 0317
      - city: OSLO
      - country: Norway
- metadata Info
  - metadata Creation Date: 19.11.2018
  - metadata Last Date Updated: 02.04.2020
  - metadata Creator
    - actor Info
      - actor Type: person
      - person Info
        surname: Hagen
        given Name: Kristin
      - organization Info
        organization Name: The Text Laboratory
        organization Short Name: Textlab
        department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
      - communication Info
        email: kristin.hagen@iln.uio.no
        url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
- version Info
  - version: Preliminary version (autumn 2018) First version November 2019
- validation Info
  - validated: true
  - validation Type: content
  - validation Mode: manual
  - validation Mode Details: The transcriptions are proofread against the audio files.
  - validation Extent: partial
  - validator:
  - actor Info
    - actor Type: organization
    - organization Info
      - organization Name: The LIA project
      - organization Short Name: LIA
      - department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
    - communication Info
      - email: tekstlab-post@iln.uio.no
      - url: http://tekstlab.uio.no/LIA/index.html
      - address: Box 1102 Blindern
      - zip Code: 0317
      - city: OSLO
      - country: Norway
- resource Documentation Info
  - documentation Unstructured
    - role: documentation
    - document Unstructured: http://tekstlab.uio.no/LIA/transkripsjon.html (In Norwegian and Sami)
- resource Creation Info
  - creation Start Date: 01.04.2014
  - creation End Date: 01.11.2019
  - resource Creator
    - actor Info
      - actor Type: organization
      - organization Info
        organization Name: The LIA project (Project participants and employees in the LIA project)
      - communication Info
        email: tekstlab-post@iln.uio.no
        url: http://tekstlab.uio.no/LIA/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
  - funding Project:
  - project Info
    - project Name: LIA (Language Infrastructure made Accessible)
    - project Short Name: LIA
    - project I D: 22 59 41
    - url: http://tekstlab.uio.no/LIA/
    - url: https://www.hf.uio.no/iln/english/research/projects/language-infrastructure-made-accessible/index.html
    - funding Type: nationalFunds
    - funder: The Research Council of Norway
    - funding Country: Norway
    - project Start Date: 04.01.2014
    - project End Date: 31.12.2019

corpus Info
- corpus Type: Multimodal Corpus
- corpus Part Info
  - media Type: text
  - corpus Text Info
    - text Format Info
      - mime Type: txt
      - size Per Text Format
        size Info
        size: 188 974
        size Unit: tokens
    - character Encoding Info
      - character Encoding: utf-8
- corpus Part Info
  - media Type: audio
  - corpus Audio Info
    - audio Size Info
      - size Info
        size: Approx 1.8 GB
        size Unit: gb
    - setting Info
      - naturality: spontaneous
      - conversational Type: dialogue
      - audience: few
      - interactivity: overlapping
      - interaction: Semiformal or informal interviews with one or more interviewers. Often the recordings are more like conversations.
    - audio Format Info
      - mime Type: wav and mp3
      - recording Quality: medium
      - compression Info
        compression: true
        compression Name: mp3
- corpus Part General Info
  - person Source Set Info
    - number Of Persons: 122
    - age Of Persons: adult
    - age Of Persons: elderly
    - age Range Start: 25
    - age Range End: 91
    - sex Of Persons: mixed
    - origin Of Persons: native
    - dialect Accent Of Persons: Dialects from 19 places in north of Norway, Sweden and FInland
  - linguality Info
    - linguality Type: monolingual
  - language Info
    - language Id: ae
    - language Name: Northern sami
  - modality Info
    - modality Type: spokenLanguage
    - modality Type Details: Orthographic transcription
  - size Info
    - size: 188 974
    - size Unit: tokens
  - annotation Info
    - annotation Type: morphosyntacticAnnotation-posTagging
    - annotated Elements: other
    - segmentation Level: word
    - tagset: http://giellatekno.uit.no/doc/lang/sme/docu-sme-grammartags.html
    - tagset Language Id: se
    - tagset Language Name: sami
    - theoretic Model: Constraint grammar, see http://giellatekno.uit.no/
    - annotation Mode: automatic
  - annotation Info
    - annotation Type: speechAnnotation-orthographicTranscription
    - annotation Manual Structured
      - role: annotationManual
      - document Info
        document Type: manual
        title: Davvisámegiela transkripšuvdna ortografiija mielde – LIA
        author: Biret Ánne Bals Baal ja Arnstein Johnskareng, UiT Norgga árktalaš universitehta
        year: 2018
        url: http://tekstlab.uio.no/LIA/pdf/LIA-ortografiija_transkriberen.pdf
    - annotation Manual Structured
      - role: annotationManual
      - document Info
        document Type: manual
        title: Transkripsjonsrettleiing for LIA – samisk
        author: Kristin Hagen,Live Håberg,Arnstein Johnskareng, Eirik Olsen og Åshild Søfteland
        year: 2016
        url: http://tekstlab.uio.no/LIA/pdf/transkripsjonsrettleiing_lia_samisk.pdf
  - classification Info
    - genre Info
      - genre Type: speechGenre
      - genre: informal
      - unstandardised Genre: conversations and informal interviews
  - classification Info
    - genre Info
      - genre Type: speechGenre
      - genre: semi formal
      - unstandardised Genre: interviews
  - time Coverage Info
    - time Coverage: 1960 – 1987
  - geographic Coverage Info
    - geographic Coverage: Sami areas in northern Norway, Finland and Sweden
  - recording Info
    - recording Device Type: tapeVHS
    - recording Environment: other

Download resources

index.html

Go to resource page

Go to resource page http://tekstlab.uio.no/LIA/samisk/index.html

dc:type	corpus
dc:title	LIA sápmi – the LIA corpus of Sami dialects
dc:identifier	oai:tekstlab.uio.no:lia-sapmi
dc:description	The LIA Sápmi corpus is a speech corpus with recordings from 1960 – 1990 of Sami dialects from the northern part of Norway, Finland and Sweden, some recordings from NRK sami radio and some from UiT, mostly collected by Niels Jernsletten. The the topics of the interviews and conversations are typically about old trades and traditional life. The corpus have about 190 000 tokens and 122 speakers from 19 places. Automatic lemmatization, morphological tagging and translation to Norwegian are done by Giellatekno.
dc:publisher
dc:format	accessibleThroughInterface
dc:date	2014-04-01
dc:date	2019-11-01
dc:rights	Academic
dc:rights	CLARIN
dc:rights	CLARIN_ACA-NC-LOC-PRIV-ND-*
dc:rights	https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&LOC=1&PRIV=1&NORED=1&ND=1
dc:creator	The LIA project (Project participants and employees in the LIA project)
dc:lang	Northern sami

LIA sápmi – the LIA corpus of Sami dialects

Extended metadata

Resource Common Info

Corpus Info

Dublin Core (DC)

Download resources

Go to resource page