The NORINT Corpus

The NORINT Corpus consists of speech from 51 and written texts from 116 adult learners of Norwegian as second language, all of whom were taking advanced Norwegian courses (≈the CEFR level B2) at the University of Oslo during the summers of 2014 and 2015.

The NORINT Corpus is divided into three sub-parts:

– NORINT Speech: The speech part of the corpus consists of interviews and conversations, 111,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations.
The recordings are transcribed orthographically with the transcription tool Elan.
– NORINT Recited: 57 L2 learners, 51 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded.
– NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection.
The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together.

The corpus is searchable in the search interface Glossa, and the transcriptions are linked to audio and video files.

The NORINT Corpus is divided into three sub-parts:

The corpus is searchable in the search interface Glossa, and the transcriptions are linked to audio and video files.

Extended metadata

resource Common Info
- resource Type: corpus
- identification Info
  - resource Name: NORINT-korpuset
  - resource Name: The NORINT Corpus
  - description: The NORINT Corpus consists of speech from 51 and written texts from 116 adult learners of Norwegian as second language, all of whom were taking advanced Norwegian courses (≈the CEFR level B2) at the University of Oslo during the summers of 2014 and 2015. The NORINT Corpus is divided into three sub-parts: – NORINT Speech: The speech part of the corpus consists of interviews and conversations, 111,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations. The recordings are transcribed orthographically with the transcription tool Elan. – NORINT Recited: 57 L2 learners, 51 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded. – NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection. The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together. The corpus is searchable in the search interface Glossa, and the transcriptions are linked to audio and video files.
  - description: NORINT-korpuset inneholder muntlig materiale fra 51 og skriftlig materiale fra 116 voksne internasjonale studenter som gikk på norskkurs på høyere nivå (≈CEFR-nivå B2) ved Universitetet i Oslo sommeren 2014 og 2015. NORINT-korpuset består av tre deler: – NORINT tale: Taledelen av korpuset består av intervjuer og samtaler, i alt 111 000 ord. Studentene ble intervjuet om bakgrunn, studier, arbeid og fremtidsplaner. I tillegg er det gjort video- og lydopptak der informantene samtaler to og to om emner som kultur, fritid, reiser eller livet i Norge. Det er 30 – 40 minutters opptak av hver student. Opptakene er transkribert ortografisk med transkripsjonsprogrammet Elan. – NORINT opplest: 57 informanter, 51 av dem de samme som bidro til NORINT tale, leser opp 60 utvalgte setninger og en liten historie. Det finnes bare lydopptak av opplesningene. – NORINT tekst: Tekstdelen av korpuset består av 53 247 ord fra 116 eksamensoppgaver. Informantene er delvis de samme som i den muntlige delen av materialet. Av hensyn til personvern er det imidlertid ikke synlige koplinger i korpuset. Tekstene i NORINT tekst foreligger i tre ulike formater: en håndskrevet originalversjon i pdf-format, en innskrevet nøyaktig kopi av originalversjonen og en versjon der alle ortografiske feil er rettet. Tekstversjonene og de korrigerte versjonene er lenket sammen. Korpuset er søkbart i søkeverktøyet Glossa der transkripsjonene dessuten er koplet til lyd- og videofiler.
  - resource Short Name: NORINT
  - url: https://www.hf.uio.no/iln/english/about/organization/text-laboratory/projects/norint/index.html
  - url: https://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/norint/index.html
  - P I D: http://hdl.handle.net/11538/0000-000B-C01E-B
- distribution Info
  - licence Info
    - user Category: Academic
    - distribution Access Medium: accessibleThroughInterface
    - execution Location: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/norint/
    - execution Location: https://www.hf.uio.no/iln/english/about/organization/text-laboratory/projects/norint/index.html
    - licence
      - licence Family: CLARIN
      - licence Name: CLARIN_ACA-NC-LOC-PRIV-ND-*
      - licence Url: https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&LOC=1&PRIV=1&NORED=1&ND=1
      - conditions Of Use: BY
      - conditions Of Use: ID
      - conditions Of Use: LOC
      - conditions Of Use: NC
      - conditions Of Use: ND
      - conditions Of Use: NORED
      - conditions Of Use: PRIV
      - non Standard Conditions Of Use: The corpus has audio and video recordings classified as personal data. In agreement with NSD, the Data Protection Official in Norway, the corpus is accessible only through Glossa, a search and post-processing tool developed by the Text Laboratory. The video and audio excerpts given by the search interface can not be shown in public unless you have an agreement with the Text Laboratory. Please note that every individual researcher is responsible for treating the participants in the corpus with respect and sincerity. Furthermore, the participants must be kept anonymous in every published paper or other output.
    - licensor:
    - actor Info
      - actor Type: organization
      - organization Info
        organization Name: University of Oslo
        organization Name: Universitetet i Oslo
        organization Short Name: UiO
        organization Short Name: UoO
        department Name: Department of Linguistics and Scandinavian Studies
        department Name: Institutt for lingvistiske og nordiske studier (ILN)
      - communication Info
        email: l.a.harnas@iln.uio.no
        email: annely.tomson@iln.uio.no
        url: http://www.hf.uio.no/iln/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
    - distribution Rights Holder
      - actor Info
        actor Type: organization
        organization Info
        organization Name: Department of Linguistics and Scandinavian Studies, University of Oslo
        organization Short Name: ILN
        department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
        communication Info
        email: tekstlab-post@iln.uio.no
        url: http://www.hf.uio.no/iln/english/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
- contact
  - actor Info
    - actor Type: organization
    - organization Info
      - organization Name: The Text Laboratory
      - organization Short Name: Textlab
      - department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
    - communication Info
      - email: tekstlab-post@iln.uio.no
      - url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
      - address: Box 1102 Blindern
      - zip Code: 0317
      - city: OSLO
      - country: Norway
  - actor Info
    - actor Type: person
    - person Info
      - surname: Harnæs
      - given Name: Liv Andlem
    - communication Info
      - email: l.a.harnas@iln.uio.no
  - actor Info
    - actor Type: person
    - person Info
      - surname: Tomson
      - given Name: Annely
    - communication Info
      - email: annely.tomson@iln.uio.no
- metadata Info
  - metadata Creation Date: 21.03.2017
  - metadata Last Date Updated: 05.06.2018
  - metadata Creator
    - actor Info
      - actor Type: person
      - person Info
        surname: Hagen
        given Name: Kristin
      - organization Info
        organization Name: The Text Laboratory
        organization Short Name: Textlab
        department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
      - communication Info
        email: kristin.hagen@iln.uio.no
        url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
- version Info
  - version: 1
  - last Date Updated: 01.09.2016
- resource Documentation Info
  - documentation Structured
    - role: documentation
    - document Info
      - document Type: manual
      - title: Brukerveiledning til Norint-korpuset
      - author: Kristin Hagen and Viktoria Holund in cooperation with Annely Thomson
      - year: 2017
      - url: http://tekstlab.uio.no/norint/index.html
      - document Language Name: Norwegian Bokmål
      - document Language Id: nb
- resource Creation Info
  - creation Start Date: 01.01.2014
  - creation End Date: 01.09.2016
  - resource Creator
    - actor Info
      - actor Type: person
      - person Info
        surname: Tomson
        given Name: Annely
      - communication Info
        email: annely.tomson@iln.uio.no
    - actor Info
      - actor Type: person
      - person Info
        surname: Harnæs
        given Name: Liv Andlem
      - communication Info
        email: l.a.harnas@iln.uio.no
    - actor Info
      - actor Type: organization
      - organization Info
        organization Name: The Text Laboratory
        organization Short Name: Textlab
        department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
      - communication Info
        email: tekstlab-post@iln.uio.no
        url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
        address: Box 1102 Blindern
        zip Code: 0317
        city: OSLO
        country: Norway
  - funding Project:
  - project Info
    - project Name: The NORINT Corpus
    - funding Type: ownFunds
    - funder: Department of Linguistic and Scandinavian Studies, University of Oslo

corpus Info
- corpus Type: Written Corpus
- corpus Type: Multimodal Corpus
- corpus Part Info
  - media Type: text
  - corpus Text Info
    - text Format Info
      - mime Type: txt
    - character Encoding Info
      - character Encoding: utf-8
- corpus Part Info
  - media Type: audio
  - corpus Audio Info
    - audio Size Info
      - size Info
        size: 57 participants x 3 audio files each for NORINT opplest (Recited)
        size Unit: files
    - setting Info
      - naturality: readSpeech
      - conversational Type: monologue
      - scenario Type: other
      - audience: no
      - interactivity: nonInteractive
    - audio Format Info
      - mime Type: mp3 and wav
- corpus Part Info
  - media Type: video
  - corpus Video Info
    - video Content Info
      - type Of Video Content: Grown up foreign students learning Norwegian as their second language
    - setting Info
      - naturality: spontaneous
      - conversational Type: dialogue
      - interactivity: overlapping
      - interaction: Each informant participates in one conversation with another informant and an interview with a teacher.
    - video Format Info
      - mime Type: mp4
- corpus Part General Info
  - source Work Info
    - work Description: The NORINT Corpus is divided into three sub-parts: – NORINT Speech: The speech part of the corpus consists of interviews and conversations, 111,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations. The recordings are transcribed orthographically with the transcription tool Elan. – NORINT Recited: 57 L2 learners, 47 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded. – NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection. The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together.
  - person Source Set Info
    - number Of Persons: 57
    - age Of Persons: adult
    - sex Of Persons: mixed
    - origin Of Persons: nonNative
    - dialect Accent Of Persons: Foreign students learning Norwegian.
  - linguality Info
    - linguality Type: monolingual
  - language Info
    - language Id: nb
    - language Name: Norwegian Bokmål
  - modality Info
    - modality Type: writtenLanguage
    - size Per Modality
      - size Info
        size: 53 247 in NORINT tekst (Text)
        size Unit: words
  - modality Info
    - modality Type: spokenLanguage
    - size Per Modality
      - size Info
        size: 110 979 in NORINT tale (Speech)
        size Unit: words
  - modality Info
    - modality Type: spokenLanguage
    - modality Type Details: recited text
    - size Per Modality
      - size Info
        size: 36 895 in NORINT opplest (Recited)
        size Unit: words
  - annotation Info
    - annotation Type: lemmatization
    - annotation Type: morphosyntacticAnnotation-posTagging
    - segmentation Level: word
    - tagset: The Oslo Bergen-tagger tagset: http://tekstlab.uio.no/obt-ny/english/index.html
    - tagset Language Id: Nb
    - tagset Language Name: Norwegian Bokmål
    - theoretic Model: Constraint Grammar
    - annotation Mode: automatic
    - annotation Manual Unstructured
      - role: annotationManual
      - document Unstructured: http://www.tekstlab.uio.no/obt-ny/english/index.html
    - annotation Tool
      - target Resource Name U R I: The Oslo-Bergen Tagger: http://tekstlab.uio.no/obt-ny/english/index.html
  - annotation Info
    - annotation Type: morphosyntacticAnnotation-posTagging
    - annotated Elements: other
    - segmentation Level: word
    - tagset: POS tagset created for the statistical NoTa-tagger – based on the tagset of the Oslo Bergen Tagger.
    - tagset Language Id: Nb
    - tagset Language Name: Norwegian Bokmål
    - theoretic Model: TreeTagger
    - annotation Mode: automatic
    - annotation Manual Structured
      - role: annotationManual
      - document Info
        document Type: article
        title: Tagging a Norwegian Speech Corpus
        author: Anders Nøklestad and Åshild Søfteland
        editor: Joakim Nivre,Heiki-Jaan Kaalep,Kadri Muischnek, Mare Koit
        year: 2007
        book Title: Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007
        pages: 245–248
        conference: Nodalida 2007
        document Language Name: English
        document Language Id: en
    - annotation Manual Structured
      - role: annotationManual
      - document Info
        document Type: article
        title: Manuell morfologisk tagging av NoTa-materialet med støtte fra en statistisk tagger.
        author: Åshild Søfteland og Anders Nøklestad
        editor: Janne Bondi Johannessen og Kristin Hagen
        year: 2008
        publisher: Novus forlag
        book Title: Språk i Oslo. Ny forskning omkring talespråk
        pages: 226–234.
        I S B N: 978-82-7099-471-7
        document Language Name: Norwegian
        document Language Id: nb
    - annotation Manual Structured
      - role: annotationManual
      - document Info
        document Type: manual
        title: NoTa-taggeren: TAGGEVEILEDNING
        author: Åshild Søfteland
        year: 2007
        url: http://www.tekstlab.uio.no/nota/oslo/Taggeveiledning2.pdf
        document Language Name: Norwegian bokmål
        document Language Id: nb
  - classification Info
    - genre Info
      - genre Type: textGenre
      - genre: unstandardised
      - unstandardised Genre: Exam papers written by students The texts are available in three different versions: one scanned original in pdf format and two transcribed versions in txt format: one original transcription with errors and one version where the errors are corrected. All versions are linked and it is possible to search in both transcribed versions.
    - genre Info
      - genre Type: speechGenre
      - genre: informal
    - genre Info
      - genre Type: speechGenre
      - genre: recited
  - time Coverage Info
    - time Coverage: 2014

dc:type	corpus
dc:title	The NORINT Corpus
dc:identifier	oai:tekstlab.uio.no:norint
dc:description	The NORINT Corpus consists of speech from 51 and written texts from 116 adult learners of Norwegian as second language, all of whom were taking advanced Norwegian courses (≈the CEFR level B2) at the University of Oslo during the summers of 2014 and 2015. The NORINT Corpus is divided into three sub-parts: – NORINT Speech: The speech part of the corpus consists of interviews and conversations, 111,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations. The recordings are transcribed orthographically with the transcription tool Elan. – NORINT Recited: 57 L2 learners, 51 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded. – NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection. The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together. The corpus is searchable in the search interface Glossa, and the transcriptions are linked to audio and video files.
dc:publisher
dc:format	accessibleThroughInterface
dc:date	2014-01-01
dc:date	2016-09-01
dc:rights	Academic
dc:rights	CLARIN
dc:rights	CLARIN_ACA-NC-LOC-PRIV-ND-*
dc:rights	https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&LOC=1&PRIV=1&NORED=1&ND=1
dc:creator	Annely Tomson
dc:creator	Liv Andlem Harnæs
dc:creator	The Text Laboratory
dc:lang	Norwegian Bokmål

Download resources

index.html

Go to resource page

Go to resource page http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/norint/

The NORINT Corpus

Extended metadata

Resource Common Info

Corpus Info

Dublin Core (DC)

Download resources

Go to resource page