The NORINT Corpus
Extended metadata
- resource Common Info
- resource Type: corpus
- identification Info
- resource Name: NORINT-korpuset
- resource Name: The NORINT Corpus
- description: The NORINT Corpus consists of speech from 51 and written texts from 116 adult learners of Norwegian as second language, all of whom were taking advanced Norwegian courses (≈the CEFR level B2) at the University of Oslo during the summers of 2014 and 2015. The NORINT Corpus is divided into three sub-parts: – NORINT Speech: The speech part of the corpus consists of interviews and conversations, 111,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations. The recordings are transcribed orthographically with the transcription tool Elan. – NORINT Recited: 57 L2 learners, 51 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded. – NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection. The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together. The corpus is searchable in the search interface Glossa, and the transcriptions are linked to audio and video files.
- description: NORINT-korpuset inneholder muntlig materiale fra 51 og skriftlig materiale fra 116 voksne internasjonale studenter som gikk på norskkurs på høyere nivå (≈CEFR-nivå B2) ved Universitetet i Oslo sommeren 2014 og 2015. NORINT-korpuset består av tre deler: – NORINT tale: Taledelen av korpuset består av intervjuer og samtaler, i alt 111 000 ord. Studentene ble intervjuet om bakgrunn, studier, arbeid og fremtidsplaner. I tillegg er det gjort video- og lydopptak der informantene samtaler to og to om emner som kultur, fritid, reiser eller livet i Norge. Det er 30 – 40 minutters opptak av hver student. Opptakene er transkribert ortografisk med transkripsjonsprogrammet Elan. – NORINT opplest: 57 informanter, 51 av dem de samme som bidro til NORINT tale, leser opp 60 utvalgte setninger og en liten historie. Det finnes bare lydopptak av opplesningene. – NORINT tekst: Tekstdelen av korpuset består av 53 247 ord fra 116 eksamensoppgaver. Informantene er delvis de samme som i den muntlige delen av materialet. Av hensyn til personvern er det imidlertid ikke synlige koplinger i korpuset. Tekstene i NORINT tekst foreligger i tre ulike formater: en håndskrevet originalversjon i pdf-format, en innskrevet nøyaktig kopi av originalversjonen og en versjon der alle ortografiske feil er rettet. Tekstversjonene og de korrigerte versjonene er lenket sammen. Korpuset er søkbart i søkeverktøyet Glossa der transkripsjonene dessuten er koplet til lyd- og videofiler.
- resource Short Name: NORINT
- url: https://www.hf.uio.no/iln/english/about/organization/text-laboratory/projects/norint/index.html
- url: https://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/norint/index.html
- P I D: http://hdl.handle.net/11538/0000-000B-C01E-B
- distribution Info
- licence Info
- user Category: Academic
- distribution Access Medium: accessibleThroughInterface
- execution Location: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/norint/
- execution Location: https://www.hf.uio.no/iln/english/about/organization/text-laboratory/projects/norint/index.html
- licence
- licence Family: CLARIN
- licence Name: CLARIN_ACA-NC-LOC-PRIV-ND-*
- licence Url: https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&LOC=1&PRIV=1&NORED=1&ND=1
- conditions Of Use: BY
- conditions Of Use: ID
- conditions Of Use: LOC
- conditions Of Use: NC
- conditions Of Use: ND
- conditions Of Use: NORED
- conditions Of Use: PRIV
- non Standard Conditions Of Use: The corpus has audio and video recordings classified as personal data. In agreement with NSD, the Data Protection Official in Norway, the corpus is accessible only through Glossa, a search and post-processing tool developed by the Text Laboratory. The video and audio excerpts given by the search interface can not be shown in public unless you have an agreement with the Text Laboratory. Please note that every individual researcher is responsible for treating the participants in the corpus with respect and sincerity. Furthermore, the participants must be kept anonymous in every published paper or other output.
- licensor:
- actor Info
- actor Type: organization
- organization Info
- organization Name: University of Oslo
- organization Name: Universitetet i Oslo
- organization Short Name: UiO
- organization Short Name: UoO
- department Name: Department of Linguistics and Scandinavian Studies
- department Name: Institutt for lingvistiske og nordiske studier (ILN)
- communication Info
- email: l.a.harnas@iln.uio.no
- email: annely.tomson@iln.uio.no
- url: http://www.hf.uio.no/iln/
- address: Box 1102 Blindern
- zip Code: 0317
- city: OSLO
- country: Norway
- distribution Rights Holder
- actor Info
- actor Type: organization
- organization Info
- organization Name: Department of Linguistics and Scandinavian Studies, University of Oslo
- organization Short Name: ILN
- department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
- communication Info
- email: tekstlab-post@iln.uio.no
- url: http://www.hf.uio.no/iln/english/
- address: Box 1102 Blindern
- zip Code: 0317
- city: OSLO
- country: Norway
- actor Info
- licence Info
- contact
- actor Info
- actor Type: organization
- organization Info
- organization Name: The Text Laboratory
- organization Short Name: Textlab
- department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
- communication Info
- email: tekstlab-post@iln.uio.no
- url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
- address: Box 1102 Blindern
- zip Code: 0317
- city: OSLO
- country: Norway
- actor Info
- actor Type: person
- person Info
- surname: Harnæs
- given Name: Liv Andlem
- communication Info
- email: l.a.harnas@iln.uio.no
- actor Info
- actor Type: person
- person Info
- surname: Tomson
- given Name: Annely
- communication Info
- email: annely.tomson@iln.uio.no
- actor Info
- metadata Info
- metadata Creation Date: 21.03.2017
- metadata Last Date Updated: 05.06.2018
- metadata Creator
- actor Info
- actor Type: person
- person Info
- surname: Hagen
- given Name: Kristin
- organization Info
- organization Name: The Text Laboratory
- organization Short Name: Textlab
- department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
- communication Info
- email: kristin.hagen@iln.uio.no
- url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
- address: Box 1102 Blindern
- zip Code: 0317
- city: OSLO
- country: Norway
- actor Info
- version Info
- version: 1
- last Date Updated: 01.09.2016
- resource Documentation Info
- documentation Structured
- role: documentation
- document Info
- document Type: manual
- title: Brukerveiledning til Norint-korpuset
- author: Kristin Hagen and Viktoria Holund in cooperation with Annely Thomson
- year: 2017
- url: http://tekstlab.uio.no/norint/index.html
- document Language Name: Norwegian Bokmål
- document Language Id: nb
- documentation Structured
- resource Creation Info
- creation Start Date: 01.01.2014
- creation End Date: 01.09.2016
- resource Creator
- actor Info
- actor Type: person
- person Info
- surname: Tomson
- given Name: Annely
- communication Info
- email: annely.tomson@iln.uio.no
- actor Info
- actor Type: person
- person Info
- surname: Harnæs
- given Name: Liv Andlem
- communication Info
- email: l.a.harnas@iln.uio.no
- actor Info
- actor Type: organization
- organization Info
- organization Name: The Text Laboratory
- organization Short Name: Textlab
- department Name: Department of Linguistics and Scandinavian Studies, University of Oslo
- communication Info
- email: tekstlab-post@iln.uio.no
- url: http://www.hf.uio.no/iln/om/organisasjon/tekstlab/
- address: Box 1102 Blindern
- zip Code: 0317
- city: OSLO
- country: Norway
- actor Info
- funding Project:
- project Info
- project Name: The NORINT Corpus
- funding Type: ownFunds
- funder: Department of Linguistic and Scandinavian Studies, University of Oslo
- corpus Info
- corpus Type: Written Corpus
- corpus Type: Multimodal Corpus
- corpus Part Info
- media Type: text
- corpus Text Info
- text Format Info
- mime Type: txt
- character Encoding Info
- character Encoding: utf-8
- text Format Info
- corpus Part Info
- media Type: audio
- corpus Audio Info
- audio Size Info
- size Info
- size: 57 participants x 3 audio files each for NORINT opplest (Recited)
- size Unit: files
- size Info
- setting Info
- naturality: readSpeech
- conversational Type: monologue
- scenario Type: other
- audience: no
- interactivity: nonInteractive
- audio Format Info
- mime Type: mp3 and wav
- audio Size Info
- corpus Part Info
- media Type: video
- corpus Video Info
- video Content Info
- type Of Video Content: Grown up foreign students learning Norwegian as their second language
- setting Info
- naturality: spontaneous
- conversational Type: dialogue
- interactivity: overlapping
- interaction: Each informant participates in one conversation with another informant and an interview with a teacher.
- video Format Info
- mime Type: mp4
- video Content Info
- corpus Part General Info
- source Work Info
- work Description: The NORINT Corpus is divided into three sub-parts: – NORINT Speech: The speech part of the corpus consists of interviews and conversations, 111,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations. The recordings are transcribed orthographically with the transcription tool Elan. – NORINT Recited: 57 L2 learners, 47 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded. – NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection. The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together.
- person Source Set Info
- number Of Persons: 57
- age Of Persons: adult
- sex Of Persons: mixed
- origin Of Persons: nonNative
- dialect Accent Of Persons: Foreign students learning Norwegian.
- linguality Info
- linguality Type: monolingual
- language Info
- language Id: nb
- language Name: Norwegian Bokmål
- modality Info
- modality Type: writtenLanguage
- size Per Modality
- size Info
- size: 53 247 in NORINT tekst (Text)
- size Unit: words
- size Info
- modality Info
- modality Type: spokenLanguage
- size Per Modality
- size Info
- size: 110 979 in NORINT tale (Speech)
- size Unit: words
- size Info
- modality Info
- modality Type: spokenLanguage
- modality Type Details: recited text
- size Per Modality
- size Info
- size: 36 895 in NORINT opplest (Recited)
- size Unit: words
- size Info
- annotation Info
- annotation Type: lemmatization
- annotation Type: morphosyntacticAnnotation-posTagging
- segmentation Level: word
- tagset: The Oslo Bergen-tagger tagset: http://tekstlab.uio.no/obt-ny/english/index.html
- tagset Language Id: Nb
- tagset Language Name: Norwegian Bokmål
- theoretic Model: Constraint Grammar
- annotation Mode: automatic
- annotation Manual Unstructured
- role: annotationManual
- document Unstructured: http://www.tekstlab.uio.no/obt-ny/english/index.html
- annotation Tool
- target Resource Name U R I: The Oslo-Bergen Tagger: http://tekstlab.uio.no/obt-ny/english/index.html
- annotation Info
- annotation Type: morphosyntacticAnnotation-posTagging
- annotated Elements: other
- segmentation Level: word
- tagset: POS tagset created for the statistical NoTa-tagger – based on the tagset of the Oslo Bergen Tagger.
- tagset Language Id: Nb
- tagset Language Name: Norwegian Bokmål
- theoretic Model: TreeTagger
- annotation Mode: automatic
- annotation Manual Structured
- role: annotationManual
- document Info
- document Type: article
- title: Tagging a Norwegian Speech Corpus
- author: Anders Nøklestad and Åshild Søfteland
- editor: Joakim Nivre,Heiki-Jaan Kaalep,Kadri Muischnek, Mare Koit
- year: 2007
- book Title: Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007
- pages: 245–248
- conference: Nodalida 2007
- document Language Name: English
- document Language Id: en
- annotation Manual Structured
- role: annotationManual
- document Info
- document Type: article
- title: Manuell morfologisk tagging av NoTa-materialet med støtte fra en statistisk tagger.
- author: Åshild Søfteland og Anders Nøklestad
- editor: Janne Bondi Johannessen og Kristin Hagen
- year: 2008
- publisher: Novus forlag
- book Title: Språk i Oslo. Ny forskning omkring talespråk
- pages: 226–234.
- I S B N: 978-82-7099-471-7
- document Language Name: Norwegian
- document Language Id: nb
- annotation Manual Structured
- role: annotationManual
- document Info
- document Type: manual
- title: NoTa-taggeren: TAGGEVEILEDNING
- author: Åshild Søfteland
- year: 2007
- url: http://www.tekstlab.uio.no/nota/oslo/Taggeveiledning2.pdf
- document Language Name: Norwegian bokmål
- document Language Id: nb
- classification Info
- genre Info
- genre Type: textGenre
- genre: unstandardised
- unstandardised Genre: Exam papers written by students The texts are available in three different versions: one scanned original in pdf format and two transcribed versions in txt format: one original transcription with errors and one version where the errors are corrected. All versions are linked and it is possible to search in both transcribed versions.
- genre Info
- genre Type: speechGenre
- genre: informal
- genre Info
- genre Type: speechGenre
- genre: recited
- genre Info
- time Coverage Info
- time Coverage: 2014
- source Work Info
dc:type | corpus |
dc:title | The NORINT Corpus |
dc:identifier | oai:tekstlab.uio.no:norint |
dc:description | The NORINT Corpus consists of speech from 51 and written texts from 116 adult learners of Norwegian as second language, all of whom were taking advanced Norwegian courses (≈the CEFR level B2) at the University of Oslo during the summers of 2014 and 2015. The NORINT Corpus is divided into three sub-parts: – NORINT Speech: The speech part of the corpus consists of interviews and conversations, 111,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations. The recordings are transcribed orthographically with the transcription tool Elan. – NORINT Recited: 57 L2 learners, 51 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded. – NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection. The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together. The corpus is searchable in the search interface Glossa, and the transcriptions are linked to audio and video files. |
dc:publisher | |
dc:format | accessibleThroughInterface |
dc:date | 2014-01-01 |
dc:date | 2016-09-01 |
dc:rights | Academic |
dc:rights | CLARIN |
dc:rights | CLARIN_ACA-NC-LOC-PRIV-ND-* |
dc:rights | https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&LOC=1&PRIV=1&NORED=1&ND=1 |
dc:creator | Annely Tomson |
dc:creator | Liv Andlem Harnæs |
dc:creator | The Text Laboratory |
dc:lang | Norwegian Bokmål |