Norwegian Parliamentary Speech Corpus 1.1
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Stortingskorpuset 1.1
- resource Name: Norwegian Parliamentary Speech Corpus 1.1
- description: Dette er versjon 1.1 av Stortingskorpuset (engelsk forkorting NPSC). Følgjande endringar skil versjon 1.1. frå versjon 1.0: – Dataa er delte opp i offisielle trenings-, evaluerings- og testsett. – Manuell dialektannotering er lagt til for kvar enkelt talar. – Sluttpunktet for ei setning (sentence_id 45886) i 20172018 er endra, av di setninga inneheldt ei 30 minutt lang pause i versjon 1.0. Den tilsvarande lydfila (20171208-085509_6122400_6124160.wav) vart forkorta i tråd med dette. – Nokre metadata til transkripsjonane i 20171213 mangla i json-filene. Desse er lagt til i versjon 1.2. – Dokumentasjonen er oppdatert med endringane over. Korpuset er utvikla ved Språkbanken på Nasjonalbiblioteket. NPSC er sett saman av lydopptak av møte i Stortinget, ortografisk transkriberte til høvesvis bokmål eller nynorsk. Det finst òg metadata om dei ulike talarane, og dei offisielle referata frå dei ulike debattane er òg inkluderte i korpuset. Opptaka utgjer 140 timar med tale frå i alt 267 ulike talarar, og inneheld 65.000 setningar og 1,2 millionar ord. Transkripsjonsarbeidet er først gjort automatisk; resultatet av den automatiske transkripsjonen er manuelt sjekka og korrigert av kvalifiserte lingvistar og filologar. For å sikre konsistens og nøyaktigheit, er alle transkripsjonane korrekturlesne. Korpuset er primært tenkt som eit open source-datasett for ASR-utvikling (Automatic Speech Recognition, automatisk taleattkjenning). Dei individuelle lydfilene i korpuset inneheld opptak frå heile dagar med plenumsmøte frå 2017 og 2018 (eller, viss eit møte varar i meir enn seks timar, dei første seks timane den aktuelle dagen). Sidan desse lydfilene er ganske store, finst det òg individuelle lydfiler for kvar enkelt setning. Betaversjonar av korpuset vart publiserte i 2020 og 2021. Me har kjørt postprosesseringsskript sidan siste versjon (0.2). Dette har ført til endringar i alle transkripsjonane, og transkripsjonane har ei anna formatering enn i dei tidligare versjonane. Dei gamle transkripsjonsfilane bør difor erstattast med filene i denne versjonen. Me set stor pris på attendemeldingar og forslag til forbetringar. Kontakt oss på sprakbanken@nb.no.
- description: This is version 1.1 of The Norwegian Parliamentary Speech Corpus (NPSC). The following changes have been made in the update from version 1.0 to 1.1: – The data has been split into official training, evaluation and test sets. – Manual dialect annotations were added for each speaker. – The end time of one sentence in 20171208 (sentence_id 45886), was changed, as a 30 minute break was included in the sentence time span in version 1.0. The corresponding audio file (20171208-085509_6122400_6124160.wav was shortened accordingly. – Some of the metadata in the transcriptions of 20171213 were lacking in the json transcription files. These are added in version 1.1. – The documentation has been updated to reflect these changes. The corpus is developed by the Norwegian Language Bank at the National Library of Norway from 2019-2021. The NPSC consists of audio recordings of meetings in Stortinget (the Norwegian parliament), with corresponding orthographic transcriptions in either Norwegian Bokmål or Norwegian Nynorsk, as well as various metadata about the speakers. The official proceedings from the meetings are also included in the corpus for reference. The recordings add up to 140 hours of running speech (including pauses) from 267 unique speakers, and contain 65,000 sentences and 1.2 million words in total. Transcription was first done automatically; subsequently, the output of the automatic process was manually checked and corrected by trained linguists and philologists. Finally, all transcriptions were proofread to ensure consistency and accuracy. NPSC is primarily intended as an open-source dataset for ASR development. The individual audio files in the corpus contain the speech of entire days of plenary meetings from 2017 and 2018 (or, if a meeting lasts more than six hours, the first six hours of the meeting). Since the audio files are quite large, individual audio files for each sentence are also included. Beta releases of the NPSC were published in 2020 and 2021. Note that we have run postprocessing scripts since the last release (0.2) which affect all transcriptions, and the formatting of the transcriptions is different from previous releases. Users should therefore replace old transcription files with the files in this release. We greatly appreciate any feedback and suggestions for improvement. Please use our e-mail address, sprakbanken@nb.no.
- resource Short Name: NPSC 1.1
- resource Short Name: NPSC 1.1
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-58/
- P I D: hdl:21.11146/58
- identifier: sbr-58
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-58/
- licence:
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-ZERO (CC-ZERO)
- licence Url: https://creativecommons.org/publicdomain/zero/1.0/
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- distribution Rights Holder
- actor Info:
- actor Type: organization
- role: Distribution Rights Holder
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- actor Info:
- actor Type: person
- role: Metadata Creator
- person Info:
- surname: Lindstad
- given Name: Arne Martinus
- affiliation:
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- corpus Info:
- corpus Type: Multimodal Corpus
- corpus Part Info:
- media Type: audio
- corpus Audio Info:
- audio Size Info:
- size Info:
- size: 140
- size Unit: hours
- size Info:
- size: 64541
- size Unit: sentences
- size Info:
- size: 1198590
- size Unit: words
- size Info:
- size: 96,4
- size Unit: gb
- size Info:
- size: 5
- size Unit: files
- duration Of Effective Speech Info:
- size: 126
- duration Unit: hours
- duration Of Audio Info:
- size: 140
- duration Unit: hours
- setting Info:
- audio Format Info:
- mime Type: audio/wav
- signal Encoding: linearPCM
- sampling Rate: 48000
- quantization: 16
- byte Order: littleEndian
- sign Convention: signedInteger
- number Of Tracks: 2
- recording Quality: medium
- corpus Text Info:
- text Format Info:
- mime Type: application/json
- size Per Text Format:
- size Info:
- size: 64541
- size Unit: sentences
- size Info:
- size: 1198590
- size Unit: words
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: monolingual
- language Info:
- language Id: no
- language Name: Norwegian
- size Per Language:
- size Info:
- size: 140
- size Unit: hours
- size Info:
- size: 64541
- size Unit: sentences
- size Info:
- size: 1198590
- size Unit: words
- size Info:
- size: 96,4
- size Unit: gb
- size Info:
- size: 5
- size Unit: files
- language Variety Info:
- language Variety Type: dialect
- language Variety Name: Norwegian dialects
- modality Info:
- modality Type: spokenLanguage
- modality Type Details: Formal speech
- annotation Info:
- annotation Type: speechAnnotation-orthographicTranscription
- time Coverage Info:
- time Coverage: 2017-02-07 – 2018-02-01
dc:type | corpus |
dc:title | Norwegian Parliamentary Speech Corpus 1.1 |
dc:identifier | oai:nb.no:sbr-58 |
dc:description | This is version 1.1 of The Norwegian Parliamentary Speech Corpus (NPSC). The following changes have been made in the update from version 1.0 to 1.1: – The data has been split into official training, evaluation and test sets. – Manual dialect annotations were added for each speaker. – The end time of one sentence in 20171208 (sentence_id 45886), was changed, as a 30 minute break was included in the sentence time span in version 1.0. The corresponding audio file (20171208-085509_6122400_6124160.wav was shortened accordingly. – Some of the metadata in the transcriptions of 20171213 were lacking in the json transcription files. These are added in version 1.1. – The documentation has been updated to reflect these changes. The corpus is developed by the Norwegian Language Bank at the National Library of Norway from 2019-2021. The NPSC consists of audio recordings of meetings in Stortinget (the Norwegian parliament), with corresponding orthographic transcriptions in either Norwegian Bokmål or Norwegian Nynorsk, as well as various metadata about the speakers. The official proceedings from the meetings are also included in the corpus for reference. The recordings add up to 140 hours of running speech (including pauses) from 267 unique speakers, and contain 65,000 sentences and 1.2 million words in total. Transcription was first done automatically; subsequently, the output of the automatic process was manually checked and corrected by trained linguists and philologists. Finally, all transcriptions were proofread to ensure consistency and accuracy. NPSC is primarily intended as an open-source dataset for ASR development. The individual audio files in the corpus contain the speech of entire days of plenary meetings from 2017 and 2018 (or, if a meeting lasts more than six hours, the first six hours of the meeting). Since the audio files are quite large, individual audio files for each sentence are also included. Beta releases of the NPSC were published in 2020 and 2021. Note that we have run postprocessing scripts since the last release (0.2) which affect all transcriptions, and the formatting of the transcriptions is different from previous releases. Users should therefore replace old transcription files with the files in this release. We greatly appreciate any feedback and suggestions for improvement. Please use our e-mail address, sprakbanken@nb.no. |
dc:publisher | |
dc:format | downloadable |
dc:date | 2019-08-01 |
dc:date | 2021-11-30 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-ZERO (CC-ZERO) |
dc:rights | https://creativecommons.org/publicdomain/zero/1.0/ |
dc:creator | National Library of Norway |
dc:lang | Norwegian |