Norwegian Voice Control Corpus
Extended metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Norwegian Voice Control Corpus
- resource Name: Norsk talestyringskorpus
- description: The Norwegian Voice Control Corpus (NVCC) is a text and speech corpus consisting of written queries in Norwegian Bokmål and Nynorsk within a number of intents, and voice recordings of these queries. The queries are the type of commands typically given to mobile phones to trigger certain functions, and the intents reflect the functions a mobile phone typically has. NVCC consists of 10 706 queries within 183 different intents. The intents are sorted into 24 intent groups further organised into 9 domains. 9,834 of the queries were recorded, read by eleven different speakers from five dialect groups. Each query has been segmented into individual audio files. The transcriptions, written queries and information about the audio segments and speakers are organised in csv files. See the documentation file for detailed information. NVCC is open-source and primarily intended as training data for the kind of voice controlled assistants found in mobile phones. However, as it is possible to make use of the text and speech parts of the corpus separately, the corpus might also be useful for development of text-based language technology, like chatbots. NVCC is developed by the Language Bank at the National Library of Norway. We greatly appreciate any feedback and suggestions for improvement. Please contact us at sprakbanken@nb.no.
- description: Norsk talestyringskorpus (engelsk forkorting NVCC) er eit tekst- og talekorpus som består av skrivne og innlesne setningar (spørjingar). Dette er spørjingar ein typisk nyttar til å styre t.d. mobiltelefonar med stemma, og dei er tilpassa typiske funksjonar i mobiltelefonar. NVCC inneheld 10.706 skrivne spørjingar på både bokmål og nynorsk. Spjørjingane er delte inn i 183 forskjellige intent, fordelte på 24 intentgrupper innanfor ni overordna domene. 9.834 av spørjingane er lesne inn av 11 talarar frå fem forskjellige dialektområde for å femne dialektvariasjon. Opptaka er transkriberte med ei blanding av nynorsk og bokmål for å liggje så nære talaranes dialekt som mogleg. Transkripsjonane og metadata om talarane (dialekt, alder, kjønn) er med i korpuset. Sjå dokumentasjonsfila for meir detaljert informasjon. NVCC er eit open source-datasett for utvikling av talestyrte mobilassistentar, men kan også vere nyttig for utvikling av tekstbasert språkteknologi som t.d. chatbotar. NVCC er utvikla av Språkbanken ved Nasjonalbiblioteket. Me set stor pris på attendemeldingar og forslag til forbetringar. Kontakt oss på sprakbanken@nb.no.
- resource Short Name: NVCC
- resource Short Name: NVCC
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-75/
- P I D: hdl:21.11146/75
- identifier: sbr-75
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-75/
- licence:
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-ZERO (CC-ZERO)
- licence Url: https://creativecommons.org/publicdomain/zero/1.0/
- contact
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- role: Metadata Creator
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- corpus Info:
- corpus Type: Multimodal Corpus
- corpus Part Info:
- media Type: audio
- corpus Audio Info:
- audio Size Info:
- size Info:
- size: 19668
- size Unit: files
- audio Content Info:
- speech Items: other
- non Speech Items: other
- textual Description: intents
- setting Info:
- naturality: prompted
- conversational Type: monologue
- scenario Type: other
- audience: no
- interactivity: other
- audio Format Info:
- mime Type: audio/x-wav
- sampling Rate: 48000
- quantization: 24
- corpus Part Info:
- media Type: text
- corpus Text Info:
- text Format Info:
- mime Type: text/csv
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: monolingual
- language Info:
- language Id: no
- language Name: Norwegian
- modality Info:
- modality Type: spokenLanguage
- modality Info:
- modality Type: writtenLanguage
- annotation Info:
- annotation Type: speechAnnotation-orthographicTranscription
dc:type | corpus |
dc:title | Norwegian Voice Control Corpus |
dc:identifier | oai:nb.no:sbr-75 |
dc:description | The Norwegian Voice Control Corpus (NVCC) is a text and speech corpus consisting of written queries in Norwegian Bokmål and Nynorsk within a number of intents, and voice recordings of these queries. The queries are the type of commands typically given to mobile phones to trigger certain functions, and the intents reflect the functions a mobile phone typically has. NVCC consists of 10 706 queries within 183 different intents. The intents are sorted into 24 intent groups further organised into 9 domains. 9,834 of the queries were recorded, read by eleven different speakers from five dialect groups. Each query has been segmented into individual audio files. The transcriptions, written queries and information about the audio segments and speakers are organised in csv files. See the documentation file for detailed information. NVCC is open-source and primarily intended as training data for the kind of voice controlled assistants found in mobile phones. However, as it is possible to make use of the text and speech parts of the corpus separately, the corpus might also be useful for development of text-based language technology, like chatbots. NVCC is developed by the Language Bank at the National Library of Norway. We greatly appreciate any feedback and suggestions for improvement. Please contact us at sprakbanken@nb.no. |
dc:publisher | |
dc:format | downloadable |
dc:date | 2020-01-06 |
dc:date | 2022-12-15 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-ZERO (CC-ZERO) |
dc:rights | https://creativecommons.org/publicdomain/zero/1.0/ |
dc:creator | National Library of Norway |
dc:lang | Norwegian |