Stortinget Speech Corpus versjon 1.0
Utvidet metadata
- resource Common Info:
- resource Type: corpus
- identification Info:
- resource Name: Stortinget Speech Corpus version 1.0
- resource Name: Stortinget Speech Corpus versjon 1.0
- description: The Stortinget Speech Corpus (SSC) is a 5000+ hours speech dataset for weak supervision ASR created from audio and aligned proceedings text from Stortinget, the Norwegian Parliament. It contains speech segments of up to 30 seconds with transcriptions in Norwegian Bokmål (nob) and Norwegian Nynorsk (nno) from the official proceedings. The dataset is distributed as a JSONL file. Audio files, proceedings files and transcription files (with ASR output) are included in this repository, and there are relative file paths in the JSONL file. Note that only segmented audio files are part of the release. Dataset statistics – Number of segments: 724 783 – Total duration in hours: 5 190 – Number of unique speakers: 729 For more detailed information, see the documentation files.
- description: Stortinget Speech Corpus (SSC) er eit taledatasett på meir enn 5000 timar for svakt overvaka taleattkjenning laga av lydopptak og tekst frå Stortingsforhandlingane. Det inneheld taleeiningar på inntil 30 sekund med transkripsjonar på bokmål og nynorsk frå dei offisielle Stortingsforhandlingane. Datasettet vert distribuert som ei JSONL-fil. Lydfiler, tekstfiler og transkripsjonsfiler (med output frå taleattkjenninga) er inkluderte i datasettet, linka med relative filstiar i JSONL-fila. Merk at berre segmenterte lydfiler er del av korpuset. Statistikk – Antall segment: 724 783 – Total varigheit i timar: 5 190 – Antal unike talarar: 729 For meir detaljert informasjon, sjå dokumentasjonsfilene.
- resource Short Name: SSC
- resource Short Name: SSC
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-91/
- P I D: hdl:21.11146/91
- identifier: sbr-91
- distribution Info:
- licence Info:
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-91/
- licence:
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-ZERO (CC-ZERO)
- licence Url: https://creativecommons.org/publicdomain/zero/1.0/
- licensor:
- actor Info:
- actor Type: organization
- role: Licensor
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- communication Info:
- email: sprakbanken@nb.no
- email: ai-lab@nb.no
- url: https://www.nb.no/sprakbanken/
- url: https://ai.nb.no
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- contact
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- role: Metadata Creator
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank / The AI-lab
- department Name: Språkbanken / AI-laben
- corpus Info:
- corpus Type: Multimodal Corpus
- corpus Part Info:
- media Type: audio
- corpus Audio Info:
- audio Size Info:
- size Info:
- size: 5190
- size Unit: hours
- size Info:
- size: 724783
- size Unit: units
- duration Of Audio Info:
- size: 5190
- duration Unit: hours
- audio Format Info:
- mime Type: audio/mpeg
- sampling Rate: 16000
- corpus Part Info:
- media Type: text
- corpus Text Info:
- text Format Info:
- mime Type: text/jsonl
- character Encoding Info:
- character Encoding: UTF-8
- corpus Part General Info:
- linguality Info:
- linguality Type: monolingual
- language Info:
- language Id: no
- language Name: Norwegian
- size Per Language:
- size Info:
- size: 724783
- size Unit: units
- size Info:
- size: 5190
- size Unit: hours
- size Info:
- size: 62
- size Unit: gb
- language Variety Info:
- language Variety Type: other
- language Variety Name: formal
- modality Info:
- modality Type: spokenLanguage
- modality Type Details: formal speech, parliamentary speech
- annotation Info:
- annotation Type: alignment
dc:type | corpus |
dc:title | Stortinget Speech Corpus versjon 1.0 |
dc:identifier | oai:nb.no:sbr-91 |
dc:description | Stortinget Speech Corpus (SSC) er eit taledatasett på meir enn 5000 timar for svakt overvaka taleattkjenning laga av lydopptak og tekst frå Stortingsforhandlingane. Det inneheld taleeiningar på inntil 30 sekund med transkripsjonar på bokmål og nynorsk frå dei offisielle Stortingsforhandlingane. Datasettet vert distribuert som ei JSONL-fil. Lydfiler, tekstfiler og transkripsjonsfiler (med output frå taleattkjenninga) er inkluderte i datasettet, linka med relative filstiar i JSONL-fila. Merk at berre segmenterte lydfiler er del av korpuset. Statistikk – Antall segment: 724 783 – Total varigheit i timar: 5 190 – Antal unike talarar: 729 For meir detaljert informasjon, sjå dokumentasjonsfilene. |
dc:publisher | |
dc:format | downloadable |
dc:date | 2019-08-01 |
dc:date | 2023-11-15 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-ZERO (CC-ZERO) |
dc:rights | https://creativecommons.org/publicdomain/zero/1.0/ |
dc:creator | National Library of Norway |
dc:creator | Norwegian University of Science and Technology |
dc:lang | norsk |