Stortinget Speech Corpus version 1.0

The Stortinget Speech Corpus (SSC) is a 5000+ hours speech dataset for weak supervision ASR created from audio and aligned proceedings text from Stortinget, the Norwegian Parliament. It contains speech segments of up to 30 seconds with transcriptions in Norwegian Bokmål (nob) and Norwegian Nynorsk (nno) from the official proceedings.

The dataset is distributed as a JSONL file. Audio files, proceedings files and transcription files (with ASR output) are included in this repository, and there are relative file paths in the JSONL file. Note that only segmented audio files are part of the release.

Dataset statistics
– Number of segments: 724 783
– Total duration in hours: 5 190
– Number of unique speakers: 729

For more detailed information, see the documentation files.

Dataset statistics
– Number of segments: 724 783
– Total duration in hours: 5 190
– Number of unique speakers: 729

For more detailed information, see the documentation files.

Extended metadata

resource Common Info:
resource Type: corpus
identification Info:
resource Name: Stortinget Speech Corpus version 1.0
resource Name: Stortinget Speech Corpus versjon 1.0
description: The Stortinget Speech Corpus (SSC) is a 5000+ hours speech dataset for weak supervision ASR created from audio and aligned proceedings text from Stortinget, the Norwegian Parliament. It contains speech segments of up to 30 seconds with transcriptions in Norwegian Bokmål (nob) and Norwegian Nynorsk (nno) from the official proceedings. The dataset is distributed as a JSONL file. Audio files, proceedings files and transcription files (with ASR output) are included in this repository, and there are relative file paths in the JSONL file. Note that only segmented audio files are part of the release. Dataset statistics – Number of segments: 724 783 – Total duration in hours: 5 190 – Number of unique speakers: 729 For more detailed information, see the documentation files.
description: Stortinget Speech Corpus (SSC) er eit taledatasett på meir enn 5000 timar for svakt overvaka taleattkjenning laga av lydopptak og tekst frå Stortingsforhandlingane. Det inneheld taleeiningar på inntil 30 sekund med transkripsjonar på bokmål og nynorsk frå dei offisielle Stortingsforhandlingane. Datasettet vert distribuert som ei JSONL-fil. Lydfiler, tekstfiler og transkripsjonsfiler (med output frå taleattkjenninga) er inkluderte i datasettet, linka med relative filstiar i JSONL-fila. Merk at berre segmenterte lydfiler er del av korpuset. Statistikk – Antall segment: 724 783 – Total varigheit i timar: 5 190 – Antal unike talarar: 729 For meir detaljert informasjon, sjå dokumentasjonsfilene.
resource Short Name: SSC
resource Short Name: SSC
url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-91/
P I D: hdl:21.11146/91
identifier: sbr-91
distribution Info:
licence Info:
user Category: Public
distribution Access Medium: downloadable
download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-91/
licence:
licence Family: Creative Commons (CC)
licence Name: Creative_Commons-ZERO (CC-ZERO)
licence Url: https://creativecommons.org/publicdomain/zero/1.0/
licensor:
actor Info:
actor Type: organization
role: Licensor
organization Info:
organization Name: National Library of Norway
organization Name: Nasjonalbiblioteket
organization Short Name: NLN
organization Short Name: NB
communication Info:
email: sprakbanken@nb.no
email: ai-lab@nb.no
url: https://www.nb.no/sprakbanken/
url: https://ai.nb.no
address: P.O. Box 2674 Solli
zip Code: 0203
city: Oslo
region: Oslo
country: Norway
contact
- actor Info:
- actor Type: organization
- role: Contact
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
metadata Info:
metadata Creation Date: 16.11.2023
metadata Language Name: English
metadata Language Name: Norwegian Nynorsk
metadata Language Id: en
metadata Language Id: nn
metadata Last Date Updated: 12.01.2024
metadata Creator
- actor Info:
- actor Type: organization
- role: Metadata Creator
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info:
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
version Info:
version: 1.0
last Date Updated: 15.11.2023
validation Info:
validated: false
resource Documentation Info:
documentation Unstructured:
role: documentation
document Unstructured: https://www.nb.no/sbfil/talegjenkjenning/ssc/SSC_1.pdf
resource Creation Info:
creation Start Date: 01.08.2019
creation End Date: 15.11.2023
resource Creator
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank / The AI-lab
- department Name: Språkbanken / AI-laben
- communication Info:
- email: sprakbanken@nb.no
- email: ai-lab@nb.no
- url: https://www.nb.no/sprakbanken/
- url: https://ai-lab.nb.no/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info:
- actor Type: organization
- role: Resource Creator
- organization Info:
- organization Name: Norwegian University of Science and Technology
- organization Name: Noregs teknisk-naturvitskaplege universitet
- organization Short Name: NTNU
- organization Short Name: NTNU
- department Name: Department of Electronic Systems
- department Name: Institutt for elektroniske system

Download resources

Download metadata

Download metadata https://www.nb.no/sprakbanken/oai?verb=GetRecord&identifier=oai:nb.no:sbr-91&metadataPrefix=cmdi

dc:type	corpus
dc:title	Stortinget Speech Corpus version 1.0
dc:identifier	oai:nb.no:sbr-91
dc:description	The Stortinget Speech Corpus (SSC) is a 5000+ hours speech dataset for weak supervision ASR created from audio and aligned proceedings text from Stortinget, the Norwegian Parliament. It contains speech segments of up to 30 seconds with transcriptions in Norwegian Bokmål (nob) and Norwegian Nynorsk (nno) from the official proceedings. The dataset is distributed as a JSONL file. Audio files, proceedings files and transcription files (with ASR output) are included in this repository, and there are relative file paths in the JSONL file. Note that only segmented audio files are part of the release. Dataset statistics – Number of segments: 724 783 – Total duration in hours: 5 190 – Number of unique speakers: 729 For more detailed information, see the documentation files.
dc:publisher
dc:format	downloadable
dc:date	2019-08-01
dc:date	2023-11-15
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-ZERO (CC-ZERO)
dc:rights	https://creativecommons.org/publicdomain/zero/1.0/
dc:creator	National Library of Norway
dc:creator	Norwegian University of Science and Technology
dc:lang	Norwegian

Stortinget Speech Corpus version 1.0

Extended metadata

Resource Common Info

Corpus Info

Dublin Core (DC)

Download resources

Download metadata