Skip to content

NST Pronunciation Lexicon for Norwegian Bokmål

This pronunciation lexicon for Norwegian Bokmål was originally produced by Nordic Language Technology (NST), and contains approximately 785,000 entries. The word list is based on the 100,000 most frequent word forms in NST’s Norwegian text corpus.

The lexicon is available as one large csv file. Each entry (line) contains 51 fields, separated by a semicolon. Not all fields are equally relevant for all purposes, but given the format, it should be easy to extract relevant information.

The lexicon contains, among other things, information about the decomposition of compounds and one or more phonetic transcriptions. The phonetic transcription has partly been done manually, but to a large extent it was done automatically using an inflector. Parts of the output of this process was manually checked afterwards. The inflector and other lexical tools that can be used in processing the lexicon, can be downloaded as a separate file.

The transcription format is SAMPA (Speech Assessment Methods Phonetic Alphabet). See http://www.phon.ucl.ac.uk/home/sampa/index.html.

A script for converting the SAMPA transcriptons to IPA can be found on GitHub (https://github.com/peresolb/sampa_to_ipa).

This pronunciation lexicon for Norwegian Bokmål was originally produced by Nordic Language Technology (NST), and contains approximately 785,000 entries. The word list is based on the 100,000 most frequent word forms in NST’s Norwegian text corpus.

The lexicon is available as one large csv file. Each entry (line) contains 51 fields, separated by a semicolon. Not all fields are equally relevant for all purposes, but given the format, it should be easy to extract relevant information.

The lexicon contains, among other things, information about the decomposition of compounds and one or more phonetic transcriptions. The phonetic transcription has partly been done manually, but to a large extent it was done automatically using an inflector. Parts of the output of this process was manually checked afterwards. The inflector and other lexical tools that can be used in processing the lexicon, can be downloaded as a separate file.

The transcription format is SAMPA (Speech Assessment Methods Phonetic Alphabet). See http://www.phon.ucl.ac.uk/home/sampa/index.html.

A script for converting the SAMPA transcriptons to IPA can be found on GitHub (https://github.com/peresolb/sampa_to_ipa).

Extended metadata

Download resources

Download metadata