ONOMASTICA Telenor, 1999 This database contains original data from the Norwegian part of the ONOMASTICA project [1][2]. The data consist of pronunciations of first names, family names, company names, street names and place names. The data also include a set of foreign names. FILE FORMATS ============= The text file format is UNIX and the character set used is ISO 8859-1 Latin 1. CATALOG STRUCTURE ================= catalog Contains File name(s) #Entries --------------------------------------------------------------------- + root | Copyright statement copyright.txt | This file README.txt| +distrib-+ | +fnvn first names fnvnXXY.ono 239 393 | +envn Surnames envnXXY.ono 82 416 | +fnvd double first names fnvdXXY.ono 27 096 | +gate Street names gateXXY.ono 88 161 | +sted Place names stedXXY.ono 6 180 | +bedr Company names bedrXXY.ono 100 179 | +mix Foreign names xx_nw.ono 13 074 --------------------------------------------------------------------- XX in file names is a running serial number. The first and surnames inventory is based on the Norwegian telephone directory. Higher serial number is associated with lower probability in the directory. Y in file names indicate the name source and can take the following values for place names: x Postal place names k Place names from official cartography t Names used in the Telenor directory The "mix" catalog contains files with "norwegian" pronunciation of foreign names. xx in place file names is a nationality code for the words in the file: uk = United Kingdom, pt = portugal se = sweden fr = France de = Germany nl = Netherlands DATA RECORDS ============ The data are structured in records as follows: Data(example): Explanation ---------------------------------------------------------------------------- SOO: : Record start ENT:NO0000001 : National prefix (NO) and series number LBO:Hansen : Ortographic name FQO:31391 : Number of occurences in source material NO0:"hAn$s@n : UPronunciation in SAMPA phonetic alphabety NO1: : Alternative pronunciation SAMPA NO2: : Alternative pronunciation SAMPA QUO:1 : Quality level, se note 1) WH0:MS,AF : Transcriber ID(s) ET0:NO : Etymology (NO = Norwegian) CT0:Surname : Name class EOO: : Record end --------------------------------------------------------------------------- 1) QU0: 1: Checked by phonetician, knowing the name 2: Checked by phonetician, not knowing the name 3: Automatic transcription only PHONEME CODES ============= Phoneme codes are in SAM Phonetic Alphabet (SAMPA) [3]. In addition to the phoneme codes defined for Norwegian, the following codes have been used to transcribe English names: SAMPA Example word ----------------------------- T Tin (thin) D Dis (this) dZ dZin (gin) aI raIse (rice) eI reIn (rain) OU pOUk (poke) Note: Non-standard: the correct SAMPA symbol is @U ----------------------------- Syllabic consonants are followed by an /*/ (asterix). (This is not standard SAMPA format.) n* example: Botn ""bOt$n* l* example: Hodsle ""HOd$l* Ambisyllabic consonants (consonants shared by neighbouring syllabes) are indicated by including the consonant on both sides of the syllable boundary. Examples: Anne ""An$n@ Inger ""iN$N@r Bjarne ""bjA:rn$rn@ [1] The Onomastica Consortium. The ONOMASTICA Interlanguage Pronunciation Lexicon, URL: http://www.inesc-id.pt/pt/indicadores/Ficheiros/3225.pdf [2] Schmidt, M. / Fitt, S. / Scott, C. / Jack, Mervyn A. (1993): "Phonetic transcription standards for european names (ONOMASTICA)", In EUROSPEECH'93, 279-282. URL: http://www.cstr.ed.ac.uk/downloads/publications/1993/Fitt_1993_a.ps [3] Wells, J.C., 1997. 'SAMPA computer readable phonetic alphabet'. In Gibbon, D., Moore, R. and Winski, R. (eds.), 1997. Handbook of Standards and Resources for Spoken Language Systems. Berlin and New York: Mouton de Gruyter. Part IV, section B. URL: http://www.phon.ucl.ac.uk/home/sampa/norweg.htm