<<

USOO593.0754A United States Patent (19) 11 Patent Number: 5,930,754 Karaali et al. (45) Date of Patent: Jul. 27, 1999

54 METHOD, DEVICE AND ARTICLE OF OTHER PUBLICATIONS MANUFACTURE FOR NEURAL-NETWORK BASED ORTHOGRAPHY-PHONETICS “The Structure and Format of the DARPATIMIT CD-ROM TRANSFORMATION Prototype”, John . Garofolo, National Institute of Stan dards and Technology. 75 Inventors: Orhan Karaali, Rolling Meadows; “Parallel Networks that Learn to Pronounce English Text' Corey Andrew Miller, Chicago, both Terrence J. Sejnowski and Charles R. Rosenberg, Complex of I11. Systems 1, 1987, pp. 145-168. 73 Assignee: Motorola, Inc., Schaumburg, Ill. Primary Examiner David R. Hudspeth ASSistant Examiner-Daniel Abebe 21 Appl. No.: 08/874,900 Attorney, Agent, or Firm-Darleen J. Stockley 22 Filed: Jun. 13, 1997 57 ABSTRACT A method (2000), device (2200) and article of manufacture (51) Int. Cl...... G10L 5/06 (2300) provide, in response to orthographic information, 52 U.S. Cl...... 704/259; 704/258; 704/232 efficient generation of a phonetic representation. The method 58 Field of Search ...... 704/259,258, provides for, in response to orthographic information, effi 704/232 cient generation of a phonetic representation, using the Steps of inputting an Orthography of a word and a predetermined 56) References Cited Set of input features, utilizing a neural network that has U.S. PATENT DOCUMENTS been trained using automatic letter phone alignment and predetermined letter features to provide a neural network 4,829,580 5/1989 Church. 5,040.218 8/1991 Vitale et al.. hypothesis of a word pronunciation. 5,668,926 9/1997 Karaali et al...... 704/259 5,687,286 11/1997 Bar-Yam ...... 704/259 61 Claims, 14 Drawing Sheets

ORTHOGRAPHY INPUT LETTER 22O2 FEATURES 2204 MICROPROCESSOR/ APPLICATION-SPECIFIC INTEGRATED CIRCUIT/ COMBINATION MICROPROCESSOR AND APPLICATION-SPECIFIC INTEGRATED CIRCUIT 2208

ENCODER AUTOMATIC LETTER PHONE ALIGNMENT

PRETRAINED ORTHOGRAPHY PRONUNCIATION NEURAL NETWORK PREDETERMINED LETTER FEATURES 2214

NEURAL NETWORK HYPOTHESIS OF A WORD PRONUNCIATION 2200 2216 U.S. Patent Jul. 27, 1999 Sheet 1 of 14 5,930,754

TEXT IN ORTHOGRAPHIC FORM 102 104 LINGUISTIC MODULE

PHONETIC REPRESENTATION 106

108 ACOUSTIC MODULE

SPEECH 110 100 A77G. -PRIOR ART

ORTHOGRAPHY-PHONETICS LEXICON 202

204 NEURAL NETWORK INPUT CODING 206

NEURAL NETWORK TRAINING

208 FEATURE-BASED ERROR BACKPROPAGATION

200 A/VG.2 U.S. Patent Jul. 27, 1999 Sheet 2 of 14 5,930,754

TEXT 302

304 LINGUISTIC PREPROCESSING MODULE 320 306 ORTHOGRAPHY-PRONUNCIATION LEXICON PRONUNCIATION 308 DETERMINATION YES 318

NEURAL NETWORK ORTHOGRAPHY PRONUNCIATION CONVERTER

POSTLEXICAL MODULE

ACOUSTIC 314 MODULE

SPEECH 316 300 A/VG.3 U.S. Patent Jul. 27, 1999 Sheet 3 of 14 5,930,754

ORTHOGRAPHY 402- STREAM 2 coat Niger, 3 if- 404Izo u *co Tow atT-N STREAMif 1 PRONUNCIATION408 - / 10 164022 k ow. It 400 A77G.4

LOCATION 1 2 3 LETTER Is Ic PHONE is k + uw

location 1.27602 SEQUENCE 1 in du Is It + SEQUENCE 2 n + + + t e 600 A/VG.. 6 -PRIOR ART U.S. Patent Jul. 27, 1999 Sheet 4 of 14 5,930,754

ORTHOGRAPHY 1 c. 702

LETTER FEATURES FOR c 710 --OBSTRUENT --CONTINUANT ALVEOLAR VELAR +ASPIRATED LETTER FEATURES FOR 712 --VOCALIC --WOWEL --SONORANT --CONTINUANT --BACK1 --BACK2 --LOW --LOW2 +VOICED --LONG +MID-LOW2 --ROUND1 --ROUND2 --MID-HIGH1 +MID-HIGH2 LETTER FEATURES FOR O 714

LETTER FEATURES FOR t 716 --OBSTRUENT --ALVEOLAR ASPIRATED --CONTINUANT +DENTAL --VOICED

STREAM U.S. Patent Jul. 27, 1999 Sheet 5 of 14 5,930,754

52%.3%32 IP 800 AG,

A/G.9

r ae p ih HYPOTHESES

U.S. Patent Jul. 27, 1999 Sheet 6 of 14 5,930,754

PHONE OX

TARGET (dk) HYPOTHESIS (o) o 1 LOCAL ERROR 1 -1 (dk-ok) (dk-ok)? 1 (d-90°)-2 1100 A/VG. ZZ -PRIOR ART

PHONE h OX TARGET (d) HYPOTHESIS (o 1 LOCAL ERROR 1(=1) -.1(M=.1) M+(dk-ok) (M+(d-o))? | 1 .2121

1200 A77G. Z2 U.S. Patent Jul. 27, 1999 Sheet 7 of 14 5,930,754

101 201

4 504 103203 1316 106206 1302 H2,532 STREAM 3 STREAM 2 INPUT 127 INPUT EER 32 228 LETTERS FEATURES 15025,306

15327 36 126525 331 41 123219

1304 1318 INPUT INPUT BLOCK 1 BLOCK 2

1306 SIGMOID SIGMOID - 1320 NEURAL NEURAL NETWORK NETWORK BLOCK 5 BLOCK 4 1322 1310 -1- N 1314 STREAM 1 SOFTMAX SOFTMAX SOFTMAX SOFTMAX NEURAL NEURAL NEURAL NEURAL OUTPUT 1016 4022 NETWORK NETWORK NETWORK NETWORK PHONES BLOCK 6 BLOCK 7 BLOCK 5 BLOCK 8

3. OUTPUT BLOCK 9 1324 1326 1300 NEURALHYPOTHESIS NETWORK to saola AVGZ.3 U.S. Patent Jul. 27, 1999 Sheet 8 of 14 5,930,754

ORTHOGRAPHY 14O2

1404 NEURAL NETWORK INPUT CODING

1408

PRONUNCIATION 1410 1400 A77G. 74

1502 C t - 1504

1500 A/VG. Zao

1602

N 1604 k low + i t . Na 1606 k low t 1600 A77G. Z6 U.S. Patent Jul. 27, 1999 Sheet 9 of 14 5,930,754

101.201. 4 102.202304 1716 EE106 is 2,332 1702 STREAM 3 1292227 STREAM 2 INPUT 341 INPUT 3 15 120 LETTER 32 LETTERS FEATURES 150153 306 36 E2.22,331 250 124220 350

1704 1718 INPUT INPUT BLOCK 1 BLOCK 2

SIGMOID SIGMOID 1706 NEURAL NEURAL 1720 NETWORK NETWORK BLOCK 3 BLOCK 4

SOFTMAX SOFTMAX SOFTMAX SOFTMAX

NEURAL NEURAL NEURAL NEURAL

NETWORK NETWORK NETWORK NETWORK

BLOCK 7 BLOCK 6 BLOCK 5 BLOCK 8

40 OUTPUT BLOCK 9 1722 1724 1700 NEURALHYPOTHESIS NETWORK to 64022 AzVGZ27

U.S. Patent Jul. 27, 1999 Sheet 11 of 14 5,930,754

4 102.202304 103203i.05 1916 206 3 3 2 1902 STREAM 3 1 2 7 ;; 3 4. 1 INPUT 228 STREAM 2 LETTER INPUT 3 15 20 FEATURES 2 5 O 6 LETTERS 3.3 2 1 7 3 3 1 1904 INPUT . 3 5 O BLOCK 1 1 2 4 2 2 O

1920-N SIGMOID NEURAL 1906 MOID NEURAL NETWORK BLOCKO 4 NETW ORK BLOCK 3 SO-5 ---><><-N SOFTMAX SOFTMAX SOFTMAX SOFTMAX NEURAL NEURAL NEURAL NEURAL NETWORK NETWORK NETWORK NETWORK BLOCK 5 BLOCK 6 BLOCK 7 BLOCK 8

1908

PHONE 2 FEATURES

1922 SOFTMAX SOFTMAX SEAM,PHONES 10 164022NEANETWORK NETWORKNEA NETWORKNEURAL NETWORKNEURAL BLOCK 9 BLOCK 10 BLOCK 11 BLOCK 12 1932 1926 Sita1930 1924 1934 1900 NEURALNoTESI" NETWORK 10 164022 AZVGZ.9 U.S. Patent Jul. 27, 1999 Sheet 12 of 14 5,930,754

A77G2O 2000 12/14 INPUTTING AN ORTHOGRAPHY OF A WORD AND A 2002 PREDETERMINED SET OF INPUT LETTER FEATURES

UTILIZING A NEURAL NETWORK THAT HAS BEEN 2004 TRAINED USING AUTOMATIC LETTER PHONE ALIGNMENT AND PREDETERMINED LETTER FEATURES TO PROVIDE A NEURAL NETWORK HYPOTHESIS OF A WORD PRONUNCIATION

A77G.2Z 2100 2102

PROVIDING A PREDETERMINED NUMBER OF LETTERS OF AN ASSOCIATED ORTHOGRAPHY CONSISTING OF LETTERS FOR THE WORD AND A PHONETIC REPRESENTATION CONSISTING OF PHONES FOR A TARGET PRONUNCIATION OF THE ASSOCIATED ORTHOGRAPHY 2104 ALIGNING THE ASSOCIATED ORTHOGRAPHY AND PHONETIC REPRESENTATION USING A DYNAMIC PROGRAMMING ALIGNMENT ENHANCED WITH A FEATURALLY-BASED SUBSTITUTION COST FUNCTION 2106 PROVIDING ACOUSTIC AND ARTICULATORY INFORMATION CORRESPONDING TO THE LETTERS, BASED ON A UNION OF FEATURES OF PREDETERMINED PHONES REPRESENTING EACH LETTER 2108 PROVIDING A PREDETERMINED AMOUNT OF CONTEXT INFORMATION 2110 TRAINING THE NEURAL NETWORK TO ASSOCIATE THE INPUT ORTHOGRAPHY WITH A PHONETIC REPRESENTATION 2172 ------?:- - - PROVIDING A PREDETERMINED NUMBER OF LAYERS OF OUTPUT REPROCESSING IN WHICH PHONES, NEIGHBORING PHONES, PHONE FEATURES AND NEIGHBORING PHONE FEATURE ARE PASSED TO SUCCEEDING LAYERS

EMPLOYING A FEATURE-BASED ERROR FUNCTION TO CHARACTERIZE A DISTANCE BETWEEN TARGET AND HYPOTHESIZED PRONUNCIATIONS DURING TRAINING U.S. Patent Jul. 27, 1999 Sheet 13 of 14 5,930,754

ORTHOGRAPHY 13/14 INPUT LETTER 22O2 FEATURES 2204 MICROPROCESSOR/ APPLICATION-SPECIFIC INTEGRATED CIRCUIT/ COMBINATION MICROPROCESSOR AND APPLICATION-SPECIFIC INTEGRATED CIRCUIT 2208

ENCODER AUTOMATIC LETTER PHONE ALIGNMENT

PRETRAINED ORTHOGRAPHY PRONUNCIATION NEURAL NETWORK PREDETERMINED LETTER FEATURES 2214 NEURAL NETWORK HYPOTHESIS OF A WORD 2200 PRONUNCIATION w 2216 A77G.22 ORTHOGRAPHY PREDETERMINED SET OF OF A WORD INPUT LETTER FEATURES 2302 2,304

ARTICLE OF

MANUFACTURE 2308 2312

INPUTTING

UNIT AUTOMATIC LETTER PHONE ALIGNMENT NEURAL NETWORK

UTILIZATION UNIT PREDETERMINED LETTER FEATURES 2,314 NEURAL NETWORK HYPOTHESIS OF A WORD 2300 PRONUNCIATION 2316 A77G.23 U.S. Patent Jul. 27, 1999 Sheet 14 of 14 5,930,754

2404 PRONUNCIATION LEXICON

2402 UNTRAINED NEURAL NETWORK

2408 TRAINED NEURAL NETWORK

2410 PART OF PRONUNCIATION LEXICON

2400 A77G.24 5,930,754 1 2 METHOD, DEVICE AND ARTICLE OF of the phonetic featural content of orthography, and that is MANUFACTURE FOR NEURAL-NETWORK evaluated, and whose error is backpropagated, on the basis BASED ORTHOGRAPHY-PHONETICS of the featural content of the generated phones. A method, TRANSFORMATION device and article of manufacture for neural-network based orthography-phonetics transformation is needed. FIELD OF THE INVENTION BRIEF DESCRIPTION OF THE DRAWINGS The present invention relates to the generation of phonetic FIG. 1 is a Schematic representation of the transformation forms from orthography, with particular application in the of text to speech as is known in the art. field of Speech Synthesis. FIG. 2 is a Schematic representation of one embodiment 1O of the neural network training proceSS used in the training of BACKGROUND OF THE INVENTION the orthography-phonetics converter in accordance with the As shown in FIG. 1, numeral 100, text-to-speech synthe present invention. sis is the conversion of written or printed text (102) into FIG. 3 is a schematic representation of one embodiment speech (110). Text-to-speech synthesis offers the possibility 15 of the transformation of text to Speech employing the neural of providing voice output at a much lower cost than record network Orthography-phonetics converter in accordance ing Speech and playing that Speech back. Speech Synthesis with the present invention. is often employed in Situations where the text is likely to FIG. 4 is a Schematic representation of the alignment and vary a great deal and where it is simply not possible to record neural network encoding of the orthography coat with the the text beforehand. phonetic representation /kowt/ in accordance with the Speech synthesizers need to convert text (102) to a present invention. phonetic representation (106) that is then passed to an FIG. 5 is a schematic representation of the one letter-one acoustic module (108) which converts the phonetic repre phoneme alignment of the Orthography School and the sentation to a speech waveform (110). pronunciation /skuwl/ in accordance with the present inven 25 tion. In a language like English, where the pronunciation of FIG. 6 is a Schematic representation of the alignment of words is often not obvious from the orthography of words, it is important to convert Orthographies (102) into unam the orthography industry with the orthography interest, as is biguous phonetic representations (106) by means of a lin known in the art. guistic module (104) which are then submitted to an acoustic FIG. 7 is a schematic representation of the neural network module (108) for the generation of speech waveforms (110). encoding of letter features for the orthography coat in In order to produce the most accurate phonetic accordance with the present invention. representations, a pronunciation lexicon is required. FIG. 8 is a schematic representation of a seven-letter However, it is simply not possible to anticipate all possible window for neural network input as is known in the art. words that a Synthesizer may be required to pronounce. For FIG. 9 is a schematic representation of a whole-word example, many names of people and businesses, as well as 35 Storage buffer for neural network input in accordance with neologisms and novel blends and compounds are created the present invention. every day. Even if it were possible to enumerate all Such FIG. 10 presents a comparison of the Euclidean error words, the Storage requirements would exceed the feasibility measure with one embodiment of the feature-based error measure in accordance with the present invention for cal of most applications. 40 In order to pronounce words that are not found in pro culating the error distance between the target pronunciation nunciation dictionaries, prior researchers have employed /raepihd/ and each of the two possible neural network letter-to-Sound rules, more or less of the form-orthographic hypotheses: ?raepaxd/ and /raepbd/. c becomes phonetic /s/ before orthographic e and i, and FIG. 11 illustrates the calculation of the Euclidean dis phonetic /k/ elsewhere. AS is customary in the art, pronun 45 tance measure as is known in the art for calculating the error ciations will be enclosed in Slashes: //. For a language like distance between the target pronunciation /raepihd/ and the English, Several hundred Such rules associated with a strict neural network hypothesis pronunciation /raepaxd/. ordering are required for reasonable accuracy. Such a rule FIG. 12 illustrates the calculation of the feature-based Set is extremely labor-intensive to create and difficult to distance measure in accordance with the present invention debug and maintain, in addition to the fact that Such a 50 for calculating the error distance between the target pronun rule-Set cannot be used for a language other than the one for ciation /raepihd/ and the neural network hypothesis pronun which the rule-Set was created. ciation /raepaxd/. Another Solution that has been put forward has been a FIG. 13 is a Schematic representation of the orthography neural network that is trained on an existing pronunciation phonetics neural network architecture for training in accor lexicon and that learns to generalize from the lexicon in 55 dance with the present invention. order to pronounce novel words. Previous neural network FIG. 14 is a Schematic representation of the neural net approaches have Suffered from the requirement that letter work Orthography phonetics converter in accordance with phone correspondences in the training data be aligned by the present invention. hand. In addition, Such prior neural networks failed to FIG. 15 is a schematic representation of the encoding of associate letters with the phonetic features of which the 60 Stream 2 of FIG. 13 of the orthography-phonetics neural letters might be composed. Finally, evaluation metricS were network for testing in accordance with the present invention. based Solely on insertions, Substitutions and deletions, with FIG. 16 is a schematic representation of the decoding of out regard to the featural composition of the phones the neural network hypothesis into a phonetic representation involved. in accordance with the present invention. Therefore, there is a need for an automatic procedure for 65 FIG. 17 is a schematic representation of the orthography learning to generate phonetics from Orthography that does phonetics neural network architecture for testing in accor not require rule-Sets or hand alignment, that takes advantage dance with the present invention. 5,930,754 3 FIG. 18 is a schematic representation of the orthography phonetics neural network for testing on an eleven-letter TABLE 1. orthography in accordance with the present invention. Orthography Pronunciation

FIG. 19 is a schematic representation of the orthography cat kaet phonetics neural network with a double phone buffer in dog daog accordance with the present invention. school skuwl COat kowt FIG. 20 is a flowchart of one embodiment of steps for inputting orthographies and letter features and utilizing a 1O The lexicon Stores pairs of Orthographies with associated neural network to hypothesize a pronunciation in accordance pronunciations. In this embodiment, orthographies are rep with the present invention. resented using the letters of the English alphabet, shown in FIG. 21 is a flowchart of one embodiment of steps for Table 2. training a neural network to transform orthographies into pronunciations in accordance with the present invention. 15 TABLE 2 FIG.22 is a Schematic representation of a microprocessor/ Number Letter Number Letter application-specific integrated circuit/combination micro 1. al 14 processor and application-specific integrated circuit for the 2 15 O transformation of orthography into pronunciation by neural 3 C 16 p 4 d 17 C network in accordance with the present invention. 5 e 18 6 f 19 S FIG. 23 is a schematic representation of an article of 7 S. 2O t manufacture for the transformation of orthography into 8 h 21 pronunciation by neural network in accordance with the 9 i 22 w 25 1O j 23 w present invention. 11 k 24 FIG. 24 is a Schematic representation of the training of a 12 I 25 y neural network to hypothesize pronunciations from a lexicon 13 26 Z. that will no longer need to be stored in the lexicon due to the neural network in accordance with the present invention. In this embodiment, the pronunciations are described using a subset of the TIMIT phones from Garofolo, John S., DETAILED DESCRIPTION OF A PREFERRED “The Structure and Format of the DARPATIMIT CD-ROM EMBODIMENT Prototype', National Institute of Standards and Technology, 1988. The phones are shown in Table 3, along with repre The present invention provides a method and device for 35 Sentative orthographic words illustrating the phones automatically converting orthographies into phonetic repre Sounds. The letters in the orthographies that account for the Sentations by means of a neural network trained on a lexicon particular TIMIT phones are shown in bold. consisting of orthographies paired with corresponding pho TABLE 3 netic representations. The training results in a neural net 40 work with weights that represent the transfer function TIMIT sample TIMIT sample required to produce phonetics from Orthography. FIG. 2, Number phone word Number phone word numeral 200, provides a high-level view of the neural 1. p pop 21 ala father network training process, including the orthography 2 t tot 22 W loop phonetics lexicon (202), the neural network input coding 45 3 k kick 23 e bird 4 Oil 24 ay high (204), the neural network training (206) and the feature 5 O 25 ey bay based error backpropagation (208). The method, device and 6 ng sing 26 aW Out article of manufacture for neural-network based 7 S Set 27 aX sofa orthography-phonetics transformation of the present inven 8 Z. AOC) 28 b barn 9 ch chop 29 d dog 50 tion offers a financial advantage over the prior art in that the 1O th thin 3O S. go System is automatically trainable and can be adapted to any 11 f ford 31 sh shoe language with ease. 12 I long 32 zh garage 13 red 33 dh this FIG. 3, numeral 300, shows where the trained neural 14 y young 34 w wice network Orthography-phonetics converter, numeral 310, fits 15 hh heavy 35 w walk 55 16 eh bed 36 ih gift into the linguistic module of a speech synthesizer (320) in 17 aO SW 37 ae fast one preferred embodiment of the present invention, includ 18 ah rust 38 uh book ing text (302); preprocessing (304); a pronunciation deter 19 oy boy 39 iy bee mination module (318) consisting of an orthography 2O OW low phonetics lexicon (306), a lexicon presence decision unit 60 (308), and a neural network orthography-phonetics con In order for the neural network to be trained on the verter (310); a postlexical module (312), and an acoustic lexicon, the lexicon must be coded in a particular way that module (314) which generates speech (316). maximizes learnability; this is the neural network input In order to train a neural network to learn orthography coding in numeral (204). phonetics mapping, an orthography-phonetics lexicon (202) 65 The input coding for training consists of the following is obtained. Table 1 displays an excerpt from an components: alignment of letters and phones, extraction of orthography-phonetics lexicon. letter features, converting the input from letters and phones 5,930,754 S 6 to numbers, loading the input into the Storage buffer, and training using feature-driven error backpropagation. The TABLE 4-continued input coding for training requires the generation of three Streams of input to the neural network Simulator. Stream 1 Letter Phone Cost Letter Phone Cost contains the phones of the pronunciation interspersed with O S Z. O any alignment Separators, Stream 2 contains the letters of the e O h hh O orthography, and Stream 3 contains the features associated aX O al ae O y y O al ey O with each letter of the orthography. y iy O al aX O FIG.4, numeral 400, illustrates the alignment (406) of an y ih O al ala O orthography (402) and a phonetic representation (408), the w w O e eh O O e iy O encoding of the orthography as Stream 2 (404) of the neural O e ey O network input encoding for training, and the encoding of the e O e ih O phonetic representation as Stream 1 (410) of the neural b b O e aX O network input encoding for training. An input orthography, C 8 O i ih O 15 C S O i ay O coat (402), and an input pronunciation from a pronunciation d d O i iy O lexicon, /kowt/ (408), are Submitted to an alignment proce d O O ala O dure (406). S. S. O O aO O S. zh 1. O OW O Alignment of letters and phones is necessary to provide j zh 1. O oy O the neural network with a reasonable sense of which letters j h O O aW O correspond to which phones. In fact, accuracy results more p O O W O than doubled when aligned pairs of orthographies and pro t O O aX O t ch 1. uh O nunciations were used compared to unaligned pairs. Align k 8 O ah O ment of letters and phones means to explicitly associate Z. Z. O W O particular letters with particular phones in a Series of loca w w O aX O tions. 25 f O S. f 2 FIG. 5, numeral 500, illustrates an alignment of the orthography School with the pronunciation /skuwl? with the constraint that only one phone and only one letter is per mitted per location. The alignment in FIG. 5, which will be For Substitutions other than those covered in the table in referred to as "one phone-one letter alignment, is per Table 4, and insertions and deletions, the costs used in the art formed for neural network training. In one phone-one letter of Speech recognition Scoring are employed: insertion costs alignment, when multiple letters correspond to a single 3, deletion costs 3 and substitution costs 4. With respect to phone, as in orthographic ch corresponding to phonetic /k/, Table 4, in Some cases, the cost for allowing a particular as in School, the Single phone is associated with the first 35 correspondence should be less than the fixed cost for inser letter in the cluster, and alignment Separators, here "+", are tion or deletion, in other cases greater. The more likely it is inserted in the Subsequent locations associated with the that a given phone and letter could correspond in a particular Subsequent letters in the cluster. position, the lower the cost for Substituting that phone and In contrast to Some prior neural network approaches to letter. neural network Orthography-phonetics conversion which 40 achieved orthography-phonetic alignments painstakingly by hand, a new variation to the dynamic programming algo When the orthography coat (402) and the pronunciation rithm that is known in the art was employed. The version of /kowt/ (408) are aligned, the alignment procedure (406) dynamic programming known in the art has been described inserts an alignment Separator, '+', into the pronunciation, with respect to aligning words that use the same alphabet, 45 making ?kow--t/. The pronunciation with alignment Separa Such as the English Orthographies industry and interest, as tors is converted to numbers by consulting Table 3 and shown in FIG. 6, numeral 600. Costs are applied for loaded into a word-sized storage buffer for Stream 1 (410). insertion, deletion and Substitution of characters. Substitu The orthography is converted to numbers by consulting tions have no cost only when the same character is in the Table 2 and loaded into a word-sized storage buffer for Same location in each Sequence, Such as the i in location 1, 50 numeral 602. Stream 2 (404). In order to align Sequences from different alphabets, Such as orthographies and pronunciations, where the alphabet for FIG. 7, numeral 700, illustrates the coding of Stream 3 of orthographies was shown in Table 2, and the alphabet for the neural network input encoding for training. Each letter of pronunciations was shown in Table 3, a new method was 55 the orthography is associated with its letter features. devised for calculating Substitution costs. A customized table reflecting the particularities of the language for which an orthography-phonetics converter is being developed was In order to give the neural network further information designed. Table 4 below illustrates the letter-phone cost upon which to generalize beyond the training Set, a novel table for English. 60 concept, that of letter features, was provided in the input coding. Acoustic and articulatory features for phonological TABLE 4 Segments are a common concept in the art. That is, each phone can be described by several phonetic features. Table Letter Phone Cost Letter Phone Cost 5 shows the features associated with each phone that appears I O C k O 65 in the pronunciation lexicon in this embodiment. For each el O S S O phone, a feature can either be activated '+', not activated, -, or unspecified 0. 5,930,754 7 8

TABLE 5 Phoneme Phoneme Number Vocalic Vowel Sonorant Obstruent Flap Continuant Africate Nasal Approximant Click Trill Silence

aX 1. ------aX 2 ------e 3 ------4 ------aO 5 ------ae 6 ------ala 7 ------dh 8 -- -- eh 9 ------ih 1O ------ng 11 ------sh 12 -- -- th 13 -- -- uh 14 ------zh 15 -- -- ah 16 ------ay 17 ------aW 18 ------b 19 -- dx 2O -- -- d 21 -- jh 22 ------ey 23 ------f 24 -- -- S. 25 -- hh 26 -- -- ly 27 ------y 28 ------k 29 -- I 3O ------el 31 ------32 ------33 ------e 34 ------OW 35 ------Ow 36 ------p 37 -- S 38 -- -- t 39 -- ch 40 ------W 41 ------w 42 -- -- w 43 ------Z. 44 -- --

Mid Mid Mid Mid Mid Mid Phoneme Front 1. Front 2 front 1 front 2 Mid 1 Mid 2 Back 1 Back 2 High 1 High 2 high 1 high 2 low 1 low 2

aX ------aX ------e ------O O O O O O O O O O O O O O aO ------ae -- -- ala -- -- dh O O O O O O O O O O O O O O eh ------ih ------ng O O O O O O O O O O O O O O sh O O O O O O O O O O O O O O th O O O O O O O O O O O O O O uh ------zh O O O O O O O O O O O O O O ah ------ay ------aW ------b O O O O O O O O O O O O O O dx O O O O O O O O O O O O O O d O O O O O O O O O O O O O O jh O O O O O O O O O O O O O O ey ------f O O O O O O O O O O O O O O S. O O O O O O O O O O O O O O hh O O O O O O O O O O O O O O iy ------y O O O O O O O O O O O O O O 5,930,754 10

TABLE 5-continued

Post Phoneme Low 1 Low 2 Bilabial Labiodental Dental Alveolar alveolar Retroflex Palatal Velar Uvular Pharyngeal Glottal

aX aX

O O

--

O O

O

O

O

--

Epi Hyper Im Lab Nasal Rhota Round Round Phoneme glottal Aspirated aspirated Closure Ejective plosive lialized Lateral ized cized Voiced 1. 2 Long

aX aX

g O O O 5,930,754 11 12

TABLE 5-continued sh O O O th O O O uh O ------zh -- O O O ah O -- ay O -- -- aW O ------b -- O O O dx -- O O O d -- O O O jh -- O O O ey O -- -- f O O O S. -- O O O hh O O O iy O -- -- y -- O O O k O O O I -- -- O O O el -- -- O O O -- O O O -- O O O e -- O O O OW O ------Ow O ------p -- O O O S O O O t -- O O O ch O O O W ------w -- O O O w ------O Z. -- O O O substitution cost of 0 in the letter-phone cost table in Table 4 are arranged in a letter-phone correspondence table, as in TABLE 7 Table 6. 35 phone obstruent continuant alveolar velar aspirated TABLE 6 S ------k ------Letter Corresponding phones ala aX 40 Table 8 shows the union of the activated features of /s/ and C k S /k/ which are the letter features for the letter c. d d e eh ey TABLE 8 f f S. S. jh f letter obstruent continuant alveolar velar aspirated h hh 45 i ih iy C ------j jh k k I I In FIG. 7, each letter of coat, that is, c (702), o (704), a (706), and t (708), is looked up in the letter phone corre e 50 spondence table in Table 6. The activated features for each OW ala letter's corresponding phones are unioned and listed in C k (710), (712), (714) and (716). (710) represents the letter features for c, which are the union of the phone features for S S /k/ and /S/, which are the phones that correspond with that t t th dh 55 letter according to the table in Table 6. (712) represents the y uh ah letter features for o, which are the union of the phone w w features for ?ao/, /ow? and ?aa?, which are the phones that X k correspond with that letter according to the table in Table 6. y y (714) represents the letter features for a, which are the union Z. Z. 60 of the phone features for /ae/, /aa/ and ?ax/ which are the phones that correspond with that letter according to the table A letter's features were determined to be the set-theoretic in Table 6. (716) represents the letter features fort, which are union of the activated phonetic features of the phones that the union of the phone features for /t/, /th/ and /dh/, which correspond to that letter in the letter-phone correspondence are the phones that correspond with that letter according to table of Table 6. For example, according to Table 6, the letter 65 the table in Table 6. c corresponds with the phones /s/ and /k/. Table 7 shows the The letter features for each letter are then converted to activated features for the phones /s/ and /k/. numbers by consulting the feature number table in Table 9. 5,930,754 13 14 internal transfer functions used by the processing elements TABLE 9 (PEs). The coefficients used in the transfer functions are varied during the training process to vary the output vector. Phone Number Phone Number The transfer functions and coefficients are collectively Vocalic 1. Low 2 28 referred to as the weights of the neural network, and the Vowel 2 Bilabial 29 weights are varied in the training process to vary the output Sonorant 3 Labiodental 3O vector produced by given input vectors. The weights are Set Obstruent 4 Dental 31 Flap 5 Alveolar 32 to Small random values initially. The context description Continuant 6 Post-alveolar 33 Serves as an input vector and is applied to the inputs of the Africate 7 Retroflex 34 neural network. The context description is processed accord Nasal 8 Palatal 35 Approximant 9 Velar 36 ing to the neural network weight values to produce an output Click 1O Uvular 37 vector, i.e., the associated phonetic representation. At the Tril 11 Pharyngeal 38 beginning of the training Session, the associated phonetic Silence 12 Glottal 39 representation is not meaningful Since the neural network Front 1 13 Epiglottal 40 15 weights are random values. An error Signal vector is gen Front 2 14 Aspirated 41 Mid front 1 15 Hyper- 42 erated in proportion to the distance between the associated Mid front 2 16 aspirated phonetic representation and the assigned target phonetic Mid 1 17 Closure 43 representation, Stream 1. Mid 2 18 Ejective 44 In contrast to prior approaches, the error Signal is not Back 1 19 Implosive 45 Back 2 2O Labialized 46 Simply calculated to be the raw distance between the asso High 1 21 Lateral 47 ciated phonetic representation and the target phonetic High 2 22 Nasalized 48 representation, by for example using a Euclidean distance Mid high 1 23 Rhotacized 49 measure, shown in Equation 1. Mid high 2 24 Voiced 50 Mid low 1 25 Round 1 51 Mid low 2 26 Round 2 52 25 E = X(d. - ok.') Equation 1 Low 1 27 Long 53 k

A constant that is 100 * the location number, where Rather, the distance is a function of how close the locations Start at 0, is added to the feature number in order asSociated phonetic representation is to the target phonetic to distinguish the features associated with each letter. The modified feature numbers are loaded into a word sized representation in featural Space. CloseneSS in featural Space storage buffer for Stream 3 (718). is assumed to be related to closeneSS in perceptual space if A disadvantage of prior approaches to the Orthography the phonetic representations were uttered. phonetics conversion problem by neural networks has been FIG. 10, numeral 1000, contrasts the Euclidean distance the choice of too small a window of letters for the neural 35 error measure with the feature-based error measure. The network to examine in order to Select an output phone for the target pronunciation is /raepihd/ (1002). Two potential asso middle letter. FIG. 8, numeral 800, and FIG. 9, numeral 900, ciated pronunciations are shown: /raepaxd/ (1004) and illustrate two contrasting methods of presenting data to the /raepbd/ (1006). /raepaxd/ (1004) is perceptually very simi neural network. FIG. 8 depicts a seven-letter window, pro lar to the target pronunciation, whereas /raepbd/ (1006) is posed previously in the art, Surrounding the first Ortho 40 rather far, in addition to being virtually unpronounceable. By graphic o (802) in photography. The window is shaded gray, the Euclidean distance measure in Equation 1, both while the target letter o (802) is shown in a black box. /raepaxd/ (1004) and /raepbd/ (1006) receive an error score This window is not large enough to include the final of 2 with respect to the target pronunciation. The two orthographic y (804) in the word. The finally (804) is indeed identical Scores obscure the perceptual difference between the deciding factor for whether the word's first o (802) is the two pronunciations. converted to phonetic ?ax/ as in photography or fow? as in 45 photograph. A novel innovation introduced here is to allow In contrast, the feature-based error measure takes into a storage buffer to cover the entire length of the word, as consideration that ?ih/ and ?ax/ are perceptually very similar, depicted in FIG. 9, where the entire word is shaded gray and and consequently weights the local error when faX/ is the target letter o (902) is once again shown in a black box. hypothesized for /ih/. A scale of 0 for identity and 1 for In this arrangement, all letters in photography are examined 50 maximum difference is established, and the various phone with knowledge of all the other letters present in the word. oppositions are given a Score along this dimension. Table 10 In the case of photography, the initial o (902) would know provides a Sample of feature-based error multipliers, or about the finally (904), allowing for the proper pronunciation weights, that are used for American English. to be generated. Another advantage to including the whole word in a 55 TABLE 10 Storage buffer is that this permits the neural network to learn neural network phone the differences in letter-phone conversion at the beginning, target phone hypothesis error multiplier middle and ends of words. For example, the letter e is often aX ih .1 Silent at the end of words, as in the boldface e in game, ih aX .1 60 theme, rhyme, whereas the letter e is less often Silent at other ala aO 3 points in a word, as in the boldface e in Edward, metal, net. aO ala 3 Examining the word as a whole in a Storage buffer as OW aO 5 described here, allows the neural network to capture Such aO OW 5 ae ala 5 important pronunciation distinctions that are a function of ala ae 5 where in a word a letter appears. 65 W OW 7 The neural network produces an output hypothesis vector OW W 7 based on its input vectors, Stream 2 and Stream 3 and the 5,930,754 15 16 given the input information to output block 9 (1324). Output TABLE 10-continued block 9 (1324) then outputs the data as the neural network hypothesis (1326). The neural network hypothesis is com neural network phone pared to Stream 1 (1322), the target phones, by means of the target phone hypothesis error multiplier feature-based error function described above. iy ey 7 The error determined by the error function is then back ey iy 7 propagated to softmax neural network blocks 5 (1308), 6 (1310), 7 (1312) and 8 (1314), which in turn backpropagate In Table 10, multipliers are the same whether the particu the error to sigmoid neural network blocks 3 (1306) and 4 lar phones are part of the target or part of the hypothesis, but (1320). this does not have to be the case. Any combinations of target The double arrows between neural network blocks in FIG. and hypothesis phones that are not in Table 10 are consid 13 indicate both the forward and backward movement ered to have a multiplier of 1. through the network. FIG. 11, numeral 1100, shows how the unweighted local FIG. 14, numeral 1400, shows the neural network error is computed for the /ih/ in /raepihd/. FIG. 12, numeral 15 orthography-pronunciation converter of FIG. 3, numeral 1200, shows how the weighted error using the multipliers in 310, in detail. An orthography that is not found in the Table 10 is computed. FIG. 12 shows how the error for ?ax/ pronunciation lexicon (308), is coded into neural network where /ih/ is expected is reduced by the multiplier, capturing input format (1404). The coded orthography is then submit the perceptual notion that this error is less egregious than ted to the trained neural network (1406). This is called hypothesizing /b/ for /ih/, whose error is unreduced. testing the neural network. The trained neural network After computation of the error Signal, the weight values outputs an encoded pronunciation, which must be decoded are then adjusted in a direction to reduce the error Signal. by the neural network output decoder (1408) into a pronun This process is repeated a number of times for the associated ciation (1410). pairs of context descriptions and assigned target phonetic When the network is tested, only Stream 2 and Stream 3 representations. This process of adjusting the weights to 25 need be encoded. The encoding of Stream 2 for testing is bring the associated phonetic representation closer to the shown in FIG. 15, numeral 1500. Each letter is converted to assigned target phonetic representation is the training of the a numeric code by consulting the letter table in Table 2. neural network. This training uses the Standard back propa (1502) shows the letters of the word coat. (1504) shows the gation of errors method. Once the neural network is trained, numeric codes for the letters of the word coat. Each letter's the weight values possess the information necessary to numeric code is then loaded into a word-sized Storage buffer convert the context description to an output vector Similar in for Stream 2. Stream 3 is encoded as shown in FIG. 7. A value to the assigned target phonetic representation. The word is tested by encoding Stream 2 and Stream 3 for that preferred neural network implementation requires up to ten word and testing the neural network. The neural network million presentations of the context description to its inputs returns a neural network hypothesis. The neural network and the following weight adjustments before the neural 35 hypothesis is then decoded, as shown in FIG. 16, by con network is considered fully trained. verting numbers (1602) to phones (1604) by consulting the The neural network contains blocks with two kinds of phone number table in Table 3, and removing any alignment activation functions, Sigmoid and Softmax, as are known in Separators, which is number 40. The resulting String of the art. The Softmax activation function is shown in Equa phones (1606) can then serve as a pronunciation for the input tion 2. 40 orthography. FIG. 17 shows how the streams encoded for the orthog e'k Equation 2 raphy coat fit into the neural network architecture. Stream 2 (1702), the numeric encoding of the letters of the input orthography, encoded as shown in FIG. 15, is fed into input 45 block 1 (1704). Input block 1 (1704) then passes this data onto sigmoid neural network block 3 (1706). Sigmoid neural FIG. 13, numeral 1300, illustrates the neural network network block 3 (1706) then passes the data for each letter architecture for training the orthography coat on the pro into softmax neural network blocks 5 (1708), 6 (1710), 7 nunciation/kowt/. Stream 2 (1302), the numeric encoding of (1712) and 8 (1714). the letters of the input orthography, encoded as shown in 50 Stream 3 (1716), the numeric encoding of the letter FIG. 4, is fed into input block 1 (1304). Input block 1 (1304) features of the input orthography, encoded as shown in FIG. then passes this data onto Sigmoid neural network block 3 7, is fed into input block 2 (1718). Input block 2 (1718) then (1306). Sigmoid neural network block 3 (1306) then passes passes this data onto sigmoid neural network block 4 (1720). the data for each letter into Softmax neural network blocks Sigmoid neural network block 4 (1720) then passes the data 5 (1308), 6 (1310), 7 (1312) and 8 (1314). 55 for each letter's features into Softmax neural network blocks Stream 3 (1316), the numeric encoding of the letter 5 (1708), 6 (1710), 7 (1712) and 8 (1714). features of the input orthography, encoded as shown in FIG. Each of the softmax neural network blocks 5 (1708), 6 7, is fed into input block 2 (1318). Input block 2 (1318) then (1710), 7 (1712), and 8 (1714) outputs the most likely phone passes this data onto sigmoid neural network block 4 (1320). given the input information to output block 9 (1722). Output Sigmoid neural network block 4 (1320) then passes the data 60 block 9 (1722) then outputs the data as the neural network for each letter's features into Softmax neural network blocks hypothesis (1724). 5 (1308), 6 (1310), 7 (1312) and 8 (1314). FIG. 18, numeral 1800, presents a picture of the neural Stream 1 (1322), the numeric encoding of the target network for testing organized to handle an orthographic phones, encoded as shown in FIG. 4, is fed into output block word of 11 characters. This is just an example; the network 9 (1324). 65 could be organized for an arbitrary number of letters per Each of the softmax neural network blocks 5 (1308), 6 word. Input stream 2 (1802), containing a numeric encoding (1310), 7 (1312), and 8 (1314) outputs the most likely phone of letters, encoded as shown in FIG. 15, loads its data into 5,930,754 17 18 input block 1 (1804). Input block 1 (1804) contains 495 way. For example, features for phone 1 and phone 2 are PE's, which is the size required for an 11 letter word, where passed to block 9 (1926), features for phone 1, 2 and 3 are each letter could be one of 45 distinct characters. Input block passed to block10 (1928), features for phones 2,3, and 4 are 1 (1804) passes these 495 PE's to sigmoid neural network 3 passed to block 11 (1930), and features for phones 3 and 4 (1806). are passed to block 12 (1932). Blocks 9 (1926), 10 (1928), 11 (1930) and 12 (1932) Sigmoid neural network 3 (1806) distributes a total of 220 output the most likely phone given the input information to PE's equally in chunks of 20 PE's to softmax neural output block 13 (1924). Output block 13 (1924) then outputs networks 4 (1808), 5 (1810), 6 (1812), 7 (1814), 8 (1816), the data as the neural network hypothesis (1934). The neural 9 (1818), 10 (1820), 11 (1822), 12 (1824) and 13 (1826) and network hypothesis (1934) is compared to Stream 1 (1922), 14 (1828). the target phones, by means of the feature-based error Input stream 3 (1830), containing a numeric encoding of function described above. letter features, encoded as shown in FIG. 7, loads its data The error determined by the error function is then back into input block 2 (1832). Input block 2 (1832) contains 583 propagated to softmax neural network blocks 5 (1908), 6 processing elements which is the size required for an 11 (1910), 7 (1912) and 8 (1914), which in turn backpropagate letter word, where each letter is represented by up to 53 15 the error to sigmoid neural network blocks 3 (1906) and 4 activated features. Input block 2 (1832) passes these 583 (1920). PE’s to sigmoid neural network 4 (1834). The double arrows between neural network blocks in FIG. Sigmoid neural network 4 (1834) distributes a total of 220 19 indicate both the forward and backward movement PE's equally in chunks of 20 PE's to softmax neural through the network. networks 4 (1808), 5 (1810), 6 (1812), 7 (1814), 8 (1816), One of the benefits of the neural network letter-to-sound 9 (1818), 10 (1820), 11 (1822), 12 (1824) and 13 (1826) and conversion method described here is a method for compress 14 (1828). ing pronunciation dictionaries. When used in conjunction Softmax neural networks 4-14 each pass 60 PE's for a with a neural network letter-to-Sound converter as described total of 660 PE's to output block 16 (1836). Output block 16 here, pronunciations do not need to be Stored for any words (1836) then outputs the neural network hypothesis (1838). 25 in a pronunciation network for which the neural network can Another architecture described under the present inven correctly discover the pronunciation. Neural networks over tion involves two layers of Softmax neural network blocks, come the large Storage requirements of phonetic represen as shown in FIG. 19, numeral 1900. The extra layer provides tations in dictionaries since the knowledge base is Stored in for more contextual information to be used by the neural weights rather than in memory. network in order to determine phones from orthography. In Table 11 shows an excerpt of the pronunciation lexicon addition, the extra layer takes additional input of phone excerpt shown in Table 1. features, which adds to the richness of the input representation, thus improving the network's performance. TABLE 11 FIG. 19 illustrates the neural network architecture for training the orthography coat on the pronunciation /kowt/. 35 Orthography Pronunciation Stream 2 (1902), the numeric encoding of the letters of the cat input orthography, encoded as shown in FIG. 15, is fed into dog input block 1 (1904). Input block 1 (1904) then passes this school data onto sigmoid neural network block 3 (1906). Sigmoid COat neural network block 3 (1906) then passes the data for each 40 letter into softmax neural network blocks 5 (1908), 6 (1910), This lexicon excerpt does not need to Store any pronun 7 (1912) and 8 (1914). ciation information, Since the neural network was able to Stream 3 (1916), the numeric encoding of the letter hypothesize pronunciations for the orthographies Stored features of the input orthography, encoded as shown in FIG. there correctly. This results in a savings of 21 bytes out of 7, is fed into input block 2 (1918). Input block 2 (1918) then 45 41 bytes, including ending 0 bytes, or a Savings of 51% in passes this data onto sigmoid neural network block 4 (1920). Storage Space. Sigmoid neural network block 4 (1920) then passes the data The approach to Orthography-pronunciation conversion for each letter's features into Softmax neural network blocks described here has an advantage over rule-based Systems in 5 (1908), 6 (1910), 7 (1912) and 8 (1914). that it is easily adaptable to any language. For each Stream 1 (1922), the numeric encoding of the target 50 language, all that is required is that an orthography phones, encoded as shown in FIG. 4, is fed into output block pronunciation lexicon in that language, and a letter-phone 13 (1924). cost table in that language. It may also be necessary to use Each of the softmax neural network blocks 5 (1908), 6 characters from the International Phonetic Alphabet, so the (1910), 7 (1912), and 8 (1914) outputs the most likely phone full range of phonetic variation in the World's languages is given the input information, along with any possible left and 55 possible to model. right phones to softmax neural network blocks 9 (1926), 10 As shown in FIG. 20, numeral 2000, the present invention (1928), 11 (1930) and 12 (1932). For example, blocks 5 implements a method for providing, in response to Ortho (1908) and 6 (1910) pass the neural network's hypothesis for graphic information, efficient generation of a phonetic phone 1 to block 9 (1926), blocks 5 (1908), 6 (1910), and 7 representation, including the steps of inputting (2002) an (1912) pass the neural network's hypothesis for phone 2 to 60 orthography of a word and a predetermined set of input letter block 10 (1928), blocks 6 (1910), 7 (1912), and 8 (1914) features, utilizing (2004) a neural network that has been pass the neural network's hypothesis for phone 3 to block 11 trained using automatic letter phone alignment and prede (1930), and blocks 7 (1912) and 8 (1914) pass the neural termined letter features to provide a neural network hypoth network's hypothesis for phone 4 to block 12 (1932). esis of a word pronunciation. In addition, the features associated with each phone 65 In the preferred embodiment, the predetermined letter according to the table in Table 5 are passed to each of blockS features for a letter represent a union of features of prede 9 (1926), 10 (1928), 11 (1930), and 12 (1932) in the same termined phones representing the letter. 5,930,754 19 20 As shown in FIG. 21, numeral 2100, the pretrained neural orthographic information, efficient generation of a phonetic network (2004) has been trained using the steps of: provid representation, including an encoder (2206), coupled to ing (2102) a predetermined number of letters of an associ receive an orthography of a word (2202) and a predeter ated orthography consisting of letters for the word and a mined set of input letter features (2204), for providing phonetic representation consisting of phones for a target digital input to a pretrained orthography-pronunciation neu pronunciation of the associated orthography, aligning (2104) ral network (2210), wherein the pretrained orthography the associated orthography and phonetic representation pronunciation neural network (2210) has been trained using using a dynamic programming alignment enhanced with a automatic letter phone alignment (2212) and predetermined featurally-based substitution cost function, providing (2106) letter features (2214). The pretrained orthography acoustic and articulatory information corresponding to the pronunciation neural network (2210), coupled to the encoder letters, based on a union of features of predetermined phones (2206), provides a neural network hypothesis of a word representing each letter, providing (2108) a predetermined pronunciation (2216). amount of context information; and training (2110) the In a preferred embodiment, the pretrained orthography neural network to associate the input orthography with a pronunciation neural network (2210) is trained using phonetic representation. 15 feature-based error backpropagation, for example as calcu In a preferred embodiment, the predetermined number of lated in FIG. 12. letters (2102) is equivalent to the number of letters in the In a preferred embodiment, the predetermined letter fea word. tures for a letter represent a union of features of predeter As shown in FIG. 24, numeral 2400, an orthography mined phones representing the letter. pronunciation lexicon (2404) is used to train an untrained As shown in FIG. 21, numeral 2100, the pretrained neural network (2402), resulting in a trained neural network orthography-pronunciation neural network (2210) of the (2408). The trained neural network (2408) produces word microprocessor/ASIC/combination microprocessor and pronunciation hypotheses (2004) which match part of an ASIC (2208) has been trained in accordance with the orthography-pronunciation lexicon (2410). In this way, the following Scheme: providing (2102) a predetermined num orthography-pronunciation lexicon (306) of a text to speech 25 ber of letters of an associated orthography consisting of system (300) is reduced in size by using neural network letters for the word and a phonetic representation consisting word pronunciation hypotheses (2004) in place of the pro of phones for a target pronunciation of the associated nunciation transcriptions in the lexicon for that part of orthography, aligning (2104) the associated orthography and orthography-pronunciation lexicon which is matched by the phonetic representation using a dynamic programming neural network word pronunciation hypotheses. alignment enhanced with a featurally-based Substitution cost Training (2110) the neural network may further include function; providing (2106) acoustic and articulatory infor providing (2112) a predetermined number of layers of output mation corresponding to the letters, based on a union of reprocessing in which phones, neighboring phones, phone features of predetermined phones representing each letter; features and neighboring phone features are passed to Suc providing (2108) a predetermined amount of context infor ceeding layers. 35 mation; and training (2110) the neural network to associate Training (2110) the neural network may further include the input orthography with a phonetic representation. employing (2114) a feature-based error function, for In a preferred embodiment, the predetermined number of example as calculated in FIG. 12, to characterize the dis letters (2102) is equivalent to the number of letters in the tance between target and hypothesized pronunciations dur word. ing training. 40 As shown in FIG. 24, numeral 2400, an orthography The neural network (2004) may be a feed-forward neural pronunciation lexicon (2404) is used to train an untrained network. neural network (2402), resulting in a trained neural network The neural network (2004) may use backpropagation of (2408). The trained neural network (2408) produces word COS. pronunciation hypotheses (2216) which match part of an The neural network (2004) may have a recurrent input 45 orthography-pronunciation lexicon (2410). In this way, the Structure. orthography-pronunciation lexicon (306) of a text to speech The predetermined letter features (2002) may include system (300) is reduced in size by using neural network articulatory or acoustic features. word pronunciation hypotheses (2216) in place of the pro The predetermined letter features (2002) may include a nunciation transcriptions in the lexicon for that part of geometry of acoustic or articulatory features as is known in 50 orthography-pronunciation lexicon which is matched by the the art. neural network word pronunciation hypotheses. The automatic letter phone alignment (2004) may be Training the neural network (2110) may further include based on consonant and vowel locations in the orthography providing (2112) a predetermined number of layers of output and associated phonetic representation. reprocessing in which phones, neighboring phones, phone The predetermined number of letters of the orthography 55 features and neighboring phone features are passed to Suc and the phones for the pronunciation of the orthography ceeding layers. (2102) may be contained in a sliding window. Training the neural network (2110) may further include The orthography and pronunciation (2102) may be employing (2114) a feature-based error function, for described using feature vectors. example as calculated in FIG. 12, to characterize the dis The featurally-based substitution cost function (2104) 60 tance between target and hypothesized pronunciations dur uses predetermined Substitution, insertion and deletion costs ing training. and a predetermined Substitution table. The pretrained orthography pronunciation neural network As shown in FIG.22, numeral 2200, the present invention (2210) may be a feed-forward neural network. implements a device (2208), including at least one of a The pretrained orthography pronunciation neural network microprocessor, an application Specific integrated circuit, 65 (2210) may use backpropagation of errors. and a combination of a microprocessor and an application The pretrained orthography pronunciation neural network Specific integrated circuit, for providing, in response to (2210) may have a recurrent input structure. 5,930,754 21 22 The predetermined letter features (2214) may include characterize the distance between target and hypothesized acoustic or articulatory features. pronunciations during training. The predetermined letter features (2214) may include a In a preferred embodiment, the neural network utilization geometry of acoustic or articulatory features as is known in unit (2310) may be a feed-forward neural network. the art. In a preferred embodiment, the neural network utilization The automatic letter phone alignment (2212) may be unit (2310) may use backpropagation of errors. based on consonant and vowel locations in the orthography In a preferred embodiment, the neural network utilization and associated phonetic representation. unit (2310) may have a recurrent input structure. The predetermined number of letters of the orthography The predetermined letter features (2314) may include and the phones for the pronunciation of the orthography acoustic or articulatory features. (2102) may be contained in a sliding window. The predetermined letter features (2314) may include a The orthography and pronunciation (2102) may be geometry of acoustic or articulatory features as is known in described using feature vectors. the art. The featurally-based substitution cost function (2104) The automatic letter phone alignment (2312) may be uses predetermined Substitution, insertion and deletion costs 15 based on consonant and vowel locations in the orthography and a predetermined Substitution table. and associated phonetic representation. As shown in FIG. 23, numeral 2300, the present invention The predetermined number of letters of the orthography implements an article of manufacture (2308), e.g., Software, and the phones for the pronunciation of the orthography that includes a computer usable medium having computer (2102) may be contained in a sliding window. readable program code thereon. The computer readable code The orthography and pronunciation (2102) may be includes an inputting unit (2306) for inputting an orthogra described using feature vectors. phy of a word (2302) and a predetermined set of input letter The featurally-based substitution cost function (2104) features (2304) and code for a neural network utilization unit uses predetermined Substitution, insertion and deletion costs (2310) that has been trained using automatic letter phone and a predetermined Substitution table. alignment (2312) and predetermined letter features (2314) to 25 The present invention may be embodied in other specific provide a neural network hypothesis of a word pronuncia forms without departing from its Spirit or essential charac tion (2316). teristics. The described embodiments are to be considered in In a preferred embodiment, the predetermined letter fea all respects only as illustrative and not restrictive. The Scope tures for a letter represent a union of features of predeter of the invention is, therefore, indicated by the appended mined phones representing the letter. claims rather than by the foregoing description. All changes As shown in FIG. 21, typically the pretrained neural which come within the meaning and range of equivalency of network has been trained in accordance with the following the claims are to be embraced within their Scope. Scheme: providing (2102) a predetermined number of letters We claim: of an associated orthography consisting of letters for the 1. A method for providing, in response to orthographic word and a phonetic representation consisting of phones for 35 information, efficient generation of a phonetic a target pronunciation of the associated orthography, align representation, comprising the Steps of ing (2104) the associated orthography and phonetic repre a) inputting an orthography of a word and a predeter Sentation using a dynamic programming alignment mined set of input letter features, enhanced with a featurally-based Substitution cost function; b) utilizing a neural network that has been trained using providing (2106) acoustic and articulatory information cor 40 automatic letter phone alignment and predetermined responding to the letters, based on a union of features of letter features to provide a neural network hypothesis of predetermined phones representing each letter, providing a word pronunciation. (2108) a predetermined amount of context information; and 2. The method of claim 1 wherein the predetermined letter training (2110) the neural network to associate the input features for a letter represent a union of features of prede orthography with a phonetic representation. 45 termined phones representing the letter. In a preferred embodiment, the predetermined number of 3. The method of claim 1 wherein the pretrained neural letters (2102) is equivalent to the number of letters in the network has been trained using the Steps of: word. a) providing a predetermined number of letters of an As shown in FIG. 24, numeral 2400, an orthography asSociated orthography consisting of letters for the pronunciation lexicon (2404) is used to train an untrained 50 word and a phonetic representation consisting of neural network (2402), resulting in a trained neural network phones for a target pronunciation of the associated (2408). The trained neural network (2408) produces word orthography; pronunciation hypotheses (2316) which match part of an b) aligning the associated orthography and phonetic rep orthography-pronunciation lexicon (2410). In this way, the resentation using a dynamic programming alignment orthography-pronunciation lexicon (306) of a text to speech 55 system (300) is reduced in size by using neural network enhanced with a featurally-based Substitution cost word pronunciation hypotheses (2316) in place of the pro function; nunciation transcriptions in the lexicon for that part of c) providing acoustic and articulatory information corre orthography-pronunciation lexicon which is matched by the sponding to the letters, based on a union of features of neural network word pronunciation hypotheses. 60 predetermined phones representing each letter; The article of manufacture may be selected to further d) providing a predetermined amount of context informa include providing (2112) a predetermined number of layers tion; and of output reprocessing in which phones, neighboring e) training the neural network to associate the input phones, phone features and neighboring phone features are orthography with a phonetic representation. passed to Succeeding layers. Also, the invention may further 65 4. The method of claim 3, step (a), wherein the predeter include, during training, employing (2114) a feature-based mined number of letters is equivalent to the number of letters error function, for example as calculated in FIG. 12, to in the word. 5,930,754 23 24 5. The method of claim 1 where a pronunciation lexicon a) providing a predetermined number of letters of an is reduced in size by using neural network word pronuncia asSociated orthography consisting of letters for the tion hypotheses which match target pronunciations. word and a phonetic representation consisting of 6. The method of claim 3 further including providing a phones for a target pronunciation of the associated predetermined number of layers of output reprocessing in orthography; which phones, neighboring phones, phone features and neighboring phone features are passed to Succeeding layers. b) aligning the associated orthography and phonetic rep 7. The method of claim 3 further including, during resentation using a dynamic programming alignment training, employing a feature-based error function to char enhanced with a featurally-based Substitution cost acterize a distance between target and hypothesized pronun function; ciations during training. 1O c) providing acoustic and articulatory information corre 8. The method of claim 1, step (b) wherein the neural sponding to the letters, based on a union of features of network is a feed-forward neural network. predetermined phones representing each letter; 9. The method of claim 1, step (b) wherein the neural d) providing a predetermined amount of context informa network uses backpropagation of errors. 15 tion; and 10. The method of claim 1, step (b) wherein the neural e) training the neural network to associate the input network has a recurrent input Structure. orthography with a phonetic representation. 11. The method of claim 1, wherein the predetermined 26. The device of claim 25, step (a) wherein the prede letter features include articulatory features. termined number of letters is equivalent to the number of 12. The method of claim 1, wherein the predetermined letters in the word. letter features include acoustic features. 27. The device of claim 21, where a pronunciation lexicon 13. The method of claim 1, wherein the predetermined is reduced in size by using neural network word pronuncia letter features include a geometry of articulatory features. tion hypotheses which match target pronunciations. 14. The method of claim 1, wherein the predetermined 28. The device of claim 21 further including providing a letter features include a geometry of acoustic features. 25 predetermined number of layers of output reprocessing in 15. The method of claim 1, step (b), wherein the automatic which phones, neighboring phones, phone features and letter phone alignment is based on consonant and vowel neighboring phone features are passed to Succeeding layers. locations in the orthography and associated phonetic repre 29. The device of claim 21 further including, during Sentation. training, employing a feature-based error function to char 16. The method of claim 3, step (a), wherein the letters acterize the distance between target and hypothesized pro and phones are contained in a sliding window. nunciations during training. 17. The method of claim 1, wherein the orthography is 30. The device of claim 21, wherein the neural network is described using a feature vector. a feed-forward neural network. 18. The method of claim 1, wherein the pronunciation is 31. The device of claim 21, wherein the neural network described using a feature vector. 35 uses backpropagation of errors. 19. The method of claim 6, wherein the number of layers 32. The device of claim 21, wherein the neural network of output reprocessing is 2. has a recurrent input Structure. 20. The method of claim 3, step (b), where the featurally 33. The device of claim 21, wherein the predetermined based Substitution cost function uses predetermined letter features include articulatory features. Substitution, insertion and deletion costs and a predeter 40 34. The device of claim 21, wherein the predetermined mined Substitution table. letter features include acoustic features. 21. A device for providing, in response to orthographic 35. The device of claim 21, wherein the predetermined information, efficient generation of a phonetic letter features include a geometry of articulatory features. representation, comprising: 36. The device of claim 21, wherein the predetermined a) an encoder, coupled to receive an orthography of a 45 letter features include a geometry of acoustic features. word and a predetermined set of input letter features, 37. The device of claim 21, step (b), wherein the auto for providing digital input to a pretrained orthography matic letter phone alignment is based on consonant and pronunciation neural network, wherein the pretrained Vowel locations in the orthography and associated phonetic neural network has been trained using automatic letter representation. phone alignment and predetermined letter features, 50 38. The device of claim 25, step (a), wherein the letters b) the pretrained orthography-pronunciation neural and phones are contained in a sliding window. network, coupled to the encoder, for providing a neural 39. The device of claim 21, wherein the orthography is network hypothesis of a word pronunciation. described using a feature vector. 22. The device of claim 21 wherein the pretrained neural 40. The device of claim 21, wherein the pronunciation is network is trained using feature-based error backpropaga 55 described using a feature vector. tion. 41. The device of claim 28, wherein the number of layers 23. The device of claim 21 wherein the predetermined of output reprocessing is 2. letter features for a letter represent a union of features of 42. The device of claim 25, step (b), where the featurally predetermined phones representing the letter. based Substitution cost function uses predetermined 24. The device of claim 21 wherein the device includes at 60 Substitution, insertion and deletion costs and a predeter least one of: mined Substitution table. a) a microprocessor; 43. An article of manufacture for converting orthogra b) application specific integrated circuit; and phies into phonetic representations, comprising a computer c) a combination of a) and b). uSable medium having computer readable program code 25. The device of claim 21 wherein the pretrained neural 65 means thereon comprising: network has been trained in accordance with the following a) inputting means for inputting an orthography of a word Scheme: and a predetermined Set of input letter features, 5,930,754 25 26 b) neural network utilization means for utilizing a neural features and neighboring phone features are passed to Suc network that has been trained using automatic letter ceeding layers. phone alignment and predetermined letter features to 49. The article of manufacture of claim 43 further provide a neural network hypothesis of a word pronun including, during training, employing a feature-based error ciation. function to characterize the distance between target and 44. The article of manufacture of claim 43 wherein the hypothesized pronunciations during training. predetermined letter features for a letter represent a union of 50. The article of manufacture of claim 43, wherein the features of predetermined phones representing the letter. neural network is a feed-forward neural network. 45. The article of manufacture of claim 43 wherein the 51. The article of manufacture of claim 43, wherein the pretrained neural network has been trained in accordance neural network uses backpropagation of errors. with the following scheme: 52. The article of manufacture of claim 43, wherein the a) providing a predetermined number of letters of an neural network has a recurrent input Structure. asSociated orthography consisting of letters for the 53. The article of manufacture of claim 43, wherein the word and a phonetic representation consisting of predetermined letter features include articulatory features. phones for a target pronunciation of the associated 15 54. The article of manufacture of claim 43, wherein the orthography; predetermined letter features include acoustic features. b) aligning the associated orthography and phonetic rep 55. The article of manufacture of claim 43, wherein the resentation using a dynamic programming alignment predetermined letter features include a geometry of articu latory features. enhanced with a featurally-based Substitution cost 56. The article of manufacture of claim 43, step (b), function; wherein the automatic letter phone alignment is based on c) providing acoustic and articulatory information corre consonant and vowel locations in the orthography and sponding to the letters, based on a union of features of asSociated phonetic representation. predetermined phones representing each letter; 57. The article of manufacture of claim 45, step (a), d) providing a predetermined amount of context informa 25 wherein the letters and phones are contained in a sliding tion; and window. e) training the neural network to associate the input 58. The article of manufacture of claim 43, wherein the orthography with a phonetic representation. orthography is described using a feature vector. 46. The article of manufacture of claim 45, step (a), 59. The article of manufacture of claim 43, wherein the wherein the predetermined number of letters is equivalent to pronunciation is described using a feature vector. the number of letters in the word. 60. The article of manufacture of claim 47, wherein the 47. The article of manufacture of claim 43 where a number of layers of output reprocessing is 2. pronunciation lexicon is reduced in Size by using neural 61. The article of manufacture of claim 45, Step (b), where network word pronunciation hypotheses which match target the featurally-based Substitution cost function uses prede pronunciations. 35 termined Substitution, insertion and deletion costs and a 48. The article of manufacture of claim 43 further includ predetermined substitution table. ing providing a predetermined number of layers of output reprocessing in which phones, neighboring phones, phone k k k k k