United States Patent (19) 11 Patent Number: 5,805,832 Brown Et Al

USOO.5805832A United States Patent (19) 11 Patent Number: 5,805,832 Brown et al. (45) Date of Patent: Sep. 8, 1998

54 SYSTEM FOR PARAMETRICTEXT TO 357344 3/1990 European Pat. Off.. TEXT LANGUAGE TRANSLATION 399533 11/1990 European Pat. Off.. WO 90/10911 9/1990 WIPO. 75 Inventors: Peter Fitzhugh Brown, New York; John Cocke, Bedford; Stephen OTHER PUBLICATIONS Andrew Della Pietra, Pearl River; Information System & Electronic Development Laboratory Vincent Joseph Della Pietra, Blauvelt; Mitsubishi Electr. Corp. “Training of Lexical Models Based Frederick Jelinek, Briarcliff Manor; on DTW-Based Parameter Reestimation Algorithm' Y. Abe Jennifer Ceil Lai, Garrison; Robert et al., 1988, pp. 623-626. Leroy Mercer, Yorktown Heights, all of N.Y. (List continued on next page.) 73 Assignee: International Business Machines Primary Examiner-Gail O. Hayes Corporation, Armonk, N.Y. Assistant Examiner William Hughet Attorney, Agent, or Firm-Robert P. Tassinari, Esq.; Sterne, 21 Appl. No.: 459,454 Kessler, Goldstein & Fox PL.L.C. 22 Filed: Jun. 2, 1995 57 ABSTRACT Related U.S. Application Data R present invention is a System for translating text from a rst Source language into a Second target language. The 63 Continuation of Ser. No. 736,278, Jul. 25, 1991, Pat. No. E. assigns probabilities or scores to various target 5,477,451. anguage translations and then displays or makes otherwise available the highest Scoring translations. The Source text is (51) Int. Cl." ...... G06F 17/28 first transduced into one or more intermediate Structural 52 U.S. Cl...... 395/752; 395/751; 395/759 representations. From these intermediate source structures a 58 Field of Search ...... 36441902,419.1, set of intermediate large-structure hypotheses is generated 364/419.01, 419.08; 395/2.49, 2.86, 751, These hypotheses are scored by two different models: a 752, 759 language model which assigns a probability or score to an s intermediate target Structure, and a translation model which assigns a probability or Score to the event that an interme 56) References Cited diate target structure is translated into an intermediate Source Structure. Scores from the translation model and language U.S. PATENT DOCUMENTS model are combined into a combined Score for each inter 4,754,489 6/1988 Bukser. mediate target-Structure hypothesis. Finally, a set of target 4,852,173 7/1989 Bahl et al.. text hypotheses is produced by transducing the highest 4,879,580 11/1989 Church. Scoring target-Structure hypotheses into portions of text in 4,882,759 11/1989 Bahl et al.. the target language. The System can either run in batch 4,984,178 1/1991 Hemphill et al.. mode, in which case it translates Source-language text into 4,991,094 2/1991 Fagen et al.. a target language without human assistance, or it can func 5,033,087 7/1991 Bahl et al.. tion as an aid to a human translator. When functioning as an 5,068,789 11/1991 Van Vliembergen. aid to a human translator, the human may simply Select from 5,109,509 4/1992 Katayana et al.. the various translation hypotheses provided by the System, 5,146,405 9/1992 Church. or he may optionally provide hints or constraints on how to 5,200,893 4/1993 Ozawa et al.. perform one or more of the Stages of Source transduction, hypothesis generation and target transduction. FOREIGN PATENT DOCUMENTS 327266 8/1989 European Pat. Off.. 24 Claims, 54 Drawing Sheets

display targe ext

fore

5,805,832 Page 2

OTHER PUBLICATIONS F. Jelinek, R. Mercer, “Interpolated Estimated of Markov CIIAM 86, Proceedings of the 2nd International Conference Source Parameters From Sparse Data”, Workshop on Pattern on Artificial Intelligence, “Logic Programming for Speech Recognition in Practice, Amsterdam (Netherland), North Understanding, pp. 487-497, Abstract of Article, Snyers D. Holland, May 21–23, 1980. et al. M. Kay, “Making Connections”, ACHALLC '91, Tempe, J. Baker, “Trainable Grammars For Speech Recognition”, Arizona, 1991, p. 1. Speech Communications Papers Presented at the 97th Meet “Method For Inferring Lexical Associations From Textual ing of the Acoustic Society of America, 1979, pp. 547-550. Co-Occurrences’, IBM Technical Disclosure Bulletin, vol. M.E. Lesk, “Lex-A Lexical Analyzer Generator', Computer 33, Jun. 1990. Science Technical Report, No. 39, Bell Laboratories, Oct. L. Bahl et al., “A Tree-Based Statistical Language Model 1975. For Natural Language Speech Recognition”, IEEE Trans L. Baum, “An Inequality and ASSociated Maximation Tech actions of Acoustics, vol. 37, No. 7, Jul. 1989, pp. nique in Statistical Estimation for Probalistic Functions of 1001-1008. Markov Processes”, Inequalities, vol. 3, 1972, pp. 1-8. P. Brown, “Word-Sense Disambiguation Using Statistical F. Jelinek, “Self Organized Language Modeling For Speech Methods”, Proceedings of the 29th Annual Meeting of the Recognition', Language Processing For Speech Recogni ASSOciation for Computational Linguistics, Jun. 1991, pp. tion, pp. 450-506. 264-270. Catizone et al., “Deriving Translation Data From Bilingual P. Brown et al., “Aligning Sentences in Parallel Corpora’, Texts, Proceedings of the First International Acquisition Proceedings of the 29th Annual Meeting of the ASSociation Workshop, Detroit, Michigan, 1989. for Computational Linguistics, Jun. 1991, pp. 169-176. J. Spohrer et al., “Partial Traceback in Continuous Speech B. Merialdo, “Tagging Text With A Probalistic Model”, Recognition”, Proceedings of the IEEE International Con Proceedings of the International Conference on Acoustics, ference on Acoustics, Speech and Signal Processing, Paris, Speech and Signal Processing, Paris, France, May 14-17, France, 1982. 1991. U.S. Patent Sep. 8, 1998 Sheet 1 of 54 5,805,832

Portion of F. G. French Text F

English

Language Model

DeCoder Computes Pr(E)

Given F, finds E for which

Pr(E) Pr(FE) is large

English to French

Translation Model

Computes Pr(FE) N ".

104

Portions of

English Text E U.S. Patent Sep. 8, 1998 Sheet 2 of 54 5,805,832

Portion of

French Text F F. G. 2

French Transducer Transforms F into an intermed late structure F 7 20

Language Model eDecoder Computes FPr(E)M Given F, finds E' for which Pr(E') Pr(F ? E is large English to French 206 Translation Model Computes Pr(F E S

English Transducer Constructs E from a intermediate structure E w

203 wriupulousap

Portions of 207

English Text E U.S. Patent Sep. 8, 1998 Sheet 3 of 54 5,805,832

305

N interlingual F G 3 epresentations

1. 304 306 N

Statistical transfer from statistical transfer from French intermediate interlingua to English structures to interlinC intermediate Structures 307

303 N 1. English intermediate French intermediate StructureS StructureS

302 308

1. N

transducer from French transducer from English text to French intermediated Structures intermediate structures to Endish text

French text Sheet 4 of 54 5,805,832

input source document

capture next section

of Source text

translate source text

into target language

SOUCe

text?

output target

document U.S. Patent Sep. 8, 1998 Sheet 5 of 54 5,805,832

402 input source F. G. 5 document

capture next section

of Source text

translate Source text USe

into target language interface

503

------

output target document

U.S. Patent Sep. 8, 1998 Sheet 7 of 54 5,805,832

707

tranScuCe SOU Ce text target-Structure language model generate target

structure hypotheses and SCOreS target-Structure 706 tO

SOU Ce-Stru Cture select hypothesis with tranSlation model the greatest Score

tranScuCe target structure

N target text U.S. Patent Sep. 8, 1998 Sheet 8 of 54 5,805,832

eaStre

Constrants

generate putative translations of source

display putative

translations of source

one of the

target text F. G. 8 U.S. Patent Sep. 8, 1998 Sheet 9 of 54 5,805,832

F.G. 9 -80 90S 705

transduce N SOur Ce text target-Structure language model

generate target

Structure hypotheses

and Scores target-Structure to SOUrCe-Structure

Select set of hypotheses translation model

with greatest scores

transduce target

StructureS -- 904

alternative translations

in the target language U.S. Patent Sep. 8, 1998 Sheet 10 of 54 5,805,832

OO2 N data to storage

translation

System

operating system

input f output interface

OO4

OO O O 2

1008

OO AR N external A77777.77A/N O13 netWOr K Z//777/7

F. G. O U.S. Patent Sep. 8, 1998 Sheet 11 of 54 5,805,832

SOUrce text

110 tokenize raW text

1102 determine Words from tokens

1103 annotate Words with parts-of-speech

1 104 perform syntactic transformations

1105 perform morphological transformations

1106 annotate morphs

intermediate SOUrce text F. G. U.S. Patent Sep. 8, 1998 Sheet 12 of 54 5,805,832

F. G. 2

token sequence

marrusa-a-a-as 12O1 s labelsdetermine true-Case

1202 perform specialized u transformations

aru samurosa

cased Word sequence U.S. Patent Sep. 8, 1998 Sheet 13 of 54 5,805,832

word sequence

annotated with

part-of-Speech

tags

1301 O perform duestion N. inversion

1302

perform do-not

Coales Cence

1303

perform adverb

TOWement

transformed

word sequence F. G. 3

annotated with

part-of-speech

tags U.S. Patent Sep. 8, 1998 Sheet 14 of 54 5,805,832

word Sequence

annotated with

part-of-Speech tags

perform question

n inversion

perform negative

Coalescence

perform pronoun

OVerment

perform adverb and

adjective movement r |--

transformed

Word Sequence

F. G. 4 annotated with

part-of-Speech

tags U.S. Patent Sep. 8, 1998 Sheet 15 0f 54 5,805,832

-1 5O1

Words: why do not you ever succeed ?

tags: RRG VDO XX PPY RR VV0

1503 example syntactic -1 transducer

-1 1502

Words: why you succeed do not MO ever M1 QINV

tags: RRY PPY WVO XX RR ONV

F.G. 5 U.S. Patent Sep. 8, 1998 Sheet 16 of 54 5,805,832 aouanbas

saouanbas 909;(~

U.S. Patent Sep. 8, 1998 Sheet 17 of 54 5,805,832

F. G. 7

begin patterns and actions; WHINP?. 1). DOSET. not ?. 2). SUBJECT-NP. 3). ADWERB'k. 4). , BARE INF-TAG. (ii) k. 2 -> english-question-inversion; (ii) + -> default-action; end patterns and actions; begin auxiliary patterns; ADWERB = , ADWTAG most , RR: more, RR: ; BARENP = ((ADWERB) k . , ADJTAG) sk., COMMONNOUNTAG; SUBJECTNP = (, SUBJECTTAG) (, DETERMINERTAG)?. BARENP; WHNPend auxiliary = WH-WORD patterns; ( ( which what how’. many). BARE-NP); begin sets of words; DOSET = {do does did; WH WORD - (what who where why whom how when which}; end sets of Words; begin sets of tags; PROPERNOUNTAG = {NP NP1 NP2}; PRONOUNTAG (PN PN1 PNQS); SUBJECTTAG (PROPERNOUNTAG PRONOUNTAG); BARE INFTAG {WVO WHO WBO WDO); COMMONNOUNTAG = {ND1 MN NN1 NN2 NNJ NNJ1 NNJ2 NNL. NNL1 NNL2 NO NNO2 NNS1 NNS2 NNSA1. NNSB NNSB1 NNSB2 NNT1 NNT2 NNU NNU1 NNU2 NPD1 NPD2 NPM1); ADWTAG {RA REX RG RGA RGR RGT RP RPK RR RRR, RRT RT); ADJTAG {JA JB JBR JBT JJ JJ R J JT JK); DETERMINERTAG (AT AT1 DA DA1 DA2 DA2R DA2T DAR DAT DB DB2 DD DD1 DD2 DDQ DDQ$ DDQV); end sets of tags; U.S. Patent Sep. 8, 1998 Sheet 18 of 54 5,805,832

F. G. 8 english question-inversion: procedure; /* This procedure is invoked when the pattern: WHNP 1). DOSET. not ?. 2). SUBJECT-NP. (3). ADVERB k. (4) ... ii., BARE INFTAG. (ii) k . 2 is matched sk/ set the output sequence to null; append tuples the beginning of the input sequence to position (Oreg (1) -1 to output; append tuples from input positions (Oreg (2) to Oreg(3)-1 to output; if (Oreg (2) = (Oreg (1) + 2 then do; append tuples from input positions (Oreg (1) to (Oreg (1)+1 to output; append tuples from input in positions Oreg (3) to Qreg (4) to output; end; else do; append tuples from input in positions Oreg (3) to Qreg(4) -1 to output; if the word at postion (Oreg (1) is does then do; append the input word at position (Oreg (4) and the tag for does to output; end ; else if the word at position (Oreg (1) is do then do; append tuple at input position (Oreg (4) end; else if the word at postion (Oreg (1) is did then do; append the input word at position (Oreg (4) and the tag for did to output; end; else there is an error; end; append the tuples from input positions (Oreg (4) +1 to one less than the last position to output; append the word 'QINV and the tag 'QINV to output; end english-question inversion; U.S. Patent Sep. 8, 1998 Sheet 19 of 54 5,805,832

intern ediate target text

1901 generate nouns, adverbs and adjectives from morphological units 704 1902 1. provide do-support for negatives 1903 provide do-support for inverted questions 1904 separate combined verb tenses 1905

1906 reposition adjectives and adverbs

1907 restore question inversion 1908 resolve target-word ambiguities resulting 1909 is from ambiguities in 1910 s Geo)target text F. G. 9

U.S. Patent Sep. 8, 1998 Sheet 21 of 54 5,805,832 ueap

ued

ueap

e-1 22’9|-!

22’9|- U.S. Patent Sep. 8, 1998 Sheet 22 of 54 5,805,832

target structure SOUCe Structure

F. G. 24

generate allowed alignments

betWeen the Source Structure

and the target structure

Compute the score of detailed

each alignment translation model

Sum the SCOres Of

all alignments

SCOre Of Source Structure

given target structure

U.S. Patent Sep. 8, 1998 Sheet 23 of 54 5,805,832

10IZ

G2(9|-! 9.In?Onu?S30Inos U.S. Patent Sep. 8, 1998 Sheet 24 of 54 5,805,832

SOur Ce Stru Cture F. G. 26

target structure

260

fertility table of

Submodel fertility parameters

26O2

M exical table of Subnode lexical parameterS

2603 2501-1 distortion table of

Submodel distortion probabilities

probability of

SOUrCe Structure

and alignment

given target structure

U.S. Patent Sep. 8, 1998 Sheet 25 of 54 5,805,832

2701

Select parameter values

for model O

Using parameter values

of model i-1

select initial parameter values for model i

Starting from initial parameter values

for model i

find locally optimal parameter values

for model i

More models

models and

parameter valu U.S. Patent Sep. 8, 1998 Sheet 26 of 54 5,805,832

F. G. 28 initially optimal

parameter values 280 u 6 for model P

manama

Choose 0 tO maximize R(P. Pe)

converged ?

locally optimal 2804 -- parameter values 0 for model P U.S. Patent Sep. 8, 1998 Sheet 27 of 54 5,805,832

FIG. 29

2804

parameter values G for model f

1 2901 Choose 0 to

maximize R(PP 0

1 2902 initial parameter values 0 for

model P

U.S. Patent Sep. 8, 1998 Sheet 28 of 54 5,805,832

F. G. 3O

plan evaluation letter aSSeSSnent request analysis In end understanding CaSe Opinion Question Conversation charge discussion Statement draft day aCCOllntS year people week Custoners month individuals Quarter employees half students reps representatives representative rep

U.S. Patent Sep. 8, 1998 Sheet 30 of 54 5,805,832

---Assign each word to its own class

Select the pair of classes Model for for which a merge

results in the Smallest bigrams loss of mutual information 3103 32O1

Merge the selected

pair of classes

3202

the number of

classes equal N 3204 U.S. Patent Sep. 8, 1998 Sheet 31 of 54 5,805,832

And the program3 has beens implemented.6

Lel programme2 ag été4 miss ens application 7 F. G. 33

Thel - Lel balance) - restee Was: appartenaits territory5thea >-u autochtOneS5aux4 Of6 the aboriginals peopleg F. G. 34

The poor2 don't3 have any's moneys

Lest pauvreS2 SOnta demunis. F. G. 35 U.S. Patent Sep. 8, 1998 Sheet 32 of 54 5,805,832

cheap cheap FIG. 36 bonu-1N marché a marché-1N bon bgn mathé bgn mathé

What - En! is2 virtup thes - des anticipated4 leS4 COSts nouvelless Ofs () propositionss administering (X) - 7 Collecting.9ands XX(X) eStgquels feeslo X le10 under 11 X Cout 11 the12 () prevu 12 neW13 cle13 proposal14 administration 14 15 et15 ------de16 perception 17 Fl G 37 --- lessde 8 droitS20 N N- 21 U.S. Patent Sep. 8, 1998 Sheet 33 of 54 5,805,832

F. G 38

The Le1 Secretary 2 Secrétaire2 Ofs des Statea Etat 4 fors a5 externals less affairst Affairest COIneS3 extérieurg ing Seg aS10 présenteio the11 comine Olle12 le12 Supporter 13 Seul 13 Of4 défenseur14 the 15 embattled 16 minister 17 ministres Ofs yesterday 19 Clill 18

bouSculler22 hier23 U.S. Patent Sep. 8, 1998 Sheet 34 of 54 5,805,832 F. G. 39

Mr. Monsieur Speaker? l2 3 Orateurs ifa 4. Wes. Sis mightg nOuS6 return pOuVOnSI tO3 revenirs starredg ag questions 10 leS10 11 questions 11 112 marquées 12 lOW 13 de13 have14 lill 14 the15 astérisque15 anSWer 16 16 and 17 je17 Wills donnerails give19 la19 it20 réponse20 to21 &21 the2 la22 house23 chambre23 U.S. Patent Sep. 8, 1998 Sheet 35 0f 54 5,805,832

Je vais Gada propre decisionY

400 Will make my OW decision

Je vais God, propre u-1 40 O2 Will take my OW

vocabulary word: prendre

informant site: first noun to the right F. G. 4O informant word:

Y first Sequence: decision Second Seduence: voiture U.S. Patent Sep. 8, 1998 Sheet 36 of 54 5,805,832

41 Ol

input sequence -1

4102 capture next word

get best informant -1 403

Site for Word

table of informants

capture informant and duestions for

Word at site every vocabulary

Word

Obtain class of

informant

label word with

class of informant --- F. G. 4 -- -

input Sequence 4-06

annotated with -1

Sen Se-labels U.S. Patent Sep. 8, 1998 Sheet 37 of 54 5,805,832

vocabulary

A.T." 4203

table of possible find the best question informants

for each informant site table of probabilities derived from Viterbi alignments

Store the informant site u table of informants and the best question and questions for every vocabulary Word

Words?

O U.S. Patent Sep. 8, 1998 Sheet 38 of 54 5,805,832

vocabulary word

430 43O4

next informant Site table of possibleib informants

42O5 find a question about

the informant that provides table of probabilities

a lot of information about derived from Viterbi translations of the vocabulary alignments

word into the target language

42O6

table of informants Store vocabulary Word, and questions for

informant site and question every vocabulary Word

4303 Ore

informants?

- - - O - F. G. 43 U.S. Patent Sep. 8, 1998 Sheet 39 of 54 5,805,832

initial choice of

question cA.

A for fixed c, find the best q: q(ec) = 1/norm X p(exit)

x: 8 (x) = c

table of alignment probabilities

for fixed a find the best question € : ?(x)M = argmin D(p(Ex,f); q(EC)) 4403

better question e F. G. 44 U.S. Patent 5,805,832

snduoo \

ITOSOLI€) INAIIV/MIZX1 U.S. Patent Sep. 8, 1998 Sheet 41 of 54 5,805,832 F. G. 46

- wrunner towa sh -

N 1 4601 4,602

Determine 4603 4604 Determine Sentern CeS SentenCeS

Align sentences that are mutual 40s ------translations

Aligned sentence pairs U.S. Patent Sep. 8, 1998 Sheet 42 of 54 5,805,832

Corpus 2

F. G. 47

major and minor anchor points 4701

Align major anchor points 4702

DiScard some aligned sections 4703

Align Sentences within 4704 SubSections N mura wumasauru 4605

Aligned sentences

U.S. Patent Sep. 8, 1998 Sheet 44 of 54 5,805,832

9817 U.S. Patent Sep. 8, 1998 Sheet 45 of 54 5,805,832

25e e F. G. 49 2Of 8f SIf

Bead Teact. e One English sentence f One French sentence ef one English and one French sentence eef two English and one French sentence eff one English and two French sentences One English paragraph One French paragraph If one English and one French paragraph F. G. 5

U.S. Patent Sep. 8, 1998 Sheet 47 of 54 5,805,832

initialize set of partial hypotheses and scores

Select partial hypotheses to extend to new hypotheses target-Structure language model

test for

completion target-Structure

SOUrCe-Structure translation node

extend Selected partial hypotheses U.S. Patent Sep. 8, 1998 Sheet 48 of 54 5,805,832

W W W

La jeune fille a reveille sa mere

transduce Source text

a jeune fille V past 3s reveillerW sa mereN F.G. 55

the

la jeune fille V past 3s reveillerW sa mereN F.G. 56

mother

a jeune fillet V past 3s reveillerW sa mereW F.G. 57 U.S. Patent Sep. 8, 1998 Sheet 49 of 54 5,805,832

the girl +

la jeune fille V past 3s reveillerM sa mereN F. G. 58

the girl

a jeune fille V past 3s reveiller/ sa mereN F.G. 59

the girl +

la jeune fille V past 3s reveillerW sa mereN F.G. 6O U.S. Patent Sep. 8, 1998 Sheet 50 of 54 5,805,832 the girlN V past\ 3S to Wake\ . la jeune fille V past 3s reveiller sa mere F. G. 6

the girl V past 3s to Wake up her

la|NN jeune fille V past 3s \\reveiller sa mere F. G. 62

the girl V past 3s to wake up her +

la jeuneN fille \V past 3s \\reveiller sa mere F.G. 63 U.S. Patent Sep. 8, 1998 Sheet 51 of 54 5,805,832 N -(1,2,3}

F.G. 64

yeSN Oui Oui FIG 65 U.S. Patent Sep. 8, 1998 Sheet 52 of 54 5,805,832 54.02 F.G. 66

initialize threshold of every --i-active prioritya queue to infinity - 660

Set thresholds for every active --6602 priority queue on the frontier

set i = maximum cardinality of N- 6603 any priority queue on the frontier

done selecting list

of partial hypotheses to

be extended

set k = number of priority queues of cardinality i

process j-th priority queue of cardinality i

U.S. Patent Sep. 8, 1998 Sheet 53 of 54 5,805,832

670 N. priority queue Q F.G. 67 --- 6702 Na

6703 6608 N let i = partial hypothesis in Q with the greatest score

6704 N

O Gone)

6705 N yes

addi to list to be extended

use i to adjust thresholds in parent priority queues

i = partial hypothesis with next largest score in Q -

5,805,832 1 2 SYSTEM FOR PARAMETRICTEXT TO turns out, however, to be a very difficult job to determine an TEXT LANGUAGE TRANSLATION accurate phoneme Sequence from a speech Signal. Although human experts certainly do exist, it has proven extremely This application is a continuation of application Ser. No. difficult to formalize their knowledge. In the alternative 07/736,278, filed Jul 25, 1991, now U.S. Pat. No. 5,477, 5 Statistical approach, Statistical models, most predominantly 451. hidden Markov models, capable of learning acoustic phonetic knowledge from Samples of Speech are employed. TECHNICAL FIELD The present approaches to machine translation are similar The present invention relates to machine translation in in their reliance on hand-written rules to the approaches to general, and, in particular to methods of performing machine Speech recognition twenty years ago. Roughly Speaking, the translation which use Statistical models of the translation present approaches to machine translation can be classified proceSS. into one of three categories: direct, interlingual, and transfer. In the direct approach a Series of deterministic linguistic BACKGROUND ART transformations is performed. These transformations There has long been a desire to have machines capable of 15 directly convert a passage of Source text into a passage of translating text from one language into text in another target text. In the transfer approach, translation is performed language. Such machines would make it much easier for in three Stages: analysis, transfer, and Synthesis. The analysis humans who speak different languages to communicate with Stage produces a structural representation which captures one another. The present invention uses Statistical tech various relationships between Syntactic and Semantic niques to attack the problem of machine translation. Related aspects of the Source text. In the transfer Stage, this structural Statistical techniques have long been used in the field of representation of the Source text is then transferred by a automatic Speech recognition. The background for the Series of deterministic rules into a structural representation invention is thus in two different fields: automatic speech of the target text. Finally, in the Synthesis Stage, target text recognition and machine translation. is Synthesized from the Structural representation of the target The central problem in Speech recognition is to recover 25 text. The interlingual approach to translation is similar to the from an acoustic Signal the word Sequence which gave rise transfer approach except that in the interlingual approach an to it. Prior to 1970, most speech recognition systems were internal Structural representation that is language indepen built around a set of hand-written rules of Syntax, Semantics dent is used. Translation in the interlingual approach takes and acoustic-phonetics. To construct Such a System it is place in two stages, the first analyzes the Source text into this necessary to firstly discover a set of linguistic rules that can language-independent interlingual representation, the Sec account for the complexity of language, and, Secondly to ond Synthesizes the target text from the interlingual repre construct a coherent framework in which these rules can be Sentation. All these approaches use hand-written determin assembled to recognize Speech. Both of these problems inistic rules. proved insurmountable. It proved too difficult to write down Statistical techniques in Speech recognition provide two by hand a Set of rules that adequately covered the vast Scope 35 advantages over the rule-based approach. First, they provide of natural language and to construct by hand a set of weights, means for automatically extracting information from large priorities and if-then Statements that can regulate interac bodies of acoustic and textual data, and Second, they tions among its many facets. provide, via the formal rules of probability theory, a SyS tematic way of combining information acquired from dif This impasse was overcome in the early 1970's with the 40 introduction of Statistical techniques to speech recognition. ferent Sources. The problem of machine translation between In the Statistical approach, linguistic rules are extracted natural languages is an entirely different problem than that automatically using Statistical techniques from large data of Speech recognition. In particular, the main area of bases of Speech and text. Different types of linguistic infor research in Speech recognition, acoustic modeling, has no mation are combined via the formal laws of probability 45 place in machine translation. Machine translation does face theory. Today, almost all Speech recognition Systems are the difficult problem of coping with the complexities of based on Statistical techniques. natural language. It is natural to wonder whether this prob lem won't also yield to an attack by Statistical methods, Speech recognition has benefited by using Statistical lan much as the problem of coping with the complexities of guage models which exploit the fact that not all word natural Speech has been yielding to Such an attack. Although Sequences occur naturally with equal probability. One 50 the statistical models needed would be of a very different Simple model is the trigram model of English, in which it is nature, the principles of acquiring rules automatically and assumed that the probability that a word will be spoken combining them in a mathematically principled fashion depends only on the previous two words that have been might apply as well to machine translation as they have to spoken. Although trigram models are Simple-minded, they Speech recognition. have proven extremely powerful in their ability to predict 55 words as they occur in natural language, and in their ability to improve the performance of natural-language speech DISCLOSURE OF THE INVENTION recognition. In recent years more Sophisticated language The present invention is directed to a System and method models based on probabilistic decision-trees, Stochastic for translating Source text from a first language to target text context-free grammars and automatically discovered classes 60 in a Second language different from the first language. The of words have also been used. Source text is first received and Stored in a first memory In the early days of Speech recognition, acoustic models buffer. Zero or more user defined criteria pertaining to the were created by linguistic experts, who expressed their Source text are also received and Stored by the System. The knowledge of acoustic-phonetic rules in programs which user defined criteria are used by various Subsystems to analyzed an input Speech Signal and produced as output a 65 bound the target text. Sequence of phonemes. It was thought to be a simple matter The Source text is then transduced into one or more to decode a word Sequence from a Sequence of phonemes. It intermediate Source Structures which may be constrained by 5,805,832 3 4 any of the user defined criteria. One or more target hypoth FIG. 5 is a schematic flow diagram of a basic system eses are then generated. Each of the target hypotheses operating in user-aided mode, comprise a intermediate target Structure of text Selected from FIG. 6 is a Schematic representation of a work Station the Second language. The intermediate target Structures may environment through which a user interfaces with a user also be constrained by any of the user defined criteria. aided System; A target-Structure language model is used to estimate a FIG. 7 is a schematic flow diagram of a translation first score which is proportional to the probability of occur component of a batch System; rence of each intermediate target-Structure of text associated with the target hypotheses. A target Structure-to- Source FIG. 8 is a Schematic flow diagram illustrating a user's Structure translation model is used to estimate a Second Score interaction with a human-aided System; which is proportional to the probability that the intermediate FIG. 9 is a schematic flow diagram of a translation target-Structure of text associated with the target hypotheses component of a human-aided System; wVill translate into the intermediate Source-Structure of text. FIG. 10 is a schematic block diagram of a basic hardward For each target hypothesis, the first and Second Scores are components used in a preferred embodiment of the present combined to produce a target hypothesis match Score. 15 invention; The intermediate target-Structures of text are then trans FIG. 11 is a Schematic flow diagram of a Source trans duced into one or more transformed target hypothesis of text ducer, in the Second language. This transduction Step may also be FIG. 12 is a schematic flow diagram of a token-to-word constrained by any of the user defined criteria. transducer, Finally, one or more of the transformed target hypotheses FIG. 13 is a schematic flow diagram of a syntactic is displayed to the user according to its associated match transducer for English; Score and the user defined criteria. Alternatively, one or more FIG. 14 is a Schematic flow diagram of a Syntactic of the transformed target hypotheses may be Stored in transducer for French. memory or otherwise made available for future access. 25 FIG. 15 is an example of a Syntactic transduction; The intermediate Source and target Structures are trans duced by arranging and rearranging, respectively, elements FIG. 16 is a schematic flow diagram illustrating the of the Source and target text according to one or more of a operation of a finite-state transducer, lexical Substitution, part-of-Speech assignment, a morpho FIG. 17 is a Table of Patterns for a finite-state pattern logical analysis, and a Syntactic analysis. matcher employed in Some Source and target transducers, The intermediate target Structures may be expressed as an FIG. 18 is a Table of Actions for an action processing ordered Sequence of morphological units, and the first Score module employed in Some Source and target transducers. probability may be obtained by multiplying the conditional FIG. 19 is a Schematic flow diagram of a target transducer, probabilities of each morphological unit within an interme FIG. 20 is a Schematic block diagram of a target Structure diate target Structure given the occurrence of previous mor 35 language model; phological units within the intermediate target Structure. In another embodiment, the conditional probability of each unit FIG. 21 is a Schematic block diagram of a target Structure of each of the intermediate target Structure may depend only to Source Structure translation model; on a fixed number of preceding units within the intermediate FIG. 22 is a simple example of an alignment between a target Structure. 40 Source Structure and a target Structure; The intermediate Source Structure transducing Step may FIG. 23 is another example of an alignment; further comprise the Step of tokenizing the Source text by FIG. 24 is a Schematic flow diagram of showing an identifying individual words and word Separators and embodiment of a target Structure to Source Structure trans arranging the words and the word Separators in a Sequence. lation model; Moreover, the intermediate Source Structure transducing Step 45 FIG. 25 is a schematic block diagram of a detailed may further comprise the Step of determining the case of the translation model; words using a case transducer, wherein the case transducer FIG. 26 is a schematic flow diagram of an embodiment of assigns each word a tolken and a case pattern. The case a detailed translation model; pattern Specifies the case of each letter of the word. Each FIG. 27 is a schematic flow diagram of a method for case pattern is evaluated to determine the true case pattern 50 estimating parameters for a progression of gradually more of the word. Sophisticated translation models; BRIEF DESCRIPTION OF DRAWINGS FIG. 28 is a schematic flow diagram of a method for The above Stated and other aspects of the present inven iteratively improving parameter estimates for a model; tion will become more evident upon reading the following 55 FIG.29 is a schematic flow diagram of a method for using description of preferred embodiments in conjunction with parameters values for one model to obtain good initial the accompanying drawings, in which: estimates of parameters for another model; FIG. 1 is a schematic block diagram of a simplified FIG. 30 shows some sample subtrees from a tree con French to English translation System; Structed using a clustering method; FIG. 2 is a Schematic block diagram of a simplified 60 French to English translation System which incorporates FIG. 31 is a schematic block diagram of a method for analysis and Synthesis, partitioning a vocabulary into classes; FIG. 3 is a Schematic block diagram illustrating a manner FIG. 32 is a schematic flow diagram of a method for in which Statistical transfer can be incorporated into a partitioning a vocabulary into classes; translation System based on an interlingua; 65 FIG.33 is an alignment with independent English words; FIG. 4 is a Schematic flow diagram of a basic System FIG. 34 is an alignment with independent French words; operating in batch mode, FIG. 35 is a general alignment; 5,805,832 S 6 FIG. 36 shows two tableaux for one alignment; FIG. 68 contains pseudocode describing the method for FIG. 37 shows the best alignment out of 19x10- align extending hypotheses ments, BEST MODE FOR CARRYING OUT THE FIG.38 shows the best of 8.4x10' possible alignments; INVENTION FIG. 39 shows the best of 5.6x10' alignments; Contents FIG. 40 is an example of informants and informant sites; 1 Introduction FIG. 41 is a schematic flow diagram illustrating the 2 Translation Systems operation of a Sense-labeling transducer, 3 Source Transducers FIG. 42 is a schematic flow diagram of a module that 1O 3.1. Overview determines good questions about informants for each 3.2 Components vocabulary word; 3.2.1 Tokenizing Transducers FIG. 43 is a schematic flow diagram of a module that 3.2.2 Token-Word Transducers determines a good question about each informant of a 3.2.3 True-Case Transducers vocabulary word; 15 3.2.4 Specialized Transformation Transducers FIG. 44 is a schematic flow diagram of a method for 3.2.5 Part-of-Speech Labeling Transducers determining a good question about an informant; 3.2.6 Syntactic Transformation Transducers FIG. 45 is a schematic diagram of a process by which 3.2.7 Syntactic Transducers for English word-by-word correspondences are extracted from a bilin 3.2.8 Syntactic Transducers for French gual corpus; 3.2.9 Morphological Transducers 3.2.10 Sense-Labelling Transducers FIG. 46 is a schematic flow diagram of a method for 3.3 Source-Transducers with Constraints aligning Sentences in parallel corpora; 4 Finite-State Transducers FIG. 47 is a schematic flow diagram of the basic step of 5 Target Transducers a method for aligning Sentences, 25 5.1. Overview FIG. 48 depicts a sample of text before and after textual 5.2 Components cleanup and Sentence detection; 5.3 Target Transducers with Constraints FIG. 49 shows an example of a division of aligned copora 6 Target Language Model into beads, 6.1 Perplexity FIG. 50 shows a finite state model for generating beads; 6.2 n-gram Language models FIG. 51 shows the beads that are allowed by the model; 6.3 Simple n-gram models 6.4 Smoothing FIG. 52 is a histogram of French sentence lengths; 6.5 n-gram Class Models FIG. 53 is a histogram of English sentence lengths; 7 Classes FIG. 54 is a schematic flow diagram of the hypothesis 35 7.1 Maximum Mutual Information Clustering Search component of the System; 7.2 A Clustering Method FIG. 55 is an example of a Source Sentence being trans 7.3 An Alternate Clustering Method duced to a Sequence of morphemes, 7.4 Improving Classes FIG. 56 is an example of a partial hypothesis which 7.5 Examples results from an extension by the target morpheme the; 40 7.6 A Method for Constructing Similarity Trees FIG. 57 is an example of a partial hypothesis which 8 Overview of Translation Models and Parameter Esti results from an extension by the target morpheme mother, mation FIG. 58 is an example of a partial hypothesis which 8.1 Translation Models results from an extension with an open target morpheme; 8.1.1 Training FIG. 59 is an example of a partial hypothesis which 45 8.1.2 Hidden Alignment Models results from an extension in which an open morpheme is 8.2 An example of a detailed translation model closed; 8.3 Iterative Improvement FIG. 60 is an example of a partial hypothesis which 8.3.1 Relative Objective Function results from an extension in which an open morpheme is 8.3.2 Improving Parameter Values kept open; 50 8.3.3 Going From One Model to Another FIG. 61 is an example of a partial hypothesis which 8.3.4 Parameter Reestimation Formulae 8.3.5 Approximate and Viterbi Parameter Estimation results from an extension by the target morpheme to wake; 8.4 Five Models FIG. 62 is an example of a partial hypothesis which 9 Detailed Description of Translation Models and Param results from an extension by the pair of target morphemes up 55 eter Estimation and to wake; 9.1 Notation FIG. 63 is an example of a partial hypothesis which 9.2 Translations results from an extension by the pair of target morphemes up 9.3 Alignments and to wake in which to wake is open; 9.4 Model 1 FIG. 64 is an example of a Subset lattice; 60 9.5 Model 2 FIG. 65 is an example of a partial hypothesis that is stored 9.6 Intermodel interlude in the priority queue {2, 3}; 9.7 Model 3 FIG. 66 is a schematic flow diagram of the process by 9.8 Deficiency which partial hypotheses are Selected for extension; 9.9 Model 4 FIG. 67 is a schematic flow diagram of the method by 65 9.10 Model 5 which the partial hypotheses on a priority queue are pro 9.11 Results cessed in the Selection for extension Step; 9.12 Better Translation Models 5,805,832 7 8 9.12.1 Deficiency 1. A language model 101 which assigns a probability or 9.12.2 Multi-word Notions score P(E) to any portion of English text E; 10 Mathematical Summary of Translation Models 2. A translation model 102 which assigns a conditional 10.1 Summary of Notation probability or score P(FE) to any portion of French text 10.2 Model 1 F given any portion of English text E; and 10.3 Model 2 3. A drcoder 103 which given a portion of French text F 10.4 Model 3 finds a number of portions of English text E, each of 10.5 Model 4 which has a large combined probability or Score 10.6 Model 5 11 Sense Disambiguation 1O 11.1 Introduction 11.2 Design of a Sense-Labeling Transducer Language models are described in Section 6, and trans 11.3 Constructing a table of informants and questions lation models are described in Sections 8, 9, and 10. An 11.4 Mathematics of Constructing Questions embodiment of the decoder 103 is described in detail in 11.4.1 Statistical Translation with Transductions 15 Section 14. 11.4.2 Sense-Labeling in Statistical Translation A shortcoming of the Simplified architecture depicted in 11.4.3 The Viterbi Approximation FIG. 1 is that the language model 101 and the translation 11.4.4 Cross Entropy model 102 deal directly with unanalyzed text. The linguistic 11.5 Selecting Questions information in a portion of text and the relationships 11.5.1 Source Questions between translated portions of text is complex, involving 11.5.2 Target Questions linguistic phenomena of a global nature. However, the 11.6 Generalizations models 101 and 102 must be relatively simple so that their 12 Aligning Sentences parameters can be reliably estimated from a manageable 12.1. Overview amount of training data. In particular, they are restricted to 12.2 Tokenization and Sentence Detection 25 the description of local Structure. 12.3 Selecting Anchor Points This difficulty is addressed by integrating the architecture 12.4 Aligning Major Anchors of FIG. 1 with the traditional machine translation arclitec 12.5 Discarding Sections ture of analysis, transfer, and Synthesis. The result is the 12.6 Aligning Sentences simplified embodiment depicted in FIG. 2. It includes six 12.7 Ignoring Anchors components: 12.8 Results for the Hansard Example 1. a French transduction component 201 in which a 13 Aligning Bilingual Corpora portion of French text F is encoded into an intermediate 14 Hypothesis Search-Steps 702 and 902 structure F"; 14.1. Overview of Hypothesis Search 2. a statistical transfer component 202 in which F" is 14.2 Hypothesis Extension 5404 35 translated to a corresponding intermediate English 14.2.1 Types of Hypothesis Extension structure E'. 14.3 Selection of Hypotheses to Extend 5402 3. an Entglish transduction component 203 in which a portion of English text E is reconstructed from E. 1 Introduction This statistical transfer component 202 is similar in func The inventions described in this specification include 40 tion to the whole of the simplified embodiment 104, except apparatuses for translating text in a Source language to text that it translates between intermedliate Structures instead of in a target language. The inventions employ a number of translating directly between portions of text. The building different components. There are various embodiments of blocks of these intermediate Structures may be words, each component, and various embodiments of the appara morphemes, Syntactic markers, Semantic markers, or other tuses in which the components are connected together in 45 units of linguistic structure. In descriptions of aspects of the different ways. Statistical transfer component, for reasons of exposition, these building blocks will sometimes be referred to as words To introduce Some of the components, a number of or morphemes, morphs, or morphological units. It should be Simplified embodiments of a translation apparatus will be understood, however, that the unless Stated otherwise the described in this introductory Section. The components 50 themselves will be described in detail in Subsequent Sec Same descriptions of the transfer component apply also to tions. More Sophisticated embodiments of the apparatuses, other units of linguistic Structure. including the preferred embodiments, will also be described The Statistical transfer component incorporates in Subsequent Sections. 4. an English Structure language mnodel 204 which assigns a probability or score P(E) to any intermediate The simplified embodiments translate from French to 55 English. These languages are chosen only for the purpose of structure E'. example. The Simplified embodiments, as well as the pre 5. an English Structure to French Structure transslation ferred embodiments, can easily be adapted to other natural model 205 which assigns a conditional probability or languages, including but not limited to, Chinese, Danish, score P(FE) to any intermediate structure F given any Dutch, English, French, German, Greek, Italian, Japanese, 60 intermediate Structure E"; and Portuguese, and Spanish, as well as to artificial languages 6. a decoder 206 which given a French structure F" finds including but not limited to data-base query languages Such a number of English Structures E, each of which has a as SQL, and the programming languages Such as LISP. For large combined probability or Score example, in one embodiment queries in English might be translated into an artificial database query language. 65 A simplified embodiment 104 is depicted in FIG. 1. It The role of the French transduction component 201 and includes three components: the English transduction component 203 is to facilitate the 5,805,832 10 task of the statistical transfer component 202. The interme 502 the step 404 of the batch system 401 is replaced by step diate Structures E and F" encode global linguistic facts about 501 in which source text is translated into the target lan E and F in a local form. As a result, the language and guage with input from the user 503. In the embodiment translation models 204 and 205 for E' and F" will be more depicted in FIG. 5, the system 502 is accompanied by a accurate than the corresponding models for E and F. user-interface 504 which includes a text editor or the like, The French transduction component 201 and the English and which makes available to the user translation transduction component 203 can be composed of a Sequence dictionaries, thesauruses, and other information that may of Successive transformations in which F" is incrementally assist the user 503 in Selecting translations and Specifying constructed from F, and E is incrementally recovered from constraints. The user can Select and further edit the transla E'. For example, one transformation of the French transduc tions produced by the System. tion component 201 might label each word of F with the part In step 505 in FIG. 5, the user-aided system, portions of of Speech, e.g. noun, verb, or adjective, that it assumes in F. Source text may be selected by the user. The user might, for The part of Speech assumed by a word depends on other example, Select a whole document, a Sentence, or a single words of F, but in F" it is encoded locally in the label. word. The system might then show the users possible Another transformation might label each word of F with a 15 translations of the Selected portion of text. For example, if croSS-linqual Sense label. The label of a word is designed to the user Selected only a single word the System might Show elucidate the English translation of the word. a ranked list of possible translations of that word. The ranks Various embodiments of French transduction components being determined by statisical models that would be used to are described in Section 3, and various embodiments of estimate the probabilities that the Source word translates in English transduction components are described in Section 5. various manners in the Source context in which the Source Embodiments of a cross-lingtial Sense labeler and meth word appears. ods for Sense-labeling are described in Section 11. In step 506 in FIG. 5 the text may be displayed for the user FIG. 2 illustrates the manner by which statistical transfer in a variety of fashions. A list of possible translations may is incorporated into the traditional architecture of analysis, be displayed in a window on a Screen, for example. Another transfer, and Synthesis. However, in other embodiments 25 possibility is that possible target translations of Source words Statistical transfer can be incoporated into other architec may be positioned above or under the Source words as they tures. FIG. 3, for example, shows how two Separate Statis appear in the context of the Source text. In this way the tical transfer components can be incorporated into a French Source text could be annotated with possible target transla to-English translation System which uses an interlingna. The tions The user could then read the source text with the help French transducer 302 together with the statistical transfer of the possible translations as well as pick and choose module 304 translate French text 301 into an interlingial various translations. representations 305. The other statistical transfer module 306 together with the English transduction module 308 A Schematic representation of a work Station translate the interlingulal representations 305 into English environment, through which the a interfaces with a user text 309. 35 aided system is shown in FIG. 6. The language models and translation models are Statistical FIG. 7 depicts in more detail the step 404 of the batch models with a large number of parameters. The values for translation system 401 depicted in FIG. 4. The step 404 is these parameters are determined in a process called training expanded into four steps. In the first step 701 the input from a large bilingual corpus consisting of parallel English Source text 707 is transduced to one or more intermediate French text. For the training of the translation model, the 40 Source structures. In the second step 702 a set of one or more English and French text is first aligned on a Sentence by hypothesized target Structures are generated. This step 702 Sentence basis. The process of aligning Sentences is makes use of a language model 705 which assigns prob described in Sections 12 and 13. abilities or Scores to target Structures and a translation model 706 which assigns probabilities or scores to source struc 2 Translation Systems 45 tures given target structures. In the third step 703 the highest FIG. 4 depicts schematically a preferred embodiment of a scoring hypothesis is selected. In the fourth step 704 the batch translation system 401 which translates text without hypothesis selected in step 703 is transduced into text in the user assistance. The operation of the System involves three target language 708. basic steps. In the first step 403 a portion of source text is FIG.8 depicts in more detail the step 501 of the user-aided captured from the input Source document to be translated. In 50 translation system 502 depicted in FIG. 5 (the user interface the Second step 404, the portion of Source text captured in has been omitted for simplification). The step 501 is Step 403 is translated into the target language. In the third expanded into four steps. In the first step 801 the user 503 Step 405, the translated target text is displayed, made avail is asked to Supply Zero or more translation constraints or able to at least one of a listen and viewing operation, or hints. These constraints are used in Subsequent Steps of the otherwise made available. If at the end of these three steps 55 translation process. the Source document has not been fully translated, the In the second step 802 a set of putative translations of the System returns to Step 403 to capture the next untranslated Source text is produced. The word putative is used here, portion of Source text. because the System will produce a number of possible FIG. 5 depicts schematically a preferred embodiment of a translations of a given portion of text. Some of these will be user-aided translation system 502 which assists a user 503 in 60 legitimate translations only in particular circumstances or the translation of text. The system is similar to the batch domains, and others may simply be erroneous. In the third translation system 401 depicted FIG. 4. Step 505 measures, Step these translations are displayed or otherwise made receives, or otherwise captures a portion of Source text with available to the user. In the fourth step 804, the user 503 is possible input from the user. Step 506 displacs the text asked either to Select one of the putative translations pro directly to the user and may also output it to an electric 65 duced in step 803, or to specify further constraints. If the media, made available to at least one of a listening and user 503 chooses to specify further constraints the system Viewing operation or elsewhere. In the user-aided System returns to the first step 801. Otherwise the translation 5,805,832 11 12 selected by the user 503 is displayed or otherwise made shown in FIG. 10. A computer platform 1014 includes a available for viewing or the like. hardware unit 1004, which includes a central processing unit FIG. 9 depicts in more detail the step 802 of FIG. 8 in (CPU) 1006, a random access memory (RAM) 1005, and an which a set of putative translations is generated. The Step input/output interface 1007. The RAM 1005 is also called 802 is expanded into four steps. In the first step 901, the main memory. input Source text is transduced to one or more intermediate The computer platform 1014 typically includes an oper target structures. This step is similar to step 701 of FIG. 7 ating system 1003. A data storage device 1002 is also called except that here the transduction takes place in a manner a Secondary Storage and may include hard disks and/or tape which is consistent with the constraints 906 which have been drives and their equivalents. The data storage device 1002 provided by the user 503. In the second step 902, a set of one represents non-volatile Storage. The data Storage 1002 may or more hypotheses of target Structures is generated. This be used to Store data for the language and translation models step is similar to step 702 except that here only hypotheses components of the translation system 1001. consistent with the constraints 906 axe generated. In the Various peripheral components may be connected to the third step 903 a set of the highest scoring hypotheses is computer platform 1014, such as a terminal 1012, a micro selected. In the fourth step 904 each of the hypotheses 15 phone 1008, a keyboard 1013, a scanning device 1009, an selected in step 903 is transduced into text in the target external network 1010, and a printing device 1011. A user language. The Sections of text in the target language which 503 may interact with the translation system 1001 using the result from this Step are the output of this module. terminal 1012 and the keyboard 1013, or the microphone A wide variety of constraints may be specified in step 801 1008, for example. As another example, the user 503 might of FIG. 8 and used in the method depicted in FIG. 9. For receive a document from the external network 1010, trans example, the user 503 may provide a constraint which late it into another language using the translation System affects the generation of target hypotheses in step 902 by 1001, and then send the translation out on the external insisting that a certain Source word be translated in a network 1010. particular fashion. AS another example, he may provide a In a preferred embodiment of the present invention, the constraint which affects the transduction of Source text into 25 computer platform 1014 includes an IBM System RS/6000 an intermediate structure in step 901 by hinting that a Model 550 workstation with at least 128 Megabytes of particular ambiguous Source word should be assigned a RAM. The operating system 1003 which runs thereon is particular part of Speech. AS other examples, the user 503 IBM AIX (Advanced Interactive Executive) 3.1. Many may a) insist that the word transaction appear in the target equivalents to the this structure will readily become apparent text; b) insist that the word transaction not appear in the to those skilled in the art. target text; c) insist that the target text be expressed as a The translation System can receive Source text in a variety question, or d) insist that the Source word avoir be trans of known manners. The following are only a few examples lated differently from the way it is was translated in the fifth putative translation produced by the System. In other of how the translation system receives Source text (e.g. data), 35 to be translated. Source text to be translated may be directly embodiments which transduce Source text into parse trees, entered into the computer system via the keyboard 1013. the user 503 may insist that a particular Sequence of Source Alternatively, the Source text could be Scanned in using the words be Syntactically analyzed as a noun phrase. These are scanner 1009. The scanned data could be passed through a just a few of many similar types of examples. In embodi character recognizer in a known manner and then passed on ments in which case frames are used in the intermediate 40 to the translation System for translation. Alernatively, the Source representation, the user 503 may insist that a par user 503 could identify the location of the source text in ticular noun phrase fill the donor slot of the verb “to give’. main or Secondary Storage, or perhaps on, removable Sec These constraints can be obtained from a simple user ondary Storage (such as on a floppy disk), the computer interface, which Someone skilled in the art of developing System could retrieve, and then translate the text accord interfaces between users and computers can easily develop. ingly. As a final example, with the addition of a speech. Steps 702 of FIG. 7 and 902 of FIG. 9 both hypothesize 45 recognition component, it would also be possible to speak intermediate target Structures from intermediate Source into the microphone 1008, have the speech converted into Structures produced in previous Steps. Source text by the Speech recognition component. In the embodiments described here, the intermediate Translated target text produced by the translation appli Source and target Structures are Sequences of morphological 50 cation running on the computer System may be output by the units, Such as roots of verbs, tense markers, prepositions and System in a variety of known manners. For example, it may so forth. In other embodiments these structures may be more be displayed on the terminal 1012, stored in RAM 1005, complicated. For example, they may comprise parse trees or stored data storage 1002, printed on the printer 1011, or case-frame representations.Structures. In a default condition, perhaps transmitted out over the external network 1010. these Structures may be identical to the original texts. In Such 55 With the addition of a speech synthesizer it would also be a default condition, the transductions StepS are not necessary possible to convert translated target text into Speech in target and can be omitted. language. In the embodiments described here, text in the target Step 403 in FIG. 4 and step 505 in FIG.5 measure, receive language is obtained by transducing intermediate target or otherwise capture a portion of Source text to be translated. Structures that have been hypothesized in previous Steps. In 60 In the context of this invention, text is used to refer to other embodiments, text in the target language can be Sequences of characters, formatting codes, and typographi hypothesized directly from Source Structures. In these other cal marks. It can be provided to the System in a number of embodiments steps 704 and 904 are not necessary and can different fashions, Such as on a magnetic disk, Via a com be omitted. puter network, as the output of an optical Scanner, or of a An example of a WorkStation environment depicting a 65 Speech recognition System. In Some preferred embodiments, hardware implementation using the translation System the Source text is captured a Sentence ant a time. Source text (application) in connection wnith the present invention is is parsed into Sentences using a finite-state machine which 5,805,832 13 14 examines the text for Such things as upper and lower case Yes. characters and Sentence terminal punctuation. Such a And “Gauss-Bonnet', as in the “Gauss-Bonnnet Theo machine can easily be constructed by Someone skilled in the rem art. In other embodiments, text may be parsed into units Such Two names, two words. as phrases or paragraphs which are either Smaller or larger 5 So if you split words at hyphens, what do you do with than individual Sentences. “vis-a-vis"? One word, and don't ask me why. How about “stingray'? 3 Source Transducers One word, of course. And “manta ray'? In this Section, Some embodiments of the the Source One word: it's just like Stingray. transducer 701 will be explained. The role of this transducer But there's a Space. is to produce one or more intermediate Source-structure Too bad. representations from a portion of text in the Source language. How about “inasmuch as”? 3.1. Overview TWO. An embodiment of the Source-transducer 701 is shown in Are you Sure? FIG. 11. In this embodiment, the transducer takes as input a 15 No. Sentence in the Source language and produces a single This dialogue illustrates that there is no canonical way of intermediate Source-Structure consisting of a Sequence of breaking a Sequence of characters into tokens. The purpose linguistic morphs. This embodiment of the Source of the transducer 1101, that tokenizes raw text, is to make transducer 701 comprises transducers that: Some choice. tokenize raw text 1101; In an embodiment in which the Source-language is English, this tokenizing transducer 1101 uses a table of a few determine words from tokens 1102; thousand Special character Sequences which are parsed as annotate words with parts of speech 1103; Single tokens and otherwise treat Spaces and hyphens as perform syntactic transformations 1104; word Separators. Punctuation marks and digits are Separated perform morphological transformations 1105; 25 off and treated as individual words. For example, 87, is annotate morphs with sense labels 1106. tokenized as the three words 87 and, It should be understood that FIG. 11 represents only one In another embodiment in which the Source-language is embodiment of the source-transducer 701. Many variations French, the tokenizing transducer 1101 operates in a similar are possible. For example, the transducers 1101, 1102,1103, way. In addition to Space and hyphen, the Symbol -t- is 1104, 1105, 1106 may be augmented and/or replaced by treated as a separator when tokenizing French text. other transducers. Other embodiments of the Source 3.2.2 Token-Word Transducers transducer 701 may include a transducer that groups words The next transducer 1102 determines a sequence of words into compound words or identifies idioms. In other from a Sequence of token Spellings. In Some embodiments, embodiments, rather than a Single intermediate Source this transducer comprises two transducers as depicted in Structure being produced for each Source Sentence, a Set of 35 FIG. 12. The first of these transducers 1201 determines an Several intermediate Source-Structures together with prob underlying case pattern for each token in the Sequence. The abilities or Scores may be produced. In Such embodiments second of these transducers 1202 performs a number of the transducers depicted in FIG. 11 can be replaced by Specialized transformations. These embodiments can easily transducers which produce at each Stage intermediate Struc be modified or enhanced by a person skilled in the art. tures with probabilities or Scores. In addition, the interme 40 3.2.3 True-Case Transducers diate Source-Structures produced may be different. For The purpose of a true-case transducer 1201 is made example, the intermediate Structures ma; be entire parse apparent by another Socratic dialogue: trees, or case frames for the Sentence, rather than a Sequence When do two Sequences of characters represent the same of morphological units. In these cases, there may be more word? than one intermediate Source-Structure for each Sentence 45 When they are the same Sequences. with Scores, or there may be only a single intermediate So, “the' and “The' are different words? SOurce-Structure. Don't be ridiculous. You have to ignore differences in 3.2 Components CSC. Referring still to FIG. 11, the transducers comprising the So “Bill and “bill are the same word? Source-transducer 701 will be explained. For concreteness, 50 No, “Bill' is a name and “bill' is something you pay. With these transducers will be discussed in cases in which the proper names the case matters. Source language is either English or French. What about the two “May's in “May I pay in May?” 3.2.1 Tokenizing Transducers The first one is not a proper name. It is capitalized because The purpose of the first transducer 1101, which tokenizes it is the first word in the sentence. raw text, is well illustrated by the following Socratic dia 55 Then, how do you know when to ignore case and when logue: not to? How do you find words in text? If you are human, you just know. Words occur between spaces. Computers don’t know when case matters and when it What about “however,”? Is that one word or two"? doesn’t. Instead, this determination can be performed by a Oh well, you have to Separate out the commas. 60 true-case transducer 1201. The input to this transducer is a Periods too? Sequence of tokens labeled by a case pattern that specifies Of course. the case of each letter of the token as it appears in printed What about “Mr.' text. These case patterns are corrupted versions of true-case Certain abbreviations have to be handled separately. patterns that Specify what the casing would be in the absence How about “shouldn’t? One word or two? 65 of typographical errors and arbitrary conventions (e.g., capi One. talization at the beginning of sentences). The task of the So “shouldn’t is different from “should not? true-case transducer is to uncover the true-case patterns. 5,805,832 15 16 In Some embodiments of the true-case transducer the case embodiments of transducer 1103, part-of-speech labels are and true-case patterns are restricted to eight possibilities assigned to a word Sequence using a technique based on hidden Markov models. A word Sequence is assigned the most probable part-of-Speech Sequence according to a Sta Lt U+ UL ULUL ULLUL UUL UUUL LUL tistical model, the parameters of which are estimated from large annotated texts and other even larger un-annotated Here U denotes an upper case letter, La lower case letter, U" texts. The technique is fully explained in article by Bernard a sequence of one or more upper case letters, and L a Merialdo entitled Tagging text with a Probabilistic Model Sequence of one or more lower case letters. In these in the Proceedings of the International Conference On embodiments, the true-case transducer can determine a Acoustics, Speech, and Signal Processing, May 14-17, true-case of an occurrence of a token in text using a simple 1991. This article is incorporated by reference herein. method comprising the following Steps: In an embodiment of the transducer 1103 for tagging of English, a tag Set consisting of 163 parts of Speech is used. 1. Decide whether the token is part of a name. If So, Set A rough categorization of these parts of Speech is given in the true-case equal to the most probable true-case Table 1. beginning with a U for that token. 15 In an embodiment of this transducer 1103 for the tagging 2. If the token is not part of a name, then check if the token of French, a tag Set consisting of 157 parts of Speech is used. is a member of a set of tokens which have only one A rough categorization of these parts of Speech is given in true-case. If So, Set the trule-case appropriately. Table 2. 3. If the true-case has not been determined by steps 1 or 2, and the token is the first token in a Sentence then Set TABLE 1. the true-case equal to the most probable true-case for that token. Parts of Speech for English 4. Otherwise, Set the true-case equal to the case for that 29 Nouns token. 27 Verbs In an embodiment of the true-case transducer used for 25 2O Pronouns 17 Determiners both English and French, names are recognized with a 16 Adverbs Simple finite-state machine. This machine employs a list of 12 Punctuation 12,937 distinct common last names and 3,717 distinct com 10 Conjunctions mon first names constructed from a list of 125 million full 8 Adjectives names obtained from the IBM online phone directory and a 4 Prepositions list of names purchased from a marketing corporation. It also 20 Other uses lists of common precVrsors to names (such as Mr., Mrs., Dr., Mlle., etc.) and common followers to names (such as Jr., Sr., III, etc). TABLE 2 The Set of tokens with only one true-case consists of all 35 tokens with a case-pattern entropy of less than 1.25 bits, Parts-of-Speech for French together with 9,506 Number of records in (o+p)8260.lcwtab 105 Pronouns English and 3,794 records in r8260.coerce(a+b+c) French 26 Verbs 18 Auxiliaries tokens Selected by hand from the tokens of a large bilingual 12 Determiners corpus. The most probable case pattern for each English 40 7 Nouns token was determined by counting token-case cooccurences Adjectives in a 67 million word English corpus, and for each French Adverbs Conjunctions token by counting cooccurences in a 72 million word French Prepositions corpus. (In this coimting occurrences at the beginnings of Punctuation Sentences are ignored). 45 Other 3.2.4 Specialized Transformation Transducers Referring still to FIG. 12, the transducer 1202 performs a 3.2.6 Syntactic Transformation Transducers few specialized transformations designed to Systematize the Referring still to FIG. 11, the transducer 1104, which tokenizing process. These transformations can include, but performs Syntactic transformations, will be described. are not limited to, the correction of typographical errors, the 50 One function of this transducer is to simplify the Subse expansion of contractions, the Systematic treatment of quent morphological transformations 1105. For example, in possessive, etc. In one embodiment of this transducer for morphological transformations, Verbs may be analyzed into English, contractions Such as don’t are expanded to do not, a morphological unit designating accidence followed by and possessives Such as John's and nurses are expanded to another unit designating the root of the verb. Thus in French, John S and nurses . In one embodiment of this transducer 55 the verb ira might be replaced by 3s future indicative for French, Sequences Such as Sil, quavez, and jadore are alter, indicating that ira is the third perSon Singular of the converted to pairs of tokenS Such as Siil, que aveZ, and je future tense of alter. The same kind of transformation can be adore. In addition, a few thousand Sequences Such as afin de performed in English by replacing the Sequence will go by are contracted to Strings Such as afin de. These Sequences future indicative to go. Unfortunately, often the two are obtained from a list compiled by a native French Speaker 60 words in Such a Sequence are Separated by intervening words who felt that Such Sequences should be treated as individual as in the Sentences: words. Also the four Strings au, aux, du, and des are will he go play in the traffic? expanded to a le, a les, dele, and de les respectively. he will not go to church. 3.2.5 Part-of-Speech Labeling Transducers Similarly, in French the third person of the English verb Referring again to FIG. 11, the transducer 1103 annotates 65 went is expressed by the two words est allé, and these two words with part-of-speech labels. These labels are used by words can be separated by intervening words as in the the Subsequent transducers depicted the figure. In Some SentenceS: 5,805,832 17 18 est-t-il allé? To understand the function of the transducer 1302 that Il nest pas allé. performs do-not coalescence, note that in English, negation It is possible to analyze Such verbs morphologically with requires the presence of an auxiliary verb. When one doesn’t Simple String replacement rules if various Syntactic trans exist an inflection of to do is used. The transducer 1302 formations that move away intervening words are performed coalesces the form of to do with not into the String do not. first. For example, A second function of the syntactic transducer 1104 is to make the task presented to the Statistical models which generate intermediate target-Structures for an intermediate John does not like turnips. Sourcestructure as easy as possible. This is done by per John do not like turnips. forming transformations that make the forms of these struc tures more Similar. For example, Suppose the Source lan The part of Speech assigned to the main verb, like above, is guage is English and the target language is French. English modified by this transducer to record the tense and perSon of adjectives typically precede the nouns they modify whereas to do in the input. When adverbs intervene in emphatic French adjectives typically follow them. To remove this 15 Sentences the do not is positioned after the intervening difference, the syntactic transducer 1104 includes a trans adverbs: ducer which moves French words labeled as adjectives to positions proceeding the nouns which they modify. These transducers only deal with the most rudimentary John does really not like turnips. linguistic phenomena. Inadequacies and Systematic prob John really do not like turnips. lems with the transformations are overcome by the Statistical models used later in the target hypothesis-generating module To understand the function of the final transducer 1303 702 of the invention. It should be understood that in other that performs adverb movement, note that in English, embodiments of the invention more Sophisticated Schemes adverbs often intervene between a verb and its auxiliaries. for Syntactic transformations with different functions can be 25 The transducer 1303 moves adverbs which intervene used. between a verb and its auxiliaries to positions following the The Syntactic transformations performed by the trans verb. The transducer appends a number onto the adverbs it ducer 1104 are performed in a Series of Steps. A sequence of moves to record the positions from which they are moved. words which has been annotated with parts of Speech is fed For example, into the first transducer. This transducer outputs a Sequence of words and a sequence of parts of Speech which together Serve as input to the Second transducer in the Series, and So Iraq will probably not be completely balkanized. forth. The word and part-of-speech sequences produced by Iraq will be en to balkanize probably M2 the final transducer are the input to the morphological not M2 completely M3. analysis transducer. 35 3.2.7 Syntactic Transducers for English An M2 is appended to both probably and to not to indicate FIG. 13, depicts an embodiment of a syntactic transducer that they originally preceded the Second word in the Verbal 1104 for English. Although in much of this document, Sequence will be balkanzized. Similarly, an M3 is appended examples are described in which the Source language is to completely to indicate that it preceded the third word in French and the target language is English, here for reasons 40 the Verbal Sequence. of exposition, an example of a Source transducer for a Source The transducer 1303 also moves adverbs that precede language of English is provided. In the next SubSection, Verbal Sequences to positions following those Sequences: another example in which the Source language is French is provided. Those with a basic knowledge of linguistics for other languages will be able to construct similar Syntactic 45 John really eats like a hog. transducer for those languages. John eats really M1 like a hog. The syntactic transducer in FIG. 13 is comprised of three transducer that This is done in order to place verbs close to their Subjects. perform question inversion 1301; 50 3.2.8 Syntactic Transducers for French perform do-not coalescence 1302; Referring now to FIG. 14, an embodiment of the syntactic perform adverb movement 1303. transducer 1104 for French is described. This embodiment To understand the function of the transducer 1301 that comprises four transducers that performs question inversion, note that in English questions perform question inversion 1401; the first auxiliary verb is often Separated by a noun phrase 55 from the root of the verb as in the sentences: perform discontinuous negative coalescence 1402; does that elephant eat? perform pronoun movement 1403; which car is he driving? perform adverb and adjective movement 1404. The transducer 1301 inverts the auxiliary verb with the The question inversion transducer 1401 undoes French Subject of the Sentence, and then converts the question mark 60 question inversion much the same way that the transducer to a special QINV marker to signal that this inversion has 1301 undoes English question inversion: occurred. For example, the two Sentences above are con verted by this transducer to: mangez-vous des legumes? that elephant eats QINV vous mangez des légumes QINV which car he is driving QINV 65 of habite-t-il? This transducer also removes Supporting do's as illustrated of ill habite OINV in the first Sentence. 5,805,832 19 20 -continued intermediate Source-Structure representation the fraternity of different forms of a word. This is useful because it allows for leluiavez-vous donné? more accurate Statistical models of the translation process. vous le lui avez donné QINV For example, a System that translates from French to English but does not use a morphological transducer can not benefit This transducer 1401 also modifies French est-ce que ques from the fact that Sentences in which parle is translated as tions: Speaks provide evidence that parlé should be translated as spoken. As a result, parameter estimates for rare words are inaccurate even when estimated from a very large training est-ce qu'il mange comme un cochon? Sample. For example, even in a Sample from the Canadian il mange comme un cochon EST-CE QUE Parlement of nearly 30 million words of French text, only 24 of the 35 different spellings of single-word inflections of the To understand the function of the transducer 1402, which verb parler actually occurred. performs negative coalescence, note that propositions in A morphological transducer 1104 is designed to amelio French are typically negated or restricted by a pair of words, rate Such problems. The output of this transducer is a ne and Some other word, that Surround the first word in a 15 Sequence of lexical morphemes. These lexical morphemes verbal sequence. The transducer 1402 moves the ne next to will Sometimes be referred to in this application as morpho its mate, and then coalesces the two into a new word: logical units or Simply morphs. In an embodiment of trans ducer 1104 used for English, inflection morphological trans formations are performed. that make evident common je ne sais pas. origins of different conjugations of the same verb; the je sais ne pas. Jean n'a jamais mangé comme un cochon. Singular and plural forms of the same noun; and the com - Jean a ne jamais mangé comme un cochon. parative and Superlative forms of adjectives and adverbs. In ill ny en a plus. an embodiment of transducer 1104 used for French, mor il y en a ne plus. phological inflectional transformations are performed that 25 make manifest the relationship between conjugations of the To understand the function of the transducer 1403, which Same verb; and forms of the same noun or adjective differing performs pronoun movement, note that in French, direct in gender and number are performed. These morphological object, indirect-object and reflexive pronouns usually pre transformations are reflected in the Sequence of lexical cede verbal sequences. The transducer 1403 moves these morphemes produced. The examples below illustrate the pronouns to positions following these Sequences. It also level of detail in these embodiments of a morphological maps these pronouns to new words that reflect their roles as transducer 1104: direct-object or indirect-object, or reflexive pronouns. So, for example, in the following Sentence le is converted to he was eating the peas more quickly than I. le DPRO because it functions as a direct object and vous to 35 he V past progressive to eat the pea N PLURAL quick vous IPRO because it functions as an indirect object: er ADV than I. nous en mangeons rarement. nous V present indicative 1p manager rare ment ADV je vous le donnerai. de en PRO je donnerai le DPRO vous IPRO. ils se sont lavés les mains sales. 40 ils v past 3p laverse RPRO les sale main N PLURAL. In the next sentence, Vous is tagged as a reflexive pronoun 3.2.10 Sense-Labelling Transducers and therefore converted to Referring again to FIG. 11, the transducer 1106, which annotates a lexical morph Sequence produced by the trans vous RPRO. vous vous lavez les mains. 45 ducer 1105 with part-of-speech labels, will be explained. vous lavez vous RPRO les mains. Much of the allure of the statistical approach to transfer in machine translation is the ability of that approach to for mally cope with the problem of lexical ambiguity. The allative pronominal clitic y and the ablative pronominal Unfortunately, Statistical methods are only able to mount a clitic en are mapped to the two-word tuples a y PRO and 50 Successful attack on this problem when the key to disam de en PRO: biguating the translation of a word falls within the local purview of the models used in transfer. je y penserai. Consider, for example, the French word prendre. je penseraia y PRO. Although prendre is most commonly translated as to take, it j'en ai plus. 55 has a number of other leSS common translations. A trigram je ai plus de en PRO. model oil English can be used to translate Je vais prendre la décision as I will make the decision because the trigram The final transducer 1404 moves adverbs to positions make the decision is much more common than the trigram following the Verbal Sequences in which they occur. It also take the decision. However, a trigram model will not be of moves adjectives to positions preceding the nouns they 60 much use in translating Je vais prendre ma propre décision modify. This is a useful step in embodiments of the present as I will make my own decision because in this case take and invention that translate from French to English, Since adjec decision no longer fall within a Single trigram. tives typically precede the noun in English. In the paper, “Word Sense Disambiguation using Statis 3.2.9 Morphological Transducers tical Methods” in the proceedings of the 29th Annual Referring again to FIG. 11, a transducer 1105, which 65 Meeting of the ASSociation for Computational Linguistics, performs morphological transformations, will be described. published in June of 1991 by the Association of Computa One purpose of this transducer is to make manifest in the tional Linguistics and incorporated by reference herein, a 5,805,832 21 description is provided of a method of asking a question about the context in which a word appears to assign that word a Sense. The question is constructed to have high why do not you ewer succeed mutual information with the translation of that word in its RRO VDO XX PPY RR VVO context. By modifying the lexical entries that appear in a Sequence of morphemes to reflect the Senses assigned to Here, RRO and RR are adverb tags, VD0 and VV0 are verb these entries, informative global information can be encoded tags, XX is a special tag for the word not, PPY is a pronoun locally and thereby made available to the statistical models tag, and 2 is a special tag for a question mark. used in transfer. The example transducer converts these two parallel Although the method described in the aforementioned 1O Sequences to the parallel word and part-of-Speech Sequences paper assigns Senses to words, the same method applies 1502: equally well to the problem of assigning Senses to morphemes, and is, used here in that fashion. This trans why you Succeed do not MO ever M1 OINV ducer 1106, for example maps prendre to prend 1 in the RRO PPY VVO XX RR OINV Sentence 15 Je vais prendre ma propre Voiture. Here, OINV is a marker which records the fact that the but to prendre 2 in the Sentence original input Sentence was question inverted. Je vais prendre ma propre décision. A mechanism by which a transducer achieves this trans It should be understood that other embodiments of the formation is depicted in FIG. 16, and is comprised of four Sense-labelling transducer are possible. For example, the components: Sense-labelling can be performed by asking not just a single a finite-state pattern matcher 1601; question, about the context but a Sequence of questions an action processing module 1603; arranged in a decision tree. a table of patterns 1602; 3.3 Source-Transducers with Constraints 25 a table of action specifications 1503. In some embodiments, such as that depicted in FIG. 9, a The transducer operates in the following way: Source-structure transducer, Such as that labelled 901, 1. One or more parallel input sequences 1605 are captured accepts a set of constraints that restricts its transformations by the finite-state pattern-matcher 1601; Source text to an intermediate target Structure in Source text. 2. The finite-state pattern-matcher compares the input Such constraints include, but are not limited to, Sequences against a table of patterns 1602 of input requiring that a particular phrase be translated as a certain Sequences Stored in memory; linguistic component of a Sentence, Such a noun 3. A particular pattern is identified, and an associated phrase; action-code 1606 is transmitted to the action requiring that a Source word be labelled as a certain 35 processing module 1603; part-of-Speech Such as a verb or determiner; 4. The action-processing module obtains a Specification of requiring that a Source word be morphologically analyzed the transformation associated to this action code from a certain way; a table of actions 1503 stored in memory; requiring that a Source word be annotated with a particular 5. The action-processing module applies the transforma Sense label; 40 tion to the parallel inputStreams to produce one or more in embodiments in which the intermediate Structure parallel output Sequences 1604, encodes parse-tree or case-frame information, requir The parallel input Streams captured by the finite-state ing a certain parse or case-frame Structure for a Sen pattern matcher 1601 are arranged in a Sequence of attribute tence; morphcologically analyzed in a particular way, tuples. An example of Such a Sequence is the input Sequence or be annotated with a particular Sense-label, 45 1501 depicted in FIG. 15. This sequence consists of a A Source-transducer accepting Such constraints in Similar Sequence of positions together with a Set of one or more to Source transducers as described in this Section. Based on attributes which take values at the positions. A few examples the descriptions already given, Such transducers can be of Such attributes are the token attribute, the word attribute, constructed by a person with a computer Science background the case-label attribute, the part-of-Speech attribute, the and skilled in the art. 50 sense-label attribute. The array of attributes for a given position will be called an attribute tuple. For example, the input attribute tuple sequence 1501 in 4 Finite-State Transducers FIG. 15 is seven positions long and is made up of two This Section provides a description of an embodiment of dimensional attribute tuples. The first component of an a mechanism by which the Syntactic transductions in Step 55 attribute tuple at a given position refers to the word attribute. 1104 and the morphological transductions in step 1105 are This attribute specifies the Spellings of the words at given performed. The mechanism is described in the context of a positions. For example, the first word in the sequence 1501 particular example depicted in 15. One with a background is why. The Second component of an attribute tuple at a in computer Science and skilled in the art of producing given position for this input Sequence refers to a part-of 60 Speech tag for that position. For example, the part of Speech finite-State transducers can understand from this example at the first position is RRO. The attribute tuple at position 1 how to construct Syntactic and morphological transducers of is thus the ordered pair why, RRO. the type described above. The parallel output Streams produced by the action pro The example transducer inverts questions involving do, cessing module 1603 are also arranged as a Sequence of does, and did. After steps 1101, 1102, and 1103, the source 65 attribute tuples. The number of positions in an output text Why don't You ever Succeed? is transduced into parallel Sequence may be different from the number of positions in word and part-of-speech sequences 1501: an input Sequence. 5,805,832 23 24 For example, the output sequence 1502 in FIG. 15, Examples of these logical operations are: consists of six positions. ASSociated with each position is a two-dimensional attribute tuple, the first coordinate of which is a word attribute and the Second coordinate of which is a Expression Matches part-of-Speech attribute. A.B.C 0 or 1 A's followed by B then by C An example of a table of patterns 1602 is shown in FIG. (A)(B+) 0 or more A's or 1 or more B's 17. This table is logically divided into a number of parts or (AB).C A or B, followed by C blocks. Pattern-Action Blocks. The basic definitions of the Attribute Tuples: The most common type of element in a matches to be made and the actions to be taken are contained regular expression is an attribute tuple. An attribute tuple is in pattern-action blocks. A pattern-action block comprises of a vector whose components are either: a list of patterns together with the name of actions to be 1. an attribute (as identified by its spelling); invoked when patterns in an input attribute-tuple Sequence 2. a name of a Set of attributes, 1605 are matched. 15 3. the mild card attribute. Auxiliary Pattern Blocks. Patterns that can be used as These elements are combined using the following opera Sub-patterns in the patterns of pattern-action blocks are torS: defined in Auxtiliary Pattern blocks. Such blocks contain lists of labelled patterns of of attributes tuples. These labelled patterns do not have associated actions, but can be Operator Meaning Usage referenced by their name in the definitions of other patterns. s Delimiter between coordinates a,b In FIG. 17 there is one Auxiliary Pattern block. This block M of an attribute tuple M defines four auxiliary patterns. The first of these has a name Negation al ADVERB and matches single tuple adverb-type construc # Wild Card # tions. The second has a name of BARE NP and matches 25 certain noun-phrase-type constructions. Notice that this aux (Here a and b denote attribute spellings or attribute set names). iliary pattern makes use of the ADVERB pattern in its The meanings of these operators are best illustrated by definition. The third and fourth auxiliary patterns match example. Let a, b, and c denote either attribute Spellings or other types of noun phrases. Set names. ASSume the dimension of the attribute tuples is 3. Set Blocks. Primary and auxiliary patterns allow for sets Then: of attributes. In FIG. 17, for example, there is a set called DO SET, of various forms of to do, and another set PROPER NOUN TAG of proper-noun tags. Attribute Tuple Matches Patterns are defined in terms of regular expressions of 35 a, b, c First attribute matches a, second match b, attribute tuples. Any pattern of attribute tuples that can be third matches c recognized by a deterministic finite-State automata can be , b, c First attribute elided (matches anything), Specified by a regular expression. The language of regular Second attribute matches b, third matches c , b, First and third attribute elided (match anything) expressions and methods for constructing finite-state Second attribute matches b automata are well known to those skilled in computer 40 al Second and third attributes elided (Match anything) Science. A method for constructing a finite-state pattern First matches a #, b, First attribute wild-card (i.e. matches anything) matcher from a Set of regular expressions is described in the M. M. Second attribute matches b. Third attribute elided article “LEX-A Lexical Analyzer Generator,” written by a, b, c Second attribute matches anything EXCEPT b. Third Michael E. Lesk, and appearing in the Bell Systems Tech matches anything EXCEPT c. nical Journal, Computer Science Technical Report, Number. 45 39, published in Oct. of 1975. Auxiliary Regular Expressions: A Second type of element Regular expressions accepted by the pattern matcher 1601 in a regular expression is an auxiliary regular expression. An are described below. auxiliary regular expression is a labelled regular expression Regular Expressions of Attribute Tuples: A regular which is used as a component of a larger regular expression. expression of attribute tuples is a Sequence whose elements 50 Logically, a regular expression involving auxiliary regular are either expressions is equivalent to the regular expression obtained 1. an attribute tuple; by resolving the reference to the auxiliary pattern. For 2. the name of an auxiliary regular expression; or example, Suppose an auxiliary regular expression named D has been defined by: 3. a register name. 55 These elements can be combined using one of the logical D=A.B.A* operations: where A, B denote attribute tuples (or other auxiliary patterns). Then: Operator Meaning Usage Matches 60 concatenation A.B A followed by B Expression is equivalent to union (i.e. or) AB A or B : 0 or more A* 0 or more As C.D C.A.B+.A O or 1 A. 0 or 1. As D-CD (A.B+.A*)+.A.B+.A.C.A.B+.A* -- 1 or more A+ 1 or more As 65 Registers: Just knowing that a regular expression matches Here A and B denote other regular expressions. an input attribute tuple Sequence usually does not provide 5,805,832 25 26 enough information for the construction of an appropriate This is made so in line 11-12 that appends all tuples output attribute tuple Sequence. Data is usually also required matching the regular expression SUBJECT NP to the out about the attribute tuples matched by different elements of put Sequence. the regular expression. In ordinary LEX, to extract this type For negative questions involving forms of do, the part of information often requires the matched input Sequence to of-Speech tag of the output verb and of the output adverbs be parsed again. To avoid this cumberSome approach, the are the same as those of the input verb and adverbs. Thus the pattern-matcher 1601 marke's details about the positions in entire input tuple Sequences corresponding to these words the input Stream of the matched elements of the regular can be appended to the output. This is done in lines 15-18. expression more directly available. From these positions, the For positive questions the tag attribute of the output verb identities of the attribute tuples can then be determined. Positional information is made available through the use may be different than that of the input verb. This is handled of registers. A register in a regular expression does not match in lines 25-37. The input word attribute for the verb is any input. Rather, appended to the output word attribute in lines 26 and 31 and 1. After a match, a register is Set equal to the position in 35. The output tag attribute is selected based on the form of the input Sequence of the next tuple in the input do in the input Sequence. Explicit tag values are appended to Sequence that is matched by an element of the regular 15 the output sequence in lines 32 and 37. expression to the right of the register. The remaining input words and tags other than the ques 2. If no further part of the regular expression to the right tion mark are written to the output Sequence in lines 43–44. of the register matches, then the register is Set equal to The input Sequence is completed in line 46 by the marker ZCO. QINV Signalling question inversion, together with the The operation of registerS is best illustrated by Some appropriate tag. examples. These examples use registers 1 and 2: 5 Target Transducers In this Section, Some embodiments of the the target Expression Contents of Registers after match structure transducer 704 will be explained. The role of this A.1.B.C Reg 1: First position of B match 25 transducer is to produce portions of text in the target A2 (CD) Reg 2: First position of either C or D match language from one or more intermediate target-Structures. A.1.B 2.C Reg 1: If B matches: First position of B match 5.1. Overview Otherwise: First position of C match Reg 2: First position of C match An embodiment of a target-structure transducer 704 for A1B.C* Reg 1: If B matches: First position of B match English is shown in FIG. 19. In this embodiment, the If C matches: First position of C match transducer 704 converts an intermediate target-structure for Otherwise: O Reg 2: If C matches: First position of C match English. consisting of a Sequence of linguistic morphs into Otherwise: O a single English Sentence. This embodiment performs the inverse transformations of those performed by the Source transducer 702 depicted in FIG. 13. That is, if the transducer A pattern-action block defines a pattern matcher. When an 35 702 is applied to an English Sentence, and then the trans input attribute-tuple Sequence is presented to the finite-state ducer 704 is applied to the Sequence of linguistic morphs pattern matcher a current input position counter is initialized produced by the transducer 702 the original English sen to 1 denoting that the current input position is the first tence is recovered. position of the Sequence. A match at the current input This embodiment of the target-transducer 704 comprises position is attempted for each pattern. If no pattern matches, 40 a Sequence of transducers which: an error occurs. If more than one pattern matches, the match of the longest length is Selected. If Several patterns match of generate nouns, adverbs and adjectives from morphologi the same longest length, the one appearing first in the cal units 1901; definition of the pattern-action block is Selected. The action provide do-support for negatives 1902; code associated with that pattern is then transmitted to the 45 separate compound verb tenses 1903; action processing module 1603. provide do-support for inverted questions 1904; Transformations by the action processing module are conjugate verbs 1905; defined in a table of actions 1503 which is stored in memory. The actions can be specified in Specified in any one of a reposition adverbs and adjectives 1906; number of programming languages Such as C, PASCAL, 50 restore question inversion 1907; FORTRAN, or the like. resolve ambiguities in the choice of target words resulting In the question-inversion example, the action specified in from ambiguities in the morphology 1908; and the pseudo-code in FIG. 18 is invoked when the pattern case the target sentence 1909. defined by the regular expression in lines 3-4 is matched. It should be understood that FIG. 19 represents only one This action inverts the order of the words in certain ques 55 possible embodiment of the target-transducer 704. Many tions involving forms of do. An instance of this action is variations are possible. For example, in other embodiments, shown in FIG. 15. In the pseudo-code of FIG. 18 for this rather than a single target Sentence being produced for each action, the Symbol (Greg(i) denotes the contents of register intermediate target-Structure, a set of Several target Sen i. In line 6 of this pseudo-code, the output attribute tuple tences together with probabilities or Scores may be pro Sequence is set to null. 60 duced. In Such embodiments, the transducers depicted in A question matched by the regular expression in lines 3-4 FIG. 19 can be replaced by transducers which produce at may or may not begin with a (so-called) wh- word in the set each Stage Several target Sentences with probabilities or WH NP. If it does match, the appropriate action is to Scores. Moreover, in embodiments of the present invention append the input tuple in the first position to the output in which the intermediate target-Structures are more Sophis sequence. This is (lone in lines 8-9). 65 ticated than lexical morph Sequences, the target-Structure After the wh-word, the next words of the output Sequence transducer is also more involved. For example, if the inter should be the Subject noun phrase of the input Sequence. mediate target-Structure consists of a parse tree of a Sentence 5,805,832 27 28 or case frames for a Sentence, then the target-Structure transducer converts these to the target language. 5.2 Components John V has been ing to eat a lot recently The transducers depicted in FIG. 19 will now be Y John V header has been V ing to eat a lot recently. explained. The first seven of these transducers 1901 1902 1903 1904. 1905 1906 1907 and the last transducer 1909 are This transducer also inserts a verb phrase header marker implemented using a mechanism Similar to that depicted in V header for use by later transducers. In particular the verb FIG. 12 and described in Subsection Finite-State Transduc phrase header marker is used by the transducer 1907 that ers. This is the same mechanism that is used by the Syntactic repositions adjective and adverb. transducers 1104 and the morphological transducers 1105 The next transducer 1905 conjugates verbs by combining that are part of the source-transducer 701. Such transducer the root of the verb and the tense of the verb into a single can be constructed by a person with a background in word. This process is facilitated by the fact that complicated computer Science and skilled in the art. The construction of verb tenses have already been broken up into Simpler pieces. the transducer 1908 will be explained below. The first transducer 1901 begins the process of combining For example, the previous Sentence is transduced to: lexical morphs into words. It generates nouns, adverbs and 15 adjectives from their morphological constituents. For John V header has been eating a lot recently. example, this transducer performs the conversions: The next transducer 1906 repositions adverbs and adjec the ADJ er big boy N plural V bare to eat rapid ADV ly. tives to their appropriate places in the target Sentence. This the bigger boys V bare to eat rapidly. is done using the move-markers M1, M2, etc., that axe he V present 3s to eat the pea N PLURAL quick ADV er than I. attached to adjectives. Thus for example, this transducer he V present 3s to eat the peas more quickly than I. COnVertS.

25 (The final Sentences here are The bigger boys eat rapidly and Iraq V header will be Balkanized probably M2 not M2. He eats the peas more quickly than I.) > Iraq V header will probably not be Balkanized There are Sometimes ambiguities in generating adverbs and adjectives from their morphological units. For example, the lexical morph Sequence quick ADV ly can be converted The verb phrase header marker V header is kept for use by to either quick or quickly, as for example in either how quick the next transducer 1907. can you run? or how quickly can you run. This ambiguity The transducer 1907 restores the original order of the is encoded by the transducer 1901 by combining the differ noun and verb in inverted questions. ent possibilities into a single unit. Thus, for example, the These questions are identified by the question inversion transducer 1901 performs the conversion: marker QINV. In most cases, to restore the inverted question 35 order, the modal or verb at the head of the verb phrase is moved to the beginning of the Sentence: how can quick ADV ly M1 you V bare to run? how can quick quickly M1 you V bare to run? Joh V header was here yesterday QINV was John here yesterday? The second transducer 1902 provides do-support for 40 negatives in English by undoing the do-not coalescences in If the sentence begins with one of the words what, who, the lexical morph Sequence. Thus, for example it converts: where, why, whom, how, when, which the modal or verb is moved to the Second position in the Sentence: John V present 3s to eat do not M1 real M1 a lot. 45 John V present 3s to do eat not M2 really M2 a lot. where John V header was yesterday QINV where was John yesterday? by Splitting do from not and inserting to do at the head of the verb phrase. (The final sentence here is John does not A special provision is made for moving the word not in really eat a lot.) The move-markers M1, M2, etc, mea 50 sure the position of the adverb (or adjective) from the head inverted questions. The need to do this is signaled by a of the verb phrase. These markers are correctly updated by Special marker MO in the lexical morph Sequence. The not the transformation. preceding the M0 marker is moved to the position following The next transducer 1903 provides do-support for the modal or verb: inverted questions by inserting the appropriate missing form 55 of the to do into lexical morph Sequences corresponding to he V header can swim not MOOINV inverted questions. This is typically inserted between the can not he swim? verb tense marker and the verb. Thus, for example, it Joh V header does eat not MO potatoes? COnVertS. Does not John eat potatoes? 60 After the inverted question order has been restored, the John V present 3s to like to eat QINV question inversion marker QINV is removed and replaced s John V present 3s to do like to eat QINV by a question mark at the end of the Sentence. The next transducer 1909 resolves ambiguities in the The next transducer 1904 prepares the stage for verb 65 choice of words for the target Sentence arising from ambi conjugation by breaking up complicated verb tenses into guities in the process 1901 of generating adverbs and Simpler pieces. adjectives from their morphological units. 5,805,832 29 30 For example, the transducer 1909 converts the sentence aforementioned article by F. Jelinek, can be used. In other how qtick-quickly can you, run? embodiments, the target Structure comprises parse trees of which contains the ambiguity in resolving quick ADV ly the target language. In these embodiments, language models into a Single adverb, into the Sentence based on Stochastic context-free grammars, as described in how quickly can you run? the aforementioned articles by F. Jelinek and the aforemen To perform such conversions, the transducer 1909 uses a tioned paper by J. Baker, can be used. target-language model to assign a probability or Score to In addition, decision tree language models, as described in each of the different possible Sentences corresponding to an the aforementioned paper by L. Bahl, et al. can be adapted input Sentence with a morphological ambiguity. The Sen by one skilled in the art to model a wide variety of target tence with the highest probability or score is selected. In the StructureS. above example, the Sentence how quickly can you run? is 6.1 Perplexity Selected because it has a higher target-language model The performance of a language model in a complete probability or Score than how quick can you run? In Some System depends on a delicate interplay between the language embodiments of the transducer 1909 the target-language model and other components of the System. One language model is a trigram model Similar to the target-Structure 15 model may Surpass another as part of a speech recognition language model 706. Such a transducer can be constructed System but perform less well in a translation System. by a person skilled in the art. The last transducer 1910 Since it is expensive to evaluate a language model in the assigns a case to the words of a target Sentence based on the context of a complete System, it is useful to have an intrinsic casing rules for English. Principally this involves capitaliz measure of the quality of a language model. One Such ing the words at the beginning of Sentences. Such a trans measure is the probability that the model assigns to the large ducer can easily be constructed by a perSon Skilled in the art. Sample of target Structures. One judges as better the lan 5.3 Target Transducers with Constraints guage model which yields the greater probability. When the In some embodiments, such as that depicted in FIG. 9, a target Structure is a Sequence of words or morphs, this target-Structure transducer, Such as that labelled 904, accepts measure can be adjusted So that it takes account of the length a Set of constraints that restricts its transformations of an 25 of the structures. This leads to the notion of the perplexity of intermediate target Structure to target text. a language model with respect to a Sample of text S: Such constraints include, but are not limited to, requiring -- (3) that a particular target word appear in the final target text. perplexity = Pr(S) is For example, because of ambiguities in a morphology, a target trannducer may not be able to distinguish between where S is the number of morphs of S. Roughly speaking, electric and electrical. Thus the output of a target transducer the perplexity is the average number of morphs which the might include the phrase electrical blanket in a situation model cannot distinguish between, in predicting a morph of where electric blanket is preferred. A constraint such as The S. The language model with the smaller perplexity will be word “electric' must appear in the target Sentcnce would the one which assigns the larger probability to S. correct this shortcoming. 35 Because perplexity depends not only on the language A target-transducer accepting Such constraints in Similar model but also on the Sample of text, it is important that the to target trainsdulcers as described in this Section. Based on text be representative of that for which the language model the descriptions already given, Such transducers can be is intended. Because perplexity is Subject to Sampling error, constructed by a person with a computer Science background making fine distinctions between language models may and skilled in the art. 40 require that the perplexity be measured with respect to a large Sample. 6 Target Language Model 6.2 n-gram Language models The inventions described in this specification employ n-gram language models will now be described. For these models, the target Structure consists of a Sequence of mor probabilistic models of the target language in a number of 45 places. These include the target Structure language model phs. 705, and the class language model used by the decoder 404. Suppose mamma . . . m be a sequence of k morphs m AS depicted in FIG. 20, the role of a language model is to For 1sissk, let m denote the Subsequence m=mm compute an a priori probability or Score of a target Structure. . . . m. For any sequence, the probability of a m=is ink Language models are well known in the Speech recogni 50 is equal to the product of the conditional probabilities of tion art. They are described in the article “Self-Organized each morph m, given the previous morphs m': Language Modeling for Speech Recognition', by F. Jelinek, appearing in the book Readings in Speech Recognition Pr(m) = (4) edited by A. Waibel and K. F. Lee and published by Morgan Pr(m)Pr(mm)Pr(m-mm) . . . Pr(m,m) . . . Pr(mm). Kaufmann Publishers, Inc., San Mateo, Calif. in 1990. They 55 are also described in the article "A Tree-Based Statistical The sequence m' is called the history of the morph m, Model for Natural Language Speech Recognition”, by L. in the Sequence. For an n-gram model, the conditional probability of a Bahl, et al., appearing in the July 1989 Volume 37 of the morph in a Sequence is assumed to depend on its history only IEEE Transactions on Acoustics, Speech and Signal Pro through the previous n-1 morphs: cessing. These articles are included by reference herein. 60 They are further described in the paper “Trainable Gram marS for Speech Recognition', by J. Baker, appearing in the (5) 1979 Proceedings of the Spring Conference of the Acous For a vocabulary of Size V, a 1-gram model is determined tical Society of America. by V-1 independent numbers, one probability Pr(m) for In Some embodiments of the present inventions, the target 65 each morph m in the Vocabulary, minus one for the con Structure consists of a Sequence of morphs. In these Straint that all of the probabilities add up to 1. A 2-gram embodiments, n-gram language models, as described in the model is determined by V-1 independent numbers, V(V-1) 5,805,832 31 32 conditional probabilities of the form Pr(mm) amd V-1 of polated estimation of Markov Source parameters from Sparse the form Pr(m). In general, an n-gram model is determined data”, by F. Jelinek and R. Mercer and appearing in Pro by V"-1 independent numbers, V"' (V-1) conditional prob ceeding of the Workshop on Pattern Recognition in Practice, abilities of the form Pr(mlin"), called the order-n condi published by North-Holland, Amsterdam, The Netherlands, tional probabilities, plus V''-1 numbers which determine in May 1980. Interpolated estimation combines several an (n-1)-gram model. models into a smoothed model which uses the probabilities The order-n conditional probabilities of an n-gram model of the more accurate models where they are reliable and, form the transition matrix of an associated Markov model. where they are unreliable, falls back on the more reliable The states of this Markov model are sequences of n-1 probabilities of less accurate models. If Pr' (m,m) is the morphs, and the probability of a transition from the State jth language model, the Smoothed model, Pr(m,m,'"), is mim2 . . . m, 1 to the state mama . . . m, is Pr(m,mm2 . . given by ... m.). An n-gram language model is called consistent if, for each string m", the probability that the model assigns to m"' is the steady state probability for the state m” of the associated Markov model. 6.3 Simple n-gram models 15 The Simplest form of an n-gram model is obtained by The values of the (m) are determined using the EM assuming that all the independent conditional probabilities method, So as to maximize the probability of Some addi are independent parameters. For Such a model, values for the tional Sample of training text called held-out data. When parameters can be determined from a large Sample of interpolated estimation is used to combine Simple 1-, 2-, and training text by Sequential maximum likelihood training. 3-gram models, the 2's can be chosen to depend on m' The order n-probabilities are given by only through the count of mem. Where this count is high, the simple 3-gram model will be reliable, and, where this count is low, the Simple 3-gram model will be unreliable. f(m'm) (6) The inventors constructed an interpolated 3-gram model 25 in which the 2's were divided into 1782 different sets where f(m) is the number of times the string of morphs m according to the 2-gram counts, and determined from a appears in the training text. The remaining parameters are held-out sample of 4,630,934 million words. The power of determined inductively by an analogous formula applied to the model was tested using the 1,014,312 word Brown the corresponding n-1-gram model. Sequential maximum corpus. This well known corpus, which contains a wide likelihood training does not produce a consistent model, variety of English text, is described in the book Computa although for a large amount of training text, it produces a tional Analysis of Present-Day American English, by H. model that is very nearly consistent. Kucera and W. Francis, published by Brown University Unfortunately, many of the parameters of a simple n-gram Press, Providence, R.I., 1967. The Brown corpus was not model will not be reliably estimated by this method. The included in either the training or held-out data used to problem is illustrated in Table 3, which shows the number of construct the model. The perplexity of the interpolated 1-, 2-, and 3-grams appearing with various frequencies in a 35 model with respect to the Brown corpus was 244. sample of 365,893,263 words oil English text from a variety 6.5 n-gram Class Models of sources. The vocabulary consists of the 260,740 different Clearly, some words are similar to other words in their words, plus a Special unknown word into which all other meaning and Syntactic function. For example, the probabil words are mapped. Of the 6.799x10" 2-grams that might ity distribution of words in the vicinity of Thursday is very have occurred in the data, only 14,494,217 actually did 40 much like that for words in the vicinity of Friday. Of course, occur and of these, 8,045,024 occurred only once each. they will not be identical: people rarely say Thank God its Similarly, of the 1.773x10' 3-grams that might have Thursday! or worry about Thursday the 13". occurred, only 75,349,888 actually did occur and of these, In class language models, morphs are grouped into 53,737,350 occurred only once each. These data and Tur classes, and morphs in the same class are viewed as Similar. ing's formula imply that 14.7 percent of the 3-grams and for 45 Suppose that Ö is a map that partitions the Vocabulary of V 2.2 percent of the 2-grams in a new Sample of English text morphs into C classes by assigning each morph m to a class will not appear in the original ?(m). An n-gram class model based on € is an n-gram language model for which Count 1-grams 2-grams 3-grams 50 (8) 1. 36,789 8,045,024 53,737,350 where c=c(m). An n-gram class model is determined by 2 20,269 2,065,469 9,229,958 C'-1+V-C independent numbers, V-C of the form 3 13,123 970,434 3,653,791 Pr(mc), plus C"-1 independent numbers which determine >3 135,335 3,413,290 8,728,789 an n-gram language model for a vocabulary of size C. If C >O 205,516 14,494,217 75,349,888 55 2O 260,741 6.799 x 1.773 x is much Smaller than V, these are many fewer numbers than 1010 1016 are required to Specify a general n-gram language model. In a simple n-gram class model, the C-1+V-Cindepen dent probabilities are treated as independent parameters. For Sample. Thus although any 3-gram that does not appear in Such a model, values for the parameters can be determined the original Sample is rare, there are So many of them that 60 by Sequential maximum likelihood training. The order n their aggregate probability is Substantial. probabilities are given by Thus, as n increases, the accuracy of a simple n-gram model increases, but the reliability of the estimates for its f(cc) (9) parameters decreases. 6.4 Smoothing 65 X f(cc) A solution to this difficulty is provided by interpolated where f(c) is the number of times that the sequence of estimation, which is described in detail in the paper “Inter classes cappears in the training text. (More precisely, f(c) 5,805,832 33 34 is the number of distinct occurrences in the training text of (V-i)/2 pairs. The score of the partition obtained by merg a consecutive sequence of morphs m for which C=c(m) ing any particular pair is the Sum of (V-i) terms, each of for 1sksi.) which involves a logarithm. Since altogether there are V-C merges, this Straight-forward approach to the computation is 7 Classes of order V. This is infeasible, except for very small values The inventions described in this specification employ of V. A more frugal organization of the computation must classes of morphs or words in a number of places. These take advantage of the redundancy in this Straight-forward include the class language model used by the decoder 702 calculation. An implementation will now be described in which the and described in Section 14, and some embodiments of the method 3204 executes in time of order V. In this target Structure language model 705. implementation, the change in Score due to a merge is The inventors have devised a minimber of methods for computed in constant time, independent of V. M automatically partitioning a vocabulary into classes based Let C denote the partition after V-k merges. Let C(1), upon frequency or coocurrence Statistics or other informa C.(2), ... C.(k) denote the k classes of C. Let p(1,n)=P tion extracted from textual corpora or other Sources. In this (C(T),C(ml)) and let Section, Some of these methods will be explained. An 15 application to construction of Syntactic classes of words will (14) be described. A person skilled in the art can easily adapt the plk (I) = P(i. m) methods to other situations. For example, the methods can be used to construct classes of morphs instead of classes of (15) words. Similarly, they can be used to construct classes based pr(n) = } P(i. m) upon cooccurrence Statistics or Statistics of word alignments in bilingual corpora. p:(l, n) (16) 7.1 Maximum Mutual Information Clustering q:(l, n) = pk(l, m)log pit pro A general Scheme for clustering a Vocabulary into classes Let I =p(C) be the score of C. So that is depicted schematically in FIG. 31. It takes as input a 25 desired number of classes C 3101, a vocabulary 3102 of size (17) V, and a model 3103 for a probability distribution P(ww.) Ii = XE q.(l, m) over bigrams from the Vocabulary. It produces as output a l, in partition 3104 of the vocabulary into C classes. In one Let I,(i,j) be the score of the partition obtained from C by application, the model 3103 can be a 2-gram language model merging classes C(i) and C(i), and let L(i,j)=I-Ii,j) be as described in Section 6, in which case P(w, w) would be the change in Score as a result of this merge. Then proportional to the number of times that the bigram ww. appears in a large corpus of training text. L(i, j) = S(i) + S(i) - q(i, j) - q(i,i) - (18) Let the score p(C) of a partition C be the average mutual information between the classes of C with respect to the 35 q.(iui iuj) - Sq:(l, illuj) - X. qi (illuj, m), probability distribution P(ww.): zi, inzi, where P(c1, c2) (10) p(C) = X P(c1, c)log PCPC), (19) 40 In this Sum, c and c. each run over the classes of the partition C, and In these and Subsequent formulae, iU denotes the result of the merge, So that, for example (11) P(c1, c2) : wiccic P(w1, w2) 45 (12) P(c) : XE P(c1, c2) C2 (13) The key to the implementation is to Store and inductively P(c2) : XE P(c1, c2) 50 update the quantities The scheme of FIG.31 chooses a partition C for which the p(t, n) pl(l) pr(n) q(l, n) (22) score average mutual information p(C) is large. I S(i) L(i, j) 7.2 A Clustering Method One method 3204 for carrying out this scheme is depicted 55 Note that if I, S(i), and S(), are known, then the majority in FIG. 32. The method proceeds iteratively. It begins (step of the time involved in computing I(i,j) is devoted to 3203) with a partition of size V in wvhiclh each word is computing the Sums on the Second line of equation 18. Each assigned to a distinct class. At each Stage of the iteration of these Sums has approximately V-k terms and So this (Steps 3201 and 3202), the current partition, is replaced by reduces the problem of evaluating I(i,j) from one of order a new partition which is obtained by merging a pair of 60 V° to one of order V. classes into a Single class. The pair of classes to be merged Suppose that the quantities shown in Equation 22 are is chosen So that the Score of the new partition is as large as known at the beginning of an iteration. Then the new possible. The method terminates after V-C iterations, at partition C is obtained by merging the pair of classes C(i) which point the current partition contains C classes. and C.G), i-j, for which L(i,j) is Smallest. The k-1 classes In order that it be practical, the method 3204 must be 65 of the new partition are C. (1).C. (2), ....C.,(k-1) with implemented carefully. At the i' iteration, a pair of classes to be merged must be Selected from amongst approximately 5,805,832 35 36 CG)=C(k) if j

Sk-1 (l) : (23)

Finally, S(i) and L (li) are determined from equations 18 discovered that in the training data each of them is most and 19. often a mistyped that. This update process requires order Vf computations. 7.6 A Method for Constructing Similarity Trees Thus, by this implementation, each iteration of the method The clustering method 3204 can also be used to construct requires order Vf time, and the complete method requires a similarity tree over the Vocabulary. Suppose the merging order V time. steps 3201 and 3202 of method 3204 are iterated V-1 times, The implementation can improved further by keeping resulting in a Single class consisting of the entire Vocabulary. track of those pairs lm for which p(1,n) is different from The order in which the classes are merged determines a Zero. For example, Suppose that P is given by a simple 25 binary tree, the root of which corresponds to this single class bigram model trained on the data described in Table 3 of and the leaves of which correspond to the words in the Section 6. In this case, of the 6.799x10' possible word Vocabulary. Intermediate nodes of the tree correspond to 2-grams WW2, only 14,494,217 have non-Zero probability. groupings of words intermediate between Single words and Thus, in this case, the Sums required in equation 18 have, on the entire vocabulary. Words that are statistically similar average, only about 56 non-Zero terms instead of 260,741 as with respect to the model P(w, w) will be close together in might be expected from the size of the Vocabulary. the tree. 7.3 An Alternate Clustering Method FIG. 30 shows Some of the Substructures in a tree con For very large vocabularies, the method 3204 may be too Structed in this manner using a simple 2-gram model for the computationally costly. The following alternate method can 1000 most frequent words in a collection of office corre be used. First, the hie words of the Vocabulary are arranged 35 Spondence. in order of frequency with the most frequent words first. Each of the first C words is assigned. to its own distinct class. The method then proceeds iteratively through V-C 8 Overview of Translation Models and Parameter steps. At the k" step the (C+k)' most probable word is Estimation assigned to a new class. Then, two of the resulting C+1 40 This Section is an introduction to translation models and classes are merged into a single class. The pair of classes that methods for estimating the parameters of Such models. A is merged is the one for which the loSS in average mutual more detailed discussion of these topics will be given in information is least. After V-C steps, each of the words in Section 9. the Vocabulary will have been assigned to one of C classes. 7.4 Improving Classes 45 TABLE 4 The classes constructed by the clustering method 3204 or the alternate clustering method described above can often be Classes from a 260,741-word vocabulary improved. One method of doing this is to repeatedly cycle Friday Monday Thursday Wednesday Tuesday Saturday Sunday through the Vocabulary, moving each word to the class for weekends Sundays Saturdays which the resulting partition has the highest average mutual 50 June March July April January December October November September August information score. Eventually, no word will move and the people guys folks fellows CEOs chaps doubters commies method finishes. It may be possible to further improve the unfortunates blokes classes by Simultaneously moving two or more words, but down backwards ashore sideways southward northward overboard for large Vocabularies, Such a Search is too costly to be aloft downwards adrift feasible. 55 water gas coal liquid acid sand carbon steam shale iron. great big vast sudden mere sheer gigantic lifelong scant colossal 7.5 Examples man woman boy girl lawyer doctor guy farmer teacher citizen The methods described above were used divide the 260, American Indian European Japanese German African Catholic 741-word vocabulary of Table 3, Section 6, into 1000 Israeli Italian Arab pressure temperature permeability density porosity stress classes. Table 4 shows Some of the classes that are particu velocity viscosity gravity tension larly interesting, and Table 5 shows classes that were 60 mother wife father son husband brother daughter sister boss uncle Selected at random. Each of the lines in the tables contains machine device controller processor CPU printer spindle subsystem members of a different class. The average class has 260 compiler plotter John George James Bob Robert Paul William Jim David Mike words. The table shows only those words that occur at least anyone someone anybody somebody ten times, and only the ten most frequent words of army feet miles pounds degrees inches barrels tons acres meters bytes class. (The other two months would appear with the class of 65 director chief professor commissioner commander treasurer months if this limit had bcen extended to twelve). The founder superintendent dean custodian degree to which the classes capture both Syntactic and 5,805,832 37 38 maximize the probability that the model assigns to a training TABLE 4-continued sample consisting of a large number S of translations (f, e'), S=1,2,....S. This is equivalent to maximizing the log Classes from a 260,741-word vocabulary likelihood objective function liberal conservative parliamentary royal progressive Tory provisional separatist federalist PQ 25 had hadn't hath would’ve could’ve should've must've might've (P)-S-S logPole)- C. elogPole). (25) asking telling wondering instructing informing kidding reminding S= ,e bothering thanking deposing Here C(fe) is the empirical distribution of the sample, so that tha theat that C(fe) is 1/S times the number of times (usually 0 or 1) head body hands eyes voice arm seat eye hair mouth that the pair (fe) occurs in the sample. 8.1.2 Hidden Alignment Models In Some embodiments, translation models are based on TABLE 5 the notion of an alignment between a target Structure and a Source Structure. An alignment is a set of connections Randomly selected word classes 15 between the entries of the two structures. little prima moment's trifle tad Litle minute's tinker's FIG. 22 shows an alignment between the target Structure hornets teammates John V past 3S to kiss Mary and the Source Structure ask remind instruct urge interrupt invite congratulate Jean V past 3s embrasser Marie. (These structures are commend warn applaud transduced versions of the Sentences John kissed Mary and object apologize apologise avow whish Jean a embrassé Marie respectively.) In this alignment, the cost expense risk profitability deferral earmarks capstone cardinality mintage reseller entry John is connected with the entry Jean, the entry B dept. A A Whitey CL pi Namerow PA Mgr. LaRose V past 3S is connected wVith the entry V past 3s, the # Rel rel. #S Shree entry to kiss is connected with the entry embrasser, and the S Gens nai Matsuzawa ow Kageyama Nishida Sumit Zollner Mallik entry Mary is connected with the entry Marie. FIG.23 shows research training education science advertising arts medicine an alignment between the target Structure John V past 3S machinery Art AIDS 25 to kiss the girl and the Source Structure Le jeune fille rise focus depend rely concentrate dwell capitalize embark intrude typewriting V passive 3S embrasser par Jean. (These structures are Minister mover Sydneys Minster Miniter transduced versions of the Sentences John kissed the girl and running moving playing setting holding carrying passing Le jeune fille a été embrassée par Jean respectively.) cutting driving fighting In some embodiments, illustrated in FIG. 24, a translation court judge jury slam Edelstein magistrate marshal Abella Scalia larceny model 706 computes the probability of a source structure annual regular monthly daily weekly quarterly periodic Good given a target Structure as the Sum of the probabilities of all yearly convertible alignments between these structures: aware unaware unsure cognizant apprised mindful partakers force ethic stoppage force's conditioner stoppages conditioners waybill forwarder Atonabee 35 Pa(fle) = Pe(f, ale). (26) systems magnetics loggers products coupler Econ databanks Centre inscriber correctors industry producers makers fishery Arabia growers addiction In other embodiments, a translation model 706 can compute medalist inhalation addict the probability of a Source Structure given a target Structure brought moved opened picked caught tied gathered cleared hung lifted as the maximum of the probabilities of all alignments 40 between these structures: 8.1 Tanslation Models (27) AS illustrated in FIG. 21, a target Structure to Source Pa(fle) = max Pa(f, ale). structure translation model P706 with, parameters 0 is a method for calculating a conditional probability, or likelihood, P(fe), for any Source structure f given any target 45 As depicted in FIG. 25, the probability P(fe) of a single Structure e. Examples of Such Structures include, but are not alignment is computed by a detailed translation model 2101. limited to, Sequences of words, Sequences of linguistic The detailed translation model 2101 employs a table 2501 of morphs, parse trees, and case frames. The probabilities values for the parameters 0. Satisfy: 8.2 An example of a detailed translation model 50 A simple embodiment of a detailed translation model is P(fle) 2, 0, P(failuree) 2 0, (24) depicted in FIG. 26. This embodiment will now be described. Other, more sophisticated, embodiments will be described in in Section 9. Pa(failuree) + ; Ple = 1, For the simple embodiment of a detailed translation 55 model depicted in FIG. 26, the Source and target Structures where the Sum ranges over all Structures f, and failure is a are Sequences of linguistic morphs. This embodiment com special Symbol. Pe(fle) can be interpretted as the probability prises three components: that a translator will produce f when given e, and 1. a fertility Sub-model 2601; P(failuree) can be interpreted as the probability that he will 2. a lexical Sub-model 2602; produce no translation when given e. A model is called 60 3. a distortion Sub-model 2603. deficient if P(failuree) is greater than Zero for Some e. The probability of an alignment and Source Structure 8.1.1 Training given a target Structure is obtained by combining the prob Before a translation model can be used, values must be abilities computed by each of these sub-models. Corre determined for the parameters 0. This process is called sponding to these Sub-models, the table of parameter values parameter estimation or training. 65 2501 comprises: One training methodology is maximum likelihood 1a. fertility probabilities n(ple), where p is any non traininig, in which the parameter values are chosen So as to negative integer and e is any target morph; 5,805,832 39 40 b. null fertility probabilities no (plm"), where p is any Source morph (V passive 3s), So (p=1. The complete set non-negative integer and m' is any positive integer, of fertilities are 2a. lexical probabilities to fle), where f is any Source morph, and e is any target morph; (p = 0 (p = 1 p = 1 p = 1 p = 1 p = 2 b. lexical probabilities t(f*null), where f is any source 5 morph, and null is a Special Symbol; The components of the probability P(f, ae) are 3. distortion probabilities a(ii,m), where m is any positive combinatorial factor = O11112 note: 0 = 1 integer, i is any positive integer, and j is any positive integer between 1 and m. fertility prob = no(k = 15) This embodiment of the detailed translation model 2101 n(1 John)n(1 V past 3s) computes the probability Pe(fae) of an alignment a and a n(1 to kiss)n (1 the)n(1 girl) Source Structure f given a target Structure e as follows. If any lexical prob t(Lethe)tGuene girl)t(fille girl) Source entry is connected to more than one target entry, then the probability is zero. Otherwise, the probability is com t(V passive 3s V past 3s) puted by the formula 15 t(embrasser to kiss) Pa(fae)=combinatorial factorxfertility probxlexical probxdis t(par *null)t(Jean John) tortion prob (28) distortion prob = 1.1 a(1 4, 7)a(25, 7)a(35, 7) where a(42,7)a(53, 7)a(71, 7) (29) combinatorial factor = II Many other embodiments of the detailed translation 2101 are possible. Five different embodiments will be described in (30) fertility prob = 25 Section 9 and Section 10. 8.3 Iterative Improvement lexical prob = (31) It is not possible to create of Sophisticated model or find good parameter values at a stroke. One aspect of the present (32) invention is a method of model creation and parameter distortion prob = estimation that involves gradual iterative improvement. For a given model, current parameter values are used to find Here better ones. In this way, Starting from initial parameter 1 is the number of morphs in the target structure; values, locally optimal values are obtained. Then, good e, for i=1,2,..., is the i' morph of the target structure; 35 parameter values for one model are used to find initial eo is the special symbol null; parameter values for another model. By alternating between m is the number of morphs in the Source Structure; these two steps, the method proceeds through a Sequence of f, for j=1,2,..., m is the j" morph of the source structure; gradually more Sophisticated models. AS illustrated in FIG. (po is the number of morphs of the target Structure that are 27, the method comprises the following Steps: not connected with any morphs of the Source Structure; 40 2701. Start with a list of models Model 0, Model 1, Model (p, for i=1,2,..., l is the number of morphs of the target 2 etc. structure that are connected with the i' morph of the Source Structure, 2703. Choose parameter values for Model 0. a for j=1,2,..., m is 0 if the j" morph of the source 2702. Let i=0. Structure is not connected to any morph of the target 45 2707. Repeat steps 2704, 2705 and 2706 until all models Structure, are processed. a for j=1,2,..., m is i if the j" morph of the Source 2704. Increment i by 1. structure is connected to the i' morph of the target 2705. Use Models 0 to i-1 together with their parameter Structure. 50 values to find initial parameter values for Model i. These formulae are best illustrated by an example. Con sider the Source Structure, target Structure, and alignment 2706. Use the initial parameter values for Model i, to find depicted in FIG. 23. There are 7 source morphs, so m=7. The locally optimal parameter values for Model i. first Source morph is Le, So fs=Le; the Second Source morph 8.3.1 Relative Objective Function is jeune, So f=jeune, etc. There are 5 target morphs so l=5. 55 Two hidden alignment models P and P of the form The first target morph is John, So e =John; the Second target depicted in FIG. 24 can be compared using the relative morph is V past 3s, so e=V past 3s., etc. The first objective function' Source morph (Le) is connected to the fourth target morph (the), So a=4. The Second Source morph (jeune) is con The relative objective function R(PP) is similar in form to the relative nected to the fifth target morph (girl) So a=5. The complete entropy alignment is 60

a = 4 a 2 = 5 a.3 = 5 a , = 2 as = 3 a = 0 a 7 = 1 All target morphs are connected to at least one Source between probability distributions p and q. However, whereas morph, So po=0. The first target morph (John) is connected 65 the relative entropy is never negative, R can take any value. to exactly one Source morph (Jean), So (p=1. The Second The inequality (35) for R is the analog of the inequality D20 target morph (V past 3s) is connected to exactly one for D. 5,805,832 41 42 “More generally, we can allow c(ow;afe) to be a non-negative real number. Pa(f, ae) (34) c(o); a, f, e) = 0(co) co- logPo(f, ale). (40) where P(afe)=P(a,fle)/P(fe). Note that R(P, P)=0. R is 5 The values for 0 that maximize the relative objective func related to up by Jensen's inequality tion R(PP) subject to the contraints (39) are determined by the Kuhn–Tucker conditions p(Po)ep(P)+R(Pa.P.), (35) which follows because the logarithm is concave. In fact, for 6 R(P, Po) - A = 0, oeS2, (41) any e and f, 30(o) where 2 is a Lagrange multiplier, the value of which is X Pa(af, ologi?tié s logy, P(al?, e) Pachalealiéle) (36)36 determined by the equality constraint in Equation (39). Pa(f, ae) Pa(f, ae) These conditions are both necessary and Sufficient for a maximum since R(PP) is a concave function of the 0(co). = log P(feal ) = logPo(?e) - logPo(?e). (37)37 15 Multiplying Equation (41) by 0(co) and using Pa(fe) Equation (40) and Definition (34) of R, yields the parameter Summing over e and f and using the Definitions (25) and reestimation formulae (34) yields Equation (35). (42) 8.3.2 Improving Parameter Values 0(co) = Wico(co), W = X., c6(c)), From Jensen's inequality (35), it follows that p(P) is oeS2 greater than p(P) if R(PP) is positive. With P=P, this (43) Suggests the following iterative procedure, depicted in FIG. ca(0) = i. C(f, e)ca(c); f, e), 28 and known as the EM Method, for finding locally optimal parameter values 0 for a model P. 25 (44) 2801. Choose initial parameter values 0. c6(a); f, e) = } P(al? e)c(o); a, f, e). 2803. Repeat Steps 2802-2805 until convergence. c(();fe) can be interpretted as the expected number of 2802. With 0 fixed, find the values 0 which maximize R(P times, as computed by the model P, that the event () occurs 8,Po). in the translation of e to f. Thus 0(c)) is the (normalized) 2805. Replace 0 by 0. expected number of times, as computed by model P, that () 2804. After convergence, 0 will be locally optimal param occurs in the translations of the training Sample. eter values for model P. These formulae can easily be generalized to models of the Note that for any 0, R(PP) is non-negative at its maximum form (38) for which the single equality constraint in Equa in 0, since it is zero for 0=0. Thus p(P) will not decrease tion (39) is replaced by multiple constraints from one iteration to the next. 35 8.3.3 Going From One Model to Another (45) Jensen's inequality also Suggests a method for using (DeS2.X 0(co)(O) = 1, u =: 1,2,..., parameter values 0 for one model P to find initial parameter values 0 for another model P: where the Subsets S2, u=1,2,... form a partition of S2. Only 40 2804. Start with parameter values 0 for P Equation (42) needs to be modified to include a different), 2901. With P and 0 fixed, find the values 0 which for each u: if coeS2, then maximize R(PP). (46) 2902. 0 will be good initial values for model P. (DeS2. 45 In contrast to the case where P=P, there may not be any 8.3.5 Approximate and Viterbi Parameter Estimation 0 for which R(PP) is non-negative. Thus, it could be that, The computation required for the evaluation of the counts even for the best 0, p(P). c in Equation 44 increases with the complexity of the model 8.3.4 Parameter Reestimation Formulae whose parameters are being determined. To apply these procedures, it is necessary to Solve the 50 For simple models, such as Model 1 and Model 2 maximization problems of Steps 2802 and 2901. For the described below, it is possible to calculate these counts models described below, this can be done explicitly. To see exactly by including the contribution of each possible align the basic form of the Solution, Suppose P is a, simple model ments. For more Sophisticated models, Such as Model 3, given by Model 4, and Model 5 described below, the Sum over 55 alignments in Equation 44 is too costly to compute cxactly. Rather, only the contributions from a Subset of alignments can be practically included. If the alignments in this Subset where the 0(co), (DeS2, are real-valued parameters Satisfying account for most of the probability of a translation, then this the constraints truncated Sum can Still be a good approximation. 60 This suggests the following modification to Step 2802 of (39) the iterative procedure of FIG. 28: 0(co) > 0, X 0(co) = 1, In calculating the counts using the update formulae 44, oeS2 approximate the Sum by including only the contribu and for each () and (a,fe), c(cow;a,fe) is a non-negative tions from Some Subset of alignments of high probabil integer. It is natural to interpret 0(co) as the probability of 65 ity. the event () and c(co;a,fe) as the number of times that this If only the contribution from the single most probable event occurs in (a,fe). Note that alignment is included, the resulting procedure is called 5,805,832 43 44 Viterbi Parameter Estination. The most probable alignment face letters, and their entries will be denoted by the corre between a target Structure and a Source Structure is called the sponding non-bold letters. Viterbi alignmtent. The convergence of Viterbi Estimation is In particular, e will denote an English word, e will denote easily demonstrated. At each iteration, the parameter values a String of English words, and E will denote a random are re-estimated So as to make the current Set of Viterbi variable that takes as values Strings of English words. The alignments as probable as possible; when these parameters length of an English String will be denoted l, and the are used to compute a new set of Viterbi alignments, either corresponding random variable will be denoted L. Similarly, the old set is recovered or a set which is yet more probable f will denote a French word, f will denote a string of English is found. Since the probability can never be greater than one, words, and F will denote a random variable that takes as this process Surely converge. In fact, it converge in a finite, values strings of French words. The length of a French string though very large, number of Steps because there are only a will be denoted m, and the corresponding random variable finite number of possible alignments for any particular will be denoted At. translation. A table notation appears in Section 10. In practice, it is often not possible to find Viterbi align 9.2 Translations ments exactly. Instead, an approximate Viterbi alignment 15 A pair of Strings that are translations of one another will can be found using a practical Search procedure. be called a translation. A translation will be depicted by 8.4 Five Models enclosing the Strings that make it up in parentheses and Five detailed translation models of increasing Sophistica Separating them by a vertical bar. Thus, (Quaurions-nous pu tion wVill be described in Section 9 and Section 10. These faire? What could we have done?) depicts the translation of models will be referred to as Model 1, Model 2, Model 3, Quaurions-nous pu faire? as What could we have done?. Model 4, and Model 5. When the strings are sentences, the final stop will be omitted Model 1 is very simple but it is useful because its unless it is a question mark or an exclamation point. likelihood function is concave and consequently has a global 9.3 Alignments maximum which can be found by the EM procedure. Model Some embodiments of the translation model involve the 2 is a slight generalization of Model 1. For both Model 1 and 25 idea, of an alignment between a pair of Strings. An alignment Model 2, the sum over alignments for the objective function consists of a set of connections between words in the French and the relative objective function can be computed very String and words in the English String. An alignment is a efficiently. This significantly reduces the computational random variable, A, a generic value of this variable will be complexity of training. Model 3 is more complicated and is denoted by a. Alignments are shown graphically, as in FIG. designed to more accurately model the relation between a 33, by writing the English string above the French string and morph of e and the set of morphs in f to which it is drawing lines from Some of the words in the English String connected. Model 4 is a more Sophisticated Step in this to some of the words in the French string. These lines are direction. Both Model 3 and Model 4 are deficient. Model 5 called connections. The alignment in FIG. 33 has seven is a generalization of Model 4 in this deficiency is removed connections, (the, Le), (program, programme), and So on. In at the expense of more increased complexity. For Models 35 the description that follows, Such an alignment will be 3,4, and 5 the exact Sum over alignments can not be denoted as (Le programme a été mis en applicationAnd computed efficiently. Instead, this Sum can be approximated the(1) program(2) has(3) been(4) implemented(5,6,7)). The by restricting it to alignments of high probability. list of numbers following an English word shows the posi Model 5 is a preferred embodiment of a detailed transla tions in the French string of the words with which it is tion model. It is a powerful but unwieldy ally in the battle 40 aligned. Because And is not aligned with any French words to describe translations. It must be led to the battlefield by here, there is no list of numbers after it. Every alignment is its weaker but more agile brethren Models 2, 3, and 4. In considered to be correct with some probability. Thus (Le fact, this is the raison d’étre of these models. perogramme a clé misen applicationAnd(1,2,3,4,5,6,7) the program has been implemented) is perfectly acceptable. Of 9 Detailed Description of Translation Models and 45 course, this is much less probable than the alignment shown Parameter Estimation in FIG. 33. In this Section embodiments of the Statistical translation AS another example, the alignment (Le reste appartenait model that assigns a conditional probability to the event that aux autochlonesThe(1) balance(2) was(3) the(3) territory(3) a Sequence of lexical units in the Source language is a of(4) the(4) aboriginal(5) people(5)) is shown in FIG. 34. translation of a sequence of lexical units in the target 50 Sometimes, two or more English words may be connected to language will be described. Methods for estimating the two or more French words as shown in FIG. 35. This will be parameters of these embodiments will be explained. denoted as (Les pauvres Soni démunisThe(1) poor(2) don't For concreteness the discussion will be phrased in terms (3,4) have(3,4) any(3,4) money(3,4)). Here, the four English of a Source language of French and a target language of words don’t have any money conspire to generate the two English. The Sequences of lexical units will be restricted to 55 French words Sont démunis. Sequences of words. The set of English words connected to a French word will It should be understood that the translation models, and be called the notion that generates it. An alignment resolves the methods for their training generalize easily to other the English String into a Set of possibly overlapping notions Source and target languages, and to Sequences of lexical that is called a notional Scheme. The previous example units comprising lexical morphemes, Syntactic markers, 60 contains the three notions. The, poor, and don’t have any Sense labels, case frame labels, etc. money. Formally, a notion is a Subset of the positions in the 9.1 Notation English String together with the words occupying those Random variables will be denoted by upper case letters, positions. To avoid confusion in describing Such cases, a and the values of such variables will be denoted by the Subscript will be affixed to each word showing its position. corresponding lower case letters. For random variables X 65 The alignment in FIG. 34 includes the notions the and and Y, the probability Pr(Y=yx=x) will be denoted by the of the 7, but not the notions of the or the 7. In (Japplaudis shorthand p(yx). Strings or vectors will be denoted by bold a la décision|I(1) applaud(2) the(4) decision(5)), a is gener 5,805,832 45 46 ated by the empty notion. Although the empty notion has no multipliers, ), and Seeking an unconstrained maximum of position and So never requires a clarifying Subscript, it will the auxiliary function be placed at the beginning of the English String, in position Zero, and denoted by eo. At times, therefore, the previous (51) alignment will be denoted as (J'applaudis a la décisionleo(3) an=0 t(flea) x( ; (e) - 1 ) I(1) applaud(2) the(4) decision(5)). The set of all alignments of (fe) will be written A(e, f). At a local maximum, all of the partial derivatives of h with Since e has length land f has length m, there are lm different respect to the components oft and ) are Zero. That the partial connections that can be drawn between them. Since an derivatives with respect to the components of be Zero is alignment is determined by the connections that it contains, Simply a restatement of the constraints on the translation and Since a Subset of the possible connections can be chosen 1O probabilities. The partial derivative of h with respect to t(fe) in 2" ways, there are 2" alignments in A(e, f). S For the Models 1-5 described below, only alignments without multi-word notions are allowed. ASSuming this dh - (52) restriction, a notion can be identified with a position in the ar(e) - English String, with the empty notion identified with posi 15 tion Zero. Thus, each position in the French String is con e (t+1)"- 1)? S. $, $8 foe,f)oe, ea)ife)eite-fi ificea)tile)- - Ae, nected to one of the positions 0 through 1 in the English a? Oa-0 j1' g Vklea. string. The notation a-i will be used to indicate that a word in position of the French String is connected to the word in where 6 is the Kronecker delta function, equal to one when position i of the English String. both of its arguments are the same and equal to Zero The probability of a French String f and an alignment a otherwise. This will be zero provided that given an English String e can be written 53 éa t(flea). (53) 25 P(, ale)-Pole?j= Palai , f', m, ePilaf', me). " Superficially Equation (53) looks like a solution to the Here, m is the length of f and a " is determined by a. maximization problem, but it is not because the translation 9.4 Model 1 probabilities appear on both Sides of the equal sign. The conditional probabilities on the right-hand side of Nonetheless, it Suggests an iterative procedure for finding a Equation (47) cannot all be taken as independent parameters Solution: because there are too many of them. In Model 1, the probabilities P(me) are taken to be independent of e and m; 1. Begin with initial guess for the translation probabilities, that P(a.a.?.l.m.,e), depends only on l, the length of the 2. Evaluate the right-hand side of Equation (53); English string, and therefore must be (1+1)"; and that 3. Use the result as a new estimate for t(fe). P(fla', f', m,e) depends only on f, and e. The 35 4. Iterate Steps 2 through 4 until converged. parameters, then, axe e=P(mle), and t(flea)=P(fla'f'', (Here and elsewhere, the Lagrange multipliers simply me), which will be called the tranzslatzion probability of f, Serve as a reminder that the translation probabilities must be given e. The parameter e is fixed at Some Small number. normalized so that they satisfy Equation (50).) This process, The distribution of M is unnormalized but this is a minor when applied repeatedly is called the EM process. That it technical issue of no significance to the computations. In 40 converges to a Stationary point of h in Situations like this, as particular, M can be restricted to Some finite range. AS long demonstrated in the previous Section, was first shown by L. as this range encompasses everything that actually occurs in E. Baum in an article entitled, An Inequality and ASSociated training data, no problems arise. Maximization Technique in Statistical Estimation of Proba A method for estimating the translation probabilities for bilistic Functions of a Markov Process, appearing in the Model 1 will now be described. The joint likelihood of a 45 journal Inequalities, Vol.3, in 1972. French Sentence and an alignment is With the aid of Equation (48), Equation (53) can be re-expressed as P(, ale) = c(+1)" j=1?ittle). (48) 50 The alignment is determined by Specifying the values of a N-1 for j from 1 to m, each of which can take any value from 0 number of times e connects to fin a to 1. Therefore, The expected number of times that e connects to fin the l in (49) translation (fle) will be called the count of f givene for (fe) P(fe)(re)-(+1)", = e (i+1) X ... X Little)II tie). 55 and will be denoted by c(fe;fe). By definition, The first goal of the training process is to find the values of the translation probabilities that maximize P(fe) subject to the constraints that for each e, 60 where P(ae,f)=P(fale)/P(fe). If ) is replaced by P(fe), Xt(fle) = 1. (50) f then Equation (54) can be written very compactly as An iterative method for doing so will be described. The method is motivated by the following consideration. Following Standard practice for constrained maximization, a 65 In practice, the training data consists of a set of necessary relationship for the parameter values at a local translations, (fe'), (ffe'), . . . , (fe'), so this maximum can be found by introducing Lagrange equation becomes 5,805,832 47 48

S (57) (64) h(t, a, W, u) = e X. . . . ife)Saii, In, i) - tre)-.'s clef, et). a 1=0 is offittestal, 'v'." m, Here, 2 Serves only as a reminder that the translation probabilities must be normalized. e f Usually, it is not feasible to evaluate the expectation in ; (; (e-)-; (; clim D-1). Equation (55) exactly. Even if multiword notions are excluded, there are still (1+1)" alignments possible for (fe). It is easily verified that Equations (54), (56), and (57) Model 1, however, is special because by recasting Equation carry over from Model 1 to Model 2 unchanged. In addition, (49), it is possible to obtain an expression that can be an iterative update formulas for a(ii, m, 1), can be obtained evaluated efficiently. The right-hand side of Equation (49) is by introducing a new count, c(ii, m, l; f, e), the expected a Sum of terms each of which is a monomial in the number of times that the word in position of f is connected translation probabilities. Each monomial contains m trans to the word in position i of e. Clearly, lation probabilities, one for each of the words in f. Different monomials correspond to different ways of connecting 15 c(ii, m, life) = S. P(ale, f)ö(i, aj). (65) words in f to notions in e with every way appearing exactly In analogy with Equations (56) and (57), for a single once. By direct evaluation, then translation, in i 58 X. . . . X. II X t(file). ( ) a 1=0 aGli, m, 1)=un'c(il, m, l; fe), (66) Therefore, the sums in Equation (49) can be interchanted and, for a set of translations, with the product to obtain S (67) 25 a(ii, n, l) -ph, c(ii, n, l, f(s), e(s)). P(fle) = e(t+1)- II,in X,i tCfile). (59) j=1 i=0 Notice that if f does not have length m or ife' does not Using this expression, it follows that have length l, then the corresponding count is Zero. AS with the 2's in earlier equations, the us here serve to normalize count of eine (60) the alignment probabilities. -- Model 2 shares with Model 1 the important property that M t(fle) $ 86, his the sums in Equations (55) and (65) can be obtained effi chef e-Goti. On 2,80, D2, ice, e). ciently. Equation (63) can be rewritten V-1 count offin f 35 P(fle) = e inII, X,i tCfile)a(ii, m, l). (68) Thus, the number of operations necessary to calculate a j=1 i=0 count is proportional to lm rather than to (1+1)" as Equation (55) might Suggest. Using this form for P(fe), it follows that The details of the initial guesses for t(fle) are unimportant in i t(fe)a(ii, m, l)ö(f, f)ö(e, e.) (69) because P(fe) has a unique local maximum for Model 1, as 40 c(fle; f, e) - 2, io t(fleo)a(Oi, m, l) +...+ t f) i, m, l) is shown in Section 10. In practice, the initial probabilities t(fe) are chosen to be equal, but any other choice that avoids and ZeroS would lead to the same final Solution. 9.5 Model 2 t(file)a(ii, n, l) (70) Model 1, takes no cognizance of where words appear in 45 clim, life) -- Goa(O. m. iiii () (, , , either String. The first word in the French String is just as Equation (69) has a double sum rather than the product of likely to be connected to a word at the end of the English two single Sums, as in Equation (60), because, in Equation String as to one at the beginning. In contrast, Model 2 makes (69), i and j are tied together through the alignment prob the same assumptions as in Model 1 except that it assumes abilities. that P(C.C., f', me) depends onj, C, and m, as well 50 Model 1 is a special case of Model 2 in which a(ii, m, 1) as on 1. This is done using a set of alignment probabilities, is held fixed at (1+1). Therefore, any set of parameters for Model 1 can be reinterpreted as a set of parameters for Model 2. Talking as initial estimates of the parameters for C.(oil.m.)=P(c,c', fif'm, 1) (61) Model 2 the parameter values that result from training which Satisfy the constraints 55 Model 1 is equivalent to computing the probabilities of all alignments using Model 1, but then collecting the counts X a(ii, m, l) = 1 (62) appropriate to Model 2. The idea of computing the prob i=0 abilities of the alignments using one model, but collecting for each triple jml. Equation (49) is replaced by the counts in a way appropriate to a Second model is very 60 general and can always be used to transfer a Set of param eters from one model to another. Pole)-(fle) = e a $1-0 ... a-0$ j-1?ittle.atali t(flea)a(alim, t). (63) 9.6 Intermodel interlude Models 1 and 2 make various approximations to the A relationship among the parameter values for Model 2 that conditional probabilities that appear in Equation (47). maximize this likelihood is obtained, in analogy with the 65 Although Equation (47) is an exact statement, it is only one above discussion by Seeking an unconstrained maximum of of many ways in which the joint likelihood off and a can be the auxiliary function written as a product of conditional probabilities. Each Such 5,805,832 49 SO product corresponds in a natural way to a generative proceSS called the Viterbi alignment for (fe) and will be denoted by for developing f and a from e. In the process corresponding V(fe). For Model 2 (and, thus, also for Model 1), finding to Equation (47), a length for f is chosen first. Next, a V(fle) is straightforward. For each j, a, is chosen so as to position in e is Selected and connected to the first position in make the product to fle)C (C. m, 1) as large as possible. f. Next, the identity of f is selected. Next, another position The Viterbi alignment depends on the model with respect to in e is Selected and this is connected to the Second word in which it is computed. In what follows, the Viterbi align f, and So on. ments for the different models will be distinguished by Casual inspection of Some translations quickly establishes that the is usually translated into a single word (le, la, or 1"), writing V(fe; 1), V(fe; 2), and so on. but is Sometimes omitted; or that only is often translated into The set of all alignment for which C-iwill be denoted by one word (for example, Seulement), but Sometimes into two A. (e., f). The pair ij will be said to be pegged for the (for example, ne . . . que), and Sometimes into none. The elements of A- (e, f). The element of A- (e., f) for which number of French words to which e is connected in an P(ale, f) is greatest will be called the pegged Viterbi align alignment is a random variable, d, that will be called its ment for ij, and will be denoted by V. (fle), Obviously, fertility. Each choice of the parameters in Model 1 or Model V. (fle; 1) and V. (fle; 2) can be found quickly with a 2 determines a distribution, Pr(d=(p)), for this random 15 straightforward modification of the method described above variable. But the relationship is remote: just what change for finding V(fe; 1) and V(fe; 2). will be wrought in the distribution of d, if, say, C.(12, 8, 9.7 Model 3 9) is adjusted, is not immediately clear. In Models 3, 4, and Model 3 is based on Equation (71). It assumes that, for i 5, fertilities are parameterized directly. between 1 and l, P(p,q),', e) depends only on p, and e, that, AS a prolegomenon to a detailed discussion of Models 3, for all i, P(th,', to", (po", e) depends only on t, and e,; 4, and 5, the generative proceSS upon which they are based and that, for i between 1 and l, P(tilt?', 'L', to", (po', e) will be described. Given an English string, e, a fertility for depends only on t, i, m, and 1. The parameters of Model 3 each word is Selected, and a list of French words to connect are thus a set of fertility probabilities, n(ple.)=P(plp', e); to it is generated. This list, which may be empty, will be a set of translation probabilities, tCfle)=Pr(T=fit?', to", called a tablet. The collection of tablets is a random variable, 25 (po', e); and a set of distortion probabilities, dGli, m, 1)=Pr T, that will be called the tableau of e; the tablet for the i' (II=jtif', L', to", (po', e). English word is a random variable, T, and the k" French The distortion and fertility probabilities for eo are handled word in the i' tablet is a random variable, T. After differently. The empty notion conventionally occupies posi choosing the tableau, a permutation of its words is tion 0, but actually has no position. Its purpose is to account generated, and f is produced. This permutation is a random for those words in the French string that cannot readily be variable, II. The position in f of the k' word in the i' tablet accounted for by other notions in the English String. Because is yet another a random variable, II. these words are usually spread uniformly throughout the The joint likelihood for a tableau, t, and a permutation, IL, French String, and because they are placed only after all of S the other words in the Sentence have been placed, the 35 probability Pr(IIo-to', 'L', to", po", e) is set to 0 unless i-1 (71) position j is vacant in which case it is set (po-k). P(, ale)- I Polo.’, ePolo', ex Therefore, the contribution of the distortion probabilities for i (p; all of the words in to is 1/(pol. III, IIIIP P(Tit, t", t',T', (po, e) x The value of cpo depends on the length of the French 40 Sentence Since longer Sentences have more of these extra i (p; neous words. In fact Model 3 assumes that i=1II k=1II P(Tilti,(Tikri', Ji', to, (po,po, e) x (p1 + ... + p. (73) (po P(pokpit, e) = ( (po )e." ... +1-hoppo P(Tolt', Ji', tol, pot, e). 45 for Some pair of auxiliary parameters po and p. The expression on the left-hand Side of this equation depends on In this equation, t, represents the series of valuest, (p,' only through the Sum (p+...+(p, and defines a probability ..., T-1, T, represents the series of values T, ... tik-1 distribution over (po whenever po and pare nonnegative and and p is a shorthand for (p. Sum to 1. The probability P(polp', e) can be interpreted as Knowing T and It determines a French String and an 50 follows. Each of the words from t' is imagined to require alignment, but in general Several different pairs T, It may an extraneous word with probability p, this word is lead to the same pair f, a. The Set of Such pairs will be required to be connected to the empty notion. The probabil denoted by (f, a). Clearly, then ity that exactly (po of the words from Ti' will require an extraneous word is just the expression given in Equation 55 P(f, ale) : (raica) P(t, Te). (72)72 (73). As in Models 1 and 2, an alignment of (fle) in Model 3 is The number of elements in (f, a) is determined by specifying C, for each position in the French string. The fertilities, po through (p, are functions of the C.'s. Thus, p, is equal to the number of j's for which C, equals i. S.$o: pil 60 Therefore, (74) because for eacht, there are p, arrangements that lead to the P(fle) = x . . . X. P(f, ale) = pair f, a. FIG. 36 shows the two tableaux for (bon march a 1=0 an=0 echeap(1,2)). 65 n - po Except for degenerate cases, there is one alignment in a 1=0X. . . . an=0X. ( (po )e. –200pl", S. pln(ple)in(p;le;) X A(e, f) for which P(ae, f) is greatest. This alignment will be 5,805,832 S1 52 -continued Model 2 does not work for Model 3. Because of the fertility parameters, it is not possible to exchange the Sums over C. through C with the product over j in Equation (74) as was done for Equations (49) and (63). A method for overcoming 5 with X(fle)=1, X,d(ii, m, 1)=1, Xn(pe)=1, and pot-p=1. this difficulty now will be described. This method relies on According to the assumptions of Model 3, each of the pairs the fact that Some alignments are much more probable than (t, t) in (f, a) makes an identical contribution to the Sum in others. The method is to carry out the sums in Equations (74) Equation (72). The factorials in Equation (74) come from and (76) through (80) only over some of the more probable carrying out this Sum explicitly. There is no factorial for the empty notion because it is exactly cancelled by the contri alignments, ignoring the vast Sea of much leSS probable bution from the distortion probabilities. OCS. By now, it should be clear how to provide an appropriate auxiliary function for Seeking a constrained maximum of the likelihood of a translation with Model 3. A suitable auxiliary To define the Subset, S, of the elements of A(fe) over function is 15 which the Sums are evaluated a little more notation is required. Two alignments, a and a' will be said to differ by h(t, d n, p, W, u, v, S) = (75) a move if there is exactly one value of j for which CzC. Alignments will be said to differ by a Swap if C-O, except

e at two values, j, and j, for which C=C- and C=C.'. The Pre- (; (e-)-4-(; di m D-1)- two alignments will be said to be neighbors if they are identical or differ by a move or by a Swap. The set of all sv (;icle-1)-spot p-1). neighbors of a will be denoted by N(a). As with Models 1 and 2, counts are defined by 25 Let b(a) be that neighbor of a for which the likelihood is greatest. Suppose that i is pegged for a. Among the neigh bors of a for which ij is also pegged, let b (a) be that for which the likelihood is greatest. The Sequence of alignments c(ii, m, life) = P(ale, f)ö(i, ai), (77) a, b(a), bf(a)=b(b(a)), . . . , converges in a finite number of Steps to an alignment that will be denoted as b (a). (78) Similarly, if i is pegged for a, the Sequence of alignments a, b- (a), b, f(a), ..., converges in a finite number of steps to an alignment that will be denoted as b, ... (a). The simple 35 form of the distortion probabilities in Model 3 makes it easy and to find b(a) and b- (a). If a' is a neighbor of a obtained from it by the move of j from i to i", and if neither inor i is 0, c(; fe)- Pale, pp. (80) then In terms of these counts, the reestimation formulae for Model 3 are

P(ae, f) = P(ale, f) --1) (85) 45 Notice that p, is the fertility of the word in positioni' for S (81) alignmenta. The fertility of this word in alignment a' is (p+1. t(?le) = W.' S= cle; f(s), e(s)), Similar equations can be easily derived when either i or i' is Zero, or when a and a' differ by a Swap. d(ii, m, t) = Hit S c(ii, m, l; f(s), e(s)), (82) 50 Given these preliminaries, S is defined by S= (86) n(ple) = v.' s c(ple, f(s), e(s)), (83) S = N(b(VCle; 2)))UU N(t), (V., (?e; 2))). S= ly

and 55 In this equation, b(V(fle; 2)) and b. (V. (fle; 2)) are approximations to V(fle; 3) and V. (fle; 3) neither of which (84) can be computed efficiently. pk = 3 s c(k; f(s), e(s)). In one iteration of the EM process for Model 3, the counts S=1 in Equations (76) through (80), are computed by Summing Equations (76) and (81) are identical to Equations (55) only over elements of S. These counts are then used in and (57) and are repeated here only for convenience. Equa 60 Equations (81) through (84) to obtain a new set of param tions (77) and (82) are similar to Equations (65) and (67), but eters. If the error made by including only Some of the C(ii, m, 1) differs from d(i, m, l) in that the former Sums to elements of A(e, f) is not too great, this iteration will lead to unity over all i for fixed while the latter sums to unity over values of the parameters for which the likelihood of the all j for fixed i. Equations (78), (79), (80), (83), and (84), for training data is at least as large as for the first Set of the fertility parameters, are new. 65 parameterS. The trick that permits the a rapid evaluation of the The initial estimates of the parameters of Model 2 are right-hand sides of Equations (55) and (65) efficiently for adapted from the final iteration of the EM process for Model 5,805,832 S3 S4 2. That is, the counts in Equations (76) through (80) are ceiling of the average value of the positions in the French computed using Model 2 to evaluate p(ale, f). The simple string of the words from its tablet. The head of the notion is form of Model 2 again makes exact calculation feasible. The defined to be that word in its tablet for which the position in Equations (69) and (70) are readily adapted to compute the French String is Smallest. counts for the translation and distortion probabilities; effi In Model 4, the probabilities d(i, m, 1) are replaced by cient calculation of the fertility counts is more involved. A two Sets of parameters: one for placing the head of each discussion of how this is done is given in Section 10. notion, and one for placing any remaining words. For i>0, 9.8 Deficiency Model 4 requires that the head for notion i be t. It A problem with the parameterization of the distortion assumes that probabilities in Model 3 is this: whereas the sum over all pairs T, It of the expression on the right-hand Side of Equation (71) is unity, if Pr(II=jt', 'L', to", (po", e) Pr(II=ila", to-po". e)=d(j-G), 1A(et 1), B(f)). (87) depends only onj, i, m, and 1 for i>0. Because the distortion Here, A and B are functions of the English and French word probabilities for assigning positions to later words do not that take on a small number of different values as their depend on the positions assigned to earlier words, Model 3 15 arguments range over their respective vocabularies. In the wastes some of its probability on what will be called Section entitled Classes, a process is described for dividing generalized Strings, i.e., Strings that have Some positions a vocabulary into classes So as to preserve mutual informa with several words and others with none. When a model has tion between adjacent classes in running text. The classes. A this property of not concentrating all of its probability on and B are constructed as functions with fifty distinct values events of interest, it will be said to be deficient. Deficiency by dividing the English and French vocabularies each into is the price for the simplicity that permits Equation (85). fifty classes according to this method. The probability is Deficiency poses no Serious problem here. Although assumed to depend on the previous notion and on the Models 1 and 2 are not technically deficient, they are Surely identity of the French word being placed So as to account for Spiritually deficient. Each assigns the same probability to the Such facts as the appearance of adjectives before nouns in alignments (Je n'ai pas de Stylo I(1) do not(2,4) have(3)a(5) 25 English but after them in French. The displacement for the pen(6)) and (Je pas ai ne de Stylol (1) do not(2,4) have(3) head of notion i is denoted by j-G) . It may be either a(5) pen(6)), and, therefore, essentially the same probability positive or negative. The probability d(-1A(e), B(f)) is to the translations (Je n'ai pas de Stylo I do not have a pen) expected to be larger than d(+1A(e), B(f)) when e is an and (Je pasai ne de Stylo I do not have a pen). In each case, adjective and f is a noun. Indeed, this is borne out in the not produces two words, ne and pas, and in each case, one trained distortion probabilities for Model 4, where d(-1A of these words ends up in the second position of the French (governments), B (développement)) is 0.9269, while d(+ string and the other in the fourth position. The first transla 1A (government's), B (développement)) is 0.0022. tion should be much more probable than the Second, but this The placement of the k" word of notion i for i>0, k>1 defect is of little concern because while the System may be is done as follows. It is assumed that required to translate the first String Someday, it will almost 35 Surely not be required to translate the Second. The translation models are not used to predict French given English but Pr(III-ilatt'', JL', to, (bo', e)=d-1(j-Ji B(f)). (88) rather as a component of a System designed to predict English given French. They need only be accurate to within The position T is required to be greater than T-1. a constant factor over well-formed Strings of French words. 40 Some English words tend to produce a series of French 9.9 Model 4 words that belong together, while others tend to produce a Often the words in an English String constitute phrases Series of words that should be separate. For example, that are translated as units into French. Sometimes, a trans implemented can produce can produce misen application, lated phrase may appear at a Spot in the French String which usually occurs as a unit, but not can produce ne pas, different from that at which the corresponding English 45 which often occurs with an intervening verb. Thus d(2B phrase appears in the English String. The distortion prob (pas)) is expected to be relatively large compared with abilities of Model 3 do not account well for this tendency of d(2B(en)). After training, it is in fact true that d(2B phrases to move around as units. Movement of a long phrase (pas)) is 0.6936 and d(2B(en)) is 0.0522. will be much less likely than movement of a short phrase It is allowed to place to either before or after any because each word must be moved independently. In Model 50 previously positioned words, but Subsequent words from t 4, the treatment of Pr(II=jt, ', 'L', to", (bo', e) is are required to be placed in order. This does not mean that modified so as to alleviate this problem. Words that are they must occupy consecutive positions but only that the connected to the empty notion do not usually form phrases second word from t must lie to the right of the first, the and So Model 4 continues to assume that these words are third to the right of the Second, and So on. Because of this, Spread uniformly throughout the French String. 55 at most one of the (p arrangements of T is possible. AS has been described, an alignment resolves an English The count and reestimation formulae for Model 4 are String into a notional Scheme consisting of a set of possibly similar to those for the previous Models and will not be overlapping notions. Each of these notions then accounts for given explicitly here. The general formulae in Section 10 are one or more French words. In Model 3 the notional Scheme helpful in deriving these formulae. Once again, the Several for all alignment is determined by the fertilities of the words: 60 counts for a translation are expectations of various quantities a word is a notion if its fertility is greater than Zero. The over the possible alignments with the probability of each empty notion is a part of the notional Scheme if po is greater alignment computed from an earlier estimate of the param than Zero. Multi-word notions are excluded. Among the eters. AS with Model 3, these expectations are computed by one-word notions, there is a natural order corresponding to Sampling Some Small Set, S, of alignments. AS described the order in which they appear in the English String. Leti 65 above, the simple form that for the distortion probabilities in denote the position in the English string of the i' one-word Model 3, makes it possible to find b (a) rapidly for any a notion. The center of this notion, O, is defined to be the The analogue of Equation (85) for Model 4 is complicated 5,805,832 SS by the fact that when a French word is moved from notion of Models 2, 3 and 4. In fact, this is the raison d'etre of these to notion, the centers of two notions change, and the models. To keep them aware of the lay of the land, their contribution of several words is affected. It is nonetheless parameters are adjusted as the iterations of the EM proceSS possible to evaluate the adjusted likelihood incrementally, for Model 5 are performed. That is, counts for Models 2, 3, although it is Substantially more time consuming. and 4 are computed by Summing over alignments as deter Faced with this unpleasant situation, the following mined by the abbreviated S described above, using Model 5 method is employed. Let the neighbors of a be ranked So that to compute P(ale, f). Although this appears to increase the the first is the neighbor for which P(ae, f; 3) is greatest, the Storage necessary for maintaining counts as the training is Second the one for which P(ale, f; 3) is next greatest, and So on. Then b(a) is the highest ranking neighbor of a for which processed, the extra burden is Small because the overwhelm P(b)(a)le, f; 4) is at least as large as P(af, e; 4). Define ing majority of the storage is devoted to counts for t(fle), and b- (a) analogously. Here, P(ale, f; 3) means P(ale, f) as these are the same for Models 2, 3, 4, and 5. computed with Model 3, and P(ae, f; 4) means P(ae, f) as 9.11 Results computed with Model 4. Define S for Model 4 by A large collection of training data is used to estimate the 15 parameters of the five models described above. In one embodiment of these models, training data is obtained using the method described in detail in the paper, Aligning Sen tences in Parallel Corpora, by P. F. Brown, J. C. Lai, and R. This equation is identical to Equation (89) except that b has L. Mercer, appearing in the Proceedings of the 29th Annual been replaced by b. Meeting of the ASSociation for Computational Linguistics, 9.10 Model 5 June 1991. This paper is incorporated by reference herein. Like Model 3 before it, Model 4 is deficient. Not only can This method is applied to a large number of translations Several words lie on top of one another, but words can be from Several years of the proceedings of the Canadian placed before the first position or beyond the last position in parliament. From these translations, a training data Set is the French string. Model 5 removes this deficiency. 25 chosen comprising those pairs for which both the English After the words for t and t are placed, there Sentence and the French Sentence are thirty words or leSS in will remain Some vacant positions in the French String. length. This is a collection of 1,778,620 translations. In an Obviously, T should be placed in one of these vacancies. effort to eliminate Some of the typographical errors that Models 3 and 4 are deficient precisely because they fail to abound in the text, a English Vocabulary is chosen consisting enforce this constraint for the one-word notions. Let V(j, of all of those words that appear at least twice in English tril-1, t") be the number of vacancies up to and Sentences in the data, and as a French Vocabulary is chosen including position jjust before t is placed. This position consisting of all those words that appear at least twice in will be denoted simply as v. Two sets of distortion param French Sentences in the data. All other words are replaced eters are retained as in Model 4; they will again be referred with a special unknown English word or unknown French to as d and d. For i>0, is is required that 35 word according as they appear in an English Sentence or a French Sentence. In this way an English Vocabulary of 42,005 words and a French vocabulary of 58,016 words is obtained. Some typographical errors are quite frequent, for The number of vacancies up to is the same as the number 40 example, momento for memento, and So the Vocabularies are of vacancies up to j-1 only when j is not, itself, vacant. The not completely free of them. At the same time, Some words last factor, therefore, is 1 when j is vacant and 0 otherwise. are truly rare, and in Some cases, legitimate words are As with Model 4, d is allowed to depend on the center of omitted. Adding eo to the English Vocabulary brings it to the previous notion and on f, but is not allowed to depend 42,006 words. one since doing so would result in too many parameters. 45 Eleven iterations of the EM process are performed for this For i>0 and k>1, it is assumed that data. The process is initialized by Setting each of the 2,437,020,096 translation probabilities, t(fe), to 1/58016. That is, each of the 58,016 words in the French vocabulary Pr(II-ilt ik-1 s ali is assumed to be equally likely as a translation for each of 50 the 42,006 words in the English vocabulary. For t(fe) to be Again, the final factor enforces the constraint that T. greater than Zero at the maximum likelihood Solution for one land in a vacant position, and, again, it is assumed that the of the models, f and e must occur together in at least one of probability depends on f, only through its class. the translations in the training data,. This is the case for only The details of the count and reestimation formulae are 25,427,016 pairs, or about one percent of all translation similar to those for Model 4 and will not be explained 55 probabilities. On the average, then, each English word explicitly here. Section 10 contains useful equations for appears with about 605 French words. deriving these formulae. No incremental evaluation of the Table 6 Summarizes the training computation. At each likelihood of neighbors is possible with Model 5 because a iteration, the probabilities of the various alignments of each move or Swap may require wholesale recomputation of the translation using one model are computed, and the counts likelihood of an alignment. Therefore, when the expecta 60 using a Second, possibly different model are accumulated. tions for Model 5 are evaluated, only the alignments in Sas These are referred to in the table as the In model and the Out defined in Equation (89) are included. The set of alignments model, respectively. After each iteration, individual values included in the Sum is further trimmed by removing any are retrained only for those translation probabilities that alignment, a, for which P(ae, f; 4) is too much Smaller than Surpass a threshold; the remainder are set to the Small value P(b (V(fe; 2)e, f; 4). 65 (10'). This value is so small that it does not affect the Model 5 provides a powerful tool for aligning transla normalization conditions, but is large enough that translation tions. It's parameters are reliably estimated by making use probabilities can be resurrected during later 5,805,832 57 58 Models 2, 3, and 5 discount a connection of he to Il because TABLE 6 it is quite far away. None of the five models is Sophisticated enough to make this connection properly. On the other hand, A summary of the training iterations. while nodding, is connected only to signe by Models 1, 2, Iteration. In -> Out Survivors Alignments Perplexity 5 and 3 and oui, it is correctly connected to the entire phrase faire signe que Oui by Model 5. In the Second example, 2 - 1 12,017,609 71550.56 (Voyez les profits que ils ontréalisés Look at the profits they 2 2 - 2 12,160,475 2O2.99 3 2 - 2 9,403.220 89.41 have made), Models 1, 2, and 3 incorrectly connect profits 4 2 - 2 6,837,172 61.59 to both profits and réalsés, but with Model 5, profits is 5 2 - 2 5,303,312 49.77 correctly connected only to profits and made, is connected 6 2 - 2 4,397,172 46.36 to realisés 7. Finally, in (De les promesses, de les 7 2 - 3 3,841,470 45.15 8 3 - 5 2,057,033 291 124.28 promesses! Promises, promises.), Promises is connected to 9 5 - 5 1850,665 95 39.17 both instances of promeSSes with Model 1; promises is 1O 5 - 5 1,763,665 48 32.91 connected to most of the French sentence with Model 2; the 11 5 - 5 1,703,393 39 31.29 15 final punctuation of the English Sentence is connected to 12 5 - 5 1,658,364 33 30.65 both the exclamation point and, curiously, to des with Model 3. iterations. AS is apparent from columns 4 and 5, even though Only Model 5 produces a satisfying alignment of the two the threshold is lowered as iterations progreSS, fewer and Sentences. The orthography for the French Sentence in the fewer probabilities survive. By the final iteration, only Second example is VoyeZ leS profits qu'ils ontréalisés and in 1,620,287 probabilities survive, an average of about thirty the third example is DeS promeSSes, des promesses! Notice nine French words for each English word. that the e has been restored to the end of qu' and that des has AS has been described, when the In model is neither twice been analyzed into its constituents, de and les. These Model 1 nor Model 2, the counts are computed by Summing and other petty pseudographic improprieties are commited over only Some of the possible alignments. Many of these 25 in the interest of regularizing the French text. In all cases, alignments have a probability much Smaller than that of the orthographic French can be recovered by rule from the Viterbi alignment. The column headed Alignments in Table corrupted versions. 6 Shows the average number of alignments for which the Tables 8 through 17 show the translation probabilities and probability is within a factor of 25 of the probability of the fertilities after the final iteration of training for a number of Viterbi alignment in each iteration. AS this number drops, English words. All and only those probabilities that are the model concentrates more and more probability onto greater than 0.01 are shown. Some words, like nodding, in fewer and fewer alignments So that the Viterbi alignment Table 8, do not slip gracefully into French. Thus, there are becomes ever more dominant. translations like (Il fait signe que ouiHe is nodding), (Il fait The last column in the table shows the perplexity of the un signe de la teteHe is nodding), (Il fait un signe de téte French text given the English text for the In model of the 35 affirmatif.He is nodding), or (Il hoche la tete iteration. The likelihood of the training data is expected to affirmativement.He is nodding). As a result, nodding fre increase with each iteration. This likelihood cain be thought quently has a large fertility and Spreads its translation of as arising from a product of factors, one for each French probability over a variety of words. In French, what is worth word in the training data. There are 28,850,104 French Saying is worth Saying in many different ways. This is also words in the training data so the 28,850,104" root of the 40 seen with words like should, in Table 9, which rarely has a likelihood is the average factor by which the likelihood is fertility greater than one but Still produces many different reduced for each additional French word. The reciprocal of words, among them deVrait, deVraient, deVrions, doit, this root is the perplexity shown in the table. As the doivent, devons, and devrais. These are (just a fraction of the likelihood increases, the perplexity decreases. A Steady many) forms of the French verb devoir. Adjectives fare a decrease in perplexity is observed as the iterations progreSS 45 little better: national, in Table 10, almost never produces except when a Switch from Model 2 as the In model to more than one word and confines itself to one of nationale, Model 3 is made. This sudden jump is not because Model 3 national, nationaux, and nationales, respectively the is a poorer model than Model 2, but because Model 3 is feminine, the masculine, the masculine plural, and the deficient: the great majority of its probability is Squandered feminine plural of the corresponding French adjective. It is on objects that are not Strings of French words. AS has been 50 clear that the models would benefit from Some kind of explained, deficiency is not a problem. In the description of morphological processing to rein in the lexical exuberance Model 1, the P(me) was left unspecified. In quoting per of French. plexities for Models 1 and 2, it is assumed that the length of AS Seen from Table 11, the produces le, la, les, and 1" as the French String is Poisson with a mean that is a linear is expected. Its fertility is usually 1, but in Some situations function of the length of the English String. Specifically, it is 55 English prefers an article where French does not and So assumed that Pr(M=me)=(0,1)"e"/ml, with 2 equal to 1.09. about 14% of the time its fertility is 0. Sometimes, as with It is interesting to See how the Viterbi alignments change farmers, in Table 12, it is French that prefers the article. as the iterations progreSS. In Table 7, the Viterbi alignment When this happens, the English noun trains to produce its for Several Sentences after iterations 1, 6, 7, and 11 are translation displayed. These are the final iterations for Models 1, 2, 3, 60 and 5, respectively. In each example, a Subscripted in affixed TABLE 7 to each word to help in interpreting the list of numbers after each English word. In the first example, (Il me semble faire The progress of alignments with iteration. signe que ouilt seems to me that he is nodding), two Il me sembles faire signes ques Oui, interesting changes evolve over the course of the iterations. 65 In the alignment for Model 1, Il is correctly connected to he, It seems(3) to(4) me(2) that(6) he(1) is nodding(5,7) but in all later alignments Il is incorrectly connected to It. 5,805,832

TABLE 7-continued TABLE 11 The progress of alignments with iteration. Translation and fertility probabilities for the. the It(1) seems(3) to me(2) that he is nodding(5,7) 5 It(1) seems(3) to(4) me(2) that(6) he is nodding(5,7) f t(fe) op n(pe) It(1) seems(3) to me(2) that he is nodding(4,5,6,7) le O.497 1. O.746 VoyeZ1 les, profits que ilss onto realises, la 0.2O7 O O.254 les 0.155 Look(1) at the(2) profits(3,7) they(5) have(6) made 1O O.086 Look(1) at the(2,4) profits(3,7) they(5) have(6) made CC O.O18 Look(1) at the profits(3,7) they(5) have(6) made cette O.O11 Look(1) at the(2) profits(3) they(5) have(6) made(7) De les2 promessess a des less promesses 7 's Promises(3,7).(4) promises. (8) 15 TABLE 12 Promises,(4) promises(2,3,6,7).(8) Promises(3), (4) promises(7). (5.8) Translation and fertility probabilities for farmers. farmers Promises(2,3), (4) promises(6.7).(8) f t(fe) op n(pe) 2O agriculteurs O.442 2 O.731 TABLE 8 les O418 1. O.228 cultivateurs O.O46 O O.O39 Translation and fertility probabilities for nodding. producteurs O.O21 nodding 25 f t(fe) op n(pe) TABLE 13 signe O.164 4 O342 la O.123 3 O.293 Translation and fertility probabilities for external. tete 0.097 2 O.167 external Oui O.086 1. O.163 3O fait O.O73 O O.O23 f t(fe) op n(pe) que O.O73 hoche O.O54 extérieures O.944 1. O.967 hocher O.048 extérieur O.O15 O O.O28 faire O.O3O externe O.O11 le O.O24 extérieurs O.O1O approuve O.O19 35 qui O.O19 O.O12 faites O.O11 TABLE 1.4 Translation and fertility probabilities for answer. 40 aSWC TABLE 9 f t(fe) op n(pe) Translation and fertility probabilities for should. should réponse O.442 1. O.809 répondre O.233 2 O. 115 f t(fe) op n(pe) 45 répondu O.O41 O O.O74 a O.O38 devrait O.33O 1. O649 solution 0.027 devraient O.123 O O.336 répondez O.O21 devions O.109 2 O.O14 répondrai O.O16 faudrait O.O73 réponde O.O14 faut O.O58 50 y O.O13 doit O.O58 la O.O1O aurait O.O41 doivent O.O24 devons O.O17 devrais O.O13 TABLE 1.5 55 Translation and fertility probabilities for oil. oil TABLE 10 f t(fe) op n(pe) Translation and fertility probabilities for national. national 60 pétrole 0.558 1. O.760 pétrolières O.138 O O181 f t(fe) op n(pe) pétrolliere O.109 2 0.057 le O.O54 nationale O.469 1. O.905 pétrolier O.O3O national O.418 O O.O94 pétroliers O.O24 nationaux O.O54 huile O.O2O nationales O.O29 65 Oil O.O13 5,805,832 61 62 Singular future of donner, which means to give. It might be TABLE 16 better if both will and give were connected to donnerai, but by limiting notions to no more than one word, Model 5 Translation and fertility probabilities for former. precludes this possibility. Finally, the last twelve words of former the English Sentence, I now have the answer and will give f t(fe) op n(pe) it to the House, clearly correspond to the last Seven words of ancien 0.592 1. O866 the French sentence, je donnerai la réponse ala Chambre, anciens O.O92 O O.O74 but, literally, the French is I will give the answer to the eX O.O92 2 O.O60 House. There is nothing about now, have, and, or it, and each précèdent O.O54 of these words has fertility 0. Translations that are as far as O.O43 this from the literal are rather more the rule than the ancienne O.O18 éte O.O13 exception in the training data. 9.12 Better Translation Models Models 1 through 5 provide an effective means for 15 obtaining word-by-word alignments of translations, but as a TABLE 1.7 means to achieve the real goal of translation, there is room for improvement. Translation and fertility probabilities for not. not Two shortcomings are immediately evident. Firstly, because they ignore multi-word notions, these models are f t(fe) op n(ple) forced to give a false, or at least an unsatisfactory, account e O497 2 O.735 of Some features in many translations. Also, these models pas O442 O O.154 are deficient, either in fact, as with Models 3 and 4, or in O O.O29 1. O.107 spirit, as with Models 1, 2, and 5. rien O.O11 9.12.1 Deficiency 25 It has been argued above that neither Spiritual nor actual together with an article. Thus, farmers typically has a deficiency poses a Serious problem, but this is not entirely fertility 2 and usually produces either agriculteurs or les. true. Let w(e) be the sum of P(fe) over well-formed French Additional examples ate in included in Tables 13 through 17, Strings and let i(e) be the Sum over ill-formed French Strings. which show the translation and fertility probabilities for In a deficient model, w(e)+i(e)<1. In this case, the remainder external, answer, oil, former, and not. of the probability is concentrated on the event failure and so FIGS. 37, 38, and 39 show automatically derived align w(e)+i(e)+P(failuree)=1. Clearly, a model is deficient pre ments for three translations. In the terminology used above, cisely when P(failuree)>0. If P(failuree)=0, but i(e)>0, then each alignment is b(V(fle; 2)). It should be understood that the model is spiritually deficient. If W(e) were independent these alignments have been found by a process that involves of e, neither form of deficiency would pose a problem, but no explicit knowledge of either French or English. Every 35 because Models 1-5 have no long-term constraints, w(e) fact adduced to Support them has been discovered automati decreases exponentially with 1. When computing cally from the 1,778,620 translations that constitute the alignments, even this creates no problem because e and fare training data. This data, in turn, is the product of a proceSS known. However, for a given f, if the goal is to discover the the Sole linguistic input of which is a set of rules explaining most probable e, then the product P(e) P(fe) is too small for how to find Sentence boundaries in the two languages. It may 40 long English Strings as compared with short ones. AS a justifiably be argued, therefore, that these alignments are result, short English Strings are improperly favored over inherent in the Canadian Hansard data, itself. longer English Strings. This tendency is counteracted in part In the alignment shown in FIG. 37, all but one of the by the following modification: English words has fertility 1. The final prepositional phrase Replace P(fle) with c'P(fle) for some empirically chosen has been moved to the front of the French sentence, but 45 COnStant C. otherwise the translation is almost verbatim. Notice, This modification is treatment of the symptom rather than however, that the new proposal has been translated into les treatment of the disease itself, but it offerS Some temporary nouvelles propositions, demonstrating that number is not an relief. The cure lies in better modelling. invariant under translation. The empty notion has fertility 5 9.12.2 Multi-word Notions here. It generates ent, des, the comma, des, and des. 50 Models 1 through 5 all assign non-zero probability only to In FIG. 38, two of the English words have fertility 0, one alignments with notions containing no more than one word has fertility 2, and one, embattled, has fertility 5. Embattled each. Except in Models 4 and 5, the concept of a notion playS is another word, like nodding, that eludes the French grasp little rôle in the development. Even in these models, notions and comes with a panoply of multi-word translations. are determined implicitly by the fertilities of the words in the The final example, in FIG. 39, has several features that 55 alignment: words for which the fertility is greater than Zero bear comment. The Second Word, Speaker, is connected to make up one-word notions, those for which it is Zero do not. the Sequence l’Orateur. Like farmers above, it has trained to It is not hard to give a method for extending the generative produce both the word that one naturally thinks of as its process upon which Models 3, 4, and 5 are based to translation and the article in front of it. In the Hansard data, encompass multi-word notions. This method comprises the Speaker always has fertility 2 and produces equally often 60 following enhancements: 1 Orateur and le président. Later in the Sentence, Starred is Including a step for Selecting the notional Scheme; connected to the phrase marquées de un astérisque. From an AScribing fertilities to notions rather than to words, initial Situation in which each French word is equally Requiring that the fertility of each notion be greater than probable as a translation of Starred, the System has arrived, ZCO. through training, at a situation where it is able to connect to 65 In Equation (71), replacing the products over words in an just the right string of four words. Near the end of the English String with products over notions in the notional Sentence, give is connected to donnerai, the first perSon Scheme. 5,805,832 63 64 Extensions beyond one-word notions, however, requires -continued Some care. In an embodiment as described in the SubSection Results, an English String can contain any of 42,005 one 10.1 Summary of Notation word notions, but there are more than 1.7 billion possible C; average position in f of the words connected to two-word notions, more than 74 trillion three-word notions, 5 position i of e and So on. Clearly, it is necessary to be discriminating in i position in e of the i' one word notion C; Ci choosing potential multi-word notions. The caution dis P translation model P with parameter values 0 played by Models 1-5 in limiting consideration to notions C(fe) empirical distribution of a sample with fewer than two words is motivated primarily by respect p(Po) log-likelihood objective function for the featureless desert that multi-word notions offer a R(Ps, Po) relative objective function t(fe) translation probabilities (All Models) priori. The Viterbi alignments computed with Model 5 give e(ml) sentence length probabilities (Models 1 and 2) a frame of reference from which to develop methods that n(ple) fertility probabilities (Models 3, 4, and 5) expands horizons to multi-word notions. These methods PoP1 fertility probabilities for ea (Models 3, 4, and 5) involve: a(ii, l, m) aligminent probabilities (Model 2) d(ii, l, m) distortion probabilities (Model 3) Promoting a multi-word Sequence to notionhood when its 15 d, (AjA,B) distortion probabilities for the first word of translations as a unit, as observed in the Viterbi a tablet (Model 4) alignments, differ Substantially from what is expected on di (Aj|B) distortion probabilities for the other words of a tablet (Model 4) the basis of its component parts. d(l or rB.u.u') distortion probabilities for choosing left or right In English, for example, either a boat or a perSon can be movement (Model 5) left high and dry, but in French, unbateau is not left haut et di (AjB.u.) distortion probabilities for leftward movement Sec, nor une perSonne haute et Seche. Rather, a boat is left of the first word of a tablet (Model 5) échoué and a perSon en plan. High and dry, therefore, is a d(AjB.u.) distortion probabilities for rightward movement of the first word of a tablet (Model 5) promising three-word notion because its translation is not d(AjB.u.) distortion probabilities for movement of the compositional. other words of a tablet (Model 5) 25 10 Mathematical Summary of Translation Models 10.2 Model 1 Parameters e(ml) string length probabilities 10.1 Summary of Notation t(fe) translation probabilities e English vocabulary Here feF; eeB or e=eo, l=1, 2, . . . ; and m=1,2,. . . . e English word General Formula e English sentence l length of e i position in e, i = 0, 1,..., l Pa(f, ale)=P(me)P(alm, e) P(fa, m, e) (92) e; word i of e 35 éo the empty notion ASSumptions F French vocabulary f French word Pe(ne) = e(ml) (93) f French sentence length off P(am, e) = (l + 1)" (94) i position in f, i = 1,2,... in 40 f word i off Pa(?a, m, e) = II tfea) (95) f fif. . . . f. j=1 . " 3. alignment Cl position in econnected to position i off for This model is not deficient. alignment a Generation C. C1 C-2 . . . Cli 45 (p; number of positions off connected to position i of e Equations (92)–(95) describe the following process for (p." (p19) . . . pi producing f from e: t tableau - a sequence of tablets, where a tablet is a 1. Choose a length m for f according to the probability sequence of French words T tablet i oft distribution e(ml). to Tot1 . . . ti 50 2. For each j=1,2,..., m, choose C, from 0, 1, 2, ... 1 (pi length of T according to the uniform distribution. k position within a tablet, k = 1,2,.... (p. 3. For each j=1,2,..., m, choose a French word?, according tik word k of T 3. a permutation of the positions of a tableau Usefulto the Formulae distribution tCEle). Jik position in f for word k of T for permutation J. JT, Jillia . . . Jik 55 Because of the independence assumptions (93)–(95), the V(fe) Viterbi alignment for (fe) Sum over alignments (26) can be calculated in closed form: V. (fle) Viterbi alignment for (fe) with i pegged N(a) neighboring alignments of a N(a) neighboring alignments of a with i pegged Pa(fle) = Pact ale) (96) b(a) alignment in N(a) with greatest probability b°(a) alignment obtained by applying b repeatedly to a b. (a) alignment in N(a) with greatest probability 60 -= e(ml)(ml) (l(i+1) + 1) a 1-0$ ... a-0 j-1fitti, t(flea) (97) b. *(a) alignment obtained by applying b, repeatedly to a A(e) class of English word e B(f) class of French wordf in i (98) Aj displacement of a word in f e(ml) (l + 1) m j=1II, i=0X, tCfile). u,' vacancies in f 9i first position in e to the left of i that has 65 Equation (98) is useful in computations since it involves non-zero fertility only O(lm) arithmetic operations, whereas the original Sum over alignments (97) involves O(1") operations. 5,805,832 65 66 Concavity The objective function (25) for this model is a strictly Y(i,(i, j, f, e continued (108)108 concave function of the parameters. In fact, from Equations pa(ii, fe) --S, ?o (25) and (98), with 5 p(Pe) = S C(fe) S. log S. tc?le) + S. C(fe)loge(ml) + constant (99) Y(i, j, f, e) = t fe)a(ii, l, m). fe j=1 i=0 ,e Viterbi Alignment which is clearly concave in e(ml) and t(fle) since the logarithm of a Sum is concave, and the Sum of concave For this model, and thus also for Model 1, the Viterbi functions is concave. alignment V(fle) between a pair of Strings (f, e) can be Because p is concave, it has a unique local maximum. expressed in closed form: Moreover, this maximum can be found using the method of FIG. 28, provided that none of the initial parameter values (109) is Zero. 15 10.3 Model 2 Parameters Parameter Reestimation Formulae e(m1) string length probabilities The parameter values 0 that maximize the relative objec t(fe) translation probabilities C(ii, l, m) alignment probabilities tive function R(P. Pe) can be found by applying the Here i=0,..., l; and j=1, . . . m. considerations of Subsection 8.3.4. The counts c(co, a, f, e) General Formula of Equation (38) are

Pa(f, ale)=P(me)Pa(alm, e)Pa(fa, m, e) (100) 25 ASSumptions c(ii, l, m, a, fe) = 8(i, a). (111) The parameter reestimation formulae for t(fle) and C(ii, l, P(me) = e(ml) (101) m) are obtained by using these counts in Equations (42)- (46). Pe(alm, e) = II a(ali, l, m) (102) 30 j=1 Equation (44) requires a Sum over alignments. If P (103) Satisfies

This model is not deficient. Model 1 is the special case of 35 Pa?e-final fe). (112) this model in which the alignment probabilities are uniform: C(ii, l, m)=(1+1) for all i. as is the case for Models 1 and 2 (see Equation (107)), then Generation this Sum can be calculated explicitly: Equations (100)-(103) describe the following process for producing f from e: 40 c6(fle;- f, e) = X,l Y,in pa(ii,- fe)ö(e, e.)ö(ff), (113) 1. Choose a length m for f according to the distribution i=0 j=1 e(ml). 2. For each j=1,2,..., m, choose C, from 0, 1, 2, ... 1 ca(ii; fe) = pa(ii, fe). (114) according to the distribution C(Cli, l, m). 3. For each j, choose a French word f, according to the 45 Equations (110)-(114) involve only O(lm) arithmetic distribution t(f.e...). operations, whereas the Sum over alignments involves O(l") Useful Formulae operations. Just as for Model 1, the independence assumptions allow 10.4 Model 3 us to calculate the Sum over alignments (26) in closed form: Parameters 50 t(fe) translation probabilities Pa(fle) = Pact ale) (104) n(pe) fertility probabilities po, p fertility probabilities for eo l in (105) XE II t(flea)a(ali l, m) d(i, l, m) distortion probabilities a 1=0 55 Here (p=0, 1, 2, . . . . in i (106) General Formulae = e(ml) j=1II, i=0X, tCfile)a(ii, l, m). By assumption (102) the connections of a are independent P(t, Jile) = P(ple) P(top, e) P(th, p, e) (115) given the length m of f. From Equation (106) it follows that 60 (116) they are also independent given f: Pa(f, ale) = Pa(t, Je) Here (f, a) is the set of all (t, t) consistent with (f, a):

where 65 (T, J)e(f, a) if for all i=0, 1,..., l and k=1, 2, ... P. f. -tik and C.i. (117) 5,805,832 67 68 ASSumptions considerations of Section 8.3.4. The counts c(co, a, f, e) of Equation (38) are (118) Pa(ple) = n() i=1X. o) i=1II n(p;le;) i (p; (119) Pe(rkp, e) = . t(title) 1 P. (120) Pa(ht, (p, e) : cpol I d(Lili, l, m) The parameter reestimation formulae for t(fe), C(ii, l, m), where and t(ple) are obtained by using these counts in Equations 121 (42)-(46). no(pom") = ( )er pit. (121) Equation (44) requires a Sum over alignments. If Po In Equation (120) the factor of 1/(po! accounts for the choices 15 Satisfies of JLo, k=1, 2, . . . , (po. This model is deficient, Since (122) Pa?e-final fe). (130) Pe(failuret, p, e) = 1 - S. Po(th, p, e) > 0. 3. as is the case for Models 1 and 2 (see Equation (107)), then Generation this sum can be calculated explicitly for c(fle; f, e) and Equations (115)-(120) describe the following process for cGili, f, e): producing for failure from e: - l in - (131) 1. For each i=1,2,..., l, choose a length (p for t, according c6(fle; f, e) = i=0X, j=1Y, pa(ii, fe)ö(e, e.)ö(ff), to the distribution n(ple). 25 2. Choose a length (po for to according to the distribution ?s(ii; f, e) = pa(ii, fe). (132) no(poly-'p). . Let m=(pot).-'p, Unfortunately, there is no analogous formula for c(ple; f, . For each i-0, 1, . . . , 1 and each k=1, 2, . . . , p, choose e). Instead, the following formulae hold: a French word T. according to the distribution totle). 5. For each i=1, 2, . . . 1 and each k=1, 2, . . . , p, choose a ?ala. . . . (P. Cli (f, e)Yk. (133) position to from 1, ..., m according to the distribution cape fe-2, see II (1-poi?e),..., -i-. d(ti, l, m). 6. If any position has been chosen more than once then aste) -- j=1SB (, ey, (134) return failure. 35 7. For each k=1,2,..., (po, choose a position to from the P., e--All-- . . . (135)135 (po-k+1 remaining Vacant positions in 1, 2, . . . , m 1- pa(ii, fe) according to the uniform distribution. 8. Let f be the string with f=t. In Equation (133), T denotes the set of all partitions of 40 (p. Useful Formulae Recall that a partition of p is a decomposition of p as a From Equations (118)-(120) it follows that if (t, t) are Sum of positive integers. For example, (p=5 has 7 partitions consistent with (f, a) then since 1+1+1+1+1=1+1+1+2=1+1+3=1+2+2=1+4=2+3=5. For a partition Y, let Y be the number of times that kappears Pe(rkp, e) = II trea), (123) 45 in the Sum, so that (p=X-'kyk. If Y is the partition corre j=1 g sponding to 1+1+3, then Y=2, Y=1, and Y=0 for k other 1. (124) than 1 or 3. Here the convention that Yo consists of the single Pe(th, p, e) = cpol E, dile, i. m) element Y with Y=0 for all k has been adopted. Equation (133) allows us to compute the counts c(ple; f, In Equation (124), the product runs over all j=1,2,..., 50 e) in O(lm+(pg) operations, where g is the number of m except those for which C-0. By Summing over all pairs partitions of (p. Although g grows with p like (4V3p))' exp (t, t) consistent with (f, a) it follows that TV2p/3, it is manageably Small for Small (p. For example, (p=10 has 42 partitions. (125) Pa(f, ale) = (raio, a) Pa(t, Je) Proof of Formula (133) 55 Introduce the generating functional = 0 ( (bols,$ p. ) 2,$n?ole: n(elepp's, $ ifle)it? (126) Gole fe-, eacole ?er, (136) where X is an indeterminant. Then jiaiz0X dilai, l, m) 60 oo il l in - (137) The factors of (p, in Equation (126) arise because there are G(xle,(xle, fe)f, e) = (p=0X. a 1=0X...... X II Peipa(ali, fe),f, e) X o?e,8(e, epoch,ei)ö(p, p)pi)x II'p, equally probable terms in the Sum (125). l in - (138) Parameter Reestimation Formulae 65 = S, 6(e, e.) S. ... S. II pa(ali, fe)xi The parameter values 0 that maximize the relative objec i=1 a 1=0 an=Oji=1 tive function R(P. Pe) can be found by applying the 5,805,832 69 70 -continued General Formulae : i=1$ 8(,e, ei) isto ... go.ifical "ali, f,feelop e)xo-?i " Pa(T, Jule) = P(ple) P(top, e) P(th, p, e) (148) In l - (140) (149) = X 8(e, e.) II X pa(ali, f, e)xi?) 5 i=1 j=1 a-0 Pa(f, ale) = (taiga)" Je) : , Ö(e, e.) I (1 - pa(ii,- . . . fe) + xp6(ii,- . . . fe)) (141) ASSumptions = j=

142 = X. 1. Ö(e, e), (1 - pa(ii, fe)) (1 + fi;(fe)x). (142) Pa(ple) = n() i=1 o) i=1f n(ple) (150) i (p; (151) To obtain Equation (138), rearrange the order of Summa Pe(rkp, e) = . t(title) tion and Sum over (p to eliminate the 6-function of (p. To obtain Equation (139), note that p,-X,"8(i, C.) and So x-II"x8 D. To obtain Equation (140), interchange the 15 Pa(alt, p, e) = - f inct (152) order of the Sums on C, and the product on j. To obtain (Po! is k-1 Equation (141), note that in the Sum on C, the only term for where which the power of X is nonzero is the one for which CL=i. Now note that for any indeterminants x, y, y, . . . y, no(pom") = () print, (153) d(j- cpA(epi), B(i)) if k = 1 (154) fava-S1 + vix) = s h is s (143) j=1 (1 + yix) do yeTk 1 Y! pik(j) = {C. if k > 1 where z = (l- S. (y)*. (144) 25 In Equation (154), p. is the first position to the left of i for which p>0, and c is the ceiling of the average position of This follows from the calculation the words of t. (155) X (1 + yix) = exps, log(1+y;x) = (145) j=1 j=1 oi = {i":i': {p}p. > 0} s Cp : (pp. -1 k=1Y Jpk expS. S. -1POP k+1(y,xk j=1 k=1 k This model is deficient, Since

ce ce 1. ce = exp X zix = x - X zixk 146 (156) P- -- in (i. R ) (146) 35 Pe(failuret, p, e) = 1 - S. Po(th, p, e) > 0. 3. Note that Equations (150), (151), and (153) are identical to nO n! i: (... ) (147) the corresponding formulae (118), (119), and (121) for Model 3. 40 Generation ce o zik Equations (148)-(152) describe the following process for (p=0Six yeTS k=1S Yk producing for failure from e: 1-4. Choose a tableau t by following Steps 1-4 for Model 3. The reader will notice that the left-hand side of Equation 45 5. For each i-1, 2, . . . , 1 and each k=1, 2, . . . , p, choose (145) involves only powers of X up to m, while the Subse a position to as follows. quent Equations (146)–(147) involve all powers of X. This is because the Z are not algebraically independent. In fact, for If k=1 then choose to according to the distribution (p>m, the coefficient of x' on the right-hand side of Equation d (1,1-coA(ee), B(t)). (147) must be Zero. It follows that Z, can be expressed as a 50 If k>1 then choose to greater than at according to the polynomial in Z, k=1, 2, . . . , m. distribution d(t-T-B(t)). Equation (143) can be used to identify the coefficient of 6.-8. Finish generating f by following Steps 6-8 for Model x" in Equation (142). Equation (133) follows by combining 3. Equations (142), (143), and the Definitions (134)-(136) and 10.6 Model 5 55 Parameters (144). t(fe) translation probabilities 10.5 Model 4 n(pe) fertility probabilities Parameters po, p fertility probabilities for eo t(fe) translation probabilities d, (A, B, u) distortion probabilities for leftward movement of n(pe) fertility probabilities 60 the first word of a tablet po, p fertility probabilities for eo d(AIB, u) distortion probabilities for rightward movement d,(AIA, B) distortion probabilities for movement of the first of the first word of a tablet word of a tablet d (1 or rB, u, u') distortion probabilities for choosing d(A|B) distortion probabilities for movement of other left or right movement words of a tablet 65 d(A,B, u) distortion probabilities for movement of other Here A, is an integer; A is an English class; and B is a French words of a tablet class. Here u, u'=1, 2, . . . , m; and 1 or r=left or right. 5,805,832 71 72 General Formulae d-1(U(t)6.-8. Finish -U(ti-)|B(t), generating f by following l(m)-U(1,-1)). Steps 6-8 for Model Pa(T, Jule) = P(ple) P(top, e) P(th, p, e) (157) 3. (158) 11 Sense Disambiguation Pa(f, ale) = (raica) Pa(t, Je) 11.1 Introduction An alluring aspect of the Statistical approach to machine ASSumptions translation is the Systematic frame-work it provides for attacking the problem of lexical disambiguation. For (159) example, an embodiment of the machine translation System Pa(ple) = n() i=1X. o) i=1II n(p;le;) depicted in FIG. 4 translates the French sentence Je vais prendre la décision as I will make the decision, correctly i (p; (160) interpreting prendre as make. Its Statistical translation Pe(rkp, e) = . t(title) model, which Supplies English translations of French words, prefers the more common translation take, but its trigram 1 P. (161) 15 language model recognizes that the three-word Sequence Pe(th, p, e) = cpol I pik (lik) make the decision is much more probable than take the decision. where This System is not always So Successful. It incorrectly renders Je vais prendre ma propre décision as I will take my 162 own decision. Its language model does not realize that take no(pom") = ( o )er pit, (162) my own decision is improbable because take and decision no longer fall within a single trigram. (163) ErrorS Such as this are common because the Statistical dLR(right|B(T,1), Vi1(cpi), vi1(n)) (164) models of this System only capture local phenomena; if the 25 context necessary to determine a translation falls outside the dr(vi.1(j) - V1(cp)B(T,1), vii (m) - V1(cp) Scope of these models, a word is likely to be translated incorrectly. However, if the relevant context is encoded if k = 1 and je cpi locally, a word can be translated correctly. AS has been noted in Section 3, Such encoding can be performed in the Source-transduction phase 701 by a Sense labeling transducer. In this Section, the operation and construction of Such a if k = 1 and j < cpi Sense-labeling transducer is described. The goal of this transducer is to perform croSS-lingual word-sense labeling. d-1(vii.(i) - Vik (Tik-1)|B(ti), vi (m) - Vik(Tik-1)) 35 That is, this transducer labels words of a Sentence in a Source if k > 1 language So as to elucidate their translations into a target In Equatbn (164), p. is the first position to the left of i language. Such a transducer can also be used to label the which has a non-zero fertility; and c is the ceiling of the words of an target Sentence So as to elucidate their transla average position of the words of tablet p (see Equation tions into a Source language. (155)). Also, e() is 1 if position j is vacant after all the 40 The design of this transducer is motivated by Some words of tablets i'1 then choose a vacant position JL greater than AS another example, in Il doute que les notres gagnent It according to the distribution (He doubts that we will win), the word il should be translated 5,805,832 73 74 as he. On the other hand, if doute is replaced by faut then il 4201. For the word being considered, finding a good ques should be translated as it: Il faut que les notres gagnent (It tion for each of a plurality of informant sites. These is necessary that we win). Here, a Sense label can be informant sites are obtained from a table 4207 Stored in assigned to il by asking about the identity of the first verb to memory of possible informant sites. Possible sites include its right. but are not limited to, the nouns to the right and left, the These examples motivate a Sense-labeling Scheme in verbs to the right and left, the words to the right and left, which the label of a word is determined by a question about the words two positions to the right or left, etc. A method an informant word in its context. In the first example, the of finding a good question is described below. This informant of prendre is the first noun to the right; in the method makes use of a table 4205 stored in memory second example, the informant of it is the first verb to the probabilities derived from Viterbi alignments. These right. The first example is depicted in FIG. 40. The two probabilities are also discussed below. sequence of this example are labeled 4001 and 4002 in the 4202. Storing in a table 4206 of informants and questions, figure. the informant Site and the good question. If more than two Senses are desired for a word, then A method for carrying out the Step 4201 for finding a questions with more than two answers can be considered. good question for each informant site of a Vocabulary word 11.2 Design of a Sense-Labeling Transducer 15 is depicted in FIG. 43. This method comprises the steps of: FIG. 41 depicts an embodiment of a Sense-labeling trans 4301. Performing the Steps 4302 and 4207 for each possible ducer based on this Scheme. For expositional purposes, this informant site. Possible informant sites are obtained from embodiment will be discussed in the context of labeling a table 4304 of Such sites. words in a Source language word Sequence. It should be 4302. For the informant site being considered, finding a understood that in other embodiments, a Sense-labeling question about informant words at this Site that provides transducer can accept as input more involved Source a lot of information about translations of the vocabulary Structure representations, including but not limited to, lexi word into the target language. cal morpheme Sequences. In these embodiments, the con 4207. Storing the vocabulary word, the informant site, and Stituents of a representation of a Source word Sequence are the good question in a table 4103. annotated by the Sense-labeling transducer with labels that 25 11.4 Mathematics of Constructing Questions elucidate their translation into analogous constituents into a A method for carrying out the Step 4302 of finding a Similar representation of a target word Sequence. It should question about an informant is depicted in FIG. 44. In this also be understood that in Still other embodiments, a Sense SubSection, the Some preliminaries for describing this labelling transducer annotates, target-Structure method are given. The notation used here is the same as that representations, (not Source-structure representations) with used in Sections 8-10. Sense labels. 11.4.1 Statistical Translation with Transductions The operation of the embodiment of the sense-labeling Recall the setup depicted in FIG. 2. A system shown there transducer depicted in FIG. 41 comprises the steps of: employs 4101. Capturing an input Sequence consisting of a Sequence 1. A Source transducer 201 which encodes a Source Sentence of words in a Source language; 35 f into an intermediate Structure f. 4102. For each word in the input sequence performing the 2. A statistical translation component 202 which translates f Steps 4107, 4108, 4109, 4104 until no more words are into a corresponding intermediate target Structure e'. This available in Step 4105; component incorporates a language model, a translation 4107. For the input word being considered, finding a best model, and a decoder. informant Site Such as the noun to the right, the verb to the 40 3. A target transducer 203 which reconstructs a target left, etc. A best informant for a word is obtained using a Sentence e from e". table 4103 stored in memory of informants and questions For Statistical modeling, the target-transduction transfor about the informants for every word in the source lan mation 203 e'->e is sometimes required to be invertible. guage Vocabulary; Then e' can be constructed from e and no information is lost 4108. Finding the word at the best informant site in the input 45 in the transformation. word Sequence; The purpose of Source and target transduction is to 4109. Obtaining the class of the informant word as given by facilitate the task of the statistical translation. This will be the answer to a question about the informant word; accomplished if the probability distribution Pr(f, e") is easier 4104. Labeling the original input word of the input Sequence to model then the original distribution Pr(f, e). In practice with the class of the informant word. 50 this means that e' and f should encode global linguistic facts For the example depicted in FIG. 40, the informant site about e and f in a local form. determined by Step 4107 is the noun to the right. For the first A useful gauge of the Success that a Statistical model word sequence 4001 of this example, the informant word enjoys in modeling translation from Sequences of Source determined by Step 4108 is décision; for the second word words represented by a random variable F, to Sequences of sequence 4109, the informant word is voiture. In this 55 target words represented by a random variable E, is the croSS example, the class of désion determined in Step 4109 is entropy different than the class of voiture. Thus the label attached to In this equation and in the reminder of this section, the convention of using prendre by Step 4104 is different for these two contexts of uppercase letters (e.g. E) for random variables and lower case letters (e.g. e) prendre. for the values of random variables continues to be used. 11.3 Constructing a table of informants and questions 60 An important component of the Sense-labeler depicted in H(EIF)= -- slog P(ef). (165) FIG. 4101 is a table 4103 of informants and questions for every word in a Source language Vocabulary. The croSS entropy measures the average uncertainty that FIG. 42 depicts a method of constructing such a table. the model has about the target language translation e of a This method comprises the Steps of: 65 Source language sequence f. Here P(ef) is the probability 4203. Performing the Steps 42.01 and 4202 for each word in according to the model that e is a translation off and the Sum a Source language Vocabulary. runs over a collection of all Spairs of Sentences in a large 5,805,832 75 76 corpus comprised of pairs of Sentences with each pair consisting of a Source and target Sentence which are trans 1. p(cp, e) = 1. c(cp, e)." (168) lations of one another (See Sections 8-10). o o Abetter model has leSS uncertainty and thus a lower croSS entropy. The utility of Source and target transductions can be In these equations and in the remainder of the paper, the generic symbol measured in terms of this croSS entropy. Thus transforma 1/norm is used to denote a normalizing factor that converts counts to tions f->f and e'–se are useful if models P(fe") and P(e') probabilities. The actual value of 1/norm will be implicit from the context. can be constructed such that H(EF")e maps a labeled Sentence In this subsection the cross entropies H(EF) and H(EF") to a sentence in which the labels have been removed. are expressed in terms of the information between Source The probability models. A translation model such as one and target words. of the models in Sections 8-10 is used for both P'(FE) and In the Viterbi approximation, the cross entropy H(FE) is for P(FE). A trigram language model Such as that discussed given by, in Section 6 is used for both P(E) and P(E). 11.4.3 The Viterbi Approximation The probability P(fle) computed by the translation model requires a Sum over alignments as discussed in detail in 35 where m is the average length of the Source Sentences in the Sections 8-10. This sum is often too expensive to compute training data, and H(FE) and H(dE) are the conditional directly since the number of alignments increases exponen entropies for the probability distributions p(f, e) and p(cp, e): tially with Sentence length. In the mathematical consider ations of this Section, this sum will be approximated by the (171) Single term corresponding to the alignment, V(fe), with 40 greatest probability. This is the Viterbi approximation already discussed in Sections 8-10 and V(fe) is the Viterbi H(dE) = - S. p(cp, e)logp(ple). alignment. e, Let c(fe) be the expected number of times that e is aligned with fin the Viterbi alignment of a pair of sentences drawn 45 A similar expression for the cross entropy H(EF) will at random from a large corpus of training data. Let c((ple) be now be given. Since the expected number of times that e is aligned with p words. Then cle) = c(re; V(fole)) (166) 50 this croSS entropy depends on both the translation model, P(fe), and the language model, P(e). With a suitable addi co, e) -- coe, Vcoe). tional approximation, 55 (172) where c(fe; V) is the number of times that e is aligned with where H(E) is the cross entropy of P(E) and I(F, E) is the fin the alignment A, and c((ple; V) is the number of times that mutual information between f and e for the probability e generates (p target words in A. The counts above are also distribution p(f, e). expressible as averages with respect to the model: The additional approximation required is, 60 cle)-; Piece vie) (167) (173) H(F) as mH(F) = -m pologp(0. c(ple) -i. P(f, e)c(ple, V(?e)). where p(f) is the marginal of p(f, e). This amounts to 65 approximating P(f) by the unigram distribution that is clos Probability distributions p(e, f) and p(cp, e) are obtained by est to it in cross entropy. Granting this, formula (172) is a normalizing the counts c(fle) and c((ple): consequence of (170) and of the identities

5,805,832 79 80 Sentence in one corpora is translated as two or more Sen constructed by one skilled in the art. Generally, the Sentences tences in the other corpora. At other times a Sentence, or produced by Such a Sentence detector conform to the grade even a whole passage, may be missing from one or the other School notion of Sentence: they begin with a capital letter, of the corpora. contain a verb, and end with Some type of Sentence-final A number of researchers have developed methods that punctuation. Occasionally, they fall short of this ideal and align Sentences according to the words that they contain. consist merely of fragments and other groupings of words. (See for example, Deriving translation data from bilingual In the Hansard example, the English corpus contains text, by R. CatizOne, G. Russel, and S. Warwick, appearing 85,016,286 tokens in 3,510,744 sentences, and the French in Proceedings of the First International Acquisition corpus contains 97.857,452 tokens in 3,690,425 sentences. Workshop, Detroit, Mi., 1989; and “Making Connections”, The average English Sentence has 24.2 tokens, while the by M. Kay, appearing in ACH/ALLC 91, Tempe, Ariz., average French sentence is about 9.5% longer with 26.5 1991.) Unfortunately, these methods are necessarily slow tokens. The left-hand side of FIG. 48 shows the raw data for and, despite the potential for high accuracy, may be unsuit a portion of the English corpus, and the right-hand Side able for very large collections of text. shows the same portion after it was cleaned, tokenized, and In contrast, the method described here makes no use of 15 divided into Sentences. The Sentence numbers do not lexical details of the corpora. Rather, the only information advance regularly because the Sample has been edited in that it uses, besides optional information concerning anchor order to display a variety of phenomena. points, is the lengths of the Sentences of the corpora. As a 12.3 Selecting Anchor Points result, the method is very fast and therefore practical for The selection of suitable anchor points (Step 4701) is a application to very large collections of text. corpus-specific task. Some corpora may not contain any The method was used to align Several million Sentences reasonable anchors. from parallel French and English corpora derived from the In the Hansard example, Suitable anchors are Supplied by proceedings of the Canadian Parliament. The accuracy of various reference markers that appear in the transcripts. these alignments was in excess of 99% for a randomly These include Session numbers, names of Speakers, time selected set of 1000 alignments that were checked by hand. 25 Stamps, question numbers, and indications of the original The correlation between the lengths of aligned Sentences language in which each Speech was delivered. This auxiliary indicates that the method would achieve an accuracy of information is retained in the tokenized corpus in the form between 96% and 97% even without the benefit of anchor of comments Sprinkled throughout the text. Each comment points. This Suggests that the method is applicable to a very has the form \SCM{ }. . . \ECM{ } as shown on the wide variety of parallel corpora for which anchor points are right-hand side of FIG. 48. not available. To Supplement the comments which appear explicitly in 12.1. Overview the transcripts, a number of additional comments were One embodiment of the method is illustrated Schemati added. Paragraph comments were inserted as Suggested by cally in FIG. 46. It comprises the steps of: the Space command of the original markup language. An 4601 and 4602. Tokenizing the text of each corpus. 35 example of this command appears in the eighth line on the 4603 and 4604. Determining sentence boundaries in each left-hand side of FIG. 48. The beginning of a parliamentary corpuS. Session was marked by a Document comment, as illustrated 4605. Determing alignments between the sentences of the in Sentence 1 on the right-hand side of FIG. 48. Usually, two corpora. when a member addresses the parliament, his name is The basic step 4605 of determining sentence alignments 40 recorded, This was encoded as an Author comment, an is elaborated further in FIG. 47. It comprises the steps of: example of which appears in Sentence 4. If the president 4701. Finding major and minor anchor points in each Speaks, he is referred to in the English corpus as Mr. Speaker corpus. This divides each corpus into Sections between and in the French corpus as major anchor points, and SubSections between minor anchor points. 45 TABLE 1.8 4702. Determining alignments between major anchor points. 4703. Retaining only those aligned sections for which the Examples of comments the number of SubSections is the Same in both corpora. English French 4704. Determining alignments between sentences within Source = Traduction 50 Source = English each of the remaining aligned SubSections. Source = Translation Source = Francais One embodiment of the method will now be explained. Source = Text Source = Texte The various Steps will be illustrated using as an example the Source = List Item Source = List Item aforementioned parallel French and English corpora derived Source = Question Source = Question from the Canadian Parliamentary proceedings. These Source = Answer Source = Response proceedings, called the Hansards, are transcribed in both 55 English and French. The corpora of the example consists of M. le Président. If several members speak at once, a the English and French Hansard transcripts for the years Shockingly regular occurrence, they are referred to as Some 1973 through 1986. Hon. Members in the English and as Des Voix in the French. It is understood that the method and techniques illustrated Times are recorded either as exact times on a 24-hour basis in this embodiment and the Hansard example can easily be 60 as in Sentence 81, or as inexact times of which there are two extended and adapted to other corpora and other languages. forms: Time=Later, and Time=Recess. These were encoded 12.2 Tokenization and Sentence Detection in the French as Time=Plus Tard and Time=Recess. Other First, the corpora are tokenized (steps 4601 and 4602) types of comments are shown in Table 18. using a finite-state tokenizer of the Sort described in Sub The resulting comments laced throughout the text are section 3.2.1. Next, (steps 4602 and 4603), the corpora are 65 used as anchor points for the alignment process. The com partitioned into Sentences using a finite State Sentence ments Author=Mr. Speaker, Author=M. le Président, boundary detector. Such a Sentence detector can easily be Author=Some Hon. Members, and Author=Des Voix are 5,805,832 81 82 deemed minor anchors. All other comments are deemed between the corresponding Sentences. Each grouping is major anchors with the exception of the Paragraph comment called a bead. The example consists of an ef-bead followed which was not treated as an anchor at all. The minor anchors by an eff-bead followed by an e-bead followed by a T are much more common than any particular major anchor, bead. From this perspective, an alignment is simply a making an alignment based on minor anchors much leSS Sequence of beads that accounts for the observed Sequences robust against deletions than one based on the major of Sentence lengths and paragraph markers. The model anchors. assumes that the lengths of Sentences have been generated 12.4 Aligning Major Anchors by a pair of random processes, the first producing a Sequence Major anchors, if they are to be useful, will usually appear of beads and the Second choosing the lengths of the Sen in parallel in the two corpora. Sometimes, however, through tences in each bead. inattention on the part of translators or other misadventure, The length of a Sentence can be expressed in terms of the an anchor in one corpus may be garbled or omitted in number of tokens in the Sentence, the number of characters another. In the Hansard example, for instance, this is prob in the Sentence, or any other reasonable measure. In the lem is not uncommon for anchors based upon names of Hansard example, lengths were measured as numbers of Speakers. 15 tokens. The major anchors of two corpora are aligned (Step 4702) The generation of beads is modelled by the two-state by the following method. First, each connection of an Markov model shown in FIG. 50. The allowed beads are alignment is assigned a numerical cost that favors exact shown in FIG. 51. A Single Sentence in one corpus is matches and penalizes omissions or garbled matches. In the assumed to line up with Zero, one, or two Sentences in the Hansard example, these costs were chosen to be integers other corpus. The probabilities of the different cases are between 0 and 10. Connections between corresponding pairs assumed to satisfy Pr(e)=Pr(f), Pr(eff)=Pr(eef), and Pr(I)= Such as Time=Later and Time=Plus Tard, were assigned a Pr(T). cost of 0, while connections between different pairS Such as The generation of Sentence lengths given beads is mod Time=Later and Author=Mr. Bateman were assigned a cost eled as follows. The probability of an English sentence of of 10. A deletion is assigned a cost of 5. A connection 25 length 1 given an e-bead is assumed to be the same as the between two names was assigned a cost proportional to the probability of an English Sentence of length 1 in the text as minimal number of insertions, deletions, and Substitutions a whole. This probability is denoted by Pr(l). Similarly, the necessary to transform one name, letter by letter, into the probability of a French sentence of length 1, given an f-bead other. is assumed to equal Pr(I). For an ef-bead, the probability of Given these costs, the Standard technique of dynamic an English sentence of length l is assumed to equal Pr(1) programming is used to find the alignment between the and the log of the ratio of length of the French Sentence to major anchors with the least total cost. Dynamic program the length of the English Sentence is assumed to be normally ming is described by R. Bellman in the book titled Dynamic distributed with mean u and variance of. Thus, if r=log(1/1), Programming, published by Princeton University Press, then Princeton, N.J. in 1957. In theory, the time and space 35 required to find this alignment grow as the product of the lengths of the two Sequences to be aligned. In practice, however, by using thresholds and the partial traceback technique described by Brown, Spohrer, Hochschild, and with C. chosen so that the sum of Pr(1/1) over positive values Baker in their paper, Partial Traceback and Dynamic 40 of 1 is equal to unity. For an eef-bead, the English sentence Programming, published in the Proceedings of the IEEE lengths are assumed to be independent with equal marginals International Conference on Acoustics, Speech and Signal Pr(l), and the log of the ratio of the length of the French Processing, in Paris, France in 1982, the time required can Sentence to the Sum of the lengths of the English Sentences be made linear in the length of the Sequences, and the Space is assumed to be normally distributed with the same mean can be made constant. Even So, the computational demand 45 and variance as for an ef-bead. Finally, for an eff-bead, the is Severe. In the Hansard example, the two corpora were out probability of an English length l is assumed to equal Pr(1) of alignment in places by as many as 90,000 Sentences and the the log of the ratio of the sum of the lengths of the owing to mislabelled or missing files. French Sentences to the length of the English Sentence is 12.5 Discarding Sections assumed to be normally distributed as before. Then, given The alignment of major anchors partitions the corpora 50 the Sum of the lengths of the French Sentences, the prob into a sequence of aligned sections. Next, (Step 4703), each ability of a particular pair of lengths, land l, is assumed Section is accepted or rejected according to the population of to be proportional to Pr(I)Pr(I). minor anchors that it contains. Specifically, a Section is Together, the model for Sequences of beads and the model accepted provide that, within the Section, both corpora for Sentence lengths given beads define a hidden Markov contain the same number of minor anchors in the same order. 55 model for the generation of aligned pairs of Sentence Otherwise, the Section is rejected. Altogether, using this lengths. Markov Models are described by L. Baum in the criteria, about 10% of each corpus was rejected. The minor article “An Inequality and associated maximization tech anchorS Serve to divide the remaining Sections into SubSec nique in Statistical estimation of probabilistic functions of a tions that range in Size from one Sentence to Several thou Markov process', appearing in Inequalities in 1972. Sand Sentences and average about ten Sentences. 60 The distributions Pr(1) and Pr(I) are determined from the 12.6 Aligning Sentences relative frequencies of various Sentence lengths in the data. The sentences within a subsection are aligned (Step 4704) For reasonably Small lengths, the relative frequency is a using a simple Statistical model for Sentence lengths and reliable estimate of the corresponding probability. For paragraph markers. Each corpus is viewed as a Sequence of longer lengths, probabilities are determined by fitting the Sentence lengths punctuated by occasional paragraph 65 observed frequencies of longer Sentences to the tail of a markers, as illustrated in FIG. 49. In this figure, the circles Poisson distribution. The values of the other parameters of around groups of Sentence lengths indicate an alignment the Markov model can be determined by from a large 5,805,832 83 84 method is applicable to corpora for which frequent anchor TABLE 1.9 points are not available.

Parameter estimates TABLE 2.0 Parameter Estimate Unusual but correct alignments Pr(e),Pr(f) OO7 And love and kisses to you, too. Pareillement. Pr(ef) 690 ... mugwumps who sit on the ... en voulant Pr(eef), Pr(eff) O2O fence with their mugs on menagerlachevreetlechous Pr(2), Pr() OOS one side and their wumps ils n’arrivent pas a prendre parti. Pr(2.2) .245 on the other side and do it. O72 p? O43 not know which side to come down on. At first reading, she may have. Elle semble en effet avoir un grief tout a fait valable, du moins au sample of text using the EM method. This method is premier abord. described in the above referenced article by E. Baum. 15 For the Hansard example, histograms of the Sentence length distributions Pr(l) and Pr(flen) for lengths up to 81 12.8 Results for the Hansard Example are shown in FIGS. 52 and 53 respectively. Except for For the Hansard example, the alignment method lengths 2 and 4, which include a large number of formulaic described above ran for 10 days on an IBM Model 3090 sentences in both the French and the English, the distribu mainframe under an operating System that permitted access tions are very Smooth. to 16 megabytes of virtual memory. The most probable The parameter values for the Hansard example are shown alignment contained 2,869,041 ef-beads. In a random in Table 19. From these values it follows that 91% of the Sample of 1000 the aligned Sentence pairs, 6 errors were English Sentences and 98% of the English paragraph mark found. This is consistent with the expected error rate of 0.9% ers line up one-to-one with their French counterparts. If X is 25 mentioned above. In Some cases, the method correctly a random variable whose log is normally distributed with aligned Sentences with very different lengths. Examples are mean u and variance of, then the mean of X is exp(u+Of/2). shown in Table 20. Thus from the values in the table, it also follows that the total 13 Aligning Bilingual Corpora length of the French text in an ef-, eef-, or eff-bead is about 9.8% greater on average than the total length of the corre With the growing availability of machine-readable bilin sponding English text. Since most Sentences belong to gual texts has come a burgeoning interest in methods for ef-beads, this is close to the value of 9.5% given above for extracting linguistically valuable information from Such the amount by which the length of the average French texts. One way of obtaining Such information is to construct Sentences exceeds that of the average English Sentence. Sentence and word correspondences between the texts in the 12.7 Ignoring Anchors 35 two languages of Such corpora. For the Hansard example, the distribution of English A method for doing this is depicted schematically in FIG. sentence lengths shown in FIG. 52 can be combined with the 45. This method comprises the steps of conditional distribution of French Sentence lengths given 4501. Beginning with a large bilingual corpus, English Sentence lengths from Equation (181) to obtain the 4504. Extracting pairs of sentences 4502 from this corpus joint distribution of French and English Sentences lengths in 40 Such that each pair consists of a Source and target Sentence ef-, eef-, and eff-beads. For this joint distribution, the mutual which are translations of each other; information between French and English Sentence lengths is 4505. Within each sentence pair, aligning the words of the 1.85 bits per sentence. It follows that even in the absence of target Sentence with the words in a Source Sentence, to anchor points, the correlation in Sentence lengths is Strong obtain a bilingual corpus labelled with word-by-word enough to allow alignment with an error rate that is asymp 45 correspondences 4503. totically less than 100%. In one embodiment of Step 4504, pairs of aligned sen Numerical estimates for the error rate as a function of the tences are extracted using the method explained in detain in frequency of anchor points can be obtained by Monte Carlo Section 12. The method proceeds without inspecting the simulation. The empirical distributions Pr(1) and Pr(I) identities of the words within Sentences, but rather uses only shown in FIGS. 52 and 53, and the parameter values from 50 the number of words or number of characters that each Table 19 can be used to generated an artificial pair of aligned Sentence contains. corpora, and then, the most probable alignment for these In one embodiment of Step 4505, word-by-word corre corpora can be found. The error rate can be estimated as the spondences within a Sentence pair are determined by finding fraction of ef-beads in the most probable alignment that did the Viterbi alignment or approximate Viterbi alignment for not correspond to ef-beads in the true alignment. 55 the pair of Sentences using a translation model of the Sort By repeating this proceSS many thousands of times, an discussed in Sections 8-10 above. These models constitute expected error rate of about 0.9% was estimated for the a mathematical embodiment of the powerfully compelling actual frequency of anchor points in the Hansard data. By intuitive feeling that a word in one language can be trans varying the parameters of the hidden Markov model, the lated into a word or phrase in another language. effect of anchor points and paragraph markers on error rate 60 Word-by-word alignments obtained in this way offer a can be explored. With paragraph markers but no anchor variable resource for work in bilingual lexicography and points, the expected error rate is 2.0%, with anchor points machine translation. For example, a method of cross-lingual but no paragraph markers, the expected error rate is 2.3%, Sense labeling, described in Section 11, and also in the and with neither anchor points nor paragraph markers, the aforementioned paper, “Word Sense Disambiguation using expected error rate is 3.2%. Thus, while anchor points and 65 Statistical Methods”, uses alignments obtained in this way paragraph markers are important, alignment is still feasible as data for construction of a Statistical Sense-labelling mod without them. This is promising Since it Suggests that the ule. 5,805,832 85 86 14 Hypothesis Search-Steps 702 and 902 with a morpheme in the hypothesized target Structure. The 14.1. Overview of Hypothesis Search iterative process of extending partial hypotheses terminates Referring now to FIG. 7, the second step 702 produces a when step 5402 produces an empty list of hypotheses to be Set of hypothesized target Structures which correspond to extended. A test for this situation is made on step 5403. putative translations of the input intermediate Source Struc This method for generating target Structure hypotheses ture produced by step 701. The process by which these target can be extended to an embodiment of step 902 of FIG. 9, by Structures are produced is referred to as hypothesis Search. modifying the hypothesis extension step 5404 in FIG. 54, In a preferred embodiment target Structures correspond to with a very similar Step that only considers extensions which Sequences of morphemes. In other embodiments more are consistent with the set of constraints 906. Such a Sophisticated linguistic Structures Such as parse trees or case modification is a simple matter for one skilled in the art. frames may be hypothesized. 14.2 Hypothesis Extension 5404 An hypothesis in this Step 702 is comprised of a target This section provides a description of the method by Structure and an alignment of a target Structure with the input which hypotheses are extended in step 5404 of FIG. 54. Source Structure. ASSociated with each hypothesis is a Score. Examples will be taken from an embodiment in which the In other embodiments a hypothesis may have multiple 15 Source language is French and the target language is English. alignments. In embodiments in which step 701 produces It should be understood however that the method described multiple Source Structures an hypothesis may contain mul applies to other language pairs. tiple alignments for each Source Structure. It will be assumed 14.2.1 Types of Hypothesis Extension here that the target Structure comprised by a hypothesis There are a number of different ways a partial hypothesis contains a single instance of the null target morpheme. The may be extended. Each type of extension is described by null morphemes will not be shown in the Figures pertaining working through an appropriate example. For reasons of to hypothesis search, but should be understood to be part of exposition, the method described in this Section assigns the target Structures nonetheless. Throughout this Section on scores to partial hypotheses based on Model 3 from the hypothesis Search, partial hypothesis will be used inter Section entitled Translation Models and Parameter Estima changeably with hypothesis, partial alignment with 25 tion. One skilled in the art will be able to adopt the method alignment, and partial target Structure with target Structure. described here to other models, Such as Model 5, which is The target Structures generated in this Step 702 are pro used in the best mode of operation. duced incrementally. The process by which this is done is A partial hypothesis is extended by accounting for one depicted in FIG. 54. This process is comprised of five steps. additional previously unaccounted for element of the Source A set of partial hypotheses is initialized in step 5401. A Structure. When a partial hypothesis His extended to Some partial hypothesis is comprised of a target Structure and an other hypothesis H, the Score assigned to H is a product of alignment with Some Subset of the morphemes in the Source the Score associated with H and a quantity denoted as the Structure to be translated. The initial set generated by Step extension Score. The value of the extension Score is deter 5401 consists of a single partial hypothesis. The partial mined by the language model, the translation model, the target Structure for this partial hypothesis is just an empty 35 hypothesis being extended and the particular extension that Sequence of morphemes. The alignment is the empty align is made. A number of different types of extensions are ment in which no morphemes in the Source Structure to be possible and are Scored differently. The possible extension translated are accounted for. types and the manner in which they are Scored is illustrated The system then enters a loop through steps 5402, 5403, in the examples below. and 5404, in which partial hypotheses are iteratively 40 As depicted in FIG. 55, in a preferred embodiment, the extended until a test for completion is satisfied in step 5403. French Sentence La jeune fille a reveillé Sa mere is trans At the beginning of this loop, in Step 5402, the existing Set duced in either step 701 of FIG. 7 or step 901 of FIG. 9 into of partial hypotheses is examined and a Subset of these the morphological Sequence la jeune fille V past 3S Sa hypotheses is Selected to be extended in the Steps which mere. comprise the remainder of the loop. In step 5402 the score 45 The initial hypothesis accounts for no French morphemes for each partial hypothesis is compared to a threshold (the and the score of this hypothesis is set to 1. This hypothesis method used to compute these thresholds is described can be extended in a number of ways. Two Sample exten below). Those partial hypotheses with Scores greater than sions are shown in FIGS. 56 and 57. In the first example in threshold are then placed on a list of partial hypotheses to be FIG. 56, the English morpheme the is hypothesized as extended in step 5404. Each partial hypothesis that is 50 accounting for the French morpheme la. The component of extended in Step 5404 contains an alignment which accounts the Score associated with this extension is the equal to for a Subset of the morphemes in the Source Sentence. The remainder of the morphemes must still be accounted for. Each extension of an hypothesis in step 5404 accounts for l(the, *)n (1the)t(lathe)d(11). (182) one additional morpheme. Typically, there are many tens or 55 Here, * is a Symbol which denotes a Sequence boundary, and hundreds of extensions considered for each partial hypoth the factor ICthe, ) is the trigram language model parameter esis to be extended. For each extension a new Score is that serves as an estimate of the probability that the English computed. This Score contains a contribution from the morpheme the occurs at the beginning of a Sentence. The language model as well as a contribution from the transla factor n(1the) is the translation model parameter that is an tion model. The language model Score is a measure of the 60 estimate of the probability that the English morpheme the plausibility a priori of the target Structure associated with the has fertility 1, in other words, that the English morpheme the extension. The translation model Score is a measure of the is aligned with only a Single French morpheme. The factor plausibility of the partial alignment associated with the t(lathe) is the translation model parameter that serves as an extension. A partial hypothesis is considered to be a full estimate of the lexial probability that the English morpheme hypothesis when it accounts for the entire Source Structure to 65 the translates to the French morpheme la. Finally, the factor be translated. A full hypothesis contains an alignment in d(11) is the translation model parameter that serves as an which every morpheme in the Source Structure is aligned estimate of the distortion probability that a French mor 5,805,832 87 88 pheme will be placed in position 1 of the French structure the factor d(32) is the translation model parameter that given that it is aligned with an English morpheme that is in Serves as an estimate of the distortion probability that a position 1 of the English Structure. In the Second example in French morpheme will be placed in position 3 of the French FIG. 57, the English morpheme mother is hypothesized as Structure given that it is aligned with an English morpheme accounting for the French morpheme mere. The Score for which is in position 2 of the English structure. This exten this partial hypothesis is Sion Score in Equation 184 is multiplied by the Score in Equation 182 for the partial hypothesis which is being extended to yield a new Score for the partial hypothesis in l(mother, *)n (1mother)t(meremother)d(71). (183) FIG. 56 of Here, the final factor d(71) serves as an estimate of the i(the , )n(1 the)t(lathe)d(11) x (185) distortion probability that a French morpheme, Such as mere, l(girl, the)2 ( 25 n(ilgirl) ) t(fillegirl).d(32). will be place in the 7th position in a Source Sequence given = that it is aligned with an English morpheme Such as mother which is in the 1st position in an hypothesized target 15 Consider now an extension to the partial hypothesis in FIG. Sequence. 58. If a partial hypothesis that is to be extended contains an Now, Suppose the partial hypothesis in FIG. 56 is to be open morpheme, then, in a preferred embodiment, that extended on some other invocation of step 5404. A common hypothesis can only be extended by aligning another mor translation of the pair of French morphemes jeune fille is the pheme from the Source Structure with that open morpheme. English morpheme girl. However, Since in a preferred When Such an extension is made, there are two possibilities: embodiment a partial hypothesis is extended to account for 1) the open morpheme is kept open in the extended partial only a single French morpheme at a time, it is not possible hypothesis, indicating that more Source morphemes are to be to account for both jeune and fille with a Single extension. aligned with that open target morpheme, or 2) the open Rather the System first accounts for one of the morphemes, morpheme is closed indicating that no more Source mor and then on another round of extensions, accounts for the 25 phemes are to be aligned with that target morpheme. These other. This can be done in two ways, either by accounting two cases are illustrated in FIGS. 59 and 60. first for jeune or by accounting first for fille. FIG. 58 depicts In FIG. 59, an extension is made of the partial alignment the extension that accounts first for fille. The +symbol in in FIG. 58 by aligning the additional French morpheme FIG. 58 after the the English morpheme girl denotes the fact jeune with the English morpheme girl. In this example the that in these extensions girl is to be aligned with more English morpheme girl is then closed in the resultant partial French morphemes than it is currently aligned with, in this hypothesis. The extension Score for this example is case, at least two. A morpheme So marked is referred to as open. A morpheme that is not open is said to be closed. A partial hypothesis which contains an open target morpheme n(2girl) (186) is referred to as open, or as an open partial hypothesis. A (S, n(ilgirl)) ) t(jeunelgirl).d(22). partial hypothesis which is not open is referred to as closed, 35 or as a closed partial hypothesis. An extension is referred to Here, the first quotient adjusts the fertility score for the as either open or closed according to whether or not the partial hypothesis by dividing out the estimate of the prob resultant partial hypothesis is open or closed. In a preferred ability that girl is aligned with at least two French mor embodiment, only the last morpheme in a partial hypothesis phemes and by multiplying in an estimate of the probability 40 that girl is aligned with exactly two French morphemes. AS can be designated open. The Score for the extension in FIG. in the other examples, the Second and third factors are 58 is estimates of the lexical and distortion probabilities associ ated with this extension. 25S. n(ilgirl) ) t(fillegirl).d(32). (184) In FIG. 60, the same extension is made as in FIG. 59, 45 except here the English morpheme girl is kept open after the Here, the factor lgirl,the) is the language model parameter extension, hence the +Sign. The extension Score for this that serves as an estimate of the probability with which the example is English morpheme girl is the Second morpheme in a Source structure in which the first morpheme is the. The next factor (S, n(ilgirl (187) of 2 is the combinatorial factor that is discussed in the 50 Section entitled Translation Models and Parameter Estima 3 (co-(S, n(ilgirl)) ) t(jeunelgirl).d(22). tion. It is factored in, in this case, because the open English Here, the factor of 3 is the adjustment to the combinatorial morpheme girl is to be aligned with at least two French factor for the partial hypothesis. Since the score for the morphemes. The factor n(ilgirl) is the translation model partial hypothesis in FIG. 59 already has a combinatorial parameter that Serves as an estimate of the probability that 55 factor of 2, the Score for the resultant partial hypothesis in the English morpheme girl will be aligned with exactly i FIG. 60, will have a combinatorial factor of 2x3=3. The French morphemes, and the Sum of these parameters for i quotient adjusts the fertility Score for the partial hypothesis between 2 and 25 is an estimate of the probability that girl to reflect the fact that in further extensions of this hypothesis will be aligned with at least 2 morphemes. It is assumed that girl will be aligned with at least three French morphemes. the probability that an English morpheme will be aligned 60 Another type of extension performed in the hypothesis with more than 25 French morphemes is 0. Note that in a Search is one in which two additional target morphemes are preferred embodiment of the present invention, this Sum can appended to the partial target Structure of the partial hypoth be precomputed and Stored in memory as a separate param esis being extended. In this type of extension, the first of eter. The factor t (filiegirl) is the translation model param these two morphemes is assigned a fertility of Zero and the eter that Serves as an estimate of the lexical probability that 65 Second is aligned with a, Single morpheme from the Source one of the French morphemes aligned with the English Structure. This Second target morpheme may be either open morpheme girl will be the French morpheme fille. Finally, or closed.