Homograph Disambiguation in Text-To-Speech Synthesis

Homograph Disambiguation in

Texttospeech Synthesis

David Yarowsky

ABSTRACT This chapter presents a statistical decision pro cedure for lexi

cal ambiguity resolution in texttosp eech synthesis Based on decision lists

the algorithm incorp orates b oth lo cal syntactic patterns and more distant

collo cation al evidence combining the strengths of decision trees Ngram

taggers and Bayesian classiers The algorithm is applied to ma jor types

of ambiguity where context can b e used to choose a words pronunciation

Problem Description

In sp eech synthesis one frequently encounters words and numbers whose

pronunciation cannot b e determined without context Seven ma jor types

of these homographs will b e addressed here

Dierent Part of Speech The largest class of pronunciation ambiguity

consists of homographs with dierent parts of sp eech Three lives were

lost vs One lives to eat These cases can typically b e resolved through

lo cal syntactic patterns

Same Part of Speech A word such as bass or bow exhibits dierent

pronunciations with the same part of sp eech and thus requires addi

tional semantic evidence for disambiguation

Proper Names such as Nice and Begin are ambiguous in capital

ized contexts including sentence initial p osition titles and singlecase

text

Roman Numerals are pronounced dierently in contexts such as Chap

ter III and Henry III

FractionsDates such as may b e pronounced as vesixteenths

or May th

YearsQuantiers Numbers such as tend to b e pronounced as

seventeenfty when used as dates and onethousand sevenhundred

David Yarowsky

and fty when used b efore measure words as in miles Related

cases include the distinction b etween the pilot and people

Abbreviations may exhibit multiple pronunciations such as St Saint

or Street and Dr Do ctor or Drive

Some homographs exhibit multiple categories of ambiguity For exam

ple lead has dierent pronunciations when used as a verb and noun and

b etween two noun senses He fol lowed her lead verses He covered the hul l

In general we will resolve the partofsp eech ambiguity rst with lead

and then resolve the additional semantic ambiguity if present

Previous Approaches

Ngram taggersJelChuMer may b e used to tag each word in a

sentence with its part of sp eech thereby resolving those pronunciation am

biguities that correlate with partofsp eech ambiguities The ATT TTS

synthesizer SHY uses Churchs PARTS tagger for this purp ose A weak

ness of these taggers is that they are typically not sensitive to sp ecic word

asso ciations The standard algorithm relies on mo dels of partofsp eech se

quence captured by probabilities of partofsp eech bigrams or trigrams to

the exclusion of lexical collo cations This causes diculty with cases such

around the pole and a bul let wound in the leg which as a ribbon wound

have identical surrounding partofsp eech sequences and require lexical in

formation for resolution A more fundamental limitation however is the

inherent myopia of these mo dels They cannot generally capture longer dis

tance word asso ciations such as b etween wound and hospital and hence

are not appropriate for resolving many semantic ambiguities

Bayesian classiersMW have b een used for a number of sense dis

ambiguation tasksGCY typically involving semantic ambiguity In an

eort to generalize from longer distance word asso ciations regardless of p o

sition an implementation prop osed in GCY characterizes each token of

a homograph by the words nearest to it treated as an unordered bag

Although such mo dels can successfully capture topiclevel dierences they

lose the ability to make distinctions based on lo cal sequence or sentence

structure In addition these mo dels have b een greatly simplied by assum

ing that o ccurrence probabilities of content words are indep endent of each

other a false and p otentially problematic assumption that tends to yield

inated probability values One can attempt to mo del these dep endencies

as in BW but data sparsity problems and computational constraints

can make this dicult and costly to do

Leaco ck et al have pursued a similar bagofwords strategy using an IRstyle vector space mo del and neural network LTV

Homograph Disambiguation in Texttospeech Synthesis

Decision Trees BFOSBDDM can b e eective at handling com

plex conditional dep endencies and nonindep endence but often encounter

severe diculties with very large parameter spaces such as the highly lex

icalized feature sets frequently used in homograph resolution

Current Algorithm The algorithm describ ed in this pap er is a hybrid

approach combining the strengths of each of these general paradigms

It was prop osed in SHY and rened in Yar It is based on the for

mal mo del of decision lists in Riv although feature conjuncts have

b een restricted to a much narrower complexity namely word and class tri

grams The algorithm mo dels b oth lo cal sequence and wide context well

and successfully exploits even highly nonindep endent feature sets

Algorithm

Homograph disambiguation is ultimately a classication problem where

the output is a pronunciation lab el for an ambiguous target word and the

feature space consists of the other words in the target words vicinity The

goal of the algorithm is to identify patterns in this feature space that can

b e used to correctly classify instances of the ambiguous word in new texts

For example given instances of the homograph lead b elow the algorithm

should assign the appropriate lab el lid or ld

Pronunciation Context

ld it monitors the lead levels in drinking

ld median blo o d lead concentration was

ld found layers of lead telluride inside

ld conference on lead p oisoning in

ld strontium and lead isotop e zonation

lid maintained their lead Thursday over

lid to Boston and lead singer for Purple

lid Bush a p oint lead in Texas only

lid his doubledigit lead nationwide The

lid the fairly short lead time allowed on

The following sections will outline the steps in this pro cess using the

individual homographs lead lidld and bass beisbs as examples The

application of this algorithm to large classes of homographs such as frac

tions vs dates is describ ed in Section

Step Col lect and Label Training Contexts

For each homograph b egin by collecting all instances observed in a large

text corpus Then lab el each example of the target homograph with its correct pronunciation in that context Here this pro cess was partially auto

David Yarowsky

mated using to ols that included a sense disambiguation system based on

semantic classes Yar

For this study the training and test data were extracted from a

millionword corpus collection containing news articles AP newswire and

Wall Street Journal scientic abstracts NSF and DOE novels two

encyclop edias and years of Canadian parliamentary debates augmented

with email corresp ondence and miscellaneous Internet dialogues

Step Measure Col locational Distributions

The driving force b ehind this disambiguation algorithm is the uneven dis

tribution of collocations word asso ciations with resp ect to the ambiguous

token b eing classied For example the following table indicates that cer

tain word asso ciations in various p ositions relative to the ambiguous token

bass including coo ccurrence within a k word window exhibit consid

erable discriminating p ower

Position Collo cation beis bs

Word to the right w bass player

bass shing

bass are

Word to the left w striped bass

on bass

sea bass

white bass

plays bass

Within words k w sh in words

guitar in words

violin in words

river in words

percussion in words

salmon in words

The goal of the initial stage of the algorithm is to measure a large and

varied set of collo cational distributions and select those which are most

useful in identifying the pronunciation of the ambiguous word

In addition to raw word asso ciations the present study also collected

collo cations of lemmas morphological ro ots which usually provide more

Several dierent context widths are used The words employed here is a

practical width for many application s The issues involved in choosing appropriate

context widths are discussed in GCY

Such skewed distribution s are in fact quite typical A study in Yar showed

that P pr onunciationjcol l ocation is a very low entropy distributio n Certain

types of contentword collo cations seen only once in training data predicted the correct pronunciation in heldout test data with accuracy

Homograph Disambiguation in Texttospeech Synthesis

succinct and generalizable evidence than their inected forms and part

ofsp eech sequences which capture syntactic rather than semantic distinc

tions in usage A richer set of p ositional relationships b eyond adjacency

and coo ccurrence in a window is also considered including trigrams and

optionally verbob ject pairs The following table indicates the pronunci

ation distributions observed for the noun lead for these various types of

evidence

Position Collo cation ld lid

l lead levelN

w narrow lead

w lead in

ww of lead in

ww the lead in

pp lead NOUN

k w zinc in k words

k w copper in k words

V l fol lowV lead

V l takeV lead

Step Compute Likelihood Ratios

The discriminating strength of each piece of evidence is measured by mag

nitude of the the loglikelihoo d ratio

P P r onunciation jC ol l ocation

1 i

AbsLog

P P r onunciation jC ol l ocation

2 i

The collo cation patterns most strongly indicative of a particular pronun

ciation will have the most extreme loglikelihoo d Sorting evidence by this

value will list the strongest and most reliable evidence rst

Note that the estimation of P P r onunciation jC ol l ocation merits con

j i

siderable care Problems arise when an observed count in the collo cation

distribution is a common o ccurrence Clearly the probability of seeing

zinc in the context of the lid pronunciation of lead is not even though

no such collo cation was observed in the training data Finding a more ac

curate probability estimate dep ends on several factors including the size

The richness of this feature set is one of the key reasons for the success of

this algorithm Others who have very pro ductively exploited a diverse feature set

include Hea Bri and DI

Position markers include token to the right token to the left

k coo ccurrence in k token window and V head verb Possible types of

ob jects at these p ositions include w raw words p parts of sp eech and l lem

mas a class of words consisting of dierent inections of the same ro ot such as

takeV takes took taken take taking

David Yarowsky

of the training sample nature of the collo cation adjacent bigrams verb

ob ject pairs or wider context our prior exp ectation ab out the similarity

of contexts and the amount of noise in the training data

Several smo othing metho ds have b een explored in this work including

those discussed in GCY The preferred technique is to take all instances

of the same raw frequency distribution such as or and collec

tively compute a smo othed ratio that b etter reects the true probability

distribution This is done by holding out a p ortion of the training data and

computing the mean observed distribution there eg for all of

the collo cations that have the same raw frequency distribution in the rst

p ortion eg This mean distribution in the heldout data is a more

realistic estimate of the distribution exp ected in indep endent test data and

hence gives b etter predictive p ower and b etter probability estimates than

using the unsmo othed values

Smo othed ratios are very sensitive to type of collo cation b eing observed

A observed ratio for adjacent content words has a smo othed ratio

of while a observed ratio for functionword collo cations to

words away has a smo othed ratio close to indicating that a

training distribution is essentially noise here with little predictive value

The pro cess of computing distributions of the form x and also x

etc for all values of x can b e simplied by the observation that the map

ping from observed ratios to smo othed ratios tends to exhibit a loglinear

relationship when other factors such as distance are held constant This is

shown in the gure b elow for observed x distributions for adjacent con

tent word collo cations The p oints and jagged line A are the empirically

observed values in heldout data B constitutes the leastsquares t of this

data which is a reasonable t esp ecially given that the empirical values

are p o or estimates for large x due to limited sample p oints in this range

Smoothing Likelihood Ratios

o B C A D

o E o o

o Log ( Smoothed Ratio )

o o

o 4 6 8 10 12 o

1 5 10 50 100

Observed Ratio (x/0)

Homograph Disambiguation in Texttospeech Synthesis

Satisfactory results may b e obtained however by a much simpler smo oth

ing pro cedure Adding a small constant to the numerator and denom

inator xz x z roughly captures the desired smo othing

b ehavior as shown in lines C D and E

ab ove The constant is determined empirically for the dierent types of

collo cation and distance from the target word However the value do es

not vary greatly for dierent homographs so adequate p erformance can

b e achieved by reusing previous values rather than estimating them afresh

from heldout training data

Step Sort by Likelihood Ratio into Decision Lists

Preliminary decision lists are created by sorting all collo cation patterns

by the absolute value of the smo othed log likelihoo d ratio computed as

describ ed ab ove The following are highly abbreviated examples

Decision List for lead noun Decision List for bass

highly abbreviated highly abbreviated

LogL Evidence Pron LogL Evidence Pron

fol lowV lead lid sh in k wrds bs

zinc in k wrds ld striped bass bs

lead levelN ld guitar in k beis

of lead in ld bass player beis

the lead in lid piano in k beis

lead role lid tenor in k beis

copper in k ld sea bass bs

lead poisoning ld playV bass beis

big lead lid river in k bs

narrow lead lid violin in k beis

takeV lead lid salmon in k bs

lead NOUN ld on bass beis

lead in lid bass are bs

The resulting decision lists are used to classify new examples by identify

ing the highest line in the list that matches the given context and returning

the indicated classication This pro cess is describ ed in Step

Step Optional Pruning and Interpolation

The decision lists created ab ove may b e used as is if we assume that the

likelihoo d ratio for the th entry in the list is roughly the same when com

puted on the entire training set and when computed on the residual p ortion

of the training set where the rst j entries have failed to match In other

words do es the probability that piano indicates the beis pronunciation of

bass change signicantly conditional on not having seen sh striped guitar

David Yarowsky

and player in the target context

In most cases the global probabilities computed from the full training

set are acceptable approximations of these residual probabilities How

ever in many cases we can achieve improved results by interpolating b e

tween the two values The residual probabilities are more relevant but

since the size of the residual training data shrinks at each level in the list

they are often much more p o orly estimated and in many cases there may

b e no relevant data left in the residual on which to compute the distri

bution of pronunciations for a given collo cation In contrast the global

probabilities are b etter estimated but less relevant A reasonable compro

mise is to interpolate b etween the two where the interpolated estimate is

g l obal r esidual When the residual probabilities are based

i i

on a large training set and are well estimated is small the residual will

dominate In cases where the relevant residual is small or nonexistent

will b e large and the smo othed probabilities will rely primarily on the b et

ter estimated global values If all exclusive use of the residual the

result is a degenerate strictly rightbranching decision tree with severe

sparse data problems Alternately if one assumes that likelihoo d ratios for

a given collo cation are functionally equivalent at each line of a decision list

then one could exclusively use the global all This is clearly the

easiest and fastest approach as probability distributions do not need to b e

recomputed as the list is constructed

Which approach is b est Using only the global proabilities do es surpris

ingly well and the results cited here are based on this readily replicable

pro cedure The reason is grounded in the strong tendency of a word to

exhibit only one sense or pronunciation p er collo cation discussed in Step

and Yar Most classications are based on an x vs distribution

and while the magnitude of the loglikelihoo d ratios may decrease in the

residual they rarely change sign There are cases where this do es hap

p en and it app ears that some interpolation helps but for this problem the

relatively small dierence in p erformance do es not necessarily justify the

greatly increased computational cost

Two kinds of optional pruning can increase the eciency of the decision

lists The rst handles the problem of redundancy by subsumption which

o ccurs when more general patterns higher in the list subsume more sp e

cic patterns lower down The more sp ecic patterns will never b e used

in Step and may b e omitted Examples of this include lemmas eg

fol lowV subsuming inected forms fol low fol lowed fol lows etc and

bigrams subsuming trigrams If a bigram unambiguously signals the pro

nunciation probability distributions for dep endent trigrams need not even

b e generated since they will provide no additional useful information

The second pruning in a crossvalidation phase comp ensates for over

mo deling of the training data which app ears to b e minimal Once a de

cision list is built it is applied to its own training set plus some heldout

crossvalidation data not the test data Lines in the list which contribute

Homograph Disambiguation in Texttospeech Synthesis

to more incorrect classications than correct ones are removed This also

indirectly handles problems that may result from the omission of the inter

p olation step If space is at a premium lines which are never used in the

crossvalidation step may also b e pruned However useful information is

lost here particularly for a small crossvalidation corpus these lines may

have proved useful during later classication of the test data Overall a

drop in p erformance is observed but an over reduction in space is

realized The optimum pruning strategy is sub ject to costb enet analysis

In the results rep orted b elow all pruning except this nal spacesaving step

was utilized

Step Using the Decision Lists

Once the decision lists have b een created they may b e used in real time

to determine the pronunciations of ambiguous words in new contexts

From a statistical p ersp ective the evidence at the top of this list will

most reliably disambiguate the target word Given a word in a new context

to b e assigned a pronunciation if we may only base the classication on

a single line in the decision list it should b e the highest ranking pattern

that is present in the target context This is uncontroversial and is solidly

based in Bayesian decision theory

The question however is what to do with the lessreliable evidence that

may also b e present in the target context The common tradition is to

combine the available evidence in a weighted sum or pro duct This is done

by Bayesian classiers neural nets IRbased classiers and Ngram part

ofsp eech taggers The system rep orted here is unusual in that it do es no

such combination Only the single most reliable piece of evidence matched

in the target context is used

There are several motivations for this approach The rst is that com

bining all available evidence rarely pro duces a dierent classication than

just using the single most reliable piece of evidence and when these dier

it is as likely to hurt as to help A study in Yar based on homographs

showed that the two metho ds agreed in of the test cases Indeed in

the cases of disagreement using only the single b est piece of evidence

worked slightly better than combining evidence Of course this b ehavior

do es not hold for all classication tasks but does seem to b e characteristic

of lexicallybased semantic classications This may b e explained by the

previously noted observation that in most cases and with high probability

words exhibit only one sense in a given collo cationYar

Thus for this type of ambiguity resolution there is no apparent detri

ment and some apparent p erformance gain from using only the single

most reliable evidence in a classication There are other advantages as

well including runtime eciency and ease of parallelization However the

greatest gain comes from the ability to incorp orate nonindep endent infor mation types in the decision pro cedure A given word in context may match

David Yarowsky

several times in the decision list once each for its part of sp eech lemma

inected form bigram trigram and p ossible wordclasses as well By only

using one of these matches the gross exaggeration of probability from com

bining all of these nonindep endent loglikelihoo ds is avoided While these

dep endencies may b e mo deled and corrected for in Bayesian formalisms

it is dicult and costly to do so Using only one loglikelihoo d ratio with

out combination frees the algorithm to include a wide sp ectrum of highly

nonindep endent information without additional algorithmic complexity or

p erformance loss

Decision Lists for Ambiguity Classes

This algorithm may also b e directly applied to large classes of ambigu

ity such as distinguishing b etween fractions and dates Rather than train

individual pronunciation discriminators for and etc training

contexts are p o oled for all individual instances of the class Since the disam

biguating characteristics are quite similar for each class member enhanced

p erformance due to larger training sets tends to comp ensate for the loss of

sp ecialization

Class Models Creation

Decision lists for ambiguity classes may b e created by replacing all members

of the class found in the training data eg and with a common

class lab el eg XY The algorithm describ ed in Section may then b e

applied to this data

An abbreviated decision list for the fractiondate class is shown b elow

Decision List for FractionDate Class

LogL Evidence Pronunciation

NUMBER XY fraction

XY of fraction

Monday in k words date

Mon in k words date

XY mile fraction

XY inch fraction

on XY date

from XY to date

There are advantages to ltering or weighting the training data such that

each member of the class has roughly balanced representation This causes the

trained decision list to mo del the dominant common features of the class rather than the idiosyncrasi es of its most frequent members

Homograph Disambiguation in Texttospeech Synthesis

Class Models Use

The use of these class decision lists requires one additional step translating

from the raw text eg to its full pronunciation using the paradigm

eg fraction selected by the decision list In conjunction with the ATT

TTS sp eech synthesizer the decision lists sp ecify the chosen paradigm by

using escap e sequences surrounding the ambiguous form as output from

the list eg nfr nfc for fraction

Although the rules for this translation are typically straightforward and

standard a complication arises in the case of dates American and British

conventions dier regarding the order of day and month pronouncing

in Monday at PM as March th and July rd resp ectively It

would seem reasonable to make this choice conditional on a global british

or american parameter set for the region of use However even if one

decided to treat ambiguous dates conservatively eg three slash seven

there is still considerable merit in pronouncing known fractions prop erly

eg of the as three sevenths rather than three slash seven

Class Models Incorporating Prior Probabilities

Clearly not every member of the class has the same inherent probability

indep endent of context We can gain leverage by mo deling these dierences

in prior probability For example the class ambiguity yearquantifier

exhibits the following distribution of the prior probability of b eing a year

7 for numbers from to

Prior Probabilities (Year vs. Quantifier)

o o o o o o

o o o o o o o P ( year ) o o

o o o o o o o o o o 0.0 0.2 0.4 0.6 0.8 1.0

500 1000 1500 2000

Number

The spurt at is due to the p ossibili ty of numbers greater than

b eing written with a comma This tendency is greatest for literary and news text

inconsistent in informal corresp ondence and relatively rare in scientic text The

second spurt at roughly is due to a strong American bias in the training datas historical references

David Yarowsky

These prior probabilities may b e used dynamically as follows Train a

decision list for the class assuming an uninformative prior When applying

the list if the highest matching pattern based on context indicates the

same pronunciation as the ma jority pronunciation based on the appropriate

prior return this result If it indicates the minority pronunciation nd

the highest matching pattern that indicates the ma jority pronunciation If

the dierence in log likelihoo ds exceeds the log of the prior ratio use the

minority pronunciation

Roman Numerals

Roman numerals are an example of where twotiered class mo dels may b e

pro ductively used The ma jority of Roman numerals including I I I I I VI

VI I VI I I IX XI I XI I I exhibit the basic distinction b etween the uses

Chapter VII and Henry VII These are mo deled in the abbreviated decision

list b elow

Decision List for Roman Numerals eg VII

LogL Evidence Pronunciation

newsent VI I seven

king within k words the seventh

Chapter VI I seven

Henry VI I the seventh

Edward VI I the seventh

Title VI I seven

Volume VI I seven

pope within k words the seventh

Pius VI I the seventh

Mark VI I seven

Gemini VI I seven

Part VI I seven

propnoun VI I the seventh

However four Roman numerals exhibit an additional p ossible pronunci

ation They include IV as ai vi for intravenous and I V and X letters

For these cases an initial decision list makes the primary distinction b e

tween these additional interpretations and the numeric options based on

such collo cations as IV drug uid dose injection oral and intramuscular

If the numeric option is identied the general Roman numeral list is con

sulted to determine if the nal pronunciation should b e as in Article IV or

George IV This twotiered list maximizes use of existing class mo dels

Homograph Disambiguation in Texttospeech Synthesis

Evaluation

The following table provides a summary of the algorithms p erformance on

the classes of ambiguity studied

System Performance

Prior

Type of Ambiguity Examp Prob Correct

Di Part of Sp eech lives

Same Part of Sp eech bass

Prop er Names Nice Begin

Roman Numerals I I I

FractionsDates

YearsQuantiers

Abbreviations St Dr

AVERAGE

A breakdown of p erformance on a sample of individual homographs fol

lows

Sample Prior

Word Pron Pron Size Prob Correct

lives laivz livz

wound wand wund

lead N lid ld

tear N te tie

axes N ksiz ksiz

Jan dxn jan

routed rutid ratid

bass beis bs

Nice nais nis

Begin bi gin beigin

Chi tsi kai

Colon ko lon kolen

St seint strit

in ints intsiz

III the rd

IV ai vi numeric

IV numeric the th

VI I the th

AVERAGE

As a standard for comparison the PARTS tagger achieves and

accuracy on this test data for lives and wound resp ectively A primary reason for the dierence in p erformance is the lexicaliza tion issue discussed in Section

David Yarowsky

Evaluation in each case is based on fold crossvalidation using heldout

test data for a more accurate estimate of system p erformance Unless other

wise sp ecied in the text these results are based on the simplest and most

readily replicable options in the algorithm ab ove and are hence representa

tive of the p erformance that can b e exp ected from the most straightforward

implementation Using more sophisticated interpolation techniques yields

p erformance ab ove this baseline The sources of the test and training data

are describ ed in Section Step

Discussion and Conclusions

The algorithm presented here has several advantages which make it suit

able for general lexical disambiguation tasks that require attention to b oth

semantic and syntactic context The incorp oration of word and optionally

partofsp eech trigrams allows the mo deling of many lo cal syntactic and

semantic constraints while collo cational evidence in a wider context allows

for topicbased semantic distinctions A key advantage of this approach is

that it allows the use of multiple highly nonindep endent evidence types

such as ro ot form inected form part of sp eech thesaurus category or

applicationsp ecic clusters and do es so in a way that avoids the com

plex mo deling of statistical dep endencies This allows the decision lists to

nd the level of representation that b est matches the observed probabil

ity distributions It is a kitchensink approach of the b est kind throw in

many types of p otentially relevant features and watch what oats to the

top While there are certainly other ways to combine such evidence this ap

proach has many advantages In particular precision seems to b e at least as

go o d as that achieved with Bayesian metho ds applied to the same evidence

This is not surprising given the observation in LTV that widely diver

gent sensedisambiguation algorithms tend to p erform roughly the same

given the same evidence The distinguishing criteria therefore b ecome

How readily can new and multiple types of evidence b e incorp orated

into the algorithm

Are probability estimates provided with a classication

How easy is it to understand the resulting decision pro cedure and the

reasons for any given classication

Can the resulting decision pro cedure b e easily edited by hand

Is the algorithm simple to implement and can it b e applied quickly

to new domains

The current algorithm rates very highly on all these standards of evalua tion esp ecially relative to some of the imp enetrable black b oxes pro duced

Homograph Disambiguation in Texttospeech Synthesis

by many machine learning algorithms Its output is highly p erspicuous

the resulting decision list is organized like a recip e with the most useful

evidence rst and in highly readable form The generated decision pro ce

dure is also easy to augment by hand changing or adding patterns to the

list The algorithm is also extremely exibleit is quite straightforward

to use any new feature for which a probability distribution can b e calcu

lated This is a considerable strength relative to other algorithms which

are more constrained in their ability to handle diverse types of evidence

In a comparative study Yarb the decision list algorithm outp erformed

b oth an NGram tagger and Bayesian classier primarily b ecause it could

eectively integrate a wider range of available evidence types

Overall the decision list algorithm demonstrates considerable hybrid

vigor combining the strengths of Ngram taggers Bayesian classiers and

decision trees in a highly eective general purp ose decision pro cedure for

lexical ambiguity resolution

Acknowledgments This research was conducted in aliation with the Lin

guistics Research Department of ATT Bell Lab oratories It was also sup

p orted by an NDSEG Graduate Fellowship ARPA grant NJ

and ARO grant DAAL C PRI The author would like to thank

Jason Eisner and Mitch Marcus for their very helpful comments

References

BFOS L Brieman J Friedman R Olshen and C Stone Classi

cation and Regression Trees Wadsworth Bro oks Monterrey

Bri E Brill A CorpusBased Approach to Language Learning

PhD Thesis University of Pennsylvania

BDDM P Brown S Della Pietra V Della Pietra and R Mercer

Word sense disambiguation using statistical metho ds In Pro

ceedings of the th Annual Meeting of the Association for Com

putational Linguistics pages Berkeley

BW R Bruce and and J Wieb e Wordsense disambiguation us

ing decomp osable mo dels In Proceedings of the nd Annual

Meeting of the Association for Computational Linguistics pages

Las Cruces NM

Chu K W Church A sto chastic parts program and noun phrase

parser for unrestricted text In Proceedings of the Second Con

ference on Applied Natural Language Processing pages

David Yarowsky

DI I Dagan and A Itai Word sense disambiguation using a sec

ond language monolingual corpus Computational Linguistics

GCY W Gale K Church and D Yarowsky A metho d for disam

biguating word senses in a large corpus Computers and the

Humanities

GCY W Gale K Church and D Yarowsky Discrimination decisions

for dimensional spaces In A Zamp oli N Calzolari and

M Palmer eds Current Issues in Computational Linguistics

In Honour of Don Walker Kluwer Academic Publishers pages

Hea M Hearst Noun homograph disambiguation using lo cal context

in large text corp ora In Using Corpora University of Waterloo

Waterloo Ontario

Jel F Jelinek Markov source mo deling of text generation In Im

pact of Processing Techniques on Communication J Skwirzin

ski ed Dordrecht

LTV C Leaco ck G Towell and E Voorhees Corpusbased statis

tical sense resolution In Proceedings ARPA Human Language

Technology Workshop pages Princeton

Mer B Merialdo Tagging text with a probabilistic mo del In Pro

ceedings of the IBM Natural Language ITL Paris France pages

MW F Mosteller and D Wallace Inference and Disputed Authorship

The Federalist AddisonWesley Reading Massachusetts

Riv R L Rivest Learning decision lists Machine Learning

SHY R Sproat J Hirschberg and D Yarowsky A corpusbased syn

thesizer In Proceedings International Conference on Spoken

Language Processing Ban

Yar D Yarowsky WordSense disambiguation using statistical mo d

els of Rogets categories trained on large corp ora In Proceed

ings COLING pages Nantes

Yar D Yarowsky One sense p er collo cation In Proceedings ARPA

Human Language Technology Workshop pages Prince ton NJ

Homograph Disambiguation in Texttospeech Synthesis

Yar D Yarowsky Decision lists for lexical ambiguity resolution ap

plication to accent restoration in Spanish and French Proceed

ings of the nd Annual Meeting of the Association for Compu

tational Linguistics pages Las Cruces NM

Yarb D Yarowsky A comparison of corpusbased techniques for

restoring accents in Spanish and French text In Proceedings

nd Annual Workshop on Very Large Corpora Kyoto pages