Similarity-Based Approaches to Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
Similarity-Based Approaches to Natural Language Processing The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Lee, Lillian Jane. 1997. Similarity-Based Approaches to Natural Language Processing. Harvard Computer Science Group Technical Report TR-11-97. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25104914 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA SimilarityBased Approaches to Natural Language Pro cessing Lillian Jane Lee TR Center for Research in Computing Technology Harvard University Cambridge Massachusetts SimilarityBased Approaches to Natural Language Pro cessing A thesis presented by Lilli an Jane Lee to The Division of Engineering and Applied Sciences in partial fulllment of the requirements for the degree of Do ctor of Philosophy in the sub ject of Computer Science Harvard University Cambridge Massachusetts May c by Lillian Jane Lee All rights reserved ii Abstract Statistical metho ds for automatically extracting information ab out asso ciations b etween words or do cuments from large collections of text have the p otential to have considerable impact in a number of areas such as information retrieval and naturallanguagebased user interfaces However even huge b o dies of text yield highly unreliable estimates of the probability of relatively common events and in fact p erfectly reasonable events may not o ccur in the training data at all This is known as the sparse data problem Traditional approaches to the sparse data problem use crude approximations We prop ose a dierent solution if we are able to organize the data into classes of similar events then if information ab out an event is lacking we can estimate its b ehavior from information ab out similar events This thesis presents two such similaritybased approaches where in general we measure similarity by the KullbackLeibler divergence an informationtheoretic quantity Our rst approach is to build soft hierarchical clusters soft b ecause each event b elongs to each cluster with some probability hierarchical b ecause cluster centroids are iteratively split to mo del ner distinctions Our clustering metho d which uses the technique of de terministic annealing represents to our knowledge the rst application of soft clustering to problems in natural language pro cessing We use this metho d to cluster words drawn from million words of Asso ciated Press Newswire and million words from Groliers iii encyclop edia and nd that language mo dels built from the clusters have substantial pre dictive p ower Our algorithm also extends with no mo dication to other domains such as do cument clustering Our second approach is a nearestneighbor approach instead of calculating a centroid for each class we in essence build a cluster around each word We compare several such nearestneighbor approaches on a word sense disambiguation task and nd that as a whole their p erformance is far sup erior to that of standard metho ds In another set of exp eriments we show that using estimation techniques based on the nearestneighbor mo del enables us to achieve p erplexity reductions of more than p ercent over standard techniques in the prediction of lowfrequency events and statistically signicant sp eech recognition errorrate reduction iv Acknowledgements In my four years as a graduate student at Harvard I have noticed a rather ominous trend Aiken Computation Lab where I sp ent so many days and nights is currently slated for demolition I worked at ATT Bell Labs for several summers and now this institution no longer exists I nally b egan to catch on when after a rash of rep etitive strain injury cases broke out at Harvard one of my fellow graduate students called the Center for Disease Control to suggest that I might b e the cause All this aside I feel that I have b een incredibly fortunate First of all I have had the Dream Team of NLP for my committee Stuart Shieb er my advisor has b een absolutely terric The b est way to sum up my interactions with him is he never let me get away with anything The stu Ive pro duced has b een the clearer and the b etter for it Barbara Grosz has b een wonderfully supp ortive she also throws a mean brunch And there is just no way I can thank Fernando Pereira enough He is the one who started me o on this research enterprise in the rst place and truly deserves his title of mentor Id also like to thank Harry Lewis for the whole CS exp erience Les Valiant and Alan Yuille for b eing on my oral exam committee and Margo Seltzer for advice and en couragement There have b een a number of p eople who made the grad school pro cess much much easier to deal with Mike at least youre not in medieval history Bailey Ellie Baker Michael Bender Alan Capil Stan Chen the one and only Ric Crabb e Adam Deaton Joshua Go o dman for the colon Carol Harlow Fatima Holowinsky Bree Horwitz Andy Kehler Anne Kist Bobby Kleinberg David Mazieres Je Miller Christine Nakatani Wheeler Ruml Kathy Ryall my job buddy Ben Scarlet Ro cco Servedio for the citroids Jen Smith Nadia Shalaby Carol Sandstrom and Chris Small and Emily and Sophie the b est listeners in the entire universe Peg Schafer Keith Smith Chris Thorp e and Tony Yan ATT was a wonderful place to work and I am quite grateful for having had the opp ortunity to talk with all of the following p eople Hiyan Alshawi Ido Dagan Don Hindle Julia Hirschberg Yoram Singer Tali Tishby and David Yarowsky Id also like to mention the s underscores the summer students at ATT who made it so dicult to get anything done David Ahn Tom Chou Tabora Constantennia Ra j Iyer Charles Isb ell Kim Knowles Nadya Mason Leah McKissic Diana Meadows Andrew Ng Marianne Shaw and Ben Slusky Reb ecca Hwa deserves sp ecial mention She was always there to say Now just calm down Lillian whenever I got stressed out I thank her for all those trips to Toscaninis and forgive her for trying to make me break my Aiken rule Maybe one day Ill even get A around to giving her her L T X b o ok back E And nally deep est thanks to my mom and dad my sister Charlotte and Jon Kleinberg they always b elieved even when I didnt v The work describ ed in this thesis was supp orted in part by the National Science Foun dation under Grant No IRI I also gratefully acknowledge supp ort from an NSF Graduate Fellowship and an grant from the ATT Labs Fellowship Program formerly the ATT Bell Labs Graduate Research Program for Women vi Bibliographic Notes Portions of this thesis are joint work and have app eared elsewhere Chapter is based on the pap er Distributional Clustering of English Words with Fernando Pereira and Naftali Tishby Pereira Tishby and Lee which app eared in the pro ceedings of the st meeting of the ACL We thank Don Hindle for making available the Asso ciated Press verbob ject data set the Fidditch parser and a verbob ject structure lter Mats Ro oth for selecting the data set consisting of ob jects of re and for many discussions David Yarowsky for help with his stemming and concordancing to ols and Ido Dagan for suggesting ways to test cluster mo dels Chapter and p ortions of chapter are adapted from SimilarityBased Metho ds for Word Sense Disambiguation This pap er cowritten with Ido Dagan and Fernando Pereira will app ear in the pro ceedings of the th meeting of the ACL Dagan Lee and Pereira We thank Hiyan Alshawi Joshua Go o dman Reb ecca Hwa Stuart Shieb er and Yoram Singer for many helpful comments and discussions Chapter is based on work with Ido Dagan and Fernando Pereira that is describ ed in SimilarityBased Estimation of Word Co o ccurrence Probabilities which app eared in the pro ceedings of the nd Annual meeting of the ACL Dagan Pereira and Lee We thank Slava Katz for discussions on the topic of this pap er Doug McIlroy for detailed comments Doug Paul for help with his baseline backo mo del and Andre Ljolje and Michael Riley for providing the word lattices for our exp eriments vii Contents Introduction Distributional Similarity Ob jects as Distributions Initial Estimates for Distributions Measures of Distributional Similarity KL Divergence Total Divergence to the Mean Geometric Distances Similarity Statistics An Example Summary and Preview Distributional Clustering Introduction Word Clustering Theoretical Basis MaximumEntropy Cluster Membership MinimumDistortion Cluster Centroids Hierarchical Clustering Clustering Examples Mo del Evaluation KL Divergence Decision Task Related Work Clustering in Natural Language Pro cessing Probabilistic Clustering Metho ds Conclusions SimilarityBased Estimation Introduction Chapter Overview