Corpus-based Semantic Lexicon Induction with Web-based Corroboration
Sean P. Igo Ellen Riloff Center for High Performance Computing School of Computing University of Utah University of Utah Salt Lake City, UT 84112 USA Salt Lake City, UT 84112 USA [email protected] [email protected]
Abstract ing such dictionaries have been developed under the rubrics of lexical acquisition, hyponym learning, se- Various techniques have been developed to au- mantic class induction, and Web-based information tomatically induce semantic dictionaries from extraction. These techniques can be used to rapidly text corpora and from the Web. Our research combines corpus-based semantic lexicon in- create semantic lexicons for new domains and lan- duction with statistics acquired from the Web guages, and to automatically increase the coverage to improve the accuracy of automatically ac- of existing resources. quired domain-specific dictionaries. We use Techniques for semantic lexicon induction can be a weakly supervised bootstrapping algorithm to induce a semantic lexicon from a text cor- subdivided into two groups: corpus-based methods pus, and then issue Web queries to generate and Web-based methods. Although the Web can be co-occurrence statistics between each lexicon viewed as a (gigantic) corpus, these two approaches entry and semantically related terms. The Web tend to have different goals. Corpus-based methods statistics provide a source of independent ev- are typically designed to induce domain-specific se- idence to confirm, or disconfirm, that a word mantic lexicons from a collection of domain-specific belongs to the intended semantic category. We texts. In contrast, Web-based methods are typically evaluate this approach on 7 semantic cate- designed to induce broad-coverage resources, simi- gories representing two domains. Our results show that the Web statistics dramatically im- lar to WordNet. Ideally, one would hope that broad- prove the ranking of lexicon entries, and can coverage resources would be sufficient for any do- also be used to filter incorrect entries. main, but this is often not the case. Many domains use specialized vocabularies and jargon that are not adequately represented in broad-coverage resources 1 Introduction (e.g., medicine, genomics, etc.). Furthermore, even Semantic resources are extremely valuable for many relatively general text genres, such as news, con- natural language processing (NLP) tasks, as evi- tain subdomains that require extensive knowledge denced by the wide popularity of WordNet (Miller, of specific semantic categories. For example, our 1990) and a multitude of efforts to create similar work uses a corpus of news articles about terror- “WordNets” for additional languages (e.g. (Atserias ism that includes many arcane weapon terms (e.g., et al., 1997; Vossen, 1998; Stamou et al., 2002)). M-79, AR-15, an-fo, and gelignite). Similarly, our Semantic resources can take many forms, but one of disease-related documents mention obscure diseases the most basic types is a dictionary that associates (e.g., psittacosis) and contain many informal terms, a word (or word sense) with one or more semantic abbreviations, and spelling variants that do not even categories (hypernyms). For example, truck might occur in most medical dictionaries. For example, yf be identified as a VEHICLE, and dog might be identi- refers to yellow fever, tularaemia is an alternative fied as an ANIMAL. Automated methods for generat- spelling for tularemia, and nv-cjd is frequently used 18 Proceedings of the NAACL HLT Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics, pages 18–26, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics to refer to new variant Creutzfeldt Jacob Disease. methods have utilized noun co-occurrence statis- The Web is such a vast repository of knowledge tics (Riloff and Shepherd, 1997; Roark and Char- that specialized terminology for nearly any domain niak, 1998), syntactic information (Widdows and probably exists in some niche or cranny, but find- Dorow, 2002; Phillips and Riloff, 2002; Pantel and ing the appropriate corner of the Web to tap into is a Ravichandran, 2004; Tanev and Magnini, 2006), challenge. You have to know where to look to find and lexico-syntactic contextual patterns (e.g., “re- specialized knowledge. In contrast, corpus-based sides in
ANIMAL DISEASE SYMPTOM N Ba Hy Av Mx Ba Hy Av Mx Ba Hy Av Mx 25 .48 .88 .92 .92 .64 .84 .80 .84 .64 .84 .92 .80 50 .58 .82 .84 .80 .72 .84 .60 .82 .62 .76 .90 .74 75 .55 .68 .67 .69 .69 .83 .59 .81 .61 .68 .79 .71 100 .45 .55 .54 .57 .69 .78 .58 .80 .59 .71 .77 .64 300 .20 .62 .38
Table 2: Ranking results for 7 semantic categories, showing accuracies for the top-ranked N words. (Ba=Basilisk, Hy=Hypernym Re-ranking, Av=Average of Seeds Re-ranking, Mx=Max of Seeds Re-ranking labeled each word as either correct or incorrect for two ramifications. First, if we want a human to man- the hypothesized semantic class. A word is consid- ually review each lexicon before adding the words ered to be correct if any sense of the word is seman- to an external resource, then the rankings may not tically correct. be very helpful (i.e., the human will need to review all of the words), and (2) incorrect terms generated 4.2 Ranking Results during the early stages of bootstrapping may be hin- We ran Basilisk for 60 iterations, learning 5 new dering the learning process because they introduce words in each bootstrapping cycle, which produced noise during bootstrapping. The HUMAN category a lexicon of 300 words for each semantic category. seems to have recovered from early mistakes, but The columns labeled Ba in Table 2 show the accu- the lower accuracies for some other categories may racy results for Basilisk.6 As we explained in Sec- be the result of this problem. The purpose of our tion 3.1, accuracy tends to decrease as bootstrapping Web-based corroboration process is to automatically progresses, so we computed accuracy scores for the re-evaluate the lexicons produced by Basilisk, using top-ranked 100 words, in increments of 25, and also Web-based statistics to create more separation be- for the entire lexicon of 300 words. tween the good entries and the bad ones. Overall, we see that Basilisk learns many cor- Our first set of experiments uses the Web-based rect words for each semantic category, and the top- co-occurrence statistics to re-rank the lexicon en- ranked terms are generally more accurate than the tries. The Hy, Av, and Mx columns in Ta- lower-ranked terms. For the top 100 words, accu- ble 2 show the re-ranking results using each of the racies are generally in the 50-70% range, except for Hypernym, Average of Seeds, and Maximum of LOCATION which achieves about 80% accuracy. For Seeds scoring functions. In all cases, Web-based the HUMAN category, Basilisk obtained 82% accu- re-ranking outperforms Basilisk’s original rank- racy over all 300 words, but the top-ranked words ings. Every semantic category except for BUILDING actually produced lower accuracy. yielded accuracies of 80-100% among the top can- Basilisk’s ranking is clearly not as good as it could didates. For each row, the highest accuracy for each be because there are correct terms co-mingled with semantic category is shown in boldface (as are any incorrect terms throughout the ranked lists. This has tied for highest). 6 These results are not comparable to the Basilisk results re- Overall, the Max of Seeds Scores were best, per- ported by (Thelen and Riloff, 2002) because our implementa- tion only does single-category learning while the results in that forming better than or as well as the other scoring paper are based on simultaneously learning multiple categories. functions on 5 of the 7 categories. It was only out- 23 BUILDING HUMAN LOCATION WEAPON ANIMAL DISEASE SYMPTOM consulate guerrilla San Salvador shotguns bird-to-bird meningo-encephalitis nausea pharmacies extremists Las Hojas carbines cervids bse).austria diarrhoea aiport sympathizers Tejutepeque armaments goats inhalational myalgias zacamil assassins Ayutuxtepeque revolvers ewes anthrax disease chlorosis airports patrols Copinol detonators ruminants otitis media myalgia parishes militiamen Cuscatancingo pistols swine airport malaria salivation Masariegos battalion Jiboa car bombs calf taeniorhynchus dysentery chancery Ellacuria Chinameca calibers lambs hyopneumonia cramping residences rebel Zacamil M-16 wolsington monkeypox dizziness police station policemen Chalantenango grenades piglets kala-azar inappetance
Table 3: Top 10 words ranked by Max of Seeds Scores. performed once by the Hypernym Scores (BUILD- and the most dense is HUMAN with 247 correct ING) and once by the Average of Seeds Scores words. Interestingly, the densest categories are not (SYMPTOM). always the easiest to rank. For example, the HU- The strong performance of the Max of Seeds MAN category is the densest category but Basilisk’s scores suggests that one seed is often an especially ranking of the human terms was poor. good collocation indicator for category membership – though it may not be the same seed word for all of θ Category Acc Cor/Tot the lexicon words. The relatively poor performance WEAPON .88 46/52 of the Average of Seeds scores may be attributable LOCATION .98 59/60 to the same principle; perhaps even if one seed is HUMAN .80 8/10 especially strong, averaging over the less-effective -22 BUILDING .83 5/6 seeds’ scores dilutes the results. Averaging is also ANIMAL .91 30/33 DISEASE .82 64/78 susceptible to damage from words that receive the SYMPTOM .65 64/99 special-case score of -99999 when a hit count is zero WEAPON .79 59/75 (see Section 3.2). LOCATION .96 82/85 Table 3 shows the 10 top-ranked candidates for HUMAN .85 23/27 each semantic category based on the Max of Seeds -23 BUILDING .71 12/17 scores. The table illustrates that this scoring func- ANIMAL .87 40/46 tion does a good job of identifying semantically cor- DISEASE .78 82/105 rect words, although of course there are some mis- SYMPTOM .62 86/139 takes. Mistakes can happen due to parsing errors WEAPON .63 63/100 (e.g., bird-to-bird is an adjective and not a noun, as LOCATION .93 111/120 HUMAN .87 54/62 in bird-to-bird transmission), and some are due to -24 BUILDING .45 17/38 issues associated with Web querying. For exam- ANIMAL .75 47/63 ple, the nonsense term “bse).austria” was ranked DISEASE .74 94/127 highly because Altavista split this term into 2 sep- SYMPTOM .60 100/166 arate words because of the punctuation, and bse by itself is indeed a disease term (bovine spongiform Table 4: Filtering results using the Max of Seeds Scores. encephalitis). The ultimate goal behind a better ranking mech- 4.3 Filtering Results anism is to completely automate the process of se- Table 2 revealed that the 300-word lexicons pro- mantic lexicon induction. If we can produce high- duced by Basilisk vary widely in the number of true quality rankings, then we can discard the lower category words that they contain. The least dense ranked words and keep only the highest ranked category is ANIMAL, with only 61 correct words, words for our semantic dictionary. However, this 24 presupposes that we know where to draw the line be- prove the ranking of lexicon entries produced by tween the good and bad entries, and Table 2 shows a weakly-supervised corpus-based bootstrapping al- that this boundary varies across categories. For HU- gorithm, without requiring any additional supervi- MANS, the top 100 words are 87% accurate, and in sion. We found that computing Web-based co- fact we get 82% accuracy over all 300 words. But occurrence statistics across a set of seed words and for ANIMALS we achieve 80% accuracy only for the then using the highest score was the most success- top 50 words. It is paramount for semantic dictio- ful approach. Co-occurrence with a hypernym term naries to have high integrity, so accuracy must be also performed well for some categories, and could high if we want to use the resulting lexicons without be easily combined with the Max of Seeds approach manual review. by choosing the highest value among the seeds as As an alternative to ranking, another way that we well as the hypernym. could use the Web-based corroboration statistics is In future work, we would like to incorporate this to automatically filter words that do not receive a Web-based re-ranking procedure into the bootstrap- high score. The key question is whether the values ping algorithm itself to dynamically “clean up” the of the scores are consistent enough across categories learned words before they are cycled back into the to set a single threshold that will work well across bootstrapping process. Basilisk could consult the the different categories. Web-based statistics to select the best 5 words to Table 4 shows the results of using the Max of generate before the next bootstrapping cycle begins. Seeds Scores as a filtering mechanism: given a This integrated approach has the potential to sub- threshold θ, all words that have a score <θ are dis- stantially improve Basilisk’s performance because carded. For each threshold value θ and semantic cat- it would improve the precision of the induced lex- egory, we computed the accuracy (Acc) of the lex- icon entries during the earliest stages of bootstrap- icon after all words with a score <θ have been re- ping when the learning process is most fragile. moved. The Cor/T ot column shows the number of Acknowledgments correct category members and the number of total words that passed the threshold. Many thanks to Julia James for annotating the gold We experimented with a variety of threshold val- standards for the disease domain. This research was ues and found that θ=-22 performed best. Table 4 supported in part by Department of Homeland Secu- shows that this threshold produces a relatively high- rity Grant N0014-07-1-0152. precision filtering mechanism, with 6 of the 7 cat- egories achieving lexicon accuracies 80%. As ≥ References expected, the Cor/Tot column shows that the num- ber of words varies widely across categories. Au- J. Atserias, S. Climent, X. Farreres, G. Rigau, and H. Ro- tomatic filtering represents a trade-off: a relatively driguez. 1997. Combining Multiple Methods for the Automatic Construction of Multilingual WordNets. In high-precision lexicon can be created, but some cor- Proceedings of the InternationalConference on Recent rect words will be lost. The threshold can be ad- Advances in Natural Language Processing. justed to increase the number of learned words, but S. Caraballo. 1999. Automatic Acquisition of a with a corresponding drop in precision. Depending Hypernym-Labeled Noun Hierarchy from Text. In upon a user’s needs, a high threshold may be desir- Proc. of the 37th Annual Meeting of the Association able to identify only the most confident lexicon en- for Computational Linguistics, pages 120–126. tries, or a lower threshold may be desirable to retain P. Cimiano and J. Volker. 2005. Towards large-scale, most of the correct entries while reliably removing open-domain and ontology-based named entity classi- fication. In Proc. of Recent Advances in Natural Lan- some of the incorrect ones. guage Processing, pages 166–172. D. Davidov and A. Rappoport. 2006. Efficient unsu- 5 Conclusions pervised discovery of word categories using symmet- ric patterns and high frequency words. In Proc. of the We have demonstrated that co-occurrence statis- 21st International Conference on Computational Lin- tics gathered from the Web can dramatically im- guistics and the 44th annual meeting of the ACL. 25 D. Davidov, A. Rappoport, and M. Koppel. 2007. Fully Lexicons. In Proceedings of the 2002 Conference on unsupervised discovery of concept-specific relation- Empirical Methods in Natural Language Processing, ships by web mining. In Proc. of the 45th Annual pages 125–132. Meeting of the Association of Computational Linguis- ProMed-mail. 2006. http://www.promedmail.org/. tics, pages 232–239, June. PubMed. 2009. http://www.ncbi.nlm.nih.gov/sites/entrez. O. Etzioni, M. Cafarella, D. Downey, A. Popescu, E. Riloff and R. Jones. 1999. Learning Dictionaries for T. Shaked, S. Soderland, D. Weld, and A. Yates. Information Extraction by Multi-Level Bootstrapping. 2005. Unsupervised named-entity extraction from the In Proceedings of the Sixteenth National Conference web: an experimental study. Artificial Intelligence, on Artificial Intelligence. 165(1):91–134, June. E. Riloff and J. Shepherd. 1997. A Corpus-Based Ap- Z. Harris. 1954. Distributional Structure. In J. A. Fodor proach for Building Semantic Lexicons. In Proceed- and J. J. Katz, editor, The Structure of Language, pages ings of the Second Conference on Empirical Methods 33–49. Prentice-Hall. in Natural Language Processing, pages 117–124. M. Hearst. 1992. Automatic acquisition of hyponyms E. Riloff. 1996. Automatically Generating Extraction from large text corpora. In Proc. of the 14th In- Patterns from Untagged Text. In Proceedings of the ternational Conference on Computational Linguistics Thirteenth National Conference on Artificial Intelli- (COLING-92). gence, pages 1044–1049. The AAAI Press/MIT Press. B. Roark and E. Charniak. 1998. Noun-phrase Co- Z. Kozareva, E. Riloff, and E. Hovy. 2008. Semantic occurrence Statistics for Semi-automatic Semantic Class Learning from the Web with Hyponym Pattern Lexicon Construction. In Proceedings of the 36th Linkage Graphs. In Proceedings of the 46th Annual Annual Meeting of the Association for Computational Meeting of the Association for Computational Linguis- Linguistics, pages 1110–1116. tics: Human Language Technologies (ACL-08). Sofia Stamou, Kemal Oflazer, Karel Pala, Dimitris Chris- D. Lin and P. Pantel. 2002. Concept discovery from text. toudoulakis, Dan Cristea, Dan Tufis, Svetla Koeva, In Proc. of the 19th International Conference on Com- George Totkov, Dominique Dutoit, and Maria Grigo- putational linguistics, pages 1–7. riadou. 2002. Balkanet: A multilingual semantic net- D. Lin. 1998. Dependency-based Evaluation of MINI- work for the balkan languages. In Proceedings of the PAR. In Workshop on the Evaluation of Parsing Sys- 1st Global WordNet Association conference. tems, Granada, Spain. H. Tanev and B. Magnini. 2006. Weakly supervised ap- G. Mann. 2002. Fine-grained proper noun ontologies for proaches for ontology population. In Proc. of 11st question answering. In Proc. of the 19th International Conference of the European Chapter of the Associa- Conference on Computational Linguistics, pages 1–7. tion for Computational Linguistics. G. Miller. 1990. Wordnet: An On-line Lexical Database. M. Thelen and E. Riloff. 2002. A Bootstrapping Method International Journal of Lexicography, 3(4). for Learning Semantic Lexicons Using Extraction Pa MUC-4 Proceedings. 1992. Proceedings of the Fourth ttern Contexts. In Proceedings of the 2002 Conference Message Understanding Conference (MUC-4). Mor- on Empirical Methods in Natural Language Process- gan Kaufmann. ing, pages 214–221. M. Pas¸ca. 2004. Acquisition of categorized named en- Peter D. Turney. 2001. Mining the Web for Syn- tities for web search. In Proc. of the Thirteenth ACM onyms: PMI-IR versus LSA on TOEFL. In EMCL International Conference on Information and Knowl- ’01: Proceedings of the 12th European Conference edge Management, pages 137–145. on Machine Learning, pages 491–502, London, UK. P. Pantel and D. Ravichandran. 2004. Automatically Springer-Verlag. labeling semantic classes. In Proc. of Conference of P. D. Turney. 2002. Thumbs up or thumbs down? se- HLT / North American Chapter of the Association for mantic orientation applied to unsupervised classifica- Computational Linguistics, pages 321–328. tion of reviews. In Proceedings of the 40th Annual P. Pantel, D. Ravichandran, and E. Hovy. 2004. To- Meeting of the Association for Computational Linguis- wards terascale knowledge acquisition. In Proc. of the tics (ACL’02), pages 417–424. 20th international conference on Computational Lin- Piek Vossen, editor. 1998. EuroWordNet: A Multilingual guistics, page 771. Database with Lexical Semantic Networks. Kluwer Academic Publishers, Norwell, MA, USA. M. Pasca. 2007. weakly-supervised Discovery of Named D. Widdows and B. Dorow. 2002. A graph model for Entities using Web Search Queries. In CIKM, pages unsupervised lexical acquisition. In Proc. of the 19th 683–690. International Conference on Computational Linguis- W. Phillips and E. Riloff. 2002. Exploiting Strong Syn- tics, pages 1–7. tactic Heuristics and Co-Training to Learn Semantic 26