Word Sense Disambiguation Using Sense Signatures Automatically Acquired from a Second Language

Word Sense Disambiguation Using Sense Signatures Automatically Acquired from a Second Language Document Number Working Paper 6.12h(b) (3rd Round) Project ref. IST-2001-34460 Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Project URL http://www.lsi.upc.es/~nlp/meaning/meaning.html Availability Public Authors: Xinglong Wang (UoS) and John Carroll (UoS) INFORMATION SOCIETY TECHNOLOGIES WP6.12-Working Paper 6.12h(b) (3rd Round) Version: Final WSD Using Sense Signatures Page : 1 Project ref. IST-2001-34460 Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Security (Distribution level) Public Contractual date of delivery February 2004 Actual date of delivery February 2004 Document Number Working Paper 6.12h(b) (3rd Round) Type Report Status & version Final Number of pages 14 WP contributing to the deliverable WP6.12 WPTask responsible Eneko Agirre Authors Xinglong Wang (UoS) and John Carroll (UoS) Other contributors Reviewer EC Project O±cer Evangelia Markidou Authors: Xinglong Wang (UoS) and John Carroll (UoS) Keywords: Word Sense Disambiguation, Sense Signature Acquisition Abstract: We present a novel near-unsupervised approach to the task of Word Sense Disambiguation (WSD). For each sense of a word with multiple meanings we produce a \sense signature": a set of words or snippets of text that tend to co-occur with that sense. We build sense signatures automatically, using large quantities of Chinese text, and English-Chinese and Chinese-English bilingual dictionaries, taking advantage of the observation that mappings between words and meanings are often di®erent in typologically distant languages. We train a classi¯er on the sense signatures and test it on two gold standard English WSD datasets, one for binary and the other for multi-way sense identi¯cation. Both evaluations give results that exceed previous state-of-the-art results for comparable systems. We also demonstrate that a little manual e®ort can improve sense signature quality, as measured by WSD accuracy. IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies WP6.12-Working Paper 6.12h(b) (3rd Round) Version: Final WSD Using Sense Signatures Page : 2 Contents 1 Introduction 3 2 Acquisition of Sense Signatures 4 3 Experiments and Results 5 3.1 Building Sense Signatures . 5 3.2 WSD Experiments on TWA dataset . 7 3.3 WSD Experiments on Senseval-2 Lexical Sample dataset . 10 4 Conclusions and Future Work 12 IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies WP6.12-Working Paper 6.12h(b) (3rd Round) Version: Final WSD Using Sense Signatures Page : 3 1 Introduction We call a set of words or text snippets that tend to co-occur with a certain sense of a word a sense signature. More formally, a sense signature SS of sense s with a length n is de¯ned as: SSs = fw1; w2; :::; wng, where wi denotes a word form. Note that word forms in a sense signature can be ordered or unordered: when ordered, they are a piece of text snippet, and when unordered, they are just a bag of words. In this paper, word forms in sense signatures are unordered. For example, a sense signature for the serving sense of waiter may be frestaurant, menu, waitress, dinnerg, while the sense signature fhospital, station, airport, boyfriendg may be one for the waiting sense of the word. The term \sense signature" is derived from the term \topic signature"1, which refers to a set of words that tend to be associated with a certain topic. Topic signatures were originally proposed by Lin and Hovy [2000] in the context of Text Summarisation. They developed a text summarisation system containing a topic identi¯cation module that was trained on topic signatures. Their topic signatures were extracted from manually topic- tagged text. Agirre et al. [2001] proposed an unsupervised method for building topic signatures. Their system queries an Internet search engine with monosemous synomyms of words that have multiple senses in WordNet [Miller et al., 1990], and then extracts topic signatures by processing text snippets returned by the search engine. This way, they can attach a set of topic signatures to each synset in WordNet. They evaluated their topic signatures on a WSD task using 7 words drawn from SemCor [Miller et al., 1993], but found their topic signatures were not very useful for WSD. Despite Agirre et al.'s disappointing result, a means of automatically and accurately producing sense (topic) signatures would be one way of addressing the important problem of the lack of training data for machine learning based approaches to WSD. Another approach to this problem is to automatically produce sense-tagged data using parallel bilingual corpora [Gale et al., 1992; Diab and Resnik, 2002], exploiting the assumption that mappings between words and meanings are di®erent in di®erent languages. One problem with relying on bilingual corpora for data collection is that bilingual corpora are rare, and aligned bilingual corpora are even rarer. Mining the Web for bilingual text [Resnik, 1999] is a solution, but it is still di±cult to obtain su±cient quantities of high quality data. Another problem is that if two languages are closely related, useful data for some words cannot be collected because di®erent senses of polysemous words in one language often translate to the same word in the other. In the second round of the MEANING project, we [2004] used this cross-language paradigm to acquire sense signatures for English senses from large amounts of Chinese text and English-Chinese and Chinese-English bilingual dictionaries. We showed the signatures to be useful as training data for a binary disambiguation task. The WSD system described in this paper is based on our previous work. We automatically build sense signatures in a similar way and then train a NaÄ³ve Baysian classi¯er on them. We have evaluated the 1We use the term \sense signature" because it more precisely reﬂects our motivation of creating this resource { for the purpose of Word Sense Disambiguation (WSD), though \topic signature" has been more often used in the literature. IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies WP6.12-Working Paper 6.12h(b) (3rd Round) Version: Final WSD Using Sense Signatures Page : 4 system on the TWA Sense Tagged Dataset [Mihalcea, 2003], as well as the English Lexical Sample Dataset from Senseval-2. Our results show conclusively that such sense signatures can be used successfully in a full-scale WSD task. The reminder of the paper is organised as follows. Section 2 brieﬂy reviews the acquisition algorithm for sense signatures. Section 3 describes details of building this resource and demonstrates our applications of sense signatures to WSD. We also present results and analysis in this section. Finally, we conclude in Section 4 and talk about future work. 2 Acquisition of Sense Signatures Following the assumption that word senses are lexicalised di®erently between languages, we [Wang, 2004] proposed an unsupervised method to automatically acquire English sense signatures using large quantities of Chinese text and English-Chinese and Chinese-English dictionaries. The Chinese language was chosen since it is a distant language from English and the more distant two languages are, the more likely that senses are lexicalised di®erently [Resnik and Yarowsky, 1999]. We also showed that the topic signatures are helpful for a small-scale WSD task. The underlying assumption of this approach is that each sense of an ambiguous English word corresponds to a distinct translation in Chinese. As English ambiguous word w Sense 1 of w Sense 2 of w English-Chinese Lexicon Chinese translation of Chinese translation of sense 1 Chinese sense 2 Search Engine Chinese text snippet 1 Chinese text snippet 1 Chinese Chinese text snippet 2 Chinese text snippet 2 Segementation ... ... ... ... Chinese- English {English sense signature 1 {English sense signature 1 Lexicon for sense 1 of w} for sense 2 of w} {English sense signature 2 {English sense signature 2 for sense 1 of w} for sense 2 of w} ... ... ... ... Figure 1: Process of automatic acquisition of sense signatures proposed by us. For simplicity, assume w has two senses. shown in Figure 1, ¯rstly, our system translates senses of an English word into Chinese words, using an English-Chinese dictionary, and then retrieves text snippets from a large amount of Chinese text, with the Chinese translations as queries. Secondly, the Chinese text snippets are segmented and then translated back to English word by word, using a Chinese-English dictionary. For each sense, a set of sense signatures is produced. As an example, suppose one wants to retrieve sense signatures for the ¯nancial sense of interest. One would ¯rst look up the Chinese translations of this sense in an English-Chinese dictionary, and ¯nd that |E is the right Chinese translation corresponding to this particular IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies WP6.12-Working Paper 6.12h(b) (3rd Round) Version: Final WSD Using Sense Signatures Page : 5 sense. Then, the next stage is to automatically build a collection of Chinese text snippets by either searching in a large Chinese corpus or on the Web, using |E as query. As Chi- nese is a language written without spaces between words, one needs to use a segmentor to mark word boundaries before translating the snippets word by word back to English. The result is a collection of sense signatures for the ¯nancial sense of interest. For example, finterest rate, bank, annual, economy, ...g might be one of them. We trained a second- order co-occurrence vector model on the sense signatures and tested on the TWA Binary sense-tagged dataset, with reasonable results.

Word Sense Disambiguation Using Sense Signatures Automatically Acquired from a Second Language

POS Tagging for Improving Code-Switching Identification In

SELECTING and CREATING a WORD LIST for ENGLISH LANGUAGE TEACHING by Deny A

Modeling and Encoding Traditional Wordlists for Machine Applications

THE IMPACT of USING WORDLISTS in the LANGUAGE CLASSROOM on STUDENTS’ VOCABULARY ACQUISITION Gülçin Coşgun Ozyegin University

All-Words Word Sense Disambiguation Using Concept Embeddings

A Web-Based Vocabulary Profiler for Chinese Language Teaching and Research

Simple Features for Statistical Word Sense Disambiguation

HSK Word List - Level 3 HSK Word List - Level 3

Cross-Linguistic Word Frequency Visualization for PT and EN

Using Word Lists to Learn Second Language Vocabulary Is Unproductive

A Referred Word List Analyzing Tool with Keyword, Concordancing and N-Gram Functions∗∗∗

A New General Service List: the Better Mousetrap We’Ve Been Looking For? Charles Browne Meiji Gakuin University Doi