A New Approach for Idiom Identification Using Meanings And

A New Approach for Idiom Identification Using Meanings and the Web∗ Rakesh Verma Vasanthi Vuppuluri Computer Science Dept. Computer Science Dept. University of Houston University of Houston Houston, TX, 77204, USA Houston, TX, 77204, USA [email protected] [email protected] Abstract Idioms play an important role in Natural Lan- There is a great deal of knowledge avail- guage Processing (NLP). They exist in almost all able on the Web, which represents a great languages and are hard to extract as there is no al- opportunity for automatic, intelligent text gorithm that can precisely outline the structure of processing and understanding, but the ma- an idiom. Idioms are important for natural lan- jor problems are finding the legitimate guage generation, parsing, and significantly influ- sources of information and the fact that ence machine translation and semantic tagging. search engines provide page statistics not Idioms could be also useful in document index- occurrences. This paper presents a new, ing, information retrieval, and in text summariza- domain independent, general-purpose id- tion or question-answering approaches that rely on iom identification approach. Our approach extracting key words or phrases from the docu- combines the knowledge of the Web with ment to be summarized, e.g., (Barrera and Verma, the knowledge extracted from dictionaries. 2011; Barrera and Verma, 2012; Barrera et al., This method can overcome the limitations 2011). Efficiently extracting idioms significantly of current techniques that rely on linguis- improves many areas of NLP. But most of the tic knowledge or statistics. It can recog- idiom extraction techniques are biased in a way nize idioms even when the complete sen- that they focus on a specific domain or make use tence is not present, and without the need of statistical techniques alone, which results in for domain knowledge. It is currently de- poor performance. The technique in this paper signed to work with text in English but can makes use of knowledge from the Web combined be extended to other languages. with knowledge from dictionaries in deciding if a phrase is a idiom rather than solely depending on 1 Introduction frequency measures or following rules of a spe- Automatically extracting phrases from the doc- cific domain. The Web has been attractive to NLP uments, be they structured, un-structured or researchers because it can solve the sparsity is- semistructured has always been an important yet sue and also its update latency is lower than for challenging task. The overall goal is to create a dictionaries, but its disadvantages are noise, lack easily machine-readable text to process the sen- of a good method for finding reliable sources and tences. In this paper we focus on identifying id- the coarseness of page statistics. Dictionaries are ioms from text. An idiom is a phrase made up of more reliable but they have higher update latency. a sequence of two or more words that has prop- Our work tries to minimize the disadvantages and erties that are not predictable from the properties maximize the advantages when combining these of the individual words or their normal mode of resources. combination. Recognition of idioms is a challeng- 1.1 Contribution ing problem with wide applications. Some exam- ples of idioms are ‘yellow journalism,’ ‘kick the This paper proposes a new idiom identification bucket,’ and ‘quick fix’. For example, the mean- technique, which is general, domain independent ing of ‘yellow journalism’ cannot be derived from and unsupervised in the sense that it requires no the meanings of ‘yellow’ and ‘journalism.’ labeled datasets of idioms. The major problem Research∗ supported in part by NSF grants CNS with existing approaches is that most of them 1319212, DUE 1241772 and DGE1433817 are supervised, requiring manually annotated data, 681 Proceedings of Recent Advances in Natural Language Processing, pages 681–687, Hissar, Bulgaria, Sep 7–9 2015. and many of them impose syntactic restrictions, of verb-noun constructions, prepositional phrases, e.g., verb-particle, noun-verb, etc. Our tech- and subordinate clauses in (Laura et al., 2010). nique makes use of carefully extracted reliable knowledge from the Web and dictionaries. More- To our knowledge, there are only a few gen- over, our technique can be extended to languages eral approaches for idiom identification in the other than English, provided similar resources are phrase classification stream (Muzny and Zettle- available. Although our approach uses meanings, moyer, 2013); (Feldman and Peng, 2013) and with the advancement of the web, more and more most of the techniques are supervised. A super- phrase definitions are becoming available on the vised technique for automatically identifying id- web and thus the reliance on dictionaries can be iomatic dictionary entries with the help of online reduced or even eliminated. However, in many resources like Wiktionary is discussed in (Muzny cases, even though the definition of a phrase may and Zettlemoyer, 2013). There are three lexical be available, the phrase itself is not necessarily la- features and five graph-based features in this tech- beled as an idiom so we cannot just do a simple nique, which model whether phrase meanings are lookup of a phrase and mark it as an idiom. constructed compositionally. The dataset consists The rest of the paper is organized as follows. of phrases, definitions, and example sentences Section 2 presents previous work on idiom extrac- from the English-language Wiktionary dump from tion and classification. In Section 3 we present our November 13th, 2012. The lexical and graph- approach in detail. Section 4 presents the datasets based features when used together yield F-scores and in Section 5 we present the experiments and of 40.1% and 62.0% when tested on the same comparisons. We conclude in Section 6. dataset, once without annotating the idiom labels and once after providing the annotated labels. 2 Related Work This approach when combined with the Lesk word sense disambiguation algorithm and a Wiktionary There is considerable work on extracting multi- label default rule, yields an F-score of 83.8%. word expressions (MWEs), a superclass of idioms, e.g., (Zhang et al., 2006); (Villavicencio et al., An unsupervised idiom extraction technique us- 2007); (Li et al., 2008); (Spence et al., 2013); ing Principal Component Analysis (PCA) treat- (Ramisch, 2014); (Marie and Constant., 2014); ing idioms as semantic outliers and a supervised (Schneider et al., 2014); (Kordoni and Simova, technique based on Linear Discriminant Analy- 2014); (Yulia and Wintner, 2014). We do not cover sis (LDA) was described by (Feldman and Peng, this work here since our focus is on idioms. 2013). The idea of treating idioms as outliers Because of its importance, several researchers was tested on 99 sentences extracted from the have investigated idiom identification. As men- British National Corpus (BNC) social science tioned in (Muzny and Zettlemoyer, 2013), prior (non-fiction) section, containing 12 idioms, 22 work on this topic can be categorized into two dead metaphors and 2 living metaphors. The idea streams: phrase classification in which a phrase of idiom detection based on LDA was tested on is always idiomatic or literal, e.g., (Gedigian et 2,984 Verb-Noun Combination (VNC) tokens ex- al., 2006); (Shutova et al., 2010), or token clas- tracted from BNC described in (Fazly et al., 2009). sification in which each occurrence of a phrase is These 2,984 tokens are translated into 2,550 sen- classified as either idiomatic or literal, e.g., (Birke tences of which 2,013 are idiomatic sentences and et al., 2006); (Katz and Eugenie, 2006); (Li and 537 are literal sentences. A variety of results were Sporleder, 2009); (Fabienne et al., 2010); (Caro- presented for PCA for different false positive rates line et al., 2010); (Peng et al., 2014). Most work ranging from 1 to 10% (one Table with rates of 16- on the phrase classification stream imposes syn- 20%). For idioms only, the detection rates range tactic restrictions. Verb/Noun restriction is im- from 44% at 1% false positive rate to 89% at 10% posed in (Fazly et al., 2009) and (Diab and Pravin, false positive rate. 2009); subject/verb and verb/direct-object restrictions are imposed in (Shutova et al., 2010) and Some of the work in the token classification verb-particle restriction is imposed in (Ramisch stream, e.g., (Peng et al., 2014), relies on a list of et al., 2008). Portions of the American Na- potentially idiomatic expressions. Such a list can tional Corpus were tagged for idioms composed be generated using our technique. 682 3 Idiom Extraction Model RDW 2 = RD21, RD22, RD23, ..., RD2m , { } and so on. We now present the details of our approach Now, each of the word in the original phrase is for extracting idioms, which is implemented in replaced with its definitions which results in a set Python and called IdiomExtractor. We focus on of new phrases P as follows: the meaning of the word idiom, i.e., “properties P = RD11RD12...RDj1, RD12RD21...RDj1 of individual words in a phrase differ from the { , RD1nRD2m...RDjl properties of the phrase in itself.” Hence, we } To avoid any confusion regarding how the proce- look at what individual words in a phrase mean dure is implemented an example is provided be- and what the phrase means as a whole. If the low. meaning of phrase is different from what the individual words in the phrase try to convey then 3.3 Subtraction by definition of the word idiom, that phrase is a idiom. Each of the phrases present in P is subtracted from each of the recreated definition in RDp and the Steps involved in the process of idiom extraction result is stored in set S.

Load more