<<

A New Approach for Idiom Identification Using Meanings and the Web∗

Rakesh Verma Vasanthi Vuppuluri Computer Science Dept. Computer Science Dept. University of Houston University of Houston Houston, TX, 77204, USA Houston, TX, 77204, USA [email protected] [email protected]

Abstract Idioms play an important role in Natural Lan- There is a great deal of knowledge avail- guage Processing (NLP). They exist in almost all able on the Web, which represents a great and are hard to extract as there is no al- opportunity for automatic, intelligent text gorithm that can precisely outline the structure of processing and understanding, but the ma- an idiom. Idioms are important for natural lan- jor problems are finding the legitimate guage generation, , and significantly influ- sources of information and the fact that ence machine and semantic tagging. search engines provide page statistics not Idioms could be also useful in document index- occurrences. This paper presents a new, ing, information retrieval, and in text summariza- domain independent, general-purpose id- tion or question-answering approaches that rely on iom identification approach. Our approach extracting key or from the docu- combines the knowledge of the Web with ment to be summarized, e.g., (Barrera and Verma, the knowledge extracted from . 2011; Barrera and Verma, 2012; Barrera et al., This method can overcome the limitations 2011). Efficiently extracting idioms significantly of current techniques that rely on linguis- improves many areas of NLP. But most of the tic knowledge or statistics. It can recog- idiom extraction techniques are biased in a way nize idioms even when the complete sen- that they focus on a specific domain or make use tence is not present, and without the need of statistical techniques alone, which results in for domain knowledge. It is currently de- poor performance. The technique in this paper signed to work with text in English but can makes use of knowledge from the Web combined be extended to other languages. with knowledge from dictionaries in deciding if a is a idiom rather than solely depending on 1 Introduction frequency measures or following rules of a spe- Automatically extracting phrases from the doc- cific domain. The Web has been attractive to NLP uments, be they structured, un-structured or researchers because it can solve the sparsity is- semistructured has always been an important yet sue and also its update latency is lower than for challenging task. The overall goal is to create a dictionaries, but its disadvantages are noise, lack easily machine-readable text to process the sen- of a good method for finding reliable sources and tences. In this paper we focus on identifying id- the coarseness of page statistics. Dictionaries are ioms from text. An idiom is a phrase made up of more reliable but they have higher update latency. a sequence of two or more words that has prop- Our work tries to minimize the disadvantages and erties that are not predictable from the properties maximize the advantages when combining these of the individual words or their normal mode of resources. combination. Recognition of idioms is a challeng- 1.1 Contribution ing problem with wide applications. Some exam- ples of idioms are ‘yellow journalism,’ ‘kick the This paper proposes a new idiom identification bucket,’ and ‘quick fix’. For example, the mean- technique, which is general, domain independent ing of ‘yellow journalism’ cannot be derived from and unsupervised in the sense that it requires no the meanings of ‘yellow’ and ‘journalism.’ labeled datasets of idioms. The major problem Research∗ supported in part by NSF grants CNS with existing approaches is that most of them 1319212, DUE 1241772 and DGE1433817 are supervised, requiring manually annotated data, and many of them impose syntactic restrictions, of verb-noun constructions, prepositional phrases, e.g., verb-particle, noun-verb, etc. Our tech- and subordinate clauses in (Laura et al., 2010). nique makes use of carefully extracted reliable knowledge from the Web and dictionaries. More- To our knowledge, there are only a few gen- over, our technique can be extended to languages eral approaches for idiom identification in the other than English, provided similar resources are phrase classification stream (Muzny and Zettle- available. Although our approach uses meanings, moyer, 2013); (Feldman and Peng, 2013) and with the advancement of the web, more and more most of the techniques are supervised. A super- phrase definitions are becoming available on the vised technique for automatically identifying id- web and thus the reliance on dictionaries can be iomatic entries with the help of online reduced or even eliminated. However, in many resources like Wiktionary is discussed in (Muzny cases, even though the definition of a phrase may and Zettlemoyer, 2013). There are three lexical be available, the phrase itself is not necessarily la- features and five graph-based features in this tech- beled as an idiom so we cannot just do a simple nique, which model whether phrase meanings are lookup of a phrase and mark it as an idiom. constructed compositionally. The dataset consists The rest of the paper is organized as follows. of phrases, definitions, and example sentences Section 2 presents previous work on idiom extrac- from the English- Wiktionary dump from tion and classification. In Section 3 we present our November 13th, 2012. The lexical and graph- approach in detail. Section 4 presents the datasets based features when used together yield F-scores and in Section 5 we present the experiments and of 40.1% and 62.0% when tested on the same comparisons. We conclude in Section 6. dataset, once without annotating the idiom la- bels and once after providing the annotated labels. 2 Related Work This approach when combined with the Lesk sense disambiguation algorithm and a Wiktionary There is considerable work on extracting multi- label default rule, yields an F-score of 83.8%. word expressions (MWEs), a superclass of idioms, e.g., (Zhang et al., 2006); (Villavicencio et al., An unsupervised idiom extraction technique us- 2007); (Li et al., 2008); (Spence et al., 2013); ing Principal Component Analysis (PCA) treat- (Ramisch, 2014); (Marie and Constant., 2014); ing idioms as semantic outliers and a supervised (Schneider et al., 2014); (Kordoni and Simova, technique based on Linear Discriminant Analy- 2014); (Yulia and Wintner, 2014). We do not cover sis (LDA) was described by (Feldman and Peng, this work here since our focus is on idioms. 2013). The idea of treating idioms as outliers Because of its importance, several researchers was tested on 99 sentences extracted from the have investigated idiom identification. As men- British National Corpus (BNC) social science tioned in (Muzny and Zettlemoyer, 2013), prior (non-fiction) section, containing 12 idioms, 22 work on this topic can be categorized into two dead metaphors and 2 living metaphors. The idea streams: phrase classification in which a phrase of idiom detection based on LDA was tested on is always idiomatic or literal, e.g., (Gedigian et 2,984 Verb-Noun Combination (VNC) tokens ex- al., 2006); (Shutova et al., 2010), or token clas- tracted from BNC described in (Fazly et al., 2009). sification in which each occurrence of a phrase is These 2,984 tokens are translated into 2,550 sen- classified as either idiomatic or literal, e.g., (Birke tences of which 2,013 are idiomatic sentences and et al., 2006); (Katz and Eugenie, 2006); (Li and 537 are literal sentences. A variety of results were Sporleder, 2009); (Fabienne et al., 2010); (Caro- presented for PCA for different false positive rates line et al., 2010); (Peng et al., 2014). Most work ranging from 1 to 10% (one Table with rates of 16- on the phrase classification stream imposes syn- 20%). For idioms only, the detection rates range tactic restrictions. Verb/Noun restriction is im- from 44% at 1% false positive rate to 89% at 10% posed in (Fazly et al., 2009) and (Diab and Pravin, false positive rate. 2009); subject/verb and verb/direct-object restric- tions are imposed in (Shutova et al., 2010) and Some of the work in the token classification verb-particle restriction is imposed in (Ramisch stream, e.g., (Peng et al., 2014), relies on a list of et al., 2008). Portions of the American Na- potentially idiomatic expressions. Such a list can tional Corpus were tagged for idioms composed be generated using our technique. 3 Idiom Extraction Model RDW 2 = {RD21,RD22,RD23, ..., RD2m}, and so on. We now present the details of our approach Now, each of the word in the original phrase is for extracting idioms, which is implemented in replaced with its definitions which results in a set Python and called IdiomExtractor. We focus on of new phrases P as follows: the meaning of the word idiom, i.e., “properties P = {RD RD ...RD ,RD RD ...RD of individual words in a phrase differ from the 11 12 j1 12 21 j1 ,RD1nRD2m...RD } properties of the phrase in itself.” Hence, we jl To avoid any confusion regarding how the proce- look at what individual words in a phrase mean dure is implemented an example is provided be- and what the phrase means as a whole. If the low. meaning of phrase is different from what the individual words in the phrase try to convey then 3.3 Subtraction by definition of the word idiom, that phrase is a idiom. Each of the phrases present in P is subtracted from each of the recreated definition in RDp and the Steps involved in the process of idiom extraction result is stored in set S. are as follows: 3.4 Idiom Result 3.1 Definition Extraction There are two options the user can choose in de- This step is the most important step in determin- ciding if a phrase is a idiom. They are: ing if a phrase is a idiom. The definitions of – By Union – By Intersection the phrase (Dp) and individual words as per the Part-of-Speech (POS) whenever possible, in the By Union: This is a lenient way of deciding if a phrase is a idiom. Here, if at least one word sur- phrase are obtained, {DW 1,DW 2, ..., DW j}. In some case a dictionary may not have definitions vives the subtraction step above, then that phrase for a word for the given POS, in which case defi- is declared to be a idiom. nition of the word is obtained without taking POS By Intersection: This is a stricter way of deciding into consideration. For obtaining definitions, we if a phrase is a idiom. Here, a phrase is a idiom use WordNet, dictionary API and Bing if and only if at least one word survives all of the search API. Here, subtraction operations. Dp = {D1,D2,D3, ..., Dk} Example - Definition extraction D = Definition of ‘forty winks’ = {sleeping for a DW 1 = {D11,D12,D13, ..., D1n} p short period of time (usually not in bed)} DW 2 = {D21,D22,D23, ..., D2m}, and so on. DW 1 = Definitions of ‘forty’ as a ‘Noun’ = {the 3.2 Recreating Definitions cardinal number that is the product of ten and four} Once we have the definitions of each word and D = Definitions of ‘winks’ as a ‘Noun’ = {a those of the phrase, each of the definition is POS W 2 very short time (as the time it takes the eye to tagged using the NLTK POS tagger and only the blink or the heart to beat), closing one eye quickly words whose POS tag is from {noun, verb} are as a signal, a reflex that closes and opens the eyes considered and the definitions are recreated after rapidly} stemming the words using the Snowball Stemmer1 as, RD and {RD ,RD , ..., RD } with p W 1 W 2 W n Example - Recreating definitions only those words present. This constraint stems RD = {sleep period time bed} from our observations of several idioms, which p RD = {number product ten} showed that idioms in general have at least one of W 1 RD = {time time eye blink heart beat, eye sig- the mentioned POS tags in-order for the phrase to W 2 nal, reflex} have a meaning. Here, P = {number product ten time time eye blink heart RD = {RD ,RD ,RD , ..., RD } p 1 2 3 k beat, number product ten eye signal, number prod- RD = {RD ,RD ,RD , ..., RD } W 1 11 12 13 1n uct ten reflex}. Note that we do not eliminate du- 1http://snowball.tartarus.org/ plicate words such as the word “time” in RDW 2, download.php since they really do not affect the idiom extraction, 1: procedure IDIOMEXTRACTION 2: for phrase p in phrases extracted do 3: Dp = Definition of phrase p 4: RDp = Recreated definitions of phrase p 5: for word in phrase do 6: Dwi = Definition of the word 7: RDwi = Recreated definitions of the word 8: Recreating definition phrases, P 9: P = {RD11RD12...RDj1,RD12RD21...RDj1,RD1nRD2m...RDjl} 10: Subtraction. S = RDp − P 11: idiom result: by Union. 12: if S is non-empty then 13: phrase p is an idiom 14: idiom result: by Intersection 15: if at least one word survives all subtractions then 16: phrase p is an idiom

Figure 1: Idiom Extraction Algorithm but future versions of the software will optimize 4.1 Idiom Example Sentences Dataset this aspect. Dataset-1: An idiom dataset is obtained from en- Example - Subtraction glishclub.com2. From the website, 198 idioms are randomly chosen and 198 example sentences S = {sleep period time bed} - {number product that exemplify those 198 idioms are used. These ten time time eye blink heart beat, number product 198 example sentences that are manually extracted ten eye signal, number product ten reflex} serve as our dataset. This dataset facilitates the = {sleep period bed, sleep period time bed, sleep evaluation of false positive rate of our technique. period time bed} Count of each word that after subtraction = 4.2 Oxford Dictionary of Idioms Dataset {sleep: 3, period: 3, time: 2, bed: 3} Dataset-2: This dataset is a collection of idioms obtained from the Oxford Dictionary of idioms. The idiom extraction steps can easily be under- The text file consisting of 176 idioms is the in- stood with an example as follows: put for IdiomExtractor. This dataset facilitates the evaluation of recall and false negative rate of our Example - idiom Result approach. By Union: Since S is a non-empty set, the phrase Preprocessing and Sanitization: ‘forty winks’ is a idiom PDFMiner3 was used to extract text as XML from By Intersection: At least one word in S is present the PDF version of Oxford Dictionary of Idioms as many times as those of recreated definitions. and then a Python script was used to extract idioms Hence ‘forty winks’ is a idiom. from the .xml file into a text file. Also, any non- ASCII characters are ignored while writing the id- ioms to the text file. 4 Datasets 4.3 VNC Dataset Dataset-3: VNC-tokens are obtained from (Fazly For the experiments in this paper, we used differ- et al., 2009). This dataset consists of 53 unique ent datasets extracted from englishclub.com and 2 Oxford Dictionary of Idioms and VNC corpus. https://www.englishclub.com/ref/ Idioms/ (02/23/2015) The datasets and their extraction process is ex- 3http://www.unixuser.org/˜euske/ plained here. python/pdfminer/ (11/28/2014) (%) IdiomExtractor (Union) IdiomExtractor (Intersection) AMALGr Expected maximum Recall 82.30 67.17 31.50 100.00 Precision 65.90 95.50 14.82 100.00 F-score 73.25 78.69 20.16 100.00

Table 1: Idiom extraction: IdiomExtractor Vs. AMALGr on Dataset-1

(%) IdiomExtractor (Union) IdiomExtractor (Intersection) AMALGr Expected maximum Recall 100.00 90.90 67.61 100.00 Precision 100.00 100.00 67.23 100.00 F-score 100.00 95.23 67.42 100.00

Table 2: Idiom extraction: IdiomExtractor Vs. AMALGr on Dataset-2 tokens which were tagged as idiomatic or literal. 5.2 IdiomExtractor Vs. AMALGr Irrespective of what the tag was we considered all We compare our idiom extraction module with the tokens as input for our software. We evaluate AMALGr from (Schneider et al., 2014) since their the recall and false negative rate of our software definition of MWE “lexicalized combinations of with the help of this dataset. two or more words that are exceptional enough to be considered as single units in the ” 5 Performance Evaluation aligns with our definition of a idiom and since 5.1 IdiomExtractor’s Performance the authors kindly made their software available.6 AMALGr requires SAID7 corpus to be purchased Depending on the number of idioms whose from Linguistic Data Consortium (LDC) (which definitions were obtained, the maximum possible we purchased) to train the software along with recall, precision and F-score are calculated for other training data sets. AMALGr requires input each of three datasets and the values are tabulated text to be represented as two tab separated tokens under the ‘Expected maximum’ column. per line, with the first token being a word from the input and the second token being the On Dataset-1: IdiomExtractor has an F-score of the word, followed by an empty line when the of 73.25% by Union approach and 78.69% by sentence ends. Intersection approach. Recall and Precision When tested on Dataset-1, F-score of IdiomEx- is documented in Table 3.4. Definitions of all tractor is 50% more when compared to the F-score 198 idioms in this dataset are obtained from of AMALGr. We believe that IdiomExtractor’s englishclub.com. performance can further be improved if efficient phrasal dictionaries were available for research On Dataset-2: IdiomExtractor has an F-score of purposes. Results are documented in Table 3.4. 95.23% by Intersection approach and 100.00% Reason for low precision of AMALGr: by Union approach. Recall and Precision is AMALGr joins individual words of MWEs either documented in the Table 3.4. For this experiment, with an underscore (strong MWE) or tilde (weak we used Oxford Dictionary of Idioms to obtain MWE). In certain cases, not all words of all definitions of 176 idioms. the idioms are joined together with either of the special characters and parts of idioms were tagged On Dataset-3: IdiomExtractor has an F-score of as MWEs. For example, ‘ugly duckling’, ‘settle of 90.72% by Intersection approach and an F- a score’ weren’t tagged as MWEs. An example score of 95.04% by Union approach. In this ex- where part of an idiom is tagged as a MWE is periment, we used idiom definitions obtained from “punch someones lights out.” These are declared two Internet sources4,5 and individual word defini- tions are obtained from WordNet dictionary. 6Not everyone we contacted was willing to share idiom extraction software. 4http://idioms.thefreedictionary.com/ 7https://catalog.ldc.upenn.edu/ 5http://dictionary.reference.com/ LDC2003T10 (02/03/2015) (%) IdiomExtractor (Union) IdiomExtractor (Intersection) AMALGr Expected maximum Recall 90.56 83.01 54.71 90.56 Precision 100.00 100.00 100.00 100.00 F-score 95.04 90.72 70.73 95.04

Table 3: Idiom extraction: IdiomExtractor Vs. AMALGr on Dataset-3

as false positives since we were looking for an document Summarization, CICLING, LNCS 7182, exact match for the idiom. This caused a drop in 366-377, 2012, New Delhi, India. the precision. Araly Barrera, Rakesh Verma and Ryan Vincent, When tested on Dataset-2, out of 176 idioms, SemQuest: University of Houston’s Semantics- 119 are tagged as idioms by AMALGr (includ- based Question Answering System, Text Analysis ing both strong and weak idioms as described in Conference, 2011. (Schneider et al., 2014)) with Recall = 67.61%, Julia Birke and Anoop Sarkar. A Clustering Approach Precision = 67.23%, F-score = 67.42%, which, for Nearly Unsupervised Recognition of Nonliteral when compared to the performance of IdiomEx- Language. EACL 2006, 11st Conference of the tractor’s Union approach is 32.39% less. Results European Chapter of the Association for Computa- tional , Proceedings of the Conference, are documented in Table 3.4. April 3-7, 2006, Trento, Italy. When tested on Dataset-3, out of 55 VNC- tokens, 29 are tagged as MWEs (strong MWEs Marie Candito and Matthieu Constant. Strategies for Contiguous Multiword Expression Analysis and De- and weak MWEs combined). In comparison pendency Parsing. Proceedings of the 52nd Annual with IdiomExtractor, the recall from AMALGr is Meeting of the Association for Computational Lin- 28.30% less than that of IdiomExtractor, which is guistics, ACL 2014, June 22-27, 2014, Baltimore, 83.01%. IdiomExtractor failed to catch 5 VNC- MD, USA, Volume 1: Long Papers : 743-753. tokens whose definitions were not provided. Mona T. Diab and Pravin Bhutada. Verb noun con- struction MWE token supervised classification. In 6 Conclusion Proceedings of the Workshop on Multiword Expres- sions: Identification, Interpretation, Disambiguation In this paper we have presented a new approach and Applications, pp. 17-22. Association for Com- for idiom extraction that is both domain and lan- putational Linguistics, 2009. guage independent, and does not require labeling Afsaneh Fazly, Paul Cook and Suzanne Stevenson. of idioms. Our approach is effective as demon- Unsupervised Type and Token Identification of Id- strated on two datasets and in a direct comparison iomatic Expressions. Computational Linguistics 35, with the supervised approach AMALGr. no. 1 (2009): 61-103. One problem with our approach is that the cur- Anna Feldman and Jing Peng. Automatic detection of rent resources available to us do not contain mean- idiomatic clauses. Computational Linguistics and ings of all of the idiom phrases. However, we Intelligent Text Processing - 14th International Con- believe that with advancement in technology we ference, CICLing 2013, Samos, Greece, March 24- 30, 2013, Proceedings, Part I. would be able to do a much better job of obtaining the phrase definitions in the near future. Fabienne Fritzinger, Marion Weller and Ulrich Heid. A One direction for future work is to compare with Survey of Idiomatic Preposition-Noun-Verb Triples the set {noun, verb, , adverb} when recre- on Token Level. Proceedings of the International Conference on Language Resources and Evaluation, ating definitions. LREC 2010, 17-23 May 2010, Valletta, Malta.

Matt Gedigian, John Bryant, Srini Narayanan and Bra- References nimir Ciric. Catching metaphors. In Proceedings of the Third Workshop on Scalable Natural Language Araly Barrera and Rakesh Verma, Combining Syn- Understanding, pp. 41-48. Association for Compu- tax and Semantics for Automatic Extractive Single- tational Linguistics, 2006. document Summarization, ACM SAC, Document Engineering Track, 2011, Taiwan. Spence Green, Marie-Catherine de Marneffe and Christopher D. Manning. Parsing models for iden- Araly Barrera and Rakesh Verma, Combining Syn- tifying multiword expressions. Computational Lin- tax and Semantics for Automatic Extractive Single- guistics 39.1 (2013): 195-227. Graham Katz and Eugenie Giesbrecht. Automatic iden- idiomaticity. Proceedings of the Twelfth Confer- tification of non-compositional multi-word expres- ence on Computational Natural Language Learn- sions using latent semantic analysis. In Proceedings ing, CoNLL 2008, Manchester, UK, August 16-17, of the Workshop on Multiword Expressions: Iden- 2008: 49-56. tifying and Exploiting Underlying Properties, pp. 12-19. Association for Computational Linguistics, Carlos Ramisch. Multiword Expressions Acquisition: 2006. A Generic and Open Framework. Theory and Appli- cations of Natural Language Processing. Springer. Valia Kordoni and Iliana Simova. Multiword Expres- 2015. sions in . Proceedings of the Ninth International Conference on Language Re- Nathan Schneider, Emily Danchik, Chris Dyer and sources and Evaluation (LREC-2014), Reykjavik, Noah A. Smith. Discriminative lexical semantic Iceland, May 26-31, 2014. segmentation with gaps: running the MWE gamut. Transactions of the Association for Computational Street Laura, Nathan Michalov, Rachel Silverstein, Linguistics 2 (2014): 193-206. Michael Reynolds, Lurdes Ruela, Felicia Flowers, Angela Talucci, Priscilla Pereira, Gabriella Morgon, Ekaterina Shutova, Sun Lin and Anna Korhonen. Samantha Siegel, Marci Barousse, Antequa Ander- Metaphor Identification Using Verb and Noun Clus- son, Tashom Carroll and Anna Feldman. Like Find- tering. COLING 2010, 23rd International Confer- ing a Needle in a Haystack: Annotating the Amer- ence on Computational Linguistics, Proceedings of ican National Corpus for Idiomatic Expressions. the Conference, 23-27 August 2010, Beijing, China Proceedings of the International Conference on Lan- 1002–1010, 2010. guage Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta. Caroline Sporleder, Linlin Li, Philip Gorinski and Xaver Koch. Idioms in Context: The IDIX Corpus Linlin Li and Caroline Sporleder. Classifier combi- Proceedings of the International Conference on Lan- nation for contextual idiom detection without la- guage Resources and Evaluation, LREC 2010, 17-23 beled data. Proceedings of the 2009 Conference on May 2010, Valletta, Malta. Empirical Methods in Natural Language Processing, Yulia Tsvetkov and Shuly Wintner. Identification of EMNLP 2009, 6-7 August 2009, Singapore, A meet- multiword expressions by combining multiple lin- ing of SIGDAT, a Special Interest Group of the ACL. guistic information sources. Computational Lin- Linlin Li and Caroline Sporleder. Linguistic Cues guistics 40, no. 2 (2014): 449-468. for Distinguishing Literal and Non-Literal Usages. Aline Villavicencio, Valia Kordoni, Yi Zhang, Marco COLING 2010, 23rd International Conference on Idiart and Carlos Ramisch. Validation and Evalu- Computational Linguistics, Posters Volume, 23-27 ation of Automatically Acquired Multiword Expres- August 2010, Beijing, China. sions for Grammar Engineering. EMNLP-CoNLL Ru Li, Lijun Zhong and Jianyong Duan. Multiword 2007, Proceedings of the 2007 Joint Conference on Expression Recognition Using Multiple Sequence Empirical Methods in Natural Language Process- Alignment. ALPIT 2008, Proceedings of The Sev- ing and Computational Natural Language Learning, enth International Conference on Advanced Lan- June 28-30, 2007, Prague, Czech Republic. guage Processing and Web Information Technology, Yi Zhang, Valia Kordoni, Aline Villavicencio and Dalian University of Technology, Liaoning, China, Marco Idiart. Automated multiword expression pre- 23-25 July 2008. diction for grammar engineering. In Proceedings of Grace Muzny and Luke S. Zettlemoyer Automatic Id- the Workshop on Multiword Expressions: Identify- iom Identification in Wiktionary. Proceedings of the ing and Exploiting Underlying Properties, pp. 36-44. 2013 Conference on Empirical Methods in Natural Association for Computational Linguistics, 2006. Language Processing, EMNLP 2013, 18-21 Octo- ber 2013, Grand Hyatt Seattle, Seattle, Washing- ton, USA, A meeting of SIGDAT, a Special Interest Group of the ACL.

Jing Peng, Anna Feldman and Ekaterina Vylomova. Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions. Proceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Oc- tober 25-29, 2014, Doha, Qatar, A meeting of SIG- DAT, a Special Interest Group of the ACL.

Carlos Ramisch, Aline Villavicencio, Leonardo Moura and Marco Idiart. Picking them up and figuring them out: Verb-particle constructions, noise and