LT3: Applying Hybrid Extraction to Aspect-Based

Orphee´ De Clercq, Marjan Van de Kauter, Els Lefever and Veronique´ Hoste LT3, Language and Technology Team Department of Translation, Interpreting and Communication – Ghent University Groot-Brittannielaan¨ 45, 9000 Ghent, Belgium [email protected]

Abstract (one for the laptops and one for the restaurants do- main). The restaurant sentences were to be anno- The LT3 system perceives ABSA as a task tated with automatically identified tuples, the laptop sentences only with the to be tackled incrementally, namely aspect identified aspect categories. In the second phase term extraction, classification and polarity classification. For the first two steps, we see (Phase B), the gold annotations for the above two that employing a hybrid terminology extrac- datasets, as well as for a hidden domain, were given tion system leads to promising results, espe- and the participants had to return the corresponding cially when it comes to recall. For the polar- polarities (positive, negative, neutral). For more in- ity classification, we show that it is possible formation we refer to Pontiki et al. (2015). to gain satisfying accuracies, even on out-of- We tackled the problem by dividing the ABSA domain data, with a basic model employing task into three incremental subtasks: (i) aspect term only lexical information. extraction, (ii) aspect term classification and (iii) as- pect term polarity estimation (Pavlopoulos and An- 1 Introduction droutsopoulos, 2014). The first two are at the basis There exists a large interest in sentiment analysis of Phase A, whereas the final one constitutes Phase tar- of user-generated content. Until recently, the main B. For the first step, viz. extracting terms (or gets research focus has been on discovering the overall ), we wanted to test our in-house hybrid termi- polarity of a certain text or phrase. A noticeable nology extraction system (Section 2). Next, we per- shift has occurred to consider a more fine-grained formed a multiclass classification task relying on a approach, known as aspect-based sentiment analysis feature space containing both lexical and semantic (ABSA). For this task the goal is to automatically information to aggregate the previously identified identify the aspects of given target entities and the terms into the domain-specific and predefined as- aspect categories sentiment expressed towards each of them. In this pects (or ) (Section 3). Finally, we paper, we present the LT3 system that participated performed polarity classification by deriving both in this year’s SemEval 2015 ABSA task. Though general and domain-specific lexical features from the focus was on the same domains (restaurants and the reviews (Section 4). We finish with conclusions laptops) as last year’s task (Pontiki et al., 2014), it and prospects for future work (Section 5). differed in two ways. This time, entire reviews were 2 Aspect Term Extraction to be annotated and for one subtask the systems were confronted with an out-of-domain test set, unknown Before starting with any sort of classification, it to the participants. is essential to know which entities or concepts are The task ran in two phases. In the first phase present in the reviews. According to Wright (1997), (Phase A), the participants were given two test sets these “ that are assigned to concepts used in

719

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 719–724, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics the special languages that occur in subject-field or (Lee et al., 2011). We should also point out that we domain-related texts” are called terms. Translated to only allowed terms to be identified in the test data the current challenge, we are thus looking for words when a sentence contains a subjective opinion. This or terms specific to a specific domain or interest, was done by running it through the above-mentioned such as the restaurant domain. sentiment lexicons. In order to detect these terms, we tested our in-house terminology extraction system TEx- 3 Phase A SIS (Macken et al., 2013), which is a hybrid Given a list of possible candidate terms, the next step system combining linguistic and statistical infor- consists in aggregating these terms to broader aspect mation. For the linguistic analysis, TExSIS re- categories. As our main focus was on combining as- lies on tokenized, Part-of-Speech tagged, lemma- pect term extraction with classification and since no tized and chunked data using the LeTs Preprocess targets were annotated for the laptops, we decided toolkit (Van de Kauter et al., 2013), which is in- to focus on the restaurants domain. The organizers corporated in the architecture. Subsequently, all provided the participants with training data consist- words and chunks matching certain Part-of-Speech ing of 254 annotated restaurant reviews. The task patterns (i.e. nouns and noun phrases) were con- was then to assign each identified term to a correct sidered as candidate terms. In order to determine aspect category. the specificity of and cohesion between these can- didate terms, we combine several statistical filters For the classification task, we relied on a rich to represent the termhood and unithood of the can- feature space for each of the candidate targets and didate terms (Kageura and Umino, 1996). To this performed classification into the domain-specific purpose, we employed Log-likelihood (Rayson and categories. Whereas the annotations allow for a Garside, 2000), C-value (Frantzi et al., 2000) and two-step classification procedure by first classify- termhood (Vintar, 2010). All these statistical fil- ing the main categories and afterwards the subcat- ters were calculated using the Web 1T 5-gram cor- egories, we chose to perform the joint classification pus (Brants and Franz, 2006) as a reference corpus. as this yielded better results in our exploratory ex- After a manual inspection of the first output periments. for the training data, we formulated some filter- 3.1 Feature Extraction ing heuristics. We filter out terms consisting of more than six words, terms that refer to location For all candidate terms present in our data sets we names or that contain sentiment words. Locations derived a number of lexical and semantic features. are found using the Stanford CoreNLP toolkit (Man- For those candidate targets that have been recog- ning et al., 2014) and for the sentiment words, we nized as anaphors (see Section 2), these features filter those terms occurring in one of the follow- were derived based on the corresponding antecedent. ing sentiment lexicons: AFINN (Nielsen, 2011), First of all, we derived bag-of-words token uni- General Inquirer (Stone et al., 1966), NRC Emo- gram features of the sentence in which a term occurs tion (Mohammad and Turney, 2010; Mohammad in order to represent some of the lexical information and Yang, 2011), MPQA (Wilson et al., 2005) and present in each of the categories. Bing Liu (Hu and Liu, 2004). The main part of our feature vectors, however, The terms that resulted from this filtered TExSIS was made up of semantic features, which should output, supplemented with those terms that were an- enable us to classify our aspect terms into the notated in the training data but not recognized by our predefined categories. These semantic features terminology extraction system, were all considered consist of: as candidate terms. Finally, this list of candidate tar- gets was further extended by also including corefer- 1. WordNet features: for each main category, a ential links as null terms. Coreference resolution of value is derived indicating the number of (unique) each individual review was performed with the Stan- terms annotated as aspect terms from that cate- ford multi-pass sieve coreference resolution system gory in the training data that (1) co-occur in the

720 synset of the candidate term or (2) which are a hy- target term that is never associated with the respec- ponym/hypernym of a term in the synset. In case the tive target in the training dictionary, we overrule the candidate term is a multi- term whose full term classification output and replace it by the (most fre- is not found, this value is calculated for all nouns in quent) category-subcategory label that is associated the multi-word term and the resulting sum is divided with this target in the training dictionary. by the number of nouns. The results of our system on the final test set and 2. Cluster features: using the implementa- rank are presented in Table 1, where Slot 1 refers to tion of the Brown hierarchical word clustering al- the aspect category classification and Slot 2 to the gorithm (Brown et al., 1992) by Liang (2005), we task of finding the correct opinion target expressions derived clusters from the Yelp dataset1. Then, we (or terms). derived for each main category a value indicating the number of (unique) terms annotated as aspect terms Slot Precision Recall F-score Rank from that category in the training data that co-occur Slot 1 51.54 56.00 53.68 8/15 with the candidate term in the same cluster. Since Slot 2 36.47 79.34 49.97 13/21 clusters can only contain single words, we calculate Slot 1,2 29.44 44.73 35.51 6/13 this value for all the nouns in a multi-word term and take the mean of the resulting sum. Table 1: Results of the LT3 system on Phase A 3. Linked Open Data (LOD) features: using For the design of our system we wanted to focus DBpedia (Lehmann et al., 2013), we included most on the combination of Slot 1 and 2, i.e. finding binary values indicating whether a candidate the target terms and being able to classify them in the term occurs in one of the following DBpedia correct category. This is the most difficult task of all categories: Foods, Cuisine, Alcoholic beverages, three, hence the lower F-scores in general (Pontiki et Non-alcoholic beverages, Atmosphere, Peo- al., 2015). Though there is much room for improve- ple in food and agriculture occupations or ment for our system, we do observe that our rank Food services occupations. These features were increases for this more difficult task. Our precision automatically derived using the RapidMiner Linked scores are rather low, but we obtain the best recall Open Data Extension (Paulheim et al., 2014). scores for Slot 2 and Slot 1,2. For Slot 1,2 we are 4. Training data features: number of annota- able to find 378 of the 845 possible targets, resulting tions in the training data for each of the main cate- in the best recall score of all participating systems gories. We filtered out candidate terms for which all (e.g. 44.73 compared to a recall score of 41.73 ob- of these feature values are “0”, but decided to keep tained by the winning team). proper nouns and proper noun phrases. This leads us to conclude that there’s quite some 3.2 Classification and Results room for improvement for the aggregation phase. Normally, the similarity between terms is first com- For all our experiments, we used LIBSVM (Chang puted after which some sort of clustering is per- and Lin, 2001). In order to tune our system, we formed split the training data into a train (90%) and test fold (10%) and ran various rounds of experiments, af- 4 Phase B ter which we manually analyzed the output. Based on this analysis, we were able to derive some post- In recent years, sentiment analysis has been a pop- processing heuristics to rule out some of the low- ular research strand. An example is last year’s Se- hanging fruit (i.e. misclassification which could be mEval task 9 Sentiment Analysis in Twitter, which ruled out univocally). To do so, we built a dictio- drew over 45 participants. The competition revealed nary containing all targets annotated in the training that the best systems use supervised machine learn- data, together with their associated category label(s). ing techniques and rely much on lexical features in In case our classifier assigns a main category to a the form of n-grams and sentiment lexicons (Rosen- thal et al., 2014). For Phase B, in which we had 1https://www.yelp.com/academic dataset all gold standard terms and aspect categories avail-

721 able, we decided to extend our LT3 system with an- data since we assumed hotels to be more similar to other classification round where we classify every restaurants than they are to laptops. The results of aspect as positive, negative or neutral. All features our system are presented in Table 2. are derived from the sentence in which the terms were found and we participated in all three domains. Domain Accuracy Rank Restaurants 75.03 4/15 4.1 Feature Extraction Laptops 73.76 5/13 We implemented a number of lexical features. First Hotels 80.53 2/11 of all, we derived bag-of-words token unigram fea- tures. Then, we also generated features using two Table 2: Result of the LT3 system on Phase B of the more well-known sentiment lexicons: Gen- eral Inquirer (Stone et al., 1966) and Bing Liu (Hu Our results show that using only lexical features and Liu, 2004) and a manually constructed list of already results in quite satisfying accuracy scores for negation cues based on the training data of SemEval- all three domains. Considering the hotels dataset, 2014 task 9 (Van Hee et al., 2014). Moreover, for we can conclude that having training data available both the restaurants and laptops domain we created a from a very similar domain does already result in a list of all the domain-specific positive, negative and satisfying accuracy (our system has the second best neutral words based on the training data. For the ho- score on the hidden domain). In the future, we will tels we were not able to compile such a list. investigate the performance gain when also includ- Finally, we also included PMI features based ing domain-specific training data. on three domain-specific datasets. PMI (pointwise mutual information) values indicate the association 5 Conclusions and Future Work of a word with positive and negative sentiment: We presented the LT3 system, which is able to tackle the higher the PMI score, the stronger the word- the aspect-based sentiment analysis task incremen- sentiment association. We calculated this for each tally by first deriving candidate terms, after which unigram based on the word-sentiment associations these are classified into various categories and po- found in the respective training dataset. PMI values larities. Applying a hybrid terminology extraction were calculated as follows: system to the first phase seems to be a promising ap- PMI(w) = PMI(w, positive) PMI(w, negative) proach. Our experiments revealed that we are able − (1) to receive high recall for the task of deriving tar- As the equation shows, the association score of a gets and aspect categories using a variety of lexical word with negative sentiment is subtracted from and semantic features. When it comes to the polar- the word’s association score with positive senti- ity estimation, we see that a classifier mostly relying ment. For the restaurants domain we relied on on lexical information achieves a satisfying perfor- the Yelp dataset (cfr. Section 3.1), for the lap- mance, even on out-of-domain data. tops domain on a subset of the Amazon electronics Based on our results, we see different directions dataset (McAuley and Leskovec, 2013), and for the for follow-up research. For the term extraction, we hidden – hotel – domain we worked with reviews will focus on more powerful filtering techniques. collected from TripAdvisor (Wang et al., 2011). All With respect to term aggregation, we will explore datasets were filtered by only including reviews with new techniques of clustering our list of candidate strong subjective ratings (e.g. we preferred a 5 star terms in different manners. Furthemore, we will ex- rating for positive reviews over one of 3 stars). plore in future experiments to which extent deeper syntactic, semantic and discourse modelling leads 4.2 Classification and Results to better polarity classification. Since the TEx- We again used LIBSVM as our learner. For the SIS system was developed as a multilingual frame- restaurants and laptops domain, we used the re- work (Macken et al., 2013), we are currently trans- spective training data sets. For the hidden (ho- lating the LT3 system so that it can handle Dutch tel) domain, we only used the restaurants training reviews.

722 References putational Approaches to Analysis and Generation of Emotion in Text, LA, California. Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Saif Mohammad and Tony Yang. 2011. Tracking Sen- Version 1 LDC2006T13. Web Download. timent in Mail: How Genders Differ on Emotional Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- Axes. In Proceedings of the 2nd Workshop on Com- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- putational Approaches to Subjectivity and Sentiment based n-gram models of natural language. Computa- Analysis (WASSA 2011), pages 70–79, Portland, Ore- tional Linguistics, 18:467–479. gon. ACL. Chih-Chung Chang and Chih-Jen Lin, 2001. LIBSVM: a Finn Nielsen. 2011. A new anew: Evaluation of a word library for support vector machines. list for sentiment analysis in microblogs. In Proceed- Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. ings of the ESWC2011 Workshop on Making Sense of 2000. Automatic recognition of multi-word terms: the Microposts: Big things come in small packages. C-value/NC-value method. International Journal on Digital Libraries, 3(2):115–130. Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, and Minqing Hu and Bing Liu. 2004. Mining and summa- Christian Bizer. 2014. Data mining with background Proceedings of the 5th rizing customer reviews. In Proceedings of the tenth knowledge from the web. In ACM SIGKDD international conference on Knowl- RapidMiner World. edge discovery and data mining, KDD04, pages 168– John Pavlopoulos and Ion Androutsopoulos. 2014. As- 177, New York, NY. ACM. pect term extraction for sentiment analysis: New Kyo Kageura and Bin Umino. 1996. Methods of au- datasets, new evaluation measures and an improved tomatic term recognition. A review. Terminology, unsupervised method. In Proceedings of the 5th 3(2):259–289. Workshop on Language Analysis for Social Media (LASM)@ EACL Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael . Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Har- Stanford’s multi-pass sieve coreference resolution sys- ris Papageorgiou, Ion Androutsopoulos, and Suresh tem at the conll-2011 shared task. In Proceedings of Manandhar. 2014. Semeval-2014 task 4: Aspect the CoNLL-2011 Shared Task. based sentiment analysis. In Proceedings of the 8th Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, International Workshop on Semantic Evaluation (Se- Dimitris Kontokostas, Pablo N. Mendes, Sebastian mEval 2014), pages 27–35, Dublin, Ireland, August. Hellmann, Mohamed Morsey, Patrick van Kleef, Association for Computational Linguistics and Dublin Soren Auer, and Christian Bizer. 2013. DBpedia – A City University. Large-scale, Multilingual Knowledge Base Extracted Maria Pontiki, Dimitris Galanis, Harris Papageorgiou, from Wikipedia. Journal. Suresh Manandhar, and Ion Androutsopoulos. 2015. Percy Liang. 2005. Semi-supervised learning for natural Semeval-2015 task 12: Aspect based sentiment analy- language. In MASTERS THESIS, MIT. sis. In Proceedings of the 9th International Workshop Lieve Macken, Els Lefever, and Veronique´ Hoste. 2013. on Semantic Evaluation (SemEval 2015), Denver, Col- TExSIS: Bilingual Terminology Extraction from Par- orado. allel Corpora Using Chunk-based Alignment. Termi- Paul Rayson and Roger Garside. 2000. Comparing cor- nology, 19(1):1–30. pora using frequency profiling. In Proceedings of the Christopher D. Manning, Mihai Surdeanu, John Bauer, workshop on Comparing corpora, 38th annual meet- Jenny Finkel, Steven J. Bethard, and David McClosky. ing of the Association for Computational Linguistics, 2014. The Stanford CoreNLP natural language pro- pages 1–6, Hong Kong, China. cessing toolkit. In Proceedings of 52nd Annual Meet- Sara Rosenthal, Alan Ritter, Preslav Nakov, and Veselin ing of the Association for Computational Linguistics: Stoyanov. 2014. Semeval-2014 task 9: Sentiment System Demonstrations. analysis in twitter. In Proceedings of the 8th Inter- Julian McAuley and Jure Leskovec. 2013. Hidden fac- national Workshop on Semantic Evaluation (SemEval tors and hidden topics: Understanding rating dimen- 2014), pages 73–80, Dublin, Ireland, August. Associ- sions with review text. In Proceedings of the 7th ation for Computational Linguistics and Dublin City ACM Conference on Recommender Systems, RecSys University. ’13, pages 165–172. Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, Saif Mohammad and Peter Turney. 2010. Emotions and Daniel M. Ogilvie. 1966. The General Inquirer: A Evoked by Common Words and Phrases: Using Me- Computer Approach to Content Analysis. MIT Press. chanical Turk to Create an Emotion Lexicon. In Pro- Marjan Van de Kauter, Geert Coorman, Els Lefever, Bart ceedings of the NAACL-HLT 2010 Workshop on Com- Desmet, Lieve Macken, and Veronique´ Hoste. 2013.

723 LeTs Preprocess: The multilingual LT3 linguistic pre- processing toolkit. Computational Linguistics in the Netherlands Journal, 3:103–120. Cynthia Van Hee, Marjan Van de Kauter, Orphee De Clercq, Els Lefever, and Veronique Hoste. 2014. Lt3: Sentiment classification in user-generated con- tent using a rich feature set. In Proceedings of the 8th International Workshop on Semantic Evaluation (Se- mEval 2014), pages 406–410, Dublin, Ireland, August. Association for Computational Linguistics and Dublin City University. Spelaˇ Vintar. 2010. Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology, 16:141–158. Hongning Wang, Yue Lu, and ChengXiang Zhai. 2011. Latent aspect rating analysis without aspect key- word supervision. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 618– 626. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on Human Language Technology and Empirical Meth- ods in Natural Language Processing, HLT05, pages 347–354, Stroudsburg, PA. ACL. Sue Ellen Wright. 1997. Term selection: the initial phase of terminology management. In Sue Ellen Wright and Gerhard Budin, editors, Handbook of terminology management, pages 13–23. John Benjamins, Amster- dam.

724