Integrating Terminology Extraction and Word Embedding for Unsupervised Aspect Based Sentiment Analysis

Integrating Terminology Extraction and Word Embedding for Unsu- pervised Aspect Based Sentiment Analysis Luca Dini Paolo Curtoni Elena Melnikova Innoradiant Innoradiant Innoradiant Grenoble, France Grenoble, France Grenoble, France [email protected] [email protected] [email protected] section 4 we summarize our previous approach Abstract in order to provide a context for our experimentation. In section 5 we prove the English . In this paper we explore the ad- benefit of the integration of unsupervised vantages that unsupervised terminology ex- terminology extraction with ABSA, whereas in 6 traction can bring to unsupervised Aspect we provide hints for further investigation. Based Sentiment Analysis methods based on word embedding expansion techniques. We 2 Background prove that the gain in terms of F-measure is in the order of 3%. ABSA is a task which is central to a number of industrial applications, ranging from e- Italiano . Nel presente articolo analizziamo reputation, crisis management, customer l’interazione tra syistemi di estrazione satisfaction assessment etc. Here we focus on a “classica” terminologica e systemi basati su specific and novel application, i.e. capturing the techniche di “word embedding” nel contesto dell’analisi delle opinioni. Domostreremo voice of the customer in new product che l’integrazione di terminogie porta un development (NPD). It is a well-known fact that guadagno in F-measure pari al 3% sul the high rate of failure (76%, according to dataset francese di Semeval 2016. Nielsen France, 2014) in launching new products on the market is due to a low consideration of 1 Introduction perspective users’ needs and desires. In order to account for this deficiency a number of methods The goal of this paper is to bring a contribution have been proposed ranging from traditional on the advantage of exploiting terminology methods such as KANO (Wittel et al., 2013) to extraction systems coupled with word recent lean based NPD strategies (Olsen, 2015). embedding techniques. The experimentation is All are invariantly based on the idea of collecting based on the corpus of Semeval 2016. In a user needs with tools such as questionnaire, previous work, summarized in section 4, we interviews and focus groups. However with the reported the results of a system for Aspect Based development of social networks, reviews sites, Sentiment Analysis (ABSA) based on the forums, blogs etc. there is another important assumption that in real applications a domain source for capturing user insights for NPD: users dependent gold standard is systematically absent. of products (in a wide sense) are indeed talking We showed that by adopting domain dependent about them, about the way they use them, about word embedding techniques a reasonable level of the emotions they raise. Here it is where ABSA quality (i.e. acceptable for a proof of concept) in becomes central: whereas for applications such terms of entity detection could be achieved by as e-reputation or brand monitoring capturing providing two seed words for each targeted just the sentiment is largely enough for the entity. In this paper we explore the hypothesis specific purpose, for NPD it is crucial to capture that unsupervised terminology extraction the entity an opinion is referring to and the approaches could further improve the quality of specific feature under judgment. the results in entity extraction. ABSA for NPD is a novel technique and as The paper is organized as follows: In section 2 such it might trigger doubts on its adoption: we enumerate the goal of the research and the given the investments on NPD (198 000 M€ only industrial background justifying it. In section 3 in the cosmetics sector) it is normal to find a we provide a state of the art of ABSA certain reluctance in abandoning traditional particularly focused towards unsupervised ABSA methodologies for voice of the customer and its relationship to terminology extraction. In collection in favor of social network based AB- an Attribute. Entity and attributes, chosen from a ABSA. In order to contrast this reluctance, two special inventory of entity types (e.g. conditions need to be satisfied. On the one hand, “restaurant”, “food”, etc.) and attribute labels one must prove that ABSA is feasible and (e.g. “general”, “prices”, etc.) are the pairs effective in a specific domain (Proof of Concept, towards which an opinion is expressed in a given POC); on the other hand the costs of a high sentence. Each E#A can be referred to a linguistic quality in-production system must be affordable expression (OTE) and be assigned one polarity and comparable with traditional methodologies label. The evaluation assesses whether a system (according to Eurostat the spending of European correctly identifies the aspect categories towards PME in the manufacturing sector for NPD will which an opinion is expressed. The categories be about 350,005.00 M€ in 2020, and PME returned by a system are compared to the usually have limited budget in terms of “voice of corresponding gold annotations and evaluated the customer” spending). according to different measures (precision (P), If we consider the fact that the range of recall (R) and F-1 scores). System performance product/services which are possible objects of for all slots is compared to baseline score. 1 ABSA studies is immense , it is clear that we Baseline System selects categories and polarity must rely on almost completely unsupervised values using Support Vector Machine (SVM) technologies for ABSA, which translates in the based on bag-of-words features (Apidianaki et al., capability of performing the task without a 2016). learning corpus. 3.2 Related works on unsupervised 3 State of the Art ABSA 3.1 Semeval2016’s overview Unsupervised ABSA. Traditionally, in ABSA context, one problematic aspect is represented by SemEval is “ an ongoing series of evaluations of 2 the fact that, given the non-negligible effort of computational semantic analysis systems” , annotation, learning corpora are not as large as organized since 1998. Its purpose is to evaluate needed, especially for languages other than semantic analysis systems. ABSA (Aspect Based English. This fact, as well as extension to Sentiment Analysis) was one of the tasks of this “unseen” domains, pushed some researchers to event introduced in 2014. This type of analysis explore unsupervised methods. Giannakopoulos provides information about consumer opinions on products and services which can help companies et al. (2017) explore new architectures that can to evaluate the satisfaction and improve their be used as feature extractors and classifiers for business strategies. A generic ABSA task consists Aspect terms unsupervised detection. to analyze a corpus of unstructured texts and to Such unsupervised systems can be based on extract fine-grained information from the user syntactic rules for automatic aspect terms reviews. The goal of the ABSA task within detection (Hercig et al., 2106), or graph SemEval is to directly compare different datasets, representations (García-Pablos et al., 2017) of approaches and methods to extract such interactions between aspect terms and opinions, information (Pontiki et al., 2016). but the vast majority exploits resources derived In 2016, ABSA provided 39 training and from distributional semantic principles testing datasets for 8 languages and 7 domains. (concretely, word embedding). Most datasets come from customer reviews The benefits of word embedding used for (especially for the domains of restaurants, ABSA were successfully shown in (Xenos et al., laptops, mobile phones, digital camera, hotels and 2016). This approach, which is nevertheless museums), only one dataset (telecommunication supervised, characterizes an unconstrained domain) comes from tweets. The subtasks of the system (in the Semeval jargon a system sentence-level ABSA, were intended to identify accessing information not included in the all the opinion tuples encoding three types of training set) for detecting Aspect Category, information: Aspect category, Opinion Target Opinion Target expression and Polarity. The Expression (OTE) and Sentiment polarity. Aspect is in turn a pair (E#A) composed of an Entity and used vectors were produced using the skip-gram model with 200 dimensions and were based on 1 multiple ensembles, one for each E#A The site of UNSPC reports more than 40,000 catego- combination. Each ensemble returns the ries of products (https://www.unspsc.org). 2 https://aclweb.org/aclwiki/SemEval_Portal , seen on combinations of the scores of constrained and 05/24/2018 unconstrained systems. For Opinion Target expression, word embedding based features ex- Word2vec's methods (such as skip-gram and tend the constrained system. The resulting scores CBOW) are also used to improve the extraction reveal, in general, rather high rating position of of terms and their identification. This is done by the unconstrained system based on word embed- the composed filtering of Local-global vectors ding. Concerning the advantages derived from (Amjadian et al., 2016). The global vectors were the use of pre-trained in domain vectors, they are trained on the general corpus with GloVe (Pen- also described in (Kim, 2014), who makes use of nington et al., 2014), and the local vectors on the convolutional neural networks trained on top of specific corpus with CBOW and Skip-gram. This pre-trained word vectors and shows good per- filter has been made to preserve both specific- formances for sentence-level tasks, and especial- domain and general-domain information that the ly for sentiment analysis words may contain. This filter greatly improves Some other systems represent a compromise the output of ATE tools for a unigram term ex- between supervised and unsupervised ABSA, i.e. traction. semi-supervised ABSA systems, such an almost The W2V method seems useful for the task of unsupervised system based on topic modelling categorizing terms using the concepts of an on- and W2V (Hercig et al., 2016), and W2VLDA tology (Ferré, 2017).

Load more