Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis

Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative). We also replicated this with two different classifiers: a Convolutional Neural Network and a Support Vector Machine. The Convolutional Neural Network should primarily learn only short-range word order, while the Long Short Term Memory network should be expected to learn as well more long-range orderings. The Support Vector Machine does not take into account word order. Subsequently, we analyzed the prediction biases of these models to see how they affect the reordering results. Based on this analysis, we conclude that the lacking results of the Long Short Term Memory classifier when fed a reordered text do not respond to a problem of prediction bias. In the process of training our models, we use two bilingual lexicons (English-Spanish and English-Catalan) (Hu and Liu 2004) that contain words that typically are key for analyzing the sentiment of a sentence that we use to project our bilingual word embeddings between each language pair. Due to the results we got in the reordering experiments, we conjectured that what determines how our models are classifying the sentiment of the target languages is whether these lexicon words appear or not in the input sentence. Finally, because of this, we test different alterations on the target languages corpora to determine whether this conjecture is strengthened or not. The results seem to go in favor of it. Our main conclusion, therefore, is that Part of Speech-based word reordering of a target language to make its word order more similar to a source language does not improve the results on sentiment classification of our Long Short Term Memory classifier trained on source language data, regardless of the granularity of the sentiment, based on our bilingual word embeddings. ii Contents 1 Introduction 1 1.1 Motivation . 1 1.2 Aims . 2 1.3 Approach . 3 2 State of the art 4 2.1 Cross-lingual Sentiment Analysis . 4 2.2 Word Reordering in Machine Translation (as Preprocessing) . 5 3 Models, Data, and Tools 6 3.1 Models . 6 3.2 Corpora and Lexicons . 7 3.3 Tools . 8 4 Impact of word reorderings on an LSTM, CNN, and SVM 9 4.1 Motivation and Goals . 9 4.2 Determining the Transformation Rules . 9 4.3 Sets of transformation rules . 11 4.4 Reordering . 15 4.5 Results and Discussion . 16 5 Model Analysis 16 5.1 Error Analysis . 16 5.2 Impact of Lexicon Words . 17 6 Conclusions 19 6.1 Summary of Results . 19 References 20 Appendix 23 POS Tagsets . 23 CREGO Frequencies on CoStEP and Tatoeba . 24 Extract of POS frequencies CoStEP . 25 iii List of Tables 1 Corpora statistics . 7 2 Train, testing, and development sets . 8 3 Frequencies of consecutive ‘VERB’ tag in CoStEP . 11 4 ‘VERB’ tag frequencies after fusing the combinations . 11 5 Simplified comparison of frequencies for Criterion 1 . 12 6 Application of the rules criteria in CoStEP . 14 7 Application of the rules criteria in Tatoeba . 14 8 Application of reorderings to an example sentence . 14 9 Final sets of rules . 15 10 LSTM results with different reorderings . 15 11 CNN results with different reorderings . 16 12 SVM results . 16 13 Comparison of prediction biases of classifiers . 16 14 Classifiers results on a lexically-modified corpus . 17 15 Application of lexical modifications to an example sentence . 18 16 Tagsets used for annotating English (38), Spanish (16), and used by Crego and Mariño (9) . 23 17 CREGO Spanish POS combinations frequencies in CoStEP . 24 18 CREGO Spanish POS combinations frequencies in Tatoeba . 24 19 Ten most frequent 2-gram POS combinations in Spanish and English (CoStEP) . 25 iv 1 Introduction 1.1 Motivation When facing a relatively simple text, such as a restaurant or hotel review, a tweet, or any product review, a general sentiment can typically be extracted. This sentiment can be considered as fundamentally positive or negative, and different levels of granularity can be determined (how many divisions should we have in a gradient between ‘positive’ an ‘negative’). The automatic process in which a computer recognizes the overall sentiment of such texts is called Sentiment Analysis. These automatic processes can allow us to analyze large quantities of texts extracted from the Internet, for example. Depending on which topics these texts are about, we can get a view on how a section of the population might feel about a particular product, a touristic destination, or a public figure, which can be of interest for companies marketing their products, public institutions, or social studies, among many other areas or organizations. Nowadays, the sentiment analysis of languages with more computational resources (such as English) tends to produce better results than languages with fewer resources (such as Spanish or Catalan). Confronted with this problem, we have two main choices: the first option is to try to create resources for low-resource languages so that we have a comparable amount of tools and data with high-resource languages such as English, mainly, and therefore be able to obtain a similar level of results to Sentiment Analysis tasks performed in high-resource languages. The second choice is to try to use the data and tools already available and currently in use for high-resource languages to perform Sentiment Analysis tasks on low-resource languages. This, of course, requires some intermediate work so that resources on a language such as English can be used in a language such as Spanish. There is a variety of possibilities to do this. Cross-Lingual Sentiment Analysis (CLSA) is the study of the different ways in which we can use the data and tools of high-resource languages to improve the performance measure of Sentiment Analysis tasks in low-resource languages. One of these possibilities is the following: when we want to analyze the sentiment of a hotel review in Catalan (we will call this the target language), for instance, we can simply translate it to the languages whose resources we are going to use. For example, we may have trained a model on English data that performs quite well when classifying the sentiment of English hotel reviews (we will call this the source language). We would translate the Catalan review to English, and simply use the English-trained sentiment classifier to analyze the translation to our original tweet. We can think of this approach as using Machine Translation (MT) as preprocessing for CLSA. Specifically, in this thesis we will use a different approach: instead of MT, we will use bilingual word embeddings between the source and target languages. We are interested in how much does the different word order between the target and source 1 languages affect the sentiment classification. Therefore, we want to study how the word reordering of the target languages that we will be working with, Spanish and Catalan, so that they more closely resemble the word order of our source language, English, affects the sentiment classification of an English-trained Model. While MT approaches typically require large amounts of parallel data, which may be difficult to acquire for low-resource languages, bilingual embeddings have been proved to be competitive in Sentiment Analysis while requiring fewer data (Abdalla and Hirst 2017, Barnes et al 2018b, Chen et al 2016). As we previously said, we will not use MT, but bilingual embeddings. A bilingual embedding consists of two monolingual embeddings that represent translation pairs: one in the source language, and one in the target language. For instance, ideally, if we were to compare a monolingual distribution of the triad ‘food’-‘dinner’-‘orange’ with ‘comida’-‘almuerzo’-‘naranja’, we would find strong similarities. What we do, then, is optimize between the two representations (the six embeddings) so that we remain each of the embeddings of these pairs being very similar to the other embedding, with only three pairs (one for each translation pair) that represent the respective distribution for both languages. For this we require a bilingual dictionary of key words between the source and target languages.

Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis

Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011

ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018

Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation

A Massively Parallel Corpus: the Bible in 100 Languages

A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection

The Translation Equivalents Database (Treq) As a Lexicographer’S Aid

The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding

Semantic Role Annotation of a French-English Corpus

Standard Test Collection for English-Persian Cross-Lingual Word Sense Disambiguation

Actes Des 2Èmes Journées Scientifiques Du Groupement De Recherche Linguistique Informatique Formelle Et De Terrain (LIFT)

Projecto Em Engenharia Informatica

New Kazakh Parallel Text Corpora with On-Line Access