Sentiment-Based Candidate Selection for NMT
Total Page:16
File Type:pdf, Size:1020Kb
Sentiment-based Candidate Selection For NMT Alex Jones Derry Wijaya Dartmouth College Boston University [email protected] [email protected] Abstract users and organizations/institutions (Pozzi et al., 2016). As users enjoy the freedoms of sharing their The explosion of user-generated content opinions in this relatively unconstrained environ- (UGC)—e.g. social media posts, comments, ment, corporations can analyze user sentiments and and reviews—has motivated the development extract insights for their decision making processes, of NLP applications tailored to these types of informal texts. Prevalent among these appli- (Timoshenko and Hauser, 2019) or translate UGC cations have been sentiment analysis and ma- to other languages to widen the company’s scope chine translation (MT). Grounded in the ob- and impact. For example, Hale(2016) shows that servation that UGC features highly idiomatic, translating UGC between certain language pairs sentiment-charged language, we propose a has beneficial effects on the overall ratings cus- decoder-side approach that incorporates auto- tomers gave to attractions and shows on TripAd- matic sentiment scoring into the MT candidate visor, while the absence of translation hurts rat- selection process. We train separate English and Spanish sentiment classifiers, then, us- ings. However, translating UGC comes with its ing n-best candidates generated by a baseline own challenges that differ from those of translating MT model with beam search, select the can- well-formed documents like news articles. UGC is didate that minimizes the absolute difference shorter and noisier, characterized by idiomatic and between the sentiment score of the source sen- colloquial expressions (Pozzi et al., 2016). Trans- tence and that of the translation, and perform a lating idiomatic expressions is hard, as they often human evaluation to assess the produced trans- convey figurative meaning that cannot be recon- lations. Unlike previous work, we select this structed from the meaning of their parts (Wasow minimally divergent translation by considering the sentiment scores of the source sentence et al., 1983), and remains one of the open chal- and translation on a continuous interval, rather lenges in machine translation (MT) (Fadaee et al., than using e.g. binary classification, allow- 2018). Idiomatic expressions, however, typically ing for more fine-grained selection of transla- carry an additional property: they imply an affec- tion candidates. The results of human evalu- tive stance rather than a neutral one (Wasow et al., ations show that, in comparison to the open- 1983). The sentiment of an idiomatic expression, source MT baseline model on top of which our therefore, can be a useful signal for translation. In sentiment-based pipeline is built, our pipeline produces more accurate translations of collo- this paper, we hypothesize that a good translation of arXiv:2104.04840v1 [cs.CL] 10 Apr 2021 quial, sentiment-heavy source texts.1. an idiomatic text, such as those prevalent in UGC, should be one that retains its underlying sentiment, 1 Introduction and explore the use of textual sentiment analysis to improve translations. The Web, widespread internet access, and social Our motivation behind adding sentiment analy- media have transformed the way people create, con- sis model(s) to the NMT pipeline are several. First, sume, and share content, resulting in the prolifera- with the sorts of texts prevalent in UGC (namely, tion of user-generated content (UGC). UGC—such idiomatic, sentiment-charged ones), the sentiment as social media posts, comments, and reviews—has of a translated text is often arguably as important proven to be of paramount importance both for as the quality of the translation in other respects, 1Code and reference materials are available at https: such as adequacy, fluency, grammatical correctness, //github.com/AlexJonesNLP/SentimentMT etc. Second, while a sentiment classifier can be trained particularly well to analyze the sentiment ity or familiarity. Si et al.(2019) used sentiment- of various texts—including idiomatic expressions labeled sentences containing one of a fixed set of (Williams et al., 2015)—these idiomatic texts may sentiment-ambiguous words, as well as valence- be difficult for even state-of-the-art (SOTA) MT sensitive word embeddings for these words, to train systems to handle consistently. This can be due models such that users could input the desired senti- to problems such as literal translation of figura- ment at translation time and receive the translation tive speech, but also to less obvious errors such as with the appropriate valence. Lastly, Lohar et al. truncation (i.e. failing to translate crucial parts of (2017, 2018) experimented with training sentiment- the source sentence). Our assumption however, is isolated MT models—that is, MT models trained that with open-source translation systems such as on only texts that had been pre-categorized into a OPUS MT2, the correct translation of a sentiment- set number of sentiment classes i.e., positive-only laden, idiomatic text often lies somewhere lower texts or negative-only texts. Our approach is novel among the predictions of the MT system, and that in using sentiment to re-rank candidate translations the sentiment analysis model can help signal the of UGC in an MT pipeline and in using precise sen- right translation by re-ranking candidates based on timent scores rather than simple polarity matching sentiment. Our contributions are as follows: to aid the translation process. • We explore the idea of choosing translations In terms of sentiment analysis models of non- that minimize source-target sentiment differ- English languages, Can et al.(2018) experimented ences on a continuous scale (0-1). Previous with using an RNN-based English sentiment model works that addressed the integration of sen- to analyze the sentiment of texts translated into timent into the MT process have treated this English from other languages, while Balahur difference as a simple polarity (i.e., positive, and Turchi(2012) used SMT to generate senti- negative, or neutral) difference that does not ment training corpora in non-English languages. account for the degree of difference between Dashtipour et al.(2016) provides an overview and the source text and translation. comparison of various techniques used to tackle • We focus in particular on idiomatic, sentiment- multilingual sentiment analysis. charged texts sampled from real-world UGC, As for MT candidate re-ranking, Hadj Ameur and show, both through human evaluation et al.(2019) provides an extensive overview of and qualitative examples, that our method im- the various features and tools that have been used proves a baseline MT model’s ability to se- to aid in the candidate selection process, and also lect sentiment-preserving and accurate trans- proposes a feature ensemble approach that doesn’t lations in notable cases. rely on external NLP tools. Others who have used • We extend our method of using monolin- candidate selection or re-ranking to improve MT gual English and Spanish sentiment classi- performance include Shen et al.(2004) and Yuan fiers to aid in MT by substituting the classi- et al.(2016). To the best of our knowledge, how- fiers for a single, multilingual sentiment clas- ever, no previous re-ranking methods have used sifier, and analyze the results of this second sentiment for re-ranking despite findings that MT MT pipeline on the lower-resource English- often alters sentiment, especially when ambiguous Indonesian translation, illustrating the gener- words or figurative language such as metaphors or alizability of our approach. idioms are present or when the translation exhibits incorrect word order (Mohammad et al., 2016). 2 Related Work 3 Models and Data Several papers in recent years have addressed the in- corporation of sentiment into the MT process. Per- 3.1 Sentiment Classifiers haps the earliest of these is Sennrich et al.(2016), For the first portion of our experiments, we train which examined the effects of using honorific mark- monolingual sentiment classifiers, one for English ing in training data to help MT systems pick up and another for Spanish. For the English classifier, on the T-V distinction (e.g. informal tu vs. for- we fine-tune the BERT Base uncased model (De- mal vous in French) that serves to convey formal- vlin et al., 2019), as it achieves SOTA or nearly 2https://github.com/Helsinki-NLP/ SOTA results on various text classification tasks. Opus-MT We construct our BERT-based sentiment classifier model using BERTForSequenceClassification, fol- model sometimes produced unreliable results, and lowing McCormick and Ryan(2019). For our so employ multiple random restarts and select the English training and development data, we sample best model, a technique used in the original BERT 50K positive and 50K negative tweets from the au- paper (Devlin et al., 2019). The test accuracy on tomatically annotated sentiment corpus described the Spanish model was 77.8%. in Go et al.(2009) and use 90K tweets for train- ing and the rest for development. For the English 3.2 Baseline MT Models test set, we use the human-annotated sentiment The baseline MT models we use for both English- corpus also described in Go et al.(2009), which Spanish and Spanish-English translation are the consists of 359 total tweets after neutral-labeled publicly available Helsinki-NLP/OPUS MT mod- tweets are removed. We use BertTokenizer with els released by Hugging Face and based on Mar- ‘bert-base-uncased’ as our vocabulary file and fine- ian NMT (Tiedemann and Thottingal, 2020; tune a BERT model using one NVIDIA V100 GPU Junczys-Dowmunt et al., 2018; Wolf et al., 2019). to classify the tweets into positive or negative la- Namely, we use both the en-ROMANCE and bels for one epoch using the Adam optimizer with ROMANCE-en Transformer-based models, which weight decay (AdamW) and a linear learning rate were both trained using the OPUS dataset (Tiede- schedule with warmup.