Francesco Barbieri

“output” — 2018/3/29 — 14:50 — page i — #1 Machine Learning Methods for Understanding Social Media Communication: Modeling Irony and Emojis Francesco Barbieri TESI DOCTORAL UPF / ANY 2017 DIRECTOR DE LA TESI Horacio Saggion Departament DTIC “output” — 2018/3/29 — 14:50 — page ii — #2 “output” — 2018/3/29 — 14:50 — page iii — #3 To Andrea, Semprepresente iii “output” — 2018/3/29 — 14:50 — page iv — #4 “output” — 2018/3/29 — 14:50 — page v — #5 Acknowledgements The last four years have been definitely the best of my life. I had the opportunity to work in a wonderful university, with incredible colleagues, and on unique (and fun) research topics. It was a blessing, and I want to thank many people. First, thank you Horacio! You made all this possible. You helped me when I needed it and let me free to explore different research areas when I felt like, leading me back to the right track when I was going too far. You have been an extraordinary mentor, thanks. An important part of this journey was the visit at the University of Trento. I thank Marco Baroni for hosting me, and all members of the Clic group, especially German, The Nghia, Adam, and Angeliki. A very special thanks goes to Angeliki, you taught me so much: it was really important working with you. I want to thank all the Snap researchers I have worked with in the last four months. Working in such fresh, dynamic and creative environment is incredible. Plus, seeing that a research can actually work on real life applications, is im- mensely rewarding. A special thank to Luis Marujo who gave me this opportunity (and also Aletta, who proofread part of this thesis). I also want to thank all the UPF/TALN researchers, in particular my co-authors, Francesco Ronzan (I consider you my second supervisor, thanks to be excited ev- ery time I came up with a new idea), Mich Ballesteros (doing machine learning with you was great, I learned a lot from you), and Sergio Oramas (working on the music domain was very fun, even if we have to find someone to take care of the writing part, to be the perfect team). I also want to thank the people who were with me most of the time during the last four years. Thank you Ruppero Carlini for your non-technical and technical support (which was about asking more questions than actually solving problems), thank you Dr Sun. Soler for all the kind and cordial words you always had for me, I know you believe in me, and also thank you Luis Pinox, I would not be as strong and successful as I am, if you would have not taught me the six rules of success. Thank you to my Italian friends. I travel and move often, but you are always there for me. I thank all the nice people I met here in Barcelona, from the sadakos and helmanas to the Casamor-Vidal family. Thank you Max, Mercedes, Marga and Carlos, you made me feel at home during these four years. I also thank my family, who always believed in me. Thank you dad, mum, Samo and Geky: you are the best, I am so grateful to have you in my life. Finally, I want to thank Maria, there is not much to say, you are my life. v “output” — 2018/3/29 — 14:50 — page vi — #6 “output” — 2018/3/29 — 14:50 — page vii — #7 Abstract In this dissertation we propose algorithms for the analysis of social media texts, focusing on two particular aspects: irony and emojis. We propose novel automatic systems, based on machine learning methods, able to recognize and interpret these two phenomena. We also explore the problem of topic bias in sentiment analysis and irony detection, showing that traditional word based systems are not robust when they have to recognize irony on a new domain. We argue that our proposal is better suited for topic changes. We then use our approach to recognize another phenomenon related to irony: satirical news in Twitter. By relying on distribu- tional semantic models, we also introduce a novel method for the study of the meaning and use of emojis in social media texts. Moreover, we also propose an emoji prediction task that consists in predicting the emoji present in a text mes- sage using only the text. We have shown that this emoji prediction task can be performed by deep-learning systems with good accuracy, and that this accuracy can be improved by using images included in the post. Resumen En esta tesis proponemos algoritmos para el análisis de textos de redes sociales, enfocándonos en dos aspectos particulares: el reconocimiento automático de la ironía y el análisis y predicción de emojis. Proponemos sistemas automáticos, basados en métodos de aprendizaje automático, capaces de reconocer e interpre- tar estos dos fenómenos. También exploramos el problema del sesgo en el análisis del sentimiento y en la detección de la ironía, mostrando que los sistemas tradi- cionales, basados en palabras, no son robustos cuando los datos de entrenamiento y test pertenecen a dominios diferentes. El modelo que se propone en esta tesis para el reconocimiento de la ironía es más estable a los cambios de dominio que los sistemas basados en palabras. En una serie de experimentos demostramos que nuestro modelo es también capaz de distinguir entre noticias satíricas y no satíricas. Asimismo, exploramos con modelos semánticos distribucionales cómo el significado y el uso de emojis varía entre los idiomas, así como a través de las épocas del año. También nos preguntamos si es posible predecir el emoji que un mensaje contiene solo utilizando el texto del mensaje. Hemos demostrado que nuestro sistema basado en deep-learning es capaz de realizar esta tarea con buena precisión y que se pueden mejorar los resultados si además del texto se utiliza información sobre las imágenes que acompañan al texto. vii “output” — 2018/3/29 — 14:50 — page viii — #8 “output” — 2018/3/29 — 14:50 — page ix — #9 Contents Figure Index xiii Table Index xvi 1 INTRODUCTION 1 1.1 Organization of this Dissertation . .6 1.2 Contributions . .7 2 RELATED WORK 9 2.1 Irony in Social Media . .9 2.1.1 Definition of Irony . .9 2.1.1.1 Verbal Irony . 10 2.1.1.2 Sarcasm . 12 2.1.1.3 Satire . 12 2.1.2 Irony Detection in Social Media . 13 2.1.2.1 Definition of the Task . 13 2.1.2.2 Short Text . 14 2.1.2.3 Long Text . 16 2.1.3 Computational Models . 17 2.1.3.1 Rule Based . 19 2.1.3.2 Feature Engineering . 20 2.1.3.3 Deep Learning . 21 2.2 Visual aspects of Social Media Content . 22 2.2.1 Vision and Language . 23 2.2.2 Emojis . 24 2.2.2.1 Emoticons . 25 2.2.2.2 Emoji Sentiments . 26 2.2.2.3 Emoji Semantics . 26 ix “output” — 2018/3/29 — 14:50 — page x — #10 3 #IRONY AND #SARCASM DETECTION IN SOCIAL MEDIA 29 3.1 Introduction . 29 3.2 Dataset and Text Processing . 30 3.3 Model . 31 3.3.1 Frequency . 32 3.3.2 Written-Spoken . 32 3.3.3 Intensity . 33 3.3.4 Synonyms . 33 3.3.5 Ambiguity . 34 3.3.6 Sentiments . 35 3.3.7 Structure . 35 3.3.8 Bag of Words Baseline . 36 3.4 Experiments and Results . 36 3.4.1 First Experiment . 36 3.4.2 Second Experiment . 37 3.5 Discussion . 40 3.6 Conclusions . 43 4 TOPIC BIAS IN IRONY DETECTION AND SENTIMENT ANAL- YSIS 45 4.1 Introduction . 45 4.2 Dataset and Text Processing . 46 4.3 Model . 47 4.3.1 Word-Based . 47 4.3.2 Synsets . 47 4.4 Experiments and Results . 48 4.4.1 Task 1: Subjectivity Classification . 48 4.4.2 Task 2: Polarity Classification . 50 4.4.3 Task 3: Irony Detection . 51 4.4.4 Cross-Domain Experiments . 51 4.5 Discussion . 52 4.6 Conclusions . 53 5 SATIRE DETECTION 55 5.1 Introduction . 55 5.2 Dataset and Text Processing . 56 5.3 Model . 58 5.3.1 English Resources . 58 5.3.2 Spanish Resources . 58 5.3.3 Italian Resources . 59 5.3.4 Model Summary . 59 x “output” — 2018/3/29 — 14:50 — page xi — #11 5.4 Experiments . 61 5.4.1 Monolingual Experiments . 61 5.4.1.1 English Experiments . 62 5.4.1.2 Spanish Experiments . 62 5.4.1.3 Italian Experiments . 63 5.4.2 Cross-Lingual Experiments . 64 5.5 Discussion . 65 5.6 Conclusions . 67 6 EMOJI SEMANTICS 69 6.1 Introduction . 69 6.2 The Twitter Dataset . 70 6.2.1 Size of the dataset and usage of emojis . 71 6.2.2 Repeated and combined use of emojis in a tweet . 72 6.3 Comparing the semantics of words and emojis: the skip-gram vec- tor model . 74 6.4 Experimental Framework . 75 6.4.1 Nearest-Neighbour experiment . 75 6.4.2 Similarity-matrix experiment . 77 6.5 Results and discussion . 77 6.5.1 Language-and-location analysis of emojis . 77 6.5.1.1 Nearest-Neighbour experiment for language-and- location analysis . 78 6.5.1.2 Similarity-matrix experiment for language-and- location analysis . 78 6.5.2 Season-based analysis of emojis . 83 6.5.2.1 Nearest-Neighbours experiment for season-based analysis . 83 6.5.2.2 Similarity-matrix experiment for season-based analysis .

Francesco Barbieri

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support