A Comparative Study of Word Embedding Methods for Early Risk Prediction on the Internet

A comparative study of word embedding methods for early risk prediction on the Internet Elena Fano Uppsala University Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology June 10, 2019 Supervisors: Joakim Nivre, Uppsala University Jussi Karlgren, Gavagai AB Abstract We built a system to participate in the eRisk 2019 T1 Shared Task. The aim of the task was to evaluate systems for early risk prediction on the internet, in particular to identify users suffering from eating disorders as accurately and quickly as possible given their history of Reddit posts in chronological order. In the controlled settings of this task, we also evaluated the performance of three different word representation methods: random indexing, GloVe, and ELMo. We discuss our system’s performance, also in the light of the scores obtained by other teams in the shared task. Our results show that our two-step learning approach was quite successful, and we obtained good scores on the early risk prediction metric ERDE across the board. Contrary to our expectations, we did not observe a clear-cut advantage of contextualized ELMo vectors over the commonly used and much more light-weight GloVe vectors. Our best model in terms of F1 score turned out to be a model with GloVe vectors as input to the text classifier and a multi-layer perceptron as user classifier. The best ERDE scores were obtained by the model with ELMo vectors and a multi-layer perceptron. The model with random indexing vectors hit a good balance between precision and recall in the early processing stages but was eventually surpassed by the models with GloVe and ELMo vectors. We put forward some possible explanations for the observed results, as well as proposing some improvements to our system. Contents Acknowledgments4 1. Introduction5 1.1. Purpose................................6 1.2. Outline................................6 2. Background8 2.1. Language of mental health patients.................8 2.2. Early Risk Prediction on the Internet................8 2.3. Word embeddings.......................... 11 2.3.1. Overview........................... 11 2.3.2. Random indexing....................... 13 2.3.3. GloVe............................. 14 2.3.4. ELMo............................. 15 3. The shared task 17 3.1. eRisk 2019.............................. 17 3.2. Data set................................ 18 3.3. Evaluation metrics.......................... 18 3.3.1. ERDE............................. 19 3.3.2. Precision, recall and F measure............... 19 3.3.3. Latency, speed and F-latency................ 20 3.3.4. Ranking-based metrics.................... 20 4. Methodology 22 4.1. System design............................. 22 4.2. Experimental settings......................... 24 4.2.1. Word embeddings as input.................. 26 4.3. Runs.................................. 26 4.4. Other models............................. 27 5. Results and discussion 29 5.1. Development experiments...................... 29 5.1.1. Error analysis......................... 30 5.2. Results on the test set........................ 32 5.2.1. Error analysis......................... 33 5.3. Shared task.............................. 36 6. Conclusion 39 References 41 3 A. Appendix 44 4 Acknowledgments I would like to thank my supervisors, Joakim Nivre and Jussi Karlgren, for sup- porting me with their academic knowledge and constant availability throughout this project. Thank you as well to the Ph.D. students Miryam de Lhoneux and Artur Kulmizev for helping me out with technical problems along the way. Finally, I am really grateful to all the people at Gavagai for their input and their interest in my work during the past months. 5 1. Introduction Mental health problems are one of the great challenges of our time. It is estimated that more than a billion people worldwide suffer from some kind of mental health issue.1 According to the World Health Organization, more than 300 million people in the world suffer from depression,2 and 70 million suffer from some kind of eating disorder. As for other major global phenomena, mental health issues are widely discussed on the internet in different online communities and social media. This generates an enormous amount of text, which contains really valuable information. Natural language processing is the field of data science that deals with language data. With the help of advanced tools and algorithms, it is possible to extract patterns and gain insights that can help save lives. One possible application of such tools is to monitor the texts that users publish online, in forums such as Reddit or social media such as Twitter, and automatically detect whether a person is at risk of developing a dangerous mental health issue. If the technology becomes reliable enough, an alert system could be developed to inform the person and put them in touch with health care resources before the problem becomes life-threatening. Another realistic scenario would be to provide support for help lines and hospitals. It would be possible to develop chat bots and other dialogue systems that can determine the severity of a person’s mental health risk based on just a few lines of text. This would help caregivers to prioritize and make sure that everyone receives the treatment they need, when they need it. The strength of machine learning tools is in the amount of data that they can process automatically in a short period of time. It would be impossible for human psychologists to keep up with the incredible amount of text data that is produced every day. Moreover, new machine learning techniques do not require the programmer to know much beforehand about the problem that she is trying to solve. These algorithms can extract cues from large data sets automatically, which eliminates the need for hand-crafted features. Of course this development has to be carefully monitored in order to take ethical issues into account, and the last word should always go to a medical professional. The important question now is whether there is any evidence that people who suffer from mental illnesses actually express themselves in a different way compared to healthy individuals. After all, machine learning is not a magic wand, and the texts have to contain some amount of signal to be picked up with such methods. As it turns out, there is plenty of evidence from psychology and cognitive science that the language of mental health patients has specific characteristics that are not as prominent in control groups of healthy people (see Section 2.1). 1https://ourworldindata.org/mental-health 2https://www.who.int/news-room/fact-sheets/detail/depression 6 The Early Risk Prediction on the Internet laboratory at CLEF 2019 (eRisk 2019 for short) focused on the detection of anorexia and self-harm tendencies in social media text with particular emphasis on the temporal dimension. The participating teams were not only asked to detect signs of the aforementioned mental health issues, but to do so as quickly as possible. The lab also required the teams to provide an estimate of the risk level for each user, and one of the tasks focused specifically on automatically filling out questionnaires on risk factors. This thesis set out to participate in Task 1 (T1) of the eRisk lab 2019, i.e. early detection of signs of anorexia. We developed an end-to-end system that can perform the required task and compared our results with the other teams. In the framework of this downstream task, we evaluated different types of word representations, also known as word embeddings. Evaluating the performance of the different versions of our system gave us some insights into the strengths and weaknesses of each word representation method. 1.1. Purpose This master thesis project has two main purposes: • Contributing to the development of techniques for early risk detection on the internet. With this goal in mind, we build an end-to-end system that given the texts posted by a user on the internet in chronological order predicts the risk of anorexia as quickly as possible. There are a number of methodological considerations in this task that make it challenging and academically relevant. • Evaluating different types of word representations in the controlled envi- ronment of a specific downstream task. In particular, we compare the performance of word embeddings belonging to different families of methods: random-indexing embeddings, GloVe embeddings and ELMo embeddings. As a baseline we use randomly initialized embeddings from one of the popular machine learning libraries. 1.2. Outline We begin by introducing the related work and fundamental concepts that con- stitute the basis for this thesis. The background section is divided into three subsections: the first concerns studies in cognitive science and psychology that have investigated the language of mental health patients; the second subsection il- lustrates previous approaches and algorithms used to solve the early risk prediction task; the third subsection deals with word representation methods. We then move on to illustrate the set up of the shared task in 2019, the evaluation metrics and the data set that we worked with. The following section covers the methodology of the present work: we discuss in depth our system design choices and the final experimental settings. Then we present the five models entered in the shared task, as well as other models that were included in our experiments but not submitted to the shared task. 7 The results section presents the outcome of our experiments, as well as the scores obtained by other teams in order to provide a comparison. We then discuss the performance of the various models and draw some conclusions regarding the strengths and weaknesses of different architectures. 8 2. Background 2.1. Language of mental health patients Many studies have focused on the language of depressed patients. Rude et al. (2004) analyzed the language of essays written by American college students who had been depressed, were currently depressed or had never been depressed.

Load more