Predicting Depression and Suicide Ideation in the Canadian Population Using Social Media Data
Total Page:16
File Type:pdf, Size:1020Kb
Predicting Depression and Suicide Ideation in the Canadian Population Using Social Media Data by Ruba Skaik Thesis submitted to the University of Ottawa in partial Fulfillment of the requirements for the Ph.D. degree in Computer Science School of Electrical Engineering and Computer Science Faculty of Engineering University of Ottawa © Ruba Skaik, Ottawa, Canada, 2021 Abstract The economic burden of mental illness costs Canada billions of dollars every year. Mil- lions of people suffer from mental illness, and only a fraction receives adequate treatment. Identifying people with mental illness requires initiation from those in need, available medical services, and professional experts’ time allocation. These resources might not be available all the time. The common practice is to rely on clinical data, which is generally collected after the illness is developed and reported. Moreover, such clinical data is incom- plete and hard to obtain. An alternative data source is conducting surveys through phone calls, interviews, or mail, but this is costly and time-consuming. Social media analysis has brought advances in leveraging population data to understand mental health prob- lems. Thus, analyzing social media posts can be an essential alternative for identifying mental disorders throughout the Canadian population. Big data research of social media may also endorse standard surveillance approaches and provide decision-makers with us- able information. More precisely, social media analysis has shown promising results for public health assessment and monitoring. In this research, we explore the task of automat- ically analysing social media textual data using Natural Language Processing (NLP) and Machine Learning (ML) techniques to detect signs of mental health disorders that need attention, such as depression and suicide ideation. Considering the lack of comprehensive annotated data in this field, we propose a methodology for transfer learning to utilize the information hidden in a training sample and leverage it on a different dataset to choose the best-generalized model to be applied at the population level. We also present evidence that ML models designed to predict suicide ideation using Reddit data can utilize the knowledge they encoded to make predictions on Twitter data, even though the two platforms differ in the purpose, structure, and limitations. In our proposed models, we use feature engi- neering with supervised machine learning algorithms (such as SVM, LR, RF, XGBoost, and GBDT), and we compare their results with those of deep learning algorithms (such as LSTM, Bi-LSTM, and CNNs). We adopt the CNN model for depression classification that obtained the highest F1-score on the test dataset (0.898) and 0.941 recall. This model is later used to estimate the depression level of the population. For suicide ideation detection, we used the CNN model with pre-trained fastText word embeddings and linguistic features (LIWC). The model achieved an F1-score of 0.936 and a recall of 0.88 to predict suicide ideation at the user-level on the test set. ii To compare our models’ predictions with official statics, we used 2015-2016 population- based Canadian Community Health Survey (CCHS) on Mental Health and Well-being conducted by Statistics Canada. The data is used to estimate depression and suicidality in Canadian provinces and territories. For depression, (n=53,050) respondents filled in the Patient Health Questionnaire-9 (PHQ-9) from 8 provinces/territories. Each survey respondent with a score ≥ 10 on the PHQ-9 was interpreted as having moderate to severe depression because this score is frequently used as a screening cut-point. The weighted percentage of depression preva- lence during 2015 for females and males of the age between 15 to 75 was 11.5% and 8.1%, respectively (with 54.2% females and 45.8% males). Our model was applied on a population-representative dataset that contains 24,251 Twitter users who posted 1,735,200 tweets during 2015 with a Pearson correlation of 0.88 for both sex and age within the seven provinces and NT territory included in the CCHS. An age correlation of 0.95 was calculated for age and sex (separately) and our model estimated that 10% of the sample dataset has evidence of depression (58.3% females and 41.7% males). For the second task, suicide ideation, Statistics Canada (2015) estimated the total number of people who reported serious suicidal thoughts as 3,396,700 persons, i.e., 9.514% of the total population, whereas our models estimated 10.6% of the population sample were at risk of suicide ideation (59% females and 41% males). The Pearson correlation coefficients between the actual suicide ideation within the last 12 months and the predicted model for each province per age, sex, and both more than 0.62, which indicates a reasonable correlation. iii Acknowledgements First of all, I am blessed to have Professor Diana Inkpen as my supervisor. I want to thank her for being an amazing advisor. I benefited not only from her profound insights and knowledge but also from her patience, kindness, and persistence. I am grateful for all the support provided by her, including Tamale seminars and group meetings. It is an honor working with and learning from her. I want to express my gratitude to the Ph.D. committee members, Professor Marina Sokolova, Professor Olga Baysal, and Professor Hussein Al-Osman, for their insightful comments and feedback on my thesis. Special thanks to Dr. Kenton White, Zunaira Jamil, Shainen Davidson, and Peter Glossop for providing me with the Twitter data for population analysis through Advanced Symboic Inc. Thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC) for funding my research. Thanks to Prasadith Buddhitha, Haifa Al-Harithi, and all TAMALE group members. It was a long journey, and I am indebted to many people throughout these years. I want to thank my family, especially my mother, for believing in me and fighting for my dream. Thanks to my beloved husband, Adel; without you this journey would neither have started nor have ended. Thanks to my kids Ahmad and Noor for their understanding. Thanks to my sisters and brothers, for their encouragement. Thanks to my uncle for his love. Special thanks to Lama Skaik, Mohammed Sadeq, and Huda Al-Atari for their constant support and care. Finally, thanks to all my friends, especially Reem Al-Halimi, Mariam Al-Otaibi, Maha Abu-Sharkh, Anas Nayfeh, Asmaa Abu-Bakr, and Manal Al-Cheikh. iv Table of Contents List of Tables ix List of Figures xii 1 Introduction1 1.1 Overview.....................................1 1.2 Motivation....................................3 1.3 Problem Statement...............................4 1.3.1 Generalizability.............................4 1.3.2 Transfer Learning............................5 1.3.3 Population Perspective.........................5 1.4 Hypothesis....................................5 1.5 Contributions..................................6 1.6 Thesis Organization...............................6 1.7 Published Papers................................7 2 Related Work8 2.1 Overview.....................................8 2.2 Features for Social Media Text.........................8 2.3 Post-level Analysis...............................9 v 2.4 User-level Analysis............................... 11 2.5 Population-level Analysis............................ 12 2.5.1 Depression Detection.......................... 15 2.5.2 Predicting Suicide Ideation and Self-harm.............. 17 2.6 Summary.................................... 18 3 Datasets 19 3.1 Overview..................................... 19 3.2 Data Collection Methods............................ 20 3.2.1 Screening Surveys............................ 20 3.2.2 Forums Membership.......................... 22 3.2.3 Social Media Posts........................... 22 3.3 Shared Datasets................................. 30 3.3.1 CLPsych Shared Tasks......................... 30 3.3.2 eRisk Shared Tasks........................... 32 3.3.3 Crisis Text Line............................. 33 3.4 Datasets Used in This Research........................ 34 3.4.1 Advanced Symbolics Inc. Dataset................... 34 3.4.2 CLPsych Shared Task 2015 Dataset.................. 39 3.4.3 Depression Self Reported Dataset................... 41 3.4.4 CLPsych Shared Task 2019 Dataset................. 43 3.4.5 Suicide ideation ASI Dataset...................... 44 3.4.6 How Are These Datasets Being Used?................. 47 3.5 Summary.................................... 47 vi 4 Depression Prediction from User to Population 48 4.1 Overview..................................... 48 4.2 Depression Classification Methods....................... 49 4.2.1 Feature Selection............................ 49 4.2.2 Traditional Classification Models................... 53 4.2.3 Deep Learning Classification Models................. 54 4.3 Results and Discussion............................. 57 4.3.1 The Results of the Traditional Classifiers............... 57 4.3.2 Deep Learning Models......................... 67 4.4 Depression Detection at Population Level................... 71 4.5 Summary.................................... 74 5 Suicide Ideation from User to Population 75 5.1 Overview..................................... 75 5.2 Suicide Ideation Classification Models..................... 76 5.2.1 Feature Selection............................ 76 5.3 Results and Discussion............................. 80 5.3.1 The Results of the Traditional Classifiers............... 80 5.3.2 Deep Learning Models........................