<<

Explainable Detection with Multi-Modalities Using a Hybrid Deep Learning Model on Social Media a b a c a Hamad Zogan , Imran Razzak , Xianzhi Wang , Shoaib Jameel and Guandong Xu

aUniversity of Technology Sydney (UTS), Australia bDeakin University, Australia cUniversity of Essex, United Kingdom

ARTICLEINFO ABSTRACT

Keywords: Model interpretability has become important to engenders appropriate user trust by providing the in- depression detection sight into the model prediction. However, most of the existing machine learning methods provide social network no interpretability for depression prediction, hence their predictions are obscure to human. In this deep learning work, we propose interpretive Multi-Modal Depression Detection with Hierarchical Attention Net- machine learning work MDHAN, for detection depressed users on social media and explain the model prediction. We explainable have considered user posts along with Twitter-based multi-modal features, specifically, we encode user posts using two levels of attention mechanisms applied at the tweet-level and word-level, calcu- late each tweet and words’ importance, and capture semantic sequence features from the user timelines (posts). Our experiments show that MDHAN outperforms several popular and robust baseline meth- ods, demonstrating the effectiveness of combining deep learning with multi-modal features. We also show that our model helps improve predictive performance when detecting depression in users who are posting messages publicly on social media. MDHAN achieves excellent performance and ensures adequate evidence to explain the prediction.

1. Introduction Mental illness is a serious issue faced by a large popu- lation around the world. In the United States (US) alone, every year, a significant percentage of the adult population is affected by different mental disorders, which include de- pression mental illness (6.7%), anorexia and bulimia nervosa (1.6%), and bipolar mental illness (2.6%) [1]. Sometimes mental illness has been attributed to the mass shooting in the US [2], which has taken numerous innocent lives. One of the common mental health problems is depression that is more dominant than other mental illness conditions world- wide [3]. The fatality risk of suicides in depressed people is 20 times higher than the general population [4]. Diagnosis of Figure 1: Interpertable depression detection depression is usually a difficult task because depression de- tection needs a thorough and detailed psychological testing by experienced psychiatrists at an early stage [5] and it re- nosis, and claims. It is common for people who quires interviews, questionnaires, self-reports or testimony suffer from mental health problems to often “implicitly” (and arXiv:2007.02847v2 [cs.IR] 28 Apr 2021 from friends and relatives. Moreover, it is very common sometimes even “explicitly”) disclose their feelings and their among people who suffer from depression that they do not daily struggles with mental health issues on social media as visit clinics to ask help from doctors in the early stages of a way of relief [11, 12]. Therefore, social media is an excel- the problem [6]. lent resource to automatically discover people who are under Individuals and health organizations have thus shifted depression. While it would take a considerable amount of away from their traditional interactions, and now meeting time to manually sift through individual social media posts online by building online communities for sharing informa- and profiles to locate people going through depression, auto- tion, seeking and giving the advice to help scale their ap- matic scalable computational methods could provide timely proach to some extent so that they could cover more affected and mass detection of depressed people which could help population in less time. Besides sharing their mood and ac- prevent many major fatalities in the future and help people tions, recent studies indicate that many people on social me- who genuinely need it at the right moment. dia tend to share or give advice on health-related information [7,8,9, 10]. These sources provide the potential pathway to Usually, depressed users act differently when they are discover the mental health knowledge for tasks such as diag- on social media, producing rich behavioural data, which is often used to extract various features. However, not all of ORCID(s): them are related to depression characteristics. Recently deep

xxxxxxxx et al.: Preprint submitted to Elsevier Page 1 of 12 Explainable Depression Detection (MDHAN) learning has been applied for depression detection on social proposed method when compared to baseline meth- media and showed significantly better performance than tra- ods. ditional machine learning methods. Hamad et. al. presented The rest of our paper is organized as follows. Section 2 a computational framework for automatic detection of de- Section 3 pressed user that initially selects relevant content through reviews the related work to our paper. presents the a hybrid extractive and abstractive summarization strategy dataset that used in this work, and different pre-processing we applied on data. Section 4 describes the two different on the sequence of all user tweets leading to a more fine- Section 5 grained and relevant content, which then forwarded to deep attributes that we extracted for our model. In , we present our model for detection depression. Section 6 re- learning framework comprising of a unified learning ma- Section 7 chinery comprising of convolutional neural network coupled ports experiments and results. Finally, concludes with attention-enhanced gated recurrent units leading to bet- this paper. ter empirical performance than existing strong baseline meth- ods [13]. Even though, recent work showed the effectiveness 2. Related Work of deep learning methods for depression detection, however, most of the existing machine learning methods provide no In this section, we will discuss closely related literature interpretability for depression prediction, hence their predic- and mention how they are different from our proposed method. tions are obscure to human which reduces the trust on the In general, just like our work, most existing studies focus on deep learning models. Interpretable model provides deep user behaviour to detect whether a user suffers from depres- insight into how a deep learning model can be improved, sion or any mental illness. We will also discuss other rele- and supports understanding. Thus, in order to engenders the vant literature covering word embeddings and hybrid deep appropriate user trust and provide the reason behind the de- learning methods which have been proposed for detecting cision, we aim develop a interpertable deep learning-based mental health from online social networks and other resources solution for depression detection by utilizing multi-modal including public discussion forums. Since we also introduce features from diverse behaviour of the depressed user in so- the notion of latent topics in our work, we have also cov- cial media. Apart from the latent features derived from lex- ered relevant related literature covering topic modelling for ical attributes, we notice that the dynamics of tweets, i.e. depression detection, which has been widely studied in the tweet timeline provides a crucial hint reflecting depressed literature. user emotion change over time. To this end, we propose a hy- Data present in social media is usually in the form of in- brid model, Multi-Modal Depression Detection Hierarchical formation that user shares for public consumption which also Attention Network MDHAN to boost the classification of includes related metadata such as user location, language, depressed users using multi-modal features and word em- age, among others [14]. In the existing literature, there are bedding features. Figure1 illustrate the effectiveness of in- generally two steps to analyzing social data. The first step is terpertablity in improving the user trust. The model can de- collecting the data generated by users on networking sites, rive new deterministic feature representations from training and the second step is to analyze the collected data using, data and produce superior results for detecting depression- for instance, a computational model or manually. In any level of Twitter users, and derive explanation from a user data analysis, feature extraction is an important task because posts content. Beside, we also studied the performance of using only a relevant small set of features, one can learn a our model when we used the two components user posts high-quality model. and his multi-modalities features separately. We found that Understanding depression on online social networks could model performance deteriorated when we used only multi- be carried out using two complementary approaches which modal features. We further show when we combined the two are widely discussed in the literature, and they are: attributes, our model led to better performance. To summa- • Post-level behavioural analysis rize, our study makes the following key contributions: • User-level behavioural analysis 1. present a novel interpertable depression detection frame- work by deep learning the textual, behavioural, tem- 2.1. Post-level behavioural analysis poral, and semantic modalities from social media. To Methods that use this kind of analysis mainly target at the best of our knowledge, this is the first work of us- the textual features of the user post that is extracted in the ing multi-modalities of topical, temporal and semantic form of statistical knowledge such as those based on count- features jointly with word embeddings in deep learn- based methods [15]. These features describe the linguistic ing for depression detection. content of the post which are discussed in [16, 17]. For in- 2. introduce the prospective of viewing explainablity of stance, in [16] the authors propose classifier to understand model for depression detection and build a pipeline the risk of depression. Concretely, the goal of the paper aided with interpertablity based on hierarchical atten- is to estimate that there is a risk of user depression from tion networks to explain the prediction of depression their social media posts. To this end, the authors collect detection. data from social media for a year preceding the onset of de- 3. Extensive experiment is performed on benchmark de- pression from user-profiles and distil behavioural attributes pression twitter dataset, that shows superiority of our xxxxxxxx et al.: Preprint submitted to Elsevier Page 2 of 12 Explainable Depression Detection (MDHAN) to be measured relating to social engagement, emotion, lan- thors in [20] extracted six depression-oriented feature groups guage and linguistic styles, ego network, and mentions of an- for a comprehensive description of each user from the col- tidepressant medications. The authors collect their data us- lected data set. The authors used the number of tweets and ing crowd-sourcing task, which is not a scalable strategy, on social interaction as social network features. For user pro- Amazon Mechanical Turk. In their study, the crowd work- file features, they have used user shared personal informa- ers were asked to undertake a standardized clinical depres- tion in a social network. Analysing user behaviour looks sion survey, followed by various questions on their depres- useful for detecting eating disorder. In Wang et al., [24] sion history and demographics. While the authors have con- they extracted user engagement and activities features on so- ducted thorough quantitative and qualitative studies, they cial media. They have extracted linguistic features of the are disadvantageous in that it does not scale to a large set users for psychometric properties which resembles the set- of users and does not consider the notion of text-level se- tings described in [22, 14, 21] where the authors have ex- mantics such as latent topics and semantic analysis using tracted 70 features from two different social networks (Twit- word embeddings. Our work is both scalable and consid- ter and Weibo). They extracted features from a user pro- ers various features which are jointly trained using a novel file, posting time and user interaction feature such as several hybrid deep learning model using a multi-modal learning ap- followers and followee. Similarly, Wong et al. combined proach. It harnesses high-performance Graphics Processing user-level and post-level semantics and cast their problem as Units (GPUs) and as a result, has the potential to scale to a multiple instance learning setup. The advantage that this large sets of instances. In Hu et al., [17] the authors also method has is that it can learn from user-level labels to iden- consider various linguistic and behavioural features on data tify post-level labels [25]. obtained from social media. Their underlying model relies Recently, Lin et al. applied CNN-based deep learning on both classification and regression techniques for predict- model to classify Twitter users based on depression using ing depression while our method performs classification, but multi-modal features. The framework proposed by the au- on a large-scale using a varied set of crucial features relevant thors has two parts. In the first part, the authors train their to this task. model in an offline mode where they exploit features from To analyze whether the post contains positive or negative Bidirectional Encoder Representations from Transformers words and/or emotions, or the degree of adverbs [18] used (BERT) [26] and visual features from images using a CNN cues from the text, for example, I feel a little depressed and I model. The two features are then combined, just as in our feel so depressed, where they capture the usage of the word model, for joint feature learning. There is then an online de- “depressed” in the sentences that express two different feel- pression detection phase that considers user tweets and im- ings. The authors also analyzed the posts’ interaction (i.e., ages jointly where there is a feature fusion at a later stage. In on Twitter (retweet, liked, commented)). Some researchers another recently proposed work [27], the authors use visual studied post-level behaviours to predict mental problems by and textual features to detect depressed users on Instagram analysing tweets on Twitter to find out the depression-related posts than Twitter. Their model also uses multi-modalities in language. In [19], the authors have developed a model to un- data, but keep themselves confined to Instagram only. While cover meaningful and useful latent structure in a tweet. Sim- the model in [28] showed promising results, it still has cer- ilarly, in [20], the authors monitored different symptoms of tain disadvantage. For instance, BERT vectors for that are mentioned in a user’s tweet. In [21], they tokens are computationally demanding to obtain even dur- study users’ behaviour on both Twitter and Weibo. To an- ing the fine-tuning stage, unlike our model which does not alyze users’ posts, they have used linguistic features. They have to train the word embeddings from scratch. Another used a Chinese language psychological analysis system called limitation of their work is that they obtain sentence repre- TextMind in sentiment analysis. One of the interesting post- sentations from BERT, for instance, BERT imposes a 512 level behavioural studies was done by [20] on Twitter by token length limit where longer sequences are simply trun- finding depression relevant words, , and de- cated resulting in some information loss, where our model pression symptoms. In [22] the authors used post-level be- has a much longer sequence length which we can tune eas- haviour for detecting anorexia; they analyze domain-related ily because our model is computationally cheaper to train. vocabulary such as anorexia, eating disorder, food, meals We have proposed a hybrid model that considers a variety and exercises. of features unlike these works. While we have not specif- ically used visual features in our work, using a diverse set 2.2. User-level behaviours of crucial relevant textual features is indeed reasonable than There are various features to model users in social me- just visual features. Of course, our model has the flexibil- dia as it reflects overall behaviour over several posts. Dif- ity to incorporate a variety of other features including visual ferent from post-level features extracted from a single post, features. user-level features extract from several tweets during differ- Multi-modal features from the text, audio, images have ent times [18]. It also extracts the user’s social engagement also been used in [29], where a new graph attention-based presented on Twitter from many tweets, retweets and/or user model embedded with multi-modal knowledge for depres- interactions with others. Generally, posts’ linguistic style sion detection. While they have used temporal CNN model, could be considered to extract features [17, 23, 23]. The au- their overall architecture has experimented on small-scale xxxxxxxx et al.: Preprint submitted to Elsevier Page 3 of 12 Explainable Depression Detection (MDHAN) questionnaire data. For instance, their dataset contains 189 combines user behaviour with user history posts. Besides, sessions of interactions ranging between 7-33min (with an our strategy to select salient content using automatic summa- average of 16 min). While they have not experimented their rization helps our model only focus on the most important in- method with short and noisy data from social media, it re- formation. Although, recent deep learning methods showed mains to be seen how their method scales to such large col- significant performance for depression detection, however, lections. Xezonaki et al., [30] propose an attention-based most of the existing machine learning methods provide no model for detecting depression from transcribed clinical in- interpretability for prediction. Interpretable model can pro- terviews than from online social networks. Their main con- vide deep insight into how a deep learning model can be im- clusion was that individuals diagnosed with depression use proved, and supports understanding. Therefore, to provide affective language to a greater extent than those who are not some details and explain user tweets or reasons to make de- going through depression. In another recent work [31], the cision functioning clear or easy to understand, we aim to authors discuss depression among users during the COVID- develop an interpretable deep learning-based approach for 19 pandemic using LSTM and fastText [32] embeddings. In depression detection. Our proposed model utilized multi- [33], the authors also propose a multi-model RNN-based modal features from the diverse behaviour of the depressed model for depression prediction but apply their model on user and his posts in social media. online user forum datasets. Trotzek et al., [34] study the problem of early detection of depression from social media 3. Problem Statement using deep learning where the leverage different word em- beddings in an ensemble-based learning setup. The authors To formulate our problem, suppose that we have a set even train a new word embedding on their dataset to ob- U of labeled users either depression or non-depression sam- tain task-specific embeddings. While the authors have used ples. Let A be a user posts A = [t1, t2, ...., tL] consisting L the CNN model to learn high-quality features, their method tweets, where L is the total number of tweets per user, each does not consider temporal dynamics coupled with latent tweet ti contains n words ti = [wi1, wi2, ...., wiN ] where N topics, which we show to play a crucial role in overall quan- is the total number of words per tweet. And Let M be the m M 1, 2, ..., S titative performance. Farruque et al., [35] study the prob- features in total for a user { i}i=1, and let { } be lem of creating word embeddings in cases where the data a finite set of available modalities, so we donate Ms as the tℎ is scarce, for instance, depressive language detection from dimension of S modality. Therefore, once we have a user user tweets. The underlying motivation of their work is to tweets A and a set of related a user behaviours features M, simulate a retrofitting-based word embedding approach [36] Learn depression detection function will be as following: where they begin with a pre-trained model and fine-tune the f(A, M) → ̂y model on domain-specific data. (1) Topic modeling has also been used for detection of de- pression from social media. Gong et al., [37] proposed a such that it maximizes prediction accuracy. And in our topic modelling approach to depression detection using multi- problem we treat depression detection as the binary classi- fication problem, i.e., user can be depressed ( ̂y = 1) or not- modal analysis. They propose a novel topic model which ̂y is context-aware with temporal features. While the model depressed ( = 0). produced satisfactory results on 2017 Audio/Visual Emo- tion Challenge (AVEC), the method does not use a variety of 4. Dataset rich features and could face scalability issues because simple Developing and validating the terms used in the vocabu- posterior inference algorithms such as those based on Gibbs lary by users with mental illness is time-consuming but helps or collapsed Gibbs sampling do not parallelize unlike deep obtain a reliable list of words, by which reliable tweets could learning methods, or one need sophisticated engineering to be crawled reducing the amount the false-positives. Recent parallelize such models. research conducted by the authors of [20] is one such work Recent studies have started to target depressed user on- that has collected a large-scale data with reliable ground truth line, extracting features representing user behaviours and clas- data, which we aim to reuse. We present the statistics of the sifying these features into different groups, such as the num- data in Table 1. To exemplify the dataset further, the authors ber of posts, posting time distribution, and number follow- collected three complementary data sets, which are: ers and followee. Peng et. al. extracted different features and classified them into three groups, user profile, user be- • Depression data set: Each user is labelled as depressed, haviour and user text and used multi-kernel SVM for classi- based on their tweet content between 2009 and 2016. fication [38]. The above-mentioned works have some limita- This includes 1,402 depressed users and 292,564 tweets. tions. They mainly focused on studying user behaviour than taking cues from user generated content such as the text they • Non-depression data set: Each user is labelled as non- share which make it extremely difficult to achieve high per- depressed and the tweets were collected in December formance in classification. These models also cannot work 2016. This includes over 300 million active users and well to detect depressed user at user-level, and as a result, 10 billion tweets. they are prone to incorrect prediction. Our novel approach xxxxxxxx et al.: Preprint submitted to Elsevier Page 4 of 12 Explainable Depression Detection (MDHAN)

Dataset Depressed Non-Depressed users who have less than 5000 followers, therefore we ended No. of Users 1402 300 million up with 2500 positive users and 2300 negative users. No. of Tweets 292,564 10 billion Table 1 5. Explainable Deep Depression Detection Statistics of the large dataset collected by the authors in [20] which is used in this study. Due to the complexity of user posts and the diversity of their behaviour on social media, we propose a hybrid model based on Hierarchical Attention Networks that combines with • Depression-candidate data set: The authors collected Multilayer perceptron MLP to detect depression through so- are labelled as depression-candidate, where the tweet cial media as depicted in Figure2. For each user, the model was collected if contained the word “depress”. This takes two inputs for the two attributes. First, the four modal- includes 36,993 depression-candidate users and over ities feature input that represents user behaviour vector runs 35 million tweets. into MLP, capturing distinct and latent features and correla- tion across the features matrix. The second input represents Data collection mechanisms are often loosely controlled, each user input tweets that will be replaced with it’s em- impossible data combinations, for instance, users labelled bedding and fed to Hierarchical Attention Networks to learn as depressed but have provided no posts, missing values, some representation features through a hierarchical word and among others. After data has been crawled, it is still not tweet level encoding. The output in the middle of both at- ready to be used directly by the machine learning model tributes is concatenated to represent one single vector feature due to various noise still present in data, which is called the that fed into an activation layer of sigmoid for prediction. In “raw data”. The problem is even more exacerbated when the following sections, we will discuss the following two ex- data has been downloaded from online social media such as isting separate architectures. Twitter because tweets may contain spelling and grammar mistakes, smileys, and other undesirable characters. There- 5.1. Features Selection fore, a pre-processing strategy is needed to ensure satisfac- We introduce this attribute type where the goal is to cal- tory data quality for computational modal to achieve reliable culate the attribute value corresponding to each modality for predictive analysis. each user. We estimate that the dimensionality for all modal- To further clean the data we used Natural Language pro- ities of interest is 76; and we mainly consider four major cessing ToolKit (NLTK). This package has been widely used modalities as listed below and ignore two modalities due to for text pre-processing [39] and various other works. It has missing values. These features are extracted respectively for also been widely used for removing common words such as each user as follows: stop words from text [40, 14, 19]. We have removed the com- mon words from users tweets (such as “the”, “an”, etc.) as 5.1.1. Social Information and Interaction: these are not discriminative or useful enough for our model. From this attribute, we extracted several features embed- These common words sometimes also increase the dimen- ded in each user profile. These are features related to each sionality of the problem which could sometimes lead to the user account as specified by each feature name. Most of the “curse-of-dimensionality” problem and may have an impact features are directly available in the user data, such as the on the overall model efficiency. To further improve the text number of users following and friends, favourites, etc. quality, we have also removed non-ASCII characters which Moreover, the extracted features relate to user behaviour have also been widely used in literature [23]. on their profile. For each user, we calculate their total num- Pre-processing and removal of noisy content from the ber of tweets, their total length of all tweets and the num- data helped get rid of plenty of noisy content from the dataset. ber retweets. We further calculate posting time distribution We then obtained a high-quality reliable data which we could for each user, by counting how many tweets the user pub- use in this study. Besides, this distillation helped reduce lished during each of the 24 hours a day. Hence it is a 24- the computational complexity of the model because we are dimensional integer array. To get posting time distribution only dealing with informative data which eventually would for each tweet, we extract two digits as hour information, be used in modelling. We present the statistics of this dis- then go through all tweets of each user and track the count tilled data below: of tweets posted in each hour of the day.

• Number of users labelled positive tweets: 5899. 5.1.2. Emojis Sentiment: Emojis allow users to express their emotions through sim- • Number of tweets from positive users: 508786. ple icons and non-verbal elements. It is useful to get the at- • Number of users labelled negative: 5160. tention of the reader. Emojis could give us a glance for the sentiment of any text or tweets, and it is essential to differen- • Number of tweets from negative users: 2299106. tiate between positive and negative sentiment text [41]. User tweets contain a large number of emojis which can be clas- To further mitigate the issue of sparsity in data, we ex- sified into positive, negative and neutral. For each positive, cluded those users who have posted less than ten posts and neutral, and negative type, we count their frequency in each xxxxxxxx et al.: Preprint submitted to Elsevier Page 5 of 12 Explainable Depression Detection (MDHAN)

Figure 2: Overview of our proposed model MDHAN: We predict depressed user by fusing two kinds of information: (1) User tweets. (2) User Behaviours.

tweet. Then we sum up the numbers from each user’s tweets proach which helps give us the desired results. to get the sum for each user. So the final output is three val- 5.1.4. Symptoms features ues corresponding to positive, neutral and negative emojis by the user. We also consider Voice Activity Detection (VAD) 1- Depression symptom counts: It is the count of de- features. These features contain Valance, Arousal and Dom- pression symptoms occurring in tweets, as specified in nine inance scores. For that, we count First Person Singular and groups in DSM-IV criteria for a depression diagnosis. The First Person Plural. Using affective norms for English words, symptoms are listed in Appendix8. We count how many a VAD score for 1030 words are obtained. We create a dic- times the nine depression symptoms are mentioned by the tionary with each word as a key and a tuple of its (valance, user in their tweets. The symptoms are specified as a list of arousal, dominance) score as value. Next, we parse each nine categories, each containing various synonyms for the tweet and calculate VAD score for each tweet using this dic- particular symptom. We created a set of seed keywords for tionary. Finally, for each user, we add up the VAD scores of all these nine categories, and with the help of the pre-trained tweets by that user, to calculate the VAD score for each user. word embedding, we extracted the similarities of these symp- toms to extend the list of keywords for each depression symp- 5.1.3. Topic Distribution: toms. Furthermore, we scan through all tweets, counting Topic modelling belongs to the class statistical modelling how many times a particular symptom is mentioned in each frameworks which helps in the discovery of abstract topics in tweet. 2- : We also focused on the antide- a collection of text documents. It gives us a way of organiz- pressants, and we created a lexicon of antidepressants from ing, understanding and summarizing collections of textual the “Antidepressant” Wikipedia page which contains an ex- information. It helps find hidden topical patterns through- haustive list of items and is updated regularly, in which we out the process, where the number of topics is specific by the counted the number of names listed for antidepressants. The user apriori. It can be defined as a method of finding a group medicine names are listed in Appendix9. of words (i.e. topics) from a collection of documents that best represent the latent topical information in the collection. 5.2. User Tweets Encoder Using RNN In our work, we applied the unsupervised Latent Dirichlet Recently, researchers find that the Hierarchical Attention Allocation (LDA) [42] to extract the most latent topic dis- Network (HAN)[44] is useful and able to find the most im- tribution from user tweets. To calculate topic level features, portant words and sentences in a document while consid- we first consider corpus of all tweets of all depressed users. ering it. A depressed user could often have different lin- Next, we split each tweet into a list of words and assem- guistic style posts, including depressive language use, and ble all words in decreasing order of their frequency of oc- mentions of antidepressants and symptoms, which can help currence, and common English words (stopwords) are re- detect depression. Additionally, a social media post con- moved from the list. Finally, we apply LDA to extract the tains linguistic prompts with different levels of word-level latent K = 25 topics distribution, where K is the number and tweet-level. Every word in a tweet and every tweet of a of topics. We have found experimentally K = 25 to be a user is equally important to understand a depressed user in suitable value. While there are tuning strategies and strate- social media. For example, "My dad doesn’t even seem to gies based on Bayesian non-parametrics [43], we have opted believe I’m really hurt!", the word "hurt" contributes more to use a simple, popular, and computationally efficient ap- signals to decide whether the tweet is depressed rather than other words in the tweet. So in this way, HAN performs bet- xxxxxxxx et al.: Preprint submitted to Elsevier Page 6 of 12 Explainable Depression Detection (MDHAN)

Figure 3: An illustration of hierarchical attention network that we used to encode user tweets

ter in predicting the class of given user tweets. Inspired by tweet information centred around xij. [44], we proposed Hierarchical Attention Network to learn user tweets representation as depicted in Figure3. We con- Next, we describe the attention mechanism. It is crucial to sider U be a user made M tweets T = [t1, t2, ...., TM ] each introduce a vector uij for all words, which is trainable and w tweet ti = [w1, w2, ...., wN ] contains Ni words. Each tweet expected to capture global word. The ℎij annotations create is represented by the sequence of d-dimensional embeddings the basis for attention that starts with another hidden layer by of their words, ti = [w11, ...., wMN ]. And we represent each letting the model learn and randomly initialized biases and word as the input layer a fixed-size vector from pre-trained weights through training. the annotations uij will be repre- word embeddings. sented as following: w uij = tanℎ(Wwℎij + bw) (4)

5.2.1. Word Encoder: The product uijuw (uw is randomly initialized) expected to signal the importance of the j word and normalized to an A bidirectional Gated Recurrent Unit is first used as the importance weight per word ij by a softmax function: word level encoder to capture annotations’ contextual infor- mation. GRU is a Recurrent Nural Network (RNN) that can exp(uijuw) capture sequential information and sentences’ long-term de- ij = ∑ (5) exp(ijuw) pendency. Only two gate functions are used which are re- j set and update gates. Update gate has been used to monitor the degree with which the previous moment’s status infor- Finally, a weighted sum of word representations concate- mation has transported into the current state. The higher the nated with the annotations previously determined called the update gate value, the more the previous moment’s status in- tweet vector vi, where t indicating importance weight per formation is carried forward. The reset gate has been used to word: É monitor the degree with which the previous moment’s status v = ℎw information is overlooked. The smaller the reset gate value, i t ij ij (6) the more neglected the context will be. Both the preceding 5.2.2. Tweet Encoder: and the following words influence the current word in the se- t In order to learn the tweet representations ℎi from a learned quential textual data, so we use the BiGRU model to extract tweet vector vi, we capture the information of context at the the contextual features. The BiGRU consists of a forward ™™™™™™™™™⃗ ⃖™™™™™™™™™ tweet level. Similar to the word encoder component, tweet GRU and a backward GRU that are used, respectively, to encoder employs the same BiGRU architecture. Hence the process forward and backward data. The annotation wij rep- combination of the hidden state that is obtained from the for- resent the word j in a sentence i that contains N-words. Each ™™⃗t ⃖™™t ward GRU and the backward GRU ℎi and ℎi is represented word of user post (tweet) will convert to a word embedding   ℎt = ℎ™™⃗t ⊕ ℎ⃖™™t xij utilising GloVe [45]. as i i i . Which capture the coherence of a tweet concerning its neighbouring tweets in both directions. Next,   we will capture the related tweets in the form vector t̂ by us- ℎ™™™™⃗w = GRU™™™™™™™™™⃗ x , ℎ™™™™™™™™™™⃗ , j ∈ {1, ..., N} ij ij i(j−1) (2) ing tweet level attention layer. The product uius expected to signal the importance of the i tweet and normalized to an s   importance weight per tweet i. Finally, i will be a vector ⃖™™™™w ⃖™™™™™™™™™ ⃖™™™™™™™™™™ ℎij = GRU xij, ℎi(j−1) , j ∈ {N, ..., 1} (3) that summarizes all the tweet information in a user posts: É s = ℎt i t i i (7) The combination of the hidden state that is obtained from ™™™™⃗w ⃖™™™™w 5.3. Multi-modal Encoder: the forward GRU and the backward GRU ℎij and ℎij is rep-   Suppose the input which resembles a user behaviour be ℎw = ℎ™™™™⃗w ⊕ ℎ⃖™™™™w resented as ij ij ij . Which carries the complete represented as [m1, m2, ..., mM ] where M is the total number xxxxxxxx et al.: Preprint submitted to Elsevier Page 7 of 12 Explainable Depression Detection (MDHAN)

tℎ of features and Ms is the dimension of S modality. Hence, • SVM: Support Vector Machine is a popular and a to obtain fine-grained information from user behaviours fea- strong classifier that has been applied on a wide range tures, the multi-modal features are fed through a one-layer of classification tasks [46] and it still remains a strong MLP to get a hidden representation mi: baseline. • NB: Naive Bayes is a family of probabilistic algo- M É rithms based on applying Bayes’ theorem with the “naive” p = f(b + W m ) i i i (8) assumption of conditional independence between in- i=1 stances [47]. Where f stands for the nonlinear function and the out- BiGRU Bidirectional Gated Recurrent p • :We applied come of behaviour modelling i is the high-level representa- Unit [48] with attention mechanism to obtain user tweets tion that captures the behavioural semantic information and representations, which we then used for user tweets plays critical role in depression diagnosis. classification. 5.4. Classification Layer: • MBiGRU: Hybrid model based on MLP and BiGRU At the classification layer, we need to predict whether for Multi-Modal features for the user behaviour and the user is depressed or not depressed. So far, we have intro- the user’s online timeline (posts). duced, how we encode user multi_modalities behaviors (p) CNN Convolutional Neural Networks and how we can encode user tweets by modeling the hier- • : We utilized archical structure from word level and tweet level (s). Then [49] with an attention mechanism to model user tweets, From both component we construct the feature matrix of user which can capture the semantics of different convolu- behaviours features and user tweets: tional window sizes for depression detection. • MCNN: Hybrid model based on MLP and CNN for 1d×M p = {p1, p2, ...., pM } ∈ ℝ (9) Multi-Modal features for the user behaviour and the user’s online timeline (posts). • HAN: A hierarchical attention neural network frame- 2d×n s = {s1, s2, ...., sn} ∈ ℝ (10) work [44], it used on user posts for depression detec- tion. The network encodes first user posts with word We further integrate these components together [p, s]. level attentions on each tweet and tweet-level atten- And The output of such a network is typically fed to a sig- tions on each user posts. moid layer for classification: • MDHAN: The proposed model in this paper. ̂y = Sigmoid(b + [p, s]W ) f f (11) 6.1. Datasets and Evaluation Metrics Where where ̂y is the predicted probability vector with ̂y0 For our experiments, we have used the datasets as men- and ̂y1 indicate the predicted probability of label being 0 (not tioned in section (3). They provide a large scale of data, depressed) and 1 (depressed user) respectively. Then, we especially for labelled negative and candidate positive, and aim to minimize the cross-entropy error for each user with in our experiments, we used the labelled dataset. We pre- ground-truth label y : process the dataset by excluding users who have their post- É ing history comprising of less than ten posts or users with y ̂y Loss = − i ⋅ log i followers more than 5000, or users who tweeted in other i than English so that we have sufficient statistical informa- where ̂yi is the predicted probability and yi is the ground tion associated with every user. We have thus considered truth label (either depression or non-depression) user. 4208 users (51.30% depressed and 48.69 % non-depressed users) as shown in Table2. For evaluation purpose, we split the dataset into training (80%) and test (20%). 6. Experiments and Results In this section, we present the experimental evaluation to Description Depressed Non-Depressed MDHAN validate the performance of . First will we will in- Numer of users 2159 2049 troduce Datasets and Evaluation Metrics and Experimental Number of tweets 447856 1349447 Setting, followed by the Experimental Results. We compare our model with the following classification methods: Table 2 Summary of labelled data used to train MDHAN model • MDL: Multimodal Dictionary Learning Model is to detect depressed users on Twitter [20]. They use a dictionary learning to extract latent data features and Additionally, We employ traditional information retrieval sparse representation of a user. metrics such as precision, recall, F1, and accuracy based on xxxxxxxx et al.: Preprint submitted to Elsevier Page 8 of 12 Explainable Depression Detection (MDHAN)

Matric SVM NB MDL BiGRU MBiGRU CNN MCNN HAN MDHAN Accuracy 0.644 0.636 0.787 0.764 0.786 0.806 0.871 0.844 0.895 Precision 0.724 0.724 0.790 0.766 0.789 0.817 0.874 0.870 0.902 Recall 0.632 0.623 0.786 0.762 0.787 0.804 0.870 0.840 0.892 F1-score 0.602 0.588 0.786 0.763 0.786 0.803 0.870 0.839 0.893 Table 3 Performance comparison of MDHAN against the baselines for depression detection on [20] Dataset the confusion matrix to evaluate our model. A confusion ma- • MDL model outperforms SVM, NB and BiGRU, and trix is a sensational matrix used for evaluating classification obtains better accuracy than these three methods. Since performance, which is also called an error matrix because it this is a recent model especially designed to discover shows the number of wrong predictions versus the number depressed users, it has captured the intricacies of the of right predictions in a tabulated manner. dataset well and learned its parameters faithfully lead- ing to better results. 6.2. Experimental Setting For parameter configurations, the word embeddings are • we can observe the evolving when we integrate The initialized with the Glove [45] with a dimension of 100 on multi-modal features with user posts and that better the training set of each dataset to initialize the word embed- helped to analyze a user that seems to be depressed as dings of all the models, including baselines. The hidden di- shown in the performance of MBiGRU, MCNN MD- mension has set to 100 in our model and other neural models, HAN. also, the dropout is set to 0.5. All the models are trained to • We can see our proposed model MDHAN improved use use the Adam optimization algorithm [50] with a batch the depression detection up to 10% on F1-Score, com- size of 16 and an initial learning rate an initial learning rate pared to MDL model and 5% compared to HAN model. of 0.001. Finally we trained our model for 10 iterations, with This suggests that our model outperforms a strong model. batch size of 16. The number of iterations was sufficient to The reason why our model performs well is primar- converge the model and our experimental results further ce- ily because it leverages a rich set of features which is ment this claim where we outperform existing strong base- jointly learned in the consolidated parameters estima- line methods, and the training epoch is set to 10. We used tion resulting in a robust model. python 3.6.3 and Tensorflow 2.1.0 to develop our implemen- tation. We rendered the embedding layer to be not trainable • Furthermore, MDHAN achieved the best performance so that we keep the features representations, e.g., word vec- with 89% in F1, indicating that combining HAN with tors and topic vectors in their original form. We used one multimodal strategy for user timeline semantic fea- hidden layer, and max-pooling layer of size 4 which gave tures strategy is sufficient to detect depression in Twit- better performance in our setting. ter. We can also deduce from the table that our model consistently outperforms all existing and strong base- 6.3. Experimental Results lines. In our experiments, we study our model attributes in- cluding the quantitative performance of our hybrid model. 6.4. Comparison and Discussion The multi-modal features and user’s timeline semantic fea- To get a better look for our model performance, We have tures attribute, we will use both these attributes jointly. After compared the effectiveness of each of the two attributes of grouped user behaviour in social media into multi-modal at- our model. Therefore, we test the performance of the model tribute, we evaluate the performance of the model. First, we with a different attribute, we build the model to feed it with examine the effectiveness of using the multi-Modal features each attribute separately and compare how the model per- only for Depression Detection with different classifiers. Sec- forms. First, we test the model using only the multi-modalities ond, we showed how the model performance increased when attribute, we can observe in Fig4 the model perform less we utilize multi-modal features with hierarchical attention optimally when we used MLP for multi-modal features. In network MDHAN. We summarise the results in Table3 as contrast, the model performs better when we use only HAN follows: with word embedding attribute. This signifies that extract- ing semantic information features from user tweets is cru- • Naive Bayes obtain the lowest F1 score, which demon- cial for depression detection. Thus, we can see the MDHAN strates that this model has less capability to classify model performance increased when combined both MM and tweets when compared with other existing models to HAN, and outperforms when using each attribute indepen- detect depression. The reason for its poor performance dently. One of the key parameters in MDHAN is the number could be that the model is not robust enough to sparse of tweets for each user; we eventually observed that MD- and noisy data. HAN reached optimal performance when using 200 tweets xxxxxxxx et al.: Preprint submitted to Elsevier Page 9 of 12 Explainable Depression Detection (MDHAN)

Figure 4: Effectiveness comparison between MDHAN with different attributes. Figure 5: Model vs number of tweets

Figure 6: Modality effectiveness Figure 7: Modality effectiveness

Figure 8: Explainable via visualization of attention score in MDHAN

as the maximum number of tweets. Figure5 illustrates the 6.5. Case Study performance of our model concerning the number of tweets. In order to illustrate the importance of MDHAN for ex- To further analyze the role played by each modal features plaining depression detection results, we have used an ex- and contribution of the user behavioural attributes and user ample of a depressed user to show the tweets captured by posts attribute, we removed the four modalities separately as MDAHN in Figure8. The figure demonstrates that the at- following: the domain-specific feature and denote as MD- tention map gives higher weights to explainable depression HAN - D, emotion feature and denote as MDHAN - E, the tweets; for instance, the tweet "One in four experience men- social network feature and denote this model as MDHAN - S tal illness ... " gained the highest attention score among all and topic feature which we denote as MDHAN - T. We can the user tweets. Moreover, MDAHN can give higher weights see in Figure6 that our model performance deteriorates as to explainable tweets than those interfering and unrelated we remove the topic feature from MDHAN model and de- tweets, which can help select more related tweets and to be grades more without the social network features. To dive a more important feature to detect the depressed user. deeper and understand the effectiveness of each modal, we To further investigate the five most influencing symp- combine each modal separately with HAN and denote them toms among depressed users, we collected all the tweets as- respectively as following: D+HAN, E+HAN, S+HAN and sociated with these symptoms. Then we created a tag cloud T+HAN. As shown in Figure7, we could see that MDHAN [51] for each of these five symptoms, to determine what are with four Modalities outperforms the others, which means the frequent words and importance that related to each symp- that each modality do contribute to depression detection. tom as shown in Figure9 where larger font words are rela- tively more important than rest in the same cloud represen- tation. This cloud gives us an overview of all the words that occur most frequently within each of these five symptoms. xxxxxxxx et al.: Preprint submitted to Elsevier Page 10 of 12 Explainable Depression Detection (MDHAN)

[4] World Health Organization WHO. Depression. http://www.who.int/mediacentre/factsheets/fs369/en/, 2015. [5] Esteban Andrés Ríssola, Mohammad Aliannejadi, and Fabio Crestani. Beyond modelling: Understanding mental disorders in online social media. In European Conference on Information Retrieval, pages 296– 310. Springer, 2020. [6] Maria Li Zou, Mandy Xiaoyang Li, and Vincent Cho. Depression and (a) A word cloud for Suicidal (b) A word cloud for disclosure behavior via social media: A study of university students ideation. Worthlessness. in china. Heliyon, 6(2):e03368, 2020. [7] Carleen Hawn. Take two aspirin and tweet me in the morning: how twitter, facebook, and other social media are reshaping health care. Health affairs, 28(2):361–368, 2009. [8] Linda Neuhauser and Gary L Kreps. Rethinking communication in the e-health era. Journal of Health Psychology, 8(1):7–23, 2003. [9] Daniel Scanfeld, Vanessa Scanfeld, and Elaine L Larson. Dissem- ination of health information through social networks: Twitter and antibiotics. American journal of infection control, 38(3):182–188, (c) A word cloud for Loss of (d) A word cloud for Insomnia. 2010. Energy. [10] Kyle W Prier, Matthew S Smith, Christophe Giraud-Carrier, and Carl L Hanson. Identifying health-related topics on twitter. In Interna- tional conference on social computing, behavioral-cultural modeling, and prediction, pages 18–25. Springer, 2011. [11] Minsu Park, Chiyoung Cha, and Meeyoung Cha. Depressive moods of users portrayed in twitter. In Proceedings of the ACM SIGKDD Workshop on healthcare informatics (HI-KDD), volume 2012, pages 1–8, 2012. [12] Krishna C Bathina, Marijn ten Thij, Lorenzo Lorenzo-Luaces, Lau- (e) A word cloud for depression ren A Rutter, and Johan Bollen. Depressed individuals express more mood. distorted thinking on social media. arXiv preprint arXiv:2002.02800, Figure 9: A word cloud depicting the most influencing symp- 2020. [13] Hamad Zogan, Imran Razzak, Shoaib Jameel, and Guandong Xu. De- toms. pressionnet: Learning multi-modalities with user post summarization for depression detection on social media. Proceedings of the 44rd In- ternational ACM SIGIR Conference on Research and Development in 7. Conclusion Information Retrieval, 2021. [14] Akshi Kumar, Aditi Sharma, and Anshika Arora. Anxious depression In this paper, we propose a new model for detecting de- prediction in real-time social data. Available at SSRN 3383359, 2019. pressed user through social media analysis by extracting fea- [15] Rémi Lebret and Ronan Collobert. Rehabilitation of count-based models for word vector representations. In International Conference tures from the user behaviour and the user’s online timeline on Intelligent Text Processing and Computational Linguistics, pages (posts). We have used a real-world data set for depressed 417–429. Springer, 2015. and non-depressed users and applied them in our model. We [16] Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric have proposed a hybrid model which is characterised by in- Horvitz. Predicting depression via social media. In Seventh inter- national AAAI conference on weblogs and social media troducing an interplay between Multi-Modal Depression De- , 2013. MDHAN [17] Quan Hu, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. Pre- tection Hierarchical Attention Network . We as- dicting depression of social media user on different observation win- sign the multi-modalities attribute which represents the user dows. In 2015 IEEE/WIC/ACM International Conference on Web In- behaviour into the MLP and user timeline posts into HAN telligence and Intelligent Agent Technology (WI-IAT), volume 1, pages to calculate each tweet and words’ importance, and capture 361–364. IEEE, 2015. semantic sequence features from the user timelines (posts). [18] Sho Tsugawa, Yusuke Kikuchi, Fumio Kishino, Kosuke Nakajima, Yuichi Itoh, and Hiroyuki Ohsaki. Recognizing depression from twit- Our model shows that by training this hybrid network im- ter activity. In Proceedings of the 33rd annual ACM conference on proves classification performance and identifies depressed human factors in computing systems, pages 3187–3196. ACM, 2015. users outperforming other strong methods and ensures ade- [19] Philip Resnik, William Armstrong, Leonardo Claudino, Thang quate evidence to explain the prediction. Nguyen, Viet-An Nguyen, and Jordan Boyd-Graber. Beyond LDA: Exploring supervised topic modeling for depression-related language in twitter. In Proceedings of the 2nd Workshop on Computational Lin- References guistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 99–107, Denver, Colorado, June 5 2015. Association Diagnostic and statistical [1] American Psychiatric Association et al. for Computational Linguistics. manual of mental disorders (DSM-5®) . American Psychiatric Pub, [20] Guangyao Shen, Jia Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tian- 2013. rui Hu, Tat-Seng Chua, and Wenwu Zhu. Depression detection via [2] Jonathan M Metzl and Kenneth T MacLeish. Mental illness, mass harvesting social media: A multimodal dictionary learning solution. American journal of shootings, and the politics of american firearms. In IJCAI, pages 3838–3844, 2017. public health , 105(2):240–249, 2015. [21] Tiancheng Shen, Jia Jia, Guangyao Shen, Fuli Feng, Xiangnan He, [3] Aqsa Zafar and Sanjay Chitnis. Survey of depression detection using Huanbo Luan, Jie Tang, Thanassis Tiropanis, Tat Seng Chua, and 2020 10th International social networking sites via data mining. In Wendy Hall. Cross-domain depression detection via harvesting social Conference on Cloud Computing, Data Science & Engineering (Con- media. In Proceedings of the 27th International Joint Conference on fluence), pages 88–93. IEEE, 2020. xxxxxxxx et al.: Preprint submitted to Elsevier Page 11 of 12 Explainable Depression Detection (MDHAN)

Artificial Intelligence, IJCAI 2018, volume 2018-July, pages 1611– [39] Krystian Horecki and Jacek Mazurkiewicz. Natural language pro- 1617. International Joint Conferences on Artificial Intelligence, July cessing methods used for automatic prediction mechanism of related 2018. phenomenon. Lecture Notes in Artificial Intelligence (Subseries of [22] Diana Ramírez-Cifuentes, Marc Mayans, and Ana Freire. Early risk Lecture Notes in Computer Science), 9120:13–24, 06 2015. detection of anorexia on social media. In International Conference [40] M. Deshpande and V. Rao. Depression detection using emotion ar- on Internet Science, pages 3–14. Springer, 2018. tificial intelligence. In 2017 International Conference on Intelligent [23] Amir Hossein Yazdavar, Hussein S Al-Olimat, Monireh Ebrahimi, Sustainable Systems (ICISS), pages 858–862, Dec 2017. Goonmeet Bajaj, Tanvi Banerjee, Krishnaprasad Thirunarayan, Jy- [41] Petra Kralj Novak, Jasmina Smailović, Borut Sluban, and Igor otishman Pathak, and Amit Sheth. Semi-supervised approach to mon- Mozetič. Sentiment of emojis. PloS one, 10(12):e0144296, 2015. itoring clinical depressive symptoms in social media. In Proceedings [42] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet of the 2017 IEEE/ACM International Conference on Advances in So- allocation. Journal of machine Learning research, 3(Jan):993–1022, cial Networks Analysis and Mining 2017, pages 1191–1198. ACM, 2003. 2017. [43] Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei. [24] Tao Wang, Markus Brede, Antonella Ianni, and Emmanouil Mentza- Sharing clusters among related groups: Hierarchical dirichlet pro- kis. Detecting and characterizing eating-disorder communities on so- cesses. In Advances in neural information processing systems, pages cial media. In Proceedings of the Tenth ACM International Confer- 1385–1392, 2005. ence on Web Search and Data Mining, pages 91–100. ACM, 2017. [44] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and [25] Akkapon Wongkoblap, Miguel A Vadillo, and Vasa Curcin. Model- Eduard Hovy. Hierarchical attention networks for document classifi- ing depression symptoms from social network data through multiple cation. In Proceedings of the 2016 conference of the North American instance learning. AMIA Summits on Translational Science Proceed- chapter of the association for computational linguistics: human lan- ings, 2019:44, 2019. guage technologies, pages 1480–1489, 2016. [26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [45] Jeffrey Pennington, Richard Socher, and Christopher D Manning. BERT: Pre-training of deep bidirectional transformers for language Glove: Global vectors for word representation. In Proceedings of understanding. arXiv preprint arXiv:1810.04805, 2018. the 2014 conference on empirical methods in natural language pro- [27] Chun Yueh Chiu, Hsien Yuan Lane, Jia Ling Koh, and Arbee LP cessing (EMNLP), pages 1532–1543, 2014. Chen. Multimodal depression detection on instagram considering [46] Christian Karmen, Robert C Hsiung, and Thomas Wetter. Screening time interval of posts. Journal of Intelligent Information Systems, internet forum participants for depression symptoms by assembling pages 1–23, 2020. and enhancing multiple nlp methods. Computer methods and pro- [28] Chenhao Lin, Pengwei Hu, Hui Su, Shaochun Li, Jing Mei, Jie Zhou, grams in biomedicine, 120(1):27–36, 2015. and Henry Leung. Sensemood: Depression detection on social media. [47] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative In Proceedings of the 2020 International Conference on Multimedia classifiers: A comparison of logistic regression and naive bayes. In Retrieval, pages 407–411, 2020. Advances in neural information processing systems, pages 841–848, [29] Wenbo Zheng, Lan Yan, Chao Gou, and Fei-Yue Wang. Graph at- 2002. tention model embedded with multi-modal knowledge for depression [48] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry detection. In 2020 IEEE International Conference on Multimedia and Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Expo (ICME), pages 1–6. IEEE, 2020. Learning phrase representations using RNN encoder–decoder for sta- [30] Danai Xezonaki, Georgios Paraskevopoulos, Alexandros Potamianos, tistical machine translation. In Proceedings of the 2014 Conference and Shrikanth Narayanan. Affective conditioning on hierarchical net- on Empirical Methods in Natural Language Processing (EMNLP), works applied to depression detection from transcribed clinical inter- pages 1724–1734, Doha, Qatar, October 2014. Association for Com- views. arXiv preprint arXiv:2006.08336, 2020. putational Linguistics. [31] JT Wolohan. Estimating the effect of covid-19 on mental health: Lin- [49] Yoon Kim. Convolutional neural networks for sentence classification. guistic indicators of depression during a global pandemic. In Pro- In Proceedings of the 2014 Conference on Empirical Methods in Nat- ceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, ural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, 2020. October 2014. Association for Computational Linguistics. [32] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian [50] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic Puhrsch, and Armand Joulin. Advances in pre-training distributed optimization. CoRR, abs/1412.6980, 2015. word representations. arXiv preprint arXiv:1712.09405, 2017. [51] Fernanda B Viégas and Martin Wattenberg. Timelines tag clouds and [33] Anu Shrestha, Edoardo Serra, and Francesca Spezzano. Multi-modal the case for vernacular visualization. interactions, 15(4):49–52, 2008. social and psycho-linguistic embedding via recurrent neural networks to identify depressed users in online forums. NetMAHIB, 9(1):22, 2020. 8. Appendix A [34] Marcel Trotzek, Sven Koitka, and Christoph M Friedrich. Utilizing neural networks and linguistic metadata for early detection of depres- List of depression symptoms as per DSM-IV: sion indications in text sequences. IEEE Transactions on Knowledge and Data Engineering, 2018. 1. Depressed mood. [35] Nawshad Farruque, Osmar Zaiane, and Randy Goebel. Augmenting 2. iminished interest. semantic representation of depressive language: From forums to mi- 3. Weight or appetite change croblogs. In Joint European Conference on Machine Learning and 4. Insomnia, hypersomnia. Knowledge Discovery in Databases, pages 359–375. Springer, 2019. [36] Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris Dyer, Eduard 5. Psychomotor retardation, psychomotor impairment. Hovy, and Noah A Smith. Retrofitting word vectors to semantic lexi- 6. Fatigue or loss of energy cons. arXiv preprint arXiv:1411.4166, 2014. 7. Feelings worthlessness, guilt. [37] Yuan Gong and Christian Poellabauer. Topic modeling based multi- 8. Diminished ability to think, indecisiveness. Proceedings of the 7th Annual Work- modal depression detection. In 9. Suicidal tendency. shop on Audio/Visual Emotion Challenge, pages 69–76, 2017. [38] Zhichao Peng, Qinghua Hu, and Jianwu Dang. Multi-kernel svm based depression recognition using social media data. International 9. Appendix B Journal of Machine Learning and Cybernetics, 10(1):43–57, 2019. List of antidepressant medicine names xxxxxxxx et al.: Preprint submitted to Elsevier Page 12 of 12 Explainable Depression Detection (MDHAN)

Citalopram Celexa Cipramil Lexapro Cipralex Prozac Sarafem Luvox Faverin Paxil Seroxat Zoloft Lustral Pristiq Cymbalta Levomilnac. Fetzima Ixel Savella Effexor Viibryd Trintellix Dutonin Nefadar Serzone Desyrel Strattera Edronax Lucelan Metatone Vivalan Wellbutrin Elavil Endep Trifluoperazine Amioxid Ambivalon Equilibrin Anafranil Norpramin Pertofrane Noveril Victoril Istonil Prothiaden Adapin Sinequan Tofranil Lomont Gamanil Dixeran Melixeran Trausabun Sintamil Pamelor Aventyl Agedal Elronon Nogedal Insidon Azafen Azaphen Vivactil Surmontil Asendin Ludiomil Tolvon Remeron Tecipul Mianserin mirtazapine setiptiline Marplan Nardil Tranylcyp. Parnate Eldepryl Zelapar Emsam Surodil Timostenil Inkazan Aurorix Manerix Pirazidol Humoryl Befol Brantur Cantor Alnert Celeport Valdoxan Spravato Ketalar Sediel Stablon Coaxil Elen Noin Clédial Gerdaxyl Oxaflozane Conflictan Tonerg Ademetionine Aurorix SAMe Heptral Transmetil Samyl Hypericum per. St. John’s Wort SJW Jarsin Kira Movina Oxitriptan Kira 5-HTP Cincofarm Levothym Triptum Rubidium chl. Rubinorm Tryptan Optimax Aminomine Noveril Solian Abilify Rexulti Latuda Zyprexa Seroquel Risperdal Buspar Eskalith Lithobid Modafinil Thyroxine Triiodoth. Amitriptyline chlordiaz. Limbitrol Parmodalin Aurorix Etafron melitracen Deanxit Surodil

xxxxxxxx et al.: Preprint submitted to Elsevier Page 13 of 12