NAACL HLT 2016
Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality
Proceedings of the Third Workshop
June 16, 2016 San Diego, California, USA Platinum Sponsor:
c 2016 The Association for Computational Linguistics
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality ISBN 978-1-941643-54-9
ii Introduction
In the United States, mental and neurological health problems are among the costliest challenges we face. Depression, Alzheimer’s disease, bipolar disorder, and attention deficit hyperactivity disorder (ADHD) are only a handful of the many illnesses that contribute to this cost. The global cost of mental health conditions alone was estimated at $2.5 trillion in 2010, with a projected increase to over $6 trillion in 2030. Neurological illnesses and mental disorders cost the U.S. more than $760 billion a year. The World Health Organization (WHO) estimates one out of four people worldwide will suffer from a mental illness at some point in their lives, while one in five Americans experience a mental health problem in any given year. Mental, neurological, and substance use disorders are the leading cause of disability worldwide, yet most public service announcements and government education programs remain focused on physical health issues such as cancer and obesity. Despite the substantial and rising burden of such disorders, there is a significant shortage of resources available to prevent, diagnose, and treat them; thus technology must be brought to bear. For clinical psychologists, language plays a central role in diagnosis, and many clinical instruments fundamentally rely on manual coding of patient language. Applying language technology in the domain of mental and neurological health could lead to inexpensive screening measures that may be administered by a wider array of healthcare professionals. Researchers had begun targeting such issues prior to this workshop series, using language technology to identify emotion in suicide notes, analyze the language of those with autistic spectrum disorders, and aid the diagnosis of dementia. The series of Computational Linguistics and Clinical Psychology (CLPsych) workshops began at ACL 2014, while NAACL 2015 hosted the second such workshop with a near-doubling in attendance. The 2015 workshop additionally hosted a Shared Task for detecting depression and post-traumatic stress disorder (PTSD) based on social media posts. The CLPsych workshops diverge from the conventional “mini-conference” workshop format by inviting clinical psychologists and researchers to join us at the workshop as discussants, to provide real-world points of view on the potential applications of NLP technologies presented during the workshop. We hope to build the momentum towards releasing tools and data that can be used by clinical psychologists. NAACL 2016 hosts the third CLPsych workshop, with another shared task. Published papers in this proceedings propose methods for aiding the diagnosis of dementia, analyzing sentiment as related to psychotherapy, assessing suicide risk, and quantifying the language of mental health. The 2016 CLPsych Shared Task centered on the classification of posts from a mental health forum to assist forum moderators in triaging and escalating posts requiring immediate attention. We accepted 11 submissions for the main workshop and 16 for the shared task. Each oral presentation will be followed by discussions led by one of our discussants, subject matter experts working in the fields of behavioral and mental health and with clinical data, including: Dr. Loring Ingraham and Dr. Bart Andrews. We wish to thank everyone who showed interest and submitted a paper, all of the authors for their contributions, the members of the Program Committee for their thoughtful reviews, our clinical discussants for their helpful insights, and all the attendees of the workshop. We also wish to extend thanks to the Association for Computational Linguistics for making this workshop possible, and to Microsoft Research for its very generous sponsorship.
– Kristy and Lyle
iii
Organizers: Kristy Hollingshead, IHMC Lyle Ungar, University of Pennsylvania
Clinical Discussants: Loring J. Ingraham, George Washington University Bart Andrews, Behavioral Health Response
Program Committee: Steven Bedrick, Oregon Health & Science University Archna Bhatia, IHMC Wilma Bucci, Adelphi University Wei Chen, Nationwide Children’s Hospital Leonardo Claudino, University of Maryland, College Park Mike Conway, University of Utah Glen Coppersmith, Qntfy Brita Elvevåg, Department of Clinical Medicine, University of Tromsø Peter Foltz, Pearson Dan Goldwasser, Purdue University Ben Hachey, University of Sydney Graeme Hirst, University of Toronto Christopher Homan, Rochester Institute of Technology Jena Hwang, IHMC Zac Imel, University of Utah Loring Ingraham, George Washington University William Jarrold, Nuance Communications Yangfeng Ji, School of Interactive Computing, Georgia Institute of Technology Dimitrios Kokkinakis, University of Gothenburg Tong Liu, Rochester Institute of Technology Shervin Malmasi, Harvard Medical School Bernard Maskit, Stony Brook University Margaret Mitchell, Microsoft Research Eric Morley, Oregon Health & Science University Danielle Mowery, University of Utah Sean Murphy, John Jay College of Criminal Justice; City University of New York Cecilia Ovesdotter Alm, Rochester Institute of Technology Ted Pedersen, University of Minnesota Craig Pfeifer, MITRE Glen Pink, University of Sydney Daniel Preotiuc, University of Pennsylvania Emily Prud’hommeaux, Rochester Institute of Technology
v Matthew Purver, Queen Mary University of London Philip Resnik, University of Maryland Rebecca Resnik, Mindwell Psychology Brian Roark, Google Mark Rosenstein, Pearson Masoud Rouhizadeh, Stony Brook University & University of Pennsylvania J. David Schaffer, Binghamton University Ronald Schouten, Harvard Medical School H. Andrew Schwartz, Stony Brook University J. Ignacio Serrano, Spanish National Research Council Richard Sproat, Google Hiroki Tanaka, Nara Institute of Science and Technology Michael Tanana, University of Utah Paul Thompson, Dartmouth College Jan van Santen, Oregon Health & Science University Eleanor Yelland, University College London Dan Yoo, Cambia Health
vi Table of Contents
Detecting late-life depression in Alzheimer’s disease through analysis of speech and language Kathleen C. Fraser, Frank Rudzicz and Graeme Hirst ...... 1
Towards Early Dementia Detection: Fusing Linguistic and Non-Linguistic Clinical Data Joseph Bullard, Cecilia Ovesdotter Alm, Xumin Liu, Qi Yu and Rubén Proaño...... 12
Self-Reflective Sentiment Analysis Benjamin Shickel, Martin Heesacker, Sherry Benton, Ashkan Ebadi, Paul Nickerson and Parisa Rashidi...... 23
Is Sentiment in Movies the Same as Sentiment in Psychotherapy? Comparisons Using a New Psy- chotherapy Sentiment Database Michael Tanana, Aaron Dembe, Christina S. Soma, Zac Imel, David Atkins and Vivek Srikumar 33
Building a Motivational Interviewing Dataset Verónica Pérez-Rosas, Rada Mihalcea, Kenneth Resnicow, Satinder Singh and Lawrence An . . 42
Crazy Mad Nutters: The Language of Mental Health Jena D. Hwang and Kristy Hollingshead ...... 52
The language of mental health problems in social media George Gkotsis, Anika Oellrich, Tim Hubbard, Richard Dobson, Maria Liakata, Sumithra Velupil- lai and Rina Dutta ...... 63
Exploring Autism Spectrum Disorders Using HLT Julia Parish-Morris, Mark Liberman, Neville Ryant, Christopher Cieri, Leila Bateman, Emily Ferguson and Robert Schultz ...... 74
Generating Clinically Relevant Texts: A Case Study on Life-Changing Events Mayuresh Oak, Anil Behera, Titus Thomas, Cecilia Ovesdotter Alm, Emily Prud’hommeaux, Christopher Homan and Raymond Ptucha...... 85
Don’t Let Notes Be Misunderstood: A Negation Detection Method for Assessing Risk of Suicide in Mental Health Records George Gkotsis, Sumithra Velupillai, Anika Oellrich, Harry Dean, Maria Liakata and Rina Dutta 95
Exploratory Analysis of Social Media Prior to a Suicide Attempt Glen Coppersmith, Kim Ngo, Ryan Leary and Anthony Wood ...... 106
CLPsych 2016 Shared Task: Triaging content in online peer-support forums David N. Milne, Glen Pink, Ben Hachey and Rafael A. Calvo ...... 118
Data61-CSIRO systems at the CLPsych 2016 Shared Task Sunghwan Mac Kim, Yufei Wang, Stephen Wan and Cecile Paris...... 128
vii Predicting Post Severity in Mental Health Forums Shervin Malmasi, Marcos Zampieri and Mark Dras ...... 133
Classifying ReachOut posts with a radial basis function SVM ChrisBrew...... 138
Triaging Mental Health Forum Posts Arman Cohan, Sydney Young and Nazli Goharian ...... 143
Mental Distress Detection and Triage in Forum Posts: The LT3 CLPsych 2016 Shared Task System Bart Desmet, Gilles Jacobs and Véronique Hoste ...... 148
Text Analysis and Automatic Triage of Posts in a Mental Health Forum Ehsaneddin Asgari, Soroush Nasiriany and Mohammad R.K. Mofrad ...... 153
The UMD CLPsych 2016 Shared Task System: Text Representation for Predicting Triage of Forum Posts about Mental Health Meir Friedenberg, Hadi Amiri, Hal Daumé III and Philip Resnik ...... 158
Using Linear Classifiers for the Automatic Triage of Posts in the 2016 CLPsych Shared Task JuriOpitz...... 162
The GW/UMD CLPsych 2016 Shared Task System Ayah Zirikly, Varun Kumar and Philip Resnik ...... 166
Semi-supervised CLPsych 2016 Shared Task System Submission Nicolas Rey-Villamizar, Prasha Shrestha, Thamar Solorio, Farig Sadeque, Steven Bethard and Ted Pedersen...... 171
Combining Multiple Classifiers Using Global Ranking for ReachOut.com Post Triage Chen-Kai Wang, Hong-Jie Dai, Chih-Wei Chen, Jitendra Jonnagaddala and Nai-Wen Chang . 176
Classification of mental health forum posts Glen Pink, Will Radford and Ben Hachey ...... 180
Automatic Triage of Mental Health Online Forum Posts: CLPsych 2016 System Description Hayda Almeida, Marc Queudot and Marie-Jean Meurs ...... 183
Automatic Triage of Mental Health Forum Posts Benjamin Shickel and Parisa Rashidi ...... 188
Text-based experiments for Predicting mental health emergencies in online web forum posts Hector-Hugo Franco-Penya and Liliana Mamani Sanchez ...... 193
viii ix Conference Program
2016/06/16
09:00–09:20 Opening Remarks Kristy Hollingshead and Lyle Ungar
09:20–10:30 Oral Presentations, Session 1
Detecting late-life depression in Alzheimer’s disease through analysis of speech and language Kathleen C. Fraser, Frank Rudzicz and Graeme Hirst
Towards Early Dementia Detection: Fusing Linguistic and Non-Linguistic Clinical Data Joseph Bullard, Cecilia Ovesdotter Alm, Xumin Liu, Qi Yu and Rubén Proaño
10:30–11:00 Break
11:00–11:45 Poster Presentations
Self-Reflective Sentiment Analysis Benjamin Shickel, Martin Heesacker, Sherry Benton, Ashkan Ebadi, Paul Nicker- son and Parisa Rashidi
Is Sentiment in Movies the Same as Sentiment in Psychotherapy? Comparisons Using a New Psychotherapy Sentiment Database Michael Tanana, Aaron Dembe, Christina S. Soma, Zac Imel, David Atkins and Vivek Srikumar
Building a Motivational Interviewing Dataset Verónica Pérez-Rosas, Rada Mihalcea, Kenneth Resnicow, Satinder Singh and Lawrence An
Crazy Mad Nutters: The Language of Mental Health Jena D. Hwang and Kristy Hollingshead
The language of mental health problems in social media George Gkotsis, Anika Oellrich, Tim Hubbard, Richard Dobson, Maria Liakata, Sumithra Velupillai and Rina Dutta
Exploring Autism Spectrum Disorders Using HLT Julia Parish-Morris, Mark Liberman, Neville Ryant, Christopher Cieri, Leila Bate- man, Emily Ferguson and Robert Schultz
x 2016/06/16 (continued)
11:45–1:00 Lunch
1:00–2:45 Oral Presentations, Session 2
Generating Clinically Relevant Texts: A Case Study on Life-Changing Events Mayuresh Oak, Anil Behera, Titus Thomas, Cecilia Ovesdotter Alm, Emily Prud’hommeaux, Christopher Homan and Raymond Ptucha
Don’t Let Notes Be Misunderstood: A Negation Detection Method for Assessing Risk of Suicide in Mental Health Records George Gkotsis, Sumithra Velupillai, Anika Oellrich, Harry Dean, Maria Liakata and Rina Dutta
Exploratory Analysis of Social Media Prior to a Suicide Attempt Glen Coppersmith, Kim Ngo, Ryan Leary and Anthony Wood
2:45–3:00 Break
3:00–3:15 Shared Task Introduction
CLPsych 2016 Shared Task: Triaging content in online peer-support forums David N. Milne, Glen Pink, Ben Hachey and Rafael A. Calvo
3:15–3:40 Shared Task Poster Presentations, Session 1
Data61-CSIRO systems at the CLPsych 2016 Shared Task Sunghwan Mac Kim, Yufei Wang, Stephen Wan and Cecile Paris
Predicting Post Severity in Mental Health Forums Shervin Malmasi, Marcos Zampieri and Mark Dras
Classifying ReachOut posts with a radial basis function SVM Chris Brew
Triaging Mental Health Forum Posts Arman Cohan, Sydney Young and Nazli Goharian
xi 2016/06/16 (continued)
Mental Distress Detection and Triage in Forum Posts: The LT3 CLPsych 2016 Shared Task System Bart Desmet, Gilles Jacobs and Véronique Hoste
Text Analysis and Automatic Triage of Posts in a Mental Health Forum Ehsaneddin Asgari, Soroush Nasiriany and Mohammad R.K. Mofrad
The UMD CLPsych 2016 Shared Task System: Text Representation for Predicting Triage of Forum Posts about Mental Health Meir Friedenberg, Hadi Amiri, Hal Daumé III and Philip Resnik
3:40–4:00 Break
4:00–4:25 Shared Task Poster Presentations, Session 2
Using Linear Classifiers for the Automatic Triage of Posts in the 2016 CLPsych Shared Task Juri Opitz
The GW/UMD CLPsych 2016 Shared Task System Ayah Zirikly, Varun Kumar and Philip Resnik
Semi-supervised CLPsych 2016 Shared Task System Submission Nicolas Rey-Villamizar, Prasha Shrestha, Thamar Solorio, Farig Sadeque, Steven Bethard and Ted Pedersen
Combining Multiple Classifiers Using Global Ranking for ReachOut.com Post Triage Chen-Kai Wang, Hong-Jie Dai, Chih-Wei Chen, Jitendra Jonnagaddala and Nai- Wen Chang
Classification of mental health forum posts Glen Pink, Will Radford and Ben Hachey
Automatic Triage of Mental Health Online Forum Posts: CLPsych 2016 System Description Hayda Almeida, Marc Queudot and Marie-Jean Meurs
Automatic Triage of Mental Health Forum Posts Benjamin Shickel and Parisa Rashidi
Text-based experiments for Predicting mental health emergencies in online web fo- rum posts Hector-Hugo Franco-Penya and Liliana Mamani Sanchez xii 2016/06/16 (continued)
4:25–4:45 Closing Remarks
xiii
Detecting late-life depression in Alzheimer’s disease through analysis of speech and language
Kathleen C. Fraser1 and Frank Rudzicz2,1 and Graeme Hirst1 1Department of Computer Science, University of Toronto, Toronto, Canada 2Toronto Rehabilitation Institute-UHN, Toronto, Canada kfraser,frank,gh @cs.toronto.edu { }
Abstract It is also important to detect when someone has both AD and depression, as this serious situation can Alzheimer’s disease (AD) and depression lead to more rapid cognitive decline, earlier place- share a number of symptoms, and commonly ment in a nursing home, increased risk of depression occur together. Being able to differentiate be- in the patient’s caregivers, and increased mortality tween these two conditions is critical, as de- (Thorpe, 2009; Lee and Lyketsos, 2003). pression is generally treatable. We use linguis- Separate bodies of work have reported the util- tic analysis and machine learning to determine whether automated screening algorithms for ity of spontaneous speech analysis in distinguish- AD are affected by depression, and to detect ing participants with depression from healthy con- when individuals diagnosed with AD are also trols, and in distinguishing participants with demen- showing signs of depression. In thefirst case, tia from healthy controls. Here we consider whether wefind that our automated AD screening pro- such analyses can be applied to the problem of de- cedure does not show false positives for indi- tecting depression in Alzheimer’s disease (AD). In viduals who have depression but are otherwise particular, we explore two questions: (1) In previous healthy. In the second case, we have moderate success in detecting signs of depression in AD work on detecting AD from speech (elicited through (accuracy = 0.658), but we are not able to draw a picture description task), are cognitively healthy a strong conclusion about the features that are people with depression being misclassified as hav- most informative to the classification. ing AD? (2) If we consider only participants with AD, can we distinguish between those with depres- sion and those without, using the same picture de- 1 Introduction scription task and analysis?
Depression and dementia are both medical condi- 2 Background tions that can have a strong negative impact on the quality of life of the elderly, and they are often co- There has been considerable work on detecting de- morbid. However, depression is often treatable with pression from speech and on detecting dementia medication and therapy, whereas dementia usually from speech, but very little which combines the two. occurs as the result of an irreversible process of neu- We willfirst review the two tasks separately, and rodegeneration. It is therefore critical to be able to then discuss some of the complexity that arises when distinguish between these two conditions. depression and AD co-occur. However, distinguishing between depression and dementia can be extremely difficult because of over- 2.1 Detecting depression from speech lapping symptoms, including apathy, crying spells, Depression affects a number of cognitive and phys- changes in weight and sleeping patterns, and prob- ical systems related to the production of speech, in- lems with concentration and attention. cluding working memory, the phonological loop, ar-
1
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 1–11, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics ticulatory planning, and muscle tension and control 30 participants with depression and 30 healthy con- (Cummins et al., 2015). These changes can result trols. The speech was elicited through interview in word-finding difficulties, articulatory errors, de- questions about situations that had aroused signif- creased prosody, and lower verbal productivity. icant emotions. Higher accuracy was achieved on Over the past decade or so, there has been grow- detecting depression in women than in men. Energy, ing interest in measuring properties of the speech intensity, shimmer, and MFCC features were all in- signal that correlate with the changes observed in de- formative, and positive emotional speech was more pression, and using these measured variables to train discriminatory than negative emotional speech. machine learning classifiers to automatically detect Scherer et al. (2013) differentiated 18 depressed depression from speech. participants from 18 controls with 75% accuracy, Ozdas et al. (2004) found that mean jitter and the using interviews captured with a simulated virtual slope of the glottalflow spectrum could distinguish human. They found that glottal features such as between 10 non-depressed controls, 10 participants the normalized amplitude quotient (NAQ) and quasi- with clinical depression, and 10 high-risk suicidal open quotient (QOQ) differed significantly between participants. the groups. Moore et al. (2008) considered prosodic features Alghowinem et al. (2013) compared four clas- as well as vocal tract and glottal features. They per- sifiers and a number of different feature sets on formed sex-dependent classification and found that the task of detecting depression from spontaneous glottal features were more discriminative than vocal speech. They found loudness and intensity features tract features, but that the best results were achieved to be the most discriminatory, and suggested pitch using all three types of features. and formant features may be more useful for longi- Cohn et al. (2009) examined the utility of facial tudinal comparisons within individuals. movements and vocal prosody in discriminating par- While most of the literature concerning the detec- ticipants with moderate or severe depression from tion of depression from speech has focused solely those with no depression. They achieved 79% accu- on the speech signal, there is an associated body of racy using only two prosodic features: variation in work on detecting depression from writing that fo- fundamental frequency, and latency of response to cuses on linguistic cues. Rude et al. (2004) found interviewer questions. They used a within-subjects that college students with depression were signifi- design, in which they predicted which participants cantly more likely to use thefirst-person pronoun had responded to treatment in a clinical trial. I in personal essays than college students without Low et al. (2011) analyzed speech from ado- depression, and also used more words with neg- lescents engaged in normal conversation with their ative emotional valence. Other work has found parents (68 diagnosed with depression, 71 con- differences in the frequency of different parts-of- trols). They grouped their acoustic features into speech (POS) (De Choudhury et al., 2013) and in 5 groups: spectral, cepstral, prosodic, glottal, and the general topics chosen for discussion (Resnik et those based on the Teager energy operator (TEO, a al., 2015). Other work has accurately identified de- nonlinear energy operator). They achieved higher pression (and differentiated PTSD and depression) accuracies using sex-dependent models than sex- in Twitter social media texts with high accuracies independent models, and found that the best results usingn-gram language models (Coppersmith et al., were achieved using the TEO-based features (up to 2015). Similarly, Nguyen et al. (2014) showed 87% for males and 79% for females). that specialized lexical norms and Linguistic Inquiry Cummins et al. (2011) distinguished 23 depressed and Word Count1 features significantly differentiate participants from 24 controls with a best accuracy clinical and control groups in blog post texts. Howes of 80% in a speaker-dependent configuration and et al. (2014) showed that lexical features (in style 79% in a speaker-independent configuration. Spec- and dialogue) could also be used to predict the sever- tral features, particularly mel-frequency cepstral co- ity of depression and anxiety during Cognitive Be- efficients (MFCCs), were found to be useful. Alghowinem et al. (2012) analyzed speech from 1http://liwc.wpengine.com.
2 havioural Therapy treatment. It is not obvious that pothesis that AD patients would use more pronouns, these results generalize to the case where the topic verbs, and adjectives and fewer nouns than controls. and structure of the narrative is constrained to a pic- Rentoumi et al. (2014) considered a slightly dif- ture description. ferent problem: they used computational techniques to differentiate between picture descriptions from 2.2 Detecting Alzheimer’s disease from speech AD participants with and without additional vas- A growing number of researchers have tackled the cular pathology (n= 18 for each group). They problem of detecting dementia from speech and achieved an accuracy of 75% when they included language. Most of this work has focused on frequency unigrams and excluded binary unigrams, Alzheimer’s disease (AD), which is the most com- syntactic complexity features, measures of vocabu- mon cause of dementia. Although the primary diag- lary richness, and information theoretic features. nostic symptom of AD is memory impairment, this Orimaye et al. (2014) obtainedF-measure scores and other cognitive deficits often manifest in spon- up to 0.74 on transcripts from DementiaBank, com- taneous language through word-finding difficulties, bining participants with different etiologies rather a decrease in information content, and changes in than focusing on AD. In previous work, we also fluency, syntactic complexity, and prosody. Other studied data from DementiaBank (Fraser et al., work, including that of Roark et al. (2007), focuses 2015). We computed acoustic and linguistic fea- on mild cognitive impairment, which is also broadly tures from the “Cookie Theft” picture descriptions applicable. and distinguished 240 AD narratives from 233 con- Thomas et al. (2005) classified spontaneous trol narratives with 81% accuracy using logistic re- speech samples from 95 AD patients and an unspeci- gression. fied number of controls by treating the problem as an authorship attribution task, and employing a “com- 2.3 Relationship between dementia and mon N-grams” approach. They were able to distin- depression guish between patients with severe AD and controls The relationship between dementia and depression with a best accuracy of 94.5%, and between patients is complicated, as the two conditions are not in- with mild AD and controls with an 75.3% accuracy. dependent of each other and in fact frequently co- Habash and Guinn (2012) built classifiers to dis- occur. When someone is diagnosed with demen- tinguish between AD and non-AD language samples tia, feelings of depression are common. At the using 80 conversations between 31 AD patients and same time, depression is a risk factor for devel- 57 cognitively normal conversation partners. They oping Alzheimer’s disease (Korczyn and Halperin, found that features such as POS tags and measures 2009). The diagnosis of a third medical condition of lexical diversity were less useful than measuring (e.g., heart disease) can trigger depression and also filled pauses, repetitions, and incomplete words, and independently increase the risk of dementia. Sim- achieved a best accuracy of 79.5%. ilarly, some risk factors for depression and demen- Meilan´ et al. (2012) distinguished between 30 AD tia are the same, such as alcohol use and cigarette patients and 36 healthy controls with temporal and smoking (Thorpe, 2009). Furthermore, changes in acoustic features alone, obtaining an accuracy of white matter connectivity have been linked to both 84.8%. For each participant, their speech sample depression (Alexopoulos et al., 2008) and dementia consisted of two sentences read from a screen. The (Prins et al., 2004). discriminating features were percentage of voice The prevalence of depression in AD has been es- breaks, number of voice breaks, number of periods timated to be 30–50% (Lee and Lyketsos, 2003), al- of voice, shimmer, and noise-to-harmonics ratio. though thesefigures have been shown to vary widely Jarrold et al. (2014) used acoustic features, POS depending on the diagnostic method used (Muller-¨ features, and psychologically-motivated word lists Thomsen et al., 2005). In contrast, the prevalence to distinguish between semi-structured interview re- of depression in the general population older than sponses from 9 AD participants and 9 controls with 75 is estimated to be 7.2% (major depression) and an accuracy of 88%. They also confirmed their hy- 17.1% (depressive disorders) (Luppa et al., 2012).
3 The prevalence of Alzheimer’s disease is 11% for description task from the Boston Diagnostic Apha- people aged 65 and older, increasing to 33% for sia Examination (BDAE) (Goodglass and Kaplan, people ages 85 and older (Alzheimer’s Association, 1983), in which participants are asked to describe 2015). everything they see going on in a picture. We ex- Symptoms which are common in both depression tract features from both the acousticfiles (converted and dementia include: poor concentration, impaired from MP3 to 16-bit mono WAV format with a sam- attention (Korczyn and Halperin, 2009), apathy (Lee pling rate of 16 kHz) and the associated transcripts. and Lyketsos, 2003), changes to eating and sleeping All examiner speech is excluded from the sample. patterns, and reactive mood symptoms, e.g., tearful- A subset of the participants also have Hamilton ness (Thorpe, 2009). However, both dementia and Depression Rating Scale (HAM-D) scores (Hamil- depression are heterogeneous in presentation, which ton, 1960). The HAM-D is still one of the gold stan- can lead to many possible combinations of symp- dards for depression rating (although it has also re- toms when they co-occur. ceived criticism; see Bagby et al. (2014) for an ex- Studies examining spontaneous speech tasks to ample). It consists of 17 questions, for which the discriminate between dementia and depression are patient’s responses are rated from 0–4 or 0–2 by the rare. Murray (2010) investigated whether clini- examiner. A total score between 0–7 is considered cal depression could be distinguished from AD by normal, 8–16 indicates mild depression, 17–23 indi- analyzing narrative speech. She found that there cates moderate depression, and greater than 24 indi- were significant differences in the amount of infor- cates severe depression (Zimmerman et al., 2013). mation that was conveyed in a picture description task, with depressed participants communicating the 3.2 Features same amount of information as healthy controls, and We extract a large number of textual features AD patients showing a reduction in information con- (including part-of-speech tags, parse constituents, tent. Other discourse measures relating to the quan- psycholinguistic measures, and measures of com- tity of speech produced and the syntactic complexity plexity, vocabulary richness, and informativeness), of the narrative did not differ between the groups. In and acoustic features (includingfluency measures, contrast to the current work, the study described in MFCCs, voice quality features, and measures of pe- Murray (2010) did not include participants with both riodicity and symmetry). A complete list of features dementia and depression, involved a much smaller is given in the Supplementary Material, and addi- data set (49 participants across 3 groups), and did tional details are reported by Fraser et al. (2015). not seek to make predictions from the data. 3.3 Classification 3 Methods We select a subset of the extracted features using a 3.1 Data correlation-basedfilter. Features are ranked by their correlation with diagnosis and only the topN fea- We use narrative speech data from the Pitt corpus tures are selected, where we varyN from 5 to 400. in the DementiaBank database2. These data were The selected features are fed to a machine learn- collected between 1983 and 1988 as part of the ing classifier; in this study we compare logistic re- Alzheimer Research Program at the University of gression (LR) with support vector machines (SVM) Pittsburgh. Detailed information about the study (Hall et al., 2009). We use a cross-validation frame- cohort is available from Becker et al. (1994), and work and report the average accuracy across folds. demographic information is presented for each ex- The data is partitioned across folds such that sam- periment below in Tables 1 and 3. Diagnoses were ples from a single speaker occur in either the training made on the basis of a personal history and a neu- set or test set, but never both. Error bars are com- ropsychological battery; a subset of these diagnoses puted using the standard deviation of the accuracy were confirmed post-mortem. The language sam- across folds. In some cases we also report sensitivity ples were elicited using the “Cookie Theft” picture and specificity, where sensitivity indicates the pro- 2https://talkbank.org/DementiaBank/ portion of people with AD (or depression) who were
4 AD Controls Sig. n= 196n= 128 Age 71.7 (8.7) 63.7 (7.6) ** Education 12.4 (2.9) 13.9 (2.4) ** Sex (M/F) 66/130 49/79 Table 1: Mean and standard deviation of demo- graphic information for participants in Experiment I. ** indicatesp<0.01. correctly identified as such, and specificity indicates the proportion of controls who were correctly iden- tified as such. Figure 1: Classification accuracy on the task of dis- 4 Experiment I: Does depression affect tinguishing AD from control narratives for varying classification accuracy? feature set sizes. Ourfirst experiment examines whether depression is a confounding factor in our current diagnostic Data set Baseline Accuracy Sens. Spec. pipeline. To answer this question, we consider the All 0.605 0.799 0.826 0.758 Depressed 0.743 0.864 0.846 1.000 subset of narratives for which associated HAM-D Non-depressed 0.552 0.780 0.816 0.739 scores are available. This leaves a set of 196 AD narratives and 128 control narratives from 150 AD Table 2: Accuracy, sensitivity, and specificity for participants and 80 control participants. Since par- all participants, depressed participants, and non- ticipants may have different scores on different vis- depressed participants. its, we consider data per narrative, rather than per speaker. Demographic information is given in Ta- depression (n= 9), we also report the accuracy of ble 1. The groups are not matched for age or educa- a majority class classifier as a baseline with which tion, which is a limitation of the complete data set as to compare the reported accuracies. Alternatives to well; the AD participants tend to be both older and this approach, including synthetically balancing the less educated. classes, e.g., with synthetic minority oversampling We then perform the classification procedure us- (Chawla et al., 2002), is to be the subject of future ing the analysis pipeline described above, with 10- work. fold cross-validation. The results for a logistic re- A key result from this experiment is that although gression and SVM classifier are shown in Figure 1. there are only a few control participants who are This is a necessaryfirst step to examine if depression depressed, none of those are misclassified as AD is a confounding factor. (specificity = 1.0 in this case). Choosing the best result of 0.799 (SVM classi- Furthermore, if we partition the participants by fier, 70 features), we then perform a more detailed accuracy (those who were classified correctly ver- analysis. Accuracy, sensitivity, and specificity for sus incorrectly), wefind no significant difference on the full data set are reported in thefirst row of Ta- HAM-D scores (p>0.05). This suggests that the ble 2. Wefirst break down the data into two sep- accuracy of the classifier is not affected by the pres- arate groups: those with a Hamilton score greater ence or absence of depression. than 7 (i.e., “depressed”) and those with a Hamil- ton score less than or equal to 7 (“non-depressed”) 5 Experiment II: Can we detect depression (Zimmerman et al., 2013). The accuracy, sensitivity, in Alzheimer’s disease? and specificity for these sub-groups are also reported in Table 2. Because there are far more AD partici- In our second experiment, we tackle the problem pants with depression (n= 65) than controls with of detecting depression when it is comorbid with
5 Depressed Non-dep. Sig. Rank Featurer Trend n= 65n= 65 1 Skewness MFCC 1 0.270 ↑ Age 71.4 (8.6) 71.6 (8.6) 2 Info unit: boy 0.265 − ↓ Education 11.7 (2.6) 12.9 (3.0) * 3 MeanΔΔMFCC 8 0.229 ↑ Sex (M/F) 21/44 19/46 4 VP VB NP 0.223 → − ↓ MMSE 18.1 (5.5) 17.9 (5.4) 5 Kurtosis MFCC 4 0.223 ↑ 6 Kurtosis MFCC 3 0.217 Table 3: Mean and standard deviation of demo- ↑ 7 KurtosisΔMFCC 2 0.213 graphic information for AD participants in Experi- ↑ 8 SkewnessΔΔMFCC 2 0.211 ment II. * indicatesp<0.05. ↑ 9 Kurtosis MFCC 10 0.209 ↑ 10 Determiners 0.206 − ↓ Table 4: Highly ranked features for distinguishing people with AD and depression from people with only AD. The third column shows the correlation with diagnosis, and the fourth column shows the di- rection of the trend (increasing or decreasing) with depression.
Data set Baseline Accuracy Sens. Spec. All 0.500 0.658 0.707 0.610 Females 0.511 0.588 0.519 0.653 Males 0.525 0.650 0.580 0.717 Table 5: Accuracy, sensitivity, and specificity for all Figure 2: Classification accuracy on the task of dis- participants, just females, and just males. tinguishing depressed from non-depressed AD nar- ratives for varying feature set sizes. Table 4 shows the features which are most highly correlated with diagnosis (over all data). Even for Alzheimer’s disease. From the previous section, we the top-ranked features, the correlation is weak, and have 65 narratives from participants with both AD the difference between groups is not significant af- and depression (HAM-D> 7). We select an addi- ter correcting for multiple comparisons. We there- tional 65 narratives from participants with AD but fore cannot conclusively draw conclusions about the no depression. These additional data are selected selected features, although we do note the apparent randomly but such that participants are matched for importance of the MFCC features here. dementia severity, age, and sex. Demographic infor- mation is given in Table 3. 5.2 Sex-dependent classification Given that acoustic features naturally vary across the 5.1 Standard processing pipeline sexes, and that previous work achieved better results We begin by using our standard processing pipeline using sex-dependent classifiers, we also consider a to assess whether it is capable of detecting depres- sex-dependent configuration. The drawback to this sion. The classification accuracies are given in Fig- approach is the reduction in data, particularly for ure 2. In this case, since the groups are the same males. In these experiments we attempt to classify size, the baseline accuracy is 0.5. The best accu- 21 males with depression+AD versus 19 males with racy of 0.658 is achieved with the LR classifier using AD only, and 44 females with depression+AD ver- 60 features (sensitivity: 0.707, specificity: 0.610). sus 46 females with AD only. The results for these This represents a significant increase (pairedt-test, experiments are shown in Figure 3, and the best ac- p<0.05) of 15 percentage points over the random curacies are given in Table 5. baseline, but there is clearly room for improvement. The features which are most correlated with di-
6 Rank Featurer Trend 1 Info unit: boy 0.323 − ↓ 2 MeanΔMFCC 9 0.284 ↑ 3 VP VB NP 0.274 → − ↓ 4 Kurtosis MFCC 3 0.266 ↑ 5 KurtosisΔ energy 0.261 ↑ 6 Skewness MFCC 1 0.260 ↑ 7 Kurtosis MFCC 4 0.256 ↑ 8 NP PRP$ NNS 0.251 → ↑ 9 SkewnessΔΔMFCC 2 0.249 ↑ 10 NP NP NP . 0.243 → ↑ (a) Females (a) Females Rank Feature r Trend 1 MeanΔΔMFCC 9 0.447 ↑ 2 SkewnessΔΔMFCC 12 0.406 − ↓ 3 VP VB S 0.405 → ↑ 4 MeanΔΔMFCC 2 0.381 − ↓ 5 Info unit: stool0.352 ↑ 6 VP VBG NP 0.351 → − ↓ 7 Key word: chair0.346 ↑ 8 MeanΔMFCC 11 0.325 − ↓ 9 Key word: girl 0.318 − ↓ 10 MeanΔΔMFCC 8 0.316 ↑ (b) Males (b) Males Figure 3: Sex-dependent classification accuracy Table 6: Highly ranked features for distinguishing on the task of distinguishing depressed from non- individuals with AD and depression from individu- depressed AD narratives for varying feature set als with only AD, in the sex-dependent case. (No sizes. differences are significant after correcting for multi- ple comparisons.) agnosis for females are listed in Table 6a, and those which are most correlated with diagnosis for males quency, glottal closure instants, linear prediction are listed in Table 6b. Again, the selected fea- residuals, peak slope, glottalflow (and derivative), tures tend to be either informational, grammatical, normalized amplitude quotient (NAQ), quasi-open or cepstral in nature, although none of the differ- quotient (QOQ), harmonic richness factor, parabolic ences are significant after correcting for multiple spectral parameter, and cepstral peak prominence. comparisons. These features are implemented in the COVAREP toolkit (version 1.4.1) (Degottex et al., 2014). 5.3 Additional features We also include three additional psycholinguistic To help our classifiers better distinguish between variables relating to the affective qualities of words: people with and without depression, we implement valence, arousal, and dominance. Valence describes a number of additional features which have been re- the degree of positive or negative emotion associated ported to be valuable in detecting depression. Many with a word, arousal describes the intensity of the of the acoustic features from the literature were al- emotion associated with a word, and dominance de- ready present in our feature set, but we now consider scribes the degree of control associated with a word. a number of glottal features, including the mean We use the crowd-sourced norms presented by War- and standard deviations of the maximum voiced fre- riner et al. (2013) for their broad coverage, and mea-
7 sure the mean and maximum value of each variable. the two conditions. In fact, previous work on detect- Finally, we count the frequency of occurrence of ing depression from speech has focused overwhelm- first-person words (I, me, my, mine). In general, the ingly on young and otherwise healthy participants, picture description task is completed in the third per- and much work is needed on detecting depression in son, butfirst-person words do occur. other populations and with other comorbidities. However, including these new features actually One limitation of this work is the type of speech had a slightly negative effect on the sex-independent data available; previous work suggests that emo- classification, reducing the maximum accuracy from tional speech is more informative for detecting de- 0.658 to 0.650, as well as on the males-only case, pression. Another limitation is that we are assigning reducing maximum accuracy from 0.650 to 0.585. our participants to the depressed and non-depressed This suggests that some of the new features are be- groups on the basis of a single test score, rather than ing selected in individual training folds, but not gen- a confirmed clinical diagnosis. A related factor to eralizing to the test folds. In contrast, the new fea- consider is the relatively mild depression that is ob- tures did make a small, incremental improvement served in this data set, which was developed for the in the females-only case, from 0.588 to 0.609 for study of AD rather than depression – only 8 par- females. The new features that were most highly ticipants met the criteria for “moderate” depression, ranked for females were the standard deviation of and none met the criteria for severe depression. Fur- the peak slope (rank 12,r= 0.237) and the stan- thermore, while the controls in Experiment 2 all had − dard deviation of NAQ (rank 35,r= 0.186), both scores below the threshold for mild depression, in − showing a weak negative correlation with diagnosis. most cases the scores were still non-zero, and so the The most useful new feature for males was the mean classification task is not as clearly binary as we have QOQ (rank 16,r=0.273), with a weak positive cor- framed it here. Finally, limitations of the dataset relation with diagnosis. introduced issues of confounding variables (namely age and education), and prohibited us from contrast- 6 Conclusion ing speech from participants with only depression versus those with only AD. We are currently under- In this paper, we considered two questions. Thefirst taking our own data collection to overcome the var- is related to previous work in thefield showing that ious challenges of this dataset. speech analysis and machine learning can lead to Depression and Alzheimer’s disease both present good, but not perfect, differentiation between par- in different syndromes, and so it is probably unre- ticipants with AD and healthy controls. We won- alistic to clearly delineate between the many poten- dered whether some control participants were being tial combinations of depression, AD, and other pos- misclassified as having AD when in fact they were sible medical conditions through the analysis of a depressed. However, in our experiment we found single language task. On the other hand, previous that none of the 9 depressed controls were misclas- work suggests that this type of analysis can be very sified as having AD. This is a small sample, but it is fine-grained and sensitive to subtle cognitive impair- consistent with thefindings of Murray (2010), who ments. Ideally, future work will focus directly on the found that although AD participants and controls task of distinguishing AD from depression, using could be distinguished through analysis of their pic- clinically validated data with a stronger emotional ture descriptions, there were no differences between component. depressed participants and controls. We then considered only participants with AD, Acknowledgments and tried to distinguish between those with comor- bid depression and those without. Our best accu- The authors would like to thank Dr. John Strauss for racy for this task was 0.658, which is considerably his comments on an early draft of this paper. This lower than reported accuracies for detecting depres- work was supported by the Natural Sciences and sion in the absence of AD, but reflects the difficulty Engineering Research Council of Canada (NSERC), of the task given the wide overlap of symptoms in grant number RGPIN-2014-06020.
8 References Alzheimer type: Evaluation of an objective tech- nique for analysing lexical performance. Aphasiology, George S. Alexopoulos, Christopher F. Murphy, Faith M. 14(1):71–91. Gunning-Dixon, Vassilios Latoussakis, Dora Kanel- Jieun Chae and Ani Nenkova. 2009. Predicting theflu- lopoulos, Sibel Klimstra, Kelvin O. Lim, and ency of text with shallow structural features: case stud- Matthew J. Hoptman. 2008. Microstructural white ies of machine translation and human-written text. In matter abnormalities and remission of geriatric de- Proceedings of the 12th Conference of the European pression. The American Journal of Psychiatry, Chapter of the Association for Computational Linguis- 165(2):238–244. tics, pages 139–147. Sharifa Alghowinem, Roland Goecke, Michael Wagner, Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, Julien Epps, Michael Breakspear, Gordon Parker, et al. and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic 2012. From joyous to clinically depressed: Mood Minority Over-sampling Technique. Journal of Artifi- detection using spontaneous speech. In Proceedings cial Intelligence Research, 16:321–357. of the Florida Artificial Intelligence Research Society Jeffrey F. Cohn, Tomas Simon Kruez, Iain Matthews, (FLAIRS) Conference, pages 141–146. Ying Yang, Minh Hoai Nguyen, Margara Tejera Sharifa Alghowinem, Roland Goecke, Michael Wagner, Padilla, Feng Zhou, and Fernando De La Torre. 2009. Julien Epps, Tom Gedeon, Michael Breakspear, and Detecting depression from facial actions and vocal Gordon Parker. 2013. A comparative study of dif- prosody. In Affective Computing and Intelligent In- ferent classifiers for detecting depression from sponta- teraction and Workshops, 2009. ACII 2009. 3rd Inter- neous speech. In Acoustics, Speech and Signal Pro- national Conference on, pages 1–7. cessing (ICASSP), 2013 IEEE International Confer- Glen Coppersmith, Mark Dredze, Craig Harman, Kristy ence on, pages 8022–8026. Hollingshead, and Margaret Mitchell. 2015. CLPsych Paavo Alku, Helmer Strik, and Erkki Vilkman. 1997. 2015 shared task: Depression and PTSD on Twitter. In Parabolic spectral parameter – a new method for quan- Proceedings of the Shared Task for the NAACL Work- tification of the glottalflow. Speech Communication, shop on Computational Linguistics and Clinical Psy- 22(1):67–79. chology. Alzheimer’s Association. 2015. 2015 Alzheimer’s dis- Michael A. Covington and Joe D. McFall. 2010. Cutting ease facts andfigures. Alzheimer’s & Dementia, the Gordian knot: The moving-average type–token ra- 11(3):332. tio (MATTR). Journal of Quantitative Linguistics, R. Michael Bagby, Andrew G. Ryder, Deborah R. 17(2):94–100. Schuller, and Margarita B. Marshall. 2014. The Bernard Croisile, Bernadette Ska, Marie-Josee Bra- Hamilton Depression Rating Scale: has the gold stan- bant, Annick Duchene, Yves Lepage, Gilbert Aimard, dard become a lead weight? American Journal of Psy- and Marc Trillet. 1996. Comparative study of chiatry, 161(12):2163–2177. oral and written picture description in patients with James T. Becker, Franc¸ois Boiler, Oscar L. Lopez, Judith Alzheimer’s disease. Brain and Language, 53(1):1– Saxton, and Karen L. McGonigle. 1994. The natural 19. history of Alzheimer’s disease: description of study Nicholas Cummins, Julien Epps, Michael Breakspear, cohort and accuracy of diagnosis. Archives of Neurol- and Roland Goecke. 2011. An investigation of de- ogy, 51(6):585–594. pressed speech detection: Features and normalization. Sarah D. Breedin, Eleanor M. Saffran, and Myrna F. In Proceedings of the 12th Annual Conference of the Schwartz. 1998. Semantic factors in verb retrieval: International Speech Communication Association (IN- An effect of complexity. Brain and Language, 63:1– TERSPEECH 2011), pages 2997–3000. 31. Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Etienne´ Brunet. 1978. Le vocabulaire de Jean Girau- Sebastian Schnieder, Julien Epps, and Thomas F doux structure et evolution´ . Editions´ Slatkine. Quatieri. 2015. A review of depression and suicide Marc Brysbaert and Boris New. 2009. Moving beyond risk assessment using speech analysis. Speech Com- Kuceraˇ and Francis: A critical evaluation of current munication, 71:10–49. word frequency norms and the introduction of a new Munmun De Choudhury, Scott Counts, and Eric Horvitz. and improved word frequency measure for American 2013. Social media as a measurement tool of depres- English. Behavior Research Methods, 41(4):977–990. sion in populations. In Proceedings of the 5th Annual Romola S. Bucks, Sameer Singh, Joanne M. Cuer- ACM Web Science Conference, pages 47–56. den, and Gordon K. Wilcock. 2000. Analysis of Gilles Degottex, John Kane, Thomas Drugman, Tuomo spontaneous, conversational speech in dementia of Raitio, and Stefan Scherer. 2014. COVAREP – a col-
9 laborative voice analysis repository for speech tech- speech. In Proceedings of the ACL Workshop on Com- nologies. In Acoustics, Speech and Signal Process- putational Linguistics and Clinical Psychology, pages ing (ICASSP), 2014 IEEE International Conference 27–36. on, pages 960–964. John Kane and Christer Gobl. 2011. Identifying regions Thomas Drugman and Yannis Stylianou. 2014. Maxi- of non-modal phonation using features of the wavelet mum voiced frequency estimation: Exploiting ampli- transform. In 12th Annual Conference of the Interna- tude and phase spectra. Signal Processing Letters, tional Speech Communication Association 2011 (IN- IEEE, 21(10):1230–1234. TERSPEECH 2011), pages 177–180. Thomas Drugman, Baris Bozkurt, and Thierry Dutoit. Amos D. Korczyn and Ilan Halperin. 2009. Depression 2012. A comparative study of glottal source esti- and dementia. Journal of the Neurological Sciences, mation techniques. Computer Speech & Language, 283(1):139–142. 26(1):20–34. Hochang B. Lee and Constantine G. Lyketsos. 2003. De- Thomas Drugman. 2014. Maximum phase modeling for pression in Alzheimer’s disease: heterogeneity and re- sparse linear prediction of speech. Signal Processing lated issues. Biological Psychiatry, 54(3):353–362. Letters, IEEE, 21(2):185–189. Lu-Shih Alex Low, Namunu C. Maddage, Margaret Ruben´ Fraile and Juan Ignacio Godino-Llorente. 2014. Lech, Lisa B. Sheeber, and Nicholas B. Allen. 2011. Cepstral peak prominence: A comprehensive analysis. Detection of clinical depression in adolescents’ speech Biomedical Signal Processing and Control, 14:42–54. during family interactions. Biomedical Engineering, Kathleen C. Fraser, Jed A. Meltzer, and Frank Rudzicz. IEEE Transactions on, 58(3):574–586. 2015. Linguistic features identify Alzheimer’s disease Xiaofei Lu. 2010. Automatic analysis of syntactic in narrative speech. Journal of Alzheimer’s Disease, complexity in second language writing. International 49(2):407–422. Journal of Corpus Linguistics, 15(4):474–496. Ken J. Gilhooly and Robert H. Logie. 1980. Age- Melanie Luppa, Claudia Sikorski, Tobias Luck, of-acquisition, imagery, concreteness, familiarity, and L. Ehreke, Alexander Konnopka, Birgitt Wiese, ambiguity measures for 1,944 words. Behavior Re- Siegfried Weyerer, H-H. Konig,¨ and Steffi G. Riedel- search Methods, 12:395–427. Heller. 2012. Age-and gender-specific prevalence Harold Goodglass and Edith Kaplan. 1983. Boston di- of depression in latest-life–systematic review and agnostic aphasia examination booklet. Lea & Febiger meta-analysis. Journal of Affective Disorders, Philadelphia, PA. 136(3):212–221. Anthony Habash and Curry Guinn. 2012. Language Juan J.G. Meilan,´ Francisco Mart´ınez-Sanchez,´ Juan analysis of speakers with dementia of the Alzheimer’s Carro, Jose´ A. Sanchez,´ and Enrique Perez.´ 2012. type. In Association for the Advancement of Artificial Acoustic markers associated with impairment in lan- Intelligence (AAAI) Fall Symposium, pages 8–13. guage processing in Alzheimer’s disease. The Spanish Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Journal of Psychology, 15(02):487–494. Pfahringer, Peter Reutemann, and Ian H. Witten. Elliot Moore, Mark Clements, John W. Peifer, and Lydia 2009. The WEKA data mining software: an update. Weisser. 2008. Critical analysis of the impact of glot- ACM SIGKDD Explorations Newsletter, 11(1):10–18. tal features in the classification of clinical depression Max Hamilton. 1960. A rating scale for depression. in speech. Biomedical Engineering, IEEE Transac- Journal of Neurology, Neurosurgery, and Psychiatry, tions on, 55(1):96–107. 23(1):56–62. Tomas Muller-Thomsen,¨ Sonke¨ Arlt, Ulrike Mann, Rein- Antony Honore.´ 1979. Some simple measures of rich- hard Maß, and Stefanie Ganzer. 2005. Detecting de- ness of vocabulary. Association for Literary and Lin- pression in Alzheimer’s disease: evaluation of four dif- guistic Computing Bulletin, 7(2):172–177. ferent scales. Archives of Clinical Neuropsychology, Christine Howes, Matthew Purver, and Rose McCabe. 20(2):271–276. 2014. Linguistic indicators of severity and progress Laura L. Murray. 2010. Distinguishing clinical depres- in online text-based therapy for depression. In In Pro- sion from early Alzheimer’s disease in elderly peo- ceedings of the ACL Workshop on Computational Lin- ple: Can narrative analysis help? Aphasiology, 24(6- guistics and Clinical Psychology. 8):928–939. William Jarrold, Bart Peintner, David Wilkins, Dimitra Thin Nguyen, Dinh Phung, Bo Dao, Svetha Venkatesh, Vergryi, Colleen Richey, Maria Luisa Gorno-Tempini, and Michael Berk. 2014. Affective and content analy- and Jennifer Ogar. 2014. Aided diagnosis of dementia sis of online depression communities. IEEE Transac- type through computer-based analysis of spontaneous tions on Affective Computing, 5(3):217–226.
10 Sylvester Olubolu Orimaye, Jojo Sze-Meng Wong, and lexical analysis of spontaneous speech. In Proceed- Karen Jennifer Golden. 2014. Learning predictive lin- ings of the IEEE International Conference on Mecha- guistic features for Alzheimer’s disease and related de- tronics and Automation, pages 1569–1574. mentias using verbal utterances. In Proceedings of the Lilian Thorpe. 2009. Depression vs. dementia: How 1st Workshop on Computational Linguistics and Clin- do we assess? The Canadian Review of Alzheimer’s ical Psychology (CLPsych), pages 78–87. Disease and Other Dementias, pages 17–21. Asli Ozdas, Richard G. Shiavi, Stephen E. Silverman, Amy Beth Warriner, Victor Kuperman, and Marc Brys- Marilyn K. Silverman, and D. Mitchell Wilkes. 2004. baert. 2013. Norms of valence, arousal, and domi- Investigation of vocal jitter and glottalflow spectrum nance for 13,915 English lemmas. Behavior Research as possible cues for depression and near-term suicidal Methods, 45(4):1191–1207. risk. Biomedical Engineering, IEEE Transactions on, Victor Yngve. 1960. A model and hypothesis for lan- 51(9):1530–1540. guage structure. Proceedings of the American Physi- Niels D. Prins, Ewoud J. van Dijk, Tom den Heijer, cal Society, 104:444–466. Sarah E. Vermeer, Peter J. Koudstaal, Matthijs Oud- Mark Zimmerman, Jennifer H. Martinez, Diane Young, kerk, Albert Hofman, and Monique M.B. Breteler. Iwona Chelminski, and Kristy Dalrymple. 2013. 2004. Cerebral white matter lesions and the risk of Severity classification on the Hamilton depression rat- dementia. Archives of Neurology, 61(10):1531–1534. ing scale. Journal of Affective Disorders, 150(2):384– Vassiliki Rentoumi, Ladan Raoufian, Samrah Ahmed, 388. Celeste A. de Jager, and Peter Garrard. 2014. Fea- tures and machine learning classification of connected speech samples from patients with autopsy proven Alzheimer’s disease with and without additional vas- cular pathology. Journal of Alzheimer’s Disease, 42:S3–S17. Philip Resnik, William Armstrong, Leonardo Claudino, and Thang Nguyen. 2015. The University of Mary- land CLPsych 2015 shared task system. In Proceed- ings of the 2nd Workshop on Computational Linguis- tics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 54–60, Denver, Colorado, June 5. Brian Roark, John-Paul Hosom, Margaret Mitchell, and Jeffrey A. Kaye. 2007. Automatically derived spoken language markers for detecting mild cognitive impair- ment. In Proceedings of the 2nd International Confer- ence on Technology and Aging (ICTA), Toronto, ON, June. Stephanie Rude, Eva-Maria Gortner, and James Pen- nebaker. 2004. Language use of depressed and depression-vulnerable college students. Cognition & Emotion, 18(8):1121–1133. Stefan Scherer, Giota Stratou, Jonathan Gratch, and Louis-Philippe Morency. 2013. Investigating voice quality as a speaker-independent indicator of depres- sion and PTSD. In Proceedings of Interspeech, pages 847–851. Hans Stadthagen-Gonzalez and Colin J. Davis. 2006. The Bristol norms for age of acquisition, imageabil- ity, and familiarity. Behavior Research Methods, 38(4):598–605. Calvin Thomas, Vlado Keselj, Nick Cercone, Kenneth Rockwood, and Elissa Asp. 2005. Automatic detec- tion and rating of dementia of Alzheimer type through
11 Towards Early Dementia Detection: Fusing Linguistic and Non-Linguistic Clinical Data
Joseph Bullard1, Cecilia Ovesdotter Alm2, Xumin Liu1, Ruben´ A. Proano˜ 3, Qi Yu1 1College of Computing & Information Sciences, 2College of Liberal Arts, 3College of Engineering Rochester Institute of Technology, Rochester, NY jtb4478,coagla,xmlics,qyuvks,rpmeie @rit.edu { }
Abstract plan for the future. Despite this importance, cur- rent detection methods are costly, invasive, or unre- Dementia is an increasing problem for an ag- liable, with most patients not being diagnosed until ing population, with a lack of available treat- their symptoms have already progressed. Dementia ment options, as well as expensive patient diagnosis is a life-changing event not only for the care. Early detection is critical to eventu- ally postpone symptoms and to prepare health patient but for the caretakers that have to adjust to care providers and families for managing a the ensuing life changes. Improved understanding patient’s needs. Identification of diagnostic and recognition of early warning signs of demen- markers may be possible with patients’ clini- tia would greatly benefit the management of the dis- cal records. Text portions of clinical records ease, and enable long-term planning and logistics for are integrated into predictive models of de- healthcare providers, health systems, and caregivers. mentia development in order to gain insights towards automated identification of patients With the advent of electronic clinical records who may benefit from providers’ early assess- comes the potential for large-scale analysis of pa- ment. Results support the potential power of linguistic records for predicting demen- tients’ clinical data to understand or discover warn- tia status, both in the absence of, and in ing signs of dementia progression. The ability to fol- complement to, corresponding structured non- low the evolution of the disease based on patients’ linguistic data. records would be key to develop intelligent support systems to assist medical decision-making and the provision of care. Current research using records 1 Introduction mainly focuses on structured data, i.e. numerical or Dementia is a problem for the aging population, categorical data, such as test results or patient de- and it is the 6th leading cause of death in the US mographics (Himes et al., 2009). However, unstruc- (Alzheimer’s Association, 2014). Around 35 mil- tured data, such as text notes taken during interac- lion people worldwide suffer from some form of de- tions between patients and doctors, presents a po- mentia, and this number is expected to double by tentially rich source of information that may be both 2030 (Prince et al., 2013). The most common form more straightforwardly interpretable for humans, as of dementia is Alzheimer’s Disease, which has no well as helpful for early dementia detection. Struc- known cure and limited treatment options. The clin- tured data from innovative diagnostic tests are of- ical care for dementia focuses on prolonged symp- ten absent due to their cost and accessibility, text tom management, resulting in high personal and fi- notes are generated for nearly every visit of a pa- nancial costs for patients and their families, straining tient. Moreover, text notes in medical records are the healthcare system in the process. Early detection a source of natural language, and potentially more is critical for potential postponement of symptoms, flexibly encode the diagnostic expertise and reason- and for allowing families to adjust and adequately ing of the clinical professionals who write them.
12
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 12–22, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics
Processing and computationally analyzing natu- more, diagnosing such conditions is often primar- ral language remains a formidable task, but insights ily a function of experts’ analysis, transcribed into gleaned from it may translate particularly well into notes. Thus, discovering lexical associations with actual clinical practice, given its interpretable and the progression of these conditions could be tremen- accessible nature. Therefore, the ability to pre- dously beneficial, and could also help to validate and dict dementia development based on both structured enhance the use of resources such as the Alzheimer’s and unstructured data would be useful for intelligent Disease Ontology (Malhotra et al., 2013). support systems which could automatically flag in- dividuals who will benefit for further evaluation, re- Topic models have produced interesting results ducing the impact of late diagnosis. across domains (Chan et al., 2013; Resnik et al., 2013; McCallum et al., 2007; Paul and Dredze, 1.1 Related Work 2011). Latent Semantic Indexing (LSI) has been Structured clinical data has been useful for identi- used in medicine to discover statistical relationships fying known disease markers (Himes et al., 2009). between lexical items in a corpus. LSI has been used Procedural and diagnostic codes (e.g., ICD-9) can to supplement the development of a clinical vocab- provide high specificity for identifying a disease, ulary associated with post-traumatic stress disorder but may not provide sufficient sensitivity (Birman- (Luther et al., 2011), and for forecasting ambulatory Deych et al., 2005; Kern et al., 2006). A pa- falls in elderly patients (McCart et al., 2013). How- tient’s history, however, is typically summarized by ever, LSI often requires around 300–500 concepts a clinician in text form, and can provide informative or dimensions to produce stable results (Bradford, expressiveness and granularity not adequately cap- 2008). This limitation can be overcome by using tured by ICD-9 codes (Li et al., 2008). Interestingly, LDA, whose identified groups of related terms are Prior work has shown that natural language data also more intuitive for human interpretation than LSI can help synthesize details and discover trends in results. Additionally, representing documents by medical records. Natural language processing and their LDA topic distribution reduces the dimension- text mining have been applied to the identification ality of the feature space. Furthermore, a study with of various known medical conditions. One method microtext data demonstrated that document length maps specific conditions to relevant terms from on- influences topic models, and that aggregating short tologies (curated knowledge bases). For exam- documents by author can be beneficial (Hong and ple, SNOMED-CT predicted post-operative patient Davison, 2010). This finding is relevant for this complications (Murff et al., 2011), and MedLEE study due to the short nature of clinical texts. (Friedman et al., 1995) identified colorectal can- cer cases (Xu et al., 2011), suspicious mammogram This study is concerned with the fusion of lin- findings (Jain and Friedman, 1997), and adverse guistic data with structured non-linguistic data, as events related to central venous catheters (Penz et well as the integration of distinct models suitable al., 2007). Similarly, the language analysis-based for each. Approaches to the former case, have been resource SymText (Haug et al., 1995) has been used studied (Ruta and Gabrys, 2000). For the latter case, for detecting bacterial pneumonia cases from de- integration of classifiers typically involves multiple scriptions of chest X-ray (Fiszman et al., 2000). models of the same data, e.g. ensemble methods While such studies with medical knowledge bases such as random forests, and often utilizes voting are useful for disease identification, they mostly algorithms to produce the final combined output. involve conditions with well known markers and However, here we focus on the combination of two known relationships between words and clinical distinct models: one based on linguistic data and one concepts typically available once the patient is on structured non-linguistic data. This setup compli- symptomatic. However, many cognitive conditions, cates the use of typical voting methods, and thus we such as dementia, as well as other illnesses of inter- explore a less frequently studied solution that lever- est, are not well understood and their onsets grad- ages Bayesian probability to produce posterior dis- ually evolve over long periods of time. Further- tributions (Bailer-Jones and Smith, 2011).
13 Normal (NL) 1.2 Our Contributions Alzheimer’s Disease (AD)
(1) We compare performance of predictive modeling 22% 28% with linguistic vs. non-linguistic features, studying if linguistic features used alone as predictors yield 24% 26% performance comparable to that of non-linguistic record data – especially when the latter exclude cog- Late MCI (LMCI ) Early MCI (EMCI ) nitive assessment scores from expert-administered tests. Our results show the utility of linguistic data Figure 1: Distribution of subject diagnostic labels (n = 679). MCI = Mild Cognitive Impairment. for dementia prediction, e.g., when relevant struc- tured data are unavailable in the records, as is often study under the most recent phase, ADNI-2, are in- the case. (2) We explore the use of Latent Dirich- cluded in this work. Subjects with a label of SMC let Allocation (LDA) (Blei et al., 2003) as textually (Significant Memory Complaint; reflecting a self- interpretable dimensionality reduction of the lexi- reported memory issue) are excluded as it is not a cal feature space into a topic space. We examine real diagnostic category outside of ADNI. A sub- if LDA can transform the sparse term space into ject’s record must have both unstructured text and a reduced topic space that meaningfully character- structured data to be included, resulting in 679 us- izes the texts, and we discuss its practical value for able subjects; from here on we refer to their data. classification. (3) We study the challenge of fus- The ADNI-2 phase of the ADNI collection ing linguistic and non-linguistic data from records in uses several labels to indicate the progression to additional classification experiments. If fusion im- Alzheimer’s Disease: NL (Normal), EMCI (Early proves performance, this would strengthen the util- Mild Cognitive Impairment), LMCI (Late MCI), and ity of records-based linguistic features for disease AD (Alzheimer’s Disease). The label (class) dis- prediction. We explore two integration methods: tribution of the remaining 679 subjects is relatively combining feature vectors computed independently balanced (see Figure 1). Moderately-sized data sets from structured and text data, or leveraging proba- are common in clinical NLP contexts, where data is bilistic outputs of their respective trained classifiers. understandably more challenging to collate and ac- This paper is organized as follows. Section 2 de- cess. For the text data, we considered text source scribes the data for the dementia detection problem. files with considerable quantities of information.2 Section 3 presents our framework and integration. All 679 subjects possess text notes in at least one of Section 4 outlines experiments and results. We con- these four files. Entries from these files are aggre- clude with future directions in Section 5. gated by subject and concatenated to yield one text document per subject. 2 Dementia Detection Problem: Data There are 22 structured data fields in this ADNI This study makes a secondary use of a data set subset. The problem of missing values in the struc- from the Alzheimer’s Disease Neuroimaging Initia- tured data was handled through multiple imputa- tive (ADNI) (adni.loni.usc.edu). The ADNI tion (using the Amelia II package in R). This pro- study contains mostly structured data, such as mea- cess uses log-likelihoods to generate probable com- surements from brain imaging scans, blood, and plete datasets. Most structured data comes from ei- cerebrospinal fluid biomarkers. The dataset also ther cerebrospinal fluid samples or brain imaging contains optional text fields in which examiners in- scans, while three fields correspond to scores on clude notes or descriptions at their discretion. cognitive exam evaluations: the Clinical Dementia Each ADNI subject1 is labeled upon entering the Rating (CDR), the Mini Mental State Examination study. ADNI’s original labeling scheme was mod- (MMSE), and the Alzheimer’s Disease Assessment ified in later phases of the study, resulting in some Scale (ADAS13). Importantly, a meaningful dis- subjects having updated labels, while others remain tinction can be made between structured data from unchanged. Therefore, only subjects who joined the cognitive assessments versus those from biophys-
1The ADNI study refers to its participants as subjects. 2See Table in the supplementary documentation.
14 ical tests/markers. A cognitive assessment is ad- LDA is a generative model for identifying latent ministered by a clinical professional, and thus is topics of related terms in a text corpus, D, which a reflection of that person’s opinion and expertise. consists of M documents and is assumed to contain Essentially, cognitive assessment scores are outputs K topics. Each topic k is essentially follows a multi- of professional interpretation, whereas other struc- nomial distribution over the corpus vocabulary, pa- tured data are inputs for future interpretation. Cogni- rameterized by φk, which is drawn from a Dirichlet tive assessments are also usually administered when distribution, i.e., φ Dir(β). Similarly, each doc- k ∼ providers already suspect dementia, and thus can be ument follows a multinomial distribution over the regarded as post-symptomatic. Patients, providers, set of topics in the corpus, also assumed to have a and families will benefit from early detection, and Dirichlet probability, denoted θ Dir(α). Work- i ∼ such automated detection can also help prioritize the ing backwards, the probability of each term in a doc- scheduling of expert-based cognitive assessments in ument is determined by the term distribution of its resource-strained healthcare environments. topic, which is in turn determined by the topic dis- tribution of the document (Blei et al., 2003). 3 Modeling of Linguistic Data Under LDA, a document is modeled as a prob- abilistic distribution over topics, learned from the There are three main feature representations for occurrence of terms through Collapsed Variational the linguistic data: bag-of-words (BOW), term- Bayesian (CVB) inference methods using the Stan- frequency inverse-document-frequency (tf-idf ) on ford Topic Modeling Toolbox (Teh et al., 2007).3 top of BOW, and topics from LDA. Since topics are determined based on statistical rela- Preprocessing and text normalization were per- tionships of terms, the effectiveness of the model can formed in Python and NLTK, involving lowercas- be hampered by extremely frequent or infrequent ing, punctuation removal, stop-listing, and number terms. For these reasons, we filter out the vocabulary removal (with exception of age mentions). Be- (Boyd-Graber et al., 2014, p. 9) for terms appearing sides regular stop-listing, words or phrases reveal- less than 3 times and the 30 most common terms.4 ing a subject’s diagnostic state (for example MCI) were removed. Words in a document were lem- 3.1 Integration with Structured Data Models matized to merge inflections (removing distinctions Integration is performed on the results of each un- between for instance cataracts and cataract). Ab- structured modeling experiment (BOW, tf-idf, and breviation expansion used lexical lists. The 200 LDA) and those of each structured ones–with vs. most frequent lexical content bigrams and trigrams without cognitive assessment features. For LDA, were extracted and concatenated (breast cancer only the parameters with the highest performance → breast cancer). Lastly, while dates were removed, are used in integration. The most intuitive form of age expressions were kept after conversion and bin- integration is concatenation of the feature vectors for ning (AGE >=70 <80), as they may be important structured and unstructured data. Hence, concatena- for this problem. Ages below 40 were represented tion refers to joining two vectors of length n and m as AGE <40 and ages at or above 90 as AGE >=90. into a single new vector of length n + m. This con- BOW and tf-idf were implemented using catenated feature vector is used in classification. gensim (Rehˇ u˚rekˇ and Sojka, 2010). The standard The second approach of integration leverages pos- BOW representation is very sparse, since any terior probabilities from the individual (linguistic vs. document only contains a small subset of the non-linguistic) classification models. For each in- vocabulary. An extension weights the terms based put, a classifier produces a posterior probability of on their distribution in the corpus using tf-idf. Thus each class label and selects the most probable as its higher weights are assigned to terms which appear output. One classifier is trained on structured data more times in fewer documents, and lower weights 3Compared to Gibbs sampling (also explored initially), to terms which appear fewer times and/or in more CVB converged on more sensible topics and performed better documents. The feature space of tf-idf corresponds in model development. to standard BOW, but the values are the weights. 4Other cutoff values were explored initially.
15 features Xs, and another on unstructured data fea- NL, 180 EMCI). While this does not perfectly match tures Xu, resulting in two posterior distributions. the reality of diagnosis, as it excludes the later de- The probability of a class Ck is then denoted as mentia stages, it could be argued that those later p(C X ,X ). If these distributions are assumed stages are in less need of automatic analysis since k | s u to be conditionally independent with respect to their they are more readily observable.The resulting bi- class labels, then by Bayes’ theorem: nary problem is referred to here as Early Risk. The results and discussions presented later in this p(Ck Xs) p(Ck Xu) paper include a comparison to a majority class base- p(Ck Xs,Xu) | | (1) | ∝ p(Ck) line, however, this is included merely as a standard comparison, while the actual comparison of inter- From here, the class label with the highest probabil- est is between integration of non-linguistic (with vs. ity is selected as the output; for details see Bailer- without cognitive assessment scores) and linguistic Jones and Smith (Bailer-Jones and Smith, 2011). features compared to those groups in isolation. For integration purposes, we use logistic regres- sion for all classification experiments, implemented Held-out Data The data set is randomly split into in scikit-learn (Pedregosa et al., 2011) to 80% (n = 544 subjects) for model development compute the posterior probabilities of all classes. (dev set), and 20% (n = 135 subjects) for final We adopt a regularized logistic regression model to evaluation (held-out set). Models are only exposed further improve the predictive accuracy. By incor- to the held-out set after satisfactory performance is porating a regularization term into the basic logis- achieved using the dev set. Class distributions are tic regression model, regularized logistic regression preserved in the dev and held-out sets. is able to reach a good bias-variance trade-off and Although the dev and hence achieve a better generalization capability. The LOO Cross-Validation held-out sets have similar class distributions, over- regularization term is comprised of two parameters, fitting is still a potential issue. For this reason, after which are C, the inverse of regularization strength,5 the held-out evaluation is complete, a leave-one-out and the penalty function (either the L1 or L2 vector cross-validation (LOO or LOOCV) procedure is run norm). A smaller C corresponds to harsher penalties on the entire merged dataset to serve as an additional for large coefficients. The values of these parameters evaluation, to either confirm or call into question the are selected through a grid search of possible values, trends from held-out testing, which may be evident evaluated by accuracy in cross validation. The pro- through differences in performance of the same fea- cess is repeated for each labeling scheme. tures and models. LOOCV is a case of k-fold cross- 4 Experimental Study validation where k is equal to the number of training instances, resulting in one fold for every data point Each subject is annotated with a dementia status in which all other data points are used for training. class label. Each subject’s linguistic and structured non-linguistic data are used separately or integrated, 4.1 Topic Exploration and Evaluation as instances for classification.Two different classifi- Tuning of the topic number parameter is essential cation problems are reported on. One involves all to finding an appropriate LDA model. This process four classes (NL, EMCI, LMCI, AD). This 4-class is performed by iteratively measuring classification problem is henceforth referred to as Standard. As accuracy at values of K ranging from 5 to 100, in discussed, early detection of dementia is critical. multiples of 5, using the training data from the held- Accordingly, EMCI subjects are of particular inter- out evaluation split. LDA is being used here with est, as they represent the beginning of the disease’s two goals in mind: to improve classification perfor- progression. In the second experiment, we use 367 mance as a form of dimensionality reduction, as well subjects having one of these two class labels (187 as to provide human-interpretable topics. The for-
5 mer is more convenient and appropriate in the con- It is common in other sources to use λ for the regulariza- tion strength, but the employed scikit-learn library in- text of this work, but does not necessarily imply stead uses C = 1/λ, i.e. the inverse of regularization strength. good results for the latter. A clinical expert view-
16 Classification Accuracy vs. Number of LDA Topics ing the output of such a model would likely prefer 60% fewer topics, each with higher interpretability. Ac- cordingly, LDA models in classification are exam- 55% (60, 50.4%) (85, 49.6%) ined with various per-topic metrics known to cor- 50% relate well with human evaluation. Thus, the best- performing reduced topic-feature space is selected 45% for classification results and then additionally ana- 40% lyzed using the topic coherence metric (Mimno et al., 2011), which measures how often the most prob- 35% able words of a topic appear together in documents, Classification Accuracy 30% and has been shown to match well with human eval- Bag-of-words Tf-idf 25% uation of topic quality (Boyd-Graber et al., 2014). LDA 4.2 Classification of Standard Labels 10 20 30 40 50 60 70 80 90 100 Number of Topics (K) The upper part of Table 1 shows the results of struc- Figure 2: Classification accuracy of LDA features tured vs. linguistic features in isolation for the Stan- on the held-out set, with increasing number of topics dard problem, while the rest of the table shows K (increments of 5). results of integration techniques. Overall, perfor- mance improved in LOOCV, with a few exceptions referencing people in their 60’s. Topic 45 pertains to (e.g. tf-idf ), which is likely due to the greater num- regular medical visits (PCP is primary care physi- ber of available training instances in this evaluation. cian), with some common concerns of elderly pa- The performance of structured data alone is sub- tients (back, heart). Topic 25 captures heart disease stantially higher than the majority class baseline, (cardiac, stent, chest pain) and related visits (hospi- and more so when cognitive assessment features talization, admitted, discharged). were included (+cognitive), as expected. Im- Linguistic and non-linguistic models are inte- portantly, the BOW representation for text data grated to improve classification performance. Ta- achieved similar performance compared to the struc- ble 1 shows results for 16 integrated models (2 non- tured data without cognitive assessment scores, linguistic models 4 linguistic models 2 inte- × × showing that simple text modeling can be useful in gration methods). Similar trends were observed for the common event that structured data are missing. BOW and tf-idf in most cases. Interestingly, inte- The benefit of tf-idf appears inconsistent be- grating with BOW is better than including cognitive tween held-out and LOOCV evaluations, possibly assessment scores for held-out. The LDA-reduced attributable to differences in document frequency of features are again less consistent than other text fea- important terms in the different training data (dev vs. tures, but still comparatively improved performance dev+held-out, respectively). in many cases. LDA integration experiments appear For LDA, performance was dependent on the more robust between held-out and LOOCV than number of topics, as seen in Figure 2, with two per- when LDA features were used alone, likely due to formance peaks (at K = 60 and K = 85) surpass- structured features taking the brunt of the decision. ing BOW. This supports that dimensionality reduc- It was predicted that the posterior probability tion by LDA can improve performance, but data size composition method would yield better results than may influence results. This is a limitation of using an vector concatenation. Interestingly, this is not ap- unsupervised algorithm for a supervised task. Per- parent, with many cases revealing the opposite. Yet formance differences between held-out and LOOCV overall, the best performing cases include results indicate overfitting to the dev set in particular. where integration is done by this method. One po- Table 2a shows 5 of the top 10 topics from the 60 tential limitation of the posterior probability com- topic model, based on topic coherence. This met- position is that a stronger decision is made when ric appears to aid in identifying interpretable topics. each of the underlying classifiers produces an asym- For example, Topic 2 is about cognitive assessment, metric posterior class distribution. A limitation of
17 Held-out Evaluation Leave-one-out Cross-validation NL EMCI LMCI AD NL EMCI LMCI AD Features Acc. P / R P / R P / R P / R Acc. P / R P / R P / R P / R Baseline (majority class) 32.6% 33 / 100 / 0 / 0 / 0 27.5% 28 / 100 / 0 / 0 / 0 − − − − − − Structured ( cognitive) 51.9% 68 / 73 27 / 28 47 / 23 55 / 88 53.9% 57 / 77 43 / 34 40 / 26 66 / 81 − Structured (+cognitive) 55.6% 80 / 84 35 / 38 33 / 20 56 / 79 62.7% 70 / 86 52 / 48 50 / 33 71 / 85 Bag-of-words 48.1% 67 / 55 32 / 38 52 / 46 43 / 54 50.2% 59 / 67 40 / 39 43 / 42 60 / 51 Tf-idf 55.6% 61 / 61 37 / 63 78 / 40 74 / 58 48.9% 49 / 75 39 / 43 49 / 32 73 / 42 LDA(K = 85) 49.6% 57 / 48 39 / 72 65 / 37 53 / 42 39.3% 39 / 62 34 / 32 39 / 29 52 / 32 LDA(K = 60) 50.4% 64 / 61 37 / 66 53 / 23 57 / 50 37.4% 39 / 54 32 / 33 35 / 28 48 / 33
S cog Bag-of-words 61.5% 77 / 68 41 / 50 70 / 46 62 / 88 59.8% 69 / 79 48 / 44 45 / 41 73 / 76 − ∪ S cog Bag-of-words 57.0% 90 / 59 41 / 53 47 / 40 59 / 83 58.3% 69 / 73 46 / 46 45 / 44 75 / 71 − ⊕ S Bag-of-words 58.5% 78 / 71 35 / 41 59 / 46 61 / 79 61.3% 72 / 79 48 / 43 47 / 44 74 / 80 +cog ∪ S Bag-of-words 59.3% 88 / 64 39 / 53 56 / 43 63 / 83 61.9% 74 / 80 48 / 48 48 / 45 77 / 75 +cog ⊕ S cog Tf-idf 53.3% 74 / 71 31 / 34 45 / 26 55 / 88 58.0% 62 / 83 49 / 38 45 / 31 68 / 81 − ∪ S cog Tf-idf 51.1% 83 / 55 37 / 59 39 / 14 51 / 88 59.6% 63 / 83 52 / 43 46 / 30 70 / 82 − ⊕ S Tf-idf 59.3% 79 / 86 41 / 44 45 / 26 58 / 79 64.7% 73 / 88 53 / 53 52 / 34 72 / 84 +cog ∪ S Tf-idf 61.5% 95 / 80 45 / 72 42 / 14 57 / 83 65.4% 73 / 89 55 / 53 54 / 35 73 / 85 +cog ⊕ S cog LDA(K = 85) 54.8% 73 / 73 31 / 34 56 / 29 55 / 88 56.4% 60 / 82 46 / 33 42 / 31 70 / 80 − ∪ S cog LDA(K = 85) 44.4% 80 / 46 28 / 50 27 / 09 50 / 88 56.3% 60 / 79 45 / 36 42 / 30 71 / 81 − ⊕ S LDA(K = 85) 58.5% 84 / 86 39 / 44 38 / 23 58 / 79 62.0% 70 / 86 49 / 45 46 / 34 74 / 84 +cog ∪ S LDA(K = 85) 58.5% 90 / 77 44 / 69 36 / 11 53 / 79 63.6% 71 / 87 52 / 48 49 / 35 75 / 85 +cog ⊕ S cog LDA(K = 60) 51.1% 69 / 61 30 / 34 48 / 29 55 / 88 55.7% 59 / 78 47 / 37 40 / 28 69 / 82 − ∪ S cog LDA(K = 60) 45.9% 78 / 48 30 / 53 33 / 09 49 / 88 56.4% 60 / 78 47 / 37 40 / 30 71 / 82 − ⊕ S LDA(K = 60) 60.0% 88 / 86 44 / 53 37 / 20 56 / 79 62.4% 72 / 86 50 / 47 45 / 33 74 / 85 +cog ∪ S LDA(K = 60) 59.3% 92 / 77 45 / 72 33 / 11 54 / 79 62.9% 74 / 86 51 / 48 44 / 34 75 / 84 +cog ⊕ Table 1: Results on Standard problem (4-classes). Integration by vector concatenation is indicated by , and posterior ∪ probability composition by . Structured ( cognitive) = S cog , Structured (+cognitive) = S+cog. ⊕ − − ID Top 10 Words 3 corroborated, subjective, continues meet, score, factor, other, SP, AGE >=60 <70, controlled medication, unremarkable 2 impression, CDR, MMSE, ADLS, AGE >=60 <70, cog, amnestic, global, function, score 17 medical, consistent, status, function, continues, health, occasional, active, daily, functional 45 blood, pressure, month, visit, PCP, diagnosed, dizziness, back, doctor, heart 25 hospital, admitted, discharged, stent, cardiac, went, chest pain, AE, anxiety, total
Table 2a: Five high-ranked topics from the Standard problem with K = 60 (ranked by topic coherence). ID Top 10 Words 38 completed, visit, reported, mg, performed, protocol, testing, study partner, blood, year 25 criterion, subjective, corroborated, factor, other, AGE >=60 <70, continues meet, score, memory problems, confounding 55 hip, left, right, removed, normal, arthritis, cataract, eye, allergy, hand 36 year, smoked, ago, pack, o, quit, per day, c, urinary frequency, memory problems 56 work, up, valve, cardiac, aortic, ER, heart, x, cardiologist, visit
Table 2b: Five high-ranked topics from the Early Risk problem with K = 100 (ranked by topic coherence).
18 this method is its dependence on strong or accu- LOOCV rate decisions from the underlying models. Vec- NL EMCI tor concatenation is not subject to this limitation, but has the drawback of potentially overwhelming Features Acc. P / R P / R a smaller feature set with a larger sparse one. As for Baseline 51.0% 51 / 100 / 0 − class-specific differences, the NL (normal) and AD Structured ( cognitive) 67.6% 67 / 73 69 / 62 − (Alzheimer’s disease) subjects were classified with Structured (+cognitive) 79.8% 78 / 84 82 / 76 higher precision and recall scores than were the MCI Bag-of-words 70.8% 71 / 73 71 / 69 classes in nearly all integration experiments, point- Tf-idf 69.2% 68 / 75 71 / 63 ing to the challenge of subtler disease stages. LDA(K = 65) 68.9% 67 / 76 71 / 62 LDA(K = 100) 68.9% 68 / 74 70 / 63 4.3 Classification of Early Risk S cog Bag-of-words 76.8% 76 / 79 78 / 74 − ∪ In addition to the experiments above, the more spe- S cog Bag-of-words 76.0% 77 / 77 76 / 76 − ⊕ cific problem of distinguishing normal (NL) subjects S Bag-of-words 77.1% 76 / 80 78 / 74 +cog ∪ from those with early mild cognitive impairment S+cog Bag-of-words 80.7% 80 / 82 81 / 79 ⊕ (EMCI) was also explored. Only LOOCV is per- S cog Tf-idf 72.2% 71 / 78 74 / 66 − ∪ formed because the subsampling of NL and EMCI S cog Tf-idf 72.8% 71 / 79 75 / 66 − ⊕ subjects slightly distorts the class distributions in the S Tf-idf 80.7% 79 / 85 83 / 77 +cog ∪ original held-out set. Results are given in Table 3. S+cog Tf-idf 83.1% 82 / 86 84 / 81 ⊕ As in the Standard problem, all non-linguistic and S cog LDA(K = 65) 72.2% 71 / 78 74 / 67 − ∪ linguistic feature types perform well above the ma- S cog LDA(K = 65) 72.5% 71 / 78 74 / 67 − ⊕ jority class baseline. One major difference here is S+cog LDA(K = 65) 79.0% 78 / 82 81 / 76 ∪ S LDA(K = 65) 79.3% 78 / 83 81 / 76 that all linguistic data types outperform the struc- +cog ⊕ tured features when cognitive assessments are ex- S cog LDA(K = 100) 71.4% 70 / 77 73 / 66 − ∪ cluded. This may suggest a potential linguistic dif- S cog LDA(K = 100) 71.9% 70 / 76 74 / 66 − ⊕ S LDA(K = 100) 80.4% 80 / 82 81 / 78 ference in clinical notes at the onset of MCI. +cog ∪ S LDA(K = 100) 80.9% 80 / 83 82 / 79 The number of LDA topics is selected as before +cog ⊕ (but using the whole Early Risk subsample, as op- Table 3: Classification performance on Early Risk (2 posed to the Standard dev set). Two peaks found classes). Vector concatenation is indicated by , and ∪ at K = 65 and K = 100 achieve the same classi- posterior probability composition by . Structured with ⊕ (S+cog) and without (S cog) cognitive. fication accuracy, but do not outperform BOW and − tf-idf. The difficulties LDA faced in the Standard problem are also faced here, and thus similar per- shared co-occurring words, in this case with left and formance shortcomings are observed. The ability right seeming to link eye and hand, along with their to approximately match tf-idf performance is still associated terms cataract and arthritis. noteworthy since the LDA features are a smaller The performance trends for the integrated models and denser representation than tf-idf, which may be are slightly more consistent for the Early Risk prob- more easily interpretable by clinical professionals. lem than they were for the Standard problem. When Table 2b shows 5 of the top 10 topics from the excluding cognitive assessment scores, all integra- 100 topic model trained on the Early Risk subset, tion experiments result in a modest improvement, based on the topic coherence metric. A consequence although there is little to no difference between the of a smaller sample of subjects is a smaller vocab- two integration methods employed. This may sug- ulary and thus weaker statistical judgments, Topics gest that results can be achieved without extra so- 38, 25, 36, and 56 appear to be about routine vis- phistication provided by posterior probability com- its/tests, cognitive evaluations, smoking habits, and position, or that further sophistication is needed be- cardiac issues, respectively. Topic 55 is an example yond either of these techniques. In general, our re- of a chained topic (Boyd-Graber et al., 2014, p. 17), sults further justify the integration of linguistic and where unrelated words are linked together through non-linguistic features and/or models.
19 5 Conclusion and Future Work erous contributions from the following: Alzheimers Association; Alzheimers Drug Discovery Founda- We explored classification of dementia progression tion; BioClinica, Inc.; Biogen Idec Inc.; Bristol- status of subjects from a study on Alzheimer’s dis- Myers Squibb Company; Eisai Inc.; Elan Pharma- ease, and the integration of text data models with ceuticals, Inc.; Eli Lilly and Company; F. Hoffmann- those of structured data, with vs. without cognitive La Roche Ltd and its affiliated company Genentech, assessment scores. Experiments support texts’ via- Inc.; GE Healthcare; Innogenetics, N.V.; IXICO bility as a useful source for dementia classification, Ltd.; Janssen Alzheimer Immunotherapy Research as an important complement to structured data, or & Development, LLC.; Johnson & Johnson Phar- alone when structured data are missing. LDA was maceutical Research & Development LLC.; Med- also studied as interpretable dimensionality reduc- pace, Inc.; Merck & Co., Inc.; Meso Scale Diag- tion. With a larger sample size, the LDA model nostics, LLC.; NeuroRx Research; Novartis Phar- may converge to a more stable set of topics, but maceuticals Corporation; Pfizer Inc.; Piramal Imag- other appropriate public datasets (with both linguis- ing; Servier; Synarc Inc.; and Takeda Pharmaceu- tic and non-linguistic data) are presently not avail- tical Company. The Canadian Institutes of Health able. An alternative is to apply supervised versions Research is providing funds to support ADNI clini- of LDA (Blei and McAuliffe, 2007; Ramage et al., cal sites in Canada. Private sector contributions are 2009). Furthermore, with access to a pool of clini- facilitated by the Foundation for the National Insti- cal specialists, it would be useful to integrate experts tutes of Health (www.fnih.org). The grantee orga- in evaluating the latent topics. Chang et al. (2009) nization is the Northern California Institute for Re- proposed various such human evaluation techniques, search and Education, and the study is coordinated such as the word intrusion task, in which human by the Alzheimers Disease Cooperative Study at the evaluators are presented with a list of n high prob- University of California, San Diego. ADNI data are ability terms of a randomly chosen topic, and one disseminated by the Laboratory for Neuro Imaging additional low probability term from that topic, and at the University of Southern California. asked to identify the former. A drawback is that it would require access to a large enough pool of de- mentia specialists. References Other avenues of future work would include the Alzheimer’s Association. 2014. 2014 Alzheimer’s Dis- incorporation of lexical similarity measures from ease Facts and Figures. Alzheimer’s & Dementia, 10. sources like WordNet. C.A.L. Bailer-Jones and K. Smith. 2011. Combining probabilities. GAIA-C8-TN-MPIA-CBJ-053, July. Acknowledgement Elena Birman-Deych, Amy D. Waterman, Yan Yan, This study was supported by a Kodak Endowed David S. Nilasema, Martha J. Radford, and Brian F. Gage. 2005. Accuracy of ICD-9-CM codes for iden- Chair award form the Golisano College of Comput- tifying cardiovascular and stroke risk factors. Medical ing and Information Sciences at the Rochester Insti- Care, 43:480–485. tute of Technology. David M. Blei and Jon D. McAuliffe. 2007. Supervised We are thankful for being able to use the ADNI topic models. In Advances in Neural Information Pro- projects dataset in this work. As per the data-use cessing Systems 20, Proceedings of the Twenty-First agreement, information about the ADNI data are in- Annual Conference on Neural Information Processing cluded verbatim. Data collection and sharing for Systems, pages 121–128, Vancouver, B.C., Canada. this project was funded by the Alzheimers Disease David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Neuroimaging Initiative (ADNI) (National Institutes 2003. Latent Dirichlet Allocation. Journal of Ma- chine Learning Research, 3:993–1022. of Health Grant U01 AG024904) and DOD ADNI Jordan Boyd-Graber, David Mimno, and David Newman, (Department of Defense award number W81XWH- 2014. Care and Feeding of Topic Models: Problems, 12-2-0012). ADNI is funded by the National In- Diagnostics, and Improvements. CRC Handbooks of stitute on Aging, the National Institute of Biomed- Modern Statistical Methods. CRC Press, Boca Raton, ical Imaging and Bioengineering, and through gen- Florida.
20 Roger B Bradford. 2008. An empirical study of required kidney disease in diabetes. Health Services Research, dimensionality for large-scale latent semantic indexing 41(2):564–580. applications. In Proceedings of the 17th ACM Con- Li Li, Herbert S. Chase, Chintan O. Patel, Carol Fried- ference on Information and Knowledge Management, man, and Chunhua Weng. 2008. Comparing ICD9- CIKM ’08, pages 153–162. encoded diagnoses and NLP-processed discharge sum- Katherine Redfield Chan, Xinghua Lou, Theofanis Kar- maries for clinical trials pre-screening: A case study. aletsos, Christopher Crosbie, Stuart Gardos, David In American Medical Informatics Association Annual Artz, and Gunnar Ratsch.¨ 2013. An empirical analy- Symposium Proceedings 2008, pages 404–408. sis of topic modeling for mining cancer clinical notes. Stephen Luther, Donald Berndt, Dezon Finch, Michael In 13th IEEE International Conference on Data Min- Richardson, Edward Hickling, and David Hickam. ing Workshops, pages 56–63, Dallas, Texas, December 2011. Using statistical text mining to supplement the 7–10. development of an ontology. Journal of Biomedical Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Informatics, 44:S86–S93. Sean Gerrish, and David M. Blei. 2009. Reading tea Ashutosh Malhotra, Erfan Younesi, Michaela Gundel,¨ leaves: How humans interpret topic models. In Neural Muller,¨ Michael T. Heneka, and Martin Hofmann- Information Processing Systems, Vancouver, British Apitius. 2013. ADO: A disease ontology representing Columbia. the domain knowledge specific to Alzheimer’s disease. Marcelo Fiszman, Wendy Webber Chapman, Dominik Alzheimer’s and Dementia, 10:238–246. Aronsky, R. Scott Evans, and Peter J. Haug. 2000. Andrew McCallum, Xuerui Wang, and Andres´ Corrada- Automatic detection of acute bacterial pneumonia Emmanuel. 2007. Topic and role discovery in so- from chest x-ray reports. Journal of the American cial networks with experiments on Enron and aca- Medical Informatics Association, 7(6):593–604. demic email. Journal of Artificial Intelligence Re- Carol Friedman, Stephen B. Johnson, Bruce Forman, and search, 30(1):249–272, Oct. Justin Starren. 1995. Architectural requirements for James A. McCart, Donald J. Berndt, Jay Jarman, De- a multipurpose natural language processor in the clini- zon K. Finch, and Stephen Luther. 2013. Finding falls cal environment. In Proceedings of the Annual Sympo- in ambulatory care clinical documents using statistical sium on Computer Application in Medical Care, pages text mining. The Journal of American Medical Infor- 347–351. matics Association, 20(5):906–914. Peter J. Haug, Spence Koehler, Lee Min Lau, Ping Wang, David Mimno, Hanna M. Wallach, Edmund Talley, Roberto Rocha, and Stanley M. Huff. 1995. Experi- Miriam Leenders, and Andrew McCallum. 2011. ence with a mixed semantic/syntactic parser. In Pro- Optimizing semantic coherence in topic models. In ceedings of the Annual Symposium on Computational Proceedings of the Conference on Empirical Methods Application in Medical Care, pages 284–288. in Natural Language Processing, EMNLP ’11, pages Blanca E. Himes, Yi Dai, Isaac S. Kohane, Scott T. Weiss, 262–272, Edinburgh, United Kingdom. and Marco F. Ramoni. 2009. Prediction of chronic Harvey J. Murff, Fern FitzHenry, Michael E. Matheny, obstructive pulmonary disease (COPD) in asthma pa- Nancy Gentry, Kristen L. Kotter, Kimberly Crimin, tients using electronic medical records. Journal of the S. Dittus, Robert, Amy K. Rosen, Peter L. Elkin, American Medical Informatics Association, 16:371– Steven H. Brown, and Theodore Speroff. 2011. Au- 379. tomated identification of postoperative complications Liangjie Hong and Brian D. Davison. 2010. Empirical within an electronic medical record using natural lan- study of topic modeling in Twitter. In Proceedings of guage processing. The American Journal of Medicine, the First Workshop on Social Media Analytics, SOMA 306(8):848–855, August. ’10, pages 80–88, Washington DC, USA. Michael J. Paul and Mark Dredze. 2011. You are what Nilesh L. Jain and Carol Friedman. 1997. Identification you tweet: Analyzing Twitter for public health. In of findings suspicious for breast cancer based on nat- Proceedings of the Fifth International Conference on ural language processing of mammogram reports. In Weblogs and Social Media, pages 265–272, Barcelona, Proceedings of the American Medical Informatics As- Catalonia, Spain, July 17–21. sociation Annual Fall Symposium: American Medical Fabian Pedregosa, Gael Varoquaux, Alexandre Gram- Informatics Association, pages 829–833. fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Elizabeth F. O. Kern, Miriam Maney, Donald R. Miller, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- Chin-Lin Tseng, Anjali Tiwari, Mangala Rajan, David cent Dubourg, Jake Vanderplas, Alexandre Passos, Aron, and Leonard Pogach. 2006. Failure of ICD-9- David Cournapeau, Matthieu Brucher, Matthieu Per- CM codes to identify patients with comorbid chronic rot, and Edouard Duchesnay. 2011. Scikit-learn: Ma-
21 chine learning in Python. Journal of Machine Learn- ing Research, 12:2825–2830. Janet F. E. Penz, Adam B. Wilcox, and John F. Hurdle. 2007. Automated identification of adverse events re- lated to central venous catheters. Journal of Biomedi- cal Informatics, 40(2):174–182, April. Martin Prince, Matthew Prina, and Maelenn¨ Guerchet. 2013. World Alzheimer Report 2013. Alzheimer’s Disease International (ADI), London, September. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi- labeled corpora. In Proceedings of the 2009 Con- ference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP ’09, pages 248–256, Stroudsburg, PA, USA. Association for Computational Linguistics. Philip Resnik, Anderson Garron, and Rebecca Resnik. 2013. Using topic modeling to improve prediction of neuroticism and depression in college students. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pages 1348–1353, Seattle, Washington, USA, 18–21 October. Dymitr Ruta and Bogdan Gabrys. 2000. An overview of classifier fusion methods. Computing and Information Systems, 7:1–10. Yee Whye Teh, David Newman, and Max Welling. 2007. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems, volume 19. Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software frame- work for topic modelling with large corpora. In Pro- ceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks, pages 45–50, Valletta, Malta, May. ELRA. Hua Xu, Zhenming Fu, Anushi Shah, Yukun Chen, Neeraja B. Peterson, Qingxia Chen, Subramani Mani, Mia A. Levy, Qi Dai, and Josh C. Denny. 2011. Extracting and integrating data from entire electronic health records for detecting colorectal cancer cases. In American Medical Informatics Association An- nual Symposium Proceedings 2011, pages 1564–1572, Washington, DC, USA.
22 Self-Reflective Sentiment Analysis
Benjamin Shickel Martin Heesacker Sherry Benton University of Florida University of Florida TAO Connect, Inc. Gainesville, FL 32611 Gainesville, FL 32611 Gainesville, FL 32601 shickelb@ufl.edu heesack@ufl.edu [email protected]
Ashkan Ebadi Paul Nickerson Parisa Rashidi University of Florida University of Florida University of Florida Gainesville, FL 32611 Gainesville, FL 32611 Gainesville, FL 32611 ashkan.ebadi@ufl.edu pvnick@ufl.edu parisa.rashidi@ufl.edu
Abstract current time, these journals are either reviewed by therapists (who are vastly outnumbered by the users) As self-directed online anxiety treatment and or left unused, with the assumption that simply talk- e-mental health programs become more preva- ing about negative emotions is therapy in and of it- lent and begin to rapidly scale to a large num- self. We see a large and novel opportunity for ap- ber of users, the need to develop automated techniques for monitoring patient progress plying natural language techniques to these unstruc- and detecting early warning signs is at an all- tured mental health records. In this paper, we focus time high. While current online therapy sys- on analyzing the sentiment of patient text. tems work based on explicit quantitative feed- The largest motivator of existing sentiment analy- back from various survey measures, little at- sis research has arguably been the detection of user tention has been paid thus far to the large amount of unstructured free text present in sentiment towards entities, such as products, com- the monitoring logs and journals submitted panies, or people. We define this type of problem by patients as part of the treatment process. as external sentiment analysis. In contrast, when In this paper, we automatically categorize pa- working in the mental health domain (particularly tients’ internal sentiment and emotions using with self-reflective textual journals), we are trying machine learning classifiers based on n-grams, to gauge a patient’s internal sentiment towards their syntactic patterns, sentiment lexicon features, own thoughts, feelings, and emotions. The differ- and distributed word embeddings. We report classification metrics on a novel mental health ences in goals, types of sentiment, and distribution dataset. of polarity presents unique challenges for applying sentiment analysis to this new domain. One key aspect that sets our task apart from tra- 1 Introduction ditional sentiment analysis is our treatment of polar- As mental health awareness becomes more ity classes. Traditionally, sentiment is categorized widespread, especially among at-risk populations as either positive, negative, or neutral. In contrast, such as young adults and college-aged students, we subdivide the neutral polarity class into two dis- many institutions and universities are beginning to tinct classes: both positive and negative and neither offer online anxiety and depression treatment pro- positive nor negative. We justify this decision based grams to supplement traditional therapy services. A on several studies showing the independent dimen- key component of these largely self-directed pro- sions of positive and negative affect in human emo- grams is the regular completion of journals, in which tion (Warr et al., 1983; Watson et al., 1988; Di- patients describe how they are feeling. These jour- ener et al., 1985; Bradburn, 1969), and feel that is nals contain a wide variety of information, including a more appropriate framework for our domain. This a patient’s specific fears, worries, triggers, reactions, choice represents a novel characterization of senti- or simply status updates on their emotional state. At ment analysis in mental health, and is one we hope
23
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 23–32, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics to see made in future studies in this domain. In the past decade, sentiment analysis techniques Our primary focus in this paper is on the au- have been applied to a wide variety of areas. Al- tomatic and reliable categorization of patient re- though the majority of work has dealt in areas out- sponses as positive, negative, both positive and neg- side of mental health, we must discuss the bulk of ative, or neither positive nor negative. Such a sys- previous sentiment analysis research, from which tem has far-reaching implications for the online ther- our techniques are derived. apy setting, in which automatic language analysis Given the explosive rise in popularity of social can be incorporated into existing patient evaluation media platforms, a large number of studies have fo- and progress monitoring, or serve as an early warn- cused on user sentiment in microblogs such as Twit- ing indicator for patients with severe cases of de- ter (Barbosa and Feng, 2010; Pak and Paroubek, pression and/or risk of suicide. Additionally, tools 2010; Agarwal et al., 2011; Kouloumpis et al., 2011; based on this type of internal sentiment analysis can Nielsen, 2011; Wang et al., 2011; Zhang et al., 2011; provide immediate feedback on mental health and Montejo-Raez´ et al., 2012; Spencer and Uchyigit, thought processes, which can become distorted and 2012; Montejo-Raez´ et al., 2014; Tang et al., 2014). unclear in patients stuck in anxiety or depression. Other studies have explored user sentiment in web In the future, sentiment-based mental health mod- forum opinions (Abbasi et al., 2008), movie reviews els can be incorporated into the characterization and (Agrawal and Siddiqui, 2009), blogs (Melville et treatment of patients with autism, dementia, or other al., 2009), and Yahoo! Answers (Kucuktunc et al., broadly-defined language disorders. 2012). As we will show, the models proposed in all In short, our main contributions are summarized of these works cannot be directly transferred to po- by the following: larity detection in mental health (as sentiment anal- ysis remains a largely domain-specific task), but our We present a novel sentiment analysis dataset, • initial techniques are largely based on these previous annotated by psychology experts, specifically works. targeted towards the mental health domain. Although the majority of sentiment analysis has We introduce the notion of subdividing the tra- focused on user opinions towards entities, there are • ditional neutral polarity class into both a dual studies in domains more directly related to our area. polarity sentiment (both positive and negative) One such study analyzed the sentiment of suicide and a neither positive nor negative sentiment. notes (Pestian et al., 2012). Another mined user sentiment in MOOC discussion forums (Wen et al., We identify the unique challenges faced when 2014). • applying existing sentiment analysis tech- Sentiment analysis and polarity detection tech- niques to mental health. niques are widely varied (Mejova and Srinivasan, 2011; Feldman, 2013), and as this research area is We present an automatic model for classifying • still garnering a great deal of interest, many studies the polarity of patient text, and compare our have proposed novel methods. These include topic- work to models trained on existing sentiment level sentiment analysis (Nasukawa and Yi, 2003; corpora. Kim and Hovy, 2004), phrase-level sentiment analy- sis (Wilson et al., 2009), linguistic approaches (Wie- 2 Related Work gand et al., 2010; Benamara et al., 2007; Tan et From a technical point of view, our methods fall al., 2011), semantic word vectorization (Maas et al., squarely in the realm of sentiment analysis, a field 2011; Tang et al., 2014), various lexicon-based ap- of computer science and computational linguistics proaches (Taboada et al., 2011; Baccianella et al., primarily concerned with analyzing people’s opin- 2010), information-theoretic techniques (Lin et al., ions, sentiments, attitudes, and emotions from writ- 2012), and graph-based methods (Montejo-Raez´ et ten language (Liu, 2010). In our paper, we apply al., 2014; Pang and Lee, 2004; Wang et al., 2011). In sentiment analysis and polarity detection techniques recent years, approaches based on deep learning ar- to the largely untapped mental health domain. chitectures have also shown promising results (Glo-
24 Word Count Distribution rot et al., 2011; Socher et al., 2013). As a starting 450 point for our new internal sentiment analysis frame- 400 work, in this paper we apply more straightforward 350 approaches based on linear classifiers. 300 250
200
3 Dataset 150
100 Number of Responses In this section, we detail the construction of our 50
0 mental health sentiment dataset. While not yet pub- 0 50 100 150 200 licly available, we plan to release our data in the near Number of Words in Response future. Figure 1: Distribution of word counts per response for our col- In order to build a dataset of real patient re- lected dataset. On average, each response contains 39 words, 1 sponses, we partnered with TAO Connect, Inc. , an with a minimum of two words and a maximum of 762 words. online therapy program designed to treat anxiety, de- 30 responses had more than 200 words, which we do not show. pression, and stress. This program is being imple- mented in several universities around the country, Annotator POS NEG BOTH NEITHER and as such, the primary demographic is college- Annotator 1 494 2569 556 402 aged students. Annotator 2 321 2509 552 638 As part of the TAO program, patients complete Annotator 3 531 2152 383 954 414 2545 510 548 several self-contained content modules designed to Final teach awareness and coping strategies for anxiety, Table 1: Label counts per annotator, as well as the the final depression, and stress. Additionally, patients regu- dataset label counts obtained via a majority-voting scheme. For larly submit several types of journals and logs per- brevity, we denote the positive label as POS, negative as NEG, taining to monitoring, anxiety, depression, worries, both positive and negative as BOTH, and neither positive nor and relaxation. The free text contained in these logs negative as NEITHER. is the source of our dataset. In total, we collected 4021 textual responses from 342 unique patients, agreement reliability between all annotators (Fleiss’ with submission dates ranging from April 2014 to kappa) was 0.55. We used a majority-vote scheme November 2015. Patients were de-identified and to assign a single label to each piece of text, where the collection process was part of an IRB-approved 62% of the documents had full annotator agreement, study. Responses typically range from single sen- 35% had a clear label majority, and only 3% had no tences to a single paragraph, with an average of majority, in which case we picked the label from the 39 words per response. We show a complete word annotator with the best aggregate reliability. Table 1 count distribution in Figure 1. shows label counts for each annotator, as well as the To help transform our collection of free text re- final count after applying the majority-vote process. sponses into a classification dataset suitable for po- To provide a clearer picture of the types of re- larity prediction, we solicited the expertise of three sponses in our dataset, we present one short concrete psychology undergraduates (all female) under the example of each polarity class below. supervision of one psychology professor (male) to Positive - I tried to say good things for them provide polarity labels for our response documents. • since I know there was a lot of arguments hap- The annotators were tasked with reading each indi- pening. vidual response, and assigning it a label of positive, negative, both positive and negative, or neither pos- Negative - I don’t do well at parties, I’m not • itive nor negative. The inter-rater agreement relia- interesting. bility (Cohen’s kappa) between annotators 1 and 2 Both Positive and Negative - I shouldn’t have was 0.5, between annotators 2 and 3 was 0.67, and • between annotators 1 and 3 was 0.48. The overall taken things so seriously. 1http://www.taoconnect.org/ Neither Positive nor Negative - I wrote in my •
25 journal, and read till I was tired enough to fall evaluate classification performance with various fea- asleep. ture subsets.
In the above examples, the challenges of applying 4.2.1 N-Gram Features and POS Tags sentiment analysis and traditional text classification As a starting point for our experiments with this techniques to self-reflective text becomes more ap- new domain, the most numerous of our extracted parent. For instance, the positive example mentions features are derived from a traditional “bag of n- arguments, typically associated with negative senti- grams” approach, in which we create document vec- ment, while the negative example mentions parties, tors comprised of word unigrams, bigrams, and/or a word usually associated with a positive connota- trigram counts. As previous works have shown, this tion. Additionally, the both positive and negative ex- allows the capture of important syntactical informa- ample exhibits subtle cues that differentiate it from tion like negation, which would otherwise be missed the other three polarity classes. in a standard “bag of words” (i.e., unigrams only) model. 4 Method In order to constrain the scope of later feature sub- To predict polarity from patient text, we employ set experiments, we first obtain the n-gram combina- several established machine learning and text classi- tion resulting in the best performance for our newly fication techniques. We begin by preprocessing the created dataset. We denote this optimal n-gram set- annotated patient responses, which we refer to inter- ting as the “n-grams only” model in later experi- changeably as documents. We then extract several ments. In this experiment, we perform a 10-fold types of attributes from each response, referred to as cross-validated randomized parameter search using features. The extracted features and polarity annota- six possible word n-gram combinations: unigrams, tions are used to build a logistic regression classifier, bigrams, trigrams, unigrams + bigrams, bigrams + which is a linear machine learning model we use to trigrams, and unigrams + bigrams + trigrams. We predict the final polarity label. In this section, we split cross-validation folds on responses, as we ex- describe each step in detail. pect patient responses to be independent over time. All extracted n-gram counts are normalized by tf- 4.1 Preprocessing idf (term frequency-inverse document frequency), a Starting with the raw documents obtained from common technique used for describing how impor- our data collection process, we apply several tradi- tant particular n-grams are to their respective docu- tional preprocessing steps to the text. First, based ments. The results of this n-gram comparison exper- on experimental feedback, we convert all the text iment are shown in Figure 2, where it is clear that us- to lowercase and strip all documents of punctua- ing a combination of unigrams and bigrams resulted tion following a standard tokenization phase. While in the best performance. these are relatively standard steps, it should be ex- In an effort to capture more subtle patterns of plicitly noted that we did not remove stop words grammatical structure, we also experiment with aug- from our corpus, which is a common preprocessing menting each document with each word’s Penn- technique in other domains, due to lowered classifi- Treebank part-of-speech (POS) tag. In these exper- cation performance. This can be partially explained iments, we augment our documents by appending by the nature of our domain; for example, the phrase these tags, in order, to the end of every sentence, al- “what if” tended to be associated with worrying lowing for our n-gram extraction methods to capture about the future - traditionally, both of these words syntactic language patterns. During the tokenization are considered stop words and filtered out, losing process, we ignore any n-grams consisting of both valuable information for our task. words and part-of-speech tags.
4.2 Feature Extraction 4.2.2 Sentiment Lexicon Word Counts Next, we extract several types of features from One of the more rudimentary sentiment analysis the preprocessed documents. In our experiments, we techniques stems from the use of a sentiment dictio-
26 N-Gram Feature Subsets for Four-Class Polarity Prediction 0.80 Accuracy Average Precision Average Recall Average F1
0.75
0.70 Mean Value
0.65
0.60 Unigrams Bigrams Trigrams Unigrams Bigrams Unigrams + Bigrams + Trigrams + Bigrams + Trigrams
Figure 2: Classification results using only word n-gram features for our 4-class polarity dataset. Results were obtained following a 10-fold cross-validated randomized hyperparameter search. A combination of unigrams and bigrams resulted in the best metrics. As seen by the final cluster, adding trigrams to this subset resulted in a performance decrease. Thus, when we use n-gram features in later experiments, we only consider the combination of unigrams and bigrams. nary, or lexicon, which is a pre-existing collection able Word2Vec model pre-trained on Google News3, of subjective words that are labeled as either posi- containing 100 billion words. Each unique word in tive or negative. Using the sentiment lexicon from the model is associated with a 300-dimensional vec- (Liu, 2012)2, we count the number of positive and tor. For each of our documents, we include the mean negative words occurring in each document and in- word vector derived from each individual word’s corporate the counts as two additional features. embedding as 300 additional features.
4.2.3 Document Word Count 5 Four-Class Polarity Prediction In our initial analysis, we discovered that often- Because our new dataset introduces a clear dis- times the most negative text responses were associ- tinction between text labeled as both positive and ated with a larger word count. Although the corre- negative and neither positive nor negative (tradition- lation is relatively weak across the entire corpus, we ally, both of these classes are grouped together as nonetheless include a word count of each document neutral), there are no baselines for which to com- as a feature. pare our experiments. We offer our results for this scenario as a launching point for future studies on 4.2.4 Word Embeddings polarity detection in mental health. For this sce- Based on the recent successes of distributed nario, we show the results of each feature extraction word representations like Word2Vec (Mikolov et method individually, as well as the results for the al., 2013) and GloVe (Pennington et al., 2014), we combination of all features. All results are evaluated sought to harness these learned language models for via 10-fold cross-validation, with folds split on re- predicting sentiment polarity. Although primarily sponses. Results are shown in Figure 3, where it is used in deep learning architectures, we show that clear that optimal performance is achieved using the these representations can also be useful with linear model trained on all features. Our methods gave rise models. Unlike our other features, the individual to an overall classification accuracy of 78%. features contained in word embeddings are indeci- From Figure 3, it is apparent that of all individual pherable; however, as we show in the results section, features, n-grams perform the best. The relatively they contribute to the overall success of our classifi- strong performance of n-gram features tends to align cation. with our expectations, given the widespread use of In our experiments, we utilize a publicly avail- n-gram features across all types of text classifica-
2https://www.cs.uic.edu/ liub/FBS/sentiment-analysis.html 3https://code.google.com/p/word2vec/
27 Feature Subsets for Four-Class Polarity Prediction 0.85 Accuracy Average Precision Average Recall Average F1 0.80
0.75
0.70
0.65
Mean Value 0.60
0.55
0.50
0.45 N-Grams POS Tags Word Count Sentiment Lexicon Word2Vec All
Figure 3: Classification performance for the 4-class polarity prediction task. We show results for each feature set individually, as well as the combination of all features. Using all extracted features results in the highest accuracy, F1, precision, and recall. tion problems. However, what is more surprising is Class Precision Recall F1 the relatively weak results for the sentiment lexicon Positive 0.63 0.32 0.42 features, given their popularity in modern sentiment Negative 0.74 0.96 0.84 Both 0.58 0.16 0.26 analysis. Additionally, the word embedding features Neither 0.77 0.47 0.59 also gave rise to better performance than expected, Overall Accuracy 0.78 especially considering that we used the Word2Vec Table 2: Polarity prediction results for the full 4-class version embeddings with linear models as opposed to the of our dataset. For brevity, the polarity class both positive and more traditional deep learning architectures. Finally, negative is denoted as Both, and the class neither positive nor we see optimal performance across all metrics when negative is denoted as Neither. using the combination of all features. Using the optimal model from Figure 3, we show the individual class metrics for precision, recall, F1, the art sentence-level binary polarity detection accu- and overall accuracy in Table 2. It is apparent that racy is reported as 85.4% (Socher et al., 2013) us- the both positive and negative class proves espe- ing deep learning models and a specialized movie cially difficult to classify. This is explained in part review dataset, and while our models are compu- by the previously mentioned class imbalance issue tationally more simple and use different features, - when the majority of the corpus is negative, it be- we incorporate such existing corpora in our exper- comes difficult for the classifier to differentiate be- iments. Since our full dataset consists of four po- tween sentiment comprising of mostly positive po- larity labels, whereas traditional sentiment analysis larity, and sentiment comprising of some positive only uses two, for these experiments we only con- polarity. The low recall of the both positive and neg- sider the responses from our dataset belonging to the ative class clearly points towards the need for more positive and negative classes. research in this area. We begin by training our model on existing sen- timent datasets only. The first is a large-scale Twit- 6 Binary Polarity Prediction ter sentiment analysis dataset4 which automatically In this section, we experiment with using ex- assigns polarity labels based on emoticons present isting sentiment analysis corpora to perform tradi- in user tweets (we denote this dataset as “Twitter”). tional two-class polarity prediction on our dataset, The next is a collection of IMDB movie reviews published by (Maas et al., 2011) at Stanford Univer- and compare the results to a cross-validation ap- 5 proach, split on responses, trained on our dataset sity (we denote this dataset as “Stanford”). We also alone. The primary purpose is to gauge the effective- use two movie reviews datasets from (Pang et al., ness of classifiers trained on existing sentiment cor- 4http://help.sentiment140.com/for-students/ pora as applied to the mental health domain. State of 5http://ai.stanford.edu/ amaas/data/sentiment/
28 Binary Polarity Prediction with External Corpora 1.0 Accuracy Average Precision Average Recall Average F1
0.9
0.8
0.7 Mean Value
0.6
0.5 Twitter Stanford Cornell Cornell Sentence UMich TAO 2-Class
Figure 4: Classification results for the positive vs. negative prediction setting using 5 external sentiment corpora and cross- validated results on our own binary dataset (TAO 2-Class). The precision, recall, and F1 scores are reported using a weighted average incorporating the support of each class label. For all metrics, training on our dataset (TAO 2-Class) yields better results than using models trained on existing sentiment corpora.
Dataset # Positive # Negative fold cross-validation process split on responses. A Twitter 797792 798076 summary of accuracy, precision, recall, and F1 score Stanford 25000 25000 for the binary prediction setting is shown in Figure 4, Cornell 1000 1000 where it is apparent that the best performance occurs Cornell Sentence 5221 5212 UMich 3995 3091 when using our dataset, pointing towards the need Table 3: Existing sentiment corpora summary. for collecting custom mental health datasets for this new type of internal sentiment analysis. Our binary polarity model resulted in 90% classification accu- 2002) at Cornell University6, where one is geared to- racy. wards document-level sentiment classification (de- noted as “Cornell”), and the other towards sentence- 7 Important Features level classification (denoted as “Cornell Sentence”). In this section, we wish to understand which fea- Our final dataset is a collection of web forum opin- tures are most discriminative in predicting whether a ions collected by the University of Michigan as piece of text is positive, negative, both positive and part of a Kaggle competition7 (which we denote as negative, or neither positive nor negative. These fea- “UMich”). The number of documents of each senti- tures (all of which are naturally-interpretable aside ment class, per dataset, is given in Table 3. from the word embeddings) can serve as useful indi- Using all features from the previously outlined ex- cators for therapists and future mental health polar- traction process, we train a separate model on each ity studies. of the five existing sentiment analysis corpora. Op- To evaluate our features, we examine the weight timal hyperparameters for each experiment were se- matrix of a randomized logistic regression classi- lected via a randomized parameter search in con- fier trained on our full four-class polarity dataset. junction with three-fold cross validation. In each The feature weights corresponding to each of the case, the trained models were tested against the bi- four classes give an idea of the relative importance nary version of our dataset. Additionally, we per- of each feature, and how discriminative they are as form the same extraction and fine-tuning process to compared to the remaining three classes. We sum- construct a model trained on our new dataset alone. marize the 10 most important features per class in For this experiment, we report the results after a 10- Table 4.
6 Much can be gleaned from an informal inspec- https://www.cs.cornell.edu/people/pabo/movie-review- data/ tion of these top features. For example, while the 7https://inclass.kaggle.com/c/si650winter11/data words found in the positive and negative polarity
29 Positive Negative Both Positive and Negative Neither Positive nor Negative was able worried but work no anxiety $RB $VBG okay nothing calm
30 X Glorot, A Bordes, and Y Bengio. 2011. Domain adap- SentiWordNet for sentiment polarity detection on tation for large-scale sentiment classification: A deep Twitter. In Proceedings of the 3rd Workshop in Com- learning approach. Proceedings of the. putational Approaches to Subjectivity and Sentiment Soo-Min Kim and Eduard Hovy. 2004. Determining the Analysis, pages 3–10. Association for Computational sentiment of opinions. In Proceedings of the 20th In- Linguistics. ternational Conference on Computational Linguistics, Arturo Montejo-Raez,´ Eugenio Mart´ınez-Camara,´ pages 1367–1373, Morristown, NJ, USA. M. Teresa Mart´ın-Valdivia, and L. Alfonso Urena-˜ Efthymios Kouloumpis, Theresa Wilson, and Johanna Lopez.´ 2014. Ranked WordNet graph for sentiment Moore. 2011. Twitter sentiment analysis: The good polarity classification in Twitter. Computer Speech & the bad and the omg! In Proceedings of the Fifth In- Language, 28(1):93–107. ternational AAAI Conference on Weblogs and Social Tetsuya Nasukawa and Jeonghee Yi. 2003. Senti- Media (ICWSM 11), pages 538–541. ment analysis: Capturing favorability using natural Onur Kucuktunc, Ingmar Weber, B. Barla Cambazoglu, language processing. In Proceedings of the 2nd In- and Hakan Ferhatosmanoglu. 2012. A large-scale ternational Conference on Knowledge Capture, pages sentiment analysis for Yahoo! Answers. In Proceed- 70–77, New York, New York, USA. ings of the 5th ACM International Conference on Web Finn Arup˚ Nielsen. 2011. A new ANEW: Evaluation Search and Data Mining, pages 633–642. of a word list for sentiment analysis in microblogs. In Yuming Lin, Jingwei Zhang, Xiaoling Wang, and Aoying ESWC2011 Workshop on Making Sense of Microposts, Zhou. 2012. An information theoretic approach to pages 93–98. sentiment polarity classification. In Proceedings of the Alexander Pak and Patrick Paroubek. 2010. Twit- 2nd Joint WICOW/AIRWeb Workshop on Web Quality, ter as a corpus for sentiment analysis and opinion pages 35–40. mining. In Proceedings of the Seventh Conference on International Language Resources and Evaluation Bing Liu. 2010. Sentiment Analysis and Subjectivity. 2 (LREC’10), pages 19–21. edition. Bo Pang and Lillian Lee. 2004. A sentimental edu- Bing Liu. 2012. Sentiment analysis and opinion mining. cation: Sentiment analysis using subjectivity summa- Synthesis Lectures on Human Language Technologies, rization based on minimum cuts. In Proceedings of 5(1):1–167. the 42nd Annual Meeting on Association for Compu- Andrew L Maas, Raymond E. Daly, Peter T. Pham, Dan tational Linguistics, pages 271–278. Huang, Andrew Y. Ng, and Christopher Potts. 2011. Bo Pang, Lilian Lee, and Shivakumar Vaithyanathan. Learning word vectors for sentiment analysis. In Pro- 2002. Thumbs up? Sentiment classification using ma- ceedings of the 49th Annual Meeting of the Associa- chine learning techniques. In Proceedings of the Con- tion for Computational Linguistics: Human Language ference on Empirical Methods in Natural Language Technologies, pages 142–150. Association for Compu- Processing, pages 79–86. tational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher D Yelena Mejova and Padmini Srinivasan. 2011. Explor- Manning. 2014. GloVe: Global vectors for word rep- ing feature definition and selection for sentiment clas- resentation. In Proceedings of the Empiricial Methods sifiers. In Proceedings of the Fifth International AAAI in Natural Language Processing, pages 1532–1543. Conference on Weblogs and Social Media, pages 546– John Pestian, John Pestian, Pawel Matykiewicz, Brett 549. South, Ozlem Uzuner, and John Hurdle. 2012. Senti- Prem Melville, Wojciech Gryc, and Richard D. ment analysis of suicide notes: A shared task. Biomed- Lawrence. 2009. Sentiment analysis of blogs by com- ical Informatics Insights, 5(1):3–16. bining lexical knowledge with text classification. In Richard Socher, Alex Perelygin, Jean Y Wu, Jason Proceedings of ACM SIGKDD International Confer- Chuang, Christopher D Manning, Andrew Y Ng, and ence on Knowledge Discovery and Data Mining, pages Christopher Potts. 2013. Recursive deep models 1275–1284, New York, New York, USA. ACM Press. for semantic compositionality over a sentiment tree- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey bank. In Proceedings of the Conference on Empiri- Dean. 2013. Efficient estimation of word repre- cal Methods in Natural Language Processing, pages sentations in vector space. In Proceedings of In- 1631–1642. ternational Conference on Learning Representations James Spencer and Gulden Uchyigit. 2012. Sentimen- (ICLR), pages 1–12. tor: Sentiment analysis of Twitter data. In Proceed- Arturo Montejo-Raez,´ Eugenio Mart´ınez-Camara,´ ings of European Conference on Machine Learning M. Teresa Mart´ın-Valdivia, and L. Alfonso Urena-˜ and Principles and Practice of Knowledge Discovery Lopez.´ 2012. Random walk weighting over in Databases, pages 56–66.
31 Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based meth- ods for sentiment analysis. Computational Linguis- tics, 37(2):267–307. Luke Kien Weng Tan, Jin Cheon Na, Yin Leng Theng, and Kuiyu Chang. 2011. Sentence-level sentiment polarity classification using a linguistic approach. In Proceedings of 13th International Conference on Asia- Pacific Digital Libraries, pages 77–87. Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment-specific word embedding for Twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics, pages 1555– 1565. Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, and Ming Zhang. 2011. Topic sentiment analysis in Twit- ter: A graph-based hashtag sentiment classification ap- proach. In Proceedings of the 20th ACM International Conference on Information and Knowledge Manage- ment, pages 1031–1040, New York, New York, USA. ACM Press. Peter Warr, Joanna Barter, and Garry Brownbridge. 1983. On the Independence of Positive and Negative Affect. Journal of Personality and Social Psychology, 44(3):644–651. D Watson, L A Clark, and A Tellegen. 1988. Develop- ment and Validation of Brief Measures of Positive and Negative Affect - the Panas Scales. Journal of Person- ality and Social Psychology, 54(6):1063–1070. Miaomiao Wen, Diyi Yang, and Cp Rose.´ 2014. Sen- timent analysis in MOOC discussion forums: What does it tell us? In Proceedings of Educational Data Mining, pages 1–8. Michael Wiegand, Alexandra Balahur, Benjamin Roth, Dietrich Klakow, and Andres´ Montoyo. 2010. A sur- vey on the role of negation in sentiment analysis. In Proceedings of the Workshop on Negation and Specu- lation in Natural Language Processing, pages 60–68. Theresa A. Wilson, Janyce Wiebe, and Paul Hoffmann. 2009. Recognizing contextual polarity: An explo- ration of features for phrase-level sentiment analysis. Computational Linguistics, 35(3):399–433. Lei Zhang, Riddhiman Ghosh, Mohamed Dekhil, Me- ichun Hsu, and Bing Liu. 2011. Combining lexicon- based and learning-based methods for Twitter senti- ment analysis. Technical report.
32 Is Sentiment in Movies the Same as Sentiment in Psychotherapy? Comparisons Using a New Psychotherapy Sentiment Database
Michael Tanana1, Aaron Dembe1, Christina S. Soma1, David Atkins2, Zac Imel1 and Vivek Srikumar3 1 Department of Educational Psychology, University of Utah, Salt Lake City, UT 2 Department of Psychiatry and Behavioral Sciences, University of Washington, Seattle, WA 3 School of Computing, University of Utah, Salt Lake City, UT [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
Abstract Muran, 2000), an individual’s process of decision making (Bar-On et al., 2004; Isen, 2008) , behavior The sharing of emotional material is central to change (Lang and Bradley, 2010), personality style the process of psychotherapy and emotional (Mischel, 2013), and happiness (Gross and Leven- problems are a primary reason for seeking son, 1997). Affect is implicated in human memory treatment. Surprisingly, very little systematic (Schacter, 1999), and is an essential building block research has been done on patterns of emo- tional exchange during psychotherapy. It is of empathy (Elliott et al., 2011; Imel et al., 2014a). likely that a major reason for this void in the The particular role of affect in different psychother- research is the enormous cost of annotating apy theories varies from encouraging patients to ac- sessions for affective content. In the field of cess and release suppressed emotions (as in psycho- NLP, there have been major strides in the cre- analysis; e.g. (Kohut, 2013)) to identifing the im- ation of algorithms for sentiment analysis, but pact of cognition on emotion (as in rational-emotive most of this work has focused on written re- behavior therapy; (Ellis, 1962)). Carl Rogers, a pro- views of movies and twitter feeds with lit- genitor of humanistic / person-centered therapy, the- tle work on spoken dialogue. We have cre- ated a new database of 97,497 utterances from orized that empathy involved a therapist experienc- psychotherapy transcripts labeled by humans ing a client’s affect as if it were his or her own, and for sentiment. We describe this dataset and that empathy constituted a necessary ingredient for present initial results for models identifying human growth and change (Rogers, 1975). Empathy sentiment. We also show that one of the best and emotion continues to be a primary area of re- models from the literature, trained on movie search in psychological science (Decety and Ickes, reviews, performed below many of our base- 2009). line models that trained on the psychotherapy corpus. In psychotherapy, there are many ways that clients and therapists communicate how they are feeling 1 Introduction (e.g., facial expression (Haggard and Isaacs, 1966), body positioning (Beier and Young, 1998), vocal People often seek psychotherapy because they feel tone (Imel et al., 2014a), but clearly one is the words emotionally distressed (e.g. anxious, unable to they use. For example, there is evidence that greater sleep). For well over a century, researchers and prac- use of affect words predicts positive treatment out- titioners have consistently acknowledged the cen- come (Anderson et al., 1999; Stalikas, 1995). Sim- tral role emotions play in psychotherapy (Freud and ilarly, Mergenthaler (1996) developed a theory on Breuer, 1895; Lane et al., 2015). Emotion, or af- how the pattern of emotional expression should pro- fect, is directly involved in key concepts of psy- ceed between a client and therapist. However, this chotherapeutic process and outcome, including the research has been limited to dictionary based meth- formation of the therapeutic alliance (Safran and ods (see also Mergenthaler (2008)). Until very re-
33
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 33–41, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics cently, the exploration of emotion in psychotherapy 381 times in our collection of therapy transcripts. has been limited by the lack of methodology for Moreover, psychotherapy text typically comes from looking at sentiment in a more nuanced way. transcribed dialogue - not written communication. Modeling strategies that work well on written text 2 Sentiment Analysis may perform poorly on spoken language. For ex- ample, methods that require parse trees (recursive There is a long tradition in the field of Natural Lan- neural nets) may have difficulty on the disfluencies, guage Processing (NLP) for trying to correctly iden- fillers and fragments that come from dialogue. tify the sentiment of passages of text and as a re- Databases used for sentiment analysis have come sult there are a large number of techniques that have from a variety of written prose ranging from classic been tested (for a review on the subject, see Pang and literature (Yussupova et al., 2012; Qiu et al., 2011; Lee (2008)). Some common methods involve us- Liu and Zhang, 2012), news articles (see Pang and ing n-grams combined with classifier models (SVM, Lee (2008) for a list of databases), to social me- CRF, Naive Bayes) to identify the sentiment of sen- dia text (for examples see Bohlouli et al. (2015), tences or passages (Pak and Paroubek, 2015). An- Gokulakrishnan et al. (2012) and Pak and Paroubek other method involves using pre-compiled dictionar- (2015)). Databases have been created from archived ies of common terms with their polarity (positive or text via the Internet. Additionally, researchers have negative) (Baccianella et al., 2010). As with many used a variety of techniques to harvest a live feed NLP methods, researchers have attempted to go be- of tweets and posts from social media outlets as yond the mere presentation of words and use sen- Twitter and Facebook, respectively, so as to access tence structure and contextual information to im- fresh data (Bohlouli et al., 2015). Virtually all of prove accuracy. Along these lines, more recently the databases for sentiment analysis are written and researchers have used deep learning techniques to none (that we are aware of) come from a mental improve accuracy on sentiment datasets, with some health domain. success (Maas et al., 2011; Socher et al., 2013). 3 Data Collection 2.1 Domain Adaptation: Why Create A New Sentiment Dataset Data were obtained from a large corpus of psy- chotherapy transcripts that are published by Alexan- The purpose of this project is to create and evalu- der Street Press (http://alexanderstreet.com/). These ate a dataset for training machine learning sentiment transcripts come from a variety of different the- analysis models that could then be applied to the do- oretical perspectives (Psychodynamic, Experien- main of psychotherapy and mental health. Pang and tial/Humanistic, Cognitive Behavioral and Drug Lee (2008) have pointed out that sentiment analy- Therapy/Medication Management) (Imel et al., sis is domain specific. Thus, creating a sentiment 2014b). Importantly, these transcripts are available dataset specific to psychotherapy addresses the pos- through library subscription and can be downloaded sibility that the words and ratings used to train mod- from the web. As a result they can be shared more els in other contexts may have very different conno- easily than a typical psychotherapy datasets. At tations than those in spoken psychotherapy. For ex- the time of writing, there were 2,354 sessions, with ample, if one were reviewing a movie and wrote that 514,118 talk turns. ‘the movie was very effective emotionally, deeply sad’, this might be rated as a very positive state- ment. But in a therapy session, the word ‘sad’ would be more likely to be used in the context ‘I am feel- ing very sad’. Moreover, there are many words that might be extremely rare in other datasets, but are very common in psychotherapy. For example, the Figure 1: Example Mechanical Turk Rating word ‘Zoloft’ (an anti-depression medication) may never occur in a movie review dataset, but it occurs Before sampling from the dataset, we segmented
34 talk turns on sentence boundaries (based on periods, 3.1 Interrater Dataset exclamation and question marks). We refer to these In addition to the main dataset, where one worker discrete units as ‘utterances’. We also excluded any rated each utterance, we created another dataset talk turns that were shorter than 15 characters (a where a random selection of 100 utterances were large part of the dataset consists of short filler text rated by 75 workers each. The purpose of this like ’mm-hmm’, ’yeah’, ’ok’ that are neutral in na- dataset was 1) to estimate the numeric interrater re- ture). We left in non-verbal indicators that were tran- liability of human coding of sentiment and 2) to be scribed like ‘(laugh)’ or ‘(sigh)’. We randomly sam- able to see the distribution of sentiment ratings for pled from the entire dataset of utterances that met different utterances. the criteria for length, without any stratification by session. 4 Data Description We used Amazon Mechanical Turk (MTurk) to code the dataset for sentiment. We limited the work- 4.1 Sentiment Dataset Description ers to individuals in the United States to reduce the The sentiment ratings were completed by 221 dif- variability in the ratings to only US English speak- ferent workers on MTurk. The workers completed ers. In addition, we required that workers were all 97,497 ratings. The mean length of the utterances ‘master’ certified by the system (which means that was 13.6 (SD = 11.1) and the median length was they had a track record of successfully performing 10 words. The most frequent rating was neutral other tasks). We packaged each utterance with a set (59.2%) and the ratings generally skewed more neg- of 7 others that were all completed at the same time ative than positive (see figure 2). (though all were selected randomly and were not in There was a similar trend to the one observed by order). Workers were told that the utterances came Socher et al. (2013) that shorter sentences tended to from transcripts of spoken dialogue, and as a result be more neutral than longer ones. Though in con- are sometimes messy, but to try their best to rate trast, even in longer phrases, our dataset skewed each one. For each rating, workers were given the more negative and had a larger neutral percentage. following five options: Negative, Somewhat Nega- This makes sense, given that the dataset comes from tive, Neutral, Somewhat Positive, Positive (see fig- a collection of psychotherapy transcripts where par- ure 1 to see the exact presentation). Each utterance ticipants are likely to be discussing the problems that in the main dataset was rated by one person. brought the client to psychotherapy.
4.2 Data Splits 60% From the overall collection of 97,497 ratings we ran- domly split the data into a training, development and test set. We allocated 60% to the training set (58,496), 20% to the development (19,503) and 20% 40% to the test set (19,498).
Percent 4.3 Interrater Dataset Description
20% The interrater dataset was used to determine the level of interrater agreement when rating sentiment in this dataset. We used the Intraclass Correlation Coefficient (ICC) to assess this agreement (Shrout
0% and Fleiss, 1979). Using a two way random effects
Negative Somewhat Negative Neutral Somewhat Positive Positive model for absolute agreement, treating the data as Rating ordinal, the ICC was .541. (95% CI [.47, .62])2. Figure 2: Distribution of Sentiment Ratings 1This rated “fair” by the criterion of Cicchetti (1994) 2CI=Confidence Interval
35 100%
75% Rating Positive Somewhat Positive 50% Neutral
Percent Somewhat Negative Negative 25%
0%
0 20 40 60 Utterance Length
Figure 3: Distribution of Sentiment Ratings by Length of Utterance
The interrater set provides an illustration of how 5 Models different types of utterances produce different re- We tested several common NLP models to predict sponses in people. We present several examples in the labels on the dataset from the text. The purpose figure 4 that were chosen to illustrate different pat- of the modeling was to build baseline measures that terns in ratings. As can be seen in the examples, could serve as comparisons for future studies. there are certain ratings where the vast majority of raters agree on the sentiment. For example ‘and 5.1 Features then left the house’ was rated neutral by most of the raters. Another example, ‘no, I don’t even like him’ We tested the models with several n-gram combi- was agreed to be some degree of negative by most nations. Grams were created by parsing on word raters. Another utterance that was generally rated boundaries without separating out contractions. For positive was ‘(chuckling) but I know I didn’t feel example, the word “don’t” would be left as a sin- as good as I do now’. But even in examples where gle gram. Each model was tested with 1) unigram the vast majority of raters agreed on the direction features 2) unigram + bigram features 3) unigram, of the rating (positive or negative) there was almost bigram and trigram features. never complete agreement on the degree of senti- 5.2 Evaluation ment. This finding lends support to the method of training models that predict the polarity of the sen- All of the models were evaluated on how well they timent, but not the degree (Socher et al., 2013). In predicted the course sentiment labels, which were many utterances a large proportion of raters agreed a ‘positive’, ‘negative’ and ‘neutral’. We used sev- phrase was not neutral, but there was low agreement eral metrics: 1) Overall accuracy predicting labels on what direction of the sentiment was. The example 2) F1-Score for each of the labels and 3) Cohen’s ‘see I don’t need any therapy’ illustrates this point. Kappa (weighted). Because the base rate for neutral The modal rating was neutral, but the example had was high in our dataset, the Kappa metric probably a wide distribution of ratings. Different raters had gives the best overall measure of the performance very different views on the sentiment of the sen- of these models, correcting for chance agreement on tence. It is possible that these different assesments neutral ratings. Although accuracy is reported in the could map onto ways in which therapists might view table, we feel that Kappa is a better metric because such a statement - some taking it at face value and in our dataset, an accuracy of .59 could be achieved an indicator a client was doing well, while others by guessing the majority class. might view it as a failure to acknowledge problems All models were tuned against the development that brought them to psychotherapy. set. Once the final hyperparameters were selected for each model, they were trained on both the train- ing and development set and run once against the test set.
36 see I don't need any therapy no, I don't even like him
30% 60%
20% 40% Percent Percent
10% 20%
0% 0%
Positive Somewhat Positive Neutral Somewhat Negative Negative Positive Somewhat Positive Neutral Somewhat Negative Negative Rating Rating
and then left the house (chuckling) but I know I didn't feel as good as I do now
80% 40%
60% 30%
40% 20% Percent Percent
20% 10%
0% 0%
Positive Somewhat Positive Neutral Somewhat Negative Negative Positive Somewhat Positive Neutral Somewhat Negative Negative Rating Rating
Figure 4: Examples Rating Distributions from Interrater Set
6 Classifier Model model from the previously mentioned paper requires parse trees of the training set, labeled at each node. We tested each of the feature sets with a Maximum Our training dataset only has the top level of the Entropy Model. The L1 regularization on each of sentence labeled. In our tables, the specific model the models was tuned against the development set. is identified as a Recursive Neural Tensor Network The model functioned as local classifier (that is, they (RNTN). could only see each utterance in isolation, without any surrounding information from the session from 7 Results which it was drawn). In addition, we tested a pre-trained version of In general, we found that the n-gram models trained the Stanford sentiment model based on a Recursive on this dataset had similar accuracy on the catego- Neural Network as a comparison for how a model rization of the course sentiment rating, but varied to that was trained on movie reviews would perform on a large degree in their F1 scores and Kappa statistics psychotherapy dialogue (Socher et al., 2013). Be- (see table 1). The maxent trigram model had the best cause of the way that this model is set up to learn, it overall accuracy, but by a relatively small margin. was not possible to train it on our data1. The RNN The best F1 for positive statements was from the maxent unigram and bigram models and the best F1 1 As a side note: this is not a limitation of Recursive Neural for negative statements was from the maxent model Networks (RNN) in general, but rather the way that Socher’s implementation was designed to learn. The movie dataset that as well. The maxent model had the highest Kappa their group created labeled all of the sections of a parse tree and score. Surprisingly, there was not a wide divergence gave these labeled tree structures to the RNN. One could have in scores by the length of the grams in the mod- designed an RNN to learn from just the top label of a tree, but els. The Kappa score for the maxent model did not then one would have to use a different implementation of an change by more than .01 between a unigram and a RNN. It may also surprise readers to learn that Socher’s model in this paper relied on the Stanford parser to pre-parse the sen- tri-gram model. tence trees, instead of letting the RNN parse the sentence. The RNTN from Socher, et al. (2013) that was
37 Model Accuracy F1-Neutral F1-Positive F1-Negative Kappa Unigram Features Maxent .601 .706 .339 .451 .308 Bigram Features Maxent .603 .709 .339 .446 .306 Trigram Features Maxent .606 .714 .337 .434 .300 RNN RNTN Trained on Movie Reviews .484 .559 .319 .450 .227 Table 1: Results Test Set. Best scores for each category are bolded. only pretrained on the movie review data had much Most Positive Words lower accuracy than the other models (.484) and a nice lower Kappa score than the maxent models. The thank, amazing F1 scores for positive and negative statements were glad, good comparable to the best models, but the F1 score for proud, great negative was lower than any of the other models relaxed, helpful fine, interesting tested (.559). forward, helped In table 2 we present the best predictors of the special, helps positive and negative classes from the unigram max- cool, better ent model. It is interesting to note that these words enjoyed, excited give some insight into why it is important to have a Most Negative Words sentiment dataset that is specific to psychotherapy. sad, crap We can see that ‘scary’ is one of the top ten negative hated, screwed afraid, terrible words in the dataset. We should note that in a movie fear, can’t review, the word ‘scary’ might be a positive indi- bothers, rejection cator. Additionally, psychologically relevant words worst, death are frequent on the list of good predictors like ‘de- hard, scary pressed’ and ‘relaxed’ . horrible, worse The confusion matrix for the maxent unigram stupid, ugly model (see figure 5) shows that the basic model pissed, depressed is generally accurate in the polarity of the state- Table 2: Most Positive and Negative Words from Maxent Uni- ment (that is, there are very few errors of positive gram Model sentences coded as negative, or negative sentences coded as positive). The errors are generally classify- ing a positive utterance as neutral or a neutral utter- Positive ance as positive. Freq 8000
8 Discussion 6000 Neutral 4000
Psychotherapy is an often emotional process, and Predicted 2000 most theories of psychotherapy involve hypothe- ses about emotional expression, but very few re- Negative searchers have systematically explored how affect works empirically in these situations. There are sev- Negative Neutral Positive eral databases of sentiment ratings in text but few of Truth them involve dialogue and none are from a mental health setting. This dataset represents an initial step Figure 5: Confusion Matrix for Maxent Unigram Model (Test towards the study of sentiment in psychotherapy. Set)
38 One of the novel contributions of this dataset to This may be a side effect of the characteristics of the area of sentiment analysis in general is the in- dialogue, which are not always as gramatically clear terrater reliability subset. In our literature search we as written text. were unable to find examples where researchers had estimated what human agreement was on different 8.1 Psychotherapy Compared to Other datasets used for sentiment analysis. This will allow Sentiment Domains us to compare our models to human-human agree- It may be surprising that the accuracy of some of our ment and also provide a qualitative sense for what initial models are lower than other similar models kinds of utterances humans agree on and on which used on other sentiment datasets. Part of this is may ones they disagree. be a result of our decision to not use extremely short We hope that the creation of this dataset will im- phrases (our dataset has a large number of neutral prove researchers’ ability to predict sentiment from listening utterances like ‘mm-hmm’ and ‘yeah’ that dialogue and in psychotherapy settings. It is clear we wanted to exclude). It should be noted that even from the interrater reliability dataset that we should in Socher et al. (2013) all of the models tested had not expect models to perfectly rate sentiment be- an accuracy below .6 on anything that had 5 words cause even humans do not completely agree on or more (see figure 6 in their paper). many types of utterances. However, it may be rea- However, there may be a larger issue in the psy- sonable for machine rated reliability to approach the chotherapy domain that makes labeling these utter- human range of reliability. ances more difficult in general. For example, when The models suggested by this paper are not in- rating movies, the typical subject of the sentence is tended to be a comprehensive list of models that going to be the movie and whether or not the re- may work well on the dataset, but are intended to viewer enjoyed it. While you may have sentences be a baseline for other work to compare to. There that express both positive and negative attitudes, but is a long list of possible models that should be tried there is some sense that the purpose is always go- in the future, including LIWC counts, LDA models, ing to be to evaluate the movie. In psychotherapy, word vectors and more comprehensive tests with Re- an utterance like “(chuckling) but I know I didn’t cursive Neural Networks. Testing all of these mod- feel as good as now” has a complicated temporal as- els and their variations is beyond the scope of this pect to it. The rater may be confused about whether paper, but we hope that this dataset will give a base- this should be positive because the person feels good line for different groups to test what works well in now, or negative because they were not feeling good this type of data. prior to now. An utterance like “see I don’t need any One of the interesting findings of this paper was therapy” is complicated because some raters may the comparison of the RNTN model from Socher see this as a person in recovery and others may see a et al. (2013) that was only trained on the origi- person in denial. nal movie review data. This dataset, it should be Consequently, our models may not necessarily be noted, is much larger than our own, but it is from evaluating how a person is feeling about another a very different context. This is consistent with the person or themselves in a given moment. Instead conclusions of Pang and Lee (2008) that context is raters evaluated the emotional valence of a statement extremely important in identifying sentiment. Our which could target the speaker, another person, or work provides a test of the viability of domain adap- something unspecified. The psychotherapy domain tion of models trained on very different datasets. It clearly presents a more complicated task than an- would appear accuracy will suffer if we use models swering the question “is this movie review a positive that were trained on datasets like movie reviews and one or a negative one?” which is a better defined apply them directly to mental health contexts. tasks. Future work may attempt more challenging Finally, there were not substantial differences in classification tasks like asking a rater to guess how a accuracy between unigram models and the bigram, client or therapist may be feeling from text - similar trigram models, suggesting that the more complex to how a human interacting with a client or thera- word patterns to not necessarily improve accuracy. pist might attempt to understand their partners inter-
39 nal state. However, even if models could be trained (somabit). Journal of Information Science, 41(6):779– to accurately capture this particular aspect of senti- 798. ment, we could not be sure that models were cap- Domenic V. Cicchetti. 1994. Guidelines, criteria, and turing an actual internal state. Instead they would rules of thumb for evaluating normed and standard- be learning human perception of this state, which in ized assessment instruments in psychology. Psychol. Assess., 6(4):284–290. and of itself can be error prone. J. Decety and W. Ickes. 2009. The social neuroscience of empathy. MIT Press, Cambridge, MA. 8.2 Future Directions R. Elliott, A. C. Bohart, J. C. Watson, and L. S. Green- Beyond the practical question of whether we can ac- berg. 2011. Empathy. Psychotherapy, 48(1):43. curately rate sentiment in psychotherapy, we hope Albert Ellis. 1962. Reason and emotion in psychother- that models trained on this dataset will eventually apy. Lyle Stuart. be able to code entire psychotherapy sessions so Sigmund Freud and Josef Breuer. 1895. Studies on hys- that we can ask larger questions about how senti- teria. se, 2. London: Hogarth. ment expressed by clients and therapists influences B. Gokulakrishnan, P. Priyanthan, T. Ragavan, outcomes. For example, would we expect to see N. Prasath, and A. Perera. 2012. Opinion min- ing and sentiment analysis on a twitter data stream. the largest improvement in symptoms from positive in advances in ict for emerging regions (icter), 2012 client expression or negative client expression? Or international conference. IEEE, pages 182–188. should there be a pattern from negative expressed James J Gross and Robert W Levenson. 1997. Hiding sentiment to positive? Another important question is feelings: the acute effects of inhibiting negative and whether we would see the most improvement from positive emotion. Journal of abnormal psychology, therapists who focus on positive aspects of a client’s 106(1):95. experience or more negative ones. To answer these Ernest A Haggard and Kenneth S Isaacs. 1966. Mi- questions, we need to be able to label more data than cromomentary facial expressions as indicators of ego is practical to do with just human raters. mechanisms in psychotherapy. In Methods of research in psychotherapy, pages 154–165. Springer. Acknowledgments Z. E. Imel, J. S. Barco, H. J. Brown, B. R. Baucom, J. S. Baer, J. C. Kircher, and D. C. Atkins. 2014a. The We would like to thank Padhraic Smyth for his help- association of therapist empathy and synchrony in vo- ful thoughts on early versions of this manuscript. cally encoded arousal. Journal of counseling psychol- Funding for this project was provided by NIH/NIDA ogy, 61(1):146. R34 DA034860. Zac E Imel, Mark Steyvers, and David C Atkins. 2014b. Psychotherapy Computational Psychother- apy Research : Scaling up the Evaluation of Patient References Provider Interactions Computational. Psychotherapy. Alice M Isen. 2008. Some ways in which positive af- Timothy Anderson, Edward Bein, Brian Pinnell, and fect influences decision making and problem solving. Hans Strupp. 1999. Linguistic analysis of affective Handbook of emotions, 3:548–573. speech in psychotherapy: A case grammar approach. H. Kohut. 2013. The analysis of the self: A systematic Psychotherapy research, 9(1):88–99. approach to the psychoanalytic treatment of narcissis- Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- tic personality disorders. University of Chicago Press. tiani. 2010. Sentiwordnet 3.0: An enhanced lexical Richard D Lane, Lee Ryan, Lynn Nadel, and Leslie resource for sentiment analysis and opinion mining. In Greenberg. 2015. Memory reconsolidation, emo- LREC, volume 10, pages 2200–2204. tional arousal, and the process of change in psy- Reuven Bar-On, Daniel Tranel, Natalie L Denburg, and chotherapy: New insights from brain science. Behav- Antoine Bechara. 2004. Emotional and social intelli- ioral and Brain Sciences, 38:e1. gence. Social neuroscience: key readings, 223. Peter J Lang and Margaret M Bradley. 2010. Emo- Ernst G Beier and David M Young. 1998. The silent tion and the motivational brain. Biological psychol- language of psychotherapy. Aldine. ogy, 84(3):437–450. M. Bohlouli, J. Dalter, M. Dronhofer, J. Zenkert, and B. Liu and L. Zhang. 2012. A survey of opinion mining M. Fathi. 2015. Knowledge discovery from so- and sentiment analysis. Mining Text Data, pages 415– cial media using big data-provided sentiment analysis 463.
40 Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics. Erhard Mergenthaler. 1996. Emotion–abstraction pat- terns in verbatim protocols: A new way of describ- ing psychotherapeutic processes. Journal of consult- ing and clinical psychology, 64(6):1306. Erhard Mergenthaler. 2008. Resonating minds: A school-independent theoretical conception and its em- pirical application to psychotherapeutic processes. Psychotherapy Research, 18(2):109–126. Walter Mischel. 2013. Personality and assessment. Psy- chology Press. A. Pak and P. Paroubek. 2015. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, 10:1320–1326. Bo Pang and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retreival, 1(2):91–231. G. Qiu, B. Liu, J. Bu, and C. Chen. 2011. Opinion word expansion and target extraction through double propa- gation. Computational Linguistics, 37(1):9–27. C. R. Rogers. 1975. Empathic: An unappreciated way of being. Couns. Psychol., 5(2):2–10. Jeremy D Safran and J Christopher Muran. 2000. Nego- tiating the therapeutic alliance: A relational treatment guide. Guilford Press. Daniel L Schacter. 1999. The seven sins of memory: Insights from psychology and cognitive neuroscience. American psychologist, 54(3):182. Patrick E. Shrout and Joseph L. Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psy- chol. Bull., 86(2):420–428. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and C. Potts. 2013. Recursive deep models for seman- tic compositionality over a sentiment treebank. Proc. Conf. Empir. Methods. Anastassios Stalikas. 1995. Client good moments: An intensive analysis of a single session anastassios sta- likas marilyn fitzpatrick mcgill university. Canadian Journal of Counselling, 29:2. N Yussupova, Diana Bogdanova, and M Boyko. 2012. Applying of sentiment analysis for texts in russian based on machine learning approach. In Proceedings of Second International Conference on Advances in In- formation Mining and Management, pages 8–14.
41 Building a Motivational Interviewing Dataset
Veronica´ Perez-Rosas´ 1, Rada Mihalcea1, Kenneth Resnicow2 Satinder Singh1, Lawrence An3 1Computer Science and Engineering, University of Michigan 2School of Public Health, University of Michigan 3Center for Health Communications Research, University of Michigan vrncapr,mihalcea,kresnic,baveja,lcan @umich.edu { }
Abstract scale or in other domains is limited by the need for human-based evaluations. Currently, this requires This paper contributes a novel psychologi- a human either watching or listening to video-tapes cal dataset consisting of counselors’ behav- and then providing evaluative feedback. iors during Motivational Interviewing encoun- Recently, computational approaches have been ters. Annotations were conducted using the Motivational Interviewing Integrity Treatment proposed to aid the MI evaluation process (Atkins (MITI). We describe relevant aspects associ- et al., 2014; Xiao et al., 2014; Klonek et al., 2015). ated with the construction of a dataset that re- However, learning resources for this task are not lies on behavioral coding such as data acqui- readily available. Having such resources will enable sition, transcription, expert data annotations, the application of data-driven strategies for the auto- and reliability assessments. The dataset con- matic coding of counseling behaviors, thus provid- tains a total of 22,719 counselor utterances ing researchers with automatic means for the eval- extracted from 277 motivational interview- ing sessions that are annotated with 10 coun- uation of MI. Moreover, this can also be useful to selor behavioral codes. The reliability anal- explore how MI works by relating MI behaviors ysis showed that annotators achieved excel- to health outcomes, and to provide counselors with lent agreement at session level, with Intra- evaluative feedback that helps them improve their class Correlation Coefficient (ICC) scores in MI skills. the range of 0.75 to 1, and fair to good agree- In this paper, we present the construction and val- ment at utterance level, with Cohen’s Kappa idation of a dataset annotated with counselor ver- scores ranging from 0.31 to 0.64. bal behaviours using the Motivational Interviewing Treatment Integrity 4.0 (MITI), which is the cur- Behavioral interventions are a promising ap- rent gold standard for MI-based psychology inter- proach to address public health issues such as smok- ventions. The dataset is derived from 277 MI ses- ing cessation, increasing physical activity, and re- sions containing a total of 22,719 coded utterances. ducing substance abuse, among others (Resnicow et al., 2002). In particular, Motivational Interviewing 1 Motivational Interviewing (MI), a client centered psychotherapy style, has been receiving increasing attention from the clinical psy- Miller and Rollnick define MI as a collaborative, chology community due to its established efficacy goal-oriented style of psychotherapy with particu- for treating addiction and other behaviors (Moyers et lar attention to the language of change (Miller and al., 2009; Apodaca et al., 2014; Barnett et al., 2014; Rollnick, 2013). MI has been widely used as a treat- Catley et al., 2012). ment method in clinical trials on psychotherapy re- Despite its potential benefits in combating addic- search to address addictive behaviors such as alco- tion and in providing broader disease prevention and hol, tobacco and drug use; promote healthier habits management, implementing MI counseling at larger such as nutrition and fitness; and help clients with
42
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 42–51, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics psychological problems such as depression and anx- MINA includes aspects that indicate counselor de- iety disorders (Rollnick et al., 2008; Lundahl et al., ficiencies while delivering MI, such as confronting 2010). In addition, MI has been successfully applied and persuading without permission. The neutral be- in different practice settings including social work haviors include counselor actions such as providing in behavioral health centers, education, and criminal information and persuading with permission. justice (Wahab, 2005; McMurran, 2009). MITI evaluation is conducted by trained coders The competence of the counselor in MI delivery who assess the overall session scores and the occur- is measured using systematic observational meth- rence of behaviors by using pen and paper. During ods to assess verbal behavior in MI by either focus- the coding process, coders rely on audio recordings ing on therapist behaviors, client behaviors, or both and their corresponding transcriptions. The evalua- (Jelsma et al., 2015). Current coding instruments for tion is usually performed as a two-step process by MI include the Behavior Change Counselor Index first evaluating overall scores and next focusing on (BECCI) (Lane et al., 2005), the Client Evaluation behavior counts. of Motivational Interview (CEMI) (Madson et al., MITI coding is a very time consuming and expen- 2009), the Independent Tape Rating Scale (ITRS) sive process, as it requires accurate transcriptions (Martino et al., 2009), the MI Skills Code (MISC) and human expertise. The quality of the transcrip- (Moyers et al., 2003), the Stimulated Client Inter- tions is affected by the recoding quality and their view Rating Scale (SCIRS) (Arthur, 1999), the One preparation is time consuming as it might take about Pass (McMaster and Resnicow, 2015), and the Mo- three times the duration of the recording (Klonek tivational Interviewing Treatment Integrity (MITI) et al., 2015). Thus, estimates for a 30 min session (Moyers et al., 2005). might add up to 2.5 hours of transcriber time and about one hour of coder time. 1.1 Motivational Interviewing Treatment Integrity 1.2 MI reliability assessment The MITI coding system is currently the most fre- quently used instrument for assessing MI fidelity Reliability assessment for MI helps to validate treat- (Moyers et al., 2003). The MITI is derived from the ment fidelity in clinical studies as it provides evi- MISC coding system and focuses exclusively on the dence that the MI intervention has been effective verbal behavior of the counselor. It measures how and allows comparisons across studies (Jelsma et al., well or poorly the clinician is using MI. The cod- 2015). MI literature suggests assessing reliability ing system evaluates MI processes related to change by double coding a fraction of the study sessions. talk such as engagement, focus, evocation, and plan- The most common method to quantify the inter- ning. MITI has two components: global scores and annotator agreement on MI coding is computing the behavior counts. The global scores aim to charac- Intraclass Correlation Coefficient (ICC). This statis- terize the overall quality of the interaction and in- tic describes how much of the total variation in MITI clude four dimensions, namely Cultivating Change scores is due to differences among annotators (Dunn Talk, Softening Sustain Talk, Partnership, and Em- et al., 2015). ICC scores range in the 0 to 1 interval; pathy. Behavior counts are evaluated by tallying in- relatively high ICC scores indicate that annotators stances of particular interviewing behaviors, which scored MITI in a similar way while scores closer can be grouped into five broad categories: questions, to 0 suggest that there is a considerable amount of reflections, MI adherent behavior (MIA), MI non- variation in the way annotator’s evaluated counselor adherent behavior (MINA), and neutral behaviors. MI skill. Low scores further suggest that either the Reflections capture reflective listening statements measure is defective or the annotators should be re- made by the clinician in response to client state- trained. Another method to measure inter-annotator ments and can be categorized as simple or complex. reliability in MI is the Cohen’s Kappa score (Lord MIA behaviors summarize counselor adherence to et al., 2015a), which calculates the pair-wise agree- core aspects of the MI strategy such as seeking col- ment among annotations considering the probability laboration, affirming, and emphasizing autonomy. of annotators agreeing by chance.
43 2 Related work Research findings have shown that natural lan- guage processing approaches can be successfully Current approaches for MI coding and evaluation applied to behavioral data for the automatic annota- entail extensive human involvement. Recently, there tion of therapists’ and clients’ behaviors. This moti- have been a number of efforts on building com- vates our interest in building resources for this task putational tools that assist researchers during the as an initial step for the construction of improved coding process. (Can et al., 2012) proposed a lin- coding tools. There has been work on creating anno- guistic based approach to automatically detect and tated resources that facilitate advances in natural lan- code counselor reflections that is based on analyz- guage processing of clinical text, including semantic ing n-grams patterns, similarity features between and syntactic annotation of pathology reports, oncol- counselor and client speech, and contextual meta- ogy reports, and biomedical journals (Roberts et al., features, which aim to represent the dialog sequence 2007; Albright et al., 2013; Verspoor et al., 2012). between the client and counselor. A method based However, to our knowledge, there are just a few psy- on topic models is presented in (Atkins et al., 2012; chotherapy corpora available. One of them is the Atkins et al., 2014), where authors focus on auto- “Alexander Street Press”, 1 which is a large col- matically identifying topics related to MI behaviors lection of transcripts and video recordings of ther- such as reflections, questions, support, and empa- apy sessions on different subjects such as anxiety, thy, among others. Text and speech based methods depression, family conflicts, and others. There are have also been proposed to evaluate overall MI qual- also some other psychology datasets available under ity. (Lord et al., 2015b) analyzed the language style limited access from the National Institute of Mental synchrony between therapist and client during MI Health (NIMH).2 These datasets provide recorded encounters. In this work, authors relied in the psy- interactions among clinicians and patients on a num- cholinguistic categories from the Linguistic Inquiry ber of psychology styles. However, they are not and Word Count lexicon to measure the degree in annotated and validated to be used in the computa- which counselor matches the client language. (Xiao tional psychology domain. et al., 2014) presents a study on the automatic eval- In this paper, we present the development of a uation of counselor empathy by analyzing correla- clinical narratives dataset that can be used to imple- tions between prosody patterns and empathy showed ment data-driven methods for the automatic evalua- by the therapist during the counseling interaction. tion of MI sessions. Although most of the work on coding of MI within session language has focused on modeling 3 Motivational Interviewing Dataset the counselor language, there is also work that ad- 3.1 Data collection dresses the client language. (Tanana et al., 2015) used recursive neural networks (RNN) to identify The dataset is derived from a collection of 284 video client change and sustain talk in MI transcripts, i.e., recordings of counseling encounters using MI. The language that indicates commitment towards and recordings were collected from various sources, in- away behavioral change. In this work, authors com- cluding two clinical trials, students’ counseling ses- bined both therapist and client utterances in a single sions from a graduate level MI course, wellness sequence model using Maximum Entropy Markov coaching phone calls, and demonstrations of MI Models, NRR, and n-grams features. (Gupta et al., strategies in brief medical encounters. 2014) analyzed the valence of client’s attitude to- The clinical trials sessions consist of interven- wards the target behavior by using n-grams and con- tions for smoking cessation and antiretroviral ther- ditional maximum entropy models. In this paper au- apy adherence with electronic drug monitoring. Psy- thors also present an exploration of the role laugh- chology students’ sessions are conducted on stan- ter of both counselor and client’s during the MI en- dardized patients and aim at weight loss and smoke counter and attempts to incorporate its occurrence as 1http://alexanderstreet.com/products/counseling-and- an additional source of information in the prediction psychotherapy-transcripts-series model. 2http://psychiatry.yale.edu/pdc/resources/datasets.aspx
44 Source No. sessions Avg.length erage of 83 utterances per session. Clinical trial 121 27 min Standardized patients 138 15 min 3.3 MITI Annotations Brief MI encounters 18 4 min Three counselors, with previous MI experience, Coaching phone calls 7 15 min were trained on the use of the MITI 4.1 by expert Total 284 trainers from the Motivational Interviewing Net- 3 Table 1: Data sources for the MI sessions work of Trainers (MINT) to conduct the annotation task. Prior to the annotation phase, annotators par- ticipated in a coding calibration phase where they cessation. Wellness coaching phone calls inquired had discussions regarding the criteria for sentence about patient health and medication adherence after parsing, the correct assignment of behavior codes, surgery. The demonstration sessions are collected and conducted team coding of sample sessions. from online sources, i.e., YouTube and Vimeo, and Annotators used both audio recordings and verba- present brief MI encounters on several scenarios tim transcriptions to conduct the annotations. such as dental practice, emergency room counseling, Annotators were instructed to parse the inter- and student counseling. Table 1 presents a summary viewer speech following the guidelines defined by of the data sources used in the dataset collection. MITI 4.1. The annotation was conducted at utter- All the sessions are manually anonymized to re- ance level, by selecting and labeling utterances in move identifiable information such as counselor and each counselor turn that contain a specific MI be- patient names and references to counseling sites’ lo- havior. cation. Each recording is assigned a new identifier Following this strategy allowed us to obtain more that does not include any information related to the accurate examples of each behavior code for cases original recording. The resulting recordings are then where a turn contains multiple utterances and thus processed to remove the visual data stream to fur- more than one behavior code. In addition, given ther prevent counselor/patient identification. After possible inaccuracies and interruptions in the turn by this process, we obtained a set of 277 sessions due to turn segmentation, annotators were allowed to select the exclusion of some sessions with recording errors. the text that they considered belonging to a coded ut- The final dataset comprises a total of 97.8 hours of terance, even if it spanned more than one counselor- audio with average session duration of 20.8 minutes client turn, to avoid utterance breaking. and a standard deviation of 11.4 minutes. In order to facilitate this process, annotators used a software based coding system instead of the tra- 3.2 Transcriptions ditional paper and pen system. Annotators were The sessions from clinical trials include full tran- trained to code using the Nvivo software,4 a quan- scripts. However, this was not the case for the titative analysis suite that allows to select and assign remaining set of sessions, and for these we ob- text segments to a given codebook. The codebook tained manual transcriptions via crowdsourcing us- contains the following behavior codes: ing Amazon Mechanical Turk. This resource has proved to be a fast and reliable method to obtain Question (QUEST) All questioning statements speech transcriptions (Marge et al., 2010). spoken by clinicians. Mechanical Turk workers were provided with Simple reflection (SR) Clinician statements that transcription guidelines that include clearly identify- convey understanding or facilitate client- ing the speaker (client or counselor), and transcrib- clinician exchanges. ing speech disfluencies such as false starts, repeti- tions of whole words or parts of words, prolonga- Complex reflection (CR) Reflective statements tions of sounds, fillers, and long pauses. Resulting that add substantial meaning or emphasis to transcriptions were manually verified to avoid spam what the client has said. and to ensure their quality. The transcriptions con- 3http://www.motivationalinterviewing.org/ sist of approximately 22,719 utterances, with an av- 4http://www.qsrinternational.com/what-is-nvivo
45 Code Count Verbal example QUEST 5262 Could you talk a little bit more about those behaviors you say that automatically makes you smoke? SR 2690 It sounds like something that you know and feel like you can improve on in the next week. CR 2876 So you want something that’s gonna to allow you to eat the foods that you enjoy but that maybe more moderation. SEEK 614 And, then, when we meet again, you can bring some of that information. Maybe we can discuss which of those feels right for you, and start to put together a plan for what could be your next steps. AUTO 141 This is something that it’s up to you whether you want to use it or not. AF 499 Okay, great. So, I’m excited about this because you’re obviously very motivated. And the barriers that you’ve presented are definitely overcomable CON 141 Okay, well that’s a good start, but cutting back isn’t gonna do it. If you actually quit the smoking, you can reverse all the damage you’ve done in your mouth, and you can stop yourself from ... from being at risk for these other diseases. But, but as long as you’re continuing to use these cigars, you’re really putting yourself in a lot of danger. PWOP 598 Okay so with all of the risks of smoking and the benefits of quitting, what is keeping you from making a plan? N-GI 1017 There are two other over the counter options. There’s a patch and that would deal with the taste you don’t like. With the patch you just put it on and it slowly releases nicotine throughout the day so you don’t even have to think about it. There are also lozenges, which are kind of like throat lozenges, or a hard candy and you just suck on it. And as it dissolves it releases nicotine. N-PWP 2100 Well, if it’s alright with you, umm, you know, I could toss out some ideas of things that have worked for other people and umm things that umm, could be helpful as far as reducing stress and, and really filling in other activities so you’re not umm, as tempted to ... smoke Table 2: Frequency counts and verbal examples of MI behaviors in the dataset
Seeking collaboration (SEEK) The clinician at- confronts the client by directly disagreeing, ar- tempts to share power or acknowledge the ex- guing, correcting, shaming, criticizing, moral- pertise of the client. izing or questioning client’s honesty.
Emphasizing autonomy (AUTO) The clinician Persuading with permission (N-PWP) Clinician focus the responsibility on the client for the statements that make emphasis on collabora- decision and actions pertaining to change. tion or autonomy support while persuading. Affirm (AF) Clinician utterances that accentuates Giving information (N-GI) The clinician give in- something positive about the client. formation, educates, or expresses a profes- Persuading without permission (PWOP) The sional opinion without persuading, advising, or clinician attempts to change the client’s opin- warning. ions, attitudes, or behaviors, using tools such as logic, compelling arguments, self-disclosure The 277 sessions were randomly distributed or facts. among the three annotators. The team annotated ap- proximately 20 sessions per week. The entire anno- Confront (CON) Statements where the clinician tation process took about three months.
46 Annotator 1 Annotator 2 Method So you’re getting back to your old SR So you’re getting back to your old SR Exact match self. self. So it sounds like you kinda So it sounds like you kinda struggle with that a little bit in SR struggle with that a little bit in that sometimes that sometimes it’s hard I imag- ine, it is sometimes hard to be fi- it’s hard I imagine, it is sometimes NL SR Split utterances nancially independent I mean I, hard to be financially independent I But it something it sounds like mean I, you respect in yourself that you But it something it sounds like are able to do it. you respect in yourself that you SR are able to do it. OK. But even though it’s something SR So you mentioned that one side ef- SR Partial match that you really don’t like, it’s some- fect of the Sustiva was that it makes thing that’s not terribly bothersome. you dizzy. OK. But even though it is something that you really don’t like, it something that,it’s not terri- bly bothersome. Table 3: Sample utterance alignment for coding comparisons
After the annotation phase, transcripts were pro- Code ICC Kappa cessed to extract the verbal content of each MITI an- QUEST 0.97 0.64 notation; non-coded utterances were also extracted CR 0.97 0.49 and labeled as neutral. Sample annotations are pre- SR 0.89 0.34 sented in Table 2. The final set contains 15,886 an- SEEK 0.03 0.42 notations distributed among the ten codes and 6,833 N-GI 0 0.28 neutral utterances. Table 2 shows the frequency dis- AF 0 0.47 tribution for each behavior count and neutral utter- AUTO 0 0.31 ances. N-PWP 0 NA CON NA NA 4 Dataset Validation PWOP NA NA In order to validate the annotator reliability, a sample Table 4: ICC at session level and Kappa scores at utterance of 10 sessions was randomly selected to be double level for 10 double coded sessions. NA indicates that the MI coded by two members of the coding team. behavior was not present in any session The total amount of recoding material for this sample is about 4.5 hours. Each session has an av- erage duration of 26 minutes, with an average of ance boundaries such as overlaps and split utter- 115 counselor-client conversation turns per session, ances. In order to allow for coding comparisons, we comprising a total of 1,160 utterances. opted for aligning annotations by utterance match- ing using similar methods to (Lord et al., 2015a). We 4.1 Inter-rater Reliability Analysis considered three cases: exact match, partial match Because we conducted the MITI annotation at ut- and split utterances. In the first case, we simply terance level without any pre-parsing, annotations compare two coded utterances and define a match across coders showed noticeable parsing variations. if both utterances contained the same words. The These variations consisted of differences in utter- partial match addresses cases where two coders dis-
47 agree in utterance boundaries, thus resulting in an- to what is not a reflection? This analysis provides notations from one annotator partially matching the further insights about the validity of the coding. In others, i.e., some degree of overlap. The third case these comparisons, utterances coded with a differ- also deals with differences due to utterance bound- ent behavior than the target behavior were consid- aries but focuses on split utterances, i.e., an anno- ered as the negative case. For instance, if the target tated utterance from one coder was split into two behavior was Simple Reflection (SR), then we eval- different annotations by the other, and cases where uated the identification of Simple Reflection vs non- utterances with different annotations show some de- Simple Reflection. In order to more accurately rep- gree of overlapping. Table 3 presents sample utter- resent the human coding process, we also included ances. non coded utterances (NL) as negatives cases. Fig- Using the transcript from each session, we first ure 1 shows the annotation agreement between the identified those utterances who were assigned a be- two annotators for 10 sessions coded at utterance havior code by either of the two annotators. Then, level in heatmap representation, where the color in- we compared their verbal content by applying the ut- tensity represents the agreement distribution. In the terance matching methods described above. We as- shown matrix, the x axis indicates the MI code as- signed a match when both annotators agreed on their signed by the first annotator and the y axis the MI evaluations. We considered both split utterances and code assigned by the second annotator. Each cell partial matches as a single match. Those utterances indicates the observed frequency of a coding pair. for which we were unable to find a matching pair or differed on the assigned codes were regarded as disagreements. From this table, we observe that questions at- Table 4 presents the Intra Class Coefficient (ICC) tained the highest agreement levels among all behav- measured at session level. Reported scores were ob- iors, followed by simple reflections (SR), complex tained using a two-way mixed model with absolute reflections (CR), seeking collaboration (SEEK), giv- agreement (Jelsma et al., 2015). Overall, we ob- ing information (GI), and emphasizing autonomy serve excellent ICC scores for Complex Reflections (AUTO). From the observed disagreements, a small CR (CR), Simple Reflections (SR), and Questions, fraction of questions annotated by one coder were based on ICC reference values, where values rang- regarded as Simple Reflections or were left uncoded ing from 0.75 to 1 are considered as excellent agree- by the second coder. This might be related to am- ment (Jelsma et al., 2015). biguous cases, where the counselor formulate a sim- ICC scores suggest that annotators did not show ple reflection but added a question tone at end of significant variations on most of the coded be- the sentence thus making the reflection sound like a haviours, except for Seeking collaboration (SEEK), question. In addition, annotators showed noticeable which showed considerable disagreement. We be- disagreement while distinguishing between complex lieve that this is caused by the higher variability on and simple reflections. This was somehow expected, the frequency counts for this code across the 10 ses- as the MI literature has reported similar findings sions. given the highly subjective criteria applied while Wanting to evaluate how well did the annotators evaluating these codes (Lundahl et al., 2010). An- agree while coding the same annotation, we calcu- notators found no agreement for confronting (CON) lated the pairwise agreement among coders using and persuading without permission (PWOP) codes. Cohen’s Kappa. Results are also reported in Table This has to do with zero or low frequency counts 4. The Kappa values suggest fair to good levels of e.g the first annotator found only one confrontation agreement among the different behavior codes. utterance while the second annotator found zero. Fi- In addition, we evaluate the ability of coders to nally, annotators showed high agreement on utter- distinguish the occurrence of a particular behavior ances that did not contain MI behaviors, thus sug- code versus any other code. This allow us to an- gesting that 1) annotators have good agreement re- swer question such as, how well did the annotators garding what should be coded; and 2) differences in agree on what is considered a reflection as compared parsing did not affect the annotations process.
48 NL 245 34 19 3 16 8 1 0 0 9
SR 37 59 62 1 7 3 0 0 0 5 200
CR 16 11 76 0 6 0 0 1 1 2
NGI 36 1 4 9 2 1 0 0 1 1 150
AF 9 1 3 0 24 0 0 0 0 0
SEEK 0 0 0 0 0 15 0 0 0 3 100
AUTO 0 0 1 0 1 1 7 0 0 0
CON 0 0 0 0 0 0 0 0 0 1 50 PWOP 0 1 0 0 0 0 0 0 0 0
QUEST 12 4 0 0 1 2 0 0 0 228 0 NL SR CR NGI AF SEEK AUTO CON PWOP QUEST
Figure 1: Annotator agreement on non-coded utterances (NL) and MI behaviors. The x axis indicates the MI code assigned by the first annotator and the y axis the MI code assigned by the second annotator.
5 Conclusion #1344257, and by grant #48503 from the John Tem- pleton Foundation. Any opinions, findings, and con- In this paper, we introduced a new clinical narratives clusions or recommendations expressed in this ma- dataset derived from MI interventions. The dataset terial are those of the authors and do not necessarily consists of annotations for ten verbal behaviors dis- reflect the views of the University of Michigan, the played by the counselor while conducting MI coun- National Science Foundation, or the John Templeton seling. We presented a detailed description of the Foundation. dataset collection and annotation process. We con- ducted a reliability analysis where we showed that annotators achieved excellent agreement at session References level, with ICC scores in the range of 0.75 to 1, Daniel Albright, Arrick Lanfranchi, Anwen Fredrik- and fair to good agreement at utterance level, with sen, William F Styler, Colin Warner, Jena D Hwang, Cohen’s Kappa scores ranging from 0.31 to 0.64. Jinho D Choi, Dmitriy Dligach, Rodney D Nielsen, The paper reports our initial efforts towards building James Martin, et al. 2013. Towards comprehensive accurate tools for the automatic coding of MI en- syntactic and semantic annotations of the clinical nar- counters. Our future work includes developing data- rative. Timothy R Apodaca, Brian Borsari, Kristina M Jackson, driven methods for the prediction of MI behaviors. Molly Magill, Richard Longabaugh, Nadine R Mas- troleo, and Nancy P Barnett. 2014. Sustain talk pre- Acknowledgments dicts poorer outcomes among mandated college stu- dent drinkers receiving a brief motivational interven- This material is based in part upon work supported tion. Psychology of Addictive Behaviors, 28(3):631. by the University of Michigan under the MCubed David Arthur. 1999. Assessing nursing students’ basic program, by the National Science Foundation award communication and interviewing skills: the develop-
49 ment and testing of a rating scale. Journal of Advanced Claire Lane, Michelle Huws-Thomas, Kerenza Hood, Nursing, 29(3):658–665. Stephen Rollnick, Karen Edwards, and Michael Rob- David C Atkins, Timothy N Rubin, Mark Steyvers, ling. 2005. Measuring adaptations of motivational Michelle A Doeden, Brian R Baucom, and Andrew interviewing: the development and validation of the Christensen. 2012. Topic models: A novel method behavior change counseling index (becci). Patient ed- for modeling couple and family text data. Journal of ucation and counseling, 56(2):166–173. family psychology, 26(5):816. Sarah Peregrine Lord, Dogan˘ Can, Michael Yi, Rebeca David C Atkins, Mark Steyvers, Zac E Imel, and Padhraic Marin, Christopher W Dunn, Zac E Imel, Panayiotis Smyth. 2014. Scaling up the evaluation of psy- Georgiou, Shrikanth Narayanan, Mark Steyvers, and chotherapy: evaluating motivational interviewing fi- David C Atkins. 2015a. Advancing methods for reli- delity via statistical text classification. Implementation ably assessing motivational interviewing fidelity using Science, 9(1):49. the motivational interviewing skills code. Journal of Elizabeth Barnett, Theresa B Moyers, Steve Sussman, substance abuse treatment, 49:50–57. Caitlin Smith, Louise A Rohrbach, Ping Sun, and Sarah Peregrine Lord, Elisa Sheng, Zac E Imel, John Donna Spruijt-Metz. 2014. From counselor skill to Baer, and David C Atkins. 2015b. More than reflec- decreased marijuana use: Does change talk matter? tions: empathy in motivational interviewing includes Journal of substance abuse treatment, 46(4):498–505. language style synchrony between therapist and client. Dogan Can, Panayiotis G. Georgiou, David C. Atkins, Behavior therapy, 46(3):296–303. and Shrikanth S. Narayanan. 2012. A case study: De- Brad W Lundahl, Chelsea Kunz, Cynthia Brownell, Der- tecting counselor reflections in psychotherapy for ad- rik Tollefson, and Brian L Burke. 2010. A meta- dictions using linguistic features. In INTERSPEECH, analysis of motivational interviewing: Twenty-five pages 2254–2257. ISCA. years of empirical studies. Research on Social Work Delwyn Catley, Kari J Harris, Kathy Goggin, Kim- Practice. ber Richter, Karen Williams, Christi Patten, Ken Michael B Madson, E Bullock, A Speed, and S Hodges. Resnicow, Edward Ellerbeck, Andrea Bradley-Ewing, 2009. Development of the client evaluation of moti- Domonique Malomo, et al. 2012. Motivational inter- vational interviewing. Motivational Interveiwing Net- viewing for encouraging quit attempts among unmoti- work of Trainers Bulletin, 15:6–8. vated smokers: study protocol of a randomized, con- Matthew Marge, Satanjeev Banerjee, Alexander Rud- trolled, efficacy trial. BMC public health, 12(1):456. nicky, et al. 2010. Using the amazon mechanical Chris Dunn, Doyanne Darnell, Sheng Kung Michael Yi, turk for transcription of spoken language. In Acoustics Mark Steyvers, Kristin Bumgardner, Sarah Peregrine Speech and Signal Processing (ICASSP), 2010 IEEE Lord, Zac Imel, and David C Atkins. 2015. Should International Conference on, pages 5270–5273. IEEE. we trust our judgments about the proficiency of moti- Steve Martino, Samuel A Ball, Charla Nich, Tami L vational interviewing counselors? a glimpse at the im- Frankforter, and Kathleen M Carroll. 2009. Infor- pact of low inter-rater reliability. Motivational Inter- mal discussions in substance abuse treatment sessions. viewing: Training, Research, Implementation, Prac- Journal of substance abuse treatment, 36(4):366–375. tice, 1(3):38–41. Fiona McMaster and Ken Resnicow. 2015. Validation R Gupta, P G Georgiou, D C Atkins, and S Narayanan. of the one pass measure for motivational interview- 2014. Predicting client’s inclination towards tar- ing competence. Patient education and counseling, get behavior change in motivational interviewing 98(4):499–505. and investigating the role of laughter. Proceed- Mary McMurran. 2009. Motivational interviewing with ings of the Annual Conference of the International offenders: A systematic review. Legal and Crimino- Speech Communication Association, INTERSPEECH, logical Psychology, 14(1):83–100. (September):208–212. William R Miller and Stephen Rollnick. 2013. Moti- Judith GM Jelsma, Vera-Christina Mertens, Lisa Fors- vational interviewing: Helping people change, Third berg, and Lars Forsberg. 2015. How to measure edition. The Guilford Press. motivational interviewing fidelity in randomized con- Theresa Moyers, Tim Martin, Delwyn Catley, Kari Jo trolled trials: Practical recommendations. Contempo- Harris, and Jasjit S Ahluwalia. 2003. Assess- rary clinical trials, 43:93–99. ing the integrity of motivational interviewing inter- Florian E Klonek, Vicenc¸Quera, and Simone Kauffeld. ventions: Reliability of the motivational interviewing 2015. Coding interactions in motivational interview- skills code. Behavioural and Cognitive Psychother- ing with computer-software: What are the advantages apy, 31(02):177–184. for process researchers? Computers in Human Behav- Theresa B Moyers, Tim Martin, Jennifer K Manuel, ior, 44:284–292. Stacey ML Hendrickson, and William R Miller. 2005.
50 Assessing competence in the use of motivational in- terviewing. Journal of substance abuse treatment, 28(1):19–26. Theresa B Moyers, Tim Martin, Jon M Houck, Paulette J Christopher, and J Scott Tonigan. 2009. From in- session behaviors to drinking outcomes: a causal chain for motivational interviewing. Journal of consulting and clinical psychology, 77(6):1113. Ken Resnicow, Colleen DiIorio, Johana E Soet, Be- linda Borrelli, Denise Ernst, Jacki Hecht, and Angelica Thevos. 2002. Motivational interviewing in medical and public health settings. Motivational interviewing: Preparing people for change, 2:251–269. Angus Roberts, Robert Gaizauskas, Mark Hepple, Neil Davis, George Demetriou, Yikun Guo, Jay Kola, Ian Roberts, Andrea Setzer, Archana Tapuria, and Bill Wheeldin. 2007. The CLEF corpus: semantic annota- tion of clinical text. AMIA ... Annual Symposium pro- ceedings / AMIA Symposium. AMIA Symposium, pages 625–629. Stephen Rollnick, William R Miller, Christopher C But- ler, and Mark S Aloia. 2008. Motivational inter- viewing in health care: helping patients change be- havior. COPD: Journal of Chronic Obstructive Pul- monary Disease, 5(3):203–203. Michael Tanana, Kevin Hallgren, Zac Imel, David Atkins, Padhraic Smyth, and Vivek Srikumar. 2015. Recursive Neural Networks for Coding Therapist and Patient Behavior in Motivational Interviewing. 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Real- ity, pages 71–79. Karin Verspoor, Kevin Bretonnel Cohen, Arrick Lan- franchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, et al. 2012. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC bioinformat- ics, 13(1):1. StEphanie´ Wahab. 2005. Motivational interviewing and social work practice. Journal of Social Work, 5(1):45– 60. Bo Xiao, Daniel Bone, Maarten Van Segbroeck, Zac E Imel, David C Atkins, Panayiotis G Georgiou, and Shrikanth S Narayanan. 2014. Modeling therapist empathy through prosody in drug addiction counsel- ing. In Fifteenth Annual Conference of the Interna- tional Speech Communication Association.
51 Crazy Mad Nutters: The Language of Mental Health
Jena D. Hwang and Kristy Hollingshead Institute for Human and Machine Cognition (IHMC) Ocala, FL 34470, USA jhwang,kseitz @ihmc.us { }
Abstract with the stigma and discrimination that result from misconceptions about such illnesses (McNair et al., Many people with mental illnesses face chal- 2002; Corrigan et al., 2003). In fact, the stigma and lenges posed by stigma perpetuated by fear and misconception in society at large. This so- discrimination related to mental illnesses have been cietal stigma against mental health conditions described as having worse consequences than the is present in everyday language. In this study conditions of the mental illnesses themselves, con- we take a set of 14 words with the potential tributing to people’s hesitation to seek treatment for to stigmatize mental health and sample Twit- a mental health condition (Corrigan et al., 2014). ter as an approximation of contemporary dis- course. Annotation reveals that these words Many studies in Linguistics and Cognitive Sci- are used with different senses, from expressive ence have shown that word choice and language to stigmatizing to clinical.We use these word- use have direct influences on the speaker’s thought sense annotations to extract a set of mental and actions (c.f., linguistic relativism (Boroditsky, health–aware Twitter users, and compare their 2011; Berlin and Kay, 1991; Lakoff, 1990)). Word language use to that of an age- and gender- choice and the context to which the words are at- matched comparison set of users, discover- tributed serve to foster stigma and prejudice toward ing a difference in frequency of stigmatizing senses as well as a change in the target of pe- people with mental health conditions, trivializing se- jorative senses. Such analysis may provide a rious mental health conditions and their accompany- first step towards a tool with the potential to ing experiences. Anti-stigma campaigns, designed help everyday people to increase awareness to raise public awareness of mental stigma, have in of their own stigmatizing language, and to recent years focused on bringing public attention to measure the effectiveness of anti-stigma cam- the negative impact of the choice of their words. For paigns to change our discourse. example, reporters are advised that, as with any dis- paraging words related to race and ethnicity, some 1 Introduction words should never be used in reporting, includ- The World Health Organization (WHO) estimates ing ‘crazy’, ‘nuts’, ‘lunatic’, ‘deranged’, ‘psycho’, that one in four people worldwide will suffer from and ‘wacko’. The premise behind anti-stigma cam- a mental illness at some point in their lives (World paigns is that increased awareness of the detrimen- Health Organization, 2011). One in five Amer- tal effects of stigmatizing language associated with icans experience a mental health problem in any mental health will help the public become more judi- given year (Kessler et al., 2007; Substance Abuse cious in their word choice and, consequently, change and Mental Health Services Administration, 2014). their attitudes and behaviors toward mental illnesses Many people with a mental illness experience social and those suffering from them. and economic hardship as a direct result of their ill- One might ask, then, whether these anti-stigma ness. They must cope with their symptoms, but also campaigns are effective; are they changing the dis-
52
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 52–62, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics course, particularly around mental health? Evidence cer or the flu (Paul and Dredze, 2011; Dredze, seems to indicate that interventions to reduce stigma 2012; Hawn, 2009), but also mental health condi- are occasionally effective in the short term (Thor- tions such as depression (Ramirez-Esparza et al., nicroft et al., 2015). As a first step in addressing 2008; De Choudhury et al., 2013; Park et al., 2012), this question empirically, we explore 14 of the com- bipolar disorder (Kramer et al., 2004), schizophre- mon terms that have been the focus of a number of nia (Mitchell et al., 2015), and a wide range of other anti-stigma campaigns, such as ‘crazy’, ‘mental’, or mental health conditions (Coppersmith et al., 2015). ‘psycho’, that can be used in a derogatory or pejo- In contrast to this previous work, which analyzed rative manner. We evaluate if indeed awareness of the language use of social media users with men- mental illnesses encourages a more restrained use of tal health conditions, we focus on the use of lan- these words, either avoiding the words entirely or guage related to mental health, regardless of the so- reducing the use of a word in its stigmatizing sense. cial media users’ mental health status. Reavley and For data, we turn to social media, a platform Pilkington (2014) also examined Twitter data related 1 used by nearly four billion people worldwide. So- specifically to stigma associated with depression and cial media platforms offer an uncensored, unscripted schizophrenia. Similarly, Joseph et al. (2015) ana- view of the ongoing discourse of the online genera- lyzed the sentiment and tone in tweets containing the tion, thus providing a source for analyzing contem- hashtags #schizophrenia and #schizophrenic, and porary language use. In particular for this study, we compared these to tweets containing the hashtags focus on public posts to Twitter. #diabetes and #diabetic, in order to determine the The paper is structured as follows: we begin with difference in attitude toward an often-stigmatized a brief discussion of related work, and present our illness like schizophrenia versus an un-stigmatized methods and motivation for gathering social me- physical illness like diabetes. The study discovered dia data from Twitter in a two-stage process. We that tweets referencing schizophrenia were twice as then inventory the various word senses discovered likely to be negative as tweets referencing diabetes, in the data for each of 14 stigmatizing words related and that Twitter users were more likely to use hu- to mental health. We discuss our annotation pro- morous or sarcastic language in association with the cess, beginning with finer-grained senses and mov- adjective schizophrenic than with diabetic. ing to a coarser-grained sense inventory for compar- ison across the set of stigmatizing words. We show Our work has a broader scope than this previous the results of word sense analysis across two differ- work, in that we examine stigmatizing language re- ent sets of social media users, demonstrating that a lated to a wide range of mental health conditions. user’s mental health awareness may be reflected in The closest work to ours is that of Collins (2015), the use – or lack thereof – of stigmatizing language. wherein she conducted a historical topic- and co- Finally, we conclude the paper with a few potential occurrence analysis of hashtags #insane, #psycho, applications of this technology. ‘#schizo, and #nutter on Twitter. In this work, we add to this list, as described in the next section (see 2 Related Work Table 1), and extend the analysis to include a broader examination of language use. There has recently been an explosion in work us- This work also borrows from annotation methods ing technology to detect and characterize various as- widely used in natural language processing. The pects of individuals with mental health disorders, process of annotation includes a linguistic analysis particularly online (Ayers et al., 2013; Yang et al., of data, the development of annotation standards or 2010; Hausner et al., 2008) and on social media guidelines, and the manual tagging of the data with (Coppersmith et al., 2014; De Choudhury, 2013; the set standards. Such annotation methods have Nguyen et al., 2014; Schwartz et al., 2013). Many been used in annotating data for lexical (Duffield et social media users post about their own health con- al., 2010), semantic (Schneider et al., 2015; Hwang ditions, including physical conditions such as can- et al., 2014; Hwang et al., 2010; Palmer et al., 2005), 1http://www.statisticbrain.com/social-networking-statistics/ and syntactic (Marcus et al., 1994) tasks.
53 3 Stigmatizing Words 20% of the dataset, each; ‘mad’ is the next most fre- quent, at 13%, with ‘psycho’; ‘insane’, and ‘mental’ In this study, we focus on 14 words with the poten- following behind with 7-11% each. ‘Bonkers’ and tial to stigmatize mental health, which we will refer ‘lunatic’ each comprise 3% of the data; ‘nutter’ and to as stigmatizing words. Table 1 contains a list of ‘deranged’ are less than 2% of the data, while the the stigmatizing words used in this study. remaining words – ‘loony’, ‘nutcase’, ‘schizo’, and bonkers insane mad nuts schizo ‘wacko’ – each comprise less than 1% of the filtered crazy loony mental nutter wacko seed set. deranged lunatic nutcase psycho For each stigmatizing word, 100 random tweets containing the word were selected for annotation. Table 1: Stigmatizing words used for keyword search. Each selected tweet was manually analyzed to es- Starting from the 4 words ‘insane’, ‘psycho’, tablish an inventory of types and varieties of mean- ‘schizo’, and ‘nutter’ as studied by Collins (2015), ings or senses of the words as used in the tweets we extended the list by including words that are of- (see Section 5 for further discussion). Based on the ten cited as problematic terminology by various anti- established senses, whenever a tweet was used in a stigma campaigns to arrive at our list of 14. In par- clinical sense of the word, the tweet was considered ticular, we focused on the terminologies discussed to originate from a mental health aware (MHA) user. in various blog entries, articles, or publications by Users that did not have any clinical usages were con- National Alliance on Mental Illness (NAMI), Time sidered to be mental health unaware (MHU) users. to Change, and HealthyPlace2. Table 2 provides a few example tweets for several of the stigmatizing words. 4 Twitter Data 4.2 User-Based Data All of the data in this study comes from publicly The second-stage dataset, which we refer to as the available Twitter data collected using the Twitter user-based set, is constructed from tweets posted API. The data was collected in two stages. publicly by a set of users. For this dataset, we began by using the tweets annotated as MHA to extract a 4.1 Keyword-Based Data list of Twitter users that had minimally used men- In the first stage, we obtained two months’ (July and tal health aware language in at least one tweet. As August 2015) worth of tweets based on a keyword the tweets typically contained specific references to search of the 14 stigmatizing words. We will refer to mental health or mental issues, we make the assump- the set of tweets collected using this keyword-based tion that these users may be more sensitive to the search as the seed set. This collection contains over existing stigma or prejudices towards mental health 27 million tweets. Once extracted, the seed set was disorders. We term this set of users the MHA users. then filtered to remove tweets containing the label We then extract a set of users for comparison. RT (retweets) or URLs, on the assumption that such Generally, a comparison set would be generated tweets often contain text not authored by the user. based on a random selection of Twitter users. In our Tweets were also filtered to select only those marked case, we limit that selection to users who tweeted by Twitter as English (i.e., “en” or “en-gb”). In order any of the stigmatizing words from Table 1 in a to exclude instances where stigmatizing words show non-clinical sense during the collection time period up in the user handle (e.g., @crazygirl), user men- (July-August 2015) and did not use the stigmatiz- tions (@s) were removed for the purposes of filter- ing words in a clinical sense during that time period, ing. Finally, any exact-match duplicates among the and thus do not belong to the MHA user group. We set of tweets were removed for purposes of annota- term these users the MHU users. The set of MHU tion. The final, filtered set consists of just over 840k users is much larger than MHA users: 686 versus 60, tweets. Of these, ‘nuts’ and ‘crazy’ make up over respectively. We additionally down-selected within 2https://www.nami.org/, http://www.time-to-change.org. the MHU set to form an age- and gender-matched uk/, and http://www.healthyplace.com/, respectively comparison set, based on evidence that failing to ac-
54 “I’m fuming. How dare a TV show portray folks suffering from mental health issues so unfairly? As if there isn’t already enough stigma around.”
“this is exactly why I think JH is borderline personality disorder. seems to fit. no one wanted 2 look at lesser mental issue.”
“I don’t even score high on schizo symptoms and that’s what bothers me most besides mood issues”
“Ha... apparently according to my Schizo voices (audio hallucinations) some of them went out on the piss while I was asleep”
“A lot of Americans are injured for them to portray one person as ‘insane or mentally ill’ ”
“If you think negatively about yourself, this should help. Im certifiably nuts. I know these things.” Table 2: Examples of tweets using stigmatizing words in their clinical sense. count for age and gender can yield biased compari- to 3200 tweets.3 Just as in the previous data col- son groups that may skew results (Dos Reis and Cu- lection stage, the data was filtered to select only lotta, 2015). To create an approximately matched tweets marked as English, to remove retweets and comparison set, we take each user in our full MHA tweets containing URLs, and to remove user men- and MHU sets, and obtain age and gender estimates tions (@s). Preprocessing of the data removed any for each from the tools provided by the World Well- exact-match duplicates among the filtered tweets. Being Project (Sap et al., 2014). These tools use From this new set of user-based tweets, we ex- lexica derived from Facebook data to identify demo- tracted each tweet containing any of the stigmatiz- graphics, and have been shown to be successful on ing words. We then annotated the instances of the Twitter data (Coppersmith et al., 2015). Then, in stigmatizing words in these extracted tweets with the order to select the best comparison set, we selected senses developed in the previous stage (Section 4.1). (without replacement) the MHU user with the same We achieved a relatively high inter-annotator agree- gender label and closest in age to each of the MHA ment rates: Cohen’s kappa value of 0.74 (IAA = users. 86.8%) on the MHA tweets and 0.64 (IAA = 80.6%) on MHU tweets. We use a balanced dataset here for our analysis, by selecting an equal number of MHA users and 5 Sense Inventory for Stigmatizing Words comparison MHU users for our analysis. In prac- tice and as stated above, we found approximately As mentioned in the previous section, the seed set an order of magnitude difference between MHA and was used to develop an inventory of fine-grained MHU users. Our selection of a balanced set enables senses for each of the stigmatizing words. Instances simpler machine learning classification efforts, and annotated with a clinical sense were marked as MHA helps to demonstrate the language differences be- tweets and used to develop a user-based dataset. tween the two groups more clearly than if we had Subsequently, in order to facilitate a comparison examined a dataset more representative of the pop- of senses as used across stigmatizing words, the ulation ( 1/10). Our results should be taken as val- fine-grained senses were binned into coarse-grained ∼ idation that the differences in language we observe senses. This process is detailed below. are relevant to mental health awareness, but only one 5.1 Fine-Grained Sense Inventory step towards applying something derived from this research as a tool in a real world scenario. We began with the definitions provided by Merriam- Webster4 as a basis for our initial sense inven- For each of the MHA users and the age- and 33200 is the maximum number of historic tweets permitted gender-matched MHU users, we retrieved all of their by the API. most recent public tweets via the Twitter API, up 4http://www.merriam-webster.com/
55 Sense Definition Example Crazy 1. irrational, crazy “Maybe I shouldn’t be revealing this crazy part of me...” 2. excitement “Got the club going crazy!” 3. odd, unusual “These cigar wraps are crazy” 4. extreme “I miss my best friend like crazy.” 5. intensifier “Smh my luck has been crazy bad lately” 6. exclamation “Crazy!” 7. name or label “Codeine crazy goes down in some of the greatest songs ever wrote.” Mad 1. angry, upset “@user I’m mad at you for being so cute” 2. irrational, crazy “Keep coughing like a mad woman” 3. extreme “That’s a mad show” 4. intensifier “Why is everyone in Chile mad good at singing?” 5. exclamation “Mad!” (akin to “Crazy!”) 6. names or labels “I just started thinking about a scene from Mad Men” Mental 1. clinical usage “Mental health awareness is something near and dear to my heart.” 2. of the mind “Emancipate yourselves from mental slavery. ” 3. irrational, crazy “People are going mental about this lion being killed. ” Nuts 1. irrational, crazy “When my mom went nuts on my sister for playing hooky...” 2. odd, unusual “Back to back is nuts but meek is about to MURDER it.” 3. testicles “Cassandra showed me her dog’s nuts” 4. exclamation “Aw nuts - -” 5. fruit “i don’t understand why all my office’s snacks have nuts in them.” 6. ‘deez nuts’ “if i had a dollar for every time i heard a kid yell ”deez nuts” at camp i would be rich” 7. clinical use “Why do you have you Dr’s personal cell in your phone?” Uh because I’m nuts.” 8. building parts “The nuts and bolts for this will be proper implementation and effective evaluation” Schizo 1. irrational “@user Total schizo ... I can’t imagine using anything other than TweetBot.” 2. clinical usage “I was diagnosed w/addiction once, but turned out I was schizo.” 3. names or labels “I added a video to a playlist Schizo BP2635 Brothers Pyrotechnics NEW FOR 2016”
Table 3: Examples of fine-grained sense inventory for the 5 most polysemous words of the 14 stigmatizing words analyzed. tory. We then made two annotation passes to in- mon meaning includes the state of unusual excite- clude missing senses, further refine existing senses, ment as attributed to situations (e.g., ‘crazy’ sense 2 or remove senses as dictated by our annotation of or ‘nuts’ sense 2) or objects (e.g., ‘crazy’ sense 3). the Twitter data. On average, 4 different senses Note that senses that indicate that someone or some- were identified for the stigmatizing words, ranging thing is irrational, extreme, or unusual are consid- from highly polysemous words with 6-8 senses like ered stigmatizing usages that anti-stigma campaigns ‘crazy’ and ‘mad’ to words with 1-2 senses like ‘de- highlight. ranged’ and ‘nutcase’. Table 3 shows the sense in- Additionally, we observe two common functional ventory for the more polysemous words in this study. usages. First is the adverbial usages that function as As expected, the most common sense for all of the intensifiers to the adjective the word precedes. For stigmatizing words is the meaning of irrationality or example “crazy bad” in ‘crazy’ sense 5 and “mad a state of being “not ‘right’ in one’s mind”, in refer- good” in ‘mad’ sense 4 serve to highlight or rein- ence to a human (e.g., ‘crazy’ sense 1, ‘mad’ sense 2, force the intensity of the adjectives ‘bad’ and ‘good’, or ‘mental’ sense 3 in Table 3). Another fairly com- respectively. Second common functional use is the
56 expressive usage as seen for ‘crazy’ sense 6, ‘mad’ was mapped to CGd sense A, senses 2 and 4 were sense 5, or ‘nuts’ sense 4. Rather than offering a de- mapped to CGd sense B, and sense 7 was mapped to scriptive content, these expressive serve to convey a coarse sense C. certain emotional perspective of the speaker.5 Sense D is for homonymous usage like ‘mad’ Out of the 14 words, only five showed instances sense 1 (i.e. anger in ‘I’m mad at you!”) or ‘nuts’ that were used in the clinical sense of the word : sense 5 (i.e. fruit in “These snacks have nuts in ‘mental’ (sense 1), ‘nuts’ (sense 7), ‘psycho’ (sense them”). Sense E is a miscellaneous category that 3), ‘insane’ (sense 4), and ‘schizo’ (sense 2). These includes senses unrelated to the central senses of the clinical senses mark the MHA tweets, from which word (e.g., a “names and labels” senses) or instances the user-based set was generated as detailed in Sec- where the sense of the word was not identifiable or tion 4.2. unclear (e.g., “tin nuts”). In the case of ‘nuts’, senses 3 and 5 were mapped to CGd sense D, and senses 6 5.2 Coarse-Grained Sense Inventory and 8 were mapped to CGd sense E. If we are to analyze the stigmatizing words and their senses to compare their usage in MHA tweets and 6 MHA vs. MHU MHU tweets, we will find that fine-grained senses 6.1 Stigmatizing Word and Sense Use pose a difficulty. Consider the senses in Table 3. The words ‘schizo’ and ‘mental’, for example, have three In evaluating occurrences of stigmatizing words in senses each, but a sense for one word does not al- MHA and MHU datasets, we find that, on the whole, ways have a counterpart for the other. Additionally, the MHA users do use these words less frequently comparison of usage between highly-polysemous when compared to the MHU users. Table 5 shows the total count of tweets in the MHA and MHU ‘nuts’ and three-sense ‘mental’ would require an ad- 6 ditional layer of analysis to bridge the differences. datasets for each of the stigmatizing words. In an effort to make an apples-to-apples compar- In fact, the number of MHU tweets appears to be lack ison of these words, we developed a set of coarse- nearly double that of MHA tweets. The sheer grained senses that can be applied to all of the of the use of stigmatizing words in MHA suggests stigmatizing words. Fine-grained senses were then that the user’s mental health awareness is likely to mapped to one of five coarse-grained (CGd) senses. cause them to be more sensitive towards the stigma- Table 4 shows the list of coarse-grained senses. tization of those suffering from mental illness. Con- sequently, they are more likely to shy away from im- A. term applied to sentient beings pulsive use of the stigmatizing words. (often in a derogatory manner) There are two exceptions to the observation that B. term applied to an object, situation, MHA users use stigmatizing words with less fre- or world in general quency. They are boldfaced in Table 5. The use of C. clinical usage the words ‘mental’ and ‘schizo’ show higher usage D. homonymous usage of stigmatized words by the MHA set. However, if E. other senses we focus on the numbers relevant to the stigmatiz- Table 4: Coarse-grained (CGd) senses for stigmatizing words. ing senses (i.e., CGd senses A and B) the story be- Sense A, B and C are the relevant senses for our comes more clear. The leftmost columns of the table study. Senses A and B capture the stigmatizing show that the majority of uses of the word ‘mental’ sense of the word applied to sentient beings (e.g., are in fact the clinical sense (CGd sense C). Addi- humans, pets, etc.) and to objects or situations. Clin- tionally, note that the use of these words in a clin- ical or medical usages of the words are assigned ical sense by MHU users indicates that our origi- to sense C. To take ‘nuts’ as an example, sense 1 nal classification of MHU users is imperfect; by our definition, each of the MHU users producing these 5For example, “Aw nuts!” expresses certain set of emotions clinical-sense words should be classified as an MHA (e.g., regret, displeasure) as is relevant to the speaker for the user. Since our seed set dataset and our user-based immediate situation, rather than ascribing a descriptive meaning to something or someone. 6These counts do not include the homonymous sense D.
57 Occurrences Clinical Use MHA MHU MHA MHU bonkers 2 19 crazy 152 252 deranged 0 12 insane 24 54 0 1 loony 1 7 lunatic 0 17 mad* 22 91 mental 267 86 233 45 nutcase 2 17 nuts* 7 46 3 2 nutter 1 7 psycho 17 34 3 0 schizo 12 9 9 2 total 489 631 248 50 total (all senses) 610 776 total (A & B only) 204 503 Table 5: Word counts of potentially-stigmatizing words in MHA and MHU tweets. An asterisk (*) indicates that the coarse-grained sense D (homonyms) for the word has been re- moved for this count.
Figure 1: Visualizing coarse-grained senses for stigmatizing dataset were pulled from Twitter at different time words as found in MHA and MHU datasets. points, approximately six months apart, our classi- fication of MHA and MHU users could be and in- total MHA tweets for ‘mental’ and ‘schizo’, respec- deed was occasionally incorrect: MHA users might tively, which is considerably lower than the MHU not have produced another clinical-sense stigmatiz- set’s stigmatizing sense use, at 20% and 67% for ing word in the user-based dataset, and MHU users ‘mental’ and ‘schizo’, respectively. might have used such a sense, perhaps due to be- Beyond ‘mental’ and ‘schizo’, what Figure 1 vi- coming more mental health–aware over time. sually captures is the prominence of the stigmatizing Figure 1 better visualizes the various senses of sense A in the MHU group. While sense A does also each of the stigmatizing words from each of the user occur in MHA tweets, the sense is not as prevalent as groups. Consider the uses of ‘mental’ and ‘schizo’ in the MHU set. The only apparent counter-example as visualized in Figure 1 for MHA and MHU tweets. in Figure 1 is that of ‘nutcase’: its only usage is that For MHA tweets, the most prominent sense of ‘men- of sense A in the MHA data. However, note that in tal’ is the clinical sense C, while the stigmatizing Table 5 there were only 2 instances of this word in sense A is used very infrequently. For MHU, how- use, too little data to draw a conclusion. ever, while there is a large portion of clinical sense C associated with the use of ‘mental’, it is not as large 6.2 Visual Language Analysis as that in the MHA set. Sense A also shows up with From the previous section, we learned that stigma- higher frequency in the MHU set. The same trend tizing words do seem to be used differently by MHA can be seen for the word ‘schizo’. MHA users do and MHU users. In this section, we look to charac- not use stigmatizing sense A as often as the clinical terize language differences more broadly, by analyz- sense C, and the reverse is true for MHU users. As ing the set of MHA and MHU tweets as a whole. To it turns out, although the MHA tweets show a high do so, we took all of the MHA and MHU tweets use of the words ‘mental’ and ‘schizo’, most of the gathered for the user-based dataset, extracted the usage is attributed to the medical sense. The stig- tweets containing any of the 14 stigmatizing words, matizing senses only make up 1% and 25% of the – all of which had been annotated for their coarse-
58 Figure 2: Vennclouds for all tweets containing the 14 stigmatiz- Figure 3: Vennclouds for all tweets containing the 14 stigmatiz- ing words, as tweeted by MHA users, with words from tweets ing words, as tweeted by MHU users, with words from tweets containing clinical senses of the words appearing on the left containing clinical senses of the words appearing on the left (blue text), stigmatizing senses of the words appearing on the (blue text), stigmatizing senses of the words appearing on the right (red text), and the language shared by tweets containing right (red text), and the language shared by tweets containing either sense in the middle (black text). either sense in the middle (black text). grained (CGd) sense – and generated dynamic Ven- tweets produced by the MHA users, shown as the nclouds (Coppersmith and Kelly, 2014) to compare largest, first word in the blue (leftmost) cloud of clinical-sense tweets to stigmatizing-sense tweets. Figure 2, with ‘help’, ‘services’, ‘disorder’, and Figures 2 and 3 display the resulting clouds. ‘stigma’ close behind. Perhaps more interesting is From these clouds, one might note immediately the long list of hashtags that appear with high fre- that in Figure 2, the blue (leftmost) cloud is much quency in the clinical-sense tweets from the MHA larger than the red (rightmost), while the reverse is users, including #b4stage4, #mhaconf14, #mmhm- true in Figure 3. This simply visualizes what was chat, #mhmwellness, #mhawell – all clearly related previously quantified in Table 5: MHA users tend to mental health, and mental health awareness. This to produce more tweets containing clinical sense of analysis again supports our simplistic classification the stigmatizing words, whereas MHU users tend to of users, but additionally gives us a source we could produce more tweets containing stigmatizing senses. use for another data pull. By simply looking for The mere presence of a blue cloud in Figure 3 clinical senses of these stigmatizing words, we dis- demonstrates that there was some use of clinical- covered clear communities of mental health–aware sense stigmatizing words from the MHU users, who users. therefore ought to have been categorized as MHA, as previously discussed in Section 6.1. However, on 7 Conclusion & Future Work the whole, the simplistic classification of users based In this study we have investigated 14 common terms, on clinical-sense usage held up relatively well. used in everyday language, with the potential for Both Vennclouds show that ‘crazy’ was the stigmatizing mental illnesses in society. We were most frequently-occurring word in stigmatized- specifically interested in evaluating if awareness of sense tweets, shown as the largest, first word in the mental illnesses can help discourage impulsive uses red (rightmost) clouds of both Figures 2 and 3, with of the pejorative senses of the words. Our findings ‘mad’ as the third- and second-most frequent word, show that MHA users less frequently use stigma- respectively. In fact, these words were never used in tizing words than MHU users, and when they do a clinical-sense tweet by either of the user groups. use the stigmatizing words, they use the stigmatiz- This analysis shows quantitatively that, to be more ing senses less often than their MHU counterparts. aware of our own uses of stigmatizing senses, we Additionally, MHA users tend to structure their lan- ought to pay particular attention to these words, as guage so as to avoid applying the derogatory sense the worst offenders from both groups. to a sentient being, and use the clinical sense of the The word ‘health’ is clearly visible as the stigmatizing word more often than MHU users. The most frequently-occurring word in the clinical-sense absence of stigmatizing words or, more specifically,
59 stigmatizing senses by MHA users suggests that the portunity to re-word, similar to the “self-flagging user’s mental health awareness contributes to how app” mentioned in (Quinn, 2014) to detect poten- they employ language in social media, demonstrat- tially offensive or bullying language (Dinakar et al., ing a degree of sensitivity towards stigmatization of 2012). This option provides a somewhat heavy- those with mental illnesses. handed method to increase mental health aware- Directions for our future work concern the im- ness. The ability to automatically detect stigmatiz- provement of methods for determining the MHA ing senses of these (and other) words might also be user group. Our current approach identifies MHA if useful as a filter, to downweight or hide posts con- they show one MHA tweet in the seed data. Unfortu- taining stigmatizing words, or add a warning or re- nately, basing whether or not a user should is MHA porting function like Facebook’s “offensive content” is based on a single tweet does not leave room for reporting utility, for stigmatizing language. In this the false positive cases, where an otherwise unaware way, users can choose to avoid such language, in user might have tweeted a aware sounding tweet. case it might trigger negative reactions. Conversely, because the pull for seed set and user- Finally, an automated analysis of the senses in based set were several months apart, there also may use for potentially-stigmatizing words in every- have been false negatives – MHU users that should day language might provide a method to assess have been classified in the user-based set as MHA whether anti-stigma campaigns are effective. Have users. Most immediate way to address this issue is we changed the discourse, and if so, in what ways? to experiment by setting a threshold greater than one And, importantly, has a change in discourse resulted before a user is considered MHA. We will also look in easier access to care, better management of crises, into taking advantage of hashtags related to mental and an improved quality of life for those with men- health campaigns to identify users who are inten- tal health conditions? Addressing these questions tionally identifying themselves as a part of mental might shed light on the strengths and weaknesses of health community. Finally, we intend to also experi- the current anti-stigma efforts, and help in guiding ment with identifying users that have self-identified future work to end the stigma of mental illness. as having a mental condition or being a part of men- tal health community (Coppersmith et al., 2015) as Acknowledgments a means of identifying the MHA group. The authors thank the anonymous reviewers for their Another future direction of this work is in further thoughtful critiques. Kristy was partially supported analysis and revision of coarse-grained senses. As in her work by the VA Office of Rural Health Pro- discussed in Section 5.1, one of the more frequent gram “Increasing access to pulmonary function test- fine-grained senses we found are expressives (e.g. ing for rural veterans with ALS through at home “Aww nuts!”) and intensifiers (e.g. “That’s crazy testing” (N08-FY15Q1-S1-P01346). The contents good”). These are currently grouped in with the represent solely the views of the authors and do not coarse-grained sense B – one of the two stigmatizing represent the views of the Department of Veterans senses, but unlike the rest of the fine-grained senses Affairs or the United States Government. also grouped in B, these usages serve the function of expressing speaker’s emotion or emphasizing a de- scriptive adverbial, rather than carrying a descriptive References or content information. In future work, we will look John W. Ayers, Benjamin M. Althouse, Jon-Patrick into distinguishing these types of senses from oth- Allem, J. Niels Rosenquist, and Daniel E. Ford. 2013. ers to determine if indeed these could be considered Seasonality in seeking mental health information on stigmatizing senses and whether or not these usages Google. American Journal of Preventive Medicine, 44(5):520–525. are indicative of mental health awareness. Brent Berlin and Paul Kay. 1991. Basic color terms: One might imagine several uses for detecting Their universality and evolution. Univ of California potentially-stigmatizing language. We could use Press. it to warn social media users that their language Lera Boroditsky. 2011. How language shapes thought. might been seen as stigmatizing, and offer an op- Scientific American, 304(2):62–65.
60 Phoebe Collins. 2015. Words will never hurt me; a study discourse: The presentational relative clause construc- of stigmatizing language in Twitter. Via personal com- tion. In Proceedings of Extracting and Using Con- munication, 2015-07-12. structions in Computational Linguistic Workshop held Glen Coppersmith and Erin Kelly. 2014. Dynamic word- in conjunction wi th NAACL HLT 2010, Los Angeles, clouds and Vennclouds for exploratory data analysis. California, June. In Proceedings of the Workshop on Interactive Lan- Helmut Hausner, Goran¨ Hajak, and Hermann Spießl. guage Learning, Visualization, and Interfaces, pages 2008. Gender differences in help-seeking behavior on 22–29, Baltimore, Maryland, USA, June. Association two internet forums for individuals with self-reported for Computational Linguistics. depression. Gender Medicine, 5(2):181–185. Glen Coppersmith, Mark Dredze, and Craig Harman. Carleen Hawn. 2009. Take two aspirin and tweet me 2014. Quantifying mental health signals in Twitter. In in the morning: How Twitter, Facebook, and other so- Proceedings of the ACL Workshop on Computational cial media are reshaping health care. Health Affairs, Linguistics and Clinical Psychology: From Linguistic 28(2):361–368. Signal to Clinical Reality (CLPsych). Jena D. Hwang, Archna Bhatia, Claire Bonial, Aous Glen Coppersmith, Mark Dredze, Craig Harman, and Mansouri, Ashwini Vaidya, Nianwen Xue, and Martha Kristy Hollingshead. 2015. From ADHD to SAD: Palmer. 2010. PropBank annotation of multilingual Analyzing the language of mental health on Twitter light verb constructions. In Proceedings of the Fourth through self-reported diagnoses. In Proceedings of Linguistic Annotation Workshop, pages 82–90, Up- the 2nd Workshop on Computational Linguistics and psala, Sweden, July. Association for Computational Clinical Psychology: From Linguistic Signal to Clini- Linguistics. cal Reality (CLPsych), pages 1–10, Denver, Colorado, Jena D. Hwang, Annie Zaenen, and Martha Palmer. June 5. Association for Computational Linguistics. 2014. Criteria for identifying and annotating caused Patrick Corrigan, Vetta Thompson, David Lambert, motion constructions in corpus data. In Proceedings Yvette Sangster, Jeffrey G. Noel, and Jean Campbell. of the Ninth International Conference on Language 2003. Perceptions of discrimination among persons Resources and Evaluation (LREC’14), Reykjavik, Ice- with serious mental illness. Psychiatric Services. land. Patrick W. Corrigan, Benjamin G. Druss, and Deborah A. Adam J. Joseph, Neeraj Tandon, Lawrence H. Yang, Perlick. 2014. The impact of mental illness stigma on Ken Duckworth, John Torous, Larry J. Seidman, and seeking and participating in mental health care. Psy- Matcheri S. Keshavan. 2015. #Schizophrenia: Use chological Science in the Public Interest, 15(2):37–70. and misuse on Twitter. Schizophrenia Research, Munmun De Choudhury, Michael Gamon, Scott Counts, 165(2):111–115. and Eric Horvitz. 2013. Predicting depression via Ronald C. Kessler, Matthias Angermeyer, James C. An- social media. In Proceedings of the 7th Interna- thony, Ron De Graaf, Koen Demyttenaere, Isabelle tional AAAI Conference on Weblogs and Social Media Gasquet, Giovanni De Girolamo, Semyon Gluzman, (ICWSM). Oye Gureje, Josep Maria Haro, Norito Kawakami, Munmun De Choudhury. 2013. Role of social media in Aimee Karam, Daphna Levinson, Maria Elena Med- tackling challenges in mental health. In Proceedings ina Mora, Mark A. Oakley Browne, Jose´ Posada-Villa, of the 2nd International Workshop on Socially-Aware Dan J. Stein, Cheuk Him Adley Tsang, Sergio Aguilar- Multimedia. Gaxiola, Jordi Alonso, Sing Lee, Steven Heeringa, Karthik Dinakar, Birago Jones, Catherine Havasi, Henry Beth-Ellen Pennell, Patricia Berglund, Michael J. Gru- Lieberman, and Rosalind Picard. 2012. Common ber, Maria Petukhova, and Somnath Chatterji. 2007. sense reasoning for detection, prevention, and mitiga- Lifetime prevalence and age-of-onset distributions of tion of cyberbullying. ACM Transactions on Interac- mental disorders in the World Health Organization’s tive Intelligent Systems (TiiS), 2(3):18. world mental health survey initiative. World Psychia- Virgile Landeiro Dos Reis and Aron Culotta. 2015. Us- try, 6(3):168. ing matched samples to estimate the effects of exer- Adam D. I. Kramer, Susan R. Fussell, and Leslie D. Set- cise on mental health from Twitter. In Proceedings of lock. 2004. Text analysis as a tool for analyzing con- the Twenty-Ninth AAAI Conference on Artificial Intel- versation in online support groups. In Proceedings ligence. of the ACM Annual Conference on Human Factors in Mark Dredze. 2012. How social media will change pub- Computing Systems (CHI). lic health. IEEE Intelligent Systems, 27(4):81–84. George Lakoff. 1990. Women, fire, and dangerous Cecily Jill Duffield, Jena D. Hwang, and Laura A. things: What categories reveal about the mind. Cam- Michaelis. 2010. Identifying assertions in text and bridge Univ Press.
61 Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Nathan Schneider, Vivek Srikumar, Jena D. Hwang, and Robert Macintyre, Ann Bies, Mark Ferguson, Karen Martha Palmer. 2015. A hierarchy with, of, and for Katz, and Britta Schasberger. 1994. The Penn Tree- preposition supersenses. In Proceedings of the 9th bank: Annotating predicate argument structure. In Linguistic Annotation Workshop, pages 112–123, Den- ARPA Human Language Technology Workshop, pages ver, Colorado, USA, June. Association for Computa- 114–119. tional Linguistics. Bernard G. McNair, Nicole J. Highet, Ian B. Hickie, H. Andrew Schwartz, Johannes C. Eichstaedt, Mar- and Tracey A. Davenport. 2002. Exploring the garet L. Kern, Lukasz Dziurzynski, Richard E. Lu- perspectives of people whose lives have been af- cas, Megha Agrawal, Gregory J. Park, Shrinidhi K. fected by depression. Medical Journal of Australia, Lakshmikanth, Sneha Jha, Martin E. P. Seligman, and 176(Suppl)(10):S69–S76. Lyle H. Ungar. 2013. Characterizing geographic vari- Margaret Mitchell, Kristy Hollingshead, and Glen Cop- ation in well-being using tweets. In Proceedings of the persmith. 2015. Quantifying the language of 8th International AAAI Conference on Weblogs and schizophrenia in social media. In Proceedings of the Social Media (ICWSM). 2nd Workshop on Computational Linguistics and Clin- Substance Abuse and Mental Health Services Adminis- ical Psychology: From Linguistic Signal to Clinical tration. 2014. Substance abuse and mental health Reality (CLPsych), pages 11–20, Denver, Colorado, services administration. In Results from the 2013 Na- June. Association for Computational Linguistics. tional Survey on Drug Use and Health: Mental Health Thin Nguyen, Dinh Phung, Bo Dao, Svetha Venkatesh, Findings, NSDUH Series H-49, HHS Publication No. and Michael Berk. 2014. Affective and content analy- (SMA) 14-4887. Substance Abuse and Mental Health sis of online depression communities. IEEE Transac- Services Administration, Rockville, MD. tions on Affective Computing, 5(3):217–226. Graham Thornicroft, Nisha Mehta, Sarah Clement, Sara Martha Palmer, Dan Guildea, and Paul Kingsbury. 2005. Evans-Lacko, Mary Doherty, Diana Rose, Mirja The Proposition Bank: An annotated corpus of seman- Koschorke, Rahul Shidhaye, Claire O’Reilly, and tic roles. Computational Linguistics, 31(1):71–105, Claire Henderson. 2015. Evidence for effective in- March. terventions to reduce mental-health-related stigma and Minsu Park, Chiyoung Cha, and Meeyoung Cha. 2012. discrimination. The Lancet. Depressive moods of users portrayed in Twitter. In World Health Organization. 2011. The World Health Re- Proceedings of the ACM SIGKDD Workshop on port 2001 – Mental Health: New Understanding, New Healthcare Informatics (HI-KDD). Hope. Geneva: World Health Organization. Michael J. Paul and Mark Dredze. 2011. You are what Albert C. Yang, Norden E. Huang, Chung-Kang Peng, you tweet: Analyzing Twitter for public health. In and Shih-Jen Tsai. 2010. Do seasons have an influ- Proceedings of the 5th International AAAI Conference ence on the incidence of depression? The use of an on Weblogs and Social Media (ICWSM). internet search engine query data as a proxy of human Cristina Quinn. 2014. MIT algorithm takes aim at social affect. PLOS ONE, 5(10):e13728. media cyberbullying. http://news.wgbh.org/post/mit- algorithm-takes-aim-social-media-cyberbullying. Accessed 2016-03-01. Nairan Ramirez-Esparza, Cindy K. Chung, Ewa Kacewicz, and James W. Pennebaker. 2008. The psy- chology of word use in depression forums in English and in Spanish: Testing two text analytic approaches. In Proceedings of the 2nd International AAAI Confer- ence on Weblogs and Social Media (ICWSM). Nicola J. Reavley and Pamela D. Pilkington. 2014. Use of Twitter to monitor attitudes toward depression and schizophrenia: an exploratory study. PeerJ, 2:e647. Maarten Sap, Greg Park, Johannes C. Eichstaedt, Mar- garet L. Kern, David J. Stillwell, Michal Kosinski, Lyle H. Ungar, and H. Andrew Schwartz. 2014. De- veloping age and gender predictive lexica over social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1146–1151.
62 The language of mental health problems in social media
1, , 1, 2 1,3 George Gkotsis † ∗, Anika Oellrich †,Tim JP Hubbard , Richard JB Dobson Maria Liakata4, Sumithra Velupillai1,5, Rina Dutta1 1King’s College London, IoPPN, London, SE5 8AF, UK 2King’s College London, Department of Medical & Molecular Genetics, London, SE1 9RT 3Farr Institute of Health Informatics Research, London, WC1E 6BT, UK 4University of Warwick, Department of Computer Science, Warwick, CV4 7AL, UK 5School of Computer Science and Communication, KTH, Stockholm † Authors contributed equally to this work. ∗ Corresponding author. E-mail: [email protected].
Abstract 1 Introduction Mental illnesses are estimated to account for 11% Online social media, such as Reddit, has to 27% of the disability burden in Europe (Wykes become an important resource to share et al., 2015) and mental and substance use disorders personal experiences and communicate are the leading cause of years lived with disability with others. Among other personal worldwide (Whiteford et al., 2013). Our knowledge information, some social media users about these mental health problems is still more lim- communicate about mental health ited than for many physical conditions, as sufferers problems they are experiencing, with may relapse even after successful treatment or ex- the intention of getting advice, support or empathy from other users. Here, hibit resistance to different treatments. Most men- we investigate the language of Reddit tal health conditions begin early, disrupt education posts specific to mental health, to define (Kessler et al., 1995) and may persist over a life- linguistic characteristics that could be time, causing disability when those affected would helpful for further applications. The normally be at their most productive (Kessler and latter include attempting to identify Frank, 1997). For example, Patel and Knapp (1997) posts that need urgent attention due to estimated the aggregate costs of all mental disor- their nature, e.g. when someone an- ders in the United Kingdom at 32 billion (1996/97 nounces their intentions of ending their life by suicide or harming others. Our prices), 45% of which was due to lost productivity results show that there are a variety of (Patel and Knapp, 1997). The global burden of men- linguistic features that are discriminative tal and substance use disorders increased by 376% across mental health user communities between 1990 and 2010 (Whiteford et al., 2013) and that can be further exploited in which means it is an international public health pri- subsequent classification tasks. Fur- ority to effectively prevent and treat mental health thermore, while negative sentiment issues. is almost uniformly expressed across the entire data set, we demonstrate In the UK, 17% of adults experience a sub- that there are also condition-specific threshold common mental disorder (McManus et vocabularies used in social media al., 2009) and up to 30% of individuals with non- to communicate about particular psychotic common mental disorders have subthresh- disorders. Source code and related ma- old psychotic symptoms (Kelleher et al., 2012) terials are available from: https: showing that a large proportion of mental illness is //github.com/gkotsis/ unrecognised, but nevertheless has a significant im- reddit-mental-health. pact upon people’s lives. Those people with con- ditions that meet criteria for diagnosis are treated in primary care or by mental health professionals.
63
Proceedings of the 3rd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 63–73, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics
Studies consistently show that between 50-60% of lected from Reddit pertaining to suicidal ideation all individuals with a serious mental illness receive could demonstrate the existence of the Werther ef- treatment for their mental health problem at any fect (suicide attempts and completions after media given time (Kessler et al., 2001). depiction of an individual’s suicide) (Kumar et al., However, most of the pathology tracking and im- 2015). Coppersmith and colleagues used Twitter provement assessment is done through question- data to determine language features that could be naires, e.g. the Personal Health Questionnaire 9 used to classify Twitter users into suffering from (PHQ9) for depression (Kroenke and Spitzer, 2002), mental health problems and unaffected individuals and require a subjective comment by the patient, e.g. (Coppersmith et al., 2015a). However, while the au- “How many days have you been bothered with lit- thors could identify features that allows the classifi- tle interest or pleasure in doing things in the past cation between healthy and unhealthy Twitter users, two weeks?”. As with every personal judgement, they also note that language differences in commu- the responses are influenced by the environment in nicating about the different mental health problem which the person has been asked, the relationship to remains an open question. Similarly, Mitchell et al. the clinician and even the stigma attached to depres- (2015) used Twitter data to separate users affected sion (Malpass et al., 2010). While there are aims by schizophrenia from healthy individuals by au- to integrate real-time reporting into a patient’s life tomatically identifying characteristic language fea- (Ibrahim et al., 2015), these are still based on set tures for schizophrenia (Mitchell et al., 2015). Both questionnaires and may not fit with the main con- the latter approaches rely on the Linguistic Inquiry cerns of a patient. and Word Count (LIWC) (Tausczik and Pennebaker, Social media, such as Twitter1, Facebook2 and 2010), but Mitchell et al. also covers features such as Reddit3, have become an accepted platform to com- Latent Dirichlet Allocation and Brown Clustering. municate about life circumstances and experiences. Very recently and concurrently with our own work, A specific example of social media in the context of De Choudhury and colleagues have shown that lin- illness is PatientsLikeMe (Wicks et al., 2010). Pa- guistic features can be used to predict the likelihood tientsLikeMe has been developed to enable people of individuals transitioning from posting about de- suffering from an illness to exchange information pression and other mental health issues on Reddit to with others with the same condition, e.g. to find al- suicidal ideation (De Choudhury et al., 2016). This ternative treatment opportunities. It has been shown work showed the ability to make causal inferences that the support received in such online communities on the basis of language usage and employed a small can be empowering by engendering self-respect and subset of the mental health groups on Reddit. a feeling of being in control of the situation (Barak et Following on from the promise that such work al., 2008). Hence, social media constitutes a tremen- holds, our goal was to study language features that dous resource for better understanding diseases from are characteristic for individual mental health con- a patient perspective. ditions using large scale Reddit data. We antici- Social media data has recently been recognised as pate that our findings can be used to assist in sep- one of the resources to gather knowledge about men- arating posts pertaining to different mental health tal illnesses (Coppersmith et al., 2015a; De Choud- problems and for various language-based applica- hury et al., 2013; Kumar et al., 2015). For exam- tions involving the better understanding of mental ple, Twitter data has been used to develop classi- health conditions. Reddit is particularly suitable fiers to recognise depression in users (De Choud- for such research as it has an enormous user base4, hury et al., 2013) and to classify Twitter users who posts and comments are topic-specific and the data have attempted suicide from those who have not is publicly available. We focussed on subreddits and from those who are clinically depressed (Cop- (communities where users can post/comment in re- persmith et al., 2015b). Furthermore, data col- lation to a specific topic, e.g. suicide ideations)
1https://twitter.com/?lang=en 4According to https://en.wikipedia.org/wiki/ 2https://en-gb.facebook.com/ Reddit, Reddit had 234M unique users with 542M monthly 3https://www.reddit.com/ visitors as of 2015
64 of the Reddit data dump5 that address the follow- knowledgeable or at least interested in the topic. We ing mental health problems: Addiction, Anxiety, used this Reddit feature, to determine subreddits tar- Asperger’s, Autism, Bipolar Disorder, Dementia, getting specific mental health problems. Depression, Schizophrenia, self harm and suicide For this purpose, we filtered the entire down- ideation. These conditions are commonly encoun- loaded data set for subreddits targeting any of the tered by mental health practitioners and contribute 10 as relevant identified diseases. The entire data significantly to treatment costs. We aimed to iden- set as obtained was separated into posts and com- tify linguistic characteristics that are specific to any ments, and we preserved this separation so that anal- of the mental illnesses covered and can be used for ysis could be executed on either posts, comments or text classification tasks. The investigated character- both combined. Posts are initial textual statements istics include lexical as well as syntactic features, that initiate a communication with other users. Com- the uniqueness of vocabularies, and the expression ments are replies to posts and are organised in a tree- of sentiment and happiness. Our results suggest that like structure. Both posts and comments can be writ- there are linguistic features that are discriminative ten be anyone, and even the Reddit user that wrote of the user communities used in this study. Fur- the initial post can comment on it. We note here thermore, applying a clustering method on subred- that the number of users, posts and comments varied dits, we could show that subreddits mostly contain substantially between subreddits (see Table 1). To a topic-specific vocabulary. Moreover, we could refer to sets of both posts and comments (total also also highlight that there are differences in the way in Table 1), we use the term “communication” in the that sentiment is expressed in each of the subred- following sections. dits. Source code and related materials are avail- https://github.com/gkotsis/ able from: Table 1: Numbers of posts, comments, ratio of comments over reddit-mental-health. posts, and the total of posts and comments (called “communica- tions”) for each mental health-related subreddit included in this 2 Methods and Materials study. Numbers are totalled across all subreddits in the last row As our aim was to define linguistic characteristics of this table. Extrema for each column are highlighted with a specific to mental health problems, we downloaded purple coloured background. subreddit #posts #comments #comments/#posts #total the Reddit data and extracted relevant posts and Anxiety 57,523 289,441 5.03 346,964 BPD 11,880 77,091 6.49 88,971 comments. These were then further investigated BipolarReddit 14,954 151,588 10.14 166,542 BipolarSOs 814 4,623 5.68 5,437 with respect to specific linguistic features, e.g. sen- OpiatesRecovery 8,651 87,038 10.06 95,689 StopSelfHarm 4,626 24,224 5.24 28,850 tence structure or unique vocabularies, to determine addiction 4,360 6,319 1.45 10,679 aspergers 15,053 202,998 13.49 218,051 characteristics for subsequent classification tasks. autism 9,470 52,090 5.50 61,560 bipolar 25,868 198,408 7.67 224,276 The data set as well as the methods employed are cripplingalcoholism 38,241 503,552 13.17 541,793 depression 197,436 902,039 4.57 1,099,475 described in the following subsections. opiates 56,492 906,780 16.05 963,272 schizophrenia 4,963 31,864 6.42 36,827 selfharm 12,476 68,520 5.49 80,996 SuicideWatch 90,518 619,813 6.85 710,331 2.1 Social media data from Reddit Total 462,807 3,506,575 7.58 3,969,382 Reddit is a social media network, where registered users can post requests to a broader community. As shown in Table 1, the depression subred- Posts are hosted in topic-specific fora, so called sub- dit contains the largest amount of communications reddits. Subreddits can be created by users based (1.1M), while the smallest amount is found in the on the subject they are interested in to communi- BipolarSOs subreddit (5K). The number of posts is cate. All users can freely join any number of sub- always smaller than the number of comments though reddits and participate in discussions. This means the ratio of average number of comments per posts that the posts are sent to a community potentially varies. The highest average rate of comments per
5 posts can be seen on subreddit opiates, while the The data was released by a Reddit user on https:// redd.it/3bxlg7 (comments) and https://redd.it/ smallest number of replies is observed on the ad- 3mg812 (posts). diction subreddit.
65 2.2 Determining linguistic features line Community-based Question Answering web- sites (Gkotsis et al., 2014). There are many ways to model communication. Communication in the form of language use can be Our first set of features pertains to the usage of characterised through a variety of feature types. Our specific words in documents. For instance, we look aim is to better understand the nature and depth of at the usage of definite articles, since we believe that the communication that takes place, and one way definite articles are used for specific and personal to do this is by the analysis of linguistic features. communications. Similarly, we keep track of pro- These features are particularly relevant in the con- nouns, first-person pronouns, and the ratio between text of mental health problems, as the abilities of the them, as indicators of the degree of first-person con- sufferer to effectively communicate can be affected tent. by such problems (Cohen and Elvevag,˚ 2014). For Additional features in this initial set aimed at ex- example, someone suffering from Bipolar Disorder amining text complexity. In our approach, text com- may suddenly write a lot, but not necessarily in a co- plexity can scale both horizontally (length, topic hesive manner. In the Iowa Writers’ Workshop study cohesion) and vertically (clauses, composite mean- (Andreasen, 1987) bipolar sufferers reported that ings). For the horizontal assessment, we count the they were unable to work creatively during periods number of sentences. Another set of features, which of depression or mania. During depressive episodes, target the understanding of topic continuity and co- cognitive fluency and energy were decreased, and hesion across sentences, is word overlap between during manic periods they were too distractible and adjacent sentences, either by taking into account all disorganized to work effectively, so it would be rea- words, or just nouns and pronouns. For the vertical sonable to expect this to be reflected in their prose. assessment, we employ the following features: a) we Understanding these features and consequently the count the noun chunks and verb phrases (sequences nature and content of the posts will allow us to bet- of nouns and verbs, respectively) and the number of ter design useful classification systems and predic- words contained within them, b) we construct the tive models. parse tree of each sentence and measure its height, Through discussion, we determined an initial fea- and c) we count the number of subordinate conjunc- ture set of linguistic characteristics that draws on tions (e.g. “although”, “because” etc.). A parse tree previously established measures of psychological represents the syntactic structure of a sentence, and relevance, such as LIWC and Coh-Metrix (Graesser tools such as dependency or constituency parsers are et al., 2004). However, we note here that in order readily available for utilisation, e.g. as implemented to not overload our initial feature set, we selected a in the Python module spaCy6. subset of all the available possibilities. In our fea- ture set, we included linguistic features introduced Finally, we found that a few posts do not contain by Pitler and Nenkova (Pitler and Nenkova, 2008) any text in their body, apart from their title. This was and partially overlapping with those used in Coh- typically the case for posts that contained a Uniform Metrix for predicting text quality. More specifically, Resource Locator (URL) to a web page of interest we adopt features that aim at assessing the read- to the community. We believe that the ratio of the ability of textual content. Readability is a measure- number of these posts over the total number of posts ment that aims to assess the required education level is associated with the degree of information dissemi- 7 for a reader to fully appreciate a certain content. nation , as opposed to the personal story-telling that The task of understanding textual content and as- might occur in other cases, and thus included this as sessing its quality encompasses various factors that an additional feature. are captured through the features that we also pro- pose here (see supplemental material for more in- formation about the implementation). A subset of 6 https://spacy.io/ these features have been used successfully to pre- 7For instance, we found that most URLs posted in addiction dict the answers to be marked as accepted in on- link to YouTube
66 2.3 Word-based classification to assess depression will use negative sentiment and express subreddit uniqueness unhappiness, while someone suffering from Bipolar Disorder may change between positive and negative For this classification-based approach, we employed mood expressions over time. However, by assess- a representation based on individual words, as well ing sentiment and happiness for a large population as information on words that frequently co-occurred of individuals, novel patterns for individual mental together. The aim of this task was to examine how health problems may evolve. closely aligned the vocabularies of each subreddit As part of our investigation, we used two dif- were, assessed via a pairwise comparison. As high- ferent methods, one to detect sentiment (Nielsen, lighted in Table 1, the data volume (in terms of posts 2011) and another to detect happiness (Dodds et al., and comments) differed significantly for the differ- 2011). Both methods, which were developed for ent subreddits. In order to compensate for the differ- social media studies, rely on a topic-specific dic- ence in size, we utilised a randomisation process by tionary. For each post in our subreddits, we deter- repeating the same experiment 10 times with a set mined the sentiment and happiness score by match- of 5000 randomly drawn posts for each repeat and ing words against the dictionaries. We accumulated individually for each of the two subreddits that were these scores on a per post basis and normalised it compared with each other. by the square root of the number of words in the In order to compare the vocabularies of two sub- post that were identified in the respective dictio- reddits with each other, we built dictionaries for each nary. Scores for happiness were further normalised pair of subreddits, by retrieving all words and se- to assign them to the same range as the values for quences of words (of length 2) occurring in one or sentiment: negative values are expressions of neg- both subreddits. We then used this list of words and ative sentiment/unhappiness, positive values are ex- frequently co-occurring words to classify posts into pressions of positive sentiment and happiness, and a belonging to one of the two subreddits that are being value of 0 can be seen as neutral. compared and recorded the performance for each of We note here that while our aim is to classify both the 10 cycles for each subreddit pair. The classifica- posts and comments, we limited ourselves in this tion performance was then averaged across all 10 cy- task to posts only. Comments could be considered cles to obtain a representative score for each pair of to be a source of noise, which may mask potential subreddits. Using this classification approach, high sentiment and happiness coming from posts, given performance scores indicate a distinctive vocabulary that our data set contains a lot more comments than while low performance scores suggest a shared vo- posts. In future work we would like to experiment cabulary across both the subreddits. The results of with more sophisticated linguistic methods for iden- this pairwise comparison are illustrated in Figure 1. tifying sentiment and emotion. More details are provided in the supplementary ma- terials, covering the algorithm and randomisation 3 Results steps. After identifying the subreddits relevant to the men- 2.4 Detecting sentiment and happiness in posts tal health problems we were interested in, we deter- One additional aspect that can be assessed when mined linguistic features (related to content of com- looking at the linguistic aspect of communications munications) for each of the subreddits. The results on social media is the expression of sentiment. Sen- of our investigations are presented in the following timent has been noted as a crucial indicator of subsections. how much involved someone is in a specific event (Tausczik and Pennebaker, 2010; Murphy et al., 3.1 Subreddits exhibit differences in linguistic 2015), and therefore can also play a role in the ex- features pression of mental illness. Some of the conditions Table 2 provides a summary of all the linguistic fea- investigated here may have characteristic mood pat- tures for each of the 16 subreddits, that were as- terns, e.g. it is likely that someone suffering from sessed as part of this study. From this table, we
67 see that two subreddits stand out in a number of means of adjacent sentences using similar words, the assessed criteria: BiPolarSOs and cripplingalco- LF10 and LF11 in Table 2), show little variation holism. BiPolarSOs is a subreddit that provides sup- across all the 16 different subreddits. Though co- port and advice to people in a relationship where ei- hesion when only taking nouns and pronouns into ther one or both partners are affected by Bipolar Dis- consideration improves, the best value obtained is order. Note that this means that users on this subred- 0.22, indicating a mostly low lexical cohesion across dit may not be affected by the disorder themselves communications on each of the subreddits. One of and may result in different communications from a our longer term goals is to be able to classify posts subreddit where only people with Bipolar Disorder to individual subreddits, and these scores would not are communicating. In our data set it was the small- be sufficiently informative for this goal due to their est subreddit in terms of total number of commu- low variation. nications (see Table 1). The subreddit cripplingal- coholism aims to facilitate communication between 3.2 Subreddit vocabulary uniqueness through people addicted to alcohol. In the description of the classification subreddit, there is no emphasis on supporting each The word occurrence-based comparison of the sub- other and people can also share what they may con- reddits was performed to better determine whether sider positive experiences regarding their condition subreddits can be distinguished based on their lexi- (e.g. “On day 8 of a bender that was supposed to cal content (see supplementary material for more in- end today because my boss was supposed to send formation). The results obtained are shown in Fig- me a bunch of work on Monday. She just emailed ure 1. me and said she won’t be sending it until Wednes- Heatmap of pairwise classification of posts (only) be- day! Sweet chocolate Jesus on a bicycle, I did a jig Figure 1: tween subreddits. High values denote high accuracy in classifi- in my jammies, cracked open a new handle of rye, cation and therefore represent high discriminability in language. and am about to take the dog on a nice drunken walk. Low values represent low score in classification and therefore Sobriety, I’ll see you Wednesday. Maybe”). high language proximity between subreddits. From Table 2, we see that the BipolarSOs sub- BPD reddit not only has a higher number of first-person BipolarReddit BipolarSOs OpiatesRecovery pronouns and a larger number of definite articles, 0.90 StopSelfHarm addiction but also that the average sentence seems to be more 0.75 aspergers complex due to a high average height of the sen- autism 0.60 bipolar tence parse trees, long verb clauses and a high num- cripplingalcoholism 0.45 depression ber of subordinating conjunctions, while the aver- opiates 0.30 schizophrenia age number of sentences per communication is com- selfharm parable to those of the other subreddits. This sug- suicidewatch BPD autism gests that people posting on this subreddit explain in bipolar opiates Anxiety selfharm addiction aspergers depression detail their experience or advice. On the contrary, BipolarSOs BipolarReddit schizophrenia StopSelfHarm
the cripplingalcoholism subreddit possesses shorter OpiatesRecovery cripplingalcoholism communications characterised by the lowest number of sentences per communication, the smallest maxi- mum height of sentence trees, a low number of sub- Figure 1 shows that apart from a small number ordinate conjunctions and short verb clauses. Us- of exceptions, the language of individual subred- ing word frequency occurrences, we also observed dits is discriminable, which can be further exploited that here the language seems stronger than on other for classification purposes in later stages. For ex- subreddits with the most frequently occurring word ample, the subreddit OpiateRecovery shows mostly being ”fuck” (details of results not provided here8). high values, which means that the language used The two features relating to lexical cohesion (by (based on frequency of words and word pairs) on this 8Available as a wordcloud visualisation of all reddit-mental-health/tree/master/ subreddits at https://github.com/gkotsis/ wordclouds
68 Table 2: Obtained results for each of the language features per subreddit. Language features investigated were: LF1 – Average number of definite article “the” in each communication; LF2 – Average number of first person pronouns; LF3 – Average number of pronouns in each communication; LF4 – Average number of noun chunks; LF5 – Average number of length of maximum verb phrase in each communication; LF6 – Average number of subordinate conjunctions; LF7 – Average value of maximum height of sentences’ parse trees; LF8 – Average number of sentences in each communication; LF9 – Average ratio of number of first person pronouns over total number of pronouns; LF10 – Similarity between adjacent sentences over nouns or pronouns only (lexical cohesion); LF11 – Similarity between adjacent sentences over all words (lexical cohesion); LF12 – Ratio of posts without any body text (containing only a title and a URL) over total number of posts. Not normalised features Normalised Subreddit LF1 LF2 LF3 LF4 LF5 LF6 LF7 LF8 LF9 LF10 LF11 LF12 Anxiety 2.24 7.67 13.13 1.85 25.68 1.43 5.95 6.51 0.54 0.19 0.18 0.10 BPD 2.23 7.98 14.28 1.83 28.32 1.48 6.14 6.63 0.54 0.20 0.18 0.09 BipolarReddit 2.14 6.49 11.54 1.84 23.91 1.44 5.98 6.15 0.54 0.18 0.18 0.01 BipolarSOs 3.53 9.68 23.06 1.99 40.52 1.49 6.67 10.12 0.40 0.17 0.17 0.04 OpiatesRecovery 2.24 6.48 11.93 1.80 23.49 1.28 5.66 6.69 0.49 0.17 0.17 0.01 StopSelfHarm 1.60 5.90 11.71 1.77 20.60 1.25 5.43 5.52 0.46 0.22 0.19 0.12 addiction 1.95 5.54 10.66 1.35 21.50 0.98 4.30 5.93 0.44 0.17 0.18 0.63 aspergers 1.94 5.12 9.69 1.72 20.76 1.53 5.88 5.06 0.53 0.17 0.18 0.12 autism 2.07 3.62 8.65 1.61 19.89 1.37 5.50 5.01 0.41 0.13 0.17 0.61 bipolar 1.86 6.16 10.62 1.73 20.82 1.31 5.53 5.67 0.56 0.18 0.17 0.15 cripplingalcoholism 0.92 2.28 4.07 1.36 8.76 0.95 4.10 3.02 0.54 0.12 0.16 0.16 depression 2.25 8.71 14.75 1.84 29.04 1.37 5.89 7.17 0.52 0.21 0.19 0.02 opiates 1.14 2.60 5.11 1.48 10.96 1.13 4.52 3.25 0.49 0.14 0.16 0.21 schizophrenia 2.13 5.96 11.15 1.80 23.74 1.45 5.82 5.85 0.50 0.18 0.18 0.13 selfharm 1.39 5.41 9.73 1.70 17.57 1.19 5.11 4.94 0.52 0.21 0.18 0.01 suicidewatch 1.96 7.10 13.44 1.85 27.85 1.29 5.74 6.73 0.46 0.22 0.19 0.03 subreddit is mostly unique. OpiateRecovery shows lem. For example, all three subreddits relating to some vocabulary overlap with the opiates and addic- Bipolar Disorder (bipolar, BipolarReddit and Bipo- tion subreddits, which suggests that there are some larSOs) show a pairwise score of 0.6 as opposed ∼ shared topics on these subreddits. One of the excep- to 0.9 with other subreddits. Similarly, both the ∼ tions is the subreddit addiction. As illustrated in the self-harm subreddits also share a pairwise vocabu- heatmap the addiction subreddit shows particularly lary of 0.6. Interestingly, the subreddits autism ∼ low values with other subreddits such as depression and schizophrenia also indicate a proximity of the and suicidewatch. This finding is not surprising as vocabularies and further investigations are required substance addiction can lead to depression and suici- to assess the shared vocabularies. dal thoughts, which is expected to be also expressed in the nature of the communication. Note that the 3.3 Sentiment/happiness expressions on diagonal of the matrix is suppressed to reduce the subreddits matrix dimension. In order to assess the emotions that Reddit users ex- Among our 16 subreddits, there are some subred- press on subreddits related to mental health prob- dits that allude to the same mental health condition, lems, we used two different methods: (i) to assess e.g. BipolarReddit and BipolarSOs both aim to fos- sentiment and (ii) to specifically assess happiness. ter a community to facilitate exchange about Bipo- The results obtained by both methods are shown in lar Disorder. While the subreddit BipolarSOs invites Figure 2. This figure illustrates that, on average, a participation from users that are affected themselves lot of negative sentiment is expressed across the dif- or are in a relationship with someone affected by ferent subreddits relating to mental health problems. Bipolar Disorder, BipolarReddit is solely focussed We can see that posts from the subreddit Suicide- on people suffering from this disorder. In Figure 1, Watch express the highest rate of negative sentiment, we can also see that vocabularies seem to be partially followed by posts from the Anxiety and self-harm- shared (indicated by a lighter colour) across those related subreddits. subreddits addressing the same mental health prob- While in the majority of cases both sentiment and
69 Figure 2: A small number of subreddits show a majority of Furthermore, Figure 2 shows a small number of positive sentiments while a large number of subreddits show subreddits, where posts seem to express positive predominantly negative sentiments. Positive values in this bar sentiment. For example, the posts extracted from plot correspond to positive sentiment and happiness, while neg- the subreddit OpiatesRecovery seem to express not ative values indicate negative sentiments or unhappiness. only positive sentiment but also happiness. This