INSIGHTS FROM PATIENT AUTHORED TEXT: FROM CLOSE READING TO AUTOMATED EXTRACTION A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DIANA LYNN MACLEAN MARCH 2015 © 2015 by Diana Lynn MacLean. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/nh030tg4542 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Jeffrey Heer, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Michael Bernstein I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Christopher Manning I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Stuart Card Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Millions of people collaborate online with others who share their health concerns. In the process, these users perform complex health-related tasks, such as differential diagnosis and treatment compar- ison. The result is a massive, growing and readily accessible corpus of patient authored text (PAT) that documents patients’ behavior outside of the clinical environment. As a result, PAT can provide insights into otherwise obscure topics, such as why patients follow only certain parts of a treatment protocol, or how people self-treat stigmatized conditions such as prescription drug addiction. Despite the potential value of PAT, attempts to extract medically-relevant insights from it have been limited. PAT is notoriously noisy and challenging to work with, and there is a dearth of methods and tools for processing and analyzing it. Moreover, the specific research questions that PAT can support are not obvious: determining what data PAT encodes, and how, is a challenge in and of itself. In this thesis, I develop methods for automatically extracting medically-relevant data from PAT. I focus specifically on the topic of addiction: a stigmatized and prevalent medical condition. Building on close readings of source text to inform schema induction, data annotation, and feature engineering, I train clas- sifiers that accurately identify (1) medically-relevant terms in PAT; (2) users’ motivations for participating in an addiction-related online health community; (3) users’ drugs of choice, and (4) users’ transitions through relapse and recovery. Using these classifiers to scale analyses to large PAT corpora, I derive novel insights into the process of addiction, as well as the role that online health communities play in giving users informational and emotional support and, ultimately, in enabling recovery. In concert, these contributions both underscore PAT’s latent value for illuminating poorly understood or clandestine medical topics, and offer viable methods that dramatically improve our ability to realize this value. iv For Angus and June v Acknowledgements My first and foremost thanks to go my advisor, Jeffrey Heer. Jeff has been a wonderful source of support, knowledge and inspiration during my time at Stanford, and I am deeply indebted to him for not only supporting my curiosity as my research ventured into uncharted territory, but for doing so with enthusiasm and confidence. Most importantly, however, Jeff has been an exemplary role model. I am lucky, grateful, and unquestionably better for having had the opportunity to learn from him, and am proud to be taking that with me as I start my next great adventure. There are several people without whom this dissertation would not have been possible: Anna Lem- bke, who brought with her invaluable medical perspective, and whose enthusiasm, thoughtful insight and patience were instrumental in making this cross-disciplinary work a reality; Stuart Card, whose inge- niousness I aspire to, and whose advice I have had the fortune to benefit from on several occasions; Sonal Gupta, a close friend and collaborator from whom I have learned a great many things, and hope to learn many more; and Michael Bernstein and Christopher Manning, who have given generously of their time and advice, helping to steer this work from its inception through its completion. I am fortunate to have had many wonderful co-conspirators while at Stanford. Sudheendra Hangal, whose patient support and advice were instrumental in my early graduate school years, has been a fantastic collaborator and a dear friend. Monica Lam, with whom I worked closely during my first year, remains an uplifting source of inspiration. The UW IDL group, the Stanford HCI group, and the fantastic people in the 3B wing have been a fun, dynamic and reliable source of new ideas, feedback and ca- maraderie, and will be greatly missed. Finally, Jillian Lentz and Monica Niemiec deserve special thanks for not only providing efficient administrative support, but also for answering even the most frantic of questions with a smile. Finally, there are some people without whom I would not be where I am today. The inimitable Margo Seltzer who, suffice it to say, started this whole business in the first place; David Holland, whose patient and thorough technical tutelage stands me in good stead to this day; Will Phan, who helped me to see the real joy in coding; my mother, Heather, who is the embodiment of never giving up; and, of course, my husband, Isa, who inspires and challenges me to be a little better every day. It makes all the difference. vi Table of Contents 1 Introduction 1 1.1 Overview & Focus . 1 1.2 Contributions . 4 1.3 Outline of Thesis . 6 2 The Internet and Health 9 2.1 Online Health Information Seeking . 9 2.1.1 Historical Overview & Current Landscape . 9 2.1.2 What Health Information Do Users Seek Online? . 12 2.1.3 Who Seeks Health Information Online? . 12 Gender . 13 Age........................................... 13 Health . 14 Race . 14 Socio-Economic Status & Education . 15 Role (Patient vs. Caregiver) . 15 2.1.4 Where Do People Find Health Information Online? . 15 2.2 Online Health Community Participation . 16 2.2.1 Modes of Participation . 16 2.2.2 Who Participates in OHCs? . 16 2.2.3 Reasons for Participation . 17 Medium-Based Affordances . 17 Informational Support . 17 Emotional Support . 18 2.2.4 Efficacy of Online Health Forums . 18 2.3 Summary . 19 vii 3 Prior Work on Patient Authored Text 21 3.1 Patient Authored Text (PAT): Introduction & Overview . 21 3.1.1 Value of PAT . 22 3.1.2 Challenges of Working with PAT . 22 Noisiness . 22 Lack of Analysis Tools . 23 Applicability to Research Questions . 23 3.2 Syndromic Surveillance . 24 3.2.1 Condition . 24 3.2.2 Data Source . 25 3.2.3 Filtering . 25 3.2.4 Modeling and Prediction . 26 3.2.5 Real-World Evaluation Dataset . 26 3.3 Pharmacovigilance . 26 3.3.1 Data Source . 27 3.3.2 Identifying Drugs in PAT . 27 3.3.3 Identifying Adverse Events in PAT . 28 3.3.4 Evaluation . 28 3.4 Named Entity Recognition . 28 3.4.1 Ontology-Based Tools . 29 3.4.2 Statistical Classifiers . 29 3.5 Thematic Analysis . 30 3.5.1 Condition . 31 3.5.2 Data Source . 31 3.5.3 Analysis Question . 31 3.5.4 Scaling Thematic Analyses . 32 3.6 Summary . 33 4 Data 35 4.1 MedHelp Corpus . 35 4.1.1 Terminology . 35 4.1.2 Forum77 . 37 viii 4.2 CureTogether Corpus . 39 5 Identifying Medically Relevant Terms in PAT 40 5.1 Introduction . 40 5.2 Related Work . 42 5.2.1 Medical Term Identification . 42 5.2.2 Consumer Health Vocabularies . 43 5.3 Data . 44 5.3.1 Preparation . 44 5.3.2 Samples . 45 5.4 Labeling Medically Relevant Terms with the Crowd . 45 5.4.1 Task Design and Pilot Study . 46 5.4.2 Experiment . 48 Determining a Gold Standard . 49 Comparing Turkers Against a Gold Standard . 49 5.4.3 Results . 50 5.4.4 Limitations of the Crowd . 50 5.5 Training a Classifier on Crowd-Labeled Data . 52 5.5.1 Models . ..
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages181 Page
-
File Size-