Data Mining Twitter for Cancer, Diabetes, and Asthma Insights Kimberly Chulis Purdue University
Total Page:16
File Type:pdf, Size:1020Kb
Purdue University Purdue e-Pubs Open Access Dissertations Theses and Dissertations 8-2016 Data mining Twitter for cancer, diabetes, and asthma insights Kimberly Chulis Purdue University Follow this and additional works at: https://docs.lib.purdue.edu/open_access_dissertations Part of the Communication Commons, and the Health and Medical Administration Commons Recommended Citation Chulis, Kimberly, "Data mining Twitter for cancer, diabetes, and asthma insights" (2016). Open Access Dissertations. 745. https://docs.lib.purdue.edu/open_access_dissertations/745 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Graduate School Form 30 Updated PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance This is to certify that the thesis/dissertation prepared By KIMBERLY CHULIS Entitled DATA MINING TWITTER FOR CANCER, DIABETES, AND ASTHMA INSIGHTS For the degree of Doctor of Philosophy Is approved by the final examining committee: RICHARD A. FEINBERG Chair CHRISTOPHER J. KOWAL CORINNE A. NOVELL TONGXIAO (CATHERINE) ZHANG To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material. Approved by Major Professor(s): RICHARD FEINBERG Approved by: RICHARD FEINBERG 7/22/2016 Head of the Departmental Graduate Program Date DATA MINING TWITTER FOR CANCER, DIABETES, AND ASTHMA INSIGHTS A Dissertation Submitted to the Faculty of Purdue University by Kimberly Chulis In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2016 Purdue University West Lafayette, Indiana ii For my Mom and Dad. iii ACKNOWLEDGEMENTS This dissertation would not have been possible without the help of so many individuals. I thank my Mom and Dad, first and foremost, for their continued support and encouragement throughout the entire process. It was your example and faith that provided me with the energy and persistence to complete this goal. I have great gratitude to my chair, Dr. Richard Feinberg, for all of his input and support throughout my time at Purdue and specifically for his participation and patience supporting this research. I thank my committee for their participation and the valuable insights and suggestions towards the refinement and improvement of this work. I owe a great gratitude to our graduate coordinator, Katharine Treece, and Lana Burnau, as well as our former graduate coordinator, Jeannie Navarre, for all they have done to support me in this effort. A heartfelt thank you also goes to Ashlee Messersmith and Dr. Kathleen Cannon for all of their formatting work, and to Aldo Deijkers for his ongoing encouragement and support. iv TABLE OF CONTENTS Page LIST OF TABLES ............................................................................................................ vii LIST OF FIGURES ........................................................................................................... ix ABSTRACT ........................................................................................................................ x CHAPTER 1 INTRODUCTION ........................................................................................ 1 Social Media Data and Healthcare .................................................................................. 3 Disease Prevalence Trends .............................................................................................. 5 CHAPTER 2 LITERATURE REVIEW ........................................................................... 10 Public Health ................................................................................................................. 12 Social Media Health ...................................................................................................... 19 Health Behaviors ........................................................................................................... 23 Twitter Disease Category .............................................................................................. 26 Cancer............................................................................................................................ 29 Twitter Data Management Methods .............................................................................. 32 Summary ....................................................................................................................... 36 CHAPTER 3 METHODOLOGY ..................................................................................... 38 Method .......................................................................................................................... 38 Analysis ......................................................................................................................... 40 Cancer ........................................................................................................................ 40 Diabetes and Asthma Files ........................................................................................ 43 Cross Industry Standard Process for Data Mining (CRISP-DM) .............................. 43 Business Understanding Phase – CRISP-DM ............................................................... 46 Data Understanding Phase – CRISP-DM ..................................................................... 47 Data Preparation Phase – CRISP-DM ........................................................................... 55 Modeling Phase – CRISP-DM ...................................................................................... 56 v Page Evaluation Phase – CRISP-DM .................................................................................... 57 Deployment Phase – CRISP-DM .................................................................................. 59 Data Management ......................................................................................................... 60 CHAPTER 5 CONTRASTING DATA CLEANSING APPROACHES ......................... 61 Data Cleansing Approach 1: Only Contains Disease Keyword .................................... 61 Data Cleansing Approach 2: Manually Identified Data Filters and Stopwords List ..... 62 Data Cleansing Approach 3: Singular Value Decomposition and Concept Clusters .... 65 Diabetes and Asthma Data File Audits ......................................................................... 71 Data Mining Twitter for Cancer Insights ...................................................................... 72 Total Unique Users........................................................................................................ 72 Deriving Variables for Cancer Tweets for Further Investigation ................................. 73 High Volume Tweet Analysis ....................................................................................... 75 High-level Tweet Patterns in the Refined Cancer File .................................................. 76 Aggregating Tweets to Analyze User Segments ........................................................... 76 Different Varieties of Twitterbots ................................................................................. 78 Who’s Tweeting About Cancer? ................................................................................... 79 Organizations ................................................................................................................ 80 Providers........................................................................................................................ 83 Individuals ..................................................................................................................... 84 Contrasting User Behaviors by Cancer Type in the File ............................................... 85 Breast Cancer Flag ........................................................................................................ 86 Segmentation of Cancer File by Cancer Type ............................................................ 101 Amplification Trends .................................................................................................. 105 Information Sharing Trends ........................................................................................ 106 Retweeting Trends....................................................................................................... 107 Reply Trends ............................................................................................................... 108 Monthly Trends by Cancer Type ................................................................................ 109 Secondary CHAID Analysis Using SVD-generated Concept Clusters ...................... 111 Data Mining Twitter for Diabetes Insights ................................................................. 125 vi Page Total Unique Users .................................................................................................. 125 Identifying Twitter Behavioral Patterns in the Diabetes File .................................. 128 Monthly Volume of Diabetes Tweets ...................................................................... 137 Diabetes Word Importance List and Tweet Categories ........................................... 137 Data