Downloaded Directly from CMS Websites
Total Page:16
File Type:pdf, Size:1020Kb
UC Riverside UC Riverside Electronic Theses and Dissertations Title Automated Analysis of User-Generated Content on the Web Permalink https://escholarship.org/uc/item/7tt7t9t5 Author Rivas, Ryan Publication Date 2021 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA RIVERSIDE Automated Analysis of User-Generated Content on the Web A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Ryan Rivas March 2021 Dissertation Committee: Dr. Vagelis Hristidis, Chairperson Dr. Eamonn Keogh Dr. Amr Magdy Dr. Vagelis Papalexakis Copyright by Ryan Rivas 2021 The Dissertation of Ryan Rivas is approved: Committee Chairperson University of California, Riverside The text of this dissertation, in part, is a reprint of the material as it appears in the following publications: Shetty P, Rivas R, Hristidis V. Correlating ratings of health insurance plans to their providers' attributes. Journal of Medical Internet Research. 2016;18(10):e279. Rivas R, Montazeri N, Le NX, Hristidis V. Automatic classification of online doctor reviews: evaluation of text classifier algorithms. Journal of Medical Internet Research. 2018;20(11):e11141. Rivas R, Patil D, Hristidis V, Barr JR, Srinivasan N. The impact of colleges and hospitals to local real estate markets. Journal of Big Data. 2019 Dec 1;6(1):7. Rivas R, Sadah SA, Guo Y, Hristidis V. Classification of health-related social media posts: evaluation of post content classifier models and analysis of user demographics. JMIR Public Health and Surveillance. 2020;6(2):e14952. The coauthor Vagelis Hristidis listed in these publications directed and supervised the research which forms the basis for this dissertation. iv ABSTRACT OF THE DISSERTATION Automated Analysis of User-Generated Content on the Web by Ryan Rivas Doctor of Philosophy, Graduate Program in Computer Science University of California, Riverside, March 2021 Dr. Vagelis Hristidis, Chairperson Social media users generate large volumes of data every day. Analysis of this data is an important tool in several areas. For example, the study of users’ opinions, behaviors, and topics of discussion can be of use in the field of health care. The first part of my research involves using existing tools to perform analysis of Web content. Specifically, it first compares between health care provider attributes and quality measures of the insurance plans they accept. This is followed by analyses of how real estate prices and related metrics are affected by proximity to a university or hospital. Further, this research studies user behaviors and discussion topics to find differences in how various demographic groups generate content on health-related Web forums and on health-related discussions in general social media. The remainder of this dissertation shifts its focus from analysis of Web content to proposing new tools to perform similar analyses. It first proposes and evaluates natural language processing-based methods to automatically classify patient opinions in doctor reviews. This work also introduces a variant of the review classification problem where class labels can represent two opposing opinions that v are not necessarily positive or negative. This is followed by an exploration of methods to effectively filter social media posts according to a user’s interests. The key challenge behind this work is to determine how to use this information to maximize a trained text classification model’s performance in classifying new posts. Finally, this dissertation proposes a multimodal Twitter embedding model that can leverage information from several parts of a tweet, such as text, image, and location. Such a model can have several applications for both researchers and Twitter users without the need to train a separate model for each application. vi Table of Contents 1 Introduction ................................................................................................................. 1 1.1 Motivation ............................................................................................................ 1 1.2 Research Problems ............................................................................................... 2 2 Correlating Ratings of Health Insurance Plans to Their Providers’ Attributes ... 9 2.1 Introduction ........................................................................................................ 10 2.1.1 Online Health Insurance Marketplaces and Search Sites ........................... 11 2.1.2 Attributes Associated with Insurance Quality ............................................ 12 2.2 Methods .............................................................................................................. 13 2.2.1 Summary ..................................................................................................... 13 2.2.2 Data Collection ........................................................................................... 14 2.2.3 Entity Mappings .......................................................................................... 18 2.3 Results ................................................................................................................ 20 2.3.1 Summary ..................................................................................................... 20 2.3.2 General Statistics of Insurance Plans .......................................................... 21 2.3.3 Attribute Correlations ................................................................................. 24 2.4 Discussion .......................................................................................................... 30 2.4.1 Principal Findings ....................................................................................... 30 2.4.2 Limitations .................................................................................................. 32 2.4.3 Conclusions ................................................................................................. 32 3 The Impact of Colleges and Hospitals to Local Real Estate Markets .................. 34 3.1 Introduction ........................................................................................................ 34 3.2 Related Work...................................................................................................... 37 3.3 Methods .............................................................................................................. 39 3.3.1 Data Collection ........................................................................................... 39 3.3.2 ZIP Code-Level Analysis ............................................................................ 43 3.3.3 Home-Level Analysis ................................................................................. 44 3.4 Results ................................................................................................................ 45 3.4.1 ZIP Code-Level Analysis ............................................................................ 45 3.4.2 Home-Level Analysis ................................................................................. 56 3.5 Discussion .......................................................................................................... 63 3.5.1 Limitations .................................................................................................. 67 vii 3.6 Conclusions ........................................................................................................ 68 4 Classification of Health-Related Social Media Posts: Evaluation of Post Content Classifier Models and Analysis of User Demographics ......................................... 71 4.1 Introduction ........................................................................................................ 72 4.1.1 Background ................................................................................................. 72 4.1.2 Related Work .............................................................................................. 73 4.2 Methods .............................................................................................................. 77 4.2.1 Datasets ....................................................................................................... 77 4.2.2 Identifying Post Contents ............................................................................ 80 4.2.3 Bot Filtering ................................................................................................ 83 4.2.4 Building Post Content Classifiers ............................................................... 84 4.2.5 Demographic Analysis ................................................................................ 88 4.2.6 Top Distinctive Message Boards ................................................................ 89 4.3 Results ................................................................................................................ 90 4.3.1 Demographics ............................................................................................. 90 4.3.2 WebMD....................................................................................................... 90 4.3.3 DailyStrength .............................................................................................. 92 4.3.4 Twitter ........................................................................................................