Heterogeneous Information Networks for Text-Based Link Mining: a Position Paper on Visualization and Structure Learning Methods

Heterogeneous Information Networks for Text-Based Link Mining: A Position Paper on Visualization and Structure Learning Methods William H. Hsu Praveen Koduru ChengXiang Zhai Dept. of Computing and iQGateway, LLC Dept. of Computer Science Information Sciences [email protected] University of Illinois Kansas State University [email protected] [email protected] +1 785 236 8247 Abstract aims of the research proposed espoused in this position paper In this position paper, we discuss the representation are as follows: of user and community domains in blogs, forums, and social media as heterogeneous information 1. Aim 1. Extend known algorithms for named entity networks, and describe a broad framework for recognition and relationship extraction, to analyzing interrelationships within such networks. produce basic summaries of diseases and treatments This approach builds upon two main mentioned in texts. The technical objective is to tag methodologies: one that is focused on community where basic entities and opinions are mentioned in detection and graph analysis; and another that freely available text (including both user posts and infers topic models in relation to communities and profiles), then map these tagged elements in space, group-level topic mixtures. We first discuss time, and by topic, to acceptable levels of precision previous work on link existence prediction in social and recall. networks, particularly user-touser links (e.g., friend 2. Aim 2. Adapt basic known techniques to the domain recommendation) and userto-community links of type 2 diabetes – specifically, extracting data from (e.g., community detection and recommendation), text discussions of diabetes that are archived from and point out some critical limitations of using this health blogs and forums using web crawlers. [1] This approach to detect communities. Next, we present entails developing a means of handling entities and a very general and flexible probabilistic model for quantitative data that have not previously been topic modeling across heterogeneous information extracted from text, such as information concerning networks, and survey techniques for learning this insulin and oral antidiabetic drug dosage, HbA 1c levels, etc. Another functional requirement is some We then discuss why this approach has become mechanism for entity reference resolution, e.g., necessary given changes in the privacy policies of abbreviations and synonyms, for known terms. social networks and other information service Finally, a domainspecific ontology of relevant providers. Finally, we discuss the task of symptoms, disease attributes, complications, and synthesize structure learning and topic modeling, treatments is proposed. For type 2 diabetes, this and relate this to extant approaches, applications, includes topics frequently discussed in health blogs and future work. and forums: food groups, meal plans, nutritional constraints, and conditions such as obesity that are linked to diabetes. This shall facilitate information 1 Introduction retrieval applications such as question answering In this paper, we address the problem of information retrieval about meal plans recommended by primary care and information extraction in subjective domains, with physicians and specialists. applications to visualization of opinions – specifically, 3. Aim 3. Develop methods for sentiment analysis and thematic mapping of opinions. At present, there is a dearth improve existing ones, to summarize opinions and of methods for integrating user profile data for social discover patterns. The technical objective is to networks with blog posts, tweets, and other content from the relate demographic data extracted from text and associated social media. These limitations present an profiles to qualitative data – namely, the polarity of integrative challenge for human-computer interaction (HCI) text at the document, sentence, or aspect level, and information retrieval (IR). Towards this end, the specific aggregated across demographic categories such as geographic region of residence. Objects of interest Such postings contain not only opinions, and attribution for sentiment analysis include prescribed therapies information that can be used to link them to the users who and specifically side effects, but can extend to expressed them, but also factual data about the posters and disease aspects and complications. their opinions. This data can help place opinions in a The overall goal of this approach is to develop an comparative context [2] with population statistics, such as integrative technology for summarizing online text about the reporting frequency of symptoms, side effects, and chronic diseases, capturing opinions from users’ posts and complications. demographic data from a combination of their posts and 2.2 Ontology Development profiles, and finally using these to discover global patterns The research approach centers around using information indicated by the set of all text documents. The central extraction to obtain structured data in the form of records hypothesis of this work is that a combination of entity and about chronic disease references in text, which are then relationship extraction, driven by a domain-specific ontology linked to users via relational data extracted from their of terms, will result in more precise and accurate profiles. [3] However, the body of relevant concepts in the summarization of opinions. This will increase the usefulness healthcare domain and in the clinical domain theory of each of free-form text, written by users of social media, in chronic disease is much broader. Currently there exist understanding patterns that are reflected in the opinions and preclinical (genomic and proteomic) and clinical demographics of chronic disease patients. translational ontologies [4] that contain information relevant to diabetes, but they do not provide the requisite concepts for 2 Background mining free-form text written by lay users who are discussing diabetes online. We propose to develop an ontology for text 2.1 Information Extraction from Health Blogs mining in diabetes, and the mappings from extracted entities The chief potential impact of the research framework and test and relationships into this ontology. bed proposed in Section Error! Reference source not found. is to provide assistive technologies to public health analysts and health services analysts who are using blogs, 2.3 Opinion Mining (Sentiment Analysis) microblogs (e.g., Twitter), and other social media to explore This aspect of the proposed work focuses on a basic research user opinions about chronic disease issues. As an example, problem: sentiment analysis from text, also known as in the application domain of type 2 diabetes, these include opinion mining, whose objective is to determine from dietary treatments such as carbohydrate control, analysis of a written document what the author’s attitude complications such as gastroparesis induced by diabetes that towards an identifiable topic is. This attitude can be may pose digestive constraints, and recommendations of subjective or objective; it can be identified as an evaluation primary care physicians, therapists, endocrinologists, (positive or negative), a declaration of the author’s emotional nutritionists, etc. attitude, or a expression intended to evoke an emotional The availability of mailing lists, blogs, wikis, and other response in the reader. Subjects of interest include chronic electronic media for content management and dissemination diseases, their features or aspects including symptoms, has resulted in rapid growth in the volume of online text data complications, and treatments, and related health services. containing voluntarily expressed public opinions about health issues. While general-purpose metadata tools exist for 2.4 Current State of the Field annotating this text, the opinions themselves remain a largely unexplored source of information about how chronic diseases affect populations. Meanwhile, the task of relating content from these various self-publishing media to semi- structured profile data from their users has not yet been effectively automated. We advocate development of application test beds and experimental systems aimed at improving techniques for information extraction, ontology development and mapping, and text mining to identify opinion patterns. The potential progress in these areas is due in part to the approach of combining information extraction to discover disease mentions with sentiment analysis to establish opinions, and in part to the application of this approach to a new source of data: free-form text describing user demographics, attributes Figure 1. Prototype event search based on a previous IR system of the chronic disease of interest and its related entities, and for veterinary epidemiology. opinions and semi-structured profile data. To help public health researchers tap into these freely Figure 1 depicts a simple search interface for an existing IR available but unexplored sources of opinions, we propose to system developed by the principal investigator’s research develop information extraction (IE) and summarization group. This system was designed for event extraction in the methods geared at health blog postings and similar text. domain of viral zoonoses, but uses general-purpose software for web crawling and ranking (the latter is developed using Lucene Java). One marker is displayed on both the thematic related information that can be related to these data. The map and the timeline for each returned

Load more