Multi-Dimensional Mining of Unstructured Data with Limited Supervision

c 2018 Chao Zhang MULTI-DIMENSIONAL MINING OF UNSTRUCTURED DATA WITH LIMITED SUPERVISION BY CHAO ZHANG DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2018 Urbana, Illinois Doctoral Committee: Professor Jiawei Han, Chair Professor ChengXiang Zhai Professor Tarek Abdelzaher Professor Qiaozhu Mei ABSTRACT As one of the most important data forms, unstructured text data plays a crucial role in data-driven decision making in domains ranging from social networking and information retrieval to healthcare and scientific research. In many emerging applications, people's information needs from text data are becoming multi-dimensional|they demand useful insights for multiple aspects from the given text corpus. However, turning massive text data into multi-dimensional knowledge remains a challenge that cannot be readily addressed by existing data mining techniques. In this thesis, we propose algorithms that turn unstructured text data into multi-dimensional knowledge with limited supervision. We investigate two core questions: 1. How to identify task-relevant data with declarative queries in multiple dimensions? 2. How to distill knowledge from data in a multi-dimensional space? To address the above questions, we propose an integrated cube construction and exploitation framework. First, we develop a cube construction module that organizes unstructured data into a cube structure, by discovering latent multi-dimensional and multi-granular structure from the unstructured text corpus and allocating documents into the structure. Sec- ond, we develop a cube exploitation module that models multiple dimensions in the cube space, thereby distilling multi-dimensional knowledge from data to provide insights along multiple dimensions. Together, these two modules constitute an integrated pipeline: leveraging the cube structure, users can perform multi-dimensional, multi-granular data selection with declarative queries; and with cube exploitation algorithms, users can make accurate cross-dimension predictions or extract multi-dimensional patterns for decision making. The proposed framework has two distinctive advantages when turning text data into multi-dimensional knowledge: flexibility and label-efficiency. First, it enables acquir- ing multi-dimensional knowledge flexibly, as the cube structure allows users to easily identify task-relevant data along multiple dimensions at varied granularities and further distill multi-dimensional knowledge. Second, the algorithms for cube construction and exploitation require little supervision; this makes the framework appealing for many applications where labeled data are expensive to obtain. ii To my family and Shili, for their love and support. iii ACKNOWLEDGMENTS First and foremost, I want to thank Professor Jiawei Han|the most ideal advisor one can hope for. As an advisor, Jiawei has provided tremendous support and guidance to train me into a researcher throughout my Ph.D. study. He has taught me essential skills to become a mature researcher, from identifying and formulating important problems, to developing solutions critically and rigorously, to collaborating with other researchers, and to communicating ideas clearly. The influence of Jiawei is beyond the role of advior|he is also a life-long role model to me. I have learned from him the courage to take challenges, the positive attitude in front of setbacks, the diligence and self-discipline, and many other characteristics. Such influences will be a forever treasure in my life. I would like to thank my thesis committee members, Professor ChengXiang Zhai, Professor Tarek F. Abdelzaher, and Professor Qiaozhu Mei. They have provided not only careful commentary on this document, but also tremendous help in my job search process and invaluable advice for crafting a successful academic career. I have been fortunate to interact with and learn from many wonderful researchers in academia: Dr. Jessie Zhenhui Li, Dr. Yizhou Sun, Dr. Xifeng Yan, Dr. Feida Zhu, Dr. Jian Pei, Dr. Quanquan Gu, Dr. Lu Su, Dr. Jing Gao, Dr. Yu Zheng, Dr. Shaowen Wang, Dr. Phillip Yu, Dr. Hong Cheng, Dr. Jimeng Sun, Dr. Anthony K. H. Tung. As experienced researchers, they have given me a lot of help and advice both on my research and on career development. Special thanks to Jessie, it is because of our conversation in KDD 2016 that I finally decided to join academia after graduation. Many thanks to my dear friends in the DMG and DAIS family! Alphabetically, they are: Xiusi Chen, Joe Chen, Yucheng Chen, Ahmed El-Kishky, Xiaotao Gu, Huan Gui, Meng Jiang, Shan Jiang, Dongming Lei, Min Li, Ji Li, Qi Li, Zoey Li, De Liao, Liyuan Liu, Mengxiong Liu, Jialu Liu, Yuning Mao, Yu Meng, Meng Qu, Xiang Ren, Jingbo Shang, Jiaming Shen, Yu Shi, Fangbo Tao, Wenzhu Tong, Frank Xu, Chenguang Wang, Chi Wang, Qi Wang, Jingjing Wang, Xuan Wang, Ellen Wu, Carl Yang, Hongkun Yu, Xiao Yu, Quan Yuan, Keyang Zhang, Yu Zhang, Shi Zhi, Qi Zhu, Wanzheng Zhu, and Honglei Zhuang. Because of them, Ph.D. life at UIUC has been so much fun. I am indebted to many of them who took great care of me during my achilles tendon injury. It is their care and encouragement that helped me through the recovery process. I wish all of them a successful career and a happy life. I am very grateful to all my collaborators outside the UIUC data mining group: Tim iv Hanratty, Lance Kaplan, Shuochao Yao, Huajie Shao, Wei Zhang, Wei Feng, Yong Li, Di Yao, Jie Feng, Julie Yixuan Zhu, and Luming Zhang. The collaborations with them are always enjoyable|I love working together in idea brainstorming, coding, paper writing, and catching up with conference deadlines. Our collaborations have been so fruitful. I look foward to future opportunities in collaborating with them and meeting them at conferences. I also want express my deep gratitude to my M.S. advisors at Zhejiang University: Profes- sor Gang Chen, Professor Lidan Shou, and Professor Ke Chen. They guided me to step into Computer Science research and realize its beauty in an early stage. Their hands-on mentor- ing in research methodology, intellectual audacity, writing, and presentation prepared me so well for starting my Ph.D. at UIUC. And over time, I have realized how important these skills are to my academic career. Finally, my deepest gratitude to my family! I thank my parents and sister for their endless love and constant support in every aspect of my life. Special thanks to Shili|my dear sunshine lighting up my world. Without you I would not have made it this far and been so happy every day. I am so fortunate to have you around, and want to thank you for your love, encouragement, and support in this wonderful journey. v TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . 1 1.1 Overview . .1 1.2 Main Modules . .3 1.3 Technical Contributions . .7 1.4 Organization . 11 CHAPTER 2 TAXONOMY GENERATION FOR CUBE SCHEMA DISCOVERY . 12 2.1 Overview . 12 2.2 Related Work . 16 2.3 Preliminaries . 18 2.4 Adaptive Term Clustering . 19 2.5 Adaptive Term Embedding . 22 2.6 Experimental Evaluation . 24 2.7 Summary . 31 CHAPTER 3 DOCUMENT ALLOCATION WITHOUT LABELED DATA . 32 3.1 Overview . 32 3.2 Related Work . 35 3.3 Preliminaries . 36 3.4 Graph-Based Joint Embedding . 38 3.5 Dimension-Aware Embedding Learning . 40 3.6 The Overall Doc2Cube Algorithm . 43 3.7 Experimental Evaluation . 45 3.8 Summary . 53 CHAPTER 4 CROSS-DIMENSION PREDICTION IN CUBE SPACE . 54 4.1 Overview . 54 4.2 Related Work . 57 4.3 Preliminaries . 57 4.4 Semi-Supervised Multimodal Embedding . 59 4.5 Online Updating of Multimodal Embedding . 63 4.6 Experiments . 68 4.7 Summary . 80 CHAPTER 5 EVENT DETECTION IN CUBE SPACE . 82 5.1 Overview . 82 5.2 Related Work . 84 5.3 Preliminaries . 86 5.4 Candidate Generation . 90 5.5 Candidate Classification . 93 vi 5.6 Supporting Continuous Event Detection . 94 5.7 Complexity Analysis . 94 5.8 Experiments . 95 5.9 Summary . 103 CHAPTER 6 CONCLUSION . 104 6.1 Summary . 104 6.2 Future Work . 105 REFERENCES . 107 vii CHAPTER 1: INTRODUCTION 1.1 OVERVIEW Text is one of the most important data forms for the human to record and communicate information. In a wide spectrum of domains, hundreds of millions of textual contents are being created, shared, and analyzed every single day|examples including tweets, news articles, Google searches, and medical notes. It is estimated that more than 80% of human knowledge is encoded in the form of unstructured text [29]. Due to such ubiquity, extracting useful insights from text data is crucial to decision making in various applications, ranging from automated medical diagnosis [12, 90] and disaster management [52, 75] to fraud detection [80, 50] and personalized recommendation [53, 51]. In many newly emerged applications, people's information needs from text data are becoming multi-dimensional. That is, people can demand useful insights for multiple aspects from the given text corpus. Consider an analyst who wants to use news corpora for disaster analytics. Upon identifying all disaster-related news articles, she needs to understand what disaster each article is about, where and when is the disaster, and who are involved. She may even need to explore the multi-dimensional what-where-when-who space at varied granularities, so as to answer questions like `what are all the hurricanes happened in the US in 2018' or `what are all the disasters happened in California in June'. Disaster analytics is only one of the many applications where multi-dimensional knowledge is required. To name a few other examples, (1) analyzing a biomedical research corpus often requires digging out what genes, proteins, and diseases each paper is about and uncover their inter-correlations; (2) leveraging the medical notes of diabetes patients for automatic diagnosis requires correlat- ing symptoms with genders, ages, and even geographical regions; (3) analyzing Twitter data for an presidential election event requires understanding people's sentiments about various political aspects in different geographical regions and time periods.

Multi-Dimensional Mining of Unstructured Data with Limited Supervision

Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Unstructured Data Is a Risky Business

1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature

Big Data Mining Tools for Unstructured Data: a Review YOGESH S

Extracting Unstructured Data from Template Generated Web Documents

Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis

Solving the Unstructured Data Puzzle with Analytics

Cheminformatics for Genome-Scale Metabolic Reconstructions

Geospatial Semantics Yingjie Hu GSDA Lab, Department of Geography, University of Tennessee, Knoxville, TN 37996, USA

Unstructured Data Analysis in Arcgis

The Role of Text Analytics in Healthcare: a Review of Recent Developments and Applications