Multi-Dimensional Mining of Unstructured Data with Limited Supervision
Total Page:16
File Type:pdf, Size:1020Kb
c 2018 Chao Zhang MULTI-DIMENSIONAL MINING OF UNSTRUCTURED DATA WITH LIMITED SUPERVISION BY CHAO ZHANG DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2018 Urbana, Illinois Doctoral Committee: Professor Jiawei Han, Chair Professor ChengXiang Zhai Professor Tarek Abdelzaher Professor Qiaozhu Mei ABSTRACT As one of the most important data forms, unstructured text data plays a crucial role in data-driven decision making in domains ranging from social networking and informa- tion retrieval to healthcare and scientific research. In many emerging applications, people's information needs from text data are becoming multi-dimensional|they demand useful in- sights for multiple aspects from the given text corpus. However, turning massive text data into multi-dimensional knowledge remains a challenge that cannot be readily addressed by existing data mining techniques. In this thesis, we propose algorithms that turn unstructured text data into multi-dimensional knowledge with limited supervision. We investigate two core questions: 1. How to identify task-relevant data with declarative queries in multiple dimensions? 2. How to distill knowledge from data in a multi-dimensional space? To address the above questions, we propose an integrated cube construction and exploita- tion framework. First, we develop a cube construction module that organizes unstructured data into a cube structure, by discovering latent multi-dimensional and multi-granular struc- ture from the unstructured text corpus and allocating documents into the structure. Sec- ond, we develop a cube exploitation module that models multiple dimensions in the cube space, thereby distilling multi-dimensional knowledge from data to provide insights along multiple dimensions. Together, these two modules constitute an integrated pipeline: lever- aging the cube structure, users can perform multi-dimensional, multi-granular data selection with declarative queries; and with cube exploitation algorithms, users can make accurate cross-dimension predictions or extract multi-dimensional patterns for decision making. The proposed framework has two distinctive advantages when turning text data into multi-dimensional knowledge: flexibility and label-efficiency. First, it enables acquir- ing multi-dimensional knowledge flexibly, as the cube structure allows users to easily iden- tify task-relevant data along multiple dimensions at varied granularities and further distill multi-dimensional knowledge. Second, the algorithms for cube construction and exploitation require little supervision; this makes the framework appealing for many applications where labeled data are expensive to obtain. ii To my family and Shili, for their love and support. iii ACKNOWLEDGMENTS First and foremost, I want to thank Professor Jiawei Han|the most ideal advisor one can hope for. As an advisor, Jiawei has provided tremendous support and guidance to train me into a researcher throughout my Ph.D. study. He has taught me essential skills to become a mature researcher, from identifying and formulating important problems, to developing solutions critically and rigorously, to collaborating with other researchers, and to communicating ideas clearly. The influence of Jiawei is beyond the role of advior|he is also a life-long role model to me. I have learned from him the courage to take challenges, the positive attitude in front of setbacks, the diligence and self-discipline, and many other characteristics. Such influences will be a forever treasure in my life. I would like to thank my thesis committee members, Professor ChengXiang Zhai, Professor Tarek F. Abdelzaher, and Professor Qiaozhu Mei. They have provided not only careful commentary on this document, but also tremendous help in my job search process and invaluable advice for crafting a successful academic career. I have been fortunate to interact with and learn from many wonderful researchers in academia: Dr. Jessie Zhenhui Li, Dr. Yizhou Sun, Dr. Xifeng Yan, Dr. Feida Zhu, Dr. Jian Pei, Dr. Quanquan Gu, Dr. Lu Su, Dr. Jing Gao, Dr. Yu Zheng, Dr. Shaowen Wang, Dr. Phillip Yu, Dr. Hong Cheng, Dr. Jimeng Sun, Dr. Anthony K. H. Tung. As experienced researchers, they have given me a lot of help and advice both on my research and on career development. Special thanks to Jessie, it is because of our conversation in KDD 2016 that I finally decided to join academia after graduation. Many thanks to my dear friends in the DMG and DAIS family! Alphabetically, they are: Xiusi Chen, Joe Chen, Yucheng Chen, Ahmed El-Kishky, Xiaotao Gu, Huan Gui, Meng Jiang, Shan Jiang, Dongming Lei, Min Li, Ji Li, Qi Li, Zoey Li, De Liao, Liyuan Liu, Mengxiong Liu, Jialu Liu, Yuning Mao, Yu Meng, Meng Qu, Xiang Ren, Jingbo Shang, Jiaming Shen, Yu Shi, Fangbo Tao, Wenzhu Tong, Frank Xu, Chenguang Wang, Chi Wang, Qi Wang, Jingjing Wang, Xuan Wang, Ellen Wu, Carl Yang, Hongkun Yu, Xiao Yu, Quan Yuan, Keyang Zhang, Yu Zhang, Shi Zhi, Qi Zhu, Wanzheng Zhu, and Honglei Zhuang. Because of them, Ph.D. life at UIUC has been so much fun. I am indebted to many of them who took great care of me during my achilles tendon injury. It is their care and encouragement that helped me through the recovery process. I wish all of them a successful career and a happy life. I am very grateful to all my collaborators outside the UIUC data mining group: Tim iv Hanratty, Lance Kaplan, Shuochao Yao, Huajie Shao, Wei Zhang, Wei Feng, Yong Li, Di Yao, Jie Feng, Julie Yixuan Zhu, and Luming Zhang. The collaborations with them are always enjoyable|I love working together in idea brainstorming, coding, paper writing, and catching up with conference deadlines. Our collaborations have been so fruitful. I look foward to future opportunities in collaborating with them and meeting them at conferences. I also want express my deep gratitude to my M.S. advisors at Zhejiang University: Profes- sor Gang Chen, Professor Lidan Shou, and Professor Ke Chen. They guided me to step into Computer Science research and realize its beauty in an early stage. Their hands-on mentor- ing in research methodology, intellectual audacity, writing, and presentation prepared me so well for starting my Ph.D. at UIUC. And over time, I have realized how important these skills are to my academic career. Finally, my deepest gratitude to my family! I thank my parents and sister for their endless love and constant support in every aspect of my life. Special thanks to Shili|my dear sunshine lighting up my world. Without you I would not have made it this far and been so happy every day. I am so fortunate to have you around, and want to thank you for your love, encouragement, and support in this wonderful journey. v TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . 1 1.1 Overview . .1 1.2 Main Modules . .3 1.3 Technical Contributions . .7 1.4 Organization . 11 CHAPTER 2 TAXONOMY GENERATION FOR CUBE SCHEMA DISCOVERY . 12 2.1 Overview . 12 2.2 Related Work . 16 2.3 Preliminaries . 18 2.4 Adaptive Term Clustering . 19 2.5 Adaptive Term Embedding . 22 2.6 Experimental Evaluation . 24 2.7 Summary . 31 CHAPTER 3 DOCUMENT ALLOCATION WITHOUT LABELED DATA . 32 3.1 Overview . 32 3.2 Related Work . 35 3.3 Preliminaries . 36 3.4 Graph-Based Joint Embedding . 38 3.5 Dimension-Aware Embedding Learning . 40 3.6 The Overall Doc2Cube Algorithm . 43 3.7 Experimental Evaluation . 45 3.8 Summary . 53 CHAPTER 4 CROSS-DIMENSION PREDICTION IN CUBE SPACE . 54 4.1 Overview . 54 4.2 Related Work . 57 4.3 Preliminaries . 57 4.4 Semi-Supervised Multimodal Embedding . 59 4.5 Online Updating of Multimodal Embedding . 63 4.6 Experiments . 68 4.7 Summary . 80 CHAPTER 5 EVENT DETECTION IN CUBE SPACE . 82 5.1 Overview . 82 5.2 Related Work . 84 5.3 Preliminaries . 86 5.4 Candidate Generation . 90 5.5 Candidate Classification . 93 vi 5.6 Supporting Continuous Event Detection . 94 5.7 Complexity Analysis . 94 5.8 Experiments . 95 5.9 Summary . 103 CHAPTER 6 CONCLUSION . 104 6.1 Summary . 104 6.2 Future Work . 105 REFERENCES . 107 vii CHAPTER 1: INTRODUCTION 1.1 OVERVIEW Text is one of the most important data forms for the human to record and communicate information. In a wide spectrum of domains, hundreds of millions of textual contents are being created, shared, and analyzed every single day|examples including tweets, news ar- ticles, Google searches, and medical notes. It is estimated that more than 80% of human knowledge is encoded in the form of unstructured text [29]. Due to such ubiquity, extracting useful insights from text data is crucial to decision making in various applications, rang- ing from automated medical diagnosis [12, 90] and disaster management [52, 75] to fraud detection [80, 50] and personalized recommendation [53, 51]. In many newly emerged applications, people's information needs from text data are becom- ing multi-dimensional. That is, people can demand useful insights for multiple aspects from the given text corpus. Consider an analyst who wants to use news corpora for disaster analytics. Upon identifying all disaster-related news articles, she needs to understand what disaster each article is about, where and when is the disaster, and who are involved. She may even need to explore the multi-dimensional what-where-when-who space at varied granulari- ties, so as to answer questions like `what are all the hurricanes happened in the US in 2018' or `what are all the disasters happened in California in June'. Disaster analytics is only one of the many applications where multi-dimensional knowledge is required. To name a few other examples, (1) analyzing a biomedical research corpus often requires digging out what genes, proteins, and diseases each paper is about and uncover their inter-correlations; (2) leveraging the medical notes of diabetes patients for automatic diagnosis requires correlat- ing symptoms with genders, ages, and even geographical regions; (3) analyzing Twitter data for an presidential election event requires understanding people's sentiments about various political aspects in different geographical regions and time periods.