Natural Language Processing and Text Mining with Graph-Structured Representations

Natural Language Processing and Text Mining with Graph-Structured Representations by Bang Liu A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Department of Electrical and Computer Engineering University of Alberta © Bang Liu, 2020 Abstract Natural Language Processing (NLP) and understanding aims to read from unformat- ted text to accomplish different tasks. As a first step, it is necessary to represent text as a simplified model. Traditionally, Vector Space Model (VSM) is most commonly used, in which text is represented as a bag of words. Recent years, word vectors learned by deep neural networks are also widely used. However, the underlying lin- guistic and semantic structures of text pieces cannot be expressed and exploited in these representations. Graph is a natural way to capture the connections between different text pieces, such as entities, sentences, and documents. To overcome the limits in vector space models, we combine deep learning models with graph-structured representations for various tasks in NLP and text mining. Such combinations help to make full use of both the structural information in text and the representation learning ability of deep neural networks. Specifically, we make contributions to the following NLP tasks: First, we introduce tree-based/graph-based sentence/document decomposition tech- niques to align sentence/document pairs, and combine them with Siamese neural network and graph convolutional networks (GCN) to perform fine-grained semantic relevance estimation. Based on them, we propose Story Forest system to automatically cluster streaming documents into fine-grained events, while connecting related events in growing trees to tell evolving stories. Story Forest has been deployed into Tencent QQ Browser for hot event discovery. Second, we propose ConcepT and GIANT systems to construct a user-centered, web-scale ontology, containing a large number of heterogeneous phrases conforming ii to user attentions at various granularities, mined from the vast volume of web documents and search click logs. We introduce novel graphical representation and combine it with Relational-GCN to perform heterogeneous phrase mining and relation identi- fication. GIANT system has been deployed into Tencent QQ Browser for news feeds recommendation and searching, serving more than 110 million daily active users. It also offers document tagging service to WeChat. Third, we propose Answer-Clue-Style-aware Question Generation to automatically generate diverse and high-quality question-answer pairs from unlabeled text corpus at scale by mimicking the way a human asks questions. Our algorithms combine sentence structure parsing with GCN and Seq2Seq-based generative model to make the "one-to-many" question generation close to "one-to-one" mapping problem. A major part of our work has been deployed into real world applications in Tencent and serves billions of users. iii To my parents, Jinzhi Cheng and Shangping Liu. To my grandparents, Chunhua Hu and Jiafa Cheng. iv . Where there’s a will, there is a way. 事(º: v Acknowledgements This work is not mine alone. I learned a lot and got support from different people for the past six years. Firstly, I am immensely grateful to my advisor, Professor Di Niu, for teaching me how to become a professional researcher. I joined Di’s group since September, 2013. During the last 6 years, Di not only created a great research environment for me and all his other students, but also helped me by providing a lot of valuable experiences and suggestions in how to develop a professional career. More importantly, Di is not only a kind and supportive advisor, but also an older friend of mine. He always believes in me even though I am not always that confident about myself. I learned a lot from Di and I am very grateful to him. I learned a great deal from my talented collaborators and mentors: Professor Lin- glong Kong and Professor Zongpeng Li. Professor Linglong Kong is my co-supervisor. He is a very nice supervisor as well as friend. He is quite smart and can see the nature of research problems. I learned a lot from him in multiple research projects. Professor Zongpeng Li is one of the coauthors in my first paper. He is an amiable professor with great enthusiasm in research. Thanks for his support in my early research works. I would like to thanks Professor H. Vicky Zhao, who is my co-supervisor when I was pursuing my Master’s degree. I am also very grateful to Professor Jian Pei, Davood Rafiei and Cheng Li for being the members in my PhD supervisor committee. More- over, I would like to thank Professor Denilson Barbosa, James Miller and Scott Dick for being the members in the committee of my PhD candidacy exam. My friends have made my time over the last six years much more enjoyable. Thanks to my friends, including but not limited to Yan Liu, Yaochen Hu, Rui Zhu, Wuhua Zhang, Wanru Liu, Xu Zhang, Haolan Chen, Dashun Wang, Zhuangzhi Li, Lingju Meng, Qiuyang Xiong, Ting Zhao, Ting Zhang, Fred X. Han, Chenglin Li, Mingjun Zhao, Chi Feng, Lu Qian, Yuanyuan, Ruitong Huang, Jing Cao, Eren, Shuai Zhou, Zhaoyang Shao, Kai Zhou, Yushi Wang, etc. You are my family in Canada. Thank you for everything we have experience together. I am very grateful to Tencent for their support. I met a lot of friends and brilliant colleagues there. Thanks to my friends Jinghong Lin, Xiao Bao, Yuhao Zhang, Litao vi Hong, Weifeng Yang, Shishi Duan, Guangyi Chen, Chun Wu, Chaoyue Wang, Jinwen Luo, Nan Wang, Dong Liu, Chenglin Wu, Mengyi Li, Lin Ma, Xiaohui Han, Haojie Wei, Binfeng Luo, Di Chen, Zutong Li, Jiaosheng Zhao, Shengli Yan, Shunnan Xu, Ruicong Xu and so on. Life is a lot more fun with all of you. Thanks to Weidong Guo, Kunfeng Lai, Yu Xu, Yancheng He, and Bowei Long, for the full support to my research works in Tencent. Thanks to my friend Qun Li and my sister Xiaoqin Zhao for all the accompanying time. Thanks to my parents, Jinzhi Cheng and Shangping Liu, and my little sister Jia Liu. Your love is what makes me strong. Thanks to all my family members, I love all of you. Lastly, thanks to my grandparents, Chunhua Hu and Jiafa Cheng, who raised me. I will always miss you, grandma. vii Contents 1 Introduction 1 1.1 Motivation . .1 1.2 User and Text Understanding: a Graph Approach . .3 1.3 Contributions . .5 1.4 Thesis Outline . .8 2 Related Work 10 2.1 Information Organization . 10 2.1.1 Text Clustering . 10 2.1.2 Story Structure Generation . 11 2.1.3 Text Matching. 13 2.1.4 Graphical Document Representation . 14 2.2 Information Recommendation . 15 2.2.1 Concept Mining . 15 2.2.2 Event Extraction . 16 2.2.3 Relation Extraction . 16 2.2.4 Taxonomy and Knowledge Base Construction . 17 2.2.5 Text Conceptualization . 17 2.3 Reading Comprehension . 18 I Text Clustering and Matching: Growing Story Trees to Solve Information Explosion 20 3 Story Forest: Extracting Events and Telling Stories from Breaking News 22 3.1 Introduction . 23 3.2 Problem Definition and Notations . 27 3.2.1 Problem Definition . 27 viii 3.2.2 Notations . 28 3.2.3 Case Study . 29 3.3 The Story Forest System . 30 3.3.1 Preprocessing . 31 3.3.2 Event Extraction by EventX . 33 3.3.3 Growing Story Trees Online . 38 3.4 Performance Evaluation . 40 3.4.1 News Datasets . 41 3.4.2 Evaluation of EventX . 42 3.4.3 Evaluation of Story Forest . 48 3.4.4 Algorithm Complexity and System Overhead . 53 3.5 Concluding Remarks and Future Works . 55 4 Matching Article Pairs with Graphical Decomposition and Convo- lutions 57 4.1 Introduction . 57 4.2 Concept Interaction Graph . 59 4.3 Article Pair Matching through Graph Convolutions . 62 4.4 Evaluation . 64 4.4.1 Results and Analysis . 69 4.5 Conclusion . 71 5 Matching Natural Language Sentences with Hierarchical Sentence Factorization 72 5.1 Introduction . 73 5.2 Hierarchical Sentence Factorization and Reordering . 75 5.2.1 Hierarchical Sentence Factorization . 77 5.3 Ordered Word Mover’s Distance . 80 5.4 Multi-scale Sentence Matching . 84 5.5 Evaluation . 86 5.5.1 Experimental Setup . 86 5.5.2 Unsupervised Matching with OWMD . 88 5.5.3 Supervised Multi-scale Semantic Matching . 90 5.6 Conclusion . 91 II Text Mining: Recognizing User Attentions for Searching ix and Recommendation 93 6 A User-Centered Concept Mining System for Query and Document Understanding at Tencent 95 6.1 Introduction . 96 6.2 User-Centered Concept Mining . 100 6.3 Document Tagging and Taxonomy Construction . 103 6.3.1 Concept Tagging for Documents . 103 6.3.2 Taxonomy Construction . 106 6.4 Evaluation . 107 6.4.1 Evaluation of Concept Mining . 107 6.4.2 Evaluation of Document Tagging and Taxonomy Construction 110 6.4.3 Online A/B Testing for Recommendation . 111 6.4.4 Offline User Study of Query Rewriting for Searching . 113 6.5 Information for Reproducibility . 113 6.5.1 System Implementation and Deployment . 113 6.5.2 Parameter Settings and Training Process . 114 6.5.3 Publish Our Datasets . 116 6.5.4 Details about Document Topic Classification . 117 6.5.5 Examples of Queries and Extracted Concepts . 117 6.6 Conclusion . 118 7 Scalable Creation of a Web-scale Ontology 119 7.1 Introduction . 119 7.2 The Attention Ontology . 123 7.3 Ontology Construction . 125 7.3.1 Mining User Attentions . 127 7.3.2 Linking User Attentions . 134 7.4 Applications . 136 7.5 Evaluation .

Load more