UCLA Electronic Theses and Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
UCLA UCLA Electronic Theses and Dissertations Title Probabilistic Topic Models for Graph Mining Permalink https://escholarship.org/uc/item/7ss082g1 Author Cha, Young Chul Publication Date 2014 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA Los Angeles Probabilistic Topic Models for Graph Mining A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Young Chul Cha 2014 c Copyright by Young Chul Cha 2014 ABSTRACT OF THE DISSERTATION Probabilistic Topic Models for Graph Mining by Young Chul Cha Doctor of Philosophy in Computer Science University of California, Los Angeles, 2014 Professor Junghoo Cho, Chair In this research, we extend probabilistic topic models, originally developed for a tex- tual corpus analysis, to analyze a more general graph. Especially, we extend them to effectively handle: (1) a bias caused by a limited number of frequent nodes (“popular- ity bias”), and (2) complex graphs having more than two entity types. For the popularity bias problem, we propose LDA extensions and new topic mod- els explicitly modeling the popularity of a node with a “popularity component”. In extensive experiments with a real-world Twitter dataset, our approaches achieve signif- icantly lower perplexity (i.e., better prediction power) and improved human-perceived clustering quality compared to LDA. To analyze more complex graphs, we propose a novel universal topic framework that takes an “incremental” approach of breaking a complex graph into smaller units, learning the topic group of each entity from the smaller units, and then “propagating” the learned topics to others. In a DBLP prediction problem, our approach achieves the best performance over many state-of-the-art methods. We also demonstrate huge potential of our approach with search logs from a commercial search engine. ii The dissertation of Young Chul Cha is approved. Carlo Zaniolo D. Stott Parker Gregory H. Leazer Junghoo Cho, Committee Chair University of California, Los Angeles 2014 iii To my family . iv TABLE OF CONTENTS 1 Introduction ::::::::::::::::::::::::::::::::: 1 1.1 Challenges in Graph Mining Using Probabilistic Topic Models . 1 1.2 Organization of Dissertation . 4 2 Preliminaries :::::::::::::::::::::::::::::::: 6 2.1 Probabilistic Topic Models . 6 2.2 Heterogeneous Information Networks . 8 3 Handling a Popularity Bias in Topic Models ::::::::::::::: 10 3.1 Introduction . 10 3.2 Applying Topic Models to Social Graphs . 13 3.2.1 Follow-Edge Generative Model . 13 3.2.2 Popularity Bias . 17 3.3 Handling a Popularity Bias . 19 3.3.1 Existing LDA Extensions . 21 3.3.2 Procedural Variations of LDA . 23 3.3.3 Popularity-Aware Topic Models . 26 3.4 Experiments . 31 3.4.1 Dataset and Experimental Settings . 32 3.4.2 Prediction Performance Analysis . 35 3.4.3 Example Topic Groups . 37 3.4.4 Grouping Quality Analysis . 40 v 3.5 Conclusion . 43 4 Complex-Graph Analysis Using Topic Models :::::::::::::: 44 4.1 Introduction . 44 4.2 Universal Topic Framework . 48 4.2.1 Edge Generative Model . 48 4.2.2 Incorporating Learned Topics . 51 4.2.3 Issues in Topic Incorporation . 54 4.3 Experiments and Analyses . 56 4.3.1 DBLP Experiment . 57 4.3.2 Online Search Experiments . 62 4.4 Conclusion . 70 5 Related Work :::::::::::::::::::::::::::::::: 73 6 Conclusions and Future Work ::::::::::::::::::::::: 76 References ::::::::::::::::::::::::::::::::::: 78 vi LIST OF FIGURES 3.1 Bipartite graph representation of models . 14 3.2 Topic group on cycling showing the popularity bias . 18 3.3 Topic hierarchy and documents generated in the hierarchy . 22 3.4 Two-step labeling approach . 24 3.5 Example of the threshold noise filtering process . 25 3.6 LDA and proposed topic models . 27 3.7 Distributions of incoming and outgoing edges . 33 3.8 Perplexity comparison . 36 3.9 Sample topic groups I . 38 3.10 Sample topic groups II . 39 3.11 Quality comparison . 42 4.1 Examples of HINs . 46 4.2 Example of edge labeling . 50 4.3 Proposed topic models . 51 4.4 Structures of two types of search logs . 62 4.5 Examples of topic incorporation orders . 64 4.6 Performance Analysis . 65 4.7 Topic Granularity Analysis . 69 4.8 Incorporation of C-Log (the last bars show results using C-Log) . 71 vii LIST OF TABLES 3.1 Symbols used throughout this chapter and their meanings . 20 3.2 Statistics of our Twitter dataset . 33 3.3 Experimental cases and descriptions . 35 4.1 Statistics of the datasets . 58 4.2 Prediction accuracy on the DBLP dataset. Except ours, all other results are from [16, 12]. 60 4.3 Venue clusters . 61 4.4 Example topic cluster without UniZ (topic number 43) . 67 4.5 Example topic cluster from UniZ . 68 viii ACKNOWLEDGMENTS First, I would like to thank my wonderful advisor, Junghoo “John” Cho. If it had not been for his help and advice, I would not have finished my Ph.D. study. When I came back to school after working for 10 years in a telecom industry in 2009, I had a quite hard time to catch up with all the coursework required for my new major, data mining. He was very patient with my slow progress and has never rushed me to do something. Whenever I was struggling with a hard problem, he provided me with an insightful advice rather than a solution, and motivated me to learn while seeking for the solution. I remember his I-know-it-all smile when I made a great fuss about my small finding after following his advice. Although I was very busy with my school work during my first two years, I was happy because I could feel I was getting better and better as a researcher. I really appreciate his help and great advice. I only hope that I will be able to practice half as much as what I learned from him when I begin to work with mentees of my own. Second, I thank my lovely wife, Gook Hee Lee, for her great support. I know that it was not an easy decision for her to quit her prospective job in Korea and live as a wife of a graduate student in the foreign country. However, she went through all the hard years without many complaints. She was very supportive and willingly adjusted her schedule to mine. I especially thank her for her effort to make me healthier when I had a health issue. She searched good food ingredients for me and spent time to learn how to make delicious food with them. I also thank her for her great culinary skills. Without them, I would have had hard times to eat a large amount of food she cooks for me. Special thanks to Carlo Zaniolo, Stott Parker and Gregory Leazer for serving as my committee. Their insightful comments made my thesis more solid and I learned a lot from their comments. Especially, Greg gave me very detailed comments on various ix aspects of my thesis. I also owe Carlo, Stott and John for their great classes. I could establish good data mining knowledge with their wonderful classes and projects. I would also like to thank our group members, including Michael Welch, Uri Schonfeld, Chu-Cheng Hsieh, Bin Bi, Yuchen Liu, Dong Wang, Zijun Xue, Christo- pher Moghbel, and Giljoo Na. Michael and Uri gave me good advices when I first joined the group, and Chu-Cheng always magically solved my computer and network issues. Bin was so smart and helped me a lot with developing topic models, and I had many interesting academic talks with Yuchen. Dong and Zijun were always nice to me whenever I need their help. Chris proofread most of my papers and provided sharp comments. Giljoo gave me good life advices and his wife frequently invited me to wonderful dinners when I lived alone during my first year. It was very lucky for me to have chances to work with Keng-hao Chang, Hari Bommaganti, Ye Chen, Jian Yuan and Tak Yan during my internships at Microsoft. They provided great ideas and a stimulating environment during the summers. I could also practice what I learned from school with large real datasets. Finally, I would like to thank my parents and brothers for their invariable love and support. They always firmly believe me and respect my decision. Whenever I am weary and in doubt, their love and belief cheer me up and give me strength. I am grateful for their selfless support and hope they are always healthy and happy. I deeply thank God who takes care of my family and leads me to the right way. x VITA 1998 B.S. (Computer Science), Yonsei University, Korea. 2000 M.S. (Computer Science), Yonsei University, Korea. 2000–2014 Researcher & Manager, KT, Korea. PUBLICATIONS Y. Cha, B. Bi, and J. Cho, Handling a Popularity Bias in Probabilistic Topic Models, under review Y. Cha, B. Bi, J. Cho, K. Chang, H. Bommaganti, Y. Chen, and T. Yan, A Universal Topic Framework (UniZ) and Its Application in Online Search, under review Y. Cha, J. Cho, J. Yuan, and T. Yan, Exploration of the Effects of Category Match Score in Search Advertising, ICDE, Apr. 2014. Y. Cha, B. Bi, C. Hsieh, and J. Cho, Incorporating Popularity in Topic Models for Social Network Analysis, SIGIR, best paper award nominee, Jul.-Aug. 2013. Y. Cha, and J. Cho, Social Network Analysis Using Topic Models, SIGIR, Aug. 2012. xi CHAPTER 1 Introduction 1.1 Challenges in Graph Mining Using Probabilistic Topic Models Recently, topic models have widely been used for a textual corpus analysis due to their high-quality analysis results.