User Behavior Modeling with Large-Scale Graph Analysis Alex Beutel May 2016 CMU-CS-16-105
Total Page:16
File Type:pdf, Size:1020Kb
User Behavior Modeling with Large-Scale Graph Analysis Alex Beutel May 2016 CMU-CS-16-105 Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Christos Faloutsos, Co-Chair Alexander J. Smola, Co-Chair Geoffrey J. Gordon Philip S. Yu, University of Illinois at Chicago Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright © 2016 Alex Beutel. This research was sponsored under the National Science Foundation Graduate Research Fellowship Program (Grant no. DGE- 1252522) and a Facebook Fellowship. This research was also supported by the National Science Foundation under grants num- bered IIS-1017415, CNS-1314632 and IIS-1408924, as well as the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. I also would like to thank the CMU Parallel Data Laboratory OpenCloud for providing infrastructure for ex- periments; this research was funded, in part, by the NSF under award CCF-1019104, and the Gordon and Betty Moore Foundation, in the eScience project. Also, I would like to thank the Open Cloud Consortium (OCC) and the Open Science Data Cloud (OSDC) for the use of resources on the OCC-Y Hadoop cluster, which was donated to the OCC by Yahoo! The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: behavior modeling, graph mining, scalable machine learning, fraud detection, recommendation systems, clustering, co-clustering, factorization, machine learning systems, distributed systems Dedicated to my family. Jess, Mom, Dad, Stephen, Grandma, Papa, Nanny and Pa—I love you. iv Abstract Can we model how fraudsters work to distinguish them from normal users? Can we predict not just which movie a person will like, but also why? How can we find when a student will become confused or where patients in a hospital system are getting infected? How can we effectively model large attributed graphs of complex interactions? In this dissertation we understand user behavior through modeling graphs. Online, users interact not just with each other in social networks, but also with the world around them—supporting politicians, watching movies, buying clothing, searching for restaurants and finding doctors. These interactions often include insightful contextual information as attributes, such as the time of the interaction and ratings or reviews about the interaction. The breadth of interactions and contextual information being stored presents a new frontier for graph modeling. To improve our modeling of user behavior, we focus on three broad chal- lenges: (1) modeling abnormal behavior, (2) modeling normal behavior and (3) scaling machine learning. To more effectively model and detect abnormal behavior, we model how fraudsters work, catching previously undetected fraud on Facebook, Twitter, and Tencent Weibo and improving classification accuracy by up to 68%. By designing flexible and interpretable models of normal behavior, we can predict why you will like a particular movie. Last, we scale modeling of large hypergraphs by designing machine learning systems that scale to hundreds of gigabytes of data, billions of parameters, and are 26 times faster than previous methods. This dissertation provides a foundation for making graph modeling useful for many other applications as well as offers new directions for designing more powerful and flexible models. vi Acknowledgements First and foremost, I want to thank my entire family for fostering my curiosity to get me to graduate school and then supporting me through it. Thank you Jess, for giving me such deep love and happiness for the past five years. Thank you Mom and Dad, for always encouraging me to do my best, not just in my work but as a person, for being my compass through graduate school, and for always being there, even in my difficult times. Thank you Stephen, for always being a much-needed grounding force in my life, consistently providing clarity when I was lost. Thank you Papa and Grandma—it is your unreasonable confidence in me and combination of intellectual idealism and real-world perceptiveness that has brought me here and helped me succeed. Nanny and Pa—I think about you often and deeply wish you were still here. Thank you to my many aunts, uncles, cousins, and Jess’s entire family, who have all been there for me at key points in graduate school. It cannot be overstated the role that my advisors had in my thesis, my time in graduate school, my career, and my life going forward. I want to thank Christos Faloutsos, for his patience and persistence with me as a young student, and his support throughout graduate school. He has taught me how to be an academic, and more broadly, how to simultaneously be a mentor, colleague and friend. Additionally, I want to thank Alex Smola, who has, since even before he came to CMU, pushed me and my research forward, making me learn new concepts and broaden my perspective. From technical insights to career opportunities, he repeatedly guided me down paths that I didn’t know were possible. I am forever grateful to you both. Additionally, I want to thank the rest of my committee: Geoff Gordon and Philip Yu. Thank you both for your comments, feedback and guidance that significantly shaped and focused this thesis. Graduate school has been an exciting experience primarily because of my friends. They have helped me through courses, been my collaborators, outlet for relaxation, and general source of support. Everyone in Christos’s database group has become like family. Vagelis Papalexakis—despite entering graduate school at the same time, you have been an academic big brother to me, both literally and figuratively. I want to thank Aditya Prakash, Danai Koutra, Leman Akoglu, Polo Chau, and U Kang for creating a welcoming environment, mentoring me as a young graduate student and continuing to be be there for me long after leaving CMU. I also want to thank Neil Shah, Jay-Yoon Lee, Rishi Chandy, Bryan Hooi, Hemank Lamba, Hyun Ah Song, Dhivya Eswaran and Kijung Shin for making the database group awesome, being fun to work with, hang out with, and everything in between. Beyond those in the database group, I have had many amazing collaborators without whom this dissertation would not have been possible—Abhimanu Kumar, Meng Jiang, Amr Ahmed, Chao-Yuan Wu, Kenton Murray, Partha Talukdar, and Stephan Günnemann. All of the work in this dissertation came from truly collaborative endeavors, and I feel lucky to have had such great friends to work with in these efforts. Additionally, I would like to thank Venkat Guruswami, Peng Cui, Eric Xing and Qirong Ho who also offered their valuable insights at key points in my research. vii Last, but certainly not least, I want to thank my friends from outside of my research life. Thank you Matteus Tanha, Anthony Brooks, David Naylor, Nico Feltman, David Wit- mer, Richard Wang, John Dickerson, Dougal Sutherland, Elie Krevat, Andrew O’Rourke and the many, many others that have made me love Pittsburgh and my time here. I also want to thank Marilyn Walgora and Deborah Cavlovich. You have both been essential in making my time at Carnegie Mellon go smoothly and pleasantly; I greatly appreciate all of your help and extra effort, far beyond what is reasonable. Marilyn— enjoy retirement! You deserve the break one thousand times over. One of the most exciting parts of my time in graduate school has been my summer internships. I have been extremely lucky to have had great mentors every summer. Thank you Wanhong Xu, Chris Palow, Lars Bakstrom, Markus Weimer, Vijay Narayanan, Ed Chi, and Zhiyuan Cheng for providing such wonderful environments for research, and for grounding my work back at Carnegie Mellon. In particular, I want to thank Wanhong Xu, who in my first summer as an intern at Facebook, got me started working on fraud detection, and for which his intuition on the problem provided the foundation for much of the rest of my fraud detection research in graduate school. Additionally, I want to thank Ed Chi, both for making my time at Google interesting, and also for his selfless support during my job search. Additionally, I want to thank Tina Eliassi-Rad, who has been indispensable in her guidance and support throughout my job search, and also always makes conferences far more fun. I also do not want to forget how I got to graduate school. Thank you to my many friends and colleagues at Duke University who helped me get into graduate school and continued to be sources of support throughout graduate school. In particular, I want thank Pankaj Agarwal and Thomas Mølhave for making me fall in love with research and continuing to be mentors long after I left Duke. Additionally, I want to thank all of my childhood, high school and college friends who have kept me tied to the real world and worked to keep me grounded despite my esoteric explorations. You all know who you are. This is a bittersweet end to a consequential period of my life. Thank you all for making it so. viii Contents I Introduction and Background1 1 Introduction3 1.1 Overview and Contributions..........................4 1.1.1 Modeling Abnormal Behavior.....................4 1.1.2 Modeling Normal Behavior.......................6 1.1.3 Scalable Machine Learning.......................8 1.2 Overarching Thesis Statements......................... 10 2 Preliminaries and Background 11 2.1 Graph Structure.................................. 11 2.2 Mathematical data structures.......................... 13 2.3 Model Structure.................................. 14 2.3.1 Factorization............................... 14 2.3.2 Clustering................................. 15 2.4 Learning...................................... 17 II Modeling Abnormal Behavior 19 II.1 Related Work..................................