Abstract Probabilistic Models For

ABSTRACT Title of dissertation: PROBABILISTIC MODELS FOR SCALABLE KNOWLEDGE GRAPH CONSTRUCTION Jay Pujara, Doctor of Philosophy, 2016 Dissertation directed by: Professor Lise Getoor Department of Computer Science In the past decade, systems that extract information from millions of Internet docu- ments have become commonplace. Knowledge graphs – structured knowledge bases that describe entities, their attributes and the relationships between them – are a powerful tool for understanding and organizing this vast amount of information. However, a significant obstacle to knowledge graph construction is the unreliability of the extracted information, due to noise and ambiguity in the underlying data or errors made by the extraction system and the complexity of reasoning about the dependencies between these noisy extractions. My dissertation addresses these challenges by exploiting the interdependencies between facts to improve the quality of the knowledge graph in a scalable framework. I introduce a new approach called knowledge graph identification (KGI), which resolves the entities, attributes and relationships in the knowledge graph by incorporating uncertain extractions from multiple sources, entity co-references, and ontological constraints. I define a proba- bility distribution over possible knowledge graphs and infer the most probable knowledge graph using a combination of probabilistic and logical reasoning. Such probabilistic models are frequently dismissed due to scalability concerns, but my implementation of KGI maintains tractable performance on large problems through the use of hinge-loss Markov random fields, which have a convex inference objective. This allows the inference of large knowledge graphs using 4M facts and 20M ground constraints in 2 hours. To fur- ther scale the solution, I develop a distributed approach to the KGI problem which runs in parallel across multiple machines, reducing inference time by 90%. Finally, I extend my model to the streaming setting, where a knowledge graph is continuously updated by incorporating newly extracted facts. I devise a general approach for approximately up- dating inference in convex probabilistic models, and quantify the approximation error by defining and bounding inference regret for online models. Together, my work retains the attractive features of probabilistic models while providing the scalability necessary for large-scale knowledge graph construction. These models have been applied on a number of real-world knowledge graph projects, including the NELL project at Carnegie Mellon and the Google Knowledge Graph. PROBABILISTIC MODELS FOR SCALABLE KNOWLEDGE GRAPH CONSTRUCTION by Jay Pujara Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2016 Advisory Committee: Professor Lise Getoor, University of Maryland, Chair Professor William W. Cohen, Carnegie Mellon University Professor Hector Corrada-Bravo, University of Maryland Professor Hal Daume´ III, University of Maryland Prof. Philip Resnik, University of Maryland c Copyright by Jay Pujara 2016 Dedication Thats it. The lover writes, the believer hears, The poet mumbles and the painter sees, Each one, his fated eccentricity, As a part, but part, but tenacious particle, Of the skeleton of the ether, the total Of letters, prophecies, perceptions, clods Of color, the giant of nothingness, each one And the giant ever changing, living in change. –Wallace Stevens, from A Primitive Like an Orb ii Acknowledgments As a lover of wordplay, I would sometimes imagine that scholarship was, instead of a task undertaken in the windowless offices of the AV Williams building, an actual ship – a tall schooner, plowing through the dark, mysterious seas of Research with its crisp, white sails billowing in the wind. However, if the metaphor has any truth to it, this ship would have one (or possibly many) leaks of unknown origin, the lines would be tangled in an irredeemable mess, and it would be well off course and weeks behind schedule. Amidst those sinking realizations, I would often temper my despair by reflecting how thankful I was that I was not on this ship alone. My advisor Lise Getoor deserves most of the credit for keeping me from running aground. I can’t say why she thought I was a good bet when I arrived at her office in March 2010 after crossing a continent and navigating the Metrobus system: sweaty, unkempt, carrying a 90 liter hiking pack and babbling disconnectedly about half-baked research ideas. Whatever her reasons, I’m extremely thankful to have the opportunity to work with her. Lise has been patient, allowing me the time to explore many avenues that turned out to be dead ends and forgiving many late weekly reports; encouraging, battling my skepticism and pessimism when confronted with frustrating datasets or bleak experimental results; and dedicated, investing her nights and weekends to sending meticulously (and often distressingly) annotated paper drafts in hopes my writing would improve. Above all, I have been astounded by how much Lise cares about her students and how hard she works to help them achieve their goals, whatever they may be. Lise has also helped me forge connections with many of the mentors who have helped me through the course of my studies. One of these mentors is William Cohen, who was critical in shaping much of my work on knowledge graphs. I benefited from William’s deep experience in NLP and machine learning, his pragmatic approach to slashing through the most troubling Gordian knots, and his ability to connect me with just the right person to solve my data woes. As a rule, no matter what problem I brought into a meeting with William, by the end he would have pointed me in the right direction to make progress. I’m grateful to have had the opportunity of this collaboration, as well as the opportunity to work with William briefly at Carnegie Mellon. I also owe a debt of gratitude to Hal Daume´ III. Hal joined the University of Mary- land at the same time as me (albeit as a professor), and his office was directly across from the LINQS lab so I’ve been bugging him throughout my entire graduate career. I’m not sure how to describe Hal’s magic, but every time I walked out of his office I would feel better, whether it was because I finally understood the optimization problem I’d been working on for a week, or had a new perspective on the related work for an idea I’d encountered, or simply because he helped defuse my anxiety and gave me fresh hope. Many of my most formative experiences as a researcher have been due to all I have learned as a member of the LINQS lab. I can’t enumerate all of the ways being a member iii of LINQS has contributed to my maturation as a researcher, but it spans everything from the weekly reading group to nightly Skeeball sessions in South Lake Tahoe during the NIPS conferences. While what I’ve learned is hard to list, the wonderful people I’ve had the opportunity to learn from comes readily to mind. I’d like to thank the various brilliant postdocs who have been part of the group and whose particular expertise I’ve benefited from: Jimmy Foulds, Bert Huang, Angelika Kimmig, Stanley Kok, and Lily Mihalkova. When I started in LINQS, I was extremely appreciative of the sage advice and guidance of the senior students: Mustafa Bilgic, Matthias Broecheler, Walaa Eldin Moustafa, Galileo Namata, Hossam Sharara, and Elena Zheleva. Three other LINQS members joined the lab the same time I did: Steve Bach, Ben London, and Theo Rekatsinas. I cannot express how much their friendship, guidance, and help has meant to me. LINQS has also had a number of visiting students in various forms, with whom I’ve had fruitful collaborations: Golnoosh Farnadi, Adam Grycner, Eric Norris, and Natalia Diaz Rodriguez. Finally, I want to thank the next generation of LINQS, who have taught me so much and been incredibly kind and welcoming: Shobeir Fakhrei, Matthew Howard, Pigi Kouki, Alex Memory, Hui Miao, Arti Ramesh, Dhanya Sridhar, and Sabina Tomkins – I’m glad to see them continuing the vibrancy of the LINQS community. The work in this dissertation would have been impossible without help from a number of people whose help at critical points greatly influenced the direction of my research. I’d like to thank the NELL group at Carnegie Mellon, particularly Bryan Kisiel, Jayant Krishnamurthy, Bhavana Dalvi Mishra, and Anthony Platanios. Serendipitously, working on NELL allowed me to reconnect with Tom Mitchell, who was my advisor as an un- dergraduate and Masters student at Carnegie Mellon, and gave me the taste for research which brought me back to my PhD. I’m grateful to Hector Corrada Bravo and Philip Resnik for their diligent service on my committee. I’d also like to give special thanks to Hui Miao, who provided crucial help during my initial explorations of KGI, and Ben London, who helped me distill my amorphous ideas of how streaming inference should work into workable theory. I’ve have had some wonderful mentors at two internships during my graduate studies. While at Google, I had the great fortune to work with Luna Dong, Evgeniy Gabrilovich, Curtis Janssen, Kevin Murphy, Wei Zhang, and other members of the Knowl- edge Vault team. Their support helped me work with the vast real data available at Google and mater the necessary tools. Similarly, I had an excellent time working at LinkedIn, enabled with the support of Mathieu Bastian, Christopher Lloyd, Pete Skomoroch, Sal Uryasev, William Vaughan and the many other generous scientists in Decision Sciences. I would also like to thank my fellow interns who helped me along the way, in particular: Tim Althoff, Praveen Bommannavar, Arun Chaganty, Ari Kobren, Matthieu Monsch, and Karthik Raman.

Load more