Automated Knowledge Base Extension Using Open Information
Total Page:16
File Type:pdf, Size:1020Kb
AUTOMATED KNOWLEDGE BASE EXTENSION USING OPEN INFORMATION by arnab kumar dutta durgapur, india Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik der Universität Mannheim Mannheim, Deutschland, 2015 dekan: Prof. Dr. Heinz Jürgen Müller, Universität Mannheim, Deutschland referent: Prof. Dr. Heiner Stuckenschmidt, Universität Mannheim, Deutschland korreferent: Ao. Prof. Dr. Fabian Suchanek, Télécom ParisTech University, France Tag der mündlichen Prüfung: 4 Februar 2016 Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle. As with all matters of the heart, you’ll know when you find it. And, like any great relationship, it just gets better and better as the years roll on. So keep looking until you find it. Don’t settle. - Steve Jobs (1955 - 2011) ABSTRACT Open Information Extractions (OIE) (like Nell,Reverb) frameworks provide us with domain independent facts in natural language forms containing knowledge from varied sources. Extraction mechanisms for structured knowledge bases (KB) (like DBpedia,Yago) often fail to retrieve such facts due to its resource specific extraction schemes. Hence, the structured KBs can extend themselves by augment- ing their coverage with the facts discovered by OIE systems. This possibility mo- tivates us to integrate these two genres of extractions into one interactive frame- work. In this work, we present a complete, ontology independent, generalized architecture for achieving this integration. Our proposed solution is modularized which solves a specific set of tasks: (1) mapping subject and object terms from OIE facts to KB instances (2) mapping the OIE relational phrases to object properties defined in the KB. Furthermore, in an open extraction setting identical semantic relationships can be represented by different surface forms, making it necessary to group them together. To solve this problem, (3) we propose the use of markov clustering to cluster OIE relations. Key to our approach lies in exploiting the inher- ent dependancies between relations and its arguments. This makes our approach completely context agnostic and generally applicable. We evaluated our method on the two state of the art extraction systems, achieving over 85% precision on instance mappings and over 90% for the relation mappings. We also created a distant supervision based gold standard for the purpose and the data has been released as part of this work. Furthermore, we analyze the effect of clustering and empirically show its effectiveness as a relation mapping technique over other techniques. Overall, our work positions itself on the intersection of information extraction, ontology mapping and reasoning. v ZUSAMMENFASSUNG Offene Systeme zur Informationsextraktions, „Open Information Extraction” (OIE) Systeme, wie z.B. Nell oder Reverb, sind in der Lage, aus verschiedenen textuellen Quellen domänen-unabhängigen Fakten zu extrahieren, und diese als Fragmente natürlicher Sprache in Form von Subjekt-Prädikat-Objekt Tripeln, d.h. letztlich semi-strukturiert, auszugeben. In häufig genutzten strukturierten Wissensbasen, „Knowledge Bases" (KBs), wie z.B. DBpedia oder Yago, ist eine Vielzahl dieser Fakten jedoch nicht enthalten, da diese KBs durch spezifische, und somit in ihrer Breite beschränkten, Extraktionsschemata erzeugt werden. Aus diesem Gegensatz eröffnet sich die Möglichkeit, strukturierte KBs mit Fakten aus OIE Systeme zu komplementieren. In dieser Arbeit werden Methoden untersucht und ein System entwickelt, um diese Integration zu ermöglichen. Der Hauptbeitrag ist hierbei der Entwurf und die Implementierung einer umfassenden, Ontologie unabhängi- gen und allgemeine Architektur zur Informationsintegration. Die vorgeschlagene Lösung besteht aus folgende Modulen: (1) Abbilden von Subjekt- und Objekt- Termen aus OIE Fakten auf KB Instanzen. (2) Abbilden von relationalen Ter- men auf die in der KB definierten Relationen. Da verschiedene OIE Relation, mit entsprechend verschiedenen Oberflächenformen, u.U. auf dieselbe, normalisierte KB Relation abgebildet werden, müssen (3) die OIE Terme zunächst in Clustern gruppiert werden, was mit Hilfe von Markov Clustering erreicht wird. Im Kern basiert diese Arbeit auf der Ausnutzung der Beziehungen zwischen Relationen und ihren Argumenten, d.h. Subjekt- und Objekt-Termen, wobei dies in einer kon- textunabhängig Weise geschieht, welche die allgemein Anwendbarkeit der Meth- oden auf beliebige KBs sicherstellt. Die Leistungsfähigkeit dieses Ansatzes wird durch eine experimentelle Evaluation mit zwei OIE Systeme auf einem selbst er- stellen Gold Standard untersucht, in welcher eine Präzision von 85% in Bezug auf Instanzen und 90% in Bezug auf Relationen, sowie des Weiteren ein positiver Ef- fekt des Markov Clusterings gegenüber andere Ansätze, aufgezeigt werden kann. Insgesamt leistet diese Arbeit damit einen relevanten methodischen Beitrag zur Integration von semi-strukturierten OIE Fakten aus Texten und strukturierten Fak- ten aus KBs dar. vi ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor Prof. Dr. Heiner Stuck- enschmidt for a wonderful learning opportunity he provided to me. I feel deeply indebted to the way he patiently let me learn from my own mistakes. With his excellent guidance, I have learnt and improved a lot about the process of doing scientific research. I consider myself very fortunate to have worked with him. I am highly grateful to Dr. Christian Meilicke for his selfless support and close in- terest in my work. His constant motivation helped me to overcome some of the critical phrases during the PhD. He taught me some of the good practices for sci- entific writing. His way of looking into the problems is highly commendable and I always kept learning something new from him in these three years of my life. I also take this opportunity to thank Prof. Dr. Simone Paolo Ponzetto for being a constant motivation with his excitement about research and cordial personal- ity. I would also like to thank Dr. Mathias Niepert for some of his ideas which he shared during the inception phases of the PhD and for bagging the Google Faculty Research Award (funding for this work). The whole work would not have been possible without the lovely group of people I was working with. It includes each and every one with whom I was fortunate to have been in close contact with. Thank you Stephanie for helping me through all the hassles of official documents, Sebastian for your immediate resolution of the infrastructure related issues, Cilli for being such a lovely friend (I feel lucky to have known her), Jörg for being always so helpful in and out of office, Micha for always being my "German" friend (I am his "Indian" friend and we won the "couple of the month" title) and everyone else in the DWS lab. Finally, I would like to thank Silpi, my loving wife for convincing me to pur- sue a scientific career. If not for you, I would have been still an undergraduate engineer working as a software developer. Special thanks goes to my family back home in India (my parents, my father-in-law) for their continued love, support and encouragement. Thanks for learning to use a touch screen device (and Skype, Watsapp, Viber) and making me always believe "..whatever happens in life, always happens for good". vii PUBLICATIONS Some of the ideas, figures, tables and numbers have appeared in the following publications: Arnab Dutta, Christian Meilicke and Heiner Stuckenschmidt. Enriching • Structured Knowledge with Open Information. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015; 267-277. , Geneva, 2015. Arnab Dutta, Christian Meilicke and Simone Paolo Ponzetto. A Probabilis- • tic Approach for Integrating Heterogeneous Knowledge Sources. In: Lec- ture Notes in Computer Science The Semantic Web: Semantics and Big Data : 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25-29, 2014, Proceedings; 286-301. Springer International Publ., Cham, 2014. Arnab Dutta, Christian Meilicke and Heiner Stuckenschmidt. Semantifying • Triples from Open Information Extraction Systems. In: Frontiers in Artifi- cial Intelligence and Applications STAIRS 2014 : Proceedings of the 7th European Starting AI Researcher Symposium; 111-120. IOS Press, Clifton, Va. [u.a.], 2014. Arnab Dutta, Mathias Niepert, Christian Meilicke and Simone Paolo Ponzetto. • Integrating Open and Closed Information Extraction: Challenges and First Steps. In: CEUR workshop proceedings NLP-DBPEDIA 2013 : Proceedings of the NLP & DBpedia workshop co-located with the 12th International Semantic Web Conference (ISWC 2013) Sydney, Australia, October 22, 2013; . RWTH, Aachen, 2013. Arnab Dutta and Michael Schuhmacher. Entity Linking for Open Informa- • tion Extraction. In: Lecture Notes in Computer Science Natural Language Pro- cessing and Information Systems : 19th International Conference on Applications of Natural Language to Information Systems, NLDB 2014, Montpellier, France, June 18-20, 2014. Proceedings; 75-80. Springer Internat. Publ., Cham, 2014. ix CONTENTS abstract v acknowledgements vii publications ix list of figures xiv list of tables xvi acronyms xix I introduction 1 1motivation 3 2informationextractionsystems 9 2.1 DBpedia ................................... 9 2.2