Knowledge Extraction for Hybrid Question Answering
Total Page:16
File Type:pdf, Size:1020Kb
KNOWLEDGEEXTRACTIONFORHYBRID QUESTIONANSWERING Von der Fakultät für Mathematik und Informatik der Universität Leipzig angenommene DISSERTATION zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.) im Fachgebiet Informatik vorgelegt von Ricardo Usbeck, M.Sc. geboren am 01.04.1988 in Halle (Saale), Deutschland Die Annahme der Dissertation wurde empfohlen von: 1. Professor Dr. Klaus-Peter Fähnrich (Leipzig) 2. Professor Dr. Philipp Cimiano (Bielefeld) Die Verleihung des akademischen Grades erfolgt mit Bestehen der Verteidigung am 17. Mai 2017 mit dem Gesamtprädikat magna cum laude. Leipzig, den 17. Mai 2017 bibliographic data title: Knowledge Extraction for Hybrid Question Answering author: Ricardo Usbeck statistical information: 10 chapters, 169 pages, 28 figures, 32 tables, 8 listings, 5 algorithms, 178 literature references, 1 appendix part supervisors: Prof. Dr.-Ing. habil. Klaus-Peter Fähnrich Dr. Axel-Cyrille Ngonga Ngomo institution: Leipzig University, Faculty for Mathematics and Computer Science time frame: January 2013 - March 2016 ABSTRACT Over the last decades, several billion Web pages have been made available on the Web. The growing amount of Web data provides the world’s largest collection of knowledge.1 Most of this full-text data like blogs, news or encyclopaedic informa- tion is textual in nature. However, the increasing amount of structured respectively semantic data2 available on the Web fosters new search paradigms. These novel paradigms ease the development of natural language interfaces which enable end- users to easily access and benefit from large amounts of data without the need to understand the underlying structures or algorithms. Building a natural language Question Answering (QA) system over heteroge- neous, Web-based knowledge sources requires various building blocks. These build- ing components include (a) knowledge extraction approaches over textual data such as blogs, news, templated websites or product descriptions to capture the semantics of the content of a document, (b) linguistic parsing modules for natural language and (c) efficient query execution and ranking functions. However, ex- isting knowledge extraction building blocks lack comparable evaluation settings, performance or quality. In this thesis, three main research challenges are identified and addressed: 1. The ongoing transition from the current Document Web of textual data to the Web of Data requires scalable and accurate approaches for the extraction of structured data in Resource Description Framework (RDF) from Web-based documents. We address this key step for bridging the Semantic Gap, i.e., ex- tracting RDF from text, with three approaches. First, this thesis presents CE- TUS, an approach for recognizing entity types in textual documents to popu- late RDF knowledge bases. Second, we describe the knowledge base-agnostic, multi-lingual, high-quality and efficient named entity disambiguation frame- work AGDISTIS. Third, we tackle the scalable extraction of consistent RDF data from highly templated websites via our framework REX. 2. The need to bridge between the textual Document Web and the structured data on the Web of Data has led to the development of a considerable number of annotation tools and frameworks. However, these approaches are currently hard to compare since the published evaluation results are calculated on diverse datasets and evaluated based on different measures The resulting Evaluation Gap is tackled by a novel set of benchmarking corpora as well as with GERBIL, our evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation approaches on multiple datasets. 3. The decentral architecture behind both Webs generated a distributed infor- mation landscape across data sources with varying structure. Furthermore, current keyword-based search paradigms contradict the human demand for expressing information needs in natural language. Especially, complex infor- mation relationships are often expressed as complex questions by humans. 1 http://www.worldwidewebsize.com/ 2 http://stats.lod2.eu/ The author will make use of the we-form as narrative. v These observations describe an Information Gap, i.e., the user generated de- mand for natural language interfaces based on the structured Web of Data and the Document Web. Thus, we introduce HAWK, a novel natural language search framework for hybrid QA. Our framework is able to combine struc- tured and textual data sources to answer complex natural language ques- tions. vi PUBLICATIONS This thesis is based on the following publications and proceedings. References to the appropriate publications are included at the respective chapters and sections. awards and notable mentions • Best Research Paper Award for Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Michael Röder, Daniel Gerber, Sandro Athaide Coelho, Sören Auer, and Andreas Both. AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data. In International Semantic Web Confer- ence (ISWC), pages 457–471, 2014 • 2nd prize at Question Answering over Linked Data (QALD) 5 for Ricardo Usbeck and Axel-Cyrille Ngonga Ngomo. HAWK@QALD5 – Trying to An- swer Hybrid Questions with Various Simple Ranking Techniques. In CLEF 2015 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR- WS.org/Vol-1391), 2015 • Best Demo Award for Ricardo Usbeck, Michael Röder, and Axel-Cyrille Ngonga Ngomo. Evaluating entity annotators using GERBIL. In The Seman- tic Web: ESWC 2015 Satellite Events, Revised Selected Papers, Demos & Posters, pages 159–164, 2015 • 1st prize at Task 2 of Open Knowledge Extraction (OKE) Challenge for Michael Röder, Ricardo Usbeck, René Speck, and Axel-Cyrille Ngonga Ngomo. CETUS – A Baseline Approach to Type Extraction. In 1st Open Knowledge Ex- traction Challenge at International Semantic Web Conference, 2015 • Best of Workshop (WASABI) for Didier Cherix, Ricardo Usbeck, Andreas Both, and Jens Lehmann. CROCUS: Cluster-based Ontology Data Cleansing. In Joint Second International Workshop on Semantic Web Enterprise Adoption and Best Practice and Second International Workshop on Finance and Economics on the Semantic Web at ESWC, 2014 • Best of Workshop (LD4IE) for Axel-Cyrille Ngonga Ngomo, Michael Röder, and Ricardo Usbeck. Cross-document coreference resolution using latent features. In 2nd International Workshop on Linked Data for Information Extraction at the International Semantic Web Conference, pages 33–44, 2014 conferences, peer-reviewed 1. Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Michael Röder, Daniel Gerber, Sandro Athaide Coelho, Sören Auer, and Andreas Both. AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data. In International Semantic Web Conference (ISWC), pages 457–471, 2014 vii 2. Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitelonis, and Lars Wesemann. GERBIL – Gen- eral Entity Annotation Benchmark Framework. In 24th International Confer- ence on World Wide Web, 2015 3. Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, and Chris- tina Unger. HAWK – Hybrid Question Answering Using Linked Data. In Extended Semantic Web Conference, pages 353–368, 2015 4. Michael Röder, Ricardo Usbeck, Daniel Gerber, Sebastian Hellmann, and An- dreas Both. N3 - A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format. In Conference on language resources and evaluation (LREC), 2014 5. Lorenz Bühmann, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Muham- mad Saleem, Andreas Both, Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. Web-Scale Extension of RDF Knowledge Bases from Templated Web- sites. In International Semantic Web Conference (ISWC), pages 66–81, 2014 book chapters, peer-reviewed 6. Anja Jentzsch, Ricardo Usbeck, and Denny Vrandecic. An Incomplete and Simplifying Introduction to Linked Data. In Jens Lehmann and Johanna Voelker, editors, Perspectives on Ontology Learning, pages 21–33. AKA / IOS Press, 2014 workshops, peer-reviewed 7. Ricardo Usbeck, Erik Körner, and Axel-Cyrille Ngonga Ngomo. Answering Boolean Hybrid Questions with HAWK. In NLIWOD workshop at International Semantic Web Conference (ISWC), 2015 8. Michael Röder, Ricardo Usbeck, René Speck, and Axel-Cyrille Ngonga Ngomo. CETUS – A Baseline Approach to Type Extraction. In 1st Open Knowledge Ex- traction Challenge at International Semantic Web Conference, 2015 9. Ricardo Usbeck. Combining Linked Data and Statistical Information Re- trieval - Next Generation Information Systems. In The Semantic Web: Trends and Challenges - 11th International Conference ESWC 2014, pages 845–854, 2014 10. Michael Röder, Ricardo Usbeck, and Axel-Cyrille Ngonga Ngomo. Develop- ing a Sustainable Platform for Entity Annotation Benchmarks. In Developers Workshop 2015 at ESWC, 2015 11. Ricardo Usbeck and Axel-Cyrille Ngonga Ngomo. HAWK@QALD5 – Trying to Answer Hybrid Questions with Various Simple Ranking Techniques. In CLEF 2015 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR-WS.org/Vol-1391), 2015 viii demos and posters, peer-reviewed 12. Ricardo Usbeck, Axel-Cyrille