Semantic Web Search Using Natural Language
Total Page:16
File Type:pdf, Size:1020Kb
University of West Bohemia Faculty of Applied Sciences Semantic Web Search Using Natural Language Ivan Habernal Doctoral Thesis Submitted in partial fulllment of the requirements for a degree of Doctor of Philosophy in Computer Science and Engineering Supervisor: Professor Václav Matou²ek Department of Computer Science and Engineering Plze¬, 2012 Západo£eská univerzita v Plzni Fakulta aplikovaných v¥d Vyhledávání v Sémantickém webu pouºitím p°irozeného jazyka Ing. Ivan Habernal diserta£ní práce k získání akademického titulu doktor v oboru Informatika a výpo£etní technika kolitel: Prof. Ing. Václav Matou²ek, CSc. Katedra informatiky a výpo£etní techniky Plze¬ 2012 Prohl´aˇsen´ı Pˇredkl´ad´amt´ımto k posouzen´ıa obhajobˇedisertaˇcn´ıpr´aci zpracovanou na z´avˇer doktorsk´eho studia na Fakultˇeaplikovan´ych vˇedZ´apadoˇcesk´euniver- zity v Plzni. Prohlaˇsuji t´ımto, ze tuto pr´acijsem vypracoval samostatnˇe, s pouˇzit´ım od- born´eliteratury a dostupn´ych pramen˚uuveden´ych v seznamu, jenˇzje sou- ˇc´ast´ıt´eto pr´ace. V Plzni dne 31. bˇrezna 2012 Ing. Ivan Habernal i Abstract This thesis presents a complete end-to-end system for the Semantic Web search using a Natural Language. The system is placed into the context of recent research in Information Retreival, Semantic Web, Natural Language Understanding, and Natural Language Interfaces. The key feature of Natu- ral Language Interfaces is that users can search for the required information by posing their questions using natural language instead of e.g. filling web forms. The developed system uses the Semantic Web technologies in both tradi- tional and new forms. The idea of the Semantic Web has brought many interesting concepts into domain modeling and data sharing. Furthermore, the development in Natural Language Interfaces to Semantic Web has shown that bridging the gap between the Semantic Web and Natural Language In- terfaces can uncover new research challenges. The main contributions of this thesis are as follows. First, a unique for- malism for capturing a natural language question semantics, based upon Semantic Web standards, was proposed. Second, the statistical model for the semantic analysis based upon supervised training was developed. Third, the evaluation of the fully functional end-to-end system with a real data and real queries was conducted. The system was tested in the accommodation domain using real data ac- quired from the Web as well as the corpus of real queries in natural lan- guage. The thesis deals with both theoretical and practical issues that must be solved in a fully functional system. A complete work-flow is described, including preparation of data, natural language corpus, ontology design, annotation, semantic model and search. Finally, a very detailed evaluation with promising results is presented and discussed. Special attention is also paid to open issues, such as a perfor- mance, a usability, a portability, or sources of a real Web data. ii Abstrakt Disertaˇcn´ıpr´acepopisuje kompletn´ısyst´em pro vyhled´av´an´ıv s´emantick´em webu pouˇzit´ım pˇrirozen´eho jazyka. Syst´emje pˇredstaven v kontextu v´yzku- mu na poli Information Retrieval, s´emantick´eho webu, porozumˇen´ıpˇriro- zen´emu jazyku a rozhran´ı vyuˇz´ıvaj´ıc´ıch pˇrirozen´yjazyk. Hlavn´ı v´yhodou rozhran´ıvyuˇz´ıvaj´ıc´ıch pˇrirozen´yjazyk je moˇznost zadat ot´azku celou vˇetou nam´ısto vyplˇnov´an´ıwebov´ych formul´aˇr˚unebo pouˇzit´ıpouze kl´ıˇcov´ych slov. Vyvinut´ysyst´emvyuˇz´ıv´atechnologie s´emantick´eho webu tradiˇcn´ımi i no- v´ymi zp˚usoby. Myˇslenka s´emantick´eho webu vnesla mnoho zaj´ımav´ych kon- cept˚udo modelov´an´ıdom´ena sd´ılen´ıdat napˇr´ıˇcdom´enami. Nav´ıckombi- nace s´emantick´eho webu a rozhran´ıvyuˇz´ıvaj´ıch pˇrirozen´yjazyk sk´yt´anov´e moˇznosti pro vylepˇsen´ıuˇzivatelsk´eho komfortu pˇri vyhled´av´an´ı. Disertaˇcn´ıpr´acem´atyto hlavn´ıpˇr´ınosy. Zaprv´e:byl navrˇzen nov´yformalis- mum pro zachycen´ıs´emantiky ot´azky v pˇrirozen´emjazyce. Tento formalis- mum vyuˇz´ıv´atechnologi´ıs´emantick´eho webu. Zadruh´e:byl vyvinut statis- tick´ymodel pro s´emantickou anal´yzu zaloˇzen´yna strojov´emuˇcen´ı. Zatˇret´ı: syst´embyl otestov´anna re´aln´ych datech a re´aln´ych ot´azk´ach. Syst´embyl testov´anna dom´enˇepro vyhled´av´an´ıubytov´an´ı. Data byla z´ısk´a- na z re´aln´ych webov´ych port´al˚ustejnˇejako testovac´ıot´azky v pˇrirozen´em jazyce. Pr´ace se zab´yv´ateoretick´ymi i praktick´ymi probl´emy, kter´emus´ıb´yt ve funkˇcn´ım syst´emu vyˇreˇseny. Je pops´ancel´ypostup z´ısk´an´ıdat, korpus ot´azek, n´avrh ontologi´ı, anotace, s´emantick´aanal´yza a vyhled´av´an´ı. Na z´avˇer je provedeno velmi d˚ukladn´evyhodnocen´ıfunkˇcnosti syst´emu. Po- zornost je tak´ezamˇeˇrena na otevˇren´eprobl´emy, napˇr. v´ykon, pouˇzitelnost, pˇrenositelnost na jinou dom´enu a jazyk a zdroje webov´ych dat. iii Contents I Introduction 1 1 Motivation and Introduction 2 1.1 Thesis Outline . .3 1.2 Discussion of the Terminology Used . .4 II State of the Art 7 2 Natural Language Interfaces to Structured Data 8 2.1 Natural Language Interfaces to Databases . .9 2.2 Semantic Web and Ontologies . .9 2.3 Architecture of NLISW . 11 2.3.1 Knowledge Base and Ontologies . 11 2.3.2 Question Understanding . 12 2.3.3 Knowledge Base Querying and Answer Representation 14 2.4 Evaluation Metrics and Testing Datasets . 14 2.5 Overview of Existing Systems . 15 2.5.1 PowerAqua and AquaLog . 15 2.5.2 ORAKEL . 16 2.5.3 FREyA . 16 2.5.4 PANTO . 17 2.5.5 QACID . 17 2.5.6 QUETAL QA . 18 2.5.7 Other related work . 19 iv 3 Statistical Methods for Natural Language Understanding 20 3.1 Introduction to NLU . 21 3.1.1 Basic Approaches . 21 3.1.2 Semantic Representation . 22 3.2 Sequential models . 22 3.2.1 Hidden understanding model . 22 3.2.2 Flat-concept parsing . 24 3.2.3 Conditional Random Fields . 25 3.3 Stochastic Semantic Parsing . 26 3.3.1 Preprocessing . 26 3.3.2 Probabilistic semantic grammars . 26 3.3.3 Vector-state Markov model . 28 3.3.4 Hidden Vector-state Markov model . 30 3.3.5 Context-based Stochastic Parser . 32 3.4 Other approaches to semantic parsing . 33 3.5 Evaluation of NLU Systems . 34 3.5.1 Exact match . 34 3.5.2 PARSEVAL . 34 3.5.3 Tree edit distance . 35 3.6 Existing corpora . 35 3.6.1 ATIS . 35 3.6.2 DARPA . 36 3.6.3 Other corpora . 36 3.7 Existing systems . 37 3.7.1 HVS Parser . 37 3.7.2 Scissor . 37 3.7.3 Wasp . 37 3.7.4 Krisp . 38 3.7.5 Other systems . 38 III Natural Language-Based Semantic Web Search Sys- v tem 39 4 Target Domain 40 4.1 Domain Requirements . 40 4.2 Natural Language Question Corpus . 42 4.2.1 NL Question Corpus Statistics . 43 4.2.2 Question Examples . 44 4.3 Knowledge Base and Domain Ontology . 45 4.3.1 Pattern-based Web Information Extraction . 45 4.3.2 Domain Ontology . 46 4.3.3 Qualitative Analysis of the Knowledge Base . 49 4.3.4 Semantic Inconsistencies Causing Practical Issues . 52 4.4 Test Data Preparation . 53 4.4.1 Search Results . 54 4.4.2 Semantic Interpretation . 54 4.4.3 Assigning the Correct Results . 54 5 Semantic Annotation of NL Questions 57 5.1 Describing NL Question Semantics . 57 5.1.1 Domain Ontology versus NL Question Ontology . 58 5.1.2 Two-Layer Annotation Approach . 58 5.1.3 Ontology of NL Question . 58 5.2 NL Question Corpus Annotation . 62 6 Semantic Analysis of Natural Language Questions 64 6.1 Semantic Analysis Model Overview . 64 6.2 Formal Definition of Semantic Annotation . 65 6.3 Statistical model . 66 6.4 Named Entity Recognition . 67 6.4.1 Maximum Entropy NER . 68 6.4.2 LINGVOParser . 68 6.4.3 String Similarity Matching . 68 vi 6.4.4 OntologyNER . 68 7 Semantic Interpretation 69 7.1 Transformation of Semantic Annotation into Query Language 69 7.1.1 Semantic Interpretation of Named Entities . 70 7.2 Practical Issues of Semantic Interpretation . 71 7.3 Semantic Reasoning and Result Representation . 72 8 Evaluation 74 8.1 Semantic Analysis Evaluation . 74 8.1.1 Evaluation of NER . 74 8.1.2 Evaluation of the Semantic Analysis Model . 76 8.2 Matching Named Entities to KB Instances . 78 8.3 End-to-end Performance Evaluation . 79 8.3.1 Fulltext Search . 79 8.3.2 Simulation of End-to-end Performance With Correct Semantic Annotation . 80 8.3.3 The end-to-end results of SWSNL system . 80 8.4 Evaluation on Other Domains and Languages . 81 8.4.1 ConnectionsCZ Corpus . 81 8.4.2 ATIS Corpus Subset . 82 9 Conclusion 85 9.1 Open Issues . 85 9.1.1 Performance Issues . 85 9.1.2 Deployment and Development . 86 9.1.3 Problems Caused by Real Web Data . 86 9.1.4 Research-related Issues . 86 9.1.5 Use in the Business Sector . 87 9.2 Future Work . 87 9.3 Final Conclusion . 88 9.3.1 Major Contributions . 88 9.3.2 Review of Aims of Ph.D. thesis . 88 vii A Hybrid Semantic Analysis System { ATIS Data Evaluation 90 A.1 Introduction . 91 A.2 Related Work . 91 A.3 Semantic Representation . 92 A.4 Data................................ 93 A.4.1 LINGVOSemantics corpus . 93 A.4.2 ATIS corpus . 93 A.5 System Description . 94 A.5.1 Lexical Class Identification .