Semantic Web Search Using Natural Language

University of West Bohemia Faculty of Applied Sciences Semantic Web Search Using Natural Language Ivan Habernal Doctoral Thesis Submitted in partial fulllment of the requirements for a degree of Doctor of Philosophy in Computer Science and Engineering Supervisor: Professor Václav Matou²ek Department of Computer Science and Engineering Plze¬, 2012 Západo£eská univerzita v Plzni Fakulta aplikovaných v¥d Vyhledávání v Sémantickém webu pouºitím p°irozeného jazyka Ing. Ivan Habernal diserta£ní práce k získání akademického titulu doktor v oboru Informatika a výpo£etní technika kolitel: Prof. Ing. Václav Matou²ek, CSc. Katedra informatiky a výpo£etní techniky Plze¬ 2012 Prohláˇsen´ı Pˇredkládámt´ımto k posouzen´ıa obhajobˇedisertaˇcn´ıpráci zpracovanou na závˇer doktorského studia na Fakultˇeaplikovaných vˇedZápadoˇceskéuniver- zity v Plzni. Prohlaˇsuji t´ımto, ze tuto prácijsem vypracoval samostatnˇe, s pouˇzit´ım od- bornéliteratury a dostupných pramen˚uuvedených v seznamu, jenˇzje sou- ˇcást´ıtéto práce. V Plzni dne 31. bˇrezna 2012 Ing. Ivan Habernal i Abstract This thesis presents a complete end-to-end system for the Semantic Web search using a Natural Language. The system is placed into the context of recent research in Information Retreival, Semantic Web, Natural Language Understanding, and Natural Language Interfaces. The key feature of Natu- ral Language Interfaces is that users can search for the required information by posing their questions using natural language instead of e.g. filling web forms. The developed system uses the Semantic Web technologies in both tradi- tional and new forms. The idea of the Semantic Web has brought many interesting concepts into domain modeling and data sharing. Furthermore, the development in Natural Language Interfaces to Semantic Web has shown that bridging the gap between the Semantic Web and Natural Language In- terfaces can uncover new research challenges. The main contributions of this thesis are as follows. First, a unique for- malism for capturing a natural language question semantics, based upon Semantic Web standards, was proposed. Second, the statistical model for the semantic analysis based upon supervised training was developed. Third, the evaluation of the fully functional end-to-end system with a real data and real queries was conducted. The system was tested in the accommodation domain using real data ac- quired from the Web as well as the corpus of real queries in natural language. The thesis deals with both theoretical and practical issues that must be solved in a fully functional system. A complete work-flow is described, including preparation of data, natural language corpus, ontology design, annotation, semantic model and search. Finally, a very detailed evaluation with promising results is presented and discussed. Special attention is also paid to open issues, such as a performance, a usability, a portability, or sources of a real Web data. ii Abstrakt Disertaˇcn´ıprácepopisuje kompletn´ısystém pro vyhledáván´ıv sémantickém webu pouˇzit´ım pˇrirozeného jazyka. Systémje pˇredstaven v kontextu výzku- mu na poli Information Retrieval, sémantického webu, porozumˇen´ıpˇriro- zenému jazyku a rozhran´ı vyuˇz´ıvaj´ıc´ıch pˇrirozenýjazyk. Hlavn´ı výhodou rozhran´ıvyuˇz´ıvaj´ıc´ıch pˇrirozenýjazyk je moˇznost zadat otázku celou vˇetou nam´ısto vyplˇnován´ıwebových formuláˇr˚unebo pouˇzit´ıpouze kl´ıˇcových slov. Vyvinutýsystémvyuˇz´ıvátechnologie sémantického webu tradiˇcn´ımi i no- vými zp˚usoby. Myˇslenka sémantického webu vnesla mnoho zaj´ımavých kon- cept˚udo modelován´ıdoména sd´ılen´ıdat napˇr´ıˇcdoménami. Nav´ıckombi- nace sémantického webu a rozhran´ıvyuˇz´ıvaj´ıch pˇrirozenýjazyk skýtánové moˇznosti pro vylepˇsen´ıuˇzivatelského komfortu pˇri vyhledáván´ı. Disertaˇcn´ıprácemátyto hlavn´ıpˇr´ınosy. Zaprvé:byl navrˇzen novýformalis- mum pro zachycen´ısémantiky otázky v pˇrirozenémjazyce. Tento formalis- mum vyuˇz´ıvátechnologi´ısémantického webu. Zadruhé:byl vyvinut statis- tickýmodel pro sémantickou analýzu zaloˇzenýna strojovémuˇcen´ı. Zatˇret´ı: systémbyl otestovánna reálných datech a reálných otázkách. Systémbyl testovánna doménˇepro vyhledáván´ıubytován´ı. Data byla z´ıská- na z reálných webových portál˚ustejnˇejako testovac´ıotázky v pˇrirozeném jazyce. Práce se zabýváteoretickými i praktickými problémy, kterémus´ıbýt ve funkˇcn´ım systému vyˇreˇseny. Je popsáncelýpostup z´ıskán´ıdat, korpus otázek, návrh ontologi´ı, anotace, sémantickáanalýza a vyhledáván´ı. Na závˇer je provedeno velmi d˚ukladnévyhodnocen´ıfunkˇcnosti systému. Po- zornost je takézamˇeˇrena na otevˇrenéproblémy, napˇr. výkon, pouˇzitelnost, pˇrenositelnost na jinou doménu a jazyk a zdroje webových dat. iii Contents I Introduction 1 1 Motivation and Introduction 2 1.1 Thesis Outline . .3 1.2 Discussion of the Terminology Used . .4 II State of the Art 7 2 Natural Language Interfaces to Structured Data 8 2.1 Natural Language Interfaces to Databases . .9 2.2 Semantic Web and Ontologies . .9 2.3 Architecture of NLISW . 11 2.3.1 Knowledge Base and Ontologies . 11 2.3.2 Question Understanding . 12 2.3.3 Knowledge Base Querying and Answer Representation 14 2.4 Evaluation Metrics and Testing Datasets . 14 2.5 Overview of Existing Systems . 15 2.5.1 PowerAqua and AquaLog . 15 2.5.2 ORAKEL . 16 2.5.3 FREyA . 16 2.5.4 PANTO . 17 2.5.5 QACID . 17 2.5.6 QUETAL QA . 18 2.5.7 Other related work . 19 iv 3 Statistical Methods for Natural Language Understanding 20 3.1 Introduction to NLU . 21 3.1.1 Basic Approaches . 21 3.1.2 Semantic Representation . 22 3.2 Sequential models . 22 3.2.1 Hidden understanding model . 22 3.2.2 Flat-concept parsing . 24 3.2.3 Conditional Random Fields . 25 3.3 Stochastic Semantic Parsing . 26 3.3.1 Preprocessing . 26 3.3.2 Probabilistic semantic grammars . 26 3.3.3 Vector-state Markov model . 28 3.3.4 Hidden Vector-state Markov model . 30 3.3.5 Context-based Stochastic Parser . 32 3.4 Other approaches to semantic parsing . 33 3.5 Evaluation of NLU Systems . 34 3.5.1 Exact match . 34 3.5.2 PARSEVAL . 34 3.5.3 Tree edit distance . 35 3.6 Existing corpora . 35 3.6.1 ATIS . 35 3.6.2 DARPA . 36 3.6.3 Other corpora . 36 3.7 Existing systems . 37 3.7.1 HVS Parser . 37 3.7.2 Scissor . 37 3.7.3 Wasp . 37 3.7.4 Krisp . 38 3.7.5 Other systems . 38 III Natural Language-Based Semantic Web Search Sys- v tem 39 4 Target Domain 40 4.1 Domain Requirements . 40 4.2 Natural Language Question Corpus . 42 4.2.1 NL Question Corpus Statistics . 43 4.2.2 Question Examples . 44 4.3 Knowledge Base and Domain Ontology . 45 4.3.1 Pattern-based Web Information Extraction . 45 4.3.2 Domain Ontology . 46 4.3.3 Qualitative Analysis of the Knowledge Base . 49 4.3.4 Semantic Inconsistencies Causing Practical Issues . 52 4.4 Test Data Preparation . 53 4.4.1 Search Results . 54 4.4.2 Semantic Interpretation . 54 4.4.3 Assigning the Correct Results . 54 5 Semantic Annotation of NL Questions 57 5.1 Describing NL Question Semantics . 57 5.1.1 Domain Ontology versus NL Question Ontology . 58 5.1.2 Two-Layer Annotation Approach . 58 5.1.3 Ontology of NL Question . 58 5.2 NL Question Corpus Annotation . 62 6 Semantic Analysis of Natural Language Questions 64 6.1 Semantic Analysis Model Overview . 64 6.2 Formal Definition of Semantic Annotation . 65 6.3 Statistical model . 66 6.4 Named Entity Recognition . 67 6.4.1 Maximum Entropy NER . 68 6.4.2 LINGVOParser . 68 6.4.3 String Similarity Matching . 68 vi 6.4.4 OntologyNER . 68 7 Semantic Interpretation 69 7.1 Transformation of Semantic Annotation into Query Language 69 7.1.1 Semantic Interpretation of Named Entities . 70 7.2 Practical Issues of Semantic Interpretation . 71 7.3 Semantic Reasoning and Result Representation . 72 8 Evaluation 74 8.1 Semantic Analysis Evaluation . 74 8.1.1 Evaluation of NER . 74 8.1.2 Evaluation of the Semantic Analysis Model . 76 8.2 Matching Named Entities to KB Instances . 78 8.3 End-to-end Performance Evaluation . 79 8.3.1 Fulltext Search . 79 8.3.2 Simulation of End-to-end Performance With Correct Semantic Annotation . 80 8.3.3 The end-to-end results of SWSNL system . 80 8.4 Evaluation on Other Domains and Languages . 81 8.4.1 ConnectionsCZ Corpus . 81 8.4.2 ATIS Corpus Subset . 82 9 Conclusion 85 9.1 Open Issues . 85 9.1.1 Performance Issues . 85 9.1.2 Deployment and Development . 86 9.1.3 Problems Caused by Real Web Data . 86 9.1.4 Research-related Issues . 86 9.1.5 Use in the Business Sector . 87 9.2 Future Work . 87 9.3 Final Conclusion . 88 9.3.1 Major Contributions . 88 9.3.2 Review of Aims of Ph.D. thesis . 88 vii A Hybrid Semantic Analysis System { ATIS Data Evaluation 90 A.1 Introduction . 91 A.2 Related Work . 91 A.3 Semantic Representation . 92 A.4 Data................................ 93 A.4.1 LINGVOSemantics corpus . 93 A.4.2 ATIS corpus . 93 A.5 System Description . 94 A.5.1 Lexical Class Identification .

Semantic Web Search Using Natural Language

A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index

Semi-Automated Ontology Based Question Answering System Open Access

EVERYTHING YOU NEED to KNOW ABOUT SEMANTIC SEO by GENNARO CUOFANO, ANDREA VOLPINI, and MARIA SILVIA SANNA the Semantic Web Is Here

On Using Product-Specific Schema.Org from Web Data Commons: an Empirical Set of Best Practices

Implementation of Intelligent Semantic Web Search Engine

Expert Knowledge Management Based on Ontology in a Digital Library

Semantically Enhanced and Minimally Supervised Models for Ontology Construction, Text Classiﬁcation, and Document Recommendation Alkhatib, Wael (2020)

Download Special Issue

Ontology Learning and Its Application to Automated Terminology Translation

Semi-Automatic Terminology Ontology Learning Based on Topic Modeling

Folksannotation: a Semantic Metadata Tool for Annotating Learning Resources Using Folksonomies and Domain Ontologies

A Comparative Analysis for English and Ukrainian Texts Processing Based on Semantics and Syntax Approach