Jeffrey Pound
Total Page:16
File Type:pdf, Size:1020Kb
Interpreting and Answering Keyword Queries using Web Knowledge Bases by Jeffrey Pound A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2013 c Jeffrey Pound 2013 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract Many keyword queries issued to Web search engines target information about real world entities, and interpreting these queries over Web knowledge bases can allow a search system to provide exact answers to keyword queries. Such an ability provides a useful service to end users, as their information need can be directly addressed and they need not scour textual results for the desired information. However, not all keyword queries can be addressed by even the most comprehensive knowledge base, and therefore equally important is the problem of recognizing when a reference knowledge base is not capable of modelling the keyword query's intention. This may be due to lack of coverage of the knowledge base or lack of expressiveness in the underlying query representation formalism. This thesis presents an approach to computing structured representations of keyword queries over a reference knowledge base. Keyword queries are annotated with occurrences of semantic constructs by learning a sequential labelling model from an annotated Web query log. Frequent query structures are then mined from the query log and are used along with the annotations to map keyword queries into a structured representation over the vocabulary of a reference knowledge base. The proposed approach exploits coarse linguistic structure in keyword queries, and combines it with rich structured query representations of information needs. As an intermediate representation formalism, a novel query language is proposed that blends keyword search with structured query processing over large Web knowledge bases. The formalism for structured keyword queries combines the flexibility of keyword search with the expressiveness of structures queries. A solution to the resulting disambiguation problem caused by introducing keywords as primitives in a structured query language is presented. Expressions in our proposed language are rewritten using the vocabulary of the knowledge base, and different possible rewritings are ranked based on their syntactic rela- tionship to the keywords in the query as well as their semantic coherence in the underlying knowledge base. The problem of ranking knowledge base entities returned as a query result is also explored from the perspective of personalized result ranking. User interest models based on entity types are learned from a Web search session by cross referencing clicks on URLs with known entity homepages. The user interest model is then used to effectively rerank answer lists for a given user. A methodology for evaluating entity-based search engines is also proposed and empirically evaluated. v Acknowledgements I would like to thank my supervisors Grant Weddell and Ihab Ilyas for their support and guidance, and for allowing me the independence to explore my research interests. I would like to thank the many collaborators I have worked with over the years: David Toman, Jiewen Wu, Alexander Hudek, Peter Mika, Roi Blanco, Hugo Zaragoza, Stelios Paparizos, Panayiotis Tsaparas, and others. I would also like to thank my committee members: Tamer Ozsu,¨ Ming Li, Lukasz Golab, and Nick Koudas for taking the time to read and critique my thesis. I would also like to thank the many friends I've met during my studies. It's the (more than occasional) beer and coffee with friends that keep a person sane throughout a PhD. Most of all I want to thank my family from the bottom of my heart. My parents Linda and Steve, my brother Brad; mi suegra Monica, suegro Horacia, hermano Alejandro y hermanita Macarena; my wife Sol and beautiful baby daughter Sofia. You have all always stood by me and believed in me, I would not be where I am today if it wasn't for your support. Muchas gracias. vii Dedication To my beloved wife Sol and my beautiful daughter Sofia. ix Table of Contents List of Figures xv 1 Introduction1 1.1 Search in Information Systems.........................1 1.2 Representing Knowledge............................4 1.3 Representing Query Intentions......................... 15 1.4 Thesis Overview................................. 17 2 The Query Understanding Problem 19 2.1 Formalizing The Query Understanding Problem............... 22 2.2 Solution Overview............................... 24 2.3 Baseline Approaches.............................. 28 3 Semantic and Structural Annotation of Keyword Queries 33 3.1 Query Segmentation & Semantic Annotation................. 33 3.2 Structuring Annotated Queries........................ 37 3.3 Analysis of a Web Query Log......................... 39 3.4 Evaluation of Keyword Query Understanding................. 41 3.5 Related Work.................................. 51 xi 4 Graph Queries with Keyword Predicates 57 4.1 Motivation.................................... 58 4.2 Structured Keyword Queries.......................... 60 4.3 Matching Subgraphs with Keyword Predicates................ 61 4.4 Concept Search Queries as Keyword Query Interpretations......... 72 4.5 Computing All-Pairs Similarity over Knowledge Base Items......... 75 4.6 Knowledge Graph Management........................ 76 4.7 Evaluation of Graph Queries with Keyword Predicates........... 81 4.8 Related Work.................................. 88 5 Personalized User Interest Models for Entity Ranking 93 5.1 Data & Problem Definitions.......................... 94 5.2 Personalized Ranking of Entity Queries.................... 100 5.3 Evaluation of Personalized Entity Ranking.................. 105 5.4 Related Work.................................. 112 6 A Methodology for Evaluating Entity Retrieval 115 6.1 The Ad-hoc Entity Retrieval Task....................... 115 6.2 An Empirical Study of Entity-based Queries................. 121 6.3 Related Work.................................. 129 7 Conclusions and Future Work 133 References 135 Appendices 149 A Evaluation Workloads 149 A.1 Query Understanding Workload........................ 149 A.2 TREC-QA Structured Keyword Query Encoded Workload......... 156 A.3 Synthetic Efficiency Workload......................... 161 xii B Proofs 165 B.1 Proof of Lemma1................................ 165 B.2 Proof of Theorem 1............................... 166 xiii List of Figures 1.1 An example knowledge graph..........................9 1.2 Part of an example knowledge graph extended with linguistic knowledge.. 14 2.1 Overview of the query understanding process................. 24 2.2 Detailed overview of the query understanding process............. 26 2.3 Examples of the query understanding process................. 27 2.4 Example query result subgraph for the query \john lennon birth city".... 29 2.5 Example query result subgraph for the query \jimi hendrix birthplace"... 31 3.1 Example segmentation and annotation process for the CRF-based semantic annotator..................................... 37 3.2 Distribution of most frequent part-of-speech tags among entity-based queries. 40 3.3 Most frequently occurring templates among entity-based queries....... 42 3.4 Effectiveness of keyword query annotation................... 46 3.5 Effectiveness of query interpretation...................... 47 3.6 Effectiveness of query interpretation...................... 48 3.7 Average Precision, MRR, and Recall for the query answering task...... 49 3.8 Average precision and recall vs. average run time for varying number of candidate query structures............................ 50 3.9 Average run-times for each phase of the keyword query answering process. 51 4.1 An ambiguous concept search query and the possible mappings of each key- word to a knowledge base item......................... 60 xv 4.2 An example structured keyword query disambiguation graph......... 63 4.3 A complex structured keyword query disambiguation graph......... 66 4.4 The example query disambiguation graph with zero-weight edges pruned and the correct disambiguation highlighted.................. 71 4.5 Encoding structured keyword query disambiguation as a rank-join...... 73 4.6 A knowledge graph index for the example knowledge graph......... 78 4.7 The knowledge graph concept search query evaluation procedure...... 79 4.8 An example encoding of a TREC QA task item to a structured keyword query. 83 4.9 Effectiveness of mapping structured keyword queries to concept search queries. 85 4.10 Entity retrieval effectiveness vs. number of knowledge base mappings... 87 4.11 Entity retrieval task effectiveness for all systems................ 89 4.12 Performance of structured keyword query disambiguation and evaluation.. 90 5.1 Example result ranking with and without a user's search context...... 94 5.2 An example entity database and query click log................ 97 5.3 Distribution of click frequency of entity types clicked in the query log.... 99 5.4 Distribution of entity count for entity types occurring in the entity database. 100 5.5 MRR and p@1 for the baseline ranking functions............... 106 5.6 MRR values of each ranking method vs. generality filtering threshold value. 107 5.7 Precision at rank 1 for each reranking method vs. generality filtering thresh-