Syntax-Based Concept Extraction for Question Answering

University of Central Florida STARS Electronic Theses and Dissertations, 2004-2019 2006 Syntax-based Concept Extraction For Question Answering Demetrios Glinos University of Central Florida, [email protected] Part of the Computer Sciences Commons, and the Engineering Commons Find similar works at: https://stars.library.ucf.edu/etd University of Central Florida Libraries http://library.ucf.edu This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted for inclusion in Electronic Theses and Dissertations, 2004-2019 by an authorized administrator of STARS. For more information, please contact [email protected]. STARS Citation Glinos, Demetrios, "Syntax-based Concept Extraction For Question Answering" (2006). Electronic Theses and Dissertations, 2004-2019. 793. https://stars.library.ucf.edu/etd/793 SYNTAX-BASED CONCEPT EXTRACTION FOR QUESTION ANSWERING by DEMETRIOS GEORGE GLINOS B.S. Trinity College, Connecticut, 1973 M.S. University of Central Florida, 1999 J.D. Georgetown University, 1976 A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the School of Electrical Engineering and Computer Science in the College of Engineering and Computer Science at the University of Central Florida Orlando, Florida Spring Term 2006 Major Professor: Fernando Gomez © 2006 Demetrios George Glinos ii ABSTRACT Question answering (QA) stands squarely along the path from document retrieval to text understanding. As an area of research interest, it serves as a proving ground where strategies for document processing, knowledge representation, question analysis, and answer extraction may be evaluated in real world information extraction contexts. The task is to go beyond the representation of text documents as “bags of words” or data blobs that can be scanned for keyword combinations and word collocations in the manner of internet search engines. Instead, the goal is to recognize and extract the semantic content of the text, and to organize it in a manner that supports reasoning about the concepts represented. The issue presented is how to obtain and query such a structure without either a predefined set of concepts or a predefined set of relationships among concepts. This research investigates a means for acquiring from text documents both the underlying concepts and their interrelationships. Specifically, a syntax-based formalism for representing atomic propositions that are extracted from text documents is presented, together with a method for constructing a network of concept nodes for indexing such logical forms based on the discourse entities they contain. It is shown that meaningful questions can be decomposed into Boolean combinations of question patterns using the same formalism, with free variables representing the desired answers. It is further shown that this formalism can be used for robust question answering using the concept network and WordNet synonym, hypernym, hyponym, and antonym relationships. This formalism was implemented in the Semantic Extractor (SEMEX) research tool and was tested against the factoid questions from the 2005 Text Retrieval Conference (TREC), which iii operated upon the AQUAINT corpus of newswire documents. After adjusting for the limitations of the tool and the document set, correct answers were found for approximately fifty percent of the questions analyzed, which compares favorably with other question answering systems. iv This work is dedicated to the memory of my father, George Demetrios Glinos, from whom I learned to love learning and to persevere in its pursuit, and to my wife Kathleen and our son George whose love and support, and particularly their good humor, sustained me throughout this experience. v ACKNOWLEDGMENTS I wish to express my profound thanks and appreciation to my major advisor and friend, Dr. Fernando Gomez, for his patience and firm guidance in the direction of this research, and to Drs. Kien A. Hua, Carlos Segami, and Annie S. Wu, for their encouragement in this endeavor and for serving on the research committee. vi TABLE OF CONTENTS LIST OF FIGURES ....................................................................................................................... ix LIST OF TABLES.......................................................................................................................... x LIST OF ACRONYMS/ABBREVIATIONS................................................................................ xi CHAPTER ONE: INTRODUCTION............................................................................................. 1 Central Claims ............................................................................................................................ 1 Motivation................................................................................................................................... 5 Current Research Realities......................................................................................................5 Question Answering Systems ............................................................................................... 10 Information Extraction.......................................................................................................... 21 Knowledge Representation ................................................................................................... 25 CHAPTER TWO: CONCEPT EXTRACTION ........................................................................... 29 Why Syntax-Based?.................................................................................................................. 29 Proposition Tuples .................................................................................................................... 33 Syntax-Based Concept Nodes................................................................................................... 36 The Concept Network............................................................................................................... 40 CHAPTER THREE: QUESTION ANSWERING ....................................................................... 43 Question Analysis ..................................................................................................................... 43 Tuples of Interest ...................................................................................................................... 47 Question Pattern Matching ....................................................................................................... 48 CHAPTER FOUR: TEST AND EVALUATION ........................................................................ 51 The SEMEX Test Environment................................................................................................ 51 vii Evaluation Against TREC Factoid Questions........................................................................... 54 CHAPTER FIVE: CONCLUSIONS ............................................................................................ 65 Concept Extraction and Question Answering........................................................................... 65 Limitations of Approach........................................................................................................... 66 Future Research ........................................................................................................................ 69 APPENDIX A: SEMEX SOURCE CODE................................................................................... 71 APPENDIX B: TREC 2005 QA QUESTIONS........................................................................... 73 APPENDIX C: TREC 2005 QA FACTOID ANSWERS............................................................ 90 REFERENCES ............................................................................................................................. 99 viii LIST OF FIGURES Figure 1. SEMEX Graphical User Interface................................................................................. 51 Figure 2. SEMEX Top-level Architecture.................................................................................... 52 ix LIST OF TABLES Table 1. Parent-Child Concept Derivations................................................................................. 39 Table 2. Question Patterns for Sample TREC 2005 Factoid Questions. ..................................... 45 Table 3. SEMEX Source Files and Functions ............................................................................. 54 Table 4. Question Types for First 200 Factoid Questions from 2005 TREC QA Test Set .......... 57 Table 5. SEMEX Results for First 200 Factoid Questions from 2005 TREC QA Test Set ........ 60 x LIST OF ACRONYMS/ABBREVIATIONS ARDA Advanced Research and Development Activity IR Information Retrieval MUC Message Understanding Conference NIST National Institute of Standards and Technology NLP Natural Language Processing POS Part of Speech PT Proposition Tuple QA Question Answering SBCN Syntax-based Concept Node SEMEX Semantic Extractor TREC Text Retrieval Conference xi CHAPTER ONE: INTRODUCTION This line of research involves the unsupervised extraction of knowledge from unrestricted text in a form that is suitable for question answering. More specifically, it involves the automated construction of knowledge structures, which we call “proposition tuples” and “syntax- based concept nodes”, and the specification of the means for employing them together with the

Syntax-Based Concept Extraction for Question Answering

English and Any Local Or Regional Language in Which the Celebrity Spokesperson Is Expected to Communicate Or Receive Coverage

Consuming Globalization: Youth and Gender in Kerala, India

Question Answering with SEMEX at TREC 2005

Stiahnuť Vo Formáte

The Globalization of Cosmetic Surgery: Examining BRIC and Beyond Lauren E

World Cup 2003

Guantanamo Bay Justine Pasek

HIV/AIDS Prevention Paradigms: Are Individuals with Disabilities Neglected?

UNFPA New York, NY 10017 USA 2001 Telephone: (212) 297-5020 at Work Web Site

Special New Year's Message

Infanticide, Abortion Responsible for 60 Million Girls Missing in Asia (Foxnews.Com) PRI Staff / June 13, 2007

Rosetta Stone of the Human Soul 2 Chainz Music Artist. Real Name