SEMANTIC ANALYSIS of NATURAL LANGUAGE and DEFINITE CLAUSE GRAMMAR USING STATISTICAL PARSING and THESAURI By
Total Page:16
File Type:pdf, Size:1020Kb
SEMANTIC ANALYSIS OF NATURAL LANGUAGE AND DEFINITE CLAUSE GRAMMAR USING STATISTICAL PARSING AND THESAURI by Björn Dagerman A thesis submitted in partial fulfillment of the requirements for the degree of BACHELOR OF SCIENCE in Computer Science Examiner: Baran Çürüklü Supervisor: Batu Akan MÄLARDALEN UNIVERSITY School of Innovation, Design and Engineering 2013 i ABSTRACT Services that rely on the semantic computations of users’ natural linguistic inputs are becoming more frequent. Computing semantic relatedness between texts is problematic due to the inherit ambiguity of natural language. The purpose of this thesis was to show how a sentence could be compared to a predefined semantic Definite Clause Grammar (DCG). Furthermore, it should show how a DCG-based system could benefit from such capabilities. Our approach combines openly available specialized NLP frameworks for statistical parsing, part-of-speech tagging and word-sense disambiguation. We compute the seman- tic relatedness using a large lexical and conceptual-semantic thesaurus. Also, we extend an existing programming language for multimodal interfaces, which uses static predefined DCGs: COactive Language Definition (COLD). That is, every word that should be accept- able by COLD needs to be explicitly defined. By applying our solution, we show how our approach can remove dependencies on word definitions and improve grammar definitions in DCG-based systems. (27 pages) ii SAMMANFATTNING Tjänster som beror på semantiska beräkningar av användares naturliga tal blir allt van- ligare. Beräkning av semantisk likhet mellan texter är problematiskt på grund av naturligt tals medförda tvetydighet. Syftet med det här examensarbetet är att visa hur en mening can jämföras med en fördefinierad semantisk Definite Clause Grammar (DCG). Dessutom bör arbetet visa hur ett DCG-baserat system kan dra nytta av den möjligheten. Vårt tillvägagångssätt kombinerar öppet tillgängliga specialiserade NLP frameworks för statistical parsing, part-of-speech tagging och worse-sense disambiguation. Vi beräknar den semantiska likheten med hjälp av en stor lexikografisk och konceptuellt semantisk synony- mordbok. Vidare utökar vi ett befintligt programmeringsspråk för multimodala gränssnitt som använder statiskt fördefinierade DCGs: COactive Language Definition (COLD). Dvs. alla ord som ska kunna accepteras av COLD måste explicit definieras. Genom att tillämpa vår lösning visar vi hur vår metod kan minska beroenden på ord-definitioner och förbättra grammatik-definitioner i DCG-baserade system. (27 sidor) iii CONTENTS Page ABSTRACT ...................................................... i SAMMANFATTNING .............................................. ii LIST OF FIGURES ................................................ iv CHAPTER 1 INTRODUCTION ............................................... 1 1.1 Thesis Statement and Contributions......................2 2 BACKGROUND ................................................ 4 2.1 Statistical Parsing................................4 2.2 Semantic Knowledge Base............................5 2.3 Word Sense Disambiguation...........................6 2.4 COactive Language Definition (COLD)....................6 3 COMPUTING SEMANTIC RELATEDNESS ........................... 8 3.1 Semantic Comparison of Words.........................8 3.2 Dynamic Rule Declarations...........................9 3.3 Comparing Natural Language and Definite Clause Grammar........ 13 4 RESULTS ..................................................... 15 4.1 Semantic Analysis of Natural Language and DCGs.............. 15 4.2 A Framework for Computing Semantic Relatedness.............. 18 5 CONCLUSIONS ................................................ 20 REFERENCES ................................................... 22 iv LIST OF FIGURES Figure Page 2.1 COLD source code sample............................7 3.1 Graph walk in WordNet.............................9 3.2 POS dictionary sample.............................. 10 3.3 Defining almost identical COLD rules..................... 11 3.4 Nested rule declarations............................. 12 3.5 Simple dictionary sample............................ 12 3.6 Using dictionary-defined parameters...................... 12 3.7 Using multiple dictionary-defined parameters in one rule........... 13 4.1 COLD semantic DCG sample.......................... 15 4.2 A dictionary sample............................... 16 4.3 A natural sentence to be used as an input................... 16 4.4 Test application.................................. 19 CHAPTER 1 INTRODUCTION Applications for computing semantic similarity are becoming more prevalent [1, 5]. A semantic similarity is a comparison of how similar in meaning two concepts are. These concepts are associated with some ambiguity, normally related to natural language. Com- parisons could be between: two words, two natural sentences, or between a sentence and a predefined statement (a rule). Such a rule would have some grammatical structure, e.g., its grammar could be described as a set of definite clauses in first-order logic. Such a repre- sentation is denoted Definite Clause Grammar (DCG). A problem with DCGs is that they are static as to what they can express. Every combination of possible words needs to be explicitly defined in order to be acceptable rules. This is a complex problem for systems comparing a user’s natural language with DCGs, because a user cannot be expected to con- struct sentences exactly matching rules. As such, rules are defined with close consideration of natural language, as can be seen in [3]. Conventional approaches for comparing texts fail to deliver human-level (common sense) results [24]. This is understandable due to the many different ways semantically identical sentences can be expressed using natural language. Computed semantic similarity measurements are used in a wide range of services which rely on natural language. There- fore, increasing the confidence of which texts can be compared is desirable in many natural language processing applications, including: machine translation [20], conversational agents [23], web-page retrieval (e.g., by search engines) and image retrieval [21, 19]. This thesis is part of larger project within the field of human robot interaction, aiming to decrease the drawback of robot deployment for small and medium enterprises (SMEs). The investment cost of deploying robots for SMEs is partly related to hardware purchases, but also the required cost of contracting expert robot programmers. A high level natural 2 language framework is purposed through [4] which aims to remove the dependencies on the robot programmers, allowing for such tasks, and (re)purposing, to be performed by manufacturing engineers. COLD, or COactive Language Definition, is a high level programming language for the rapid development of multimodal interfaces [3]. In its current version, possible multimodal sentences are described as context free definite clause grammar. That is, every word that is desired to be acceptable by COLD needs to be explicitly written as a rule in a COLD grammar file. This limits the capabilities of the language and make programming it cum- bersome. For instance, the sentences: ”go to your house” and ”go home” are similar both semantically and lexically. However, two different rules must be defined to be able to handle both cases. 1.1 Thesis Statement and Contributions This thesis contributes to the field of natural language processing. Specifically, it discusses the benefits of extending static definite clause grammar (DCG) systems–that match users’ inputs with predefined rules–with tools for semantic analysis. Also, it supplies a modular framework for DCG–text and text–text comparisons. Although said framework is not the purpose of this thesis, it can serve as a basis for further development and verification. The purpose of this thesis is to: 1. present an approach for computing semantic confidence measurements between natu- ral lingual phrases and DCGs, and 2. show how an existing static DCG-based system can be extended with said function- ality. Our algorithm (Section 4.1) combines common natural language processing tasks such as: statistical parsing, tokenization, parts-of-speech tagging and word-sense disambiguation. We apply these techniques through the context of the combined word-sense of the input and DCGs. We extend the understood meaning of concepts using large lexical and semantic 3 thesauri. Doing so, our algorithm successfully match linguistic phrases with (somewhat) ambiguous predefined grammar. We show how our approach can benefit DCG-bases system. This includes: • Greater freedom in the definitions of the grammar rules. • Not requiring all usable words to be predefined. • Parts of parsing can be shifted to the statistical parser, allowing for early termination of parses where further traversing of the parse tree would not otherwise be beneficial. Furthermore, this could enable rule definitions to be performed without the explicit con- sideration of users’ natural language, but rather in a way that better conveys the semantic goal of the rule. Although our approach focuses on DCG-based systems, it is still applicable in any system where the semantic relatedness of two texts is desired, because in essence, a semantic DCG is a text. 4 CHAPTER 2 BACKGROUND Before a meaningful semantic comparison can be performed on two sentences of natural text, they first need to be parsed. This process involves chunking the sentences and tagging the individual words of a given phrase with its corresponding part-of-speech