SEMANTIC ANALYSIS OF NATURAL AND DEFINITE CLAUSE GRAMMAR USING STATISTICAL PARSING AND THESAURI

by

Björn Dagerman

A thesis submitted in partial fulfillment of the requirements for the degree

of

BACHELOR OF SCIENCE

in

Computer Science

Examiner: Baran Çürüklü

Supervisor: Batu Akan

MÄLARDALEN UNIVERSITY School of Innovation, Design and Engineering

2013 i

ABSTRACT

Services that rely on the semantic computations of users’ natural linguistic inputs are becoming more frequent. Computing semantic relatedness between texts is problematic due to the inherit of natural language. The purpose of this thesis was to show how a could be compared to a predefined semantic Definite Clause Grammar (DCG).

Furthermore, it should show how a DCG-based system could benefit from such capabilities.

Our approach combines openly available specialized NLP frameworks for statistical parsing, part-of-speech tagging and word-sense disambiguation. We compute the seman- tic relatedness using a large lexical and conceptual-semantic thesaurus. Also, we extend an existing programming language for multimodal interfaces, which uses static predefined

DCGs: COactive Language Definition (COLD). That is, every word that should be accept- able by COLD needs to be explicitly defined. By applying our solution, we show how our approach can remove dependencies on word definitions and improve grammar definitions in

DCG-based systems.

(27 pages) ii

SAMMANFATTNING

Tjänster som beror på semantiska beräkningar av användares naturliga tal blir allt van- ligare. Beräkning av semantisk likhet mellan texter är problematiskt på grund av naturligt tals medförda tvetydighet. Syftet med det här examensarbetet är att visa hur en mening can jämföras med en fördefinierad semantisk Definite Clause Grammar (DCG). Dessutom bör arbetet visa hur ett DCG-baserat system kan dra nytta av den möjligheten.

Vårt tillvägagångssätt kombinerar öppet tillgängliga specialiserade NLP frameworks för statistical parsing, part-of-speech tagging och worse-sense disambiguation. Vi beräknar den semantiska likheten med hjälp av en stor lexikografisk och konceptuellt semantisk synony- mordbok. Vidare utökar vi ett befintligt programmeringsspråk för multimodala gränssnitt som använder statiskt fördefinierade DCGs: COactive Language Definition (COLD). Dvs. alla ord som ska kunna accepteras av COLD måste explicit definieras. Genom att tillämpa vår lösning visar vi hur vår metod kan minska beroenden på ord-definitioner och förbättra grammatik-definitioner i DCG-baserade system.

(27 sidor) iii

CONTENTS

Page ABSTRACT ...... i SAMMANFATTNING ...... ii LIST OF FIGURES ...... iv CHAPTER 1 INTRODUCTION ...... 1 1.1 Thesis and Contributions...... 2 2 BACKGROUND ...... 4 2.1 Statistical Parsing...... 4 2.2 Semantic Knowledge Base...... 5 2.3 Word Sense Disambiguation...... 6 2.4 COactive Language Definition (COLD)...... 6 3 COMPUTING SEMANTIC RELATEDNESS ...... 8 3.1 Semantic Comparison of Words...... 8 3.2 Dynamic Rule Declarations...... 9 3.3 Comparing Natural Language and Definite Clause Grammar...... 13 4 RESULTS ...... 15 4.1 Semantic Analysis of Natural Language and DCGs...... 15 4.2 A Framework for Computing Semantic Relatedness...... 18 5 CONCLUSIONS ...... 20 REFERENCES ...... 22 iv

LIST OF FIGURES

Figure Page

2.1 COLD source code sample...... 7

3.1 Graph walk in WordNet...... 9

3.2 POS dictionary sample...... 10

3.3 Defining almost identical COLD rules...... 11

3.4 Nested rule declarations...... 12

3.5 Simple dictionary sample...... 12

3.6 Using dictionary-defined parameters...... 12

3.7 Using multiple dictionary-defined parameters in one rule...... 13

4.1 COLD semantic DCG sample...... 15

4.2 A dictionary sample...... 16

4.3 A natural sentence to be used as an input...... 16

4.4 Test application...... 19 CHAPTER 1

INTRODUCTION

Applications for computing semantic similarity are becoming more prevalent [1, 5]. A semantic similarity is a comparison of how similar in meaning two are. These concepts are associated with some ambiguity, normally related to natural language. Com- parisons could be between: two words, two natural sentences, or between a sentence and a predefined statement (a rule). Such a rule would have some grammatical structure, e.g., its grammar could be described as a set of definite clauses in first-order logic. Such a repre- sentation is denoted Definite Clause Grammar (DCG). A problem with DCGs is that they are static as to what they can express. Every combination of possible words needs to be explicitly defined in order to be acceptable rules. This is a complex problem for systems comparing a user’s natural language with DCGs, because a user cannot be expected to con- struct sentences exactly matching rules. As such, rules are defined with close consideration of natural language, as can be seen in [3].

Conventional approaches for comparing texts fail to deliver human-level (common sense) results [24]. This is understandable due to the many different ways semantically identical sentences can be expressed using natural language. Computed semantic similarity measurements are used in a wide range of services which rely on natural language. There- fore, increasing the confidence of which texts can be compared is desirable in many natural language processing applications, including: machine [20], conversational agents

[23], web-page retrieval (e.g., by search engines) and image retrieval [21, 19].

This thesis is part of larger project within the field of human robot interaction, aiming to decrease the drawback of robot deployment for small and medium enterprises (SMEs).

The investment cost of deploying robots for SMEs is partly related to hardware purchases, but also the required cost of contracting expert robot programmers. A high level natural 2 language framework is purposed through [4] which aims to remove the dependencies on the robot programmers, allowing for such tasks, and (re)purposing, to be performed by manufacturing engineers.

COLD, or COactive Language Definition, is a high level programming language for the rapid development of multimodal interfaces [3]. In its current version, possible multimodal sentences are described as context free definite clause grammar. That is, every word that is desired to be acceptable by COLD needs to be explicitly written as a rule in a COLD grammar file. This limits the capabilities of the language and make programming it cum- bersome. For instance, the sentences: ”go to your house” and ”go home” are similar both semantically and lexically. However, two different rules must be defined to be able to handle both cases.

1.1 Thesis Statement and Contributions

This thesis contributes to the field of natural language processing. Specifically, it discusses the benefits of extending static definite clause grammar (DCG) systems–that match users’ inputs with predefined rules–with tools for semantic analysis. Also, it supplies a modular framework for DCG–text and text–text comparisons. Although said framework is not the purpose of this thesis, it can serve as a basis for further development and verification.

The purpose of this thesis is to:

1. present an approach for computing semantic confidence measurements between natu-

ral lingual phrases and DCGs, and

2. show how an existing static DCG-based system can be extended with said function-

ality.

Our algorithm (Section 4.1) combines common natural language processing tasks such as: statistical parsing, tokenization, parts-of-speech tagging and word-sense disambiguation.

We apply these techniques through the context of the combined word-sense of the input and DCGs. We extend the understood meaning of concepts using large lexical and semantic 3 thesauri. Doing so, our algorithm successfully match linguistic phrases with (somewhat) ambiguous predefined grammar. We show how our approach can benefit DCG-bases system.

This includes:

• Greater freedom in the definitions of the grammar rules.

• Not requiring all usable words to be predefined.

• Parts of parsing can be shifted to the statistical parser, allowing for early termination

of parses where further traversing of the parse tree would not otherwise be beneficial.

Furthermore, this could enable rule definitions to be performed without the explicit con- sideration of users’ natural language, but rather in a way that better conveys the semantic goal of the rule. Although our approach focuses on DCG-based systems, it is still applicable in any system where the semantic relatedness of two texts is desired, because in essence, a semantic DCG is a text. 4

CHAPTER 2

BACKGROUND

Before a meaningful semantic comparison can be performed on two sentences of natural text, they first need to be parsed. This process involves chunking the sentences and tagging the individual words of a given phrase with its corresponding part-of-speech (POS). In addition, some semantic relation (known through a knowledge base) must exist between the words. This chapter presents a broad overview of related work. Furthermore, common specialized frameworks for natural language processing tasks are presented.

Agent/dialogue systems use Semantic-based Conversational Agents (SCAF) to interact with users through dialogue, often with scripted responses [23]. Without understanding the of a user’s input; no intelligible response is possible. Untranslated words are a major problem in machine translation where a common cause is due to statistical training of systems where only a small percentage of the corpus is understood [20]. Web-page retrieval often involves comparing header titles of pages. This is discussed in [21] among algorithms for web and text classification. The related subject of image retrieval often relies on semantic hierarchies to classify related images [19]. Social bookmarking enables users to annotate and share bookmarks of web-pages. Comparing tags to find related sites requires semantic knowledge of the tag concepts [22]. Tagging of images on social network is a recent application, which combines these techniques.

2.1 Statistical Parsing

A statistical parser is a type of natural language parser which, given a text, determines its structure; which words are related, which subjects and objects are related to which verb, which words are verbs/nouns/adjectives. The parser tries to produce the most likely analysis of the text, using knowledge previously obtained through supervised parses [12]. 5

There are two actions of specific interest provided by the statistical parser for use in semantic similarity comparison: tokenization and part-of-speech tagging:

Tokenization: splitting the given text into a set of tokens where a token represents the smallest part of speech, i.e., the sentence "its color is red" would yield the tokens:

" its ", " color ", " is ", " red ".

Part-of-speech (POS) tagging: every token is tagged with its corresponding POS, i.e., red which is an adjective is tagged with adjective. Every noun is tagged with noun, etc.

OpenNLP is a machine learning library for processing natural language that can perform the above desired tasks [6]. Other features include: sentence segmentation, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.

2.2 Semantic Knowledge Base

WordNet is the outcome of a research project at Pennsylvania University regarding the creation of a large database of English words [2]. It is available both as an online and offline resource. Any given word in WordNet is tagged as being part of one, or more, parts-of- speech (POS); verbs, adjectives, nouns or adverbs. Words of the same POS with similar

(conceptual) meaning are grouped together in synonym sets, denoted as synsets.

A synset contains a short text description of its conceptual meaning in addition to its list of words (synonyms). More interestingly, the sysnets also contain semantic and lexical references to other synsets. That is, the various synsets reference each other forming complex networks. The references are of a super-subordinate relation. Meaning, general sets link to more specific ones (and vice versa), more specific ones link to very specific, e.g., the set containing the word { red } would link to the set with { crimson }, as crimson is a special kind of red. Furthermore, it would also link to { color }, as red is a color.

Relations are inherited from their superordinates, meaning that from the previous example the conclusion could be made that crimson must be a color, even though it is not 6 directly connected. The same conclusions cannot be drawn from the subordinates point of view, e.g., a color could be crimson, but it certainly doesn’t have to be. The same is true for verbs where hypernym and troponym are related, e.g., { walk } is linked with the synset

{ travel, go, move }, since walking is a more precise way of moving.

For the most part there are no connections between different POS. Some adjectives have a special relation denoted attribute. An attribute describes the given adjectives con- nection to a noun, linking two POS together (e.g., linking the adjective happy with the noun happiness).

2.3 Word Sense Disambiguation

Word sense disambiguation is the process of determining which sense of a word is used in a text. For instance, consider the sentence ”I’m having an old friend for dinner”. Does the word having refer to what is being served, or does it refer to the company?

Techniques involve supervised, semi-supervised, unsupervised and cross-lingual evi- dence methods. They are explained through [13]. Other methods involve the use of dictio- naries and knowledge-bases. One such method is the Lesk algorithm. Lesk assumes that the words in a given phrase are likely related to each other and that the relations can be observed through the words definitions [8]. Words can thus be compared—through their definitions—finding the pairs that are most closely related.

2.4 COactive Language Definition (COLD)

The main motivation behind this thesis is the inclusion of statistical parsing and se- mantic analysis—to an existing high level framework for natural language—in human robot interaction. The framework uses the programming language COactive Language Definition

(COLD), for semantic Definite Clause Grammar (DCG) rule definitions. COLD allows for incremental multimodal grammar rules. Texts, and speech (which can also be represented as text) are of most significant interest for our approach. The other modalities are explained in [3]. 7

100 command( Retval ) −−> set the color to red { Retval = setColor(red); }. 101 command( Retval ) −−> set the color to blue { Retval = setColor(blue); }. 102 103 command −−> pickup { pickup(obj); } 104 105 Object ( Retval ) −−> the box { Retval = (theBox); }. 106 Object ( Retval ) −−> the sphere { Retval = property(theSphere); }.

Figure 2.1: COLD source code sample

There are two categories of syntax: grammar definitions and semantic representation.

Both are illustrated in Figure 2.1. The semantic representation is any valid C# code and can be declared in between two ”{ }” clauses.

Note the Prolog-like syntax in Figure 2.1 with the → arrow. The text to the right hand side of the arrow represents the speakable grammar rule. That is, to call the command on line 100 the words: ”set the color to red”, should be the input. Calls can be nested, as shown on line 103. It declares a command with the property ””, which is defined on lines 105-106. Implying that the valid inputs for rule 103 would be: ”pickup the box” and ”pickup the sphere”.

The COLD syntax is declared in a script file and an integrated development environ- ment (COLD IDE) parses these files. It also provides XML files to be used together with

Microsoft’s Speech API (for the speech recognition). The IDE compiles a multimodal gram- mar graph, which is used in COLD programs to reference the grammar. See [3] and [14] for further details. 8

CHAPTER 3

COMPUTING SEMANTIC RELATEDNESS

In its most general form the input and definite clause grammar system depends on (a) a set of predefined rules e.g., ”Set the color to red”, ”Pick up the biggest box”, etc. and

(b) some input (i.e., a text). Through analyzing the supplied input together with the rules, a semantic similarity value sim ∈ [0, 1], can be computed. A value of sim = 1 indicates a perfect, unambiguous relation, whereas sim = 0 implies no connection what so ever. For every rule the general approach consists of:

1. tokenization and part-of-speech tagging,

2. identifying the words from the input with the closest semantic meaning to those of

the given rule (word-sense disambiguation),

3. comparing pairs of words to find the overall highest scoring combination.

3.1 Semantic Comparison of Words

Popular techniques for evaluating word similarity are compared in [7]. Wu-Palmer similarity is one such technique originally developed to allow for inexact matches to achieve correct lexical selection in machine translation [10]. Wu-Palmer exploits the fact that within a conceptual domain, the relation between two concepts will be directly related to how closely they are defined in the word hierarchy. This is the same hypothesis of which the knowledge base WordNet, described in section 2.2, is defined. Within a graph; consider the two concepts c1 and c2, which are connected at the least common super- c3. Denote

the distance between the nodes as: d1 = dist(c1,c3), d2 = dist(c2,c3) and d3 = dist(c3, root).

The similarity sim between c1 and c2 can then be evaluated through:

2 · d3 sim(c1, c2) = (3.1) d1 + d2 + 2 · d3 9

The possibility of multiple paths between c1 and c2 should be noted. Naturally, the distances should be the shortest path. Figure 3.1 shows an illustration of a graph search and the synonym sets along the shortest path.

Figure 3.1: Graph walk in WordNet from the hypernyms tangerine and red to their common hyponym chromatic- and specular color. The numbers represents the length of the walk.

3.2 Dynamic Rule Declarations

Declaring completely static rules is partly a tedious task for the system engineers, but more importantly, it constrains the users as to how they can interact with the system.

Relying on a user’s ability to construct lexical identical phrases to exactly match predefined grammar rules is not reasonable [17]. It might be admissible for a classical simple telephone redirecting service, which states a few rules, e.g., ”Press 1 to ...”, ”Press 2 to ...”, where the user is presented with the rules and given instructions on how to supply an exact and unambiguous input (a key press). However, even such systems suffer and often structures the instructions as to have the last rule provide information on how to repeat the previous rules.

In recent days dialogue bases systems are becoming more popular [1]. Such systems rely on a written (or spoken) phrase from the user to be used as an input. This poses 10

100 JJ:red,green ,blue ,orange ,black ,small ,large ,huge, 101 a c t i v e , i n a c t i v e 102 VB:add,take ,change,attach ,set ,take 103 NN:house ,home, building ,box,mountain ,sun 104 RB: quickly , s l o w l y 105 106 ACTION: pick_up , put_down 107 OBJECT: box , sphere

Figure 3.2: Outtake from a dictionary file showing the predefined POS-categories JJ (adjec- tives), VB (verbs), NN (nouns), and RB (adverbs) on lines 100-104. The POS abbreviations are the same as those obtained through the statistical parsing, see section 2.1. Lines 106-107 contain user-defined categories. a problem when declaring the rules—they should be formulated in such a manner as to best match the user’s expected natural language. To help abbreviate this problem a local dictionary for the rule-system is proposed. A simple text file containing categories with comma separated entries (words), as shown in Figure 3.2. The annotations follow the penn treebank [18] word level tags, as they are commonly used and also present in the statistical parser.

Using the dictionary from Figure 3.2, a rule such as ”Perform on ” would be able to match ”Perform pick_up on box” (as ”box” is a defined noun). However, it would equally match ”Perform pick_up on sun”. In cases where it is desirable to only have specific words matchable to a given rule/rules, optional user categories can be added.

For instance, defining the category OBJECT on line 107, allows the rule to be changed to:

”Perform on ”.

Some words belong to several categories, for instance, the word ”word” can be a noun or a verb, depending on which context it’s used in. If such a word is desired to be usable in more than one POS it should either be defined multiple times (once per POS-category), or declared in a user-defined category.

The rule definitions are further improved through allowing the statistical parser and semantic analyzer to operate on the rule/rules, in the context of the input. For instance, 11 consider the rule: ”Set the color to ”, using the above dictionary. If given an input such as: ”Set the color to crimson”; a match would not be possible in a static environment

(as crimson is unknown). However, the semantic analyzer could perform a word-sense disambiguation, realizing that crimson is a color; a special type of red, allowing for the rule

”Set the color to red” to be matched. Furthermore, this allows for the ability to define rules without the consideration of an agent’s natural language, but rather in a way that best explains the semantic goal of the rule.

3.2.1 Extending COLD Grammar

The COLD language syntax is explained in section 2.4. The main limitation of the grammar defined rules is that they are completely static. For instance, say we want to define a rule to change the color property of an object. In COLD this is possible through two solutions, as shown in Figures 3.3 and 3.4.

100 command( Retval ) −−> set the color to red 101 { Retval = s e t C o l o r ( " red " ) ; } . 102 103 command( Retval ) −−> set the color to blue 104 { Retval = s e t C o l o r ( " blue " ) ; } . 105 106 command( Retval ) −−> set the color to green 107 { Retval = setColor("green"); }.

Figure 3.3: Defining almost identical new rules

Arguably, the second alternative is more dynamic, but both require predefinition of every single color to be used (i.e., in this example only the colors red, green and blue would be valid inputs). If we want to add another color then we need to write more rules.

To improve upon this solution we apply the concepts presented above regarding the local dictionary. As shown in Figure 3.2. Firstly, rules are added to the COLD compiler to allow for any text to be written in between two ”< >”. When the compiler match such a token, 12

100 command( Retval ) −−> set the color to 101 { Retval = s e t C o l o r ( c l r ) ; } . 102 103 Color ( Retval ) −−> red { Retval = "red"; }. 104 Color ( Retval ) −−> blue { Retval = "blue"; }. 105 Color ( Retval ) −−> green { Retval = "green"; }.

Figure 3.4: Nested calls through defining another rule ”Color” it (a) checks if it’s defined in the script, or (b) checks if it’s defined in the dictionary. If either is correct then it’s valid. In the case of (a) it also appends the dictionary with these definitions. Failure to match both (a) and (b) results in a compile error. This allows for a dictionary such as the one shown in Figure 3.5 to be used together with the code snippet in Figure 3.6 to define all the rules from Figure 3.4.

100 JJ : red , green , blue

Figure 3.5: A simple dictionary containing only three color adjectives

100 command( Retval ) −−> set the color to { Retval = setColor(JJ); }

Figure 3.6: A rule using a dictionary-defined parameter, JJ

A common bracket syntax is proposed to allow for different words within the same dictionary category to be distinguishable within the semantic representation. See Figure

3.7 for a rule that would be able to match an input such as ”change the color from red to blue”.

Through these additions, COLD rules can be defined in a more natural way. The approach has no limitations to what words can be used. As such, categories containing 13

100 command( Retval ) −−> change the color from to { 101 i f (getColor == JJ[0]) 102 s e t C o l o r ( JJ [ 1 ] ) ; 103 } .

Figure 3.7: A rule using two of same dictionary-defined parameters. They do not have to be represented by the same word. The two parameters are distinguished in the semantic representation (within the ”{ }”) through indexing in ”[ ]” brackets. synonyms could be defined allowing for even more general inputs. However, this still requires the structure of the input sentence to be the same as that of the rule. Instead of relying on defined synonyms in the dictionary this information could be obtained through a semantic knowledge base. The statistical parser, semantic analyzer and the semantic knowledge base are all contained within the same .Net library, making the integration into COLD simple.

The extra overhead associated with the added NLP methods should be considered.

COLD relies on multimodal inputs, where speech is the most interesting for our purpose.

We suggest that because these computations can be performed while an agent is speaking; the calculations will have sufficient time to conclude before the speech is complete (as it is incrementally parsed). As such, no additional overhead would be present for this environment. An asymptotic time complexity analysis of this algorithm was outside the scope of this thesis. Obviously, such an analysis should be carried out in order to support the above claim. See Chapter 5 Conclusions.

3.3 Comparing Natural Language and Definite Clause Grammar

Comparing any two natural language texts is a similar problem to that of comparing one text with definite clause grammar (DCG). The different DCG rules together provide a context in which the natural language text can be compared. Such a context is not present in the former problem.

Firstly, the inputs (a natural language text) word-sense is disambiguated using an adaption of the Lesk algorithm [9], partly as suggested through [11]. The Lesk algorithm 14 operates on the hypothesis that neighboring words in a text have similar meaning. This idea is further extended through using the context of the different rules, together with a dictionary, as suggested in Section 3.2 Dynamic Rule Declarations. Through this process

the input is modified by replacing words with synonyms or hypernyms from the knowledge

base, as to better match the context of the DCG.

Secondly, all individual pairs of words obtainable through selecting one word from the

input and one word form the rule are compared to find the semantic similarity between

each of them, as explained in section 3.1 Semantic Comparison of Words. Pairs of words

are selected as to find the overall highest scoring combination. However, in the case where

multiple dictionary tags are defined in the DCG, i.e., the two JJ’s in ”Change the color from

to ”, different matches are required for each tag. Consider the input ”Turn the

red to blue”; the word red can only be matched with one of the JJ tags, not both. This is

exclusive to matching dictionary tags and has no impact on how red is matched with other

words, i.e., color.

Through this process the input is scored versus all rules yielding a comparison confi-

dence value together with the suggested dictionary matches, for each rule. Provided with

this data, selecting which rule/rules to fire is a simple design decision (e.g., all rules with

confidence > 0.5). 15

CHAPTER 4

RESULTS

In this chapter, detailed annotated results of a semantic analysis, using the discussed

methods, is presented in Section 4.1. In addition, a general framework for semantic relat-

edness is presented in Section 4.2.

4.1 Semantic Analysis of Natural Language and DCGs

This sections details the semantic analysis algorithm, using the methods discussed in

Chapter 3 Computing Semantic Relatedness. First consider a set of DCGs as shown in

Figure 4.1 together with a lexical dictionary as shown in Figure 4.2. Secondly, the input from Figure 4.3.

100 command( Retval ) −−> Set the color to 101 { Retval = s e t C o l o r ( JJ ) ; } . 102 103 command( Retval ) −−> Change the color from to 104 { 105 var c o l o r = getColor ( ) ; 106 i f (color == JJ[0]) Retval = setColor( JJ[1] ); 107 else Retval = color; 108 } 109 110 command( Retval ) −−> Change the size to 111 { Retval = s e t S i z e ( JJ ) ; } .

Figure 4.1: Snippet from a COLD script showing some semantic DCGs. The speakable components are to the right of the → arrows

That is, the acceptable sentences would be: ”Set the color to ”, ”Change the color 16

100 JJ:yellow,crimson,green, ... , big,small,large 101 NN: . . . 102 VB : . . . 103 RB : . . . 104 . . .

Figure 4.2: A dictionary file detailing some of its color- and size-adjectives from to ” and ”Change the size to ”, where is any word defined in the dictionary entry represented by line 100 in Figure 4.2.

100 " Turn i t red "

Figure 4.3: A natural sentence to be used as an input

The semantic analysis through these criteria is performed as follows:

1. The input is tokenized (split into the smallest word-unit)

ex. ”Turn it red” → ”[Turn] [it] [red]”

2. The tokenized input is tagged with its part-of-speech (POS)

”[Turn] [it] [red]” → ”[VP Turn/VB ] [NP it/PRP ] red/JJ”

3. For every token, get the synonym set (synset) for that word using its corresponding

POS from the thesaurus

4. For every rule:

(a) Get the synset for every dictionary entry of each containing category (,

, ...)

(b) Find the best matching tags (using Wu-Palmer similarity) with confidence value

c ∈ [0, 1], which is not already matched (i.e., two in a rule cannot match

one tag in the input) 17

i.e., rule 100: sim(red, crimson) = 1.0 ⇒ [0] = crimson,

and rule 110: sim(red, crimson) = 1.0 ⇒ [0] = crimson

(c) Discard any rule that failed to match all tags

i.e., rule 103: sim(red, crimson) = 1.0, sim( N/A , X) = NaN

[0] = crimson, [1] = N/A, therefore discard rule 103

5. For the remaining rule(s) create the sentence s using the best matches from (4b) and

remove all determiners:

s100: ”Set color to crimson”

s110: ”Change size to crimson”

6. For every word in input and every sentence from (5), find the best matched sentence,

through comparing the similarity between pairs of words:

input: ”Turn it red”, [VP Turn/VB ] [NP it/PRP ] red/JJ

s100: ”Set color to crimson”, [VP set/VBN ] [NP color/NN ] [PP to/TO ] crimson/JJ

s110: ”Change size to crimson”, [VP change/VB ] [NP size/NN ] [ PP to/TO ] crim- son/JJ

s100: sim(turn, set) = 0.666, sim(red, crimson) = 1.0, sim(red, color) = 0.428

⇒ s100 = 0,698

s110: sim(turn, change) = 0.571, sim(red, crimson) = 1.0, sim(red, size) = 0.0

⇒ s110 = 0,523

7. Select rule r corresponding to the best matched sentence from (6) with the POS

matches from (4b):

r:"Set the color to crimson"

Note how the selected rule is the closest matching, even though the input contained none of the actual words. The input: ”Turn it red”, contains the color red, which is not defined in the lexical dictionary of accepted words (Figure 4.2). However, crimson is. Due to the word-sense, the understood word crimson could be used instead. The grammar rules: ”Set 18 the color to ” and ”Change the size to ” are almost identical, except for the words color and size. Both use tags, which does not exclude as to which words from that category can be matched. I.e., ”Change the size to crimson” and ”Set the color to crimson” are both equally good matches as far as the DCGs and lexical dictionary are concerned. Because of the word-sense of the input and rules, the algorithm could determine that in this context, the word crimson is more closely related to color than it is to size. As such, the desired rule was matched.

The example shows how the approach handles . It could be noted that the ambiguity instead could have been avoided altogether by defining the DCGs better.

For instance, this specific problem could have been circumvented by using user-defined categories instead of general tags, i.e., defining the categories colors and sizes.

4.2 A Framework for Computing Semantic Relatedness

Computing semantic relatedness of texts combines many different NLP techniques.

Though these methods complement each other for our purpose, researchers mostly focus

on improving the relations of just one or a few of these. For instance, recent research has

discussed the use of WordNet and how Wikipedia could replace it as a knowledge-base [15,

16].

We have created a general modular framework for the computation of semantic relat-

edness. Each modularity being a NLP technique. All modalities share a common interface

as to how they interact to perform the semantic analysis. As such, different methods can

be replaced as more prominent solutions become available, or to better fit the needs of a

specific system. For example, adding support for Wikipedia in conjunction with WordNet,

would not affect how statistical parsing or part-of-speech tagging is performed.

4.2.1 Test Application

A simple test application is provided to test the functionality of different NLP tech-

niques. It’s included in the delivered modular framework for semantic analysis in COLD.

A screenshot of the test environment can be seen in Figure 4.4. 19

Figure 4.4: Screenshot of the test application. The input ”increase it to large” from the above input box is interpreted and the results printed in the box belove

Any sentence can be written as input. The sentence is parsed, tokenized and POS- tagged. Gloss and synonyms for each individual word from the thesaurus is shown. The input is matched against some DCGs. The confidence values for these matches are presented, together with the most closely matched rule. The application reads DCGs from a script file and passes them as text strings to a data structure. As an alternative, sentences could be passed directly to the data structure. As such, the input would instead be compared to any given text strings. 20

CHAPTER 5

CONCLUSIONS

In this thesis we have proposed a general natural language processing framework for computing semantic relatedness of texts. Also, we have shown how our approach can be used to solve the more specific problem of comparing a text with a definite clause grammar

(DCG) rule. We did this through introducing a lexical dictionary to the DCG and computing the word-sense using the context of the DCGs and a thesaurus. Although the approach is explained through a DCG-based system, it is still applicable in any system where the semantic relatedness of two texts is desired. This is because a semantic DCG includes a set of words. Expanding a grammar to include the possible acceptable words forms sentences.

Although these sentences are likely fragmented; conceptually they are no different than a natural language text.

We have shown how an existing incremental multimodal framework (COLD) can benefit from our approach, rather than its current static grammar definitions. These benefits include: not requiring a list of all possible words, early termination of rule parsing, and the ability to make use of word types. The DCG rules can therefore be made much more dynamic. Furthermore, when defining the rules, no consideration of the natural language of the users’ is required. As such, rules can be defined as to best express their semantic meaning, minimizing ambiguities of natural language. Before applying our solution to

COLD; to match a rule defined as ”set the color to crimson”, a user would have to input exactly that. As can be seen in 4.1, using our method, the input ”turn it red” could convincingly be matched to the rule ”set the color to crimson”, even though neither of the words: ”turn”, ”it” or ”red” were explicitly defined. In this case it was sufficient for the words to be known from the thesaurus, in conjunction with a rule that conveyed its underlying semantic meaning, to reach a correct conclusion. 21

It should be noted that the goal of this thesis was not to find "the best" method for comparing DCGs and texts, but rather any method that would be applicable to COLD

(see 2.4). Therefore, extensive testing of the suggested method was outside the scope of this thesis. Also, the effectiveness of the proposed algorithm for the comparison of natural language and DCGs (3.3 & 4.1) has not been thoroughly analyzed. Naturally, further testing is required to determine its effectiveness.

Our approach uses the thesaurus WordNet. Although a respected source, newer re-

search suggests Wikipedia can provide better results [15]. More details on this can further

be seen in [16] which also discusses the benefits of combining WordNet, Wikipedia and

Google. Although an increase in the confidence values of the semantic relatedness could be

achieved, the complete implications of the (presumed) increased asymptotic time and space

complexity is unknown and needs to be further analyzed.

Our approach has made the rule definitions of COLD more dynamic. Part of this is due

to the introduction of the lexical dictionary and the dictionary tags. In the current solution

it is possible to define grammar using (adjective), (verb), (noun) and

(adverb) tags, in addition to user-defined categories. We would like to extend this to also use phrases, i.e., verb-phrases and noun-phrases. This would further improve upon the grammar definitions and allow for even more dynamic rules. 22

REFERENCES

[1] Y. et. al, “Sentence similarity based on semantic nets and corpus statistics,” Trans-

actions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138–1150, August

2006.

[2] G. A. Miller, “Wordnet: A lexical database for english,” Communications of the ACM,

vol. 38, no. 11, pp. 39–41, 1995.

[3] A. Ameri, “A general framework for incremental processing of multimodal inputs,” in

In Proceedings of the 13th International Conference on Multimodal Interaction, 2001,

pp. 225–228.

[4] B. Akan, “Human robot interaction solutions for intuitive industrial robot program-

ming,” Licentiate Thesis No. 149, Mälardalen University, 2012.

[5] D. Michie, “Return of the imitation game,” 2001.

[6] (2010) Apache opennlp. [Online]. Available: http://opennlp.apache.org

[7] E. Blanchard, M. Harzallah, H. Briand, and P. Kuntz, “A typology of ontology-based

semantic measures.” in EMOI-INTEROP, 2005.

[8] C. Fellbaum, WordNet: An electronic lexical database. Wiley Online Library, 1998.

[9] M. Lesk, “Automatic sense disambiguation using machine readable dictionaries: how to

tell a pine cone from an ice cream cone,” in Proceedings of the 5th annual international

conference on Systems documentation. ACM, 1986, pp. 24–26. 23

[10] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the

32nd annual meeting on Association for Computational . Association for

Computational Linguistics, 1994, pp. 133–138.

[11] S. Banerjee and T. Pedersen, “An adapted lesk algorithm for word sense disambiguation

using wordnet,” in Computational linguistics and intelligent text processing. Springer,

2002, pp. 136–145.

[12] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the

41st Annual Meeting on Association for Computational Linguistics-Volume 1. Asso-

ciation for Computational Linguistics, 2003, pp. 423–430.

[13] C. D. Manning and H. Schütze, Foundations of statistical natural language processing.

MIT press, 1999.

[14] M. Johnston, “Unification-based multimodal parsing,” in Proceedings of the 36th An-

nual Meeting of the Association for Computational Linguistics and 17th International

Conference on Computational Linguistics-Volume 1. Association for Computational

Linguistics, 1998, pp. 624–630.

[15] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness of words and texts

in wikipedia-derived semantic space.” IJCAI’07 Proceedings of the 20th international

join conference on Artificial Intelligence, 2007, pp. 1606–1611.

[16] M. Strube and S. P. Ponzetto, “Wikirelate! computing semantic relatedness using

wikipedia,” in AAAI, vol. 6, 2006, pp. 1419–1424.

[17] B. Galitsky, “Building a repository of background knowledge using semantic skeletons,”

in AAAI Spring symposium 2006 - Formalizing and Compiling Background Knowledge

and Its Applications to Known Represention and Question Ansering, 2006, pp. 22–27.

[18] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated

corpus of english: The penn treebank,” Computational linguistics, vol. 19, no. 2, pp.

313–330, 1993. 24

[19] J. Deng, A. C. Berg, and L. Fei-Fei, “Hierarchical semantic indexing for large scale

image retrieval,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE

Conference on. IEEE, 2011, pp. 785–792.

[20] Y. Marton, C. Callison-Burch, and P. Resnik, “Improved statistical machine translation

using monolingually-derived paraphrases,” in Proceedings of the 2009 Conference on

Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association

for Computational Linguistics, 2009, pp. 381–390.

[21] X. Qi and B. D. Davison, “Web page classification: Features and algorithms,” ACM

Computing Surveys (CSUR), vol. 41, no. 2, p. 12, 2009.

[22] B. Markines, C. Cattuto, F. Menczer, D. Benz, A. Hotho, and G. Stumme, “Evaluating

similarity measures for emergent semantics of social tagging,” in Proceedings of the 18th

international conference on World wide web. ACM, 2009, pp. 641–650.

[23] K. O’Shea, “An approach to conversational agent design using semantic sentence sim-

ilarity,” Applied Intelligence, vol. 37, no. 4, pp. 558–568, 2012.

[24] Y. Li, Z. A. Bandar, and D. McLean, “An approach for measuring semantic similarity

between words using multiple information sources,” Knowledge and Data Engineering,

IEEE Transactions on, vol. 15, no. 4, pp. 871–882, 2003.