Vision Document 1.0

For

Natural Language to Machine Readable Format

Damian Aaron Tamayo CIS 895 – MSE Project Department of Computing and Information Sciences Kansas State University

Committee Members Dr. Daniel Andresen Dr. David A. Gustafson Dr. William Hsu 1. Introduction

1.1 Purpose

The idea of NLP started in the mid 1940’s to the early 1950’s before the advent of computers for the use of machine translation. When NLP was first theorized about, many saw it as code breaking. Taking English which was considered “code” and breaking that code to allow translation to another “code”. This avenue was explored extensively at the beginning of NLP research. However, it presented many problems. A major problem that was encountered was that language in general proved to be too complex to allow for this manner of thinking. In essence, even though a sentence may be stated in a sparse amount of words, it takes knowledge of the environment in which that sentence is used, in order to be fully understood. Without this environment knowledge, a person may get a translation that is valid, but only valid given a specific set of circumstances and not necessarily the set of circumstances that were intended to be present when the sentence was made.

An example of such a sentence would be as follows:

Xavier bought a play ticket for Katherine in McCain.

Now while this sentence seems straight forward, it has several aspects that can only be determined by knowing the environment in which this sentence takes place. We know from a cursory examination of the sentence that Xavier is the main agent of the sentence. Furthermore, from placement we know that the action “bought” belongs to Xavier and the intended result of “bought” is “play ticket” and that the place where this all takes place is McCain. Finally we know that the intended recipient is Katherine. However, while we were able to pick out several bits of information from the cursory examination, the knowledge that we have obtained is still incomplete. First of all, there is no conception of time between the two agents. Was Katherine present with Xavier when he bought the ticket? Or did Xavier buy the ticket in McCain and then give her the ticket later on. From examining the sentence we can only assume that Katherine was present when the ticket was bought even if this is not a correct interpretation of the sentence.

Furthermore, how do we make a computer understand such nuisances of a sentence? How do we have a computer understand that the literal interpretation of the sentence may not be what is needed to understand the intended meaning of the sentence? From this example we can see that perceiving language as strictly a code to be broken is far too specific to represent language in its entirety. Language is by far too varied in terms of use, context, and interpretation to be seen as something that can be coded directly from one language to another. The meaning of sentences requires more than a strict understanding of grammatical rules. Understanding language requires an understanding of not only grammatical rules but of inferences, context, environmental scenery, etc. Because language has such a dynamic nature, there are many avenues that can be taken to analyze and attempt to solve this problem.

As a result, many researchers choose to tackle small subset of NLP in the hopes that the solutions of these individual areas can be used to obtain a solution to NLP in its entirety later on.

Some approaches focus on the input and how to correctly tag the parts of the speech (POS). Other approaches focus on Named Entity Recognition (NER) along with chronological and geographical placement. Such as, how do we tell the context of a word? When did an event happen and how many times? Or better yet, where does an event take place. Contextual meaning of words and phrases are a vital part to understanding the English language. Extracting this information in correct context can be invaluable to later analysis. NER helps to extract the atomic elements of a sentence, such as people, locations, quantities, etc.

No matter what part of NLP is being solved, all approaches look to achieve the goal of increasing the accuracy and precision of the human to machine interaction.

Due to the complexity and challenges that are offered by NLP, the focus of the Master’s of Software Engineering (MSE) project is to develop a program that takes as it’s input, an English sentence and produces a logical output that can then be utilized by another computer.

1.2 NLP Natural language processing (NLP) is the translation of natural human language in respect to the communication between humans and computers. In essence the terms natural human language refers to spoken or written languages such as English, Spanish, French, etc. The term of communication is the translation of this natural language into a logical output that can be read into the computer in such a way so that computer can understand the meaning of the sentence on the level that was intended by the originator.

2. Project Overview This section presents the information about the overall view of the working of the system. 2.1 Part of Speech Taggers The POS taggers that are used in this project are standfordNLP and openNLP. A POS tagger takes a sentence in some language, for our case the English language, and tags the parts of speech in order to obtain a definition for each word, such as noun, verb, adjective, preposition, pronoun, conjunction, etc. In addition there may be subtypes to these main types, such as a noun in the plural form, which help to further define a words purpose in the sentence.

2.1.1 stanfordNLP StanfordNLP is a POS tagger that can be used to obtain the varying parts of speech of an English sentence. It is original trained on a default corpus, but can be retrained on another specific corpus in order to improve the parsing of the POS of the sentences. As it stands right now, the current corpus used in the project is the default corpus. StanfordNLP is used in conjunction with the openNLP program and is used to obtain any parses that may be valid that the openNLP does not produce.

2.1.2 openNLP OpenNLP performs much like the StanfordNLP POS tagger. However, unlike the StanfordNLP POS tagger it is originally trained on the Wall Street Journal.

2.1.3 Proprietary POS tagger This POS tagger will supersede the aforementioned taggers to ensure that we can always have correct input into the system. It will allow the user to construct a valid POS tree that can be ensured to be correct. This program will first check with this POS tagger for a parse. In the event, that no parse is found here, the program will then check the other two taggers. The purpose of this tagger it ensure that even when stanfordNLP and openNLP fail to produce correct parsings, we still have to ability to obtain a correct parsing.

2.2 Project Goal The goal of the MSE project is 1.) accurately represent the inputted text and 2.) understand the parsed structure that is represented in part 1.The first part focuses on how to take the grammatical structure that is given by the POS taggers and represent the varying parts internally in the developed program. The second part focus comes in the logical representation of these internal structures that represent the input but can be furthered used as a source of information for another program. 2.2.1 MSE Project Focus part 1 The first part of the MSE project requires that an English sentence be inputted into a graphical user interface (GUI). The English sentence should conform to most of the standard modern English sentences. The program should not be expected to handle archaic English or English of a more imaginary nature, such as poetry, neither of which conform to known grammatical standards. In essence that English sentence should follow most of the following rules:  Conform to most grammatical rules  Use known sentence structures  Able to be understood in a casual conversation amongst lay persons  Must not be so complex as to have to have a dictionary present in order to know the meaning of the sentence  The sentences should generally be able to fit onto a single line as restricted by the text box in the GUI After an English sentence is inputted into the GUI then the GUI should split all sentences up and put a sentence on its own line. The new group of sentences should then be displayed to the user. Once the new sentences are displayed to the user, the user should have the ability to select a specific sentence. That particular sentence should be run through the POS taggers that are currently in use and an output from the taggers should be displayed to the user for inspection. At this point, the program should be able to take the tags provided by the POS taggers and store the sentence in an internal representation that represents the sentences via programming structures. This structure should be viewable in some way to the user. The programming structures that are used are at the programmer’s discretion. However, modern practices should be used when programming the MSE project.

2.2.2 MSE Project Focus part 2 The program should take the representation that is constructed from part 1 and then be able to construct a logical output for the sentence. The logical output of the sentence should represent the sentence in a predicate logic format. The internal representation of the sentence should be such that it allows for this logical interpretation. The user should be able to view the ontology output in some way. The ontology set is defined by the programmer but should conform to the known logic standards. If the logical output does not conform to the logical standards then a justification should be used to explain why the output was done a specific way.

2.2.3 Purpose The purpose of the project is not to improve upon the existing POS taggers. Instead it is to develop a program that can improve upon the definition and manipulation of the information provided by the POS taggers.

As stated in 2.2.1 the first focus is to make sure that the structure of the sentence can be represented accurately. In order to do this, the sentence must be in an expected format as described in 2.2.1. This expected format should ensure that if there is a correct parse for the sentence, then the POS taggers should be able to produce it. However, the Proprietary tagger will always be checked first for the parse to sentence. If the parse is not found in any of POS taggers then the sentence will be manually tagged by the user using the Proprietary POS tagger tools.

The order of the POS taggers is as follows: 1.) Proprietary POS tagger 2.) openNLP tagger 3.) stanfordNLP tagger

The Proprietary tagger is meant for to ensure that the program will always be given correct input to represent and transform into logic. Therefore, it has the highest priority. The openNLP tagger was trained on the wall street journal corpus. It has the second highest priority simply due to the fact, that we know what corpus it was trained on and the types of sentences that we are currently using tend to representative of this trained corpus. It is unknown what the default corpus is for the StanfordNLP tagger and is therefore, the 3rd highest priority. However, the StanfordNLP can be trained on other corpuses; it is unknown at this time how to do so though.

The main focus in using these taggers though is to ensure that we can accurately build a programmatic structure to encapsulate inputted English sentences as described in 2.2.1.

Furthermore, correct representation helps to interpret the correct predicate logic to form for a given sentence as described in section 2.2.2.

3.0 Project Requirements This section will discuss the requirements for the project.

3.1 Program Requirements

3.1.1 Req1 (Critical) The program shall have a graphical user interface. This is to allow the user ease of use due to the fact that any other command line interaction would be too complicated to get anything reasonable accomplished. 3.1.1.1 Req1a (Non-Critical) The GUI should have a tab to handle the POS tagger output. When a sentence is parsed, the resulting output from the POS taggers is shown on this tab.

3.3.1.2 Req1b (Critical) The GUI should have a tab for user input and a button to process user input. When the button is pressed, the user input should be processed and outputted to the split sentences tab.

3.3.1.3 Req1c (Non-Critical) The GUI should have a tab for sentence splitting output (one sentence per line). This is where the output goes for the sentences that were processed as input from the user.

3.3.1.4 Req1d (Critical) The GUI should have a tab for internal structure representation viewing. The tab should show the structure of the internal representation of the sentence that was chosen to be processed.

3.3.1.5 Req1e (Critical) The GUI should have a tab for ontology output viewing. The tab should show the logic was derived from sentence that was chosen to be processed.

3.3.1.6 Req1f (Critical) The GUI should have a tab for proprietary POS tagger input. This tab is where the user can manually tag a sentence so that that tagging overrides results from the other POS taggers.

3.1.2 Req2 (Non-Critical) The program should display error messages when it cannot connect to the POS tagger servers. If the program cannot connect to the servers then it will shut down. The program should display to the user which server it could not connect to.

3.1.3 Req3 (Non-Critical) English sentence input should conform to modern English sentences. The English sentence should be free of typos and grammatical errors.

3.1.4 Req4 (Non-Critical) English sentences should not be ambiguous. (i.e. The pretty tall girl ran to school) 3.1.5 Req5 (Non-Critical) The GUI should allow the user to select a group of sentence to be processed, by which the sentences are split up and shown on the split sentences tab.

3.1.6 Req6 (Non-Critical) The program shall allow the user to enter in dynamic sentences for processing. In other words, the user can enter in text from via the keyboard.

3.1.7 Req7 (Non-Critical) The inputted text does not have to be inputted as a sentence per line. Instead it can be entered in as a paragraph as long as each sentence is ended with appropriate punctuation.

3.1.8 Req8 (Critical) Internal representation output should be correct for the sentences as defined by the Major Professor.

3.1.9 Req9 (Non-Critical) User input should not use pronouns

3.1.10 Req10 (Non-Critical) User input should not use inference.

3.2 Proprietary POS Tagger Requirements

3.2.1 PReq1 – (Non-Critical) The proprietary POS tagger tab should take in an English sentence of the same complexity as required by the program and allow the user to specify the parts of speech for the sentence.

3.2.1.1 PReq3a (Non-Critical) The Proprietary POS tagger will work in conjunction with openNLP and stanfordNLP on a separate server. The Proprietary POS tagger should allow the user to connect to it in the same manner as the other two POS taggers.

3.2.1.2 PReq3b (Non-Ccritical) The Proprietary POS tagger should return a parse of the sentence in the same manner as the openNLP and stanfordNLP taggers.

3.2.2 PReq2 (Non-Critical) Proprietary POS tagger will have tag options as defined by the openNLP and stanfordNLP taggers. 3.2.3 PReq3 (Critical) Proprietary POS tagger will allow the user to dynamically tag the sentence that was entered, using a textual representation similar to that seen on the POS tagger output tab. That representation overrides all of the other taggers in use. 4.0 Assumptions  The user will have an executable program to interact with  The servers will be up and running before interaction with the program begins o Server1 – openNLP program on a separate computer o Server2 – stanfordNLP program on a separate computer o Server3 – proprietary POS tagger 5.0 Constraints  The openNLP program is used but not built upon in this project  The StanfordNLP program is used but not built upon in this project 6.0 Environment  Microsoft Visual Studio 2008  Version control using TortoiseSVN