A System Developed as a Project in a Natural Language Processing Course*

W. Wang, J. Auer, R. Parasuraman, I. Zubarev, D. Brandyberry, and M. P. Harpm Purdue University West Lafayette, IN 47907 {wang28,jauer,pram,dbrandyb,harper}@ecn.purdue.edu and [email protected]

Abstract Plain Text ( Story and Questions ) This paper describes the Question Answering Sys- tem constructed during a one semester graduate- )i..... 8__~'!P°_s__!,g_ ~L ...... level course on Natural Language Processing (NLP). We hypothesized that by using a combination of syn- Tagged Text

tactic and semantic features and machine learning 2 i Name Identification

techniques, we could improve the accuracy of ques- Propernoun Identified 3 Tagged Text tion answering on the test set of the Remedia corpus Jr_ ...... T ...... Lex]cat Lexica$ and Role • over the reported levels. The approach, although Infonmatior L ...... LabellnformalJ~1 ~ ...... ~ Lexicon ...... ~ ...... novel, was not entirely successful in the time frame ...... L ...... Wordnel ...... Grammar Partial ~ : Pronoun of the course...... - ...... Rules Parser 4--- ...... - Resolution Gramnrmr ...... ~ Pronouns ......

...... Resolved 1 Introduction This paper describes a preliminary reading com- Sentence-to-Question prehension system constructed as a semester-long . I, Comparison i project for a natural language processing course. This was the first exposure to this material for all but one student, and so much of the semester . --~=:: ..... ~__ .. ~ ...... ~ " Rule-Based " Neuro- : Neural Genetic was spent learning about and constructing the tools Classifier .' Fuzzy Net. Network Algorithm .' that would be needed to attack this comprehen- • • .ANS ANS ANS ~ -- ~" ANS sive problem. The course was structured around

the project of building a question answering system Voting

following the HumSent evaluation as used by the ...... ÷ ...... Deep Read system (Hirschman eta]., 1999). The Answer with i Deep Read reading comprehension prototype system Highest Scores i (Hirschman et al., 1999) achieves a level of 36% of the answers correct using a bag-of- approach Figure 1: The architecture for our question answer- together with limited linguistic processing. Since the ing system. average number of sentences per passage is 19.41, this performance is much better than chance (i.e., 5%). We hypothesized that by using a combina- (POS) tagger distributed with the Deep Read sys- tion of syntactic and semantic features and machine tem. This tagged text is then passed to the Name learning techniques, we could improve the accuracy Identification Module, which updates the tags of of question answering on the test set of the Remedia named entities with semantic information and gen- corpus over these reported levels. der when appropriate." The Partial Parser Mod- ule then takes this updated text and breaks it into 2 System Description phrases while attempting to ]exically disambiguate The overall architecture of our system is depicted the text. The Pronoun Resolution Module is con- sulted by the parser in order to resolve pronouns be- in Figure 1. The story sentences and its five ques- fore passing partially parsed sentences and questions tions (who, what, where, when, and why) are first to the Sentence-to-Question Comparison Module. preprocessed and tagged by the Brill part-of-speech The Comparison Module determines how strongly * We would like to thank the Deep Read group for giving us the phrases of a sentence are related to those of a " access to their test bed. question, and this information is passed to several

28 Rm3-5 (... The club is for boys who are under 12 years Bid.) They are called Cub ScButs. modules which attempt to learn which features of (Answer to Question 1 ) the comparison are the most important for identify- ing whether a sentence is a strong answer candidate. : POS tagging We intended to set up a voting scheme among vari- They are called i Cub Scouts ous modules; however, this part of the work has not PO_S_-.~_P.R. p_: P_O_S_=_~V_B P" ' P_OS='__V_BN'_~ ii_POS='N_N_P21 POS_--.:NN__S" been completed (as indicated by the dashed lines). Name Identification

Our system, like Deep Read, uses as the develop- They are ~ i called Cub Scouts' Pos.'t~s" : ment set 28 stories from grade 2 and 27 from grade POS="PRP" POS='VBP" " POS='VBN" PROP~PE.'~t 5, each with five short answer questions (who, what, Initial Partial when, where, and why), and 60 stories with ques- tions from grades 3 and 4 for testing 1. We will refer They are called ] Cub Scouts TYPE=NP TYPE=VP TYPE.VP ! TYPE=NP to the development and testing data as the Remedia ID=I ID=I ID..2 1[:)=2 LABEL=auK LABEL=subject Lt~,BEL. rnvb LABEL=object BASE=be BASE.,call ! BASE =CubScouts corpus. The following example shows the informa- BASE=they AGR~,~p AGR.3p AGR=3p AGR=3p TENSE=present TENSE=pastp i tion added to a plain text sentence as it progresses GENDER=male VOtCE=ac~,'e VOICE=gas=ire GENDER=male through each module of the system we have created. SEM_TYPE=per~ .z~, SEM TYPE=tan SEM_TYPEiequat el SEM ~__PE-~_erson Pronoun Resolution Each module-is described in more detail in the fol- Y lowing sections. They are called Cub Scouts "P#PE=NP TYP~=VP TYPE=VP TYPE=NP 11:>-i io=1 ~O.2 ID=2 IL'~'BEL=subjec~ LABEL~ LABEL.nwb LAeEL,=objoct ~',SElthey BAS.~Mce BASE=call BASE=CubScouts AGR=3p , AGR.~3p AGFI=3p 2.1 Name Identification Module GENDER=nut ~1 : TENBE=pQI=~I~ TENSE=pastp AGR=3p ! BEM TYPE=per ' VOICE,=active VOICE=pass)re GENDER.mate , PRONRE F=boys ' = _S E_M TYF~i=b~-pu~.E L SEM TYPE=equate _, SEM-_UPE=~__~ The Name Identification Module expects as in- Updating Features put a file that has been tagged by the Brill They are called ! Cub Scouts ' TYPE=NP TYPE.V'P TYPE=VP TYPE=NP tagger distributed with the Deep Read system. ID=I IB=~ 1[:),=2 ID=2 LABEL=subject LABEL=am LABEL=mvb i LABEL=object The most important named entities in the Re- BASE-be BASE~an BASE=CubSco~t= ' BASE=boy AGR=3p AGRB3p AGFt=3g TENSIE.pms4mt TENSE~p~tp AGR=3p media corpus are the names of people and the GENDER=male voICE~ctive VOICE=passive GENDER=mak~ ; names of places. To distinguish between these • SEM_~PE=~.~ers~ __. SE M "i'~fP _Ept?e-p~_ _ SEU='r~'Pe=~q2~: two types, we created dictionaries for names of Bracketed Sentence

people and names of places. The first and last Cub Scouts Who are : ~'V~¢- ...... name dictionaries were derived from the files TypE~.VP TYPE=NP ID=I JD.I ID=2 at http ://www. census, gov/genealogy/names/. , LABEL=subject LABEL~vb ! LABEL=object = BABE=who BABE.be 8ABE=CubScout= i AGR,,3p i : AGR=3p AGR=3p First names had an associated gender feature; TENSE=,preserd GENDER=male : GENDER=male YOICE~cti~ , names that were either male or female included • s E_M,.TY~,~ .... SE_M=D'~,E:--p. ~. gender frequency. Place names were extracted from Bracketed Question 1 atlases and other references, and included names of countries, major cities and capital cities, major Figure 2: Processing an example sentence for match- attractions and parks, continents, etc. WordNet ing with a question in our system. was also consulted because of its coverage of place names. There are 5,165 first name entries, 88,798 last name entries, and 1,086 place name entries in 2.2 Partial Parser Module the dictionaries used by this module. The Partial Parser Module follows sequentially af- The module looks up possible names to decide ter the Name Identification Module. The input is whether a word is a person's name or a location. If it the set of story sentences and questions, such that cannot find the word in the dictionaries, it then looks the words in each are tagged with POS tags and at the POS tags provided in the input file to deter- the names are marked with type and gender infor- mine whether or not it is a propernoun. Heuristics mation. Initially pronouns have not been resolved; (e.g., looking at titles like Mr. or word endings like the partial parser provides segmented text with rich rifle) are then applied to decide the semantic type lexical information and role labels directly to the of the propernoun, and if the type cannot be deter- Pronoun Resolution Modffle. After pronoun reso- mined, the module returns both person and location lution, the segmented text with resolved pronouns as its type. The accuracy of the Name Identification is returned to the partial parser for the parser to Module on the testing set was 79.6%. The accuracy update the feature values corresponding to the pro- adjusted to take into account incorrect tagging was nouns. Finally, the partial parser provides bracketed 83.6%. text to the Comparison Module, which extracts fea- tures that will be used to construct modules for an- swering questions. There were differences between the Deep Read electronic version of the passages and the Remedia published passages. The Partial Parser Module utilizes information We used the electronic passages. in a lexicon and a grammar to provide the partial

29 parses. The lexicon and the parser will be detailed script was created to semi-automate the construc- in the next two subsections. tion of the lexicon from information extracted from 2.2.1 The Lexicon previously existing dictionaries and from WordNet. There were two methods we used to construct the 2.2.2 The Partial Parser lexicon: open lexicon, which includes all words The parser segments each sentence into either a noun from the development set along with all determiners, phrase (NP), a verb phrase (VP), or a prepositional pronouns, prepositions, particles, and conjunctions phrase (PP), each with various feature sets. NPs (these words are essential to achieving good sentence have the feature types: Base (the root word of the segmentation), and closed lexicon, which includes head word of the NP), AGR (number/person infor- all of the development and testing words 2. We con- mation), SemType (the semtype of the root form in structed the closed lexicon with the benefit of the the lexicon, e.g., person, object, event, artifact, or- development corpus only (i.e., we did not consult the ganization), Label (the role type of the word in the test materials to design the entries). To improve cov- sentence, e.g., subject), and Gender. Verb phrases erage in the case of the open lexicon, we constructed (VPs) have the feature types: Base, AGR, SemType a module for obtaining features fbr words that do (the semtype of the root form in the lexicon, e.g., not appear in the development set (unknown words) contact, act, possession), Tense (e.g., present, past), that interfaces with WordNet to determine a word's and Voice. Prepositional phrases (PPs) have the base/stem, semantic type, and synonyms. When an feature types: Prep (the root form of the preposition unknown word has multiple senses, we have opted to word), SemType (the semtype of the root form in the choose the first sense because WordNet orders senses lexicon, e.g., at-loc, at-time), Need (the object of the by frequency of use. Ignoring numbers, there are preposition), and NeedSemType (the semtype of the 1,999 unique words in the development set of the object of the preposition). Feature values are as- Remedia corpus, and 2,067 in the testing data, of signed using the lexicon, Pronoun Resolution Mod- which 1,008 do not appear in the development set. ule, and grammar rules. Overall, there are 3,007 unique words across both We implemented a bottom-up partial parser to training and testing. segment each sentence into syntactic subparts. The One of our hypotheses was that by creating a lex- grammar used in the bottom-up parser is shown be- icon with a rich set of features, we would improve low: the accuracy of question answering. The entries in 1. NP -+ DET ADJ+ NOUN+ the lexicon were constructed using the conventions 2. NP ~ DET NOUN adopted for the Parsec parser (Harper and Helzer- 3. NP ~ ADJ PROPERNOUN+ man, 1995; Harper et al., 1995; Harper et al., 2000). 4. VP ~ (AUX-VERB) MAIN-VERB Each word entry contains information about its root 5. PP --~ ADV word (if there is one), its lexical category (or cate- 6. PP ~ ADJ (PRED) gories) along with a corresponding set of allowable 7. PP ~ PREP NP features and their corresponding values. Lexical cat- At the outset, the parser checks whether there are egories include noun, verb, pronoun, propernoun, any punctuation marks in the sentence, with corn- adjective, adverb, preposition, particle, conjunction, mass and periods being the most helpful. A comma determiner, cardinal, ordinal, predeterminer, noun is used in two ways in the Remedia corpus: it acts modifier, and month. Feature types used in the as a signal for the conjunction of a group of nouns lexicon include subcat, gender, agr, case, vtype or propernouns, or it acts as punctuation signalling (e.g., progressive), mood, gap, inverted, voice, an auxiliary phrase (usually a PP) or sentence. In behavior (e.g., mass), type (e.g., interrogative, rel- the NP conjunction case, the parser groups the con- ative), semtype, and conjtype (e.g., noun-type, joined nouns or propernouns together as a plural NP. verb-type, etc.). We hypothesized that semtype In the second case, the sentence is partially parsed. should play a significant role in improving question The partial parser operates in a bottom-up fashion answering performance, but the choice of semantic taking as input a POS:tagged and name-identified granularity is a difficult problem. We chose to keep sentence and matching it to the right-hand side of the number of semantic values relatively small. By the grammar rules. Starting from the beginning of using the lexicographers' files in WordNet to group the sentence or auxiliary phrase (or sentence), the the semantic values, we selected 25 possible seman- parser looks for the POS tags of the words, trans- tic values for the nouns and 15 for the verbs. A forming the POS tags into corresponding lexical cat- egories and tries to match the RHS of the rules. 2Initially, we created the closed lexicon because this list Phrases are maintained on an agenda until they are of words was in the Deep Read materials. Once we spotted that the list contained words not in the development material, finalized. we kept it as an alternative to see how important full lexical NPs often require merging sincesome consecutive knowledge would be for answering questions. NPs form a single multi-word token (i.e., multi-word

30 names and conjunctions). An NP that results from with lexical (including gender) and semantic feature merging two tokens into a single multi-word token information, and the phrase structure is also avail- has its Base as the rootword of the combined token, able. After the input file is provided by the Par- and AGR and SemType features are updated according tial Parser Module, the Pronoun Resolution Module to the information retrieved from the lexicon based searches for the pronouns by looking through the on the multi-word token. In the case of an NP con- NPs identified by the partial parser. Candidate an- junction, the Base is the union of the Base of each tecedents are identified and a comparison of the fea- NP, AGR is set to 3p, and SemType is assigned as that tures is made between the pronoun and the possible of the head word of the merged NP. The rule for find- antecedent. The phrase that passes the most rule ing the head word of an NP is: find the FIRST con- filters is chosen as the antecedent. First and second secutive noun (propernoun) group in the NP, then person pronouns are handled by using default val- the LAST noun (propernoun) in this group is de- ues (i.e., writer and reader). If the system fails to fined as the head word of the NP. arrive at an antecedent, the pronoun is marked as The partial parser performs word-sense disam- non-referential, which is often the case for pronouns biguation as it parses. Words such as Washington like it or they. Some of the most useful rules are have multiple_semtype values in the lexicon for one listed below: lexical category. The following are rules for word- • Reflexives must refer to an antecedent in the same sense disambiguation used by the parser: sentence. For simplicity, we chose the closest • NP plus VP rules for word-sense disambiguation: noun preceding the pronoun in the sentence with If there are verbs such as name, call, or be, matching Gender, AGR, and SemType. which have the semtype of equate, then the NPs • Two NPs that co-refer must agree in AGR, Gender, that precede and follow the VP have the same and SemType (e.g., person, location). Since, in semtype. many cases the gender cannot be determined, this If a noun is the object of a verb, then the subcat information was used only when available. feature value of the verb can be used to disam- • A subject was preferred over the object when the biguate its word sense (e.g., take generally has pronoun occurred as the subject in a sentence. the subcat of obj+time). • When it occurs in the beginning of a paragraph, • PP rules for word-sense disambiguation: it is considered non-referential. For some nouns (propernouns) which are the • We prefer a global entity (the first named entity in object of a preposition, the intersection of the a paragraph) when there is a feature match. In the semtype value sets of the preposition word and absence of such, we prefer the closest propernoun its object determines their semtype. preceding the pronoun with a feature match. If • NPs in the date line of each passage are all ei- that fails, we prefer the closest preceding noun or ther dates or places with the typical order be- pronoun with a feature match. ing place then time. For example, in (WASH- The accuracy of our pronoun resolution module INGTON, June, 1989), Washington is assigned on the training corpus was 79.5% for grade 2 and semtype of location rather than person. 79.4% for grade 5. On testing, it was 81.33% for To process unknown words (the 1,008 words in the grade 3 and 80.39% for grade 4. The overall accu- testing set that don't appear in the development set) racy of this module on both the testing and train- in the case of the open lexicon, WordNet is used to ing corpus was 80.17%. This was an improvement assign the semtype feature for nouns and verbs, the over the baseline Deep Read coreference module AGR feature for verbs can be obtained in part from which achieved a 51.61% accuracy on training and the POS tag, and AGR for unknown noun words can a 50.91% accuracy on testing, giving an overall ac- be determined when they are used as the subject of curacy of 51.26%. This accuracy was determined a sentence. For the closed lexicon, the only unknown based on Professor Harper's manual pronoun reso- words are numbers. If a number is a four-digit num- lution of both the training and testing set (the per- ber starting with 16 to 19 or is followed by A.D or fect coreference information was not included in the B.C. then generally it is a year, so its semtype is de- distribution of the Deep Read system). fined as time. Other numbers tend to be modifiers 2.4 Sentence-to-Question Comparison or predicates and have the semtype of num. Module 2.3 Pronoun Resolution Module The Sentence-to-Question Comparison Module A pronoun resolution module was developed using takes as input a set of tagged stories, for which the rules given in Allen's text (Allen, 1995) along phrase types and features have been identified. with other rules described in the work of Hobbs The semantic and syntactic information is coded as (Hobbs, 1979). The module takes as input the shown in Figure 2 (using XML tags). A mechanism feature-augmented and segmented text provided by to quantify a qualitative comparison of questions the partial parser. Hence, the words are marked and sentences has been developed. The comparison

31 provides data about how questions compare to their The matrices for the development set were pro- answers and how questions compare to non-answers. vided to the algorithms in the Answer Module for The classification of answers and non-answers is im- training the component answer classifiers. The ma- plemented by using feature comparison vectors of trices for the testing set were also passed to the al- phrase-to-phrase comparisons in questions and po- gorithms for testing. Additionally, specific informa- tential answer sentences. tion about the feature values for each sentence was A comparison is made using phrase-to-phrase passed to the Answer Module. comparisons between each sentence and each ques- tion in a passage. In particular, NP-to-NP, VP-to- 2.5 Answer Modules VP, PP-to-PP, and NP-to-PP comparisons are made Several methods were developed in parallel in an between each sentence and each of the five questions. attempt to learn the features that were central to These comparisons are stored for each sentence in identifying the sentence from a story that correctly the following arrays. Note that in these arrays Q answer a question. These methods are described in varies from 1 to 5, signifying the question that the the following subsections. Due to time constraints, sentence matches. F varies over the features for the the evaluations of these Answer Modules were car- phrase match. ried out with a closed lexicon and perfect pronoun resolution. CN[Q][F] Comparison of NP features (F = I{Base, h6R, and SemType}D between question 2.5.1 A Neuro--Fuzzy Network Classifier :.'.' Q and the sentence. An Adaptive Network-based Fuzzy Inference System CV[Q][F] Comparison of VP features (F = i{Base, (ANFIS) (Jang, 1993) from the Matlab Fuzzy Logic AGR, SemType, Tense}[) between question Q and the sentence. Toolbox was used as one method to resolve the story CP[Q][F] Comparison of PP features (F = [{NeedBase, questions. A separate network was trained for each Prep, PPSemType, NeedSemType}[) question type in an attempt to make the networks between question Q and the sentence. learn relationships between phrases that classify an- CPN[Q][F] Comparison of PP features in sentence swer sentences and non-answer sentences differently. to NP features in question Q. Here F=2, comparing lWeedBase and Base, and ANFIS has the ability to learn complex relationships NeedSemType and SemType. between its input variables. It was expected that by learning the relationships in the training set, the resolution of questions could be performed on the Values for these comparison matrices were calcu- testing set. lated for each sentence by comparing the features of each phrase type in the sentence to features of the For ANFIS, the set of sentence-question pairs was indicated phrase types in each of the five questions. divided into five groups according to question type. Currently the implementation of ANFIS on Matlab The individual matrix values describe the compari- son of the best match between a sentence and a ques- is restricted to 4 inputs. Hence, we needed to devise a way to aggregate the feature comparison informa- tion for NP-to-NP (the three feature match scores tion for each comparison vector. The comparison for the best matching NP pair of the sentence and vectors for each phrase-to-phrase comparison were question Q are stored in CN[Q]), VP-to-VP (stored reduced to a single number for each comparison pair in CV[Q]), PP-to-PP (stored in CP[Q]), and PP-to- (i.e., NP-NP, VP-VP, PP-PP, NP-PP). This reduc- NP (store in CPN[Q]). Selecting the phrase compar- tion was performed by multiplying the vector values ison vector for a phrase type that best matches a by a normalized weighting constant for the feature sentence phrase to a question phrase was chosen as a heuristic to avoid placing more importance on a values (e.g., NP-comparison = (Base weight)*(Base comparison value) + (AGR weight)*(AGR compari- sentence only because it contains more information. son value) + (SemType weight)*(SemType compari- Comparisons between features were calculated us- ing the following equations. The first is used when son value), with the weights summing to 1). In most comparing features such as Base, NeedBase, and cases that a match is found, the comparison values Prep, where a partial match must be quantified. are 1 (exact match). So weights were chosen that al- The second is used when comparing features such as lowed the ANFIS to tell'something about the match SemType, AGR, and Tense where only exact matches characteristics (e.g., if the AGR weight is 0.15 and make sense. the SemType weight is 0.1, and the NP-comparison value was 0.25, it can be concluded that the NP 1 if Strl = Str2 that matched best between in the sentence-question rain len~th(Strl,Str2) length(Sth) # length(Str2) pair had the same AGR and SemType features). The c = max length(Strl,Str2) A(Strl 6 Str2 V Str2 6 Strl ) aggregation weights were chosen so that all com- 0 if Strl -~ Str2 binations of exact matches on features would have 1 if Strl = Str2 unique values and the magnitude of the weights were c = 0 if Strl ~- Str2 chosen based on the belief that the higher weighted

32 features contribute more useful information. The cates whether or not the sentence is the answer to weights, ordered to correspond to the features in the the corresponding question (i.e., question 1 through table on the previous page are: (.55, .15, .3) for CN, 5). (.55, .1, .22, .13) for CV, (.55, .15, .2, .1) for CP, and Several training trials were performed to deter- (.55, .45) for CPN. mine the optimum parameters for the network. We ANFIS was trained using the update on the de- trained using various subsets of the full input fea- velopment set provided by the Sentence-to-Question ture set since some features could be detrimental to Comparison Module as described above. During creating a good classifier. However, in the end, the testing, the data, provided by the Comparison Mod- full set of features performed better than or equiva- ule and updated as described above, is used as input lently to the various subsets. Increasing the number to ANFIS. The output is a confidence value that de- of hidden nodes can often improve the accuracy of scribes the likelihood of a sentence being a answer. the network because it can learn more complex re- Every sentence is compared with every question in lationships; however, this did not help much in the ANFIS, and then within question, the sentences are current domain, and so the number of hidden nodes ranked by the likelihood that they are a question's was set to 16. For this domain, there are many more answer. sentences that are not the answer to a question than The accuracy of the best classifier produced with that are. An effort was made to artificially change ANFIS was quite poor. In the grade 3 set, we this distribution by replicating the answer sentences achieved an accuracy of 13.33% on who questions, in the training set; however, no additional accuracy 6.67% on what questions, 0% on where questions, was gained by this experimentation. Finally, we cre- 6.67% on when questions, and 3.33% on why ques- ated a neural network for each question type as in tions. In the grade 4 set, we achieved an accuracy of ANFIS; however, these small networks had lower ac- 3.54% on who questions, 10.34% on what questions, curacy than the single network approach. 10.34% on where questions, 0% on when questions, The overall test set accuracy of the best neural and 6.9% on why questions. Although the best rank- network classifier was 14%. In the grade 3 set, we ing sentence produced poor accuracy results on the achieved an accuracy of 30% on who questions, 0% testing set, with some additional knowledge the top- on what questions, 23.3% on when questions, 13.3% ranking incorrect answers may be able to be elimi- on where questions, and 3.3% on why questions. In nated. The plots in Figure 3 display the number of the grade 4 set, we achieved an accuracy of 17.2% times the answer sentence was assigned a particular on who questions, 10.3% on what questions, 23.6% rank by ANFIS. The rank of the correct sentence on when questions, 10.3% on where questions, and tends to be in the top 10 fairly often for most ques- 3.4% on why questions. tion types. This rank tendency is most noticeable 2.5.3 A Rule-based Classifier based on C5.0 for who, what and when questions, but it is also present for where questions. The rank distribution We attempted to learn rules for filtering out sen- for why questions appears to be random, which is tences that are not good candidates as answers to consistent with our belief that they require a deeper questions using C5.0 (Rul, 1999). First we ex- analysis than would be possible with simple feature tracted information from the sentence-to-question comparisons. correspondence data ignoring the comparison values to make the input C5.0-compatible, and produced 2.5.2 A Neural Network Classifier five different files (one for each question type). These Like ANFIS, this module uses a neural network, but files were then fed to C5.0; however, the program did it has a different topology and uses an extended fea- not produce a useful tree. The problem may have ture set. The nn (Neureka) neural network sim- been that most sentences in the passages are nega- ulation system (Mat, 1998) was used to create a tive instances of answers to questions. multi-layer (one hidden layer) back-propagation net- work. A single training/testing instance was gener- 2.5.4 GAS ated from each story sentence. The network contains GAS (Jelasity and Dombi, 1998) is a steady genetic an input layer with two groups of features. The sen- algorithm with subpopulation support. It is capa- tence/question feature vectors that compare a sen- ble of optimizing functions with a high number of tence to each of the five story questions comprise the local optima. The initial parameters were set theo- first group. Sentence features that are independent retically. In the current matching problem, because of the questions, i.e., contains a location, contains a the number of local optima can be high due to the time/date, and contains a human, comprise the sec- coarse level of sentence information (there can be ond group. The hidden layer contains a number of several sentence candidates with very close scores), nodes that was experimentally varied to achieve best this algorithm is preferred over other common ge- performance. The output layer contains five nodes, netic algorithms. This algorithm was trained on the each of which has a binary outpht value which indi- training set, but due to the high noise level in the

33 Who Questions What Questions ~2 '~t <-~ :~ ~. ~:~" ; ~.~ ~ .~ ~~ ~.,~!- :-::~ i~',~ .' • t2

...... ,t...,~ ~ :~ ...... ~, "-: ...... :~......

N-best order of sentences N-best order of sentences

When Questions Where Questions

18 10 [;p~?,~z~,:::::::~,~,:~::~.~:~-:~:~:~:,~.~?.::~ff:,:~ 6. F~ •

~- ~ ~

g 4 ~, ...... ~ ~ ,~...... :~ ~

N-best order of sentences N-best order of sentences

Why Questions

4.5 ~i~:~:~

0

N-best order of sentences

Figure 3: Correct answers ordered by ANFIS preference. training data, the algorithm fails to produce a win- ined those SemType values that occurred most fre- ning population based on the mean square minimiza- quently and calculated statistics on how often the tion function. values occurred in story sentences that are answers versus non-answers to the questions. We observed 3 A Closer Look at the Features the following phenomena: After observing the question answer accuracy results 1. For who questions, the SemType value person of the above classifiers, we concluded that the fea- plays an important role in •identifying answer tures we extracted for the classifiers are affected by sentences, since 83.64% answers have person as noise. The fact that we take into consideration only its NP SemType value, and 21.82% have it as the top matching phrase-to-phrase matches on a spe- its PP NeedSemType value. However, 66.83% of cific set of features may have contributed to this the non-answer sentences also have person as its noisiness. To analyze the noise source of features, NP SemType and 15.85% as its PP NeedSemType. given that SemType was hypothesized to be essen- Phrases with person SemType appear in most sen- tial for answer candidate discrimination, we exam- tences, whether they are answers or not, and this

34 weakens its ability to act as an effective filter. classifiers. The confidence rankings of the classifiers 2. For what questions, the SemType value person ap- would be determined from their corresponding an- pears as the NP SemType of most answer and non- swer selection accuracy in the training set, and will answer sentences. The next most dominant fea- be used horizontally over the classifiers to provide ture is the SemType value object, which appears a weighted confidence measure for each sentence, in the NP for 29.41% of the answer sentences and giving a final ordered list, where the head of the PP NeedSemType for 15.68% of the answer sen- list is the proposed answer sentence. We propose tences. Most of the other SemType values such as to use a voting neural network to train the confi- time contribute trivially to distinguishing answers dence weights on different classifiers based on differ- from non-answers, as might be expected. ent question types, since we also want to explore the 3. For when questions, person appears dominant relationship of classifier performance with question among NP SemType values; however, time fea- types. We believe this voting scheme will optimize tures appear to be second most dominant since the bagging of different classifiers and improve the 19.30% of the answer sentences have time as their hypothesis accuracy. NP SemType, and 26.32% have at-time as their PP SemType. N_o.te that the PP NeedSeraType and VP References SemType appear to be less capable of guiding the J. Allen. 1995. Natural Language Understanding. selection of the correct answer. The Benjamin/Cummings Publishing Company, 4. For where questions, location features are impor- Menlo Park, CA. tant with 24.07% answer sentences having loca- M. P. Harper and R. A. Helzerman. 1995. Man- tion as their NP SemType value, and 20.37% hav- aging multiple knowledge sources in constraint- ing at-loc as their PP SemType. However, the based parsing spoken language. Fundamenta In- distribution of values for VP SemType and PP formaticae, 23(2,3,4):303-353. NeedSemType shows no interesting patterns. M. P. Harper, R. A. Helzerman, C. B. Zoltowski, The current training strategy weights the NP-NP, B. L. Yeo, Y. Chan, T. Stewart, and B. L. Pellom. VP-VP, PP-PP, and NP-PP comparisons equiva- 1995. Implementation issues in the development lently. The above observations suggest that training of the parsec parser. SOFTWARE - Practice and classifiers based on these equally weighted compar- Experience, 25:831-862. isons may have prevented the detection of a clear M. P. Harper, C. M. White, W. Wang, M. T. John- class boundary, resulting in poor classification per- son, and R. A. Helzerman. 2000. Effectiveness formance. Since different phrase types do not appear of corpus-induced dependency grammars for post- to contribute in the same way across different ques- processing speech. In Proceedings of the 1st An- tion types, it may be better to generate a rule base nual Meeting of the North American Association as a prefitter to assign more weight to certain phrases for Computational Linguistics. or discard others before inputting the feature vector L. Hirschman, M. Light, E. Breck, and J.D. Burger. into the classifier for training. 1999. Deep Read: A reading comprehension sys- tem. In Proceedings of the 37th Annual Meeting 4 Future Directions of the Association for Computational Linguistics, As a next step, we will try to tame our feature set. pages 325-332. One possibility is to use a rule-based classifier that J. R. Hobbs. 1979. Coherence and coreference. Cog- is less impacted by the serious imbalance between nitive Science, 1:67-90. negative and positive instances than C5.0 in order J-SR Jang. 1993. ANFIS: Adaptive-Network-based to learn more effective feature sets for answer candi- Fuzzy Inference System. IEEE Transactions on date discrimination corresponding to different ques- System, Man, and Cybernetics, 23(3):665-685. tion types. We could then use the classifier as a pre- M. Jelasity and J. Dombi. 1998. GAS, a concept on processing filter to discard those less relevant com- modeling species in genetic algorithms. Artificial parison vector elements before inputting them into Intelligence, 99 (1) :1-19. the classifiers, instead of inputting comparison re- The MathWorks, Inc., 1998. Neural Network Tool- sults based on the complete feature sets. This should box, v3. O. 1. help to reduce noise generated by irrelevant features. Rulequest Research, 1999. Data Also, we will perform additional data analysis on the Mining Tools See5 and C5. O. classification results to gain further insight into the http ://www. rulequest, com/s eeS-in~o, html. noise sources. The classifiers we developed covered a wide range of approaches. To optimize the classification perfor- mance, we would like to implement a voting mod- ule to process the answer candidates from different

35