QED: the Edinburgh TREC-2003 Question Answering System
Total Page:16
File Type:pdf, Size:1020Kb
QED: The Edinburgh TREC-2003 Question Answering System Jochen L. Leidner Johan Bos Tiphaine Dalmas James R. Curran Stephen Clark Colin J. Bannard Bonnie Webber Mark Steedman School of Informatics, University of Edinburgh 2 Buccleuch Place, Edinburgh, EH8 9LW, UK. [email protected] Abstract We also use additional state-of-the-art text processing tools, including maximum entropy taggers for POS tag- This report describes a new open-domain an- ging and named entity (NE) recognition. POS tags and swer retrieval system developed at the Uni- NE-tags are used during the construction of the semantic versity of Edinburgh and gives results for the representation. Section 2 describes each component of TREC-12 question answering track. Phrasal an- the system in detail. swers are identified by increasingly narrowing The main characteristics of the system architecture are down the search space from a large text col- the use of the Open Agent Architecture (OAA) and a par- lection to a single phrase. The system uses allel design which allows multiple questions to be an- document retrieval, query-based passage seg- swered simultaneously on a Beowulf cluster. The archi- mentation and ranking, semantic analysis from tecture is shown in Figure 1. a wide-coverage parser, and a unification-like matching procedure to extract potential an- 2 Component Description swers. A simple Web-based answer validation stage is also applied. The system is based on 2.1 Pre-processing and Indexing the Open Agent Architecture and has a paral- The ACQUAINT document collection which forms the ba- lel design so that multiple questions can be an- sis for TREC-2003 was pre-processed with a set of Perl swered simultaneously on a Beowulf cluster. scripts, one per newspaper collection, to identify and normalize meta-information. This meta-information in- 1 Introduction cluded the document id and paragraph number, the title, publication date and story location. The markup for these This report describes QED, a new question answering last three fields was inconsistent, or even absent, in the (QA) system developed at the University of Edinburgh for various collections, and so collection-specific extraction TREC-12. A key feature of QED is the use of natural lan- scripts were required. guage processing (NLP) technology at all stages in the The collection was tokenized offline using a combi- QA process; recent papers have shown the benefit of us- nation of the Penn Treebank sed script and Tom Mor- ing NLP for QA (Moldovan et al., 2002). In particular, we ton’s statistical tokenizer, available from the OpenNLP parse both the question and blocks of text potentially con- project. Ratnaparkhi’s MXTERMINATOR program was taining an answer, producing dependency graphs which used to perform sentence boundary detection (Reynar and are transformed into a fine grained semantic interpreta- Ratnaparkhi, 1997). The result was indexed with the tion. A matching phase then determines if a potential Managing Gigabytes (MG 1.3g) search engine (Witten et answer is present, using the relations in WordNet to con- al., 1999). For our TREC-2003 experiments, we used case- strain the answer. sensitive indexing without stop-word removal and with- In order to process very large text collections, the sys- out stemming. tem first uses shallow methods to identify text segments which may contain an answer, and these segments are 2.2 Query Generation and Retrieval passed to the parser. The segments are identified using a Using ranked document retrieval, we obtained the “tiler”, which uses simple heuristics based on the words best 100 documents from MG, using a query gener- in the question and the text being processed. ated from the question. The question words were Document 500 Questions (At least) 1 Answer / Question Collection Search Reranker Engine Answer Extraction Preprocessor WordNet Semantic Analysis NE model Tokenizer Parser C&C POS model Indexer Query Construction 100 Passages Retriever Tiler 100 Documents Figure 1: The QED system architecture. first augmented with their base forms obtained from question and tile. The score for every tile is multiplied Minnen et al. (2001)’s morphological analyser, which with a window function (currently a simple triangle func- also performs case normalisation. Stopwords were then tion) which weights sentences in the centre of a window removed to form the query keywords. Any remaining up- higher than in the periphery. percase keywords were required to be in the returned doc- Our tiler is implemented in C++, has linear asymptotic uments using MG’s plus operator. All remaining lower time complexity and requires constant space. For TREC- case keywords were weighted by a factor of 12 (deter- 2003 we use a window size of 3 sentences and pass for- mined by experimentation). Without such a weighting, ward the top-scoring 100 tiles (with duplicates eliminated MG’s ranking is too heavily influenced by the uppercase using a hash signature test). keywords (which are predominantly named entities). 2.4 Tagging And Syntactic Analysis 2.3 Passage Segmentation and Ranking The C&C maximum entropy POS tagger (Curran and Since our approach involves full parsing to obtain gram- Clark, 2003a) is used to tag the question words and the matical relations in later stages, we need to reduce the text segments returned by the tiler. The C&C NE-tagger amount of text to be processed to a fraction of the amount (Curran and Clark, 2003b) is also applied to the question returned by the search engine. To this end, we have im- and text segments, identifying named entities from the plemented QTILE, a simple query-based text segmenta- standard MUC-7 data set (locations, organisations, per- tion and passage ranking tool. This “tiler” extracts from sons, dates, times and monetary amounts). The POS tags the set of documents a set of segments (“tiles”) based on and NE-tags are used to construct a semantic representa- the occurrence of relevant words in a query, which com- tion from the output of the parser (see Section 2.5). prises the words of the question. A sliding window is We used the RADISP system (Briscoe and Carroll, shifted sentence by sentence over the text stream, retain- 2002) to parse the question and the text segments re- ing all window tiles that contain at least one of the words turned by the tiler. The RADISP parser returns syntactic in the query and contain all upper-case query words. dependencies represented by grammatical relations such Each tile gets assigned a score based on the following: as ncsubj (non-clausal subject), dobj (direct object), the number of non-stopword query word tokens (as op- ncmod (non-clausal modifier), and so on. The set of de- posed to types) found in the tile; a comparison of the cap- pendencies for a sentence are annotated with POS and NE italization of query occurrence and tile occurrence of a information and converted into a graph in Prolog format. term; and the occurrence of 2-grams and 3-grams in both The next section contains an example dependency graph. To increase the quality of the parser output, we re- sidered: pred(x,S), named(x,S), card(x,S), formulated imperatives in “list questions” (e.g. Name event(e,S), and argN(e,x), rel(x,y,S), countries in Europe) into proper question form (What are mod(x,S), where e, x, y are discourse referents, S countries in Europe?). The RADISP parser was much bet- a constant, and N a number between 1 and 3. Ques- ter at returning the correct dependencies for such ques- tions introduce a special DRS-condition of the form tions, largely because the RADISP POS tagger typically answer(x,T) for a question type T. We call this the assigned the incorrect tag to Name in the imperative form. answer literal; answer literals play an important role in We applied a similar approach to other question types not answer extraction (see Section 2.6). handled well by the parser. Implemented in Prolog, we reached a recall of around 80%. (By recall we mean the percentage of categories 2.5 Semantic Analysis that contributed to semantic information in the DRS). The aim of this component is to build a semantic rep- Note that each passage or question is translated into resentation from the output of the parser. It is used for one single DRS; hence DRSs can span several sentences. both the question under consideration and the text pas- Some basic techniques for pronoun resolution are im- sages that might contain an answer to the question. The plemented as well. However, to avoid complicating the input to the semantic analysis is a set of dependency rela- answer extraction task too much, we only considered tions (describing a graph) between syntactic categories, non-recursive DRSs in our TREC-2003 implementation, as Figure 2 illustrates. Categories contain the follow- i.e. DRSs without complex conditions introducing nested ing information: the surface word-form, the lemmatized DRSs for dealing with negation, disjunction, or universal word-form, the word position in the sentence, the sen- quantification. tence position in the text, named-entity information, and Finally, a set of DRS normalisation rules are applied a POS tag defining the category. in a post-processing step, thereby dealing with active- passive alternations, question typing, inferred semantic top(1, node(’originate’, 9) ). information, and the disambiguating of noun-noun com- cat(1, ’croquet’, node(’croquet’, 8), ’NN1’, ’O’ ). pounds. The resulting DRS is enriched with information cat(1, ’the’, node(’the’, 5), ’AT’, ’O’ ). cat(1, ’did’, node(’do’, 4), ’VDD’, ’O’ ). about the original surface word-forms and POS tags, by cat(1, ’originate’, node(’originate’, 9), ’VV0’, ’O’ ).