Geotime Retrieval Through Passage-Based Learning to Rank ∗

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan GeoTime Retrieval through Passage-based Learning to Rank ∗ Xiao-Lin Wang1;2, Hai Zhao1;2, Bao-Liang Lu1;2y 1Center for Brain-Like Computing and Machine Intelligence Department of Computer Science and Engineering, Shanghai Jiao Tong University 2MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems Shanghai Jiao Tong University 800 Dong Chuan Rd., Shanghai 200240, China [email protected], {zhaohai; blu}@cs.sjtu.edu.cn ABSTRACT search method processing geographic and temporal expres- The NTCIR-9 GeoTime task is to retrieve documents to an- sions may benefit a large number of users. The NTCIR-9 swer such questions as when and where certain events hap- GeoTime task provides researchers with a platform for test- pened. In this paper we propose a Passage-Based Learning ing such methods. to Rank (PGLR) method to address this task. The proposed The Center for Brain-like Computing and Machine Intelli- method recognizes texts both strongly related to the target gence, Shanghai Jiao Tong University (SJTU-BCMI) partic- topics and containing geographic and temporal expression- ipates in the NTCIR-9 GeoTime English subtask. We pro- s. The implemented system provides more accurate search pose a Passage-based GeoTime Learning to Rank method results than a system without PGLR. Performance, accord- (PGLR) to address this task. PGLR first evaluates the rele- ing to the official evaluation, is average among submitted vance and GeoTime informativeness of each passage in can- systems. didate documents, which are retrieved by a BM25F algorithm, and then re-ranks the documents according to the maximum score of their passages. Categories and Subject Descriptors This paper takes paragraphs as passages, and various meth- H.3.3 [Information Search and Retrieval]: Search pro- ods of feature extraction and ranking are applied. The un- cess derlying intuition is that a document could contain both query terms and GeoTime expressions without any seman- General Terms tic relation between them, and such a document would not be a correct answer for GeoTime Retrieval. On the other Algorithms, Performance, Experimentation hand, if the query terms and GeoTime expressions appear in the same passage, they are more likely to have a seman- Keywords tic relation. A document containing such a passage is more GeoTime Retrieval, BM25F, Learning to Rank likely to be a correct answer for GeoTime Retrieval. This paper describes our proposed method and discusses its evaluation results. The rest of the paper is organized as 1. INTRODUCTION follows: the system is presented in Sec. 2; then the submitted Queries that request geographic and temporal informa- runs are described and discussed in Sec. 3; finally this paper tion, such as \where and when did . happen?", are very in concluded in Sec. 4. common in information retrieval. Therefore, a customized ∗ This work was partially supported by the National Natural 2. SYSTEM DESCRIPTION Science Foundation of China (Grant No. 60903119, Grant No. 61170114 and Grant No. 90820018), the National Ba- The core of our system for NTCIR-9 GeoTime is to em- sic Research Program of China (Grant No. 2009CB320901), ploy a passage-based GeoTime Learning to Rank (PGLR) the Science and Technology Commission of Shanghai Munic- method to improve the relevance related to GeoTime re- ipality (Grant No. 09511502400), and the European Union trieval. Figure 1 presents its framework where PGLR is Seventh Framework Programme (Grant No. 247619). used as post processing after page retrieval and page rank. yCorresponding author The system works as follows: 1. Parse the GeoTime topics into query terms Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 2. Find the top-N relevant documents by BM25F simi- not made or distributed for profit or commercial advantage and that copies larity; bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific 3. Re-rank with the PGLR algorithm; permission and/or a fee. NTCIR-9 Workshop Meeting Tokyo, Japan. Copyright 2011 ACM X-XXXXX-XX-X/XX/XX ...$10.00. The following three subsections give details. ― 64 ― Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan Query Unigrams, Bigrams, Dependancies GeoTime Query Parser Question Query Unigrams Passage-based BM25F Search Learning to Rank Articles Figure 1: Framework of SJTU-BCMI's system for NTCIR-9 GeoTime task 2.1 GeoTime Topic Parsing The representation of a GeoTime topic is normally a ques- X tf(t; d) · tion that starts with\when"and\where"noted as DESCRIP- BM25F (d; q) = idf(t) (1) 2 k1 + tf(t; d) TION, a small piece of narrowing description noted as NAR- Xt q RATIVE. In this step they are parsed into queries. Three tf(t; d) = wc ∗ tfc(t; d) (2) kinds of representations { unigram, bigram and dependency c2d { are employed in order to catch the semantics of a GeoTime occurc(t; d) tfc(t; d) = (3) topic. Tab. 1 gives four examples in NTCIR9-GeoTime. ld;c 1 − bc + bc The query unigrams, bigrams and dependencies are gen- lc erated according to the following rules : N − n(t) + 0:5 idf(t) = log (4) n(t) + 0:5 • Unigrams of DESCRIPTION(UoD) are the unigrams of the DESCRIPTION not in a customized stop word where d is a document, q is a query, t represents a word, c list. This list consists of common English stop words represents a field contained in d, occurc(t; d) is the number plus words with high frequency in GeoTime topics such of occurrences of t in c of d, ld;c is the length of c at d, lc is as \when" and \where". the average length of c, N is the number of documents, n(t) is the number of documents containing t, and k1,wc and bc • Bigrams of DESCRIPTION (BoD) are pairs of UoDs are parameters. tfc(t; d) is called the field term frequency which are neighbors, with stop words skipped. function, tf(t; d) is called the term frequency function, and idf(t) is called the inverse document frequency. • Dependencies of DESCRIPTION (DoD) are the re- In our system, the title and body of a document are taken sults of the Stanford parser 1 [3, 1] that have at least as the two fields in BM25F (please see Sec. 3 for details). one UoD. • Unigrams of NARRATIVE (UoN) are the unigram- 2.3 Passage-based GeoTime Learning to Rank s which appear in the bigrams and dependencies of (PGLR) NARRATIVE (BoN and DoN). The intuition of PGLR is that if a document has a small piece of text both strongly related to the GeoTime topic, • Bigrams of NARRATIVE (BoN) are those neighbor- and containing temporal and geographic expressions, this ing non-stop unigrams in NARRATIVE that have at document is highly likely to be a correct one. By observing least one UoD. many documents of real-world news, the unit of passage is taken as the granule for computation. • Dependencies of NARRATIVE (DoN) are those de- Learning to rank [7] is taken to evaluate the rightness pendencies in NARRATIVE that have at least one of each passage given a GeoTime Topic. The passages of UoD. previously retrieved documents are first converted to vector 2.2 Document Retrieval with BM25F representations based on a group of features; then a trained SVMrank [2] 4 predicts the rightness of these passages. In The document retrieval component is a modified Lucene the end the documents are re-ranked according to the max- search engine. Lucene is an open source search engine from imum rightness of their passages. the Apache Software Foundation 2. The default similarity The following nine features are used to represent passages. model of Lucene is a combination of a boolean model and a vector space model 3. We implemented the Standard B- • the document's BM25F score M25F similarity formula on Lucene [6, 9], and this slightly raises the accuracy according to our pilot experiments. • the TF·IDF similarity of DESCRIPTION's unigrams The implementation of BM25F follows Eq. 1 presented at • the TF·IDF similarity of DESCRIPTION's bigrams [4]. • the TF·IDF similarity of DESCRIPTION's dependen- 1http://nlp.stanford.edu/software/lex-parser.shtml cies 2http://lucene.apache.org/ 3http://lucene.apache.org/java/2_9_0/api/core/org/ 4http://www.cs.cornell.edu/people/tj/svm_light/ apache/lucene/search/Similarity.html svm_rank.html ― 65 ― Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan Table 1: Examples of Parsing GeoTime Questions into Queries ID Topic Query unigram Query bigram Query dependency 26 (D)Where and when did the space space shuttle columbia space shuttle shuttle space shuttle Columbia disaster take disaster shuttle columbia disaster columbia place? columbia disaster shuttle disaster (N)The user wants to know when state space shuttle state space space shuttle columbia space and in what state the space shuttle columbia explode shuttle columbia columbia shuttle Columbia exploded. columbia explode explode columbia 27 (D)When was the last flight of last flight concorde land last flight flight concorde flight last flight concorde Concorde and where did it land? concorde land (N)The user wants to know when last time supersonic last time airliner concorde time last concorde supersonic was the last time that the airliner concorde fly city concorde fly city land concorde airliner fly concorde supersonic airliner Concorde flew land time land fly land and in what city it landed. 28 (D)When and where were the washington beltway washington beltway arrest washington Washington beltway snipers sniper arrest beltway sniper sniper arrest sniper beltway arrested? washington sniper (N)The user wants to know when state washington sniper state washington sniper state and in what state were the arrest kill washington sniper sniper washington Washington snipers arrested who sniper arrest arrest kill sniper arrest sniper kill killed ten people and critically injured three more.

Load more