Leveraging Temporal Expressions in IR∗

Time Will Tell: Leveraging Temporal Expressions in IR∗ Irem Arikan Srikanta Bedathur Klaus Berberich Max-Planck Institute for Informatics Saarbrücken, Germany {iarikan, bedathur, kberberi}@mpi-inf.mpg.de Search Engine Y 1. List of state leaders in 1977 ABSTRACT http://en.wikipedia.org/wiki/List of state leaders in 1977 Temporal expressions, such as between 1992 and 2000, are 2. Prime minister frequent across many kinds of documents. Text retrieval, http://en.wikipedia.org/wiki/Prime minister though, treats them as common terms, thus ignoring their 3. List of state leaders in 1976 inherent semantics. For queries with a strong temporal com- en.wikipedia.org/wiki/List of state leaders in 1976 ponent, such as U.S. president 1997, this leads to a decrease 4. List of state leaders in 1974 in retrieval effectiveness, since relevant documents (e.g., a bi- http://en.wikipedia.org/wiki/List of state leaders in 1974 ography of Bill Clinton containing the aforementioned tem- 5. List of state leaders in 1978 poral expression) can not be reliably matched to the query. http://en.wikipedia.org/wiki/List of state leaders in 1978 We propose a novel approach, based on language models, to make temporal expressions first-class citizens of the Search Engine G retrieval model. In addition, we present experiments that 1. List of state leaders in 1977 show actual improvements in retrieval effectiveness. http://en.wikipedia.org/wiki/List of state leaders in 1977 2. French municipal elections, 1977 Categories and Subject Descriptors http://en.wikipedia.org/wiki/French municipal elections, 1977 H.3.3 [Information Search and Retrieval]: Retrieval models 3. France-Albert René http://en.wikipedia.org/wiki/France-Albert René General Terms 4. 1977 Algorithms, Experimentation, Performance http://en.wikipedia.org/wiki/1977 5. Anthony Eden Keywords http://en.wikipedia.org/wiki/Anthony Eden Temporal Information Retrieval, Language modeling Figure 1: Search results “Prime Minister France 1977” 1. INTRODUCTION describing a futuristic science-fiction plot or a Wikipedia ar- Increasing amounts of content, not only created at differ- ticle about the French revolution. ent times but also pertaining to different times, are avail- As a consequence, retrieval effectiveness suffers for queries able on the World Wide Web. Prominent examples of such that have a strong temporal component (e.g., such aimed at content include news articles, blogs, and wikis. Typical ap- finding historical information). To illustrate this problem, proaches to retrieval either treat the temporal expressions consider a user who wants to find out who was prime minis- contained in these documents simply as common terms, or ter of France in 1977. We ran the query “Prime Minister take the creation time of a document as a surrogate for the France 1977” on two popular web search engines – code- temporal context of the document’s content. named Y and G – while restricting the domain of search However, both families of approaches fail to capture the to http://en.wikipedia.org/. The top-5 answers for the semantics inherent to the time dimension. Treating tempo- query are listed in Figure 1. As these results show, the top ral expressions as common terms, on the one hand, ignores result for the query is simply the full list of world leaders their inherent semantics. Unless document and query con- in 1977 – a special feature of Wikipedia that is typically tain exactly the same temporal expression, the document not available in text collections. When ignoring this spe- will not be ranked high in the results. The creation time of cial result, none of the remaining results is relevant to our a document, on the other hand, can be way offthe time the information need. contents of the document pertain to – think of a web page In order to improve retrieval effectiveness for such queries, it is therefore essential to pay special attention to tempo- ∗Partially supported by the EU within the 7th Framework ral expressions contained in documents. In this paper, we Programme under contract 216267 “Living Web Archives address this very issue. Our key contribution is a novel ap- (LiWA)” proach that seamlessly integrates the temporal dimension into a language model based retrieval framework. Exper- Permission to make digital or hard copies of all or part of this work for imental evidence shows that our approach yields improve- personal or classroom use is granted without fee provided that copies are ments in retrieval effectiveness. For instance, when eval- not made or distributed for profit or commercial advantage and that copies uating the above query Prime Minister France 1977 using bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific our approach, we obtain the article about Raymond Barre permission and/or a fee. (http://en.wikipedia.org/Raymond Barre), who was prime min- WSDM’09, February 9–12, 2009, Barcelona, Spain. ister of France at the time of interest, at the second position. 2. MODEL an overlapping interval of a temporal expression given by In this section, we lay out the model and the notation that the user. Intuitively, a document containing temporal ex- will be used throughout the remainder. We let D denote our pressions of these kinds is favorable to documents that do document collection. When modeling the contents of a doc- not contain any relevant temporal expressions. ument d D, we distinguish between terms and temporal Our first approach to take into account temporal expres- expressions∈. Formally, a document consists of a bag of tex- sions follows this intuition in a radical way. The idea behind tual terms dtx and a bag of temporal expressions dte.A the approach, coined LmF, is to not report documents that temporal expression T found in a document is a time inter- do not contain any temporal expressions of relevance to the val T =[b, e ] with a begin boundary b and end boundary user. The approach therefore filters out documents that do e drawn from a time domain. Queries in our setting consist not contain (i) a superinterval, (ii) a subinterval, or (iii) an of a textual part qtx and temporal part qte. The textual overlapping interval of a temporal expression specified in the part qtx is a set of terms and can thus be thought of as user’s query. a standard keyword query. Analogously, the temporal part Formally, LmF reports only documents from the query- qte is a set of temporal expressions that captures the times dependent subset of the collection of interest to the user. As an example, a user interested in D(qte)={ d D | T qte T # dte : T T # = } . (2) who were presidents of the U.S. in the 1990s could formulate ∈ ∃ ∈ ∃ ∈ ∩ & ∅ the query U.S. president 1990s. The relative ranking of result documents is exactly the same as the one obtained from the Ponte and Croft model 3. LEVERAGING for the textual part qtx of the query. In fact, the approach is not dependent on the use of language models, but can TEMPORAL EXPRESSIONS be used with other relevance models as, for instance, Okapi We now proceed to the core of this work and describe how BM25 [14]. Moreover, it can easily be implemented on top temporal expressions can be leveraged to improve retrieval of an existing system as a post-filtering step. effectiveness. 3.3 Weighted Model 3.1 Ponte and Croft’s Model One drawback of the LmF approach just described is that Our approach builds on language models as originally pro- it assumes a black-and-white perspective on the world. A posed by Ponte and Croft [13]. Due to space constraints document is either considered or not – there is no thing we only give an informal description of their approach and in between. In particular, the approach does not take into point to the original work [13] for full details. For a recent account (i) how many relevant temporal expressions a docu- more complete description of language models, we refer to ment contains and (ii) how closely they match the temporal to Manning et al. [9]. expressions specified in the user’s query. With regard to the In Ponte and Croft’s approach each document has a gen- second point and considering our earlier example, consider P t|d erative model of terms associated. The probability ( tx) two documents that talk about earthquakes in San Francisco t d of producing the term from document tx depends on the in April 1906 and the 1900s, respectively. Given otherwise t d term frequency of in tx, but also on the collection fre- equal relevance of the two documents, it is reasonable to t D quency of (i.e., its total number of occurrences in ). favor the first document, as the temporal expression con- Assuming independence for the generation of individual tained is closer to the temporal expression specified in the d q terms, the relevance of document tx to the query tx is user’s query, namely, April 18th, 1906. q then assessed as the probability of generating tx from the Our second approach, coined LmW addresses these issues. d generative model associated with tx, i.e., LmW assigns higher relevance to a document, if it contains more temporal expressions that provide a closer match to P(qtx|dtx)= P(q|dtx) 1.0 − P(q|dtx) . (1) × the temporal part qte of the user’s query. q qtx q q ∈! "∈!tx At the core of LmW lies a generative model for temporal In the remainder, we will refer to the Ponte and Croft expressions.

Load more