A Graphical Framework for Contextual Search and Name Disambiguation in Email

A Graphical Framework for Contextual Search and Name Disambiguation in Email Einat Minkov William W. Cohen Andrew Y. Ng Language Technologies Inst. Machine Learning Dept. Computer Science Dept. Carnegie Mellon University Carnegie Mellon University Stanford University Pittsburgh, PA 15213 Pittsburgh, PA 15213 Stanford, CA 94305 [email protected] [email protected] [email protected] Abstract important to understand how text-based document similarity measures can be extended to documents Similarity measures for text have histor- embedded in complex structural settings. ically been an important tool for solving Our similarity metric is based on a lazy graph information retrieval problems. In this pa- walk, and is closely related to the well-known per we consider extended similarity met- PageRank algorithm (Page et al., 1998). PageRank rics for documents and other objects em- and its variants are based on a graph walk of infi- bedded in graphs, facilitated via a lazy nite length with random resets. In a lazy graph walk, graph walk. We provide a detailed in- there is a fixed probability of halting the walk at each stantiation of this framework for email step. In previous work (Toutanova et al., 2004), lazy data, where content, social networks and walks over graphs were used for estimating word a timeline are integrated in a structural dependency distributions: in this case, the graph graph. The suggested framework is evalu- was one constructed especially for this task, and the ated for the task of disambiguating names edges in the graph represented different flavors of in email documents. We show that rerank- word-to-word similarity. Other recent papers have ing schemes based on the graph-walk sim- also used walks over graphs for query expansion (Xi ilarity measures often outperform base- et al., 2005; Collins-Thompson and Callan, 2005). line methods, and that further improve- In these tasks, the walk propagates similarity to a ments can be obtained by use of appropri- start node through edges in the graph—incidentally ate learning methods. accumulating evidence of similarity over multiple connecting paths. In contrast to this previous work, we consider 1 Introduction schemes for propogating similarity across a graph Many tasks in information retrieval can be per- that naturally models a structured dataset like an formed by clever application of textual similarity email corpus: entities correspond to objects includ- metrics. In particular, The canonical IR problem of ing email addresses and dates, (as well as the usual ad hoc retrieval is often formulated as the task of types of documents and terms), and edges corre- finding documents “similar to” a query. In modern spond to relations like sent-by. We view the simi- IR settings, however, documents are usually not iso- larity metric as a tool for performing search across lated objects: instead, they are frequently connected this structured dataset, in which related entities that to other objects, via hyperlinks or meta-data. (An are not directly similar to a query can be reached via email message, for instance, is connected via header multi-step graph walk. information to other emails in the same thread and In this paper, we formulate and evaluate this ex- also to the recipient’s social network.) Thus it is tended similarity metric. The principal problem we 1 Workshop on TextGraphs, at HLT-NAACL 2006, pages 1–8, New York City, June 2006. c 2006 Association for Computational Linguistics consider is disambiguating personal names in email, that edge labels determine the source and target ` ` which we formulate as the task of retrieving the per- node types: i.e., if x −→ z and w −→ y then son most related to a particular name mention. We T (w) = T (x) and T (y) = T (z). However, mul- show that for this task, the graph-based approach im- tiple relations can hold between any particular pair ` proves substantially over plausible baselines. After of nodes types: for instance, it could be that x −→ y retrieval, learning can be used to adjust the ranking `0 or x −→ y, where ` 6= `0. (For instance, an email of retrieved names based on the edges in the paths message x could be sent-from y, or sent-to y.) Note traversed to find these names, which leads to an ad- also that edges need not denote functional relations: ditional performance improvement. Name disam- for a given x and `, there may be many distinct nodes biguation is a particular application of the suggested ` general framework, which is also applicable to any y such that x −→ y. For instance, for a file x, there has-term real-world setting in which structural data is avail- are many distinct terms y such that x −→ y holds. able as well as text. In representing email, we also create an inverse −1 This paper proceeds as follows. Sections 2 and label ` for each edge label (relation) `. Note that 3 formalize the general framework and its instanti- this means that the graph will definitely be cyclic. ation for email. Section 4 gives a short summary Table 1 gives the full set of relations used in our of the learning approach. Section 5 includes experi- email represention scheme. mental evaluation, describing the corpora and results 3 Graph Similarity for the person name disambiguation task. The paper concludes with a review of related work, summary 3.1 Edge weights and future directions. Similarity between two nodes is defined by a lazy walk process, and a walk on the graph is controlled 2 Email as a Graph by a small set of parameters Θ. To walk away from A graph G consists of a set of nodes, and a set of la- a node x, one first picks an edge label `; then, given ` beled directed edges. Nodes will be denoted by let- `, one picks a node y such that x −→ y. We assume ters like x, y, or z, and we will denote an edge from that the probability of picking the label ` depends ` x to y with label ` as x −→ y. Every node x has only on the type T (x) of the node x, i.e., that the a type, denoted T (x), and we will assume that there outgoing probability from node x of following an are a fixed set of possible types. We will assume for edge type ` is: convenience that there are no edges from a node to Pr(` | x)= Pr(` | Ti) ≡ θ`,T itself (this assumption can be easily relaxed.) i We will use these graphs to represent real-world Let STi be the set of possible labels for an edge leav- data. Each node represents some real-world entity, ing a node of type Ti. We require that the weights ` and each edge x −→ y asserts that some binary over all outgoing edge types given the source node relation `(x, y) holds. The entity types used here type form a probability distribution, i.e., that to represent an email corpus are shown in the left- θ`,Ti = 1 most column of Table 1. They include the tradi- ∈X ` STi tional types in information retrieval systems, namely file and term. In addition, however, they include the In this paper, we will assume that once ` is picked, types person, email-address and date. These enti- y is chosen uniformly from the set of all y such that ` ties are constructed from a collection of email mes- x −→ y. That is, the weight of an edge of type l sages in the obvious way–for example, a recipient of connecting source node x to node y is: “Einat Minkov [email protected] ” indicates the < > ` θ Pr(x −→ y | `)= `,Ti existence of a person node “Einat Minkov” and an ` email-address node “[email protected]”. (We as- | y : x −→ y | sume here that person names are unique identifiers.) This assumption could easily be generalized, how- The graph edges are directed. We will assume ever: for instance, for the type T (x) = file and 2 source type edge type target type and probability 0 to all other nodes, then the value file sent-from person sent-from-email email-address given to y in Vk can be interpreted as a similarity sent-to person measure between x and y. sent-to-email email-address In our framework, a query is an initial distribu- date-of date has-subject-term term tion Vq over nodes, plus a desired output type Tout , has-term term and the answer is a list of nodes y of type Tout , person sent-from inv. file −1 ranked by their score in the distribution V . For in- sent-to file k alias email-address stance, for an ordinary ad hoc document retrieval has-term term query (like “economic impact of recycling tires”) email-address sent-to-email−1 file −1 would be an appropriate distribution Vq over query sent-from-email file alias-inverse person terms, with Tout = file. Replacing Tout with person is-email−1 term would find the person most related to the query— −1 term has-term file e.g., an email contact heavily associated with the has subject-term−1 file is-email email-address retread economics. Replacing Vq with a point dis- has-term−1 person tribution over a particular document would find the date date-of−1 file people most closely associated with the given docu- Table 1: Graph structure: Node and relation types ment. ` 3.3 Relation to TF-IDF ` = has-term, weights for terms y such that x −→ y might be distributed according to an appropriate lan- It is interesting to view this framework in compar- guage model (Croft and Lafferty, 2003).

Load more