Adaptive Context Features for Toponym Resolution in Streaming News∗

In SIGIR’12: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 731–740, Portland, OR, August 2012. Adaptive Context Features for Toponym Resolution in Streaming News∗ Michael D. Lieberman Hanan Samet Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland College Park, MD 20742 {codepoet, hjs}@cs.umd.edu ABSTRACT 1. INTRODUCTION News sources around the world generate constant streams Today’s increasingly informed and connected society de- of information, but effective streaming news retrieval re- mands ever growing volumes of news and information. Thou- quires an intimate understanding of the geographic content sands of newspapers, and millions of bloggers and tweeters of news. This process of understanding, known as geotag- around the world generate constant streams of data, and ging, consists of first finding words in article text that corre- the demand for such data is skyrocketing as people strive spond to location names (toponyms), and second, assigning to stay up-to-date. Also, Internet-enabled mobile devices each toponym its correct lat/long values. The latter step, are increasingly common, which expands the requirement called toponym resolution, can also be considered a classi- for location-based services and other highly local content— fication problem, where each of the possible interpretations information that is relevant to where users are, or the places for each toponym is classified as correct or incorrect. Hence, in which they are interested. News itself often has a strong techniques from supervised machine learning can be applied geographic component, and newspapers tend to character- to improve accuracy. New classification features to improve ize their readership in terms of location, and publish news toponym resolution, termed adaptive context features, are articles describing events that are relevant to geographic lo- introduced that consider a window of context around each cations of interest to their readers. We wish to collect these toponym, and use geographic attributes of toponyms in the articles and make them available for location-based retrieval window to aid in their correct resolution. Adaptive pa- queries, which requires special techniques. rameters controlling the window’s breadth and depth afford To enable news retrieval queries with a geographic com- flexibility in managing a tradeoff between feature compu- ponent, we must first understand the geographic content tation speed and resolution accuracy, allowing the features present in the articles. However, currently, online news to potentially apply to a variety of textual domains. Ex- sources rarely have articles’ geographic content present in tensive experiments with three large datasets of streaming machine-readable form. As a result, we must design algo- news demonstrate the new features’ effectiveness over two rithms to understand and extract the geographic content widely-used competing methods. from the article’s text. This process of extraction is called geotagging of text, which amounts to identifying locations in natural language text, and assigning lat/long values to Categories and Subject Descriptors them. Put another way, geotagging can be considered as H.3.1 [Information Storage and Retrieval]: Content enabling the spatial indexing of unstructured or semistruc- Analysis and Indexing tured text. This spatial indexing provides a way to exe- cute both feature-based queries (“Where is X happening?”) and location-based queries (“What is happening at location General Terms Y ?”) [5] where the location argument is specified textually Algorithms, Design, Performance rather than geometrically as in our related systems such as QUILT [34] and the SAND Browser [31]. Geotagging methods have been implemented in many different textual Keywords domains, such as Web pages [3, 23, 27], blogs [28], ency- Toponym resolution, geotagging, streaming news, adaptive clopedia articles [14, 36], tweets [33], spreadsheets [2, 17], context, machine learning the hidden Web [19], and of most relevance for us, news articles [8, 11, 18, 20, 21, 29, 32, 37]. Particular domains ∗ This work was supported in part by the National Science such as blogs and tweets may pose additional challenges, Foundation under Grants IIS-10-18475, IIS-09-48548, IIS- such as having few or no formatting or grammatical re- 08-12377, and CCF-08-30618. quirements. The methods in this paper were applied in the NewsStand system [37], which uses a geotagger to assign geographic locations to clusters of news articles based on their Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are content, which allows users to visually explore the news in not made or distributed for profit or commercial advantage and that copies NewsStand’s interactive map interface. Also, several com- bear this notice and the full citation on the first page. To copy otherwise, to mercial products for geotagging text are available, such as republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’12, August 12–16, 2012, Portland, Oregon, USA. Copyright 2012 ACM 978-1-4503-1472-5/12/08 ...$15.00. In SIGIR’12: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 731–740, Portland, OR, August 2012. MetaCarta’s Geotagger1, Thomson Reuters’s OpenCalais2, ...in and around [Louisville 17] and and Yahoo!’s Placemaker3, the latter two of which we inves- [Lexington 31], [Kentucky 6], [Nashville 27] and tigate here. [Cordova 55], [Tennessee 5], [Richmond 69], Geotagging consists of two steps: finding all textual ref- [Virginia 42], [Fort Lauderdale 1] and erences to geographic locations, known as toponyms, and [Orlando 9], [Florida 96], [Indianapolis 3], then choosing the correct location interpretation for each [Indiana 8] and [Atlanta 22], [Georgia 12]. toponym (i.e., assigning lat/long values) from a gazetteer (database of locations). These two steps are known as to- Figure 1: Excerpt from an Earth Times press release [25] ponym recognition and toponym resolution, the second of with toponyms and their number of interpretations high- which we investigate here, and are difficult due to ambigu- lighted, showing the extreme ambiguity of these toponyms ities present in natural language. Importantly, both these and illustrating the need for adaptive context features. steps can be considered as classification [10] problems: To- ponym recognition amounts to classifying each word in the we consider all possible combinations of resolutions for these document’s text as part of a toponym or not, and toponym toponyms, this results in about 3·1017 possibilities, an as- resolution amounts to classifying each toponym interpreta- tonishingly large number for this relatively small portion of tion as correct or incorrect. With this understanding, and text, which is far too many to check in a reasonable time. with appropriately annotated datasets, we can leverage tech- Instead, we can set parameters which we term the window’s niques from supervised machine learning to create an effec- breadth and depth, named analogously to breadth-first and tive geotagging framework. These techniques take as input depth-first search, which control the number of toponyms sets of values known as feature vectors, along with a class la- in the window and the number of interpretations examined bel for each feature vector, and learn a function that will pre- for each toponym in the window, respectively. The adaptive dict the class label for a new feature vector. Many such tech- context features thus afford us flexibility since by varying niques for classification, and other machine learning prob- these parameters, we can control a tradeoff between feature lems, exist and have been used for geotagging purposes, in- computation time and resolution accuracy. The more to- cluding SVM [4, 13, 24], Bayesian schemes [9, 12, 41], and ponyms and toponym interpretations we examine, the more expectation maximization [6]. likely we are to find the correct interpretation, but the longer The effectiveness of such techniques for a given problem resolution will take, and vice versa. Some textual domains domain depends greatly on the design of the input features such as Twitter, where tweets arrive at a furious rate, de- that comprise each feature vector. One common feature used mand faster computation times, while in other, offline do- for geotagging is the population of each interpretation, since mains, the time constraint is relaxed and we can afford to larger places will tend to be mentioned more frequently and spend more time to gain higher accuracy. While window-like are more likely to be correct. However, using population features and heuristics have been used in other work related alone or overly relying on it, as many methods do, resulted to geotagging (e.g., [15, 16, 24, 30, 35, 42]), these features’ in greatly reduced accuracy in our experiments, especially adaptive potential has not been explored. for toponym recall. Instead, in this paper, we consider a As we pointed out, in this paper our focus is on toponym new class of features to improve the accuracy of toponym resolution, while toponym recognition makes use of our pre- resolution, termed adaptive context features. These features vious work [18]. Our work differs from that of others in t construct

Load more