Words in a Text
Total Page:16
File Type:pdf, Size:1020Kb
Cross-lingual geo-parsing for non-structured data Judith Gelernter Wei Zhang Language Technologies Institute Language Technologies Institute School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] ABSTRACT to cross-language geo-information retrieval. We will examine A geo-parser automatically identifies location words in a text. We Strötgen’s view that normalized location information is language- have generated a geo-parser specifically to find locations in independent [16]. Difficulties in cross-language experiments are unstructured Spanish text. Our novel geo-parser architecture exacerbated when the languages use different alphabets, and when combines the results of four parsers: a lexico-semantic Named the focus is on proper names, as in this case, the names of Location Parser, a rules-based building parser, a rules-based street locations. parser, and a trained Named Entity Parser. Each parser has Geo-parsing structured versus unstructured text requires different different strengths: the Named Location Parser is strong in recall, language processing tools. Twitter messages are challenging to and the Named Entity Parser is strong in precision, and building geo-parse because of their non-grammaticality. To handle non- and street parser finds buildings and streets that the others are not grammatical forms, we use a Twitter tokenizer rather than a word designed to do. To test our Spanish geo-parser performance, we tokenizer. We use an English part of speech tagger created for compared the output of Spanish text through our Spanish geo- tweets, which was not available to us in Spanish. Our machine parser, with that same Spanish text translated into English and run learning models for recognizing location words acknowledge through our English geo-parser. The results were that the Spanish Twitter-specific features such as hash tags (index terms specific geo-parser identified toponyms with an F1 of .796, and the to Twitter), website addresses and names in Twitter @mentions. English geo-parser identified toponyms with an F1 of .861 (and Both our English and Spanish models were trained on tweets. this is despite errors introduced by translation from Spanish to English), compared to an F1 of .114 from a commercial off-the- Our focus is social media for crisis response due to the importance shelf Spanish geo-parser. Results suggest (1) geo-parsers should of location in message text. The use of social media for disaster be built specifically for unstructured text, as have our Spanish and relief had been established by the time of the Haiti earthquake in English geo-parsers, and (2) location entities in Spanish that have January 2010, when posts to Facebook, Flickr, YouTube and been machine translated to English are robust to geo-parsing in Twitter were numerous [5]. Research revealed that those English. messages sent during the Haiti earthquake that were considered “actionable” and more useful than the others were generally the Categories and Subject Descriptors ones that included locations [10]. Moreover, Twitterers themselves are more likely to pass along, or re-tweet, messages H.3.1 [Content Analysis and Indexing]: Linguistic processing that include geo-location and situational updates [17], indicating General Terms that Twitterers find such messages important. Our data for this Algorithms, Design research are tweets pertaining to a 2010 earthquake in Chile. The basis of our experiment is to send a test set of Spanish Keywords Chilean earthquake messages through our Spanish geo-parser, a geo-parse, location, translation, Spanish, Twitter, microtext, geo- commercial Spanish geo-parser, and in English translation reference, cross-language geographic information retrieval (CL- through our own English geo-parser. The output of the three GIR) algorithms were scored in comparison to manual geo-tagging annotation of the Spanish tweets by fluent Spanish speakers. 1. INTRODUCTION Translating a document into a language plentiful in language Our English translations of the Spanish tweets come from Google processing tools and location resources, such as English, amounts Translate (see Table 1 for example).1 This tool finds statistically- recurring patterns and selects out of hundreds of millions of human-translated documents what seems to be the best translation for unseen word groupings. Given the vast amount of high quality Permission to make digital or hard copies of all or part of this work for Spanish-English translations, the Google Translate output is also personal or classroom use is granted without fee provided that copies are of high quality. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for Our research questions are: components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to How does our Spanish geo-parser perform on identifying post on servers or to redistribute to lists, requires prior specific permission toponyms in Spanish tweets in comparison to our English geo- and/or a fee. Request permissions from [email protected]. GIR'13, November 05 2013, Orlando, FL, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. Copyright 2013 ACM 978-1-4503-2241-6/13/11$15.00 1 http://dx.doi.org/10.1145/2533888.2533943 http://translate.google.com/ and according to Google Translate “About”, http://translate.google.com/about/ 2.2 What to geo-parse? Table 1. Sample tweet on the February earthquake Locations found by geo-parsers are usually toponyms, or political in Chile, 2010 (with locations in bold) and administrative place names at the resolution of city or higher Spanish Info geologica Terremoto Chile: …construidos sobre la parte in the spatial hierarchy. Some have mined for telephone number tweet plana de concepcion, dundadas sobre suelo lando (arenas del and postal codes for additional location precision [1], but neither bio bio y relleno. telephone nor postal information appear in tweets with any Google- Chile earthquake geological info: … built on the flat of frequency. We parse for streets and buildings as well as translated conception, on soil dundadas lando (bio bio sands and fillers) toponyms because that is what our preliminary inquiry into tweet locations in tweet text uncovered [7]. The OpenCalais by Thomson Reuters parses for facility, which we call building.5 parser finding toponyms on those tweets Google-translated into 2.3 Geo-parser structure English? Multi-lingual text mining systems have been developed in one language and ported to another, as for example Gamon et al. [4]. How does our Spanish geo-parser for unstructured text compare For cross-lingual applications such as cross-lingual topic tracking to a commercial geo-parser for Spanish?2 or linking related documents across languages, the most common approach is to use Machine Translation or bilingual dictionaries to Our contributions include (1) a Spanish geo-parser that is tuned to translate foreign languages into one language, typically English informal, unstructured text,3 and (2) a prediction based on our [15]. We will try this approach for geo-parsing tweets. experimental results about the relative utility of geo-parsing within a language compared with geo-parsing across languages 2.4 Misspelling which use the same alphabet. The Levenshtein algorithm that identifies misspellings was proposed in the 1960s, and the original paper is in Russian.6 The 2. LITERATURE REVIEW distance between two spellings has been called the “edit distance” 2.1 Geo-parsing in that it is the number of edits needed to transform one string into Geo-parsing is a subset of named entity recognition, which another string. Edits are insertion, deletion or substitution. Other identifies entities in text and classifies them into categories such methods of identifying mis-spelled words have been based on as agent, organization and location. Our literature review focuses web-wide data with many more words spelled correctly than mis- therefore on language processing rather than geographic spelled [18]. This, along with the edit distance of Levenshtein, is information retrieval. the basis of the Norvig algorithm that uses probability theory to find the correct spelling of a word [11.]. Misspelling sub-routines Geo-parsing is performed automatically based on language rules. are not standard in geo-parsers, so that locations spelled A model based on language rules is then applied to unseen incorrectly or in non-standard form are often missed. instances to find location words in unseen text [2]. The text of a tweet is not only unstructured, it is also brief. If not supplemented 3. GAZETTEER STRUCTURE with metadata, it is potentially difficult to parse due to lack of This section describes the structure of our gazetteer as it is part of context. It has been found that a data set too small or with both the Named Location Parser and the Named Entity insufficient context for the location word as in the geo-parsing of Recognizer parser. Gazetteer structure is essential because of its word or phrase search queries is liable to yield low recall and immense multi-GB size. Without accommodating the data precision results [8]. structure, gazetteer lookup would make processing time Our Spanish geo-parser was created specifically for this research. unacceptable. Our English geo-parser was created earlier [6], but it has 3.1 Gazetteer as trie and inverted index continued to be developed in concert with our Spanish geo-parser We use the structure of a trie tree (from retrieval) because it is for the benefit of this research. Other geo-parsers include those by flexible for information look-up. Searching character by character MetaCarta, Yahoo! Placemaker, GeoGravy, NetOwl, and open allows us to match word approximations that are misspellings, and source systems such as CLAVIN, TextGrounder, Geodict, adjectival locations that are similar in stem but differ in ending GeoDoc, Geotxt, Europeana, OpenSextant and DIGMAP, as well from the gazetteer term.