A Hierarchical Location Normalization System for Text
Total Page:16
File Type:pdf, Size:1020Kb
A Hierarchical Location Normalization System for Text Dongyun Liang, Guohua Wang, Jing Nie, Binxu Zhai and Xiusen Gu Tencent, China fdylanliang, guohuawang, [email protected] Abstract congestion at Shiling Interchange)” refers to a defi- nite location, but there’s no indication of where the It’s natural these days for people to know the Shiling Interchange is to make an exact response, local events from massive documents. Many unless we know it belongs to Longquanyi district, texts contain location information, such as city Chengdu city, Sichuan province. name or road name, which is always incom- plete or latent. It’s significant to extract the Countries are divided up into different units to administrative area of the text and organize manage their land and the affairs of their people the hierarchy of area, called location normal- easier. Administrative division (AD) is a portion ization. Existing detecting location systems of a country or other region delineated for the pur- either exclude hierarchical normalization or pose of administration. Due to China’s large pop- present only a few specific regions. We pro- ulation and area, the AD of China have consisted pose a system named ROIBase 1 that normal- of several levels since ancient times. For clarity izes the text by the Chinese hierarchical ad- ministrative divisions. ROIBase adopts a co- and convenience, we cover three levels in our sys- occurrence constraint as the basic framework tem, and treat the largest administrative division to score the hit of the administrative area, of a country as 1st-level, next subdivisions as 2nd- achieves the inference by special embeddings, level and 3rd-level, which matches the provincial and expands the recall by the ROI (region of in- (province, autonomous region, municipality, and terest). It has high efficiency and interpretabil- special administrative region), prefecture-level city ity because it mainly establishes on the definite and county in China, shown as Table1. China knowledge and has less complex logic than the supervised models. We demonstrate that administers more than 3,200 divisions in these flat- ROIBase achieves better performance against tened levels. In such a large and complex hierarchy, feasible solutions and is useful as a strong sup- much work stops at extracting the relevant loca- port system for location normalization. tions, such as named entity tagging (Srihari, 2000). There are many similar named entity recognition 1 Introduction (NER) toolkits (Che et al., 2010; Finkel et al., 2005) for location extraction. As the ambiguity is very In every day and every place, various events are be- high for location name, Li et al.(2002) and Al- ing reported in the form of texts, and many of these Olimat et al.(2017) develop to the disambiguation don’t present hierarchical and standard locations. arXiv:2001.07320v1 [cs.CL] 21 Jan 2020 of location extraction. We take a step closer to In the context-aware text, location is a fundamental extract normalization information, and determine component that supports a wide range of applica- which the three hierarchical administrative area the tions. We need to focus on the normalizing location document mainly describes. to process massive texts effectively in specific sce- The challenges are a bit different in our loca- narios. As the text stream in social media are more tion normalization, which are mainly in ambiguity quickly in accident or disaster response (Munro, and explicit absence. For example, there is a du- 2011), location normalization is crucial for situa- plicate Chaoyang district as 3rd-level in Beijing tional awareness in these fields, in which the omit- and Changchun city, and “Chaoyan” also means ted writing style often avoids redundant content. the rising sun in Chinese, which may cause ambi- For example, “Auˤ路段¤通拥5 (Traffic guity. If “Beijing” and “Chaoyang” are mentioned 1github.com/waterblas/ROIBase-lite in the same context, it is confident that “Chaoyang” 2019/12/23 Administrative divisions of China - Wikipedia Structural hierarchy of the administrative divisions and basic level autonomies of China Provincial level (1st) Prefectural level (2nd) County level (3rd) 省级行政区 地级行政区 县级行政区 Sub-provincial-level autonomous prefecture 副省级自治州 District 市辖区 County-level city 县 级市 Prefectural-level city 地级市 Autonomous region County 县 自治区 Autonomous prefecture 自治州 Autonomous count y 自治县 Prefecture 地区 Banner 旗 Leagues 盟 Autonomous bann er 自治旗 Sub-provincial-level city 副省级城市 District 市辖区 Special district 特 区 Prefectural-level city 地级市 County-level city 县级 市 Province Autonomous prefecture 自治州 County 县 省 Prefecture 地区 Autonomous count y 自治县 Sub-prefectural-level city 副地级市 Forestry district 林区 Sub-provincial-level new area 副省级市辖新区 Municipality District 市辖区 直辖市 County 县 Special administrative region Region 地区 (informal) District 区 特别行政区 Civic and Municipal Affairs Bureau 民政总署 Freguesia 堂区 (informal) (Part of the One country, two systems) Municipality 市 (informal) Table 1: Structural hierarchy of the administrative divisions. https://en.wikipedia.org/wiki/Administrative_divisions_of_China 1/1 should refer to the district of Beijing city. Simi- naturally, we propose the concept of ROI, which larly, Yarowsky(1995) proposes a corpus-based has a bi-directional association with AD. Given an unsupervised approach that avoids the need for ROI mapping the fixed hierarchical administrative costly truthed training data. However, it’s com- area, ROI has high confidence to represent the area, mon that some contexts lack enough co-occurrence as well as the area contains it definitely. In the of AD to disambiguate or the explicit information absence of explicit patterns, the co-occurring ROI completely misses. We refer to it as the explicit in the context can be good evidence to predict the absence problem, and neither NER nor disambigua- most likely administrative area. The main contri- tion makes it work unless more hidden information butions of the system are as follows, which can be is explored. There are many specific AD-related applied to other languages: points identifying which division is, including: 1. We provide a structured AD database, and • Location alias, e.g. “OÎ (Pengcheng)” is the use the co-occurrence constraints to make a alias name of Shenzhen city; decision; • Old calling or customary title, e.g. “老ø北 2. The ROIBase is equipped with geographic (Old Zhabei)” is a municipal district that once embeddings trained by special location se- existed in Shanghai city; quences to make an inference; • The phrase about the spatial region event, e.g. 3. We use a large news corpus to build a knowl- “-国国E½F'会 (China Huishang Con- edge base that is made up of ROIs, which ference)” has been held in Hefei city; helps normalization. • Some POIs (point of interest), e.g. The well- known “颐和园 (Summer Palace)” is situated 2 User Interface in the northwestern suburbs of Beijing. We design a web-based online demo 2 to show the location normalization. As shown in Figure1, there We summary them as a concept named ROI, are three cases split by blue lines, and each case which is both similar and different from POI. mainly contains two components: query and result. POI dataset collects specific location points that Query Input the document into the textbox with someone may find useful or interesting. It maps a green border to query for ROIBase. The query the detailed address that covers the administra- accepts the Chinese format sentences, such as the tive division. However, many POIs only build an text from news or social media. uni-directional association with AD. For example, Result On the right of the textbox, it will show Bank of China as a common POI is opened across the structured result from ROIBase after submit- the China. We can find many Bank of China at a ting the query. The result consists of there parts: specific AD, but if only “Bank of China” exists in a Confidence, Inference and ROI. context, we can’t directly confirm its location with- out more area information. Since POI is uncertain 2http://research.dylra.com/v/roibase/ Figure 1: User interface of ROIBase. Confidence represents the result that can be ex- ministrative areas in China, which are organized tracted and identified from explicit information. in the form of hierarchy. Each record is associated For example, we have confidence to fill “新疆 (Xin- with its parent and children, for example, “D3市 jiang)” when “尉犁¿ (Yuli County)” and “巴音郭 (Xiangyang city)” is at 2nd-level, its alias is Xiang- ^蒙古ê»州 (Bayingol Mongolian Autonomous fan, its parent is Hubei province, and some children Prefecture)” are coming together in context. of its divisions are Gucheng County, Xiangzhou Inference is complement for the Confidence by District, etc. we develop a co-occurrence constraint embeddings, where the nearest uncertain admin- based on this database to Confidence result, shown istrative level will be inferred from the implicit in Algorithm1. information of the input. For example, none of the explicit administrative area appears in middle case ALGORITHM 1: processing Confidence of the Figure1, so the Inference will start with 1st- Input: S, sentences from text level (the largest division), and it infers “广东省 Output D, hierarchical administrative division (Guangdong Province)”. If the Confidence comes T ;;Q fg up with 1st-level, the Inference will start with 2nd- foreach word phrase w 2 S do level. If the Confidence is filled with three levels, if w hit AD database then Inference does nothing and keeps it as before. expand w to three levels [l1; l2; l3] by ROI is derived from the ROI knowledge base. standard AD, and add them into T ; We will match the input with the ROI knowledge end base, and return the ROI associated with the admin- foreach hierarchical candidate t 2 T do istrative area when the match is successful.