Using an Optimized Chinese Address Matching Method to Develop a Geocoding Service: a Case Study of Shenzhen, China

International Journal of Geo-Information

Article Using an Optimized Chinese Address Matching Method to Develop a Geocoding Service: A Case Study of Shenzhen, China

Qin Tian 1, Fu Ren 1,2,3, Tao Hu 4, Jiangtao Liu 5, Ruichang Li 1 and Qingyun Du 1,2,3,6,*

1 School of Resource and Environmental Science, Wuhan University, 129 Luoyu Road, Wuhan 430079, China; [email protected] (Q.T.); [email protected] (F.R.); [email protected] (R.L.) 2 Key Laboratory of Geographic Information System, Ministry of Education, Wuhan University, 129 Luoyu Road, Wuhan 430079, China 3 Key Laboratory of Digital Mapping and Land Information Application Engineering, National Administration of Surveying, Mapping and Geo-information, Wuhan University, 129 Luoyu Road, Wuhan 430072, China 4 School of Information Engineering, Xuchang University, 88 Bayi Road, Xuchang 461000, China; [email protected] 5 Shenzhen Municipal Planning & Land Real Estate Information Center, 8007 Hongli Road, Shenzhen 518034, China; [email protected] 6 Collaborative Innovation Center Of Geospatial Technology, Wuhan University, 129 Luoyu Road, Wuhan 430079, China * Correspondence: [email protected]; Tel.: +86-27-8766-4557; Fax: +86-27-6877-8893

Academic Editor: Wolfgang Kainz Received: 19 January 2016; Accepted: 2 May 2016; Published: 13 May 2016

Abstract: With the coming era of big data and the rapid development and widespread applications of Geographical Information Systems (GISs), geocoding technology is playing an increasingly important role in bridging the gap between non-spatial data resources and spatial data in various ﬁelds. However, Chinese geocoding faces great challenges because of the complexity of the address string format in Chinese, which contains no delimiters between Chinese words, and the poor address management resulting from the existence of multiple address authorities spread among different governmental agencies. This paper presents a geocoding service based on an optimized Chinese address matching method, including address modeling, address standardization and address matching. The address model focuses on the spatial semantics of each address element, and the address standardization process is based on an address tree model. A geocoding service application is implemented in practice using a large quantity of data from Shenzhen Municipality. More than 1,460,000 data records were used to test the geocoding service, and good matching rates were achieved with good adaptability and intelligence.

Keywords: Geographical Information System; Chinese address match; address model; address matching; geocoding service

1. Introduction Increasing amounts of various types of data are being collected in various domains with the coming era of big data and the rapid development and widespread application of Geographical Information Systems (GISs) [1]. It appears to be an urgent necessity to possess the ability to assign coordinates to these data and continue to pursue deeper research into big data from the spatial perspective [2]. One key topic of interest is the spatial-information-containing texts known as addresses and it can be the most resources which citizens used to convey geographical information or

ISPRS Int. J. Geo-Inf. 2016, 5, 65; doi:10.3390/ijgi5050065 www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2016, 5, 65 2 of 17 locations [3]. The Global Sourcebook of Address Data Management introduced the address systems of 194 countries [4]. Most urban GIS applications require geocoding technology, which can obtain spatial coordinates from addresses [5]. To date, geocoding has been the most popular and efficient method for translating addresses into spatial coordinates and has been employed in many fields, including public health [6,7], criminology [8], cancer research [9], transportation [10], traffic crashes response [11,12], and others [7,13,14]. Geocoding, which is also referred to as address matching, is the process of mapping a text-based descriptive address to digital geographic coordinates that can be used in spatial analysis and visualization [15–17]. In other words, geocoding is a type of positioning technology that can extract coordinates from a given text or natural-language descriptions of spatial location information, such as an urban address, building numbers, street addresses, postal codes, building names and company names [15]. Geocoding is the most important available means of bridging the gap between non-spatial information and spatial information, and accuracy and certainty are of great importance for geocoding systems and services [5,18–21]. Many researchers are working to improve geocoding matching rates [16,22–26]. Geocoding is a worldwide challenge, and is currently more problematic in China than in most Western countries because the Chinese language is much more complex than the English language regarding segmentation and semantics [27,28]: the English language has delimiters between words, and Western countries have many successful solutions for performing such tasks, such as the Geographic Base File/Dual Independent Map Encoding (GBF/DIME) address geocoding solution, Topologically Integrated Geographic Encoding and Reference (TIGER, or TIGER/Line) Geocoding solution, ESRI address geocoding solution and so on [29]. Almost all commercial GIS software packages (e.g., ArcGIS and MapInfo) and web-based map service vendors (e.g., Google, Bing and MapQuest) provide geocoding modules that function well for non-Chinese text addresses. However, these solutions are designed for Western countries and are unsuitable for application to Chinese addresses. Chinese geocoding remains a persistent problem because of the complexity of Chinese syntax and semantics, as described below [24,27,28]:

(1) Chinese text has no delimiters between words, unlike English [2,28]. It is extremely difficult to parse the string and extract exact address elements, such as administrative region names, road names, street numbers, and building numbers, to match the reference database. (2) Various descriptive styles exist among Chinese addresses. The regulation of Chinese address structures is poor and received little attention from urban planning and administrative managers before the early 1990s. A large number of addresses that have been collected and stored in basic geographic databases lack a uniform structure and have been defined according to local custom in the absence of explicit authoritative rules to follow. With the increasing availability of such address data, the disordered state of Chinese addresses is becoming more evident. (3) Urban planning and address management in many areas of China is chaotic. For a long time, city layouts in China have been developed without sufficient urban planning, and building numbers were assigned randomly and without regulation. Consequently, it is difficult to add, delete and update to accommodate for new addresses, and address interpolation and address matching are difficult because, for example, several different roads are named Zhongshan Road. This further increases the difficulty and inaccuracy of the address matching task in China. (4) The rights for address naming and approval are distributed among several different governmental agencies. The right to name administrative regions, such as provinces, cities and districts, belongs to the State Council, whereas street numbers and building numbers are managed by the Department of Public Safety. Meanwhile, Urban Planning Departments hold the naming rights for roads and streets. For example, in the address “Shenzhen City, Futian District, Hongli West Road Number 8890”, the address elements “Shenzhen City” and “Futian District” are determined by the State Council, whereas “Hongli West Road” is defined by the Urban Planning Department and “Number 8890” is assigned by the Department of Public Safety. The lack of united address ISPRS Int. J. Geo-Inf. 2016, 5, 65 3 of 17

ISPRS Int. J. Geo-Inf. 2016, 5, 65 3 of 16 planning makes it quite difﬁcult to manage addresses and perform address matching because addressdifferent matching agencies have because different different address agencies formats have and different descriptive address forms. formats This situation and descriptive makes it forms.quite difﬁcult This situation to standardize makes it even quite a difficult single Chinese to standardize address. even a single Chinese address.

In this paper, aa setset ofof optimizedoptimized ChineseChinese addressaddress matchingmatching methods is presented to resolve the situation described above. The The proposed proposed methods methods are are intended to to:: ( (1)1) adapt to the complexity of Chinese text;text; (2)(2) account for the topological relationships among address elements;elements; and (3)(3) improve the efﬁciencyefficiency andand accuracy accuracy of of traditional traditional geocoding geocoding algorithms. algorithms. In theIn the remainder remainder of this of this paper, paper, we ﬁrst we describefirst describe the study the study area and area data and in data section in section2, followed 2, followed by the optimized by the optimized method in method detail, including in detail, theincluding structure the of structure the address of the model address and model the address and the standardization address standardization and address and matching address processmatching in Sectionprocess3 in. Then, Section the 3. conducted Then, the experimentsconducted experiments and their results and their are presentedresults are in presented Section4. in A Section discussion 4. A isdiscussion provided is in provided Section5 ,in and Section the conclusions 5, and the conclusions are summarized are summ in Sectionarized6. in Section 6.

2.2. Study Study Area Area and and Data Data

2.1.2.1. Study Area ˝ 1 Shenzhen is a major city located in the the southeast southeast of of China China between between longitudes longitudes of of 113 113°46′ to ˝ 1 ˝ 1 ˝ 1 2 114°37′ east and latitudes of 22°27′ and 22°52′ north. Sh Shenzhenenzhen covers a total area of 19521952 kmkm2 and it spansspans a range of is 81.4 km from east to west and 10.8 km from north to south. Shenzhen is connected to Hong Kong Kong in in the the south, south, Dongguan Dongguan in in the the north, north, and and Huizhou Huizhou in inthe the east. east. Shenzhen Shenzhen is one is oneof the of thespecial special economic economic zones zones in inChina. China. At At present, present, Shenzhen Shenzhen consists consists of of 10 10 districts districts (Luohu, (Luohu, Futian, Nanshan, Yantian, Baoan,Baoan, Longgang, Guangming, Pingshan, Longhua and Dapeng).Dapeng). Figure1 1 shows shows the location of Shenzhen MunicipalityMunicipality andand aa mapmap ofof thethe districtsdistricts ofof Shenzhen.Shenzhen.

Figure 1. Location of Shenzhen in China and Districts in Shenzhen. Figure 1. Location of Shenzhen in China and Districts in Shenzhen.

With the rapid development of Shenzhen Municipality, the government of Shenzhen has With the rapid development of Shenzhen Municipality, the government of Shenzhen has collected collected and stored a massive amount of data in recent years. These data can be divided into the two andfollowing stored general a massive categories: amount data of data that in includ recente years. spatial These coordinates, data can namely, be divided spatial into data, the two and following data that generalinclude categories:only text descriptions data that include of spatial spatial information, coordinates, such namely, as addresses, spatial postal data, codes, and data and that toponyms. include onlyThe latter text descriptions type of data of do spatiales not information,have coordinates, such as and addresses, are called postal non- codes,spatial and data. toponyms. For more The effective latter typeutilization of data of does these not data, have the coordinates, government and areplans called to initiate non-spatial efforts data. towards For more providing effective open utilization data. ofHowever, these data, these the plans government face the plans critical to initiateproblem efforts of how towards to attach providing spatial openinformation data. However, to these thesenon- plansspatial face data. the For critical this purpose, problem geocoding of how to attach is the spatialbest solution. information to these non-spatial data. For this purpose, geocoding is the best solution.

ISPRS Int. J. Geo Geo-Inf.-Inf. 20120166,, 5,, 6 655 4 of 1716

2.2. Data 2.2. Data As mentioned above, large amounts of spatial data and non-spatial data have been collected by the ShenzhenAs mentioned government above, large over amounts the years of, spatialincluding data spatial and non-spatial data for dataroads, have point been of collectedinterest (POI), by the Shenzhenbuildings, governmentstreet numbers, over theand years, building including numbers spatial and data non for-spatial roads, data point for of interesthospitals, (POI), schools, buildings, and streetcompanies. numbers, and building numbers and non-spatial data for hospitals, schools, and companies. The spatial data that we used in this study to construct construct a reference reference database for the development of the proposed geocoding service were obtained from the Urban Planning, Land and Resources Commission of of Shenzhen Shenzhen Municipality Municipality (SZPL). (SZPL). This This database database contains contains administrative administrative data, data, road roaddata, data,POI data, POI street data, streetnumber number data, toponym data, toponym data and data house and number house number data (a total data of (a approximately total of approximately 708,000 708,000data records). data records). We used non-spatialnon-spatial disease data associated with home addresses and company data associated with company addresses. The The disease disease data data contains more than 100,000 records obtained from the Shenzhen Center for Health Information (SCHI), and the company data include approximately 1,360,000 records obtainedobtained from the Market Market and and Quality Quality Supervision Supervision Commission of Shenzhen Municipality (SZSCJG).

3. Address Address Match Match Methods Methods Figure 22 illustrates the the fundamental fundamental principles principles of the of thegeocoding geocoding process process [26,27] [, which26,27], include which includeaddress address model model building, building, address address standardizati standardization,on, address address matching matching and and reference reference database construction. The The address address model model explains explains how how a a single single address address is is organized organized and definesdefines the relationships between address address elements. elements. The The establishment establishment of of an internationally standardized geospatially enabled address model is the purpose of ISO 19160 [ 30]30].. Unfortunately, Unfortunately, little significant significant progress in address address internationalization internationalization has been made made in in China. China. Address standardization is the process of parsing an input address string into a list of address elements and then reordering these elements based on their spatial semantics. The The process also involves the elimination of errors, the correction of address imperfections and the confirmationconfirmation of the destinationdestination of the input address string. Address matching is the next step of geocoding, and refers to the process of assigning spatial coordinates to a standardized address by searching in a reference database using a particular address matching algorithm.

Figure 2. The geocoding process process..

Geocoding accuracy is one of the largest concerns for users when seeking the most appropriate Geocoding accuracy is one of the largest concerns for users when seeking the most appropriate geocoding service, because lower geocoding accuracy will result in errors in geocoding applications. geocoding service, because lower geocoding accuracy will result in errors in geocoding applications. Most geocoding solutions employ a general method in which addresses are matched based solely on Most geocoding solutions employ a general method in which addresses are matched based solely text comparison and the spatial semantics of the address elements are neglected. By contrast, the on text comparison and the spatial semantics of the address elements are neglected. By contrast, geocoding solution proposed in this paper fully considers the relevant spatial semantics, especially the geocoding solution proposed in this paper fully considers the relevant spatial semantics, especially the topological relationships between address elements. With the constraints provided by the spatial the topological relationships between address elements. With the constraints provided by the spatial relationships, the fitting rules used to reconstruct the destination address will be more precise, and relationships, the ﬁtting rules used to reconstruct the destination address will be more precise, and spatial operations can be used in the geocoding process. spatial operations can be used in the geocoding process.

ISPRS Int. J. Geo-Inf. 2016, 5, 65 5 of 17

3.1. Address Model ISPRS Int. J. Geo-Inf. 2016, 5, 65 5 of 16 The address model is an abstract description of how a given address containing certain address elements3.1. Address is organized Model and expressed. A Chinese address consists of three major components: administrative elements, basic constraint objects and local point locations [31,32]. The organization ISPRSThe Int. address J. Geo-Inf. model 2016, 5, 6is5 an abstract description of how a given address containing certain address5 of 16 rules for these address elements are often described as below: elements is organized and expressed. A Chinese address consists of three major components: administrative3.1. Address Model elements, basic constraint objects and local point locations [31,32]. The organization ::= rules forThe these address address model elements is an abstract are often description described of ashow below: a given address containing certain address Theelements elements deﬁned and ::= ists of components: administrative elements, basic constraint objects and local point locations [31,32]. The organization The elements are defined as follows: rules for these ::=often described as below: [district] [village] ::= ::= name> ::=||| The elements as follows: ::=|||

::= numbers>::= numbers> [house [house numbers]||| [village] of interest> of interest> AsAs an an example, example, thethe address ::=||| District, District, Shenzhen”, Shenzhen”, as asshown in in Figure Figure ::= [house numbers]|| As an example, consider the address “No. 8088, Hongli Road, Futian District, Shenzhen”, as shown in Figure 3.

FigureFigure 3. 3.An An example example ofof thethe organization rules rules for for addresses addresses..

The organization rulesFigure for 3. addressesAn example provideof the organization important rules guidance for addresses regarding. how to describe a locationThe organization in natural language, rules for but addresses they do providelittle to represent important the guidance spatial relationships regardinghow and constraints to describe a locationbetween inThe naturaladdress organization language,elements, rules which but for theyaddresses are do important little provide to representfor important address the guidance parsing spatial regardingand relationships standardization. how to and describe constraints Certain a betweenspatiallocation addressrelationships in natural elements, language,do exis whicht betweenbut arethey important doaddress little to elements forrepresent address: an the analysis parsingspatial relationshipsof and Chinese standardization. addressesand constraints reveals Certain spatialtopologicalbetween relationships addressrelationships elements, do exist (such betweenwhich as containing, are addressimportant adjacency, elements: for address and an parsingadjoining), analysis and of distancestandardization. Chinese relationships addresses Certain reveals and topologicaldirectionspatial relationships relationships. do (such In exis mostt asbetween containing, cases, address the topological adjacency, elements: andrelationshipsan analysis adjoining), of betweenChinese distance addresses address relationships elements,reveals and directionespeciallytopological relationships. containing relationships relationships, In (such most as cases, containing, are thevery topological adjacency, common andand relationships adjoining), obvious; an distance between address relationships modeladdress with elements, and this especiallyhierarchicaldirection containing relationships.characteristic relationships, Inis commonly most cases, are called very the topological commona hierarchical and relationships address obvious; model between an address[33] address. Figure model elements,4 shows with an this especially containing relationships, are very common and obvious; an address model with this hierarchicalexample of characteristic a Shenzhen address is commonly with multiple called- atypes hierarchical of spatial address relationships. model There [33]. are Figure topological4 shows hierarchical characteristic is commonly called a hierarchical address model [33]. Figure 4 shows an an(containing) example of a relationships Shenzhen address between with the multiple-types address elements of spatial “Shenzhen relationships. City”, ”Hongli There are Road” topological and example of a Shenzhen address with multiple-types of spatial relationships. There are topological “Number 8088”; a direction relationship between the address elements “Number 8088” and “West”; (containing)(containing) relationships relationships between between the the address address elements elements “Shenzhen “Shenzhen City”, City”, ”Hongli ”Hongli Road” Road” and and and a distance relationship exists between the address elements “Number 8088” and “200 m”. “Number“Number 8088”; 8088” a direction; a direction relationship relationship between between the address address elements elements “Number “Number 8088” 8088” and “West and “West”;”; and aand distance a distance relationship relationsh existsip exists between between the the addressaddress elements “Number “Number 8088” 8088” and and “200 “200 m”. m”.

Figure 4. An example of the spatial relationship model. Figure 4. An example of the spatial relationship model. Figure 4. An example of the spatial relationship model. According to the organization rules mentioned above, containing relationships exist among the three classesAccording of address to the organization elements. above, elementscontaining contain relationships the elements;Accordingthree classes for to example, of the address organization province elements. elements rules cityabove, elements, elements containing containand relationships district among county the threeelementselements; classes are of forcontained address example, elements.in provincecity elements. city object> elements,elements elements and contain district contain object> elements;elements;elements for for are example, example, contained provincestreet in city numbers elements. elements or elements, are elements located and containon district certain blocks. or county In elements; for example, street numbers or building numbers are located on certain roads or blocks. In elementscertain are cases, contained direction in cityrelationships elements. and elements may simultaneously contain certain cases, direction relationships and distance relationships may simultaneously exist among

ISPRS Int. J. Geo-Inf. 2016, 5, 65 6 of 17 elements; for example, street numbers or building numbers are located on certain roads or blocks. In certainISPRS cases,Int. J. Geo direction-Inf. 2016, 5, 6 relationships5 and distance relationships may simultaneously exist6 of among 16 elements; for example, in the local point location description “No. 8088 West elements; for example, in the local point location description “No. 8088 West 200 200 m”, “West” indicates direction and “200 m” indicates distance. The combination of the local m”, “West” indicates direction and “200 m” indicates distance. The combination of the local point point descriptor “No. 8088”, the direction “West” and the distance “200 m” speciﬁes an accurate descriptor “No. 8088”, the direction “West” and the distance “200 m” specifies an accurate location; location;none none of these of theseelements elements can be omitted. can be omitted.Figure 5 illustrates Figure5 illustratesthe detailed therelationships detailed relationshipsabout the sample about the sampleaddress. address.

Figure 5. Relationships between address elements. Figure 5. Relationships between address elements. 3.2. Address Standardization 3.2. Address Standardization Address standardization is the most important process involved in geocoding. In this process, Addressan input address standardization string parsed is theand mostthen exported important in a process standard involved format. This in geocoding.process usually In involves this process, an inputtwo address steps [34] string: address parsed parsing and andthen address exported normalization. in a standard The format. purpose This of processaddress usuallyparsing isinvolves to two stepssegment [34 ]:the address input address parsing string and into address meaningful normalization. address elements The purposewith exact of spatial address semantics; parsing in is to segmentaddress the normalization, input address any string informal into or meaningful abbreviated addressaddress elements elements are with converted exact into spatial a standard semantics; in addressformat, normalization, and address elements any informal that are or written abbreviated incorrectly address are re elements-recorded arein the converted correct format. into a standard In this study, we employed a standardization process based on an address tree model, which format, and address elements that are written incorrectly are re-recorded in the correct format. was first proposed in our previously published paper [35]. The process is defined based on the In this study, we employed a standardization process based on an address tree model, which was following three rigorous definitions: first proposed in our previously published paper [35]. The process is defined based on the following three(1) rigorous An address definitions: is a collection of address elements and is allowed to point to multiple different spatial entities. (1) An(2) addressEach address is a element collection possesses of address certain elements address semantics. and is allowed The semantic to point meaning to multiple of an address different spatialelement entities. is the spatial object that is its real destination; this definition reflects the phenomenon (2) Eachthat address the same element position possesses may have certain multiple address different semantics. addresses. The semantic meaning of an address (3) The semantic level of an address element is defined by its class. According to this definition, element is the spatial object that is its real destination; this definition reflects the phenomenon administrative elements have the highest semantic level, and detailed local elements have the that thelowest same semantic position level. may have multiple different addresses. (3) The semantic level of an address element is defined by its class. According to this definition, As Figure 6 shows, the purpose of address standardization is to find a connected path in the administrative elements have the highest semantic level, and detailed local elements have the semantic collection that has the correct spatial constraint relationships, and each connected path that lowest semantic level. is found can be regarded as a subtree of an address tree model. Address standardization can be Assuitably Figure performed6 shows, using the purpose a tree model of address based on standardization the semantic characteristics is to find a connected of that model. path The in the detailed standardization process is described as follows: semantic collection that has the correct spatial constraint relationships, and each connected path that is found(1) can Parse be regarded the input as addre a subtreess string of an and address organize tree it model. as a collection Address of standardization address elements can X beand suitably a performedsemantic using acollection tree model S. based on the semantic characteristics of that model. The detailed standardization(2) Create the process root node, is described extract address as follows: element X1, and then traverse all of the semantic elements in S1 associated with X1 to create address semantic nodes and connect them to the root node. (1) Parse(3) Continue the input to addressextract address string andelement organize Xi and it traverse as a collection all of its ofchild address nodes. elements For example,X and consider a semantic collectionthe comparisonS. of Si1 with the current leaf node Li. First, perform the semantic level comparison, and if the semantic level of Si1 is lower than that of Li, then evaluate the consistency of the spatial (2) Create the root node, extract address element X1, and then traverse all of the semantic elements constraint relationship between these two nodes. If this relationship is consistent, then connect in S1 associated with X1 to create address semantic nodes and connect them to the root node. Si1 to the current leaf node Li. (3) Continue to extract address element Xi and traverse all of its child nodes. For example, consider the comparison of Si1 with the current leaf node Li. First, perform the semantic level comparison, ISPRS Int. J. Geo-Inf. 2016, 5, 65 7 of 17

and if the semantic level of Si1 is lower than that of Li, then evaluate the consistency of the spatial constraint relationship between these two nodes. If this relationship is consistent, then connect Si1 to the current leaf node Li. ISPRS Int. J. Geo-Inf. 2016, 5, 65 7 of 16

Figure 6. Relationship among addresses, address elements and address semantics. Figure 6. Relationship among addresses, address elements and address semantics. If the consistency is not correct, the operation must be repeated, tracing back along the current Ifsubtree the consistency until another is node not correct,Li’ is found the that operation has a higher must semantic be repeated, level than tracing Si1 and back the consistency along the currentof ’ subtreethese until two another nodes can node be confirmed.Li is found Thus, that Step has 3 a is higher an iterative semantic process level of thancomparingSi1 and Si1 the with consistency each of thesesubsequent two nodes node can Li’; beif they confirmed. are consistent, Thus, S Stepi1 is inserted 3 is an into iterative the current process position of comparing in this subtree,Si1 with and each otherwise, Si1 must’ be compared with the next leaf node of the address tree. subsequent node Li ; if they are consistent, Si1 is inserted into the current position in this subtree, and For the same address element, if AddrLevel(Si) ≠ AddrLevel(Sj) (i ≠ j) and Sj has been designated otherwise, S must be compared with the next leaf node of the address tree. as a newi1 leaf node of the address model, then skip this leaf node. For theThrough same this address process, element, an input if AddrLevel( address stringSi) with‰ AddrLevel confused,(S jdisorganized)(i ‰ j) and Sandj has even been incorrect designated as a newdescriptions leaf node can of be the correctly address reorganized model, then to allow skip the this spatial leaf node. semantic structure of the address to be Throughextracted for this subsequent process, anprocessing. input address string with confused, disorganized and even incorrect descriptions can be correctly reorganized to allow the spatial semantic structure of the address to be extracted3.3. Address for subsequent Matching Algorithm processing. Address matching is the process of translating an input address string into spatial coordinates 3.3. Address[7], which Matching can then Algorithm be downloaded as a formatted file or be displayed on maps. Address matching Addresstechnology matching is of great is importance the process for of integrating translating non an-spatial input addressdata with string spatial into data spatial and provi coordinatesdes the [7], whichability can thento perform be downloaded address-based as mapping a formatted and spatial file or analysis be displayed to discover on new maps. spatial Address patterns matching and spatial correlations that are difficult to see from statistical data, for example, disease mapping and technology is of great importance for integrating non-spatial data with spatial data and provides the crime mapping. Address matching strives to return the most accurate matching results for any input abilityaddress to perform string. address-basedFirst, an attempt mapping is made to and accurately spatial match analysis the input to discover address new at the spatial house patternsnumber and spatiallevel; correlations if no match that result are is difficultfound, matching to seefrom will be statistical performed data, at the for next example, higher level disease represented mapping in and crimethe mapping. address, Addresssuch as the matching community, strives street to or return district the level, most until accurate a result matching is found. Finally, results the for input any input addressaddress string. is assigned First, an spatial attempt coordinate is made information to accurately for mapping match the and input spatial address analysis. at the house number level; if noClearly, match there result are is found, two possibilities matching when will be given performed a particular at the set next of input higher address level represented data: the in the address,existence such or absence as the of community, corresponding street address or districtdata in the level, reference until database. a result is In found. the first Finally,case, address the input addressmatching is assigned is quite spatial simple. coordinate In this case, informationthe corresponding for mapping address and and its spatial geographical analysis. coordinates are Clearly,found inthere the referenc are twoe database. possibilities However, when givensometimes a particular the desired set ofaddress input data address do not data: exist the in existence the reference database, perhaps because the input is a new address. In the second case, interpolation is or absence of corresponding address data in the reference database. In the first case, address matching required to find the most likely correct address. In addition, the numbers of the neighboring houses is quiteand simple. their geographical In this case, coordinates the corresponding must be determined, address and and the its following geographical formulas coordinates must be used are to found in thecalculate reference the database. approximate However, geographical sometimes coordinates the desiredof the desired address address data [ do22,36] not: exist in the reference database, perhaps because the input is a new address. In the second case, interpolation is required 푋2 − 푋1 to find the most likely correct address.푋 In= addition,푋1 + the∗ numbers(푁 − 푁1) of the neighboring houses and(1) their 푁2 − 푁1 geographical coordinates must be determined, and the following formulas must be used to calculate 푌2 − 푌1 the approximate geographical coordinates푌 = 푌 of1 + the desired∗ (푁 address− 푁1) [22,36]: (2) 푁2 − 푁1

The variables in the formulas above are Xdefined2 ´ X1 as follows: X1 is the longitudinal coordinate of X “ X1 ` ˆ pN ´ N1q (1) the first house to the right of the given houseN 2number,´ N1 X2 is the longitudinal coordinate of the first house to the left of the given house number, Y1 is the latitudinal coordinate of the first house to the right of the given house number, Y2 is the latitudinalY2 ´ Y1 coordinate of the first house to the right of the Y “ Y1 ` ˆ pN ´ N1q (2) N2 ´ N1

ISPRS Int. J. Geo-Inf. 2016, 5, 65 8 of 17

The variables in the formulas above are defined as follows: X1 is the longitudinal coordinate of the first house to the right of the given house number, X2 is the longitudinal coordinate of the first

ISPRShouse Int. to J. the Geo- leftInf. 201 of6 the, 5, 6 given5 house number, Y1 is the latitudinal coordinate of the first house to8 of the 16 right of the given house number, Y2 is the latitudinal coordinate of the first house to the right of the given house number, N1 isis the number of the first first house to the right, N2 is the number of the firstfirst house to the left, and N is the given house number. If there is no house number to the right of the given house number, thenthen N22 isis assigned assigned the closest closest possible value to the given number and X = X2.. If If there there is is no no house house number number t too the left of the given house number, thenthen NN11 is the assigned thethe closestclosest possiblepossible valuevalue toto thethe givengiven numbernumber andand XX == XX11 and Y = Y1.. The The flow flow chart chart of of the the interpolation interpolation algorithm algorithm is is shown shown in in Figure Figure 77..

Figure 7. The flow ﬂow chart of the interpolation algorith algorithmm for address matching.matching.

4. Geocoding Service Application 4. Geocoding Service Application To verify the addressing match methods proposed in this paper, a geocoding service system was To verify the addressing match methods proposed in this paper, a geocoding service system was developed for Shenzhen Municipality, China. This geocoding system uses the methods described in developed for Shenzhen Municipality, China. This geocoding system uses the methods described in the previous section, including the address model, the address standardization procedure and the the previous section, including the address model, the address standardization procedure and the address matching algorithm, for Shenzhen Municipality. address matching algorithm, for Shenzhen Municipality. The geocoding service system contains datasets collected from the SZPL. Totally, there are The geocoding service system contains datasets collected from the SZPL. Totally, there are approximately 708,000 records. Specifically, the dataset contain 714 records of administrative names, approximately 708,000 records. Speciﬁcally, the dataset contain 714 records of administrative names, 9475 records of district names, 2000 records of landmarks, 340,000 records of POIs, 25,000 records of 9475 records of district names, 2000 records of landmarks, 340,000 records of POIs, 25,000 records of roads, 250,000 records of building numbers and 81,000 records of house numbers. roads, 250,000 records of building numbers and 81,000 records of house numbers.

4.14.1.. Developing the Geocoding Service The geocoding geocoding system system was was built built for forthe city the of city Shenzhen, of Shenzhen, China Chinausing the using address the addressmatching matching methods presentedmethods presentedabove. The above. system The consists system of consists five components: of ﬁve components: a reference a referencedatabase, database,an address an-m address-atching engine,matching a geocoding engine, a service geocoding based service on the based address on matching the address-engine, matching-engine, an online geocoding an online display geocoding system, anddisplay a geocoding system, database and a geocoding management database system, management which are visually system, summarized which are visually in Figure summarized 8. in FigureThe8. reference database is based on Oracle 11g Release 2 and contains an administrative region dataset,The a reference road dataset, database a POI is dataset, based on a Oraclebuilding 11g numbers Release dataset, 2 and contains index data, an administrative rules data and region some otherdataset, auxiliary a road datasets. dataset,a POI dataset, a building numbers dataset, index data, rules data and some otherThe auxiliary address datasets. matching engine is the most important part of the system and runs on the database described above. It consists of a preprocessor, an address parser and an address-matching module. The preprocessor removes meaningless symbols, and the address parser attempts to parse each address input by the user in accordance with the address model and the characteristics of Chinese addresses. The address matching engine is coded in Java and uses Lucene for fuzzy and intelligent matching.

ISPRS Int. J. Geo-Inf. 2016, 5, 65 9 of 17 ISPRS Int. J. Geo-Inf. 2016, 5, 65 9 of 16

ISPRS Int. J. Geo-Inf. 2016, 5, 65 9 of 16

Figure 8. Architecture of the geocoding service system. Figure 8. Architecture of the geocoding service system. Figure 8. Architecture of the geocoding service system.

(a(a) ) (b()b ) Figure 9. The online geocoding service: (a) address matching and (b)reverse address matching. Note: FigureFigure 9. 9.The The online online geocoding geocoding service: service: ((aa)) address matching matching and and (b (b)reverse)reverse address address matching. matching. Note: Note: In (a), * 4 shows the address matched, * 5 shows the matching degree, * 6 gives the coordinates, and * InIn (a), (a *), 4 * shows4 shows the the address address matched, matched, * * 5 5 shows shows the the matching matching degree,degree, * 6 gives the coordinates, coordinates, and and * * 7 7 gives the place name. In (b), * 5 gives the distance between the matched point and the point provided gives7 gives the the place place name. name. In In (b ),(b *), 5* gives5 gives the the distance distance betweenbetween the matched matched point point and and the the point point provided provided by bythe the user. user. by the user.

The address matching engine is the most important part of the system and runs on the database described above. It consists of a preprocessor, an address parser and an address-matching module. The ISPRS Int. J. Geo-Inf. 2016, 5, 65 10 of 17 preprocessor removes meaningless symbols, and the address parser attempts to parse each address input by the user in accordance with the address model and the characteristics of Chinese addresses. The address matching engine is coded in Java and uses Lucene for fuzzy and intelligent matching. ISPRS Int. J. Geo-Inf. 2016, 5, 65 10 of 16 ISPRS Int. J. Geo-Inf. 2016, 5, 65 10 of 16

Figure 10. The online geocoding display system (accurate matching). Figure 10. The online geocoding display system (accurate matching). Figure 10. The online geocoding display system (accurate matching).

Figure 11. The online geocoding display system (fuzzy matching). Figure 11. The online geocoding display system (fuzzy matching). The geocodingFigure service 11. Theis based online on geocoding the address display-matching system eng (fuzzyine and matching). was built using standard web The services geocoding. Figure service 9 shows is basedan example on the ofaddress the geocoding-matching service engine output.and was The built online using geocoding standard Thedisplayweb geocoding services system. Figure service is based 9 s ishows onbased thean onexample geocoding the address-matching of theservice geocoding and uses engine service the and output.OpenLayers was The built online library using geocoding standard for map web services.visualization;display Figure system9 showsFigures is based an10 and example on 11 the show geocoding of th thee interface geocoding service of the service and online uses output. geocoding the OpenLayers The display online system. library geocoding Finally, for map display a management system that includes integrated mapping and data-reading functionalities, which were systemvisualization; is based on Figures the geocoding 10 and 11 show service the interface and uses of the online OpenLayers geocoding library display for system. map visualization;Finally, a implementedmanagement systemby embedding that includes the ArcObjects integrated product mapping offered and data by ESRI,-reading is prove functionalities,n to be able which to manage were Figures 10 and 11 show the interface of the online geocoding display system. Finally, a management theimplemented data. Figure by 12 embedding shows the the main ArcObjects user interface product of the offered database by ESRI, management is proven system. to be able to manage system that includes integrated mapping and data-reading functionalities, which were implemented by the data.According Figure to12 shows the address the main model user describedinterface of in the the database previous management section, we system. identified the most embeddingsuitableAccording the address ArcObjects tomodel the addressfor product Shenzhen model offered Municipality, described by ESRI, in iswhich the proven previous is sho town be section, ablein Figure to managewe 13. identified When the accounting data. the most Figure 12 showsforsuitable the the main address address user model, model interface the for higher Shenzhen of the stable database Municipality, location management indicated which inis system. asho descriptivewn in Figure address 13. When was chosen. accounting At a Accordingfor the address to the model, address the higher model stable described location in indicated the previous in a descriptive section, we address identiﬁed was thechosen. most At suitable a address model for Shenzhen Municipality, which is shown in Figure 13. When accounting for the

ISPRS Int. J. Geo-Inf. 2016, 5, 65 11 of 17

addressISPRS model, Int. J. Geo the-Inf.higher 2016, 5, 65 stable location indicated in a descriptive address was chosen.11 Atof 16 a ﬁner level ofISPRS detail, Int. J. Geo building-Inf. 2016, numbers5, 65 are usually the most important address information, followed11 of 16 by landmarksfiner level and of POIs. detail, Once building the address numbers model are usually was the created, most important it was used address in the information, address standardization followed by landmarks and POIs. Once the address model was created, it was used in the address and addressfiner level matching of detail, processes. building numbers are usually the most important address information, followed standardizationby landmarks andand address POIs. Once matching the addressprocesses. model was created, it was used in the address standardization and address matching processes.

Figure 12. The geocoding management system. Figure 12. The geocoding management system. Figure 12. The geocoding management system.

Figure 13. Address Model for Shenzhen.

Figure 13. Address Model for Shenzhen. Figure 13. Address Model for Shenzhen. ISPRS Int. J. Geo-Inf. 2016, 5, 65 12 of 17

4.2. Experiment and Results The address matching degree refers to the fit of a matching address with its destination address. The candidate address of an original address can be obtained and chosen based on the candidate address that best matches the matching address. The matching address degree can be expressed as follows: M s pt ` t ` t ` ¨ ¨ ¨ ` t q s ` s ` s ` ¨ ¨ ¨ ` s D “ s “ 1 2 3 n « 1 2 3 n O t ` x ` t ` t ` x ` ¨ ¨ ¨ ` t ` x t ` x ` t ` t ` x ` ¨ ¨ ¨ ` t ` x s 1 1 2 3 2 n i 1 1 2 3 2 n i (3) s “ n t1 ` x1 ` t2 ` t3 ` x2 ` ¨ ¨ ¨ ` tn ` xi where D is the matching degree, Ms is the matching address of the original address, Os is the destination address of the original address, M is the spatial semantic set of the original address segmentation results and could be described as s pt1 ` t2 ` t3 ` ... ` tnq or s1 ` s2 ` s3 ` ... ` sn. The most detailed address is sn. The original address can be divided into some address elements, t means that the address element can be identified and x means that the address cannot be identified. We can calculate the matching degree from Equation (1). As a better calculation method, we adopted the modified vector space model (M-VSM) to calculate the matching degree, which is expressed as follows:

D “ cosθ “ pM¨ Oq { p||M||||O ||q where M is the weight vector of the matching address element and O is the weight vector of original address segmentation. In the results, a matching degree of 100% means that the address description is quite standard and ﬁts for the address model regulation. However, several factors can reduce the matching degree, such as an incomplete reference database and unformulated addresses without relative locations or spatial direction relations. More than 100,000 records of disease data from the SCHI and approximately 1,360,000 records of company data from the SZSCJG were used to test the developed geocoding service system. In this study, we consider the building data as the reference dataset and disease data and company data as the target data. If the address of target data can be searched in the reference dataset, we assume that the target data are completely matched with a matching degree of 100%. If some elements of the target data can be searched in the reference dataset, we consider the target data as partial matching, then we calculate the matching degree for each match process and choose the best one for the target data with a matching degree between 0%–100%. However if no elements of the target data can be found in the reference dataset, the result is unmatched, with a matching degree of 0%. Thus, the higher matching degrees correspond with more accurate matching processes. The experimental results are shown in Table1. Figure 14 shows the matching degree distribution. Overall, the results reveal the following: (1) For the disease data, 61.1% of the records were matched with full accuracy, 3.3% of the records were matched with a matching degree of 90%–100%, 8.5% of the records were matched with a match degree of 80%–90%, and only 13.9% of the addresses had a matching degree of less than 60%. (2) For the company data, 98.6% of the records were matched with full accuracy, 81.1% of the records were matched with a matching degree of 90%–100%, 17.4% of the records were matched with a matching degree of 70%–80%, and only 1.4% of the addresses had a matching degree of less than 60%. For a nonstandard address, such as “Hongli Road, Nanshan District, Shenzhen” (Hongli Road is not located in Nanshan District), the address standardization module will standardize and correct the address (in this case, to “Hongli Road, Futian District, Shenzhen”). ISPRS Int. J. Geo-Inf. 2016, 5, 65 13 of 17 ISPRS Int. J. Geo-Inf. 2016, 5, 65 13 of 16

Table 1. Experimental results. Table 1. Experimental results.

MatchingMatching Degree Degree Unmatched Unmatched 0% 0%–50%–50% 50% 50%–60%–60% 60% 60%–70%–70% 70% 70%–80%–80% 80%–90%80%–90% 90%–100%90%–100% 100%100% CompanyCompany 13.8%13.8% 0.1% 0.1% 0% 0% 13.88% 13.88% 0% 8.5% 3.3%3.3% 60.4%60.4% MatchMatch datadata raterate DiseaseDisease 0.1%0.1% 1.3% 1.3% 0% 0% 17.4% 17.4% 0% 20.2% 11.6%11.6% 49.3%49.3% datadata

(a) (b)

Figure 14. Matching degree distributions: (a) matching degree result for the disease data; and (b) Figure 14. Matching degree distributions: (a) matching degree result for the disease data; and matching degree results for the company data. (b) matching degree results for the company data.

5. Discussions 5. Discussions In accordance with the proposed address matching methods, a geocoding service system was In accordance with the proposed address matching methods, a geocoding service system was developed for Shenzhen, China. To measure the effectiveness of this method, the matching degree, developed for Shenzhen, China. To measure the effectiveness of this method, the matching degree, the matching rate, and measures of adaptability and intelligence were used to assess the quality of the matching rate, and measures of adaptability and intelligence were used to assess the quality of the address match method. The matching degree indicates the quality of a geocoded record. A record the address match method. The matching degree indicates the quality of a geocoded record. A record is considered to be accurately matched, inaccurately matched or unmatched. This system provides a is considered to be accurately matched, inaccurately matched or unmatched. This system provides concise score for every matched address; a higher score indicates more accurate matching, Table 2 a concise score for every matched address; a higher score indicates more accurate matching, Table2 summarizes the interpretation of the matching degree. summarizes the interpretation of the matching degree.

Table 2. Matching degree and its interpretation. Table 2. Matching degree and its interpretation. Matching Degree Matching Level Information Matching100% Degree MatchingMatched with Level full Information accuracy 90%100%–100% Matched at the building level (building Matched numbers); with full Building accuracy numbers interpolation is possible 90%–100%80%–90% Matched at the building level (buildingMatched numbers); at the block Building level numbers interpolation is possible 80%–90%70%–80% Matched Matched at the at community the block level level 70%–80%60%–70% MatchedMatched at theat the community road levellevel 60%–70%50%–60% Matched Matched at the at sub the-district road level level 50%–60%50%–0% MatchedMatched at at the the sub-district district level level unmatched50%–0% MatchedUnmatched at the at districtany level level unmatched Unmatched at any level The matching rate is the proportion of addresses matched successfully at a given matching degree.The Figure matching 15 shows rate is the the matching proportion rates of addresses determined matched from successfullythe experimental at a given results. matching degree. FigureFrom 15 shows the experimental the matching results, rates we determined can see that from for the the experimental company data results. 60.4% of the addresses were matchedFrom accurately, the experimental 86% of the results, addresses we can were see matched that for successfully the company at datathe matching 60.4% of degree the addresses greater werethan matched 60%, and accurately, the corresponding 86% of the matching addresses rates were for matched the disease successfully data wer at thee 49.3% matching and degree 98.6%, greaterrespectively, than 60%,at the and same the matching corresponding degrees. matching After analy ratesz foring the the disease matching data logs, were we 49.3% identified and 98.6%, two respectively,main reasons at for the these same findings: matching First, degrees. the Aftercompany analyzing data were the matching recorded logs, in a wemore identiﬁed regular two manner, main reasonsand the foraddress these descriptions ﬁndings: First, in the the disease company were data are were highly recorded irregular in abecause more regular they were manner, recorded and theby addressdifferent descriptions people with indifferent the disease understandings were are highly of address irregular formatting. because they Second, were many recorded of the by addresses different in the disease data are not located in Shenzhen. However, the results are accurate enough to do some research regarding public health [37,38].

ISPRS Int. J. Geo-Inf. 2016, 5, 65 14 of 17 people with different understandings of address formatting. Second, many of the addresses in the disease data are not located in Shenzhen. However, the results are accurate enough to do some research regarding public health [37,38]. ISPRS Int. J. Geo-Inf. 2016, 5, 65 14 of 16

(a) (b)

Figure 15. Matching rate statistics: (a) for the disease data; and (b) for the company data. Figure 15. Matching rate statistics: (a) for the disease data; and (b) for the company data.

The developed geocoding service exhibits good performance in terms of adaptability. For example,The developed consider thegeocoding address service ”Hongli exhibits Road, good Nanshan performance District, in terms Shenzhen” of adaptability. (Hongli ForRoad example, is not locatedconsider in the Nanshan address District). ”Hongli Th Road,e system Nanshan can correct District, this Shenzhen” address to (Hongli“Hongli Road Road, is Futian not located District, in Shenzhen”Nanshan District). while parsing The system the address can correct using this the addressaddress totree “Hongli model.Road, Futian District, Shenzhen” whileFurthermore, parsing the address by virtue using of the the address address tree tree model model. (described in Section 2.2) and Apache Lucene, the solutionFurthermore, can provide by virtue some of support the address for fuzzy tree model and intelligent (described matching. in Section For 2.2 )instance, and Apache when Lucene, a user inputsthe solution the address”KFC, can provide Futian some supportDistrict, Shenzhen”, for fuzzy and the intelligent system will matching. return all Forof the instance, KFCs in when Futian a District.user inputs the address”KFC, Futian District, Shenzhen”, the system will return all of the KFCs in FutianUpon District. analyzing the results with low matching degree, we observed that most of the results with low matchingUpon analyzing degrees the are results house withnumbers, low matching with some degree, addresses weobserved that havethat changed most or of have the results a different with identifierlow matching (alias). degrees House are number house numbers,descriptions with are some highly addresses chaotic; that different have changed people often or have use a different different formatsidentifier for (alias). describing House house number num descriptionsbers. Table are3 shows highly several chaotic; examples different of people complex often house use numbers different thatformats make for it describingdifficult for house a computer numbers. to understand Table3 shows the severaladdress. examples of complex house numbers that make it difficult for a computer to understand the address. Table 3. Examples of complex house numbers. Table 3. Examples of complex house numbers. Complex House Number 深圳市南山区桃源街道龙光社区南山区龙井路光前村西区Complex House Number 14 号 A\14 号 B En:深 No.14A圳市南山\No.14B,区桃源街 West道龙光Area,社区南 GuangQian山区龙井路 Village,光前村西 LongJing区14号Az Road,14号B NanShan District, LongGuang Community,En: No.14A TaoYuanzNo.14B, WestStreet, Area, NanShan GuangQian District, Village, Shenzhen LongJing City Road, NanShan District, 深圳市南山区南头街道田厦社区南新路LongGuang Community, TaoYuan Street,2108 NanShan号南头影剧院综合大楼 District, Shenzhen4F402 City 深圳市南山区南头街道田厦社区南新路2108号南头影剧院综合大楼4F402 En:En: 4F402, 4F402, Comprehensive Comprehensive building building of of Nantou Nantou Theater, No.2108,No.2108, NanXin NanXin Road, Road, TianSha TianSha Community, Communit NanTouy, NanTouStreet, Street, NanShan NanShan District, District, Shenzhen Shenzhen City City 福田区华发北路桑达新村22-5-202 福田区华发北路桑达新村En: 22-5-202,SangDa New22- Village,5-202 North Hua Fa Road, FuTian District En:光 22明-5新-202,SangDa区光明街道笔 New架山临Village,时住宅 North37栋1号 Hua整 Fa Road, FuTian District En: No.1, 37 Building, Temporary residence in Beacon Hill, GuangMing Street, GuangMing New District 光明新区光明街道笔架山临时住宅 37 栋 1 号整 En: No.1, 37 Building, Temporary residence in Beacon Hill, GuangMing Street, GuangMing New District Another deficiency of the system arises in the case of aliases or historical addresses. With the rapid developmentAnother ofdeficiency a city, many of the place system names arises and addressesin the case may of aliases change. or In historical addition, addresses. some administrative With the rapidboundaries development may change; of a however, city, many the place historical names records and addressesstill use the may old place change. names In addition, and addresses. some Thisadministrative issue continues boundaries to pose may a significant change; however, challenge. the historical records still use the old place names and addresses. This issue continues to pose a significant challenge. 6. Conclusions 6. ConclusionsThis paper introduced the concept of geocoding Chinese addresses and described the challenges encountered in Chinese geocoding. Then, this paper proposed an optimized address matching method This paper introduced the concept of geocoding Chinese addresses and described the challenges for Chinese geocoding with three components: address modeling, address standardization and address encountered in Chinese geocoding. Then, this paper proposed an optimized address matching method for Chinese geocoding with three components: address modeling, address standardization and address matching. Based on the proposed method, we created a geocoding service system for Chinese addresses using more than 708,000 address records from the SZPL. Finally, we conducted an experiment by using more than 100,000 records of disease data and approximately 1,360,000

ISPRS Int. J. Geo-Inf. 2016, 5, 65 15 of 17 matching. Based on the proposed method, we created a geocoding service system for Chinese addresses using more than 708,000 address records from the SZPL. Finally, we conducted an experiment by using more than 100,000 records of disease data and approximately 1,360,000 records of company data. In these experiments, the average matching rate was greater than 90% for a matching degree higher than 60%. The geocoding service system has been widely used in many ﬁelds, indicating that optimized address matching method can be highly effective and accurate. The fact that our proposed system can parse not only standard addresses but also some nonstandard addresses illustrates its good adaptability. Furthermore, the proposed optimized method provides some degree of support for fuzzy and intelligent matching capabilities. Future improvements will focus on the correct handling of aliases and historical addresses as well as on better parsing of the sometimes chaotic descriptions of building numbers and house numbers.

Acknowledgments: This study was supported by the National Natural Science Foundation of China (Project No. 41271455 and 41371427). The authors would like to thank Mengjun Kang, Haijing Zhang, Yuexin Lu and Tianqi Qiu from Wuhan University for their valuable suggestions. Author Contributions: Qin Tian, Fu Ren, Tao Hu, Jiangtao Liu, Ruichang Li and Qingyun Du worked collectively. Specifically, Qingyun Du and Fu Ren developed the original idea for the study and conducted the organization of the content. Qin Tian and Tao Hu conducted the geocoding experiments and performed the analysis of the results. Jiangtao Liu and Ruichang Li provided literature guidance and proposals for the study. All of the co-authors drafted and revised the article collectively, and all authors read and approved the final manuscript. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Coetzee, S.; Bishop, J. Address databases for national SDI: Comparing the novel data grid approach to data harvesting and federated databases. Int. J. Geogr. Inf. Sci. 2009, 23, 1179–1209. [CrossRef] 2. Jing, Z.; Qi, L. Research on the application of geocoding. Geogr. Geo-Inf. Sci. 2003, 19, 22–25. 3. Eichelberger, P. The importance of addresses: The locus of GIS. In Proceedings of the URISA Annual Conference, Atlanta, GA, USA, 25–29 July 1993; pp. 212–222. 4. Rhind, G.R. Global Sourcebook of Address Data Management: A Guide to Address Formats And Data in 194 Countries; Gower: Aldershot, UK, 1999; Volume 615. 5. Davis, C.A.; Fonseca, F.T. Assessing the certainty of locations produced by an address geocoding system. Geoinformatica 2007, 11, 103–129. [CrossRef] 6. Shah, T.I.; Bell, S.; Wilson, K. Geocoding for public health research: Empirical comparison of two geocoding services applied to canadian cities. Can. Geogr. 2014, 58, 400–417. [CrossRef] 7. Baldovin, T.; Zangrando, D.; Casale, P.; Ferrarese, F.; Bertoncello, C.; Buja, A.; Marcolongo, A.; Baldo, V. Geocoding health data with geographic information systems: A pilot study in northeast italy for developing a standardized data-acquiring format. J. Prev. Med. Hyg. 2015, 56, 88–94. 8. Ratcliffe, J.H. Geocoding crime and a first estimate of a minimum acceptable hit rate. Int. J. Geogr. Inf. Sci. 2004, 18, 61–72. [CrossRef] 9. Rushton, G.; Armstrong, M.; Gittler, J. Geocoding in cancer research: A review. Am. J. Prev. Med. 2006, 30, 16–24. [CrossRef][PubMed] 10. Goodchild, M. GIS and transportation: Status and challenges. GeoInformatica 2000, 4, 127–139. [CrossRef] 11. Qin, X.; Parker, S.; Liu, Y.; Graettinger, A.J.; Forde, S. Intelligent geocoding system to locate traffic crashes. Accid. Ana. Prev. 2013, 50, 1034–1041. [CrossRef][PubMed] 12. Mammadrahimli, A. Assessment of crash location improvements in map-based geocoding systems and subsequent benefits to geospatial crash analysis. In Proceedings of 94th Transportation Research Board Annual Meeting, Washington, DC, USA, 11–15 January 2015. 13. Krieger, N.; Chen, J.T.; Waterman, P.D.; Soobader, M.-J.; Subramanian, S.V.; Carson, R. Geocoding and monitoring of us socioeconomic inequalities in mortality and cancer incidence: Does the choice of area-based measure and geographic level matter?: The Public Health Disparities Geocoding Project. Am. J. Epidemiol. 2002, 156, 471–482. [CrossRef][PubMed] ISPRS Int. J. Geo-Inf. 2016, 5, 65 16 of 17

14. Shi, X. Evaluating the uncertainty caused by post office box addresses in environmental health studies: A restricted monte carlo approach. Int. J. Geogr. Inf. Sci. 2007, 21, 325–340. [CrossRef] 15. Goldberg, D.W.; Wilson, J.P.; Knoblock, C.A. From text to geographic coordinates: The current state of geocoding. URISA J. 2007, 19, 33–46. 16. Karimi, H.A.; Sharker, M.H.; Roongpiboonsopit, D. Geocoding recommender: An algorithm to recommend optimal online geocoding services for applications. Trans. GIS 2011, 15, 869–886. [CrossRef] 17. Goldberg, D.W. Advances in geocoding research and practice. Trans. GIS 2011, 15, 727–733. [CrossRef] 18. Bonner, M.R.; Han, D.; Nie, J.; Rogerson, P.; Vena, J.E.; Freudenheim, J.L. Positional accuracy of geocoded addresses in epidemiologic research. Epidemiology 2003, 14, 408–412. [CrossRef][PubMed] 19. Goldberg, D.W.; Ballard, M.; Boyd, J.H.; Mullan, N.; Garfield, C.; Rosman, D.; Ferrante, A.M.; Semmens, J.B. An evaluation framework for comparing geocoding systems. Int. J. Health Geogr. 2013, 12.[CrossRef] [PubMed] 20. Roongpiboonsopit, D.; Karimi, H.A. Comparative evaluation and analysis of online geocoding services. Int. J. Geogr. Inf. Sci. 2010, 24, 1081–1100. [CrossRef] 21. Whitsel, E.A.; Quibrera, P.M.; Smith, R.L.; Catellier, D.J.; Liao, D.; Henley, A.C.; Heiss, G. Accuracy of commercial geocoding: Assessment and implications. Epidemiol. Perspect. Innov. 2006, 3.[CrossRef] [PubMed] 22. Goldberg, D.W. Improving geocoding match rates with spatially-varying block metrics. Trans. GIS 2011, 15, 829–850. [CrossRef] 23. Lovasi, G.S.; Weiss, J.C.; Hoskins, R.; Whitsel, E.A.; Rice, K.; Erickson, C.F.; Psaty, B.M. Comparing a single-stage geocoding method to a multi-stage geocoding method: How much and where do they disagree? Int. J. Health Geogr. 2007, 6.[CrossRef][PubMed] 24. Ran, W.; Xuehu, Z.; Linfang, D.; Haoming, M.; Qi, L. A knowledge-based agent prototype for chinese address geocoding. In Proceedings of the Geoinformatics 2008 and Joint Conference on GIS and Built Environment: Advanced Spatial Data Models and Analyses, Guangzhou, China, 28–29 June 2008. 25. Weihong, L.; Ao, Z.; Kan, D. An efficient bayesian framework based place name segmentation algorithm for geocoding system. In Proceedings of the Fifth International Conference on Intelligent Systems Design and Engineering Applications (ISDEA), Zhangjiajie, China, 15–16 June 2014; pp. 141–144. 26. Zandbergen, P.A. A comparison of address point, parcel and street geocoding techniques. Comput. Environ. Urban Syst. 2008, 32, 214–232. [CrossRef] 27. Zhang, L.; Yi, J. A brief analysis of geocoding. In Software Engineering and Knowledge Engineering: Theory and Practice: Volume 2; Wu, Y., Ed.; Springer: Berlin, Germany, 2012; pp. 987–993. 28. Zhang, X.; Ma, H.; Li, Q. An address geocoding solution for Chinese cities. In Proceedings of the Geoinformatics 2006: Geospatial Information Science, Wuhan, China, 28–29 October 2006. 29. Ratcliffe, J. On the accuracy of tiger-type geocoded address data in relation to cadastral and census areal units. Int. J. Geogr. Inf. Sci. 2010, 15, 473–485. [CrossRef] 30. Gaitanis, H.; Winter, S. Is a richer address data model relevant for lbs? In Principle and Application Progress in Location-Based Services; Liu, C., Ed.; Springer: Berlin, Germany, 2014; pp. 121–137. 31. Chinese Academy of Surveying and Mapping. GB/T 23705-2009: The Rules of Coding for Address in the Common Platform for Geospatial Information Service of Digital City; Standards Press of China: Beijing, China, 2009. 32. Huanju, Y.; Qingwen, Q.; Yunling, L. Study on city address geocoding model based on street. J. Geo-Inf. Sci. 2013, 15, 175–179. 33. Haibo, Y. Techniques on Geocoding in Digital Cities and Their Applications. Master Thesis, China, University of Petroleum, Beijing, China, 2009. 34. Zhaoting, M.Z.; Zhigang, L.; Wei, S.; Jie, Y. An automatic geocoding algorithm based on address segmentation. Bull. Surv. Map. 2011, 2, 59–63. 35. Mengjun, K.; Qingyun, D.; Mingjun, W. A new method of chinese address extraction based on address tree model. Acta Geod. Cartogr. Sin. 2015, 44, 99–107. 36. Bakshi, R.; Knoblock, C.; Thakkar, S. Exploiting online sources to accurately geocode addresses. Int. J. Geogr. Inf. Sci. 2004.[CrossRef] ISPRS Int. J. Geo-Inf. 2016, 5, 65 17 of 17

37. Hu, T.; Du, Q.; Ren, F.; Liang, S.; Lin, D.; Li, J.; Chen, Y. Spatial analysis of the home addresses of hospital patients with hepatitis b infection or hepatoma in shenzhen, China from 2010 to 2012. Int. J. Environ. Res. Public Health 2014, 11, 3143–3155. [CrossRef][PubMed] 38. Wang, Z.; Du, Q.; Liang, S.; Nie, K.; Lin, D.N.; Chen, Y.; Li, J.J. Analysis of the spatial variation of hospitalization admissions for hypertension disease in shenzhen, China. Int. J. Environ. Res. Public Health 2014, 11, 713–733. [CrossRef][PubMed]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).