International Journal of Geo-Information
Article A Polygon and Point-Based Approach to Matching Geospatial Features
Juan J. Ruiz-Lendínez *, Manuel A. Ureña-Cámara ID and Francisco J. Ariza-López
Departamento de Ingeniería Cartográfica, Geodésica y Fotogrametría, Escuela Politécnica Superior de Jaén, Universidad de Jaén, 23071 Jaén, Spain; [email protected] (M.A.U.-C.); [email protected] (F.J.A.-L.) * Correspondence: [email protected]; Tel.: +34-953-21-24-70; Fax: +34-953-21-28-54
Received: 19 October 2017; Accepted: 1 December 2017; Published: 5 December 2017
Abstract: A methodology for matching bidimensional entities is presented in this paper. The matching is proposed for both area and point features extracted from geographical databases. The procedure used to obtain homologous entities is achieved in a two-step process: The first matching, polygon to polygon matching (inter-element matching), is obtained by means of a genetic algorithm that allows the classifying of area features from two geographical databases. After this, we apply a point to point matching (intra-element matching) based on the comparison of changes in their turning functions. This study shows that genetic algorithms are suitable for matching polygon features even if these features are quite different. Our results show up to 40% of matched polygons with differences in geometrical attributes. With regards to point matching, the vertex from homologous polygons, the function and threshold values proposed in this paper show a useful method for obtaining precise vertex matching.
Keywords: area entities matching; point entities matching; genetic algorithm; turning function descriptor; geographical databases
1. Introduction Pattern recognition and matching of features are essential processes in many problems when dealing with graphic information [1,2]. Whether in vector format (object-oriented) or in raster format, there are many studies related to graphic information. One specific type of graphic information is geographic information or spatial data, which is very much in demand nowadays because of the use of location in many areas (business, military, international aid, tourism, fleet management, smart territories, etc.) and has great economic importance [3]. The main feature of spatial data is the geographical position, which means that spatial data refer to positions in the real world. Position is the basis of mapping and navigation, which are essential for engineering, natural sciences, land management, etc. Different matching techniques based on position have been developed in response to the need to create new cartographic products, which are the result of integrating Geospatial Databases (GDB) from heterogeneous sources with levels of quality equally heterogeneous [4–6]. This integration process has not only allowed the development of new ways of mapping based on the Volunteered Geographic Information (VGI) and the Spatial Data Infrastructures (SDI), but has also been the cause of an ever-increasing necessity to implement more efficient accuracy assessment procedures for evaluating the quality of these products. Under this new geospatial paradigm, the knowledge of the positional accuracy of a GDB is a critical aspect. Thus, a GDB with an insufficient level of positional accuracy for a particular purpose can be useful by integrating with other GDB with higher accuracy. Therefore, and regardless of this level of accuracy, it is necessary to know it. Positional accuracy has traditionally been evaluated by means of statistical methods based on measuring errors in points between the real world and the
ISPRS Int. J. Geo-Inf. 2017, 6, 399; doi:10.3390/ijgi6120399 www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2017, 6, 399 2 of 24
ISPRS Int. J. Geo-Inf. 2017, 6, 399 2 of 24 dataset (Figure1a). From a statistical point of view, one of the most controversial aspects of these methods is the number of points thatthat must be used to ensure, with a given level of confidence, confidence, that a GDB with an unacceptableunacceptable quality level will notnot bebe acquiredacquired [[7].7]. Thus,Thus, althoughalthough recommendationsrecommendations always suggest at least 20 points [[8,9],8,9], other studies [[10]10] and institutions [[11,12]11,12] suggest that this size (n == 20) seems to be very small, and larger sizes areare proposed [[13].13]. In this sense, the new capabilities offered by Global Navigation Satellite Systems (GNSS)(GNSS) in the last few decades have allowed a great improvement inin the process of fieldfield data acquisition; however, theirtheir use still presents the inconvenience of having to use aa limitedlimited numbernumber ofof points.points. This factor, combined with the very high cost of fieldfield surveying, represents the the main main weakness weakness of of th thee traditional traditional points-based points-based methods methods (Figure (Figure 1a).1a).
Figure 1. (a) Traditional points-based methods of positional quality assessment and (b) the new Figure 1. (a) Traditional points-based methods of positional quality assessment and (b) the new approach based on automated positional assessment. GBD, Geospatial Databases. approach based on automated positional assessment. GBD, Geospatial Databases.
Having taken into account these considerations, and in order to overcome the difficulties mentionedHaving previously, taken into a accountnew assumption these considerations, was developed and that in orderpositional to overcome accuracy thecould difficulties also be mentioneddefined by previously,measuring the a new differences assumption between was the developed location thatstored positional in a GDB accuracy (tested couldsource) also and be a definedlocation bydetermined measuring by the another differences GDB (reference between thesour locationce) of higher stored accuracy in a GDB (Figure (tested 1b). source) Thus, andif the a locationaccuracy determinedof the second by source another is high GDB enough, (reference the source) unmeasured of higher difference accuracy betw (Figureeen it1 andb). Thus,the real if theone accuracycan be ignored of the [14]. second Under source this is new high approach enough, to the the unmeasured overall process difference of position betweenal assessment, it and the realthe first one canand bemost ignored important [14]. Under stage thisis to newidentify approach the homo to thelogous overall elements process ofbetween positional both assessment, GDB, quickly the firstand andreliably. most Thus, important we consider stage is that toidentify now is the right homologous time for elementsdeveloping between new methodologies both GDB, quickly based and on reliably.the automated Thus, weassessment consider of that the now positional is the right and timegeom foretric developing components new of methodologies geographic information. based on the In automatedthis sense, assessmentthe authors of developed the positional a methodolog and geometricy to components analyze the of geographicassessment information. of the positional In this sense,components the authors of lines developed based aon methodology a previous tomatching analyze theand assessment a set of descriptors of the positional [15]. componentsLatterly, they of linescontinued based their on a previousstudy on matching positional and accuracy a set of descriptors by means [15of ].point-based Latterly, they methodologies continued their [16]. study Both on positionalstudies are accuracy based on by a meanstwo-step of point-basedmatching process methodologies that is extensively [16]. Both studiesdescribed are in based this onpaper. a two-step Thus, matchingwe can consider process that that the is extensively assessment described process inis essentially this paper. Thus,dealing we with can considera matching that problem the assessment where processthere are is two essentially key issues dealing [2]: (1) with how a matchingto measure problem the similarity where between there are elemen two keyts from issues both [2]: datasets (1) how to(in measure our case, the GDB) similarity in order between to elementsmatch them from (F bothigure datasets 2a) and (in (2) our how case, to GDB) achieve in order the tooptimal match themvertex-to-vertex (Figure2a) correspondence and (2) how to achievebetween the the optimal previously-matched vertex-to-vertex elements correspondence (Figure 2b). between Following the previously-matchedthese ideas, in this paper, elements we (Figureshow the2b). matching Following process these ideas, in a two-step in this paper, process we (inter-elements show the matching and processintra-elements) in a two-step that we process have (inter-elementsbeen developing and in intra-elements) order to allow that quality we have control been assessment developing inof ordertwo GDBs. to allow quality control assessment of two GDBs.
ISPRS Int. J. Geo-Inf. 2017, 6, 399 3 of 24 ISPRS Int. J. Geo-Inf. 2017, 6, 399 3 of 24
Figure 2.2. ((a)) SimilaritySimilarity betweenbetween elementselements fromfrom bothboth GDBGDB andand ((bb)) vertex-to-vertexvertex-to-vertex correspondencecorrespondence between thethe elementselements matchedmatched previously.previously.
1.1. Inter-Elements Matching (Inter-Polygons) With regardregard toto thethe measurementmeasurement ofof similaritysimilarity inin orderorder toto matchmatch elements,elements, thisthis isis generallygenerally evaluated by meansmeans ofof aa quantitativequantitative variable,variable, whichwhich measuresmeasures thethe distancedistance betweenbetween thethe elementselements comparedcompared andand analyzes analyzes them them by meansby means of a setof ofa characteristics.set of characteristics. In the field In of the geographic field of information,geographic thereinformation, are several there studies are several regarding studies the classificationregarding the of classification these measures, of these among measures, them Tong among et al. them [17], whoTong defined et al. [17], three who different defined categories: three different categories: • Geometric measures: considers all those aspects arising from the coordinates that define a spatial • Geometric measures: considers all those aspects arising from the coordinates that define a spatial object: (i) positional aspect and (ii) shape, which is represented by means of the geometry. object: (i) positional aspect and (ii) shape, which is represented by means of the geometry. • Spatial relationship measures: considers geometric aspects related to the relation between two • Spatial relationship measures: considers geometric aspects related to the relation between two features: (i) distance between features, (ii) direction and (iii) topological. features: (i) distance between features, (ii) direction and (iii) topological. • Attribute measures: considers thematic aspects of the geographic information. • Attribute measures: considers thematic aspects of the geographic information. The issue examined in this paper is related both to the geometric measures used to determine a geometricThe issue similarity examined and inthe this spatia paperl relationship is related bothmeasures to the employed geometric in measures order to useddetermine to determine distances a geometricfrom matching similarity features. and theIn our spatial case, relationship the position measures and the employedshape adopted in order by tocartographic determine distanceselements fromthat represent matching features.real-world In ourphenomena case, the are position the most and thesignificant shape adopted aspects by taken cartographic into consideration elements that in representmeasuring real-world the similarity phenomena between are them the mostand their significant subsequent aspects matching. taken into consideration in measuring the similarityOver the betweenyears, some them matching and their techniques subsequent have matching. been developed. However, the majority of the currentOver approaches the years, concentrate some matching on network techniques matchi haveng been and developed.mainly on road However, networks the majority[18–27]. ofIn thecontrast current to network approaches matching, concentrate there are on few network studie matchings concerning and matching mainly onarea; road nevertheless, networks [despite18–27]. Inthis contrast shortage to networkof research, matching, there are there relevant are few area-based studies concerning matching matchingalgorithms area; that nevertheless, must be analyzed. despite this shortageOverall, ofit can research, be said there that arethe relevanttechniques area-based employed matching to match algorithms objects using that this must type be of analyzed. algorithm are basedOverall, on: it canthe bepercentage said that theof techniquesoverlapped employed area [28–30], to match the objectscontext using by means this type of ofa algorithmDelaunay aretriangulation based on: [31,32], the percentage the belief theory of overlapped on position area and [28 orientation–30], the [33] context and probability by means ofclassifiers a Delaunay over triangulationa set of “evidence”, [31,32], both the belief geometric theory and on positionattribute-based and orientation [34]. A review [33] and of probability these studies classifiers shows overthat aeven set ofthough “evidence”, the overall both geometric performance and attribute-basedin matching between [34]. A features review of is these satisfactory, studies showsthere thatare some even thoughaspects thethat overall need performanceto be improved. in matching In this sens betweene, the features main problems is satisfactory, are usually there are related some aspectsto: the thatgeneration need to of be false improved. matching In pairs this sense,in the thecases main of 1:n problems or n:m arecorrespondences usually related with to: thepolygons generation with ofrelatively false matching complex pairs contours in the [29,30], cases of the1:n matchingor n:m correspondences between networks with polygonswith different with relativelylevels of complexdetail [33,34] contours and [the29,30 process´s], the matching low level between of automa networkstization with [32]. different Especially levels interesting of detail [33 are,34 the] and ideas the processpresented´s low by levelMortara of automatization and Spagnolo [ 32[31].]. Especially In their study, interesting the correspondences are the ideas presented between by polygons Mortara and are Spagnolofound by [taking31]. In theirinto account study, the the correspondences morphological between characterization polygons of are the found shapes by takingby means into accountof their theapproximate morphological skeleton. characterization Thus, the correspondence of the shapes byprob meanslem is of essentially their approximate reduced skeleton. to a question Thus, of the a correspondenceskeleton matching problem process. is essentially It could be reduced said, ther to aefore, question that of this a skeleton approach matching represents process. a combined It could besolution said, therefore, (area-based that thisand approach network-based) represents to a combinedthe matching solution problem. (area-based On and the network-based) other hand, toZhang the matching et al. [35] problem. compare Ondifferent the other matching hand, Zhang algorithms et al. [like35] comparethe Classification different And matching Regression algorithms Tree like(CART) the Classification [36], C4.5 algorithms And Regression [37], the Tree naive (CART) Bayes [36 ],classifier C4.5 algorithms (a probabilistic [37], the model) naive Bayes and classifierSupport (aVector probabilistic Machines model) (SVM) and over Support the same Vector set of Machines measures, (SVM) area relationship, over the same shape set ofsimilarity measures, and area an adapted wall statistical weighting. The conclusions of these authors do not mark a difference from the results of all these classifiers.
ISPRS Int. J. Geo-Inf. 2017, 6, 399 4 of 24 relationship, shape similarity and an adapted wall statistical weighting. The conclusions of these authors do not mark a difference from the results of all these classifiers. Another type of matching algorithm is inspired by the concept of shape blending [38,39]. Based on the morphological characterization of objects, this process is related to the gradual transformation of one shape into another and requires, as a first step, finding correspondences between features from the two shapes. Although the main and the most numerous practical applications of shape blending are related to computer animation, scientific visualization and solid modeling, some authors have developed methods applied to spatial data. Among these methods, we must highlight the approach developed by Prakash and Robles-Kelly [40]. These last authors measure the similarity (or dissimilarity) between objects by counting the number of edit operations required to transform one object into another, and although this approach is closely related to the problem of measuring geometrical similarities, it does not seem to provide an optimal solution when the number of correspondences is too high. For this reason, recent efforts have been focused on comparing an input shape with various template shapes (patterns), which represent potential solutions to the matching problem, and reporting the template with the highest similarity (or lowest dissimilarity) to the input shape. In our case, and because of the need to match a large amount of shapes, it makes sense to use this last mechanism based on pattern comparison. In addition, there are various similarity/dissimilarity measures to be considered for this task [41].
1.2. Intra-Elements Matching (Vertex-to-Vertex) With regard to achieving an optimal vertex-to-vertex correspondence between previously matched elements, the background information available in the field of spatial data is once again very sparse. Thus, the main contributions center on applications related to graph matching, such as shape analysis, face recognition and structural pattern recognition. Among the proposed methods. we must highlight the studies of Gösseln and Sester [42] and Butenuth et al. [43]. These authors match polygon datasets by using an iterative closest point algorithm that detects corresponding point pairs for two point sets derived from each contour of corresponding objects. Zhang and Meng [44] integrated two spatial datasets by means of an iterative matching approach, and Safra et al. [45] developed a location-based join algorithm for searching for corresponding objects based on the distance proximity between their representative points. On the other hand, Huh et al. [46] develop a method for detecting a corresponding point pair between a polygonal object pair with a string matching method based on a confidence region model of a line segment. Another common approach is that of isometric point pattern matching [47]. According to these authors, in such a case, we assume that a “copy” of an object G (the “template”) appears isometrically transformed within G´ (the “target”). Song et al. (2011) [6] develop a procedure based on a relaxation of the probability of matching between different junctures inside a traffic network. However, these methods have some limitations in detecting vertex-to-vertex correspondences between previously matched elements. Some of them require topological properties such as a valence of each node or the directions of edges connected at a node [6,44], while others consider the vertices or locations of objects as isolated points, not as geometric primitives that form the shape of an object [42,43,45], and finally, others are restricted to a low density of objects to be matched [42,46]. Following Fan et al. [30], in this last case when neighboring polygons are located immediately next to each other and are similar in shape and size, there will be error matching. In order to overcome the aforementioned limitations, the contour-matching algorithms have been developed. These algorithms take into account both the order and location of an object’s vertices and match the contours of an object pair by comparing their vertex strings [48]. Nevertheless, these methods, as in those described above, present some disadvantages, which in this case, are related mainly to the comparison between vertex strings, which belong to objects with relatively different contours. In addition, they cannot be applied to a 1:n or an m:n corresponding object pair [29]. ISPRS Int. J. Geo-Inf. 2017, 6, 399 5 of 24
A different interpretation of the contour-matching algorithms is proposed by Arkin et al. [49], who compare two polygonal curves by means of their turn angle representations. Specifically, this last approach allows us to solve the point matching problem more efficiently than in the previous cases due to the characteristics of the geographic information with which we work.
1.3. Paper Outline Taking into account all of the above, we propose an approach for the matching of objects from GDBs, which tries to overcome as far as possible the limitations of the methods discussed above. Constrained to a unique spatial phenomenon like urban zones and a specific type of element like polygonal shapes, our method considers the first two measures mentioned by Veltkamp and Hagedoorn (2000) [41] applied in two major steps:
• Quantification of similarity (inter-elements): The matching of polygonal shapes is effected by means of a weight-based classification methodology using the polygon’s low-level feature descriptors (edge-based features such as area, perimeter, number of angles, etc.). This classification allows for the categorization of the matching quality from a quantification of the similarity. Among all the existent classifiers, we have chosen a Genetic Algorithm (GA). The GA belongs to a family of stochastic global optimization methods based on the concept of Darwinian evolution in populations. Their performance has been sufficiently studied and developed by several authors [50–54], and all their possible variants share the main mechanism that operates on an initial population of randomly-generated chromosomes or individuals, who represent possible solutions to the problem, and consists of three operations: evaluation of individual fitness, development of an intermediate population through selection mechanisms and recombination through crossover and mutation operators. The main reason for employing a GA is that, compared with other matching methods based on descriptors such as Fourier descriptors [55], these classifiers have proven their applicability not only to solve matching problems for all those approaches that need a power search procedure for problems with large, complex and poorly understood search spaces [56], but also to produce quality values from this matching [57,58]. These quality values in turn are closely related to the geometric similarity measures and are very relevant for our work since they allow us to select 1:1 corresponding objects pairs among all the possible correspondences. This is the simplest and most efficient approach to matching geospatial data [59] and avoids the disadvantages described in Section 1.1 (false matching pairs in the cases of 1:n or n:m correspondences). • Threshold of dissimilarity (intra-element): After having matched the polygonal shapes from the two GDBs, the second and last step involves using a metric to compute homologous points, which represent a vertex-to-vertex matching. Specifically, we used a contour-matching metric based on the method defined by Arkin et al. [49] for comparing two polygonal shapes, which essentially consists of comparing their turning functions and checking whether the distance between these two functions in certain areas of their graphic representations is smaller than a previously fixed threshold. In this case, the use of this metric applied to 1:1 correspondences allowed us to overcome the disadvantages described in Section 1.2. In addition, this metric has also allowed us to define a powerful new edge-based shape descriptor, which takes into account the relationships between adjacent boundaries of polygons.
The rest of this paper is organized as follows: Section2 defines the two geospatial datasets and the polygonal features used. Sections3 and4 focus on the matching methodology. Specifically in Section3, the matching of polygonal shapes is addressed, and in Section4, we examine the issue of vertex-to-vertex matching. Finally, experimental results and discussions are presented in Section5, followed by conclusions in Section6. ISPRS Int. J. Geo-Inf. 2017, 6, 399 6 of 24
2. The Two Geospatial Databases and Polygonal Features Used The two GDBs selected for our study needed to meet certain requirements in order to assert that both GDBs must be independently produced and that neither of these two GDBs, in turn, can be derived from another cartographic product of a larger scale through any process. All this implies that none of them have undergone any action involving a degradation of their quality. In addition, both datasets should be interoperable, which means that we must ensure that they will be comparable both in terms of reference system and cartographic projection [60]. Therefore, we used two official cartographic databases in Andalusia (southern Spain) that meet the aforementioned requirements: the BCN25 (from “Base Cartográfica Numérica E25k”) and the MTA10 (from “Mapa Topográfico de Andaluc selectfontía E10k”). The BCN25 dataset is the digital version of the “Mapa Topográfico Nacional” at scale 1:25,000, (MTN25KISPRS map Int. J. series), Geo-Inf. 2017 produced, 6, 399 by the “Instituto Geográfico Nacional” (National Geographic6 of 24 Institute of Spain)the in BCN25 2010[ 61(from]. Its “Base planimetric Cartográfica accuracy Numérica has E25k been”) and estimated the MTA10 at between (from “Mapa 2.5 and Topográfico 7.5 m, depending de on theAndalucía type of entityE10k”). considered. However, several authors estimate its planimetric accuracy at 4.0 m [62]. TheThe BCN25 MTA10 dataset dataset, is the known digital version as the “Mapaof the “Mapa Topogr Topográficoáfico de Nacional” Andaluc íata E10K”scale 1:25,000, (Andalusian Topographic(MTN25K Map map at series), scale1:10,000), produced isby producedthe “Institu byto Geográfico the “Instituto Nacional” de Estad (Nationalística Geographic y Cartograf ía de AndalucInstituteía” (IECA), of Spain) and in more 2010 [61]. precisely, Its planimetric the GDB accuracy used washas been produced estimated in at 2009. between The 2.5 declared and 7.5 m, positional depending on the type of entity considered. However, several authors estimate its planimetric accuracyaccuracy of the at MTA10 4.0 m [62]. is RMSE The MTA10 = 3 m dataset, [63]. The known MTA10 as the would “Mapa playTopográfico the role de of Andalucía “reference E10K” source” in the subsequent(Andalusian process Topographic of positional Map at accuracy scale 1:10,000), assessment is produced of the by BCN25 the “Instituto database. de Estadística y AfterCartografía describing de Andalucía” the GDB, (IECA), we and must more define precis theely, the polygonal GDB used features was produced used in in 2009. our The matching methodology.declared Inpositional this case, accuracy we selected of the MTA10 buildings is RM fromSE = 3 urbanm [63]. zones.The MTA10 The would use of play this the type role of of object is based on“reference the following: source” in (1) the buildings subsequent areprocess a huge of po setsitional of polygonal accuracy assessment features of in the any BCN25 GDB, database. at least in the After describing the GDB, we must define the polygonal features used in our matching number of elements, although perhaps not in area, and (2) the buildings included in any GDB contain a methodology. In this case, we selected buildings from urban zones. The use of this type of object is sufficientbased quantity on the offollowing: geometrical (1) buildings information are a huge (vertexes set of polygonal and the features edges thatin an joiny GDB, them) at least for in our the purpose. Owingnumber to of the elements, lack of although relevant perhaps previous not in studies, area, and in (2) the the initial buildings stages included of this in any study, GDB we contain realized the need toaapply sufficient the quantity proposed of geometrical methodology information to a wider(vertexes population and the edges sample. that join Forthem) this for reason,our purpose. the sample comprises nineOwing urban to the areaslack of selectedrelevant previous to meet studies, as far in as the possible initial stages all the of this variability study, we requirements realized the with regard toneed the to typologiesapply the proposed of polygons. methodology These to requirementsa wider population include sample. a wide For this geographical reason, the sample distribution comprises nine urban areas selected to meet as far as possible all the variability requirements with and an ample variety in the type of buildings and topographical terrain. The urban areas selected regard to the typologies of polygons. These requirements include a wide geographical distribution are includedand an inample three variety sheets in the of type the of MTN50k buildings (National and topographical Topographic terrain. The Map urban of Spain areas selected at scale are 1:50,000). More specifically,included in three we sheets used theof the following MTN50k (National urban areas:Topographic Mairena Map deof Spain Alcor, at scale Alcal 1:50000).á de Guadaira More and Carmonaspecifically, (0985 Sheet); we used Fuente the following Vaqueros, urban Santa areas: Fe Mairena and Granada de Alcor, (1009 Alcalá Sheet); de Guadaira and Co andín, Carmona Alhaur ín de la Torre and(0985 Fuengirola Sheet); Fuente (1066 Vaqueros, Sheet) Santa (Figure Fe 3anda). Granad Figurea3 b(1009 shows Sheet); two and examples Coín, Alhaurín of polygonal de la Torre features correspondingand Fuengirola to buildings (1066 Sheet) belonging (Figure to 3a). MTA10 Figure and 3b BCN25shows two (Carmona, examples 0985 of polygonal sheet). features corresponding to buildings belonging to MTA10 and BCN25 (Carmona, 0985 sheet).
Figure 3. (a) Selected sheets of the Mapa Topográfico Nacional (MTN) at the scale of 1:50,000 of the Figure 3. (a) Selected sheets of the Mapa Topográfico Nacional (MTN) at the scale of 1:50,000 of the region of Andalusia (southern Spain) and its administrative boundaries and (b) examples of region ofpolygonal Andalusia features (southern belonging Spain) to andMapa its administrativeTopográfico de boundariesAndalucía E10K and ((MTA10)b) examples and ofBase polygonal featuresCartográfica belonging Numérica to Mapa E25k Topogr (BCN25)áfico de(Carmona, Andaluc 0985ía E10Ksheet). (MTA10) and Base Cartográfica Numérica E25k (BCN25) (Carmona, 0985 sheet). 3. Inter-Elements Matching
3.1. Polygon Low-Level Feature Descriptors The main purpose of object descriptors involves finding discriminative features that describe the object of interest [64]. These features can represent different information levels. In our case, due to the characteristics of the geographic information and to the specific type of objects (two-dimensional polygonal shapes) we worked with, the features used as descriptors to encode the shape of an object belong to low-level information. Following Shih et al. [65], low-level feature descriptors are usually
ISPRS Int. J. Geo-Inf. 2017, 6, 399 7 of 24
3. Inter-Elements Matching
3.1. Polygon Low-Level Feature Descriptors The main purpose of object descriptors involves finding discriminative features that describe the object of interest [64]. These features can represent different information levels. In our case, due to the characteristics of the geographic information and to the specific type of objects (two-dimensional polygonal shapes) we worked with, the features used as descriptors to encode the shape of an object belong to low-level information. Following Shih et al. [65], low-level feature descriptors are usually extracted to describe the geometric properties [66], spatial properties [67] and shape distributions [68] of two- and three-dimensional objects. In addition, the similarity between two 2D objects can be measured by comparing their features. In our approach, the GDB used must be able to compute these features efficiently and effectively. For this reason, the descriptors selected were edge-based features and included the following:
• Number of convex and concave angles: Both features are conditioned by edge orientation. This descriptor informs us about the complexity of the polygon’s perimeter. • Perimeter: This is a feature that is not conditioned by edge orientation, but depends on their lengths. This characteristic should indicate both the complexity of the polygon and some basic shape description in relation to the area. • Area: This depends both on the orientation of the edges and on their lengths. Despite not being a particularly interesting descriptor with respect to the shape of the element, it has a very important visual weight in the GDB to be considered in the matching process. • Minimum moment of inertia of second order: This can be computed easily by means of the polygon’s total area, centroid and vertices, which are connected by straight lines in a 2D system. It depends both on the orientation and the lengths of the edges. • Arkin Graph Area (AGA): For a polygon A, we define the AGA descriptor as the area of the region below the turning function θA. In order to avoid differences of AGA based on the starting points, all AGA are computed starting from the northwestern point in a-clockwise direction, and the perimeter distance is normalized to one. In addition, as mentioned in Section1, this descriptor takes into account the relationships between adjacent boundaries of polygons and depends both on the orientation and the lengths of the edges. Its computation is addressed in Section4.
• Minimum Bounding Rectangle (MBR): This is specified by coordinate pairs (Xmin, Xmax, Ymin, Ymax). This feature represents the absolute spatial position of the polygon and in the case of geographic information is a key aspect in object characterization.
These descriptors are used as the attributes of each polygon from the first GDB that tries to match another polygon of the second GDB. All of them are used in a normalized form from their differences values except the minimum bounding rectangle, which is determined as the total area of normalized overlapping rectangles. Finally, we must note that all of these features aim to define, as much as possible, the parameters that humans tend to apply in a manual match of the objects. For example, the minimum bounding rectangle represents the first attempt of a cartographer at identifying homologous objects, that is objects that share a similar position in 2D space.
3.2. GA Classification of Polygonal Shapes After characterizing the polygonal shapes, a set of homologous polygons from both GDBs was determined in order to match them. For this purpose, a GA based on real number representation (called Real Coded GA (RCGA)) was used. In an RCGA, a chromosome is a vector of floating point numbers whose size is the same as the proposed descriptors, which is the solution to the problem. This type of codification has been sufficiently developed by several authors [53,69–72], and for this reason, this issue is not described here. ISPRS Int. J. Geo-Inf. 2017, 6, 399 8 of 24
ISPRSThe Int. J. conceptual Geo-Inf. 2017,framework 6, 399 for the proposed methodology is shown in Figure4. 8 of 24
Figure 4. Conceptual framework for the classification process. RCGA, Real Coded GA. Figure 4. Conceptual framework for the classification process. RCGA, Real Coded GA.
The classification methodology is based on computing a set of weights that must be assigned to the featuresThe classification sets and then methodology using these is basedweights on to computing confront each a set ofobject weights from that both must GDBs. be assignedTo achieve to thethis features goal, it setsis necessary and then not using only these to establish weights tothe confront capability each of object the RCGA from bothas a classification GDBs. To achieve tool, thisbut goal,also to it isdemonstrate necessary not that only it can to establish produce the sufficiently capability good of the quality RCGA matches as a classification by means tool, of a but fitness also tofunction. demonstrate The fitness that it canfunction produce evaluates sufficiently the qualit goody quality of each matches solution by (chromosome) means of a fitness [54]. function. In our Theapproach, fitness this function function evaluates evaluates the qualityeach of of the each weig solutionhts assigned (chromosome) to each [of54 the]. In descriptors our approach, used this to functionidentify pairs evaluates of objects, each ofthe the genes weights included assigned in the to chromosome. each of the descriptors The weights used evaluate to identify the similarity pairs of objects,of the polygon the genes matching included over in thea su chromosome.pervised training The database, weights evaluateand the value the similarity obtained of represents the polygon the matchingquality of overthis solution. a supervised Obtaining training this database, database and using the a value supervised obtained training represents procedure the quality was the of first this solution.stage of the Obtaining process. this The database training usingdatabase a supervised was composed training of procedure805 polygons, was with the first their stage respective of the process.homologous The polygons, training database that were was manually composed obtained of 805 polygons,by three different with their cartographers. respective homologous The set of polygons,polygons thatobtained were by manually these three obtained cartographers by three different was 265, cartographers. 265 and 275 homologous The set of polygons polygons. obtained These bythree these sets three were cartographersindependent of was each 265, other 265 and and di 275fferent homologous from the polygons.final GDB used These for three the matching. sets were independentHowever, the of polygons each other were and extracted different from from GDBs the final similar GDB to used those for used the matching.in this paper. However, These two the polygonshypotheses, were using extracted different from cartographers GDBs similar and to thosea diff usederent inset this of paper.GDB polygons, These two were hypotheses, used to usingavoid differentthe influence cartographers of the person and selecting a different homologous set of GDB po polygons,lygons and were the used polygons to avoid to be the used. influence In addition, of the personthe supervised selecting training homologous process polygons was externally and the eval polygonsuated by to fifteen be used. reviewers In addition, selected the from supervised a pool of traininginternationally process recognized was externally experts. evaluated Their participation by fifteen reviewers in the procedure selected from reveals a pool the ofrobustness internationally of the recognizedtraining process experts. and Their the quality participation of the training in the procedure dataset. Both reveals the thedetails robustness relating of to the this training procedure process and andthe software the quality implemented of the training specifically dataset. for Both carrying the details it out relatingcan be found to this in procedure [73,74]. and the software implementedThe initial specifically population, for the carrying set of weights it out can for be each found descriptor in [73,74, represents]. possible solutions to the matching problem and is generated through a process called chromosome initialization [75]. Paying special attention to this process may have a positive effect on the success of the GA. Although there are several alternatives such as quasi random sequences and the simple sequential inhibition process, in our case, the chromosome initialization was carried out by means of a Pseudo-random Number
ISPRS Int. J. Geo-Inf. 2017, 6, 399 9 of 24
The initial population, the set of weights for each descriptor, represents possible solutions to the matching problem and is generated through a process called chromosome initialization [75]. Paying special attention to this process may have a positive effect on the success of the GA. Although there are several alternatives such as quasi random sequences and the simple sequential inhibition process, in our case, the chromosome initialization was carried out by means of a Pseudo-random Number Generator Algorithm (PNGA). This is because it is particularly well suited to cases where the speed of convergence is high [76], which may happen with the specific type of spatial objects with which we work. In addition, it is fast and easy to use (a detailed description of this process is addressed in Section 5.1). The initial population evolves toward the final solution (iterative process) by developing an intermediate population through mechanisms of selection, crossover and mutation. These mechanisms or requirements also change during the execution of the algorithm. Thus, we prevent the premature convergence of RCGA to suboptimal solutions. In addition, multiple runs of the RCGA using different initial populations were carried out in order to assure that the initial selection did not affect the proposed solution. With regard to the size of the initial population, this is related to: (i) crossover and mutation probabilities (inversely related) and (ii) the number of iterations (called generations). In this last aspect, some authors suggest using a ratio value (size of population/number of generations) close to 1/5 [77]. In our approach, the initial population was integrated by m chromosomes, which included seven genes, each treated as the weight assigned to each of the descriptors. Defining the main mechanism, which operates on an initial population, was the next stage in our approach. In order to develop an intermediate population we employed the proportional selection calculation [78], the Blend Alpha (BLX-α) crossover operator [69] and the non-uniform mutation [79]. This choice is justified because the experiments carried out highlight these genetic operators as the most suitable ones for building RCGAs [53]. On the other hand, both the BLX-α crossover operator and non-uniform mutation are controlled by two parameters (α and β), which determine the exploitation interval in the search space. The values assigned to these parameters depend in large part on the population size. The weight computation was the last stage of our process. This is the final solution to the problem and represents the chromosome that has achieved the best results regarding the evaluation. Having computed the weights (after finalizing the training phase of the RCGA), a classification of the remaining features can be achieved. As mentioned in Section 1.3, this classification allows for the categorization of the matching quality from a quantification of the similarity and consists of weights assigned to each of the polygon’s descriptors described above. Keeping this in mind, we consider two matched polygons (PA, PB) exactly equal if the Match Accuracy Value (MAV) achieved is equal to one (100%). This value is obtained as a linear combination of the n polygons descriptors (At(PA), At(PB)) and the n weights (W(At)) computed in the training phase (Equation (1)).
n ( ) = ( ) ( ) · ∈ ∀ ∈ MAV PA, PB MAX∑ Ati PA /Ati PB WAti , PA GDBBCN25, PB GDBMTA10 (1) i=1
Equation (1) defines two characteristics: the first allows for determining the matching between polygons (PA is selected, then we can choose PB from the second GDB based on the maximum value of the linear combination); the second is the value of the linear combination that represents the accuracy of the matching, or in other words the shape similarity measure between two polygons. Thus, polygons having different degrees of similarity will provide non-equal values of MAV ranging from zero to one. Another important consideration in order to be able to achieve good results is that of avoiding the acceptance of erroneously-matched polygons (false positive), because this acceptance would also imply the subsequent introduction of mismatched points. For this reason, two matched polygons will be used to compute the matching between points depending on the MAV value achieved and only if this value exceeds a previously fixed threshold. ISPRS Int. J. Geo-Inf. 2017, 6, 399 10 of 24
Finally, we must note that an extensive description of the codification of the RCGA employed can be found in [80]. The most powerful feature of this software is that control parameters of selection, crossover and mutation operators can be defined in order to find the most adequate solution for our approach.
4. Intra-Elements Matching (Vertex-to-Vertex Matching) ISPRS Int. J. Geo-Inf. 2017, 6, 399 10 of 24 After having matched the polygonal shapes (buildings) from both GDBs, and in order to achieve aFollowing vertex-to-vertex these authors, matching, a common a metric representati derived fromon the of methodthe boundary defined of bya simple Arkin etpolygon al. [49] P wasA can used. be Followingobtained by these the turning authors, function a common representation. However, we of used the this boundary function, of awhich simple measures polygon theP Aanglecan beof the obtained counterclockwise by the turning tangent, function as a functionθPA (s). However,of the arc-length we used s in this the function,clockwise whichdirection, measures measured the anglefrom some of the reference counterclockwise point 0 on tangent, A’s boundary. as a function Thus, of the 0 arc-length is the angles in v the that clockwise the tangent direction, at the measuredreference point from 0 some makes reference with some point reference0 on A’s orientation boundary. associated Thus, θPA ( 0with) is thethe anglepolygonv that (such the as tangent the x- ataxis). the reference keeps point track0 makes of withthe turning some reference that takes orientation place, increasing associated with with theleft-hand polygon turns (such and as thedecreasing x-axis). withθPA ( s)right-handkeeps track turns. of the If turningwe assume that takesthat each place, polygon increasing is rescaled with left-hand so thatturns the total and decreasingperimeter length with right-hand is one, then turns. If we is assume a function that eachfrom polygon (0, 1) to is (− rescaled4π, 4π) so(Figure that the 5). total The perimeter function length is one, then θ (s) is a function from (0, 1) to (−4π, 4π) (Figure5). The function θ (s) has several has severalP Aproperties, which make it especially suitable for identifyingP Aand extracting properties,homologous which points make between it especially two suitablepolygons. for identifyingAs mentioned and extractingin [49], it homologous is piecewise-constant points between for two polygons. As mentioned in [49], it is piecewise-constant for polygons, making computations polygons, making computations particularly easy and fast. By definition, the function is particularlyinvariant under easy translation and fast. By and definition, scaling of the the function polygonθP A(. sRotation) is invariant of A undercorresponds translation to a simple and scaling shift of the polygon A. Rotation of A corresponds to a simple shift of θP (s) in the θ direction. In addition, of in the θ direction. In addition, we used the area of theA region below in order to ( ) wedefine used what the areawe denominated of the region below as an θArkinPA s inGraph order Area to define (AGA). what This we area denominated can be easily as an computed Arkin Graph by Areathe value (AGA). of the This integral area can (Equation be easily (2)) computed and wasby us theed as value another of the polygon’s integral (Equationshape descriptor (2)) and in wasthe usedclassification as another stage. polygon’s This integral shape can descriptor be obtained in the numerically classification by a stage. sum of This prisms integral because can bethe obtained changes numericallyin height only by take a sum place of prismsin each becausevertex. the changes in height only take place in each vertex.
Z 11 AGA(P ) == θθ (s)ds (2) AGA(PAA ) PPA (s)ds (2) 0 A