<<

RISIS / Working with geographical data

Geographical concentration of S&T activities

Nanosciences and Nanotechnologies databases

Lionel Villard, Michel Revollo

1/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Goals and responsibility Geocoding process Key elements to Data pre-processing Geocoding with postal codes Geocoding with the names of toponyms City identification for Batch Geocode engine Results for patents database Other solutions Identifying the areas of aggregation Common problems The two propositions for RISIS Main families of algorithms Our approach with two sequential analysis Main advantages of this method Parameters and java interface Exemples of uses Examples of thresholds Collaborations between clusters Temporal and dynamic characteristics Future challenges 2/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Main goals

Analyzing the distribution of S&T (here through patents and publications) activities and measuring the aggregation effects by identifying the existing geographical spaces where a high density of activity takes place.

The ambition is to look at clustering effects as they happen and not by considering administrative borders that widely differ between .

3/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Responsibility of the data producers at Micro or Meso levels

Due to the high level of impacts that could have some results - characterizations and indicators - data producers have a responsibility toward policy makers, firms...

 At the Macro level : a lack of information or a wrong affectation could be partially accepted and hide by data aggregation (e.g. group consolidation levels, aggregation on national or continental levels);

 At Micro or Meso levels : a lack of information or a wrong affectation can drastically affect results and the comprehension of a phenomena, and at the end the decisions based on the analysis (e.g. city or cluster level, subsidiary or laboratory level).

4/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Goals and responsibility Geocoding process Key elements to geocode addresses Data pre-processing Geocoding with postal codes Geocoding with the names of toponyms City identification for Batch Geocode engine Results for patents database Other solutions Identifying the areas of aggregation Common problems The two propositions for RISIS Main families of algorithms Our approach with two sequential analysis Main advantages of this method Parameters and java interface Exemples of uses Examples of thresholds Collaborations between clusters Temporal and dynamic characteristics Future challenges 5/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Beginning of the project in 2004

In 2006, none of the solutions were enough robust and adapted to the variety of addresses situations.

We chose to developed our own geocoding engine :

 adapted to heterogeneity and specificities of the datasources (addresses in scientific articles and in patents);  be able to fill as many blanks as possible;  to solve some problematic or ambiguous situations.

6/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

This suppose three steps :

 extraction of the geographical informations (toponyms, buildings names, postal codes...);  to geocode these information;  building the clusters boundaries for identifying the geographical aggregation of the S&T activities.

7/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Key elements to geocode addresses

Identifying the name of the cities is a key element in the for the geocoding process. When there is information at a lower scale, we use dictionaries (postal codes or buildings name) to geocode the address.

A fast disambiguation of the toponyms can be done by identifying the regions names (states, , prefectures ) for:  ambiguities on the type of toponyms : a city with the same name as the regional level;  homonyms : cities with the same name in the same .

8/50 9/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Data pre-processing

 Data expansion:  first, we extract remaining addresses in the inventors and applicants names;  secondly, we use external sources of information like INPI or RegPat (OECD) for patents to extend the coverage of addresses;  finally, an internal propagation of addresses is done to add information where it is empty. INPADOC families can be used as a referentiel to propagate information using a name (inventors or applicants) fuzzy comparison.

10/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

 Data cleaning: data cleaning and identification of the best candidates at each scale  puntuation suppression (except comma);  special characters suppression;

 standardisation of the country names or country codes using ISO 3166-1 aplha-2 norme.

11/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

 Parsing addresses:

 Initial amount of addresses for the nanotechnologies database (patents) : 2 891 986

 After cleaning, selection of the last three sections based on comma, and grouping : 703 576 distinct addresses

12/50 13/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : postal codes phase (45.5 %)

 Pattern detection for the position of the : "/[0-9]{4,}/"  Comparison with GeoNames postal codes

External resource : GeoNames all_postal_doc (911 346 postal codes)

14/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : postal codes phase (45.5 %)

 Pattern detection for the position of the potential city name : "/([A-Za-z]{3,50}-?)+/ "  Comparison with GeoNames place name of the postal codes

External resource : GeoNames all_postal_doc (911 346 place names)

15/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Geocoding : toponyme s names phase (46.5 %)

 Geocoding based on toponyme s names  Constraint on the  Comparison with all the selected toponymes of GeoNames

External resource : GeoNames AllCountries_normalise (3 310 006 entities selected amoung the 8 255 731 toponyms names)

16/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : toponym names phase (46.5 %)

Selected feature codes on Geonames : "ADM2" OR "ADM3" OR "ADM4" OR "ADM5" OR "PPL%" (3 310 006 entities selected among the 8 255 731 toponyms names)

Selected feature codeSection or code definition Description No All codes L parks, area All section of those places No All codes H stream, lakes, … All section of those places No All codes Road, railroad All section of those places No All codes Spot, building, farm All section of those places No All codes Mountain, hill, rock All section of those places No All codes undersea All section of those places No All codes forest, heath, … All section of those places No ADM1 first-order a primary administrative division of a country, such as a state in the Yes ADM2 second-order administrative division a subdivision of a first-order administrative division Yes ADM3 third-order administrative division a subdivision of a second-order administrative division Yes ADM4 fourth-order administrative division a subdivision of a third-order administrative division Yes ADM5 fifth-order administrative division a subdivision of a fourth-order administrative division Yes PPL populated place a city, town, village, or other agglomeration of buildings where people live and work Yes PPLA seat of a first-order administrative seat of a first-order administrative division (PPLC takes precedence over PPLA) Yes PPLA2 seat of a second-order administrative Yes PPLA3 seat of a third-order administrative Yes PPLA4 seat of a fourth-order administrative Yes PPLC capital of a political entity Yes PPLCH historical capital of a political entity a former capital of a political entity Yes PPLF farm village a populated place where the population is largely engaged in agricultural activities Yes PPLG seat of government of a political entity Yes PPLH historical populated place a populated place that no longer exists Yes PPLL populated locality an area similar to a locality but with a small group of dwellings or other buildings Yes PPLQ abandoned populated place Yes PPLR religious populated place a populated place whose population is largely engaged in religious occupations Yes PPLS populated places cities, towns, villages, or other agglomerations of buildings where people live and work Yes PPLW destroyed populated place a village, town or city destroyed by a natural disaster, or by war Yes PPLX section of populated place 17/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : toponym names phase (46.5 %)

 Geocoding based on toponyms vernacular names  Constraint on the country code  Identification of the corresponding official name, and its coordinates

External resource : GeoNames Alter_name (7 137 897 other ways to named the official toponyms)

18/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : missing cities and coordinates (6.5 % of new city names)

 General address structure for 35 countries : [Postal code, Regional code/Country code, City]  Extraction of the city names inside address elements

19/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : missing cities and coordinates (6.5 % of new city names)

 Specific rules for 8 countries with regional division inside address  Use of the regional (, state ) as a pivot string to extract the city names inside addresses elements

20/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : missing cities and coordinates (6.3 %)

 All remaining addresses without coordinates are submitted to the web application BatchGeocode  It add 210 009 coordinates, after applying filters for the accuracy, it represents 6.3% of the total of addresses

Selected Accuracy Value Description Nb addresses (patents) 0 Unknown accuracy. 18351 1 Country level accuracy. 4297 2 Region (state, province, prefecture, etc.) level accuracy. 21566 yes 3 Sub-region (county, municipality, etc.) level accuracy. 6514 yes 4 Town (city, village) level accuracy. 162493 yes 5 Post code (zip code) level accuracy. 210 yes 6 Street level accuracy. 8959 yes 7 Intersection level accuracy. 26 yes 8 Address level accuracy. 35 yes 9 Premise (building name, property name, shopping center, etc.) level accuracy. 5909 10 Manual Search (google maps and reverse geobatch) 0

21/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : a focus on priority patents

From the initial application (one of the priority patents of the family) setting the main features of an invention, it is not unusual that with time this initial patent (first filing or priority patent) is completed (technological extension) or extended to other patent offices (geographical extension) to protect the invention in new markets. The collection of patent is named a patent family.

As a proxy to understand the spatial distribution of activities, we chose to focus on priority patent that are not singleton (with at least an extension).

22/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : results for priority patents  The geocoding process is validated by an intense manual check, and by using a gold dataset of geocoded addresse,  High level of geocoded addresses, depending of the patent office, some addresses

are still missing. Inventor's addresses Harmonized Inventor's With filled Geolocalized for non singleton % country addresses addresses addresses priority patents Total for the 108 countries 195 143 67,5% 134 477 131 771 98,0% UNITED STATES US 90 974 97,8% 89 121 88 952 99,8% DE 27 068 15,1% 4 293 4 075 94,9% KR 18 595 3,9% 898 722 80,4% FR 17 599 97,4% 17 254 17 146 99,4% JP 6 684 80,1% 5 606 5 353 95,5% TW 6 260 37,3% 2 527 2 336 92,4% CN 4 082 8,5% 426 349 81,9% CANADA CA 3 473 66,3% 2 406 2 301 95,6% GB 2 713 37,6% 1 106 1 020 92,2% ES 2 709 15,8% 635 428 67,4% NL 1 719 82,4% 1 427 1 417 99,3% CH 1 610 76,8% 1 257 1 237 98,4% BE 1 591 81,0% 1 300 1 289 99,2% RU 1 525 21,5% 585 328 56,1% IT 1 374 66,7% 1 146 917 80,0% IN 723 87,4% 647 632 97,7%23/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Geocoding : results for scientific publications The geocode process is easier with publication : high level of geocoded addresses.  Good completion of addresses : good quality of the addresses, with toponymes (cities, states...) and postal codes  Good coverage of addresses for authors : most of scientific journals deliver their

author's addresses. Harnonised Number of Geocoded Author's addresses country addresses addresses % Total for all the 166 countries 2 176 376 2 153 142 98,9% UNITED STATES US 471 352 471 322 100,0% CHINA CN 268 630 268 488 100,0% JAPAN JP 216 934 215 834 99,5% GERMANY DE 138 001 137 994 100,0% FRANCE FR 109 136 109 118 100,0% SOUTH KOREA KR 101 996 85 863 84,2% UNITED KINGDOM GB 83 113 83 103 100,0% ITALY IT 73 211 73 203 100,0% INDIA IN 57 754 57 700 99,9% TAIWAN TW 56 723 56 723 100,0% RUSSIA RU 49 725 49 599 99,8% SPAIN ES 49 060 49 054 100,0% CANADA CA 43 182 43 182 100,0% BR 31 320 31 319 100,0% AU 30 501 28 327 92,9% NETHERLANDS NL 26 337 26 335 100,0%24/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Other Geocoding solutions for RISIS Google geocoding API Gisgraphy Geopy Users of the free API: Use geonames and geopy in python is an interface classes for the 2500 requests per 24 hour geocoder services: period. Webservice limits : 5 requests per second. it depends the subscription, OpenStreetMap Nominatim, 30 qpm (queries per ESRI ArcGIS, Google Google Maps API for Work minute) 80€ /month Geocoding API (V3), Baidu customers: 60 qpm 150€ / month Maps, Bing Maps API, 120 qpm 250€ /month Yahoo! PlaceFinder, Yandex, 100000 requests per 24 300 qpm 300€ /month IGN France, GeoNames, hour period. NaviData, OpenMapQuest, 10 requests per second. Open source and dump files , OpenCage, databases to install SmartyStreets, geocoder.us, Gisgraphy in a local server and GeocodeFarm (all data, geonames and openstreetmap)

600 € all countries 400 € US only

25/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

 Pelias : https://github.com/pelias/pelias

 Photon (Komoot) : https://github.com/komoot/photon Use the Nominatim database extracted from OpenStreetMap

26/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Goals and responsibility Geocoding process Key elements to geocode addresses Data pre-processing Geocoding with postal codes Geocoding with the names of toponyms City identification for Batch Geocode engine Results for patents database Other solutions Identifying the areas of aggregation Common problems The two propositions for RISIS Main families of algorithms Our approach with two sequential analysis Main advantages of this method Parameters and java interface Exemples of uses Examples of thresholds Collaborations between clusters Temporal and dynamic characteristics Future challenges 27/50 28/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Common problems

 Dataset specific geographical distribution: each dataset has his own geographical distribution depending of the subject studied;  Country specific boundaries: each country has his own administrative boundaries, and statistical geographical scales;

29/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Two main propositions in RISIS

 Administrative-based approaches: with a common definition of what is an urban area, these methods take in account the population concentration and some others demographic characteristics, and merge administrative units (mostly specific to countries). The boundaries produced can be used to map serval variables in different contexts, an to characterised uniformly the new areas (OECD, 2012)

 Bottom up approaches: project the geographical information of the field studied and build boundaries based on the specific geographical distribution of the data (IFRIS/ ESIEE). Catch the local geographical concentration of the activities.

30/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Main families of algorithms used in geographical clusters analysis

31/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

A quick overview : algorithms and geocoded data

Main families of algorithms that can be used with geographical data (M. Ouattara, 2010):

 Hierarchical : algorithms try to group objects or divide them in subgroups;

 Partition based methods : like k-means, objects shared some characteristics;

 Density-based algorithms : boundaries of the clusters are built strictly by analyzing the geographical distribution of the activities.

32/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

A method based on a combination of two sequential approaches

1 / Identification of the initial clusters with a density-based algorithm (DBScan, 1996) that is able to identifying the area where the activities are concentrated. The clusters are defined by two parameters fixed before the calculation: all points of a cluster are surrounded by at least X points in a circle with a diameter of Y km.

Where are located the area in which activity is the most intense?

33/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

DBScan (Density-Based Spatial Clustering of Applications with Noise, M. Ester, HP. Kriegel, J. Sander & X. Xu, 1996)

Points A are core points Points B and C are density-reachable from A and thus density-connected and belong to the same cluster Point N is a noise point that is neither a core point nor density-reachable (MinPts=3 or MinPts=4)

Main adventages  does not require to specify the number of clusters a priori  can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster.  notion of noise, and is robust to outliers

Main disadvantages  not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster  the quality of depends on the distance measure ("Curse of dimensionality" for high-dimensional data)  cannot cluster data sets well with large differences in densities

34/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

2 / In a second step, we compare two different dimensions of the relation between the initial clusters:

2.1 How intense are the relations between the initial clusters (less than 100 km between the centroids) ? RI/Relative Interconnectivity

2.2 Does the final cluster will have a similar profil of collaborations as the two initial clusters taken separately (to avoid large variations of density of links in the final cluster) ? RC/Relative Closeness

35/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

11 CHAMELEON method (Karypis, Han & Kumar, 1999) A cluster is defined by 3 4 Ti : the number of nodes (with different geographical coordinates), 3 9 Ei : the links between these nodes 1 7 Ci : the value of the links is the 4 number of collaborations between 2 2 the 2 nodes connected by this link 5 1 3 3 The relations between 2 clusters 8 3 E(i,j) : The number of links between 1 these two clusters C(i,j) : the total number of collaborations supported by these links

Cluster DBScan 1 Cluster DBScan 2

Relative Interconnectivity (measure connectivity coherence between clusters) RI is the ratio between the total number of collaborations between the two clusters (C(i,j)) and the average number of internal collaborations of the two clusters. 푪(풊,풋) 푹푰(풊,풋) = 푪풊 + 푪풋 ퟐ Relative Closeness (measure the similarity of the collaboration profils of the two clusters) The relative closeness between 2 clusters (RC(i,j)) is the ratio between the absolute closeness of the two clusters (ratio between the total collaborations observed between the two clusters and the number of links between these two clusters) and the average internal closeness of the two clusters (based upon the number of nodes of the 2 clusters, Ti + Tj). 푪풍 푪 푪(풊,풋) (풊,풋) 풊 푪풍 = 푹푪 = 푪풍풊 = (풊,풋) (풊,풋) 푻 푬풊 푬(풊,풋) 푻풊 풋 × 푪풍풊 + × 푪풍풋 푻풊 + 푻풋 푻풊 + 푻풋 36/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Main advantages of this method

The combination of a purely geometric approach (a fixed perimeter to measure a density) and the analysis of the relations between spaces makes possible to build clusters boundaries :

at a micro scale (clusters in a specific part of a town) as well as at a macro scale (at a regional level for worldwide comparisons).

37/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Example of local parameters selected for the databases on nanotechnologies (publications and patents)

Criteria Publications Patents Relative Interconnectivity 0.28 2 Relative Closeness 0.32 12.50 Minimal weight (addresses) 1 500 1 250 Maximum distance 25 Km 20 Km Number of addresses analysed 2 129 107 1 523 093 Number of initial clusters 313 186 (after DBScan) Number of final clusters 295 155 (after Chameleon) Parameters for DBScan and Chameleon

RI (publications) RC (publications) 2,5 1 0,9 2 0,8 0,7 1,5 0,6 0,5 1 0,4 0,3 0,5 0,2 0,1 0 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 38/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

A program in Java with an interface (Michel Revollo, 2014)

csv files outputs Sql database -> csv files inputs (clustered coordinates and RI/RC value)

 files with the different addresses depending of the sources (with temporal information, if needed)  one file for all the geographical coordinates

39/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

The software, the codes of the algorithms, is freely accessible on RISIS account of Github:

https://github.com/risis-eu/geoclust

40/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Main advantages of this method

The combination of a purely geometric approach (a fixed perimeter to measure a density) and the analysis of the relations between spaces makes possible to build clusters boundaries :

at a micro scale (clusters in a specific part of a town) as well as at a macro scale (at a regional level for worldwide comparisons).

41/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Goals and responsibility Geocoding process Key elements to geocode addresses Data pre-processing Geocoding with postal codes Geocoding with the names of toponyms City identification for Batch Geocode engine Results for patents database Other solutions Identifying the areas of aggregation Common problems The two propositions for RISIS Main families of algorithms Our approach with two sequential analysis Main advantages of this method Parameters and java interface Exemples of uses Examples of thresholds Collaborations between clusters Temporal and dynamic characteristics Future challenges 42/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Patents R=20 km MinPoids= 1 000 addresses (RI>0.3 and RC>20)

43/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Patents R=20 km MinPoids=1 000 addresses (RI>0.1 and RC>20)

44/50 Scientific publications

45/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

With a common definition of the geographical unit of analysis, it is possible to characterize and compare the different spaces where the activities are concentrated.

For example in terms of collaboration between continents, or inside countries (network of coauthors) : hubs, centrality, openness...

46/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges or to characterise clusters : link between size and rate of growth (A. Delemarle & al, 2009)

47/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Goals and responsibility Geocoding process Key elements to geocode addresses Data pre-processing Geocoding with postal codes Geocoding with the names of toponyms City identification for Batch Geocode engine Results for patents database Other solutions Identifying the areas of aggregation Common problems The two propositions for RISIS Main families of algorithms Our approach with two sequential analysis Main advantages of this method Parameters and java interface Exemples of uses Examples of thresholds Collaborations between clusters Temporal and dynamic characteristics Future challenges 48/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Future challenges for RISIS concerning the geographical aspects and the concentration of the activities

Concerning the geocode process :  Improving the geocode process to be able to work with large dataset with heterogeneous sources (of other facilities).  Applying a concave hull algorithm to delineate the perimeters of the clusters, and storing the polygons in a spatial database

Concerning the OECD FUA:  Do we want to apply the OECD method to other countries that are not covered by the actual Functional Urban Areas?  How to use of the Urban Areas name to label the clusters created in our datasets (when there is an overlap)?

Estimating the overlap between our method and the OECD urban areas boundaries :  Where are the perfect overlaps?  Is there any OECD urban areas that are split with our methods?  Is there new places that emerge outside the OECD urban areas boundaries in the covered countries?

49/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges

Future challenges 4 demonstrators on spatial dimensions for RISIS

 Characteristics and dynamics of R & I of metropolitan areas in Europe (in association with the OECD, see http://measuringurban.oecd.org)  What polarization processes associated with European funding and 'cross-border'  Functional spatial distribution and ranking of higher education in Europe  The role of 'urban areas' in the growth of new technology firms

50/50