WP 9 Workshop – UPEM Presentation 1
Total Page:16
File Type:pdf, Size:1020Kb
RISIS / Working with geographical data Geographical concentration of S&T activities Nanosciences and Nanotechnologies databases Lionel Villard, Michel Revollo 1/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Goals and responsibility Geocoding process Key elements to geocode addresses Data pre-processing Geocoding with postal codes Geocoding with the names of toponyms City identification for Batch Geocode engine Results for patents database Other solutions Identifying the areas of aggregation Common problems The two propositions for RISIS Main families of algorithms Our approach with two sequential analysis Main advantages of this method Parameters and java interface Exemples of uses Examples of thresholds Collaborations between clusters Temporal and dynamic characteristics Future challenges 2/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Main goals Analyzing the distribution of S&T (here through patents and publications) activities and measuring the aggregation effects by identifying the existing geographical spaces where a high density of activity takes place. The ambition is to look at clustering effects as they happen and not by considering administrative borders that widely differ between countries. 3/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Responsibility of the data producers at Micro or Meso levels Due to the high level of impacts that could have some results - characterizations and indicators - data producers have a responsibility toward policy makers, firms... At the Macro level : a lack of information or a wrong affectation could be partially accepted and hide by data aggregation (e.g. group consolidation levels, aggregation on national or continental levels); At Micro or Meso levels : a lack of information or a wrong affectation can drastically affect results and the comprehension of a phenomena, and at the end the decisions based on the analysis (e.g. city or cluster level, subsidiary or laboratory level). 4/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Goals and responsibility Geocoding process Key elements to geocode addresses Data pre-processing Geocoding with postal codes Geocoding with the names of toponyms City identification for Batch Geocode engine Results for patents database Other solutions Identifying the areas of aggregation Common problems The two propositions for RISIS Main families of algorithms Our approach with two sequential analysis Main advantages of this method Parameters and java interface Exemples of uses Examples of thresholds Collaborations between clusters Temporal and dynamic characteristics Future challenges 5/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Beginning of the project in 2004 In 2006, none of the solutions were enough robust and adapted to the variety of addresses situations. We chose to developed our own geocoding engine : adapted to heterogeneity and specificities of the datasources (addresses in scientific articles and in patents); be able to fill as many blanks as possible; to solve some problematic or ambiguous situations. 6/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges This suppose three steps : extraction of the geographical informations (toponyms, buildings names, postal codes...); to geocode these information; building the clusters boundaries for identifying the geographical aggregation of the S&T activities. 7/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Key elements to geocode addresses Identifying the name of the cities is a key element in the address for the geocoding process. When there is information at a lower scale, we use dictionaries (postal codes or buildings name) to geocode the address. A fast disambiguation of the toponyms can be done by identifying the regions names (states, provinces, prefectures ) for: ambiguities on the type of toponyms : a city with the same name as the regional level; homonyms : cities with the same name in the same country. 8/50 9/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Data pre-processing Data expansion: first, we extract remaining addresses in the inventors and applicants names; secondly, we use external sources of information like INPI or RegPat (OECD) for patents to extend the coverage of addresses; finally, an internal propagation of addresses is done to add information where it is empty. INPADOC families can be used as a referentiel to propagate information using a name (inventors or applicants) fuzzy comparison. 10/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Data cleaning: data cleaning and identification of the best candidates at each scale puntuation suppression (except comma); special characters suppression; standardisation of the country names or country codes using ISO 3166-1 aplha-2 norme. 11/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Parsing addresses: Initial amount of addresses for the nanotechnologies database (patents) : 2 891 986 After cleaning, selection of the last three sections based on comma, and grouping : 703 576 distinct addresses 12/50 13/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Geocoding : postal codes phase (45.5 %) Pattern detection for the position of the postal code : "/[0-9]{4,}/" Comparison with GeoNames postal codes External resource : GeoNames all_postal_doc (911 346 postal codes) 14/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Geocoding : postal codes phase (45.5 %) Pattern detection for the position of the potential city name : "/([A-Za-z]{3,50}-?)+/ " Comparison with GeoNames place name of the postal codes External resource : GeoNames all_postal_doc (911 346 place names) 15/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Geocoding : toponyme s names phase (46.5 %) Geocoding based on toponyme s names Constraint on the country code Comparison with all the selected toponymes of GeoNames External resource : GeoNames AllCountries_normalise (3 310 006 entities selected amoung the 8 255 731 toponyms names) 16/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Geocoding : toponym names phase (46.5 %) Selected feature codes on Geonames : "ADM2" OR "ADM3" OR "ADM4" OR "ADM5" OR "PPL%" (3 310 006 entities selected among the 8 255 731 toponyms names) Selected feature codeSection or code definition Description No All codes L parks, area All section of those places No All codes H stream, lakes, … All section of those places No All codes Road, railroad All section of those places No All codes Spot, building, farm All section of those places No All codes Mountain, hill, rock All section of those places No All codes undersea All section of those places No All codes forest, heath, … All section of those places No ADM1 first-order administrative division a primary administrative division of a country, such as a state in the United States Yes ADM2 second-order administrative division a subdivision of a first-order administrative division Yes ADM3 third-order administrative division a subdivision of a second-order administrative division Yes ADM4 fourth-order administrative division a subdivision of a third-order administrative division Yes ADM5 fifth-order administrative division a subdivision of a fourth-order administrative division Yes PPL populated place a city, town, village, or other agglomeration of buildings where people live and work Yes PPLA seat of a first-order administrative seat of a first-order administrative division (PPLC takes precedence over PPLA) Yes PPLA2 seat of a second-order administrative Yes PPLA3 seat of a third-order administrative Yes PPLA4 seat of a fourth-order administrative Yes PPLC capital of a political entity Yes PPLCH historical capital of a political entity a former capital of a political entity Yes PPLF farm village a populated place where the population is largely engaged in agricultural activities Yes PPLG seat of government of a political entity Yes PPLH historical populated place a populated place that no longer exists Yes PPLL populated locality an area similar to a locality but with a small group of dwellings or other buildings Yes PPLQ abandoned populated place Yes PPLR religious populated place a populated place whose population is largely engaged in religious occupations Yes PPLS populated places cities, towns, villages, or other agglomerations of buildings where people live and work Yes PPLW destroyed populated place a village, town or city destroyed by a natural disaster, or by war Yes PPLX section of populated place 17/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Geocoding : toponym names phase (46.5 %) Geocoding based on toponyms vernacular names Constraint on the country code Identification of the corresponding official name, and its coordinates External resource : GeoNames Alter_name (7 137 897 other ways to named the official toponyms) 18/50 Goals / Geocoding process / Identifying the areas of aggregation / Uses / Future challenges Geocoding : missing cities and coordinates (6.5