<<

International Journal of Geo-Information

Article A Century of French Railways: The Value of Remote Sensing and VGI in the Fusion of Historical Data

Robert Jeansoulin

CNRS UMR 8049 LIGM, Université Gustave-Eiffel, Cité Descartes, 5 Descartes, 77454 Marne-la-Vallée, ; [email protected]

Abstract: Providing long-term data about the evolution of railway networks in Europe may help us understand how European Union (EU) member states behave in the long-term, and how they can comply with present EU recommendations. This paper proposes a methodology for collecting data about railway stations, at the maximal extent of the French railway network, a century ago.The expected outcome is a geocoded dataset of French railway stations (gares), which: (a) links gares to each other, (b) links gares with French communes, the basic administrative level for statistical information. Present stations are well documented in public data, but thousands of past stations are sparsely recorded, not geocoded, and often ignored, except in volunteer geographic information (VGI), either collaboratively through Wikipedia or individually. VGI is very valuable in keeping of that heritage, and remote sensing, including aerial photography is often the last chance to obtain precise locations. The approach is a series of steps: (1) meta-analysis of the public datasets, (2) three- steps fusion: measure-decision-combination, between public datasets, (3) computer-assisted geocoding for ‘gares’ where fusion fails, (4) integration of additional gares gathered from VGI, (5) automated quality control, indicating where quality is questionable. These five families of methods, form a   comprehensive computer-assisted reconstruction process (CARP), which constitutes the core of this paper. The outcome is a reliable dataset—in geojson format under open license—encompassing (by Citation: Jeansoulin, R. A Century of January 2021) more than 10,700 items linked to about 7500 of the 35,500 communes of France: that is French Railways: The Value of 60% more than recorded before. This work demonstrates: (a) it is possible to reconstruct transport Remote Sensing and VGI in the data from the past, at a national scale; (b) the value of remote sensing and of VGI is considerable Fusion of Historical Data. ISPRS Int. J. Geo-Inf. 2021, 10, 154. https:// in completing public sources from an historical perspective; (c) data quality can be monitored all doi.org/10.3390/ijgi10030154 along the process and (d) the geocoded outcome is ready for a large variety of further studies with statistical data (demography, density, space coverage, CO2 simulation, environmental policies, etc.). Academic Editor: Wolfgang Kainz Keywords: historical GIS; railway network; meta-analysis; geo-information fusion; information Received: 27 December 2020 revision; crowdsourcing open data; volunteer geographic information; VGI; Wikipedia geo-information Accepted: 6 March 2021 extraction; spatial data quality; data consistency checking Published: 10 March 2021

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in 1. Introduction published maps and institutional affil- Transport infrastructure for people and goods is as old as urban civilizations, e.g., iations. Via Appia, Aemilia, Aurelia, etc. have helped in structuring the European landscape for centuries. From 1920 to 2020, the evolution of road and rail networks is a well known element of their competition [1–3]. Today climate challenges reignite that century-old debate:for instance the European Union (EU) transportation white paper [4] notes that Copyright: © 2021 by the author. in 2010 rail yielded11 million tons of CO2, versus road: 191 m tons, but regrets “ ... that Licensee MDPI, Basel, Switzerland. only limited data are available ... ” for measuring the impact of measures by member states. This article is an open access article Linking railway data with socio-eco-demographic data in historical geographic information distributed under the terms and systems (GIS), would contribute to understanding long-term trends. This approach has conditions of the Creative Commons been developed by Siebert [5], and, in Europe by Gregory et al. [6], and Morillas-Torné [7] Attribution (CC BY) license (https:// with a very similar goal, and uses several European public datasets, which record main rail creativecommons.org/licenses/by/ 4.0/). lines (no secondary rail lines).

ISPRS Int. J. Geo-Inf. 2021, 10, 154. https://doi.org/10.3390/ijgi10030154 https://www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2021, 10, 154 2 of 37

This paper describes a method for building a digital representation of all the ‘stations’ (we use the French term ‘gare’) that have existed, between the maximum extent of the French railway network (1920s) and now, a century later. The goal is to deliver a dataset to answer questions such as: Does a commune possesses a gare? How far from the closest gare? Also, the aim is to give a comparison at different dates. For that we need: (a) to link gares to each other along their ‘rail line’ (we use French term ‘ligne’), (b) to link every gare to a ‘city or town’: we use the French term ‘commune’, the smallest French administrative level at which most basic statistical data are collected. There is no digital dataset providing all French gares at the maximum extent of the network, and even their number is unknown. There are two kinds of source: public records (open data from French public institutions: SNCF, Insee, IGN over 2014–2020), and a lot of volunteered geographic information (VGI), either collective through Wikipedia, or individual, providing sparse data about old gares (even simple stops) and lignes. Sections2 and3 present materials and methods, respectively for public and VGI sources, with the goal to “geocode” all the gares, in a reliable, controllable way; as “points- of-interest” multi-source matching is presented in Li et al. [8]. Multi-source datasets are expected to show discrepancies between names (toponym changes), as investigated by Tian-Lan and Longley [9], and locations (existence and geocod- ing) of each individual gare. In order to resolve discrepancies, target constraints are formu- lated, and a schema is proposed. Each source is scrutinized by a “meta-analysis”[10,11], in order to identify what may conflict the constraints. Then public datasets are inputted in a fusion process, following Bloch [12], which more precisely is an asymmetric revision, as in Benferhat et al. [13], in a background similar to Reichgelt [14]. Issues raised by VGI have been addressed, e.g., by Johnson and Iizuka [15] successfully combining OpenStreetMap features and Landsat image classification; by Younghoon et al. [16] integrating several crowdsourcing sets in a single graph,or possibly developing collab- orative platform such as FeatureHub [17]; or multiple platforms as in Juhász et al. [18]. The specific goal assigned to VGI in this paper is to gather data for revising some already geocoded gares (whose quality is questioned), and mostly for integrating more gares, not recorded in public datasets. Chen et al. [19] have developed CrowdFusion, a framework aiming at data refinement (a hard non-polynomial task), while Gouvëa et al. [20], and Hastings [21], have focused on toponym conflation and gazetteer-based geo-referencing. Section3 describes routines querying Wikipedia for coordinates of gares, and mostly to parsing railway-oriented “Infobox”[22]. There are some inspiringpapers about how to populatethem [23], or to extract information based on common-sense rules [24]. An Infobox is a very valuable means to obtain at once the information about a ligne, the other lignes it connects with, and the gares it encompasses. The parsing of the information that complies with target constraints is one important contribution of the present work, because, beyond French peculiarities, railway semantics is shared by most countries. Next, Section3 describes several ways in querying various VGI sources, mostly about gares on secondary lignes (e.g., the popular rural tramways in France, between WWI and WWII). The last opportunity to resolve failed geocoding issues, is to exploit remote-sensing images, old aerial photographs, and contemporary or 20th-century maps. Issues of enacting data quality within a data collection process have been addressed by Vasseur et al. [25], who evaluates the quality in the context of a target use, and [26,27], two reviews about quality control in crowdsourcing, by Daniel et al. and Senaratne et al. The mix of manual and software steps presented above, combines into what we name the computer-assisted reconstruction procedure (CARP). Finally, the resulting CARP dataset is cleaned up by an assisted data quality control.

2. Materials and Methods. Part 1: Public Data, Revision and Control 2.1. Shaping the Expected Target 2.1.1. Target Goal, Constraints and Initial Schema The target should allow us to answer questions such as: ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 3 of 38

2. Materials and Methods. Part 1: Public Data, Revision and Control

ISPRS Int. J. Geo-Inf. 2021, 10, 154 2.1. Shaping the Expected Target 3 of 37 2.1.1. Target Goal, Constraints and Initial Schema The target should allow us to answer questions such as: -- ForFor each each of of the the French French communecommunes,s, how how far far from from its its center center is is the the closest closest garegare? ? -- BetweenBetween closest closest garesgaresofof communescommunes AA and and B, B, what what is is the the closest closest path path (kilometers)? (kilometers)? -- ComparingComparing today today and and a a century century ago. ago. For that purpose, we need the minimal information to linking gares with communes,, and to to linking linking garesgares withwith ligneslignes. That. That minimal minimal information information can be can aligned be aligned with withsome somecom- ponentscomponents of the of standards the standards for transportation for transportation networks networks developed developed by the by European the European infra- structureinfrastructure INSPIRE INSPIRE [28] or [28 the] or Open the Open geograph geographicic consortium consortium OGC OGC[29]. Four [29]. classes Four classes seem theseem minimum the minimum required: required: gare, communegare, commune, ligne, ,andligne a, node and aclassnode thatclass allows that allows one gare one togare be at-to tachedbe attached to several to several lignes lignes(a “junction(a “junction” gare).” gare). The UML-class diagram (Figure 11)) illustratesillustrates thethe relationshipsrelationships andand theirtheir cardinalities,cardinalities, between these these classes, classes, and and so someme additional additional constraints. constraints.

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 4 of 38

FigureFigure 1. 1.Conceptual Conceptual targettarget schemaschema forfor communescommunes, gares, lignes andand nodesnodes(primary(primary key: key: bold bold font). font).

- validToThe “belongs_to” “belongs_to” = year of endrelationships relationships of service of of Figure a Figure ligne 1,1(hence, are are very very of all important: important: its gares). A ligne can be disused • partlyThe garegare (%––commune communeof full length) relationshiprelationship at different is is straightforward, straightforward, dates, definitively once once all all(French: garesgares areare déclassement geocoded; geocoded; ) or re- • versibly.The garegare–– Mostligneligne relationshiprecentrelationship year is is is used the the challenge: challenge:(INSPIRE: how howvalidTo to to find). find old old ligneslignes andand all all their their garesgares. . AdditionalAdditional attributes spatial spatial (useful constraint constraint for display (contains (contains purposes,) controls) controls quality that that a garecontrol): a gare ‘belongs_to’‘belongs_to’ a commune a commune, and, -(and inherits_one (labelinherits_one = )name states) statesgiven that at thatto leasta gare at leastone (should node one nodebegives ingives bijection its coordinates its coordinates with uic to) orthe to a garenode the gare;(what (what about about the -otherthe othernom ones, = ones,toponym if any? if any? This for Thisais commune anissue is anissue studied (when studied usedin the for in sequel). the a gare sequel)., may differ from the label); - capaSome = target targetuse status, concepts concepts e.g., matching‘cargo matching only’, some some ‘borde INSPIRE INSPIREr point’, ones ones etc. (Figure (Figureused for 2),2), displaye.g., e.g., purposes. - node → RailwayNode || RailwayStationNode; - gare → [RailwayNode(West), RailwayStationNode, RailwayNode(East)]; - km (point kilométrique) → linear referencing, - uic (Union Internationale des Chemins de fer, the Worldwide Railway Organization (UIC) code) → RailwayStationCode, etc.

2.1.2. Target Attributes for Classes Commune, Gare, Ligne Main attributes (mandatory) are: - insee = [primary key] official unique code for a commune, (may evolve with time); - num = [primary key] official number for a ligne, or any given unique code.; - uic = [primary key] official RailwayStationCodefor a gare, given by the UIC, the inter- national railway organization, or any given unique code; - km = linear referencing, French:”point kilométrique”, German: “Streckenkilometer”; - the couple (num, km) is [primary key] for a gare (not two equal km for one num); Geographical attributes (mandatory): - point = geocode of a node, and for the associated gare; - polygon = contour of a commune; - geometry in the sequel, is used to denote either point, or polygon.

Time attributes (mandatory, only one by now): FigureFigure 2.2.an an illustrationillustration fromfrom the INSPIRE railway ne networktwork standard standard (excerpted (excerpted from from [28]). [28]).

Summing up the constraints that must be satisfied: - node → RailwayNode || RailwayStationNode;

(1).- Everygare → gare[RailwayNode(West), belongs to one existing RailwayStationNode, ligne: num is mandatory RailwayNode(East)]; (see Section 3); (2).- Akm couple (point (num kilom, kmétrique) must) →be linearunique; referencing, (3). A couple (num, label), must be unique (see Section 2.3.1 and Appendix B); (4). The subset of all nodes of a same ligne (= num) must be strictly ordered by km; (5). Every gare belongs to one commune, whose polygon contains the garepoint. Sometimes, purposely, a gare has been built at the border of two communes (only one is recorded); (6). Every ligne has two ends: an origin node, and a terminus node. One ligne either is “ac- tive”, or has a “validTo” (alias “end”) date; (7). Every gare has one or more node(s), but a node may have no gare (e.g., a fork); Constraints (1–4) are checked a priori (go/no-go), also a posteriori for quality control. Constraint (5) can be checked a posteriori, setting, or controlling the insee value. Constraint (6): a failure must trigger a search for additional information (Section 3). Constraint (7): several nodes attached to the same gare, may differ in geometry, what is a major cause of indecision in the fusion process.

2.1.3. Target Notation and Generic Step-by-Step Approach for Collecting Values The overall objective is to gather as many gares as possible that have ever existed. Therefore, the process starts with an initial dataset, then grow it, in a logical consistent way, by adding complementary or supplementary information from more available data. To add a gare, means to create a node and to fill in attribute data step by step (Table 1), starting with the knowledge that a gare named label belongs to the ligne identified by num. In general, information about a ligne contains a list of the gares in between origin and ter- minus, which allows to set km values in correct order, although approximate in kilometers.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 4 of 37

- uic (Union Internationale des Chemins de fer, the Worldwide Railway Organization (UIC) code) → RailwayStationCode, etc.

2.1.2. Target Attributes for Classes Commune, Gare, Ligne Main attributes (mandatory) are: - insee = [primary key] official unique code for a commune, (may evolve with time); - num = [primary key] official number for a ligne, or any given unique code.; - uic = [primary key] official RailwayStationCodefor a gare, given by the UIC, the inter- national railway organization, or any given unique code; - km = linear referencing, French:”point kilométrique”, German: “Streckenkilometer”; - the couple (num, km) is [primary key] for a gare (not two equal km for one num); Geographical attributes (mandatory): - point = geocode of a node, and for the associated gare; - polygon = contour of a commune; - geometry in the sequel, is used to denote either point, or polygon. Time attributes (mandatory, only one by now): - validTo = year of end of service of a ligne (hence of all its gares). A ligne can be disused partly (% of full length) at different dates, definitively (French: déclassement) or reversibly. Most recent year is used (INSPIRE: validTo). Additional attributes (useful for display purposes, quality control): - label = name given to a gare (should be in bijection with uic) or a node; - nom = toponym for a commune (when used for a gare, may differ from the label); - capa = use status, e.g., ‘cargo only’, ‘border point’, etc. used for display purposes. Summing up the constraints that must be satisfied: (1). Every gare belongs to one existing ligne: num is mandatory (see Section3); (2). A couple (num, km) must be unique; (3). A couple (num, label), must be unique (see Section 2.3.1 and AppendixB); (4). The subset of all nodes of a same ligne (= num) must be strictly ordered by km; (5). Every gare belongs to one commune, whose polygon contains the garepoint. Sometimes, purposely, a gare has been built at the border of two communes (only one is recorded); (6). Every ligne has two ends: an origin node, and a terminus node. One ligne either is “active”, or has a “validTo” (alias “end”) date; (7). Every gare has one or more node(s), but a node may have no gare (e.g., a fork); Constraints (1–4) are checked a priori (go/no-go), also a posteriori for quality control. Constraint (5) can be checked a posteriori, setting, or controlling the insee value. Constraint (6): a failure must trigger a search for additional information (Section3). Constraint (7): several nodes attached to the same gare, may differ in geometry, what is a major cause of indecision in the fusion process.

2.1.3. Target Notation and Generic Step-by-Step Approach for Collecting Values The overall objective is to gather as many gares as possible that have ever existed. Therefore, the process starts with an initial dataset, then grow it, in a logical consistent way, by adding complementary or supplementary information from more available data. To add a gare, means to create a node and to fill in attribute data step by step (Table1 ), starting with the knowledge that a gare named label belongs to the ligne identified by num. In general, information about a ligne contains a list of the gares in between origin and terminus, which allows to set km values in correct order, although approximate in kilometers. ISPRS Int. J. Geo-Inf. 2021, 10, 154 5 of 37

Table 1. Step-by-step generic approach for adding a gare into the reconstructed network.

(a). LABEL: the gare with that name belongs to the ligne identified by num, which has at least an originand a terminus; If working from a given list of gares: to start at origin. (b). KM: if unknown, can be interpolated between adjacent gares, or between origin, terminus, using the length of the ligne. The hard assumption is about the strict order of the list; (c). Point[LONGITUDE,LATITUDE]: coordinates to be obtained by software or visual inspection. The geocoding, and its quality, is extensively addressed in the sequel. (d). If working from a list: to iterate with next gare in list, until reaching terminus.

Attribute uic, is either given and the node is linked to this gare, or derived from (num, km). Attribute insee, is either given or derived from geometry of gare and,commune. Attributes capa, info are merely informative (voidable). The baseline goal is to build a dataset, compliant with the above target constraints. Constraints are flexible enough for semi-structured VGI (Section3), while allowing consistency checking at every stage, for every source, and for the result as well. Connectivity is not explicit in that model, but can be retrieved from the existence of several couples (num, uic) for a same uic, which denotes a “junction-gare”, or by the explicit identification of a fork between 2 lignes (see Discussion and AppendixB).

2.2. Public Open Data Sources, and the Information Revision Problem In a project it is customary to define a baseline, which encompasses: goal, constraints (listed above), resources (listed below), estimated cost (time to target goal). Baseline resources are versions (2020, 2017, 2014) of a public datasets originating from SNCF-Réseau (French national operator) [30], somewhat similar to the target schema. Baseline cost is the cost per gare times the number of expected gares. The cost for one gare, depends on the steps outlined in Table1. The number of expected gares, has not been found in the literature: for instance Auphan [31] estimates the maximum length of the network at 70,000 km in 1920, but does notmention a number of gares or communes impacted at that time. An estimation can use length and density of the network, e.g., 24,000 km today, serving 3029 “gares du réseau français”[32], gives a gare every 7.92 km. That inter-gare distance was lower in 1920, with stops in every crossed commune: a sampling on already reconstructed data gives about 6 km. Therefore, the number can be up to 12,000 gares.

2.2.1. Description of the Available Public and Administrative Sources The public datasets that provide gare and node information (no explicit difference), are listed in Table2, with data schema, version, and availability. The public datasets about communes and lignes, listed in Table3, are used for joining information (insee, num) and for display purposes.

Table 2. Public data sources for stations: ‘gare’ or ‘node’ features.

Sncf2020: “Réseau ferré de France”, last update: 2020, accessed: 8 March 2020 url: https://ressources.data.sncf.com/explore/dataset/liste-des-gares/ Schema (geojson) and counts (4148 features, with geometry): {code_ligne, libelle, fret, voyageurs, code_uic, pk, departemen, commune} +{geometry} Sncf2017, orSNCF2: “Réseau ferré de France”, last update: 2017 url: (authors copy) not available anymore since end 2019, Schema (=SNCF2020) and counts (7702 features, only 6812 with geometry): Sncf2014, orSNCF1: “Réseauferré de France”, last update: 2014, accessed: 8 March 2020 https://data.gouv.fr/fr/datasets/gares-ferroviaires-de-tous-types-exploitees-ou-non/ Schema (CSV) and counts (6442 features, all attributes: 6442): [code_ligne, nom, nature, latitude (WGS84), longitude (WGS84)] ISPRS Int. J. Geo-Inf. 2021, 10, 154 6 of 37

Table 3. Public data sources for cities and towns: ‘commune’, and railway lines: ‘lignes’.

Insee-new: official codes for commune (+2003–2021 administrative changes) url: https://www.insee.fr/fr/information/2028028 CSV, counts (35,589 features, 2577 changes in 2018, no geometry) Insee-geo: commune contours (curated by G. David), accessed: 8 March 2020 url: https://france-geojson.gregoiredavid.fr/ geojson, counts (35,798 communes contours with geometry) Lignes:“SNCF Réseau ferré de France”, created: (before 2017), accessed: 8 March 2020 https://ressources.data.sncf.com/explore/dataset/formes-des-lignes-du-rfn/ Schema (geojson) and counts (1779 features, with geometry): {libelle, code_ligne, mnemo}+{geometry}

2.2.2. Matching Attributes of the Public Sources with Target Attributes Among the attributes, there are some direct, or some indirect matches: • Target.num: matches SNCF*|Lignes:code_ligne, primary key for lignes. Mandatory; • Target.label: matches SNCF1:nom, and SNCF2*:libelle; • Target.km: matches SNCF2*:pk. Mandatory. Order must reflect the strict order of the nodes on the ligne; • Target.uic: matches SNCF2*:code_uic. In SNCF1, uses Target.label instead; • couples SNCF1:(code_ligne, nom) and SNCF2:(code_ligne, pk), are used as primary key for nodes what requires to check its uniqueness; • Target.capa: derived from SNCF:nature, SNCF*:fret,voyageurs or Lignes:mnemo, whichever available; • Target.geometry: matches SNCF1:(latitude,longitude), SNCF2*:geometry:point, • Target.insee: present in Insee-geo or Insee-new datasets, is primary key for com- munes, and must be retrieved for a gare, by identifying which commune verifies Point_in_Polygon(geometry(gare), geometry(commune)); • Target.nom: matches SNCF2:commune, would rather be retrieved via Target.insee;

2.3. Meta-Analysis of Public Sources, Building a Similarity Measure, and Corrections The various Sncf versions provide scarce metadata [28], only about their lineage: how to check if a dataset complies with the target constraints? the only way is to perform a “meta- analysis”, a technique using “individual participant data” (IPD-MA), mainly developed in medicine [9,10]. The math behind meta-analysis uses variance-covariance matrices. In this application, “individual participants” are nodes of gares, and data are qualitative: hence, only equality can be checked, leading to true/false values. Therefore mathematics boils down to simple counts of items sharing equal attribute values. The meta-analysis is applied directly on the individual items of each dataset, and proves useful in the subsequent fusion. Counting single attribute occurrences performs anunivariate meta-analysis, counting joint occurrences by a couple of attributes performs a multivariate meta-analysis. This subsection is devoted to detailing what issues are revealed by the meta-analysis, helping to design the steps of a fusion approach. For aquick read, skip directly to Section 2.4 to discover the fusion, and return later to understanding the reasons why.

2.3.1. Meta-Analysis of Each Dataset, and Comparison This is performed by the metaAnalysis routine (AppendixA). • Step 1: univariate meta-analysis with num and label values. Result: each node has a num and a label value, in both 2014 and 2017, what fulfills the target first requirement. All 6442 nodes have a geometry in 2014, but only 6812 nodes have a geometry in 2017. • Step 2: multivariate meta-analysis with uic, label, and (uic-label) values: 2017 only. There are 6813 different label, but 6817 different (uic-label) values, what means that 4 couples (=label, 6=uic) are denoting a toponym ambiguity, e.g.,: "Cernay". Ambiguity ISPRS Int. J. Geo-Inf. 2021, 10, 154 7 of 37

is inherent to toponymy (in most countries): this is a well known issue [20,21]. Several mismatching examples and an explanation are proposed in AppendixB for what is named the “toponym-pattern”. • Step 3: multivariate meta-analysis with (label-num) and (uic-num) couples In both 2014 and 2017, 6 nodes have equal values (=label, =num) or (=uic, =num), meaning data duplication e.g.,“Saintes” (cf. routine disDuplicate in AppendixA). • Step 4: inspect the geometry of Step3 couples: if a same label or uic occurs in several nodes, it denotes a junction-gare (=uic, 6=num). Associated geometryare expected to be equal: the computed deviation is broken down into five distance intervals (Table4).

Table 4. Breakdown of distances between the multiple nodes of gares.

Intervals (nb. Nodes) [0, 150 m] [150, 300] [300, 1 km] [1 km, 2 km] [2 km, ∞] “Close Enough” “Rather Far” SNCF2014 (642 nodes/6442) 550 48 32 6 6 SNCF2017 (1360 nodes/7702) 541 184 375 157 103

The choice of the values (150 m, 300 m, 1 km, 2 km) is explained in Section 2.4. In 2014, among 642 nodes, 93% are “close enough”, and 1% “rather far”. For instance, above 2 km, the “-” case, is named a “fork-pattern”, for the closure of a ligne has “moved” the node to a remote location at the last fork on that ligne. An analysis of the “fork-pattern” is proposed in AppendixB. In 2017, among 1360 nodes, 53% are “close enough”, but 635 are not (47%). The cause of so many discrepancies is that more “fork-patterns” are taken into account in 2017. It’s impossible to choose the right geometry without extra information, moreover, the wrong location is not an error, but a meaningful node: the solution is to mark it (Section 2.4). • Step 5: inspect the cardinal of Step3 couples, for every num value: thisgives the number of nodes per ligne (Table5).

Table 5. Breakdown of lignes according to their number of nodes.

Datasets and Link Attribute Isolated Gare 2-Gares Ligne 3 to 7-Gares 8 to 15-Gares 16 to 30-Gares 31 to Max Gares (Total) SNCF2017 num 107 76 245 171 91 44 734 SNCF2014 num 94 62 191 130 76 40 593

A ligne with one gare denotes an “isolated” gare, which is a conflicting constraint. Most of these errors correspond to lignes now closed to traffic (fork-pattern again), which requires extra information, not available at that stage.

2.3.2. Conclusion of the Meta-Analysis The meta-analysis has been applied to the two main public datasets providing gare information: the result is a better understanding of what data must be handled, which helps designing a fusion approach to merging data from 2014 and 2017. In particular, it helps in understanding that the “toponymy-patter” and the “fork-pattern” constitute hurdles in the fusion process, and will require additional information from external VGI sources. The meta-analysis can be used as a quality control tool at any time at any stage of the reconstruction process, and will be used to control the result (Section4). Concerning Sncf2014 and Sncf2017, some metadata can be summarized (Table6). ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 8 of 38

Table 6. Initial metadata after meta-analysis of the individual datasets. ISPRS Int. J. Geo-Inf. 2021, 10, 154 8 of 37 METADATASncf2014: { “mime”:”csv”, “year”:”2014”, “nbfeatures”:6442, “schema#”: {“label”:6442, “num”:6442, “nature”:6442, “geometry”:6442}, (1) “unos#”: {“label”:6098, “num”:593},”duos#”{“num-label”:6436} (2) } Table 6. Initial metadata after meta-analysis of the individual datasets. METADATASncf2017: { “mime”:”geojson”, “year”:”2017”, “nbfeatures”:7702, “schema#”METADATA: {“label”:7702,Sncf2014: { “mime”:”csv”, “uic”:7702, “year”:”2014”, “num”:7702, “nbfeatures”:6442, “km”:7702, “capa”:7702, “schema#”: {“label”:6442, “num”:6442, “nature”:6442, “geometry”:6442}, (1) “geometry” 6812}, “unos#”: {“label”:6098,: “num”:593},”duos#”{“num-label”:6436} (2) “unos#”} : {“label”:6813, “uic”:6817, “num”:734} } METADATASncf2017: { “mime”:”geojson”, “year”:”2017”, “nbfeatures”:7702, 1 schema 2 unos duos ( ) notation“schema#”: {“label”:7702,# stands for: “uic”:7702, count of all “num”:7702, occurrences “km”:7702, of properties. “capa”:7702, ( ) # and “geometry”:6812}, #: count of unique“unos#”: value {“label”:6813, occurrence for “uic”:6817, some properties “num”:734} or couple of properties. } (1) notationThe outcome schema# standsof the for: meta-analysis count of all occurrences of the initial of properties. datasets, (2) unos# is that and 890 duos#: nodes count are of uniquenot geo- value coded,occurrence and for 635 some otherhave properties a or poor couple geometry of properties. (Table 4), and would rather be avoided in the fusion process. The outcome of the meta-analysis of the initial datasets, is that 890 nodes are not 2.4.geocoded, Combining and the 635 Available otherhave Sources: a poor The geometryFusion/Revision (Table Problem4), and would rather be avoided in theVersions fusion process. Sncf2014 and Sncf2014 probably have a same origin, then evolved separately. Sncf2017 has more attributes, notably the km, and uic, and a few more geocoded features 2.4. Combining the Available Sources: The Fusion/Revision Problem (6812 versus 6442) although we do notknow which version would provide the best quality. TheVersions expectationSncf2014 is thatand Sncf2014Sncf2014 probablyand Sncf2017 have, aeither same origin,confirm then each evolved other separately.on some nodesSncf2017, or addhas up more unique attributes, nodes (Figure notably 3).the km, and uic, and a few more geocoded features (6812The versus relational 6442) approach, although wealthough do notknow natural which with tabular-like version would datasets, provide has the been best refuted quality. by theThe meta- expectation analysis, which is that revealedSncf2014 thatand 1525Sncf2017nodes have, either poor confirm or no geometry, each other and on somethat labelnodes is, or not add a reliable up unique attribute.nodes (Figure3).

A: keep SNCF 2014 B: SNCF 2017 extend

Fusion: if OK(2014,2017)

FigureFigure 3. 3.MatchingMatching Sncf2014Sncf2014 andand Sncf2017Sncf2017. . The relational approach, although natural with tabular-like datasets, has been refuted Let’s consider instead the fusion approach, following Bloch [12]: by the meta- analysis, which revealed that 1525nodes have poor or no geometry, and that label is not a reliable attribute. DEFINITION. Fusion of information consists of combining information originating from several Let’s consider instead the fusion approach, following Bloch [12]: sources in order to improve decision making.

DEFINITION. Fusion of information consists of combining information originating from several More precisely, let usconsider the context of Revision, or reverse-Updating [13] for we sources in order to improve decision making. aim at reconstructing the past. Revising Sncf2017 by Sncf2014, means to compute 𝑂𝐾(𝑥, 𝑦) 𝑥 𝑦 𝑥 in orderMore to undertake precisely, “pairing” let usconsider (2014 the) contextwith a of (Revision2017), then, or reverse-Updatingmerging attributes[13] forwith we 𝑦 𝑥 𝑦 𝑥 thoseaim atof reconstructing, or “impairing” the past., if no Revising fits, Sncf2017then addingby Sncf2014 into Sncf2017, means, toafter compute some conver-OK(x, y) sion.in order All “impaired” to undertake 𝑦 of “pairing” Sncf2017 xremain(2014 )unchanged. with a y (2017 ), then merging x attributes with thoseFollowing of y, or “impairing” Bloch, three xsteps, if no arey fits, requ thenired: adding measure,x into decision,Sncf2017 combination., after some conversion. All “impaired” y of Sncf2017 remain unchanged. 2.4.1. KnowledgeFollowing Bloch,Representation three steps and are Measure required: of measure,Similarity decision, combination.

2.4.1. Knowledge Representation and Measure of Similarity Consider a declarative approach based on the notion of neighborhood V(x) of a node x, and a measure of membership to that neighborhood:|y∈V(x)|, degree of similarity of another node y to x.

The meta-analysis tells us that, once duplicates and ambiguities are solved, a dataset is complying with the target constraints, and has an attribute space into which to de- ISPRS Int. J. Geo-Inf. 2021, 10, 154 9 of 37

fine “neighborhoods”, and to build a “similarity” measure (Table7), which is a function OK : (x, y) → [0, 1] , where 0 can be interpreted as false and 1 as true.

Table 7. The similarity measures membership to V(x). Dissimilarity measures non membership.

|y∈V(x)| = |similarNumLabel(x,y)–normalizedShortDistance(x,y)| (a) similarNumLabel(x, y): (b) (num(y) = num(x)) ∧ (slug(label(y)) = slug(label(x))) → 1 or 0 normalizedShortDistance(x, y): let d = distanceGreatCircleInMeter(geometry(y), geometry(x)); (c) if (d ≤ D1) → 1; elseif (D1 < d ≤ D2) → (D2 − d)/(D2-D1); elseif (d > D2) → 0 |y∈/V(x)|: |¬similarNumLabel(x,y)–normalizedGreatDistance(x,y)| (a) ¬similarNumLabel(x, y): (b) (num(y) 6= num(x)) ∨ (slug(label(y)) 6= slug(label(x))) → 1 or 0 normalizedGreatDistance(x, y): let d = distanceGreatCircleInMeter(geometry(y), geometry(x)); (c) if (D4 ≤ d) → 1; elseif (D3 < d ≤ D4) → (D4 − d)/(D4 − D3); elseif (D3 > d) → 0

Let usdescribe that similarity/dissimilarity measure components: (a) and (a) quantify num, label, geometry for both x and y, into the [0, 1] interval; (b) is qualitative (true/false: 1/0). Checking equality on slug(label) rather than on label; (c) uses threshold values D1and D2,(c) uses D3 and D4, whichmeans that dissimilarity is not the complement of the similarity measure. The initial step is to seek, for each 2017 nodey, the closest 2014 nodex:

distance(x, y) = mindistance(z, y), z∈B

to form a pair of “closest nodes”, using the GreatCircle algorithm, whose precision is suffi- cient in this application, in the range [0, 100 km]. This routine is part of Annex A. Table8 gives the distance breakdown for “direct pairs” = pairs of closest nodes with the same label, and “reverse pairs”, with the closest 2017 counterpart of each 2014 node.

Table 8. Breakdown by distance intervals of (2014, 2017) direct pairs; and (2014, 2017) reverse pairs.

Intervals [0, 50 m] [50, 100] [100, 150] [150, 200] [200, 250] [250, 300] [300, 1 km] [1 km, ∞] direct positive 4618 1082 375 161 113 72 321 58 reverse positive 4501 1072 348 135 70 48 129 18

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 10 of 38 The histogram of the distances grouped by 50 m intervals (Figure4), provides a clue about choosing relevant thresholds:

FigureFigure 4. Breakdown 4. Breakdown by by 50 50m m distance distance intervals:intervals: typi typicalcal shape shape and and best best inflection inflection position. position.

• the number of directpairs (left) decreases rapidly up to 150 m (88%), and slows down after 300 m (93%); and for reversepairs (right): 92% at 150 m, 96% at 300 m; • both kinds of pair are less than 1.5% after 1 km; and only a few after 2 km.

Therefore the choice for D1–D2 is 150–300 m, and 1–2 km or D3–D4.

2.4.2. Decision to Include a Node in the Similarity Neighborhood

To include x in class Ci according to the absolute majority (or threshold rule):

𝑥 ∈ 𝐶 𝑖𝑓 |𝑦 ∈𝑉(𝑥),𝑦| ∈𝐶 >  ∗ |𝑉| , 𝑤𝑖𝑡ℎ:  ∈ ]0,1[ α=% means: at least % of the nodes of V(x) must be in Ci The inferential aspect of the declarative approach [14] ensures that: (a) rules are independent of the actual knowledge in the knowledge base; (b) rules are not restricted to the conceptual schema representation. The choice to “pair” x and y is made by comparing y∈V(x) to some threshold, causing to keep x and y whose distance is α% between D1 and D2: x and yare “positive” candidates to fusion. The uncertainty is that some of them can be “false positives”. Similar nodes, whose dist≥ D4, are tagged “supposed negative” (SN). The choice to “impair” an x, is to consider, using D3 and D4, that xcannotbe compared to any y, and is candidate to be “integrated” to the final result. However, we may get “false negatives”. Dissimilar x whose dist ≤ D1, are tagged “supposed positive” (SP). About “undecided” (und) nodes, the choice is between: (1) to reduce the D2–D3 gap, even to D3=D2, whichreduces the number of und, but increases the number of both false negatives and false positives; or (2) to obtain only a few FNs and FPs, and many undecid- ed. The overall tagging decision is sketched in Figure 5, with different possible tags. Corresponding results for direct and reverse revision, are provided in Table 9 (a,b), where signs + and ¬ denote respective use of similarity or dissimilarity measure. Unde- cided cases (und) are grey cells. Counts for the direct revision do not include the nodes with no geometry.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 10 of 37

• the number of directpairs (left) decreases rapidly up to 150 m (88%), and slows down after 300 m (93%); and for reversepairs (right): 92% at 150 m, 96% at 300 m; • both kinds of pair are less than 1.5% after 1 km; and only a few after 2 km.

Therefore the choice for D1–D2 is 150–300 m, and 1–2 km or D3–D4.

2.4.2. Decision to Include a Node in the Similarity Neighborhood

To include x in class Ci according to the absolute majority (or threshold rule):

x ∈ Ci i f |y ∈ V(x), y ∈ Ci| > α ∗ |V| , with : α ∈ [0, 1] α= % means: at least % of the nodes of V(x) must be in Ci The inferential aspect of the declarative approach [14] ensures that: (a) rules are independent of the actual knowledge in the knowledge base; (b) rules are not restricted to the conceptual schema representation. The choice to “pair” x and y is made by comparing |y∈V(x)| to some threshold, causing to keep x and y whose distance is α% between D1 and D2: x and y are “positive” candidates to fusion. The uncertainty is that some of them can be “false positives”. Similar nodes, whose dist≥ D4, are tagged “supposed negative” (SN). The choice to “impair” an x, is to consider, using D3 and D4, that xcannotbe compared to any y, and is candidate to be “integrated” to the final result. However, we may get “false negatives”. Dissimilar x whose dist ≤ D1, are tagged “supposed positive” (SP). About “undecided” (und) nodes, the choice is between: (1) to reduce the D2–D3 gap, even to D = D , whichreduces the number of und, but increases the number of both false ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 3 2 11 of 38 negatives and false positives; or (2) to obtain only a few FNs and FPs, and many undecided. The overall tagging decision is sketched in Figure5, with different possible tags.

ℎ𝑖𝑔ℎ𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑢𝑛𝑑𝑒𝑐𝑖𝑑𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑦 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ℎ𝑖𝑔ℎ𝑙𝑦 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒

0 𝐷1 ...... 𝐷2 𝐷3 ...... 𝐷4 ∞ ↓ ↓ ↓ ↓ ↓ Similar. AP PP und und SN Dissimi. SP und und PN AN

FigureFigure 5. Decision5. Decision diagram diagram depending depending on the distance on the in dist fusion-candidateance in fusion-can pairs of nodesdidate. Notations: pairs of APnodes:”assumed. Notations: positive”, PPAP:”possible:”assumed positive”, positive”,AN:”assumed PP:”possible negative”, positive”,PN:”possible AN negative”,:”assumedSP:”supposed negative”, positive”, PN:”possibleSN:”supposed negative”, negative”, undSP:”undecided”.:”supposed positive”, SN:”supposed negative”, und:”undecided”. Corresponding results for direct and reverse revision, are provided in Table9 (a,b), Table 9. (a) From 6979 direct pairs (only those withwhere geometry signs amon + andg¬ 7702).denote (b) respective From 6442 use reverse of similarity pairs (all or dissimilaritywith geometry). measure. Unde- cided cases (und) are grey cells. Counts for the direct revision do not include the nodes with no geometry.(a) intervals 0–D1 = 150 m. D1–D2= 250 m. D2–D3= 1000 m. D3–D4= 2000 m. D4–∞. direct + 6070 AP: 87.2%Table 9. (a) From 353 6979PP: direct5.0% pairs (onlytotal those undecided with geometry among 7702). (b) From 6442 reverse pairs9 SN (all: with 0.1% geometry). direct ¬ 59 SP: 0.8%  450 und: 6.4% (a) 18 PN: 0.2% 25 AN: 0.4% intervals 0–D1 = 150 m.(b) D1–D2 = 250 m. D2–D3 = 1000 m. D3–D4 = 2000 m. D4–∞. direct + 6070 AP: 87.2% 353 PP: 5.0% total undecided ← 9 SN: 0.1% intervals 0–D1 = 150 m.direct ¬ D1–D2= 250 m.59 SP : 0.8% D2–D3= 1000 m. → D3–D4504= und2000: 6.4% m. 18 PN: 0.2% D4–∞. 25 AN: 0.4% reverse + 5944 AP: 91.9% 251 PP: 3.9% total undecided (b)  4 SNs: 0.3% intervals 0–D = 150 m. D –D = 250 m. D –D = 1000 m. D –D = 2000 m. D –∞. 1 1 2 2 3 3 4 4 reverse ¬ 46 SP: 0.7%reverse + 5944 AP: 91.9%163 und 251 PP: 2.5%: 3.9% 4 PNtotal: undecided0.2% ← 30 AN: 0.8%4 SNs : 0.3% reverse ¬ 46 SP: 0.7% → 163 und: 2.5% 4 PN: 0.2% 30 AN: 0.8% The role of the fusionTags routine (Appendix A), is to set the tag value (AP, AN, PP, PN, SP, SN or und), of everyThe role node of the and fusionTags to store routineit together (Appendix with Athe), is closest to set the counterpart tag value ( AP node, AN , PP, PN, SP, SN or und), of every node and to store it together with the closest counterpart node and distance of that pair, resulting in: and distance of that pair, resulting in: 2017-node: f + {“tag”:tag(f), “dist”:dist(f,g′), “closest”:link_to-2014(g′)} 2014-node: g + {“tag”:tag(g), “dist”:dist(g,f′), “closest”:link_to-2017(f′)} dist(a,b) measures the distance between node from A and its closest B counterpart.

Then a fusion sort is made into three classes, fusionable, integrable or undecided: • fusionable: means that the node and its counterpart (closest is ok) are supposed to in- form the same real gare and we can proceed with the fusion of their data; • integrable: means that the node has no counterpart (closest is irrelevant) and can be kept as is (if in the right set), or adapted for integration (if from the other set). Using f for 2017and g for 2014, the decision follows the rules: - direct revision: (𝑡𝑎𝑔(𝑓) =𝐀𝐏) (𝑡𝑎𝑔(𝑓) =𝐏𝐏 𝑑𝑖𝑠𝑡(𝑓,𝑔′) 𝐷) 𝑓𝑢𝑠𝑖𝑜𝑛𝑎𝑏𝑙𝑒(𝑓), 𝑞𝑢𝑎𝑙𝑖𝑡𝑦: 𝑜𝑘

(𝑡𝑎𝑔(𝑓) =𝐏𝐏 (𝐷 𝑑𝑖𝑠𝑡(𝑓,𝑔′) 𝐷))  𝑓𝑢𝑠𝑖𝑜𝑛𝑎𝑏𝑙𝑒(𝑓), 𝑞𝑢𝑎𝑙𝑖𝑡𝑦: 𝑚𝑒𝑑𝑖𝑢𝑚

(𝑡𝑎𝑔(𝑓) =𝐀𝐍) (𝑡𝑎𝑔(𝑓) =𝐏𝐍 𝑑𝑖𝑠𝑡(𝑓, 𝑔′) 𝐷) 𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑏𝑙𝑒(𝑓), 𝑞𝑢𝑎𝑙𝑖𝑡𝑦:𝑜𝑘 (𝑡𝑎𝑔(𝑓) =𝐏𝐍 (𝐷 >𝑑𝑖𝑠𝑡(𝑓,𝑔 )𝐷))  𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑏𝑙𝑒(𝑔), 𝑞𝑢𝑎𝑙𝑖𝑡𝑦: 𝑚𝑒𝑑𝑖𝑢𝑚 - reverse revision (for integration of “forgotten nodes”):

(𝑡𝑎𝑔(𝑔) =𝐀𝐍) (𝑡𝑎𝑔(𝑔) =𝐏𝐍 𝑑𝑖𝑠𝑡(𝑔, 𝑓′) 𝐷) 𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑏𝑙𝑒(𝑔), 𝑞𝑢𝑎𝑙𝑖𝑡𝑦:𝑜𝑘 (𝑡𝑎𝑔(𝑔) =𝐏𝐍 (𝐷 >𝑑𝑖𝑠𝑡(𝑔, 𝑓 )𝐷))  𝑖𝑛𝑡𝑒𝑔𝑟𝑎𝑏𝑙𝑒(𝑔), 𝑞𝑢𝑎𝑙𝑖𝑡𝑦: 𝑚𝑒𝑑𝑖𝑢𝑚

Among the tags, onlyAP, PP, AN, and PN are reliable enough to be used in the deci- sion. Indeed, we must question the SP-pairs (dist< D1, ≠label), for most uncertainty is due to the toponym-pattern (Appendix B.1). Also we must question SN-pairs (dist> D4, =label): see fork-pattern (Appendix B.2). Theses tags and distances are helpful when confronting extra open data, for instance with a specific lexicon, e.g.: Insee-new list of toponym changes, SNCF list of gares at different dates (see Section 3). Also helpful for further processing is the quality of the fusion decision which can be a fuzzy measure between [D1, D2] or [D3, D4] respectively:

ISPRS Int. J. Geo-Inf. 2021, 10, 154 11 of 37

2017-node: f + {“tag”:tag(f), “dist”:dist(f,g0), “closest”:link_to-2014(g0)} 2014-node: g + {“tag”:tag(g), “dist”:dist(g,f0), “closest”:link_to-2017(f0)} dist(a,b) measures the distance between node from A and its closest B counterpart. Then a fusion sort is made into three classes, fusionable, integrable or undecided: • fusionable: means that the node and its counterpart (closest is ok) are supposed to inform the same real gare and we can proceed with the fusion of their data; • integrable: means that the node has no counterpart (closest is irrelevant) and can be kept as is (if in the right set), or adapted for integration (if from the other set). Using f for 2017 and g for 2014, the decision follows the rules: - direct revision:

(tag( f ) = AP) ∨ (tag( f ) = PP ∧ dist( f , g0) ≤ D1) ⇒ f usionable( f ), quality : ok (tag( f ) = PP ∧ (D1 < dist( f , g0) ≤ D2)) ⇒ f usionable( f ), quality : medium (tag( f ) = AN) ∨ (tag( f ) = PN ∧ dist( f , g0) ≤ D4) ⇒ integrable( f ), quality : ok 0 (tag( f ) = PN ∧ (D4 > dist( f , g )D3)) ⇒ integrable(g), quality : medium

- reverse revision (for integration of “forgotten nodes”):

(tag(g) = AN) ∨ (tag(g) = PN ∧ dist(g, f 0) ≤ D4) ⇒ integrable(g), quality : ok 0 (tag(g) = PN ∧ (D4 > dist(g, f )D3)) ⇒ integrable(g), quality : medium

Among the tags, only AP, PP, AN, and PN are reliable enough to be used in the decision. Indeed, we must question the SP-pairs (dist < D1, 6=label), for most uncertainty is due to the toponym-pattern (Appendix B.1). Also we must question SN-pairs (dist > D4, =label): see fork-pattern (Appendix B.2). Theses tags and distances are helpful when confronting extra open data, for instance with a specific lexicon, e.g.,: Insee-new list of toponym changes, SNCF list of gares at different ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 12 of 38 dates (see Section3). Also helpful for further processing is the quality of the fusion decision which can be a fuzzy measure between [D1, D2] or [D3, D4] respectively:

fusionable:quality integrable:quality I I I I I > 0 D1 D2 D3 D4 ∞

2.4.3. Fusion Combination, Integration, Revision and Conclusion 2.4.3. Fusion Combination, Integration, Revision and Conclusion The last fusion step is to make a single “fused” node from a node pair, that is choosing The last fusion step is to make a single “fused” node from a node pair, that is choosing appropriate values for every attribute, e.g., best coordinates?, or to convert into the attribute appropriate values for every attribute, e.g., best coordinates?, or to convert into the attrib- schema of the other dataset. ute schema of the other dataset. The fusion-combination of 2014 attributes concerns: - nature:The fusion-combination which can be translated of 2014 intoattributes a capa concerns: value (though losing some specificity). - - geometry:nature: which the maximal can be translated distance beinginto a Dcapa2, use value2017 (thoughcoordinates; losing some specificity). - - info:geometry add approximation: the maximal distance info about being geometry D2, use (depending2017 coordinates; on dist). - Theinfo integration: add approximation of 2017 nodes infois straightforward.about geometry (depending on dist). TheThe integration integration of of2014 2017nodes nodesisis more straightforward. difficult, concerning: - geometry:The integration straightforward, of 2014 nodes useis 2014morecoordinates; difficult, concerning: - - nature:geometry translate: straightforward, into a capa use value 2014 (rather coordinates; easy); - - km:nature seek: translate for the closest into a 2capanodes valueon the (rather same easy);ligne, then interpolate a value; - - uic:km: buildseek for a unique the closest code 2 usingnodeson numand the same km; ligne, then interpolate a value; - - info:uic: addbuild uncertainty a unique code info using about num kmand and km uic.; - Theinfo revised: add uncertainty dataset is the info new about “baseline-resource” km and uic. for the rest of this work. It complies with target constraints (see metadata: Table 10). Quality information is added whenever an The revised dataset is the new “baseline-resource” for the rest of this work. It complies approximation is made. with target constraints (see metadata: Table 10). Quality information is added whenever an approximation is made.

Table 10. Updated metadata after the revision process (differences with Table 6 are underlined).

METADATASncf2014: { “mime”:”csv”, “year”:”2014”, “nbfeatures”:6442, “sch#”: {“label” … :6442, “closest”:6442, “fusable”:6189, “integrable”:33} } METADATASncf2017: { “url”:”g2017”, “mime”:”geojson”, “year”:”2017”, “nbfeatures”:7702, “sch#”: {“label” …:7702, “closest”:6984, “fusable”:6415, “integrable”:41, “coordinates”:6984} }

Overall conclusion before Section 3: - Software fusion is safe (6415 AP + PP tags = almost no false positives); - Software integration is safe (74 AN + PN tags = almost no false negatives). - The remaining 250 nodesfrom 2014 and 564 from 2017 for which the fusion decision is too uncertain, plus 718 with no geometry, from 2017are in total 1532. The important information after the fusion is that: - 6489 nodes have been qualified; - 1532 nodes (label-num-km) -i.e.,: about 950 gares− can be confronted to VGI and remote sensing imagery for setting a controlled geometry.

3. Materials and Methods Part 2: Voluntary Geographic Information (VGI) Data Inte- gration and Control So far, the revision process has yield an important and consistent dataset, complying with target constraints (unique code, revised label, correct km, …), plus some quality in- formation. However, the target horizon, amounts to twice as manynodes.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 12 of 37

Table 10. Updated metadata after the revision process (differences with Table6 are underlined).

METADATASncf2014:{ “mime”:”csv”, “year”:”2014”, “nbfeatures”:6442, “sch#”: {“label” . . . :6442, “closest”:6442, “fusable”:6189, “integrable”:33} }

METADATASncf2017:{ “url”:”g2017”, “mime”:”geojson”, “year”:”2017”, “nbfeatures”:7702, “sch#”: {“label” . . . :7702, “closest”:6984, “fusable”:6415, “integrable”:41, “coordinates”:6984} }

Overall conclusion before Section3: - Software fusion is safe (6415 AP + PP tags = almost no false positives); - Software integration is safe (74 AN + PN tags = almost no false negatives). - The remaining 250 nodesfrom 2014 and 564 from 2017 for which the fusion decision is too uncertain, plus 718 with no geometry, from 2017are in total 1532. The important information after the fusion is that: - 6489 nodes have been qualified; - 1532 nodes (label-num-km) -i.e.,: about 950 gares− can be confronted to VGI and remote sensing imagery for setting a controlled geometry.

3. Materials and Methods Part 2: Voluntary Geographic Information (VGI) Data Integration and Control So far, the revision process has yield an important and consistent dataset, complying with target constraints (unique code, revised label, correct km, ... ), plus some quality information. However, the target horizon, amounts to twice as manynodes. Doubling up the total, starting by investigating the 1532 nodesleft behind, then discov- ering thousands more “forgotten” gares, and further improving the overall quality are the challenges that we ask the VGI to help us overcome. This section investigates Wikipedia, focusing on its semantic aspect, which makes profit of the UIC ontology (Section 3.1), then investigates various VGI sources, developing some helpers to integratesparse data from less structured and less semantic web data, and it takes advantage of the aerial pictures archives made publicly available by the French IGN.

3.1. Gathering Information from Wikipedia Railway Specific Pages There are many pages devoted to railways. Luckily, standardization efforts of the UIC (UIC: Union Internationale des Chemins de fer, the Worldwide Railway Organization, since 1922) propose an international ontology (thousands terms in the UIC railway Dictionary), allowing the design of semantic Wikipedia “templates” for gares, lignes and communes.

3.1.1. Extracting Coordinates from Wikipedia Let usstart with a simple use: the extraction of the coordinates of a gare named label. The Wikipedia API (application program interface) gives access to the coordinates, at the highest level. Querying it with “Gare de label”, returns the coordinates (if any), or “missing”. The wikiCoordinates routine (Annex) is an asynchronous code that must be used in a “Promise” (technology to awaiting answer, provided by most computing languages). In case of a “missing”result, we can query the API with “label” as commune toponym, for quite often a gare is labeled by its commune name. However the gare can be somewhat far from the center. For instance, with Martigues (Figure6), the distance is 5.3 km (great-circle distance). Therefore, coordinates obtained by that means provides only a rough positioning, which may help when using remote-sensing imagery. ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 13 of 38

Doubling up the total, starting by investigating the 1532 nodesleft behind, then dis- covering thousands more “forgotten” gares, and further improving the overall quality are the challenges that we ask the VGI to help us overcome. This section investigates Wikipedia, focusing on its semantic aspect, which makes profit of the UIC ontology (Section 3.1), then investigates various VGI sources, developing some helpers to integratesparse data from less structured and less semantic web data, and it takes advantage of the aerial pictures archives made publicly available by the French IGN.

3.1. Gathering Information from Wikipedia Railway Specific Pages There are many pages devoted to railways. Luckily, standardization efforts of the UIC (UIC: Union Internationale des Chemins de fer, the Worldwide Railway Organization, since 1922) propose an international ontology (thousands terms in the UIC railway Dic- tionary), allowing the design of semantic Wikipedia “templates” for gares, lignes and com- munes.

3.1.1. Extracting Coordinates from Wikipedia Let usstart with a simple use: the extraction of the coordinates of a gare named label. The Wikipedia API (application program interface) gives access to the coordinates, at the highest level. Querying it with “Gare de label”, returns the coordinates (if any), or “missing”. The wikiCoordinates routine (Annex) is an asynchronous code that must be used in a “Promise” (technology to awaiting answer, provided by most computing languages). In case of a “missing”result, we can query the API with “label” as commune topo- nym, for quite often a gare is labeled by its commune name. However the gare can be ISPRS Int. J. Geo-Inf. 2021, 10, 154 somewhat far from the center. For instance, with Martigues (Figure 6), the distance is 135.3 of 37 km (great-circle distance). Therefore, coordinates obtained by that means provides only a rough positioning, which may help when using remote-sensing imagery.

FigureFigure 6. The6. The coordinates coordinates of of a acommune commune centercenter and its its garegare (5.31(5.31 km km apart apart in inthat that example). example).

TheThe wikiCoordinates wikiCoordinates routine routine is usedis used for improvingfor improving the “revised”the “revised” dataset, dataset, when when several locationseveral competelocation compete for one label,for one for label instance,, for ininstance, the case in ofthe Saintes case of (Section Saintes 2.3 (Section), the distance 2.3), betweenthe distance the duplicatesbetween the was duplicates 348 m, was and 348 the m, answer and the of answer wikiCoordinates of wikiCoordinates is: is: → Query:”GareQuery:”Gare de de Saintes” Saintes”→ geometry:[ geometry:−[-0.617596,45.748672]0.617596,45.748672], Quality:”good”, Quality:”good” ThatThat resultresult isis removing the the uncertainty. uncertainty. ISPRSISPRS Int.Int. J.J. Geo-Inf.Geo-Inf. 20212021,, 1010,, xx FORFOR PEERPEER REVIEWREVIEW 1414 ofof 3838

3.1.2.3.1.2. Extracting Extracting Semantic Semantic Information Information Using Usin Wikipedia’sg Wikipedia’s Infobox Infobox TheThe Wikipedia Wikipedia API-v2API-v2 provides much much richer richer information information from from an an“ontology-driven” “ontology-driven” Infobox [22]. InfoboxAnAn InfoboxInfobox[22]. isis aa panel,panel, usuallyusually nextnext toto thethe leadlead section,section, whichwhich usesuses aa specificspecific ontology:ontology: • An Infobox is a panel, usually next to the lead section, which uses a specific ontology: • commune:commune:{{Infobox{{Infobox CommuneCommune dede France}}France}} containscontains coordinates,coordinates, inseeinsee andand moremore •(Figurecommune:{{Infobox 7); Commune de France}} contains coordinates, insee and more (Figure(Figure 7);7 ); • gare: {{Infobox Gare …}}, contains coordinates, nom of the including commune, all or • • gare:gare: {{Infobox {{Infobox Gare ...…}},}}, containscontains coordinates, coordinates, nom nom of theof includingthe includingcommune commune, all or , all or lignes somesomesome ofof ofthethe the ligneslignes that thatthat connect connectconnect to toto that thatthatgare garegare(Figure (Figure(Figure8). 8).8). • • •ligne:ligne:ligne:{{Infobox{{Infobox{{Infobox LigneLigneLigne ferroviaire ferroviaireferroviaire . . . }} see …}}…}} next seesee Section nextnext 3.1.3 SectionSection. 3.1.3.3.1.3.

FigureFigureFigure 7.7. InfoboxInfobox 7. Infobox forfor aafor communecommune a commune ((ErcéErcé(Ercé).).). RelevantRelevant Relevant itemsitemsitems underlined underlinedunderlined (excerpt (excerpt(excerpt from fromfrom screen screenscreen console). console).console).

Figure 8. Infobox for a gare (Gare de Rennes). Relevant items underlined. FigureFigure 8.8. InfoboxInfobox forfor aa garegare ((GareGare dede RennesRennes).). RelevantRelevant itemsitems underlined.underlined.

TheThe routineroutine wikiInfoboxwikiInfobox (Appendix(Appendix A)A) cancan parseparse bothboth typestypes ofof InfoboxInfobox..

3.1.3.3.1.3. ExtractingExtracting DataData forfor allall thethe GaresofGaresof aa SameSame Ligne:Ligne: TheThe RoutemapRoutemap SemanticsSemantics GivenGiven aa ligneligne name,name, e.g.,e.g., Ligne_d’Audun-le-Tiche_à_Hussigny-GodbrangeLigne_d’Audun-le-Tiche_à_Hussigny-Godbrange,, wewe cancan queryquery thethe WikipediaWikipedia API,API, andand getget anan {{Infobox{{Infobox LigneLigne ferroviaireferroviaire …}}…}} description.description. ThatThat InfoboxInfobox (French(French template)template) containscontains simplesimple items,items, suchsuch asas lengthlength ((longueurlongueur),), gaugegauge ((écartementécartement),), thethe officialofficial numnum ((numéronuméro),), closingclosing datedate ((fermeturefermeture),), …,…, andand aa muchmuch moremore com-com- plexplex item:item: aa tabletable whosewhose syntaxsyntax isis intendedintended toto displaydisplay aa ““routemaproutemap”” ((schémaschéma)) ofof thatthat ligneligne.. TheThe routemaproutemap semanticssemantics (French(French sites,)sites,) refersrefers toto Modèles/BSModèles/BS (“BS”(“BS” standsstands forfor Bahn-Bahn- StreckeStrecke forfor thethe modelmodel hashas beenbeen initiatedinitiated byby GermGermans).ans). EnglishEnglish sitessites referrefer toto thethe generalgeneral RouteRoute diagramdiagram templatetemplate descriptiondescription [33,34].[33,34]. AA simplifiedsimplified versionversion ofof aa ligneligne InfoboxInfobox,, isis inin TableTable 11.11. TheThe ““schémaschéma”” extractedextracted fromfrom thisthis InfoboxInfobox,, isis presentedpresented inin aa simplifiedsimplified version,version, inin TableTable 12:12: itit isis aa {{BS-table}}{{BS-table}},, complyingcomplying withwith thethe BSBS model,model, whichwhich givesgives thethe orderedordered listlist ofof nodesnodes,, alongalong thatthat ligneligne..

TableTable 11.11. SimplifiedSimplified versionversion ofof anan exampleexample ofof Infobox_Ligne_ferroviaireInfobox_Ligne_ferroviaire.. {{Infobox{{Infobox LigneLigne ferroviaire\n|ferroviaire\n| nomlignenomligne == d’Audun-le-Tiched’Audun-le-Tiche àà Hussi-Hussi- gny-Godbrange\n|gny-Godbrange\n| misemise enen serviceservice == 1880\n|1880\n| fermeturefermeture == 1966\n|1966\n| fermeture2fermeture2 == 1987\n|1987\n| numéronuméro == 196000\n|196000\n| longueurlongueur == 8.8\n|8.8\n| écartementécartement == normal\n|normal\n| …… …… (mor(moree items)items) …… schémaschéma == \n{{BS-table}}\n\n{{BS-table}}\n …… {{BSnn}}{{BSnn}} …… seriesseries ofof itemsitems describingdescribing thethe nodesnodes alongalong thatthat lineline (see(see nextnext Table)Table) …… \n{{BS-table-fin}}\n\n{{BS-table-fin}}\n }}}}

ISPRS Int. J. Geo-Inf. 2021, 10, 154 14 of 37

The routine wikiInfobox (AppendixA) can parse both types of Infobox.

3.1.3. Extracting Data for all the Garesof a Same Ligne: The Routemap Semantics Given a ligne name, e.g., Ligne_d’Audun-le-Tiche_à_Hussigny-Godbrange, we can query the Wikipedia API, and get an {{Infobox Ligne ferroviaire . . . }} description. That Infobox (French template) contains simple items, such as length (longueur), gauge (écartement), the official num (numéro), closing date (fermeture), ... , and a much more complex item: a table whose syntax is intended to display a “routemap”(schéma) of that ligne. The routemap semantics (French sites,) refers to Modèles/BS (“BS” stands for Bahn- Strecke for the model has been initiated by Germans). English sites refer to the general Route diagram template description [33,34]. A simplified version of a ligne Infobox, is in Table 11. The “schéma” extracted from this Infobox, is presented in a simplified version, in Table 12: it is a {{BS-table}}, complying with the BS model, which gives the ordered list of nodes, along that ligne.

Table 11. Simplified version of an example of Infobox_Ligne_ferroviaire.

{{Infobox Ligne ferroviaire\n| nomligne = d’Audun-le-Tiche à Hussigny-Godbrange\n| mise en service = 1880\n| fermeture = 1966\n| fermeture2 = 1987\n| numéro = 196000\n| longueur = 8.8\n| écartement = normal\n| ...... (more items) . . . schéma = \n{{BS-table}}\n . . . {{BSnn}} . . . series of items describing the nodes along that line (see next Table) ... \n{{BS-table-fin}}\n }}

Table 12. Simplified version of the BS-table (of the same ligne).

{{BS-table}} {{BSbis| ... | ...... |Ligne d’Esch-sur-Alzette à Audun-le-Tiche}} {{BS3bis| ... | ...... |Frontière entre la France et le Luxembourg} border {{BS5bis| ... | ...... |Ligne de Fontoy à Audun-le-Tiche}} {{BS5bis| ... | ...... |(1) Ligne non déclassée, non exploitée}} {{BS5bis| ... | 21,144|Audun-le-Tiche |(301 m)}} origin {{BS5bis| ... | ...... |Ancienne voie vers Villerupt-Micheville}} {{BSbis| ... | 2,4xx| |Bif vers raccordement d’Audun-le-Tiche|}} branch {{BSbis| ... | 4,6xx|Rédange |(340 m)}} label {{BSbis| ... | 5,5xx| |Tunnel d’Adlergrund|(409m)}} tunnel {{BSbis| ... | 6,6xx| |Ancienne frontière France - Empire allemand}} km {{BS5bis| ... | ...... |Ligne de Longwy à Villerupt-Micheville}} {{BSbis| ... | 8,8xx|Hussigny-Godbrange |(346 m)}} branch {{BSbis|..| ...... |Ligne de Longwy à Villerupt-Micheville}} terminus {{BS-table-fin}}

The right column indicates which attribute to be parsed (values underlined on left): origin to terminus, gares with label, km, also forks to other lines, bridges, international border crossings, etc.and most of the expected information, except the geometry. Wikipedia parses the BS-table to draw a graphic representation of the ligne (Figure9). There are 1000 such Wikipedia pages for French lignes, each ligne being linked to one or several other lignes. Therefore, it is worth developing a specific routine for the parsing of an Infobox and its BS-table. However, the hard part is twofold: • the Infobox BS-table complies with Modèles/BS, whose syntax is somewhat tricky, but was parsed successfully in most cases: only a few code failures, hard to overcome, have been met. Figure 10 illustrates a simple case of {{BS-table}}, code and result. • The other difficulty is to deal with the cascade of asynchronous Internet requests for processing a single line. ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 15 of 38

Table 12. Simplified version of the BS-table (of the same ligne). {{BS-table}} {{BSbis|…| … … |Ligne d’Esch-sur-Alzette à Audun-le-Tiche}} border {{BS3bis|…| … … |Frontière entre la France et le Luxembourg}

{{BS5bis|…| … … |Ligne de Fontoy à Audun-le-Tiche}} {{BS5bis|…| … … |(1) Ligne non déclassée, non exploitée}} origin {{BS5bis|…| 21,144|Audun-le-Tiche |(301 m)}} {{BS5bis|…| … … |Ancienne voie vers Villerupt-Micheville}} branch {{BSbis|…| 2,4xx| |Bif vers raccordement d’Audun-le-Tiche|}} label tunnel {{BSbis| | 4,6xx|Rédange |(340 m)}} … km {{BSbis|…| 5,5xx| |Tunnel d’Adlergrund|(409m)}} {{BSbis|…| 6,6xx| |Ancienne frontière France - Empire allemand}}branch {{BS5bis|…| … … |Ligne de Longwy à Villerupt-Micheville}} terminus {{BSbis|…| 8,8xx|Hussigny-Godbrange |(346 m)}} {{BSbis|..| … … |Ligne de Longwy à Villerupt-Micheville}} {{BS-table-fin}}

The right column indicates which attribute to be parsed (values underlined on left):

ISPRS Int. J. Geo-Inf. 2021, 10, 154 origin to terminus, gares with label, km, also forks to other lines, bridges, international15 of 37 bor- der crossings, etc.and most of the expected information, except the geometry. Wikipedia parses the BS-table to draw a graphic representation of the ligne (Figure 9).

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 16 of 38 FigureFigure 9. 9.Infobox_Ligne_ferroviaire Infobox_Ligne_ferroviaire with with Ligne_d’Audun-le-Tiche_ Ligne_d’Audun-le-Tiche_à_Hussigny-Godbrangeà_Hussigny-Godbrange. .

There are 1000 such Wikipedia pages for French lignes, each ligne being linked to one or several other lignes. Therefore, it is worth developing a specific routine for the parsing of an Infobox and its BS-table. However, the hard part is twofold: • the Infobox BS-table complies with Modèles/BS, whose syntax is somewhat tricky, but was parsed successfully in most cases: only a few code failures, hard to overcome, have been met. Figure 10 illustrates a simple case of {{BS-table}}, code and result. • {“label”:”Audun-le-Tiche”,The other difficulty is to deal“num”:”196,000”, with the cascade “km”:”0+0”, of asynchronous “info”:”origin Internet requests for Ligne_d’Audun-le-Tiche_à_Hussigny-Godbrange”}, “geometry”:null, processing a single line. {“label”:”Rédange”, “num”:”196,000”, “km”:”4+570”},”geometry”: null, {“label”:”Hussigny-Godbrange”, “num”:”196,000”, “km”:”8+8x”, “info”: ”terminus”}, “geometry”:null,

FigureFigure 10. 10.Top: Top:Infobox Infoboxsample sample code code(left) (left) andand displaydisplay (right).(right). Bottom: Bottom: result result of of processing processing the the InfoboxInfobox of: ‘Ligne_d’Audun-le-Tiche_à_Hussigny-Godbrange’ (Figure 10). of: ‘Ligne_d’Audun-le-Tiche_à_Hussigny-Godbrange’ (Figure 10).

AttributesAttributes are are set set from from thethe parsing:parsing: • • numnum is is in in the theInfobox Infoboxmain main partpart (InfoboxInfobox attribute numéro numéro = =196,000); 196,000); • • label,label, kmare kmare in in the the BS-table, BS-table, andand kmkm can be be interpolated, interpolated, if if absent; absent; • • nomnom is is often often given given (if (if differingdiffering fromfrom label), insee insee must must be be determined determined later; later; • • geometrygeometry is is null, null, but but cancan bebe seekonseekon ‘Gare d’Audun-le-Tiche’, d’Audun-le-Tiche’, ‘Gare ‘Gare de de Rédange’,etc. Rédange’, etc. RoutineRoutine wikiBSTable wikiBSTable (Appendix (AppendixA A)) shows shows how how the the parsing parsing ofof oneone tabletable cancan triggertrigger aa cascade cascade of of asynchronousasynchronous requests: initial initial fetch, fetch, possible possible second second fetch fetch if there if there is an is in- an indirectiondirection on the BS-table, then then a a fetch fetch trying trying to to get get the the geometry of of each each individual individual garegare foundfound in in the the BS-table. BS-table. The The full full code code isis muchmuch longer,longer, andand outout of the scope of of that that paper. paper. Moreover,Moreover, when when parsing parsing a lignea ligne, one, one or or several several connected connectedlignes lignescan can be be found, found, and and for everyfor every such lignesuch, ligne we can, we trigger can trigger a new a requestnew request for that for new that lignenew, ligne and, so and on. so Piling on. Piling up these up requeststhese requests should should in theory in allowtheory us allow to browse us to browse the entire the network, entire network, but in practicebut in practice rapidly saturatesrapidly saturates memory andmemory computing and computing resources, reso aturces, least on at aleast personal on a personal computer. computer. To develop a full scale “big-data” procedure would improve greatly the time spent linking data, from lignes to gares, and from gares to their geocoded location. Instead, we are querying Wikipedia with only a few (2 to 4) lignes requests at a time.

3.2. Gathering Information from VGI Pages Wikipedia is a great, irreplaceable source of collaborative VGI, covering almost the full range of the lignes that have been included in the main French railway network. However, “secondary” lignes (rural lines and tramways), which were never embraced in the nationalization of 1937 into what became the SNCF [27], have not yet been treated so extensively: a dozen or two are fully described (with BS-table). About 100 are qualita- tively described in the page of their owning operator, e.g., “Voies_ferrées_des_Landes” lists 12lignes, only two having their specific page (w/o BS-table). However, in France as in most countries, railways arouse vocations for “volunteers”, and a lot of French websites (Table 13) are devoted to keep track of past gares, lignes, vi- aducts and tunnels. Even the collection of street-signs (“place de la gare”, “rue du pet- it-train”) is valuable: it is the toponymy-memory of the location of a past gare.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 16 of 37

To develop a full scale “big-data” procedure would improve greatly the time spent linking data, from lignes to gares, and from gares to their geocoded location. Instead, we are querying Wikipedia with only a few (2 to 4) lignes requests at a time.

3.2. Gathering Information from VGI Pages Wikipedia is a great, irreplaceable source of collaborative VGI, covering almost the full range of the lignes that have been included in the main French railway network. However, “secondary” lignes (rural lines and tramways), which were never embraced in the nationalization of 1937 into what became the SNCF [27], have not yet been treated so extensively: a dozen or two are fully described (with BS-table). About 100 are qualitatively described in the page of their owning operator, e.g., “Voies_ferrées_des_Landes” lists 12lignes, only two having their specific page (w/o BS-table). However, in France as in most countries, railways arouse vocations for “volunteers”, and a lot of French websites (Table 13) are devoted to keep track of past gares, lignes, viaducts and tunnels. Even the collection of street-signs (“place de la gare”, “rue du petit-train”) is valuable: it is the toponymy-memory of the location of a past gare.

Table 13. Some websites about railway lignes and gares in France (short name on the left).

http://archeoferroviaire.free.fr accessed on: 8 March 2021 Archef records hundreds of secondary lines, including industrial, military lines, at a high-resolution http://chemindeferenfrance.blogspot.com accessed on: 8 March 2021 cdfFR gathers cartographic records about a hundred ancient lines http://www.train.eryx.net/hscff/ accessed on: 8 March 2021 GeoCF records hundreds of secondary lines (in revision) https://www.lignes-oubliees.com accessed on: 8 March 2021 Ligou “forgotten lines” records 1768 gares and 361 lignes http://rd-rail.fr/ RDrail by the author of “Les 400 profils des lignes voyageurs du réseau” (2 volumes) https://routes.fandom.com/wiki/Portail:Transport_ferroviaire_français accessed Routes on: 8 March 2021 several tables about ancient lines, about chronology https://rue_du_petit_train.pagesperso-orange.fr accessed on: 8 March 2021 RuePT records 6800 street signs mentioning the presence of a former gare (useful for disambiguating precise localization) http://chemins.de.traverses.free.fr accessed on: 8 March 2021 Traver gathers photographs and cartographic records about ancient lignes

The variety of information, and unstructured representation, make it difficult to code any generic software. But some routines may help, which are proposed in thissection. There are two critical steps in the overall process described in Section 2.1.3: - First step(a): identifying the existence of a gare onto a ligne; - Last step(d): obtaining coordinates data. The generic approach to extracting gare information from VGI starts with step(a): fortunately, most mentioned VGI providesome data, loosely structured by ligne (be a text or a table), for instance: “ligne A to B has such characteristics, and comprises: gare1, gare2 ... ”, The challenge is then to code a “parser” that delivers a list of nodes such as: {label:“A”, num:“AtoBcode”, km:“0”, info:“origin of A to B, characteristics“} {label:“gare1_name”, num:“AtoBcode”, km:“somerank information, or 1”} {label:“gare2_name”, num:“AtoBcode”, km:“somerank information, or 2”} ... ISPRS Int. J. Geo-Inf. 2021, 10, 154 17 of 37

{label:“B”, num:“AtoBcode”, km:“max = length, or max rank”, info: “terminus”} Here, follow three cases that are common among several VGI.

3.2.1. HTML-Style Tables As opposed to a Wikipedia Infobox, the HTML

tag is not a semantic tag, and it is uneasy to determine its content, unless by direct reading (no simple artificial intelligence (AI) routine at hand, although research is becoming active). For decision making, the most important question is: is there a relevant key in that table? (e.g., num, insee, label?). A few cases, from Routes or RDrail, were processed this way, giving information only about origin-terminus, and some ligne characteristics (closing date, gauge, length): • ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER visual:REVIEWmake a decision about relevance; 18 of 38 • manual: copy and paste the table text into a spreadsheet; convert into CSV format; • software: fetch CSV file; join objects with routine mergeLigneInfo (AppendixA).

3.2.2.3.2.2. CoordinatesCoordinates ListsLists (Polyline(Polyline for for a aLigneLigne) )in in Google Google KML KML or or GPX GPX Format Format SomeSome contributorscontributors have have digitalized digitalizedlignes lignesusing using the theGoogleMyMaps GoogleMyMapsfacility, facility, and some-and times,sometimes,gare are gare individualized are individualized in such in kindssuch kinds of data. of data. The procedureThe procedure is: is: • • manual:manual: exportexport the ligne data into KML format (from (from the the Google Google page); page); • • software:software: convertconvert into geojson, and if the the gare isis individualized: individualized: add add it it directly, directly, or: or: • • manual:manual: copycopy and paste the relevant geojson ofof the the ligneligne;; • • visual:visual: (see(see Section 3.33.3)) locate each expected garegare alongalong that that ligneligne. . ThisThis hashas beenbeen usedused with source Traver and a few ligneslignes fromfrom cdfFRcdfFR (Table(Table 13). 13).

3.2.3.3.2.3. SimpleSimple ListsLists of gares in Plain Text ItIt isis notnot easyeasy toto automate automate a a detection, detection, but but the the result result proves proves useful useful in in obtaining obtaining a a list list of garesof garesalong along a same a sameligne ligne: : •• manual:manual: copy-pastecopy-paste the text into a string constant; •• software:software: checkcheck ifif that ligne is already documented, or or add add a a new new ligneligne numbernumber num. num. TheThe routineroutine stationsAlongLinestationsAlongLine (Annex) (Annex) is used is used in inparsing parsing the the example example of (Figure of (Figure 11). 11 ).

FigureFigure 11. 11.Example Example of of semi-structured semi-structured input input text text with with list oflistgares of gares(top );(top result); result of the of parsing the parsing (middle (middle); and); resulting and resultinggeojson listgeojson of nodes list (ofbottom nodes ).(bottom).

Parsing the text extracted from one page of Archef provides an ordered list of gares, from origin to terminus, a length, creates the lignenum, and two dates for its lifetime.

3.3. Remote-Sensing Information for the Assisted Visual Geocoding of Gares If the direct geocoding of a gare (wikiCoordinates) has failed, the last opportunity is to visually inspect maps or aerial pictures to locate that gare, in a way similar to [35] with land-cover. For geocoding a gare, there is a two-step procedure: • (1) software: use a geocoder API (e.g., Google), querying coordinates for: French word “gare” + label. For instance, with Cardet (Gard), we can try a few query vari- ants (Table 14). It is impossible to know which query will give the best answer (dis- tance data have been added in Table 14, only after visual inspection);

ISPRS Int. J. Geo-Inf. 2021, 10, 154 18 of 37

Parsing the text extracted from one page of Archef provides an ordered list of gares, from origin to terminus, a length, creates the lignenum, and two dates for its lifetime.

3.3. Remote-Sensing Information for the Assisted Visual Geocoding of Gares If the direct geocoding of a gare (wikiCoordinates) has failed, the last opportunity is to visually inspect maps or aerial pictures to locate that gare, in a way similar to [35] with land-cover. For geocoding a gare, there is a two-step procedure: • (1) software: use a geocoder API (e.g., Google), querying coordinates for: French word “gare” + label. For instance, with Cardet (Gard), we can try a few query variants ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 19 of 38 (Table 14). It is impossible to know which query will give the best answer (distance data have been added in Table 14, only after visual inspection);

Table 14. Example of geocoding queries for gare_de_Cardet (département: Gard). Table 14. Example of geocoding queries for gare_de_Cardet (département: Gard). Query [lat,lon] dist [44.023710, Query [lat,lon]0 dist ... true coordinates (see: Figure 14 and Section 3.3.3) = 4.088814] ... true coordinates (see: Figure 14 and Section 3.3.3) = [44.023710,[44.026042, 4.088814] 0 [lat,lon] = geocode(“gare, Cardet, France”) 672 m [lat,lon] = geocode(“gare, Cardet, France”) [44.026042,4.081059] 4.081059] 672 m [44.010420, [lat,lon] = geocode(“rue de la gare, Cardet, France”) 1515 m [lat,lon] = geocode(“rue de la gare, Cardet, France”) [44.010420,4.084597] 4.084597] 1515 m [lat,lon] = geocode(“chemin de la gare, Cardet, [44.019122, [lat,lon] = geocode(“chemin de la gare, Cardet, France”) [44.019122, 4.086764]536 536m m France”) 4.086764]

• • (2)(2) manual/visual manual/visual:: use use anyany approximation, or orskip skip first first step, step, and andinput input the toponym the toponym intointo an an online online map map facility facility providing, preferably preferably ancient, ancient, maps maps and aerial and aerial photos. photos. In France,In France, the the best best datadata sourcesource for for historic historic (mid-20th (mid-20th century) century) aerial aerial images images and and maps,maps, is IGN, is IGN, the the French French national national geographic organization, organization, which which provides provides the website: the website: https://remonterletemps.ign.frhttps://remonterletemps.ign.fr, giving, giving access access to: to: Today (2016–2018): Aerial image (≈1 m resolution), Vector cartography (≈1/10,000) Today (2016–2018): Aerial image (≈1 m resolution), Vector cartography (≈1/10,000) Past (1947–1960): Aerial image (≈10 m resolution), Digitized map (≈1/25,000) Past (1947–1960): Aerial image (≈10 m resolution), Digitized map (≈1/25,000) Figure 12 illustrates this website with toponym ‘Chantelle’ (cf. Section 3.2.3), at zoomFigure = 16, 12 andillustrates two sources this website (1954-aerial with and toponym 1952-map): ‘Chantelle’ the gare(cf. is visible Section (big 3.2.3 arrow)), at zoomand = 16, andthe two fork sources between (1954-aerial lines (yellow and arrow), 1952-map): about 200m the SWgare ofis the visible gare (scale (big arrow)at bottom-left and the fork betweencorner). lines (yellow arrow), about 200 m SW of the gare (scale at bottom-left corner).

Figure 12. Website ‘RemonterLeTemps’: geocoding Chantelle (Allier) at roughly 1/20,000, using a 1954 aerial image, and Figure 12.the Websitescanned map ‘RemonterLeTemps’: (1952). The double arrow geocoding shows theChantelle gare, the (Allier) smallerat arrow roughly points 1/20,000, to a junction using of two a 1954 lignes aerial (the line image, to and the scannedEbreuil map makes (1952). a bend The to the double south). arrow shows the gare, the smaller arrow points to a junction of two lignes (the line to Ebreuil makes a bend to the south). 3.3.1. Dual Use of the Website RemonterLeTemps, (a) Direct Geocoding of Toponyms Routine checkToponymAt (Appendix A) encodes the identifier URI pointing to the relevant remonterLeTemps data for a given nom, what we did above with Chantelle (Figure 12).

ISPRS Int. J. Geo-Inf. 2021, 10, 154 19 of 37

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 20 of 38 3.3.1. Dual Use of the Website RemonterLeTemps, (a) Direct Geocoding of Toponyms Routine checkToponymAt (AppendixA) encodes the identifier URI pointing to the rel- evant remonterLeTempsThe next step todata detecting for a given a gare nom,, is visual: what we quite did often above rather with Chantelle easy, sometimes(Figure 12 more). difficultThe next (cf. step examples to detecting in Figure a gare 13a–c)., is visual: A gare quite is always often ratherat the easy,junction sometimes of several more linear gare difficultelements (cf. (rail examples + road), in Figureeasier to 13 detecta–c). Awhen findingis always the at appropriate the junction zoom of several level. linear elements (rail + road), easier to detect when finding the appropriate zoom level.

(a) Jaligny-sur-Besbre (Allier): a small road is visi- ble, but where is the gare? The toponym “la Gare” is still present on the recent IGN map: (toponymy memory!), … now we can “see” the ancient rail- way bending NE (line of trees).

(b) Cardet (Gard): the railway line is clearly visible (E-W), also two roads converging towards vil- lage center, crossing the railway in two places. Where is the gare? intersection with the biggest ? road (east?), or closer to the center (west?). The answer is given later in the text. ?

(c) Le Cateau-Cambrésis (): railway very clearly visible, half-circling the city. But where is the stop “Le Cateau-Halte”? Comparing ‘km’ with previous and next stops, by rough interpo- lation, the stop should be at the bridge, near “Faubourg de ” in the middle, different ? from another gare, just South of the picture, where two lines meet. ?

FigureFigure 13. Some13. Some examples examples where where visual visual geocoding geocoding is not is not obvious. obvious.

Therefore,Therefore, the the different different steps steps of theof the procedure procedure can can be summarizedbe summarized as: as: • • software:software: generate generate URI URI using using routine routine checkToponymAt; checkToponymAt; • • visual:visual: start start with with an intermediatean intermediate zoom zoom (e.g., (e.g., z = 16)z=16) and and focus focus on on the the section section where where thethe line line is expected is expected to reachto reach the thecommune commune(e.g., (e.g., SW SW corner); corner); • visual: inspect map-aerial photo combinations, focus on road-rail intersections; • manual: input back coordinates into geometry of the gare.

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 21 of 38

ISPRS Int. J. Geo-Inf. 2021, 10, 154 20 of 37 Adding more automation is beyond the scope of this study. The care brought to this task, must insure a high quality, for it will be very difficult to detect errors later. • visual: inspect map-aerial photo combinations, focus on road-rail intersections; 3.3.2.• Dualmanual: Use of input the backWebsite coordinates RemonterLeTemps: into geometry of (b) the Visualgare. Reverse-Geocoding To showAdding where more automationa geometry is points beyond to the in scope aerial of or this map study. representation The care brought at a to given this date: task, must insure a high quality, for it will be very difficult to detect errors later. • software: generate URI using routine checkGeocodeAt (Annex), cross-check with a 3.3.2.geocoder Dual, to Use get of commune the Website name; RemonterLeTemps: (b) Visual Reverse-Geocoding • visual:To show start wherewith a geometryhigh zoom points (e.g., to in19), aerial and or inspect map representation map-aerial atphoto a given combinations, date: •zoom-outsoftware: to generateget neighbor URI using toponyms routine (railway checkGeocodeAt terminology (Annex), may cross-check appear); with a • manual:geocoder confirm,, to get commune or infirm,name; the quality of the information for that node. • visual: start with a high zoom (e.g., 19), and inspect map-aerial photo combinations, 3.3.3. Hintszoom-out for Alternatives to get neighbor toponyms (railway terminology may appear); • manual: confirm, or infirm, the quality of the information for that node. Cartography is notan exact science, even at “the scale of a mile to the mile!” (Lewis Carroll3.3.3.), different Hints for Alternativesmaps may choose to display different toponyms. Sometimes Bingmaps or GooglemapsCartography may help is notan to solve exact science,an indecision, even at “sometimesthe scale of a the mile “ tocarte the mile!IGN”(”, Lewiswhat is the case Carrollfor the), differentCardet indecision maps may choose (from to Figure display 13b): different the toponyms. gare is at Sometimesthe East intersection,Bingmaps or “Che- min deGooglemaps la Gare” may(Figure help 14, to solveright), an not indecision, at the West sometimes intersection, the “carte although IGN”, what closer is the to case the center of thefor village. the Cardet indecision (from Figure 13b): the gare is at the East intersection, “Chemin de la Gare” (Figure 14, right), not at the West intersection, although closer to the center ofThe the problem village. is to find the right zoom level, which displays the street name.

Figure Figure14. Using 14. Using alternative alternative maps maps to toresolve resolve indecision: indecision: therethere is is a “aChemin “Chemin de la de Gare la Gare” indication.” indication. The problem is to find the right zoom level, which displays the street name. Note:Note: to the to the same same query: query: “ “Chemin de de la la Gare, Gare, Cardet Cardet”, the”, Googlethe Googlegeocoder geocoder answers answers a a locationlocation 563 563 m (more m (more South), South), versus versus 5050 m m with with the theIGN IGNgeocoder. geocoder. In that Inparticular that particular case, case, we havewe have the thevisual visual result result exactl exactlyy at thethe crossing crossing with with the the road. road. 4. Results: The Overall Procedure, the Reconstructed Network, and Its Quality Control 4. Results: The Overall Procedure, the Reconstructed Network, and Its Quality Control There are four main results out of this study: 1.There the are dataset four “ mainCARP ”results of French outgares of this, over study: the 1920–2020 time span; 1. 2.the datasetthe developed “CARP routines” of French combined gares in, a over computer-assisted the 1920–2020 reconstruction time span; procedure; 2. 3.the developedquality controls routines increase combined confidence, in or a pointcomp oututer-assisted odd/missing reconstruction data, to be fixed procedure; later. 3. quality controls increase confidence, or point out odd/missing data, to be fixed later. 4.1. The Reconstruted Network (CARP) The Computer-Assisted Reconstruction Procedure –CARP- data result from the fusion 4.1. The Reconstruted Network (CARP) (Section 4.2.1) and from the VGI integration (Section 4.2.2). So far, as of January 2021: the CARPThe Computer-Assisted dataset contains about Reconstruction 10,900 nodes, accounting Procedure for about –CARP 9100 different- data resultgares (aboutfrom the fu- sion 100(Section in neighbor 4.2.1) countries), and from 450 the fork VGInodes ,integration and 68 border-crossing (Section 4.2.2).nodes. So far, as of January 2021:the CARP dataset contains about 10,900 nodes, accounting for about 9100 different gares (about 100 in neighbor countries), 450 fork nodes, and 68 border-crossing nodes.

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 22 of 38

Displaying the CARP Dataset Figure 15 illustrates a part of the CARP on the Region-Bretagne, active gares only are on Figure 16, and disused gares on Figure 17. All data are plotted on top of a classical tiled OpenStreetMap, and green lignes are from the LignesSNCF dataset.

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 22 of 38

ISPRS Int. J. Geo-Inf. 2021, 10, 154 21 of 37

Displaying the CARP Dataset Displaying the CARP Dataset Figure 15 illustrates a part of the CARP on the Region-Bretagne, active gares only are Figure 15 illustrates a part of the CARP on the Region-Bretagne, active gares only are on Figureon Figure 16, 16 and, and disused disused garesgares on FigureFigure 17 17.. All All data data are are plotted plotted on top on of top a classical of a classical tiled tiled OpenStreetMap,OpenStreetMap, and and green green ligneslignes areare from from the theLignesSNCF LignesSNCFdataset. dataset.

Figure 15. “Bretagne” region full set of gares, on top of OpenStreetMap: green dots are 2020 active gares on green 2018-active lignes; red dots are ‘past’ gares (mostly on rural tramways).

All gares are sorted by ligne, and by km increasing order: it allows a simple routine to convert these sequences into a line-string, plotted as dotted red lines that approximate the disused lignes: no extra information is added. The Region Bretagne has been thoroughly processed: the CARP data represent al- most 100% of what has ever existed. Region Provence is almost complete as well. The CARP dataset is deposited, under open license, on the data.gouv.fr website. Figure 15.Figure “Bretagne 15. “Bretagne” region” region full full set set of garesgares, on, on top oftop OpenStreetMap: of OpenStreetMap: green dots green are 2020 dots active aregares 2020on greenactive 2018-active gares on green The format is geojson. Periodic revisions will bring additional data until full completion. 2018-activelignes lignes; red dots; red are dots ‘past’ aregares ‘past’(mostly gares on (mostly rural tramways). on rural tramways).

All gares are sorted by ligne, and by km increasing order: it allows a simple routine to convert these sequences into a line-string, plotted as dotted red lines that approximate the disused lignes: no extra information is added. The Region Bretagne has been thoroughly processed: the CARP data represent al- most 100% of what has ever existed. Region Provence is almost complete as well. The CARP dataset is deposited, under open license, on the data.gouv.fr website. The format is geojson. Periodic revisions will bring additional data until full completion.

FigureFigure 16. “ 16.Bretagne“Bretagne”: active”: active garesgares (small(small green dots, dots, 2018) 2018) on on top top of active of activelignes lignes. .

Figure 16. “Bretagne”: active gares (small green dots, 2018) on top of active lignes.

ISPRS Int.ISPRS J. Geo-Inf. Int. J. 2021 Geo-Inf., 102021, x FOR, 10, 154 PEER REVIEW 22 of 37 23 of 38

FigureFigure 17. “ 17.Bretagne“Bretagne”: disused”: disused or or past past garesgares,, andandlignes lignes. Small. Small black black circles circles = forks = .forks.

TheAll meta-analysisgares are sorted appliedto by ligne, CARP and by, km has increasing helped in order: understanding it allows a simple more routine discrepancies to (systematicconvert thesebias sequencesrather than into arbitrary a line-string, errors): plotted as dotted red lines that approximate the disused lignes: no extra information is added. 1. labelThe disparity Region Bretagne due to has change been thoroughly between dates: processed: the thetoponym-pattern CARP data represent (Appendix almost B.1); 2. 100%geometry of what deviation, has ever existed. due to Region network Provence evolution is almost and complete “fork-pattern as well.” (Appendix B.2). The CARP dataset is deposited, under open license, on the data.gouv.fr website. The The CARP tries to restore the evolution of the data as best as possible, and to bring format is geojson. Periodic revisions will bring additional data until full completion. additionalThe correctionsmeta-analysis whereverappliedto identified.CARP, has helped in understanding more discrepancies (systematic bias rather than arbitrary errors): 4.2. The1. Computer-Assistedlabel disparity due toReconstruction change between Procedure dates: the fortoponym-pattern Past Railway(Appendix Stations B.1); 4.2.1.2. The geometry Generic deviation,Procedure due for to Reconstruction network evolution from and “Publicfork-pattern Datasets:” (Appendix CARP-Main B.2). ProceduresThe CARP described tries to restore in Sections the evolution 2 and of 3 the result data asin best that as CARP, possible, summarized and to bring in the additional corrections wherever identified. next two paragraphs: • 4.2.Meta-analysis The Computer-Assisted (Figure Reconstruction18): production Procedure of formetadata, Past Railway counting Stations missing attributes, 4.2.1.checking The Generic constraints Procedure (e.g., for uniqueness), Reconstruction fromcorrection Public Datasets:of ambiguities CARP-Main and duplicates, possiblyProcedures completing described some in Sections geometry2 and,3 per- resultnode in that addition CARP, of summarized quality information; in the next • twoThree-steps paragraphs: fusion: similarity–measure, fusion–decision, attribute–combination •(FigureMeta-analysis 19), whichyields (Figure 18 ):the production revised ofdataset metadata, (=6489 counting nodes missing), plus attributes, dataset of check- remaining ing constraints (e.g., uniqueness), correction of ambiguities and duplicates, possibly undecidednodes, i.e.,: no geometry (718), or too uncertain (250 in 2014, 564 in 2017); • completing some geometry, per-node addition of quality information; •QuestThree-steps for geometry fusion: of similarity–measure,undecided nodes (Figure fusion–decision, 20), using attribute–combination Wikipedia or any direct geocoder,(Figure to19 ),get whichyields gare coordinates: the revised it helped dataset to (=6489 integratenodes), about plus dataset 400 additional of remain- nodes; • Visualing undecidednodesinspection of, i.e.,: Remote no geometry Sensing (718), / old or too maps, uncertain for (250final in 2014,improvements: 564 in 2017); it gave •aboutQuest 600 for additional geometry nodes of undecided originallynodes in(Figure public 20 datasets), using Wikipedia but without or any geometry direct (Fig- ure geocoder,21). to get gare coordinates: it helped to integrate about 400 additional nodes; • Visual inspection of Remote Sensing/old maps, for final improvements: it gave about 600 additional nodes originally in public datasets but without geometry (Figure 21).

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 24 of 38 ISPRSISPRS Int.Int. J.J. Geo-Inf.Geo-Inf. 20212021,,, 1010,,, xxx FORFORFOR PEERPEERPEER REVIEWREVIEWREVIEW 2424 ofof 3838 ISPRS Int. J. Geo-Inf. 2021, 10, 154 23 of 37

Figure 18. The meta-analysis: metadata and improved datasets. FigureFigureFigure 18.18. 18. TheTheThe meta-analysis:meta-analysis: meta-analysis: metadatametadata metadata and andand improved improvedimproved datasets. datasets.datasets.

Figure 19. The full revision procedure: 3-steps fusion, integration and reminder. FigureFigureFigure 19.19. TheThe 19. The fullfull full revisionrevision revision procedure:procedure: procedure: 3-step3-step 3-stepsss fusion, fusion,fusion, integration integrationintegration and andand reminder. reminder.reminder.

Figure 20. The improved revision procedure: Wikipedia querying, integration and reminder. FigureFigureFigure 20.20. TheThe 20. improvedimprovedThe improved revisionrevision revision procedure:procedure: procedure: WikipeWikipe Wikipediadiadia querying, querying,querying, integration integrationintegration and andand reminder. reminder.reminder.

FigureFigure 21. The 21. assistedThe assisted computer-assisted computer-assisted reconstruction reconstruction procedureprocedure (CARP) (CARP) visual visual decision decision to geo-coding to geo-coding nodes. nodes. FigureFigure 21.21. TheThe assistedassisted computer-assistedcomputer-assisted reconstructionreconstruction procedprocedureure (CARP)(CARP) visualvisual decisiondecision toto geo-codinggeo-coding nodes.nodes. Only the last step (Figure 21) is “manual/visual”, although assisted by some direct or OnlyOnly thethe lastlast stepstep (Figure(Figure 21)21) isis “manual/visual”,“manual/visual”, althoughalthough assistedassisted byby somesome directdirect oror reverse geocoding, depending on what is available first (toponym, or closest geometry). reversereverse geocoding,geocoding, dependingdepending onon whatwhat isis avaavailableilable firstfirst (toponym,(toponym, oror closestclosest geometry).geometry). The dataset built at this step is named “CARP-main”. TheThe datasetdataset builtbuilt atat thisthis stepstep isis namednamed “CARP-main”.“CARP-main”.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 24 of 37 ISPRSISPRS Int.Int. J.J. Geo-Inf.Geo-Inf. 2021,, 10,, xx FORFOR PEERPEER REVIEWREVIEW 25 of 38

Only the last step (Figure 21) is “manual/visual”, although assisted by some direct or 4.2.2.reverse Addition geocoding, of New depending Lignesand on what Gares, is available from VGI first (toponym, or closest geometry). The dataset built at this step is named “CARP-main”. A gare existsexists onlyonly ifif aa ligneligne existsexists toto whichwhich thatthat gare belongs.belongs. MostMost ofof thethe ‘secondary’‘secondary’ ligneslignes4.2.2. areare Addition notnot recordedrecorded of New inin Lignesand publicpublic datasets,datasets, Gares, from andand VGI havehave nono officialofficial num code.code. VGIVGI hashas recordedrecorded many ofA them.gare exists Depending only if a onligne whatexists can to whichbe found, that gare variousbelongs. routines Most of may the ‘secondary’apply. lignesFiguresare not 22 recordedand 23 sketch in public the datasets, three step and haves of a no possible official num full code. software VGI has chain: recorded many of them. Depending on what can be found, various routines may apply. (a)(a) fromfromFigures aa ligneligne 22:: andlistlist of23of nodessketch (routines(routines the three steps executedexecuted of a possible onon thethe full server):server): software chain: f[i] = {“label”:”l”, “num”:”n”, “km”:”k”, “info”:”…”}, “geometry”:null; (a) from a ligne: list of nodes (routines executed on the server): f[i] = {“label”:”l”, “num”:”n”, (b)(b) forfor “km”:”k”,eacheach node “info”:”:: askask aa geocoder,geocoder, . . . ”}, “geometry”:null; asas inin FigureFigure 20,20, oror visual inspection as in Figure 21: (b)f[i].geometryfor each node: = ask {“type”:”Point”, a geocoder, as in Figure “coordinates”:[lon,lat]}; 20, or visual inspection as in Figure 21: (c)(c) forfor eacheachf[i].geometry node withwith = {“type”:”Point”, geometry (Figure(Figure “coordinates”:[lon,lat]}; 23),23), checkcheck nom andand insee,, checkcheck km strictstrict order,order, (c)checkfor label each node uniquenessuniquenesswith geometry inin CARPCARP (Figure dataset,dataset, 23), check checkcheck nom geometry and insee, deviationdeviation check km strictifif aa garegare or- ofof thethe sameder, label check existsexists labeluniqueness already.already. ReportReport in CARP controlcontrol dataset, resultresult check inin geometry info:: deviation if a gare of {f[i].nom,the same f[i].insee} label exists already. = getCommuneFromPoint(f[i].geometry?.coordinates); Report control result in info: {f[i].nom, f[i].insee} = f[i].infogetCommuneFromPoint(f[i].geometry?.coordinates); += “ (commune_ok | commune_inconsistent) f[i].info += (km_checked) “ (commune_ok …”; | commune_inconsistent) (km_checked) ... ”; The dataset built at this step is named The“CARP-VGI”. dataset built at this step is named “CARP-VGI”.

FigureFigure 22. thethe 22. semi-automatedsemi-automatedthe semi-automated CARP-VGICARP-VGI CARP-VGI forfor integrating integratingintegrating a new aa newnewligne lignelignein the inin network. thethe network.network.

FigureFigure 23. 23. TheThe CARP CARP posterior quality quality control. control. 4.3. Posterior Quality Control, Revealing Additional Errors 4.3. Posterior4.3.1. Enacting Quality Quality Control, through Revealing Constraint Additional Checking Errors 4.3.1. EnactingThe automated Quality constraint through checking Constraint is performed Checking a posteriori, checking: 1.Thea automatedgare has a label, constraint a num, and checking a km, identifying is performedperformed it and aa the posteriori,posteriori,ligne to which checking:checking: it belongs; 2. all gares sharing the same num, have their km in strict increasing order, from the gare 1. a gare has a label, a num, and a km, identifying it and the ligne to which it belongs; 1. a garetagged has a origin to, the a gare, taggedand a terminus,, identifying what means it and at the least ligne two togares whichper ligne it belongs;; 2. 3.all garesa gare sharingsharingwith an insee thethecommune samesame numcode,,, havehave has its theirtheir geometry-point km inin strictstrict inside increasingincreasing the geometry-polygon order,order, fromfrom thethe gare taggedtaggedof that origincommune toto ;thethe every garegare taggedtaggedlocated terminus in France,, can whatwhat receive meansmeans an insee atat leastleastcommune twotwo code;gares perper ligneligne;; 3. a gare withwith anan inseecommune code,code, hashas itsits geometry--point insideinside thethe geome- try-polygon ofof thatthat commune;; everyevery gare locatedlocated inin FranceFrance cancan receivereceive anan in- seecommune code;code; 4. a garelabel isis consistentconsistent withwith itsits communenom andand insee.. QuiteQuite oftenoften aa label isis thethe name of an existing commune (sometimes(sometimes 22 communes names).names). Rules 1–4were implemented, It helped to reporting a few dozen inconsistencies, e.g.: about 60 ambiguities, and a dozen duplicates were identified and corrected. Only a few

ISPRS Int. J. Geo-Inf. 2021, 10, 154 25 of 37

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR 4.PEERa REVIEWgarelabel is consistent with its communenom and insee. Quite often a label is the 26 of 38 name of an existing commune (sometimes 2 communes names). Rules 1–4 were implemented, It helped to reporting a few dozen inconsistencies, e.g.,: about 60 ambiguities, and a dozen duplicates were identified and corrected. Only a few havinghaving required required a manuala manual correction, correction, and onlyand only one was one interpreted was interpreted as an arbitrary as an arbitrary error error identifiedidentified so so far. far. Some Some cases cases revealed revealed a ‘fork-pattern a ‘fork-pattern’ again,’ again, such as such this example:as this example: {“label”:”L{“label”é:”Lézan”zan”, “nom”:”L, “nom”:”Lézan”ézan”, . . . “coordinates”:[4.11488,, …“coordinates”:[4 44.04241]},.11488, 44.04241]}, CheckingChecking coordinates, coordinates, rule rule 4 answers: 4 answers: nom =nomVé z=é nobresVézénobres, about, about 7 km East7 km of EastLézan of. Lézan. TheThe Wikipedia Wikipedia Infobox:sch Infobox:schémaéma (Figure (Figure 24) indicates 24) indicates the fork thejust fork before just the before viaduct. the viaduct.

FigureFigure 24. 24.Quality Quality control control after after automated automated identification identifica oftion a geographic of a geographic inconsistency inconsistency (rule 4). (rule 4).

Result:Result:fork-pattern fork-patternagain, again, discovered discovered by the by final the computer-assistedfinal computer-assisted quality control.quality control.

4.3.2. A Posteriori Control of the Reconstructed Network 4.3.2. A Posteriori Control of the Reconstructed Network Meta-analysis of the CARP dataset is compared (Tables 15 and 16) with previous data. Meta-analysis of the CARP dataset is compared (Tables 15 and 16) with previous Table 15. Comparingdata. breakdown of junction-gares, by distances between their nodes.

Intervals (nb.Nodes)Table [0, 150 15. m] Comparing % breakdown [150, of junction-gares 300] [300,, by 1 distances km] [1 between km, 2 km] their nodes. [2 km, ∞] 2014 (642 nodes/6442) 550 85.7 48 32 6 6 [0, 150 [150, [300, 1 [1 km, 2 [2 km, 2017 (1360 nodes/7702)Intervals 541 (nb.Nodes)40.0 184% 375 157 103 m] 300] km] km] ∞] CARP (2116 nodes/10,747) 1418 67.0 184 344 115 55 2014(642 nodes/6442) 550 85.7 48 32 6 6 2017(1360 nodes/7702) 541 40.0 184 375 157 103 Table 16.CARPComparing(2116 nodes/10747) breakdowns of nodes1418 according 67.0 to their count184 per ligne. 344 115 55 Datasets Isolated Gare % 2-Gares Ligne 3 to 7-Gares 8 to Max- (Total) Table 16. Comparing breakdowns of nodes according to their count per ligne. 2014 num 94 15.8 62 191 130 593 2017 num 107 14.6 Isolated76 2452-Gares 171 734 Datasets % 3 to 7-Gares 8 to Max- (Total) CARP num 50 5.2 163Gare 259Ligne 486 958 2014 num 94 15.8 62 191 130 593 The2017 result num for CARP is everywhere 107 14.6 better than76Sncf2017 , and 245 in particular171 the lignes 734 are muchCARP better num represented with 50 CARP5.2. 163 259 486 958

5. DiscussionThe result for CARP is everywhere better than Sncf2017, and in particular the lignes are muchThis paper better has represented focused on with designing CARP a. computer-assisted reconstruction procedure with the goal to gather as many gares as possible from any kind of available source of information.5. Discussion Several software solutions were considered and then abandoned forrequiring too much development (concurrent cascading queries), or for having too little expected benefitThis (deep paper machine-learning has focusedgare on detection,designing based a computer-assisted on already learned reconstructiongare locations). procedure withBecause the goal the to spectrum gather isas rather many large, gares we as need possible some from guidelines any kind to assess of available the overall source of methodinformation. from an Several organizational software point solutions of view (Section were considered 5.1). Then, to and open then some abandoned perspectives forrequir- ining possible too much applications, development one example (concurrent is given incascad Sectioning 5.2 queries),, and in European or for having cooperation, too little ex- becausepected itbenefit was the (deep initial machine-learning motivation of starting gare this detection, work: to understand based on howalready a century learned of gare lo- Frenchcations). transportation policies has led to shrinkage of the national coverage so much, what Because the spectrum is rather large, we need some guidelines to assess the overall method from an organizational point of view (Section 5.1). Then, to open some perspec- tives in possible applications, one example is given in Section 5.2, and in European co- operation, because it was the initial motivation of starting this work: to understand how a century of French transportation policies has led to shrinkage of the national coverage so much, what shifts cargo transportation to roads, and what scientific arguments to reverse that policy trends are important. This can be grounded on historical analysis, in com- parison with other European countries that have, or have not, followed the same trend in

ISPRS Int. J. Geo-Inf. 2021, 10, 154 26 of 37

shifts cargo transportation to roads, and what scientific arguments to reverse that policy trends are important. This can be grounded on historical analysis, in comparison with other European countries that have, or have not, followed the same trend in rail transportation policies (the author’s feeling is that important differences can be revealed).

5.1. Theoretical Issues in Building a Comprehensive Dataset from Internet Sources The problem of building a dataset from the vast amount of information available on the Internet has been theorized, noticeably in the medical sector, for instance by Sreenivasaiah et al. [36], which lists four categories of a total 11 issues for integrating information. This present work is confronted with at these categories.

5.1.1. Data Representation and Standards; Interoperability; Data Visualization; Issues in the Development of Tools These four issues are very critical in geographic information in general, because geography by itself is a reservoir of very strong constraints, which must be consistent with all other constraints: e.g., linear referencing, digraph connectivity, etc. whichhas been studied extensively for decades, leading to the definition of standards by INSPIRE or OGC [28,29], or implementations as in OSM [37]. The CARP dataset compliance with such standards should be more extensively checked, whichwould help interoperability [38], in particular if considering other European past railway datasets. For instance, Linear referencing (km attribute), and Network interconnections (6=gares-same node “tolerance”), are explicit terms of INSPIRE Transport Networks. Moreover, adding time constraints, as happens with a historical railway network, should also be treated more carefully: so far, only an “end” (validTo) date is provided, when a gare is no longer active. We indirectly use the standardization efforts of the UIC (Section 2.4): today railway ontology in Wikipedia benefits from it for keepinghomogeneity in representation, visualization and associated tools (e.g., Infobox/BS-tables). However, toponymy is much less standardized: changes in names and boundaries of administrative objects is accelerating in France, without real retrospective help for data reconstruction (e.g., simple list of communes merging, see Section 2.3). Overall, interoperability should be improved with a better compliance with INSPIRE, especially if a European collaboration is foresighted. Data visualization: has been limited to the use of geojson, which can be displayed on line, by many websites, and which is easily read as an open source. Examples are displayed with OpenLayers, on top of OpenStreetMap tiles. Issues in the development of tools: the code is in JavaScript, well suited to direct interaction with the web, with web-workers (what allows some parallelism if the computer supports this), but the maximum number of actually possible Internet requests is rapidly reached on a personal computer. Any perspective of development would imply working on a stronger architecture more adapted to Big Data. We have not yet started research in that direction although this is a bottleneck issue.

5.1.2. Data Quantity; Computation Intensivity These two issues are met all along this work: we are working with about 1000 lignes 10,000 gares, 35,000 communes and a gare-commune join requires 350 M possibly complex tests. For instance, when seeking for insee codes of the gares, the process (point-in-polygon) should, and can, be parallelized for 90% of data, because most gares of a same ligne are in the same area (5 or 6 large parts of France), excepting a few very long lignes (e.g., Paris-Nice). The technology of web-workers has been used successfully on a personal computer with 4 processors. In the Wikipedia realm, it is rather easy to obtain gares from a ligne, then posting asynchronous queries about coordinates of these gares, which is computationally difficult beyond a dozen concurrent gares: see above comment about better Big Data access (Section 5.1.1). In the crowd of VGI, where the most sophisticated structure isoften a list of names, the rule is to work “schemaless”, as proposed in [39], possibly after a text analysis looking for words ligne and gare (in French, or other language of course), and a “text pattern” ISPRS Int. J. Geo-Inf. 2021, 10, 154 27 of 37

identifying a list, then deriving for instance a ligne–gare relationship. Important work is yet to be done in the future, in particular if aiming at handling more European countries.

5.1.3. Data Quality; Version Control Undoubtedly big issues. Data quality issues are addressed in almost every stage of this work, for different quality components: - Attribute accuracy: label (cf. both toponym-pattern and fork-pattern); - Logical consistency e.g., label, uic and (num,km) consistency (cf. ambiguities, dupli- cates); - Completeness: e.g., creating missing km by interpolation; - Positional accuracy: cf. fork pattern, computer-assisted visual geocoding; - Time period and contemporaneousness: e.g., comparing capa/gare with mnemo and validTo/ligne. Most standard quality metadata components are concerned and have been used, and a posteriori quality control is largely automated by the systematic constraint checking. Version control should be better taken into account, in order to track: - the evolution of the CARP dataset: a Git-type solution would improve it; - the lineage of data used in setting attribute values. Some retrospective information could be retrieved, partially, for that lineage mixes software and non-reproducible manual steps.

5.1.4. Data Availability; Data Access; Security These issues are not really addressed in this work. We are considering sources only from public datasets (excepting an author copy of a dataset that is no longer on-line), or Internet accessible open data, and the outcome is archived on a public portal. Together with version control, these aspects should be given more attention.

5.2. A Dataset for Further Social Sciences Studies The CARP resulting dataset is open, accessible on a public website, and is intended as a tool, for geographers and transportation specialists to further investigate how long-term trends form and evolve. The present version (January 2021) is about 80% complete (?), but compared to the previous public datasets, already shows an increase in all total counts: nodes: +56%, lignes (6=num): +30%, communes (6=insee): +39%. One typical question used as an example of the opportunity to reconstruct this past network, was: Does a commune possess a gare? Today? A century ago? The CARP dataset can list, and count the number of such gares (Table 17).

Table 17. Communes reached by at least one gare according to public data versions and CARP.

Sncf2014 (origin 2014) 5266 14.7% Sncf2017 (origin 2014) 5279 14.7% % of the 35,798 communes Sncf2020 (update 2020) 3112 8.7% (updated year-2019) CARP (as of Jan.2021) 7341 20.5%

To date, 20% of communes, have been checked as having been connected to the railway network, at least by a simple stop for some time between 1920 and now. This is already significant material for demographic, socio-economicand even electoral studies. Almost 4000 additional nodes have been geocoded and quality checked. Also in CARP: every national border crossing with neighbor countries (68 such places), showing how European rail liaisons, from or to France have evolved since WWII (Figure 25). ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 29 of 38

Also in CARP: every national border crossing with neighbor countries (68 such places), showing how European rail liaisons, from or to France have evolved since WWII (Figure 25). In particular, surprisingly, some French gares are still in operation only because they are served by foreign trains from Luxembourg (Audun-le-Tiche, Volmerange), or Ger- many ( to Sarreguemines) or Switzerland (Line 10 of the Basel tramway). This CARP dataset, can also confirm broadly to some local patterns mentioned in the literature [2–3,29,40], for instance the lignes that have been closed just before WWII in Region Bretagne, or in the late 1990s.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 These observations are intended to be studied further in collaboration 28with of 37 transport geographers, including those at a European level. Already cited papers [2,7] demonstrate that there is an interest in developing such research in Europe.

Figure 25. Display of rail border crossings (bigger dots) North-EastNorth-East ofof France.France. Color indicate which border points are presently open (green).

5.3. SimilarIn particular, Past Railway surprisingly, Network someReconstruction French gares for Europeanare still in Countries operation only because they are served by foreign trains from Luxembourg (Audun-le-Tiche, Volmerange), or Though the present work has been specifically developed, for France, several as- (SaarBahn to Sarreguemines) or Switzerland (Line 10 of the Basel tramway). pects demonstrate that similar work can be undertakenin other countries. This CARP dataset, can also confirm broadly to some local patterns mentioned in For instance, a Wikipedia entry for a same ligne or gare, may exist in many lan- the literature [2,3,29,40], for instance the lignes that have been closed just before WWII in guages, because the different Infobox-templates, share a common semantics for a ligne Region Bretagne, or in the late 1990s. (e.g., num, gauge…) translated in spoorlijn, or a gare (e.g., label, ...) translated in station. These observations are intended to be studied further in collaboration with transport Example: (all accessed 8 March 2021) geographers, including those at a European level. Already cited papers [2,7] demonstrate https://nl.wikipedia.org/wiki/Spoorlijn_Audun-le-Tiche_-_Hussigny-Godbrange that there is an interest in developing such research in Europe. https://fr.wikipedia.org/wiki/Ligne_d’Audun-le-Tiche_à_Hussigny-Godbrange and https://nl.wikipedia.org/wiki/Station_Audun-le-Tiche5.3. Similar Past Railway Network Reconstruction for European Countries https://fr.wikipedia.org/wiki/Gare_d’Audun-le-TicheThough the present work has been specifically developed, for France, several aspects demonstratePresentation that similarmay differ work (Figure can be 26), undertakenin but the underlying other countries. semantics is quite the same, what Forwould instance, make atranslatable Wikipedia entrymost forof the a same proceduresligne or presentedgare, may existin this in paper. many languages, because the different Infobox-templates, share a common semantics for a ligne (e.g., num, gauge ... ) translated in spoorlijn, or a gare (e.g., label, ...) translated in station. Example: (all accessed 8 March 2021)

https://nl.wikipedia.org/wiki/Spoorlijn_Audun-le-Tiche_-_Hussigny-Godbrange https://fr.wikipedia.org/wiki/Ligne_dAudun-le-Tiche_a_Hussigny-Godbrange and https://nl.wikipedia.org/wiki/Station_Audun-le-Tiche https://fr.wikipedia.org/wiki/Gare_dAudun-le-Tiche Presentation may differ (Figure 26), but the underlying semantics is quite the same, what would make translatable most of the procedures presented in this paper. ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 30 of 38

In Belgium, some lignes, near the border with France, have been processed directly with wikiInfoBox., e.g., https://fr.wikipedia.org/wiki/Ligne_73_(Infrabel). In Germany routemap tables are used instead, instead of BS-tables, e.g.: ISPRS Int. J. Geo-Inf. 2021, 10, 154 https://de.wikipedia.org/wiki/Stuttgart_Hauptbahnhof, uses Infobox_Bahnhof29 of 37a similar template, whose Strecken entry (= list of lignes), has a different syntax.

Figure 26. Internationalization of the Wikipedia-InfoboxesInfoboxes forfor ligneslignes ((left),left), and gares ((right).right).

OtherIn Belgium, templates some havelignes been, near developed the border for withmost France,European have countries, been processed also Russia, directly Chi- na,with … wikiInfoBox., (total 22 different), e.g., https://fr.wikipedia.org/wiki/Ligne_73_(Infrabel whichmeans that analogs to the wikiInfobox routine). (Section In Germany routemap tables are used instead, instead of BS-tables, e.g.,: https://de. 3.1.2), tailored for the French template, can be developed, at the cost of some adaptations. wikipedia.org/wiki/Stuttgart_Hauptbahnhof, uses Infobox_Bahnhof a similar template, whose Strecken entry (=list of lignes), has a different syntax. 5.4. Conclusions Other templates have been developed for most European countries, also Russia, China, ... (totalMaking 22 different), observations whichmeans about the that long analogs evolut toion the of wikiInfoboxthe network routine is important, (Section in 3.1.2 par-), ticulartailored for for environmental the French template, policies. can be developed, at the cost of some adaptations. The deviation between the EU White Paper objectives [4], and decisions made by member5.4. Conclusions states cannot be understood without understanding in turn the combination of multipleMaking factors, observations by multiple about actors, the longover evolution several decades. of the network In France, is important, rail carried in 9.9% particu- of inlandlar for environmentalfreight transport policies. in 2018, versus 18.7% for average EU [41], while the EU report observesThe that deviation road freight between is about the EU 100 White times Paper more objectives polluting [than4], and rail, decisions and that madebetween by 2008–2013member states road cannot freight be increased understood from without 70% to understanding 72%, which was in turnopposite the combination to the goal of shiftingmultiple 30% factors, freight by from multiple road actors, to rail/maritime over several by decades.2030. In France, rail carried 9.9% of inlandAn freight EU recent transport update in [42] 2018, states: versus “Natio 18.7%nal for projections average EU compiled [41], while by thethe EUEuropean report Economicobserves thatArea road (EEA) freight suggest is about that 100transpor timest moreemissions polluting in 2030 than will rail, remain and that above between 1990 levels,2008–2013 even road with freightmeasures increased currently from planned 70% toin 72%,Member which Stat wases. Further opposite action to the is needed goal of particularlyshifting 30% in freight road fromtransport, road tothe rail/maritime highest contributor by 2030. to transport emissions, as well as aviationAn EUand recent shipping update …”, [ 42whichwe] states: can “National read as: projections rail is the only compiled transportation by the European mode able Eco- tonomic significantly Area (EEA) decrease suggest CO that2 and transport greenhouse emissions gas (GHG) in 2030 emissions. will remain above 1990 levels, evenThe with first measures sentence currently of the EU planned report’s in Memberexecutive States. summary Further states: action “ The is needed fact that particu- only limitedlarly in roaddata transport,are available the and highest that contributor the impacts to of transport most of emissions,the initiatives as well cannot as aviation yet be observedand shipping do not... ”,allow whichwe the proper can read assessmen as: railt is of the the only effectiveness transportation of the mode measures able to significantly decrease CO2 and greenhouse gas (GHG) emissions. The first sentence of the EU report’s executive summary states: “ The fact that only limited data are available and that the impacts of most of the initiatives cannot yet be observed do not allow the proper assessment of the effectiveness of the measures adopted so far and their contribution to reaching the goals “. Therefore, better long-term data could help the EU to better measure the adoption of its recommendations by the member states, and better understand what hurdles exist on the route. ISPRS Int. J. Geo-Inf. 2021, 10, 154 30 of 37

Funding: This research received no specific external funding. Institutional Review Board Statement: Not applicable: no humans or animals involved in this study. Informed Consent Statement: Not applicable: no humans involved in this study. Data Availability Statement: The data presented in this study can be found here: https://www.data. gouv.fr/fr/datasets/gares-du-reseau-ferroviaire-francais-en-service-fermees-ou-disparues/ (accessed date 8 March 2021). Acknowledgments: Work initiated with the help of the sophomores 2018 at IUT-Marne-la-Vallée, and pursued with the encouraging comments from IGN (seminar on patrimonial data), SNCF-Réseau and several VGI contributors. First version of the manuscript submitted in August 2020, extensively rewrote thanks to the acute, thorough and highly relevant comments of two rounds of reviewers, who deserve heartfelt thanks. Conflicts of Interest: The author declares no conflict of interest.

Appendix A. Code Description for Routines Mentioned in the Text Routines A1–A7 are components of the meta-analysis and fusion, both methods applied to the public datasets. and meta-analysis applied to the CARP result. Routines A8–A15 are components of methods for gathering data from VGI, and assisting location geocoding from remote-sensing visual inspection.

Table A1. The metaAnalysis routine (pseudo JavaScript).

const metaAnalysis = (ff) => { /* maps nodes by same value of single attribute: (f.properties[key] = v) or couples of attributes: [k1,k2] with (v,w) */ }; metaAnalysis(gares); // analyzes quality based on keys and values

Table A2. The disDuplicate routine.

const disDuplicate /* marks “duplicate”= to be removed / to be kept */ gares.map(f =>disDuplicate(f))

Table A3. The slugify routine.

const slugify = (lib) => { /* light version, more rules may apply */ const dia = “éèêàâ...”, nodia = “eeeaa...”; /* diachritics */ lib.replace(“St-”,”Saint-”); /* also:“Ste-”,”Sainte-” */ lib = lib.toLowerCase(lib); dia.split().forEach((x,i) =>lib.replace(x,nodia[i]) /* removes */ return lib;} const similName = (fp, gp) => slugify(fp.label) == slugify(gp.label);

Table A4. The specific disAmbiguous routine.

const disAmbiguous /* adds the correct department, marks “disambig” */ gares.map(f =>disAmbiguous(f));

Table A5. The geom_ify routine.

const sameuic = (uicf) => /* lists nodes whose uic = uicf, with geometry */ const bestgeo /* choose best geometry among list, or null */ const geom_ify /* if(bestgeo(list)) return (f.geometry = geo) or void */ gares.filter(hasGeometry).map(f =>geom_ify(f, sameuic(f.properties.uic)); ISPRS Int. J. Geo-Inf. 2021, 10, 154 31 of 37

Table A6. The checkIsolated routine.

const uniques = (ff) => /* array of nodes whose num is unique */ const checkIsolated = (f,ff) => { const garesOnNum = (ff,num) =>ff.filter(f =>f.properties.num == num); const is_closest = (f,g,cum) => /* returns g, or cum, if closest to f */ const getClosest /* extracts: is_closest(f,g,cum) for each node */ f.properties.unic = getClosest(f, garesOnNum(ff, num)); return f;} uniques(ff).map(f =>checkIsolated (f,ff));

Table A7. The fusionTags routine.

const closestNode /* gets: closest node to f (f,old,new) */ const breaks = [ D1, D2, D3, D4, Infinity]; /* thresholds */ const pos = [“AP”,”PP”,”und”,”und”,”SN”], neg = [”SP”,”und”,”und”,“PN”,”AN”]; const dd = /* geographic “greatCircle distance” in meters */ const tt = /* starts with “und”, then update tag[i] depending on: similar label and rank in breaks[i-1] */ const fusionTags /* returns {fuse: tt; dist: dd; closest: closestNode } */ const garesA = fetch(“A_dataset”), garesB = fetch(“B_dataset”); garesB.filter(hasGeometry).map(f =>fusionTags (f,closestNode(f,garesA)));

Table A8. Routine prefixWithGare for preparing Wikipedia API to get “article” coordinates.

/* Wikipedia English sites postfix any toponym N →N_railway_station, French sites have a variety, e.g.,: Gare_de_Martigues, Gare_d’Audun, Gare_des_Aubrais, Gare_du_Havre ... */ const prefix = [Gare_de_, Gare_d’, Gare_des_, Gare_du_]; const prefixWithGare = N => prefix[correctIndexOf(N)] + articleRemoved(N);

Table A9. Routine wikiCoordinates: get geometry of “topon” from Wikipedia.

const Q1 =“.../api.php?action=query&prop=coordinates&format=json&titles=“; const ccOf /* get coordinates */ const ggOf = (cc,k) => cc? ({“coordinates”: cc, ”quality”:k}): null; const P1 = fetch(encodeURI(Q1 + prefixWithGare(topon)) .then(a =>a.json()).then(b => !b.contains(“missing”)? ggOf(ccOf(b.query.pages),”ok”): fetch(encodeURI(Q1 + topon)) // no prefix .then(a =>a.json().then(b =>ggOf(ccOf(b.query.pages),”approx”)); /* then use that promise P1 in combination with other promises */

Table A10. Routine wikiInfobox: parsing a simple Wikipedia Infobox (no tables) for “topon”.

const Q2 = Q1.replace(“prop=coordinates”, “prop=revisions&rvslots=*&rvprop=content&formatversion=2”); const boxOf = json =>json.query.pages[0]?.revisions[0]?.slots?.main?.content; const addItem /* seek for item in boxOf */ const parseBox = (box, list) => box &&list&&list.length> 0 && list.reduce((acc,x) =>addItem(acc,x,box), {}) || ({“missing”:true}); const P2 = fetch(encodeURI(Q2 + topon)).then(a =>a.json) .then (b => parseBox(boxOf(b), [“latitude”, “longitude”, “insee”, “lignes” /* and as many as available */ ]); ISPRS Int. J. Geo-Inf. 2021, 10, 154 32 of 37

Table A11. Routine mergeLigneInfo: get additional info (e.g., dates) from VGI.

const A = “URI_of_LignesSNCF”, B = “URI_of_CSV_file”; function CSV2Json(txt) /* returns JSON from CSV [{num, enddate},..] */ function mergeLigneInfo(a, b){ const match = (s,r) => r.enddate? (s.enddate= r.enddate&& s): s; return a.map(s => match(s, b.find(r =>r.num === s.num)[0])) } const P3 = Promise.all([fetch(A).then(a=>a.json()), fetch(B).then(a=>a.text()).then(CSV2Json)]) .then(([a,b])=> mergeLigneInfo(a,b))

Table A12. Routine stationsAlongLine: from VGI list of gares to list of geojson nodes.

const SST = ‘semi_structured_text‘, Q2 = “cf A.10”, num = “unique_number”; function prefixLigne (A,B) { /* returns “Ligne_de_A_à_B” . . . */ } function SSTtoJSON /* returns {origin, terminus, length, gares} from SST */ function stationsAlongLine (json, num) { const km = /* interpolates km according to length and rank */ const info = /* sets: “origin” or “terminus” or ... */ const properties = {“label”:json.gares[i], num, km, info}; return json.gares.map((s,i,t) => ({properties, “geometry”:null}); } const json = SSTtoJSON(SST); /* check if line exists, apply wikiBSTable, or stationsAlongLine */ fetch(Q2 + prefixLigne(json.origin, json.terminus))).then(a =>a.json()) .then(b => wikiBSTable(boxOf(b)), /* resolve with Routine */ b =>stationsAlongLine (json, num)); /* process the reject */

Table A13. Direct geo-location of an expected gare around a given commune toponym.

/* GEO-LOCATION: open a window with RLT around toponym location – if found */ const RLT = “https://remonterletemps.ign.fr/comparer/basic?mode=doubleMap” + “&layer1=ORTHOIMAGERY.ORTHOPHOTOS.1950-1965” + “&layer2=GEOGRAPHICALGRIDSYSTEMS.MAPS.SCAN-EXPRESS.STANDARD”; /* then locate visualy */

Table A14. Reverse geo-location from a given (lon-lat) couple of coordinates.

/* REVERSE GEO-LOCATION: open window at mkURI(lon,lat) */ const RLT = “as above”; const nom = reverseGeocode(lon,lat); // fromany online geocoder (or via API) const mkURI = (x,y) => RLT + “&x=“+x +”&y=“+y +”&z=20”; /* then locate visualy */ ISPRS Int. J. Geo-Inf. 2021, 10, 154 33 of 37

Table A15. Routine wikiBSTable: from Wikipedia list of gares to list of geojson features.

/* const Q2, constboxOf(json) from A.10 */ function prefixLigne (A,B) /* returns “Ligne_de_A_à_B” or “Ligne_d’A . . . */ function parseBStable(box){ /* get relevant items using parseBox(box, itemlist) */ const share = parseBox(box, [“num”, . . . , “schema”, “schema2”]); if(share.missing) return null; // page doesn’texist if(share.schema) return processBStable(share.schema); if(share.schema2) /* thereis an indirection to another page */ returnfetch(schema2).then(a =>a.json()).then(processBStable); //recursion! } constlineName = prefixWithLigne(“Audun-le-Tiche”, “Hussigny-Godbrange”); const P3 = fetch(encodeURI(Q2 + lineName)).then(a =>a.json()) .then(b =>parseBStable(boxOf(b)), b =>reject (b, “some comment”));

Appendix B. The Toponym-Pattern, the Fork-Pattern, and Graph Representation Issues Two recording patterns, not documented in the public datasets, were causing headaches until it has been possible to analyze them, and to name them, respectively the toponym- pattern and the fork-pattern.

Appendix B.1. Revealing “Toponym-Patterns” from Suspect Fusion Examples There are multiple causes of toponymy ambiguities. With French toponyms, here are some usual ones: • accents: e.g., Benauge is sometimes recorded with an accent Bénauge; • abbreviations: e.g., St-Paul instead of Saint-Paul; • non-standard way of associating two names, e.g., the gare “Ermont-Eaubonne” is between two communes in Val d’Oise, sometimes written Ermont-Eaubonne (hyphen w/o space); • non-uniqueness of a commune name: e.g., Cernay exists in five French départements. To distinguish them, add a department name: Cernay (Haut-Rhin); • commune merging (2500+ merges in 2015–2019, versus 50 splits) in France, implies using a lexicon (Insee-geo) for converting to today’s nom and insee values; • also: Paris-Nord can be Paris Gare du Nord and many similar cases of a base-name plus a specifier (no rule applies). The accents, extra spaces, and “St.e” versus “Saint.e”, and space around hyphen, can be handled using what is named a “slug” version of the toponym. Also, a specific lexicon (department names) can be used to “specify” a toponym in order to removing ambiguity (the case of Cernay, La Joux, etc.). These two techniques are mixed into slugify and disAmbiguous routines (Annex.A) useful when confronting gares and communes datasets. However, the negation slug(a)6=slug(b) does notprove that related gares are really differ- ent. More complex string distances (Hamming or Levenshtein), did not prove satisfactory on the remaining ambiguities, which are too scarce, to deserve specific software development. Besides the cases of Cernay/Cernay (Haut-Rhin) and other department-specified toponyms, let us recall some other examples of suspected false negatives (SP) in Table A16.

Table A16. Cases for false negatives out of the (2014, 2017) direct pairs (top); and lists (bottom).

Cause Label2017 Dist(m) Label2014 Ambérieu-en-Bugey 0 Ambérieu SP: Arras 0 Achicourt label 6= Paris-Nord-Surface 0 Paris-Nord-Souterraine and Bainville-sur-Madon 0 Bainville dist ≤2 m. Paris-Austerlitz 1 Paris-Austerlitz (Banlieue) ISPRS Int. J. Geo-Inf. 2021, 10, 154 34 of 37

Question is: Why different names at a same place? ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER SomeREVIEW pairs: Ambérieu-en-Bugey/Ambérieu, Bainville-sur-Madon/ Bainville, Crécy- 35 of 38

en-Brie-La Chapelle/Crécy-La Chapelle, Vernon (Eure)/Vernon-Giverny, ... have a “base” name with or without a “specifier”. The difference is explained by a change in the name, as it occurs from time to time. The pair should be “fusionable”. “typespecifierOther pairs:”: a Blanzy/Blanzy-Canal, French word (Ville, Ch âCanal,teaulin/Ch Banlieue,âteaulin-Ville, Surface Huningue/Huningue …), or an acronym (IE, MIN,(IE), Paris-Nord-Surface/ TT, TER, TGV …), Paris-Nord-Souterraine, pointing the specific Argenteuil-Triage/Le function of a nearby Val-d’Argenteuil, but different gare (e.g., cargo-only.St-Césaire/St-C vs.é sairevoyager, (MIN) or... -trainhave a precise. vs. “conventionaltypespecifier”: a train). French Then, word (Ville, both Canal,nodes of the pair shouldBanlieue, be Surface kept as... “integrable), or an acronym”. (IE, MIN, TT, TER, TGV ... ), pointing the specific functionWe ofhave a nearby now butan identified different gare list:(e.g., accents, cargo-only. Saint.e vs. abbreviations, voyager, or tram-train. spaced-hyphens, vs. three conventional train). Then, both nodes of the pair should be kept as “integrable”. kindsWe of have basename now an - identified specifier list: association accents, Saint.e (department, abbreviations, new spaced-hyphens, name, gare-type), three which may explainkinds of toponym basename -changes. specifier associationWe name that (department, recording new pattern name, gare-type),thetoponym-pattern which may. explainTo toponymdistinguish changes. cases We and name resolve that recording pairs either pattern “fusionable thetoponym-pattern” or “integrable. ” is difficult.The solutionTo distinguish is to undertake cases and manual resolve editing, pairs either or to “fusionable keep both” or nodes “integrable as if” really is difficult.The different. solution is to undertake manual editing, or to keep both nodes as if really different. Appendix B.2. Revealing “Fork-Patterns” from Suspect Fusion Examples Appendix B.2. Revealing “Fork-Patterns” from Suspect Fusion Examples label LetLet usus namenamefork-pattern fork-pattern, the, the recording recording pattern pattern consisting consisting in attributing in attributing the label the of someof some garegare, to, the to the last last (active) (active) nodenode locatedlocated where where the the ligneligne forksforks towards towards that gare.. This is the caseThis with is the the case INSPIRE with the illustration INSPIRE illustration in Figure in 2: Figure a RailwayNode2: a RailwayNode just beforejust a before RailwayStationNode a . RailwayStationNodeThe case Bauvin-Provin. is also a fork-pattern: a former, partly dismantled ligne, going to thatThe gare case, ends Bauvin-Provin then (at recording is also a fork-pattern date) at: aa former,forknode partly, therefore dismantled this ligneis the, going last fork before to that gare, ends then (at recording date) at a forknode, therefore this is the last fork before Bauvin-Provin Saintes Bauvin-Provin (see (see Figure Figure A1). TheA1). case The of case Saintes, of mentioned, withmentioned the duplicates, with the is also duplicates, is alsoa fork-pattern a fork-pattern. .

FigureFigure A1.A1.The The Bauvin–Provin Bauvin–Provin “fork-pattern”. “fork-pattern”.

The gare location is on top of the image, between toponyms Bauvin and Provin, the same label in a pop-up print, near the Harnes toponym is explained by the history of the ligne (small blue dots passing through Harnes, Courrières and Carvin). That lignen°285,000 (ligned’Hénin-Beaumont à Bauvin-Provin) has been closed by segments between 1960 and 1970, and the geometry of the terminus has been attributed twice: (1) the original terminus Bauvin–Provin, was at km = 233.0, (2) then the segment end was near the Harnesgare, at km = 222.39, but with the Bauvin–Provin label, although 11.6 km further. Let’s use some examples in Table A17, of the suspected false positives (SN) that were mentioned in Section 2.4.2.

ISPRS Int.ISPRS J. Geo-Inf. Int. J. Geo-Inf.2021, 102021, x ,FOR10, 154 PEER REVIEW 35 of 37 36 of 38

Table A17. Cases for falseThe negativesgare location out isof onthe top (2014, of the 2017) image, direct between pairs (top); toponyms and lists Bauvin (bottom). and Provin, the same label in a pop-up print, near the Harnes toponym is explained by the history of the Cause Labelligne (small blue dotsDist passing (m) through Harnes, Courrières and Carvin). ◦ Reims That lignen 285,0002273 (ligned’H énin-Beaumont à Bauvin-Provin) has been closed by seg- ments between 1960 and 1970, and the geometry of the terminus has been attributed Champdôtre-Pont twice: (1) the original terminus2299 Bauvin–Provin, was at km = 233.0, (2) then the segment Sillé-le-Guillaumeend was near the Harnes2342gare, at km = 222.39, but with the Bauvin–Provin label, although Argentan11.6 km further. 2355 SN: Let’s use some examples in Table 17, of the suspected false positives (SN) that were Toucy-Ville 3446 mentioned inISPRS Section Int. J. Geo-Inf. 2.4.2 2021, 10. , x FOR PEER REVIEW 36 of 38 label = Épinal 3644

& Table 17. Cases for false negatives out ofTable the A17. (2014, Cases for 2017) false negatives direct out pairs of the (2014, (top); 2017) direct and pairs lists (top); (bottom). and lists (bottom). dist Cause Label Dist (m) Cause LabelReims Dist (m) 2273 > 2.2 km Champdôtre-Pont 2299 Fougères Reims4964Sillé-le-Guillaume 2273 2342 Champdôtre-Pont Argentan 2299 2355 SN: SN: Toucy-Ville 3446 Sillé-le-Guillaumelabel = 2342 label = Épinal 3644 & Argentan& 2355 Toucy-Villedist 3446 dist > 2.2 km Fougères > 2.2 km Épinal 3644 4964 QuestionFoug is: Whyères the same gare names 4964 so far away? Reims In the example of , thereQuestion are is: Why 4 the nodes same gare, namestwo so farbeing away? very close (Reims on picture Ta- ble A16), one for the North ligne, andIn the examplethe farthest of Reims, there fork are 4 nodes(N-E), two being being very close 2.273 (Reims kmon picture away. Ta- Question is: Why the same gare namesble A16), one so for far the away North ?ligne, and the farthest fork (N-E) being 2.273 km away. In the example of Fougères (FigureIn the example A3, of Fougères next (Figure section), A3, next section), two two locations locations are 4.964 are km 4.964apart: km apart: In the example of Reims, thereone is the arenow a 4 closednodes gare, the two other being is the last veryfork of the close former ligne (Reims, still located on on picture an one is the now a closed gare, theactive other ligne. is the last fork of the former ligne, still located on an Table A16), one for the North ligneAll, and cases thein the farthestlist (Table A17)fork result(N-E) from the being fork-pattern 2.273. It was km right away.to tag them active ligneIn the. example of Fougères“suspected (Figure negatives”, A3, next and to section), not fusion them two with locationstheir respective aregares. 4.964 km apart: oneAll is cases the now in athe closed list gare(Table, theAppendix A17) other B.3. isresult The the Target last Schemafromfork and ofthe the Graph the fork-pattern formerRepresentationligne of the. ,NetworkIt still was located right on to an tag them active ligne. The case with the initial UML-class diagram is that it is does nottake into account “suspected negatives”, and to notcorrectly fusion the difference them between with a gare-node their, and respective a non-gare-node, such gares as a .fork . This is All cases in the list (Tablenot 17 an) issue, result for instance from when the confrontingfork-pattern gares and. communes It was, which right requires to a tag simple them “suspected negatives”, and to not“point_in_polygon” fusion them algorithm. with But their a simple respective example (Figuregares A2) .illustrates the prob- Appendix B.3. The Target Schemalem: and one garethe (yellow Graph double Representation circle) is at the junction of oftwo thelignes Networkcoming from two sim- ple gares (yellow simple circle), and a fork (white circle) gives the exact location of where B.3.The The case Target with Schema the andinitial the Graph UML-classthe two Representationlignes meet (Figure diagram A2-case ofthe a). is Network that it is does nottake into account correctlyThe the case difference with the initialbetween UML-class a gare-node diagram, and is thata non-gare it is does-node nottake, such into as account a fork. This is not correctlyan issue, the for difference instance betweenwhen confronting a gare-node, andgaresa non-gareand communes-node, such, which as a fork requires. This is a simple “point_in_polygon”not an issue, for instance algorithm. when confronting But a simplegares exampleand communes (Figure, which A2) requires illustrates a simple the prob- “point_in_polygon” algorithm. But a simple example (Figure A2) illustrates the problem: lem: one gare (yellowFigure A2.double Mixing [simple circle)]gare-node, [junctionis at]gare the-node junctionand fork-node: (a ) ofno link, two (b) links lignes ignoring thecoming forks, (c) links from two sim- one gare (yellowpreserving double forks circle). is at the junction of two lignes coming from two simple gares ple gares (yellow simple circle), and a fork (white circle) gives the exact location of where (yellow simple circle), and a fork (whiteCase (b): the circle) fork is ignored gives and the a simple exact “digraph” location representation of where will work the cor- two the twolignes lignesmeet (Figuremeet (Figure A2-case A2-case a). rectly with a). algorithms: shortest-path (well known algorithm also named after Dijkstra), or path-contraction, an algorithm to build the minor graph of the original graph, which ignores intermediate [simple] gares and nodes (not terminus, not junction). Case (c) is more complex: it preserves the information of the fork, whichleads to a better visual result. With an historical viewpoint, archiving more geographical infor- mation about the past network is much better. The fork-pattern example of Fougères (Figure A3), involves four lignes (N, S-E, S-W, N-W). In the original data, the fork North was the last node in the list, hence plotted as the terminus: this is solution (b), illustrated on the left. Solution (c) is on right.

Figure A2.Figure Mixing A2. Mixing [simple [simple]gare]-node,gare-node, [junction [junction]gare]gare-node-node and and forkfork-node:-node: (a ()a no) no link, link, (b) links(b) links ignoring ignoring the forks the,(c forks) links, (c) links preservingpreserving forks. forks. Case (b): the fork is ignored and a simple “digraph” representation will work correctly withCase algorithms: (b): the forkshortest-path is ignored(well and known a simple algorithm “digraph” also named representation after Dijkstra), will or path- work cor- rectlycontraction with algorithms:, an algorithm shortest-path to build the (wellminor known graph of algorithm the original also graph, named which after ignores Dijkstra), or path-contractionintermediate, [ simplean algorithm] gares and tonodes build(not the terminus, minor graph not junction). of the original graph, which ignores intermediate [simple] gares and nodes (not terminus, not junction). Case (c) is more complex: it preserves the information of the fork, whichleads to a better visual result. With an historical viewpoint, archiving more geographical infor- mation about the past network is much better. The fork-pattern example of Fougères (Figure A3), involves four lignes (N, S-E, S-W, N-W). In the original data, the fork North was the last node in the list, hence plotted as the terminus: this is solution (b), illustrated on the left. Solution (c) is on right.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 36 of 37

Case (c) is more complex: it preserves the information of the fork, whichleads to a better visual result. With an historical viewpoint, archiving more geographical information about the past network is much better. ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEWThe fork-pattern example of Fougères (Figure A3), involves four lignes (N, S-E, S-W, 37 of 38 N-W). In the original data, the fork North was the last node in the list, hence plotted as the terminus: this is solution (b), illustrated on the left. Solution (c) is on right.

Figure A3. Example of the gare of Fougères before and after addition of the fork on ligne North-West. Figure A3. Example of the gare of Fougères before and after addition of the fork on ligne North-West. References References 1. De Block, G.; Polasky, J. Light railways and the rural–urban continuum: Technology, space and society in late nineteenth-century 1. De Block,Belgium. G.; Polasky,J. Hist. Geogr. J. Light2011, 37railways, 312–328. [andCrossRef the ] rural–urban continuum: Technology, space and society in late nine- teenth-century2. Martí-Henneberg, Belgium. J.J. EuropeanHist. Geogr. integration 2011, 37 and,312–328. national models for railway networks (1840–2010). J. Transp. Geogr. 2013, 26, 2. Martí-Henneberg,126–138. [CrossRef J. European] integration and national models for railway networks (1840–2010). J. Transp. Geogr. 2013, 26, 126–138,doi:10.1016/j.jtra3. Auphan, E. La contractionngeo.2012.09.004. du réseau ferré français dans le temps et dans l’espace. In Colloque international Le secteur des transports ferroviaires dans la mondialisation; Univ. Versailles-Saint-Quentin-en-Yvelines: Versailles, France, 2013. Available 3. Auphan, E. La contraction du réseau ferré français dans le temps et dans l’espace. In Colloque international Le secteur des trans- online: https://f-origin.hypotheses.org/wp-content/blogs.dir/2536/files/2015/03/auphan-etienne-atelier-f.pdf (accessed on ports ferroviaires14 September 2020).dans la mondialisation; Univ. Versailles-Saint-Quentin-en-Yvelines: 2013. Available online: https://f-origin.hypotheses.org/wp-conten4. Report on implementation of the 2011t/blogs.dir/2536/files/2015/03/auphan-etienne EU White Paper on Transport. Roadmap to a Single-atelier-f.pdf European Transport (accessed Area—Towards on 14 Septem- ber 2020).a Competitive and Resource-Efficient Transport System. Five Years Later: Achievements and Challenges. 2016. Available online: 4. Report onhttps://ec.europa.eu/transport/sites/transport/files/themes/strategies/doc/2011_white_paper/swd(2016)226.pdf implementation of the 2011 EU White Paper on Transport. Roadmap to a Single European Transport (accessedArea—Towards a Competitiveon 14 and September Resource-Efficient 2020). Transport System. Five Years Later: Achievements and Challenges; 2016. Available online: J. Transp. Hist. 2004 25 https://ec.europa.eu/transport/sites/tran5. Siebert, L. Using GIS to Map Rail Networksport/files/themes/strategies/doc/2011_whi History. , , 84–104. [CrossRefte_paper/swd(2016)226.pdf] (accessed on 14 6. Gregory, I.; Ell, P.S. Historical GIS: Technologies, Methodologies and Scholarship; (Cambridge Studies in Historical Geography); SeptemberCambridge 2020). University Press: Cambridge, UK, 2007; p. 227, (Cambridge Studies in Historical Geography). 5. Siebert,7. Morillas-TornL. Using GISé ,to M. Map Creation Rail of Network a Geo-Spatial History. Database J. Transp. to Analyse Hist. Railways 2004, 25 in, 84–104. Europe (1830–2010). A Historical GIS Approach. 6. Gregory,J. Geogr.I.; Ell, Inf. P.S. Syst. Historical2012, 4, 176–187. GIS: Technologies, [CrossRef] Methodologies and Scholarship; (Cambridge Studies in Historical Geography); Cambridge8. Li, C.; University Liu, L.; Dai, Press: Z.; Liu, Cambridge, X. Different SourcingUK, 2007; Point p. of227. Interest Matching Method Considering Multiple Constraints. ISPRS Int. 7. Morillas-Torné,J. GeoInf. 2020M. Creation, 9, 214. [CrossRef of a Geo-Spatial] Database to Analyse Railways in Europe (1830–2010). A Historical GIS Approach. J. Geogr.9. Normand, Inf. Syst. 2012 S.-L.T., 4 Meta-analysis:, 176–187, doi:10.4236/jgis.2012.42023. Formulating, evaluating, combining, and reporting. Stat. Med. 1999, 15, 321–359. [CrossRef] [PubMed] 8. Li, C.; Liu, L.; Dai, Z.; Liu, X. Different Sourcing Point of Interest Matching Method Considering Multiple Constraints. ISPRS 10. Riley, R.D.; Price, M.J.; Jackson, D.; Wardle, M.; Gueyffier, F.; Wang, J.; Staessen, J.A.; White, I.R. Multivariate meta-analysis using Int. J. GeoInfindividual. 2020 participant, 9, 214, doi:10.3390/ijgi9040214. data. Res. Syn. Meth. 2015 , 6, 157–174. [CrossRef][PubMed] 9. Normand,11. Lan, S-L.T. T.; Longley, Meta-analysis: P. Geo-Referencing Formulating, and Mapping evaluating, 1901 Census combining, Addresses for and England reporting. and Wales. Stat.ISPRS Med. Int. J. Geo-Inf.1999, 152019, ,321–359, doi:10.1002/(sici)1097-0258(19990215)18:3<321::a8, 320. [CrossRef] id-sim28>3.0.co;2-p. PMID:10070677. 10. Riley,12. R.D.;Bloch, Price, I. Information M.J.; Jackson, Fusion in D.; Signal Wardle, and Image M.; Processing:Gueyffier, Major F.; Wang, Probabilistic J.; Staessen, and Non-Probabilistic J.A.; White, Numerical I.R. Multivariate Approaches; Wiley-meta-analysis using individualOnline Library: participant Hoboken, data. NJ, USA,Res. Syn. 2008; Meth. ISBN 9781848210196.2015, 6, 157–174, doi:10.1002/jrsm.1129. 11. Lan,13. T.; Benferhat,Longley, S.;P. Kacia,Geo-Referencing S.; LeBerre, D.; and Williams, Mappi M.-A.ng 1901 Weakening Census conflicting Addresses information for England for iterated and Wales. revision ISPRS and knowledgeinte- Int. J. GeoInf. 2019, 8, gration. Artif. Intell. 2004, 153, 339–371. [CrossRef] 320,14. doi:10.3390/ijgi8080320. Reichgelt, H. Knowledge Representation: An AI Perspective; AblexPublishing: New York, NY, USA, 1991; ISBN 10:0893915904. 12. Bloch, I. Information Fusion in Signal and Image Processing: Major Probabilistic and Non-Probabilistic Numerical Approaches; Wiley- Online Library: Hoboken, New Jersey, USA, 2008. ISBN 9781848210196. 13. Benferhat, S.; Kacia, S.; LeBerre, D.; Williams, M.-A. Weakening conflicting information for iterated revision and knowledgeintegration. Artif. Intell. 2004, 153, 339–371. 14. Reichgelt, H. Knowledge Representation: An AI Perspective; AblexPublishing: New York, NY, USA, 1991. ISBN 10:0893915904 15. Johnson, B.A.; Iizuka, K. Integrating OpenStreetMap crowdsourced data and Landsattime-series imagery for rapid land use-land cover (LULC) mapping: Case study of the Laguna de Bay area of the Philippines. Appl. Geogr. 2016, 67, 140–149. 16. Younghoon, K.; Woohwan, J.; Kyuseok, S. Integration of graphs from different data sources using crowdsourcing. Inf. Sci. 2017, 385–386, 438–456, doi:10.1016/j.ins.2017.01.006. 17. Smith, M.J.; Wedge, R.; Veeramachaneni, K. FeatureHub: Towards collaborative datascience. In Proceedings of the IEEE In- ternational Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, 19–21 October 2017; pp. 590–600. 18. Juhász, L.; Rousell, A.; Jokar Arsanjani, J. Technical Guidelines to Extract and Analyze VGI from Different Platforms. Data 2016, 1, 15, doi:10.3390/data1030015. 19. Chen, Y.; Chen, L.; Zhang, C. CrowdFusion: A Crowdsourced Approach on Data Fusion Refinement. In Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE), Tokyo, Japan, 19–21 Octobetr 2017; pp. 127–130.

ISPRS Int. J. Geo-Inf. 2021, 10, 154 37 of 37

15. Johnson, B.A.; Iizuka, K. Integrating OpenStreetMap crowdsourced data and Landsattime-series imagery for rapid land use-land cover (LULC) mapping: Case study of the Laguna de Bay area of the Philippines. Appl. Geogr. 2016, 67, 140–149. [CrossRef] 16. Younghoon, K.; Woohwan, J.; Kyuseok, S. Integration of graphs from different data sources using crowdsourcing. Inf. Sci. 2017, 385–386, 438–456. [CrossRef] 17. Smith, M.J.; Wedge, R.; Veeramachaneni, K. FeatureHub: Towards collaborative datascience. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, 19–21 October 2017; pp. 590–600. 18. Juhász, L.; Rousell, A.; Jokar Arsanjani, J. Technical Guidelines to Extract and Analyze VGI from Different Platforms. Data 2016, 1, 15. [CrossRef] 19. Chen, Y.; Chen, L.; Zhang, C. CrowdFusion: A Crowdsourced Approach on Data Fusion Refinement. In Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE), Tokyo, Japan, 19–21 October 2017; pp. 127–130. 20. Gouvêa, C.; Loh, S.; FortesGarcia, L.F.; Brasil da Fonseca, E.; Wendt, I. Discovering Location Indicators of Toponyms from News to Improve Gazetteer-Based Geo-Referencing. In Proceedings of Brazilian Symposium on Geoinformatics; 2008; pp. 51–62. Available online: http://www.geoinfo.info/portuguese/geoinfo2008/artigos/p13.pdf (accessed on 8 March 2021). 21. Hastings, J.T. Automated conflation of digital gazetteer data. Int. J. Geogr. Inf. Sci. 2008, 22, 1109–1127. [CrossRef] 22. Wikipedia. Route-Diagram. Available online: https://simple.wikipedia.org/wiki/Template:Infobox_rail_line (accessed on 8 February 2021). 23. Lange, D.; Böhm, C.; FelixNaumann, F. Extracting Structured Information from Wikipedia Articles to Populate Infoboxes. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, CA, USA, 26 October 2010; pp. 1661–1664. [CrossRef] 24. Auer, S.; Lehmann, J. What Have Innsbruck and in Common? Extracting Semantics from Wiki Content. In The Semantic Web: Research and Applications; Franconi, E., Kifer, M., May, W., Eds.; Lecture Notes in Computer Science; Springer: /, Germany, 2007; Volume 4519. [CrossRef] 25. Vasseur, B.; Jeansoulin, R.; Devillers, R.; Frank, A. External quality evaluation of geographical applications: An ontological approach. Fundam. Spat. Data Qual. 2006, 255–270. [CrossRef] 26. Daniel, F.; Kucherbaev, P.; Cappiello, C.; Benatallah, B.; Allahbakhsh, M. Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Comput. Surv. 2018.[CrossRef] 27. Senaratne, H.; Mobasheri, A.; Ali, L.; Capineri, C.; Haklay, M. Are view of volunteered geographic information quality assessment methods. Int. J. Geogr. Inf. Sci. 2017, 31, 139–167. [CrossRef] 28. INSPIRE Thematic Working Group Transport Networks. Data Specification on Transport Networks—Technical Guidelines (D2.8.I.7). European Commission Joint Research Centre. April 2014. Available online: https://inspire.ec.europa.eu/id/ document/tg/tn (accessed on 8 February 2021). 29. Axelsson, P.; Wikström, L. OGC, InfraGML1.0: Part5 Railways—Encoding Standard. 2017. Available online: http://www.opengis. net/doc/standard/infragml/part5/1.0 (accessed on 8 February 2021). 30. SNCF Réseaux. Available online: https://ressources.data.sncf.com/explore/dataset/liste-des-gares (accessed on 13 September 2020). 31. Auphan, E. L’apogée des chemins de fer secondaires en France: Essai d’interprétation cartographique. Rev. D’histoire Des Chemins De Fer 2002, 24–25, 24–46. Available online: https://journals.openedition.org/rhcf/2028 (accessed on 8 February 2021). [CrossRef] 32. SNCF. Available online: https://fr.m.wikipedia.org/wiki/SNCF_Gares_&_Connexions (accessed on 17 August 2020). 33. Wikipedia. Available online: https://en.wikipedia.org/wiki/Wikipedia:Route_diagram_template (accessed on 17 August 2020). 34. Wikipedia. Available online: https://de.wikipedia.org/wiki/Wikipedia:Formatvorlage_Bahnstrecke( accessed on 8 February 2021). 35. Comber, A.; Wulder, M. Considering spatiotemporal processes in big data analysis: Insights from remote sensing of land cover and land use. Trans. GIS 2019, 23, 879–891. [CrossRef] 36. Sreenivasaiah, P.K.; Kim, D.H. Current Trends and New Challenges of Databases and Web Applications for Systems Driven Biological Research. Front. Physiol. 2010, 1, 147. [CrossRef][PubMed] 37. OSM. Available online: https://wiki.openstreetmap.org/wiki/Railway_stations (accessed on 8 February 2021). 38. Gervais, M.; Bédard, Y.; Levesque, M.; Bernier, E.; Devillers, R. Data Quality issues and Geographic Knowledge Discovery. In Geographic Data Mining and Knowledge Discovery; Miller, H.J., Han, J., Eds.; 2009; Chapter 5; pp. 99–115. Available online: http://yvanbedard.scg.ulaval.ca/wp-content/documents/publications/518.pdf (accessed on 8 February 2021). 39. Yang, S.; Wu, Y.; Sun, H.; Yan, X. Schemaless and Structureless Graph Querying. In Proceedings of the 40th International Conference on Very Large Databases (VLDB); 2014; Available online: https://yinghwu.github.io/mat/papers/Schemaless_and_Structureless_ graph_querying-vldb14.pdf (accessed on 8 February 2021). 40. Auphan, E.; Dupuy, G. Vingt ans de travaux scientifiques sur les réseaux et la mobilité ferroviaires. Rev. D’histoire Des Chemins De Fer 2009, 39, 95–101. [CrossRef] 41. Eurostat. Modal Split of Inland Freight Transport. Statistics Explained Website. 2018. Available online: https://ec.europa. eu/eurostat/statistics-explained/index.php?title=File:Modal_split_of_inland_freight_transport,_2018_(%25_share_in_tonne- kilometres).png (accessed on 18 September 2020). 42. European Environment Agency. Greenhouse Gas Emissions from Transport in Europe. 2020. Available online: https://www.eea. europa.eu/data-and-maps/indicators/transport-emissions-of-greenhouse-gases-7/assessment (accessed on 8 February 2021).