International Journal of Geo-Information

Article A Hybrid Population Distribution Prediction Approach Integrating LSTM and CA Models with Micro-Spatiotemporal Granularity: A Case Study of ,

Pengyuan Wang 1, Xiao Huang 2 , Joseph Mango 1,3 , Di Zhang 1 , Dong Xu 1 and Xiang Li 1,*

1 Key Laboratory of Geographical Information Science (Ministry of Education), School of Geographic Sciences, East Normal University, Shanghai 200241, China; [email protected] (P.W.); [email protected] (J.M.); [email protected] (D.Z.); [email protected] (D.X.) 2 Department of Geosciences, University of Arkansas, Fayetteville, AR 72701, USA; [email protected] 3 Department of Transportation and Geotechnical Engineering, University of Dar es Salaam, Dar es Salaam 35131, Tanzania * Correspondence: [email protected]

Abstract: Studying population prediction under micro-spatiotemporal granularity is of great sig- nificance for modern and refined urban traffic management and emergency response to disasters. Existing population studies are mostly based on census and statistical yearbook data due to the limitation of data collecting methods. However, with the advent of techniques in this information age, new emerging data sources with fine granularity and large sample sizes have provided rich   materials and unique venues for population research. This article presents a new population predic- tion model with micro-spatiotemporal granularity based on the long short-term memory (LSTM) Citation: Wang, P.; Huang, X.; and cellular automata (CA) models. We aim at designing a hybrid data-driven model with good Mango, J.; Zhang, D.; Xu, D.; Li, X. A adaptability and scalability, which can be used in more refined population prediction. We not only Hybrid Population Distribution Prediction Approach Integrating try to integrate these two models, aiming to fully mine the spatiotemporal characteristics, but also LSTM and CA Models with propose a method that fuses multi-source geographic data. We tested its functionality using the Micro-Spatiotemporal Granularity: A data from Chongming District, Shanghai, China. The results demonstrated that, among all scenarios, Case Study of Chongming District, the model trained by three consecutive days (ordinary dates), with the granularity of one hour, Shanghai. ISPRS Int. J. Geo-Inf. 2021, incorporated with road networks, achieves the best performance (0.905 as the mean absolute error) 10, 544. https://doi.org/10.3390/ and generalization capability. ijgi10080544 Keywords: population distribution prediction; micro-spatiotemporal granularity; long short-term Academic Editor: Wolfgang Kainz memory (LSTM); cellular automata (CA); integrated model

Received: 3 June 2021 Accepted: 10 August 2021 Published: 13 August 2021 1. Introduction

Publisher’s Note: MDPI stays neutral Population distribution prediction refers to estimating the population of a specific with regard to jurisdictional claims in geographic unit, taking into account the impact of natural geography and the socioe- published maps and institutional affil- conomic environment and applying scientific methods (predictive models) to estimate iations. population development in another temporal period [1]. However, traditional population prediction generally has extremely coarse spatiotemporal granularity. With the advent of information handling techniques, new data collection methods, data processing algo- rithms, and improvements in computing power have made population prediction with micro-spatiotemporal granularity possible. Since ancient times, population estimation Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. and prediction have been essential in human society for policymaking and socioeconomic This article is an open access article planning, such as urban and health care planning at different administration levels. Studies distributed under the terms and on population prediction can be traced back to the United Kingdom in 1695 [2]. After years conditions of the Creative Commons of exploration and efforts by mathematicians, statisticians, demographers, geographers, Attribution (CC BY) license (https:// and other scholars, a series of models for population prediction is proposed. creativecommons.org/licenses/by/ In general, population prediction models can be grouped into four categories, given 4.0/). the approaches they adopt: (1) mathematical models, (2) statistical models, (3) demographic

ISPRS Int. J. Geo-Inf. 2021, 10, 544. https://doi.org/10.3390/ijgi10080544 https://www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2021, 10, 544 2 of 16

models, and (4) machine learning models. In the past, many mathematical models were proposed [3,4]. Due to insufficient assumptions, however, data utilization is limited in these mathematical models, and it is difficult to determine parameters in an objective manner. The above constraints led to the development and application of statistical models in population prediction. Statistical models in population prediction mainly include regression, such as linear regression and non-linear regression models (e.g., Gompertz models), and time series models such as the autoregressive integrated moving average model (ARIMA) that combines differential, moving average, and autoregressive analysis. For example, Zakria and Muhammad [5] applied this model to simulate the population dynamics in Pakistan from 1951 to 2007, and they predicted that the country’s population would reach 229 million by 2025. Despite more data and parameters that ultimately improve the prediction results, statistical models often fail to consider other factors that play significant roles in population growth, such as factors regarding societal settings and economic influences. In comparison, demographic models such as the cohort-component models consider not only the information of birth-rate, mortality-rate, and migration- rate but also age-specific information, providing detailed descriptions on the population spectrum, thus rendering more accurate results compared to pure statistical methods [6]. A study by Stoto found that the distribution of error statistics in the US Census Bureau and the UN’s forecast samples was relatively stable, and such information could be utilized to predict the total population of the United States in 2000 [7]. Lee and Carter [8] used the time series method to predict age-specific mortality in the United States between 1990 and 2065. In 1996, based on an expert’s judgment on the trends and uncertainties of future birth rate, mortality, and migration rates in global regions, Lutz et al. [9] predicted the demographics at global and selected regions by 2100. Based on this study’s perspectives, all methods mentioned above can be classified as the traditional methods because they largely rely on structured census and statistical data from yearbooks. Moreover, their predictions tend to have coarse spatiotemporal granularity, with yearly temporal units and coarse spatial scales (e.g., national or global). Recently, the improvement of data collecting methods and the arrival of the Big Data era, the development of Big Data analytics, and the increasing maturity of machine learning have fostered new ideas in population prediction studies. For example, Reichstein et al. [10] argued that the next-generation researches in geographic sciences should involve hybrid models that integrate physical and machine learning models. Robinson et al. [11] trained a Convolutional Neural Network (CNN) [12] with one-year synthetic remote sensing images of Landsat with a grid resolution of 0.01 × 0.01 (unit: degree) to predict the US population. Griffith et al. [13] observed the increasing popularity of fine-scale population studies and indicated that the future prediction methods could start at the family or even the individual level, thanks to the development of new modeling algorithms, such as agent models and artificial neural networks (ANN). LSTM is a variation of recurrent neural networks (RNN) first proposed by Hochreiter and Schmidhuber in 1997 [14]. Since RNNs often experience problems of gradient vanishing during training, such networks can only retain short-term memory. LSTM relies on a unique set of internal mechanisms (e.g., gates and memory cells) to grant the networks the capability of long-term memory. It has been widely adopted in handwriting [15] and speech recognition [16]. However, we have found that this feature is particularly suitable for expressing the irregularities within the population change. The cellular automata (CA) with discrete states and local spatiotemporal interactions were proposed by American mathematician Stanisław Marcin Ulam in the 1940s and were used by Von Neumann to study the logical nature of self-replicating systems (White and Engelen, 1993) [17]. At present, CA models are mainly used for modeling complex systems. Compared with mathematical models, CA can simulate the evolution of natural phenomena in a more accurate manner [18], and therefore it has been adopted in simulation studies such as human migration, urban development, and land-use dynamics [19,20]. ISPRS Int. J. Geo-Inf. 2021, 10, 544 3 of 16

This study uses mobile phone signaling data, an emerging type of population data different from population data collected by traditional methods. Mobile phone data enables population prediction with micro-spatiotemporal granularity, characterized by fine-grained spatiotemporal granularity compared with traditional methods. The temporal resolution can be as fine as hours or even minutes, and the spatial resolution can be at a sub−1 km scale. Mobile phone signaling data have extensive spatial coverages, high data collection frequencies, long temporal spans, and spatial grid as a regular statistical unit. Based on the literature, extremely few methods can handle population datasets with such spatiotemporal granularity. LSTM has the ability to solve long-term dependency problems, given its unique design structure. Its design makes it suitable for processing and predicting important events with long intervals and delays in time series, ideal for investigating population dynamics. CA is a dynamic grid model with discrete time, space, and state. Since its spatial interactions and time causality are local-focused, CA is often adopted to simulate the process of the spatiotemporal evolution of complex systems (e.g., geographical environment). In addition, the concept of cells in CA is suitable to express the spatial distribution of the population. Although there exist some efforts that integrate LSTM and CA models [21,22], their models lack the ability to integrate multi-source geographic information and, to our best knowledge, no effort has been made to investigate such integration in predicting the spatial distribution of population. In light of the background of population prediction with micro- spatiotemporal granularity and the unique characteristics of such data, we attempt to build a data-driven population prediction framework by integrating two basic models: LSTM and CA.

2. Study Area and Data 2.1. Overview of the Study Area This study uses data from Chongming District (extending 121◦0903000 E to 121◦5400000 E and 31◦2700000 N to 31◦5101500 N), which is a municipal district in Shanghai, China. It borders the River to the west and the to the east. On the northern side, it borders Haimen City and Qidong City. Towards the south, the Chongming District is surrounded by three other districts of , Jiading, and Baoshan (Figure1). With a total area of 1413 square kilometers, the Chongming District is mainly composed of islands that include Chongming Island, , and Changxing Island. Among these, Chongming Island is the third-largest Chinese island and the world’s largest estuary alluvial island [23]. Located at the midpoint of the Chinese coastline, Chongming District is the estuary of the Yangtze River, the longest river in China. The terrain of Chongming Island is largely flat, and the elevation of more than 90% of the area is between 3.21 m and 4.20 m (Wujing Elevation). According to the 2018 Shanghai Yearbook Statistics, by the end of 2017, the registered population in Chongming District was 675,900, including 694,600 permanent residents and 141,900 foreign residents. In the whole district, the density of population is 586 people per square kilometer. The natural population growth rate was −0.51%. In 2017, according to the Statistical Communique of National Economic and Social Development of Chongming District, the per capita disposable income of all residents in the region was 5654 US dollars, and the per capita disposable income of rural residents was 3930 US dollars. The annual GDP was 5.14 billion US dollars, of which the tertiary industries grew up more by 12.8% over the previous year, thus making a total of 2.66 billion US dollars, equivalent to 51.8% of the region’s GDP. ISPRS Int. J. Geo-Inf. 2021,ISPRS 10, x Int.FOR J. PEER Geo-Inf. REVIEW2021, 10, 544 4 of 16 4 of 16

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 4 of 16

Figure 1. Location of the study area: Chongming District. Figure 1. Figure 1. Location of the study area:Location Chongming of the study District. area: Chongming District. 2.2.2.2. Data Characteristics 2.2. Data CharacteristicsThe data used used in in this this study study are are mobile mobile ph phoneone signaling signaling data data collected collected by China by China Uni- The data usedUnicom,com, in this covering coveringstudy the are theentire mobile entire Chongming ph Chongmingone signaling District, District, dataShanghai, Shanghai, collected with with by a temporalChina a temporal Uni- span span from from 27 com, covering the27September entire September Chongming 2018, 2018 to, 10 to October10District, October 2018. Shanghai, 2018. As mobile As mobile with phone phonea temporal signaling signaling dataspan data involve from involve 27sensitive sensitive and September 2018, toandprivate 10 privateOctober information, information, 2018. As the mobile actual the actual phonedata datahas signaling been has been aggregated data aggregated involve to 10 to minsensitive 10 minas the as and temporal the temporal unit private information,unitand the to and regular actual to regular gridsdata grids ofhas 250 been ofm 250by aggregated 250 m bym. 250We are m. to not We10 involved aremin not as involvedthe in the temporal mobile in the phoneunit mobile signaling phone signalingdata collection data collectionand aggregation and aggregation process, but process, we canbut confirm we can that confirm the data that wethe retrieve data wehas and to regular grids of 250 m by 250 m. We are not involved in the mobile phone signaling retrievebeen anonymized. has been anonymized. This dataset This collects dataset mobile collects phone mobile signaling phone data, signaling precisely data, reflecting precisely data collection andreflectingthe aggregation number the of number mobile process, services, of but mobile we thus services,can serving confirm thus as thatan serving ideal the proxydata as an we idealto populationretrieve proxy has to distribution. population been anonymized.distribution.The This file dataset size is The about collects file size871 mobile isMB about (MByte), phone 871 MB with signaling (MByte), a totalwith of data, 21,833,490 a total precisely of 21,833,490lines reflecting of records. lines of The records. orig- the number of mobileTheinal original dataservices, is in data textthus is format, in serving text format, with as aneach with ideal line each proxystoring line storing to one population record one record and distribution. fields and fields separated separated by “|” by The file size is about“|”(Figure 871 (Figure MB2). 2There (MByte),). There are are four with four attribute a attribute total offields. fields.21,833,490 Ta Tableble 1 1 lines presentspresents of records. detailed The descriptionsdescriptions orig- ofof eacheach inal data is in textfield.field. format, The with longitude each andline latitude storing is one the record central pointand fields of thethe separated aggregatedaggregated by populationpopulation “|” data, (Figure 2). There areandand four centralcentral attribute pointpoint valuevalue fields. suggestssuggests Table thethe 1 presentsnumbernumber ofof detailed peoplepeople inin descriptions eacheach gridgrid ofof 250250of each mm byby 250250 m.m. field. The longitude and latitude is the central point of the aggregated population data, and central point value suggests the number of people in each grid of 250 m by 250 m.

FigureFigure 2.2. Raw format of mobile phonephone signalingsignaling data.data.

Figure 2. Raw format of mobile phone signaling data.

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 5 of 16

Table 1. Field descriptions of mobile phone signaling data.

Field Meaning Format Remark

Counting time year/month/day/hour: Time Accurate to 10 min minute

Floating point number retained to six WGS-84 geographic coordinate Longitude decimal places system ISPRS Int. J. Geo-Inf. 2021, 10, 544 5 of 16 Floating point number retained to six WGS-84 geographic coordinate Latitude decimal places system Table 1. Field descriptions of mobile phone signaling data. Statistics Integer Mobile users in the grid Field Meaning Format Remark Time Counting time year/month/day/hour: minute Accurate to 10 min Longitude FloatingThe point original number data retained contain to six decimal collected places time,WGS-84 latitude, geographic and coordinate longitude system information. Since Latitude Floating point number retained to six decimal places WGS-84 geographic coordinate system Statisticsthe Integer integrated model requires serialization Mobile characteristics users in the grid for the input data, we re-ranked the data according to the timestamp and divided them into individual files by day. Finally, 14 textThe files original are obtained, data contain ranging collected time,from latitude, 27 September and longitude 2018 information. to 10 October Since 2018, sorted by timestamp.the integrated Thus, model data requires of different serialization longitudes characteristics and for latitudes the input data, at the we re-rankedsame time are gathered together.the data according It should to thebe timestampnoted that and all divided data ar theme referenced into individual on files the by World day. Finally, Geographic Coordi- 14 text files are obtained, ranging from 27 September 2018 to 10 October 2018, sorted by natetimestamp. System Thus, (WGS-84), data of different and their longitudes spatial and sampling latitudes ataccuracy the same time(250 are m gathered× 250 m) is determined duringtogether. collection. It should be notedSince that it allis datadifficult are referenced to determine on the World spatial Geographic characteristics Coordinate of data in text format,System we (WGS-84), used andthe theirGeospatial spatial sampling Data Abstract accuracyion (250 Library m × 250 m)(GDAL) is determined to convert these text during collection. Since it is difficult to determine spatial characteristics of data in text filesformat, into we raster used format the Geospatial to facilitate Data Abstraction visual re Librarypresentation. (GDAL) toFigure convert 3 thesepresents text the population distributionfiles into raster at format12:00 top.m. facilitate on 27 visual September representation. 2018 Figure in Chongming3 presents the District population in a raster format. Thedistribution cell attribute at 12:00 indicates p.m. on 27 the September number 2018 of in people Chongming in each District grid. in a raster format. The cell attribute indicates the number of people in each grid.

Figure 3. Population distribution in Chongming District in a raster format at 12:00 p.m. on Figure27 September 3. Population 2018. distribution in Chongming District in a raster format at 12:00 p.m. on 27 Sep- tember3. Method 2018. of Model Integration 3.1. Overview of the LSTM and CA Models 3. MethodIn this of study, Model the temporal Integration period for data collection is considerably long. Compared 3.1.with Overview classic time of seriesthe LSTM models, and LSTM CA can Models benefit the extraction of long-term features in the temporal dimension and the extraction of those features in an unsupervised manner, largely increasingIn this its study, utility andthe generalizability.temporal period The internalfor data structure collection of the is neural considerably networks is long. Compared withpresented classic in Figuretime series4. Xt denotes models, the originalLSTM inputcan benefit in the current the extraction time step. C tof−1 andlong-termCt features in therepresent temporal the memory dimension units of and the previousthe extraction time step andof those the current features time step, in an respectively. unsupervised manner, Ht−1 and Ht represent the hidden state of the previous time step and the current time largelystep, respectively. increasing its utility and generalizability. The internal structure of the neural net- works is presented in Figure 4. X denotes the original input in the current time step. C

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 6 of 16

and C represent the memory units of the previous time step and the current time step, ISPRS Int. J. Geo-Inf. 2021, 10, 544 respectively. H and H represent the hidden state of the previous6 of 16 time step and the current time step, respectively.

FigureFigure 4. The4. The internal internal structure structure of long short-term of long memory short- (LSTM)term neural memory networks. (LSTM) neural networks. As mentioned above, to mitigate the problem of short-term memory, LSTM constructs a specialAs structure mentioned inside the above, networks to with mitigate memory cells, the which problem can pass of the short-term information memory, LSTM con- along the sequence, and the structure can be understood as the “memory” of the networks. Itstructs includes a three special gates: structure forget gate, inputinside gate, the and networks output gate. with memory cells, which can pass the in- formationGiven time along step t, the calculating sequence, process and of the the forgetstruct gate,ure input can gate,be understood and output as the “memory” of gatethe cannetworks. be presented It asincludes follows: three gates: forget gate, input gate, and output gate. Given time step 𝑡, the calculating process of the forget gate, input gate, and output ft = σ xtWx f + ht−1Wh f + b f (1) gate can be presented as follows: it = σ(xtWxi + ht−1Whi + bi) (2) 𝑓 𝜎𝑥𝑊 ℎ𝑊 𝑏 (1) ot = σ(xtWxo + ht−1Who + bo) (3) n∗h where ft, it, ot ∈ R (n is the number of samples𝑖 𝜎𝑥 and𝑊h is theℎ number𝑊 of𝑏 hidden units) (2) n∗d are forget gate, input gate, and output gate, respectively. The current time step xt ∈ R (d is the number of features) is the input,𝑜 and𝜎𝑥 the hidden𝑊 ℎ state of𝑊 the previous𝑏 time (3) n∗h d∗h h∗h step is ht−1 ∈ R . Wx f , Wxi, Wxo ∈ R and Wh f , Whi, Who ∈ R are the weights. 1∗h ∗ bwheref , bi, bo ∈𝑓R,𝑖,𝑜denote ∈ℝ the bias (n parameters. is the number of samples and h is the number of hidden units) n∗h At time step t, the candidate memory cell Cet ∈ R is calculated as: ∗ are forget gate, input gate, and output gate, respectively. The current time step 𝑥 ∈ℝ

(d is the number of features)Cet = tanh( xistW thexc + input,ht−1Whc +andbc) the hidden state of(4) the previous time step is ℎ ∈ℝ∗ . 𝑊 ,𝑊 ,𝑊 ∈ℝ∗ and 𝑊 ,𝑊 ,𝑊 ∈ℝ∗ are the weights. 𝑏 ,𝑏,𝑏 ∈ where W ∈ d∗h andW ∈ h∗h are the weights of the parameters, and b ∈ 1∗h is the ∗ xc R hc R c R biasℝ parameter. denote The the update bias of parameters. the memory cells is based on the memory cells passed in the previous time step. The combination of the forget gate, and the input gate is∗ able to control At time step 𝑡, the candidate memory cell 𝐶 ∈ℝ is calculated as: the flow of internal information:

𝐶 𝑡𝑎𝑛ℎ𝑥𝑊 ℎ𝑊 𝑏 (4) Ct = ftCt−1 + itCet (5) 𝑊 ∈ℝ∗ 𝑊 ∈ℝ∗ 𝑏 ∈ℝ∗ wherewhere the forget gate ft determinesand how much are information the weights is retained of in the the memoryparameters, and is n∗h celltheC biast−1 ∈ Rparameter., and the input The gate updateit determines of the how memory much information cells is isbased added on in the the memory cells passed candidatein the previous memory cell timeCet. step. The combination of the forget gate, and the input gate is able to The hidden state is realized by controlling the flow of internal information of the memorycontrol cell the through flow of the internal output gate. information: The calculating process of the hidden state is as follows: ht = ot ∗ tanh(C𝐶t) 𝑓𝐶 𝑖𝐶 (6) (5) whereWith the the gatesforget mentioned gate 𝑓 above, determines LSTM can how effectively much extract information long-term temporal is retained in the memory cell patterns in mobile phone signaling data. ∗ 𝐶 ∈ℝ , and the input gate 𝑖 determines how much information is added in the can- didate memory cell 𝐶. The hidden state is realized by controlling the flow of internal information of the memory cell through the output gate. The calculating process of the hidden state is as follows:

ℎ 𝑜 ∗tanh 𝐶 (6) With the gates mentioned above, LSTM can effectively extract long-term temporal patterns in mobile phone signaling data. A CA system contains four basic elements: cells, states, neighborhoods, and conver- sion rules. Cells are the basic objects (or units) of a CA system. Space formed by all cells

ISPRS Int. J. Geo-Inf. 2021, 10, 544 7 of 16

A CA system contains four basic elements: cells, states, neighborhoods, and conver- sion rules. Cells are the basic objects (or units) of a CA system. Space formed by all cells is called cell space. States of cells (generally discrete) have different choices based on different scenarios. However, the population data in this study are continuous. Conversion rules act on local scopes, and neighboring cells need to be defined to form neighborhoods. The conversion rule is a function that determines the state of the next moment according to the current state of the cell and the states of the neighboring cells. Therefore, this study aims to determine the neighborhood and extract the conversion rules effectively.

3.2. Model Integration The integrated model we proposed is a data-driven model that can adapt to the spatiotemporal characteristics of mobile phone signaling data. Due to the high frequency and long acquisition period, classic time series methods such as ARIMA may not be feasible, as they cannot analyze multi-dimensional data and require stationary data or stationary after differentiating. LSTM benefits the extraction of both short-term and long- term temporal features. In addition, the population data is collected at the grid level, similar to the concept of cells in CA. Moreover, the neighborhood and extraction algorithms of conversion rules in CA have good scalability. Therefore, we integrate these two models to build a population prediction model with a micro-spatiotemporal granularity that can fully adapt to the characteristics of mobile phone signaling data. In the process of integrating LSTM and CA models, we consider two main aspects. First, we need to take advantage of the respective advantages of LSTM in sequence data and CA in spatial simulation. Second, we need to ensure the generalizability of the model. From a macro perspective, a CA model is the upper-level framework of the integrated model, which defines the interaction between the cells in the simulation space through varying definitions of neighborhoods. The LSTM can be seen as an underlying algorithm for automatically extracting conversion rules in CA. In this study, LSTM is able to handle sequential data with the input of a three-dimensional tensor that includes samples (cor- responding to the grids), timesteps, and features. Relying on the upper-level CA model, features can incorporate geographical information from various neighboring scenarios. The output of the CA-LSTM-integrated model is a two-dimensional tensor that includes samples (grids) and the corresponding population counts (in this study, mobile user counts). In this integration framework, CA and LSTM, respectively, handle information from spatial and temporal perspectives. Their integration benefits efficient spatiotemporal data fusion that leads to accurate population prediction. Specifically, there are two considerations: one is to use the cells in the CA to represent the grid. Different definitions of the neighborhood can incorporate more spatial information into the model. The second is to take advantage of the unique mechanism of LSTM for extracting complex time series features in population data. Figure5 presents the schematic diagram of the integration between LSTM and CA. In Figure5, X denotes the original input while X0 denotes the model input after temporal selection and neighborhood construction. Different uses of t represent different timestamps. Ct−1, Ct, Ht−1, and Ht are the intermediate input and output of the LSTM unit during the training process, and y represents the model output. In general, the integrated model first selects the time series of the original data, then builds features in the attribute dimension according to the definition of the neighborhood in CA. Further, structurally transformed data are fed into the LSTM for training. The calculation at each time step generates new memory cells and hidden states. At the end of the entire training process, LSTM learns a set of parameters inside the model that can be used for later predictions. ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 7 of 16

is called cell space. States of cells (generally discrete) have different choices based on dif- ferent scenarios. However, the population data in this study are continuous. Conversion rules act on local scopes, and neighboring cells need to be defined to form neighborhoods. The conversion rule is a function that determines the state of the next moment according to the current state of the cell and the states of the neighboring cells. Therefore, this study aims to determine the neighborhood and extract the conversion rules effectively.

3.2. Model Integration The integrated model we proposed is a data-driven model that can adapt to the spa- tiotemporal characteristics of mobile phone signaling data. Due to the high frequency and long acquisition period, classic time series methods such as ARIMA may not be feasible, as they cannot analyze multi-dimensional data and require stationary data or stationary after differentiating. LSTM benefits the extraction of both short-term and long-term tem- poral features. In addition, the population data is collected at the grid level, similar to the concept of cells in CA. Moreover, the neighborhood and extraction algorithms of conver- sion rules in CA have good scalability. Therefore, we integrate these two models to build a population prediction model with a micro-spatiotemporal granularity that can fully adapt to the characteristics of mobile phone signaling data. In the process of integrating LSTM and CA models, we consider two main aspects. First, we need to take advantage of the respective advantages of LSTM in sequence data and CA in spatial simulation. Second, we need to ensure the generalizability of the model. From a macro perspective, a CA model is the upper-level framework of the inte- grated model, which defines the interaction between the cells in the simulation space through varying definitions of neighborhoods. The LSTM can be seen as an underlying algorithm for automatically extracting conversion rules in CA. In this study, LSTM is able to handle sequential data with the input of a three-dimensional tensor that includes sam- ples (corresponding to the grids), timesteps, and features. Relying on the upper-level CA model, features can incorporate geographical information from various neighboring sce- ISPRS Int. J. Geo-Inf. 2021narios., 10, x TheFOR outputPEER REVIEW of the CA-LSTM-integrated model is a two-dimensional tensor that 8 of 16

includes samples (grids) and the corresponding population counts (in this study, mobile user counts). In this integration framework, CA and LSTM, respectively, handle infor- mation from spatial and temporal perspectives. Their integration benefits efficient spatio- temporal data fusionIn Figurethat leads 5, toX accurate denotes population the original prediction. input whileSpecifically, X′ denotes there are the model input after tem- two considerations:poral oneselection is to use and the cellsneighborhood in the CA to representconstruction. the grid. Different Different defi-uses of 𝑡 represent different nitions of the neighborhood can incorporate more spatial information into the model. The timestamps. C, C, H, and H are the intermediate input and output of the LSTM second is to take advantage of the unique mechanism of LSTM for extracting complex unit during the training process, and y represents the model output. ISPRS Int. J. Geo-Inf.time2021 series, 10, 544 features in population data. Figure 5 presents the schematic diagram8 of 16of the integration betweenIn LSTM general, and CA.the integrated model first selects the time series of the original data, then builds features in the attribute dimension according to the definition of the neighborhood in CA. Further, structurally transformed data are fed into the LSTM for training. The cal- culation at each time step generates new memory cells and hidden states. At the end of the entire training process, LSTM learns a set of parameters inside the model that can be used for later predictions. Another consideration of this model integration is whether the model itself is with sufficient scalability and universality. The structural requirements of the integrated model for the input data are three-dimensional tensors with the form of sample, timestep, and feature, where the timestep corresponds to the information on the time series, and the feature corresponds to the definition of the neighborhood in the CA. The training of the integrated model is data-driven, and the information in the temporal and spatial dimen-

Figure 5. Schematic diagram of integrated LSTM and CA model. Figure 5. Schematicsions diagram is implicit of integrated in the LSTM input and data. CA model. As no additional controls are required besides the ad- justmentAnother of consideration input data, of this the model integrated integration is model whether theis with model itselfstrong is with scalability and generalizability. Insufficient addition, scalability given and universality. that internal The structural parameters requirements ar ofe thewith integrated acceptable model initialization settings, man- for the input data are three-dimensional tensors with the form of sample, timestep, and ualfeature, adjustments where the timestep are corresponds not required to the information, which demonstrates on the time series, andthe the robustness and automaticity of thefeature proposed corresponds model. to the definition of the neighborhood in the CA. The training of the integrated model is data-driven, and the information in the temporal and spatial dimensions is implicit in the input data. As no additional controls are required besides the 3.3.adjustment Spatial of input Extension data, the integrated model is with strong scalability and generalizability. In addition, given that internal parameters are with acceptable initialization settings, manualThe adjustments spatial are notextension required, which of demonstratesthe model the robustnessin this andpaper automaticity is achieved via the perspective of of the proposed model. neighborhood construction. In the experiment, we select three different neighborhoods: the3.3. Spatial special Extension neighborhood of the central cell, the Moore neighborhood (R = 1), and the ex- The spatial extension of the model in this paper is achieved via the perspective of tendedneighborhood Moore construction. neighborhood In the experiment, (R we = select 1) by three extrac differentting neighborhoods: cell weights the based on road networks. specialThe neighborhood first neighborhood of the central cell, the construction Moore neighborhood is (Rrelati = 1), andvely the extendedsimple, that is, the choice of the cen- Moore neighborhood (R = 1) by extracting cell weights based on road networks. tralThe cell first while neighborhood ignoring construction its neighborhoods. is relatively simple, that Th is,e theMoore choice ofneighborhood the (R = 1) is a classic methodcentral cell while of defining ignoring its neighborhoods.neighborhood, The Moore as shown neighborhood in (RFigure = 1) is a 6. classic method of defining neighborhood, as shown in Figure6.

Figure 6. Moore 6. Moore neighborhood neighborhood (R = 1). (R = 1).

The combination of the Moore neighborhood (R = 1) and algorithm for extracting cell weights based on road networks is another way to build neighborhoods. This paper draws on the idea of “decay neighborhood with distance” [24–26] from the previous studies and uses data of road networks to ensure that proper spatial information is considered. Figure 7 presents the road networks in Chongming District.

ISPRS Int. J. Geo-Inf. 2021, 10, 544 9 of 16

The combination of the Moore neighborhood (R = 1) and algorithm for extracting cell weights based on road networks is another way to build neighborhoods. This paper draws ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEERon the REVIEW idea of “decay neighborhood with distance” [24–26] from the previous studies and 9 of 16

uses data of road networks to ensure that proper spatial information is considered. Figure7 presents the road networks in Chongming District.

Figure 7. Road networks in Chongming District. Figure 7. Road networks in Chongming District. The method of constructing neighborhoods from road networks is based on the assumptionThe method that population of constructing movements are neighborhoods affected by surrounding from areas road of interest networks (such is based on the as- as roads, tourist areas, etc.). With the consideration of road elements, people are with a sumptiontendency to movethat towardspopulation the road movements in a random state, are asaffe thected cells closerby surrounding to the road have areas a of interest (such asgreater roads, impact tourist on the areas, central cell.etc.). To ensureWith thatthe the consideration total impact weight of ofroad all cells elements, is 1, the people are with a tendencyweights are to normalized move towards via the following the road formula: in a random state, as the cells closer to the road have

a greater impact on the central cell. eTo−D i ensure that the total impact weight of all cells is 1, W = (7) i 9 − the weights are normalized via thee ∑followingj=1 Dj formula:

where Wi represents the weight of the i-th cell, and Di representse the distance of the i-th cell from the nearest road. Given the distance to the nearestW road, the weights are first calculated (7) ∑ considering an exponential decaying effect, then normalizede after the summation. WithW the use of the Moore neighborhood (R = 1), road data areD integrated into the wheremodel, as cell represents weights of the the neighborhood weight of are the calculated i-th cell, based and on the road represents network. Such the distance of the i-th cellinformation from the affects nearest the extraction road. of Given spatial featuresthe distance during model to the training. nearest road, the weights are first calculated considering an exponential decaying effect, then normalized after the summa- 4. Experiments and Results tion. In the first experiment, we selected the central cell without considering its neigh- borhood.With Since the the use period of the of the Moore data covers neighborhood the Chinese National (R = 1), Day road (1 October) data andare integrated into the model,ordinary as dates, cell the weights training was of conductedthe neighborhood separately to investigateare calculated the heterogeneity based on in the road network. Suchthe temporal information pattern of affects population the changes. extraction In addition, of spatial the minimum features time during granularity model of training. the data (10 min) and different time granularities (i.e., 1 h) were used to show the impact on the generalization effect of the different models. We also considered combining data from 4.multiple Experiments days to reveal and the Results longer-term characteristics in the time series. An additional experiment was conducted by applying extended neighborhoods that aim to include spatial informationIn the in first the model.experiment, Figure8 presents we selected the specific the training central method. cell without considering its neighbor- hood. Since the period of the data covers the Chinese National Day (1 October) and ordi- nary dates, the training was conducted separately to investigate the heterogeneity in the temporal pattern of population changes. In addition, the minimum time granularity of the data (10 min) and different time granularities (i.e., 1 h) were used to show the impact on the generalization effect of the different models. We also considered combining data from multiple days to reveal the longer-term characteristics in the time series. An additional experiment was conducted by applying extended neighborhoods that aim to include spa- tial information in the model. Figure 8 presents the specific training method.

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 10 of 16

ISPRS Int. J. Geo-Inf. 2021, 10, 544 10 of 16

Figure 8. Roadmap of experiments. Figure 8. Roadmap of experiments. In total, we trained six models, in which the first four models were constructed based on a single temporal dimension, and the other two were composite models that integrated Inspatiotemporal total, we trained perspectives. six models, For purposes in which of evaluating the first four the generalization models were effect constructed of the based on a singlemodel, temporal we used two dimension, evaluation methods: and the an othe absoluter two evaluation were composite and a relative models evaluation. that integrated spatiotemporalThe absolute perspectives. evaluation applies For the purposes mean absolute of evaluating error (MAE) the as the generalization evaluation metric: effect of the model, we used two evaluation methods: an1 nabsolute evaluation and a relative evaluation. The absolute evaluation appliesJ(yi, fthe(xi; θmean)) = abso∑|ylutei − f (errorxi; θ)| (MAE) as the evaluation(8) metric: n i=1 1 where y is the true value, θ is the training weights, and f (x ; θ) is the model output. i 𝐽𝑦, 𝑓𝑥;𝜃 |𝑦 i 𝑓𝑥;𝜃| (8) Relative evaluation is a baseline assessment.𝑛 Before training the model, an ideal model based on common sense as a reasonable reference point was established. After 𝑦 𝜃 𝑓𝑥 ;𝜃 wherecomparing is the the true trained value, model is with the the training baseline weights, model, the and relative quality is of the the model model output. Relativecan be judged. evaluation Since our is research a baseline problem assess targetsment. time seriesBefore population training prediction, the model, the an ideal modelsimplest based model on common assumption sense is that as the a predicted reasonable results reference at the current point time was step inestablished. the test After comparingset are the the same trained as the valuemodel at thewith same the time baseline in the training model, set. the In this relative study, thequality division of of the model training set and testing set is unique, given the fact that our model has spatial and temporal can be judged. Since our research problem targets time series population prediction, the dependencies. Therefore, unlike the traditional method of splitting the dataset in a fixed simplestratio, model we train assumption with data for is a periodthat the of predicted time and predict results with at data the forcurrent another time period step of in the test set aretime. the All same implementations as the value in at this the study same are time completed in the via training Python programmingset. In this study, language, the division of trainingspecifically set and using testing TensorFlow set is and unique, Keras libraries.given the In fact this study,that our the hyperparametersmodel has spatial for and tem- poralall dependencies. models are the same. Therefore, We use RELUunlike as the the activationtraditional function method with of the splitting Adam algorithm the dataset in a as the optimizer (default setting). fixed ratio, we train with data for a period of time and predict with data for another period of time.4.1. TimeAll implementations Dimension in this study are completed via Python programming lan- guage, specificallyNeural network using training TensorFlow requires a and balance Keras between libraries. optimization In this and study, generalization. the hyperparam- eters Onefor all of themodels focuses are of the this same. study isWe to improveuse RELU the as generalizability the activation of thefunction model, with as the the Adam application of population prediction models requires a certain degree of robustness. Since algorithm as the optimizer (default setting). neural networks have a strong ability in data fitting, the problem of overfitting often occurs during training. We added an L2 regular term to the loss function when designing the 4.1. Timemodel Dimension to solve this problem. Another possible solution is to perform manual adjustment Neuralby regulating network the number training of learningrequires parameters. a balance The between number optimization of learning parameters and generaliza- determines the number of features that a model can learn, and the larger the number, the tion. moreOne complexof the focuses the model. of Athis small study number is to of improve parameters the can generalizability lead to underfitting, of while the model, as the applicationa high number of canpopulation lead to overfitting, prediction which model greatlys requires affects the a model’scertain generalizationdegree of robustness. Sinceability. neural Therefore, networks the have adjustment a strong of hyperparameters ability in data in fitting, studies should the proble primarilym of target overfitting this often occursparameter. during Despitetraining. that, We there added are many an L2 hyperparameters regular term into the the deep loss learning function framework when designing (e.g., batch size and epoch); only the number of learning parameters remained adjustable the model to solve this problem. Another possible solution is to perform manual adjust- in this experiment with other hyperparameters fixed to ensure comparability in this study. ment by regulatingFour models the were number trained, respectively,of learning using parameters. 1 h granularity The number data on 27 of September learning parame- ters determines2018 (Model the 1), 1number h granularity of features data on that 1 October a model 2018 can (Model learn, 2), and 10 min the granularity larger the number, the more complex the model. A small number of parameters can lead to underfitting, while a high number can lead to overfitting, which greatly affects the model’s generaliza- tion ability. Therefore, the adjustment of hyperparameters in studies should primarily tar- get this parameter. Despite that, there are many hyperparameters in the deep learning framework (e.g., batch size and epoch); only the number of learning parameters remained adjustable in this experiment with other hyperparameters fixed to ensure comparability in this study. Four models were trained, respectively, using 1 h granularity data on 27 September 2018 (Model 1), 1 h granularity data on 1 October 2018 (Model 2), 10 min granularity data

ISPRS Int. J. Geo-Inf. 2021, 10, 544 11 of 16

data on 27 September 2018 (Model 3), and 1 h granularity data on 27–29 September 2018 (Model 4). The numbers of best learning parameters are, respectively, set as 10, 10, 18, and 22. The baseline errors are 1.245, 1.636, 1.245, and 1.871, respectively. The generalization performance on the test set is shown in Tables2–5. Figure9 presents the generalization error plot of the four models on the test set. Baseline error is calculated from baseline assessment, and the test error is derived from the prediction of the trained model.

Table 2. Errors on the entire test set of the 10-parameter model by 1 h granularity on 27 September 2018 (Model 1).

Date Baseline Error Test Error 28 September 2018 1.245 1.161 29 September 2018 1.459 1.122 30 September 2018 2.31 1.161 1 October 2018 6.141 1.174 2 October 2018 6.587 1.229 3 October 2018 6.73 1.27 4 October 2018 6.626 1.314 5 October 2018 6.022 1.131 6 October 2018 5.478 1.187 7 October 2018 5.289 1.371

Table 3. Errors on the entire test set of the 10-parameter model by 1 h granularity on 1 October 2018 (Model 2).

Date Baseline Error Test Error 27 September 2018 6.141 1.216 28 September 2018 6.104 1.202 29 September 2018 5.901 1.168 30 September 2018 5.177 1.176 2 October 2018 1.636 1.286 3 October 2018 1.949 1.334 4 October 2018 2.076 1.369 5 October 2018 1.97 1.138 6 October 2018 2.393 1.102 7 October 2018 3.107 1.21

Table 4. Errors on the entire test set of the 18-parameter model by 10 min granularity 27 September 2018 (Model 3).

Date Baseline Error Test Error 28 September 2018 1.245 0.902 29 September 2018 1.459 0.997 30 September 2018 2.31 1.214 1 October 2018 6.141 2.034 2 October 2018 6.587 2.215 3 October 2018 6.73 2.42 4 October 2018 6.626 2.359 5 October 2018 6.022 2.136 6 October 2018 5.478 2.133 7 October 2018 5.289 1.92 ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 12 of 16

ISPRS Int. J. Geo-Inf. 2021, 10, 544 12 of 16

Table 5. Errors on the entire test set of the 22-parameter model by 1 h granularity during 29 Sep- temberTable5. 2018–27Errors September on the entire 2019 (Model test set 4). of the 22-parameter model by 1 h granularity during 29 SeptemberDate 2018–27 September 2019 (Model Baseline 4). Error Test Error 30 SeptemberDate 2018 Baseline 1.871 Error Test 0.963 Error 1 October 2018 5.901 0.956 30 September 2018 1.871 0.963 21 October October 2018 6.325 5.901 0.992 0.956 32 October October 2018 6.439 6.325 0.992 1.04 43 October October 2018 6.351 6.439 1.134 1.04 4 October 2018 6.351 1.134 5 October 2018 5.749 0.982 5 October 2018 5.749 0.982 66 October October 2018 5.273 5.273 0.919 0.919 77 October October 2018 5.166 5.166 0.961 0.961

Figure 9. TheThe plot of daily test errors of model-1, 2, 3, 4. According to Figure9, it is observed that with an increase in the amount of data and According to Figure 9, it is observed that with an increase in the amount of data and the decrease in time granularity, the number of optimal learning parameters of the model the decrease in time granularity, the number of optimal learning parameters of the model increases, suggesting that the model needs more parameters to express such characteristics increases, suggesting that the model needs more parameters to express such characteris- in the time dimension. When comparing Model 1 and Model 2, we found that there exists a tics in the time dimension. When comparing Model 1 and Model 2, we found that there certain periodic feature in the single-day population data, and the expression of this feature exists a certain periodic feature in the single-day population data, and the expression of on the normal date and the National Day is different. Despite such discrepancies, the thissimilarity feature of on changes the normal in population date and dynamicsthe National during Day oneis different. day is notable. Despite We such also discrepan- observed cies,that thisthe similarity rule performs of changes better in on population data of ordinary dynamics dates. during When one comparing day is notable. Model We 1 also and observedModel 3, wethat found this rule that performs a small timebetter granularity on data of doesordinary not necessarilydates. When perform comparing better. Model The 1generalization and Model 3, errorwe found of the that data a atsmall 1 h granularitytime granularity is lower does than not that necessarily of the data perform at 10 min. better. As Thein certain generalization occasions, error specific of the characteristics data at 1 h granularity exist under is a lower specific than granularity. that of the Sometimes, data at 10 min.data withAs in a finecertain time occasions, granularity specific may lose charac keyteristics features. exist In addition, under a even specific random granularity. data can Sometimes,express certain data correlations, with a fine time as some granularity scholars may have lose found key features. that some In correlationsaddition, even can ran- be domfound data even can for express data with certain complete correlations, noises as [21 some]. When scholars comparing have found Model that 1 and some Model corre- 4, lationswe found can that be found the multi-day even for data data generally with complete contain noises richer [21]. information When comparing and have Model hidden 1 andfeatures Model on 4, the we long-term found that sequence. the multi-day Given data its great generally performance, contain richer the subsequent information model and havetraining hidden is based features on Model on the 4. long-term sequence. Given its great performance, the subse- quent model training is based on Model 4.

ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 13 of 16 ISPRS Int. J. Geo-Inf. 2021, 10, 544 13 of 16

4.2. Spatiotemporal Dimension 4.2. Spatiotemporal Dimension The above four models only consider a single time dimension and do not adopt the conceptThe of above neighborhood. four models In this only experiment, consider a singlewe incorporate time dimension spatial andattribute do not information adopt the intoconcept the model of neighborhood. from two perspectives: In this experiment, (1) the weclassic incorporate Moore neighborhood spatial attribute (R information= 1) (Model 5),into and the model(2) the from extended two perspectives: Moore neighborhood (1) the classic (R Moore = 1) combined neighborhood with (R extracting = 1) (Model cell 5), weightsand (2) thebased extended on road Moore networks neighborhood (Model 6). (RThe = best 1) combined learning withparameters extracting are set cell as weights 30 and based on road networks (Model 6). The best learning parameters are set as 30 and 35, 35, respectively, and the baseline errors are both 1.871. The generalization errors on the respectively, and the baseline errors are both 1.871. The generalization errors on the test set test set are shown in Tables 6 and 7. Figure 10 presents the errors for all six models. are shown in Tables6 and7. Figure 10 presents the errors for all six models. Table 6. Errors on the entire test set of the 30-parameter model by 1 h granularity and Moore neigh- borhoodTable 6. 29Errors September on the 2018–27 entire testSeptember set of the 2019 30-parameter (Model 5). model by 1 h granularity and Moore neighborhood 29 September 2018–27 September 2019 (Model 5). Date Baseline Error Test Error 30 SeptemberDate 2018 Baseline 1.871 Error Test 0.922 Error 301 SeptemberOctober 2018 2018 1.871 5.901 0.9350.922 21 October October 20182018 5.901 6.325 0.935 0.96 2 October 2018 6.325 0.96 33 October October 20182018 6.439 6.439 0.978 44 October October 20182018 6.351 6.351 1.06 55 October October 20182018 5.749 5.749 0.906 66 October October 20182018 5.273 5.273 0.845 7 October 2018 5.166 0.931 7 October 2018 5.166 0.931

Table 7. ErrorsErrors on on the the entire entire test test set setof a of 35-paramete a 35-parameterr model model by 1 h by granularity 1 h granularity and extended and extended neigh- borhood 29 September 2018–27 September 2019 (Model 6). neighborhood 29 September 2018–27 September 2019 (Model 6). Date Baseline Error Test Error Date Baseline Error Test Error 30 September 2018 1.871 0.874 30 September 2018 1.871 0.874 1 October 2018 5.901 0.899 1 October 2018 5.901 0.899 22 October October 2018 6.325 6.325 0.909 0.909 33 October October 2018 6.439 6.439 0.947 0.947 44 October October 2018 6.351 6.351 1.048 1.048 5 October 2018 5.749 0.882 5 October 2018 5.749 0.882 6 October 2018 5.273 0.813 67 October October 2018 5.273 5.166 0.813 0.867 7 October 2018 5.166 0.867

Figure 10. TheThe plot plot of daily test errors of model-1, 2, 3, 4, 5, 6.

ISPRS Int. J. Geo-Inf. 2021, 10, 544 14 of 16

We observed that the quality of the models is further improved with the consideration of spatial information. The spatial assumption of the Moore neighborhood is spatially contiguous, which limits model performance to a certain degree. In comparison, Model 6 fuses road network information, allowing more detailed spatial information to be intro- duced into the model, thereby further improving the quality of the model. However, the introduction of road network information leads to increased model complexity.

5. Discussion Population distribution prediction has always been a hot research topic in geogra- phy and other spatial science-related fields. Better population distribution knowledge is expected to benefit a wide range of domains that include smart cities [27], regional planning [28,29], transportation management [29], and disaster mitigation [30], to list a few. After reviewing traditional population distribution prediction models, we notice that traditional methods, largely relying on the structured census, tend to have coarse spa- tiotemporal granularity. In recent years, the advances in information handing techniques, the introduction of data processing approaches, and improved computational power led to the popularity of population distribution prediction with fine spatiotemporal granularity. Entering the Big Data era, scholars start to resort to machine learning and automated algorithms, aiming to address the challenges in fine-scale population modeling studies. LSTM and CA are popular approaches that, respectively, address temporal and spatial problems. Given its design, LSTM is featured by its capability in predicting events with long intervals and delays in time series, making it an ideal algorithm for population distri- bution prediction. On the other hand, CA has the capability of simulating spatiotemporal evolutions. The unique features from LSTM and CA motivate us to think about their potential integration, with CA as the upper-level framework and LSTM as the underlying algorithm to derive the conversion rules. Taking advantage of the mobile phone signaling data collected by China Unicom, we test the proposed CA-LSTM-integrated model in predicting the population distribution in Chongming District, Shanghai by setting different time granularity, timesteps, and neighboring scenarios. The results suggested that models with finer time granularity do not guarantee better performance, as data separated by finer granularity may lose key features, posing challenges for the models when they are making the prediction. We also notice that an increase in data put benefits the accuracy of the prediction, which is expected because deep learning models rely on data to establish a robust mapping between the input and the output. As for neighboring scenarios, we notice that models that consider spatial information outperform models that do not, which proves the important role of spatial dependency in population distribution prediction. In addition, we observe that models that consider spatial information characterized by road networks outperform models that consider spatial information characterized by the Moore neighborhood (spatially contiguous). This result points to the fact that population dynamics are more intertwined with road network settings than spatial closeness. Further efforts can be made to involve other spatial scenarios in population distribution prediction, aiming to better capture the factors that drive population movement. Unfortunately, the mobile phone signaling data we have only lasted 14 days, which, to some degree, limits our investigation. In addition, cellphone counts only capture a certain spectrum of the population, while people who do not have access to mobile devices or who do not carry devices outside are underrepresented. In our future work, we plan to explore the capability of the proposed model in handling a longer time period and include different time granularities. In addition, the performance of different granularities needs to be tested with more case studies. We also intend to integrate more geographic information to investigate the potential improvement of the model’s performance.

6. Conclusions Although there exist some efforts that integrate LSTM and CA models, their models lack the ability to integrate multi-source geographic information and, to our best knowl- ISPRS Int. J. Geo-Inf. 2021, 10, 544 15 of 16

edge, no effort has been made to investigate such integration in predicting the spatial distribution of population. Taking mobile phone signaling data as input and Chongming District of China as the study case, we proposed a model that integrates LSTM and CA to extract population dynamics from spatiotemporal dimensions. The integrated model is constructed in an unsupervised manner with great scalability and the capability of fusing multi-source geographic data. We also applied different time granularities and varying methods of spatial neighborhoods. Despite the great performance of the proposed model, challenges still remain. In this case study, we observed that the generalization abilities of trained models differ greatly, indicating that the training methods should be dependent on different research data types and scenarios. That is to say, different training data can lead to different best parameter settings. The internal parameters of the model should be adjusted appropriately in order to achieve optimal performance. The best model obtained in this paper is based on the training data of the extended Moore neighborhood (R = 1), combined with three-day ordinary dates, one-hour granularity, and extracting cell weights from road networks. The calculated testing error (MAE) for all cells is 0.905. The generality and robustness of the integrated model can be further improved by including more data sources. Benefiting from the proposed method of neighborhood construction, other data sources can be easily incorporated into our proposed model.

Author Contributions: Conceptualization, Pengyuan Wang and Xiang Li; methodology, Pengyuan Wang and Dong Xu; formal analysis, Pengyuan Wang and Di Zhang; investigation, Joseph Mango; resources, Xiao Huang; writing—original draft preparation, Pengyuan Wang; writing—review and editing, Xiang Li and Xiao Huang; funding acquisition, Xiang Li. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the National Natural Science Foundation of China, Grant/Award Number: 41771410; Ministry of Education of China, Grant/Award Number: 19JZD023. Data Availability Statement: The dataset and codes in this study can be found at https://figshare. com/s/ea7fcdff4e977135a20c (accessed on 25 July 2021). Conflicts of Interest: The authors declare no conflict of interest.

References 1. Wu, S.S.; Qiu, X.; Wang, L. Population estimation methods in GIS and remote sensing: A review. GIScience Remote Sens. 2005, 42, 80–96. [CrossRef] 2. Nusteling, H.P. The population of England, 1539–1873: An issue of demographic homeostasis. Hist. Mes. 1993, 8, 59–92. [CrossRef] 3. Glass, D.; Appleman, P. Thomas Robert Malthus: An Essay on the Principle of Population. Popul. Stud. 1976, 30, 369. [CrossRef] 4. Leslie, P.H. On the use of matrices in certain population mathematics. Biometrika 1945, 33, 183–212. [CrossRef][PubMed] 5. Zakria, M.; Muhammad, F. Forecasting the population of Pakistan using ARIMA models. Pak. J. Agric. Sci. 2009, 46, 214–223. 6. Lutz, W.; Sanderson, W.; Scherbov, S. Doubling of world population unlikely. Nature 1997, 387, 803–805. [CrossRef] 7. Stoto, M.A. The accuracy of population projections. J. Am. Stat. Assoc. 1983, 78, 13–20. [CrossRef] 8. Carter, L.; Lee, R. Modeling and forecasting US sex differentials in mortality. Int. J. Forecast. 1992, 8, 393–411. [CrossRef] 9. Lutz, W.; Sanderson, W.; Scherbov, S. Expert-Based Probabilistic Population Projections. Popul. Dev. Rev. 1998, 24, 139. [CrossRef] 10. Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N. Prabhat Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [CrossRef] 11. Robinson, C.; Hohman, F.; Dilkina, B. A deep learning approach for population estimation from satellite imagery. In Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities, Redondo Beach, CA, USA, 7–10 November 2017; pp. 47–54. 12. Rumelhart, D.; Hinton, G.; Williams, R. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [CrossRef] 13. Griffith, C.S.; Swanson, D.A.; Knight, M. DOMICILE 1.0: An agent-based simulation model for population estimates at the domicile level. In Opportunities and Challenges for Applied Demography in the 21st Century; Springer: Dordrecht, The Netherlands, 2012; pp. 345–370. 14. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef][PubMed] 15. Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 855–868. [CrossRef] ISPRS Int. J. Geo-Inf. 2021, 10, 544 16 of 16

16. Zeyer, A.; Doetsch, P.; Voigtlaender, P.; Schlüter, R.; Ney, H. A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2462–2466. 17. White, R.; Engelen, G. Cellular Automata and Fractal Urban Form: A Cellular Modelling Approach to the Evolution of Urban Land-Use Patterns. Environ. Plan. A Econ. Space 1993, 25, 1175–1199. [CrossRef] 18. Itami, R. Simulating spatial dynamics: Cellular automata theory. Landsc. Urban Plan. 1994, 30, 27–47. [CrossRef] 19. Barlovic, R.; Santen, L.; Schadschneider, A.; Schreckenberg, M. Metastable states in cellular automata for traffic flow. Eur. Phys. J. B 1998, 5, 793–800. [CrossRef] 20. Esser, J.; Schreckenberg, M. Microscopic Simulation of Urban Traffic Based on Cellular Automata. Int. J. Mod. Phys. C 1997, 8, 1025–1036. [CrossRef] 21. Devi, N.S.S.S.N.U.; Mohan, R. Long Short-Term Memory with Cellular Automata (LSTMCA) for Stock Value Prediction. In Data Engineering and Communication Technology; Springer: Singapore, 2020; pp. 841–848. 22. Liu, L.; Sun, X.-K. Volcanic Ash Cloud Diffusion From Remote Sensing Image Using LSTM-CA Method. IEEE Access 2020, 8, 54681–54690. [CrossRef] 23. Liu, S.; Xue, D.; Gao, J. The Assessment of Island-Type Urban Ecosystem Based on the Environmental Carrying Capacity Model—A Case Study of Shanghai’s ChongMing Island. Adv. Mater. Res. 2012, 573–574, 325–330. [CrossRef] 24. Padró-Martínez, L.T.; Patton, A.; Trull, J.B.; Zamore, W.; Brugge, D.; Durant, J.L. Mobile monitoring of particle number concentration and other traffic-related air pollutants in a near-highway neighborhood over the course of a year. Atmos. Environ. 2012, 61, 253–264. [CrossRef] 25. Requia, W.J.; Roig, H.L.; Adams, M.D.; Zanobetti, A.; Koutrakis, P. Mapping distance-decay of cardiorespiratory disease risk related to neighborhood environments. Environ. Res. 2016, 151, 203–215. [CrossRef][PubMed] 26. Freedman, L.; Pee, D. Return to a Note on Screening Regression Equations. Am. Stat. 1989, 43, 279. 27. Garcia-Retuerta, D.; Chamoso, P.; Hernández, G.; Guzmán, A.S.; Yigitcanlar, T.; Corchado, J.M. An efficient management platform for developing smart cities: Solution for real-time and future crowd detection. Electronics 2021, 10, 765. [CrossRef] 28. Langford, M.; Higgs, G.; Radcliffe, J.; White, S. Urban population distribution models and service accessibility estimation. Comput. Environ. Urban. Syst. 2008, 32, 66–80. [CrossRef] 29. Martín, Y.; Li, Z.; Ge, Y.; Huang, X. Introducing Twitter DAILY estimates of residents and NON-RESIDENTS at the county level. Soc. Sci. 2021, 10, 227. [CrossRef] 30. Huang, X.; Wang, C.; Li, Z.; Ning, H. A 100 m POPULATION grid in the Conus by DISAGGREGATING census data with Open-source Microsoft building footprints. Big Earth Data 2020, 5, 112–133. [CrossRef]