Spatiotemporal Prediction of Theft Risk with Deep Inception-Residual Networks

smart cities

Article Spatiotemporal Prediction of Theft Risk with Deep Inception-Residual Networks

Xinyue Ye 1 , Lian Duan 2 and Qiong Peng 3,*

1 Department of Landscape Architecture & Urban Planning, Texas A&M University, College Station, TX 77840, USA; [email protected] 2 School of Geographical Sciences and Planning, Guangxi Teachers Education University, Nanning 530100, China; [email protected] 3 Department of Urban Studies and Planning and National Center for Smart Growth, University of Maryland, College Park, MD 20742, USA * Correspondence: [email protected]

Abstract: Spatiotemporal prediction of crime is crucial for public safety and smart cities operation. As crime incidents are distributed sparsely across space and time, existing deep-learning methods constrained by coarse spatial scale offer only limited values in prediction of crime density. This paper proposes the use of deep inception-residual networks (DIRNet) to conduct fine-grained, theft-related crime prediction based on non-emergency service request data (311 events). Specifically, it outlines the employment of inception units comprising asymmetrical convolution layers to draw low-level spatiotemporal dependencies hidden in crime events and complaint records in the 311 dataset. Afterward, this paper details how residual units can be applied to capture high-level spatiotemporal features from low-level spatiotemporal dependencies for the final prediction. The effectiveness of the proposed DIRNet is evaluated based on theft-related crime data and 311 data in New York City from

2010 to 2015. The results conﬁrm that the DIRNet obtains an average F1 of 71%, which is better than other prediction models.

Citation: Ye, X.; Duan, L.; Peng, Q. Spatiotemporal Prediction of Theft Keywords: crime prediction; Inception networks; residual networks; deep convolution neural Risk with Deep Inception-Residual networks; New York City Networks. Smart Cities 2021, 4, 204–216. https://doi.org/10.3390/ smartcities4010013 1. Introduction Academic Editor: Pierluigi Siano Safety is an important goal that smart cities should pursue. As Neirotti et al. stated, the Received: 24 December 2020 initiatives of smart cities are characterized by modern technology and aiming at improving Accepted: 26 January 2021 the lives of the urban residents in various domains, such as development, safety, energy, Published: 29 January 2021 etc. [1]. Furthermore, an ever-increasing volume of data provides a precious opportunity for researchers and practitioners to predict crime events and enable police departments Publisher’s Note: MDPI stays neutral to develop more effective strategies for crime prevention. Therefore, the integration of with regard to jurisdictional claims in heterogeneous big datasets and spatiotemporal prediction of crime [2,3] becomes critical. published maps and institutional affil- Spatiotemporal prediction of crimes helps law enforcement agencies identify the iations. patterns relating to the proliferation of crime [4] to efficiently deploy limited police re- sources [5,6]. However, crimes exhibit complex spatial and temporal dependencies em- bedded in dynamic spatial-social environments [7–9]. A growing list of methods has been developed to address this challenge, which includes spatiotemporal point processes [10], Copyright: © 2021 by the authors. random space-time distribution hypotheses [11], risk-terrain analysis [12], spatial regressive Licensee MDPI, Basel, Switzerland. models, and Bayesian models [13]. A variety of engineering techniques have recently been This article is an open access article developed to characterize crime-related features in order to enhance the prediction power distributed under the terms and using Foursquare data [14], Twitter data [15], 911 events [16], and taxi trajectories [17]. conditions of the Creative Commons Attribution (CC BY) license (https:// However, the challenge is that the aforementioned methods require either extensive cover- creativecommons.org/licenses/by/ age of the relevant indicators—which often suffers from insufficient data availability—or 4.0/). heavy reliance on feature-engineering processes that tend to be cumbersome.

Smart Cities 2021, 4, 204–216. https://doi.org/10.3390/smartcities4010013 https://www.mdpi.com/journal/smartcities Smart Cities 2021, 4 205

This paper proposes a novel deep learning model—a deformable image registration deep learning network derived from deep inception-residual networks (DIRNet)—that can extract the information of spatial and temporal crime dependence before fusing multiple datasets for prediction. In addition, the proposed DIRNet has deep layers in its architecture to account for complex and nonlinear relationships between inputs and a predictor. To test the proposed DIRNet, we examined New York City as a case study and predicted theft-related crime compared with four baseline machine-learning models. The evaluation results of those models illustrate that the proposed model outperforms other baseline models in crime prediction. The remainder of this paper is structured as below: Section2 reviews the studies that predict crime using both conventional methods and machine-learning approaches. Section3 describes the New York City case study and the datasets used. Section4 illustrates the procedure of the proposed DIRNet model. Section5 examines whether the proposed DIRNet performs better in crime prediction than other baseline machine-learning models. The data includes theft-related crime data and 311 data in New York City. We highlight key ﬁndings, present implications of this study, and propose a research agenda in the last section.

2. Literature Review Our research speaks to an important portion of the existing academic literature in the urban context: the use of big data approaches to predict events that are spatially and temporally dependent, including, specifically, crimes. Conventionally, scholars have used various statistical models to investigate the causality mechanism of crimes and predict the occurrence of criminal incidents. For example, Igbal et al. employed Naïve Bayes and decision trees to predict a multiclass classification using three labels (low, medium, and high) at an accuracy level of 83.95% [18]. Based on the linguistic analysis of tweets, Wang and Brown applied topic modeling methods to discover hidden semantic structures and discussion themes in a city, and enhance the crime prediction accuracy through a kernel density estimation [19]. Law et al. explored Bayesian spatiotemporal methods to analyze the changing local patterns of crime over time [13]. Chohlas-Wood et al. focused on random forest regression and Poisson regression with features built from 911 data, 311 data, weather, and building data to forecast the criminal activity [16]. Wang et al. crafted crime data, demographic information, points of interest (POIs) data, and taxi data to explain the crime rate of a neighborhood [17]. Chandrasekar et al. applied gradient boosted trees, support vector machines, and random forest models to deal with the blue-collar/white- collar crime classification issue [20]. However, these studies had an intrinsic limitation that the large amounts of data types made these research procedures difficult to be replicated for crime prediction in other cities that do not have such rich data types. At the same time, Kang and Kang developed a deep learning model to predict crime hotspots, involving the images associated with crime locations retrieved from Google Maps’ street view as well as weather, demographics, and other crime incident information [4]. In addition, Chun et al. applied a machine-learning method—deep neural networks (DNNs)— using individual criminal charge history data to estimate the likelihood of crime and the likely type of crime at the individual level [21]. Wang et al. employed deep convolutional neural networks (DCNNs) to predict urban crime distribution while the study area was divided into 16 × 16 grids—a coarse spatial scale to target policing patrol strategies [22]. To address these issues, Duan et al. developed a DCNNs-based model to predict crime risk at the finer spatiotemporal scale [23]. However, such efforts require time-consuming manual processing to tune the specific dropout paths for convolutional layers, with the unstable and lower prediction performances compared with the proposed model in this paper (detailed in Section5). This study aims to fill the aforementioned gaps through a novel deep network for end-to-end prediction, through which we can estimate theft-related crime risks using fine-grained spatial grids. Importantly, only theft-related 311 events will be employed as Smart Cities 2021, 4, FOR PEER REVIEW 3

This study aims to fill the aforementioned gaps through a novel deep network for end-to-end prediction, through which we can estimate theft-related crime risks using fine- grained spatial grids. Importantly, only theft-related 311 events will be employed as the Smart Citiesauxiliary2021, 4 geo-data for crime prediction, which can make this study highly transferable to 206 other urban areas that lack various types of auxiliary data for modeling.

3. Research Contextthe and auxiliary the Datasets geo-data for crime prediction, which can make this study highly transferable New York Cityto is otheramong urban the areas largest that cities lack various in the typesworld. of auxiliaryIt is famous data for modeling.its numerous tourist attractions and3. Research its cultural Context diversity. and the DatasetsNew York City boasts a global population, with nearly 40% of residentsNew Yorkborn City overseas. is among A therecord-high largest cities 3.3 in million the world. foreign-born It is famous forimmi- its numerous grants reside in Newtourist York attractions City, topping and its culturalall other diversity. cities worldwide New York City (U.S. boasts Census a global Bureau, population, with 2015). As identified nearlyfrom Figure 40% of residents1, Queens born and overseas. Brooklyn A record-high have a higher 3.3 million proportion foreign-born of for- immigrants eign-born residents residethan other in New boroughs. York City, Socioeconomic topping all other diversity cities worldwide is pronounced (U.S. Census in New Bureau, 2015). As identified from Figure1, Queens and Brooklyn have a higher proportion of foreign- York City as well. Neighborhoods in Manhattan, specifically Midtown, Lower Manhattan, born residents than other boroughs. Socioeconomic diversity is pronounced in New York and Upper East Side,City have as well. the Neighborhoodshighest median in household Manhattan, specificallyincome in Midtown, New York Lower City. Manhattan, It is and unsurprising that theUpper aforementioned East Side, have neighbor the highesthoods median also householdhave the highest income median in New Yorkrents City. It is and median home values,unsurprising as evidenced that the aforementioned by the two subfigures neighborhoods of Figure also have 1 featured the highest below. median rents The distribution of socioeconomicand median home inequality values, as evidencedby race and by thehousehold two subfigures income of Figuremight1 featuredaffect below. the geography of crimeThe rates distribution [24]. We of socioeconomicare interested inequality in theft-related by race andcrime household research income because might affect the geography of crime rates [24]. We are interested in theft-related crime research because (1) New York City is heavily featured in crime-related research; numerous studies in the (1) New York City is heavily featured in crime-related research; numerous studies in the criminology literaturecriminology evaluate literature New York evaluate City; New and York (2) City; New and (2)York New City’s York City’stheft-related theft-related crime crime data and 311 data andare 311publicly data are accessible. publicly accessible.

Figure 1. Maps that convey characteristics of demographics and housing in New York City (left top: the proportion of foreign-bornFigure 1: residents;Maps that right convey top: median characteristics household of income; demo leftgraphics bottom: and median housing rent; rightin New bottom: York median City (left home value). top: the proportion of foreign-born residents; right top: median household income; left bottom: median rent; right bottom: median home value) Smart Cities 2021, 4, FOR PEER REVIEW 4

The datasets used in this paper were downloaded from the New York City Open Data website over the span of 1 January 2010 to 31 December 2015, to include crime data and 311 data. The datasets are illustrated in Figure 2. Crime data: There were 825,064 incidents for three theft-related types: burglary, grand larceny, and petit larceny. Each incident had the properties of offense type, geographic coordinates, and time. The density of theft-related incidents is shown in the left subfigure of Figure 2. From the subfigure, we can see that downtown Manhattan had the highest theft-related incident density in the city. 311 data: The reason why we include 311 data is that civil disorder and conducts hostile to society could be associated with wrongdoing activities [25]. Hence, complaint records from the 311 dataset can be an explainer and predictor for crime events. This da-

Smart Citiestaset2021 includes, 4 1,069,786 complaint records (311 events) about three theft-related types: 207 street condition, street light condition, and dirty conditions. Each complaint record incorporated coordinates, complaint type, and the time at which the complaint was documented. For the density Theof the datasets complaint used in thisrecords, paper weresee the downloaded right subfigur from thee Newof Figure York City 2. This Open Data subfigure shows thatwebsite Brooklyn over the and span the of 1Bronx January have 2010 the to 31 highest December density 2015, to of include theft-related crime data and complaints. 311 data. The datasets are illustrated in Figure2.

Figure 2. Theft related crime (left) and theft-related 311 (right) incidents intensities of New York City from 2012 through 2015.Figure A unit 2: Theft is the total related number crime of incidents (left) and per theft-related grid. 311 (right) incidents intensities of New York City from 2012 through 2015. A unit is the total number of incidents per grid. Crime data: There were 825,064 incidents for three theft-related types: burglary, grand 4. Methods larceny, and petit larceny. Each incident had the properties of offense type, geographic coordinates, and time. The density of theft-related incidents is shown in the left subfigure Before illustratingof Figure the method2. From the we subfigure, developed, we can some see thatbasic downtown concepts Manhattan need to be had speci- the highest fied as follows: theft-related incident density in the city. Definition 1 (Region):311 The data: study The reason area whyis partitioned we include into 311 data disjointed is that civil 80 × disorder 80 grids, and G conducts = hostile to society could be associated with wrongdoing activities [25]. Hence, complaint {g1, g2, ⋯, g80×80}. Eachrecords grid fromin G the is 311considered dataset can abe separate an explainer region. and predictor The area for of crime each events. region This is dataset about 0.28 km2 (0.70includes km × 0.46 1,069,786 km). complaint records (311 events) about three theft-related types: street Definition 2 (Spatialcondition, neighbor street light set): condition, The target and dirty region conditions. is the Eachfocus complaint region record where incorporated the researchers aim to predictcoordinates, whether complaint a crime type, is andmost the possible time at which to take the place. complaint The was s × documented.s regions For the density of the complaint records, see the right subfigure of Figure2. This subfigure that geographically showssurround that Brooklynthe target and region the Bronx r, which have the is highestcentered density in the of theft-relateds × s regions, complaints. are called the spatial neighbor set of g. Definition 3 (Time4. Methods window): the duration of M days is thought of as a time window. Before illustrating the method we developed,g someg basic concepts need to be specified Problem formulation: Given history observation X= {Xt , Xt1,…, our goal is to esti- as follows: g mate the probability distribution P (y |X) to ascertain whether there will be crime(s) in tk g day t+k for the targetDefinition region 1g.. (Region):If P (y The= study1|X) area>=0.5, is partitioned then it may into disjointedbe considered 80 × 80 grids,a high G = {g , tk 1 g2, ... , g80×80}. Each grid in G is considered a separate region. The area of each region is about crime risk, else it may0.28 be km regarded2 (0.70 km ×as0.46 a low km). crime risk. We observe that the crime study suffers from a data imbalance problem. Ninety-three percent of samples wereDefinition labeled 2. (Spatial as 0, indicating neighbor set): a Theclass target imbalance region is the problem focus region in our where dataset. the researchers aim to predict whether a crime is most possible to take place. The s × s regions that geographically surround the target region r, which is centered in the s × s regions, are called the spatial neighbor set of g.

Deﬁnition 3. (Time window): the duration of M days is thought of as a time window. Smart Cities 2021, 4 208

g g o Problem formulation: Given history observation X= {Xt ,Xt−1,... , our goal is to g estimate the probability distribution P (yt+k|X) to ascertain whether there will be crime(s) g in day t+k for the target region g. If P (y = 1|X) >=0.5, then it may be considered a high Smart Cities 2021, 4, FOR PEER REVIEW t+k 5 crime risk, else it may be regarded as a low crime risk. We observe that the crime study suffers from a data imbalance problem. Ninety-three percent of samples were labeled as 0, indicating a class imbalance problem in our dataset. This study addressed this issue in a two-fold manner. First, the crime-label is a dummy variable, 𝑦g ∈0,1. Value of 0 indicates no crime took place in the region on day t, while variable, yt ∈ {0, 1}. Value of 0 indicates no crime took place in the region on day t, while thethe value value of of 1 1 indicates indicates some some crimes crimes took place in the region g on day t. There is a severe classclass imbalance. Grids Grids in in which which crime crime has has occurred occurred are much less than grids in which no crimecrime has occurred. To To account for this imbalance, we, secondly, under-sampledunder-sampled 0-label data, as Chohlas-Wood et al. did in their stud studyy [16], [16], and augmented the sample size of the 1-label sample based on the synthetic minorityminority oversampling techniquetechnique(SMOTE (SMOTE) ) [[26].26].

4.1. DIRNet DIRNet A deformable image image registration registration deep deep learning learning network network (DIRNet) (DIRNet) is iscomposed composed of of a convolutionala convolutional neural neural network network repressor, repressor, a spatial a spatial transformer, transformer, and anda resampler. a resampler. The con- The volutionalconvolutional neural neural network network investigates investigates a pair a pair of offixed fixed and and moving moving images images and and outputs parameters forfor thethe spatialspatial transformer, transformer, which which produces produces the the displacement displacement vector vector that that enables ena- blesthe resamplerthe resampler to adjust to adjust the the moving moving image image to theto the fixed fixed image. image. In In this this case, case, we we propose propose a aDIRNet. DIRNet. This This section section first first presents presents the the DIRNet DIRN architectureet architecture and and then then explains explains how how its key its keycomponents components take take spatiotemporal spatiotemporal dependencies dependencies into into account. account. Architecture Figure3 3 presents presents an an overview overview of theof the proposed proposed DIRNet DIRNet architecture, architecture, which which is composed is com- posedof three of functionalthree functional modules modules handling handling the crime the datasetcrime dataset and the and 311 the dataset 311 dataset as well as as well the asfusion the fusion of these of two these datasets. two datasets. The top The left top branch left branch of DIRNet of DIRNet is responsible is responsible for building for build- latent ingrepresentations latent representations using inception using unitsinception (IncUnits) units (IncUnits) for crime data.for crime Similarly, data. theSimilarly, lower-left the lower-leftbranch is branch used to is extract used to 311 extract spatiotemporal 311 spatiotemporal dependencies. dependencies. Both outputsBoth outputs are fed are into fed intothe sub-module,the sub-module, which which mainly mainly consists consists of nine of nine residual residual units units (ResUnits) (ResUnits) to capture to capture the thehigh-level high-level interactions interactions of crimeof crime and and 311 311 events. events. Finally, Finally, the the networks networks end end withwith aa global average pooling layer and a two-way dense laye layerr with a Softmax function to realize crime prediction.

FigureFigure 3. 3. CriminalCriminal ( (leftleft) and 311 ( right) incidents intensities in New York CityCity fromfrom 20122012 throughthrough 2015.2015.

4.2. Spatiotemporal Spatiotemporal Dependencies Dependencies Represented Represented in Convolutions This section outlines outlines the the utilities utilities of of various various convolution convolution layers layers in in their their distributed distributed representationrepresentation of the spatiotemporal dependencies for the crime.

4.2.1. Convolution Layer In the convolution layer, the convolution operation acts as the inner product of the linear kernel followed by rectified linear units (ReLU) [27], as a nonlinear activation function f, Smart Cities 2021, 4 209

4.2.1. Convolution Layer In the convolution layer, the convolution operation acts as the inner product of the linear kernel followed by rectiﬁed linear units (ReLU) [27], as a nonlinear activation Smart Cities 2021, 4, FOR PEER REVIEW function f, 6 Nl ! T fl(x) = f ∑ wljxj + bl = f wl xl + bl (1) j=1

where l indicates the lth layer, Nl indicates the size of the kernel, xj indicates the entry 𝑓 (𝑥) = 𝑓 𝑤 𝑥 +𝑏=𝑓(𝑤𝑥 +𝑏) covered by the linear kernel in the input feature map, wl is the learnable(1) weight in the linear kernel, and bl is the linear bias. A particular convolutional operation (1) decides whether a neuron in a current layer can be activated, depending on the local neurons in the input where l indicates the lth layer, 𝑁 indicates the size of the kernel, 𝑥 indicates the entry feature map, which can then be used to capture a particular spatiotemporal relationship covered by the linear kernel in the input feature map, 𝑤 is the learnable weight in the among crime-related events. linear kernel, and 𝑏 is the linear bias. A particular convolutional operation (1) decides whether a neuron in a current layer can be activated, depending on the local neurons in 4.2.2. Pooling Layer the input feature map, which can then be used to capture a particular spatiotemporal relationship among Therecrime-related is a need events. to summarize the local spatiotemporal features to reduce the sensitivity 4.2.2. Poolingor Layer tune out noise related to crime incidents [28]. On the other hand, pooling layers usually serve to control overfitting by reducing the number of nonsignificant neurons. There is a need to summarize the local spatiotemporal features to reduce the sensitivity or tune4.2.3. out noise Concat related Layer to crime incidents [28]. On the other hand, pooling layers usually serve to control overfitting by reducing the number of nonsignificant neurons. To combine multiple correlations in crime risk, this study uses the merging layer to 4.2.3. Concat layer collect multiple output feature maps from different layers into a “thick” feature map [29,30]. To combine multiple correlations in crime risk, this study uses the merging layer to collect multiple4.2.4. output Dense feature Layer maps from different layers into a “thick” feature map [29,30]. In a dense layer, each entry of the input feature map is connected to each of its 4.2.4. Dense layerneurons. The dense layer can model global patterns in the whole input feature map, thus In a denserepresenting layer, each entry interactions of the input of crime-related feature map is incidents connected globally. to each of its neurons. The dense layer can model global patterns in the whole input feature map, thus representing interactions4.2.5. Inception of crime-related Unit and incidents Residual globally. Unit 4.2.5. Inception unitMutual and residual influences unit of geo-events between distant locations and long periods can be Mutual representedinfluences of bygeo-events overlapping between convolution distant locations layers [and31]. long However, periods direct can be stacking of convolu- represented bytion overlapping layers only conv makesolution the layers enlarged [31]. However, network moredirect pronestacking to of overfitting convolu- [32]. To overcome tion layers onlythis makes weakness, the enlarged this study network developed more prone the IncUnit to overfitting and ResUnit [32]. To toovercome improve crime prediction this weakness,performance. this study developed the IncUnit and ResUnit to improve crime prediction performance. We construct an inception unit obeying the principle [32] as shown in Figure4, by We constructstacking an inception asymmetric unit convolutionobeying the principle layers in[32] three as shown branches in Figure to capture 4, by the information stacking asymmetricdescribed convolution below: layers in three branches to capture the information described below: Input

Conv，1×1，64 Conv，1×1，64 Conv，1×1，64 Max，3×3

Conv，1×1，64 Conv，1×1，64

Conv，1×1，64 Concat

Figure 4: Inception unit. Figure 4. Inception unit.

Such asymmetric neural structures can express the various trends in different local spatiotemporal scopes while producing cheaper computations with fewer parameters. Smart Cities 2021, 4 210

Smart Cities 2021, 4, FOR PEER REVIEW 7

Such asymmetric neural structures can express the various trends in different local spatiotemporal scopes while producing cheaper computations with fewer parameters. According to residual networks [[33],33], thisthis studystudy buildsbuilds aa residualresidual unitunit fromfrom everyevery threethree stackedstacked convolutionconvolution layers layers with with asymmetric asymmetric kernels kernels (e.g., (e.g., 1 1× ×3 3 and and 3 3× × 11 kernels),kernels), toto useuse fewerfewer parametersparameters whilewhile stillstill increasingincreasing expression expression ability, ability, as as shown shown in in Figure Figure5. 5.

FigureFigure 5.5. ResidualResidual unit.unit. This residual network structure can capture very subtle spatiotemporal relationships This residual network structure can capture very subtle spatiotemporal relationships between geo-events with more overlapping convolution layers. Otherwise, deep structures between geo-events with more overlapping convolution layers. Otherwise, deep struc- usually lead to “gradient vanishing,” while shallow structures do not point to sufﬁciently tures usually lead to “gradient vanishing,” while shallow structures do not point to suffi- effective relationships among geo-events. Both cases will result in prediction performance ciently effective relationships among geo-events. Both cases will result in prediction per- degradations. formance degradations. 5. Experiments and Results 5. Experiments and Results This section validates the proposed DIRNet on the crime and 311 datasets of New York CityThis andsection compares validates the performancesthe proposed DIRNet against fouron the baseline crime models.and 311 datasets of New York City and compares the performances against four baseline models. 5.1. Experimental Setup 5.1. ExperimentalPython 3.5, TensorFlowSetup 1.01, Keras 1.0, and CUDA 8.0 were used to build CNNs modelsPython on a desktop 3.5, TensorFlow computer 1.01, with Keras an Intel 1.0, i7 and 3.4GHz CUDA CPU, 8.0 32GB were memory, used to and build NVIDIA CNNs GT1080models TIon GPU.a desktop For DIRNet,computer samples with an were Intel fedi7 3.4GHz into the CPU, model 32GB with memory, a mini-batch and NVIDIA size of 64.GT1080 All the TI convolutionGPU. For DIRNet, layers usedsamples ReLU were activation fed into functionsthe model with with batch a mini-batch normalization size of technology.64. All the convolution We use the layers He initialization used ReLU methodactivation to functions initialize allwith of batch the weights normalization in the deeptechnology. learning We model, use the which He initialization rely on the binarymethod cross to initialize entropy all and of the the weights adaptive inmoment the deep estimationlearning model, (Adm) which methods rely ason its the loss binary function cross and entropy optimizer, and the respectively. adaptive moment estimation (Adm) methods as its loss function and optimizer, respectively. 5.2. Evaluation Metrics for Machine Learning 5.2. EvaluationTo assess Metrics the performance for Machine of Learning a machine-learning model, researchers can draw a confusionTo assess matrix the andperformance use an accuracy of a machine-le indicatorarning to determine model, researchers how often can the draw model a con- is correct.fusion matrix The problem and use of an using accuracy the accuracy indicator indicator to determine is that how the often result the is model not meaningful is correct. whenThe problem a severe of class using imbalance the accuracy exists. indicator This study is that falls the into result this is scenario. not meaningful For example, when if a the positives are grids in which crime has occurred and the negatives are grids in which severe class imbalance exists. This study falls into this scenario. For example, if the posi- no crime has occurred, an accuracy rate of 99% means that the outcome is correct 99% tives are grids in which crime has occurred and the negatives are grids in which no crime of the time across all classes. However, the number of grids representing crime is much has occurred, an accuracy rate of 99% means that the outcome is correct 99% of the time lower than the number of grids without crime. For the positive class, this data may be across all classes. However, the number of grids representing crime is much lower than current as little as 30% of the time even though the accuracy rate is considered to be 99%. the number of grids without crime. For the positive class, this data may be current as little Therefore, we prefer to use another indicator integrating the precision and recall—the F1 as 30% of the time even though the accuracy rate is considered to be 99%. Therefore, we prefer to use another indicator integrating the precision and recall—the F1 score (Precision = True positives/ (True positives + False positives), while recall = True positives/ (True positives + False negatives). F1 score combines precision and recall). Smart Cities 2021, 4 211

Smart Cities 2021, 4, FOR PEER REVIEW 8 score (Precision = True positives/ (True positives + False positives), while recall = True positives/ (True positives + False negatives). F1 score combines precision and recall).

precision × recall F1 = 2 × precision× recall F12=×precision + recall precision+ recall A good F1 score shows that both low false positives and low false negatives exist, A good F1 score shows that both low false positives and low false negatives exist, which allows the classes to be correctly identiﬁed. The higher the F1 score is, the more robustwhich theallows model’s the classes performance. to be correctly (There areiden multipletified. The dimensions higher the in F1 model score comparisonis, the more inrobust addition the model’s to the F1 performance. score, such (There as the executionare multiple time. dimensions We did notin model compare comparison execution in timeaddition among to the models, F1 score, simply such because as the theexec trainingution time. time We that did either not compare the proposed execution model time or comparedamong models, models simply cost is because less than the 3 hourstraining on ti ame standard that either PC. Therefore, the proposed the executionmodel or timecom- ispared not a models major concern cost is less in this than analysis). 3 hours on a standard PC. Therefore, the execution time is not a major concern in this analysis). 5.3. Baseline Machine-Learning Models 5.3. BaselineWe select Machine-Learning four machine-learning Modelsmodels that are recently and frequently implemented in predictingWe select crime four as machine-learning the baseline machine models models. that The are baseline recently machine-learning and frequently models imple- aremented in comparison in predicting with crime our proposed as the baselin approache machine with regardsmodels. toThe the baseline F1 score: machine-learn- •ing modelsSupport are vector in comparison machines (SVMs)with our [ 20proposed]: Following approach the literature, with regards we ran to the SVMs F1 score: using • theSupport Gaussian vector radial machines basis (SVMs) function(RBF) [20]: Following kernel the to literature, map the we original ran SVMs features using tothe a Gauss- high- dimensionalian radial basis feature function(RBF) space. kernel to map the original features to a high-dimensional feature space. • Random Forests: Random Forests is a popular ensemble-learning method that effec- • Random Forests: Random Forests is a popular ensemble-learning method that effectively han- tivelydles high-dimensional handles high-dimensional data to scale dataup to tomany scale samples up to many[16,20]. samples [16,20]. •• ST-ResNetST-ResNet [31]: [31]: Spatio-temporal Spatio-temporal residual residual networ networkk (ST-ResNet) (ST-ResNet) can be deemed can be deemedas a variant as aof variantDIRNet. ofWe DIRNet. also call it WeDIRNet-Inception. also call it DIRNet-Inception. ST-ResNet mainly contains ST-ResNet ResUnits mainly without contains incep- ResUnitstion blocks, without as shown inception in Figure blocks, 6. as shown in Figure6.

Conv,3×3,64 IncUnit*4 Max,3×3

311 Concat Result

Dense,128

Crime Conv,3×3,64 IncUnit*4 Max,3×3

FigureFigure 6.6. DeepDeep inception-residual inception-residual networks. networks • • STCN:STCN: Spatiotemporal crime crime network network (STCN) (STCN) applies applies inception inception networks networks and fractal and networks fractal networkssimultaneously simultaneously to forecast crime to forecast risks. The crime parameters risks. of The STCN parameters implemented of STCNin this case imple- are mentedthe same in as this the best case model are the described same as in the Duan best et model al. [23]. described in Duan et al. [23].

5.4.5.4. PerformancePerformance ComparisonComparison ThisThis sectionsection presentspresents experimentalexperimental resultsresults andand givesgives insightsinsights onon thethe effectivenesseffectiveness ofof these approaches. these approaches. 5.4.1.5.4.1. ComparativeComparative PerformancePerformance StudiesStudies onon TimeTime WindowWindow N N InIn thisthis experiment, experiment, the the size size of of the the spatial spatial neighbor neighbor set set was was ﬁxed fixed at 3 at× 33. × The 3. The predictive predic- resultstive results of models of models with awith positive a positive F1 and F1 an and increasing an increasing timewindow time window n have n beenhave depicted been de- inpicted Figure in7 .Figure The horizontal 7. The horizontal axis of Figure axis 7of indicates Figure 7 theindicates amounts the of amounts days that of we days take that into we take into considerations when we train the model. For example, if n = 10, it means we use the last 10 days’ crime data and 311 data to train the model. Every time the time window is changed, the model has to be retrained. In this test, we have 10 time windows. Hence, Smart Cities 2021, 4 212 Smart Cities 2021, 4, FOR PEER REVIEW 9

considerations when we train the model. For example, if n = 10, it means we use the last 10 wedays’ trained crime datathe model and 311 10 data times. to train The the vertical model. axis Every of timeFigure the 7 time indicates window the is F1 changed, score. As the the model has to be retrained. In this test, we have 10 time windows. Hence, we trained time window increases, more and more crime samples that occurred during the increased the model 10 times. The vertical axis of Figure7 indicates the F1 score. As the time window windowincreases, are more included and more in crime training samples the models that occurred. Figure during 7 shows the increased that the window proposed are DIRNet approachincluded inoutperforms training the other models. baseline Figure models7 shows across that the various proposed time DIRNet windows. approach This suggests thatoutperforms the DIRNet other model baseline is models capable across of retrievi variousng time the windows. spatial pattern This suggests that thatunderlies the both short-timeDIRNet model span is capablegeographical of retrieving data and the spatiallong-time pattern span that geographical underlies both data. short-time More specifically,span geographicalthere is more data than and a long-time2% improvement span geographical in F1 in comparison data. More speciﬁcally, with STCN, there which is is the bestmore classifier than a 2% among improvement the baseline in F1 in models. comparison Since with the STCN, SPPT whichmethod is thealways best classiﬁer analyzes crime eventsamong over the baseline an entire models. area to Since yield the a SPPTpredicti methodon, its always F1 value analyzes is constant crime eventsand the over prediction an entire area to yield a prediction, its F1 value is constant and the prediction data are the data are the least reliable. For most algorithms, prediction performance improves as the least reliable. For most algorithms, prediction performance improves as the time window timeincreases window to a certain increases threshold, to a certain due to more threshold, temporal due patterns to more or trends temporal being patterns discovered or trends beingas more discovered historical crimesas more and historical 311 data crimes continue and to be311 fed data into continue models. Itto is be worth fed into noting models. It isthat worth the F1 noting of DIRNet that isthe still F1 higher of DIRNet than all is baselines still higher when than the all length baselines of the time when window the length of thereaches time 100 window days, proving reaches that 100 the days, proposed proving model that has the the proposed best ability model to reveal has the distant best ability totemporal reveal dependenciesdistant temporal from dependencies sparse data. from sparse data.

Figure 7. Comparison of F1 scores of models across time windows. Figure 7. Comparison of F1 scores of models across time windows. 5.4.2. Comparative Performance Studies on Spatial Ranges 5.4.2.The Comparative influence of Performance varying spatial Studies ranges on (size Spatial of the Ranges neighbor set) on the positive F1 is shownThe influence in Figure8 of. Here,varying the spatial length ranges of the time (size window of the neighbor is fixed at set) 60 on days the for positive all F1 is models. The horizontal axis of the figure indicates spatial windows while the vertical axis shown in Figure 8. Here, the length of the time window is fixed at 60 days for all models. demonstrates F1 scores. As shown in Figure8, the proposed DIRNet approach achieves Thethe highesthorizontal F1 score axis amongof the figure all the indicates other models, spatial most windows of which while obtain the the vertical lower F1axis by demon- strates6% compared F1 scores. to the As DIRNet.shown in This Figure suggests 8, the the proposed DIRNet DIRNet model is approach more able achieves to capture the high- estinformation F1 score ofamong spatial all dependency the other than models, other modelsmost of across which various obtain spatial the lower windows. F1 by The 6% com- paredSPPT method to the DIRNet. offers the This worst suggests performance. the DIRNet When the model spatial is sizemore ranges able fromto capture 1 to 3, mostinformation ofmodels spatial experience dependency a performance than other promotion models in F1,across demonstrating various spatial that the compositionwindows. The of SPPT methodadjacent offers regions the is importantworst performance. for capturing When complicated the spatial spatial size patterns. ranges However,from 1 to as3, most the mod- elsspatial experience range continues a performance to increase, promotion the performance in F1, demonstrating of each of the overall that modelsthe composition degrades. of ad- jacentThe reasons regions for thisis important may be that for crime capturing and 311 intensitiescomplicated in a spatial broader patterns. spatial range However, tend to as the fluctuate more and have a less significant spatial dependency on a daily temporal scale. spatial range continues to increase, the performance of each of the overall models degrades. The reasons for this may be that crime and 311 intensities in a broader spatial range tend to fluctuate more and have a less significant spatial dependency on a daily temporal scale. SmartSmart Cities Cities 20212021, 4,, 4FOR PEER REVIEW 213 10

Figure 8. Comparison of F1 scores of models across various spatial windows. Figure 8. Comparison of F1 scores of models across various spatial windows. 5.5. Discussion 5.5. DiscussionThe aforementioned results show that the proposed deep learning network, DIRNet, outperformsThe aforementioned other models in results terms of show F1. The that reason the proposed is that the DIRNetdeep learning model extracts network, the DIRNet, information of spatial-temporal dependence before merging crime data and 311 data. As outperforms other models in terms of F1. The reason is that the DIRNet model extracts such, the spatial-temporal information can be used in other layers to increase the prediction theaccuracy information of crime. of In spatial-temporal contrast, the STCN dependence model combines before crime merging data and crime 311 data atand the 311 data. Asfirst such, layer, the and spatial-temporal fails to extract information information on spatial-temporal can be used in dependence other layers of to crime increase from the pre- dictionthe data; accuracy thus, the STCNof crime. model In hascontrast, a lower the prediction STCN accuracymodel combines rate than the crime DIRNet data model. and 311 data atAnother the first baseline layer, model,and fails ST-ResNet, to extract effectively information extracts on informationspatial-temporal on spatial-temporal dependence of crime fromdependence the data; of thus, crime the before STCN merging model the has crime a lowe datar andprediction 311 data. accuracy However, rate this than model the DIRNet model.offers fewer Another layers baseline in its neural model, network ST-ResNet, than our proposedeffectively DIRNet, extracts which information thus results on in spatial- temporala lower prediction dependence accuracy of crime rate. before merging the crime data and 311 data. However, this This paper shows that non-urgent social events, like those conveyed by 311 data, model offers fewer layers in its neural network than our proposed DIRNet, which thus often correlate with worse events, such as crimes, but the complex space-time relationships resultsbetween in them a lower are oftenprediction difficult accuracy to obtain. rate. There are many categories into which 311 dataThis can be paper classified shows (e.g., that noise, non-urgent quarrel, and social water ev leakages).ents, like Determiningthose conveyed which by of 311 the data, of- ten311 correlate data categories with worse can be usedevents to, efficientlysuch as crimes, indicate but crimes—as the complex well asspace-time corresponding relationships betweenresearch—is them further are often needed difficult in future to studies. obtain. There are many categories into which 311 data can beAs classified the momentary (e.g. noise, retrospective quarrel, features and water from leakages). crimes and Determining 311 calls are scantwhich and of the 311 dataunlikely categories to provide can enough be used resolution to efficiently for prompt indicate forecasting crimes—as [16], advanced well as modelscorresponding are re- usually preferred for long-run features following the ordinary natures of periods or seasons. search—is further needed in future studies. The properties of the crime series distributions in the forms of mean, median, standard deviation,As the and momentary Shannon entropy retrospective are important features in thefrom characterization crimes and 311 of crimecalls are intensity scant and un- likelyfluctuations. to provide Such tailor-madeenough resolution features canfor decreaseprompt computationalforecasting [16], complexity advanced and models feature are usu- allyspace, preferred while still for preserving long-run useful features data following interpretability. the ordinary Even more, natures existing of research periods also or seasons. Theshows properties that the usage of the of crime human series behavioral distributi data [ons14], in rich the statistical forms of demographical mean, median, data, standard deviation,and weather and data Shannon [31] all provide entropy finer are temporal important and spatial in the resolution characterization for comprehensively of crime intensity fluctuations.understanding Such local tailor-made crime. As a features consequence, can decrease future work computational will involve multiple complexity data and fea- turesources space, and while leverage still the preserving statistical index useful from data time interpretability. series to produce Even deeper more, retrospective existing research features. also shows that the usage of human behavioral data [14], rich statistical demographical Unequal class distribution representation in data often emerges in urban investiga- data,tions, and as happened weather indata this [31] study. all Itprovide tends to finer make temporal a statistical and model spatial or classifier resolution heavily for compre- hensivelybiased towards understanding the majority local class crime. [34]. To As address a consequence, this challenge, future previous work will studies involve in the multiple datacrime sources analysis and domain leverage applied the several statistical policies index to reach from the time minority series class. to produce For example, deeper retro- spectiveKadar et features. al. split crime categories into “violent” and ”non-violent crimes” classes [14]. GerberUnequal only included class distribution the locations representation with at least one in crimedata often during emerges a long-term in urban period investiga- tions,as samples as happened [15]. However, in this these study. methods It tends only to adopted make a samplesstatistical with model at least or one classifier crime heavily biased towards the majority class [34]. To address this challenge, previous studies in the crime analysis domain applied several policies to reach the minority class. For example, Kadar et al. split crime categories into “violent” and ”non-violent crimes” classes [14]. Gerber only included the locations with at least one crime during a long-term period as Smart Cities 2021, 4 214

incident and thus could not be used to predict when and where there would be no crime. For this study, the researchers augmented the sample size of the minority class through the over-sampling method of SMOT [26] while reducing the majority class through an under- sampling method. Therefore, this approach of generating synthetic samples to alleviate the imbalance problem saw improved results and appears to offer a more robust solution that is applicable to the real world. However, in the experiment, lowering the under-sampling ratio to a value such as 0.2 (indicating that the number of 0 class is large), would lead to an increase in accuracy of the 0 class but the decline in accuracy of the 1 class as well, especially when using the DIRNet model. On the contrary, when the under-sampling ratio is large, such as at a value of 0.8 (indicating that the number of 0 class is small), the accuracies of both the 0 and 1 classes decrease slightly. This may happen as the decrease in samples of both the 0 and 1 classes undermines the capacity of the model to be used to explore more valid spatiotemporal patterns. If the under-sampling ratio was set in the range of 0.2–0.8, there would be no obvious variation pattern in the predictive performance in accuracy or the F1. In practice, we set the under-sampling ratio to 0.5.

6. Conclusions This paper proposes a novel deep-learning model that can offer end-to-end learn abstract latent features for fine-grain crime prediction. The proposed model can extract information on spatial and temporal dependence before fusing the crime data and 311 data, has deep layers, and hence may have a good performance in crime prediction. Experiments using real datasets validated the effectiveness of the proposed DIRNet by comparing the model with four other baseline machine-learning models. Machine-learning plays a vital role in the built environment and planning fields (e.g., [35]). This study, centering on a novel deep learning approach has critical implications in academia and practice. (1) It shows that researchers should extract information on spatial and temporal dependence before merging multiple datasets in order to apply the proposed neural network to predict events that are spatially and temporally dependent. This study also indicates that a neural network with deeper layers could be applied to achieve accurate predictions as long a large sample size is used. (2) The proposed DIRNet can accurately predict crime in such a way that could benefit policing strategies and crime reduction. One such example application for this technique might be hotspot policing (directed patrol). To reduce hotspots and peak times for criminal activity, a more precise policy patrol presence should concentrate on those hotspots at time windows. With the potential accuracy of crime hotspot forecasting, other strategies, such as broken windows policing, zero tolerance policing, rapid response, and increase police numbers can also be applied to reduce criminal activity. (3) The proposed DIRNet can not only be used to predict crime but it can also be implemented to predict other events in urban areas, such as traffic accidents, etc. that are spatially and temporally dependent. Even more, the proposed DIRNet could capture those dependencies and thus serve a unique role in traffic accident prediction. However, some limitations should be noted. First, we did not take other factors into account and these factors may influence the crime rate, such as weather conditions. Second, we did not subcategorize the 311 compliant records. Some subcategories of 311 complaints may have better explanatory power of crime incidents than the other subcategories. Here we would like to see if the proposed neural network can automatically capture the spatial patterns of crime incidents that underlies noisy large-scale data (unrefined crime data and unrefined 311 data), and has a high F1 score, and it does. The acknowledgment of the research limitations also inspires the directions of future research. (1) Researchers could apply and test the proposed neural network for crime prediction in other cities; (2) more factors that may influence crime rate could be included in crime prediction; and (3) the proposed neural network could be developed further to take advantages of prior knowledge of crime, which will promote the prediction accuracy. Smart Cities 2021, 4 215

Big data applications hold promise for addressing both shortfalls of existing predictive methods and potential dangers. The successful advancement of the criminology of place requires elevating the “why” question to equal status with those of “where” and “what” in the analysis of crime [3]. As such, our future research efforts will further investigate the “why” factors, along with the advances of deep-learning approaches.

Author Contributions: Conceptualization, X.Y. and L.D.; methodology, X.Y. and L.D.; software, L.D.; validation, L.D.; formal analysis, X.Y., L.D., and Q.P.; original draft preparation, X.Y., L.D., and Q.P.; Writing—Review and editing, X.Y., Q.P., and L.D. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the National Natural Science Foundation of China (grant number 41961062) and Guangxi Natural Science Foundation (grant number 2018JJA150089). Data Availability Statement: Publicly available datasets were analyzed in this study. The data can be found here: https://opendata.cityofnewyork.us/. Conﬂicts of Interest: The authors declare no conﬂict of interest.

References 1. Neirotti, P.; De Marco, A.; Cagliano, A.C.; Mangano, G.; Scorrano, F. Current trends in Smart City initiatives: Some stylised facts. Cities 2014, 38, 25–36. [CrossRef] 2. Butt, U.M.; Letchmunan, S.; Hassan, F.H.; Ali, M.; Baqir, A.; Sherazi, H.H.R. Spatio-Temporal Crime HotSpot Detection and Prediction: A Systematic Literature Review. IEEE Access 2020, 8, 166553–166574. [CrossRef] 3. Hossain, S.; Abtahee, A.; Kashem, I.; Hoque, M.M.; Sarker, I.H. Crime Prediction Using Spatio-Temporal Data. arXiv 2020, arXiv:2003.09322. preprint. 4. Kang, W.H.; Kang, H.-B. Prediction of crime occurrence from multi-modal data using deep learning. PLoS ONE 2017, 12, e0176244. [CrossRef] 5. Chainey, S. The Crime Prediction Framework—A Spatial Temporal Framework for Targeting Patrols, Crime Prevention and Strategic Policy. In Proceedings of the National Security Summit, San Diego, CA, USA, 18 July 2015. 6. Bannister, J.; O’Sullivan, A.; Bates, E. Place and time in the Criminology of Place. Theor. Criminol. 2017, 23, 315–332. [CrossRef] 7. Kalantari, M.; Yaghmaei, B.; Ghezelbash, S. Spatio-temporal analysis of crime by developing a method to detect critical distances for the Knox test. Int. J. Geogr. Inf. Sci. 2016, 30, 2302–2320. [CrossRef] 8. Duan, L.; Ye, X.; Hu, T.; Zhu, X. Prediction of Suspect Location Based on Spatiotemporal Semantics. ISPRS Int. J. Geo-Inf. 2017, 6, 185. [CrossRef] 9. Rosser, G.; Davies, T.; Bowers, K.J.; Johnson, S.D.; Cheng, T. Predictive Crime Mapping: Arbitrary Grids or Street Networks? J. Quant. Criminol. 2017, 33, 569–594. [CrossRef] 10. Adepeju, M.; Rosser, G.; Cheng, T. Novel evaluation metrics for sparse spatio-temporal point process hotspot predictions - a crime case study. Int. J. Geogr. Inf. Sci. 2016, 30, 2133–2154. [CrossRef] 11. Wells, W.; Wu, L.; Ye, X. Patterns of Near-Repeat Gun Assaults in Houston. J. Res. Crime Delinq. 2011, 49, 186–212. [CrossRef] 12. Caplan, J.M.; Kennedy, L.W.; Miller, J. Risk Terrain Modeling: Brokering Criminological Theory and GIS Methods for Crime Forecasting. Justice Q. 2011, 28, 360–381. [CrossRef] 13. Law, J.; Quick, M.; Chan, P. Bayesian Spatio-Temporal Modeling for Analysing Local Patterns of Crime Over Time at the Small-Area Level. J. Quant. Criminol. 2013, 30, 57–78. [CrossRef] 14. Kadar, C.; Iria, J.; Cvijikj, I.P. Exploring Foursquare-derived Features for Crime Prediction in New York City. In Proceedings of the 5th International Workshop on Urban Computing (UrbComp 2016), San Francisco, CA, USA, 14 August 2016. 15. Gerber, M.S. Predicting crime using Twitter and kernel density estimation. Decis. Support Syst. 2014, 61, 115–125. [CrossRef] 16. Chohlas-Wood, A.; Merali, A.; Reed, W.; Damoulas, T. Mining 911 Calls in New York City: Temporal Patterns, Detection, and Fore- casting. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. 17. Wang, H.; Kifer, D.; Graif, C.; Li, Z. Crime Rate Inference with Big Data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 635–644. 18. Iqbal, R. An Experimental Study of Classification Algorithms for Crime Prediction. Indian J. Sci. Technol. 2013, 6, 1–7. [CrossRef] 19. Wang, X.; Brown, D.E. The spatio-temporal modeling for criminal incidents. Secur. Inf. 2012, 1, 2. [CrossRef] 20. Chandrasekar, A.; Raj, A.S.; Kumar, P. Crime Prediction and Classification in San Francisco City. Available online: http: //cs229.stanford.edu/proj2015/228\{$\backslash$_\}report.pdf (accessed on 24 December 2020). 21. Chun, S.A.; Paturu, V.A.; Yuan, S.; Pathak, R.; Atluri, V.; Adam, N.R. Crime Prediction Model using Deep Neural Networks. In Proceedings of the 20th Annual International Conference on Digital Government Research, Dubai, UAE, 18–20 June 2019; pp. 512–514. 22. Wang, B.; Zhang, D.; Zhang, D.; Brantingham, J.P.; Bertozzi, L.A. Deep learning for real time crime forecasting. arXiv 2017, arXiv:1707.03340. preprint. Smart Cities 2021, 4 216

23. Duan, L.; Hu, T.; Cheng, E.; Zhu, J.; Gao, C. Deep Convolutional Neural Networks for Spatiotemporal Crime Prediction. In Proceedings of the International Conference on Information and Knowledge Engineering (IKE); CSREA Press: Las Vegas, NV, USA, 2017; The Steering Committee of the World Congress in Computer Science, Computer. 24. Hipp, J.R. Income inequality, race and place: Does the distribution of race and class within neighborhoods affect crime rates? Criminology 2007, 45, 665–697. [CrossRef] 25. Wilson, Q.J.; Kelling, G.L. Broken windows. Atl. Mon. 1982, 249, 29–38. 26. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [CrossRef] 27. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] 28. Pinheiro, P.; Collobert, R. Recurrent Convolutional Neural Networks for Scene Labeling. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014. 29. Zheng, Y. Methodologies for Cross-Domain Data Fusion: An Overview. IEEE Trans. Big Data 2015, 1, 16–34. [CrossRef] 30. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. 31. Zhang, J.; Zheng, Y.; Qi, D. Deep spatio-temporal residual networks for citywide crowd flows prediction. arXiv 2016, arXiv:1610.00081. preprint. 32. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. 33. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. 34. Ganganwar, V. An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 42–47. 35. Pagani, A.; Mehrotra, A.; Musolesi, M. Graph input representations for machine learning applications in urban network analysis. Environ. Plan. B: Urban Anal. City Sci. 2019.[CrossRef]