Social Sensing from Street-Level Imagery a Case Study in Learning

Senseable City Lab :.:: Massachusetts Institute of Technology This paper might be a pre-copy-editing or a post-print author-produced .pdf of an article accepted for publication. For the deﬁnitive publisher-authenticated version, please refer directly to publishing house’s archive system

SENSEABLE CITY LAB ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

journal homepage: www.elsevier.com/locate/isprsjprs

Social sensing from street-level imagery: A case study in learning spatio- T temporal urban mobility patterns ⁎ Fan Zhanga, Lun Wua, Di Zhua,b, Yu Liua, a Institute of Remote Sensing and Geographical Information Systems, School of Earth and Space Sciences, Peking University, Beijing 100871, China b SpaceTimeLab, Department of Civil, Environmental and Geomatic Engineering, University College London, London WC1E 6BT, United Kingdom

ARTICLE INFO ABSTRACT

Keywords: Street-level imagery has covered the comprehensive landscape of urban areas. Compared to satellite imagery, Street-level imagery this new source of image data has the advantage in fine-grained observations of not only physical environment Urban physical environment but also social sensing. Prior studies using street-level imagery focus primarily on urban physical environment Urban mobility auditing. In this study, we demonstrate the potential usage of street-level imagery in uncovering spatio-temporal Social sensing urban mobility patterns. Our method assumes that the streetscape depicted in street-level imagery reflects urban Deep learning functions and that urban streets of similar functions exhibit similar temporal mobility patterns. We present how a deep convolutional neural network (DCNN) can be trained to identify high-level scene features from street view images that can explain up to 66.5% of the hourly variation of taxi trips along with the urban road network. The study shows that street-level imagery, as the counterpart of remote sensing imagery, provides an oppor- tunity to infer fine-scale human activity information of an urban region and bridge gaps between the physical space and human space. This approach can therefore facilitate urban environment observation and smart urban planning.

1. Introduction and measuring perceptions (Kang et al., 2017; Zhang et al., 2018). However, most of the existing works focus on extracting the informa- Street-level imagery refers to the photographs taken along street tion of a physical urban environment, encountering challenges in per- networks, depicting the profile view of the urban streetscape froma ceiving the human activity and urban socioeconomic environment (Liu similar view of human vision and describing the urban physical en- et al., 2015) due to a lack of computational tools to extract high-level vironment comprehensively. Due to the fast development of web representations of street-level imagery. Khosla et al. identified this issue mapping services (e.g., Google Maps and Tencent Maps) (Anguelov and proposed to “look beyond the visible scene”, where they trained a et al., 2010), social media (e.g., Weibo, Instagram, and Flickr) (Liu binary-classification model to predict which image points closer toa et al., 2016), and vehicle-mounted intelligent hardware (e.g., driving grocery or appears safer (Khosla et al., 2014). Despite the work’s in- recorders and autonomous vehicle cameras) (Balali and Golparvar- spiration for sensing the “unseen” space using street-level imagery, it is Fard, 2015), street-level imagery is being generated at a rapid speed, still difficult to obtain a quantitative measure of human dynamic and which densely covers every corner of a city. This data source lays the socioeconomic status precisely by employing conventional computer foundation of new approaches to observe, perceive and understand the vision features, e.g., texture (Ojala et al., 2002), colour (Khan et al., urban environment. 2013), and gradient (Dalal and Triggs, 2005). Predicting the latent Compared to satellite imagery, street-level imagery describes the scene affordances quantitatively remains underdeveloped due tothe urban environment from the human perspective and contains sub- lack of computational tools and approaches to learn image content stantial fine-grained information associated with cities. Considerable comprehensively. research in recent years has demonstrated the advantages of this new To automate the learning of high-level visual knowledge of a data source in monitoring neighborhood changes (Naik et al., 2014, streetscape and further to perceive the urban socioeconomic and so- 2017), calculating the sky view factor (Gebru et al., 2017; Gong et al., ciodemographic environment, we design a multi-task deep convolu- 2018), quantifying neighbourhood types (Zhang et al., 2018), dis- tional neural network (DCNN). DCNN has recently exhibited an out- covering distinct place features (Zhang et al., 2019; Kang et al., 2019), standing performance in learning effective and interpretable image

⁎ Corresponding author. E-mail address: [email protected] (Y. Liu). https://doi.org/10.1016/j.isprsjprs.2019.04.017 Received 11 December 2018; Received in revised form 22 April 2019; Accepted 25 April 2019 0924-2716/ © 2019 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved. F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Table 1 Comparison between street-level imagery and remote sensing imagery.

Remote sensing imagery Street-level imagery

Data source Radiometer, radar, LiDAR sensors of satellite and aircraft Map services (Google, Tencent, etc.); Social media photos; Customized collection Main task Physical features of the Earth’s surface Physical and social features of the urban environment View Nadir/limb view Profile/human view Content Natural and built features: rivers, grass, buildings, roads, etc. Neighbourhood facilities and amenities: groceries, bus stations, house facades, signs, etc. Accessibility Limited; ready-made; expensive Convenient; allows for high customization; economical Resolution Sub-metre spatial resolution; hyperspectral Millimetre spatial resolution; panchromatic band features (LeCun et al., 2015). As a case study, we demonstrate the po- 2.2. Comparison between street-level imagery and remote sensing imagery tential advantage of street-level imagery in uncovering urban diurnal activities using an improved DCNN model. Without vehicles necessarily Satellite and airborne remote sensing involve the information ex- presenting in scenes, the DCNN model is expected to infer the high-level traction from objects or scenes of the Earth surface using radiometer, urban functional information of scenes where vehicles might perform radar, or LiDAR sensors (Benediktsson et al., 2012; Bioucas-Dias et al., certain diurnal behaviours on a weekday. Inspired by the evidence that 2013). Remote sensing powers many aspects of modern society such as the urban mobility is highly relevant to urban land use and urban meteorology, agriculture, forestry, and landscape and regional planning physical environment (Qi et al., 2011; Liu et al., 2012; Yuan et al., based on its outstanding ability in large-scale Earth observations 2012), our method assumes that the streetscape depicted in street-level (Lillesand et al., 2014). Recently, as the rapid development of web imagery reflects urban functions and that urban activities in streets of mapping services and social media platforms, a new source of in- similar functions present similar temporal patterns. Specifically, we formation for urban physical environment observation and auditing has take the hourly variation of taxi trip numbers throughout one month as merged and densely covered the road network of a city, in the form of the proxy of urban mobility and predict the diurnal temporal signature street-level imagery (Anguelov et al., 2010; Zhang et al., 2018). Street- of taxi trips with street view images using a novel DCNN model. We first level imagery refers to the photographs taken along urban street net- discretize the number of trips into ten levels to formulate the problem works, depicting the profile view of urban streetscapes from a similar as a discriminative classification task, which achieves an overall accu- view of human vision and describing the urban physical environment in racy of 74.1%. Second, we estimate the exact number of trips based on detail, potentially offering new approaches for researchers to observe, the model outputs and achieve a mean absolute percentage error perceive and understand the urban environment comprehensively. (MAPE) of 30.7% and an R2 of 0.665. The two data sources are of similar aims and different scopes. In The results demonstrate that street-level imagery can reveal urban Table 1 we compare street-level imagery with remote sensing imagery. mobility patterns and serve as a fine-grained measure of urban dynamic Both remote sensing imagery and street-level imagery aim to observe factors. Our results are of significant meanings. Scientifically, the study the physical environment of the Earth surface. However, the former sheds light on the connections between the physical settings and human shows more advantage in large-scale Earth observation while the su- activities of a place. Practically, it takes advantage of street-level ima- periority of the latter lies in fine-grained urban physical environment gery to derive fine-scale socioeconomic information of an urban area in description. Moreover, the detailed information presents in the street- which urban planners and social scientists in urban observation, urban level imagery, e.g., the number of amenities, the condition of the fa- studies and planning might be interested. cilities, and the historical and cultural style of the streetscape, provides the opportunities to understand the human dynamics and socio- 2. Literature review economic environments of an urban region. A number of studies have taken advantage of both types of data 2.1. Social sensing sources to understand urban environments comprehensively. Law et al. recovered the visual attributes of the urban environment using fused The proliferation of crowdsourcing technology and the emergence features from street view images and satellite images to estimate the of individual-level big geospatial data bring unprecedented opportu- house prices of neighbourhoods (Law et al., 2018). Lin et al. learned nities for researchers to better understand the physical and socio- features from street view images and satellite images to find where a economic environment of urban regions. Liu et al. (2015) used the term photo was taken by matching the photo to a city-scale aerial view image “social sensing” to define such sources of information and the asso- database (Lin et al., 2015). To enrich the streetscape of a specific lo- ciated approaches (Liu et al., 2015). Social sensing has been considered cation, Deng et al. trained a conditional generative adversarial network the analogue and complement of remote sensing in terms of capturing using street view images and satellite images to generate natural- socioeconomic features and individual-level observation. On the one looking street-level imagery (Deng et al., 2018). hand, floating vehicle trajectories, social media posts, cellphone call records, smartcard transactions, and LBSN check-ins have been used to 2.3. Understanding urban mobility patterns with social sensing characterize human dynamics and socioeconomic environments such as urban traffic patterns (Liu et al., 2012), human mobility patterns (Liu Urban mobility plays a crucial role in the growth, employment and et al., 2014; Ahas et al., 2015), crime occurrence (Kang and Kang, 2017; sustainable development of a city (Bulkeley and Betsill, 2005). Bene- Kadar and Pletikosa, 2018), land use (Pei et al., 2014; Yao et al., 2017), fiting from the multi-level and multi-sourced big geospatial data,re- and job-housing relationships (Masucci et al., 2013; Huang et al., search efforts have been made to approximate spatio-temporal urban 2018). On the other hand, place locales can be considered a result of a mobility patterns using Global Positioning System (GPS) data reciprocal interaction between human behaviours and their physical (Adriansen and Nielsen, 2005), smart card records (Liu et al., 2009), settings (Stedman, 2003; Zhang et al., 2018). As a simulation and re- mobile positioning data (Ahas et al., 2010), etc. In addition, several presentation of place locales, street-level imagery, which potentially studies have focused on characterizing the urban/human mobility reflects the social environments of neighbourhoods and urban func- patterns and attempted to explore a universal law (Gonzalez et al., tions, is considered an important data source of social sensing. Simi- 2008; Noulas et al., 2012; Gallotti et al., 2016). larly, remote sensing has been widely employed to observe the physical A common thread that ties together previous works in modeling environment of urban space. urban mobility is their focus on the driving force of mobility. Previous

49 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58 works identified the strong associations between urban mobility and of taxi trips. The urban mobility patterns vary from place to place ac- land use (Liu et al., 2012), spatial structure (Zhong et al., 2014), the cording to the different urban functions. The similar taxi trip dataset built environment (Chen et al., 2009), etc., demonstrating the con- and street view image dataset have been used in previous works; de- nections between urban mobility and urban physical environments. tailed descriptions can be found in Zhu et al. (2017) and Zhang et al. This study is based on the assumption that urban mobility in streets (2018). with similar physical settings and functions presents similar patterns. Street-level imagery is used to simulate and represent the urban physical environments and further to approximate urban mobility patterns. 3.2. Data preprocessing

In the study area, we treat each street as an individual unit. Taxi 3. Learning urban mobility patterns from street-level imagery data and street view images are spatially aggregated into 917 streets. The advantage of using the street as the spatial unit for big geo-data 3.1. Study area and data sources analysis has been demonstrated in previous work (Zhu et al., 2017; Zhang et al., 2018). As shown in Fig. 2, the training set consists of Beijing is one of the most populous cities and has one of the most sample pairs (Fig. 2a) of an street view image (Fig. 2b) and the corre- complex street networks in China. This study focuses on the streets sponding taxi trip temporal signature vector of the street, where the within the 5th Ring Road of Beijing and treats each street as an in- street view image will be the input of the DCNN model and the taxi trip dividual unit to aggregate street view images and urban mobility data. temporal signature vector is the prediction target of the DCNN model A total of 917 streets are selected based on the availability of street (Fig. 2c). For street view images, we resize the images for consistency view image data. We request and collect more than 120,000 street view with the DCNN model input size. For taxi data, first, we aggregate the images along the streets at intervals of 30 meters by calling the street trip data (including pick-ups and drop-offs) in September 2016 by each view image API, which enables users to obtain a static street view image hour of each day. Since the spatio-temporal patterns of urban mobility with certain camera parameters. The street view images were taken in are quite different from weekdays to weekends (Liu et al., 2012), we 2016. remove the data for weekends and focused only on weekdays. Second, We take the diurnal signatures of taxi trips, which is the total to address the skewed trip values of different hours and streets, we number of taxi’s pick-ups and drop-offs in each hour of each day, asthe adopt logarithmic transformation and normalization (the regularization proxy of urban mobility (the bias issues are discussed in the following made the data distribution resemble a Gaussian distribution, allowing parts). The taxi trip dataset contains more than 20,000 taxis in Beijing us to learn more robust prediction models (Isola et al., 2014)). Third, in September 2016 and records the taxi’s pick-up and drop-off locations since we implement a classification-then-regression strategy for our and times. We retain only the records of weekdays and average the prediction problem (see next subsection), we discretize the quantity of records by each hour of a day for each street. Fig. 1 presents the study trips into N levels with the equal interval [,]bn b n+1 (n= 0, 1, … , N 1). area, street view image samples of four streets and temporal signatures bn is the boundary of the interval, as calculated according to the

Fig. 1. The region within 5th Ring Road of Beijing, street view image samples of four streets and the corresponding taxi hourly trips curves. (a) Zhongguancun Street. (b) Xidan North Street. (c) Guanghua Road. (d) Guangshun North Street.

50 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Fig. 2. Data preprocessing and training set organization. A sample pair (a) consists of a street view image (b) and the 24-h taxi trip magnitude of a street (c). Taxi data are preprocessed through spatio-temporal aggregation, logarithmic transformation, normalization and discretization. maximum and minimum value of trips in the whole dataset after ap- regression tasks involve more complexity, uncertainty, and noise of the plying logarithmic transformation and normalization. taxi trip data and are hard to train for a DCNN model. In this case, we employ a classification-then-regression strategy to address this pro- 3.3. Deep convolutional neural networks blem. First, we discretize the number of taxi trips into N levels with equal interval [,]bn b n+1 (n= 0, 1, … , N 1) and train the DCNN model The mechanisms that govern the urban mobility patterns are com- to estimate the magnitude level of taxi trips in a discriminative classi- plex and nonlinear. Modeling the traffic flow using a single factor ofthe fication manner. Second, we present a method to estimate theexact surroundings has proved difficult. In this work, we employ adeep quantity of trips based on the softmax function’s output. For a sample convolutional neural network (DCNN) to learn the visual knowledge of pair of images I of a street and a quantity of taxi hourly trips along the Q( h [0, 23]) Pn a physical environment and utilize the deep representation of a scene street h , the probability distribution of a prediction is h , image to predict traffic flow. A DCNN is an improved form ofthear- where n is the magnitude category. Here, we cal- tificial neural network, which is a family of computation models taking culate the exact quantity of trips from h o’clock to o’clock as inspiration from the neuronal structure in the brain. DCNN model N n bn+ b n+1 learns and extracts image features layer by layer automatically, which QPh = h n =(0,1, … ,N 1), h = (0,1, … ,23) 2 has been proved to be efficient and invariant in image feature learning n=1 and representation (LeCun et al., 2015). As shown in Fig. 3, our model is composed of two parts: the DenseNet DenseNet is a basic DCNN architecture that has been widely used in blocks (Fig. 3b) and the 24 classifiers connected with the shared deep image classification (Huang et al., 2017), segmentation (Zhou et al., feature, which is the last layer of the DenseNet block (Fig. 3c). The hth 2017) and object detection (Shen et al., 2017). Relying on its novel classifier followed by DenseNet blocks is designed to estimate the structure, DenseNet alleviates the vanishing-gradient problem, en- magnitude of taxi trips from h o’clock to o’clock ( ). courages feature reuse, reduces the number of parameters and outper- Given an input image (Fig. 3a), our proposed model is expected to forms other models in various computer vision tasks (Huang et al., predict the 24-dimensional vector of the taxi hourly trips (Fig. 3d). 2017). We design a new DCNN model based on DenseNet (Huang et al., 2017) architecture to predict the taxi trips of each hour with a single street view image. 3.4. Transfer learning Regarding the number of tasks, predicting the taxi trips from h o’clock to (h + 1) o’clock (h [0, 23]) is a multi-task learning problem Transfer learning helps solve new problems through the transfer of for a machine learning model. For instance, predicting the taxi trips knowledge gained while learning a different but related task (Torrey from 8 AM to 9 AM and those from 10 AM to 11 AM are two separate et al., 2010). Transfer learning reduces the time of training, improves (but related) tasks. Inspired by previous works (Ranjan et al., 2017; the model’s generalization capability and prevents over-fitting by Devin et al., 2017), we design a model to learn a shared representation sharing the learned image representations, such as colour patterns, at high-level layers, followed by 24 task-specific decoders that make basic shapes, and object parts across models. This approach is common decisions for each task. In this work, we choose the softmax function, practice in learning deep features of natural images (Zhou et al., 2016), which is a generalization of the logistic function (Bridle, 1990), as the satellite imagery (Jean et al., 2016; Albert et al., 2017) and medical decoder and classifier. The softmax function can output a probability imagery (Greenspan et al., 2016) for a wide variety of applications. As distribution over all possible outcomes, which is, N magnitude cate- shown in Fig. 3e, we apply transfer learning to this work by first gories in this case. This feature enables our method to estimate the training a DCNN model Ms to carry out a street classification task, i.e., exact quantity of trips. recognizing which street a street view image belongs to, and then Regarding the type of task, predicting the number of taxi trips is training a DCNN model Mt to predict the taxi trips along a street by fine- typically a regression problem for a machine learning model. However, tuning the pre-trained model .

51 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Fig. 3. The architecture of our proposed DCNN model. (a) Input street view image of a street. (b) DenseNet blocks. (c) Twenty-four task-specific classifiers that are connected with the last layer of DenseNet blocks independently. (d) Twenty-four-dimensional temporal signature vector of taxi hourly trips of a street. (e) Model parameter initialization using feature learning from a street classification task.

T Our method assumes that, through the street classification task, the ()QQ 2 R2 = 1 h h h =(0, 1, … ,23) Ms T 2 model learns the deep representation of a streetscape, which im- (QQh¯ h ) proves the learning process of the model Mt.

4. Experiment and results 3.5. Performance assessment We trained the designed model on the street view imagery to esti- To deal with the imbalanced category problem (the total image mate the hourly variation of taxi trips along each street. The number of numbers of the streets are unbalanced), we build the dataset with a magnitude levels of taxi trips N is set to 10. We first train the DCNN stratified sampling strategy for each street. We randomly split the model with parameters transferred from an ImageNet object classifi- whole dataset into training, validation, and test set with a ratio of 0.6: cation model, which is to recognize the object in an image from 1000 0.2: 0.2, and use the four following metrics below in the three splits to categories with over 14 million training samples (Russakovsky et al., evaluate the performance of our proposed method in predicting the 2015). As a contrast, in the second experiment, we initialize the model average taxi hourly trips. parameters that were learned in a street classification task, i.e., re- Overall accuracy. For the N-category discriminative classification cognizing which street a street view image shows out of the 917 streets task, as described in the data preprocessing section, the quantity of trips across the study area (as Fig. 3e depicts). We hypothesize that the [,]b b is discretized into N levels with the equal interval n n+1 model initialized with the streetscape features from street classification (n= 0, 1, … , N 1) . The accuracy of predicting the N-level magnitude task will outperform the one using the object features from the Im- (h + 1) (h = 0, 1, … ,23) of taxi trips from h o’clock to o’clock is denoted ageNet object classification task. h, and the overall accuracy is obtained by

24 1 4.1. Training results = OA 24 h h=1 Table 2 lists the accuracy, MAPE and of the experiment, which demonstrates the excellent performance of the DCNN model. The model Mean absolute error(MAE). Let Th be the total number of samples achieves an overall of 0.665 in the test set, indicating that by using a from h o’clock to o’clock, we evaluate the performance of esti- DCNN learning model, the street view imagery can reveal the variation mating the exact quantity of trips using mean absolute error (MAE) as of urban mobility up to 66.5%. T Fig. 4 depicts the learning curves during the training process. The 1 MAEh = |Qh Q | h = (0,1,,23) … model that was initialized with street view imagery features achieved T h h an accuracy of 74.3% in the validation set, outperforming that with ImageNet-features (55.3% in the validation set). The result proves that Mean absolute percentage error(MAPE). The MAPE provides a the image features learned in the street classification task, which we metric to measure the percentage of the prediction errors to the mean of the observed values. MAPEh is calculated by Table 2 T Model performance. 100% |QQh h | MAPEh = h =(0, 1, … ,23) 2 Th Qh Samples (#) Accuracy (%) MAPE (%) R

Training 75,020 0.995 0.154 0.941 Coefficient of determinationR ( 2). measures how well-observed Validation 25,005 0.743 0.307 0.672 outcomes are produced by the model based on the proportion of total Test 25,010 0.741 0.307 0.665 variation of outcomes explained by the model. is calculated by

52 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

poor results are shown in Fig. 7b in red. The physical settings of these areas are considered inconsistent with its traffic demand, which should receive attention to in terms of traffic management and planning. As an example, we identify the abnormal streets by examining the outliers of residual plots. In detail, we observe the relationship between the actual taxi trip numbers and the mean absolute percentage errors (MAPE) of the 24-h taxi trip numbers. For an ideal estimation model, the value of MAPE should be insensitive to the input. In Fig. 8a, as expected, the plot of the MAPE shows a fairly constant value (an average of 23.5%) when the amount of taxi trips is larger than 5. However, several streets are over-estimated when the actual amount of taxi trips is low. We analyze three types of outliers and take Side Road of South 3rd Ring, Caishikou Street, and Jingshan Front Street as examples. First, Side Road of South 3rd Ring is over-estimated due to a modifiable areal unit problem (MAUP), since the street is two short Fig. 4. Learning curves during the training process. The model initialized with ( 200 m) with little taxi trips being aggregated into the street unit street view imagery features (curve in blue) outperforms the one initialized when calculating the true values for the training set. One potential with ImageNet features (curve in orange). The solid-curve depicts the accuracy solution is to merge the street to its neighbor streets. in the training process, while the dashed-curve depicts the accuracy values in Second, for Caishikou Street, there are a number of residential the test set. (For interpretation of the references to colour in this figure legend, communities are located, but the number of taxi trips is low because the reader is referred to the web version of this article.) there are two subway stations—Caishikou Station and Taoranting Station nearby, which attract travel flows and result in a small number regard as the deep representation of a streetscape, are more effective in of taxi trips in the street. Third, Jingshan Front Street is located be- carrying out tasks related to the urban physical environment than the tween Forbidden City and Jingshan Park, which is a well-developed image features learned in an object detection task, which we consider to street and is predicted to feature more number of taxi pick-ups and be the deep representation of objects. drop-offs by the model. However, the actual flow is low becausethe We show more detailed model performance evaluations on the test street is regulated—taxis are only allowed to stop at a taxi stand, which set for each hour in Fig. 5. Generally, the model achieves high and is located in an adjacent street. Image examples of the three streets are stable accuracy in estimating the magnitude and quantity of taxi trips shown in Fig. 8b. Overall, by identifying and analyzing abnormal places over all the 24 h. Fig. 5a shows the prediction accuracy is slightly whose physical conditions are inconsistent with the traffic demand, this higher in the daytime ( 75.0%) than late at night ( 72.6%). Similar approach holds great promise to inform traffic management and urban pattern is also revealed in Fig. 5b which presents the different MAE for planning. each hour of a day, implying that the urban mobility pattern is easier to depict in the daytime in the street view imagery. Fig. 6 shows the overall spatial distribution of actual taxi trips along 4.3. Sample evaluation each street in each time interval compared to the estimated ones. The model generally achieves high estimation performance for each hour. In Fig. 9, we present the evaluation of twelve selected street view However, the taxi flow during the night is a little bit over-estimated by image samples with their actual and predicted mobility curves. The first the model, which is consistent with the results shown in Fig. 5. column shows the images; the second column depicts the predicted and ground-truth taxi tip curves during the day in blue and orange, re- spectively; and the third column displays the informative image region 4.2. Residual analysis using class activation mapping (CAM), a technique used to identify the discriminative regions of an image for model decision making (im- We then investigate the model robustness by examining the dis- plementation details of CAM are described in Zhou et al. (2016)). Note tribution of the residuals for the estimation of each street. Fig. 7a is the that the greenery, bus station, and density and condition of buildings rank of the MAPE of the 917 streets. A lower MAPE indicates a better can provide important visual cues that are relevant to urban mobility. estimation. The model achieves an MAPE lower than 40% for 89% Future work is expected to measure the associated relationships quan- (817/917) of the streets. Some of the streets that are estimated with titatively.

Fig. 5. Evaluation on the performance of estimating taxi trips for each hour on the test set. (a) Accuracy of the different hours. (b) Mean Absolute Error of different hour (#taxi trips).

53 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Fig. 6. Spatial distribution of taxi hourly trips within the 5th Ring Road of Beijing. Left: the observed distribution; Right: the predicted distribution. The results are shown for every three hours. A darker colour indicates a higher quantity of trips in each hour. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

5. Discussion and conclusions new data source in extracting the information of a physical urban environment. However, perceiving human mobility and urban socio- Street-level imagery describes urban environments from the human economic environment still remains difficult due to the lack of feasible perspective and contains fine-grained physical and social information of methods to extract high-level representations of street-level imagery. cites. A number of studies have demonstrated the advantages of this In this work, we present an approach to inferring properties of the

Fig. 7. Evaluation of the performance of estimating taxi trips along the 917 streets. A lower MAPE means a better estimation. The model achieves high and stable performance for most of the streets. (a) The rank of the mean absolute percentage error (MAPE) of each street. More than 89% of the streets achieve an MAPE lower than 40%. (b) The spatial distribution of the MAPE of each street. More red indicates a higher MAPE. The model performs poorly for just a few short streets. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

54 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Fig. 8. MAPE shows a fairly constant value (an average of 23.5%) when the amount of taxi trips is larger than 5. Three streets with low taxi trips and high MAPE, Side Road of South 3rd Ring, Caishikou Street and Jingshan Front Street are identified as outliers. Street view image samples are presented. urban environment instead of simply surveying on the objects present traditional high-resolution remote sensing imagery in fine-grained in a scene. A DCNN-based method is proposed to estimate the spatio- urban environment observation and social sensing. They are function- temporal patterns of urban mobility using street-level imagery without ally similar in physical environment observation, but they describe the the need for vehicles to appear in the images. The DCNN model ap- Earth surface with a different scale, scope and view. Incorporating proximates the temporal signature of the taxi trips of 917 streets in street-level imagery and remote sensing imagery together to better Beijing with high performance. In terms of the temporal variation, we model and understand the spatial separation and urban form is an ex- obtain the taxi flow predictions of all 24-time slots based on a multi- citing avenue for future research. task deep learning framework. The results demonstrate that street-level Second, from a theoretical aspect, seeking connections between the imagery can reveal urban mobility patterns and serve as a fine-grained physical settings and human activities has long been of interest in a measure of urban dynamic factors. The contributions and significance wide variety of fields (Lynch, 1960; Tuan, 1979; Chen et al., 2009). of this work are listed as follows. How much urban physical environments can reveal the urban social First, we employ a new data source, street-level imagery, to simulate environments remains unclear. This work presents how “big data and represent the urban physical environment, and to approximate the thinking”—mining complex associations among various data sources, urban mobility patterns. This data source opens a new door to obser- can be applied to shed light on this issue by using a deep learning ving and understanding the sociodemographic and socioeconomic en- model. The results show that human mobility patterns can be well es- vironments. Moreover, street-level imagery can potentially complement timated by the urban physical environment, demonstrating that visual

55 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Fig. 9. Sample evaluation: predicting the 24-h variation of taxi flow using a single street view image. Twelve image samples with taxi trip curves and informative image regions are shown. The regions in red color are considered to be the discriminative area that the model relies on. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) cues are helpful in approximating human behaviours. This conclusion traffic flows of a street. Urban mobility reflects the degree ofdevelop- suggests evidence for those working on understanding the reciprocal ment and vitality of a city. According to Jacobs (1961), a city’s vitality interactions between the urban built environment and human beha- and diversity can be promoted by mixed land use and block density viour. This framework is also promising to evaluate if the development (Jacobs, 1992). Future works are anticipated to address regions with of a city is oversupplied or insufficient by observing the relationship inconsistent physical environment conditions to traffic flows by fo- between urban physical settings and urban activities. cusing on the regions with a larger residual in the predictions. The Third, in a technical sense, how to learn and interpret high-level results will potentially have implications for improving the mobility of visual knowledge of a scene automatically has been of great interest in a city and could inform future urban design and traffic planning. computer vision (Zhou et al., 2014; Reed et al., 2016). This study The limitation of this work is twofold. First, big data representa- proposes a transfer learning and DCNN-based approach to extract high- tiveness and bias have been widely examined recently (Martí et al., level representations of a streetscape. We find that, regarding under- 2019; Wardrop et al., 2018). Urban travels may feature multiple modes, standing the urban physical environment, the visual features learned in and taxi trip data might not reflect a complete picture of urban mobi- our designed street classification task largely outperformed the visual lity. In this work, bias exists in approximating urban mobility using taxi feature learned in the ImageNet object detection task. This work pro- pick-ups and drop-offs. More promising results can be achieved by vides a ready-to-use computational tool for extracting visual concepts applying the proposed general and scalable method to more re- and high-level visual knowledge of a streetscape to facilitate studies presentative datasets or multi-source urban travel datasets to in- using street-level imagery. vestigate urban mobility. Second, the generality of the proposed Fourth, from a practical sense, our approach provides an alternative method has been tested locally (in Beijing). Considering the geo- way of estimating the spatio-temporal urban mobility patterns of a city. graphical heterogeneity, the relationship between the streetscape and We show that the relationship between urban traffic flows and the urban mobility may be different in a different city. Employing the pre- corresponding visual scenes is adequately described by a DCNN model. trained model to measure the differences among cities is expected tobe With a single street-level image, we estimate the hourly variation of taxi a promising approach. Also, integrating external information flows of a street, which can be applied to the estimation ofgeneral (weekday/weekends, holiday/ordinary, weather condition, etc.) will

56 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58 improve the performance of the estimation model to predict the urban Kadar, C., Pletikosa, I., 2018. Mining large-scale human mobility data for long-term crime mobility patterns under various conditions and for different areas. prediction. Available from: arXiv preprint arXiv:1806.01400. Kang, H.-W., Kang, H.-B., 2017. Prediction of crime occurrence from multi-modal data using deep learning. PLoS ONE 12 (4), e0176244. Acknowledgement Kang, Y., Wang, J., Wang, Y., Angsuesser, S., Fei, T., 2017. Mapping the sensitivity of the public emotion to the movement of stock market value: a case study of manhattan. Int. Arch. Photogram. Rem. Sens. Spat. Inform. Sci. 42, 1213–1221. This work was supported by the National Key R&D Program of Kang, Y., Jia, Q., Gao, S., Zeng, X., Wang, Y., Angsuesser, S., Liu, Y., Ye, X., Fei, T., 2019. China under Grant 2017YFB0503602, the National Natural Science Extracting human emotions at different places based on facial expressions and spatial Foundation of China under Grant 41830645 and 41625003, and China clustering analysis. Trans. GIS. https://doi.org/10.1111/tgis.12552. Postdoctoral Science Foundation under Grant 2018M641068. Authors Khan, R., Van de Weijer, J., Shahbaz Khan, F., Muselet, D., Ducottet, C., Barat, C., 2013. Discriminative color descriptors. In: Proceedings of the IEEE Conference on Computer wish to thank Dr. Yaoli Wang for her constructive comments on this Vision and Pattern Recognition, pp. 2866–2873. paper. Khosla, A., An An, B., Lim, J.J., Torralba, A., 2014. Looking beyond the visible scene. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3710–3717. References Law, S., Paige, B., Russell, C., 2018. Take a look around: using street view and satellite images to estimate house prices. Available from: arXiv preprint arXiv:1807.07155. Adriansen, H.K., Nielsen, T.T., 2005. The geography of pastoral mobility: a spatio-tem- LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444. poral analysis of GPS data from Sahelian Senegal. GeoJournal 64 (3), 177–188. Lillesand, T., Kiefer, R.W., Chipman, J., 2014. Remote Sensing and Image Interpretation. Ahas, R., Aasa, A., Silm, S., Tiru, M., 2010. Daily rhythms of suburban commuters’ John Wiley & Sons. movements in the Tallinn metropolitan area: case study with mobile positioning data. Lin, T.-Y., Cui, Y., Belongie, S., Hays, J., 2015. Learning deep representations for ground- Transport. Res. Part C: Emerg. Technol. 18 (1), 45–54. to-aerial geolocalization. In: Proceedings of the IEEE Conference on Computer Vision Ahas, R., Aasa, A., Yuan, Y., Raubal, M., Smoreda, Z., Liu, Y., Ziemlicki, C., Tiru, M., Zook, and Pattern Recognition, pp. 5007–5015. M., 2015. Everyday space-time geographies: using mobile phone-based sensor data to Liu, L., Hou, A., Biderman, A., Ratti, C., Chen, J., 2009. Understanding individual and monitor urban activity in Harbin, Paris, and Tallinn. Int. J. Geogr. Inform. Sci. 29 collective mobility patterns from smart card records: a case study in Shenzhen. In: (11), 2017–2039. International IEEE Conference On Intelligent Transportation Systems. IEEE, pp. 1–6. Albert, A., Kaur, J., Gonzalez, M.C., 2017. Using convolutional networks and satellite Liu, Y., Kang, C., Gao, S., Xiao, Y., Tian, Y., 2012. Understanding intra-urban trip patterns imagery to identify patterns in urban environments at a large scale. In: Proceedings of from taxi trajectory data. J. Geogr. Syst. 14 (4), 463–483. the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Liu, Y., Wang, F., Xiao, Y., Gao, S., 2012. Urban land uses and traffic ‘source-sink areas’: Mining. ACM, pp. 1357–1366. evidence from GPS-enabled taxi data in Shanghai. Landscape Urban Plan. 106 (1), Anguelov, D., Dulong, C., Filip, D., Frueh, C., Lafon, S., Lyon, R., Ogale, A., Vincent, L., 73–87. Weaver, J., 2010. Google street view: capturing the world at street level. Computer Liu, Y., Sui, Z., Kang, C., Gao, Y., 2014. Uncovering patterns of inter-urban trip and spatial 43 (6), 32–38. interaction from social media check-in data. PLoS ONE 9 (1), e86026. Balali, V., Golparvar-Fard, M., 2015. Segmentation and recognition of roadway assets Liu, Y., Liu, X., Gao, S., Gong, L., Kang, C., Zhi, Y., Chi, G., Shi, L., 2015. Social sensing: a from car-mounted camera video streams using a scalable non-parametric image new approach to understanding our socioeconomic environments. Ann. Assoc. Am. parsing method. Autom. Constr. 49, 27–39. Geogr. 105 (3), 512–530. Benediktsson, J.A., Chanussot, J., Moon, W.M., 2012. Very high-resolution remote sen- Liu, L., Zhou, B., Zhao, J., Ryan, B.D., 2016. C-IMAGE: city cognitive mapping through sing: challenges and opportunities. Proc. IEEE 100 (6), 1907–1910. geo-tagged photos. GeoJournal 81 (6), 817–861. Bioucas-Dias, J.M., Plaza, A., Camps-Valls, G., Scheunders, P., Nasrabadi, N., Chanussot, Lynch, K., 1960. The Image of the City, vol. 11 MIT Press. J., 2013. Hyperspectral remote sensing data analysis and future challenges. IEEE Martí, P., Serrano-Estrada, L., Nolasco-Cirugeda, A., 2019. Social media data: challenges, Geosci. Rem. Sens. Mag. 1 (2), 6–36. opportunities and limitations in urban studies. Comput. Enviro. Urban Syst. 74, Bridle, J.S., 1990. Probabilistic interpretation of feedforward classification network 168–174. outputs, with relationships to statistical pattern recognition. In: Neurocomputing. Masucci, A.P., Serras, J., Johansson, A., Batty, M., 2013. Gravity versus radiation models: Springer, pp. 227–236. on the importance of scale and heterogeneity in commuting flows. Phys. Rev. E88 Bulkeley, H., Betsill, M., 2005. Rethinking sustainable cities: multilevel governance and (2), 022812. the urban politics of climate change. Environ. Polit. 14 (1), 42–63. Naik, N., Philipoom, J., Raskar, R., Hidalgo, C., 2014. Streetscore-predicting the per- Chen, C., Chen, J., Barry, J., 2009. Diurnal pattern of transit ridership: a case study of the ceived safety of one million streetscapes. In: Proceedings of the IEEE Conference on New York city subway system. J. Transport Geogr. 17 (3), 176–186. Computer Vision and Pattern Recognition Workshops, pp. 779–785. Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection. In: Naik, N., Kominers, S.D., Raskar, R., Glaeser, E.L., Hidalgo, C.A., 2017. Computer vision Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. uncovers predictors of physical urban change. Proc. Natl. Acad. Sci. 114 (29), vol. 1. IEEE, pp. 886–893. 7571–7576. Deng, X., Zhu, Y., Newsam, S., 2018. What is it like down there?: generating dense Noulas, A., Scellato, S., Lambiotte, R., Pontil, M., Mascolo, C., 2012. A tale of many cities: ground-level views and image features from overhead imagery using conditional universal patterns in human urban mobility. PLoS ONE 7 (5), e37027. generative adversarial networks. In: Proceedings of the 26th ACM SIGSPATIAL Ojala, T., Pietikainen, M., Maenpaa, T., 2002. Multiresolution gray-scale and rotation International Conference on Advances in Geographic Information Systems – invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. SIGSPATIAL ’18. ACM Press, pp. 43–52. Mach. Intell. 24 (7), 971–987. Devin, C., Gupta, A., Darrell, T., Abbeel, P., Levine, S., 2017. Learning modular neural Pei, T., Sobolevsky, S., Ratti, C., Shaw, S.-L., Li, T., Zhou, C., 2014. A new insight into land network policies for multi-task and multi-robot transfer. In: 2017 IEEE International use classification based on aggregated mobile phone data. Int. J. Geogr. Inform. Sci. Conference on Robotics and Automation, pp. 2169–2176. 28 (9), 1988–2007. Gallotti, R., Bazzani, A., Rambaldi, S., Barthelemy, M., 2016. A stochastic model of Qi, G., Li, X., Li, S., Pan, G., Wang, Z., Zhang, D., 2011. Measuring social functions of city randomly accelerated walkers for human mobility. Nat. Commun. 7, 12600. regions from large-scale taxi behaviors. IEEE International Conference on Pervasive Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E.L., Fei-Fei, L., 2017. Using Computing and Communications Workshops, IEEE 384–388. deep learning and google street view to estimate the demographic makeup of Ranjan, R., Patel, V.M., Chellappa, R., 2017. Hyperface: a deep multi-task learning fra- neighborhoods across the United States. Proc. Natl. Acad. Sci. 114 (50), mework for face detection, landmark localization, pose estimation, and gender re- 13108–13113. cognition. IEEE Trans. Pattern Anal. Mach. Intell. 1. Gong, F.-Y., Zeng, Z.-C., Zhang, F., Li, X., Ng, E., Norford, L.K., 2018. Mapping sky, tree, Reed, S., Akata, Z., Lee, H., Schiele, B., 2016. Learning deep representations of fine- and building view factors of street canyons in a high-density urban environment. grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Build. Environ. 134, 155–167. Vision and Pattern Recognition, pp. 49–58. Gonzalez, M.C., Hidalgo, C.A., Barabasi, A.-L., 2008. Understanding individual human Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., mobility patterns. Nature 453 (7196), 779. Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual recognition chal- Greenspan, H., Van Ginneken, B., Summers, R.M., 2016. Guest editorial deep learning in lenge. Int. J. Comput. Vis. 115 (3), 211–252. medical imaging: overview and future promise of an exciting new technique. IEEE Shen, Z., Liu, Z., Li, J., Jiang, Y.-G., Chen, Y., Xue, X., 2017. DSOD: Learning deeply Trans. Med. Imag. 35 (5), 1153–1159. supervised object detectors from scratch. In: The IEEE International Conference on Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L., 2017. Densely connected con- Computer Vision, pp. 1937–1945. volutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Stedman, R.C., 2003. Is it really just a social construction?: the contribution of the phy- Pattern Recognition, pp. 2261–2269. sical environment to sense of place. Soc. Nat. Resour. 16 (8), 671–685. Huang, J., Levinson, D., Wang, J., Zhou, J., Wang, Z., 2018. Tracking job and housing Torrey, L., Shavlik, J., 2010. Transfer learning. In: Handbook of Research on Machine dynamics with smartcard data. In: Proceedings of the National Academy of Sciences. Learning Applications and Trends: Algorithms, Methods, and Techniques, IGI Global, Isola, P., Xiao, J., Parikh, D., Torralba, A., Oliva, A., 2014. What makes a photograph pp. 242–264. memorable? IEEE Trans. Pattern Anal. Mach. Intell. 36 (7), 1469–1482. Tuan, Y.-F., 1979. Space and place: humanistic perspective. In: Philosophy in Geography. Jacobs, J., 1992. The death and life of great American cities. Springer, pp. 387–427. Jean, N., Burke, M., Xie, M., Davis, W.M., Lobell, D.B., Ermon, S., 2016. Combining sa- Wardrop, N., Jochem, W., Bird, T., Chamberlain, H., Clarke, D., Kerr, D., Bengtsson, L., tellite imagery and machine learning to predict poverty. Science 353 (6301), Juran, S., Seaman, V., Tatem, A., 2018. Spatially disaggregated population estimates 790–794. in the absence of national population and housing census data. Proc. Natl. Acad. Sci. 115 (14), 3529–3537.

57 F. Zhang, et al. ISPRS Journal of Photogrammetry and Remote Sensing 153 (2019) 48–58

Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., Mai, K., 2017. Sensing spatial dis- of urban structure through spatial network analysis. Int. J. Geogr. Inform. Sci. 28 tribution of urban land use by integrating points-of-interest and Google Word2Vec (11), 2178–2199. model. Int. J. Geogr. Inform. Sci. 31 (4), 825–848. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A., 2014. Learning deep features for Yuan, J., Zheng, Y., Xie, X., 2012. Discovering regions of different functions in a city using scene recognition using places database. In: Advances in Neural Information human mobility and pois. In: Proceedings of the 18th ACM SIGKDD International Processing Systems, pp. 487–495. Conference on Knowledge Discovery and Data mining. ACM, pp. 186–194. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A., 2016. Learning deep features Zhang, F., Zhou, B., Liu, L., Liu, Y., Fung, H.H., Lin, H., Ratti, C., 2018. Measuring human for discriminative localization. In: Proceedings of the IEEE Conference on Computer perceptions of a large-scale urban region using machine learning. Landscape Urban Vision and Pattern Recognition, pp. 2921–2929. Plan. 180, 148–160. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A., 2017. Scene parsing Zhang, F., Zhang, D., Liu, Y., Lin, H., 2018. Representing place locales using scene ele- through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision ments. Comput. Environ. Urban Syst. 71, 153–164. and Pattern Recognition, pp. 5122–5130. Zhang, F., Zhou, B., Ratti, C., Liu, Y., 2019. Discovering place-informative scenes and Zhu, D., Wang, N., Wu, L., Liu, Y., 2017. Street as a big geo-data assembly and analysis objects using social media photos. Roy. Soc. Open Sci. 6 (3), 181375. unit in urban studies: a case study using Beijing taxi data. Appl. Geogr. 86, 152–164. Zhong, C., Arisona, S.M., Huang, X., Batty, M., Schmitt, G., 2014. Detecting the dynamics