Downloaded by guest on September 30, 2021 www.pnas.org/cgi/doi/10.1073/pnas.1903064116 Dong Lei data restaurant using attributes socioeconomic neighborhoods’ Predicting bei nern oieooi ucms(41) Nevertheless, valu- (14–16). content) outcomes socioeconomic progress text inferring and in current able imagery “unstructured Moreover, satellite/street-view made con- (e.g., also on. data” has (12), so algorithms wealth and machine-learning with (11), (13), unemploy- distribution index (7), sumption population personality individual’s (8–10), infer ment to platforms, e-commerce used estimating and been gener- phone, to have mobile data, media, online contributed social Massive by has ated (4–6). instance, activities economic for regional data, “Night- 3). lights” (2, interest time of outcomes approaches socioeconomic alternative measuring seek to makers policy and researchers spatial data, 20 than fewer level—offering like publicly district cities observations. only the Chinese are at big databases some accessible socioeconomic in most Even Shanghai, quality. and developing scarce low often some are of in data or socioeconomic importantly, key China, More as y. such countries con- 10 typically every is once census both ducted timeliness—a consider and structure. cannot resolution surveys consumption spatial cost, high the high the in of changes because However, understand to survey us comprehensive consumption allows provides household while census sur- information, population of demographic types instance, various For are data veys. socioeconomic (1). of activities source city main and understand The to economic, seek urban, they as in fields researchers environmental pro- for activity, guidance economic essential and vide population of distribution temporal H good social data to socioeconomic cities all allowing dividends. cities, algorithm it between and data applying gaps big then data enjoy and bridge model restaurant this city help this of one can of transferability The in cities. generality other model to cross-city the directly training the results by demonstrate predicted indus- method we firm the and Finally, of demographics, tries. heterogeneity locations, the spatial different as training across of well number trade- as and the samples, resolution, analyze attributes spatial We those accuracy, dataset. of between test variation off the the in of 95% neighborhoods to trained across firms, 90 The of resolutions. explain spatial can number various model at population, level features nighttime consumption Using and and to cities. models daytime Chinese machine-learning estimate train 9 we plat- for restaurants, from online microdatasets extracted an 3 from with data form restaurant urban merge of attributes We socioeconomic neighborhoods. of used range accessible a be easily predict attribute—restaurant—can accurately an to location show socioe- updated we timely local-scale Here, and in scarce. reliable remain However, makers cities, data policy policies. or conomic and countries location-based scientists developing implement many social and for design critical to neigh- is the 2019) 21, at level February activity review borhood enterprise for and (received employment, 2019 data 10, population, as June on such approved data and socioeconomic CA, timely Angeles, high-resolution, Los 02139 Accessing California, MA of Cambridge, Technology, University of Clark, V. Institute A. Massachusetts William Planning, by and Edited Studies Urban of Department Estate, Real for Center and a esal iyLb eateto ra tde n lnig ascuet nttt fTcnlg,Cmrde A019 and 02139; MA Cambridge, Technology, of Institute Massachusetts Planning, and Studies Urban of Department Lab, City Senseable ie h ako ieyadsailydtie socioeconomic detailed spatially and timely of lack the Given eerhr n oiymkr.Sc aa iespatial– like data, Such makers. policy for critical and are cities researchers for data socioeconomic igh-resolution a,b al Ratti Carlo , | restaurant a n iiZheng Siqi and , | ra studies urban | ahn learning machine b,1 | ulse nieJl 5 2019. 15, July online Published 1073/pnas.1903064116/-/DCSupplemental. y at online information supporting contains article This 1 aadpsto:Teageae aaadcd,t elct eut nti ae,have paper, this in results replicate to GitHub, code, in and deposited been data aggregated The deposition: Data the under Published Submission.y Direct PNAS a is article This interest.y of conflict no paper.y declare the authors wrote The S.Z. and L.D. C.R., research; L.D., performed and L.D. research; data; designed analyzed S.Z. and C.R., L.D., contributions: Author Our cities. other to it applying then and city train- one the by in generality model resolutions, cross-city the the spatial ing as different well as at effect, accuracy heterogeneous with consumption—and prediction data and test restaurant firm, we combine datasets—population, We extensive (20). segregation 3 quality (18), county hygiene change US or neighborhood and (19), Yelp (17), from patterns data business restaurant between relation the interpretation, results. model’s economic explaining of reasonable possibility the have providing reviews) price, States), average of (e.g., (United data number Yelp restaurant from and extracted (China) variables and Dianping as such online platforms via socioeco- available consumption. easily local are and data restaurant with wealth, national-wide size, Third, correlated population highly like is attributes, nomic It local cities. developed deregulated developing in and or whether decentralized industries, most consumption the restaurant (location-based) of the one Second, urban is available. updated industry rarely regularly are and granular data fine socioeconomic but decades, past cate- over the 3 urbanization has rapid meal, exercise experienced has This a China restaurant. First, of a motivations. characterize price that average on, so as and such gory, are attributes, use we of data set restaurant a The consump- resolutions. of spatial various volume at and tion firms, of number population, nighttime population, daytime variables: data socioeconomic important 4 predict these of interpretability challenges. significant remain and sources updatability, accessibility, owo orsodnemyb drse.Eal [email protected] Email: addressed. be may correspondence whom To ihrc uvydt n hnapyn tt iisweesuch where cities to cities unavailable. it in are applying data model then and the data training survey by rich with cities, between bridge help gap” also “data can the model our cross-city of city good performance The the manner. transferability our low-cost in and timely, use neighborhoods granular, unsampled a then sam- in infer and training to model collect neighborhoods trained socioeconomic to representative of us from range allows ples a approach predict This effectively models, attributes. to attribute, used machine-learning be neighborhood with can updated combined eas- when timely an restaurant, that and show for accessible We scarce countries. remains ily and it cities but developing implementation, many and design place-based policy for crucial are data socioeconomic High-resolution Significance hspprdfesfo eetltrtr hthsevaluated has that literature recent from differs paper This to data restaurant using of potential the show we paper, this In PNAS NSlicense.y PNAS https://github.com/leiii/restaurant. | uy3,2019 30, July | o.116 vol. www.pnas.org/lookup/suppl/doi:10. | y b o 31 no. hn uueCt Lab City Future China | 15447–15452

SOCIAL SCIENCES ABFeature engineering xy Features Cuisine type - log(review sum + 1) Anhui x - log(price mean + 1) - taste mean ...... Beijing Locations Bread ƒ Prediction model Buffet (5-fold cross validation) Chinese Coffee Attributes-to-predict 80% train ...... - daytime population 20% test y - nighttime population - firm Review count - consumption Rolling window augmentation Location-features matrix

Fig. 1. Data and methodology. (A) A schematic of feature engineering and the training process. For each grid cell and for each cuisine type of restaurant, we calculate 7 metrics (see SI Appendix, Note 2 for details). We merge restaurant features with attributes-to-predict (daytime population, nighttime population, firm, and consumption) by grid cell index. (B) Rolling window data augmentation. For each grid, we move the sampling window along the x and y direction (and both) by half the length of the grid size to produce a new sample that is 3 times the original sample size, and then we split the combined data into the 80% training set and the 20% test set.

approach is especially helpful for developing countries/cities with neighborhood level in the United States) predictions on the test a fast speed of urbanization and limited resources: 1) it allows dataset can explain 95% (minimum: 86%; maximum: 98%) of us to collect training samples of socioeconomic variables from a the variation in daytime population, 95% (91%; 98%) in night- small number neighborhoods and then infer unsampled neigh- time population, 93% (87%; 97%) in number of firms, and 90% borhoods from statistical models. 2) The transferability of the (82%; 94%) in the volume of consumption (Fig. 2D). The highest model can help bridge the “data gap” between big and small accuracy is achieved for daytime/nighttime population, which are cities, allowing small cities to benefit from data and algorithm proxies for employment and residents, respectively (22), suggest- advances. ing that as a local market-driven industry, restaurants are highly correlated with characteristics of the local population. The result Results of population estimation (95%) is also much higher than remote For this research, we collect restaurant data from Dianping sensing-based models at a similar granular level (R2 of the pre- (the largest online rating and deal service platform for restau- dicted population density reported in ref. 23 was 86% and 85% rants in China) for 9 Chinese cities: Beijing, Shenzhen, in ref. 11). Chengdu, Shenyang, Zhengzhou, Kunming, Baoding, Yueyang, The R2 values of firm prediction are a little bit lower than for and Hengyang. SI Appendix, Table S1 shows the descriptive daytime/nighttime population. This may be attributable to the statistics of these cities. These selected cities are located in dif- limitation of the firm dataset. For firm data, we only have the ferent geographic regions of China and vary greatly in population registered address, which may not be the same as the operation size, which helps test the robustness of our method. We collect address, and firm size is unreported in the raw dataset and thus the following restaurant attributes: name, longitude, latitude, may also bias the results. For the consumption side, restaurants cuisine category, price, taste rating, environment rating, ser- should highly correlate with consumption, while the cate- vice rating, average score, and number of reviews (SI Appendix, gory of consumption is unknown in the dataset, making it difficult Fig. S1). From these restaurant attributes, we construct hun- to test this hypothesis. dreds of features, for instance, the number of restaurants, the Fig. 2 A–D also shows that accuracy increases with the size of number of reviews, the mean of reviews, and the mean of taste the grid. This may be because an increase in cell size reduces ratings at various spatial resolutions—from 1 to 5 km (see Fig. 1A heterogeneity among cells, thus reducing the impact of extreme and SI Appendix, Note 2 for details). We then merge restaurant values on the model. As far as the population size of the city features with daytime/nighttime population data (estimated by is concerned, except for consumption volume, models trained on mobile phone data), firm registration record data, and volume of small- and medium-sized cities appear more accurate than on big consumption (estimated by the bank card records) for each grid cities at high resolution (Fig. 2E). Since spatial heterogeneity in cell (Materials and Methods). We propose a data augmentation small- and medium-sized cities is less than that found in big cities, method to improve model performance. For each grid cell, we suggesting that small/medium cities may be more predictable for move the sampling window along the x and y direction by half the socioeconomic characteristics than big cities. length of the grid size (Fig. 1B). This data augmentation method From the application perspective, one important tradeoff is provides more information about the spatial distribution of vari- between the number of samples used to train a model and ables and could significantly improve the accuracy and stability of the accuracy that can be reached, because this tradeoff directly the model (SI Appendix, Fig. S7). For each city and each spatial determines the cost–benefit of the model application. To cal- resolution, we apply LASSO (least absolute shrinkage and selec- culate this tradeoff relationship, we fix grid size at 3 km and tion operator) regressions (21) with 5-fold cross-validations on randomly select subsamples of the training set. Results for Bei- 80% of the observations (training set) and then test the model jing, as depicted in Fig. 2F, show that even collecting very few performance on the remaining 20% of the sample (testing set) random samples can result in fairly high prediction accuracy. (see Materials and Methods for details). Using 10% of the training samples (∼ 100 observations) achieves 86% (SD = 2.3%) accuracy for daytime population, 87% (SD = Model Performance. Fig. 2 and SI Appendix, Fig. S8 show the 2.0%) for nighttime population, 74% (SD = 3.3%) for number of prediction accuracy (R2) of daytime population, nighttime pop- firms, and 82% (SD = 2.6%) for consumption volume. Collecting ulation, number of firms, and volume of consumption across 9 more samples could increase the accuracy but with diminishing cities at various spatial resolutions (from 1 to 5 km). In general, marginal returns (Fig. 2F). the model trained by restaurant data are strongly predictive of all To justify the predictive power of restaurants, we include 4 variables of interest. At a cell size of 4.5 km (the resolution of “night-time light”—the most commonly used proxy for urban the administrative boundary—Jiedao—in China, similar to the activities—as baseline models (SI Appendix, Table S2). As shown

15448 | www.pnas.org/cgi/doi/10.1073/pnas.1903064116 Dong et al. Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 oge al. et Dong zones. industrial the or in residential performance suburban lower i.e., and mixed direction, centers, highly opposite urban in scale. i.e., performance areas, neighborhood “combina- good use the achieve a land can at reflects model residents restaurants the and Thus, that employment fact of the tion” by explained and be areas for model; itself population residential reverses nighttime pattern the some (this areas in industrial in population underestimates daytime the timates for other populations. difficult high-density and predict it lighting accurately effect—making to Night saturation model the (23). the have areas data populated satellite population densely underestimate the to likely remote-sensing in accuracy more from are lower differs which sources, This relatively data is areas. The accuracy suburban center. around prediction urban exists Good the cities. near additional achieved and for Beijing, results in gives population daytime predicting the of the of firm heterogeneity and demographics, industries. locations, the spatial different investigate across results we model, nighttime Effect. of that Heterogeneous of quarter of one error only predictive is the lighting. consumption. (MSE), model and error restaurant firm squared the mean for accuracy of the especially terms from model, far In still restaurant is the lighting nighttime of the predic- of power the tive However, estimation. population daytime/nighttime 10 to in 3 of population and hollow a The have resolution. Hengyang. cities 3-km and Medium-sized Yueyang, at ( Baoding, sample. Beijing respectively. million: our resolution, of 3 5-km in data than and the Chengdu resolution less from and 1-km of trained at Shenzhen, population trained Beijing, a models have million: represent cities 10 circles Small-sized solid over Kunming. of and population Zhengzhou, a Shenyang, million: with cities are cities Big i.2. Fig. cuayo atm ouain ihtm ouain rs n osmto,rsetvl.( respectively. consumption, and firms, population, nighttime population, daytime of accuracy ABC 2 D 2 nte eea atr nFg 3A Fig. in pattern general Another 3A Fig. h ngttm ih”prom elin well performs light” “night-time the , S2 Table Appendix, SI Accuracy (R ) Accuracy (R ) 0.50 0.75 1.00 0.50 0.75 1.00 hnzo.(C Zhengzhou. (B) Beijing. (A) accuracy. Prediction 12345 12345 Beijing hw h pta itiuino h rdcieerror predictive the of distribution spatial the shows Average Grid size (km) Grid Grid sizeGrid (km) oda ute nomto rmthe from information further draw To Jiedao Jiedao .Ti can This S10). Fig. Appendix, SI sta h oe overes- model the that is i.S9 Fig. Appendix, SI

E 2 2 Accuracy (R )

Accuracy (R ) 0.50 0.75 1.00 0.4 0.6 0.8 1.0 vrgdacrc f9cte.Terd elw ry n lelnsrpeetthe represent lines blue and gray, yellow, red, The cities. 9 of accuracy Averaged (D) Baoding. ) 12345 Zhengzhou ml eimBig Medium Small City size group Grid sizeGrid (km) ertect etr o-kl rs ncnrs,aemore are contrast, in concentrated firms, be low-skill to center; thus city and the agglomeration near from likely more benefit are firms to high-skilled example, For locations. spatial data. survey could timely which more model, with restaurant evaluated the be of power may gap predictive this the and lower data, restaurant and data (R demographic y between 14 below of group age (R (R wealth household immigrants 15- the Closely the industry. are restaurant following for the of customers achieved and (R activities is urban group accuracy 64-y-old highest to The groups. graphic largest the China. y) Lianjia.com, in company from 65+ agent wealth. collected y, estate household real are for 64 We proxy data to a predicted. price 15 as be Housing price y, to housing variables 14 median take as to also immigrants 0 of groups: percentage (3 and we Here, structure accessible. age not are include the income and in religion, attributes education, few the like very at are data there obser- that census 309 Note includes Beijing. which in highest States), vations United the the and the in at 2010, neighborhoods is in lat- available administrated data The explore resolution was people. spatial to China of in alternative groups census an different est as across data accuracy census predictive use we dataset, phone industry. restaurant the as for regions opportunities underestimated treat business also potential can we angle, another From ifrn ye ffim aevr ifrn rfrne for preferences different very have firms of types Different 3B Fig. mobile the in available information demographic the Without F h ecnaeo riigsmlsadacrc.Mdl were Models accuracy. and samples training of percentage The ) rsnstepeito cuayo ifrn demo- different of accuracy prediction the presents 1km 5km E 2 h eainhpbtenct ieadmdlaccuracy. model and size city between relationship The ) 0 = Jiedao PNAS ,aegopoe 5y(R y 65 over group age .59), 2 F

0 = 2 2 Accuracy (R ) | Accuracy (R ) 0.50 0.75 1.00 0.6 0.7 0.8 0.9 1.0 2 ,a hyaetemi oc of force main the are they as .73), ee,adsm motn variables important some and level, uy3,2019 30, July 0 = 12345 .Nt htteei - gap 7-y a is there that Note .56). Baoding 55 5100 75 50 25 Percentage (%) | Grid sizeGrid (km) 2 Jiedao o.116 vol. 0 = consumption firm nighttime population daytime population ,pretg of percentage .70), ee smlrto (similar level | 2 o 31 no. 0 = ,and .59), | 15449

SOCIAL SCIENCES A Error(%) B 0.8 C 0.60 Residential area 100 40.4 50 Agriculture Architecture 0 0.7 0.55 ) ) 2 2 Business service −50 Entertainment −100 40.0 0.6 0.50 Finance acy (R acy (R r r Hotel Daytime Manufacture population =14 Manufacture 15−64 Agriculture Accu 1 Accu 0.5 0.45 Neighbor service 2 Age =65 Real estate 39.6 3 Immigrant Retail Industrial area Wealth 4 0.4 0.40 Science 5 116.0 116.25 116.5 116.75 x105 Group Group

Fig. 3. Heterogeneity. (A) Spatial distribution of prediction errors of daytime population at 3-km resolution (Beijing). (B) Prediction accuracy of different age groups, immigrant percentage, and wealth (housing price) at the Jiedao level. (C) Prediction accuracy of different industries of firms at the Jiedao level (Chinese neighborhood unit).

sensitive to land price, leading to a dispersed spatial distri- with reasonable accuracy in cities where survey outcomes are bution. We train the model by firm category in terms of the unobserved. industry code and investigate the predictive accuracy across For the transfer models, we summarize the top predictors in different industries. Fig. 3C shows that agriculture and man- SI Appendix, Table S5. “/Zhejiang,” “Guangdong,” and ufacturing are industries with the lowest prediction accuracy. “Crayfish” appear as top features for all 4 variables of inter- Business service, entertainment, finance, and many other service est, showing the popularity of these in different cities. industries have nearly the same prediction accuracy. This is in “Coffee” only enters the top features in the model for the day- line with our hypothesis, as the spatial variation of firms causes time population. Similar to SI Appendix, Table S3, the important an uneven distribution of employees with different income predictors learned by the model warrant further exploration. levels and skill sets, which is also reflected in the distribution of restaurants. Discussion and Conclusion SI Appendix, Table S3 shows the top predictors for demo- Measuring and mapping the socioeconomic outcomes of cities graphic and firm categories (Materials and Methods). For exam- are of great importance to policy makers and researchers. Here, ple, the best predictors of the presence of immigrants include we demonstrate that local restaurants can accurately infer the cuisines of “Beijing,” “Bakery,” “Cooked,” and “Xinjiang,” spatial distribution of socioeconomic activities within cities at whereas housing price (wealth) is indicated by “Hubei,” “Cof- high granularity. Notably, we also show that collecting only a few fee,” and “Beijing.” Good predictors of the 15 to 64 age group training samples can result in high accuracy in inferring unsam- include “Bakery,” “Hotpot,” and “.” Although some pled locations using machine-learning models, suggesting that restaurant features clearly relate to their predicted attribute, as the restaurant model can help city governors and researchers in the case of “Coffee” for the housing price prediction (also monitoring city performance in a timely and low-cost man- noted in ref. 18), other pairs are more elusive; there is no obvious ner. Another potential implication of our work is helping city connection between “Bakery” and the presence of immigrants. governors optimize decision-making regarding the efficient allo- cation of public facilities with the inferred daytime and nighttime Transferable. Finally, based on data from 9 different cities, we population distributions. study the transferability of the restaurant model—to what extent The similarity between restaurants and other geolocated digi- the model trained with one city’s data can estimate other cities’ tal traces (e.g., online maps) suggests that the potential to reveal socioeconomic variables. Exploring whether one specific city’s a neighborhood’s attributes is unlikely to be limited to restau- model generalizes across regions is helpful. Since most of the rants. Moreover, the rich attributes of digital traces also make “big data”-related research or applications are taken from big it possible to measure outcomes that were never included in cities, if models trained in big cities with rich data sources can traditional datasets (3). be transferred to small cities, then small cities could also benefit Taking the national perspective, we show that models trained from big data and algorithm. in one city can achieve good predictive accuracy when applied For the cross-city model, we fix the spatial resolution to to other cities. Despite differences in geographical, cultural, 3 km and traverse all city pairs. For each pair “A–B,” we and economic conditions, cities share many common features train the model with any of A’s data that shares features with in restaurants, which are strongly correlated with socioeconomic B and apply the model to predict B’s outcomes. The results characteristics across cities. The transferability of the model are summarized in Fig. 4. We show that for all variables of could help bridge the “socioeconomic data gap” between large interest, as expected, most of the models trained in-city (diag- and small cities. Currently, we only demonstrate the transfer- onals) outperform models out-of-city. In some cases, such as ability between cities within the same country. One important firm prediction in Shenyang, the out-of-city models can out- following question should be whether the restaurant model (or perform in-city models (Fig. 4C). Generally, models appear to more generally, the socioeconomic predictive model) can be transfer well across cities. This transferability is particularly evi- transferred even between different countries? Addressing this dent in the daytime/nighttime population and firm datasets—R2 issue is beyond the scope of this research, but it is a promising exceeds 0.7 in most out-of-city models. We also find that models direction for further exploration. trained in big cities to infer outcomes for other cities outper- Given the limited availability of high-resolution time series form models applied in the opposite way in most cases (SI data for population, employment, and other key socioeconomic Appendix, Table S4). These results indicate that restaurant data indicators, we have not yet been able to evaluate the abil- can capture common indicators of socioeconomic outcomes, and ity of the restaurant data and the machine-learning approach these commonalities can be transferred by models trained in to predict the temporal changes in a location’s socioeconomic cities with rich survey data and estimate outcomes of interest attributes over time. Such investigation should be possible in

15450 | www.pnas.org/cgi/doi/10.1073/pnas.1903064116 Dong et al. Downloaded by guest on September 30, 2021 Downloaded by guest on September 30, 2021 i.4. Fig. oge al. It et on Dong China. listed in restaurants administrations million 8 county-level were (80%) there 2,298 2017, covering of Dianping, end the By ping.com. Data. Restaurant Methods decision-making and Materials assist to practice. data and these research inter- in-depth use more of to requires outcomes how inexpensively socioeconomic However, for on est. useful data be granular could data. producing here short-term proposed using approach trained are The deci- models (related most predictive while changes), current decisions since land-use the long-term long-term cities, and are infrastructure for durable development to important city about especially decision” sions a is making and This prediction a (25). making num- a between are gaps “there of However, ber agenda. urban the in roles important of sources (24). data possibility restaurants geolocated to the limited with not us data Newly city show aggregated instruments. vision, interpolating computer survey as in other such superresolution communities, or machine-learning in data aggregated algorithms coarse developed census from like data granular sources high deriving important regularly very is more but direction are untouched data Another gathered. restaurant frequently and and data survey as future the Reported evaluated. (C population. Medium C AB seegn ra,bgdt n ahn erigaeplaying are learning machine and data big areas, emerging As Small Big rs-iymdlgnrlzto.Cross-validated generalization. model Cross-city Zhengzhou Zhengzhou Hengyang Hengyang Shenzhen Shenzhen Shenyang Shenyang Chengdu Chengdu Kunming Kunming Yueyang Yueyang Baoding Baoding osmto.Cte nthe on Cities Consumption. (D) Firm. ) Beijing Beijing ecletdrsarn aai 07t 08fo dian- from 2018 to 2017 in data restaurant collected We R

2 Beijing Beijing ausaeaeae vr5 ras n h ihs au o ahct swti h lc box. black the within is city each for value highest the and trials, 50 over averaged are values atm ouainNighttimepopulation Daytime population imConsumption Firm 0 0 0 0 0.85 0 0 0 0 0 0 0 0 0 0 0 ...... 7 . .7 . .7 Shenzhen 95 64 83 91 89 88 86 86 87 84 66 Shenzhen . 6 5 4 7 0 0 0 0 0.8 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . .7 . . . .7 . . . .7 .7 .7 83 85 86 63 69 54 .7 6 86 67 . .

Chengdu 8 6 Chengdu 9 1 1 5 9 6 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .7 . . . .8 .7 .5 .7 .7 .7 .7 . . . . .7 Shenyang 65 94 86 . . 69 Sheny 67 81 87 7 6 6 1 8 8 3 1 6 1 9 0 0 0 0 0.83 0 0 0 0 0 0 0 0 0 0 0 0 Zhengzhou Zhengzhou 0 ...... ang .75 .7 . . .7 .7 . . . 6 81 97 69 66 . 68 84 68 94 67 61 67 8 1 1 8 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . .7 .9 .7 . . .7 . .7 . . . .5 . . 59 82 83 . .7 68 66 9 8 8 87 . 9 Kunming 9 Kunming 7 4 4 8 2 5 5 7 5 0 0 0 0.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .7 .7 . . .7 . .75 .7 .5 . .7 . . . . 84 95 82 . .7 82 96 64 62 6 6 6 7 3 Baoding 8 Baoding 6 8 3 7 x 0 0 0 0 0.64 0 0 0 0 0 0 0 0 0 0 0 0 0 . .7 .7 .74 . . .7 R . .5 ...... xsidct hr h oe a rie,wt iison cities with trained, was model the where indicate axis 53 94 8 . 64 8 44 67 89 63 81 . 9 6 Yueyang 2 1 9 7 Yueyang 5 2 5 Nighttime ( B) population. Daytime (A) cities. other to applied and city one in trained models of 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . .7 . .7 . .7 . . . .5 .7 .5 . .7 . . Hengyang 66 52 33 94 . .

Hengyang 6 62 41 96 66 6 7 6 6 6 5 2 6 1 3 0 0 0 0 0.8 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .7 .7 .7 .7 . . . .5 . . .7 .5 . . 69 57 92 92 4 83 69 84 9 . 6 3 8 1 8 7 5 7 7 3 5 R R 2 2 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 mbl hn nerd ouainfr9cte (see cities 9 for population inferred) million 56.3 phone popu- have daytime we (mobile and preprocess, data nighttime After the respectively. as distributions, regarded lation of are distributions locations The work respectively. for and classification, classifiers home location apply 2 work train we and to Finally, home algorithm, the 3) machine-learning algorithm. tree-based then (DBSCAN) We a noise 2) Xgboost, with threshold. clus- spatial applications time mov- density-based of 10-min using when clusters a tering by different within into defined points m stay points, 200 the stay than cluster user’s less is each distance identify ing We 1) (see longi- details). steps stamp, following for nighttime take (time and we daytime points the distributions, infer geopositioning To population user. anonymous of each series for location-based latitude) a use tude, people contains when for generated it data are location data services; raw phone The mobile 2015. year by the estimated were populations Data. nighttime Population Nighttime and Daytime cities. 9 for restaurants, restaurants 632,964 (Hengyang) collected (Kunming), 18,209 have we 56,140 and total, In (Yueyang), (Zhengzhou), respectively. 15,864 66,458 (Baoding), (Shenyang), 46,219 57,915 (Chengdu), restaurants. of number In accounts total operation. which the in restaurants, of restaurants 139,131 Associa- 94% 147,575 have for Cuisine we were data, Beijing there restaurant 2016, by Dianping of report our end Using the the 100%. to of reach as according not tion, example: does an still the as of rate Beijing share penetration large very the a industry, cover data restaurant Dianping although that noted be should o h te iis ehv olce 851(hnhn,134,497 (Shenzhen), 98,531 collected have we cities, other the For Zhengzhou Zhengzhou D Hengyang Hengyang Shenzhen Shenzhen Shenyang Shenyang Chengdu Chengdu Kunming Kunming Yueyang Yueyang Baoding Baoding Beijing Beijing

Beijing Beijing 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .7 .71 . .75 .7 .7 .7 .5 . Shenzhen 93 88 .7 . . . 89 6 Shenzhen .5 8 8 8 8 6 7 9 9 5 PNAS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .7 .7 . .72 .7 . .71 .7 .7 .7 .7 .7 . .7 .5 . . 86 69 . 66 44 68

Chengdu Chengdu 9 3 3 3 8 6 3 2 6 2 1 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .7 . . .7 .77 . . . .55 .7 .7 .5 . . Shenyang . 9 88 8 8 8 . . . Shenyang 4 61 uy3,2019 30, July 8 8 9 8 6 4 5 2 4 7 8 8 3 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Zhengzhou Zhengzhou 0 0 . .7 . . . .72 .2 .7 . . . .5 .5 . . .5 41 98 83 66 . .7 36 63 87 47 3 y 8 3 6 3 8 4 5 7 xssoigweetemdlwas model the where showing axis 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h itiuin fdyieand daytime of distributions The .7 .7 . . . . .72 .7 .7 .77 .7 . .77 . .5 . .7 88 9 8 6 8 62 68 Kunming .5 Kunming 1 6 2 2 7 3 9 3 7 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 0 . .5 . . .74 . .7 . .7 . .77 . . . . . 62 83 85 96 66 .7 4 68 89 43 39 62 .7 Baoding 2 6 Baoding 1 5 o.116 vol. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .7 .5 . .7 ...... 7 . .7 .7 . .7 .7 . 8 63 8 94 86 88 39 6 61 .

Yueyang 6 Yueyang 2 3 2 8 1 7 3 4 5 3 6 al S1 Table Appendix , SI oe1 Note Appendix, SI 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .7 .5 .7 .7 . . .71 . . .5 . .5 .5 .5 . . . .5 Hengyang 5 6 94 85 Heng 3 41 39 86 | 1 6 4 5 1 1 3 7 2 6 4 1 o 31 no. 0 0 0 0 0 0 0 0 y 0 0 0 0 0 0 0 0 0 0 .7 . . .74 .7 .7 . . ang .7 . .75 .75 . . .55 . . .7 8 85 8 95 44 6 6 4 81 9 2 9 8 7 1 7 7 7 | R R 15451 2 2 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0

SOCIAL SCIENCES 1 2 for details). To protect user privacy, the datasets are anonymized during the min J(β0, β) = ky − β01 − Xβk + λkβk , [1] β ,β 2 2 1 whole process and aggregated into 100 m × 100 m grid cells for further 0 analysis. where λ(≥ 0) controls the sparsity of the model and could be selected At the aggregated level, we calculate the correlation between mobile using cross-validation. By shrinking the coefficient of some variables to zero, phone-inferred home locations (nighttime distribution) and the population LASSO effectively reduces overfitting and improves prediction accuracy on microcensus data at the district level. The R2 is 0.97 for Beijing, indicating small datasets. To avoid overfitting, we adopt the following process. First, that the mobile phone inferred population fits well with the microcensus we randomly divide the data into training (80%) and test (20%) sets. For data in terms of spatial distribution (SI Appendix, Fig. S11). the training set, we train the model using fivefold cross-validation (split- ting the training set into 5 randomly sampled folds; 4 folds are used to Firm Data. We collected firm registration record data (2000 to 2017) from train the model and then test the accuracy on the fifth fold). This pro- the registry database of the State Administration for Industry and Commer- cedure is repeated 50 times to determine the average accuracy on the cial Bureau of China. These data cover the registered information for all test set. firms in China, with variables including firm name, year established, opera- tion status, address, industry code, and so on. There are 8.72 million firms Variables’ Importance. Although LASSO can shrink many coefficients to zero in these cities we study (SI Appendix, Table S1). We geocode firm addresses and report the value of “used” coefficients, we usually cannot directly com- into longitude and latitude using Baidu Map application program interface pare coefficients to show variables’ contribution. As demonstrated in ref. and then aggregate firms by grid cells of each city. 27, a variable used in one partition of the dataset may not be used in another (although the prediction accuracy of different partitions is similar). Consumption Data. Consumption data were aggregated by the bank card We assume that important variables should appear more times than unim- records for point-of-sales (POS) from July to September in 2016. The origi- portant ones in different partitions. Thus, we compare the frequency with nal record is anonymized, and we only know the amount of consumption which different variables appear over 50 trials. in each location. These data have several limitations. First, the consump- tion category (e.g., food, hotel, transportation, etc.) is not included in this All data and code needed to replicate this research can be dataset, making it impossible to distinguish food consumption from all Data Availability. downloaded from https://github.com/leiii/restaurant (28). records. Second, POS and bank cards have different adoption rates across cities, which may affect the model’s generalizability. However, it is the high- ACKNOWLEDGMENTS. We thank Hao Li and Meng Li for their great assis- est level of spatial granularity dataset we could access to estimate the tance with data preprocessing and Kevin P. O’Keeffe, Hongbin Pei, Zhan consumption at grid cell level across different cities. To reduce the noise Shi, Haishan Wu, and seminar participants at The Institute for Operations caused by outliers, we set an upper limit for each record—any single pur- Research and the Management Sciences, Massachusetts Institute of Technol- chase of more than 10,000 renminbi (RMB) was set to the upper limit ogy (MIT) for helpful comments. We also thank Allianz, Amsterdam Institute (10,000 RMB). for Advanced Metropolitan Solutions, Brose, Cisco, Ericsson, Fraunhofer Institute, Liberty Mutual Institute, Kuwait-MIT Center for Natural Resources and the Environment, Shenzhen, Singapore-MIT Alliance for Research and Prediction Model. To estimate the grid cell level socioeconomic outcomes, Technology, Uber, the Vitoria State Government, Volkswagen Group of we train LASSO regression models using the glmnet package (26) in R. America, and all of the members of the MIT Senseable City Lab Consortium As a commonly used machine-learning method, LASSO performs both and the China Future City Lab for supporting this research. This work was regularization and variable selection by introducing `1 penalty, which partially supported by National Natural Science Foundation of China Grants minimizes: 41801299 and 71625004.

1. N. Wardrop et al., Spatially disaggregated population estimates in the absence of 16. T. Gebru et al., Using deep learning and google street view to estimate the demo- national population and housing census data. Proc. Natl. Acad. Sci. U.S.A. 115, 3529– graphic makeup of neighborhoods across the United States. Proc. Natl. Acad. Sci. 3537 (2018). U.S.A. 114, 13108–13113 (2017). 2. L. Einav, J. Levin, Economics in the age of big data. Science 346, 1243089 (2014). 17. E. L. Glaeser, H. Kim, M. Luca, “Nowcasting the local economy: Using yelp data 3. E. L. Glaeser, S. D. Kominers, M. Luca, N. Naik, Big data and big cities: The to measure economic activity” (NBER Working Paper 24010, National Bureau of promises and limitations of improved measures of urban life. Econ. Inq. 56, 114–137 Economic Research, Cambridge, MA, 2017). (2018). 18. E. L. Glaeser, H. Kim, M. Luca, Measuring gentrification: Using yelp data to quantify 4. X. Chen, W. D. Nordhaus, Using luminosity data as a proxy for economic statistics. neighborhood change. AEA Pap. Proc. 108, 77–82 (2018). Proc. Natl. Acad. Sci. U.S.A. 108, 8589–8594 (2011). 19. D. R. Davis, J. I. Dingel, J. Monras, E. Morales, How segregated is urban consump- 5. J. V. Henderson, A. Storeygard, D. N. Weil, Measuring economic growth from outer tion?J. Polit. Econ., in press (2019). space. Am. Econ. Rev. 102, 994–1028 (2012). 20. J. S. Kang, P. Kuznetsova, M. Luca, Y. Choi, “Where not to eat? Improving public policy 6. D. Donaldson, A. Storeygard, The view from above: Applications of satellite data in by predicting hygiene inspections using online reviews” in Proceedings of the 2013 economics. J. Econ. Perspect. 30, 171–198 (2016). Conference on Empirical Methods in Natural Language Processing (Association for 7. M. Kosinski, D. Stillwell, T. Graepel, Private traits and attributes are predictable Computational Linguistics, Seattle, WA, 2013), pp. 1443–1448. from digital records of human behavior. Proc. Natl. Acad. Sci. U.S.A. 110, 5802–5805 21. R. Tibshirani, Regression shrinkage and selection via the LASSO. J. Roy. Stat. Soc. B (2013). Met. 58, 267–288 (1996). 8. H. Choi, H. Varian, Predicting the present with google trends. Econ. Rec. 88, 2–9 22. D. Martin, S. Cockings, S. Leung, Developing a flexible framework for spatiotemporal (2012). population modeling. Ann. Am. Assoc. Geogr. 105, 754–772 (2015). 9. J. L. Toole et al., Tracking employment shocks using mobile phone data. J. R. Soc. 23. A. E. Gaughan et al., Spatiotemporal patterns of population in mainland China, 1990 Interf. 12, 20150185 (2015). to 2010. Sci. Data 3, 160005 (2016). 10. L. Dong et al., Measuring economic activity in China with mobile big data. EPJ Data 24. T. Vandal et al., “DeepSD: Generating high resolution climate change projections Sci. 6, 29 (2017). through single image super-resolution” in Proceedings of the 23rd ACM SIGKDD 11. P. Deville et al., Dynamic population mapping using mobile phone data. Proc. Natl. International Conference on Knowledge Discovery and Data Mining (Association for Acad. Sci. U.S.A. 111, 15888–15893 (2014). Computer Machinery, New York, NY, 2017), pp. 1663–1672. 12. J. Blumenstock, G. Cadamuro, R. On, Predicting poverty and wealth from mobile 25. S. Athey, Beyond prediction: Using big data for policy problems. Science 355, 483–485 phone metadata. Science 350, 1073–1076 (2015). (2017). 13. A. Cavallo, Scraped data and sticky prices. Rev. Econ. Stat. 100, 105–119 (2018). 26. J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear 14. N. Naik, S. D. Kominers, R. Raskar, E. L. Glaeser, C. A. Hidalgo, Computer vision uncov- models via coordinate descent. J. Stat. Softw. 33, 1 (2010). ers predictors of physical urban change. Proc. Natl. Acad. Sci. U.S.A. 114, 7571–7576 27. S. Mullainathan, J. Spiess, Machine learning: An applied econometric approach. J. (2017). Econ. Perspect. 31, 87–106 (2017). 15. N. Jean et al., Combining satellite imagery and machine learning to predict poverty. 28. L. Dong, C. Ratti, S. Zheng, Predicting neighborhoods’ socioeconomic attributes using Science 353, 790–794 (2016). restaurant data. GitHub. https://github.com/leiii/restaurant. Deposited 3 May 2019.

15452 | www.pnas.org/cgi/doi/10.1073/pnas.1903064116 Dong et al. Downloaded by guest on September 30, 2021