Using Deep Learning and Google Street View to Estimate The

Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States Timnit Gebrua,1, Jonathan Krausea, Yilun Wanga, Duyun Chena, Jia Dengb, Erez Lieberman Aidenc,d,e, and Li Fei-Feia

aArtiﬁcial Intelligence Laboratory, Computer Science Department, Stanford University, Stanford, CA 94305; bVision and Learning Laboratory, Computer Science and Engineering Department, University of Michigan, Ann Arbor, MI 48109; cThe Center for Genome Architecture, Department of Genetics, Baylor College of Medicine, Houston, TX 77030; dDepartment of Computer Science, Rice University, Houston, TX 77005; and eThe Center for Genome Architecture, Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005

Edited by Kenneth W. Wachter, University of California, Berkeley, CA, and approved October 16, 2017 (received for review January 4, 2017)

The United States spends more than $250 million each year on of analyzing demographic trends in great detail, in real time, and the American Community Survey (ACS), a labor-intensive door-to- at a fraction of the cost. door study that measures statistics relating to race, gender, edu- Recently, Naik et al. (7) used publicly available imagery to cation, occupation, unemployment, and other demographic fac- quantify people’s subjective perceptions of a neighborhood’s tors. Although a comprehensive source of data, the lag between physical appearance. They then showed that changes in these demographic changes and their appearance in the ACS can perceptions correlate with changes in socioeconomic variables exceed several years. As digital imagery becomes ubiquitous and (8). Our work explores a related theme: whether socioeconomic machine vision techniques improve, automated data analysis may statistics can be inferred from objective characteristics of images become an increasingly practical supplement to the ACS. Here, from a neighborhood. we present a method that estimates socioeconomic characteris- Here, we show that it is possible to determine socioeconomic tics of regions spanning 200 US cities by using 50 million images statistics and political preferences in the US population by com- of street scenes gathered with Google Street View cars. Using bining publicly available data with machine-learning methods. deep learning-based computer vision techniques, we determined Our procedure, designed to build upon and complement the the make, model, and year of all motor vehicles encountered in ACS, uses labor-intensive survey data for a handful of cities to particular neighborhoods. Data from this census of motor vehi- train a model that can create nationwide demographic estimates. cles, which enumerated 22 million automobiles in total (8% of This approach allows for estimation of demographic variables all automobiles in the United States), were used to accurately with high spatial resolution and reduced lag time. estimate income, race, education, and voting patterns at the Specifically, we analyze 50 million images taken by Google zip code and precinct level. (The average US precinct contains Street View cars as they drove through 200 cities, neighborhood- ∼1,000 people.) The resulting associations are surprisingly sim- by-neighborhood and street-by-street. In Google Street View ple and powerful. For instance, if the number of sedans encoun- images, only the exteriors of houses, landscaping, and vehicles on tered during a drive through a city is higher than the num- the street can be observed. Of these objects, vehicles are among ber of pickup trucks, the city is likely to vote for a Democrat the most personalized expressions of American culture: Over during the next presidential election (88% chance); otherwise, 90% of American households own a motor vehicle (9), and their it is likely to vote Republican (82%). Our results suggest that choice of automobile is influenced by disparate demographic fac- automated systems for monitoring demographics may effectively tors including household needs, personal preferences, and eco- complement labor-intensive approaches, with the potential to nomic wherewithal (10). (Note that, in principle, other factors measure demographics with fine spatial resolution, in close to such as spacing between houses, number of stories, and extent of real time. shrubbery could also be integrated into such models.) Such street scenes are a natural data type to explore: They already cover computer vision | deep learning | social analysis | demography Significance or thousands of years, rulers and policymakers have sur- Fveyed national populations to collect demographic statistics. We show that socioeconomic attributes such as income, race, In the United States, the most detailed such study is the Amer- education, and voting patterns can be inferred from cars ican Community Survey (ACS), which is performed by the US detected in Google Street View images using deep learning. Census Bureau at a cost of $250 million per year (1). Each Our model works by discovering associations between cars year, ACS reports demographic results for all cities and coun- and people. For example, if the number of sedans in a city ties with a population of 65,000 or more (2). However, due to is higher than the number of pickup trucks, that city is likely the labor-intensive data-gathering process, smaller regions are to vote for a Democrat in the next presidential election (88% interrogated less frequently, and data for geographical areas with chance); if not, then the city is likely to vote for a Republican less than 65,000 inhabitants are typically presented with a lag of (82% chance). ∼ 2.5 y. Although the ACS represents a vast improvement over the earlier, decennial census (3), this lag can nonetheless impede Author contributions: T.G., J.K., J.D., E.L.A., and L.F.-F. designed research; T.G., J.K., effective policymaking. Thus, the development of complemen- Y.W., D.C., J.D., E.L.A., and L.F.-F. performed research; T.G. and J.K. contributed new tary approaches would be desirable. reagents/analytic tools; T.G., J.K., Y.W., D.C., J.D., E.L.A., and L.F.-F. analyzed data; and In recent years, computational methods have emerged as a T.G., J.K., E.L.A., and L.F.-F. wrote the paper. promising tool for tackling difficult problems in social science. The authors declare no conflict of interest. For instance, Antenucci et al. (4) have predicted unemployment This article is a PNAS Direct Submission. rates from Twitter; Michel et al. (5) have analyzed culture using This open access article is distributed under Creative Commons Attribution- large quantities of text from books; and Blumenstock et al. (6) NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND). used mobile phone metadata to predict poverty rates in Rwanda. 1To whom correspondence should be addressed. Email: [email protected]. These results suggest that socioeconomic studies, too, might be This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. facilitated by computational methods, with the ultimate potential 1073/pnas.1700035114/-/DCSupplemental.

13108–13113 | PNAS | December 12, 2017 | vol. 114 | no. 50 www.pnas.org/cgi/doi/10.1073/pnas.1700035114 Downloaded by guest on September 28, 2021 Downloaded by guest on September 28, 2021 er tal. et Gebru residents. its socio- of statistics, preferences demographic political of and neigh- range attributes, each wide economic in a vehicles infer motor we classified borhood, the Using task. same the assuming 2,657 of in one automobiles into the fied automobiles classify to Neverthe- taking able SVT). categories, is Supercrew system F-150 F-150 our Ford Ford Honda less, 2011 2001 2007 (e.g., vs. (e.g., grilles LL or lights Supercrew Accord) tail Honda in 2008 car vs. changes some Accord subtle instance, have for person; can can cars untrained models between an Differences to accom- images. could imperceptible of people be handful few a that even one for is plish here perform we recog- task object fine-grained nition as the a humans, (such for vision easy determine are machine tagging) in reliably photo tasks to challenging and many also model, Whereas make, but year. including vehi- characteristics, scene vehicle recognize of street only range complex wide not a to in possible cles is Neu- (CNN)—it Convolutional Networks learning—specifically, ral deep on based work which self-driving with of frequency the sampled. emergence in are increase the locations large different a and about States, bring will United cars the of much falvhce nteUie tts fe oaiigec vehi- each localizing After States. United the in vehicles all of ing through scenes. View recruited Street Google experts in cars car identify and to Craigslist) using Turk, recruited Mechanical laypeople, (both Amazon humans dataset asking gold-standard by a generated of we advantage (see took (12) model street This automati- the Methods). on to vehicles learned motor algorithm (11)] voting localize recognition (DPM) cally object Model” 39,286 Part our and “Deformable cars, [a of codes photos zip annotated and 3,068 from spanning precincts images View Street ntefis tpo u nlss ecollected we analysis, our of step first the In frame- vision machine a deploying by that, demonstrate We escesul detected successfully We 32% falo h eilsi the in vehicles the of all of 10 Price: Trim: Body Type: Year: Model: Make: e mg,wudtk oethan more take would image, per s 2006 1.8 s $5,417 Nissan Sentra 0.2 200 sedan e eil mg od o hl tclassi- it While so. do to image vehicle per s 50 Scte Fg ) sn hs images these Using 1). (Fig. cities US ilo mgsin images million Price: Trim: Body Type: Year: Model: Make: 22 ilo itntvhce,compris- vehicles, distinct million 2003 e-150 $3,778 Ford Econoline-Cargo 200 van iisw tde and studied we cities 2 k ua expert, human a wk, 50 Price: Trim: Body Type: Year: Model: Make: Honda 15 ilo Google million 1994 $3,591 lx aeil and Materials operform to y Accord sedan 8% Price: Trim: Body Type: Year: Model: Make: 2004 ex $8,773 Honda eto fvhce eni ein hssml iermdlis between model associations linear negative simple col- and This positive the region. identify of a to basis in sufficient the seen to on vehicles esti- model of preferences regression to lection voter ridge model a and regression and income logistic levels estimate a education train and we race mate set, training that our model in the evaluate to process. set training test the the from resulted used We County, Maricopa County). County, let- Dakota Yolo the as with (such “Z” starting through counties “D” in ters regions all comprising set,” “test “C” etc.). and or codes, County, zip “B,” Cabarrus encompasses “A,” County, set with training Baldwin that starts This County, regions name Ada all whose as comprising county (such set,” a “training in a mostly is lie first The 2). (see (Fig. region the in vehicles efficiency, of fuel Methods). density and and addi- price overall include average the the also types and etc.), vehicle We various SUVs, for vans, region. counts (trucks, aggregate that as were such that from we features model tional precinct), and images make or in each of code, identified vehicles zip of number (city, the examined geo- count each we For follows. region as graphical preferences voter and statistics 91%), ( graphic minivans (83%), See vans ( 82%). data), trucks pickup test and For the (86%), 1). SUVs in (Fig. (identifying vehicles 1990 cars such identified since of accurately States distinct models visually United our all the instance, of cate- in list fine-grained exhaustive sold 2,657 nearly automobiles of a one form into which vehicle were gories, each we our classify Specifically, to Using cars. to of able 1). CNN types (Fig. the different vehicle trained between we distinguish each images, of standard year gold human-annotated and type, make, the body determine to model, learning classification, deep object for successful date most to the algorithm 14), (13, CNN deployed we cle, Civic sn C n rsdnileeto oigdt o regions for data voting election presidential and ACS Using subsets two into county, by dataset, our partitioned then We demo- estimate we data, vehicle motor resulting the Using sedan 2657 CarCategories 22,000,000 CarsAnalyzed 50,0000,000 Images 200 Cities ∼ PNAS 12% | ye n rc ftecri 02 mgscourtesy Images Earth. 2012. Maps/Google in Google car of the of price body and year, type, model, we make, car, the of as type such each into metadata For have vehicles cars. of detected classes then 2,657 the We of one categorize cars. to million DPM CNN 22 on use estimated an based count algorithms and cars vision detect we computer image, Google with each million In 50 images. using View States Street United the in cities 1. Fig. ftepeicsi u aa h eodi a is second The data. our in precincts the of eebr1,2017 12, December epromavhclrcnu f200 of census vehicular a perform We 35 fthe of | 200 o.114 vol. i.S1. Fig. Appendix, SI cities, | ∼ o 50 no. 15% Materials | fthe of 13109 95%

COMPUTER SCIENCES i. White (Seattle, Washington) ii.Black (Seattle, Washington) iii.. Asian (Seattle, Washington)

100%

actual predicted actual predicted actual predicted

Fig. 2. We use all of the cities in counties starting with A, B, and C (shown in purple on the map) to train a model estimating socioeconomic data from car attributes. Using this model, we estimate demographic variables at the zip code Train level for all of the cities shown in green. We Test show actual vs. predicted maps for the percentage of Black, Asian, and White people in Seat- tle, WA (i–iii); the percentage of people with less than a high school degree in Milwaukee, WI (iv); and the percentage of people with graduate degrees in Milwaukee, WI (v). (vi) Maps the median household income in Tampa, FL. The ground truth values are mapped on Left, iv. Less than High school (Milwaukee, Wisconsin) v. Graduate school (Milwaukee, Wisconsin) vi. Income (Tampa, Florida) and our estimated results are on Right. We accurately localize zip codes with the highest 34% 48% $111K and lowest concentrations of each demographic variable such as the three zip codes in Eastern Seattle with high concentrations of Caucasians, one Northern zip code in Milwaukee with highly 0% 0% $7K educated inhabitants, and the least wealthy zip actual predicted actual predicted actual predicted code in Southern Tampa.

the presence of specific vehicles (such as Hondas) and particu- tic we examined. (The r values for the correlations were as fol- lar demographics (i.e., the percentage of Asians) or voter prefer- lows: median household income, r = 0.82; percentage of Asians, ences (i.e., Democrat). r = 0.87; percentage of Blacks, r = 0.81; percentage of Whites, Our model detects strong associations between vehicle distri- r = 0.77; percentage of people with a graduate degree, r = 0.70; bution and disparate socioeconomic factors. For instance, sev- percentage of people with a bachelor’s degree, r = 0.58; percent- eral studies have shown that people of Asian descent are more age of people with some college degree, r = 0.62; percentage of likely to drive Asian cars (15), a result we observe here as well: people with a high school degree, r = 0.65; percentage of people The two brands that most strongly indicate an Asian neighbor- with less than a high school degree, r = 0.54). See SI Appendix, hood are Hondas and Toyotas. Cars manufactured by Chrysler, Figs. S3–S5. Taken together, these results show our ability to esti- Buick, and Oldsmobile are positively associated with African mate demographic parameters, as assessed by the ACS, using the American neighborhoods, which is again consistent with exist- automated identification of vehicles in Google Street View data. ing research (16). And vehicles like pickup trucks, Volkswagens, Although our city-level estimates serve as a proof-of-principle, and Aston Martins are indicative of mostly Caucasian neighbor- zip code-level ACS data provide a much more fine-grained por- hoods. See SI Appendix, Fig. S2. trait of constituencies. To investigate the accuracy of our meth- In some cases, the resulting associations can be easily applied ods at zip code resolution, we compared our zip code-by-zip code in practice. For example, the vehicular feature that was most estimates to those generated by the ACS, confirming a close cor- strongly associated with Democratic precincts was sedans, respondence between our findings and ACS values. For instance, whereas Republican precincts were most strongly associated with when we looked closely at the data for Seattle, we found that extended-cab pickup trucks (a truck with rear-seat access). We our estimates of the percentage of people in each zip code who found that by driving through a city while counting sedans and were Caucasian closely matched the values obtained by the ACS pickup trucks, it is possible to reliably determine whether the (r = 0.84, p < 2e − 7). The results for Asians (r = 0.77, p = 1e − city voted Democratic or Republican: If there are more sedans, 6) and African Americans (r = 0.58, p = 7e − 4) were similar. it probably voted Democrat (88% chance), and if there are more Overall, our estimates accurately determined that Seattle, Wash- pickup trucks, it probably voted Republican (82% chance). See ington is 69% Caucasian, with African Americans mostly residing Fig. 3 A, iii. in a few Southern zip codes (Fig. 2 i and ii). As another exam- As a result, it is possible to apply the associations extracted ple, we estimated educational background in Milwaukee, Wis- from our training set to vehicle data from our test set regions consin zip codes, accurately determining the fraction of the pop- to generate estimates of demographic statistics and voter prefer- ulation with less than a high school degree (r = 0.70, p = 8e −5), ences, achieving high spatial resolution in over 160 cities. To be with a bachelor’s degree (r = 0.83, p < 1e − 7), and with post- clear, no ACS or voting data for any region in the test set were graduate education (r = 0.82, p < 1e − 7). We also accurately used to create the estimates for the test set. determined the overall concentration of highly educated inhabi- To confirm the accuracy of our demographic estimates, we tants near the city’s northeast border (Fig. 2 iv and v). Similarly, began by comparing them with actual ACS data, city-by-city, our income estimates closely match those of the ACS in Tampa, across all 165 test set cities. We found a strong correlation Florida (r = 0.87, p < 1e −7). The lowest income zip code, at the between our results and ACS values for every demographic statis- southern tip, is readily apparent.

13110 | www.pnas.org/cgi/doi/10.1073/pnas.1700035114 Gebru et al. Downloaded by guest on September 28, 2021 Downloaded by guest on September 28, 2021 ay.Oeal hr a togcreainbtenoresti- our between correlation strong a was there (Fig. is Overall, city that racy). the classify city correctly in a we precincts Alabama, can, Democratic Birmingham, in small And two 3B). the of out cor- one of we city, borders Republican Northeastern a classify Arizona, rectly and Gilbert, Republi- in West, Similarly, few South, city. a the the are in there cor- precincts that can we determine precincts, accurately we 311 notably, with classify city Democratic rectly Milwaukee, very in instance, a For data. Wisconsin, truth to ground continued the estimates match our that closely By presi- found 2008 city. we the results, particular election to dential a estimates precinct-by-precinct within our patterns comparing voter identify estimate preferences accurately voter to correlation approach strong our very of behavior. ability a (r the found preferences firm voter We actual cities. and p estimates set vot- across pref- our city-by-city, the test between election, voter with presidential 165 our them 2008 comparing all the of of by accuracy results that began ing the voters we confirm the estimates, and erence To vehicles them. prefer- between surround such associations infer using can ences procedure machine-learning automated er tal. et Gebru << hl iylvldt rvd eea itr,precinct-level picture, general a provide data city-level While our data, preference voter collect not does ACS the While A 0% 0% 0.7 1e − .See 7). 58 264 100% 100% 0.4 u of out iii. Ratio ofSedans iii. to Extended-cab Trucks ii. Predicted Percent of Voters for Obama in 2008 PredictedObama in Voters for Percent of ii. i. Actual Percent of Voters for Obama in 2008 rcnt [85% precincts i.S5 Fig. Appendix, SI 60 87 rcnt (97% precincts u fthe of out ] Most 3B)]. (Fig. accuracy 105 cuay,identifying accuracy), hs eut con- results These . rcnt (83% precincts 23% Republi- 0 = accu- .73, B ae n culeetrlotoe ttesnl-rcntlevel single-precinct the at outcomes (r electoral actual and mates hs tetsee tn ncnrs otemsietextual massive the to contrast in walk- stand notepad. a by scenes and at camera instance, street a albeit for with also obtained, Thus, neighborhood methods: is target be low-tech a It can around using day. ing data pace, single similar slower every that a here as note studied take to were currently important as vehicles Tesla images increas- instance, many become For to ubiquitous. likely neighborhoods ingly of cameras—are use—footage vehicle-mounted we data from of type the widespread, deter- quickly can we patterns. model, infer- demographic our mine ACS—and using regions via few other obtained for a data routinely ring for data surveys of in collecting type patterns cities—the By political America. and across economic, inex- neighborhoods can social, scenes determine street available pensively publicly to methods provide vision automatically puter to used cities. American be many can for information cities demographic few up-to-date data survey a precinct-level for or ACS code- yearly collected zip for approach, available accurate our those Using are than data. estimates resolutions demographic spatial higher our at that suggest images. View also Street Google They of database large a using voter preferences and statistics demographic both estimate accurately to rithm Los Angeles, California Milwaukee, Wisconsin Birmingham, Alabama Lexington, Kentucky ssl-rvn aswt nor aea eoeincreasingly become cameras onboard with cars self-driving As com- automated fully of application the that find we Thus, algo- machine-learning our of ability the illustrate results These 0 = Casper, Wyoming Garland, Texas Garland, Gilbert, Arizona . 57, p Republican Democrat < PNAS 1e − | 7). eebr1,2017 12, December r,adnrhatr odr ftecity. the of a borders for northeastern and west- except southern, ern, the city in Democratic precincts Republican a few a is as WI CA Mil- waukee, that Angeles, predict Los accurately We and city. Democratic city WY Casper, Republican in a classifies shown as correctly model are Our precincts our red. in Republican using shown and level are precinct blue, precincts the Democratic at model. set full test cities our various in for actual affiliations Shows voter (B) predicted Wyoming. vs. the as and in Texas such very cities in those is Republican those in as ratio high As such and the Coast set. cities East map, test Democratic the our in from in low seen cities 165 be the can in sedans (r to election Obama presidential Barack iii 2008 for voted the who in people of age i 3. Fig. and astertoo eetdpcu trucks pickup detected of ratio the maps ii a h culadpeitdpercent- predicted and actual the map culadifre oigpatterns. voting inferred and Actual | o.114 vol. | o 50 no. 0.74). = | 13111 A,

COMPUTER SCIENCES corpora presently used in many computational social science state-of-the-art in object detection, DPMs (11), instead of recent algorithms studies, which can be constrained by privacy and copyright con- such as ref. 23. cerns that prevent individual researchers from obtaining the raw For DPMs, there are two main parameters that influence the running time data underlying published analyses. and performance, which are the number of components and the number of The automated method we present could be substantially parts in the model. SI Appendix, Table S3 provides an analysis of the perfor- improved by expanding our object recognition beyond vehicles mance/time tradeoff on our data, measured on the validation set. Based on (17, 18) and incorporating global image features (7, 19–21). this analysis, using a DPM with a single component and eight parts strikes the right balance between performance and efficiency, allowing us to detect For instance, our experiments show that global image features cars on all 50 million images in 2 wk. In contrast, the best performing param- extracted from CNN can also be used to infer demographics. But eters would have taken 2 months to run and only increased average preci- this approach requires more data—at least 50% of our dataset— sion (AP) by 4.5. rather than the 12% to 15% we use using our current method As discussed in ref. 12, we also introduce a prior on the location and size (see SI Appendix). The model could also be improved by inte- of predicted bounding boxes and use it to improve detection accuracy. Incor- grating other types of data, such as satellite images (22), social porating this prior into our detection pipeline improves AP on the validation networks (4), and economic data pertaining to local consumer set by 1.92 at a negligible cost. SI Appendix, Fig. S6B visualizes this prior. The behavior in particular geographic regions. Nevertheless, there output of our detection system is a set of bounding boxes and scores where are many characteristics that our methodology—which relies each score indicates the likelihood of its associated box containing a car. on publicly available data—may not be able to infer (see SI We converted these scores into estimated probabilities via isotonic Appendix). For instance, the age of children in a neighborhood regression (24) (see SI Appendix for details). We report numbers using a detection threshold of −1.5 (applied before the location prior). At test time, can be estimated with moderate accuracy (r = 0.54), while the after applying the location prior (which lowers detection decision values), percentage of farmers in a neighborhood was not successfully we use a detection threshold of −2.3. This reduces the average number inferred using our method (r = 0.0). of bounding boxes per image to be classified from 7.9 to 1.5 while only Although automated methods could be powerful resources degrading AP by 0.6 (66.1 to 65.5) and decreasing the probability mass of for both researchers and policymakers, their progress will raise all detections in an image from 0.477 to 0.462 (a 3% drop). SI Appendix, important ethical concerns; it is clear that public data should not Fig. S8 shows examples of car detections using our model. Bounding boxes be used to compromise reasonable privacy expectations of indi- with cars have high estimated probabilities, whereas the opposite is true for vidual citizens, and this will be a central concern moving forward. those containing no cars. The AP of our final model is 65.7, and its precision In the future, such automated methods could lead to estimates recall curve is visualized in SI Appendix, Fig. S7B. We calculate chance per- that are accurately updated in real time, dramatically improving formance using a uniform sample of bounding boxes greater than 50 pixels upon the time resolution of a manual survey. in width and height.

Materials and Methods Car Classification. Our pipeline, described in ref. 12, classifies automobiles into one of 2,657 visually distinct categories with an accuracy of 33.27%. Here, we describe our methodology for data collection, car detection, car We use a CNN (25) following the architecture of ref. 14 to categorize cars. classification, and demographic inference. Some of these methods were par- CNNs, like other supervised machine-learning methods, perform best when tially developed in an earlier paper (12), which served as a proof of concept trained on data from a similar distribution as the test data (in our case, focusing on a limited set of predictions (e.g., per capita carbon emission, Street View images). However, the cost of annotating Street View photos Massachusetts Department of Vehicles registration data, income segrega- makes it infeasible to collect enough images to train our CNN only using tion). Our work builds on these methods to show that income, race, educa- this source. Thus, we used a combination of Street View and the more plen- tion levels, and voting patterns can be predicted from cars in Google Street tiful product shot images as training data. We modified the traditional CNN View images. In the sections below, we discuss our dataset and methodology training procedure in a number of ways. in more detail. First, taking inspiration from domain adaptation, we approximated the WEIGHTED method of Daume´ (26) by duplicating each Street View image 10 Dataset. While learning to recognize automobiles, a model needs to be times during training. This roughly equalizes the number of training Street trained with many images of vehicles annotated with category labels. To View and product shot images, preventing the classifier from overfitting on this end, we used Amazon Mechanical Turk to gather a dataset of labeled product shot images. car images obtained from edmunds.com, cars.com, and craigslist.org. Our Product shot and Street View images differ significantly in image resolu- dataset consists of 2,657 visually distinct car categories, covering all com- tion: Cars in product shot images occupy many more pixels in the image. monly used automobiles in the United States produced from 1990 onward. To compensate for this difference, we first measured the distribution of We refer to these images as product shot images. We also hired experts bounding box resolutions in Street View images used for training. Then, to annotate a subset of our Google Street View images. The annotations during training, we dynamically downsized each input image according to include a bounding box around each car in the image and the type of car this distribution before rescaling it to fit the input dimensions of the CNN. contained in the box. We partition the images into training, validation, Resolutions are parameterized by the geometric mean of the bounding box and test sets. In addition to our annotated images, we gathered 50 mil- width and height, and the probability distribution is given as a histogram lion Google Street View images from 200 cities, sampling GPS points every over 35 different such resolutions. The largest resolution is 256, which is the 25 miles. We captured 6 images per GPS point, corresponding to different input resolution of the CNN (see SI Appendix for additional details). camera rotations. Each Street View image has dimensions 860 by 573 pixels At test time, we input each detected bounding box into the CNN and ◦ and a horizontal field of view of ∼90 . Since the horizontal field of view is obtain softmax probabilities for each car category through a single for- larger than the change in viewpoint between the 6 images per GPS point, ward pass. We only keep the top 20 predictions, since storing a full 2, 657- the images have some overlapping content. In total, we collected 50,881,098 dimensional floating point vector for each bounding box is prohibitively Google Street View images for our 200 cities. They were primarily acquired expensive in terms of storage. On average, these top 20 predictions account between June and December of 2013 with a small fraction (3.1%) obtained for 85.5% of the softmax layer activations’ probability mass. After extensive in November and December of 2014. See SI Appendix for more detail on the code optimization to make this classification step as fast as possible, we data collection process. are primarily limited by the time spent reading images from disk, especially when using multiple GPUs to perform classification. At the most fine- Car Detection. In computer vision, detection is the task of localizing objects grained level (2, 657 classes), we achieve a surprisingly high accuracy of within an image and is most commonly framed as predicting the x, y, width, 33.27%. We classify the car make and model with 66.38% and 51.83% accu- and height coordinates of an axis-aligned bounding box around an object of racy respectively. Whether it was manufactured in or outside of the United interest. The central challenge for our work is designing an object detector States can be determined with 87.71% accuracy. that is fast enough to run on 50 million images within a reasonable amount We show confusion matrices for classifying the make, model, body type, of time and accurate enough to be useful for demographic inference. Our and manufacturing country of the car (SI Appendix, Fig. S9 A–D). Body computation resources consisted of 4 T K40 graphics processing units and type misclassifications tend to occur among similar categories. For exam- 200 2.1 GHz central processing unit cores. As we were willing to trade a cou- ple, the most frequent misclassification for “coupe” is “sedan,” and the ple of percentages in accuracy for efficiency (12), we turned to the previous most frequent misclassification for trucks with a regular cab is trucks with an

13112 | www.pnas.org/cgi/doi/10.1073/pnas.1700035114 Gebru et al. Downloaded by guest on September 28, 2021 Downloaded by guest on September 28, 2021 4 rzesyA usee ,Hno E(02 mgNtcasfiainwt epconvo- deep with classification to ImageNet (2012) applied GE learning Hinton I, Gradient-based Sutskever (1998) A, Krizhevsky P Haffner 14. Y, Bengio L, Bottou Y, LeCun 13. 2 er ,e l 21)Fn-rie a eeto o iulcnu siainin estimation census visual for detection car Fine-grained (2017) al. et with T, Gebru detection Object 12. (2010) D Ramanan of D, role McAllester The R, drive? Girshick P, people Felzenszwalb do vehicle 11. of type What (2004) PL Mokhtarian S, Choo 10. aino tlat50ada es 0dtce cars. detected popu- 50 a least with at ones and be 500 to least at restricted values. of are extreme lation interest having of range from the regions estimates within geographical our stay deter-The preventing to (parameters data, predictions SD training clip also the unit We of and set). mean training the zero on have mined to We cross-validation models. features fivefold trained the the using averaged normalize models and should parameter 5 race) regularization trained the each we select (or to cases, backgrounds people all educational of In possible 100%. percentage 5 yield the the summing in of region, inherent each For each structure with estimation. use for to Specifically, affiliation regression data. voter logistic the used and used we We income set. education, for test and the text, race model on main predicting regression the and in ridge set discussed training a as the sets on test model and a training training into dataset our in cities by solely determined is votes McCain. total cast and of votes Obama count for ignore the We votes is, McCain. that person; John other precinct- and presidential any of Obama US consist for Barack and 2008 27 for ref. the counts of vote for authors the level Data by us codes). to census provided were (e.g., election analysis our in used See dataset. 2008–2012. our between in makes of 58 (1990– of percent ranges each year and is 5 make 2010–2014); of whose minimum and each cars the total within 2005–2009, as fall 2000–2004, (selected car) 1995–1999, year the of 1994, for each whose from values cars cars year total total possible electric; of of of total are percent percent of USA); types; that percent the body countries; cars from seven 11 (not total of foreign of each are from percent that are per- cars hybrids; that highway); cars are and total that (city of gallon percent cars per total detected miles of of price; number car cent average average the image; attributes: per car-related cars 88 of set following Estimation. Demographic correct our the of in indicative pattern diagonal, clear strong only predictions. a the is classes, popular matrix of most confusion number two model-level large the the the “USA,” to misclassifying or Similarly, Due by “Japan” make. countries. occur popular either level more as country a country manufacturing to manufacturing the mostly at is when Honda errors it Thus, as most misclassified, others. (such than is makes similar make two car’s visually no a more are are there that hand, Mercedes-Benz) other and the On cab. extended er tal. et Gebru .Aeia soito fSaeHgwyadTasotto fcas(2013) Officials vision Transportation and Computer Highway (2017) State CA of Association Hidalgo American EL, 9. Glaeser R, Raskar SD, Kominers N, Naik 8. perceived the Streetscore–Predicting (2014) C Hidalgo R, Raskar J, mobile Philipoom from N, wealth Naik and digitized poverty 7. of Predicting (2015) millions R using On G, culture Cadamuro of J, Blumenstock analysis Quantitative 6. (2011) al. et JB, Michel 5. R M, Levenstein M, Available Cafarella D, census. Antenucci Decennial 4. (2010) Bureau Census US Commerce, of Department 5 3. survey community American bud- (2012) bureau’s Bureau Census census US Commerce, US of Department (2013) 2. Bureau Census US Commerce, of Department 1. opromoreprmns epriindtezpcds rcnt,and precincts, codes, zip the partitioned we experiments, our perform To collected were and (2) ACS the from obtained were data Socioeconomic uinlnua networks. neural lutional recognition. document press. in iciiaieytandpr ae models. 1627–1645. based part trained discriminatively choice. type vehicle influencing 201–222. in lifestyle and attitude 7. Report DC), Washington, Officials, Transportation 2013 and America Highway in Commuting Availability. Transit and change. urban physical 7576. of predictors uncovers Workshops streetscapes. Recognition Pattern million one of safety metadata. phone books. 20010. Report Technical MA), Flows Market Labor Measure to 2014. 13, September ed Access- https://www.census.gov/data/developers/data-sets/decennial-census.html. at 2014. https://factfinder.census.gov/faces/tableservices/ 13, September at Accessed Available jsf/pages/productview.xhtml?src=bkmk. (2008-2012). data year 2014. www.osec.doc.gov/bmi/budget/fy13cbj/Census 13, September at Accessed CongressionalJustification-FINAL.pdf. Available estimates. get Science 331:176–182. Science rcIEEE Proc 350:1073–1076. Avail- Systems. Processing Information Neural in Advances IAppendix SI nalo u eorpi siain,w s the use we estimations, demographic our of all In IE,NwYr) p793–799. pp York), New (IEEE, Ntoa ueuo cnmcRsac,Cambridge, Research, Economic of Bureau (National 86:2278–2324. 04IE ofrneo optrVso and Vision Computer on Conference IEEE 2014 o oedti ngon rt data truth ground on detail more for ,SaioM (2014) MD Shapiro C, e ´ EETasPtenAa ahIntell Mach Anal Pattern Trans IEEE rcNt cdSiUSA Sci Acad Natl Proc Aeia soito fState of Association (American rnpr e o Pract Pol Res Transport sn oilMedia Social Using 114:7571– FY2013 Vehicle AAAI 32: 38: , 5 ln 21)Aincnuesadteatmtv akt vial tapp. at Available market. automotive the and consumers Asian (2012) M Bland 15. 9 roe ,Br L(04 erighg-ee uget fubnpercep- urban of judgments high-level Learning (2014) TL Berg styles V, clothing world-wide Ordonez Exploring Streetstyle: 19. (2017) N Snavely K, Bala K, Matzen 18. fashion: in Neuroaesthetics (2015) R Urtasun F, Moreno-Noguer S, Fidler E, Simo-Serra 17. buyers? African-American attract most brands Which (2011) Staff Remarketing Auto 16. 7 noaeeeS amrM e 21)Peic-ee lcinDt.Aalbeat Available Data. Election Precinct-Level (2014) A Lee M, Palmer S, Ansolabehere 27. to applied learning Gradient-based (1998) Daum P Haffner 26. Y, Bengio L, (1972) Bottou HD Y, LeCun Brunk 25. J, Bremner DJ, detec- Bartholomew object RE, real-time Towards Barlow r-CNN: Faster 24. (2017) J Sun predict R, to Girshick learning K, machine He and S, imagery Ren satellite 23. Combining (2016) al. et N, Jean 22. analysis attribute via identity city Recognizing (2014) A Torralba A, Oliva L, Liu B, Zhou scene. 21. visible the beyond Looking (2014) A Torralba JJ, Lim B, An A, Khosla 20. elwhp(oTG) n VDA(hog oae GPUs). donated DARE (through Stanford NVIDIA This the and IIS-1115493, encouragement. T.G.), Grant (to and Yeung, NSF fellowship by input, Serena supported support, Marten, and partially their is family Brendan research for and especially Tewoderos anno- friends lab, Selome our on Vision and and worked dedication; Stanford who their entire everyone for the dataset edits; car and our suggestions tating helpful for Burke ACKNOWLEDGMENTS. Eqs. to according 1) calculate we estimates, these Using and Obama, Barack S than McCain sedans, John than for trucks pickup more with Let n trucks. pickup than sedans Let McCain. John than Obama Barack P nbigi iywt oepcu rcsta easa olw.Let follows. as sedans than trucks pickup more P with city a r in being on r (r (Democrat d etertoo ikptuk osdn.W ol iet estimate to like would We sedans. to trucks pickup of ratio the be ∩ s eestimate We ecmuetepoaiiyo oigDmca/eulcnconditioned Democrat/Republican voting of probability the compute We f764-43b4-84c4-40950ff36307/File/3e1e2e5d8d20fad49eaac919e38abc8e/polk compendium.com/uploads/user/a33eed35-8a44-4da7-84c4-16f3751fe303/9855ee60- 2017. 9, November Accessed papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional- neural-networks.pdf. at able tion. arXiv:1706.01869. photos. of millions from Recognition Pattern and Vision beauty. of perception the Modeling 2016. 24, October Accessed african-american-buyers. www.autoremarketing.com/content/trends/which-brands-most-attract- at Available 17 d.adent10./11.Acse aur 3 2015. 13, January Accessed hdl.handle.net/1903.1/21919. Linguistics Computational for ation recognition. document Regression Isotonic of Application and York). Theory New The Restrictions: Order under networks. 1149. proposal region with tion poverty. 519–534. pp images. geo-tagged of Recognition Pattern 3717. and Vision Computer on Conference 510. < etenme feeet in elements of number the be S p 2012 )a olw.Let follows. as 1) ial,let Finally, . I 20)Futaigyes oanadaptation. domain easy Frustratingly (2007) III H e ´ uoenCneec nCmue Vision Computer on Conference European Science rsnainpf cesdNvme ,2017. 6, November Accessed presentation.pdf. |r > P P )and 1) (Republican|r 353:790–794. PNAS (Democrat P C (Democrat 1 etenme fcte norts set: test our in cities of number the be and uoenCneec nCmue Vision Computer on Conference European rcIEEE Proc etakNa en tfn ro,adMarshall and Ermon, Stefano Jean, Neal thank We P P | S P (Republican|r (Republican, d (Democrat 2. eebr1,2017 12, December = |r {c IE,NwYr) p869–877. pp York), New (IEEE, , P P < > r 86:2278–2324. i n (r (r } > s 1) 1) rceig fteIE ofrneo Computer on Conference IEEE the of Proceedings AL rge zc Republic). Czech Prague, (ACL, < S etesto iiswt oevtsfor votes more with cities of set the be > etenme feeet in elements of number the be 1), P d EETasPtenAa ahIntell Mach Anal Pattern Trans IEEE = = , (Democrat 1) ∩ 1) r S r s S > < P P ≈ S P ≈ < = s (Republican, (Democrat r (Republican, iial,let Similarly, . 1) 1): h e fcte ihmr votes more with cities of set the {c n 1) n C C p n s j ≈ P ≈ } r P p (r | (r etesto iiswt more with cities of set the be |r n n h ubro lmnsin elements of number the > C o.114 vol. d < C r > Srne,Bso) p494– pp Boston), (Springer, s p 1) , 1) )and 1) r IE,NwYr) p3710– pp York), New (IEEE, r ofrneo h Associ- the of Conference r > S < < p 1) etesto cities of set the be 1), | 1) ttsia Inference Statistical P Srne,Boston), (Springer, (Republican|r o 50 no. P (r > S s 04IEEE 2014 ) and 1), | 39:1137– n let and 3af (Wiley, 13113 [5] [3] [4] [2] [1] [6] 05 <

COMPUTER SCIENCES