This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

1 Places: A 10 million Image for Scene Recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba

Abstract—The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near- human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.

Index Terms—Scene classification, visual recognition, deep learning, deep feature, image dataset. !

1 INTRODUCTION Whereas most datasets have focused on object categories If a current state-of-the-art visual recognition system would (providing labels, bounding boxes or segmentations), here send you a text to describe what it sees, the text might read we describe the Places database, a quasi-exhaustive repos- something like: “There is a sofa facing a TV set. A person itory of 10 million scene photographs, labeled with 434 is sitting on the sofa holding a remote control. The TV is scene semantic categories, comprising about 98 percent of on and a talk show is playing”. Reading this, you would the type of places a human can encounter in the world. likely imagine a living-room. However, that scenery can Image samples are shown in Fig. 1 while Fig. 2 shows the very well happen in a resort by the beach. number of images per category, sorted in decreasing order. For an agent acting into the world, there is no doubt that Departing from Zhou et al. [1], we describe in depth the object and event recognition should be a primary goal of its construction of the Places Database, and evaluate the per- visual system. But knowing the place or context in which formance of several state-of-the-art Convolutional Neural the objects appear is as equally important for an intelligent Networks (CNNs) for place recognition. We compare how system to understand what might have happened in the past the features learned in a CNN for scene classification be- and what may happen in the future. For instance, a table have when used as generic features in other visual recogni- inside a kitchen can be used to eat or prepare a meal, while tion tasks. Finally, we visualize the internal representations a table inside a classroom is intended to support a notebook of the CNNs and discuss one major consequence of training or a laptop to take notes. a deep learning model to perform scene recognition: object A key aspect of scene recognition is to identify the place detectors emerge as an intermediate representation of the in which the objects seat (e.g., beach, forest, corridor, office, network [2]. Therefore, while the Places database does not street, ...). Although one can avoid using the place category contain any object labels or segmentations, it can be used by providing a more exhaustive list of the objects in the to train new object classifiers. picture and a description of their spatial relationships, a 1.1 The Rise of Multi-million Datasets place category provides the appropriate level of abstraction to avoid such a long and complex description. Note that one What does it take to reach human-level performance with could avoid using object categories in a description by only a machine-learning algorithm? In the case of supervised listing parts (i.e. two eyes on top of a mouth for a face). learning, the problem is two-fold. First, the algorithm must Like objects, places have functions and attributes. They are be suitable for the task, such as Convolutional Neural composed of parts and some of those parts can be named Networks in the large scale visual recognition [1], [3] and and correspond to objects, just like objects are composed Recursive Neural Networks for natural language processing of parts, some of which are nameable as well (e.g., legs, [4], [5]. Second, it must have access to a training dataset eyes). of appropriate coverage (quasi-exhaustive representation of classes and variety of exemplars) and density (enough • B. Zhou, A. Khosla, A. Oliva, A. Torralba are with the Computer samples to cover the diversity of each class). The optimal Science and Artificial Intelligence Laboratory, Massachusetts Institute space for these datasets is often task-dependent, but the of Technology, USA. rise of multi-million-item sets has enabled unprecedented • A. Lapedriza is with Universitat Oberta de Catalunya, Spain. performance in many domains of artificial intelligence. 0162-8828 () 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

2

veterinarians office elevator door fishpond windmill train station platform

bedroom cafeteria watering hole corral amusement park

staircase bar field road arch tower soccer field

conference center shoe shop rainforest swimming pool street

Indoor Nature Urban Fig. 1. Image samples from various categories of the Places Database (two samples per category). The dataset contains three macro-classes: Indoor, Nature, and Urban.

Fig. 2. Sorted distribution of image number per category in the Places Database. Places contains 10,624,928 images from 434 categories. Category names are shown for every 6 intervals.

The successes of Deep Blue in chess, Watson in “Jeop- MIT Indoor67 database [15] with 67 indoor categories and ardy!”, and AlphaGo in Go against their expert human the SUN (Scene Understanding, with 397 categories and opponents may thus be seen as not just advances in algo- 130,519 images) database [16] provided a larger coverage rithms, but the increasing availability of very large datasets: of place categories, but failed short in term of quantity of 700,000, 8.6 million, and 30 million items, respectively [6]– data needed to feed deep learning algorithms. To comple- [8]. Convolutional Neural Networks [3], [9] have likewise ment large object-centric datasets such as ImageNet [11], achieved near human-level visual recognition, trained on we build the Places dataset described here. 1.2 million object [10]–[12] and 2.5 million scene images [1]. Expansive coverage of the space of classes and samples Meanwhile, the Pascal VOC dataset [17] is one of the allows getting closer to the right ecosystem of data that a earliest image dataset with diverse object annotations in natural system, like a human, would experience. The history scene context. The Pascal VOC challenge has greatly ad- of image datasets for scene recognition also sees the rapid vanced the development of models for object detection and growing in the image samples as follows. segmentation tasks. Nowadays, COCO dataset [18] focuses on collecting object instances both in polygon and bounding box annotations for images depicting everyday scenes of 1.2 Scene-centric Datasets common objects. The recent Visual Genome dataset [19] The first benchmark for scene recognition was the Scene15 aims at collecting dense annotations of objects, attributes, database [13], extended from the initial 8 scene dataset in and their relationships. ADE20K [20] collects precise dense [14]. This dataset contains only 15 scene categories with annotation of scenes, objects, parts of objects with a large a few hundred images per class, and current classifiers are and open vocabulary. Altogether, annotated datasets further saturated, reaching near human performance with 95%. The enable artificial systems to learn visual knowledge linking 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

3

parts, objects and scene context. 2.2.1 Step 1: Downloading images using scene cate- gory and attributes From online image search engines (Google Images, Bing 2 PLACES DATABASE Images, and Flickr), candidate images were downloaded using a query word from the list of scene classes provided 2.1 Coverage of the categorical space by the SUN database [16]. In order to increase the diversity The first asset of a high-quality dataset is an expansive of visual appearances in the Places dataset, each scene class coverage of the categorical space to be learned. The strategy query was combined with 696 common English adjectives 1 of Places is to provide an exhaustive list of the categories of (e.g., messy, spare, sunny, desolate, etc.). In Fig. 3) we show environments encountered in the world, bounded by spaces some examples of images in Places grouped by queries. where a human body would fit (e.g. closet, shower). The About 60 million images (color images of at least 200×200 SUN (Scene UNderstanding) dataset [16] provided that pixels size) with unique URLs were identified. Importantly, initial list of semantic categories. The SUN dataset was the Places and SUN datasets are complementary: PCA- built around a quasi-exhaustive list of scene categories with based duplicate removal was conducted within each scene different functionalities, namely categories with unique category in both , so that they do not contain the identities in discourse. Through the use of WordNet [21], same images. the SUN database team selected 70,000 words and concrete terms that described scenes, places and environments that 2.2.2 Step 2: Labeling images with ground truth cat- can be used to complete the phrase “I am in a place”, or egory “let’s go to the/a place”. Most of the words referred to Image ground truth label verification was done by crowd- and entry-level names ( [22]), resulting in a corpus sourcing the task to Amazon Mechanical Turk (AMT). of 900 different scene categories after bundling together Fig.4 illustrates the experimental paradigm used. First, synonyms, and separating classes described by the same AMT workers were given instructions relating to a par- word but referring to different environments (e.g. inside ticular category at a time (e.g. cliff), with a definition, and outside views of churches). Details about the building sample images belonging to the category (true images), of that initial corpus can be found in [16]. Places Database and sample images not belonging to the category (false has inherited the same list of scene categories from the SUN images). As an example, Fig.4.a shows the instructions for dataset, with a few changes that are described in section the category cliff. Workers then performed a verification 2.2.4. task for the corresponding category. Fig.4.b shows the AMT interface for the verification task. The experimental interface displayed a central image, flanked by smaller 2.2 Construction of the database version of images the worker had just responded (on the left), and the images the worker will respond to next (on The construction of the Places Database is composed of the right). Information gleaned from the construction of the four steps, from querying and downloading images, labeling SUN dataset suggests that in the first iteration of labeling images with ground truth category, to scaling up the dataset more than 50% of the the downloaded images are not using a classifier, and further improving the separation of true exemplars of the category. For this reason the default similar classes. The detail of each step is introduced in the answer in the interface the default answer was set to NO following sections. (notice that all the smaller versions of the images in the left The data collection process of the Place Database is are marked with a bold red contour, which denotes that the similar to the image collection in other common datasets, image do not belong to the category). Thus, if the worker like ImageNet and COCO. The definition of categories just presses the space bar to move, images will keep the for the ImageNet dataset [11] is based on the synset of default NO label. Whenever a true exemplar appears in the WordNet [21]. Candidate images are queried from several center, the worker can press a specific key to mark it as a Image search engines using the set of WordNet synonyms. positive exemplar (responding YES). As the response is set Images are cleaned up through AMT in the format of the to YES the bold contour of the image turns to green. The binary task similar to the ours. Quality control is done by interface also allows moving backwards to revise previous multiple users annotating the same image. There are about annotations. Each AMT HIT (Human Intelligence Task, one 500-1200 ground-truth images per synset. On the other assignment for one worker), consisted of 750 images for hand, COCO dataset [18] focuses on annotating the object manual annotation. A control set of 30 positive samples instances inside the images with more scene context. The and 30 negative samples with ground-truth category labels candidate images are mainly collected from Flickr, in order from the SUN database were intermixed in the HIT as well. to include less iconic images commonly returned by image As a quality control measure, only worker HITs with an search engines. The image annotation process of COCO is accuracy of 90% or higher on these control images were split into category labeling, instance spotting, and instance kept. segmentation, with all the tasks done by AMT workers. COCO has 80 object categories with more than 2 million 1. The list of adjectives used in querying can be found in https://github. object instances. com/CSAILVision/places365/blob/master/adjectives download.csv 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

4

spare bedroom teenage bedroom romantic bedroom darkest forest path wintering forest path greener forest path

wooded kitchen messy kitchen stylish kitchen rocky coast misty coast sunny coast Fig. 3. Image samples from four scene categories grouped by queries to illustrate the diversity of the dataset. For each query we show 9 annotated images.

trained to classify the remaining 53 million images: We first randomly selected 1,000 images per scene category as training set and 50 images as validation set (for the 413 categories which had more than 1000 samples). AlexNet achieved 32% scene classification accuracy on the valida- tion set after training. The trained AlexNet was then used a) b) to classify the 53 million images. We used the predicted class score by the AlexNet to rank the images within one Fig. 4. Annotation interface in the Amazon Mechanical scene category as follow: for a given category with too Turk for selecting the correct exemplars of the scene few exemplars, the top ranked images with predicted class from the downloaded images. a) instruction given to confidence higher than 0.8 were sent to AMT for a third the workers in which we define positive and negative round of manual annotation using the same interface shown examples. b) binary selection interface. in Fig.4. The default answer was set to NO. After completing this third round of AMT annotation, the distribution of the number of images per category flattened The positive images resulting from the first cleaning out: 401 scene categories had more than 5,000 images per iteration were sent for a second iteration of cleaning. We category and 240 scene categories had more than 20,000 used the same task interface but with the default answer was images. In total, about 3 million images were added into set to YES. In this second iteration, 25.4% of the images the dataset. were relabeled as NO. We tested a third cleaning iteration on a few exemplars but did not pursue it further as the percentage of images relabeled as NO was not significant. 2.2.4 Step 4: Improving the separation of similar After the two iterations of annotation, we collected one classes scene label for 7,076,580 images pertaining to 476 scene Despite the initial effort to bundle synonyms from Word- categories. As expected, the number of images per scene Net, the scene list from the SUN database still contained a category vary greatly (i.e. there are many more images of few categories with very close synonyms (e.g. ‘ski lodge’ bedroom than cave on the web). There were 413 scene and ‘ski resort’, or ‘garbage dump’ and ‘landfill’). We categories that ended up with at least 1000 exemplars, and manually identified 46 synonym pairs like these and merged 98 scene categories with more than 20,000 exemplars. their images into a single category. Additionally, we observed that some scene categories 2.2.3 Step 3: Scaling up the dataset using a classifier could be easily confused with blurry categorical boundaries, As a result of the previous round of image annotation, there as illustrated in Fig. 5. This means that, for images in these were 53 million remaining downloaded images not assigned blurry boundaries, answering the question “Does image I to any of the 476 scene categories (e.g. a bedroom picture belong to class A?” might be difficult. However, it can could have been downloaded when querying images for be easier to answer the question “Does image I belong to living-room category, but marked as negative by the AMT class A or B?”. With this question, the decision boundary worker). Therefore, a third annotation task was designed becomes clearer for a human observer and it also gets closer to re-classify then re-annotate those images, using a semi- to the final task that a computer system will be trained to automatic bootstrapping approach. solve, which is actually separating classes even when the A deep learning-based scene classifier, AlexNet [3], was boundaries are blurry. 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

5

create Places365-Standard and Places365-Challenge. The details of each benchmark are the following: • Places365-Standard has 1,803,460 training images with the image number per class varying from 3,068 Field Forest to 5,000. The validation set has 50 images per class and the test set has 900 images per class. Note that the Fig. 5. Boundaries between place categories can be experiments in this paper are reported on Places365- blurry, as some images can be made of a mixture of Standard. different components. The images shown in this figure • Places365-Challenge contains the same categories as show a soft transition between a field and a forest. Places365-Standard, but the training set is signifi- Although the extreme images can be easily classified cantly larger with a total of 8 million training images. as field and forest scenes, the middle images can be The validation set and testing set are the same as the ambiguous. Places365-Standard. This subset was released for the Places Challenge 20162 held in conjunction with the European Conference on Computer Vision (ECCV) 2016, as part of the ILSVRC Challenge. • Places205. Places205, described in [1], has 2.5 million images from 205 scene categories. The image number per class varies from 5,000 to 15,000. The training set has 2,448,873 total images, with 100 images per a) b) category for the validation set and 200 images per Fig. 6. Annotation interface in Amazon Mechanical category for the test set. Turk for differentiating images from two similar cate- • Places88. Places88 contains the 88 common scene gories. a) instruction in which we give several typical categories among the ImageNet [12], SUN [16] and examples in each category. b) the binary selection Places205 databases. Places88 contains only the im- interface, in which the worker has to classify the shown ages obtained in round 2 of annotations, from the image into one of the classes or none of them. first version of Places used in [1]. We call the other two corresponding subsets ImageNet88 and SUN88 respectively. These subsets are used to compare perfor- After checking the annotations, we confirmed that in the mances across different scene-centric databases, as the previous three steps of the AMT annotation, workers were three datasets contain different exemplars per category confused with some pairs (or groups) of scene categories. (i.e. none of these three datasets contain common For instance, there was an overlap between ‘canyon’ and images). Note that finding correspondences between ‘mountain’ or ‘butte’ and ‘mountain’. There were also the classes defined in ImageNet and Places brings images mixed in the following category pairs: ‘jacuzzi’ and some challenges. ImageNet follows the WordNet def- ‘swimming pool indoor’; ‘pond’ and ‘lake’; ‘volcano’ and initions, but some WordNet definitions are not always ‘mountain’; ‘runway’ and ‘highway and road’; ‘operating appropriate for describing places. For instance, the room’ and ‘hospital room’; among others. In the whole set class ’elevator’ in ImageNet refers to an object. In of categories, we identified 53 different ambiguous pairs. Places, ’elevator’ takes different meanings depend- To further differentiate the images from the categories ing on the location of the observer: elevator door, with shared content, we designed a new interface for elevator interior, or elevator lobby. Many categories a fourth annotation step. The instructions for the task in ImageNet do not differentiate between indoor and are shown Fig. 6.a, while Fig. 6.b shows the annotation outdoor (e.g., ice-skating rink) while in Places, indoor interface. The interface combines exemplar images from and outdoor versions are separated as they do not the two categories with shared content (such as ‘art school’ necessarily afford the same function. and ‘art studio’), and AMT workers were asked to classify images into one of the categories or neither of them. 4 COMPARING SCENE-CENTRIC DATASETS After this fourth annotation step, the Places database Scene-centric datasets correspond to images labeled with a was finalized with over 10 millions labeled exemplars scene, or place name, as opposed to object-centric datasets, (10,624,928 images) from 434 place categories. where images are labeled with object names. In this section we use the Places88 benchmark to compare Places dataset 3 PLACES BENCHMARKS with the tow other biggest scene datasets: ImageNet88 and SUN88. Fig. 7 illustrates the differences among the number Here we describe four subsets of Places database as bench- of images found in the different categories for Places88, marks. Places205 and Places88 are from [1]. Two new ImageNet88 and SUN88. Notice that Places Database is benchmarks have been added: from the 434 categories, we selected 365 categories with more than 4000 images each to 2. http://places2.csail.mit.edu/challenge.html 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

6

the largest scene-centric image dataset so far. The next the diversity of A with respect to B can be defined as subsection presents a comparison of these three datasets in terms of image diversity. DivB(A) = 1 − p(d(a1, a2) < d(b1, b2)) (1)

4.1 Dataset Diversity where a1, a2 ∈ A and b1, b2 ∈ B are randomly selected. With this definition of relative diversity we have that A is Given the types of images found on the internet, some more diverse than B if, and only if, DivB(A) > DivA(B). categories will be more biased than others in terms of For an arbitrary number of datasets, A1, ..., AN , the viewpoints, types of objects, or even image style [23]. diversity of A1 with respect to A2, ..., AN can be defined However, bias can be compensated with a high diversity of as images, with many appearances represented in the dataset. In this section we describe a measure of dataset diversity to compare how diverse images from three scene-centric DivA ,...,A (A1) = 1−p(d(a11, a12) < min d(ai1, ai2)) datasets (Places88, SUN88 and ImageNet88) are. 2 N i=2:N Comparing datasets is an open problem. Even datasets (2) covering the same visual classes have notable differences where ai1, ai2 ∈ Ai are randomly selected, i = 2 : N. providing different generalization performances when used We measured the relative diversities between SUN88, to train a classifier [23]. Beyond the number of images ImageNet88 and Places88 using AMT. Workers were pre- and categories, there are aspects that are important but sented with different pairs of images and they had to select difficult to quantify, like the variability in camera poses, the pair that contained the most similar images. The pairs in decoration styles or in the type of objects that appear in were randomly sampled from each database. Each trial the scene. was composed of 4 pairs from each database, giving a Although the quality of a database is often task depen- total of 12 pairs to choose from. We used 4 pairs per dent, it is reasonable to assume that a good database should database to increase the chances of finding a similar pair be dense (with a high degree of data concentration), and and avoiding users having to skip trials. AMT workers had diverse (it should include a high variability of appearances to select the most similar pair on each trial. We ran 40 and viewpoints). Imagine, for instance, a dataset composed trials per category and two observers per trial, for the 88 of 100,000 images all taken within the same bedroom. categories in common between ImageNet88, SUN88 and This dataset would have a very high density but a very Places88 databases. Fig. 8.a-b shows some examples of low diversity as all the images will look very similar. An pairs from the diversity experiments for the scene categories ideal dataset, expected to generalize well, should have high playground (a) and bedroom (b). In the figure only one pair diversity as well. While one can achieve high density by from each database is shown. We observed that different collecting a large number of images, diversity is not an annotators were consistent in deciding whether a pair of obvious quantity to estimate in image sets, as it assumes images was more similar than another pair of images. some notion of similarity between images. One way to esti- Fig. 8.c shows the histograms of relative diversity for all mate similarity is to ask the question are these two images the 88 scene categories common to the three databases. If similar? However, similarity in the wild is a subjective and the three datasets were identical in terms of diversity, the loose concept, as two images can be viewed as similar average diversity should be 2/3 for the three datasets. Note if they contain similar objects, and/or have similar spatial that this measure of diversity is a relative measure between configurations, and/or have similar decoration styles and so the three datasets. In the experiment, users selected pairs on. A way to circumvent this problem is to define relative from the SUN database to be the closest to each other 50% measures of similarity for comparing datasets. of the time, while the pairs from the Places database were Several measures of diversity have been proposed, par- judged to be the most similar only on 17% of the trials. ticularly in biology for characterizing the richness of an ImageNet88 pairs were selected 33% of the time. ecosystem (see [24] for a review). Here, we propose to The results show that there is a large variation in terms use a measure inspired by the Simpson index of diver- of diversity among the three datasets, showing Places to sity [25]. The Simpson index measures the probability that be the most diverse of the three datasets. The average two random individuals from an ecosystem belong to the relative diversity on each dataset is 0.83 for Places, 0.67 same species. It is a measure of how well distributed the for ImageNet88 and 0.50 for SUN. The categories with the individuals across different species are in an ecosystem, and largest variation in diversity across the three datasets were it is related to the entropy of the distribution. Extending playground, veranda and waiting room. this measure for evaluating the diversity of images within a category is non-trivial if there are no annotations of sub- categories. For this reason, we propose to measure the 4.2 Cross Dataset Generalization relative diversity of image datasets A and B based on As discussed in [23], training and testing across different the following idea: if set A is more diverse than set B, datasets generally results in a drop of performance due to then two random images from set B are more likely to be the dataset bias problem. In this case, the bias between visually similar than two random samples from A. Then, datasets is due, among other factors, to the differences in 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

7

100000 Places88 ImageNet88 SUN88 10000

1000 Number of images 100 raft ruin attic arch dam igloo shed office lobby pulpit tower motel valley parlor closet chalet abbey pantry harbor palace jail cell cockpit swamp kitchen viaduct shower iceberg nursery pavilion volcano windmill fountain veranda sandbar mansion gift shop cafeteria ski slope bedroom campsite cemetery rock arch reception wind farm hot spring bookstore food court restaurant classroom lighthouse shoe shop auditorium hotel room dorm room skyscraper courthouse playground game room rope bridge water tower mausoleum candy store locker room dining room train railway supermarket engine room phone booth schoolhouse waiting room beauty salon bowling alley railroad track amphitheater hospital room assembly line butchers shop topiary garden garbage dump airport terminal church outdoor market outdoor building facade stadium football amusement park conference room underwater coral reef apartment building outdoor

Fig. 7. Comparison of the number of images per scene category for the common 88 scene categories in Places, ImageNet, and SUN datasets.

Test on SUN 88 Test on ImageNet Scene 88 Test on Places 88 70 70 55

50 60 60 45

50 50 40

35 40 40 30

30 30 25 Classi!cation accuracy Classi!cation accuracy Classi!cation accuracy 54.2 Train on Places 88 [69.5] Train on ImageNet 88 [65.6] Train on Places 88 [ ] 20 Train on Places 88 [60.3] Train on ImageNet 88 [44.6] 20 Train on SUN 88 [63.3] 20 Train on SUN 88 [37.0] Train on ImageNet 88 [62.8] Train on SUN 88 [49.2] 15

10 10 10 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 a) Number of training samples per category b) Number of training samples per category c) Number of training samples per category Fig. 9. Cross dataset generalization of training on the 88 common scenes between Places, SUN and ImageNet then testing on the 88 common scenes from: a) SUN, b) ImageNet and c) Places database.

the diversity between the three datasets. Fig. 9 shows the classification results obtained from the training and testing on different permutations of the 3 datasets. For these results a) b) we use the features extracted from a pre-trained ImageNet- CNN and a linear SVM. In all three cases training and 40 Places88 testing on the same dataset provides the best performance 35 ImageNet88 30 SUN88 for a fixed number of training examples. As the Places

25 database is very large, it achieves the best performance on 20 two of the test sets when all the training data is used. 15

Number of categories 10

5 0 5 CONVOLUTIONAL NEURAL NETWORKS 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 c) Diversity FOR SCENE CLASSIFICATION

Fig. 8. Examples of pairs for the diversity experiment Given the impressive performance of the deep Convolu- for a) playground and b) bedroom. Which pair shows tional Neural Networks (CNNs), particularly on the Im- the most similar images? The bottom pairs were cho- ageNet benchmark [3], [12], we choose three popular sen in these examples. c) Histogram of relative diver- CNN architectures, AlexNet [3], GoogLeNet [26], and sity per each category (88 categories) and dataset. VGG 16 convolutional-layer CNN [27], then train them on Places88 (in blue line) contains the most diverse set of Places205 and Places365-Standard respectively to create images, then ImageNet88 (in red line) and the lowest baseline CNN models. The trained CNNs are named as diversity is in the SUN88 database (in yellow line) as PlacesSubset-CNN, i.e., Places205-AlexNet or Places365- most images are prototypical of their class. VGG. All the Places-CNNs presented here were trained using the Caffe package [28] on Nvidia GPUs Tesla K40 and 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

8

Titan X3. Additionally, given the recent breakthrough per- Places365-AlexNet is 0.577. Thus Places365-AlexNet not formances of the Residual Network (ResNet) on ImageNet only predicts more categories, but also has better accuracy classification [29], we further fine-tuned ResNet152 on on the previous existing categories. the Places365-Standard (termed as Places365-ResNet) and Fig.10 shows the responses to examples correctly pre- compared it with the other trained-from-scratch Places- dicted by the Places365-VGG. Notice that most of the CNNs for scene classification. Top-5 responses are very relevant to the scene description. Some failure or ambiguous cases are shown in Fig.11. Broadly, we can identify two kinds of mis-classification 5.1 Results on Places205 and Places365 given the current label attribution of Places: 1) less-typical After training the various Places-CNNs, we used the final activities happening in a scene, such as taking group output layer of each network to classify the test set images photo in a construction site and camping in a junkyard; of Places205 and SUN205 (see [1]). The classification 2) images composed of multiple scene parts, which make results for Top-1 accuracy and Top-5 accuracy are listed one ground-truth scene label not sufficient to describe the in Table 1. The Top-1 accuracy is the percentage of the whole environment. These results suggest the need to have testing images where the top predicted label exactly match multi-ground truth labels for describing environments. the ground-truth label. The Top-5 accuracy is that the It is important to emphasize that for many scene cate- percentage of testing images where the ground-truth label gories the Top-1 accuracy might be an ill-defined measure: is among the top ranked 5 predicted labels given by an environments are inherently multi-labels in terms of their algorithm. Since there are some ambiguity between some semantic description. Different observers will use different scene categories, the Top-5 accuracy is a more suitable terms to refer to the same place, or different parts of the criteria of measuring scene classification performance. same environment, and all the labels might fit well the As a baseline comparison, we show the results of a linear description of the scene, as we observe in the examples SVM trained on ImageNet-CNN features of 5000 images of Fig.11. per category in Places205 and 50 images per category in SUN205 respectively. We observe that Places-CNNs perform much better than the ImageNet feature+SVM 5.2 Web-demo for Scene Recognition baseline while, as expected, Places205-GoogLeNet and Based on our trained Places-CNN, we created a web- Places205-VGG outperformed Places205-AlexNet with a demo for scene recognition6, accessible through a computer large margin, due to their deeper structures. To date (Oct browser or mobile phone. People can upload photos to 2, 2016) the top ranked results on the test set of Places205 the web-demo to predict the type of environment, with leaderboard4 is 64.10% on Top-1 accuracy and 90.65% on the 5 most likely semantic categories, and relevant scene Top-5 accuracy. Note that for the test set of SUN205, we attributes. Two screenshots of the prediction result on the did not fine-tune the Places-CNNs on the training set of mobile phone are shown in Fig.12. Note that people can SUN205, as we directly evaluated them on the test set of submit feedback about the result. The top-5 recognition SUN. accuracy of our recognition web-demo in the wild is about We further evaluated the baseline Places365-CNNs on 72% (from the 9,925 anonymous feedbacks dated from the validation set and test set of Places365. The results Oct.19, 2014 to May 5, 2016), which is impressive given are shown in Table 2. We can see that Places365-VGG and that people uploaded all kinds of photos from real-life and Places365-ResNet have similar top performances compared not necessarily places-like photos. Places205-AlexNet is the with the other two CNNs5. Even though Places365 has 160 back-end prediction model in the demo. more categories than Places205, the Top-5 accuracy of the Places205-CNNs (trained on the previous version of Places [1]) on the test set only drops by 2.5%. 5.3 Places365 Challenge Result To evaluate how extra categories bring improvements, The subset Places365-Challenge, which contains more we compute the accuracy for the 182 common cate- than 8 million images from 365 scene categories, was used gories between Places205 and Places365 (we merge some in the Places Challenge 2016 held as part of the ILSVRC categories in Places205 when building Places365 thus Challenge in the European Conference on Computer Vision there are less common categories) for Places205-CNN and (ECCV) 2016. Places365-CNN. On the validation set of Places365, we The rule of the challenge is that each team can only select the images of the 182 common categories, then use the provided data in the Places365-Challenge to train use the aligned 182 outputs of the Places205-AlexNet and their networks. Standard CNN models trained on Imagenet- Places365-AlexNet to predict the labels respectively. The 1.2million and previous Places are allowed to use. Each Top1 accuracy for Places205-AlexNet is 0.572, the one for team can submit at most 5 prediction results. Ranks are based on the top-5 classification error of each submission. 3. All the Places-CNNs are available at https://github.com/ Winners teams are then invited to give talks at the ILSVRC- CSAILvision/places365 4. http://places.csail.mit.edu/user/leaderboard.php COCO Joint Workshop at ECCV’16. 5. The performance of the ResNet might result from fine-tuning or under-training, as the ResNet is not trained from scratch. 6. http://places.csail.mit.edu/demo.html 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

9

TABLE 1 Classification accuracy on the test set of Places205 and the test set of SUN205. We use the class score averaged over 10-crops of each test image to classify the image. ∗ shows the top 2 ranked results from the Places205 leaderboard.

Test set of Places205 Test set of SUN205 Top-1 acc. Top-5 acc. Top-1 acc. Top-5 acc. ImageNet-AlexNet feature+SVM 40.80% 70.20% 49.60% 80.10% Places205-AlexNet 50.04% 81.10% 67.52% 92.61% Places205-GoogLeNet 55.50% 85.66% 71.6% 95.01% Places205-VGG 58.90% 87.70% 74.6% 95.92% SamExynos∗ 64.10% 90.65% - - SIAT MMLAB∗ 62.34% 89.66% - -

TABLE 2 Classification accuracy on the validation set and test set of Places365. We use the class score averaged over 10-crops of each testing image to classify the image.

Validation Set of Places365 Test Set of Places365 Top-1 acc. Top-5 acc. Top-1 acc. Top-5 acc. Places365-AlexNet 53.17% 82.89% 53.31% 82.75% Places365-GoogLeNet 53.63% 83.88% 53.59% 84.01% Places365-VGG 55.24% 84.91% 55.19% 85.01% Places365-ResNet 54.74% 85.08% 54.65% 85.07%

GT: cafeteria GT: classroom GT: drugstore top-1: cafeteria (0.179) top-1: locker room (0.585) top-1: supermarket (0.286) top-2: restaurant (0.167) top-2: lecture room (0.135) top-2: hardware store (0.248) top-3: dining hall (0.091) top-3: conference center (0.061) top-3: drugstore (0.120) top-4: coffee shop (0.086) top-4: classroom (0.033) top-4: department store (0.087) top-5: restaurant patio (0.080) top-5: elevator door (0.025) top-5: pharmacy (0.052)

GT: natural canal GT: creek GT: greenhouse indoor top-1: swamp (0.529) top-1: forest broadleaf (0.307) top-1: greenhouse indoor (0.479) top-2: marsh (0.232) top-2: forest path (0.208) top-2: greenhouse outdoor (0.055) top-3: natural canal (0.063) top-3: creek (0.086) top-3: botanical garden (0.044) top-4: lagoon (0.047) top-4: rainforest (0.076) top-4: assembly line (0.025) top-5: rainforest (0.029) top-5: cemetery (0.049) top-5: vegetable garden (0.022)

GT: chalet GT: crosswalk GT: market outdoor top-1: ski resort (0.141) top-1: crosswalk (0.720) top-1: promenade (0.569) top-2: ice floe (0.129) top-2: plaza (0.060) top-2: bazaar outdoor (0.137) top-3: igloo (0.114) top-3: street (0.055) top-3: boardwalk (0.118) top-4: balcony exterior (0.103) top-4: shopping mall indoor (0.039) top-4: market outdoor (0.074) top-5: courtyard (0.083) top-5: bazaar outdoor (0.021) top-5: flea market indoor (0.029)

Fig. 10. The predictions given by the Places365-VGG for the images from the validation set. The ground- truth label (GT) and the top 5 predictions are shown. The number beside each label indicates the prediction confidence.

There were totally 92 valid submissions from 27 teams. images. Finally team Hikvision [30] won the 1st place with 9.01% top-5 error, team MW [31] won the 2nd place with 10.19% 5.4 Generic Visual Features from Places-CNNs top-5 error, and team Trimps-Soushen [32] won the 3rd and ImageNet-CNNs place with 10.30% top-5 error. The leaderboard and the We further used the activation from the trained Places- 7 team information are available at the challenge result page . CNNs as generic features for visual recognition tasks using The ranked results of all the submissions are plotted in different image classification benchmarks. Activations from Fig.13. The entry from the winner team outperforms our the higher-level layers of a CNN, also termed deep features, best baseline with a large margin (∼ 6% in top-5 accuracy). have proven to be effective generic features with state-of- Note that our baselines are trained with the Places365- the-art performance on various image datasets [33], [34]. Standard while those challenge entries are trained on the But most of the deep features are from the CNNs trained Places365-Challenge which has 5.5 million more training on ImageNet, which is mostly an object-centric dataset. Here we evaluated the classification performances of 7. http://places2.csail.mit.edu/results2016.html the deep features from scene-centric CNNs and object- 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

10

Top-5 errors of all the 92 submission (sorted) 20.00% 18.00%

16.00% Baseline ResNet152: 14.9% 14.00% 12.00% 10.00% GT: construction site GT: junkyard top-1: martial arts gym (0.157) top-1: campsite (0.306) 8.00% Hikvision: 9.01% top-2: stable (0.156) top-2: sandbox (0.276) top-3: boxing ring (0.091) top-3: beer garden (0.052) 6.00% top-4: locker room (0.090) top-4: market outdoor (0.035) 0 10 20 30 40 50 60 70 80 90 100 top-5: basketball court (0.056) top-5: flea market indoor (0.033) Fig. 13. The ranked results of all the 92 valid sub- missions. The best baseline trained on Places365- standard is the Resnet152 which has the top5-error as 14.9%, while the winner network from HikVision gets 9.01% top5-error which outperform the baseline with large margin.

GT: aquarium GT: lagoon top-1: restaurant (0.213) top-1: balcony interior (0.136) top-2: ice cream parlor (0.139) top-2: beach house (0.134) centric CNNs in a systematic way. The deep features from top-3: coffee shop (0.138) top-3: boardwalk (0.123) top-4: pizzeria (0.085) top-4: roof garden (0.103) several Places-CNNs and ImageNet-CNNs on the following top-5: cafeteria (0.078) top-5: restaurant patio (0.068) scene and object benchmarks are tested: SUN397 [16], MIT Indoor67 [15], Scene15 [13], SUN Attribute [35], Fig. 11. Examples of predictions rated as incorrect in Caltech101 [36], Caltech256 [37], Stanford Action40 [38], the validation set by the Places365-VGG. GT states and UIUC Event8 [39]. for ground truth label. Note that some of the top- All of the presented experiments follow the standards in 5 responses are often not wrong per se, predicting the mentioned papers. In the SUN397 experiment [16], the semantic categories near by the GT category. See the training set size is 50 images per category. Experiments text for details. were run on 5 splits of the training set and test set given in the dataset. In the MIT Indoor67 experiment [15], the training set size is 100 images per category. The experiment is run on the split of the training set and test set given in the dataset. In the Scene15 experiment [13], the training set size is 50 images per category. Experiments are run on 10 random splits of the training set and test set. In the SUN Attribute experiment [35], the training set size is 150 images per attribute. The reported result is the average precision. The splits of the training set and test set are given in the paper. In Caltech101 and Caltech256 experiment [36], [37], the training set size is 30 images per category. The experiments are run on 10 random splits of the training set and test set. In the Stanford Action40 experiment [38], the training set size is 100 images per category. Experiments are run on 10 random splits of the training set and test set. The reported result is the classification accuracy. In the UIUC Event8 experiment [39], the training set size is 70 images per category and the test set size is 60 images per category. The experiments are run on 10 random splits of the training set and test set. Places-CNNs and ImageNet-CNNs have the same net- work architectures for AlexNet, GoogLeNet, and VGG, but they are trained on scene-centric data (Places) and Fig. 12. Two screenshots of the scene recognition object-centric data (ImageNet) respectively. For AlexNet demo based on the Places-CNN. The web-demo pre- and VGG, we used the 4096-dimensional feature vector dicts the type of environment, the semantic categories, from the activation of the Fully Connected Layer (fc7) of and associated scene attributes for uploaded photos. the CNN. For GoogLeNet, we used the 1024-dimensional feature vector from the response of the global average pool- 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

11

70 Combined kernel [37.5] of what has been learned inside CNNs and what are the HoG2x2 [26.3] 60 differences between the object-centric CNN trained on Im- DenseSIFT [23.5] Ssim [22.7] ageNet and the scene-centric CNN trained on Places given 50 Geo texton [22.1] that they share the same architecture AlexNet. Following Texton [21.6] the methodology in [2] we feed 100,000 held-out testing Gist [16.3] 40 LBP [14.7] images into the two networks and record the max activation ImageNet−AlexNet [42.6] 30 of each unit pooled over the whole spatial feature map for Places205−AlexNet [54.3] Places365−VGG [63.24] each of the images respectively. For each unit, we get the Classification accuracy 20 top three ranked images by ranking their max activations, then we segment the images by bilinear upsampling the 10 binarized spatial feature map mask.

0 The image segmentation results of the units from differ- 1 5 10 20 50 Number of training samples per category ent layers are shown in Fig.15. We can see that from conv1 to conv5, the units detect visual concepts from low-level edge/texture to high-level object/scene parts. Furthermore, Fig. 14. Classification accuracy on the SUN397 in the object-centric ImageNet-CNN there are more units Dataset. We compare the deep features of Places365- detecting object parts such as dog and people’s heads in the VGG, Places205-AlexNet (result reported in [1]), and conv5 layer, while in the scene centric Places-CNN there ImageNet-AlexNet, to hand-designed features (HOG, are more units detecting scene parts such as bed, chair, or gist, etc). The deep features of Places365-VGG outper- buildings in the conv5 layer. forms other deep features and hand-designed features Thus the specialty of the units in the object-centric CNN by a large margin. Results of other hand-designed and scene-centric CNN yield very different performances features/kernels are fetched from [16]. of generic visual features on a variety of recognition bench- marks (object-centric datasets vs scene-centric datasets) in ing layer before softmax producing the class predictions. Table 3. The classifier in all of the experiments is a linear SVM We further synthesized preferred input images for the with the default parameter for all of the features. Places-CNN by using the image synthesis technique pro- Table 3 summarizes the classification accuracy on various posed in [41]. This method uses a learned prior deep datasets for the deep features of Places-CNNs and the deep generator network to generate images which maximize the features of the ImageNet-CNNs. Fig.14 plots the classifica- final class activation or the intermediate unit activation of tion accuracy for different visual features on the SUN397 the Places-CNN. The synthetic images for 50 scene cate- database over different numbers of training samples per gories are shown in Fig.16. These abstract image contents category. The classifier is a linear SVM with the same de- reveal the knowledge of the specific scene learned and fault parameters for the two deep feature layers (C=1) [40]. memorized by the Places-CNN: examples include the buses The Places-CNN features show impressive performance on within a road environment in the bus station, and the tents scene-related datasets, outperforming the ImageNet-CNN surrounded by forest-types of features for the campsite. features. On the other hand, the ImageNet-CNN features Here we used Places365-AlexNet (other Places365-CNNs show better performance on object-related image datasets. generated similar results). We further used the synthesis Importantly, our comparison shows that Places-CNN and technique to generate the images preferred by the units conv5 ImageNet-CNN have complementary strengths on scene- in the layer of Places365-AlexNet. As shown in centric tasks and object-centric tasks, as expected from Fig.17, the synthesized images are very similar to the the type of the datasets used to train these networks. On segmented image regions by the estimated receptive field the other hand, the deep features from the Places365- of the units. VGG achieve the best performance (63.24%) on the most challenging scene classification dataset SUN397, while the 6 CONCLUSION deep features of Places205-VGG performs the best on the From the Tiny Image dataset [42], to ImageNet [11] and MIT Indoor67 dataset. As far as we know, they are the Places [1], the rise of multi-million-item dataset initiatives state-of-the-art scores achieved by a single feature + linear and other densely labeled datasets [18], [20], [43], [44] have SVM on those two datasets. Furthermore, we merge the enabled data-hungry machine learning algorithms to reach 1000 classes from the ImageNet and the 365 classes from near-human semantic classification of visual patterns, like the Places365-Standard to train a VGG (Hybrid1365-VGG). objects and scenes. With its category coverage and high- The deep feature from the Hybrid1365-VGG achieves the diversity of exemplars, Places offers an ecosystem of visual best score averaged over all the eight image datasets. context to guide progress on scene understanding problems. Such problems could include determining the actions hap- 5.5 Visualization of the Internal Units pening in a given environment, spotting inconsistent objects Through the visualization of the unit responses for various or human behaviors for a particular place, and predicting levels of network layers, we can have a better understanding future events or the cause of events given a scene. 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

12

TABLE 3 Classification accuracy/precision on scene-centric databases (the first four datasets) and object-centric databases (the last four datasets) for the deep features of various Places-CNNs and ImageNet-CNNs. All the accuracy/precision is the top-1 accuracy/precision.

Deep Feature SUN397 MIT Indoor67 Scene15 SUN Attribute Caltech101 Caltech256 Action40 Event8 Average Places365-AlexNet 56.12 70.72 89.25 92.98 66.40 46.45 46.82 90.63 69.92 Places205-AlexNet 54.32 68.24 89.87 92.71 65.34 45.30 43.26 94.17 69.15 ImageNet-AlexNet 42.61 56.79 84.05 91.27 87.73 66.95 55.00 93.71 72.26 Places365-GoogLeNet 58.37 73.30 91.25 92.64 61.85 44.52 47.52 91.00 70.06 Places205-GoogLeNet 57.00 75.14 90.92 92.09 54.41 39.27 45.17 92.75 68.34 ImageNet-GoogLeNet 43.88 59.48 84.95 90.70 89.96 75.20 65.39 96.13 75.71 Places365-VGG 63.24 76.53 91.97 92.99 67.63 49.20 52.90 90.96 73.18 Places205-VGG 61.99 79.76 91.61 92.07 67.58 49.28 53.33 93.33 73.62 ImageNet-VGG 48.29 64.87 86.28 91.78 88.42 74.96 66.63 95.17 77.05 Hybrid1365-VGG 61.77 79.49 92.15 92.93 88.22 76.04 68.11 93.13 81.48

conv1 conv2 conv3 conv4 conv5 ImageNet-CNN Places-CNN

Fig. 15. a) Visualization of the units’ receptive fields at different layers for the ImageNet-CNN and Places-CNN. Subsets of units at each layer are shown. In each row we show the top 3 most activated images. Images are segmented based on the binarized spatial feature map of the units at different layers of ImageNet-CNN and Places-CNN. Here we take ImageNet-AlexNet and Places205-AlexNet as the comparison examples. See the detailed visualization methodology in [2].

ACKNOWLEDGMENTS [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- cation with deep convolutional neural networks.” in In Advances in The authors would like to thank Santani Teng, Zoya Bylin- Neural Information Processing Systems, 2012. skii, Mathew Monfort and Caitlin Mullin for comments on [4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” the paper. Over the years, the Places project was supported Neural computation, 1997. [5] C. D. Manning and H. Schutze,¨ Foundations of statistical natural by the National Science Foundation under Grants No. language processing. MIT Press, 1999. 1016862 to A.O and No. 1524817 to A.T; the Vannevar [6] M. Campbell, A. J. Hoane, and F.-h. Hsu, “Deep blue,” Artificial Bush Faculty Fellowship program sponsored by the Basic intelligence, 2002. [7] D. Ferrucci, A. Levas, S. Bagchi, D. Gondek, and E. T. Mueller, Research Office of the Assistant Secretary of Defense for “Watson: beyond jeopardy!” Artificial Intelligence, 2013. Research and Engineering and funded by the Office of [8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Naval Research through grant N00014-16-1-3116 to A.O.; Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural the MIT Big Data Initiative at CSAIL, the Toyota Research networks and tree search,” Nature, 2016. Institute / MIT CSAIL Joint Research Center, Google, [9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based Xerox and Amazon Awards, and a hardware donation from learning applied to document recognition,” Proceedings of the IEEE, NVIDIA Corporation, to A.O and A.T. B.Z is supported by 1998. [10] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: a Facebook Fellowship. Surpassing human-level performance on imagenet classification,” in Proc. CVPR, 2015. REFERENCES [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. [1] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning CVPR, 2009. deep features for scene recognition using places database,” in In [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Advances in Neural Information Processing Systems, 2014. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet [2] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object large scale visual recognition challenge,” Int’l Journal of Computer detectors emerge in deep scene cnns,” International Conference on Vision, 2015. Learning Representations, 2015. [13] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

13

Fig. 16. The synthesized images preferred by the final output of Places365-AlexNet for 20 scene categories.

Fig. 17. The synthesized images preferred by the conv5 units of the Places365-AlexNet corresponds to the segmented images by the receptive fields of those units. The synthetic images are very similar to the segmented image regions of the units. Each row of the segmented images correspond to one unit.

Spatial pyramid matching for recognizing natural scene categories,” image annotations,” ijcv, 2016. in Proc. CVPR, 2006. [20] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, [14] A. Oliva and A. Torralba, “Modeling the shape of the scene: A “Scene paring through ade20k dataset,” Proc. CVPR, 2017. holistic representation of the spatial envelope,” Int’l Journal of [21] G. A. Miller, “Wordnet: a lexical database for english,” Communi- Computer Vision, 2001. cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995. [15] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. [22] P. Jolicoeur, M. A. Gluck, and S. M. Kosslyn, “Pictures and names: CVPR, 2009. Making the connection,” Cognitive psychology, 1984. [16] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun [23] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in database: Large-scale scene recognition from abbey to zoo,” in Proc. Proc. CVPR, 2011. CVPR, 2010. [24] C. Heip, P. Herman, and K. Soetaert, “Indices of diversity and [17] M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Al- evenness,” Oceanis, 1998. lan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorko´ et al., “The pascal visual object classes challenge 2007 (voc2007) [25] E. H. Simpson, “Measurement of diversity.” Nature, 1949. results,” 2007. [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with P. Dollar,´ and C. L. Zitnick, “Microsoft coco: Common objects in convolutions,” Proc. CVPR, 2015. context,” in European Conference on Computer Vision. Springer, [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks 2014, pp. 740–755. for large-scale image recognition,” International Conference on [19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Learning Representations, 2014. S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual [28] Y. Jia, “Caffe: An open source convolutional architecture for fast genome: Connecting language and vision using crowdsourced dense feature embedding,” http://caffe.berkeleyvision.org/, 2013. 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

14

[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Agata Lapedriza is an Associate Professor image recognition,” Proc. CVPR, 2016. at the Universitat Oberta de Catalunya. She [30] Q. Zhong, C. Li, Y. Zhang, H. Sun, S. Yang, D. Xie, and S. Pu, “To- received her MS deegree in Mathematics at wards good practices for recognition and detection,” http://image-net. the Universitat de Barcelona in 2003, and org/challenges/talks/2016/Hikvision at ImageNet 2016.pdf, 2016. her Ph.D. degree in Computer Science at [31] L. Shen, Z. Lin, G. Sun, and J. Hu, “Places401 and places365 the Computer Vision Center in 2009, at the models,” https://github.com/lishen-shirley/Places2-CNNs, 2016. Universitat Autonoma Barcelona. She was [32] J. Shao, X. Zhang, Z. Ding, Y. Zhao, Y. Chen, J. Zhou, working as a visiting researcher in the Com- W. Wang, L. Mei, and C. Hu, “Good pratices for deep feature fu- puter Science and Artificial Intelligence Lab, sion,” http://image-net.org/challenges/talks/2016/Trimps-Soushen@ at the Massachusetts Institute of Technology, ILSVRC2016.pdf, 2016. from 2012 until 2015. Her research interests [33] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, are related to high-level image understanding, scene and object and T. Darrell, “Decaf: A deep convolutional activation feature for recognition, and affective computing. generic visual recognition,” in Proc. ICML, 2014. [34] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” CVPR workshop, 2014. [35] G. Patterson and J. Hays, “Sun attribute database: Discovering, annotating, and recognizing scene attributes,” in Proc. CVPR, 2012. Aditya Khosla received the BS degree in [36] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual computer science, electrical engineering and models from few training examples: An incremental bayesian ap- economics from the California Institute of proach tested on 101 object categories,” Computer Vision and Image Technology, and the MS degree in computer Understanding, 2007. science from Stanford University, in 2009 and [37] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category 2011 respectively. He completed his PhD in dataset,” 2007. computer science from the Massachusetts [38] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, Institute of Technology in 2016 with a focus “Human action recognition by learning bases of action attributes and on computer vision and machine learning. In parts,” in Proc. ICCV, 2011. his thesis, he developed machine learning [39] L.-J. Li and L. Fei-Fei, “What, where and who? classifying events techniques that predict human behavior and by scene and object recognition,” in Proc. ICCV, 2007. the impact of visual media on people. [40] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” 2008. [41] A. Nguyen, A. Dosovitskiy, T. Yosinski, Jason band Brox, and J. Clune, “Synthesizing the preferred inputs for neurons in neural networks via deep generator networks,” In Advances in Neural Information Processing Systems, 2016. Aude Oliva is a Principal Research Scien- [42] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: tist at the MIT Computer Science and Artifi- A large data set for nonparametric object and scene recognition,” cial Intelligence Laboratory (CSAIL). After a IEEE Trans. on Pattern Analysis and Machine Intelligence, 2008. French baccalaureate in Physics and Mathe- [43] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- matics, she received two M.Sc. degrees and man, “The pascal visual object classes (voc) challenge,” Int’l Journal a Ph.D in Cognitive Science from the Institut of Computer Vision, 2010. National Polytechnique of Grenoble, France. [44] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- She joined the MIT faculty in the Department nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset of Brain and Cognitive Sciences in 2004 and for semantic urban scene understanding,” Proc. CVPR, 2016. CSAIL in 2012. Her research on vision and memory is cross-disciplinary, spanning hu- man perception and cognition, computer vision, and human neuro- science. She received the 2006 National Science Foundation (NSF) Career award, the 2014 Guggenheim and the 2016 Vannevar Bush fellowships.

Antonio Torralba received the degree in telecommunications engineering from Tele- com BCN, Spain, in 1994 and the Ph.D. de- gree in signal, image, and speech processing from the Institut National Polytechnique de Grenoble, France, in 2000. From 2000 to 2005, he spent postdoctoral training at the Brain and Cognitive Science Department and the Computer Science and Artificial Intelli- Bolei Zhou is a Ph.D. Candidate in Com- gence Laboratory, MIT. He is now a Profes- puter Science and Artificial Intelligence Lab sor of Electrical Engineering and Computer (CSAIL) at the Massachusetts Institute of Science at the Massachusetts Institute of Technology (MIT). Prof. Technology. He received M.Phil. degree in In- Torralba is an Associate Editor of the International Journal in Com- formation Engineering from the Chinese Uni- puter Vision, and has served as program chair for the Computer versity of Hong Kong and B.Eng. degree in Vision and Pattern Recognition conference in 2015. He received Biomedical Engineering from Shanghai Jiao the 2008 National Science Foundation (NSF) Career award, the Tong University in 2010. His research inter- best student paper award at the IEEE Conference on Computer ests are computer vision and machine learn- Vision and Pattern Recognition (CVPR) in 2009, and the 2010 J. ing. He is an award recipient of the Facebook K. Aggarwal Prize from the International Association for Pattern Fellowship, the Microsoft Research Asia Fel- Recognition (IAPR). lowship, and the MIT Greater China Fellowship. 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.