A 10 Million Image Database for Scene Recognition
Total Page:16
File Type:pdf, Size:1020Kb
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence 1 Places: A 10 million Image Database for Scene Recognition Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba Abstract—The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near- human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems. Index Terms—Scene classification, visual recognition, deep learning, deep feature, image dataset. F 1 INTRODUCTION Whereas most datasets have focused on object categories If a current state-of-the-art visual recognition system would (providing labels, bounding boxes or segmentations), here send you a text to describe what it sees, the text might read we describe the Places database, a quasi-exhaustive repos- something like: “There is a sofa facing a TV set. A person itory of 10 million scene photographs, labeled with 434 is sitting on the sofa holding a remote control. The TV is scene semantic categories, comprising about 98 percent of on and a talk show is playing”. Reading this, you would the type of places a human can encounter in the world. likely imagine a living-room. However, that scenery can Image samples are shown in Fig. 1 while Fig. 2 shows the very well happen in a resort by the beach. number of images per category, sorted in decreasing order. For an agent acting into the world, there is no doubt that Departing from Zhou et al. [1], we describe in depth the object and event recognition should be a primary goal of its construction of the Places Database, and evaluate the per- visual system. But knowing the place or context in which formance of several state-of-the-art Convolutional Neural the objects appear is as equally important for an intelligent Networks (CNNs) for place recognition. We compare how system to understand what might have happened in the past the features learned in a CNN for scene classification be- and what may happen in the future. For instance, a table have when used as generic features in other visual recogni- inside a kitchen can be used to eat or prepare a meal, while tion tasks. Finally, we visualize the internal representations a table inside a classroom is intended to support a notebook of the CNNs and discuss one major consequence of training or a laptop to take notes. a deep learning model to perform scene recognition: object A key aspect of scene recognition is to identify the place detectors emerge as an intermediate representation of the in which the objects seat (e.g., beach, forest, corridor, office, network [2]. Therefore, while the Places database does not street, ...). Although one can avoid using the place category contain any object labels or segmentations, it can be used by providing a more exhaustive list of the objects in the to train new object classifiers. picture and a description of their spatial relationships, a 1.1 The Rise of Multi-million Datasets place category provides the appropriate level of abstraction to avoid such a long and complex description. Note that one What does it take to reach human-level performance with could avoid using object categories in a description by only a machine-learning algorithm? In the case of supervised listing parts (i.e. two eyes on top of a mouth for a face). learning, the problem is two-fold. First, the algorithm must Like objects, places have functions and attributes. They are be suitable for the task, such as Convolutional Neural composed of parts and some of those parts can be named Networks in the large scale visual recognition [1], [3] and and correspond to objects, just like objects are composed Recursive Neural Networks for natural language processing of parts, some of which are nameable as well (e.g., legs, [4], [5]. Second, it must have access to a training dataset eyes). of appropriate coverage (quasi-exhaustive representation of classes and variety of exemplars) and density (enough • B. Zhou, A. Khosla, A. Oliva, A. Torralba are with the Computer samples to cover the diversity of each class). The optimal Science and Artificial Intelligence Laboratory, Massachusetts Institute space for these datasets is often task-dependent, but the of Technology, USA. rise of multi-million-item sets has enabled unprecedented • A. Lapedriza is with Universitat Oberta de Catalunya, Spain. performance in many domains of artificial intelligence. 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence 2 veterinarians office elevator door fishpond windmill train station platform bedroom cafeteria watering hole corral amusement park staircase bar field road arch tower soccer field conference center shoe shop rainforest swimming pool street Indoor Nature Urban Fig. 1. Image samples from various categories of the Places Database (two samples per category). The dataset contains three macro-classes: Indoor, Nature, and Urban. Fig. 2. Sorted distribution of image number per category in the Places Database. Places contains 10,624,928 images from 434 categories. Category names are shown for every 6 intervals. The successes of Deep Blue in chess, Watson in “Jeop- MIT Indoor67 database [15] with 67 indoor categories and ardy!”, and AlphaGo in Go against their expert human the SUN (Scene Understanding, with 397 categories and opponents may thus be seen as not just advances in algo- 130,519 images) database [16] provided a larger coverage rithms, but the increasing availability of very large datasets: of place categories, but failed short in term of quantity of 700,000, 8.6 million, and 30 million items, respectively [6]– data needed to feed deep learning algorithms. To comple- [8]. Convolutional Neural Networks [3], [9] have likewise ment large object-centric datasets such as ImageNet [11], achieved near human-level visual recognition, trained on we build the Places dataset described here. 1.2 million object [10]–[12] and 2.5 million scene images [1]. Expansive coverage of the space of classes and samples Meanwhile, the Pascal VOC dataset [17] is one of the allows getting closer to the right ecosystem of data that a earliest image dataset with diverse object annotations in natural system, like a human, would experience. The history scene context. The Pascal VOC challenge has greatly ad- of image datasets for scene recognition also sees the rapid vanced the development of models for object detection and growing in the image samples as follows. segmentation tasks. Nowadays, COCO dataset [18] focuses on collecting object instances both in polygon and bounding box annotations for images depicting everyday scenes of 1.2 Scene-centric Datasets common objects. The recent Visual Genome dataset [19] The first benchmark for scene recognition was the Scene15 aims at collecting dense annotations of objects, attributes, database [13], extended from the initial 8 scene dataset in and their relationships. ADE20K [20] collects precise dense [14]. This dataset contains only 15 scene categories with annotation of scenes, objects, parts of objects with a large a few hundred images per class, and current classifiers are and open vocabulary. Altogether, annotated datasets further saturated, reaching near human performance with 95%. The enable artificial systems to learn visual knowledge linking 0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence 3 parts, objects and scene context. 2.2.1 Step 1: Downloading images using scene cate- gory and attributes From online image search engines (Google Images, Bing 2 PLACES DATABASE Images, and Flickr), candidate images were downloaded using a query word from the list of scene classes provided 2.1 Coverage of the categorical space by the SUN database [16]. In order to increase the diversity The first asset of a high-quality dataset is an expansive of visual appearances in the Places dataset, each scene class coverage of the categorical space to be learned. The strategy query was combined with 696 common English adjectives 1 of Places is to provide an exhaustive list of the categories of (e.g., messy, spare, sunny, desolate, etc.). In Fig. 3) we show environments encountered in the world, bounded by spaces some examples of images in Places grouped by queries. where a human body would fit (e.g.