Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Noname manuscript No. (will be inserted by the editor) Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs Rahul Raguram · Changchang Wu · Jan-Michael Frahm · Svetlana Lazebnik Received: date / Accepted: date Abstract This article presents an approach for mod- by users and uploaded to photo-sharing websites. For eling landmarks based on large-scale, heavily contami- example, on Flickr.com, locations form the single most nated image collections gathered from the Internet. Our popular category of user-supplied tags [Sigurbj¨ornsson system efficiently combines 2D appearance and 3D geo- and van Zwol, 2008]. With the growth of community- metric constraints to extract scene summaries and con- contributed collections of place-specific imagery, there struct 3D models. In the first stage of processing, im- comes a growing need for algorithms that can distill ages are clustered based on low-dimensional global ap- their content into representations suitable for summa- pearance descriptors, and the clusters are refined using rization, visualization, and browsing. 3D geometric constraints. Each valid cluster is repre- In this article, we consider collections of Flickr im- sented by a single iconic view, and the geometric re- ages associated with a landmark keyword such as “Statue lationships between iconic views are captured by an of Liberty,” with often noisy annotations and metadata. iconic scene graph. Using structure from motion tech- Our goal is to efficiently identify all photos that actu- niques, the system then registers the iconic images to ally represent the landmark of interest, and to organize efficiently produce 3D models of the different aspects of these photos to reveal the spatial and semantic struc- the landmark. To improve coverage of the scene, these ture of the landmark. Any system that aims to meet 3D models are subsequently extended using additional, this goal must address several challenges inherent in non-iconic views. We also demonstrate the use of iconic the nature of the data: images for recognition and browsing. Our experimental – Contamination: When dealing with community- results demonstrate the ability to process datasets con- contributed landmark photo collections, it has been taining up to 46,000 images in less than 20 hours, using observed that keywords and tags are accurate only a single commodity PC equipped with a graphics card. approximately 50% of the time [Kennedy et al., 2006]. This is a significant advance towards Internet-scale op- Since we obtain our input using keyword searches, eration. a large fraction of the input images comprises of “noise,” or images that are unrelated to the concept of interest. 1 Introduction – Diversity: The issue of contamination aside, even “valid” depictions of landmarks have a remarkable Today, more than ever before, it is evident that “to col- degree of diversity. Landmarks may have multiple lect photography is to collect the world” [Sontag, 1977]. aspects (sometimes geographically dispersed), they More and more of the Earth’s cities and sights are pho- may be photographed at different times of day and tographed each day from a variety of digital cameras, in different weather conditions, to say nothing of viewing positions and angles, weather and illumination non-photorealistic depictions and cultural references conditions; more and more of these photos get tagged (Figure 1). Dept. of Computer Science, University of North Carolina, Chapel –Scale: The typical collection of photos annotated Hill, NC 27599-3175 with a landmark-specific phrase has tens to hun- E-mail: {rraguram,ccwu,jmf,lazebnik}@cs.unc.edu dreds of thousands of images. For example, there 2 scene graph. In the next step, this graph is used for efficient reconstruction of a 3D skeleton model, which is subsequently extended using additional relevant im- ages. Given a new test image, we can register it into the model in order to answer the question of whether the landmark is present in the test image. In addition, as a natural consequence of the structure of our ap- proach, the image collection can be cleanly organized into a hierarchy for browsing. Since our method efficiently filters out unrelated im- ages using 2D appearance-based constraints, which are computationally cheap, and applies more computation- ally demanding geometric constraints to much smaller Fig. 1 The diversity of photographs depicting “Statue of Lib- erty.” There are copies of the statue in New York, Las Vegas, subsets of “promising” images, it is scalable to large Tokyo, and Paris. The appearance of the images can vary signif- photo collections. Unlike approaches based purely on icantly based on time of day and weather conditions. Further SfM, e.g., [Agarwal et al., 2009], it does not require complicating the picture are parodies (e.g., people dressed as a massively parallel cluster of hundreds of computing the statue) and non-photorealistic representations. The approach presented in this article relies on rigid 3D constraints, so it is not cores and can process datasets consisting of tens of applicable to the latter two types of depictions. thousands of images within hours on a single commod- ity PC. The rest of this article is organized as follows. Sec- are over 140,000 images on Flickr associated with tion 2 places our research in the context of other related the keyword “Statue of Liberty.” If we wanted to work on landmark modeling. In Section 3 we introduce process such collections using a traditional structure the steps of our implemented system. Section 4 presents from motion (SfM) pipeline, we would have to take experimental results on three datasets: the Notre Dame every pair of images and try to establish a two-view cathedral in Paris, Statue of Liberty, and Piazza San relation between them. The running time of such an Marco in Venice. Finally, Section 5 closes the presenta- approach would be at least quadratic in the number tion with a discussion of limitations and directions for of input images. Clearly, such brute-force matching future work. is not scalable; we need smarter and more efficient An earlier version of this work was originally pre- ways of organizing the images. sented in [Li et al., 2008]. For the present article, the system has been completely re-implemented to include Fortunately, landmark photo collections also possess much faster GPU-based feature extraction and geomet- helpful characteristics that can actually make large- ric verification, an improved image registration algo- scale modeling easier. The main such characteristic is rithm leading to higher precision and recall, and a new redundancy: people tend to take pictures from certain incremental reconstruction strategy delivering larger and viewpoints and to frame their compositions in consis- more complete models. Videos of computed 3D models, tent ways, giving rise to many large groups of very along with complete browsing summaries, can be found similar-looking photos. Our system is based on the ob- on the project website.1 servation that such groups can be discovered using 2D appearance-based constraints that are considerably more efficient than full-blown SfM constraints, and that the 2 Previous Work iconic views representing these groups form a complete and concise summary of the scene, so that most of the Our system offers a comprehensive solution to the prob- subsequent computation can be restricted to the iconic lems of dataset collection, 3D reconstruction, scene sum- views without much loss of content. marization, browsing and recognition for landmark im- Figure 2 gives an overview of our system and Algo- ages. In this section, we discuss related recent work in rithm 1 shows a more detailed summary of the mod- these areas. eling steps. Our system begins by clustering all input At a high level, one of the goals of our work can images based on 2D appearance descriptors, and then it be described as follows: starting with the heavily con- progressively refines these clusters with geometric con- taminated output of an Internet image search query, straints to select iconic images that represent dominant we want to extract a high-precision subset of images aspects of the scene. These images and the pairwise geometric relationships between them define an iconic 1 http://www.cs.unc.edu/PhotoCollectionReconstruction 3 Raw photo collection Iconic summary Iconic scene graph Query: “Statue of Liberty” 454 images 47,238 images, ~40% irrelevant RRttieconstruction Hierarchical browsing Las Vegas New York Tokyo Fig. 2 Overview of our system. The input to the system is a raw, contaminated photo collection, which is reduced to a collection of representative iconic images by 2D appearance-based clustering followed by geometric verification. The geometric relationships between iconic views are captured by the iconic scene graph. The structure of the iconic scene graph is used to automatically generate 3D point cloud models, as well as to impose a hierarchical organization on the images for browsing. Videos of the models, along with a browsing interface, can be found at www.cs.unc.edu/PhotoCollectionReconstruction. that are actually relevant to the query. Several exist- gain in efficiency and an improved ability to group simi- ing approaches consider this problem of dataset collec- lar viewpoints. Another difference between our method tion for generic visual categories not characterized by and [Philbin and Zisserman, 2008, Zheng et al., 2009] rigid 3D structure[Fergus et al., 2004, Berg and Forsyth, is that we perform geometric verification by applying 2006, Li et al., 2007, Schroff et al., 2007, Collins et al., full 3D SfM constraints instead of loose 2D spatial con- 2008]. These approaches use statistical models to com- straints. bine different kinds of 2D image features (texture, color, To discover all the images belonging to the land- keypoints), as well as text and tags. However, for our mark, we first try to find a set of iconic views, corre- specific application of landmark modeling, such statis- sponding to especially popular and salient aspects.