Harvesting Social Images for Bi-Concept Search Xirong Li, Cees G
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012 1091 Harvesting Social Images for Bi-Concept Search Xirong Li, Cees G. M. Snoek, Senior Member, IEEE, Marcel Worring, Member, IEEE,and Arnold W. M. Smeulders, Member, IEEE Abstract—Searching for the co-occurrence of two visual con- cepts in unlabeled images is an important step towards answering complex user queries. Traditional visual search methods use com- binations of the confidence scores of individual concept detectors to tackle such queries. In this paper we introduce the notion of bi-concepts, a new concept-based retrieval method that is directly learned from social-tagged images. As the number of potential bi-concepts is gigantic, manually collecting training examples is infeasible. Instead, we propose a multimedia framework to collect de-noised positive as well as informative negative training examples from the social web, to learn bi-concept detectors from these examples, and to apply them in a search engine for retrieving bi-concepts in unlabeled images. We study the behavior of our bi-concept search engine using 1.2 M social-tagged images as a data source. Our experiments indicate that harvesting examples for bi-concepts differs from traditional single-concept methods, yet the examples can be collected with high accuracy using a multi-modal approach. We find that directly learning bi-concepts is better than oracle linear fusion of single-concept detectors, with a relative improvement of 100%. This study reveals the potential of learning high-order semantics from social images, for free, suggesting promising new lines of research. Index Terms—Bi-concept, semantic index, visual search. I. INTRODUCTION Fig. 1. Searching for two visual concepts co-occurring in unlabeled images. A EARCHING pictures on smart phones, PCs, and the (green) tick indicates a positive result. Given two single concept detectors with S Internet for specific visual concepts, such as objects and reasonable accuracy, a combination using their individual confidence scores scenes, is of great importance for users with all sorts of infor- yields a bad retrieval result (c). We propose to answer the complex query using a bi-concept detector optimized in terms of mutual training examples mation needs. As the number of images is growing so rapidly, (d). (a) Searching for “car” by a “car” detector. (b) Searching for “horse” by full manual annotation is unfeasible. Therefore, automatically a “horse” detector. (c) Searching for “car horse” by combining the results determining the occurrence of visual concepts in the visual of (a) and (b). (d) Searching for “car horse” using the proposed bi-concept search engine. content is crucial. Compared to low-level visual features such as color and local descriptors used in traditional content-based image retrieval, the concepts provide direct access to the se- In reality, however, a user’s query is often more complex mantics of the visual content. Thanks to continuous progress than a single concept can represent [9]. For instance consider in generic visual concept detection [1]–[4], followed by novel the query: “an image showing a horse next to a car”. To an- exploitation of the individual detection results [5]–[8], an swer this query, one might expect to employ a “car” detector effective approach to unlabeled image search is dawning. and a “horse” detector, and combine their predictions, which is indeed the mainstream approach in the literature [6]–[8], [10]–[12]. But is this approach effective? We observe that the Manuscript received June 28, 2011; revised November 26, 2011; accepted March 14, 2012. Date of publication April 03, 2012; date of current version single concept detectors are trained on typical examples of July 13, 2012. This work was supported in part by the Dutch national program the corresponding concept, e.g., cars on a street for the “car” COMMIT and in part by the STW SEARCHER project. The associate editor detector, and horses on grass for the “horse” detector. We coordinating the review of this manuscript and approving it for publication was Dr. Samson Cheung. hypothesize that images with horses and cars co-occurring also X. Li is with the MOE Key Laboratory of Data Engineering and Knowl- haveacharacteristic visual appearance, while the individual edge Engineering, School of Information, Renmin University of China, Beijing, concepts might not be present in their common form. Hence, China (e-mail: [email protected]). C. G. M. Snoek, M. Worring, and A. W. M. Smeulders are with the Intelligent combining two reasonably accurate single concept detectors Systems Lab Amsterdam, University of Amsterdam, 1098 XH Amsterdam, The is mostly ineffective for finding images with both concepts Netherlands. visible, as illustrated in Fig. 1. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Ideally, we treat the combination of the concepts as a new Digital Object Identifier 10.1109/TMM.2012.2191943 concept, which we term bi-concept. To be precise, we define 1520-9210/$31.00 © 2012 IEEE 1092 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012 a bi-concept as the co-occurrence of two distinct visual con- cepts, where its full meaning cannot be inferred from one of its component concepts alone. According to this definition, not all combinations of two concepts are bi-concepts. For instance, a combination of a concept and its superclass such as “horse animal” is not a bi-concept, because “horse animal” bears no more information than “horse”. Besides, specialized single concepts consisting of multiple tags such as “white horse” [13], [14] and “car driver” are not bi-concepts as the two tags refer to the same visual concept. The same holds for events such as “airplane landing” where the tag landing is not a distinct visual concept by itself. Although not all semantic combinations are bi-concepts, the number of possible bi-concepts is still quadratic in the number of single concepts. Even when we assume that a set of only 5000 Fig. 2. Conceptual diagram of the proposed bi-concept image search engine. concepts is enough for general purpose search [6], finding suf- ficient labeled examples for each possible bi-concept already becomes a problem of big proportion. The amount of labeling serving given the image. In a similar fashion, we define a effort is so considerable that it puts the scalability of expert an- bi-concept detector . notation and the recent Amazon Mechanical Turk service into question. We consider obtaining bi-concept examples without A. Visual Search by Combining Single Concepts expert labeling as a key problem for bi-concept search in unla- Given hundreds of single concept detectors trained on beled images. well-labeled examples [18], a considerable amount of papers A novel source of labeled images for concept detection are have been published on how to combine the detectors for visual user-tagged images on the social web such as those on Flickr search. We refer to the review paper by Snoek and Worring [19] and Facebook. However, due to the subjectiveness of social for a comprehensive discussion. Here we discuss two effective, tagging, social tags often do not reflect the actual content of simple and popular combination methods: the product rule [12] the image. It has been shown in previous studies that directly and linear fusion [7], [8], [11], [20]–[22]. training on social-tagged images results in suboptimal single If the assumption would hold that individual concepts are concepts detectors [15]–[17]. Learning bi-concept detectors conditionally independent given the image content [12], a from social-tagged images is, to the best of our knowledge, bi-concept detector can be approximated by the product of its non-existing in the literature. By definition, the number of component concepts, namely images labeled with a bi-concept is less than the number of images labeled with a single concept, meaning bi-concepts (1) have a worse starting point for obtaining examples. In addi- tion, a scene with two concepts present tends to be visually more complex, requiring multi-modal analysis. Given these The linear fusion version of the bi-concept detector is com- difficulties, effective bi-concept search demands an approach monly expressed as to harvesting appropriate examples from social-tagged images for learning bi-concept detectors. (2) In this paper, we introduce the notion of bi-concepts, and pro- pose a multimedia search engine to study their instantiation, where is a weighting parameter. To automatically development, and applicability. We present a multi-modal ap- determine the weight, many have used a heuristic approach. proach to collect de-noised positive as well as informative neg- Natsev et al. [21] suggest average fusion with .Changet ative training examples from the social web. We learn bi-con- al. [22] weight the individual single concept detectors in terms cept detectors from these examples, and later apply them for re- of their training performance. However, Snoek et al. [7] argue trieving bi-concepts in unlabeled images. A conceptual diagram that the best performing individual detectors such as “person” of the proposed search engine is illustrated in Fig. 2. and “outdoor” are often the least informative for retrieval. So Li et al. [11] set to be proportional to the informativeness of the concepts. How to determine the proper combination of the II. RELATED WORK single concept detectors remains a open question [19]. It is the We first review related work on combining single concept de- reason why many not only show real results but also resort to an tectors for visual search, and then discuss recent progress on har- oracle combination using the best possible weights [6], [7], [21]. vesting training examples from the (social) web. For the ease One may also consider utilizing an object localization system of consistent description, we use to represent an image, which relies on region-based image analysis to pinpoint regions for a single concept, and for a bi-concept comprised of of the single concepts [23].