Semantic Image Search from Multiple Query Images

Gonzalo Vaca-Castano Mubarak Shah Center for Research in Computer Vision. Center for Research in Computer Vision. University of Central Florida University of Central Florida [email protected] [email protected]

ABSTRACT This paper presents a novel search paradigm that uses mul- tiple images as input to perform of images. While earlier focuses on using single or multiple query im- ages to retrieve images with views of the same instance, the proposed paradigm uses each query image to discover com- mon concepts that are implicitly shared by all of the query (a) Query Images images and retrieves images considering the found concepts. Our implementation uses high level visual features extracted from a deep convolutional network to retrieve images simi- lar to each query input. These images have associated text previously generated by implicit crowdsourcing. A Bag of Words (BoW) textual representation of each query image is built from the associated text of the retrieved similar images. A learned vector space representation of English words ex- tracted from a corpus of 100 billion words allows computing (b) Retrieved Images the conceptual similarity of words. The words that repre- Figure 1: Results of the proposed paradigm with three input sent the input images are used to find new words that share images. a) Query images. b) Top retrieved images. conceptual similarity across all the input images. These new words are combined with the representations of the input im- ages to obtain a BoW textual representation of the search, which is used to perform image retrieval. The retrieved im- ages are re-ranked to enhance visual similarity with respect to acquire different viewing conditions of the unique object to any of the input images. Our experiments show that the or concept for the user. concepts found are meaningful and that they retrieve cor- In this work, we follow a different approach for multiple rectly 72.43% of the images from the top 25, along with user query image inputs, since the input images are employed ratings performed in the cases of study. jointly to extract underlying concepts common to the input images. The input images do not need to be part of the 1. INTRODUCTION same concept. In fact, different input image either enriches, The goal of any Image Retrieval system is to retrieve im- amplifies, or reduces the importance of one descriptive as- ages from a large visual corpus that are similar to the input pect of the multiple significances that any image inherently query. To date, most Image Retrieval systems (including has. Consider the example in Figure 1. Three input images commercial search engines like Google1 ) base their search are used to query the system. The first input image is a milk on a single image input query. Recently, multiple images as bottle, the second contains a piece of meat, and the third input queries have been proposed for image retrieval [1, 2]. contains a farm. These three images are dissimilar from a In these cases, the objective of the multiple image inputs is visual point of view. However, conceptually, they could be linked by the underlying concept “cattle,” since farmers ob- 1http://images.google.com/ tain milk and meat from cattle. These types of knowledge– based conceptual relations are the ones that we propose to capture in this new search paradigm. Figure 1b shows the top retrieved images by our system. Permission to make digital or hard copies of all or part of this work for The use of multiple images as input queries in real sce- personal or classroom use is granted without fee provided that copies are narios is straightforward, especially when mobile and wear- not made or distributed for profit or commercial advantage and that copies able devices are the user interfaces. Consider for instance bear this notice and the full citation on the first page. To copy otherwise, to a wearable device. In this case, pictures can be passively republish, to post on servers or to redistribute to lists, requires prior specific sampled and used as query inputs of a semantic search to permission and/or a fee. ACM Multimedia 2015 Brisbane, Australia discover the context of the user. Once meaningful semantic Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00. concepts are found, many specific applications can be built, including personalization of textual search, reduction of the 2.1 Visual Similarity search space for visual object detection, and personalization We use Convolutional Neural Network (CNN) to generate of output information such as suggestions, advertising, and a high level visual representation of the images. The Con- recommendations. volutional Neural Network (CNN) presented by Krizhevsky The utility of the proposed search paradigm is enhanced et al. [5] contains eight layers with weights; the first five when the user has constructed an unclear search or lacks the layers are convolutional and the last three layers are fully knoledge to describe it, but does have some idea of images of connected neural networks. The last layer contains 1000 isolated concepts. For example, consider a user looking for output units fed to a softmax function that determines the ideas for a gift. While walking through the mall, the user class label output. Since we are interested in a high level adds some pictures of vague ideas for the gift. Based on representation of the image instead of a classifier, we remove the provided images, semantic concepts are found and used the last layer of the network. Features are then calculated to retrieve images of gift suggestions. Another example of by forward propagation of a mean subtracted 224x224 RGB this type of application is retrieving image suggestions from image over five convolutional layers and two fully connected a domain–specific knowledge area (painting, cooking, etc.), layers. The result is a 4096 dimension vector that represents based on the user’s more general visual inputs, where the the image as a global descriptor. user does not have specialized knowledge in the specific area. Since our image descriptor is global, we perform Locality- Hence, given a couple of paints that you like as input, the Sensitive Hashing (LSH) to have fast approximated near- semantic search could find other images for you that share est neighbor retrieval of candidate images. The presented similar common concepts. scheme produces retrievals that are similar in appearance, This paper presents a new Image Retrieval paradigm with but also account for the meaning of the image as an object, the following novelties: which is one of the main goals of our proposed framework. 1) Multiple image queries are used as inputs. 2) The input images are used to retrieve the concepts that 2.2 Textual Representation of Dataset Images images represent, instead of merely visual similarities. The images of the dataset are assumed to have a dual 3) Implicit crowdsourcing is used to obtain textual de- representation (visual / textual). In order to extract tex- scriptions of pictures that are less noisy than typical sur- tual representation of an image, traditionally the text sur- rounding text of web images. rounding the images or associated meta-tags are used as 4) Text descriptors are used as operands to discover under- text descriptors for the images. However, surrounding text lying concepts common to the input images. The retrieved is highly noisy, and meta-tags are frequently absent, incom- images capture semantic similarity from knowledge gained plete, or inconsistent with other tasks. We believe that im- via text concepts and visual similarity. plicit crowdsourcing is a source for cleaner text descriptions. Implicit crowdsourcing engages users in an activity in or- der to gather information about another topic based on the 2. METHOD user’s actions or responses. The dataset provided for the The proposed method to retrieve images from multiple Bing challenge [9] is an example of implicit crowdsourcing query input images is explained next. Initially, each one where text for the images were added by users after they of the query images Ii is processed individually to retrieve clicked for the best results for their textual searches. Many candidate images, based on the visual similarity, from the phrases can describe a single image. Instead of deciding retrieval dataset. Each query image Ii produces a set of k which text best represents the image, we represent the im- nearest neighbor candidates, denoted as Cij , where j goes age as a Bag of Words representation of collected text. from 1 . . . k . All of the images from the dataset used for retrieval are 2.3 Textual Representation of Query Images assumed to have a dual representation: a visual represen- The input to our retrieval system is a set of query im- tation given by a global image descriptor and a textual de- ages without tags, labels, text description, or . In scriptor represented by a histogram of all of the texts de- order to operate in the conceptual level, textual representa- scribing the image. Hence, every candidate image Cij has tion of the content of the query images must be generated. an associated textual descriptor represented by a word his- Textual representations of the top retrieved images from vi- togram that serves to link the visual representation with the sual features are transferred to obtain a textual representa- conceptual representation. tion of each query image. The top retrieved candidates are Candidate images that are visually closer to the query im- weighted using a decreasing function of their visual distance age have a higher impact in the text representation of the to the query image. Hence, if a retrieved image matches query image. The most representative text words that de- exactly with the query image, it receives a maximum weight scribes the query images, are processed using Natural Lan- equal to one, while others receive lower weights. The textual guage Processing (NLP) techniques to discover new words representation of the query image is given by the weighted that share conceptual similarity. Afterwards, the histogram sum of retrieved textual representations. summation of the weighted word representations of the can- didate images Cij and the aggregation of bins for the discov- 2.4 Word Representation in Vector Space ered words produces a histogram that represents the queries The phrases that describe images are morphologically sim- of the search jointly. Later, cosine distance between the tf– ple. The images in the dataset are fairly well described by idf textual representation of the images and the a Bag of Words (BoW) textual representation. From a rela- joint search representation is performed to retrieve images. tively broad vocabulary, only a small set of words are enough Finally, a re-ranking is performed to privilege images with to describe the content of images. Hence, similarities be- high visual similarity to any of the query images. tween pairs of words can be enough to capture semantic input : tuple words, then the word Wl is declared as a new word • conceptually shared by the N-tuple. We call this procedure BoW representation of the N input images “intersection.” There are many combinations of N-tuples that could be • Index of the sorted columns of kernel Kij formed; however, the most interesting ones are the N-tuples output: List of new words conceptually shared by the created from words that have highest weights in the textual N input images representation of each query image. initialization; A small number T of words with higher weights are used while size(ListNewW ords) == 0 do to create the N-tuples. When no shared word is found in the increase number T of words used to create N-tuples; given set of N-tuples, the number of words used to create N- increase number of sorted terms to be intercepted; tuples and the number of sorted terms to be intercepted is for All N-tuples from top ranked words do increased until at least one common word is found. Algo- V ←− Intersection(current N-tuple); rithm 1 summarizes this procedure. if size(V ) ≠ 0 then add V to ListNewWords 2.6 Image Retrieval end Image retrieval is performed from a unique textual repre- end sentation that accounts for all input images and the list of end words conceptually shared by the input images. The tex- tual representation of the joint query is the summation of Algorithm 1: Algorithm to discover words conceptually the individual query input representations and the addition shared by the input images from their textual represen- of new bins indexed by the list of words shared by the input tations. images. Image retrieval is performed based on the ranking score produced by the cosine scoring algorithm [6] between the relations between images. joint search representation and the tf-idf representation of In the proposed multi-query input image retrieval system, the images of the dataset. we combine the text descriptor from each input image rep- Instead of comparing with the entire set of images of the resented by a small set of words, to infer new words that dataset, we use a shorter number of possible outputs to cal- capture the common meaning of inputs. culate the score of the images. This subset of images is Discrete words can be represented in a continuous dense formed by all the images of the dataset that contain at least vector space that captures semantic knowledge learned in one word from either the list of words conceptually shared the text domain. The skip-gram and the Continuous Bag of by the inputs or the most representative words of each query Words (CBOW) model architectures proposed by Mikolov et input. Hence, the number of images to evaluate reduces from al. [7,8] efficiently learn semantically meaningful float point one million to just a couple of thousand candidate images in representations of words from very large text datasets. the Bing dataset. The retrieved images are based on their Cosine distance can be used in order to find the semantic conceptual ranking only. Therefore, a re-scoring of the re- similarity between two words of the vocabulary, by normal- trieved images is performed based on the visual similarity to izing the vector representation of each word and computing the input images. Images that are visually inconsistent with the dot product between them. all the input images are penalized, and re-ranked to lower A kernel Kij that measures the similarity between any positions. The value of penalization is calculated using an word i and any word j of the vocabulary is calculated during inverse exponential function of the Euclidean norm of the the training phase. The purpose of the kernel Kij is to difference between descriptors of the retrieved image and its quickly find the similarity of any pair of words. most visual similar input image.

2.5 Concept Discovery 3. EXPERIMENTS The representation of a query image is typically formed We use the public domain implementation and trained by only tens of words. If N is the number of query inputs model available from the caffe library [4] to calculate our and one word is selected from each query image, an N-tuple visual image representation. Image resizing and feature ex- of words is formed. We want to examine the N words of the traction of one million images can be performed in less than N-tuple to discover new words that relate them conceptually. 24 hours in a regular Quad core personal computer. A word W is considered as a new word concept, when W l l The dataset [9] provided for the 2013 MSR-Bing Image has simultaneously high similarity to all the words of the Retrieval Challenge2 was sampled from one-year click logs N-tuple measured in the vector space. of the Microsoft Bing image . We chose this The kernel K enables finding the similarity between any ij dataset for image retrieval since the texts associated with pair of words of the vocabulary. A row (or column) q of the the images are fairly accurate because they are built from kernel matrix K is the distance of the word indexed by q to ij user’s search criteria and click preferences. The most impor- any other word of the vocabulary. Performing a descend sort tant advantage is that the image labeling does not require operation and taking their indices in the selected column q, humans dedicated to this activity and that is the product of gives the word indices that are conceptually closer to the implicit crowdsourcing. The average number of queries per word Wq. Given an N-tuple, we can find the closest words for all 2http://acmmm13.org/submissions/call-for-multimedia- the words that are part of the N-tuple. If a word Wl is grand-challenge-solutions/msr-bing-grand-challenge-on- found in the top positions of all the sorted lists of the N- image-retrieval-scientific-track/ Table 1: Mean accuracy of the retrieved images acording to user ratings in 101 pairs of query images. Results are showed at different top retrieval levels.

Method Top 5 Top 10 Top 15 Top 20 Top 25 Baseline (Before visual re-ranking) 0.5058 0.4843 0.4823 0.4774 0.4784 Baseline (After visual re-ranking) 0.5549 0.5294 0.5124 0.5093 0.5011 Ours (Before visual re-ranking) 0.7509 0.7450 0.7327 0.7172 0.7098 Ours (After visual re-ranking) 0.7765 0.7579 0.7490 0.7358 0.7243 image is 23.1, consequently there is a significant amount of of retrieving shared concepts. The previous observation is text describing each image. an indication that some users tend to rate positively near A stop list is created to ignore frequent words in the identical images when the conceptual meaning of the query dataset. Words that appear in over 50000 images are ig- images cannot be clearly established. nored since they correspond to non-discriminative words. The resulting text representation of the images is very 4. CONCLUSIONS sparse since only a few tens of words describe each image. We have introduced a new search paradigm that leverages For this reason, text representation can be saved in a very representations of multiple input images to infer concepts compact way. In fact, the full dataset representation of one shared by all the input images. The retrieved images are million images can be fully loaded in 51 Mb of memory. conceptually and visually meaningful. For our experiments, we used vector representation of We presented a complete solution to the problem. Our words trained on a part of the Google News dataset, which solution includes novel visual and text representation of the contains about 100 billion words.3 The vocabulary size D images, exploiting a dataset with annotations obtained by of this model is 300. implicit crowdsourcing, and operating a vector space that 3.1 Multiple Query Image Retrieval allows conceptually measuring the similarity between words. The proposed approach achieved mean accuracy of 77.65% We performed experiments to evaluate our system for a on the top 5 retrievals and 72.43% on the top 25, according set of 101 pairs (N = 2) of image inputs. The definition of to user ratings. The proposed approach outperforms the the input pairs of images was performed with the help of baseline by more than 22%, which shows that the method semantic maps downloaded from the internet.4 works very well in the proposed problem. A semantic map is a visual strategy for vocabulary expan- sion and extension of knowledge by displaying words and their relations with other words [3]. Any person can define 5. REFERENCES a semantic map about any topic; therefore, there is no “cor- [1] R. Arandjelovic and A. Zisserman. Multiple queries for rect” semantic map. We chose several sets of pairs of words large scale specific object retrieval. In BMVC, 2012. that were related according to the semantic maps found. [2] F. Basura and T. Tuytelaars. Mining multiple queries Words that were not easily represented pictorially were dis- for image retrieval: On-the-fly learning of an carded, and the remaining words were used to download ex- object-specific mid-level representation. In ICCV, 2013. ample images and form query pairs of images conceptually [3] G. G. Duffy. Explaining Reading, Second Edition: A related. Resource for Teaching Concepts, Skills, and Strategies. For each pair of available query images, we asked several Guilford Press, 2009. users to rate the top 25 retrieved images of a pair of query [4] Y. Jia. Caffe: An open source convolutional images. They were then asked to provide a binary answer to architecture for fast feature embedding. In the following question: Is the retrieved image conceptually http://caffe.berkeleyvision.org/, 2013. similar to both input images or not? [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Based on the user ratings, we calculated the mean ac- Imagenet classification with deep convolutional neural curacy of the retrieved images from the 101 pairs of query networks. In NIPS, 2012. images. Accuracy is reported for the top X retrieved images, [6] C. D. Manning, P. Raghavan, and H. Schtze. with X ranging from 5 to 25 in intervals of 5. Introduction to . Cambridge The baseline method is defined as the image retrieval per- University Press New York, 2008. formed from a joint search representation given by the sum- mation of the individual input textual representations with- [7] T. Mikolov, K. Chen, G. Corrado, and J. Dean. out the addition of words conceptually shared by the input Efficient estimation of word representations in vector images. space. In arXiv preprint arXiv:1301.3781, 2013. Table 1 presents the results of accuracy of our method and [8] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic the baseline. For both methods we include the results before regularities in continuous space word representations. applying visual re-ranking. Our method clearly outperforms In Proceedings of NAACL-HLT, 2013. the baseline results by more than 22% in the proposed task. [9] MSR-Bing. Image retrieval challenge dataset. In The visual re-ranking also helps to improve the ratings of http://web- the users. The improvement is more significant in the base- ngram.research.microsoft.com/GrandChallenge/Datasets.aspx. line case, where the retrieved images are worse in the task 3https://code.google.com/p/word2vec/ 4http://www.bing.com/images/search?q=semantic+mapping