Compact Deep Aggregation for Set Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Compact Deep Aggregation for Set Retrieval Yujie Zhong · Relja Arandjelovic´ · Andrew Zisserman Abstract The objective of this work is to learn a compact such as a group of friends or a family. Then we would like the embedding of a set of descriptors that is suitable for efficient retrieved images that contain all of the set to be ranked first, retrieval and ranking, whilst maintaining discriminability of followed by images containing subsets, e.g. if there are three the individual descriptors. We focus on a specific example of friends in the query, then first would be images containing this general problem – that of retrieving images containing all three friends, then images containing two of the three, multiple faces from a large scale dataset of images. Here the followed by images containing only one of them. Consider set consists of the face descriptors in each image, and given another example where a fashion savvy customer is research- a query for multiple identities, the goal is then to retrieve, in ing a dress and a handbag. A useful retrieval system would order, images which contain all the identities, all but one, etc. first retrieve images in which models or celebrities wear both To this end, we make the following contributions: first, items, so that the customer can see if they match. Following we propose a CNN architecture – SetNet – to achieve the ob- these, it would also be beneficial to show images containing jective: it learns face descriptors and their aggregation over a any one of the query items so that the customer can discover set to produce a compact fixed length descriptor designed for other fashionable combinations relevant to their interest. For set retrieval, and the score of an image is a count of the num- these systems to be of practical use, we would also like this ber of identities that match the query; second, we show that retrieval to happen in real time. this compact descriptor has minimal loss of discriminability These are examples of a set retrieval problem: the database up to two faces per image, and degrades slowly after that – far consists of many sets of elements (e.g. the set is an image exceeding a number of baselines; third, we explore the speed containing multiple faces, or of a person with multiple fash- vs. retrieval quality trade-off for set retrieval using this com- ion items), and we wish to order the sets according to a query pact descriptor; and, finally, we collect and annotate a large (e.g. multiple identities, or multiple fashion items) such that dataset of images containing various number of celebrities, those sets satisfying the query completely are ranked first which we use for evaluation and is publicly released. (i.e. those images that contain all the identities of the query, Keywords Image retrieval · Set retrieval · Face recognition · or all the query fashion items), followed by sets that satisfy Descriptor aggregation · Representation learning · Deep all but one of the query elements, etc. An example of this learning ranking for the face retrieval problem is shown in Fig. 1 for two queries. arXiv:2003.11794v1 [cs.CV] 26 Mar 2020 In this work, we focus on the problem of retrieving a set 1 Introduction of identities, but the same approach can be used for other set retrieval problems such as the aforementioned fashion Suppose we wish to retrieve all images in a very large collec- retrieval. We can operationalize this by scoring each face in tion of personal photos that contain a particular set of people, each photo of the collection as to whether they are one of the identities in the query. Each face is represented by a fixed Y. Zhong and A. Zisserman Visual Geometry Group, Department of Engineering Science, University length vector, and identities are scored by logistic regression of Oxford, UK classifiers. But, consider the situation where the dataset is E-mail: fyujie,[email protected] very large, containing millions or billions of images each R. Arandjelovic´ containing multiple faces. In this situation two aspects are DeepMind crucial for real time retrieval: first, that all operations take E-mail: [email protected] place in memory (not reading from disk), and second that an 2 Y. Zhong, R. Arandjelovic´ and A. Zisserman Query: Top 3 Query: Top 3 Barbara Hershey Martin Freeman Natalie Portman Mark Gatiss Vincent Cassel Fig. 1: Images ranked using set retrieval for two example queries. The query faces are given on the left of each example column, together with their names (only for reference). Left: a query for two identities; right: a query for three identities. The first ranked image in each case contains all the faces in the query. Lower ranked images partially satisfy the query, and contain progressively fewer faces of the query. The results are obtained using the compact set retrieval descriptor generated by the SetNet architecture, by searching over 200k images of the Celebrity Together dataset introduced in this paper. efficient algorithm is used when searching for images that is small, typically 128 (to keep the memory footprint low), satisfy the query. The problem is that storing a fixed length and N is in the thousands. vector for each face in memory is prohibitively expensive We make the following contributions: first, we introduce at this scale, but this cost can be significantly reduced if a a trainable CNN architecture for the set-retrieval task that fixed length vector is only stored for each set of faces in an is able to learn to aggregate face vectors into a fixed length image (since there are far fewer images than faces). As well descriptor in order to minimize interference, and also is able as reducing the memory cost this also reduces the run time to rank the face sets according to how many identities are cost of the search since fewer vectors need to be scored. in common with the query using this descriptor. To do this, So, the question we investigate in this paper is the fol- we propose a paradigm shift where we draw motivation from lowing: can we aggregate the set of vectors representing the image retrieval based on local descriptors. In image retrieval, multiple faces in an image into a single vector with little it is common practice to aggregate all local descriptors of loss of set-retrieval performance? If so, then the cost of both an image into a fixed-size image-level vector representa- memory and retrieval can be significantly reduced as only tion, such as bag-of-words (Sivic and Zisserman, 2003) and one vector per image (rather than one per face) have to be VLAD (Jegou´ et al., 2010); this brings both memory and stored and scored. speed improvements over storing all local descriptors indi- Although we have motivated this question by face re- vidually. We generalize this concept to set retrieval, where trieval it is quite general: there is a set of elements (which instead of aggregating local interest point descriptors, set ele- are permutation-invariant), each element is represented by ment descriptors are pooled into a single fixed-size set-level representation. For the particular case of face set retrieval, a vector of dimension De, and we wish to represent this set this corresponds to aggregating face descriptors into a set by a single vector of dimension D, where D = De in practice, without losing information essential for the task. Of course, representation. The novel aggregation procedure is described if the total number of elements in all sets, N, is such that in Sec. 3 where the aggregator can be trained in an end-to-end N ≤ D then this certainly can be achieved provided that the manner with different feature extractor CNNs. set of vectors are orthogonal. However, we will consider the Our second contribution is to introduce a dataset anno- situation commonly found in practice where N D, e.g. D tated with multiple faces per images. In Sec. 4 we describe Compact Deep Aggregation for Set Retrieval 3 a pipeline for automatically generating a labelled dataset of VLAD has recently been adapted into a differentiable pairs (or more) of celebrities per image. This Celebrity To- CNN layer, NetVLAD (Arandjelovic´ et al., 2016), making gether dataset contains around 200k images with more than it end-to-end trainable. We incorporate a modified form of half a million faces in total. It is publicly released. the NetVLAD layer in our SetNet architecture. An alterna- The performance of the set-level descriptors is evaluated tive, but related, very recent approach is the memory vector in Sec. 5. We first ‘stress test’ the descriptors by progressively formulation proposed by Iscen et al. (2017), but we have not increasing the number of faces in each set, and monitoring employed it here as it has not been made differentiable yet. their retrieval performance. We also evaluate retrieval on the Celebrity Together dataset, where images contain a variable number of faces, with many not corresponding to the queries, 2.2 Image retrieval and explore efficient algorithms that can achieve immediate (real-time) retrieval on very large scale datasets. Finally, we Another strand of research we build on is category level investigate what the trained network learns by analyzing why retrieval, where in our case the category is a face. This is an- the face descriptors learnt by the network are well suited for other classical area of interest with many related works (Chat- aggregation. field et al., 2011; Perronnin et al., 2010b; Torresani et al., Note, although we have grounded the set retrieval prob- 2010; Wang et al., 2010; Zhou et al., 2010).