Proceedings of the Twenty-Fourth International Joint Conference on (IJCAI 2015)

Online for Content-Based Image Retrieval∗ Ji Wan1,2,3, Pengcheng Wu2, Steven C. H. Hoi2, Peilin Zhao4, Xingyu Gao1,2,3, Dayong Wang5, Yongdong Zhang1, Jintao Li1 1 Key Laboratory of Intelligent Information Processing of CAS, ICT, CAS, China 2 Singapore Management University 3 University of the Chinese Academy of Sciences 4 Institute for Infocomm Research, A*STAR, Singapore 5 Michigan State University, MI, USA {wanji,gaoxingyu,zhyd,jtli}@ict.ac.cn, {pcwu,chhoi}@smu.edu.sg, [email protected], [email protected]

Abstract Although CBIR has been studied extensively for years, it is often hard to find a single best retrieval scheme, i.e., some A major challenge in Content-Based Image Re- pair of feature representation and distance measure, which trieval (CBIR) is to bridge the semantic gap be- can consistently beat the others in all scenarios. It is thus tween low-level image contents and high-level se- highly desired to combine multiple types of diverse feature mantic concepts. Although researchers have inves- representations and different kinds of distance measures in tigated a variety of retrieval techniques using differ- order to improve the retrieval accuracy of a real-world CBIR ent types of features and distance functions, no sin- task. In practice, it is however nontrivial to seek an optimal gle best retrieval solution can fully tackle this chal- combination of different retrieval schemes, especially in web- lenge. In a real-world CBIR task, it is often highly scale CBIR applications with millions or even billions of im- desired to combine multiple types of different fea- ages. Besides, for real-world CBIR applications, the optimal ture representations and diverse distance measures combination weights for different image retrieval tasks may in order to close the semantic gap. In this paper, vary across different application domains. Thus, it has be- we investigate a new framework of learning to rank come an urgent research challenge for investigating an auto- for CBIR, which aims to seek the optimal combina- mated and effective learning solution for seeking the optimal tion of different retrieval schemes by learning from combination of multiple diverse retrieval schemes in CBIR. large-scale training data in CBIR. We first formu- late the problem formally as a learning to rank task, To tackle the above challenge, in this paper, we investi- which can be solved in general by applying the ex- gate a framework of learning to rank al- isting batch learning to rank algorithms from text gorithms in seeking the optimal combination of multiple di- information retrieval (IR). To further address the verse retrieval schemes for CBIR by learning from large-scale scalability towards large-scale online CBIR appli- training data automatically. In particular, we first formulate cations, we present a family of online learning to the problem as a learning to rank task, which thus can be rank algorithms, which are significantly more ef- solved in general by applying the existing batch learning to ficient and scalable than classical batch algorithms rank algorithms in text IR. However, to further improve the for large-scale online CBIR. Finally, we conduct an efficiency and scalability issues, we present a family of on- extensive set of experiments, in which encouraging line learning to rank algorithms to cope with the challenge results show that our technique is effective, scalable of large-scale learning in CBIR. We give theoretical analysis and promising for large-scale CBIR. of the proposed online learning to rank algorithms, and em- pirically show that the proposed algorithms are both effective and scalable for large-scale CBIR tasks. 1 Introduction In summary, our main contributions of this paper include: Content-based image retrieval (CBIR) has been extensively i) We conduct a comprehensive study of applying learning to studied for many years in multimedia and computer vision rank techniques to CBIR, aiming to seek the optimal combi- communities. Extensive efforts have been devoted to various nation of multiple retrieval schemes; ii) We propose a family low-level feature descriptors [Jain and Vailaya, 1996] and dif- of efficient and scalable online learning to rank algorithms ferent distance measures defined on some specific sets of low- for CBIR; iii) We analyze the theoretical bounds of the pro- level features [Manjunath and Ma, 1996]. Recent years also posed online learning to rank algorithms, and also examine witness the surge of research on local feature based represen- their empirical performances extensively. tations, such as the bag-of-words models [Sivic et al., 2005] The rest of this paper is organized as follows. Section 2 using local feature descriptors (e.g., SIFT [Lowe, 1999]). reviews related work. Section 3 presents our problem for- ∗This work was supported by Singapore MOE tier 1 research mulation and a family of online learning to rank algorithms grant (C220/MSS14C003) and the National Nature Science Foun- for CBIR, and Section 4 gives theoretical analysis. Section 5 dation of China (61428207). discusses our experiments and Section 6 concludes this work.

2284 2 Related Work family of diverse existing studies in CBIR [He et al., 2004; Hoi et al., 2006; Chechik et al., 2010] that usually to ap- 2.1 Learning to Rank and CBIR ply machine learning techniques (supervised or to rank has been extensively studied in text Infor- learning) to learn a good ranking function on a single type mation Retrieval (IR) [Qin et al., 2010]. In general, most ex- of features or some combined features. Such existing tech- isting approaches can be grouped into three major categories: niques potentially could be incorporated as one component (i) pointwise, (ii) pairwise, and (iii) listwise approaches. We of our scheme, which is out of scope of the discussions in briefly review related work in each category below. this work. The first group, the family of pointwise learning to rank ap- proaches, simply treats ranking as a regular classification or 2.2 Online Learning regression problem by learning to predict numerical ranking Online learning is a family of efficient and scalable machine values of individual objects. For example, in [Cooper et al., learning algorithms [Rosenblatt, 1958; Crammer et al., 2006] 1992; Crammer and Singer, 2001; Li et al., 2007], the rank- extensively studied in machine learning for years. In general, ing problem was formulated as a regression task in different online learning operates in a sequential manner. Consider on- forms. In addition, [Nallapati, 2004] formulated the ranking line classification, each time step, an online learner processes problem as a binary classification of relevance on document an incoming example by first predicting its class label; af- objects, and solved it by applying some discriminative mod- ter that, it receives the true class label from the environment, els such as SVM. which is then used to measure the loss between the predicted The second group of learning to rank algorithms, the fam- label and the truth label; at the end of each time step, the ily of pairwise approaches, treats the pairs of documents learner is updated whenever the loss is nonzero. Typically, as training instances and formulates ranking as a task of the goal of an online learning task is to minimize the cumula- learning a classification or regression model from the col- tive mistakes over the entire sequence of predictions. lection of pairwise instances of documents. A variety of In literature, a variety of algorithms have been proposed pairwise learning to rank algorithms have been proposed by for online learning [Hoi et al., 2014]. The most well-known applying different machine learning algorithms [Joachims, example is the Perceptron algorithm [Rosenblatt, 1958]. In 2002; Burges et al., 2005; Tsai et al., 2007]. The well- recent years, various algorithms have been proposed to im- known algorithms include: SVM-based approaches such prove Perceptron [Li and Long, 1999; Crammer et al., 2006], as RankSVM [Joachims, 2002], neural networks based which usually follow the criterion of maximum margin learn- approaches such as RankNet [Burges et al., 2005], and ing principle. A notable approach is the family of Passive- boosting-based approaches such as RankBoost [Freund et al., Aggressive (PA) learning algorithms [Crammer et al., 2006], 2003], etc. This group is the most widely explored research which updates the classifier whenever the online learner fails direction of learning to rank, in which many techniques have to produce a large margin on the current instance. These algo- been successfully applied in real-world commercial systems. rithms are often more efficient and scalable than batch learn- In general, our proposed approaches belong to this category. ing algorithms. In this work, we aim to extend the existing The third group, the family of listwise learning to rank ap- online learning principle for developing new learning to rank proaches, treats a list of documents for a query as a train- algorithms. In addition, we note that our work is also very ing instance and attempts to learn a ranking model by op- different from another study in [Grangier and Bengio, 2008] timizing some loss functions defined on the predicted list which focuses on text-based image retrieval by applying PA and the ground-truth list. There are two different kinds algorithms. By contrast, our CBIR study focuses on image of approaches in this category. The first is to directly retrieval based on the visual similarity. Finally, the proposed optimize some IR metrics, such as Mean Average Preci- online learning to rank is based on linear models and is thus sion (MAP) and Normalized Discounted Cumulative Gain more scalable than the kernel-based similarity learning ap- (NDCG) [Jarvelin¨ and Kekal¨ ainen,¨ 2000]. Example algo- proaches [Xia et al., 2014]. rithms include AdaRank [Xu and Li, 2007] and SVM-MAP by optimizing MAP [Yue et al., 2007], and SoftRank [Taylor 3 Online Learning to Rank for CBIR et al., 2008] and NDCG-Boost [Valizadegan et al., 2009] by In this section, we present the problem formulation and the optimizing NDCG, etc. The other is to indirectly optimize proposed online learning to rank algorithms for CBIR. the IR metrics by defining some listwise loss function, such as ListNet [Cao et al., 2007] and ListMLE [Xia et al., 2008]. 3.1 Problem Formulation Unlike the extensive studies in text IR literature, learning Let us denote by I an image space. Each training instance to rank has been seldom explored in CBIR, except some re- received at time step t is represented by a triplet instance 1 2 cent study in [Faria et al., 2010; Pedronette and da S Tor- (qt, pt , pt ), where qt ∈ I denotes the t-th query in the en- 1 2 res, 2013] which simply applied some classical batch learn- tire collection of queries, pt ∈ I and pt ∈ I denote a pair of ing to rank algorithms. Unlike their direct use of the batch images for ranking prediction w.r.t. the query qt. learning to rank algorithms that are hardly scalable for large- We also denote by yt ∈ {+1, −1} the true ranking order of 1 scale CBIR applications, we propose a family of efficient the pairwise instances at step t such that image pt is ranked 2 1 2 and scalable online learning to rank algorithms and evalu- before pt if yt = +1; otherwise pt is ranked after pt . We ate them extensively on a comprehensive testbed. Finally, introduce a mapping function n we note that our work is also very different from a large φ : I × I 7→ R ,

2285 which creates a n-dimensional feature vector from an image Proposition 1. The optimizations in (2) and (3) have the fol- pair. For example, consider φ(q, p) ∈ Rn, one way to extract lowing closed-form solution: one of the n features is based on different similarity measures 1 2 on different feature descriptors. wt+1 = wt + λtyt(φ(qt, pt ) − φ(qt, pt )), (4) The goal of a learning to rank task is to search for the opti- where λt for (2) is computed as: mal ranking model w ∈ Rn with the following target ranking 1 2 function for any triplet instance (qt, pt , pt ): `t(wt) λt = min(C, 1 2 2 ), (5) 1 2 > 1 2 kφ(qt, pt ) − φ(qt, pt ))k f(qt, pt , pt ) = w φ(qt, pt , pt ) > 1 2 and λt for (3) is computed as: = w (φ(qt, pt ) − φ(qt, pt )). max(0, 1 − w >y (φ(q , p1) − φ(q , p2))) By learning an optimal model, we expect the prediction λ = t t t t t t . 1 2 t 1 2 2 1 output by the function f(qt, pt , pt ) will be positive if an im- kφ(qt, pt ) − φ(qt, pt ))k + 1 2 2C age pt is more similar to the query qt than another image pt , and negative otherwise. The above proposition can be obtained by following the In particular, for a sequence of T triplet training instances, similar idea of passive aggressive learning in [Crammer et our goal is to optimize the sequence of ranking models al., 2006]. We omit the details here due to the space limita- w1,..., wT so as to minimize the prediction mistakes dur- tion. From the results, we can see that the ranking model re- > 1 2 ing the entire online learning process. Below we present a mains unchanged if wt yt(φ(qt, pt ) − φ(qt, pt )) ≥ 1. That family of online learning algorithms to tackle the learning to is, we will update the ranking model whenever the current 1 2 rank tasks. We note that we mainly explore the first-order on- ranking model fails to rank the order of pt and pt w.r.t. query line learning techniques due to their high efficiency, but the qt correctly at a sufficiently large margin. similar idea could also be extended by exploring second-order online learning techniques [Dredze et al., 2008]. 3.4 Online Gradient Descent Ranking (OGDR) The Online Gradient Descent Ranking (OGDR) follows the 3.2 Online Perceptron Ranking (OPR) idea of Online Gradient Descent [Zinkevich, 2003] to tackle 1 2 The Online Perceptron Ranking (OPR) follows the idea of our problem. When receiving a training instance (qt, pt , pt ) Perceptron [Rosenblatt, 1958], a classical online learning al- and its true label yt at each time step t, we suffer a hinge loss 1 2 gorithm. In particular, given any training instance (qt, pt , pt ) as Eq. (5). Then we update the ranking model based on the and true label yt, at step t, OPR makes the following update: gradient descent of the loss function: 1 2 1 2 wt+1 = wt + yt(φ(qt, pt ) − φ(qt, pt )), (1) wt+1 = wt − η∇`(w;(qt, pt , pt ), yt), (6) > 1 2 whenever ytwt (φ(qt, pt )−φ(qt, pt ))<0; otherwise, the where η is the learning rate. More specifically, whenever the 1 2 ranking model remains unchanged. loss `(w;(qt, pt , pt ), yt) is nonzero, OGDR makes the fol- lowing update: 3.3 Online Passive Aggressive Ranking (OPAR) 1 2 The Online Passive Aggressive Ranking (OPAR) follows the wt+1 = wt + ηyt(φ(qt, pt ) − φ(qt, pt )). (7) idea of the online passive-aggressive (PA) learning [Crammer et al., 2006] to tackle this challenge. In particular, we first 4 Theoretical Analysis formulate the problem as an optimization task (OPAR-I): In this section, we analyze the performance of the proposed 1 2 1 2 wt+1 = arg min kw − wtk + C`(w;(qt, pt , pt ), yt), (2) online algorithms. We firstly present a lemma to disclose the w 2 relationship between the cumulative loss and an IR perfor- where `(w) is a hinge loss defined as mance measure, i.e., mean average precision (MAP).

> 1 2 Lemma 1. For one query qt and its related images, the MAP `(w) = max(0, 1 − ytw (φ(qt, pt ) − φ(qt, pt ))), is lower bounded as follows: and C > 0 is a penalty cost parameter. We can also formulate γMAP X this problem as another variant (OPAR-II): MAP ≥ 1 − `(w;(q , p1, p2), y ), T t t t t 1 2 1 2 2 wt+1 = arg min kw − wtk + C`(w;(qt, pt , pt ), yt) . (3) where γMAP = 1/m, and m is the number of relevant docu- w 2 ments. The above two optimizations trade off two major concerns: (i) the updated ranking model should not be deviated too Proof. Using the essential loss idea defined in [Chen et al., much from the previous ranking model wt, and (ii) the up- 2009], from Theorem 1 of [Chen et al., 2009] we could see dated ranking model suffers a small loss on the triplet training the essential loss is an upper bound of measure-based rank- 1 2 instance (qt, pt , pt ). The tradeoff is essentially controlled by ing errors; besides, the essential loss is the lower bound of the penalty cost parameter C. Finally, we can derive the fol- the sum of pairwise square hinge loss, using the properties of lowing proposition for the closed-form solutions to the above square hinge loss, which is non-negative, non-increasing and optimizations. satisfy `(0) = 1.

2286 The above lemma indicates that if we could prove bounds Combining the above two inequalities gives for the online cumulative hinge loss compared to the best 2 X 2 kwk ≥ {2λ ` (w ) − 2λ ` (w) − λ X}. ranking model with all data beforehand, we could obtain the t t t t t t cumulative IR measure bounds. Fortunately there are strong Denote λ∗ = minλt>0 λt and re-arranging the above in- theoretical loss bounds for the proposed online learning to equality, we can get rank algorithms. Therefore, we can prove the MAP bounds X 1 1 2 X 1 2 `t(wt) ≤ { kwk + C `t(w) + C XT }. for each of the proposed algorithms as follows. λ∗ 2 2 1 2 2 Theorem 1. Assume maxt kφ(qt, pt )−φ(qt, pt )k ≤ X, the Plugging the above inequality into Lemma 1 concludes the MAP of the Online Perceptron Ranking algorithm is bounded first part of this theorem. as Similarly, assume `∗ = min`t(wt)>0 `t(wt) for OPAR-II γMAP X γMAP 1 2 X MAP ≥ 1 − − { kwk + ` (w)}. algorithm, we can prove 2 T 2 t X (X + 1/(2C)) 2 X 2 2 2 `t(wt) ≤ {kwk + 2C `t(w) }. Proof. Define ∆t = kwt − wk − kwt+1 − wk , it is not `∗ difficult to see: t T Combining the above inequality with Lemma 1 concludes X 2 2 2 the second part of this theorem. ∆t = kwk − kwT +1 − wk ≤ kwk . t=1 1 , 2 2 Theorem 3. Assume maxt k(φ(qt, pt ) − φ(qtpt ))k ≤ X, In addition, according to the update rule, we have the MAP of the online gradient descent ranking algorithm is 1 2 bounded as: ∆t = −2yt(wt − w) · (φ(qt, pt ) − φ(qt, pt )) T 1 2 2 γMAP 1 2 ηXT X − kφ(q , p ) − φ(q , p )k MAP ≥ 1 − { kwk + + `t(w)}. t t t t T 2η 2 t=1 ≤ 2`t(wt) − 2`t(w) − X Proof. Firstly, according to equation (6), we have Combining the above two inequalities results in 2 2 kwt+1 − wk = kwt − η∇`t(wt) − wk 2 X kwk ≥ [2`t(wt) − 2`t(w) − X]. 2 2 = kwt − wk − 2hwt − w, η∇`t(wt)i + kη∇`t(wt)k . t The above equality can be reformulated as follows: Re-arranging the above inequality gives hw − w, ∇` (w )i X 1 X 1 t t t ` (w ) ≤ kwk2 + [` (w) + X]. 1 t t t = [kw − wk2 − kw − wk2 + kη∇` (w )k2]. (7) 2 2 2η t t+1 t t Plugging the above inequality into the Lemma 1 concludes the proof. Secondly, `t(·) is convex, so 1 2 2 `t(w) ≥ `t(wt) + h∇`t(wt), (w − wt)i. Theorem 2. Assume maxt k(φ(qt, pt ) − φ(qt, pt ))k ≤ X, the MAP of the OPAR-I algorithm is bounded as follows: Reformulating this inequality and plugging it into Eq. (7) re- sults in 2 γMAP C X MAP ≥ 1 − `t(wt) − `t(w) ≤ h∇`t(wt), wt − wi λ∗ 1 2 2 2 γMAP 1 2 X [kwt − wk − kwt+1 − wk + kη∇`t(wt)k ]. − { kwk + C `t(w)}, 2η T λ∗ 2 Summing the above inequality over t, we get where λ∗ = minλt>0 λt, while the MAP of the OPAR-II al- T T gorithm is bounded as: X X `t(wt) − `t(w) γMAP (X + 1/(2C)) MAP ≥ 1 − {kwk2 t=1 t=1 `∗T T X 2 X 1 2 2 2 +2C ` (w) }, ≤ [kwt − wk − kwt+1 − wk + kη∇`t(wt)k ] t 2η t=1 where `∗ = min`t>0 `t. T 1 2 2 X 1 2 = [kw − wk − kw − wk ] + kη∇` (w )k Proof. Firstly, for the OPAR-I algorithm, it is not difficult to 2η 1 T +1 2η t t show that t=1 T T 1 2 X η 1 , 2 2 X 2 2 2 ≤ kwk + kyt(φ(qt, p ) − φ(q p ))k . ∆t = kwk − kwT +1 − wk ≤ kwk , 2η 2 t t t t=1 t=1 2 2 Re-arranging the above inequality results in where ∆t = kwt − wk − kwt+1 − wk . In addition, using the relation between w and w gives T T t t+1 X 1 2 ηXT X ` (w ) ≤ kwk + + ` (w). > 1 2 t t 2η 2 t ∆t = − 2λtyt(wt − w) [φ(qt, pt ) − φ(qt, pt )] t=1 t=1 2 1 2 2 − λt k(φ(qt, pt ) − φ(qt, pt ))k Plugging the above inequality into the Lemma 1 concludes 2 the proof. ≤2λt`t(wt) − 2λt`t(w) − λt X.

2287 5 Experiments and (ii) “Uni-Con”: it uniformly combines all the query- We conduct an extensive set of experiments for benchmark dependent descriptors for ranking. evaluations of varied learning to rank algorithms for CBIR 5.3 Evaluation on Standard Datasets tasks, including both batch and online learning algorithms. We first evaluate the algorithms on the standard datasets. 5.1 Testbeds for Learning to Rank Table 2 shows average MAP performance on five standard Table 1 shows a list of image databases in our testbed. For datasets. Several observations can be drawn as follows. each database, we randomly split it into five folds, in which one fold is used for test, one is for validation, and the rest are Table 2: Evaluation of the average MAP performance. for training. Besides, to test the scalability of our technique Algorithm Holiday Caltech101 ImageCLEF Corel for large-scale CBIR, we also include a large database (“Im- Best-Fea 0.4892 0.2664 0.5777 0.1846 ageCLEFFlickr”), which includes ImageCLEF as a ground- Uni-Con 0.5175 0.2594 0.6174 0.2990 truth subset and 1-million distracting images from Flickr. RankNet 0.6292 0.2753 0.6326 0.3133 C-Ascent 0.6373 0.3193 0.6803 0.3406 RankSVM 0.6429 0.3270 0.6585 0.3366 Table 1: List of image databases in our testbed. λ-MART 0.6230 0.3650 0.6796 0.3683 Datasets #images #classes #train-instances OPR 0.6219 0.3285 0.6555 0.3292 Holiday 1,491 500 200,000 OPAR-I 0.6329 0.3070 0.6556 0.3340 Caltech101 8,677 101 200,000 OPAR-II 0.6283 0.3157 0.6632 0.3389 ImageCLEF 7,157 20 200,000 OGDR 0.6368 0.3024 0.6626 0.3228 Corel 5,000 50 200,000 ImageCLEFFlickr 1,007,157 21 3,000,000 First, we observe that all the learning to rank algorithms To generate training data of query-dependent descriptors, outperform the two heuristic baselines (“Best-Fea” and “Uni- for each query in a dataset, we involve all positive/relevant Con”) for most cases. This clearly demonstrates that the pro- images and sample a subset of negative/irrelevant images. posed learning to rank framework can effectively combine different feature representation and distance measures for im- The feature mapping φ(q, p) ∈ Rn is computed over 9 dif- ferent features with 4 similarity measurements, which results proving image retrieval performance. Second, comparing dif- in 36-dimensional feature representation. Due to the low ef- ferent batch learning to rank algorithms, we observe that no ficiency of the existing batch learning to rank algorithms, we single method can beat the others on all datasets, which is design two different experiments. The first aims to evalu- consistent to some previous empirical study in text IR, and ate different learning to rank algorithms on all the standard λ-MART tends to perform slightly better which attained the databases, in which we can only sample a total of 200,000 best performance among 2 out of 4 datasets. Third, by exam- training instances as training data set to ensure that all the ining the proposed online learning to ranking algorithms, we batch learning to rank algorithms can be completed. The sec- found that their average mAP performance is fairly compa- ond aims to examine if the proposed technique can cope with rable to the batch algorithms, which indicates that the online large amount of training data, in which a total of 3-million algorithms are at least as effective as the existing batch algo- training instances were generated in the training data set. For rithms in terms of the retrieval efficacy. validation and test data sets, we randomly choose 300 valida- 4 tion images and 150 test images from each fold. 10 RankNet C−Ascent RankSVM 5.2 Setup and Compared Algorithms 3 λ−MART 10 OPR To conduct a fair evaluation, we choose the parameters of dif- OPAR−I OPAR−II OGDR ferent algorithms via the same cross validation scheme in all 2 10 the experiments. To evaluate the retrieval performance, we adopt the mean Average Precision (mAP), a widely Time (seconds) 1 used in IR, which is calculated based on the Average Preci- 10 sion (AP) value of all the queries, where the value of AP is 0 10 the area under precision-recall curve for a query. 0 0.5 1 1.5 2 5 To evaluate the efficacy of our scheme, we compare the # training instances x 10 proposed family of online learning to rank algorithms, in- cluding OPR, OPAR-I, OPAR-II and OGDR, against sev- Figure 1: Cumulative time cost on Corel w/ 200k instances. eral representative batch learning to rank algorithms in text IR, including RankNet [Burges et al., 2005], Co- To evaluate the efficiency and scalability, we measure the ordinate Ascent(“C-Ascent”) [Metzler and Croft, 2007], time cost taken by different learning to rank algorithms given RankSVM [Herbrich et al., 2000] and LambdaMART(“λ- different amounts of training data. Figure 1 shows the eval- MART”) [Wu et al., 2010]. Besides, we also evaluate two uation of CPU time cost on the Corel dataset on different straightforward baselines: (i) “Best-Fea”: it selects the best amounts of training instance streams from a total of 200,000 query-dependent descriptor for ranking via cross validation; training instances. The online learning algorithms take only

2288 tens of seconds for training while batch learning algorithms lier iterations) available for training, we employ a reservoir are much slower, e.g. C-Ascent take around 2 hours. It is scheme that cached all the labeled data for RankSVM. clear to see that the proposed online algorithms are consid- Specifically, we use ImageCLEFFlickr as the database set, erably more efficient and scalable than most of the existing and randomly select 2,000 images for sequential queries. Fig batch algorithms. 3 shows the improvement of NDCG@50 for different algo- rithms compared to the Uni-Con baseline, whose average 5.4 Evaluation on the Large-scale Dataset NDCG@50 over 2,000 queries is about 0.80. In this experiment, we evaluate the proposed family of online ∆ NDCG@50 learning to rank algorithms on the large-scale dataset, i.e., the 0.05

ImageCLEFFlickr data set with over 1-million images and 3- 0.045 million training instances. For the batch algorithms, we can 0.04 only evaluate the RankSVM since the other algorithms are 0.035 too computationally intensive to run on this data set. 0.03 0.025

0.02 ImageCLEFFlickr 0.8 RankSVM 0.015 Best−Fea Uni−Con RankSVM OPR OPAR−I OPAR−II OGDR OPR 0.01 0.7 improvement of NDCG@50 OPAR−I OPAR−II 0.005 0.6 OGDR

0 0 200 400 600 800 1000 1200 1400 1600 1800 2000

0.5 # queries arrived

0.4 mAP 0.3 Figure 3: Online cumulative retrieval performance.

0.2

0.1 We also measure the cumulative CPU time cost taken by

0 different algorithms shown in Figure 4. The batch learning 1 2 3 4 5 Fold algorithm RankSVM take about a few of hours for re-training while all online learning methods just take several seconds. Figure 2: Evaluation of the MAP performance on the Image- It is clear to observe that batch algorithms are impractical for CLEFFlickr dataset with over 1-million images. this application, meanwhile the proposed online algorithms are significantly more efficient and scalable. Figure 2 shows the evaluation of mAP performance on five

6 different folds and Table 3 shows the evaluation of running 10

5 time cost on 3-million training instances. We can draw sev- 10 RankSVM

4 OCWR eral observations from the results. First, the online learning 10 OGDR OPAR−II to rank algorithms generally outperform the baseline algo- 3 10 OPAR−I rithms without learning to rank significantly. Furthermore, OPR 2 our proposed algorithms achieve better or at least comparable 10 1 10 accuracy performance than the state-of-the-art batch learning Time (seconds) 0 to rank approaches. Finally, the online learning to rank algo- 10

−1 rithms are generally more efficient than the batch algorithm. 10

−2 10 0 100 200 300 400 500 600 700 800 900 1000 # queries arrived Table 3: Running time(s) on 3-million training instances. RankSVM OPR OPAR-I OPAR-II OGDR Figure 4: Online Cumulative Time Cost. 4737 1154 1370 1708 2307 6 Conclusions 5.5 Evaluation on the Large Scale Online CBIR This paper investigates a new framework of efficient and scal- We now simulate a real-world online CBIR system by assum- able learning to rank for CBIR, which aims to learn an opti- ing training data arrive sequentially. This is a more realistic mal combination of multiple feature representations and dif- setting, especially for web image search engines where user ferent distance measures. We formulate the problem as a query log data collected from click actions often arrive se- learning to rank task, and explore online learning to solve quentially. At each iteration, after receiving a query image, it. To overcome the drawbacks of existing batch learning to we first apply the former learned model for CBIR, and then rank techniques, we present a family of efficient and scalable assume the top 50 retrieved images would be labeled, e.g., online learning to rank algorithms, which are empirically as via a relevance feedback scheme interactively. After that, the effective as the batch algorithms for CBIR, but significantly newly received labeled data are adopted to update the model more scalable by avoiding re-training. Finally, we note that which will predict the next query image. Because batch learn- our technique is rather generic, which could be extended for ing algorithms require all the labeled data (including the ear- solving many other types of multimedia retrieval tasks.

2289 References [Li and Long, 1999] Yi Li and Philip M. Long. The relaxed online maximum margin algorithm. In NIPS, pages 498–504, 1999. [Burges et al., 2005] Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gre- [Li et al., 2007] Ping Li, Christopher J. C. Burges, and Qiang Wu. gory N. Hullender. Learning to rank using gradient descent. In Mcrank: Learning to rank using multiple classification and gra- ICML, 2005. dient boosting. In NIPS, 2007. [Cao et al., 2007] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, [Lowe, 1999] David G. Lowe. Object recognition from local scale- and Hang Li. Learning to rank: from pairwise approach to list- invariant features. In ICCV, pages 1150–1157, 1999. wise approach. In ICML, 2007. [Manjunath and Ma, 1996] B. S. Manjunath and Wei-Ying Ma. [Chechik et al., 2010] Gal Chechik, Varun Sharma, Uri Shalit, and Texture features for browsing and retrieval of image data. IEEE Samy Bengio. Large scale online learning of image similarity TPAMI, 18(8):837–842, 1996. through ranking. JMLR, 11:1109–1135, 2010. [Metzler and Croft, 2007] Donald Metzler and W. Bruce Croft. [Chen et al., 2009] Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhiming Linear feature-based models for information retrieval. Inf. Retr., Ma, and Hang Li. Ranking measures and loss functions in learn- 10(3):257–274, 2007. ing to rank. In NIPS, 2009. [Nallapati, 2004] Ramesh Nallapati. Discriminative models for in- [Cooper et al., 1992] William S. Cooper, Fredric C. Gey, and formation retrieval. In SIGIR, 2004. Daniel P. Dabney. Probabilistic retrieval based on staged logistic [Pedronette and da S Torres, 2013] Daniel Carlos Guimaraes˜ Pe- regression. In SIGIR, 1992. dronette and Ricardo da S Torres. Image re-ranking and rank [Crammer and Singer, 2001] Koby Crammer and Yoram Singer. aggregation based on similarity of ranked lists. Pattern Recogni- Pranking with ranking. In NIPS, pages 641–647, 2001. tion, 2013. [Crammer et al., 2006] Koby Crammer, Ofer Dekel, Joseph Keshet, [Qin et al., 2010] Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. Shai Shalev-Shwartz, and Yoram Singer. Online passive- LETOR: A benchmark collection for research on learning to rank aggressive algorithms. JMLR, 7:551–585, 2006. for information retrieval. Inf. Retr., 13(4):346–374, 2010. [ ] [Dredze et al., 2008] Mark Dredze, Koby Crammer, and Fernando Rosenblatt, 1958 Frank Rosenblatt. The perceptron: A proba- Pereira. Confidence-weighted linear classification. In ICML, bilistic model for information storage and organization in the pages 264–271, 2008. brain. Psych. Rev., 7:551–585, 1958. [ et al. ] [Faria et al., 2010] Fabio F. Faria, Adriano Veloso, Humberto M. Sivic , 2005 Josef Sivic, Bryan C. Russell, Alexei A. Efros, Almeida, Eduardo Valle, Ricardo da S. Torres, Marcos A. Andrew Zisserman, and William T. Freeman. Discovering ob- CVPR Gonc¸alves, and Wagner Meira, Jr. Learning to rank for content- jects and their location in images. In , 2005. based image retrieval. In MIR, 2010. [Taylor et al., 2008] Michael Taylor, John Guiver, Stephen Robert- son, and Tom Minka. Softrank: optimizing non-smooth rank met- [Freund et al., 2003] Yoav Freund, Raj D. Iyer, Robert E. Schapire, rics. In WSDM, 2008. and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR, 4:933–969, 2003. [Tsai et al., 2007] Ming-Feng Tsai, Tie-Yan Liu, Tao Qin, Hsin-Hsi Chen, and Wei-Ying Ma. Frank: a ranking method with fidelity [Grangier and Bengio, 2008] David Grangier and Samy Bengio. A loss. In SIGIR, pages 383–390, 2007. discriminative kernel-based approach to rank images from text queries. IEEE TPAMI, 30(8):1371–1384, 2008. [Valizadegan et al., 2009] Hamed Valizadegan, Rong Jin, Ruofei Zhang, and Jianchang Mao. Learning to rank by optimizing ndcg [ et al. ] He , 2004 Jingrui He, Mingjing Li, Hong-Jiang Zhang, measure. In NIPS, 2009. Hanghang Tong, and Changshui Zhang. Manifold-ranking based image retrieval. In ACM MM, pages 9–16. ACM, 2004. [Wu et al., 2010] Qiang Wu, Christopher J. Burges, Krysta M. Svore, and Jianfeng Gao. Adapting boosting for information re- [ ] Herbrich et al., 2000 Ralf Herbrich, Thore Graepel, and Klaus trieval measures. Inf. Retr., 13(3):254–270, June 2010. Obermayer. Large margin rank boundaries for ordinal regres- sion. In Advances in Large Margin Classifiers, pages 115–132, [Xia et al., 2008] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng 2000. Zhang, and Hang Li. Listwise approach to learning to rank: the- ory and algorithm. In ICML, 2008. [Hoi et al., 2006] Steven CH Hoi, Wei Liu, Michael R Lyu, and Wei-Ying Ma. Learning distance metrics with contextual con- [Xia et al., 2014] Hao Xia, Steven CH Hoi, Rong Jin, and Peilin straints for image retrieval. In CVPR, volume 2, pages 2072– Zhao. Online multiple kernel similarity learning for visual search. 2078. IEEE, 2006. IEEE TPAMI, 36(3):536–549, 2014. [Hoi et al., 2014] Steven C.H. Hoi, Jialei Wang, and Peilin Zhao. [Xu and Li, 2007] Jun Xu and Hang Li. Adarank: a boosting algo- Libol: A library for online learning algorithms. JMLR, 15:495– rithm for information retrieval. In SIGIR, pages 391–398, 2007. 499, 2014. [Yue et al., 2007] Yisong Yue, Thomas Finley, Filip Radlinski, and [Jain and Vailaya, 1996] Anil K. Jain and Aditya Vailaya. Image Thorsten Joachims. A support vector method for optimizing av- retrieval using color and shape. Pattern Recognition, 29:1233– erage precision. In SIGIR, 2007. 1244, 1996. [Zinkevich, 2003] Martin Zinkevich. Online convex programming [Jarvelin¨ and Kekal¨ ainen,¨ 2000] Kalervo Jarvelin¨ and Jaana and generalized infinitesimal gradient ascent. In ICML, pages Kekal¨ ainen.¨ Ir evaluation methods for retrieving highly relevant 928–936, 2003. documents. In SIGIR, 2000. [Joachims, 2002] Thorsten Joachims. Optimizing search engines using clickthrough data. In KDD, pages 133–142, 2002.

2290