Partial Order Embedding with Multiple Kernels

Brian McFee [email protected] Department of Computer Science and Engineering, University of California, San Diego, CA 92093 USA

Gert Lanckriet [email protected] Department of Electrical and Computer Engineering, University of California, San Diego, CA 92093 USA

Abstract tirely different types of transformations may be natural for each modality. It may therefore be better to learn a sepa- We consider the problem of embedding arbitrary rate transformation for each type of feature. When design- objects (e.g., images, audio, documents) into Eu- ing metric learning algorithms for heterogeneous data, care clidean space subject to a partial order over pair- must be taken to ensure that the transformation of features wise distances. Partial order constraints arise nat- is carried out in a principled manner. urally when modeling human perception of simi- larity. Our partial order framework enables the Moreover, in such feature-rich data sets, a notion of simi- use of graph-theoretic tools to more efficiently larity may itself be subjective, varying from person to per- produce the embedding, and exploit global struc- son. This is particularly true of multimedia data, where ture within the constraint set. a person may not be able to consistently decide if two ob- We present an embedding algorithm based on jects (e.g., songs or movies) are similar or not, but can more semidefinite programming, which can be param- reliably produce an ordering of similarity over pairs. Al- eterized by multiple kernels to yield a unified gorithms in this regime must use a sufficiently expressive space from heterogeneous features. language to describe similarity constraints. Our goal is to construct an algorithm which integrates sub- jective similarity measurements and heterogeneous data to 1. Introduction produce a low-dimensional embedding. The main, novel contributions of this paper are two-fold. First, we develop A notion of distance between objects can be a powerful tool the partial order embedding framework, which allows us for predicting labels, retrieving similar objects, or visualiz- to apply graph-theoretic tools to solve relative comparison ing high-dimensional data. Due to its simplicity and math- embedding problems more efficiently. Our second contri- ematical properties, the Euclidean distance metric is often bution is a novel kernel combination technique to produce applied directly to data, even when there is little evidence a unified Euclidean space from multi-modal data. that said data lies in a Euclidean space. It has become the focus of much research to design algorithms which adapt The remainder of this paper is structured as follows. Sec- the space so that Euclidean distance between objects con- tion 2 formalizes the embedding problem and develops forms to some other presupposed notion of similarity, e.g., some mathematical tools to guide algorithm design. Sec- class labels or human perception measurements. tion 3 provides algorithms for non-parametric and multi- kernel embedding. Section 4 describes two experiments: When dealing with multi-modal data, the simplest first step one on synthetic and one on human-derived constraint sets. is to concatenate all of the features together, resulting in Section 5 discusses the complexity of exact dimensionality a single vector space on which metric learning algorithms minimization in the partial order setting. can be applied. This approach suffers in situations where features cannot be directly concatenated, such as in ker- 1.1. Related work nel methods, where the features are represented (implic- itly) by infinite-dimensional vectors. For example, if each Metric learning algorithms adapt the space to fit some pro- object consists of mixed audio, video, and text content, en- vided similarity constraints, typically consisting of pairs which are known to be neighbors or belong to the same th Appearing in Proceedings of the 26 International Conference class (Wagstaff et al., 2001; Xing et al., 2003; Tsang & on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Partial Order Embedding with Multiple Kernels

Kwok, 2003; Bilenko et al., 2004; Hadsell et al., 2006; A Euclidean embedding is a g : X → Rn which Weinberger et al., 2006; Globerson & Roweis, 2007). maps X into n−dimensional space equipped with the Eu- clidean (` ) metric: kx − yk = p(x − y)T(x − y). Schultz and Joachims (2004) present an SVM-type algo- 2 rithm to learn a metric from relative comparisons, with ap- A symmetric matrix A ∈ Rn×n has eigen-decomposition plications to text classification. Their formulation consid- A = V ΛV T, where Λ is a diagonal matrix con- ered metrics derived from axis-aligned scaling, which in taining the eigenvalues of A in descending order: the kernelized version, translates to a weighting over the λ1 ≥ λ2 ≥ · · · ≥ λn. A is positive semidefinite (PSD), de- training set. Agarwal et al. (2007), motivated by mod- noted A  0, if each of its eigen-values is non-negative. eling human perception data, present a semidefinite pro- For A  0, let A1/2 denote the matrix Λ1/2V T. Finally, th gramming (SDP) algorithm to construct an embedding of for any matrix B, let Bi denote its i column vector. general objects from paired comparisons. Both of these al- gorithms treat constraints individually, and do not exploit 2. Partial order embedding global structure (e.g., transitivity). Previous work has considered the setting where similar- Song et al. (2008) describe a metric learning algorithm that ity information is coded as tuples (i, j, k, `) where objects seeks a locally isometric embedding which is maximally (i, j) are more similar than (k, `), but treat each tuple in- aligned to a PSD matrix derived from side information. Al- dividually and without directly taking into consideration though our use of (multiple) side-information sources dif- larger-scale structure within the constraints. Such global fers from that of (Song et al., 2008), and we do not attempt structure can be revealed by the graph representation of the to preserve local , the techniques are not mutually constraints. exclusive. If the constraints do not form a partial order, then the corre- Lanckriet et al. (2004) present a method to combine multi- sponding graph must contain a cycle, and therefore cannot ple kernels into a single kernel, outperforming the original be satisfied by any embedding. Moreover, it is easy to lo- kernels in a classification task. To our knowledge, simi- calize of constraints which cannot all be satisfied by lar results have not yet been demonstrated for the present looking at the strongly-connected components of the graph. metric learning problem. We therefore restrict attention to constraint sets which sat- 1.2. Preliminaries isfy the properties of a partial order: transitivity and anti- symmetry. Exploiting transitivity allows us to more com- Let X denote a set of n objects. Let C denote a partial order pactly represent otherwise large sets of independently con- over pairs drawn from X : sidered local constraints, which can improve the efficiency of the algorithm. C = {(i, j, k, `): i, j, k, ` ∈ X , (i, j) < (k, `)}, 2.1. Problem statement where the less than relation is interpreted over dissimilar- Formally, the partial order embedding problem is defined ity between objects. Because C is a partial order, it can as follows: given a set X of n objects, and a partial order C 2 be represented by a DAG with vertices in X (see Fig- over X 2, produce a map g : X → Rn such that ure 1). For any pair (i, j), let depth(i, j) denote the length 2 2 of the longest path from a source to (i, j) in the DAG. ∀ (i, j, k, `) ∈ C : kg(i) − g(j)k < kg(k) − g(`)k . Let diam(C) denote the length of the longest (possibly For numerical stability, g is restricted to force margins be- weighted) source-to-sink path in C, and let length(C) de- tween constrained distances: note the number of edges in the path. 2 2 ∀ (i, j, k, `) ∈ C : kg(i)−g(j)k +eijk` < kg(k)−g(`)k .

Many algorithms take eijk` = 1 out of convenience, but jk uniform margins are not strictly necessary. We augment jl ik the DAG representation of C with positive edge weights il corresponding to the desired margins. We will refer to this ij representation as a margin-weighted constraint graph, and (X , C) as a margin-weighted instance. Figure 1. A partial order over similarity in DAG form: vertices In the special case where C is a total ordering over all represent pairs, and a directed edge from (i, j) to (i, k) indicates pairs (i.e., a chain graph), the problem reduces to non- that (i, j) are more similar than (i, k). metric multidimensional scaling (Kruskal, 1964), and a Partial Order Embedding with Multiple Kernels

Algorithm 1 Naïve total order construction that for all i 6= j, Input: objects X , margin-weighted partial order C p Output: symmetric dissimilarity matrix ∆ ∈ Rn×n 1 ≤ kg(i) − g(j)k ≤ (4n + 1)(diam(C) + 1).

for each i in 1 . . . n do Proof. Let ∆ be the output of Algorithm 1 on (X , C), and ∆ii ← 0 A, A∗ as defined in (1) and (2). From (2), the squared- end for distance matrix derived from A∗ can be expressed as for each (k, `) in topological order do ∗ 1 if in-degree(k, `) = 0 then ∆ij = ∆ij − 2λn(A) i6=j. (3) ∆k`, ∆`k ← 1 else Expanding (1) yields ∆k`, ∆`k ← max ∆ij + eijk` (i,j,k,`)∈C 1 1 1 A = ∆ − ∆T1 − 1T∆ + 1T∆1, end if ij ij n i n j n2 end for and since ∆ij ≥ 0, it is straightforward to verify the fol- lowing inequalities: constraint-satisfying embedding can always be found by ∀i, j : − 2 max ∆xy ≤ Aij ≤ 2 max ∆xy. constant-shift embedding (Roth et al., 2003). In general, x,y x,y C is not a total order, but a C-respecting embedding can al- ways be produced by reducing the partial order to a total It follows from the Geršgorin circle theorem (Varga, 2004) order (see Algorithm 1). that λn(A) ≥ −2n max ∆xy. (4) Algorithm 1 produces arbitrary distances for unconstrained x,y pairs, and is therefore unsuitable for applications where Since ∆ ≤ 1 + diam(C), substituting (4) into (3) yields the solution must be constrained to somehow respect fea- ij the desired result. tures of the objects. However, it can be used to construct an upper bound on the diameter of a satisfying solution, which will be useful when designing more sophisticated In general, unconstrained distances can lie anywhere in the algorithms in Section 3. range [1, diam(C) + 1], and Lemma 2.1 still applies. Even within a bounded diameter, there are infinitely many possi- 2.2. A diameter bound ble solutions, so we must further specify optimality criteria to define a unique solution. Let ∆ be the output of Algorithm 1 on an instance (X , C). An embedding can be found by first applying classical mul- 2.3. Constraint graphs tidimensional scaling (MDS) (Cox & Cox, 1994) to ∆: The partial order framework enables pre-processing of the 1 constraint set by operations performed directly on the con- A = − H∆H, (1) 2 straint graph. This has the advantage that redundancies and inconsistencies (i.e., cycles) can be identified and resolved 1 T where H = I − n 11 is the n × n centering matrix. Shift- ahead of time, resulting in more robust solutions. ing the spectrum of A yields In particular, the transitivity property can be exploited to ∗ A − λn(A)I = A  0, (2) dramatically simplify a constraint set: any redundant edges may be safely removed from the graph without altering where λn(A) is the minimum eigenvalue of A. The em- its semantics. We can therefore operate equivalently on bedding g can be found by decomposing A∗ = V Λ∗V T, the transitive reduction of the constraint graph (Aho et al., so that g(i) is the ith column of (A∗)1/2; this is the so- 1972). lution constructed by the constant-shift embedding non- Additionally, real constraint data can exhibit inconsisten- metric MDS algorithm (Roth et al., 2003). cies, in which case we would like to produce a maximal, The construction of A∗ leads to a bound on the embed- consistent of the constraints. If the data contains ding’s diameter which depends only on the size of X and sufficient variability, it may be reasonable to assume that the diameter of the margin-weighted constraint graph. inconsistencies are confined to small (local) regions of the graph. This corresponds to subsets of pairs which are too Lemma 2.1. For every margin-weighted instance (X , C), close in perceived similarity to reliably distinguish. Prun- there exists a margin-preserving map g : X → Rn−1 such ing edges within strongly connected components (SCC) Partial Order Embedding with Multiple Kernels can resolve local inconsistencies while retaining the unam- Algorithm 2 Partial order embedding biguous constraints. By construction, this always produces Input: objects X , margin-weighted partial order C a valid partial order. Output: inner product matrix A of embedded points Some data sets, such as in Section 4.1, exhibit insufficient d(i, j) = A + A − 2A variability for the SCC method. If it is possible to obtain Let ii jj ij. multiple observations for candidate constraints, the results X max d(i, j) may be filtered by majority vote or hypothesis testing to A generate a robust subset of the constraints. i,j ∀i, j ∈ X (4n + 1)(diam(C) + 1) ≥ d(i, j) Although real data typically contains inconsistencies, de- ∀(i, j, k, `) ∈ C d(i, j) + e ≤ d(k, `) noising the constraint set is not the focus of this paper. For ijk` X the remainder of this paper, we assume that the constraints Aij = 0,A  0 have already been pre-processed to yield a valid partial or- i,j dering.

3. Algorithms to ensure rapid neighborhood growth. Section 4.2 provides one such example. The naïve embedding algorithm in Section 2 gives no prin- cipled way to determine unconstrained distances, and may By combining the MVU objective with the diameter bound, artificially inflate the dimensionality of the data to sat- margin constraints, and a centering constraint to resolve isfy the ordering. Previous algorithms address these prob- translation invariance, we arrive at Algorithm 2. lems by filling in unconstrained distances according to the optimum of an objective function (Agarwal et al., 2007; 3.2. Parametric embedding Schultz & Joachims, 2004). In the parametric embedding problem, we are provided side Often, the objective function used is a convex approxi- information about X (e.g., vector representations), which mation to the rank of the embedding. We take a differ- parameterizes the map g. The simplest such parameteriza- ent approach here, and use a maximum-variance unfolding tion is a linear projection M, yielding g(x) = Mx. As (MVU) objective to stretch all distances while obeying the in (Globerson & Roweis, 2007), this generalizes to non- constraints (Weinberger et al., 2004). This choice of objec- linear transformations by using kernels. tive relates more closely to the form of the constraints (dis- T tances), and in practice yields low-dimensional solutions. In this setting, including a regularization term Tr(M M) (scaled by a factor γ > 0) in the objective function 3.1. Non-parametric embedding allows us to invoke the generalized representer theo- rem (Schölkopf et al., 2001). It follows that an optimal In the non-parametric embedding problem, we seek a map solution must lie in the span of the training data: if X g : X → Rn which respects C. We do not assume any contains the training set features as column vectors, then explicit representation of the elements of X . Our algorithm M = NXT for some matrix N. Therefore, the regulariza- P 2 optimizes the MVU objective: max i,j kg(i) − g(j)k , tion term can be expressed as subject to C. Tr(M TM) = Tr(XN TNXT) = Tr(N TNXTX). To ensure that the objective is bounded, it suffices to bound the diameter of the embedding. Unlike the original MVU Because the embedding function depends only upon in- algorithm for manifold learning, we do not assume the ex- ner products between feature vectors, the procedure can istence of a neighborhood graph, so we cannot deduce an be generalized to non-linear transformations by the ker- implicit bound. Instead, we use Lemma 2.1 to achieve an nel trick. Let K be a kernel matrix over X , i.e. explicit bound on the objective, thereby resolving scale in- Kij = hφ(i), φ(j)i for some feature map φ. Then we can variance of the solution. replace the inner products in g(·) with the kernel function, yielding an embedding of the form g(x) = NKx, where Similarly, without a neighborhood graph, there is no di- Kx is the column vector formed by evaluating the kernel rect way to choose which distances to preserve (or mini- function at x against the training set. mize) and which to maximize. However, we can exploit the Within the optimization, all distance calculations between acyclic structure of C by constructing margins eijk` which grow with depth in C. This creates nested families of neigh- embedded points can be expressed in terms of inner prod- borhoods around each constrained point. For certain prob- ucts: 2 T T lems, it is possible to automatically construct margin values kg(i) − g(j)k = (Ki − Kj) N N(Ki − Kj). Partial Order Embedding with Multiple Kernels

This allows us to formulate the optimization over a PSD Algorithm 3 Multi-kernel partial order embedding matrix W = N TN. The regularization term can be Input: objects X , margin-weighted partial order C, ker- expressed equivalently as Tr(WK), and the inner prod- nel matrices K(1),K(2),...,K(m) over X uct matrix for the embedded points is A = KWK. Output: matrices W1,W2,...,Wm  0. We can then factor W to obtain an embedding function g(x) = W 1/2K P (p) x. Let d(i, j) = p d (i, j). As in (Schultz & Joachims, 2004), this formulation X X can be interpreted as learning a Mahalanobis distance max d(i, j) − β ξijk` W (p),ξ metric ΦW ΦT in the feature space of K, where for i,j C X   i ∈ X , Φi = φ(i). If we further restrict W to be a diagonal −γ Tr W (p)K(p) matrix, it can be interpreted as learning a weighting over p the training set which forms a C-respecting metric. More- ∀i, j ∈ X (4n + 1)(diam(C) + 1) ≥ d(i, j) over, restricting W to be diagonal changes the semidefinite ∀(i, j, k, `) ∈ C d(i, j) + eijk` − ξijk` ≤ d(k, `) constraint to a set of linear constraints: Wii ≥ 0, so the resulting problem can be solved by linear programming. ξijk` ≥ 0 (p) To allow for a tradeoff between dimensionality and ∀p ∈ 1, 2, . . . , m W  0 constraint-satisfaction, we relax the distance constraints by introducing slack variables, which are then penalized with a positive scaling parameter β. We assume that K is pre- each kernel K(p). Each W (p) defines a partial embedding centered, allowing us to drop the centering constraint on 1/2 A. The resulting algorithm can be seen as a special case of (p)  (p) (p) Algorithm 3. g (x) = W Kx , wherein the distance is defined as 3.3. Multiple kernels T (p)  (p) (p) (p)  (p) (p) For heterogeneous data, we learn a unified space from mul- d (i, j) = Ki − Kj W Ki − Kj . tiple kernels, each of which may code for different features. If we then concatenate the partial embeddings to form a sin- We now extend the reasoning of Section 3.2 to construct a m gle vector g(x) = g(p)(x) ∈ nm, the total distance multi-kernel embedding algorithm. p=1 R is (1) (2) (m) X Let K ,K ,...,K be a set of kernel matrices, each d(i, j) = d(p)(i, j) = kg(i) − g(j)k2. (p) having a corresponding feature map φ , for p ∈ 1, . . . , m. p One natural way to combine the kernels is to look at the We can then add regularization terms Tr(W (p)K(p)) and product space: apply the analysis from Section 3.2 to each portion of the  m objective and embedding function independently. The final φ(i) = φ(p)(i) , p=1 multi-kernel embedding algorithm is given as Algorithm 3. with inner product In this context, diagonal constraints on W (p) not only sim- plifies the problem to linear programming, but carries the m X D E added interpretation of weighting the contribution of each hφ(i), φ(j)i = φ(p)(i), φ(p)(j) . (kernel, training point) pair in the construction of the em- p=1 (p) bedding. A large value at Wii corresponds to point i being The product space kernel matrix can be conveniently rep- a landmark for the features compared by K(p). P (p) resented in terms of the original kernels: Kb = p K . Previous kernel combination algorithms learn a weight- 4. Experiments ing P a K(p) such that informative kernels get higher p p The following experiments were conducted with the Se- weight, thereby contributing more to the final predic- DuMi and YALMIP optimization packages (Sturm, 1999; tion (Lanckriet et al., 2004). In general, the discriminative Löfberg, 2004). In the first experiment, we demonstrate power of each K(p) may vary over different subsets of X , both the parametric and non-parametric algorithms on a and learning a single transformation of the combined kernel human perception modeling task. The second experiment could fail to exploit this effect. formulates a hierarchical multi-class visualization problem We therefore extend the formulation from the previous sec- as an instance of partial order embedding, which is then tion by learning a separate transformation matrix W (p) for solved with the multi-kernel algorithm. Partial Order Embedding with Multiple Kernels

4.1. Relative similarity classes from the Amsterdam Library of Object Images (ALOI) (Geusebroek et al., 2005), which were then merged Our first experiment reproduces the human perception ex- into 3 higher-level classes to form the taxonomy in Table 1. periment of (Agarwal et al., 2007). The data consists of 55 rendered images of rabbits with varying surface reflectance A label taxonomy can be translated into partial order con- properties, and 13049 human-derived relative similarity straints by following the subset relation up from the leaves measurements. The constraint set contains numerous re- to the root of the hierarchy. Let LCA(i, j) denote the least dundancies and inconsistencies, which we removed by ran- common ancestor of i and j in the taxonomy tree; that is, domly sampling constraints to produce a maximal partial the smallest subset which contains both i and j. For each order. This results in a constraint DAG of 8770 edges and internal node S in the taxonomy tree, we generate con- length 55. straints of the form: The constraint graph was further simplified by computing {(i, j, i, k): i, j, k ∈ S, LCA(i, j) 6= S, LCA(i, k) = S}. its transitive reduction (Aho et al., 1972), thereby suppress- ing redundant constraints, i.e., those which can be deduced From each of the 10 classes, we sampled 10 images, each from others. This results in an equivalent, but more com- of varying out-of-plane rotation. Each image was full-color pact representation of 2369 unique constraints, and a much with dimensions 192 × 144. We parameterized the em- simpler optimization problem. bedding by forming five kernels: K(1) is the inner prod- (2,...,5) Using unit margins, we first constructed a non-parametric uct between grayscale images, and K are radial embedding (Figure 2(a)) with Algorithm 2. The results are basis function (RBF) kernels over histograms in the red, qualitatively similar to those presented in (Agarwal et al., green, blue channels, and grayscale intensity. All kernels 2007); the x-axis in the figure roughly captures gloss, and were centered by K 7→ HKH, and then normalized by p the y-axis captures brightness. Kij 7→ Kij/ KiiKjj. We then constructed a parametric embedding with a diag- Figure 3(a) illustrates the PCA projection of the product- onally constrained W . For features, we used radial basis space kernel for this experiment. Although each of the 10 function (RBF) kernels over image intensity histograms. classes are individually clustered, the product space does The embedding, shown in Figure 2(b), exhibits more struc- not conform to the higher-level structure defined by the tax- ture than the non-parametric embedding, but still conforms onomy. to the constraints. Unlike in (Agarwal et al., 2007), this When building the constraint graph, we randomly sampled embedding extends to new points by applying the kernel edges to obtain an equal number of constraints at each function to the novel point and each point in the training depth. We supplied margin weights of the form eijk` = set, and then applying the learned transformation to the re- 2depth(k,`)−1. This forces tight clusters at lower levels sulting vector. in the taxonomy, but subset diameters increase rapidly at higher levels in the taxonomy. We applied Algorithm 3 with diagonal constraints on each W (p), resulting in the 20 0.6

15 0.4 embedding shown in Figure 3(b). 10

5 0.2 0 Table 2 shows the fraction of satisfied taxonomy constraints 0 −5

−10 −0.2 (ignoring margins) for each kernel, before learning (i.e.,

−15 −0.4 −20 distances native to the kernel) and after. The first 6 rows −25 −0.6 were computed from the single-kernel algorithm of Sec- −30 −20 −10 0 10 20 −0.6 −0.4 −0.2 0 0.2 0.4 tion 3.2, and the last was computed from the multi-kernel (a) (b) algorithm. The metric derived from the dot product kernel outperforms each other individual kernel, indicating that colors are not generally discriminative in this set. The met- Figure 2. (a) Non-parametric embedding of the rabbit data set. (b) ric learned from the product space kernel can be interpreted Parametric embedding of the rabbit data set, using an RBF kernel over intensity histograms. as a weighting over the data, with uniform weighting over kernels, and performs much worse than the single best ker- nel. However, the multi-kernel metric adapts each kernel 4.2. Hierarchical embedding individually, achieving the highest fraction of satisfied con- straints. Our second experiment is a hierarchical visualization task. Given a set of labeled objects and a taxonomy over the Figure 3(c) depicts the weights learned by the algorithm. labels, we seek a low-dimensional embedding which re- Although there were 500 weight variables, only 42 received flects the global structure of the labels. We selected 10 non-zero values, largely eliminating the color features. As Partial Order Embedding with Multiple Kernels

Table 1. The label taxonomy for the hierarchical embedding ex- periment of Section 4.2. BC CD AB AD All Clothing Toys Fruit Shoe Xmas bear Lemon AC BD Hat Pink animal Pear White shoe Red/yellow block Orange Smurf (a) (b)

2 Figure 4. (a) A square in R . (b) The partial order over distances Table 2. The fraction of taxonomy constraints satisfied before and induced by the square: each side is less than each diagonal. This 1 after learning in each kernel individually, the product-space ker- constraint set cannot be satisfied in R . nel, and finally the multi-kernel embedding. Kernel Before learning After learning Theorem 5.1. 1-POE is NP-Hard. Dot product 0.8255 0.8526 RBF Red 0.6303 0.6339 Proof. Let (Z,T ) be an instance of Betweenness. Let RBF Green 0.6518 0.6746 X = Z, and for each (a, b, c) ∈ T , introduce constraints RBF Blue 0.7717 0.8349 (a, b, a, c) and (b, c, a, c) to C. Since Euclidean distance 1 RBF Gray 0.6782 0.6927 in R is simply line distance, these constraints force b Product 0.7628 0.7674 to lie between a and c. Therefore, the original instance Multi — 0.9483 (Z,T ) ∈ Betweenness if and only if the new instance (X , C) ∈ 1-POE. Since Betweenness is NP-Hard, 1-POE is NP-Hard as well. Table 2 indicates, the color features are relatively weak when compared to the dot product kernel, but they can be 6. Conclusion combined to produce a much better embedding. We have presented a metric learning algorithm which con- structs a single unified space from heterogeneous data 5. Hardness in R1 sources, supplied by multiple kernels. Our kernel combi- nation technique adapts to varying utility of kernels over So far, we’ve focused on algorithms that attempt to produce the training set. low-dimensional embeddings, but it is natural to ask if so- lutions of minimal dimensionality can be found efficiently. Gearing our algorithm toward subjective similarity moti- As a special case, one may ask if any instance (X , C) can vates the partial order framework, which enables graph- be satisfied in R1. However, as Figure 4 demonstrates, not theoretic analysis of the constraints. This leads to more all instances can be satisfied in the line. efficient algorithms and more reliable embeddings. Because rank constraints are not convex, convex optimiza- tion techniques cannot efficiently minimize dimensionality. References This does not necessarily imply other techniques could not Agarwal, S., Wills, J., Cayton, L., Lanckriet, G., Krieg- work. man, D., & Belongie, S. (2007). Generalized non-metric However, we show that it is NP-Complete to decide if a multi-dimensional scaling. Proceedings of the Twelfth given C can be satisfied in R1. A satisfying embedding International Conference on Artificial Intelligence and can be verified in polynomial time by traversing the con- Statistics. straint graph, so it remains to show that the 1 partial order R Aho, A. V., Garey, M. R., & Ullman, J. D. (1972). The embedding problem (hereafter referred to as 1-POE) is NP- transitive reduction of a directed graph. SIAM Journal Hard. We reduce from the following problem: on Computing, 1, 131–137. Definition 5.1 (Betweenness (Opatrny, 1979)). Given a fi- nite set Z and a collection T of ordered triples (a, b, c) Bilenko, M., Basu, S., & Mooney, R. J. (2004). Integrat- of distinct elements from Z, is there a one-to-one func- ing constraints and metric learning in semi-supervised tion f : Z → R such that for each (a, b, c) ∈ T , either clustering. Proceedings of the Twenty-first International f(a) < f(b) < f(c) or f(c) < f(b) < f(a)? Conference on Machine Learning (pp. 81–88). Cox, T. F., & Cox, M. A. (1994). Multidimensional scaling. Chapman and Hall. Partial Order Embedding with Multiple Kernels

2 1 Dot 0 10 20 30 40 50 60 70 80 90 100 2 1 Red 0 10 20 30 40 50 60 70 80 90 100 2 1 Green 0 10 20 30 40 50 60 70 80 90 100 2 1 Blue 0 10 20 30 40 50 60 70 80 90 100 2 1 Gray 0 10 20 30 40 50 60 70 80 90 100 Training set

(a) (b) (c)

Figure 3. (a) PCA of the clothes/toy/fruit set in the native product-space kernel. Higher level classes (clothes, toys, fruit, coded by color) are not clustered. (b) PCA of the same set after learning: high-level structure is now captured in low dimensions. (c) The weighting learned to produce (b). From top to bottom: grayscale inner product, red, green, blue, and grayscale histogram RBF kernels.

Geusebroek, J. M., Burghouts, G. J., & Smeulders, A. Song, L., Smola, A., Borgwardt, K., & Gretton, A. W. M. (2005). The Amsterdam library of object images. (2008). Colored maximum variance unfolding. Ad- Int. J. Comput. Vis., 61, 103–112. vances in Neural Information Processing Systems (pp. 1385–1392). Cambridge, MA: MIT Press. Globerson, A., & Roweis, S. (2007). Visualizing pairwise similarity via semidefinite embedding. Proceedings of Sturm, J. F. (1999). Using sedumi 1.02, a matlab toolbox the Twelfth International Conference on Artificial Intel- for optimization over symmetric cones. Optimization ligence and Statistics. methods and software, 11–12, 625–653. Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimension- Tsang, I. W., & Kwok, J. T. (2003). Distance metric learn- ality reduction by learning an invariant mapping. Com- ing with kernels. International Conference on Artificial puter Vision and Pattern Recognition. Neural Networks (pp. 126–129).

Kruskal, J. (1964). Nonmetric multidimensional scaling: a Varga, R. (2004). Geršgorin and his circles. Springer- numerical method. Psychometrika, 29. Verlag. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). L. E., & Jordan, M. I. (2004). Learning the kernel matrix Constrained k-means clustering with background knowl- with semidefinite programming. J. Mach. Learn. Res., 5, edge. Proceedings of the Eighteenth International Con- 27–72. ference on Machine Learning (pp. 577–584). Löfberg, J. (2004). Computer Aided Control Systems De- Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Dis- sign, 2004 IEEE International Symposium on, 284–289. tance metric learning for large margin nearest neighbor Opatrny, J. (1979). Total ordering problem. SIAM J. Com- classification. Advances in Neural Information Process- puting, 111–114. ing Systems 18 (pp. 451–458). Cambridge, MA: MIT Press. Roth, V., Laub, J., Buhmann, J. M., & Müller, K.-R. (2003). Going metric: denoising pairwise data. Advances in Weinberger, K. Q., Sha, F., & Saul, L. K. (2004). Learning Neural Information Processing Systems 15 (pp. 809– a kernel matrix for nonlinear dimensionality reduction. 816). Cambridge, MA: MIT Press. Proceedings of the Twenty-first International Conference on Machine Learning (pp. 839–846). Schölkopf, B., Herbrich, R., Smola, A. J., & Williamson, R. (2001). A generalized representer theorem. Proceed- Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). ings of the 14th Annual Conference on Computational Distance metric learning, with application to clustering Learning Theory (pp. 416–426). with side-information. Advances in Neural Information Processing Systems 15 (pp. 505–512). Cambridge, MA: Schultz, M., & Joachims, T. (2004). Learning a distance MIT Press. metric from relative comparisons. Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press.