Partial Order Embedding with Multiple Kernels Brian McFee [email protected] Department of Computer Science and Engineering, University of California, San Diego, CA 92093 USA Gert Lanckriet [email protected] Department of Electrical and Computer Engineering, University of California, San Diego, CA 92093 USA Abstract tirely different types of transformations may be natural for each modality. It may therefore be better to learn a sepa- We consider the problem of embedding arbitrary rate transformation for each type of feature. When design- objects (e.g., images, audio, documents) into Eu- ing metric learning algorithms for heterogeneous data, care clidean space subject to a partial order over pair- must be taken to ensure that the transformation of features wise distances. Partial order constraints arise nat- is carried out in a principled manner. urally when modeling human perception of simi- larity. Our partial order framework enables the Moreover, in such feature-rich data sets, a notion of simi- use of graph-theoretic tools to more efficiently larity may itself be subjective, varying from person to per- produce the embedding, and exploit global struc- son. This is particularly true of multimedia data, where ture within the constraint set. a person may not be able to consistently decide if two ob- We present an embedding algorithm based on jects (e.g., songs or movies) are similar or not, but can more semidefinite programming, which can be param- reliably produce an ordering of similarity over pairs. Al- eterized by multiple kernels to yield a unified gorithms in this regime must use a sufficiently expressive space from heterogeneous features. language to describe similarity constraints. Our goal is to construct an algorithm which integrates sub- jective similarity measurements and heterogeneous data to 1. Introduction produce a low-dimensional embedding. The main, novel contributions of this paper are two-fold. First, we develop A notion of distance between objects can be a powerful tool the partial order embedding framework, which allows us for predicting labels, retrieving similar objects, or visualiz- to apply graph-theoretic tools to solve relative comparison ing high-dimensional data. Due to its simplicity and math- embedding problems more efficiently. Our second contri- ematical properties, the Euclidean distance metric is often bution is a novel kernel combination technique to produce applied directly to data, even when there is little evidence a unified Euclidean space from multi-modal data. that said data lies in a Euclidean space. It has become the focus of much research to design algorithms which adapt The remainder of this paper is structured as follows. Sec- the space so that Euclidean distance between objects con- tion 2 formalizes the embedding problem and develops forms to some other presupposed notion of similarity, e.g., some mathematical tools to guide algorithm design. Sec- class labels or human perception measurements. tion 3 provides algorithms for non-parametric and multi- kernel embedding. Section 4 describes two experiments: When dealing with multi-modal data, the simplest first step one on synthetic and one on human-derived constraint sets. is to concatenate all of the features together, resulting in Section 5 discusses the complexity of exact dimensionality a single vector space on which metric learning algorithms minimization in the partial order setting. can be applied. This approach suffers in situations where features cannot be directly concatenated, such as in ker- 1.1. Related work nel methods, where the features are represented (implic- itly) by infinite-dimensional vectors. For example, if each Metric learning algorithms adapt the space to fit some pro- object consists of mixed audio, video, and text content, en- vided similarity constraints, typically consisting of pairs which are known to be neighbors or belong to the same th Appearing in Proceedings of the 26 International Conference class (Wagstaff et al., 2001; Xing et al., 2003; Tsang & on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Partial Order Embedding with Multiple Kernels Kwok, 2003; Bilenko et al., 2004; Hadsell et al., 2006; A Euclidean embedding is a function g : X! Rn which Weinberger et al., 2006; Globerson & Roweis, 2007). maps X into n−dimensional space equipped with the Eu- clidean (` ) metric: kx − yk = p(x − y)T(x − y). Schultz and Joachims (2004) present an SVM-type algo- 2 rithm to learn a metric from relative comparisons, with ap- A symmetric matrix A 2 Rn×n has eigen-decomposition plications to text classification. Their formulation consid- A = V ΛV T, where Λ is a diagonal matrix con- ered metrics derived from axis-aligned scaling, which in taining the eigenvalues of A in descending order: the kernelized version, translates to a weighting over the λ1 ≥ λ2 ≥ · · · ≥ λn. A is positive semidefinite (PSD), de- training set. Agarwal et al. (2007), motivated by mod- noted A 0, if each of its eigen-values is non-negative. eling human perception data, present a semidefinite pro- For A 0, let A1=2 denote the matrix Λ1=2V T. Finally, th gramming (SDP) algorithm to construct an embedding of for any matrix B, let Bi denote its i column vector. general objects from paired comparisons. Both of these al- gorithms treat constraints individually, and do not exploit 2. Partial order embedding global structure (e.g., transitivity). Previous work has considered the setting where similar- Song et al. (2008) describe a metric learning algorithm that ity information is coded as tuples (i; j; k; `) where objects seeks a locally isometric embedding which is maximally (i; j) are more similar than (k; `), but treat each tuple in- aligned to a PSD matrix derived from side information. Al- dividually and without directly taking into consideration though our use of (multiple) side-information sources dif- larger-scale structure within the constraints. Such global fers from that of (Song et al., 2008), and we do not attempt structure can be revealed by the graph representation of the to preserve local isometry, the techniques are not mutually constraints. exclusive. If the constraints do not form a partial order, then the corre- Lanckriet et al. (2004) present a method to combine multi- sponding graph must contain a cycle, and therefore cannot ple kernels into a single kernel, outperforming the original be satisfied by any embedding. Moreover, it is easy to lo- kernels in a classification task. To our knowledge, simi- calize subsets of constraints which cannot all be satisfied by lar results have not yet been demonstrated for the present looking at the strongly-connected components of the graph. metric learning problem. We therefore restrict attention to constraint sets which sat- 1.2. Preliminaries isfy the properties of a partial order: transitivity and anti- symmetry. Exploiting transitivity allows us to more com- Let X denote a set of n objects. Let C denote a partial order pactly represent otherwise large sets of independently con- over pairs drawn from X : sidered local constraints, which can improve the efficiency of the algorithm. C = f(i; j; k; `): i; j; k; ` 2 X ; (i; j) < (k; `)g; 2.1. Problem statement where the less than relation is interpreted over dissimilar- Formally, the partial order embedding problem is defined ity between objects. Because C is a partial order, it can as follows: given a set X of n objects, and a partial order C 2 be represented by a DAG with vertices in X (see Fig- over X 2, produce a map g : X! Rn such that ure 1). For any pair (i; j), let depth(i; j) denote the length 2 2 of the longest path from a source to (i; j) in the DAG. 8 (i; j; k; `) 2 C : kg(i) − g(j)k < kg(k) − g(`)k : Let diam(C) denote the length of the longest (possibly For numerical stability, g is restricted to force margins be- weighted) source-to-sink path in C, and let length(C) de- tween constrained distances: note the number of edges in the path. 2 2 8 (i; j; k; `) 2 C : kg(i)−g(j)k +eijk` < kg(k)−g(`)k : Many algorithms take eijk` = 1 out of convenience, but jk uniform margins are not strictly necessary. We augment jl ik the DAG representation of C with positive edge weights il corresponding to the desired margins. We will refer to this ij representation as a margin-weighted constraint graph, and (X ; C) as a margin-weighted instance. Figure 1. A partial order over similarity in DAG form: vertices In the special case where C is a total ordering over all represent pairs, and a directed edge from (i; j) to (i; k) indicates pairs (i.e., a chain graph), the problem reduces to non- that (i; j) are more similar than (i; k). metric multidimensional scaling (Kruskal, 1964), and a Partial Order Embedding with Multiple Kernels Algorithm 1 Naïve total order construction that for all i 6= j, Input: objects X , margin-weighted partial order C p Output: symmetric dissimilarity matrix ∆ 2 Rn×n 1 ≤ kg(i) − g(j)k ≤ (4n + 1)(diam(C) + 1): for each i in 1 : : : n do Proof. Let ∆ be the output of Algorithm 1 on (X ; C), and ∆ii 0 A; A∗ as defined in (1) and (2). From (2), the squared- end for distance matrix derived from A∗ can be expressed as for each (k; `) in topological order do ∗ 1 if in-degree(k; `) = 0 then ∆ij = ∆ij − 2λn(A) i6=j: (3) ∆k`; ∆`k 1 else Expanding (1) yields ∆k`; ∆`k max ∆ij + eijk` (i;j;k;`)2C 1 1 1 A = ∆ − ∆T1 − 1T∆ + 1T∆1; end if ij ij n i n j n2 end for and since ∆ij ≥ 0, it is straightforward to verify the fol- lowing inequalities: constraint-satisfying embedding can always be found by 8i; j : − 2 max ∆xy ≤ Aij ≤ 2 max ∆xy: constant-shift embedding (Roth et al., 2003).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-