Permutation Learning Via Lehmer Codes
Total Page:16
File Type:pdf, Size:1020Kb
24th European Conference on Artificial Intelligence - ECAI 2020 Santiago de Compostela, Spain Permutation Learning via Lehmer Codes A¨ıssatou Diallo1 Markus Zopf1 Johannes Furnkranz¨ 2 Abstract. Many machine learning problems require to learn to Another line of work investigated convex surrogates of the permu- permute a set of objects. Notable applications include ranking or tations themselves. The classical workflow for tackling the discrete sorting. One of the difficulties of learning in such a combinatorial optimization problem over the set of permutations of n elements is space is the definition of meaningful and differentiable distances and the following: (1) use permutation matrices to represent permuta- loss functions. Lehmer codes are an elegant way for mapping per- tions, (2) as these matrices are discrete, gradient-based optimisation mutations into a vector space where the Euclidean distance between is not possible; hence use a relaxation in order to transform them two codes corresponds to the Kendall tau distance between the corre- into the nearest convex, surrogate, doubly stochastic matrix, i.e., sponding rankings. This transformation, therefore, allows the use of into a matrix where all rows and all columns sum up to one. In a surrogate loss converting the original permutation learning problem particular, permutation matrices may be viewed as special cases of into a surrogate prediction problem which is easier to optimize. To doubly stochastic matrices which contain only zero and ones. that end, learning to predict Lehmer codes allows a transformation of In this paper, we propose a different approach for learning permu- the inference problem from permutations matrices to row-stochastic tations, which differs from previous work [6] in that we transform matrices, thereby removing the constraints of row and columns to the permutation matrix from the combinatorial space to the Hilbert sum to one. This permits a straight-forward optimization of the space using Lehmer codes [9]. Unlike [7], where the use of Lehmer permutation prediction problem. We demonstrate the effectiveness codes was proposed for label ranking, we employ them for the task of this approach for object ranking problems by providing empirical of object ranking. Lehmer codes provide a bijective mapping from results and comparing it to competitive baselines on different tasks. permutations onto sequences of integers. Moreover, Lehmer codes enjoy algorithmic advantages such as that the L1 norm of the code represents the Kendall’s tau distance between the permutation that 1 INTRODUCTION generated the code and the lexicographic permutation, which is a Permutation learning is a long-standing problem in many scientific commonly used metric for assessing the ordinal association between fields. The need to deal with permutations arises in a variety of ranking data. Lehmer codes provide vector representations of permu- applications, such as ranking [13], information retrieval or multi- tations in which the coordinates are decoupled, i.e., the values of each object tracking [6] for computer vision. The fundamental underlying coordinate are not restricted by the other coordinates. This mapping problem is the correct identification of the permutation that allows allows to transform the problem from learning a doubly stochastic to reconstruct the original sequence. Ranking, sorting and matching matrix to an integer valued vector which enjoys great advantages over problems are classical examples of the necessity to perform inference the direct optimization of permutation matrices. over such a combinatorial space, and they can be both represented by The remainder of the paper is organized as follows: Section 2 permutations over the set of objects to sort. However, one of the main lays the necessary mathematical background and defines the notation difficulties in dealing with such combinatorial objects is the factorial used throughout the paper, whereas Section 3 introduces prior work growth of the number of permutations with growing number of ele- on permutation learning and embedding of permutations. Section 4 ments. Moreover, there are mutual exclusivity constraints associated presents our primary contribution, motivating our theoretical choices. with permutations. Finally, the space of all possible permutations Finally, Section 5 presents experiments which compare our proposed of a set of objects is not smooth. In fact, permutations are discrete approach against alternative methods, illustrating the benefits of objects and thus it is not possible to compute straight marginals with Lehmer codes for the task of permutation learning. gradient-based methods. Thus, exact solutions are intractable to find for most cases. 2 BACKGROUND One strategy for working with combinatorial objects is to embed In this section, we formally state the permutation learning problem as them into continuous spaces, in order to impose a structure and a well as the notation used throughout the paper, which closely follows topology so that conventional tools for inference can be used. Prior the literature [7, 11]. works have proposed different approaches to deal with this problem. For example, several authors [4, 6] derived an approximation of a general probability distribution with a restricted set of basis 2.1 Notation functions, performing inference in the Fourier domain, and using Given an ordered list of items indexed by f1; :::; Ng , let us accompanying transformations to project back to the combinatorial denote without loss of generality, πIn as the n-dimensional identity space. permutation vector. A permutation is a rearrangement of the elements 1 Technische Universitat¨ Darmstadt, Germany, email: [diallo@aiphes, of πIn , i.e a bijection from πIn to πn, a generic permutation vector n n zopf@ke].tu-darmstadt.de with πn 2 P . P denotes the set of all permutations vector of 2 Johannes Kepler Universitat¨ Linz, Austria, email: juffi@faw.jku.at length n in the symmetric group, and has the cardinality n!. 24th European Conference on Artificial Intelligence - ECAI 2020 Santiago de Compostela, Spain 2.2 Learning problem The permutation learning problem aims at learning a function s : X!Pn that maps a feature space X to the permutation space Pn. Loosely speaking, given a sequence of shuffled items, the goal is to recover their natural order as defined by some specific criterion c. Formally, the underlying problem is to predict the permutation matrix n×n Pi 2 f0; 1g , given a sequence of shuffled items X~i such that −1 Pi recovers the original ordered sequence Xi. In such a setting, the cardinality of P is n!, which often is orders of magnitude larger than the number of training samples. This is one of the difficulties in dealing with combinatorial spaces such as n P . Moreover, permutations are represented as discrete points in the Figure 1: The Birkhoff (left) and Lehmer (right) polytopes of size 3. Euclidean space. This discrete nature typically prohibits gradient- Permutations matrices are in the column representation. based optimization, because these discrete solvers work by iteratively moving towards an optimum via stochastic updates along gradient a finite number of permutation matrices. The set of n × n doubly directions. A large part of the dedicated literature proposed different stochastic matrices is the called the Birkhoff polytope (cf. Figure 1 relaxations of the permutation matrix, the best known of which (left)). A natural consequence that arises from this theorem is involve the use of the Sinkhorn operator [18]. This operator is a to consider doubly stochastic matrices as a convex relaxation of procedure that takes as an input a non-negative square matrix and permutation matrices. However, it is difficult to learn matrices with outputs a doubly-stochastic matrix by iteratively normalizing rows this particular structure. The dedicated literature proposed to use and columns[1]. Sinkhorn normalization [18, 5] to ensure that columns and rows sums Another approach consists of embedding the permutations into to one. This procedure works by iteratively normalising rows and continuous spaces. An example of such line of work is given by columns of the square matrix. [7], which proposes three different encodings suitable to act as embeddings for the task of label ranking. Among these encodings, 2.4 Lehmer codes Lehmer codes showed great algorithmic advantages, in particular, the fact that the encoding and decoding step have a cost of O(n). Several works in theoretical computer science and discrete mathe- matics deal with effective ways to representing permutations. The 2.3 Permutation matrices Lehmer code [9] is a particular way to encode each possible permu- tation of a sequence of n numbers. Concretely, it is a bijection from As previously stated, there are several ways to represent permu- every permutation N to a function φ : N ! f0; 1;:::;N − 1g. tations. In matrix theory, a permutation matrix is a square matrix The Lehmer code,J alsoK known as inversionJ K vector, is a word of the with the peculiarity of having exactly a single unit value for each form cπ 2 CN , f0g × f0; 1g × f0; 2g × · · · × f0;N − 1g, where row and column and zero elsewhere. They are a natural and concise for j = 1; :::; N, way to represent permutations. In fact, given an ordered sequence cπ(j) = #fi 2 N : i < j; π(i) > π(j)g (2) of elements represented as a vector πIn , it is possible to derive the n×n J K permutation πn from the permutation matrix Pn 2 f0; 1g , Thus, the coordinate cπ(j) represents the number of elements i with simply by performing a vector-matrix multiplication: index smaller than j that are ranked higher than j in the permutation π. Consider the following example, which shows the canonical set of πn = PnπIn (1) items e, a permutation π, and the corresponding Lehmer code cπ: where Pn is obtained by swapping the rows of the identity matrix e 1 2 3 4 5 6 7 8 9 according to the desired permutation πn. The set of the n × n π 2 1 4 5 7 3 6 9 8 n! permutation matrices has cardinality and is the subset of the non- cπ 0 1 0 0 0 3 1 0 1 n×n singular matrices in R .