Discriminative Probabilistic Prototype Learning

Discriminative Probabilistic Prototype Learning Edwin V. Bonilla [email protected] NICTA & Australian National University, Locked Bag 8001, Canberra ACT 2601, Australia Antonio Robles-Kelly [email protected] NICTA & Australian National University, Locked Bag 8001, Canberra ACT 2601, Australia Abstract pabilities with our learning algorithms. We refer to this problem as that of learning representations. This In this paper we propose a simple yet pow- has been a long standing goal in machine learning and erful method for learning representations in has been addressed throughout the years from differ- supervised learning scenarios where an input ent perspectives. In fact, one of the simplest and oldest datapoint is described by a set of feature vec- attempts to tackle this problem was given by Rosen- tors and its associated output may be given blatt(1962) with the Perceptron algorithm for classifi- by soft labels indicating, for example, class cation problems. In such an approach it was suggested probabilities. We represent an input data- that we can have non-linear mappings of the inputs so point as a K-dimensional vector, where each that the obtained representation allows us to discrimi- component is a mixture of probabilities over nate between the classes with a simple linear function. its corresponding set of feature vectors. Each However, the mapping (or \features") should have probability indicates how likely a feature vec- been engineered beforehand instead of being learned tor is to belong to one-out-of-K unknown from the available data. Neural networks and their prototype patterns. We propose a proba- back-propagation algorithm (Rumelhart et al., 1986) bilistic model that parameterizes these pro- became popular because they offered an automatic totype patterns in terms of hidden variables way of learning flexible representations by introduc- and therefore it can be trained with conven- ing the so-called hidden layers and hidden units into tional approaches based on likelihood maxi- multilayer Perceptron architectures. Kernel-based al- mization. More importantly, both the model gorithms and, in particular, support vector machines parameters and the prototype patterns can (SVMs, Scholkopf & Smola, 2001) offered a clever al- be learned from data in a discriminative way. ternative to neural nets, circumventing the problem of We show that our model can be seen as a learning representations by using kernels to map the probabilistic generalization of learning vector input into feature spaces where the patterns are likely quantization (LVQ). We apply our method to to be linearly separable. However, SVMs are inher- the problems of shape classification, hyper- ently non-probabilistic and unsuitable for applications spectral imaging classification and people's where one requires uncertainty measures around their work class categorization, showing the supe- predictions. rior performance of our method compared to the standard prototype-based classification In this paper we propose a simple yet powerful ap- approach and other competitive benchmarks. proach to learning representations for classification problems where an original input datapoint is described by a set of feature vectors and its associated 1. Introduction output may be given by soft labels indicating, for example, class probabilities, degrees of membership or A fundamental problem in machine learning is that noisy labels. Our approach to this problem is to rep- of coming up with useful characterizations of the in- resent an input datapoint as a K-dimensional vector, put so that we can achieve better generalization ca- where each component is a mixture of probabilities Appearing in Proceedings of the 29 th International Confer- over its corresponding set of feature vectors. Each ence on Machine Learning, Edinburgh, Scotland, UK, 2012. probability indicates how likely a feature vector is to Copyright 2012 by the author(s)/owner(s). belong to one-out-of-K unknown prototype patterns. Discriminative Probabilistic Prototype Learning We propose a probabilistic model that parameterizes representation. It is important to realize that this these prototype patterns in terms of hidden variables method is a winner-takes-all approach where each fea- and therefore it can be trained with conventional ap- ture vector x is assigned to only one centre. This is proaches based on likelihood maximization. More im- the main motivation for our method where we will re- portantly, both the model parameters and the proto- lax this assumption and propose a fully discriminative type patterns can be learned from data in a discrim- probabilistic model for learning such representations. inative way. To our knowledge, previous approaches Our model assumes a bag-of-words representation as have not addressed the problem of discriminative pro- given by equation (2). However, here we consider that totype learning within a consistent multi-class proba- the prototypes are probabilities and are given by: bilistic framework (see section5 for details). 2 exp (−βkµk − xk ) fk(x) = , (3) PK 2 2. Problem Setting j=1 exp (−βkµj − xk ) In this paper we are interested in multi-class classifi- where β is a rate parameter (or inverse temperature). cation problems for which an input point is character- Note that when β ! 1 Equation (3) becomes equiv- ized by a set of feature vectors S = fx(1);:::; x(M)g, alent to the hard limit in Equation (1). Therefore, where each feature vector may describe, for example, fk(x) is the probability of feature vector x belonging some local characteristics of the input point. Addi- to \cluster" k and zk is a mixture of these probabili- tionally, we consider the general case where the out- ties. puts may be given by soft labels indicating, for example, class probabilities, degrees of membership or noisy In addition to defining how to map the set of input class labels. Hence, our goal is to build a probabilistic vectors into parameterized probabilistic prototype rep- classifier based on given training data, which is com- resentations, our method requires the definition of a prised by the tuples D = f(S(n); P~(n)); n = 1;:::;Ng, discriminative probabilistic classifier. In principle, this where S(n) is the set of feature vectors in the nth tu- could be any classifier that focuses on defining the con- ple. In general, the number of feature vectors is dif- ditional probability: ferent for each input and therefore we shall describe i def (n) C σ (z(X); Θ) = p(y = ijz(X); Θ) (4) it by Mn. Similarly, P~ 2 [0; 1] , corresponds to the C-dimensional vector of soft labels (e.g. empirical directly in terms of our prototype representation z. probabilities) associated with the C output classes of Here we have used X def= fx(j)gM and made explicit th j=1 the n training instance. Obviously, these probabili- the dependency of z on its corresponding feature vec- PC ~ n n ties are constrained by j=1 P (y = j) = 1, where y tors. In the sequel, for simplicity in the notation, we denotes the latent class assignment of datapoint n. will drop this dependency . Note that our model is a A common approach to prototype-based learning de- (conditional) directed probabilistic model. This con- scribes an input by a histogram of words from a vocab- trasts with other approaches such as latent-variable ulary of size K. This histogram is commonly known CRFs which are undirected graphical models. As we as the bag-of-words representation. In order to spec- shall see later, we will focus on a softmax classifier due ify the vocabulary it is customary to use clustering to the simplicity and efficacy of this parametric model. methods such as K-means or generative models such as However, it is clear that we can also incorporate non- Gaussian Mixtures, which are often used as a disjoint parametric classifiers. We expect such approaches to step before training a specific classifier. The model for be more effective than their parametric counterparts extracting such representations is given by: and we postpone their study to future work. 1 iff kµ − xk < kµ − xk 8j 6= k f (x) = k j (1) k 0 otherwise 3. Parameter Learning (n) X In this section we are interested in learning the pa- zk = πkfk(x), (2) x2S(n) rameters of our discriminative probabilistic prototype framework. These parameters are: the rate parameter K where z is a K-dimensional vector to be used as the β, the vocabulary or centers fµkgk=1 and the param- K input representation for a specific classifier; fµkgk=1 eters of the discriminative classifier Θ. This could be are usually referred to as the centers; and πk is set to effected using a number of optimization methods in- 1=Mn. We will refer to each fk(x) as a prototype func- cluding simulated annealing and Markov Chain Monte tion as it performs the encoding of each D-dimensional Carlo. Here we propose direct gradient-based opti- vector x into its corresponding (binary) K-dimensional mization of the data log-likelihood. Discriminative Probabilistic Prototype Learning 3.1. Direct Likelihood Maximization with A similar approach can be followed to compute the Gradient-Based Methods partial derivatives wrt β. Hence we have that: Assuming iid data, the log-likelihood of the model pa- K @Ln X @Ln @zn rameters given the data can be expressed as: = k (14) @β @zn @β k=1 k N K ! k X n K n L(Θ; fµ g ; β) = L (Θ; fµ g ; β) (5) @zk X X ` 2 k 2 k k=1 k k=1 = fk(x) kµ − xk f`(x) − kµ − xk . n=1 @β x2Sn `=1 N (15) X ~ n n n K = P (y ) log P y jz (Θ; fµkgk=1; β) , (6) n=1 3.2. Discriminative Parametric Model: where P~(yn) refers to the soft labels associated with Softmax Classifier input n, e.g.

Load more