Discriminative Probabilistic Prototype Learning

Edwin V. Bonilla [email protected] NICTA & Australian National University, Locked Bag 8001, Canberra ACT 2601, Australia Antonio Robles-Kelly [email protected] NICTA & Australian National University, Locked Bag 8001, Canberra ACT 2601, Australia

Abstract pabilities with our learning algorithms. We refer to this problem as that of learning representations. This In this paper we propose a simple yet pow- has been a long standing goal in machine learning and erful method for learning representations in has been addressed throughout the years from differ- scenarios where an input ent perspectives. In fact, one of the simplest and oldest datapoint is described by a set of feature vec- attempts to tackle this problem was given by Rosen- tors and its associated output may be given blatt(1962) with the Perceptron algorithm for classifi- by soft labels indicating, for example, class cation problems. In such an approach it was suggested probabilities. We represent an input data- that we can have non-linear mappings of the inputs so point as a K-dimensional vector, where each that the obtained representation allows us to discrimi- component is a mixture of probabilities over nate between the classes with a simple linear function. its corresponding set of feature vectors. Each However, the mapping (or “features”) should have probability indicates how likely a feature vec- been engineered beforehand instead of being learned tor is to belong to one-out-of-K unknown from the available data. Neural networks and their prototype patterns. We propose a proba- back-propagation algorithm (Rumelhart et al., 1986) bilistic model that parameterizes these pro- became popular because they offered an automatic totype patterns in terms of hidden variables way of learning flexible representations by introduc- and therefore it can be trained with conven- ing the so-called hidden layers and hidden units into tional approaches based on likelihood maxi- multilayer Perceptron architectures. Kernel-based al- mization. More importantly, both the model gorithms and, in particular, support vector machines parameters and the prototype patterns can (SVMs, Scholkopf & Smola, 2001) offered a clever al- be learned from data in a discriminative way. ternative to neural nets, circumventing the problem of We show that our model can be seen as a learning representations by using kernels to map the probabilistic generalization of learning vector input into feature spaces where the patterns are likely quantization (LVQ). We apply our method to to be linearly separable. However, SVMs are inher- the problems of shape classification, hyper- ently non-probabilistic and unsuitable for applications spectral imaging classification and people’s where one requires uncertainty measures around their work class categorization, showing the supe- predictions. rior performance of our method compared to the standard prototype-based classification In this paper we propose a simple yet powerful ap- approach and other competitive benchmarks. proach to learning representations for classification problems where an original input datapoint is de- scribed by a set of feature vectors and its associated 1. Introduction output may be given by soft labels indicating, for ex- ample, class probabilities, degrees of membership or A fundamental problem in machine learning is that noisy labels. Our approach to this problem is to rep- of coming up with useful characterizations of the in- resent an input datapoint as a K-dimensional vector, put so that we can achieve better generalization ca- where each component is a mixture of probabilities Appearing in Proceedings of the 29 th International Confer- over its corresponding set of feature vectors. Each ence on Machine Learning, Edinburgh, Scotland, UK, 2012. probability indicates how likely a feature vector is to Copyright 2012 by the author(s)/owner(s). belong to one-out-of-K unknown prototype patterns. Discriminative Probabilistic Prototype Learning

We propose a probabilistic model that parameterizes representation. It is important to realize that this these prototype patterns in terms of hidden variables method is a winner-takes-all approach where each fea- and therefore it can be trained with conventional ap- ture vector x is assigned to only one centre. This is proaches based on likelihood maximization. More im- the main motivation for our method where we will re- portantly, both the model parameters and the proto- lax this assumption and propose a fully discriminative type patterns can be learned from data in a discrim- probabilistic model for learning such representations. inative way. To our knowledge, previous approaches Our model assumes a bag-of-words representation as have not addressed the problem of discriminative pro- given by equation (2). However, here we consider that totype learning within a consistent multi-class proba- the prototypes are probabilities and are given by: bilistic framework (see section5 for details). 2 exp (−βkµk − xk ) fk(x) = , (3) PK 2 2. Problem Setting j=1 exp (−βkµj − xk ) In this paper we are interested in multi-class classifi- where β is a rate parameter (or inverse temperature). cation problems for which an input point is character- Note that when β → ∞ Equation (3) becomes equiv- ized by a set of feature vectors S = {x(1),..., x(M)}, alent to the hard limit in Equation (1). Therefore, where each feature vector may describe, for example, fk(x) is the probability of feature vector x belonging some local characteristics of the input point. Addi- to “cluster” k and zk is a mixture of these probabili- tionally, we consider the general case where the out- ties. puts may be given by soft labels indicating, for exam- ple, class probabilities, degrees of membership or noisy In addition to defining how to map the set of input class labels. Hence, our goal is to build a probabilistic vectors into parameterized probabilistic prototype rep- classifier based on given training data, which is com- resentations, our method requires the definition of a prised by the tuples D = {(S(n), P˜(n)), n = 1,...,N}, discriminative probabilistic classifier. In principle, this where S(n) is the set of feature vectors in the nth tu- could be any classifier that focuses on defining the con- ple. In general, the number of feature vectors is dif- ditional probability: ferent for each input and therefore we shall describe i def (n) C σ (z(X); Θ) = p(y = i|z(X), Θ) (4) it by Mn. Similarly, P˜ ∈ [0, 1] , corresponds to the C-dimensional vector of soft labels (e.g. empirical directly in terms of our prototype representation z. probabilities) associated with the C output classes of Here we have used X def= {x(j)}M and made explicit th j=1 the n training instance. Obviously, these probabili- the dependency of z on its corresponding feature vec- PC ˜ n n ties are constrained by j=1 P (y = j) = 1, where y tors. In the sequel, for simplicity in the notation, we denotes the latent class assignment of datapoint n. will drop this dependency . Note that our model is a A common approach to prototype-based learning de- (conditional) directed probabilistic model. This con- scribes an input by a of words from a vocab- trasts with other approaches such as latent-variable ulary of size K. This histogram is commonly known CRFs which are undirected graphical models. As we as the bag-of-words representation. In order to spec- shall see later, we will focus on a softmax classifier due ify the vocabulary it is customary to use clustering to the simplicity and efficacy of this parametric model. methods such as K- or generative models such as However, it is clear that we can also incorporate non- Gaussian Mixtures, which are often used as a disjoint parametric classifiers. We expect such approaches to step before training a specific classifier. The model for be more effective than their parametric counterparts extracting such representations is given by: and we postpone their study to future work.  1 iff kµ − xk < kµ − xk ∀j 6= k f (x) = k j (1) k 0 otherwise 3. Parameter Learning (n) X In this section we are interested in learning the pa- zk = πkfk(x), (2) x∈S(n) rameters of our discriminative probabilistic prototype framework. These parameters are: the rate parameter K where z is a K-dimensional vector to be used as the β, the vocabulary or centers {µk}k=1 and the param- K input representation for a specific classifier; {µk}k=1 eters of the discriminative classifier Θ. This could be are usually referred to as the centers; and πk is set to effected using a number of optimization methods in- 1/Mn. We will refer to each fk(x) as a prototype func- cluding simulated annealing and Markov Chain Monte tion as it performs the encoding of each D-dimensional Carlo. Here we propose direct gradient-based opti- vector x into its corresponding (binary) K-dimensional mization of the data log-likelihood. Discriminative Probabilistic Prototype Learning

3.1. Direct Likelihood Maximization with A similar approach can be followed to compute the Gradient-Based Methods partial derivatives wrt β. Hence we have that:

Assuming iid data, the log-likelihood of the model pa- K ∂Ln X ∂Ln ∂zn rameters given the data can be expressed as: = k (14) ∂β ∂zn ∂β k=1 k N K ! k X n K n L(Θ, {µ } , β) = L (Θ, {µ } , β) (5) ∂zk X X ` 2 k 2 k k=1 k k=1 = fk(x) kµ − xk f`(x) − kµ − xk . n=1 ∂β x∈Sn `=1 N (15) X ˜ n n n K  = P (y ) log P y |z (Θ, {µk}k=1, β) , (6) n=1 3.2. Discriminative Parametric Model: where P˜(yn) refers to the soft labels associated with Softmax Classifier input n, e.g. the empirical class probabilities. In order For the case of a parametric model such as the softmax to optimize the data log-likelihood we use a Quasi- classifier: Newton method (BFGS) to which we provide the fol- lowing gradient information: exp((θi)T z) P (y = i|z, Θ) = def= σi(z; Θ) (16) PC exp((θj)T z) N N n j=1 X n ∂L X ∂L ∇ ` L = ∇ ` L = (7) µ µ ∂β ∂β n=1 n=1 the local log-likelihood terms can be written as: and C n X ˜ n j n n (n,`) T (n) L = P (y = j) log σ (z ; Θ). (17) ∇µ` L = (G ) g , (8) j=1 where: The corresponding derivatives wrt the prototype rep- resentations are given by: (n) def n (n,`) def n g = ∇zn L G = ∇µ` z (9) n n C (n) ∂L (n,`) ∂z g = G = k (10) n X  n j n  j k n k,d ` ∇zn L = P˜(y = j) − σ (z ; Θ) θ . (18) ∂zk ∂µd j=1 for k, ` = 1,...K and d = 1,...D. The advantage of Finally, we also need the gradient information wrt the the above formulation is that the partial derivatives model parameters (Θ): wrt the cluster centers can be computed in closed- form. Note that although these updates are straight- n forward to derive, we specify all the details here for X  ˜ n i n  n ∇θi L = P (y = i) − σ (z ; Θ) z . (19) completeness and for their later use when showing the n=1 relation between our method and LVQ in Section 3.3. The derivatives are given as follows: Equations (5) and (19) can be modified so as to in- clude a regularization term −λ tr (ΘT Θ), where λ is K ∂Ln X ∂Ln ∂zn a regularization parameter; tr (·) is the trace opera- = k . (11) ∂µ` ∂zn ∂µ` tor; and Θ is the matrix of weights with columns given d k=1 k d by each θi, i = 1 ...,C. It is well known that the so- lution for the weights in this case corresponds to the Moreover, by using equations (2) and (3) we have that: MAP solution when considering a Gaussian prior over the weights θi ∼ N (θi|0, λI). We can have a similar n X ∇µ` zk = ∇µ` fk(x) (12) prior or regularization for β. x∈Sn We note here that optimization of the objective func-  X `  2β fk(x)(1 − fk(x))(x − µ ), iff k = ` tion in Equation (5) wrt to all the parameters is a  x∈Sn non-convex problem. However, as we shall see in sec- = X `  2β fk(x)f`(x)(µ − x), otherwise. tion4, we use coordinate ascent and optimization of  x∈Sn the model parameters Θ, given all others fixed, is a (13) convex problem. Discriminative Probabilistic Prototype Learning

3.3. Relation to LVQ and people’s work class categorization. For all our ex- periments, we employ our prototype learning approach Learning Vector Quantization (LVQ, Kohonen, 1990) where we iterate the learning of the prototype param- is a prototype-based learning algorithm that uses eters {µ }K , β and the model parameters Θ via co- the class label information to adapt the prototypes ` `=1 ordinate ascent on the model likelihood. We have ini- (i.e. cluster centers). The idea is that the prototypes tialized the cluster centers making use of k-means and should move towards the training examples in their varied the size of the vocabulary, i.e. K, according to corresponding class and away from training examples the experimental vehicle. To illustrate the behavior of with different labels. As in the k-means algorithm, the our algorithm and to show its performance with re- assignment of datapoints to prototypes corresponds to spect to competitive benchmarks, the first three prob- the rule in Equation (1). If we were going to update lems studied (synthetic data, shape classification, hy- the prototypes in our model with gradient ascent, this perspectral image classification) consider the common update would become: case of hard labels and our last on people’s (new) (old) work class categorization investigates the use of class µ = µ + η∇µ L, (20) ` ` ` probabilities as soft labels. where η is the learning rate. By expanding Equation (11), substituting Equations 4.1. Illustration on Synthetic Data (12) and (18), and assuming hard labels, i.e. P˜(yn) These are based on a set of 20 datapoints is 1 for only one label and zero for all the others, we each comprised by a set of Mn ∈ {1,..., 20} feature should get: vectors in a 2-D space. Here we compare the baseline model yielded by k-means clustering (with K = 2) `(new) ` µ = µ + against our discriminative model. Figure1 shows the  K  C  original datapoints in the two-dimensional space along X X yn X j j ` η  θk − θkσ (z; θ) fk(x)f`(x)(µ − x) with the cluster centers (top). We see that the cluster x∈Sn k=1 j=1 centers learned by our discriminative approach (Panel

 C   b) are quite different even though the clusters obtained X yn X j j ` by k-means (Panel a) are very close to those used + θ − θ σ (z; θ) f`(x)(µ − x) . ` ` to generate the data. At the bottom panel we show x∈Sn j=1 the prototype representation given by both methods (21) and we observe that the representation learned by our By taking the hard limit for the assignment of the sam- method is much more discriminative as it takes into ples to a single prototype (and assuming that all sam- consideration the class labels. ples corresponding to a single data-point are assigned to the same cluster): f`(x) = 1 and, consequently, 4.2. MPEG DataSet f (x) = 0 for k 6= ` we have: k Our first real dataset is given by the MPEG-7 CE-1  C  Part B database (Latecki et al., 2000), which we will `(new) ` X yn X j j ` µ = µ +η θ` − θ` σ (z; θ) (µ −x). refer to as the mpeg-7 dataset. This contains 1400 x∈Sn j=1 binary shapes organized in 70 classes, each comprised (22) of 20 images. We have sampled 1 in every 10 pixels If |Sn| = 1 then the factor premultiplying (µ` − x) on the shape contours and we have built a fully con- can be absorbed into the learning rate and, therefore, nected graph whose edge-weights are given by the Eu- we obtain the LVQ update. In our model, this factor clidean distances between each pair of pixel locations. is interpreted as the difference between the parameter These weights are then normalized to be in the interval corresponding to the prototype ` for the label of the [0, 1]. The feature vectors are given by the frequency yn current datapoint , i.e (θ` ) and the average (wrt the of these distances for every node. In our posterior probabilities) of the parameters of the other experiments, we have used 10 bins for the frequency classes. As a result, we can view the gradient-ascent histogram computation and set K = 200. updates for our method as a relaxed version of LVQ. For purposes of shape categorization, we compare our method to three alternatives. The first one is a 4. Experiments and Results prototype-based baseline akin to the bag of words ap- proaches in computer vision (Fei-Fei & Perona, 2005). In this section we present results on synthetic data, Our baseline recovers prototypes via k-means cluster- shape classification, hyperspectral image classification Discriminative Probabilistic Prototype Learning

4 5

2 0 2 0 2 x x

−5 −2

−4 −10 −4 −2 0 2 4 −15 −10 −5 0 5 10 x x 1 1 (a) (b) 1 1

0.8 0.8

0.6 0.6 2 2 z z 0.4 0.4

0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 z z 1 1 (c) (d)

Figure 1. Illustration of our model’s discriminative power on toy data. (a) The original two-dimensional data drawn from a mixture of Gaussians with two components: one is an isotropic Gaussian with 1 and the other one is a correlated Gaussian with covariance 0.95 and variance 1 on both dimensions. An input is represented by several datapoints in the plot and the cluster centers as learned by k-means are shown as black dots. (b) The same data with the cluster centers as learned by our discriminative probabilistic prototype classifier. (c) The representation obtained when using a standard prototype-based approach (k-means). (d) The representation learned when using our discriminative probabilistic prototype model. ing. The shapes under study are then described by the the basic strategy taken for the construction of our prototype-based representation which we then employ graphs, which contrasts with the specialized nature of as input to a classifier. The classifier used for this base- the skeletal and shape contexts, our probabilistic pro- line is also a softmax classifier as in our method. In totype method outperforms the alternatives. other words, the only difference between this baseline method and our approach (probabilistic prototype) is 4.3. SPECTRAL Dataset how these prototypes are learned. This application considers hyperspectral imagery from The other two alternatives are specifically designed for real-world scenes that include various types of materi- purposes of shape classification. These are the shape als. We have annotated these images at a pixel-level and skeletal contexts in Belongie et al.(2002) and Xie considering 10 different classes (C=10): tree trunk, et al.(2008), respectively. Once the shape and skeletal light poles, shadow on grass, grass, road, white line on contexts are at hand, we train one-versus-all SVM clas- road, shadow on road, leaves, sky, and white regions on sifiers whose parameters have been selected through sky. We considered 24 different images from which we ten-fold cross validation. For all our shape categoriza- have extracted 1, 746, 708 data points with their corre- tion experiments, we have divided the graphs in the sponding labels. In order to characterize an input data dataset into a training and a testing set. The train- point with a set of vectors (Sn) we have considered ing set comprises 50% randomly selected graphs, i.e. neighborhood information according to 7×7 windows. 700 and the other 50% of the data was used for test- Hence, for each data point we have 49 feature vectors. ing. For these partitions we have effected five trials. We have subsampled the data so as to include 1000 The categorization results are shown in Table1. The training instances and 16643 test instances across 10 table shows the percentage of correctly classi- different partitions. fied shapes and the corresponding variance. Despite Discriminative Probabilistic Prototype Learning

Table 1. Shape categorization results for the mpeg-7 dataset. The mean classification accuracy is shown along with (±) one when using 50% of the data for training and the rest for testing. Our probabilistic prototype approach outperforms all other baseline methods. Probabilistic Standard Shape Context Skeletal Context Prototype Prototype (Belongie et al., 2002) (Xie et al., 2008) 85.49 ± 1.43 % 82.20 ± 0.98 % 77.55 ± 2.39 % 79.91 ± 1.78 %

Note that our method is effectively learning the pro- classifier devoid of their probability of error. Thus, our totypes that can be used to represent the spectra for method not only delivers better performance than the the objects in the scene as a linear combination of the alternatives, but also provides better probability esti- spectral signatures of its material constituents. In geo- mates. In Figure2(b) we show similar results for the sciences and process control this is known as unmixing spectral dataset. As with the mpeg-7 dataset, the (Bergman, 2006). Current unmixing methods assume test data log-likelihood for the spectral dataset de- availability of the end-member spectra, which usually livered by our approach is greater than the one yielded involves a cumbersome labeling of the end member by the alternatives and shows less variability. data, effected through expert intervention. For comparison we use the standard prototype ap- 4.5. ADULT Dataset proach (based on k-means for learning the cen- Finally, we applied our method to the adult dataset ters), and the InfoLoss (Lazebnik & Raginsky, 2009) from the UCI machine learning repository (Frank & method. InfoLoss adapts the prototypes based on Asuncion, 2010). This dataset has been extracted from the optimization of an information-theoretical crite- -based data and the original task was to predict rion. For this method, we have used an in-house im- whether a person makes over $50K a year. It contains plementation whose parameter settings have been set continuous and discrete input variables including edu- as described in Lazebnik & Raginsky(2009). For the cation, age, work class (with eight different categories), standard prototype approach and InfoLoss, as in our etc. In order to use this dataset as a test benchmark method, we have used a softmax classifier so as to al- for our method, we have grouped people’s records ac- low a direct representation-quality comparison via the cording to the values of the discrete variables except classification rates for our approach and those yielded native country and work class and we have selected the by the alternatives. In Table2, we show the accuracy latter variable as our target labels. Hence, we have an of the methods evaluated along with two standard er- 8-dimensional output variable. We represented an in- rors. As before, probabilistic prototype refers to our put instance by the set of vectors that belong to the method and standard prototype refers to the represen- same group (i.e. with the same values for all the dis- tation computed via the direct application of k-means crete features excluding native country). Similarly, we for learning the centers. From the table, we can con- have computed soft labels (P˜) by calculating for each clude that our method outperforms the alternatives by group the proportion of records that have the same delivering a mean categorization rate of 81.25%. value for the corresponding label. Finally, we have sub-sampled the data to have 4146 different instances 4.4. Test Likelihoods on MPEG-7 and and considered 50% for training and the other 50% for SPECTRAL testing. The regularization parameters for our model and the baseline have been set-up via cross-validation. Now, we turn our attention to the evaluation of the methods from a probabilistic point of view by report- The (average) KL- between the true (empir- ing the likelihood of the models on the test data. In ical) class distribution and the predictive distribution Figure2(a), we show the test data log-likelihood for is 1.12 bits and if the soft labels were thresholded to our probabilistic prototype approach and the baseline. provide a hard class assignment the accuracy of the The bar corresponds to the mean across the five tri- model would be 77%. One interesting application of als and the segments account for two standard errors. our model on this dataset is the automatic “discovery” Note that the log-likelihood for our approach is signifi- of the latent population prototypes and how these re- cantly greater than the one yielded by the standard ap- late to the class labels under consideration. Due to proach. This is since the performance measures above space limitations, we have not included these results account for the correctness of the labels yielded by the here and postpone their analysis to future work. Discriminative Probabilistic Prototype Learning

Table 2. Classification results for the spectral dataset. The mean classification accuracy is shown along with (±) two standard errors when using 100 data-points per class for training and the rest for testing. Our probabilistic prototype method outperforms the baseline methods. Probabilistic Standard InfoLoss Prototype Prototype (Lazebnik & Raginsky, 2009) 81.25 ± 0.72 % 75.17 ± 0.49 % 72.16 ± 0.82 %

−1 −0.6

−0.7 −1.5

−0.8 −2 Likelihood Likelihood 2 2 −0.9 −2.5 −1

Mean Test Log −3 Mean Test Log −1.1

−3.5 Std Prototype Our Method Std Prototype Our Method Infoloss (a) (b)

Figure 2. (a) The test data log likelihood on the mpeg-7 dataset, with K = 200 and stratified for training using 50% of the data and the rest for testing. The mean values are reported along with two standard errors across 5 replications of the experiment. (b) The test data log likelihood on the spectral dataset when using 1000 data-points for training and K = 50.

5. Related Work tialize our method with k-means, which can been seen as a pre-training stage, which is followed by a fine- Generative models have been proposed as probabilistic tuning stage in DBNs. generalizations of ad-hoc learning methods mostly for scenarios (see e.g. Bishop et al., In the computer vision community, summarization 1998). The work by Seo & Obermayer(2002) is re- into a codebook has been used by a number of ap- lated to ours in their attempt to generalize LVQ but proaches in the literature. For example, Gemert et al. their probabilistic model is inherently generative (see (2008) replace histograms with kernel density estima- their equation 4). Their objective function is based tion (KDE) in the construction of codebooks and use on likelihood ratios and does not arise naturally from the extracted features in conjunction with SVMs (non- their original model. probabilistic approach) for classification. It is interest- ing to mention the different between InfoLoss (Lazeb- Neural networks and, more recently, deep belief net- nik & Raginsky, 2009) and our approach. The meth- works (DBNs) have proved popular as methods for ods are similar in spirit but their difference is anal- learning latent representations with the goal of tack- ogous to the difference between filter and wrapper ling difficult problems in AI (see e.g. Hinton & methods in feature selection. While InfoLoss focuses Salakhutdinov, 2006; Bengio & LeCun, 2007). In par- on a filter metric (an information-theoretical objec- ticular, our method can be seen as a generalization of tive function), our method directly includes the spe- radial basis function (RBF) networks that considers cific classifier in the learning process (wrapper). On a multiple probabilistic observations; applies a pooling similar vein, Boureau et al.(2010) propose the learn- operator over the set of vectors belonging to the same ing of mid-level features for recognition in computer instance; and optimizes the parameters via a coordi- vision. Their work focuses on (multiple) binary classi- nate ascent mechanism. There is also an interesting fiers and does not provide consistent probabilistic out- relation between our proposed learning algorithm and puts across all classes. It is well-known in the machine how deep architectures are currently trained as we ini- Discriminative Probabilistic Prototype Learning learning theoretical literature that simple approaches Andrew. Picodes: Learning a compact code for novel- to combining these classifiers such as one-against-all category recognition. In NIPS 24. 2011. are inconsistent in that given optimal binary classi- Bergman, M. Some unmixing problems and algorithms fiers this “reduction” may not yield optimal multi-class in spectroscopy and hyperspectral imaging. In Applied classifiers (see e.g. Beygelzimer et al., 2009). Imagery and Workshop, 2006. Finally, recent work by Bergamo et al.(2011) has pro- Beygelzimer, Alina, Langford, John, and Ravikumar, posed an approach to learning compact representa- Pradeep. Error-correcting tournaments. In International tions for novel category recognition. Their focus is conference on Algorithmic Learning Theory, 2009. on learning compact descriptions that can be used, for Bishop, Christopher M., Svens´en,Markus, and Williams, example, in image retrieval. Christopher K. I. GTM: The generative topographic mapping. Neural Computation, 10(1):215–234, 1998. 6. Conclusions Boureau, Y-Lan, Bach, Francis, LeCun, Yann, and Ponce, Jean. Learning mid-level features for recognition. In We have presented a probabilistic model for the dis- CVPR, 2010. criminative learning of latent representations, which Fei-Fei, L. and Perona, P. A Bayesian hierarchical model corresponds to a relaxed version of the popular ap- for learning natural scene categories. In CVPR, 2005. proach of prototype-based classification. From the ap- plication viewpoint, our method can be viewed as a Frank, A. and Asuncion, A. UCI machine learning reposi- discriminative technique that can be used for unmixing tory, 2010. URL http://archive.ics.uci.edu/ml. in geosciences and remote sensing (Bergman, 2006). It Gemert, Jan C., Geusebroek, Jan-Mark, Veenman, Cor J., can also be applied to other problems, such as popu- and Smeulders, Arnold W. Kernel codebooks for scene lation modeling where labels and proportions of these categorization. In ECCV, 2008. can be associated to groups of instances. Hinton, Geoffrey and Salakhutdinov, Ruslan. Reducing the Our method requires, in general, a model for a prob- dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006. abilistic discriminative classifier and for purposes of illustrating the utility of our approach we have used Kohonen, Teuvo. Improved versions of learning vector a softmax classifier. However, our approach can, in quantization. In International Joint Conference on Neu- principle, be used with any discriminative and prob- ral Networks, pp. 545–550, 1990. abilistic classifier including non-parametric methods. Latecki, L. J., Lakamper, R., and Eckhardt, U. Shape de- In the future we aim to explore the combination of scriptors for non-rigid shapes with a single closed con- non-parametric methods with our prototype learning tour. In CVPR, 2000. approach and also investigate better optimization al- Lazebnik, Svetlana and Raginsky, Maxim. Supervised gorithms for parameter learning, e.g. based on mean learning of quantizer codebooks by information loss min- field approximations. imization. IEEE TPAMI, pp. 1294–1309, 2009. Rosenblatt, Frank. Principles of neurodynamics; percep- Acknowledgments trons and the theory of brain mechanisms. Spartan Books, 1962. We thank Sarah Namin and Lars Petersson for provid- ing us with the spectral dataset. NICTA is funded Rumelhart, D.E., Hinton, G.E., and Williams, R.J. Learn- ing representations by back-propagating errors. Nature, by the Australian Government as represented by the 323(6088):533–536, 1986. Department of Broadband, Communications and the Digital Economy and the Australian Research Council Scholkopf, Bernhard and Smola, Alexander J. Learning through the ICT Centre of Excellence program. with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001. References Seo, Sambu and Obermayer, Klaus. Soft learning vector quantization. Neural Computation, 15:1589–1604, 2002. Belongie, S., Malik, J., and Puzicha, J. Shape match- ing and object recognition using shape contexts. IEEE Xie, J., Heng, P., and Shah, M. Shape matching and mod- TPAMI, 24(4):509–522, 2002. eling using skeletal context. Pattern Recognition, 41(5): 1756–1767, 2008. Bengio, Yoshua and LeCun, Yann. Scaling learning al- gorithms towards AI. In Large-Scale Kernel Machines. MIT Press, 2007.

Bergamo, Alessandro, Torresani, Lorenzo, and Fitzgibbon,