Unimodal Probability Distributions for Deep Ordinal Classification

Unimodal Probability Distributions for Deep Ordinal Classification Christopher Beckham 1 Christopher Pal 1 Abstract other classes may not necessarily make sense in context. We present an example of this in Figure1, showing three Probability distributions produced by the cross- probability distributions: A, B, and C, all conditioned on entropy loss for ordinal classification problems some input image. Highlighted in orange is the ground can possess undesired properties. We propose truth (i.e. the image is of an adult), and all probability a straightforward technique to constrain discrete distributions have identical cross-entropy: this is because ordinal probability distributions to be unimodal the loss only takes into account the ground truth class, via the use of the Poisson and binomial proba- − log(p(yjx) ), where c = adult, and all three distribu- bility distributions. We evaluate this approach in c tions have the same probability mass for the adult class. the context of deep learning on two large ordinal image datasets, obtaining promising results. Despite all distributions having the same cross-entropy loss, some distributions are ‘better’ than others. For example, between A and B, A is preferred, since B puts an 1. Introduction unusually high mass on the baby class. However, A and B are both unusual, because the probability mass does not Ordinal classification (sometimes called ordinal regression) gradually decrease to the left and right of the ground truth. is a prediction task in which the classes to be predicted are In other words, it seems unusual to place more confidence discrete and ordered in some fashion. This is different from on ‘schooler’ than ‘teen’ (distribution A) considering that discrete classification in which the classes are not ordered, a teenager looks more like an adult than a schooler, and and different from regression in that we typically do not it seems unusual to place more confidence on ’baby’ than know the distances between the classes (unlike regression, ’teen’ considering that again, a teenager looks more like an in which we know the distances because the predictions lie adult than a baby. Distribution C makes the most sense be- on the real number line). Some examples of ordinal clas- cause the probability mass gradually decreases as we move sification tasks include predicting the stages of disease for further away from the most confident class. In this paper, a cancer (Gentry et al., 2015), predicting what star rating we propose a simple method to enforce this constraint, util- a user gave to a movie (Koren & Sill, 2011), or predicting ising the probability mass function of either the Poisson or the age of a person (Eidinger et al., 2014). binomial distribution. Two of the easiest techniques used to deal with ordinal For the remainder of this paper, we will refer to distribu- problems include either treating the problem as a discrete tions like C as ‘unimodal’ distributions; that is, distribu- classification and minimising the cross-entropy loss, or tions where the probability mass gradually decreases on treating the problem as a regression and using the squared both sides of the class that has the majority of the mass. error loss. The former ignores the inherent ordering between the classes, while the latter takes into account the arXiv:1705.05278v2 [stat.ML] 22 Jun 2017 1.1. Related work distances between them (due to the square in the error term) but assumes that the labels are actually real-valued – that Our work is inspired by the recent work of Hou et al. is, adjacent classes are equally distant. Furthermore, the (2016), who shed light on the issues associated with differ- cross-entropy loss – under a one-hot target encoding – is ent probability distributions having the same cross-entropy formulated such that it only ‘cares’ about the ground truth loss for ordinal problems. In their work, they alleviate this class, and that probability estimates corresponding to the issue by minimising the ‘Earth mover’s distance’, which is defined as the minimum cost needed to transform one prob- 1 Montreal´ Institute of Learning Algorithms, Quebec,´ ability distribution to another. Because this metric takes Canada. Correspondence to: Christopher Beckham <christo- [email protected]>. into account the distances between classes – moving probability mass to a far-away class incurs a large cost – the Proceedings of the 34 th International Conference on Machine metric is appropriate to minimise for an ordinal problem. It Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 turns out that in the case of an ordinal problem, the Earth by the author(s). Unimodal Probability Distributions for Deep Ordinal Classification distribution A distribution B distribution C 0.30 0.30 0.35 0.25 0.25 0.30 0.25 0.20 0.20 0.20 0.15 0.15 p(y|x) p(y|x) 0.15 0.10 0.10 0.10 0.05 0.05 0.05 0.00 0.00 0.00 kid kid kid teen teen teen baby baby baby adult adult adult elder elder elder senior senior senior schooler schooler schooler (a) An adult woman (b) Three probability distributions exhibiting the same mass for the ‘adult’ class and therefore the same cross-entropy error. Figure 1. Three ordinal probability distributions conditioned on an image of an adult woman. Distributions A and B are unusual in the sense that they are multi-modal. mover’s distance reduces down to Mallow’s distance: though it is a somewhat odd formulation as it assumes c is 1 continuous when it is really discrete. 1 l emd(^y; y) = jjcmf(^y) − cmf(y)jj ; (1) K l Cheng(2007) proposed the use of binary cross-entropy or where cmf(·) denotes the cumulative mass function for squared error on an ordinal encoding scheme rather than the some probability distribution, y denotes the ground truth one-hot encoding which is commonly used in discrete clas- (one-hot encoded), ^y the corresponding predicted probabil- sification. For example, if we have K classes, then we have ity distribution, and K the number of classes. The authors labels of length K − 1, where the first class is [0;:::; 0], evaluate the EMD loss on two age estimation and one aes- second class is [1;:::; 0], third class is [1; 1;:::; 0] and so thetic estimation dataset and obtain state-of-the-art results. forth. With this formulation, we can think of the i’th output However, the authors do not show comparisons between the unit as computing the cumulative probability p(y > ijx), probability distributions learned between EMD and cross- where i 2 f0;:::;K − 2g. Frank & Hall(2001) also entropy. proposed this scheme but in a more general sense by using multiple classifiers (not just neural networks) to model Unimodality has been explored for ordinal neural networks each cumulative probability, and Niu et al.(2016) proposed in da Costa et al.(2008). They explored the use of the bi- a similar scheme using CNNs for age estimation. This nomial and Poisson distributions and a non-parametric way technique however suffers from the issue that the cumu- of enforcing unimodal probability distributions (which we lative probabilities p(y > 0 j x); : : : ; p(y > K − 2 j x) do not explore). One key difference between their work are not guaranteed to be monotonically decreasing, which and ours here is that we evaluate these unimodal distribu- means that if we compute the discrete probabilities p(y = tions in the context of deep learning, where the datasets are 0 j x); : : : ; p(y = K − 1 j x) these are not guaranteed generally much larger and have more variability; however, to be strictly positive. To address the monotonicity issue, there are numerous other differences which we will high- Schapire et al.(2002) proposed a heuristic solution. light throughout this paper. There are other ordinal techniques but which do not Beckham & Pal(2016) explored a loss function with an impose unimodal constraints. The proportional odds intermediate form between a cross-entropy and regression model (POM) and its neural network extensions (POMNN, loss. In their work the squared error loss is still used, but a CHNN (Gutierrez´ et al., 2014)) do not suffer from the probability distribution over classes is still learned. This is monotonicity issue due to the utilization of monotonically done by adding a regression layer (i.e. a one-unit layer) at increasing biases in the calculation of probabilities. The the top of what would normally be the classification layer, stick-breaking approach by Khan et al.(2012), which is p(yjx). Instead of learning the weight vector a it is fixed a reformulation of the multinomial logit (softmax), could T to [0;:::;K − 1] and the squared error loss is minimised. also be used in the ordinal case as it technically imposes an This can be interpreted as drawing the class label from a ordering on classes. 2 Gaussian distribution p(cjx) = N(c; E[a]p(yjx); σ ). This technique was evaluated against the diabetic retinopathy dataset and beat most of the baselines employed. Inter- estingly, since p(cjx) is a Gaussian, this is also unimodal, Unimodal Probability Distributions for Deep Ordinal Classification 1*log f(x) 1.2. Poisson distribution softmax – f(x) – log(1!) f(x)1 The Poisson distribution is commonly used to model the probability of the number of events, k 2 [ 0 occurring N . in a particular interval of time. The average frequency of f(x) these events is denoted by λ 2 R+. The probability mass deep net . function is defined as: . λk exp(−λ) p(k; λ) = ; (2) f(x) K*log f(x) k! K – f(x) – log(K!) where 0 ≤ k ≤ K − 1.

Load more