Evaluating Bregman Divergences for Probability Learning from Crowd
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating Bregman Divergences for Probability Learning from Crowd Francisco Mena Ricardo Nanculef˜ Univ. Tecnica´ Federico Santa Mar´ıa∗ Univ. Tecnica´ Federico Santa Mar´ıa Depto de Informatica´ Depto de Informatica´ Santiago, Chile Santiago, Chile Abstract various annotations over some dataset, with mul- tiple annotations (not the same) per data. With The crowdsourcing scenarios are a good this one can group all the annotations in one vec- example of having a probability distribu- tor of repeats and then normalize to get probabil- tion over some categories showing what ities of each category representing what a com- the people in a global perspective thinks. mon/regular annotator thinks or behave. Then, Learn a predictive model of this proba- in this scenario a problem with two categories bility distribution can be of much more (dog; cat), an image of a big cat can have prob- valuable that learn only a discriminative ability distribution of (0:8; 0:2) that show how an model that gives the most likely category annotator behave, also how she can get confused of the data. Here we present differents and give a wrong annotation over the image. The models that adapts having probability dis- experimental work of [Snow et al., 2008] over dif- tribution as target to train a machine learn- ferent text datasets, shows that multiple inexpert ing model. We focus on the Bregman di- annotators can perform similar to expert annota- vergences framework to used as objective tors, having a strong correlation between them. So function to minimize. The results show we can assumed that multiple annotators can have that special care must be taken when build a standard good behave. a objective function and consider a equal In this application we need to measure some func- optimization on neural network in Keras tion among two probability distribution as a cost framework. function to optimize, the commonly used in the state of the art is the KL divergence [Thomas, 1 Introduction 1991], that measure the difference between two Know the probability distribution over different pdfs. Here we explore different dissimilarity mea- categories (Discrete variables) on some domain sures between vectors (probabilities among differ- can be very valuable when one want a predictive ent categories). We focus on the domain of the model that can give some prediction together with Bregman divergences [Bregman, 1967] that mea- the uncertain of it. The classic machine learning sure dissimilarity between objects and by itself is models try to learn a discriminative model over not a metric. only the truth category of the data. We measure different functions trying to under- arXiv:1901.10653v1 [cs.LG] 30 Jan 2019 When face a problem of learn all the probability stand what kind of objective functions work better. distribution over the categories of some data, the The results report that Bregman divergences and classic machine learning method may not fit well. similar objectives function turned out not behave This is a different objective, because the model in the same way. We suspect that this is because need to learn to give a prediction even on the least the approximate optimization that does the neural likely categories and with this give the uncertain, network framework, as the stochastic optimization i.e. avoid to assign a priori zero probability to the or the approximate functional derivatives. categories that the data does not belong. The paper structure is as follow: In Section 2 we The crowdsourcing platform, such as Amazom formally we define the problem and what we are Mechanical Turk (AMT)1, allows one to obtain facing. In the next section (3) we present the Contact∗ to [email protected] models proposed to solved the problem and com- 1http://www.mturk.com pare between them, while the evaluation metrics to compare is in Section 4. Section 5 present the • Cross-entropy Related work and in Section 6 we show the results K of all the experimentation. Finally the conclusion X H(p ; p^ ) = −p logp ^ (4) are shown in Section 7. i i ij ij j 2 Problem • Reverse KL: The task is given a dataset of N pairs N d K f(xi; pi)g , with xi 2 the data and pi 2 K i=1 R R X p^ij the vector of repeats normalized, aka a vector of KL(^pijjpi) = p^ij log (5) pij the probabilities of each category K on the data, j we need to learn a model that maps the data to the d K probabilities f : R ! R ; f(xi) =p ^i. • Jensen Shanon divergence [Lin, 1991] (also This is a different objective that the classic ma- known as symmetric KL): chine learning that trying to learn only the most K probable category (argmax). Here we have a loss 1 X JS(p ; p^ ) = KL(p jjm )+KL(^p jjm ) function that use all the probability distribution to i i 2 ij ij ij ij j learn `(pi; f(xi)) and jointly learn correctly the (6) uncertainty of the predictions. pij +^pij With mij = 2 To build the data probabilities, in where some cases come with the data, we assumed that the data 3.2 Bregman divergences and the annotators model correctly the probability Here we define the divergence (inverse to simi- of each category, aka ground truth. With rij the larity) of Bregman [Chen et al., 2008, Banerjee vector of repeats that store the number of times et al., 2005] to measure the difference between two that the data i was annotated by category j. probability distribution. These functions come rij from a family that share some properties, due they pij = P (1) l ril are derived from a general framework/structure. 3 Models Given Φ, a strictly convex differentiable function , the Bregman divergence d is define as: The proposed models trying to solve the problem Φ are the well studied deep neural network func- dΦ(x; y) = Φ(x) − Φ(y) − hx − y; r(y)i (7) tions [LeCun et al., 2015] to model f with dif- ferent objective functions that adapts probabilities. With ha; bi the inner product between a and b. As Particularly here we work with the Bregman di- can be seen the order that is given to dΦ matters, vergences [Bregman, 1967]. Here we define the so as it does not fulfill the symmetry or triangular objective function to compare, which are evalu- inequality properties, is not a defined as a metric. ated on every pair of examples in a batch and then Nonetheless, it has some other properties that are merge together with an arithmetic average, as a good for optimization purpose: standard neural network. • Convex: on his first argument x. 3.1 Based on Keras Firstly we define some common used in Keras • Non-negative: dΦ(x; y) ≥ 0 for every x; y. [Chollet et al., 2015] metrics to evaluate deep • Duality: If Φ has a convex conjugate can be learning models. used. • Mean Squared Error (MSE): K • The median as a minimum in random sce- 1 X MSE(p ; p^ ) = (p − p^ )2 (2) nario: Given a set of random vectors, the i i K ij ij j minimum of dΦ(x; y) for y, given any func- tion Φ and x, is the median of the vectors • Root Mean Squared Error (RMSE): [Banerjee et al., 2005]. v u K u 1 X 2 RMSE(pi; p^i) =t (pij − p^ij) (3) Then, the divergence dΦ(pi; p^i) with different Φ K j functions: • Square Euclidean Distance (also known as • Normalized discounted cumulative gain 2 sum of square error/sse), for Φ(pi) = jjpijj : (NDCG), a metric from learning to rank [Cao et al., 2007], with the objective to measure K X 2 the order of predicted probabilities. SSE(pi; p^i) = (pij − p^ij) (8) j • Accuracy on Ranking Decrease: • Forward KL, for negative entropy function, N K (k) (k) 1 X 1 X I(y =y ^ ) Φ(p ) = PK p log p acc = i i i j ij ij rank N K k i k K (14) X pij (k) KL(pi jj p^i) = pij log (9) With y the real category of data i in posi- p^ij i j tion k and I the indicator function. • Generalized I divergence, similar to Forward 5 Related Work KL but generalized to the positive reals2. Since the Bregman divergences was proposed K X pij [Bregman, 1967], several works recently has stud- GenI(pi jj p^i) = pij log −(pij −p^ij) ied the benefits of this divergences. For instances, p^ij j [Vemuri et al., 2011] propose a different way to (10) find the optimum or t-center for the objective, as • Itakura Saito distance, for Φ(pi) = − log pi: find the representative in a cluster algorithm, for example for MSE the center is the mean. Here he K X pij pij present a robust and efficient formulate to seek a IS(pi; p^i) = − log − 1 (11) p^ij p^ij center of a different formulation of Bregman di- j vergences. The work of [Banerjee et al., 2005] use the Breg- We hope that a evaluating function (objective man divergence to measure data distance and clus- function) based on probabilities achieved a best ter data. This work is closed related because he behavior that a standard evaluation function for used the Bregman divergences as objective func- continuous variables. tions as us but on unsupervised scenario. The 4 Metrics good results shown here say that the Bregman di- vergences can be powerful on recognize pattern In order to fair comparison between the effect of and is a good dissimilarity measure to cluster data.