Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2958705, IEEE Transactions on Information Theory 1 Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W. Wornell, Fellow, IEEE Abstract—A loss function measures the discrepancy between Moreover, designing a predictor with respect to one measure the true values and their estimated fits, for a given instance may result in poor performance when evaluated by another. of data. In classification problems, a loss function is said to For example, the minimizer of a 0-1 loss may result in be proper if a minimizer of the expected loss is the true underlying probability. We show that for binary classification, an unbounded loss, when measured with a Bernoulli log- the divergence associated with smooth, proper, and convex likelihood loss. In such cases, it would be desireable to design loss functions is upper bounded by the Kullback-Leibler (KL) a predictor according to a “universal” measure, i.e., one that divergence, to within a normalization constant. This implies is suitable for a variety of purposes, and provide performance that by minimizing the logarithmic loss associated with the KL guarantees for different uses. divergence, we minimize an upper bound to any choice of loss from this set. As such the logarithmic loss is universal in the In this paper, we show that for binary classification, the sense of providing performance guarantees with respect to a Bernoulli log-likelihood loss (log-loss) is such a universal broad class of accuracy measures. Importantly, this notion of choice, dominating all alternative “analytically convenient” universality is not problem-specific, enabling its use in diverse (i.e., smooth, proper, and convex) loss functions. Specifically, applications, including predictive modeling, data clustering and we show that by minimizing the log-loss we minimize the sample complexity analysis. Generalizations to arbitary finite alphabets are also developed. The derived inequalities extend regret associated with all possible alternatives from this set. several well-known f-divergence results. Our result justifies the use of log-loss in many applications. As we develop, our universality result may be equivalently Index Terms—Kullback-Leibler (KL) divergence, logarithmic loss, Bregman divergences, Pinsker inequality. viewed from a divergence analysis viewpoint. In particular, we establish that the divergence associated with the log-loss— . i.e., Kullback Leibler (KL) divergence—upper bounds a set of Bregman divergences that satisfy a condition on its Hessian. I. INTRODUCTION Additionally, we show that any separable Bregman divergence NE of the major roles of statistical analysis is making that is convex in its second argument is a member of this set. O predictions about future events and providing suitable This result provides a new set of Bregman divergence inequal- accuracy guarantees. For example, consider a weather fore- ities. In this sense, our Bregman analysis is complementary to caster that estimates the chances of rain on the following the well-known f-divergence inequality results [7]–[10]. day. Its performance may be evaluated by multiple statistical We further develop several applications for our results, measures. We may count the number of times it assessed the including universal forecasting, universal data clustering, and chance of rain as greater than 50%, when there was eventually universal sample complexity analysis for learning problems, no rain (and vice versa). This corresponds to the so-called 0-1 in addition to establishing the universality of the information loss, with threshold parameter t = 1=2. Alternatively, we may bottleneck principle. We emphasize that our universality re- consider a variety of values for t, or even a completely dif- sults are derived in a rather general setting, and not restricted ferent measure. Indeed, there are many candidates, including to a specific problem. As such, they may find a wide range of the quadratic loss, Bernoulli log-likelihood loss, boosting loss, additional applications. etc. [2]. Choosing a good measure is a well-studied problem, The remainder of the paper is organized as follows. Sec- mostly in the context of scoring rules in decision theory [3]– tion II summarizes related work on loss function analysis, [6]. universality and divergence inequalities. Section III provides Assuming that the desired measure is known in advance, the needed notation, terminology, and definitions. Section IV the predictor may be designed accordingly—i.e., to minimize contains the main results for binary alphabets, and their gener- that measure. However, in practice, different tasks may require alization to arbitrary finite alphabets is developed in Section V. inferring different information from the provided estimates. Additional numerical analysis and experimental validation is provided in Section VI, and the implications of our results in Manuscript received Oct. 2018, revised July 2019. The material in this three distinct applications is described in Section VII. Finally, paper was presented in part at the Int. Symp. Inform. Theory (ISIT-2018) [1], Vail, Colorado, June 2018. Section VIII contains some concluding remarks. This work was supported, in part, by NSF under Grant No. CCF-1717610. A. Painsky is with the Department of Industrial Engineering, Tel Aviv University, Tel Aviv 6139001, Israel (E-mail: [email protected]). II. RELATED WORK G. W. Wornell is with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 (E- The Bernoulli log-likelihood loss function plays a funda- mail: [email protected]). mental role in information theory, machine learning, statistics 0018-9448 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIT.2019.2958705, IEEE Transactions on Information Theory 2 and many other disciplines. Its unique properties and broad any discrete memoryless source is successively refinable under applications have been extensively studied over the years. an arbitrary distortion criterion for the second decoder. The Bernoulli log-likelihood loss function arises naturally It is important to emphasize that universality results of in the context of parameter estimation. Consider a set of n the type discussed above are largely limited to relatively independent, identically distributed (i.i.d.) observations y = narrowly-defined problems and specific optimization criteria. (y1; : : : ; yn) drawn from a distribution pY (·; θ) whose param- By contrast, our development is aimed at a broader notion of eter θ is unknown. Then the maximum likelihood estimate of universality that is not restricted to a specific problem, and - θ in O is considers a broader range of criteria. θ^ = arg max L(θ; yn); θ2O- An additional information-theoretic justification for the wide use of the log-loss is introduced in [14]. This work where n focuses on statistical inference with side information, showing n n Y L(θ; y ) = pY n (y ; θ) = pY (yi; θ): that for an alphabet size greater than two, the log-loss is the i=1 only loss function that benefits from side information and Intuitively, it selects the parameters values that make the data satisfies the data processing lemma. This result extends some most probable. Equivalently, this estimate minimizes a loss well-known properties of the log-loss with respect to the data that is the (negative, normalized, natural) logarithm of the processing lemma, as later described. likelihood function, viz., Within decision theory, statistical learning and inference n problems, the log-loss also plays further key role in the context n 1 n 1 X `(θ; y ) − log L(θ; y ) = − log p (y ; θ); of proper loss function, which produce estimates that are , n n Y i i=1 unbiased with respect to the true underlaying distribution. whose mean is Proper loss functions have been extensively studied, compared, and suggested for a variety of tasks [3]–[6], [15]. Among these, E [`(θ; Y )] = −E [log pY (Y ; θ)] : the log-loss is special: it is the only proper loss that is local [16], [17]. This means that the log-loss is the only proper loss Hence, by minimizing this Bernoulli log-likelihood loss, function that assigns an estimate for the probability of the termed the log-loss, over a set of parameters we maximize event Y = y0 that depends only on the outcome Y = y0. the likelihood of the given observations. The log-loss also arises naturally in information theory. In turn, proper loss functions are closely related to Breg- man divergences, with which there exists a one-to-one cor- The self-information loss function − log pY (y) defines the ideal codeword length for describing the realization Y = y respondence [4]. For the log-loss, the associated Bregman [11]. In this sense, minimizing the log-loss corresponds to divergence is KL divergence, which is also an instance of an minimizing the amount of information that are necessary f-divergence [18]. Significantly, for probability distributions, to convey the observed realizations. Further, the expected the KL divergence is the only divergence measure that is a self-information is simply Shannon’s entropy which reflects member of both of these classes of divergences [19]. The the average uncertainty associated with sampling the random Bregman divergences are the only divergences that satisfy variable Y . a “mean-as-minimizer” property [20], while any divergence The logarithmic loss function is known to be “universal” that satisfy the data processing inequality is necessarily an f- from several information-theoretic points of view. In [12], divergence (or a unique (one-to-one) mapping thereof) [21].

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Learning to Approximate a Bregman Divergence

Metrics Defined by Bregman Divergences †

On a Generalization of the Jensen-Shannon Divergence and the Jensen-Shannon Centroid

Information Divergences and the Curious Case of the Binary Alphabet

Statistical Exponential Families: a Digest with Flash Cards

Applications of Bregman Divergence Measures in Bayesian Modeling Gyuhyeong Goh University of Connecticut - Storrs, [email protected]

Jensen-Bregman Logdet Divergence with Application to Efficient

Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W

Generalized Bregman and Jensen Divergences Which Include Some F-Divergences

Bregman Metrics and Their Applications

Low-Rank Kernel Learning with Bregman Matrix Divergences

Proximal Approaches for Matrix Optimization