1

Bregman Bounds and Universality Properties of the Logarithmic Loss Amichai Painsky, Member, IEEE, and Gregory W. Wornell, Fellow, IEEE

Abstract—A loss function measures the discrepancy between Moreover, designing a predictor with respect to one measure the true values and their estimated fits, for a given instance may result in poor performance when evaluated by another. of data. In classification problems, a loss function is said to For example, the minimizer of a 0-1 loss may result in be proper if a minimizer of the expected loss is the true underlying probability. We show that for binary classification, an unbounded loss, when measured with a Bernoulli log- the divergence associated with smooth, proper, and convex likelihood loss. In such cases, it would be desireable to design loss functions is upper bounded by the Kullback-Leibler (KL) a predictor according to a “universal” measure, i.e., one that divergence, to within a normalization constant. This implies is suitable for a variety of purposes, and provide performance that by minimizing the logarithmic loss associated with the KL guarantees for different uses. divergence, we minimize an upper bound to any choice of loss from this set. As such the logarithmic loss is universal in the In this paper, we show that for binary classification, the sense of providing performance guarantees with respect to a Bernoulli log-likelihood loss (log-loss) is such a universal broad class of accuracy measures. Importantly, this notion of choice, dominating all alternative “analytically convenient” universality is not problem-specific, enabling its use in diverse (i.e., smooth, proper, and convex) loss functions. Specifically, applications, including predictive modeling, data clustering and we show that by minimizing the log-loss we minimize the sample complexity analysis. Generalizations to arbitary finite alphabets are also developed. The derived inequalities extend regret associated with all possible alternatives from this set. several well-known f-divergence results. Our result justifies the use of log-loss in many applications. As we develop, our universality result may be equivalently Index Terms—Kullback-Leibler (KL) divergence, logarithmic loss, Bregman divergences, Pinsker inequality. viewed from a divergence analysis viewpoint. In particular, we establish that the divergence associated with the log-loss— . i.e., Kullback Leibler (KL) divergence—upper bounds a set of Bregman divergences that satisfy a condition on its Hessian. I.INTRODUCTION Additionally, we show that any separable Bregman divergence NE of the major roles of statistical analysis is making that is convex in its second argument is a member of this set. O predictions about future events and providing suitable This result provides a new set of Bregman divergence inequal- accuracy guarantees. For example, consider a weather fore- ities. In this sense, our Bregman analysis is complementary to caster that estimates the chances of rain on the following the well-known f-divergence inequality results [7]–[10]. day. Its performance may be evaluated by multiple statistical We further develop several applications for our results, measures. We may count the number of times it assessed the including universal forecasting, universal data clustering, and chance of rain as greater than 50%, when there was eventually universal sample complexity analysis for learning problems, no rain (and vice versa). This corresponds to the so-called 0-1 in addition to establishing the universality of the information loss, with threshold parameter t = 1/2. Alternatively, we may bottleneck principle. We emphasize that our universality re- consider a variety of values for t, or even a completely dif- sults are derived in a rather general setting, and not restricted arXiv:1810.07014v2 [cs.IT] 2 Jan 2020 ferent measure. Indeed, there are many candidates, including to a specific problem. As such, they may find a wide range of the quadratic loss, Bernoulli log-likelihood loss, boosting loss, additional applications. etc. [2]. Choosing a good measure is a well-studied problem, The remainder of the paper is organized as follows. Sec- mostly in the context of scoring rules in decision theory [3]– tionII summarizes related work on loss function analysis, [6]. universality and divergence inequalities. Section III provides Assuming that the desired measure is known in advance, the needed notation, terminology, and definitions. SectionIV the predictor may be designed accordingly—i.e., to minimize contains the main results for binary alphabets, and their gener- that measure. However, in practice, different tasks may require alization to arbitrary finite alphabets is developed in SectionV. inferring different information from the provided estimates. Additional numerical analysis and experimental validation is provided in SectionVI, and the implications of our results in Manuscript received Oct. 2018, revised July 2019. The material in this three distinct applications is described in Section VII. Finally, paper was presented in part at the Int. Symp. Inform. Theory (ISIT-2018) [1], Vail, Colorado, June 2018. Section VIII contains some concluding remarks. This work was supported, in part, by NSF under Grant No. CCF-1717610. A. Painsky is with the Department of Industrial Engineering, Tel Aviv University, Tel Aviv 6139001, Israel (E-mail: [email protected]). II.RELATED WORK G. W. Wornell is with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 (E- The Bernoulli log-likelihood loss function plays a funda- mail: [email protected]). mental role in information theory, machine learning, 2 and many other disciplines. Its unique properties and broad any discrete memoryless source is successively refinable under applications have been extensively studied over the years. an arbitrary distortion criterion for the second decoder. The Bernoulli log-likelihood loss function arises naturally It is important to emphasize that universality results of in the context of parameter estimation. Consider a set of n the type discussed above are largely limited to relatively independent, identically distributed (i.i.d.) observations y = narrowly-defined problems and specific optimization criteria. (y1, . . . , yn) drawn from a distribution pY (·; θ) whose param- By contrast, our development is aimed at a broader notion of eter θ is unknown. Then the maximum likelihood estimate of universality that is not restricted to a specific problem, and - θ in O is considers a broader range of criteria. θˆ = arg max L(θ; yn), θ∈O- An additional information-theoretic justification for the wide use of the log-loss is introduced in [14]. This work where n focuses on with side information, showing n n Y L(θ; y ) = pY n (y ; θ) = pY (yi; θ). that for an alphabet size greater than two, the log-loss is the i=1 only loss function that benefits from side information and Intuitively, it selects the parameters values that make the data satisfies the data processing lemma. This result extends some most probable. Equivalently, this estimate minimizes a loss well-known properties of the log-loss with respect to the data that is the (negative, normalized, natural) logarithm of the processing lemma, as later described. likelihood function, viz., Within decision theory, statistical learning and inference n problems, the log-loss also plays further key role in the context n 1 n 1 X `(θ; y ) − log L(θ; y ) = − log p (y ; θ), of proper loss function, which produce estimates that are , n n Y i i=1 unbiased with respect to the true underlaying distribution. whose mean is Proper loss functions have been extensively studied, compared, and suggested for a variety of tasks [3]–[6], [15]. Among these, E [`(θ; Y )] = −E [log pY (Y ; θ)] . the log-loss is special: it is the only proper loss that is local [16], [17]. This means that the log-loss is the only proper loss Hence, by minimizing this Bernoulli log-likelihood loss, function that assigns an estimate for the probability of the termed the log-loss, over a set of parameters we maximize event Y = y0 that depends only on the outcome Y = y0. the likelihood of the given observations. The log-loss also arises naturally in information theory. In turn, proper loss functions are closely related to Breg- man divergences, with which there exists a one-to-one cor- The self-information loss function − log pY (y) defines the ideal codeword length for describing the realization Y = y respondence [4]. For the log-loss, the associated Bregman [11]. In this sense, minimizing the log-loss corresponds to divergence is KL divergence, which is also an instance of an minimizing the amount of information that are necessary f-divergence [18]. Significantly, for probability distributions, to convey the observed realizations. Further, the expected the KL divergence is the only divergence measure that is a self-information is simply Shannon’s entropy which reflects member of both of these classes of divergences [19]. The the average uncertainty associated with sampling the random Bregman divergences are the only divergences that satisfy variable Y . a “mean-as-minimizer” property [20], while any divergence The logarithmic loss function is known to be “universal” that satisfy the data processing inequality is necessarily an f- from several information-theoretic points of view. In [12], divergence (or a unique (one-to-one) mapping thereof) [21]. Feder and Merhav consider the problem of universal sequential As a consequence, any divergence that satisfies both of these prediction, where a future observation is to be estimated from important properties simultaneously is necessarily proportional a given set of past observations. The notion of universality to the KL divergence [22, Corollary 6]. Additional properties comes from the assumption that the underlaying distribution of KL divergence are also discussed in [22]. is unknown, or even nonexistent. In this case, it is shown that Finally, divergences inequalities have been studied exten- if there exists a universal predictor (with a uniformly rapidly sively. The most celebrated example is the Pinsker inequality decaying redundancy rates) that minimizes the logarithmic loss [23], which expresses that KL divergence upper bounds the function, then there exist universal predictors for any other loss squared total-variation . More recently, the detailed function. studies of Reid and Williamson [10], Harremoes¨ and Vajda [9], More recently, No and Weissman [13] introduced log-loss Sason and Verdu´ [8], and Sason [7] have extended this result universality results in the context of lossy compression. They to a broader set of f-divergences inequalities. Moreover, f- show that for any fixed length lossy compression problem divergence inequalities for non-probability measures appear in, under an arbitrary distortion criterion, there is an equivalent e.g., by Stummer and Vajda [24]. In [25], Zhang demonstrated lossy compression problem under a log-loss criterion where an important Bregman inequality in the context of statistical the optimum schemes coincide. This result implies that with- learning, showing that the KL divergence upper bounds the out loss of generality, one may restrict attention to the log- squared excess-risk associated with the 0-1 loss, and thus loss problem (under an appropriate reconstruction alphabet). controls this traditionally important performance measure. In addition, [13] considers the successive refinement problem, Within this context, our work can be viewed as extending such showing that if the first decoder operates under log-loss, then Bregman inequalities and their analysis. 3

III.NOTATION,TERMINOLOGYAND DEFINITIONS Savage [28] shows that a loss function l(y, q) is proper and Let Y ∈ {0, 1} be a Bernoulli random variable with regular if and only if G(·) is concave and for every p, q ∈ [0, 1] we have that parameter p = pY (1), which may be unknown. A loss function l(y, yˆ) quantifies the discrepancy between a realization Y = y L(p, q) = G(q) + (p − q) G0(q). (10) and its corresponding estimate yˆ. In this work we focus on This property allows us to draw an immediate connection probabilistic estimates yˆ , q ∈ [0, 1] whereby q is an estimate of p rather than y itself; as such, q is a “soft” decision. between regret and Bregman divergence. In particular, let A loss function for such estimation takes the form f : S 7→ R be a continuously differentiable, strictly convex function over some interval S ⊂ R. Then its associated l(y, q) = 1{y = 0} l0(q) + 1{y = 1} l1(q), (1) Bregman divergence takes the form 0 with 1{·} denoting the Kronecker (indicator) function, where Df (skt) , f(s) − f(t) − (s − t)f (t) (11) l (q) Y = k k is a loss function associated with the event , for for any s, t ∈ S. We focus on closed intervals, in which k ∈ {0, 1}. In turn, the corresponding expected loss is case the formal definition of Df (skt) at boundary points requires more care; the details are summarized in AppendixA, L(p, q) , E [l(Y, q)] = (1 − p) l0(q) + p l1(q), (2) following [29]. where we note that L(p, q) depends only on p and the estimate In the special case S = [0, 1] using (10) in (9) and q. An example is the log-loss, for which comparing the result to (11) we obtain 1 1 l (y, q) y log + (1 − y) log . (3) ∆L(p, q) = D−G(p, q), (12) log , q 1 − q i.e., the regret of a proper loss function is uniquely associ- Loss functions with additional properties are of particular ated with a Bregman divergence. As an important example, interest. A loss function is proper (or, equivalently, Fisher- associated with the Shannon entropy (8) is the KL divergence consistent or unbiased) if a minimizer of the expected loss is p 1 − p the true underlying distribution of the random variable we are DKL(pkq) , p log + (1 − p) log . (13) to estimate; specifically, q 1 − q Of particular interest are loss functions that are convex, p ∈ arg min L(p, q), p ∈ [0, 1]. (4) i.e., l such that l(y, ·) is convex. Such loss functions play q∈[0,1] a special role in learning theory and optimization [2], [26]. A strictly proper loss function means that q = p is the unique For example, suppose1 Xd and Y is a set of d explanatory minimizer, i.e., variables (features) and a (target) variable, respectively. Then given a set of n i.i.d. samples of Xd and Y , the empirical risk p = arg min L(p, q), p ∈ [0, 1]. (5) minimization (ERM) criterion seeks to minimize q∈[0,1] n 1 X A proper loss function is fair if l(y , q ), n i i i=1 l0(0) = l1(1) = 0, (6) d where qi , qi(xi ) denotes a functional of the ith sample in which case there is no loss incurred for accurate prediction. of Xd. This minimization is much easier to carry out when Additionally, a proper loss function is regular if the loss function l is convex, particularly when d is large. In addition, the minimum of the expected loss L(p, ·) for a given lim q l1(q) = lim(1 − q) l0(q) = 0. (7) q→0 q→1 p subject to constraints is typically much easier to characterize Intuitively, (7) ensures that making mistakes on events that and compute when l is convex. cannot happen do not incur a penalty. Conveniently, convex proper loss functions l(y, q) corre- The minimum of the expected loss for proper loss functions, spond to Bregman divergences D−G such that D−G(pk·) is which we denote using convex [26]. This family of divergences are of special interest in many applications [30], [31], and have an important role in G(p) , L(p, p), our results, as will become apparent. Accordingly, our development emphasizes the following is referred to as the generalized entropy function [4], Bayes class of analytically convenient loss functions. risk [26] or Bayesian envelope [27]. As an example, the Definition 1: A loss function l : {0, 1} × [0, 1] 7→ R, which Shannon entropy associated with the log-loss (3) is takes the form (1), is admissible if it satisfies the following 1 1 three properties: G (p) p log + (1 − p) log . (8) log , p 1 − p P1.1) l(y, q) is strictly proper, fair, and regular, i.e., satisfies (5)–(7). The regret is defined as the difference between the expected P1.2) l(y, ·) is convex for each y ∈ {0, 1}. loss and its minimum, so for proper loss functions takes the form 1 m The sequence notation a = (a1, . . . , am) is convenient in our exposi- ∆L(p, q) = L(p, q) − G(p). (9) tion. 4

P1.3) l(y, ·) is in C3 for each y ∈ {0, 1}, i.e., ∂kl(y, q)/∂qk approach to placing loss functions on a common scale is to exist and are continuous for k = 1, 2, 3. define a universal scaling by setting, for instance, For convenience, we refer to loss functions that satisfy prop- 1 1 − G00 = 1, erty P1.3 as smooth. 2 2 As further terminology, for a proper, smooth loss function l(y, q) with generalized entropy G(p), as appears, e.g., in [2], [10]. Theorem1 avoids imposing such a normalization, and instead absorbs such scaling into the 00 w(p) , −G (p) (14) constant C(G) to obtain the desired invariance. As an example, for the quadratic loss G00(1/2) = −2, so any C(G) > 1 is referred to as its weight function, which we note is non- suffices in this case, whence negative. As an example, that corresponding to the log-loss is D (pkq) ≥ (p − q)2. (18) 1 KL w (q) = . (15) KL q(1 − q) The practical implications of Theorem1 are quite imme- Using (14), we obtain, for example, diate. Assume that the performance measure according to which a learning algorithm is to be measured is unknown a ∂ priori to the application (as is the case, e.g., in the weather D−G(pkq) = (q − p) w(q), (16) ∂q forecasting example of SectionI). In such cases, minimizing by differentiating (10), which emphasizes the one-to-one cor- the log-loss provides an upper bound on any possible choice respondence between D−G and w for such loss functions; see of measure that is associated with an “analytically convenient” AppendixB for additional properties and characterizations. loss function. As such, the log-loss is a universal choice for Finally, representative examples of loss functions are pro- classification problems with respect to this class of measures. vided in TableI, along with their generalized entropies, their More generally, as discussed in SectionII, designing suit- associated Bregman divergences, and their weight functions. able loss functions is an active research field with many ap- plications. Via Theorem1, one obtains universality guarantees IV. UNIVERSALITY PROPERTIES OF THE LOGARITHMIC for any (current or future) loss function that is proper, convex, LOSS FUNCTION and smooth. We emphasize that this class of loss functions is quite rich. For instance, it is straightforward to verify that Our main result is as follows, a proof of which is provided the loss functions satisfying Definition1 form a convex set: in AppendixC. any convex combination of such loss functions also satisfies Theorem 1: l(y, q) Given a loss function satisfying Defi- Definition1. G nition1 with corresponding generalized entropy function , The local behavior of proper, convex, and smooth loss p, q ∈ [0, 1] then for every , functions can be derived from Theorem1. In particular, we 1 have the following corollary. D (pkq) ≥ D (pkq), (17a) KL C(G) −G Corollary 2: Given a loss function l(y, q) satisfying Def- where inition1, whose corresponding generalized entropy function 1 1 is G, we have, for every p, p + dp ∈ [0, 1] and some finite C(G) > − G00 (17b) 2 2 C(G) > 0, is a positive normalization constant (that does not depend on 1 dp2 D (pkp + dp) ≤ J(p) + o(dp2), (19a) p or q). C(G) −G 2 Note that a further consequence of Theorem1 expresses that where KL divergence is a “dominating” Bregman divergence in the 1 J(p) (19b) sense that given another Bregman divergence D˜(pkq) such that , p(1 − p) [cf. (17a)] denotes the Fisher information of a Bernoulli distributed ˜ 1 D(pkq) ≥ D−G(pkq) random variable with parameter p. C˜(G) Proof: With p, p+dp ∈ [0, 1], the Taylor series expansion ˜ holds for any Bregman divergence D−G for some C(G), then of the KL divergence around p is ˜ the theorem asserts that there exists CKL such that 2 dp 2 1 DKL(pkp + dp) = J(p) + o(dp ), (20) D (pkq) ≥ D˜(pkq). 2 KL ˜ CKL where J(p) is as given in (19b). Substituting (20) into (17a) In essence, the dominating Bregman divergences form an yields the desired inequality. equivalence class, of which KL divergence is a member. Corollary2 establishes that when q is sufficiently close to We emphasize the necessity of scaling constants in The- p, the divergence associated with the set of smooth, proper orem1. Indeed, the class of loss functions satisfying Defi- and convex binary loss functions is effectively upper bounded nition1 is closed under (nonnegative) scaling, i.e., if l(y, q) by the Fisher information that locally characterizes KL di- (with a corresponding G) satisfies Definition1, then so does vergence. As such, we conclude that the rate of convergence γl(y, q)—with a corresponding γG—for any γ > 0. A typical of any D−G(pkq) to zero as q → p is upper bounded by 5

Table I EXAMPLESOF COMMONLY USED BINARY LOSS FUNCTIONS

Loss function l(y, q) G(p) = L(p, p) D−G(pkq) w(p)

2 2 2 Quadratic loss y(1 − q) + (1 − y)q p(1 − p) DQL(pkq) = (p − q) 2

1 y log q 1 p log p Logarithmic loss 1 D (pkq) = p log p + (1 − p) log 1−p 1 + (1−y) log 1−q 1 KL q 1−q p(1−p) + (1 − p) log 1−p

q 1−q 2y q  q q  D (pkq) = 2 p 1−q + (1−p) q q p BL q 1−q 1 Boosting loss q 4 p(1 − p) 3/2 + 2(1 − y) 1−q (p(1−p)) − 4pp(1 − p)

the rate of DKL(pkq). This reveals that the price paid for the which generalizes the definition in TableI. universality of the log-loss is its slower rate of convergence. We focus on the following class of Bregman divergences. Such behavior will be demonstrated empirically in SectionVI. Definition 2: For some integer K, a Bregman divergence m generator f : [0, 1] 7→ R is K-admissible if it satisfies the V. EXTENDED BREGMAN DIVERGENCE INEQUALITIES following properties: f To extend our result to arbitrary finite alphabets, we consider P2.1) is a strictly convex function that is well-defined the corresponding broader class of Bregman divergences. In on its boundaries, in the sense of generalizing the requirements of AppendixA. particular, for a continuously differentiable, strictly convex K k m m P2.2) f ∈ C , i.e., ∂ f(p )/∂p ··· ∂p exist and are function f : S 7→ be a over some convex set S ⊂ , 1 k R R continuous for k = 1,...,K. its associated Bregman divergence takes the form Our first generalization is the following theorem, whose m m m m m m m Df (s kt ) , f(s ) − f(t ) − hs − t , ∇f(t )i (21) proof is provided in AppendixD. Theorem 3: Given a positive integer m, let f : [0, 1]m 7→ for any sm, tm ∈ S when S is open, where ∇f(tm) is the R satisfy Definition2 for K = 2, and let D (pmkqm) and gradient of f at tm. f H (pm) denote the associated Bregman divergence and Hes- We focus on the set S = [0, 1]m, and let pm, qm ∈ S. We f sian matrix, respectively. If there exists a (finite) positive emphasize that this is an extension beyond the unit simplex. constant C(f) such that2 Let m 2 m m m m m Hf (p ) , ∇ f(p ) (22) C(f) HKL(p ) − Hf (p ) 0, all p ∈ [0, 1] , (25a) m m m denote the m × m Hessian matrix of f. For example, the then for every p , q ∈ [0, 1] , divergence associated with ˜ m m 1 m m DKL(p kq ) ≥ Df (p kq ). (25b) m C(f) m X f(p ) = pi log pi (23) We emphasize that, in contrast to Theorem1, the inequality i=1 (25b) applies to any Bregman divergence satisfying Defini- m is the generalized KL divergence tion2, and in particular does not require Df (p k·) to be m m m m m convex for any p ∈ [0, 1] . However, at the same time, we m m X p X X D˜ (p kq ) p log i − p + q , (24) stress that Theorem3 is restricted to the class of divergences KL , i q i i i=1 i i=1 i=1 satisfying (25a). 3 the corresponding Hessian for which is As an example application, when m m T m m ! f(p ) = (p ) Q p , (26) m 2 X H (p ) ∇ p log p , KL , i i with positive definite matrix parameter Q, the corresponding i=1 the Bregman divergence which we note is a diagonal matrix whose ith diagonal element 1 is 1/p . In the special case wherein pm and qm are probability D = (pm − qm)TQ (pm − qm) i f 2 measures (i.e., restricted to the unit simplex), we have 2 m We use A 0 to denote that a matrix A is positive definite. p 3 m ˜ m m m m X i Here, and elsewhere as needed, we construe a sequence a as a column DKL(p kq ) = DKL(p kq ) , pi log , q vector. i=1 i 6 is the well-known Mahalanobis distance, and and the associ- It is important to emphasize that while Theorem1 restricts ated Hessian is attention to divergences defined both over binary alphabets and m Hf (p ) = Q. only on the unit simplex—i.e., in the notation of this section,

For this divergence we have the following corollary, whose m = 2, p1 = p, p2 = 1 − p, p ∈ [0, 1], proof is provided in AppendixE. by contrast the divergences in Theorem5 are defined for any Corollary 4: If Df is a Mahalanobis distance, whereby f takes the form (26) with Q 0, then (25b) holds for positive integer m and, in addition, over the entire hypercube pm, qm ∈ [0, 1]m. As such, we emphasize that Theorem1 is C(Q) > λmax(Q), (27) not a special case of Theorem5. In particular, because (17a) must hold for a domain that extends beyond the unit simplex, where λmax(Q) is the largest eigenvalue of Q. the smallest C(g) for which it is satisfied when m = 2 must Our second generalization of Theorem1 focuses on the class generally be bigger than the smallest C(G) for which (17a) of separable Bregman divergences, a member of which takes holds.4 the form 2 m As a simple application of Theorem5, choosing g(p) = p m m X generates the quadratic divergence Dg(p kq ) , dg(pikqi) (28a) i=1 m X 2 with Dg(pkq) = (pi − qi) , i=1 d (p kq ) g(p ) − g(q ) − (p − q ) g0(q ), (28b) g i i , i i i i i which is a special case of the Mahalanobis distance. In this m m m 00 for p , q ∈ (0, 1) , where g : [0, 1] 7→ R denote a con- case, since g (1) = 2, Theorem5 requires C(g) > 2, yielding tinuously differentiable, strictly convex function with addi- m m m 1 X 2 tional constraints discussed analogous to those discussed in D˜ (p kq ) ≥ (p − q ) . (33) m m KL 2 i i AppendixA, and via which (28a) is extended to p , q ∈ i=1 [0, 1]m. Consistent with the preceding discussion, inf{C(g): C(g) > Such divergences hold a special role in divergence analysis, 2} = 2 corresponding to (33), is larger than the corresponding as discussed in, e.g., [22], [32]. Note that in this case, the inf{C(G): C(G) > 1} = 1 in the bound (18). Bregman generator function takes the form Additionally, it is worth noting that (33) resembles the well- m m X known Pinsker inequality [11], viz., fKL(p ) = g(pi), (29) 1 i=1 D (pmkqm) ≥ D2 (pmkqm), (34) KL 2 TV via which we obtain the Hessian as where H (pm) = 1{i = j} g00(p ). m f i m m X DTV(p kq ) , |pi − qi| (35) As an example, i=1 g(p) = p log p (30) is the total-variation distance (or Csiszar´ divergence [11]), matches (23), and when used in (28b) yields which is not a Bregman divergence, but rather an f-divergence. It is straightforward to verify that (34) is tighter than (33) when ˜ pi m m dKL(pikqi) , pi log − pi + qi, (31) p and q are restricted to the unit simplex. Nevertheless, qi this simple example serves to illustrate that Theorem3 and so that (28a) specializes to the generalized KL divergence (24). Theorem5 may be viewed as Bregman divergences extensions Our main result is the following theorem, a proof of which to some well-known f-divergence results, as discussed in is provided in AppendixF. SectionII. m m Theorem 5: Given a positive integer m, let Dg(p kq ) be a separable Bregman divergence satisfying Definition2 for K = m m m VI.NUMERICAL ANALYSIS AND EXPERIMENTS 3 and for which Dg(p k·) is convex for every p ∈ [0, 1] . Then for every pm, qm ∈ [0, 1]m, To complement the results of SectionV, we use numerical analysis to examine the dependence of ˜ m m 1 m m DKL(p kq ) ≥ Dg(p kq ) (32a) 1 C(g) D˜ (pmkqm) − D (pmkqm) KL C(f) f when C(G) satisfies on pm and qm, for some choices of f such that Theorem3 00 C(g) > g (1). (32b) applies, and with C(f) chosen according to (25a). We remark that when g00(1) is unbounded, Corollary5 does 4That said, if desired, via similar analysis, together with the use of Lagrange not yield a useful bound. By contrast, Theorem1 is guaranteed multipliers, one can obtain a version of Theorem5 restricted to the unit 00 to produce a bound, since G (1/2) is always finite. simplex, for which smaller constants will generally be obtained. 7

To begin, we consider a random variable Y ∈ Y with |Y| = m, and restrict qm to lie in a subset S of the unit simplex PY. For a given pm ∈ PY, with m m m qf (S) , arg min Df (p kq ) (36) {qm∈S⊂PY} and, in turn, qm (S) qm (S), KL , fKL (37) where fKL is as given in (29), the minimum KL divergence upper bounds the minimum of any Bregman divergence ac- cording to 1 D (pmkqm (S)) ≥ D (pmkqm (S)) KL KL C(f) f KL 1 ≥ D (pmkqm(S)), (38) C(f) f f where to obtain the first inequality we have used Theorem3 Figure 1. Divergence bound behavior under mean constraint. Depicted is the 3 3 3 Y with C(f) satisfying (25a), and to obtain the second inequality minimum of D (p kq )/C(f) with respect to q ∈ P with 3 [Y ] fixed, f Eq we have used (36). for Y ∈ Y = {−1, 0, 1}, p3 = (1/4, 1/2, 1/4), and different choices for f, In the first experiment, we set as described in text.  m Y S = q ∈ P : Eqm [h(Y )] = µ , distance D as defined in (35). For constrast, the lower plots for some h: Y 7→ and µ, i.e., we constrain qm to lie TV R corresponds choosing for D in (40) the (Neyman) chi-square in a hyperplane restricted to the unit simplex PY. More divergence, i.e., specifically, we choose Y = {−1, 0, 1}, h(y) = y, and p3 = (1/4, 1/2, 1/4) to illustrate our results. The results of our m 2 m m X (pi − qi) experiment, which compares the minima in (38) as a function Dχ2 (p kq ) = . (41) qi of µ, are depicted in Fig.1. The top (blue) curve is (with a mi- i=1 nor abuse of notation) D (p3kq3 (µ)), and the progressively m m KL KL The plots on the left compare DKL(p kqKL(S) with (37) lower (red, purple, and black) curves are (with a similar abuse D (pmkqm(S)/C(f) f 3 3 to f f with (36), for corresponding to of notation) Df (p kqf (µ))/C(f) with f corresponding to the the quadratic and separable Mahalanobis (where the quadratic divergence, the separable Mahalanobis distance with latter has parameters Qs as specified in (39)). Consistent with parameters Q , and the nonseparable Mahalanobis distance m m m m s (38), DKL(p kqKL(S)) upper bounds Df (p kqf (S)) for with parameters Qns, respectively. The specific values of these both the quadratic divergence and the separable Mahalanobis parameters are distance. Moreover, we see that larger values of  result in a 3 0 0  3 1/2 1/2 greater bias, as we would expect. m m Q = 0 2 0 and Q = 1/2 2 1/2 . (39) The plots on the right compare DKL(p kqKL(S)) with (37) s   ns   m m 0 0 1 1/2 1/2 1 to the middle term in (38), i.e., Df (p kqKL(S))/C(f), for f corresponding to the quadratic distance. The results demon- 3 m Note that since Ep3 [Y ] = 0 for our choice of p , all strate that qKL(S) can, indeed, be an effective approximation m m the minimum divergences are zero at µ = Eq3 [Y ] = 0, to qf (S) with respect to minimizing Df (p k·). 3 3 and thus qf (0) = p for all Bregman generators f. However, In the third experiment we demonstrate the application of 3 3 when µ 6= 0, the optimizing qf (µ) must differ from p , and our bounds to weather forecasting as discussed in SectionI. Fig.1 quantifies these differences as a function of the bias Recall that weather forecasters typically assign probabilistic µ. Consistent with the analysis of SectionV, KL divergence estimates to future meteorological events. The estimates are upper bounds normalized measures of all these differences. designed to minimize a performance measure, according to In the second experiment we show that the bounds (38) which the weather forecaster is evaluated. However, weather hold for a broader range of problems. To model a statistical, estimates serve a wide audience, within which different re- m computational, or even algorithmic constraint that prevents q cipients may be interested in different and often conflicting m Y from converging to some given p ∈ P , we impose that measures. For example, by minimizing the quadratic loss, a m q ∈ S where forecaster may reasonably assign zero probability of occur- m Y m m rence to very rare events, but this would result in an unbounded S = {q ∈ P : D(p kq ) ≥ } (40) logarithmic loss. for some D and  > 0. In Fig.2, we compare the terms To demonstrate the value of using log-loss minimization to in (38) for different choices for f, and two different (non- control a large set of commonly used performance measures, Bregman) examples of D in (40). In particular, the upper we analyze weather data collected by the Australian Bureau of plots corresponds to choosing for D in (40) the total-variation Meteorology [33]. This publicly available dataset contains the 8

3 3 3 3 Figure 2. Divergence bound behavior under a divergence constraint. On the left is depicted the minimum of Df (p kq )/C(f) over q for p = (1/4, 1/2, 1/4) subject to D(p3kq3) ≥ . In the plots on the right, the minimizing q3 is replaced with that minimizing KL divergence. The upper and lower plots correspond to D being total-variation and chi-square divergences, respectively. The different choices for f are as in Fig.1.

Table II Note that the unbounded logarithmic loss is a consequence WEATHER FORECAST EXPERIMENT of the fact that there are several instances in which the forecaster predicted zero chance of rain but it ultimately Quadratic Logarithmic rained. In correspondence with them, Australia’s National Me- Weather Forecaster 0-1 loss loss loss teorological Service confirmed that their forecasts are typically internally evaluated by both a quadratic loss and a 0-1 loss Australian with parameter t = 0.35. In addition, they perform more 0.0898 0.0676 ∞ Forecaster sophisticated evaluation analysis which is not in the scope of this work. Modified Next, we consider a method for revising the existing fore- 0.0901 0.0675 0.234 Forecaster casts based on our log-loss universality results. Since the available forecasts are generated by a prediction algorithm whose features unavailable to us, our revised forecasts can only be based on the existing forecasts. Accordingly, we make observed weather and its corresponding forecasts in multiple use of a simple logistic regression in which the target is the weather stations in Australia. In our experiment we focus on observed data and the single feature is the corresponding origi- the predicted chances of rain (where a rainfall is defined as nal forecast. Specifically, given an original weather forecast of over 2mm of rain) compared with the true event of rain. Our x ∈ [0, 1], we generate the following updated weather forecast dataset contains n = 33 134 pairs {(x , y ),..., (x , y )} of 1 1 n n according to forecasts and corresponding weather observations that were collected during the period Apr. 28–30, 2016. For reference, 1 qβ ,β (x) = , (42) in this period, a fraction 0 1 1 + e−β0−β1x n 1 X where the regression parameters β0 and β1 are fit to training yi = 0.09 {(˜x , y˜ ),..., (˜x , y˜ )} n data 1 1 n˜ n˜ according to i=1 n˜ 1 X of the observations correspond to an event of rain. We evaluate arg min l y˜ , q (˜x ). n˜ log i β0,β1 i the accuracy of the Australian weather forecasts by the three β0,β1 i=1 commonly used proper loss measures: logarithmic, quadratic, To avoid over-fitting, the training data was from Jan. 2016, and 0-1 losses, with the latter defined via and thus different from the test data . The accuracy of the l(y, q) = y 1{q < t} + (1 − y) 1{q ≥ t}, resulting updated forecasts are presented in the second row of TableII. and where we choose as its parameter t = 0.35, following the Note that the updated forecasts now incur a bounded log- Bureau’s guidelines. The first row of TableII summarizes our loss, and that this robustness is achieved without significantly results. affecting accuracy with respect to the other loss functions. 9

Evidently, even such simple post-processing improves log- clustering task. Banerjee et al. [20] show that there is a unique loss performance while controlling a large set of alternative correspondence between exponential families and Bregman measures, consistent with the results of Theorem1 (and of divergences. As such, if the data are from an exponential those in [25] for the 0-1 loss). family, with different parameters for different clusters, then the natural distance for clustering is the corresponding Bregman VII.EXAMPLE APPLICATIONS divergence. As an example, for Gaussian distributions with The Bernoulli log-likelihood loss function is widely used in differing means, the quadratic distance used by k-means is a variety of scientific fields. Several key examples, in addition the natural distance. However, in practice, information about to those discussed above, include logistic regression in statis- the generative model for the data is rarely known. tical analysis [34], the info-max criterion in machine learning As an alternative, our results suggest a “universal” approach [35], independent component analysis in signal processing to clustering that provides performance guarantees with re- spect to any Bregman divergence that might turn out to be [36], [37], splitting criteria in classification trees [38], DNA n sequence alignment [39], and many others. In this section relevant. Specifically, suppose we are given samples x to be partitioned into k clusters with corresponding representatives we demonstrate the potential applicability of our universality k results in the context of three key examples. µ = (µ1, . . . , µk). Then the optimum solution for measure Df is k A. Universal Clustering with Bregman Divergences k X X µf , arg min Df (xikµj), k µ f k Data clustering is an unsupervised learning procedure that j=1 i∈I (µ ) has been extensively studied across a variety of disciplines j over many decades. Most clustering methods assign each data where5 sample to one of a pre-specified number of partitions, with If (µk) each partition defined by a cluster representative, and where j  0 the quality of clustering is measured by the proximity of , i ∈ {1, . . . , n}: Df (xikµj) < Df (xikµj0 ), all j 6= j . samples to their assigned cluster representatives, as measured D˜ by a pre-defined distance function. Similarly, for measure KL, we use the (slightly simpler) notation IKL(µk) = IfML (µk), and µk µk . Several popular algorithms for data clustering have been j j KL , fKL developed over the years. This includes the well-known k- Using Theorem3 (for f and C(f) satisfying the conditions means algorithm [40] which minimizes the quadratic distance. of the theorem), we can then bound performance with respect Another widely used example is the Linde-Buzo-Gray (LBG) to Df according to [cf. (38)] algorithm [41], [42] based on the Itakura-Saito distance [43]. k X X ˜ KL More recently, Dhillion et al. [44] proposed an information- DKL(xikµj ) j=1 KL k theoretic approach to clustering probability distributions based i∈Ij (µKL) on KL divergence. k All of these clustering methods are based on an Expectation- 1 X X KL ≥ Df (xikµj ) Maximization (EM) framework for minimizing the aggregate C(f) j=1 i∈IKL(µk ) distance, and share the same optimality property: the centroid j KL k (representative) of each cluster is the mean of the data points 1 X X f ≥ D (x kµ ), (44) that are assigned to it. Moreover, all of these algorithms use C(f) f i j j=1 f k a Bregman divergence as their measure of distance, as do i∈Ij (µf ) some promising emerging methods. For example, a new class k f f where µf = (µ1 , . . . , µk ). of clustering methods has been shown to offer significant Via (44), we conclude that by applying a clustering algo- improvement in various domains by utilizing so-called total rithm that minimizes KL divergence, we provide performance Bregman divergence, a rotation-invariant version of classical guarantees for any (reasonable) choice of clustering method. Bregman divergence [45]–[49]. As such, our analysis provides further justification for the pop- The connection between clustering and the Bregman diver- ularity of the KL divergence in distributional clustering [50] gence is developed in Banerjee et al. [20]. In particular, a key and specifically in the context of natural language processing result is that a random variable X satisfies and text classification [51]–[53].   E [X] = arg min E Df (Xkz) (43) z B. The Universality of the Information Bottleneck if and only if D is a Bregman divergence. It follows that f The information bottleneck [54] is a conceptual machine any clustering algorithm that satisfies the “mean-as-minimizer” learning framework for extracting an informative but compact property centroid property minimizes a Bregman divergence, representation of an explanatory variable6 X with respect and thus we need look no further than among the Bregman to inferences about a target Y , generalizing the notion of a divergences in selecting a candidate distance measure for EM- based data clustering. 5When a sample is equidistant to multiple representatives, we pick one Even with this restriction, it is frequently not clear how arbitrarily. to choose an appropriate Bregman divergence for a given 6Note that X can equivalently represent a collection of variables. 10

minimal sufficient statistic from classical parametric statistics. for any Bregman divergence Df that satisfies Definition2 Given the joint distribution pX,Y , the method selects the and (25a) for some C(f) > 0. Therefore, by minimizing X D (p (·|X)kp (·|T )) compressed representation for that preserves the maximum EpX,T KL Y |X Y |T we effectively minimize amount of information about Y . As such, Y effectively reg- (46) for any divergence that might reasonably be of interest. ulates the compression of X, so as to maintain a level of Finally, it is worth noting that in classification problems, explanatory relevance with respect to Y . Specifically, with separable divergence measures are popular. In this case, then, T denoting the compressed representation, the information via Theorem5 we obtain a universality bound of the form bottleneck problem is (47) for any separable Bregman divergence Df that is convex in its second argument. max I(T ; Y ) subject to I(X; T ) ≤ I,¯ (45) pT |X where T ↔ X ↔ Y form a Markov chain, and thus C. Universal PAC-Bayes Bounds the minimization is over all possible (generally randomized) Probably approximately correct (PAC)-Bayes theory blends ¯ mappings of X to T . Here, I is a constant parameter that sets Bayesian and frequentist approaches to the analysis of machine ¯ the level of compression to be attained. As I is varied, the learning. The PAC-Bayes formulation assumes a probability tradeoff between I(X; T ) (corresponding to the representation distribution on events occurring in nature and a prior on complexity) and I(T ; Y ) (corresponding to the predictive the class of candidate hypotheses (estimators) that express a power) is a continuous, concave function. learner’s preference for some hypotheses over others. PAC- Information bottleneck analysis is a powerful tool in a Bayes generalization bounds [66]–[68] govern the perfor- variety of machine learning domains and related areas; see, mance (loss) when stochastically selecting hypotheses from e.g., [55]–[59]. It is also applicable in a variety of other a posterior distribution. We begin this section with a summary fields, including neuroscience [60] and optimal control [61]. of those aspects of PAC-Bayes theory needed for our devel- Recently, there have been demonstrations of its ability to opment. analyze the performance of deep neural networks [62]–[64]. Let X be an explanatory variable7 (feature) and Y an It is useful to recognize that the information bottleneck independent variable (target). Assume that X and Y follow problem (45) is an instance of a remote-source rate-distortion a joint probability distribution pXY . Let H be a class of problem [11]. In particular, let Y be a remote source that is hypotheses (estimators) for Y , where each estimator q ∈ H is unavailable to the encoder, and let X be a random variable some functional of X. As an example, in logistic regression, that is dependent of Y through a (known) mapping pX|Y , each hypothesis is an estimator of the form (42) for some which is available to the encoder. The remote source coding constants β0 and β1. problem is to achieve the highest possible compression rate Next, we view q as a realization of a random variable Q that for X given a prescribed maximum tolerable reconstruction is independent of X and Y and governed by (prior) distribution error of Y from the compressed representation T . In this 0 pQ on H, and let l(y, q(x)) be the loss between the realization setting, the reconstruction error is measured by a predefined y and the estimate q(x), for a given estimator q and loss distortion (loss) function, where the choice of log-loss leads function l, such that lq(y, q(x)) ∈ [0,Lmax] for some constant to the standard information bottleneck problem [65]. Lmax > 0 and all x, y, and q. We select q ∈ H based on i.i.d. While the choice of log-loss is typically justified by several training samples properties of KL divergence [22], the results of this paper can be applied to show that its use provides valuable universality Tn , {(x1, y1),..., (xn, yn)} guarantees for the remote source coding problem. from p so as to minimize the generalization loss To develop this view, first note that X,Y   L = [l(Y, q(X))] . I(T ; Y ) = I(X; Y ) − D (p (·|X)kp (·|T )) , q EpX,Y EpX,T KL Y |X Y |T which follows from straightforward algebra. In this form, In particular, the selection is based on the training loss n we recognize pY |X as the full predictive model and pY |T 1 X Lˆ l(y , q(x )). as the compressed predictive one. Since pX,Y is given, q , n i i we maximize I(T ; Y ) (as (45) dictates) by minimizing i=1 D (p (·|X)kp (·|T )) EpX,T KL Y |X Y |T . In the more general An example of a standard generalization bound of this type souce coding problem, we instead seek to minimize is the following, in which H is assumed to be countable. Theorem 6 (PAC bound [68]): Given training data T from D (p (·|X)kp (·|T )) , n EpX,T f Y |X Y |T (46) pX,Y , with probability at least 1 − δ, with f chosen as desired.   ! When the appropriate choice of f is not clear, via Theorem3 2λ ˆ λLmax 1 Lq ≤ Lq + log 0 , we have 2λ − 1 n δ pQ(q) D (p (·|X)kp (·|T )) EpX,T f Y |X Y |T for all q ∈ H and all λ > 1/2.   ≤ C(f) Ep DKL(pY |X (·|X)kpY |T (·|T )) X,T 7While the development generalizes naturally to collections of explanatory (47) variables, to simplify the exposition we focus on a single such variable. 11

ˆlog In the PAC-Bayes extensions of Theorem6, we allow H to is a normalization constant that depends only on G, and Lp 0 Q be continuous (uncountable). Moreover, in addition to pQ we L˜ (x, y) is of the form (49), where pQ is specialized to [cf. (50)] let pQ another distribution over H, and define ˜log   h i Lp (x, y) , Ep llog(y, Q(x)) , L L˜ (X,Y ) , Q Q pQ , EpX,Y pQ (48) with l as defined in (3). and log n Theorem8 establishes that even when we do not know 1 X Lˆ L˜ (x , y ), (49) a priori the loss function with respect to which are to be pQ , n pQ i i i=1 measured, it is often possible to bound the generalization loss. Such universal generalization bounds have potentially wide where range of applications. L˜ (x, y) [l(y, Q(x))] . pQ , EpQ (50)

While tighter PAC-Bayes bounds have been developed— VIII.DISCUSSIONAND CONCLUSIONS see, e.g., [67], [69]–[72]—the original is the following, which can be derived as a corollary of results by Catoni [69]. In this work we introduce a fundamental inequality for two- Theorem 7 (PAC-Bayes bound [66]): Given training data Tn class classification problems. We show that the KL divergence, from pX,Y , with probability at least 1 − δ, associated with the Bernoulli log-likelihood loss, upper bounds     any divergence measure that corresponds to a smooth, proper 2λ ˆ λLmax 0 1 Lp ≤ Lp + DKL(pQkpQ)+log , and convex binary loss function. This property makes the Q 2λ − 1 Q n δ log-loss a universal choice, in the sense that it controls any for all p on H and all λ > 1/2. “analytically convenient” alternative one may be interested in. Q This result has implications in a wide range of applications. Evidently, the bounds in both Theorem6 and Theorem7 are There are many examples beyond those we have explicitly specific to the choice of loss function l. For scenarios where described. For instance, in binary classification trees [38], such a choice is not clear, a “universal” PAC-Bayes bound the split criterion in each node is typically chosen between based on log-loss, which we now develop, is useful. the Gini impurity (which corresponds to quadratic loss) and A complication in the development is PAC-Bayes bounds information-gain (which corresponds to log-loss). The best apply only to bounded loss functions as they focus on worst- choice for a splitting mechanism is a long standing open case performance [68], and thus log-loss is inadmissible. question with many statistical and computational implications; Different approaches have been introduced to overcome this see, e.g., [77]. Our results indicate that by minimizing the limitation. In [68], McAllester suggests modifying an un- information-gain we implicitly obtain guarantees for the Gini bounded loss by applying an “outlier threshold” L to re- max impurity (but not vice-versa). This provides a new and poten- place l(y, q(x)) with min {l(y, q(x)),L }, which is always max tially useful perspective on the question. bounded.. This approach introduces analytical difficulties as the new loss is typically neither continuous nor convex. Finally, by viewing our bounds from a Bregman divergence perspective, we extend the well-studied f-divergence inequal- An alternative approach, which we follow and whose use ities by providing complementary Bregman inequalities. Col- is more widespread, assumes that the underlying distribution lectively, these results contribute to our growing understanding for the data is bounded away from zero [73]–[76]. Equiva- understanding of the fundamental role that KL divergence lently, the model p is not deterministic (singular), and the Y |X plays in these two important classes of divergences. hypothesis class is chosen accordingly. Specifically, for some ∆ > 0 we have pY |X (y|x), q(x) ∈ [∆, 1 − ∆] for every x, y, and q. APPENDIX A Via the latter methodology, the loss function is bounded BREGMAN DIVERGENCE CHARACTERIZATION on the domain of interest, and we obtain the following universal PAC-Bayes inequality, a proof of which is provided Following [29], a Bregman divergence generator is a con- in AppendixG. tinuous, strictly convex (finite) function f : S 7→ R on some Theorem 8: Let l(y, q) be a loss function that satisfies appropriately chosen open interval S = (a, b) such that [a, b] Definition1, and G its corresponding generalized entropy covers (at least) the union of the ranges of s and t, as appears function. If p(y|x), q(x) ∈ [∆, 1 − ∆] for some ∆ > 0 and in (11)—e.g., S = [0, 1] in the binary classification problem every x, y, and q ∈ H, then with probability at least 1 − δ, of SectionIV. Due to condition P2.2, we further restrict our    attention to continuously differentiable f. 2λC(G) ˆlog λLmax 0 1 Lp ≤ Lp + DKL(pQkpQ) + log , We continuously extend f to f¯:[a, b] 7→ ∪ {+∞} via Q 2λ − 1 Q n δ R (51)  for all p on H and all λ > 1/2. In (51), L = − log ∆, lim f(s) s = a Q max s→a ¯    f(s) , f(s) s ∈ (a, b) 1 00 1  C(G) > − G lim f(s) s = b, 2 2 s→b 12

which can be infinite only for s ∈ {a, b}. Moreover, we APPENDIX C 0 continuously extend the derivative f (s):(a, b) 7→ R to PROOFOF THEOREM 1 ¯0 f :[a, b] 7→ R ∪ {−∞, +∞} via First, due to the convexity of the loss (with respect to q),  we have lim f 0(s) s = a s→a 2  ∂ ∂ 0 f¯0(s) f 0(s) s ∈ (a, b) L(p, q) = w(q)(q − p) = w(q) + (q − p)w (q) ≥ 0 , ∂q2 ∂q  0 lim f (s) s = b. (53) s→b for every fixed p ∈ [0, 1] and q ∈ (0, 1). Specializing (53) to Using these extensions, for S = [a, b], we let the cases p = 0 and p = 1 then yields 0 ¯ 1 w (q) 1 Df (skt) , ψ(s, t), − ≤ ≤ (54) q w(q) 1 − q ¯ 2 where ψf :[a, b] 7→ R ∪ {+∞} is following lower semi- for all q ∈ (0, 1). In turn, (54) implies continuous nonnegative function. First, Z 1/2 1 Z 1/2 w0(q) Z 1/2 1 − dq ≤ dq ≤ dq ¯ ¯ 0 ψf (s, t) , f(s) − f(t) − (s − t)f (t), s ∈ [a, b], t ∈ (a, b). 0 q 0 w(q) 0 1 − q Z 1 1 Z 1 w0(q) Z 1 1 Next, for s ∈ (a, b), − dq ≤ dq ≤ dq, 1/2 q 1/2 w(q) 1/2 1 − q ¯ ψf (s, a) i.e., ( ¯0  ¯0  ¯0 w(1/2) w(1/2) f(s) − sf (a) + lim tf (a) − f(t) f (a) > −∞ ≤ w(q) ≤ , q ∈ (0, 1/2) (55a) , t→a 2(1 − q) 2q ∞ f¯0(a) = −∞ w(1/2) w(1/2) ≤ w(q) ≤ , q ∈ [1/2, 1). (55b) and 2q 2(1 − q) Similar results appear in, e.g., [26, Theorem 29]. We empha- ψ¯ (s, b) f size that we have not assumed that w(·) is integrable on (0, 1), ( ¯0  ¯0  ¯0 f(s) − sf (b) + lim tf (b) − f(t) f (b) < +∞ so as to accommodate loss functions such that l0(·) and/or l1(·) , t→b ∞ f¯0(b) = +∞, are unbounded at 0 and 1, respectively [2]. Next, we show there exists a constant C such that where we note the limits exist but may be infinite. Finally, R(p, q) , CDKL(pkq) − D−G(pkq) (56)  0 (s, t) = (a, a) is nonnegative for all p, q ∈ [0, 1]. For any p ∈ [0, 1], since   0   lim f(s) − sf¯ (b) R(p, p) = 0 it suffices to show that R(p, ·) has a minimum at  s→a   ¯0  (s, t) = (a, b) p for a suitable choice of C. From  + lim tf (b) − f(t) ψ¯ (s, t) = t→b ∂  C  f  0   lim f(s) − sf¯ (a) R(p, q) = (q − p) − w(q) , (57)  s→b ∂q q(1 − q)   ¯0  (s, t) = (b, a)  + lim tf (a) − f(t) we see that q = p is a unique stationary point. Moreover, this  t→a 0 (s, t) = (b, b). stationary point is a minimum when

∂2 R(p, q) APPENDIX B ∂q2 q=p WEIGHT FUNCTIONSOF SMOOTH PROPER LOSSES     p 1−p 0 As a complementary view of weight functions, we note = C + − w(q) − (q−p) w (q) q2 (1−q)2 that when a smooth loss function is proper, its expected loss q=p C satisfies = − w(p) > 0, (58) p(1 − p) ∂ 0 0 L(p, q) = p l1(p) + (1 − p) l0(p) = 0, ∂q for all p ∈ (0, 1). q=p Now for every q ∈ (0, 1/2), we have whence C C 1  1 1 −l0 (p) l0 (p) −w(q) > −w(q) ≥ C − w , (59a) 1 = 0 = w(p), (52) q(1 − q) q q 2 2 1 − p p where the first inequality follows since q > 0, and the last where the last equality in (52) is obtained by matching terms inequality follows from (55a). Similarly, for q ∈ [1/2, 1) we in the forms (2) and (10), and using (14). Shuford et al. [78] have establish that the converse is also true: a smooth loss function C C 1  1 1 −w(q) > −w(q) ≥ C − w , is proper only if (52) holds for some nonnegative w(p) that q(1 − q) 1 − q 1 − q 2 2 R 1− satisfies  w(p) dp < ∞, for all  > 0. (59b) 13 where the first inequality follows since q < 1, and the last with inequality follows from (55b). Hence, choosing λ1(V ) ≥ · · · ≥ λm(V ) , λmin(V ) 1 1 1 1 C > w = − G00 (60) denoting its eigenvalues, and note that according to (25a) 2 2 2 2 it suffices to show that when C(Q) satisfies (27), we have m ensures the right-hand side of (the relevant variant of) (59) is V (p ) 0, for which the condition λmin(V ) > 0 is positive for all q ∈ (0, 1), and thus (58) holds for all p ∈ (0, 1). equivalent. m Hence, we conclude that R(p, q) ≥ 0 for all p, q ∈ (0, 1). Next, let V˜ = V (1 ), whose eigenvalues we denote via Next consider the case p ∈ {0, 1} and q ∈ (0, 1). If choose λ (V˜ ) ≥ · · · ≥ λ (V˜ ) λ (V˜ ), C according to (60), then (58) holds for all p ∈ (0, 1). In 1 m , min m m this case, (57) must be strictly positive for all q ∈ (0, 1) and note that for every x ∈ R , when p = 0, so R(0, ·) is monotonically increasing, and m 2 thus its minimum is attained at 0. Likewise (57) must be m T m X x m T m (x ) V x = C(Q) i − (x ) Q x strictly negative for all q ∈ (0, 1) when p = 1, so R(1, ·) p i=1 i is monotonically decreasing, and thus its minimum is attained m X 2 m T m at 1. In turn, since R(0, 0) = R(1, 1) = 0, it follows that ≥ C(Q) xi − (x ) Q x R(p, q) ≥ 0 also holds for p ∈ {0, 1} and q ∈ (0, 1). i=1 It remains to consider the case p ∈ [0, 1] and q ∈ {0, 1}. = (xm)TV˜ xm. When p = q ∈ {0, 1} we have R(p, q) ≥ 0 since R(0, 0) = R(1, 1) = 0. Finally, when p 6= q ∈ {0, 1} we have DKL(pkq) Hence, is unbounded, so (17a) holds trivially. m T m λmin(V ) = min (x ) V x m P 2 {x : i xi =1} APPENDIX D m T m ˜ ≥ min (x ) V x = λmin(V ), PROOFOF THEOREM 3 m P 2 {x : i xi =1} It suffices to show that where the equalities follow from the Rayleigh quotient theo- R(pm, qm) C(f) D˜ (pmkqm) − D (pmkqm) rem [79, Theorem 4.2.2]. , KL f ˜ 8 ˜ Finally, λi(V ) = C(Q) − λi(Q) since V = C(Q) I − Q, is nonnegative for all pm, qm ∈ [0, 1]m. Using (21) in the form so ˜ m λmin(V ) ≥ λmin(V ) = C(Q) − λmax(Q). m m m m X ∂ m D (p kq ) = f(p ) − f(q ) − (p − q )f(q ), f ∂q k k m k=1 k Accordingly, setting C(Q) > λmax(Q) yields V (p ) 0 for (61) all pm ∈ [0, 1]m. we have  m  ∂ m m ∂f(p ) APPENDIX F R(p , q ) = C(f) log pi − ∂pi ∂pi PROOFOF THEOREM 5  ∂f(qm) − C(f) log q − (62) First, note that i ∂q i ∂ d (pkq) = (q − p)g00(q) (64a) and, in turn, ∂q g 2 2 m 2 ∂ m m 1{i = j} ∂ f(p ) ∂ 000 00 R(p , q ) = C(f) − 2 dg(pkq) = (q − p)g (q) + g (q). (64b) ∂pi∂pj pi ∂pi∂pj ∂q = C(f) H (pm) − H (pm) , (63) KL f i,j Since dg(p, ·) is convex for every p ∈ [0, 1],(64b) is nonneg- ative for every p ∈ [0, 1] and q ∈ (0, 1). Choosing p = 0 we where [·]i,j denotes the i, jth element of its matrix argument. m obtain R(·, q ) 000 Hence, it follows that is strictly convex if there exists 1 g (q) a constant C(f) such that (25a) is satisfied. Moreover, from − ≤ 00 , q ∈ (0, 1). m m q (62) we have that p = q is a stationary point, so provided g (q) (25a) is statisfied, this stationary point is a minimum. Finally, Hence, we have since R(pm, pm) = 0, it follows that R(pm, qm) ≥ 0 for all Z 1 Z 1 000 pm, qm ∈ [0, 1]m. 1 g (q) − dq ≤ 00 dq, q q q g (q) APPENDIX E whence 00 PROOFOF COROLLARY 4 g (1) g00(q) ≤ . (65) First, let q

m m m m 8 V (p ) , C(Q) HKL(p ) − Q, p ∈ [0, 1] , We use I to denote the identity matrix. 14

Next, following an approach similar to that in the proof of APPENDIX G Theorem1, we define PROOFOF THEOREM 8 m m m X The following lemma will be useful. R(p , q ) , r(pi, qi) (66a) Lemma 9: If l(y, q) is a loss function that satisfies Defi- i=1 nition1, with corresponding generalized entropy function G, with then ˜ r(p, q) , C dKL(pkq) − dg(pkq), (66b) G(p) − CG (p) ≤ 0, for all p, q ∈ [0, 1], and show that when C is chosen as prescribed, (66a) is log nonnegative for pm, qm ∈ [0, 1]m. Note that it is sufficient when 1 1 to show that for such C,(66b) is nonnegative for every C > − G00 , (70) p, q ∈ [0, 1]. 2 2 Accordingly, we fix p and analyze r(p, q) with respect to where Glog(p) is the Shannon entropy as defined in (8). q. Via (64) (including its specialization to (30)) we obtain Proof: With ∂ C  r(p, q) = (q − p) − g00(q) (67a) R(p) = G(p) − CG (p) ∂q q log ∂2 p we have r(p, q) = C − g00(q) − (q − p)g000(q). (67b) 2 2 ∂ 0 1 − p ∂q qi R(p) = G (p) − C log (71a) ∂p p First, consider the case p ∈ (0, 1). Since r(p, p) = 0, if ∂2 C C r(p, q) ≥ 0 then a global minimum of r(p, ·) must occur at R(p) = G00(p) + = − w(p), (71b) p. Proceeding, from (67a), we see that the unique stationary ∂p2 p(1 − p) p(1 − p) point is q = p. Moreover, this stationary point is a minimum where to obtain the second equality in (71b) we have used when (14). In turn, using (59) from the proof of Theorem1, we ∂2 C r(p, q) = − g00(p) likewise conclude that choosing C according to (70) ensures 2 ∂q p that q=p C − w(p) > 0, is positive, from which we obtain the requirement p(1 − p) C − g00(q) > 0, for all q ∈ (0, 1). (68) in which case R(p) is strictly convex. In addition, we have q Choosing C > g00(1) we obtain G(p) = L(p, p) = (1 − p) l0(p) + p l1(p), C g00(1) where the first and second qualities follow from (10) and − g00(q) > − g00(q) ≥ 0, q q (2), respectively, and thus using (7) we have G(p) = 0 for p ∈ {0, 1}. Since G (p) = 0 for {0, 1} as a special case, it where the last inequality follows from (65). Hence, r(p, q) ≥ 0 log follows that R(p) = 0 for p ∈ {0, 1}. Hence, R(p) ≤ 0. for p, q ∈ (0, 1). Next, consider the case p ∈ {0, 1}. Again, with the choice Proceeding to the proof of Theorem8, from (12) with (9) C > g00(1),(68) holds for all q, and thus (67a) is positive for we obtain D−G(pkq) = L(p, q) − G(p), which when used in q ∈ (0, 1) when p = 0, so r(0, ·) is an increasing function. conjunction with (17a) of Theorem1 yields

Since r(0, 0) = 0, then, we conclude r(0, q) ≥ 0. Likewise, L(p, q) ≤ CLlog(p, q) + G(p) − CGlog(p) (72) q ∈ (0, 1) p = 1 r(1, ·) thus (67a) is negative for when , so is ≤ CL (p, q), (73) a decreasing function. Since r(1, 1) = 0, then, we conclude log r(1, q) ≥ 0. Hence, r(p, q) ≥ 0 for p ∈ {0, 1} and q ∈ (0, 1). where in (72) It remains only to consider the case q ∈ {0, 1}, for any L (p, q) l (Y, q) , p ∈ [0, 1]. When p = q, we have r(p, q) = r(q, q) = 0. When log , E log p 6= q = 0,(31) is unbounded so r(p, 0) ≥ 0. For the case and where to obtain (73) we have used Lemma9. q = 1, straightforward calculation yields Next, we have

∂ 0 h i r(p, 1) = α(p) − α(1), with α(p) C log p − g (p). Lp = Ep Ep [l(Y,Q(X))] ∂p , Q X,Y Q (69) h i = Ep p Ep (·|X) [l(Y,Q(X))] (74) But X Q Y |X C h i α0(p) = − g00(p), ≤ C l (Y,Q(X)) (75) p EpX pQ EpY |X (·|X) log = CLlog, which matches the left-hand side of (68), and thus is positive pQ (76) for all p ∈ (0, 1) when C > g00(1), in which case α(·) is an where to obtain (75) we have used an instance of (73) to bound increasing function. As a result, (69) is negative for p ∈ (0, 1), the inner expectation in (74). and thus r(·, 1) is a decreasing function. Since, in addition, Moreover, since p (y|x), q(x) ∈ [∆, 1 − ∆] we have r(1, 1) = 0, we conclude r(p, 1) ≥ 0. Hence, r(p, q) ≥ 0 for Y |X p ∈ [0, 1] and q ∈ {0, 1}. llog(Y,Q(X)) ∈ [0, − log ∆] (77) 15

with probability one. [23] I. Csiszar,´ “Information-type measures of difference of probability dis- Finally, using (76) followed by Theorem7 specialized to the tributions and indirect observations,” Studia Sci. Math. Hungar., vol. 2, no. Jan., pp. 299–318, 1967. log-loss, together with (77), we obtain that with probability [24] W. Stummer and I. Vajda, “On divergences of finite measures and their 1 − δ, applicability in statistics and information theory,” Statistics, vol. 44, no. 2, pp. 169–187, 2010. log L ≤ CL [25] T. Zhang, “Statistical behavior and consistency of classification methods pQ pQ based on convex risk minimization,” Ann. Stat., vol. 32, no. 1, pp. 56– 2λC  λL  1 ≤ Lˆlog + max D (p kp0 ) + log , 134, 2004. 2λ1 − 1 pQ n KL Q Q δ [26] M. D. Reid and R. C. Williamson, “Composite binary losses,” J. Mach. Learn. Res., vol. 11, no. Sep., pp. 2387–2422, 2010. for any λ > 1/2 and Lmax = − log ∆. [27] N. Merhav and M. Feder, “Universal schemes for sequential decision from individual data sequences,” IEEE Trans. Inform. Theory, vol. 39, no. 4, pp. 1280–1292, Apr. 1993. REFERENCES [28] L. J. Savage, “Elicitation of personal probabilities and expectations,” J. Am. Stat. Assoc., vol. 66, no. 336, pp. 783–801, Dec. 1971. [1] A. Painsky and G. Wornell, “On the universality of the logistic loss function,” in Proc. Int. Symp. Inform. Theory (ISIT), Vail, Colorado, [29] M. Broniatowski and W. Stummer, “Some universal insights on di- June 2018, pp. 936–940. vergences for statistics, machine learning and artificial intelligence,” Geometric Structures of Information [2] A. Buja, W. Stuetzle, and Y. Shen, “Loss functions for in , F. Nielsen, Ed. Cham, binary class probability estimation and classification: Structure Switzerland: Springer International Publishing, 2019, pp. 149–211. and applications,” Statistics Dept., Wharton School, University of [30] H. H. Bauschke and J. M. Borwein, “Joint and separate convexity of Pennsylvania, Philadelphia, PA, Tech. Rep., Nov. 2005. [Online]. the Bregman distance,” in Inherently Parallel Algorithms in Feasibility Available: https://faculty.wharton.upenn.edu/wp-content/uploads/2012/ and Optimization and their Applications, D. Butnariu, Y. Censor, and 04/Paper-proper-scoring.pdf S. Reich, Eds. Elsevier, 2001, vol. 8, pp. 23–36. [3] R. L. Winkler, J. Munoz, J. L. Cervera, J. M. Bernardo, G. Blattenberger, [31] C. L. Byrne, Iterative Optimization in Inverse Problems. Boca Raton, J. B. Kadane, D. V. Lindley, A. H. Murphy, R. M. Oliver, and D. R´ıos- FL: CRC Press, 2014. Insua, “Scoring rules and the evaluation of probabilities,” Test, vol. 5, [32] J. Jiao, T. A. Courtade, A. No, K. Venkat, and T. Weissman, “Information no. 1, pp. 1–60, June 1996. measures: the curious case of the binary alphabet,” IEEE Trans. Inform. [4] T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, prediction, Theory, vol. 60, no. 12, pp. 7616–7626, Dec. 2014. and estimation,” J. Am. Stat. Assoc., vol. 102, no. 477, pp. 359–378, [33] Australian Bureau of Meteorology, “Australian data archive for meteo- Mar. 2007. rology,” http://www.bom.gov.au/climate/data/, retrieved 2017. [5] E. C. Merkle and M. Steyvers, “Choosing a strictly proper scoring rule,” [34] J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical Decision Analysis, vol. 10, no. 4, pp. 292–304, Dec. 2013. Learning, 2nd ed. New York, NY: Springer-Verlag, 2009. [6] A. P. Dawid and M. Musio, “Theory and applications of proper scoring [35] R. Linsker, “Self-organization in a perceptual network,” Computer, rules,” METRON, vol. 72, no. 2, pp. 169–183, Aug. 2014. vol. 21, no. 3, pp. 105–117, Mar. 1988. [7] I. Sason, “On f-divergences: Integral representations, local behavior, [36] A. Painsky, S. Rosset, and M. Feder, “Generalized independent compo- and inequalities,” Entropy, vol. 20, no. 5, 383, 2018. nent analysis over finite alphabets,” IEEE Trans. Inform. Theory, vol. 62, [8] I. Sason and S. Verdu,´ “f-divergence inequalities,” IEEE Trans. Inform. no. 2, pp. 1038–1053, Feb. 2016. Theory, vol. 62, no. 11, pp. 5973–6006, Nov. 2016. [37] ——, “Linear independent component analysis over finite fields: Algo- [9] P. Harremoes¨ and I. Vajda, “On pairs of f-divergences and their joint rithms and bounds,” IEEE Trans. Signal Processing, vol. 66, no. 22, range,” IEEE Trans. Inform. Theory, vol. 57, no. 6, pp. 3230–3235, Jun. Nov. 15, 2018. 2011. [38] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification [10] M. D. Reid and R. C. Williamson, “Information, divergence and risk and Regression Trees. Boca Raton, FL: CRC Press, 1984. for binary experiments,” J. Mach. Learn. Res., vol. 12, no. Mar., pp. [39] J. Keith and D. P. Kroese, “Sequence alignment by rare event simula- 731–817, 2011. tion,” in Proc. Winter Simulation Conf. (WSC), vol. 1. IEEE, 2002, pp. [11] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. 320–327. New York, NY: John Wiley & Sons, 2006. [40] J. MacQueen, “Some methods for classification and analysis of multi- [12] N. Merhav and M. Feder, “Universal prediction,” IEEE Trans. Inform. variate observations,” in Proc. Berkeley Symp. Math. Stat., Prob., vol. 1, Theory, vol. 44, no. 6, pp. 2124–2147, June 1998. 1967, pp. 281–297. [13] A. No and T. Weissman, “Universality of logarithmic loss in lossy [41] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizer Proc. Int. Symp. Inform. Theory (ISIT) compression,” in . Hong Kong, design,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84–95, Jan. 1980. China: IEEE, June 2015, pp. 2166–2170. [42] A. Buzo, A. Gray, R. Gray, and J. Markel, “Speech coding based [14] J. Jiao, T. A. Courtade, K. Venkat, and T. Weissman, “Justification of upon vector quantization,” IEEE Trans. Acoustics, Speech, and Signal logarithmic loss via the benefit of side information,” IEEE Trans. Inform. Processing, vol. 28, no. 5, pp. 562–574, 1980. Theory, vol. 61, no. 10, pp. 5357–5365, 2015. [15] J. E. Bickel, “Some comparisons among quadratic, spherical, and [43] F. Itakura and S. Saito, “Analysis synthesis telephony based on the Proc. Int. Congress Acoustics logarithmic scoring rules,” Decision Analysis, vol. 4, no. 2, pp. 49–65, maximum likelihood method,” in , Aug. June 2007. 1968, pp. C17–C20. [16] J. M. Bernardo and A. F. M. Smith, Bayesian Theory. New York, NY: [44] I. S. Dhillon, S. Mallela, and R. Kumar, “A divisive information-theoretic Wiley, 2000. feature clustering algorithm for text classification,” J. Mach. Learn. Res., [17] M. Parry, A. P. Dawid, and S. Lauritzen, “Proper local scoring rules,” vol. 3, no. Mar., pp. 1265–1287, 2003. Ann. Stat., vol. 40, no. 1, pp. 561–592, 2012. [45] M. Liu, Total Bregman divergence, a robust divergence measure, and [18] I. Csiszar´ and P. C. Shields, “Information theory and statistics: A its applications. University of Florida, 2011. tutorial,” Foundations and Trends in Communications and Information [46] M. Liu, B. C. Vemuri, S.-I. Amari, and F. Nielsen, “Total Bregman Theory, vol. 1, no. 4, pp. 417–528, 2004. divergence and its applications to shape retrieval,” in Proc. Conf. Comp. [19] I. Csiszar,´ “Generalized projections for non-negative functions,” Acta Vision, Patt. Recog. (CVPR), San Francisco, CA, Jun. 2010, pp. 3463– Math. Hungar., vol. 68, no. 1–2, pp. 161–185, 1995. 3468. [20] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with [47] B. C. Vemuri, M. Liu, S.-i. Amari, and F. Nielsen, “Total Bregman Bregman divergences,” J. Mach. Learn. Res., vol. 6, pp. 1705–1749, divergence and its applications to DTI analysis,” IEEE Trans. Med. Oct. 2005. Imag., vol. 30, no. 2, pp. 475–483, Feb. 2010. [21] M. Zakai and J. Ziv, “A generalization of the rate-distortion theory and [48] M. Liu, B. C. Vemuri, S.-i. Amari, and F. Nielsen, “Shape retrieval using applications,” in Information Theory New Trends and Open Problems, hierarchical total Bregman soft clustering,” IEEE Trans. Pattern Anal., G. Longo, Ed. Vienna, Austria: Springer, 1975, pp. 87–123. Mach. Intell., vol. 34, no. 12, pp. 2407–2419, Dec. 2012. [22] P. Harremoes¨ and N. Tishby, “The information bottleneck revisited or [49] R. Nock, F. Nielsen, and S.-i. Amari, “On conformal divergences and how to choose a good distortion measure,” in Proc. Int. Symp. Inform. their population minimizers,” IEEE Trans. Inform. Theory, vol. 62, no. 1, Theory (ISIT). Nice, France: IEEE, June 2007, pp. 566–570. pp. 527–538, Jan. 2015. 16

[50] F. Pereira, N. Tishby, and L. Lee, “Distributional clustering of English [75] N. Abe, J.-i. Takeuchi, and M. K. Warmuth, “Polynomial learnability words,” in Proc. Meet. Assoc. Comput. Ling. Columbus, OH: Associ- of stochastic rules with respect to the KL-divergence and quadratic ation for Computational Linguistics, June 1993, pp. 183–190. distance,” IEICE Trans. Inform., Syst., vol. 84, no. 3, pp. 299–316, Mar. [51] L. D. Baker and A. K. McCallum, “Distributional clustering of words 2001. for text classification,” in Proc. Int. Conf. Res., Dev. Inform. Retrieval [76] R. Shwartz-Ziv, A. Painsky, and N. Tishby, “Representation compression (SIGIR), Melbourne, Australia, Aug. 1998, pp. 96–103. and generalization in deep neural networks,” Unpublished Manuscript, [52] A. Clark, “Unsupervised induction of stochastic context-free grammars 2018. [Online]. Available: https://openreview.net/pdf?id=SkeL6sCqK7 using distributional clustering,” in Proc. Workshop Comput. Nat. Lang. [77] A. Painsky and S. Rosset, “Cross-validated variable selection in tree- Learn. (ConLL), vol. 7, Toulouse, France, July 2001. based methods improves predictive performance,” IEEE Trans. Pattern [53] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “On feature Anal. Mach. Intell., vol. 39, no. 11, pp. 2142–2153, Nov. 2017. distributional clustering for text categorization,” in Proc. Int. Conf. Res., [78] E. H. Shuford, A. Albert, and H. E. Massengill, “Admissible probability Dev. Inform. Retrieval (SIGIR), New Orleans, LA, 2001, pp. 146–153. measurement procedures,” Psychometrika, vol. 31, no. 2, pp. 125–145, [54] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck June 1966. method,” in Proc. Allerton Conf. Commun., Contr., Computing, Monti- [79] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. Cambridge, cello, IL, Sep. 1999, pp. 368–377. UK: Cambridge Uiversity Press, 2012. [55] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. Int. Conf. Res., Dev. Inform. Retrieval (SIGIR), Athens, Greece, July 2000, pp. 208–215. [56] N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby, “Multivariate information bottleneck,” in Proc. Conf. Uncertainty in Artificial Intelli- Amichai Painsky (S12M18) received his B.Sc. in Electrical Engineering from gence (UAI), San Francisco, CA, Aug. 2001, pp. 152–161. Tel Aviv University (2007), his M.Eng. degree in Electrical Engineering from [57] J. Sinkkonen and S. Kaski, “Clustering based on conditional distributions Princeton University (2009) and his Ph.D. in Statistics from the School of in an auxiliary space,” Neural Comput., vol. 14, no. 1, pp. 217–239, Mathematical Sciences in Tel Aviv University. He was a Post-Doctoral Fellow, 2002. co-affiliated with the Israeli Center of Research Excellence in Algorithms (I- [58] N. Slonim, G. S. Atwal, G. Tkacik,ˇ and W. Bialek, “Information-based CORE) at the Hebrew University of Jerusalem, and the Signals, Information clustering,” Proc. Nat. Acad. Sci. (PNAS), vol. 102, no. 51, pp. 18 297– and Algorithms (SIA) Lab at MIT (2016-2018). Since 2019, he is a faculty 18 302, 2005. member at the Industrial Engineering Department at Tel Aviv University, [59] R. M. Hecht, E. Noor, and N. Tishby, “Speaker recognition by Gaussian where he leads the Statistics and Data Science Laboratory. His research information bottleneck,” in Proc. Interspeech, Brighton, UK, Sep. 2009, interests include Data Mining, Machine Learning, Statistical Learning and pp. 1567–1570. Inference, and their connection to Information Theory. [60] E. Schneidman, N. Slonim, N. Tishby, R. de Ruyter van Steveninck, and W. Bialek, “Analyzing neural codes using the information bottleneck method,” Unpublished manuscript, 2001. [Online]. Available: ftp://ftp.cis.upenn.edu/pub/cse140/public html/2002/schneidman.pdf [61] N. Tishby and D. Polani, “Information theory of decisions and actions,” Gregory W. Wornell (S’83-M’91-SM’00-F’04) received the B.A.Sc. degree in Perception-Action Cycle: Models, Architectures, and Hardware, from the University of British Columbia, Canada, and the S.M. and Ph.D. V. Cutsuridis, A. Hussain, and J. G. Taylor, Eds. New York, NY: degrees from the Massachusetts Institute of Technology, all in electrical Springer, 2011, pp. 601–636. engineering and computer science, in 1985, 1987 and 1991, respectively. [62] N. Tishby and N. Zaslavsky, “Deep learning and the information Since 1991 he has been on the faculty at MIT, where he is the Sumitomo bottleneck principle,” in Proc. Inform. Theory Workshop (ITW), Apr. Professor of Engineering in the Department of Electrical Engineering and 2015. Computer Science. At MIT he leads the Signals, Information, and Algorithms [63] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural Laboratory within the Research Laboratory of Electronics. He is also chair networks via information,” CoRR, vol. abs/1703.00810, 2017. [Online]. of Graduate Area I (information and system science, electronic and photonic Available: http://arxiv.org/abs/1703.00810 systems, physical science and nanotechnology, and bioelectrical science and [64] Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen, engineering) within the EECS department’s doctoral program. He has held B. Kingsbury, and Y. Polyanskiy, “Estimating information flow in deep visiting appointments at the former AT&T Bell Laboratories, Murray Hill, NJ, neural networks,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 97, the University of California, Berkeley, CA, and Hewlett-Packard Laboratories, Long Beach, CA, June 2019, pp. 2299–2308. Palo Alto, CA. [65] Y. Y. Shkel and S. Verdu,´ “A single-shot approach to lossy source coding His research interests and publications span the areas of information under logarithmic loss,” IEEE Trans. Inform. Theory, vol. 64, no. 1, pp. theory, statistical inference, signal processing, digital communication, and 129–147, Jan. 2017. information security, and include architectures for sensing, learning, comput- [66] D. A. McAllester, “PAC-Bayesian model averaging,” in Proc. Conf. ing, communication, and storage; systems for computational imaging, vision, Comput. Learn. Theory (COLT), Santa Cruz, CA, July 1999, pp. 164– and perception; aspects of computational biology and neuroscience; and the 170. design of wireless networks. He has been involved in the Information Theory [67] J. Langford, “Tutorial on practical prediction theory for classification,” and Signal Processing societies of the IEEE in a variety of capacities, and J. Mach. Learn. Res., vol. 6, no. Mar., pp. 273–306, 2005. maintains a number of close industrial relationships and activities. He has [68] D. A. McAllester, “A PAC-Bayesian tutorial with a dropout won a number of awards for both his research and teaching, including the bound,” CoRR, vol. abs/1307.2118, 2013. [Online]. Available: http: 2019 IEEE Leon K. Kirchmayer Graduate Teaching Award. //arxiv.org/abs/1307.2118 [69] O. Catoni, PAC-Bayesian Supervised Classification: The Thermodynam- ics of Statistical Learning, ser. Lecture Notes–Monograph. Beachwood, OH: Institute of Mathematical Statistics, 2007, vol. 56. [70] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand, “PAC-Bayesian learning of linear classifiers,” in Proc. Int. Conf. Mach. Learn. (ICML), Montreal,´ Canada, 2009, pp. 353–360. [71] A. Maurer, “A note on the PAC Bayesian theorem,” CoRR, vol. cs.LG/0411099, 2004. [Online]. Available: http://arxiv.org/abs/cs.LG/ 0411099 [72] M. Seeger, “PAC-Bayesian generalisation error bounds for Gaussian process classification,” J. Mach. Learn. Res., vol. 3, no. Oct., pp. 233– 269, 2002. [73] D. Haussler, “Decision theoretic generalizations of the PAC model for neural net and other learning applications,” Inf. Comput., vol. 100, no. 1, pp. 78–150, Sep. 1992. [74] S. Bharadwaj and M. Hasegawa-Johnson, “A PAC-Bayesian approach to minimum perplexity language modeling,” in Proc. Int. Conf. Comput. Ling. (COLING), Dublin, Ireland, Aug. 2014, pp. 130–140.