A New Family of Bounded Measures and Application to Signal Detection

Shivakumar Jolad1, Ahmed Roman2, Mahesh C. Shastry3, Mihir Gadgil4 and Ayanendranath Basu5 1Department of Physics, Indian Institute of Technology Gandhinagar, Ahmedabad, Gujarat, India 2Department of Mathematics, Virginia Tech , Blacksburg, VA, U.S.A. 3Department of Physics, Indian Institute of Science Education and Research Bhopal, Bhopal, Madhya Pradesh, India 4Biomedical Engineering Department, Oregon Health & Science University, Portland, OR, U.S.A. 5Interdisciplinary Statistical Research Unit, Indian Statistical Institute, Kolkata, West Bengal-700108, India

Keywords: Divergence Measures, Bhattacharyya Distance, Error Probability, F-divergence, Pattern Recognition, Signal Detection, Signal Classification.

Abstract: We introduce a new one-parameter family of divergence measures, called bounded Bhattacharyya distance (BBD) measures, for quantifying the dissimilarity between probability distributions. These measures are bounded, symmetric and positive semi-definite and do not require absolute continuity. In the asymptotic limit, BBD measure approaches the squared Hellinger distance. A generalized BBD measure for multiple distributions is also introduced. We prove an extension of a theorem of Bradt and Karlin for BBD relating Bayes error probability and divergence ranking. We show that BBD belongs to the class of generalized Csiszar f-divergence and derive some properties such as curvature and relation to Fisher Information. For distributions with vector valued parameters, the curvature matrix is related to the Fisher-Rao metric. We derive certain inequalities between BBD and well known measures such as Hellinger and Jensen-Shannon divergence. We also derive bounds on the Bayesian error probability. We give an application of these measures to the problem of signal detection where we compare two monochromatic signals buried in white noise and differing in frequency and amplitude.

1 INTRODUCTION to assess the convergence (or divergence) of proba- bility distributions. Many of these measures are not Divergence measures for the distance between two metrics in the strict mathematical sense, as they may probability distributions are a statistical approach to not satisfy either the symmetry of arguments or the comparing data and have been extensively studied in triangle inequality. In applications, the choice of the the last six decades (Kullback and Leibler, 1951; Ali measure depends on the interpretation of the metric in and Silvey, 1966; Kapur, 1984; Kullback, 1968; Ku- terms of the problem considered, its analytical prop- mar et al., 1986). These measures are widely used in erties and ease of computation (Gibbs and Su, 2002). varied fields such as pattern recognition (Basseville, One of the most well-known and widely used di- 1989; Ben-Bassat, 1978; Choi and Lee, 2003), speech vergence measures, the Kullback-Leibler divergence recognition (Qiao and Minematsu, 2010; Lee, 1991), (KLD)(Kullback and Leibler, 1951; Kullback, 1968), signal detection (Kailath, 1967; Kadota and Shepp, can create problems in specific applications. Specif- 1967; Poor, 1994), Bayesian model validation (Tumer ically, it is unbounded above and requires that the and Ghosh, 1996) and quantum information theory distributions be absolutely continuous with respect (Nielsen and Chuang, 2000; Lamberti et al., 2008). to each other. Various other information theoretic Distance measures try to achieve two main objectives measures have been introduced keeping in view ease (which are not mutually exclusive): to assess (1) how of computation ease and utility in problems of sig- “close” two distributions are compared to others and nal selection and pattern recognition. Of these mea- (2) how “easy” it is to distinguish between one pair sures, Bhattacharyya distance (Bhattacharyya, 1946; than the other (Ali and Silvey, 1966). Kailath, 1967; Nielsen and Boltz, 2011) and Chernoff There is a plethora of distance measures available distance (Chernoff, 1952; Basseville, 1989; Nielsen

72

Jolad, S., Roman, A., Shastry, M., Gadgil, M. and Basu, A. A New Family of Bounded Divergence Measures and Application to Signal Detection. DOI: 10.5220/0005695200720083 In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2016), pages 72-83 ISBN: 978-989-758-173-1 Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

A New Family of Bounded Divergence Measures and Application to Signal Detection

and Boltz, 2011) have been widely used in signal for different signal to noise ratios, providing thresh- processing. However, these measures are again un- olds for signal separation. bounded from above. Many bounded divergence mea- Our paper is organized as follows: Section I is sures such as Variational, Hellinger distance (Bas- the current introduction. In Section II, we recall the seville, 1989; DasGupta, 2011) and Jensen-Shannon well known Kullback-Leibler and Bhattacharyya di- metric (Burbea and Rao, 1982; Rao, 1982b; Lin, vergence measures, and then introduce our bounded 1991) have been studied extensively. Utility of these Bhattacharyya distance measures. We discuss some measures vary depending on properties such as tight- special cases of BBD, in particular Hellinger distance. ness of bounds on error probabilities, information the- We also introduce the generalized BBD for multi- oretic interpretations, and the ability to generalize to ple distributions. In Section III, we show the posi- multiple probability distributions. tive semi-definiteness of BBD measure, applicability Here we introduce a new one-parameter (α) fam- of the Bradt Karl theorem and prove that BBD be- ily of bounded measures based on the Bhattacharyya longs to generalized f-divergence class. We also de- coefficient, called bounded Bhattacharyya distance rive the relation between curvature and Fisher Infor- (BBD) measures. These measures are symmetric, mation, discuss the curvature metric and prove some positive-definite and bounded between 0 and 1. In inequalities with other measures such as Hellinger the asymptotic limit (α ∞) they approach squared and Jensen Shannon divergence for a special case of Hellinger divergence→ (Hellinger, ± 1909; Kakutani, BBD. In Section IV, we move on to discuss applica- 1948). Following Rao (Rao, 1982b) and Lin (Lin, tion to signal detection problem. Here we first briefly 1991), a generalized BBD is introduced to capture describe basic formulation of the problem, and then the divergence (or convergence) between multiple dis- move on computing distance between random pro- tributions. We show that BBD measures belong to cesses and comparing BBD measure with Fisher In- the generalized class of f-divergences and inherit use- formation and KLD. In the Appendix we provide the ful properties such as curvature and its relation to expressions for BBD measures , with α = 2, for some Fisher Information. Bayesian inference is useful in commonly used distributions. We conclude the paper problems where a decision has to be made on clas- with summary and outlook. sifying an observation into one of the possible array of states, whose prior probabilities are known (Hell- man and Raviv, 1970; Varshney and Varshney, 2008). 2 DIVERGENCE MEASURES Divergence measures are useful in estimating the er- ror in such classification (Ben-Bassat, 1978; Kailath, In the following subsection we consider a measurable 1967; Varshney, 2011). We prove an extension of space Ω with σ-algebra B and the set of all probability the Bradt Karlin theorem for BBD, which proves the measures M on (Ω,B). Let P and Q denote probabil- existence of prior probabilities relating Bayes error ity measures on (Ω,B) with p and q denoting their probabilities with ranking based on divergence mea- densities with respect to a common measure λ. We sure. Bounds on the error probabilities Pe can be recall the definition of absolute continuity (Royden, calculated through BBD measures using certain in- 1986): equalities between Bhattacharyya coefficient and Pe. We derive two inequalities for a special case of BBD Absolute Continuity: A measure P on the Borel (α = 2) with Hellinger and Jensen-Shannon diver- subsets of the real line is absolutely continuous with ( ) = gences. Our bounded measure with α = 2 has been respect to Lebesgue measure Q, if P A 0, for ev- ery Borel subset A B for which Q(A) = 0, and is used by Sunmola (Sunmola, 2013) to calculate dis- ∈ tance between Dirichlet distributions in the context of denoted by P << Q. Markov decision process. We illustrate the applicabil- ity of BBD measures by focusing on signal detection 2.1 Kullback-Leibler Divergence problem that comes up in areas such as gravitational wave detection (Finn, 1992). Here we consider dis- The Kullback-Leibler divergence (KLD) (or rela- criminating two monochromatic signals, differing in tive entropy) (Kullback and Leibler, 1951; Kullback, frequency or amplitude, and corrupted with additive 1968) between two distributions P,Q with densities p white noise. We compare the Fisher Information of and q is given by: the BBD measures with that of KLD and Hellinger p I(P,Q) plog dλ. (1) distance for these random processes, and highlight the ≡ q Z   regions where FI is insensitive large parameter devia- The symmetrized version is given by tions. We also characterize the performance of BBD J(P,Q) (I(P,Q) + I(Q,P))/2 ≡

73 ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

(Kailath, 1967), I(P,Q) [0,∞]. It diverges if where α [ ∞,0) (1,∞]. This gives the measure ∈ ∈ − ∪ x0 : q(x0) = 0 and p(x0) = 0. ∃ 6 (1 ρ) α B ( (P,Q)) α , α ρ log 1 − 1 − (5) ≡ − (1 α ) − α KLD is defined only when P is absolutely contin- −   uous w.r.t. Q. This feature can be problematic in nu- which can be simplified to merical computations when the measured distribution (1 ρ) has zero values. log 1 − − α Bα(ρ) = . (6) logh 1 1 i 2.2 Bhattacharyya Distance − α It is easy to see that Bα(0) =1, Bα(1) = 0. Bhattacharyya distance is a widely used measure in signal selection and pattern recognition (Kailath, 2.4 Special Cases 1967). It is defined as: 1. For α = 2 we get, B(P,Q) ln √pqdλ = ln(ρ), (2) ≡ − − 1 + ρ 2 1 + ρ Z  B (ρ) = log 2 = log . 2 − 2 2 − 2 2 ( , ) where the term in parenthesis ρ P Q √pqdλ    (7) is called Bhattacharyya coefficient (Bhattacharyya,≡ R We study some of its special properties in Sec.3.7. 1943; Bhattacharyya, 1946) in pattern recognition, 2. α ∞ affinity in theoretical , and fidelity in quan- → tum information theory. Unlike in the case of KLD, (1 ρ) 2 B∞(ρ) = log e− − = 1 ρ = H (ρ), (8) the Bhattacharyya distance avoids the requirement of − e − absolute continuity. It is a special case of Chernoff where H(ρ) is the Hellinger distance (Basseville, distance 1989; Kailath, 1967; Hellinger, 1909; Kakutani, 1948) α 1 α C (P,Q) ln p (x)q − (x)dx , H(ρ) 1 ρ(P,Q). (9) α ≡ − ≡ − Z  3. α = 1 p with α = 1/2. For discrete probability distribu- − tions, ρ [0,1] is interpreted as a scalar product of 1 B 1(ρ) = log2 . (10) the probability∈ vectors = ( p , p ,..., p ) and − − 2 ρ P √ 1 √ 2 √ n  −  Q = (√q ,√q ,...,√q ). Bhattacharyya distance 1 2 n 4. α ∞ is symmetric, positive-semidefinite, and unbounded → − (0 B ). It is finite as long as there exists some (1 ρ) 2 ∞ B ∞(ρ) = loge e − = 1 ρ = H (ρ). (11) region≤ S≤ X such that whenever x S : p(x)q(x) = 0. − − ⊂ ∈ 6 We note that BBD measures approach squared 2.3 Bounded Bhattacharyya Distance Hellinger distance when α ∞. In general, they are convex (concave) when→α ±> 1 (α < 0) in ρ, as Measures seen by evaluating second derivative 2 In many applications, in addition to the desirable ∂ Bα(ρ) 1 2 = − 2 = properties of the Bhattacharyya distance, bounded- ∂ρ 1 1 ρ α2 log 1 1 − ness is required. We propose a new family of bounded − α − α   measure of Bhattacharyya distance as below, > 0 α > 1 = (12) B (P,Q) log (ψ(ρ)) (3) (< 0 α < 0 . ψ,b ≡ − b 2 where, ρ = ρ(P,Q) is the Bhattacharyya coefficient, From this we deduce Bα>1(ρ) H (ρ) Bα<0(ρ) for 1 ≤ ≤ ψb(ρ) satisfies ψ(0) = b− , ψ(1) = 1. In particular ρ [0,1]. A comparison between Hellinger and BBD we choose the following form : measures∈ for different values of α are shown in Fig. 1. (1 ρ) α ψ(ρ) = 1 − − α   α α b = , (4) α 1  − 

74 A New Family of Bounded Divergence Measures and Application to Signal Detection

1 3 PROPERTIES α=−10 −1 0.8 1.01 3.1 Symmetry, Boundedness and 2 Positive Semi-definiteness 10 0.6 2 B (ρ) H (ρ) B ( (P,Q)) α Theorem 3.1. α ρ is symmetric, positive semi-definite and bounded in the interval [0,1] for 0.4 α [ ∞,0) (1,∞]. ∈ − ∪ Proof. Symmetry: Since ρ(P,Q) = ρ(Q,P), it fol- 0.2 lows that

Bα(ρ(P,Q)) = Bα(ρ(Q,P)). 0 0 0.2 0.4 0.6 0.8 1 Positive-semidefinite and boundedness: Since ρ Bα(0) = 1, Bα(1) = 0 and Figure 1: [Color Online] Comparison of Hellinger and ∂Bα(ρ) 1 bounded Bhattacharyya distance measures for different val- = < 0 ∂ρ αlog(1 1/α)[1 (1 ρ)/α] ues of α. − − − for 0 ρ 1 and α [ ∞,0) (1,∞], it follows that ≤ ≤ ∈ − ∪ 2.5 Generalized BBD Measure 0 B (ρ) 1. (15) ≤ α ≤ In decision problems involving more than two ran- dom variables, it is very useful to have divergence measures involving more than two distributions (Lin, 3.2 Error Probability and Divergence 1991; Rao, 1982a; Rao, 1982b). We use the general- Ranking ized geometric mean (G) concept to define bounded Bhattacharyya measure for more than two distribu- Here we recap the definition of error probability and tions. The G ( p ) of n variables p , p ,..., p with β { i} 1 2 n prove the applicability of Bradt and Karlin (Bradt and weights β1,β2,...,βn, such that βi 0, ∑i βi = 1, is Karlin, 1956) theorem to BBD measure. given by ≥ n Error Probability: The optimal Bayes error proba- βi bilities (see eg: (Ben-Bassat, 1978; Hellman and Ra- Gβ( pi ) = ∏ pi . { } i=1 viv, 1970; Toussaint, 1978)) for classifying two events For n probability distributions P1,P2,...,Pn , with P1,P2 with densities p1(x) and p2(x) with prior prob- densities p , p ,..., p , we define a generalized Bhat- abilities Γ = π1,π2 is given by 1 2 n { } tacharyya coefficient, also called Matusita measure of affinity (Matusita, 1967; Toussaint, 1974): Pe = min[π1 p1(x),π2 p2(x)]dx. (16) Z n β βi Error Comparison: Let p (x) (i = 1,2) be param- ρβ(P1,P2,...,Pn) = ∏ pi dλ. (13) i ZΩ i=1 eterized by β (Eg: in case of β = µ1,σ1;µ2,σ2 ). In signal detection literature, where βi 0, ∑ βi = 1. Based on this, we define the { } i a signal set β is considered better than set β0 for the generalized≥ bounded Bhattacharyya measures as: densities pi(x) , when the error probability is less for 1 ρβ β than for β (i.e. P (β) < P (β )) (Kailath, 1967). log(1 − ) 0 e e 0 Bβ (ρ (P ,P ,...,P )) − α (14) α β 1 2 n ≡ log(1 1/α) − Divergence Ranking: We can also rank the param- where α [ ∞,0) (1,∞]. For brevity we denote it as eters by means of some divergence D. The signal β ∈ − ∪ β set β is better (in the divergence sense) than β0, if Bα(ρ). Note that, 0 ρβ 1 and 0 Bα(ρ) 1, since ≤ ≤ ≤ ≤ Dβ(P1,P2) > Dβ (P1,P2). the weighted geometric mean is maximized when all 0 the p ’s are the same, and minimized when any two i In general it is not true that D (P ,P ) > of the probability densities p ’s are perpendicular to β 1 2 i D (P ,P ) = P (β) < P (β ). Bradt and Karlin each other. β0 1 2 e e 0 proved the following⇒ theorem relating error probabil- ities and divergence ranking for symmetric Kullback Leibler divergence J:

75 ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

Theorem 3.2 (Bradt and Karlin (Bradt and Karlin, 3.3 Bounds on Error Probability ( , ) > ( , ) 1956)). If Jβ P1 P2 Jβ0 P1 P2 , then a set of prior probabilities Γ = π ,π for two hypothesis∃ g ,g , { 1 2} 1 2 Error probabilities are hard to calculate in general. for which Tight bounds on Pe are often extremely useful in prac- Pe(β,Γ) < Pe(β0,Γ) (17) tice. Kailath (Kailath, 1967) has shown bounds on Pe where Pe(β,Γ) is the error probability with parameter in terms of the Bhattacharyya coefficient ρ: β and prior probability Γ. 1 1 2π 1 4π π ρ2 P π + √π π ρ, It is clear that the theorem asserts existence, but 2 1 − − 1 2 ≤ e ≤ 1 − 2 1 2 (22) h i   no method of finding these prior probabilities. Kailath p 1 with π1 + π2 = 1. If the priors are equal π1 = π2 = 2 , (Kailath, 1967) proved the applicability of the Bradt the expression simplifies to Karlin Theorem for Bhattacharyya distance measure. We follow the same route and show that the Bα(ρ) 1 1 1 1 ρ2 P ρ. (23) measure satisfies a similar property using the follow- 2 − − ≤ e ≤ 2 ing theorem by Blackwell.  q  Inverting relation in Eq. 6 for ρ(B ), we can get the Theorem 3.3 (Blackwell (Blackwell, 1951)) . α bounds in terms of B (ρ) measure. For the equal prior P (β ,Γ) P (β,Γ) for all prior probabilities Γ if α e 0 e probabilities case, Bhattacharyya coefficient gives a and only if≤ tight upper bound for large systems when ρ 0 (zero → [ ( ) ] [ ( ) ], overlap) and the observations are independent and Eβ0 Φ Lβ0 g Eβ Φ Lβ g | ≤ | identically distributed. These bounds are also useful continuous concave functions Φ(L), where ∀ to discriminate between two processes with arbitrar- Lβ = p1(x,β)/p2(x,β) is the likelihood ratio and ily low error probability (Kailath, 1967). We suppose Eω[Φ(Lω) g] is the expectation of Φ(Lω) under the | that tighter upper bounds on error probability can be hypothesis g = P2. derived through Matusita’s measure of affinity (Bhat- Theorem 3.4. If Bα(ρ(β)) > Bα(ρ(β0)), or equiva- tacharya and Toussaint, 1982; Toussaint, 1977; Tous- lently ρ(β) < ρ(β0) then a set of prior probabilities saint, 1975), but is beyond the scope of present work. Γ = π ,π for two hypothesis∃ g ,g , for which { 1 2} 1 2 3.4 F-divergence Pe(β,Γ) < Pe(β0,Γ). (18) Proof. The proof closely follows Kailath (Kailath, A class of divergence measures called f-divergences 1967). First note that √L is a concave function of were introduced by Csiszar (Csiszar, 1967; Csiszar, L (likelihood ratio) , and 1975) and independently by Ali and Silvey (Ali and Silvey, 1966) (see (Basseville, 1989) for review). ρ(β) = ∑ p1(x,β)p2(x,β) It encompasses many well known divergence mea- x X ∈ p sures including KLD, variational, Bhattacharyya and p1(x,β) Hellinger distance. In this section, we show that = p2(x,β) ∑ ( , ) Bα(ρ) measure for α (1,∞], belongs to the generic x X s p2 x β ∈ ∈ class of f-divergences defined by Basseville (Bas- = Eβ Lβ g2 . (19) seville, 1989). | hq i Similarly F-divergence (Basseville, 1989). Consider a measur- able space Ω with σ-algebra B. Let λ be a measure on ρ(β0) = Eβ Lβ g2 (20) 0 0 | (Ω,B) such that any probability laws P and Q are ab- hq i solutely continuous with respect to λ, with densities p Hence, ρ(β) < ρ(β0) ⇒ and q. Let f be a continuous convex real function on +, and g be an increasing function on . The class Eβ Lβ g2 < Eβ Lβ g2 . (21) R R | 0 0 | of divergence coefficients between two probabilities: hq i hq i Suppose assertion of the stated theorem is not true, p then for all Γ, Pe(β0,Γ) Pe(β,Γ). Then by Theorem d(P,Q) = g f qdλ (24) ≤ Ω q 3.3, Eβ [Φ(Lβ ) g2] Eβ[Φ(Lβ) g2] which contradicts Z 0 0 | ≤ |     our result in Eq. 21. are called the f-divergence measure w.r.t. functions ( f ,g) . Here p/q = L is the likelihood ratio. The term in the parenthesis of g gives the Csiszar’s (Csiszar, 1967; Csiszar, 1975) definition of f-divergence.

76 A New Family of Bounded Divergence Measures and Application to Signal Detection

The B (ρ(P,Q)) , for α (1,∞] measure can be writ- ∂2ρ(θ,φ) 1 1 ∂ f 2 1 ∂2 α ∈ = dx + f dx ten as the following f divergence: ∂φ2 φ=θ − 4 f ∂θ 2 ∂θ2 Z   Z 1 √x log( F) 1 ∂log f (x θ) 2 f (x) = 1 + − , g(F) = − , (25) = f (x θ) | dx − α log(1 1/α) − 4 | ∂θ − Z   1 where, = I (θ). (31) − 4 f 1 p F = 1 + 1 qdλ where I (θ) is the Fisher Information of distribution − α − q f ZΩ   r  f (x θ) 1 1 | = q 1 + √pq dλ 2 Ω − α − α ∂log f (x θ) Z     If (θ) = f (x θ) | dx. (32) 1 ρ | ∂θ = 1 + − . (26) Z   − α Using the above relationships, we can write down and the terms in the expansion of Eq. 29 Zθ(θ) = 1 ρ ∂Zθ(φ) log(1 − ) 0 , ∂φ = 0, and g(F) = − α = B (ρ(P,Q)). (27) φ=θ log(1 1/α) α 2 − ∂ Zθ(φ) = C(α)If (θ) > 0, (33) 3.5 Curvature and Fisher Information ∂φ2 φ=θ

1 In statistics, the information that an observable ran- where C(α) = 4αlog −(1 1/α) > 0 dom variable X carries about an unknown parameter θ − (on which it depends) is given by the Fisher informa- The leading term of Bα(θ,φ) is given by tion. One of the important properties of f-divergence 2 of two distributions of the same parametric family (φ θ) Bα(θ,φ) − C(α)If (θ). (34) is that their curvature measures the Fisher informa- ∼ 2 tion. Following the approach pioneered by Rao (Rao, 1945), we relate the curvature of BBD measures to 3.6 Differential Metrics the Fisher information and derive the differential cur- vature metric. The discussions below closely follow Rao (Rao, 1987) generalized the Fisher information (DasGupta, 2011). to multivariate densities with vector valued parame- ters to obtain a “geodesic” distance between two para- Definition. Let f (x θ);θ Θ R , be a family of { | ∈ ⊆ } metric distributions Pθ,Pφ of the same family. The densities indexed by real parameter θ, with some reg- Fisher-Rao metric has found applications in many ar- ularity conditions ( f (x θ) is absolutely continuous). | eas such as image structure and shape analysis (May- 1 ρ(θ,φ) bank, 2004; Peter and Rangarajan, 2006) , quantum log(1 − ) Z (φ) B (θ,φ) = − α (28) statistical inference (Brody and Hughston, 1998) and θ ≡ α log(1 1/α) − Blackhole thermodynamics (Quevedo, 2008). We de- rive such a metric for BBD measure using property of where ρ(θ,φ) = f (x θ) f (x φ)dx | | f-divergence. p Theorem 3.5. CurvatureR p of Zθ(φ) φ=θ is the Fisher Let θ,φ Θ R , then using the fact that | ∈ ⊆ information of f (x θ) up to a multiplicative constant. ∂Z(θ,φ) = 0, we can easily show that | ∂θi φ=θ Proof. Expand Zθ(φ) around theta p 2 p ∂ Zθ 2 2 dZθ = ∑ dθidθ j + = ∑ gi jdθidθ j + .... (35) dZ (φ) (φ θ) d Z (φ) ∂θi∂θ j ··· Z (φ) = Z (θ) + (φ θ) θ + − θ + ... i, j=1 i, j=1 θ θ − dφ 2 dφ2 (29) The curvature metric gi j can be used to find the Let us observe some properties of Bhattacharyya co- geodesic on the curve η(t), t [0,1] with efficient : ρ(θ,φ) = ρ(φ,θ), ρ(θ,θ) = 1, and its ∈ derivatives: C = η(t) : η(0) = θ η(1) = φ. (36) ∂ρ(θ,φ) 1 ∂ = f (x θ)dx = 0, (30) Details of the geodesic equation are given in many ∂φ φ=θ 2 ∂θ | standard differential geometry books. In the con- Z text of probability distance measures reader is re- ferred to (see 15.4.2 in A DasGupta (DasGupta, 2011)

77 ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

for details) The curvature metric of all Csiszar f- Jensen-Shannon Divergence: The Jensen differ- divergences are just scalar multiple KLD measure ence between two distributions P,Q, with densities (DasGupta, 2011; Basseville, 1989) given by: p,q and weights (λ1,λ2); λ1 + λ2 = 1, is defined as, f gi j(θ) = f 00(1)gi j(θ). (37) J , (P,Q) = Hs(λ1 p + λ2q) λ1Hs(p) λ2Hs(q), (43) λ1 λ2 − − For our BBD measure where H is the Shannon entropy. Jensen-Shannon di- 1 √x 00 1 s f 00(x) = 1 + − = vergence (JSD) (Burbea and Rao, 1982; Rao, 1982b; − α 4αx3/2   Lin, 1991) is based on the Jensen difference and is f˜00(1) = 1/4α. (38) given by: Apart from the 1/log(1 1 ), this is same as C( ) α α JS(P,Q) =J (P,Q) in Eq. 34. It follows− that the− geodesic distance for our 1/2,1/2 1 2p metric is same KLD geodesic distance up to a multi- = plog plicative factor. KLD geodesic distances are tabulated 2 p + q Z h   in DasGupta (DasGupta, 2011). 2q + qlog dλ (44) p + q 3.7 Relation to Other Measures  i The structure and goals of JSD and BBD measures Here we focus on the special case α = 2, i.e. B2(ρ) are similar. The following theorem compares the two Theorem 3.6. metrics using Jensen’s inequality. B H2 log4 B (39) Lemma 3.7 Jensen’s Inequality: For a convex func- 2 ≤ ≤ 2 tion ψ, E[ψ(X)] ψ(E[X]). where 1 and log4 are sharp. ≥ Theorem 3.8 (Relation to Jensen-Shannon mea- Proof. Sharpest upper bound is achieved via taking 2 sure). JS(P,Q) B2(P,Q) log2 H2(ρ) ≥ log2 − supρ [0,1) B (ρ) . Define ∈ 2 We use the un-symmetrized Jensen-Shannon met- 1 ρ ric for the proof. g(ρ) − . (40) ≡ log (1 + ρ)/2 − 2 We note that g(ρ) is continuous and has no singulari- Proof. ties whenever ρ [0,1). Hence ∈ 1 ρ 1+ρ 2p(x) 1+−ρ + log( 2 ) JS(P,Q) = p(x)log dx g0(ρ) = log2 0. p(x) + q(x) 2 ρ+1 ≥ Z log 2 p(x) + q(x) = 2 p(x)log dx − It follows that g(ρ) is non-decreasing and hence Z p 2p(x) supρ [0,1)g(ρ) = limρ 1 g(ρ) = log(4). Thus ∈ → p(x) + q(x) 2 p(x)log dx H2/B log4. (41) ≥ − 2p(x) 2 ≤ Z p p (since √p + q √p + √q) Combining this with convexity property of Bα(ρ) for ≤ p α > 1, we get p(X) + q(X) =EP 2log B H2 log4 B "− 2p(X) # 2 ≤ ≤ 2 p p Using the same procedure we can prove a generic ver- By Jensen’s inequality p ( , ] sion of this inequality for α 1 ∞ , given by E[ log f (X)] logE[ f (X)], we have ∈ − ≥ − 1 B (ρ) H2 αlog 1 B (ρ) (42) α ≤ ≤ − − α α p(X) + q(X)   EP 2log "− p 2p(pX) # ≥ pp(X) + q(X) 2logEP . − "p 2p(pX) # p 78 A New Family of Bounded Divergence Measures and Application to Signal Detection

Hence, respectively signal is present and absent are given by (Budzynski´ et al., 2008) p(x) + q(x) 2 2 JS(P,Q) 2log p(x) dx 1 (G ρ )   p1(G) = exp − , (48) ≥ − p 2p(px) √ − 2ρ2 Z 2πρ   1 + p(xp)q(x) 2 = 2log log2 1 G p0(G) = exp (49) − R p2 ! − √ −2ρ2 2πρ   B (p(x),q(x)) = 2 2 log2 log2 − 4.1 Distance between Gaussian   2 Processes = B (P,Q) log2. (45) log2 2 − Consider a stationary Gaussian random process X, which has signals s1 or s2 with probability densities p1 and p2 respectively of being present in it. These densities follow the form Eq. 48 with signal to noise 2 2 ratios ρ1 and ρ2 respectively. The probability den- 4 APPLICATION TO SIGNAL sity p(X) of Gaussian process can modeled as limit DETECTION of multivariate Gaussian distributions. The diver- gence measures between these processes d(s1,s2) are Signal detection is a common problem occurring in in general functions of the correlator (s1 s2 s1 s2) many fields such as communication engineering, pat- (Budzynski´ et al., 2008). Here we focus− on| distin-− tern recognition, and Gravitational wave detection guishing monochromatic signal s(t) = Acos(ωt + φ) (Poor, 1994). In this section, we briefly describe the and filter sF (t) = AF cos(ωFt + φ) (both buried in problem and terminology used in signal detection. We noise), separated in frequency or amplitude. illustrate though simple cases how divergence mea- The Kullback-Leibler divergence between the sig- sures, in particular BBD can be used for discrimi- nal and filter I(s,s ) is given by the correlation (s F − nating and detecting signals buried in white noise of sF s sF ): correlator receivers (matched filter). For greater de- | − I(s,s ) =(s s s s ) = (s s) + (s s ) 2(s s ) tails of the formalism used we refer the reader to F − F | − F | F | F − | F review articles in the context of Gravitational wave =ρ2 + ρ2 2ρρ [ cos(∆ωt) cos(∆φ) F − F h i detection by Jaranowski and Krolak(Jaranowski´ and sin(∆ωt) sin(∆φ)], (50) Krolak,´ 2007) and Sam Finn (Finn, 1992). − h i where is the average over observation time [0,T]. One of the central problem in signal detection is to hi detect whether a deterministic signal s(t) is embedded Here we have assumed that noise spectral density in an observed data x(t), corrupted by noise n(t). This N( f ) = N0 is constant over the frequencies [ω,ωF ]. can be posed as a hypothesis testing problem where The SNRs are given by the null hypothesis is absence of signal and alternative A2T A2 T ρ2 = , ρ2 = F . (51) is its presence. We take the noise to be additive , so N F N that x(t) = n(t) + s(t). We define the following terms 0 0 used in signal detection: Correlation G (also called (for detailed discussions we refer the reader to matched filter) between x and s, and signal to noise Budzynksi et. al (Budzynski´ et al., 2008)). ratio ρ (Finn, 1992; Budzynski´ et al., 2008). The Bhattacharyya distance between Gaussian processors with signals of same energy is ( Eq 14 in G = (x s), ρ = (s,s), (46) (Kailath, 1967)) just a multiple of the KLD B = I/8. | We use this result to extract the Bhattacharyya coeffi- where the scalar product (. .) isp defined by | cient : ∞ x˜( f )y˜∗( f ) (s sF s sF ) (x y) := 4ℜ d f . (47) ρ(s,sF ) = exp − | − (52) | N˜ ( f ) − 8 Z0   ℜ denotes the real part of a complex expression, tilde 4.1.1 Frequency Difference denotes the Fourier transform and the asterisk * de- notes complex conjugation. N˜ is the one-sided spec- Let us consider the case when the SNRs of signal and tral density of the noise. filter are equal, phase difference is zero, but frequen- For white noise, the probability densities of G when cies differ by ∆ω. The KL divergence is obtained by

79 ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

evaluating the correlator in Eq. 50 sin(∆ωT) I(∆ω) = (s s s s ) = 2ρ2 1 . − F | − F − ∆ωT  (53) sin(∆ωT) by noting cos(∆ωt) = ∆ωT and sin(∆ωt) = 1 cos(∆ωT) h i h i − ∆ωT . Using this, the expression for BBD family can be written as 2 ρ 1 sin(∆ωT) log 1 1 1 e− 4 − ∆ωT − α −      Bα(∆ω) = 1 . log 1 α − (54) As we have seen in section 3.4, both BBD and KLD belong to the f-divergence family. Their curvature for Figure 2: Comparison of Fisher Information, KLD, BBD distributions belonging to same parametric family is a and Hellinger distance for two monochromatic signals dif- constant times the Fisher information (FI) (see Theo- fering by frequency ∆ω, buried in white noise. Inset shows rem: 3.5). Here we discuss where the BBD and KLD wider range ∆ω (0,1) . We have set ρ = 1 and chosen ∈ 4 deviates from FI, when we account for higher terms parameters T = 100 and N0 = 10 . in the expansion of these measures. The Fisher matrix element for frequency gω,ω = ∂logΛ 2 2 2 E ∂ω = ρ T /3 (Budzynski´ et al., 2008), whereh Λ is thei likelihood ratio. Using the relation for  2 line element ds = ∑i, j gi jdθidθ j and noting that only frequency is varied, we get ρT∆ω ds = . (55) √3 Using the relation between curvature of BBD measure and Fisher’s Information in Eq.34, we can see that for low frequency differences the line element varies as:

2B (∆ω) α ds. s C(α) ∼ Figure 3: Comparison of Fisher information line element Similarly √dKL ds at low frequencies. However, at with KLD, BBD and Hellinger distance for signals differing ∼ in amplitude and buried in white noise. We have set A = 1, higher frequencies both KLD and BBD deviate from 4 the Fisher information metric. In Fig. 2, we have plot- T = 100 and N0 = 10 . ted ds, √dKL and 2Bα(∆ω)/C(α) with α = 2 and Hellinger distance (α ∞) for ∆ω (0,0.1). We ob- The correlation reduces to p → ∈ serve that till ∆ω = 0.01 (i.e. ∆ωT 1), KLD and A2T (A + ∆A)2T A(A + ∆A)T ∼ (s sF s sF ) = + 2 BBD follows Fisher Information and after that they − | − N0 N0 − N0 start to deviate. This suggests that Fisher Informa- (∆A)2T tion is not sensitive to large deviations. There is not = . (56) much difference between KLD, BBD and Hellinger N0 for large frequencies due to the correlator G becom- (∆A)2T This gives us I(∆A) = , which is the same ing essentially a constant over a wide range of fre- N0 2 quencies. as the line element ds with Fisher metric ds = T/2N0∆A. In Fig. 3, we have plotted ds, √dKL 4.1.2 Amplitude Difference ( )/ ( ) ( , ) pand 2Bα ∆ω C α for ∆A 0 40 . KLD and FI line element are the same. Deviations∈ of BBD and We now consider the case where the frequency and Hellingerp can be observed only for ∆A > 10. phase of the signal and the filter are same but they dif- Discriminating between two signals s1,s2 requires fer in amplitude ∆A (which reflects in differing SNR). minimizing the error probability between them. By

80 A New Family of Bounded Divergence Measures and Application to Signal Detection

Theorem 3.4, there exists priors for which the prob- Bayes error probabilities can be put by using proper- lem translates into maximizing the divergence for ties of Bhattacharyya coefficient. BBD measures. For the monochromatic signals dis- Although many bounded divergence measures cussed above, the distance depends on parameters have been studied and used in various applications, no (ρ1,ρ2,∆ω,∆φ). We can maximize the distance for a single measure is useful in all types of problems stud- given frequency difference by differentiating with re- ied. Here we have illustrated an application to signal spect to phase difference ∆φ (Budzynski´ et al., 2008). detection problem by considering “distance” between In Fig. 4, we show the variation of maximized BBD monochromatic signal and filter buried in white Gaus- for different signal to noise ratios (ρ1,ρ2), for a fixed sian noise with differing frequency or amplitude, and frequency difference ∆ω = 0.01. The intensity map comparing it to Fishers Information and Kullback- shows different bands which can be used for setting Leibler divergence. the threshold for signal separation. A detailed study with chirp like signal and colored Detecting signal of known form involves minimiz- noise occurring in Gravitational wave detection will ing the distance measure over the parameter space of be taken up in a future study. Although our measures the signal. A threshold on the maximum “distance” have a tunable parameter α, here we have focused on between the signal and filter can be put so that a detec- a special case with α = 2. In many practical appli- tion is said to occur whenever the measures fall within cations where extremum values are desired such as this threshold. Based on a series of tests, Receiver Op- minimal error, minimal false acceptance/rejection ra- erating Characteristic (ROC) curves can be drawn to tio etc, exploring the BBD measure by varying α may study the effectiveness of the distance measure in sig- be desirable. Further, the utility of BBD measures is nal detection. We leave such details for future work. to be explored in parameter estimation based on min- imal disparity estimators and Divergence information criterion in Bayesian model selection (Basu and Lind- say, 1994). However, since the focus of the current paper is introducing a new measure and studying its basic properties, we leave such applications to statis- tical inference and data processing to future studies.

ACKNOWLEDGEMENTS

One of us (S.J) thanks Rahul Kulkarni for insightful discussions, Anand Sengupta for discussions on ap- plication to signal detection, and acknowledge the fi- nancial support in part by grants DMR-0705152 and Figure 4: BBD with different signal to noise ratio for a DMR-1005417 from the US National Science Foun- fixed. We have set T = 100 and ∆ω = 0.01. dation. M.S. would like to thank the Penn State Elec- trical Engineering Department for support.

5 SUMMARY AND OUTLOOK REFERENCES In this work we have introduced a new family of bounded divergence measures based on the Bhat- Ali, S. M. and Silvey, S. D. (1966). A general class of co- tacharyya distance, called bounded Bhattacharyya efficients of divergence of one distribution from an- distance measures. We have shown that it belongs other. Journal of the Royal Statistical Society. Series B (Methodological), 28(1):131–142. to the class of generalized f-divergences and inher- Basseville, M. (1989). Distance measures for signal pro- its all its properties, such as those relating Fishers In- cessing and pattern recognition. Signal processing, formation and curvature metric. We have discussed 18:349–369. several special cases of our measure, in particular Basu, A. and Lindsay, B. G. (1994). Minimum disparity squared Hellinger distance, and studied relation with estimation for continuous models: efficiency, distri- other measures such as Jensen-Shannon divergence. butions and robustness. Annals of the Institute of Sta- We have also extended the Bradt Karlin theorem on tistical Mathematics, 46(4):683–705. error probabilities to BBD measure. Tight bounds on Ben-Bassat, M. (1978). f-entropies, probability of er-

81 ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

ror, and feature selection. Information and Control, Jaranowski, P. and Krolak,´ A. (2007). Gravitational-wave 39(3):227–242. data analysis. formalism and sample applications: the Bhattacharya, B. K. and Toussaint, G. T. (1982). An upper gaussian case. arXiv preprint arXiv:0711.1115. bound on the probability of misclassification in terms Kadota, T. and Shepp, L. (1967). On the best finite set of matusita’s measure of affinity. Annals of the Insti- of linear observables for discriminating two gaussian tute of Statistical Mathematics, 34(1):161–165. signals. IEEE Transactions on Information Theory, Bhattacharyya, A. (1943). On a measure of divergence 13(2):278–284. between two statistical populations defined by their Kailath, T. (1967). The Divergence and Bhattacharyya Dis- probability distributions. Bull. Calcutta Math. Soc, tance Measures in Signal Selection. IEEE Transac- 35(99-109):4. tions on Communications, 15(1):52–60. Bhattacharyya, A. (1946). On a measure of divergence be- Kakutani, S. (1948). On equivalence of infinite product tween two multinomial populations. Sankhya:˜ The In- measures. The Annals of Mathematics, 49(1):214– dian Journal of Statistics (1933-1960), 7(4):401–406. 224. Kapur, J. (1984). A comparative assessment of various mea- Blackwell, D. (1951). Comparison of experiments. In Sec- sures of directed divergence. Advances in Manage- ond Berkeley Symposium on Mathematical Statistics ment Studies, 3(1):1–16. and Probability, volume 1, pages 93–102. Kullback, S. (1968). Information theory and statistics. New Bradt, R. and Karlin, S. (1956). On the design and compar- York: Dover, 1968, 2nd ed., 1. ison of certain dichotomous experiments. The Annals Kullback, S. and Leibler, R. A. (1951). On information of mathematical statistics, pages 390–409. and sufficiency. The Annals of Mathematical Statis- Brody, D. C. and Hughston, L. P. (1998). Statistical geom- tics, 22(1):pp. 79–86. etry in quantum mechanics. Proceedings of the Royal Kumar, U., Kumar, V., and Kapur, J. N. (1986). Some Society of London. Series A: Mathematical, Physical normalized measures of directed divergence. Inter- and Engineering Sciences, 454(1977):2445–2475. national Journal of General Systems, 13(1):5–16. Budzynski,´ R. J., Kondracki, W., and Krolak,´ A. (2008). Lamberti, P. W., Majtey, A. P., Borras, A., Casas, M., and Applications of distance between probability distribu- Plastino, A. (2008). Metric character of the quan- tions to gravitational wave data analysis. Classical tum Jensen-Shannon divergence . Physical Review A, and Quantum Gravity, 25(1):015005. 77:052311. Burbea, J. and Rao, C. R. (1982). On the convexity of Lee, Y.-T. (1991). Information-theoretic distortion mea- some divergence measures based on entropy func- sures for speech recognition. Signal Processing, IEEE tions. IEEE Transactions on Information Theory, Transactions on, 39(2):330–335. 28(3):489 – 495. Lin, J. (1991). Divergence measures based on the shannon Chernoff, H. (1952). A measure of asymptotic efficiency for entropy. IEEE Transactions on Information Theory, tests of a hypothesis based on the sum of observations. 37(1):145 –151. The Annals of Mathematical Statistics, 23(4):pp. 493– Matusita, K. (1967). On the notion of affinity of several dis- 507. tributions and some of its applications. Annals of the Choi, E. and Lee, C. (2003). Feature extraction based Institute of Statistical Mathematics, 19(1):181–192. on the Bhattacharyya distance. Pattern Recognition, Maybank, S. J. (2004). Detection of image structures using 36(8):1703–1709. the fisher information and the rao metric. IEEE Trans- Csiszar, I. (1967). Information-type distance measures actions on Pattern Analysis and Machine Intelligence, and indirect observations. Stud. Sci. Math. Hungar, 26(12):1579–1589. 2:299–318. Nielsen, F. and Boltz, S. (2011). The burbea-rao and bhat- Csiszar, I. (1975). I-divergence geometry of probability dis- tacharyya centroids. IEEE Transactions on Informa- tributions and minimization problems. The Annals of tion Theory, 57(8):5455–5466. Probability, 3(1):pp. 146–158. Nielsen, M. and Chuang, I. (2000). Quantum computation and information. Cambridge University Press, Cam- DasGupta, A. (2011). Probability for Statistics and Ma- bridge, UK, 3(8):9. chine Learning. Springer Texts in Statistics. Springer New York. Peter, A. and Rangarajan, A. (2006). Shape analysis using the fisher-rao riemannian metric: Unifying shape rep- Finn, L. S. (1992). Detection, measurement, and gravita- resentation and deformation. In Biomedical Imaging: tional radiation. Physical Review D, 46(12):5236. Nano to Macro, 2006. 3rd IEEE International Sympo- Gibbs, A. and Su, F. (2002). On choosing and bounding sium on, pages 1164–1167. IEEE. probability metrics. International Statistical Review, Poor, H. V. (1994). An introduction to signal detection and 70(3):419–435. estimation. Springer. Hellinger, E. (1909). Neue begrundung¨ der theo- Qiao, Y. and Minematsu, N. (2010). A study on in- rie quadratischer formen von unendlichvielen variance of-divergence and its application to speech veranderlichen.¨ Journal fur¨ die reine und angewandte recognition. Signal Processing, IEEE Transactions Mathematik (Crelle’s Journal), (136):210–271. on, 58(7):3884–3890. Hellman, M. E. and Raviv, J. (1970). Probability of Error, Quevedo, H. (2008). Geometrothermodynamics of black Equivocation, and the Chernoff Bound. IEEE Trans- holes. General Relativity and Gravitation, 40(5):971– actions on Information Theory, 16(4):368–372. 984.

82 A New Family of Bounded Divergence Measures and Application to Signal Detection

Rao, C. (1982a). Diversity: Its measurement, decomposi- Poisson : tion, apportionment and analysis. Sankhya: The In- • dian Journal of Statistics, Series A, pages 1–22. λk e λp λk e λq Rao, C. R. (1945). Information and the accuracy attainable p − q − P(k) = k! , Q(k) = k! . in the estimation of statistical parameters. Bull. Cal- cutta Math. Soc., 37:81–91. 2 Rao, C. R. (1982b). Diversity and dissimilarity coefficients: (√λp √λq) /2 1 + e− − A unified approach. Theoretical Population Biology, ζpoisson(P,Q) = log2 . (58) 21(1):24 – 43. − 2 ! Rao, C. R. (1987). Differential metrics in probability spaces. Differential geometry in statistical inference, 10:217–240. Gaussian : Royden, H. (1986). Real analysis. Macmillan Publishing • Company, New York. 1 (x x )2 P(x) = exp − p , Sunmola, F. T. (2013). Optimising learning with trans- 2 √2πσp − 2σp ! ferable prior information. PhD thesis, University of Birmingham. 1 (x x )2 Q(x) = exp − q . Toussaint, G. T. (1974). Some properties of matusita’s mea- √ − 2σ2 sure of affinity of several distributions. Annals of the 2πσq q ! Institute of Statistical Mathematics, 26(1):389–394. Toussaint, G. T. (1975). Sharper lower bounds for discrimi- 2σ σ (x x )2 ζ (P,Q) = 1 log 1 + p q exp p − q . nation information in terms of variation (corresp.). In- Gauss 2 2 2 2 2 (59) − σp + σq − 4(σp + σq) ! formation Theory, IEEE Transactions on, 21(1):99– h i 100. Toussaint, G. T. (1977). An upper bound on the probability of misclassification in terms of the affinity. Proceed- λpx λqx Exponential : P(x) = λpe− , Q(x) = λqe− . ings of the IEEE, 65(2):275–276. • Toussaint, G. T. (1978). Probability of error, expected 2 ( λp + λq) divergence and the affinity of several distributions. ζexp(P,Q) = log2 . (60) IEEE Transactions on Systems, Man and Cybernetics, − " p2(λp +pλq) # 8(6):482–485. Tumer, K. and Ghosh, J. (1996). Estimating the Bayes Pareto : Assuming the same cut off xm, error rate through classifier combining. Proceedings • of 13th International Conference on Pattern Recogni- αp α xm for x x tion, pages 695–699. P(x) = p xαp+1 ≥ m (61) Varshney, K. R. (2011). Bayes risk error is a bregman di- (0 for x < xm, vergence. IEEE Transactions on Signal Processing, αq 59(9):4470–4472. xm αq α +1 for x xm Varshney, K. R. and Varshney, L. R. (2008). Quanti- Q(x) = x q ≥ (62) zation of prior probabilities for hypothesis testing. (0 for x < xm. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 56(10):4553. 2 (√αp + √αq) ζpareto(P,Q) = log2 . (63) − " 2(αp + αq) # APPENDIX

BBD Measures of Some Common Distributions. Here we provide explicit expressions for BBD B2, for some common distributions. For brevity we denote ζ B . ≡ 2 Binomial : •

P(k) = n pk(1 p)n k, Q(k) = n qk(1 q)n k. k − − k − −   1 + [√pq + (1 p)(1 q)]n ζbin(P,Q) = log2 − − . (57) − p2 !

83