A New Family of Bounded Measures and Application to Signal Detection

Shivakumar Jolad1, Ahmed Roman2, Mahesh C. Shastry 3, Mihir Gadgil4 and Ayanendranath Basu 5 1Department of Physics, Indian Institute of Technology Gandhinagar, Ahmedabad, Gujarat, INDIA 2 Department of Mathematics, Virginia Tech , Blacksburg, VA, USA. 3 Department of Physics, Indian Institute of Science Education and Research Bhopal, Bhopal, Madhya Pradesh, INDIA. 4 Biomedical Engineering Department, Oregon Health & Science University, Portland, OR, USA. 5 Indian Statistical Institute, Kolkata, West Bengal-700108, INDIA [email protected], [email protected], [email protected], [email protected], [email protected]

Keywords: Divergence Measures, Bhattacharyya Distance, Error Probability, F-divergence, Pattern Recognition, Signal Detection, Signal Classification.

Abstract: We introduce a new one-parameter family of divergence measures, called bounded Bhattacharyya distance (BBD) measures, for quantifying the dissimilarity between probability distributions. These measures are bounded, symmetric and positive semi-definite and do not require . In the asymptotic limit, BBD approaches the squared Hellinger distance. A generalized BBD measure for multiple distributions is also introduced. We prove an extension of a theorem of Bradt and Karlin for BBD relating Bayes error probability and divergence ranking. We show that BBD belongs to the class of generalized Csiszar f-divergence and derive some properties such as curvature and relation to Fisher Information. For distributions with vector valued parameters, the curvature matrix is related to the Fisher-Rao metric. We derive certain inequalities between BBD and well known measures such as Hellinger and Jensen-Shannon divergence. We also derive bounds on the Bayesian error probability. We give an application of these measures to the problem of signal detection where we compare two monochromatic signals buried in white noise and differing in frequency and amplitude.

1 INTRODUCTION to assess the convergence (or divergence) of proba- bility distributions. Many of these measures are not Divergence measures for the distance between two metrics in the strict mathematical sense, as they may probability distributions are a statistical approach to not satisfy either the symmetry of arguments or the comparing data and have been extensively studied in triangle inequality. In applications, the choice of the the last six decades [Kullback and Leibler, 1951, Ali measure depends on the interpretation of the metric in and Silvey, 1966, Kapur, 1984, Kullback, 1968, Ku- terms of the problem considered, its analytical prop- mar et al., 1986]. These measures are widely used in erties and ease of computation [Gibbs and Su, 2002]. varied fields such as pattern recognition [Basseville, One of the most well-known and widely used di- 1989, Ben-Bassat, 1978, Choi and Lee, 2003], speech vergence measures, the Kullback-Leibler divergence recognition [Qiao and Minematsu, 2010, Lee, 1991], (KLD) [Kullback and Leibler, 1951,Kullback, 1968], arXiv:1201.0418v9 [math.ST] 10 Apr 2016 signal detection [Kailath, 1967, Kadota and Shepp, can create problems in specific applications. Specif- 1967,Poor, 1994], Bayesian model validation [Tumer ically, it is unbounded above and requires that the and Ghosh, 1996] and quantum information theory distributions be absolutely continuous with respect [Nielsen and Chuang, 2000, Lamberti et al., 2008]. to each other. Various other information theoretic Distance measures try to achieve two main objectives measures have been introduced keeping in view ease (which are not mutually exclusive): to assess (1) how of computation ease and utility in problems of sig- “close” two distributions are compared to others and nal selection and pattern recognition. Of these mea- (2) how “easy” it is to distinguish between one pair sures, Bhattacharyya distance [Bhattacharyya, 1946, than the other [Ali and Silvey, 1966]. Kailath, 1967,Nielsen and Boltz, 2011] and Chernoff There is a plethora of distance measures available distance [Chernoff, 1952, Basseville, 1989, Nielsen tions. We also characterize the performance of BBD and Boltz, 2011] have been widely used in signal for different signal to noise ratios, providing thresh- processing. However, these measures are again un- olds for signal separation. bounded from above. Many bounded divergence mea- Our paper is organized as follows: Section I is sures such as Variational, Hellinger distance [Bas- the current introduction. In Section II, we recall the seville, 1989, DasGupta, 2011] and Jensen-Shannon well known Kullback-Leibler and Bhattacharyya di- metric [Burbea and Rao, 1982,Rao, 1982b,Lin, 1991] vergence measures, and then introduce our bounded have been studied extensively. Utility of these mea- Bhattacharyya distance measures. We discuss some sures vary depending on properties such as tightness special cases of BBD, in particular Hellinger distance. of bounds on error probabilities, information theoretic We also introduce the generalized BBD for multi- interpretations, and the ability to generalize to multi- ple distributions. In Section III, we show the posi- ple probability distributions. tive semi-definiteness of BBD measure, applicability of the Bradt Karl theorem and prove that BBD be- Here we introduce a new one-parameter (α) fam- longs to generalized f-divergence class. We also de- ily of bounded measures based on the Bhattacharyya rive the relation between curvature and Fisher Infor- coefficient, called bounded Bhattacharyya distance mation, discuss the curvature metric and prove some (BBD) measures. These measures are symmet- inequalities with other measures such as Hellinger ric, positive-definite and bounded between 0 and 1. and Jensen Shannon divergence for a special case of In the asymptotic limit (α → ±∞) they approach BBD. In Section IV, we move on to discuss applica- squared Hellinger divergence [Hellinger, 1909,Kaku- tion to signal detection problem. Here we first briefly tani, 1948]. Following Rao [Rao, 1982b] and Lin describe basic formulation of the problem, and then [Lin, 1991], a generalized BBD is introduced to cap- move on computing distance between random pro- ture the divergence (or convergence) between multi- cesses and comparing BBD measure with Fisher In- ple distributions. We show that BBD measures belong formation and KLD. In the Appendix we provide the to the generalized class of f-divergences and inherit expressions for BBD measures , with α = 2, for some useful properties such as curvature and its relation to commonly used distributions. We conclude the paper Fisher Information. Bayesian inference is useful in with summary and outlook. problems where a decision has to be made on clas- sifying an observation into one of the possible array of states, whose prior probabilities are known [Hell- man and Raviv, 1970, Varshney and Varshney, 2008]. 2 DIVERGENCE MEASURES Divergence measures are useful in estimating the er- ror in such classification [Ben-Bassat, 1978, Kailath, In the following subsection we consider a measur- 1967, Varshney, 2011]. We prove an extension of the able space Ω with σ-algebra B and the set of all prob- Bradt Karlin theorem for BBD, which proves the exis- ability measures M on (Ω,B). Let P and Q denote tence of prior probabilities relating Bayes error prob- probability measures on (Ω,B) with p and q denoting abilities with ranking based on divergence measure. their densities with respect to a common measure λ. Bounds on the error probabilities Pe can be calcu- We recall the definition of absolute continuity [Roy- lated through BBD measures using certain inequal- den, 1986]: ities between Bhattacharyya coefficient and Pe. We derive two inequalities for a special case of BBD Absolute Continuity A measure P on the Borel sub- (α = 2) with Hellinger and Jensen-Shannon diver- sets of the real line is absolutely continuous with re- gences. Our bounded measure with α = 2 has been spect to Lebesgue measure Q, if P(A) = 0, for every used by Sunmola [Sunmola, 2013] to calculate dis- Borel subset A ∈ B for which Q(A) = 0, and is de- tance between Dirichlet distributions in the context of noted by P << Q. Markov decision process. We illustrate the applicabil- ity of BBD measures by focusing on signal detection 2.1 Kullback-Leibler divergence problem that comes up in areas such as gravitational wave detection [Finn, 1992]. Here we consider dis- The Kullback-Leibler divergence (KLD) (or rela- criminating two monochromatic signals, differing in tive entropy) [Kullback and Leibler, 1951, Kullback, frequency or amplitude, and corrupted with additive 1968] between two distributions P,Q with densities p white noise. We compare the Fisher Information of and q is given by: the BBD measures with that of KLD and Hellinger distance for these random processes, and highlight the Z  p I(P,Q) ≡ plog dλ. (1) regions where FI is insensitive large parameter devia- q The symmetrized version is given by where α ∈ [−∞,0) ∪ (1,∞]. This gives the measure

J(P,Q) ≡ (I(P,Q) + I(Q,P))/ α 2  (1 − ρ) Bα(ρ(P,Q)) ≡ −log 1 −α 1 − , (5) [Kailath, 1967], I(P,Q) ∈ [0,∞]. It diverges if (1− α ) α ∃ x0 : q(x0) = 0 and p(x0) 6= 0. which can be simplified to KLD is defined only when P is absolutely contin- h (1−ρ) i uous w.r.t. Q. This feature can be problematic in nu- log 1 − α merical computations when the measured distribution Bα(ρ) = . (6) log1 − 1  has zero values. α It is easy to see that B (0) = 1, B (1) = 0. 2.2 Bhattacharyya Distance α α

Bhattacharyya distance is a widely used measure 2.4 Special cases in signal selection and pattern recognition [Kailath, 1967]. It is defined as: 1. For α = 2 we get, Z √  B(P,Q) ≡ −ln pqdλ = −ln(ρ), (2) 1 + ρ2 1 + ρ B2(ρ) = −log22 = −log2 . R √ 2 2 where the term in parenthesis ρ(P,Q) ≡ pqdλ (7) is called Bhattacharyya coefficient [Bhattacharyya, We study some of its special properties in Sec.3.7. 1943, Bhattacharyya, 1946] in pattern recognition, affinity in theoretical , and fidelity in quan- 2. α → ∞ tum information theory. Unlike in the case of KLD, B ( ) = −log e−(1−ρ) = 1 − = H2( ), (8) the Bhattacharyya distance avoids the requirement of ∞ ρ e ρ ρ absolute continuity. It is a special case of Chernoff where H(ρ) is the Hellinger distance [Basseville, distance 1989, Kailath, 1967, Hellinger, 1909, Kakutani, Z  C (P,Q) ≡ − pα(x)q1−α(x)dx , 1948] α ln p H(ρ) ≡ 1 − ρ(P,Q). (9) with α = 1/2. For discrete probability distribu- tions, ∈ [0,1] is interpreted as a scalar product of 3. α = −1 ρ √ √ √ the probability vectors P = ( p , p ,..., p ) and √ √ √ 1 2 n  1  Q = ( q1, q2,..., qn). Bhattacharyya distance B−1(ρ) = −log2 . (10) is symmetric, positive-semidefinite, and unbounded 2 − ρ (0 ≤ B ≤ ∞). It is finite as long as there exists some region S ⊂ X such that whenever x ∈ S : p(x)q(x) 6= 0. 4. α → −∞ B (ρ) = log e(1−ρ) = 1 − ρ = H2(ρ). (11) 2.3 Bounded Bhattacharyya Distance −∞ e Measures We note that BBD measures approach squared Hellinger distance when α → ±∞. In general, they In many applications, in addition to the desirable are convex (concave) when α > 1 (α < 0) in ρ, as properties of the Bhattacharyya distance, bounded- seen by evaluating second derivative ness is required. We propose a new family of bounded 2 measure of Bhattacharyya distance as below, ∂ Bα(ρ) −1 2 = 2 = ∂ρ  1−ρ  Bψ,b(P,Q) ≡ −log (ψ(ρ)) (3) 2 1  b α log 1 − α 1 − α where, ρ = ρ(P,Q) is the Bhattacharyya coefficient, ( −1 > 0 α > 1 ψb(ρ) satisfies ψ(0) = b , ψ(1) = 1. In particular = (12) we choose the following form : < 0 α < 0 .  α (1 − ρ) 2 ψ(ρ) = 1 − From this we deduce Bα>1(ρ) ≤ H (ρ) ≤ Bα<0(ρ) for α ρ ∈ [0,1]. A comparison between Hellinger and BBD  α α measures for different values of α are shown in Fig. b = , (4) α − 1 1. 1 3 PROPERTIES α=−10 −1 0.8 1.01 3.1 Symmetry, Boundedness and 2 Positive Semi-definiteness 10 0.6 2 B (ρ) H (ρ) α Theorem 3.1. Bα(ρ(P,Q)) is symmetric, positive semi-definite and bounded in the interval [0,1] for 0.4 α ∈ [−∞,0) ∪ (1,∞]. Proof. Symmetry: Since ρ(P,Q) = ρ(Q,P), it fol- 0.2 lows that

Bα(ρ(P,Q)) = Bα(ρ(Q,P)). 0 0 0.2 0.4 0.6 0.8 1 Positive-semidefinite and boundedness: Since ρ Bα(0) = 1, Bα(1) = 0 and Figure 1: [Color Online] Comparison of Hellinger and ∂B (ρ) 1 bounded Bhattacharyya distance measures for different val- α = < 0 ues of α. ∂ρ αlog(1 − 1/α)[1 − (1 − ρ)/α] for 0 ≤ ρ ≤ 1 and α ∈ [−∞,0) ∪ (1,∞], it follows that

2.5 Generalized BBD measure 0 ≤ Bα(ρ) ≤ 1. (15)

In decision problems involving more than two ran- dom variables, it is very useful to have divergence 3.2 Error Probability and Divergence measures involving more than two distributions [Lin, Ranking 1991, Rao, 1982a, Rao, 1982b]. We use the general- G ized geometric mean ( ) concept to define bounded Here we recap the definition of error probability and Bhattacharyya measure for more than two distribu- prove the applicability of Bradt and Karlin [Bradt and G ({p }) n p , p ,..., p tions. The β i of variables 1 2 n with Karlin, 1956] theorem to BBD measure. weights β1,β2,...,βn, such that βi ≥ 0, ∑i βi = 1, is given by Error probability: The optimal Bayes error proba- n bilities (see eg: [Ben-Bassat, 1978, Hellman and Ra- βi viv, 1970,Toussaint, 1978]) for classifying two events Gβ({pi}) = ∏ pi . i=1 P1,P2 with densities p1(x) and p2(x) with prior prob- abilities Γ = {π ,π } is given by For n probability distributions P ,P ,...,P , with 1 2 1 2 n Z densities p , p ,..., p , we define a generalized Bhat- 1 2 n Pe = min[π1 p1(x),π2 p2(x)]dx. (16) tacharyya coefficient, also called Matusita measure of affinity [Matusita, 1967, Toussaint, 1974]: β Error comparison: Let pi (x) (i = 1,2) be pa- Z n rameterized by β (Eg: in case of βi ρβ(P1,P2,...,Pn) = ∏ pi dλ. (13) β = {µ1,σ1;µ2,σ2} ). In signal detection literature, Ω i=1 a signal set β is considered better than set β0 for the densities pi(x) , when the error probability is less for where βi ≥ 0, ∑i βi = 1. Based on this, we define the β than for β0 (i.e. P (β) < P (β0)) [Kailath, 1967]. generalized bounded Bhattacharyya measures as: e e

1−ρβ Divergence ranking: We can also rank the β log(1 − α ) parameters by means of some divergence D. The B (ρ (P1,P2,...,Pn)) ≡ (14) α β log(1 − 1/α) signal set β is better (in the divergence sense) than β0, if Dβ(P1,P2) > Dβ0 (P1,P2). where α ∈ [−∞,0)∪(1,∞]. For brevity we denote it as β β Bα(ρ). Note that, 0 ≤ ρβ ≤ 1 and 0 ≤ Bα(ρ) ≤ 1, since In general it is not true that Dβ(P1,P2) > 0 the weighted geometric mean is maximized when all Dβ0 (P1,P2) =⇒ Pe(β) < Pe(β ). Bradt and Karlin the pi’s are the same, and minimized when any two proved the following theorem relating error probabil- of the probability densities pi’s are perpendicular to ities and divergence ranking for symmetric Kullback each other. Leibler divergence J: Theorem 3.2 (Bradt and Karlin [Bradt and Karlin, 3.3 Bounds on Error Probability 1956]). If Jβ(P1,P2) > Jβ0 (P1,P2), then ∃ a set of prior probabilities Γ = {π1,π2} for two hypothesis Error probabilities are hard to calculate in general. g1,g2, for which Tight bounds on Pe are often extremely useful in prac- 0 tice. Kailath [Kailath, 1967] has shown bounds on Pe Pe(β,Γ) < Pe(β ,Γ) (17) in terms of the Bhattacharyya coefficient ρ: where Pe(β,Γ) is the error probability with parameter 1  q   1 √ 2π − 1 − 4π π ρ2 ≤ P ≤ π − + π π ρ, β and prior probability Γ. 2 1 1 2 e 1 2 1 2 It is clear that the theorem asserts existence, but (22) 1 no method of finding these prior probabilities. Kailath with π1 + π2 = 1. If the priors are equal π1 = π2 = 2 , [Kailath, 1967] proved the applicability of the Bradt the expression simplifies to Karlin Theorem for Bhattacharyya distance measure.  q  1 2 1 We follow the same route and show that the Bα(ρ) 1 − 1 − ρ ≤ Pe ≤ ρ. (23) measure satisfies a similar property using the follow- 2 2 ing theorem by Blackwell. Inverting relation in Eq. 6 for ρ(Bα), we can get the bounds in terms of B ( ) measure. For the equal prior Theorem 3.3 (Blackwell [Blackwell, 1951]). α ρ 0 probabilities case, Bhattacharyya coefficient gives a Pe(β ,Γ) ≤ Pe(β,Γ) for all prior probabilities Γ if and only if tight upper bound for large systems when ρ → 0 (zero overlap) and the observations are independent and identically distributed. These bounds are also useful Eβ0 [Φ(Lβ0 )|g] ≤ Eβ[Φ(Lβ)|g], to discriminate between two processes with arbitrar- ∀ continuous concave functions Φ(L), where ily low error probability [Kailath, 1967]. We suppose Lβ = p1(x,β)/p2(x,β) is the likelihood ratio and that tighter upper bounds on error probability can be Eω[Φ(Lω)|g] is the expectation of Φ(Lω) under the derived through Matusita’s measure of affinity [Bhat- hypothesis g = P2. tacharya and Toussaint, 1982, Toussaint, 1977, Tous- 0 saint, 1975], but is beyond the scope of present work. Theorem 3.4. If Bα(ρ(β)) > Bα(ρ(β )), or equiva- lently ρ(β) < ρ(β0) then ∃ a set of prior probabilities 3.4 f-divergence Γ = {π1,π2} for two hypothesis g1,g2, for which 0 Pe(β,Γ) < Pe(β ,Γ). (18) A class of divergence measures called f-divergences were introduced by Csiszar [Csiszar, 1967, Csiszar, Proof. The proof closely√ follows Kailath [Kailath, 1975] and independently by Ali and Silvey [Ali and 1967]. First note that L is a concave function of Silvey, 1966] (see [Basseville, 1989] for review). L (likelihood ratio) , and It encompasses many well known divergence mea- p sures including KLD, variational, Bhattacharyya and ρ(β) = ∑ p1(x,β)p2(x,β) Hellinger distance. In this section, we show that x∈X s Bα(ρ) measure for α ∈ (1,∞], belongs to the generic p1(x,β) class of f-divergences defined by Basseville [Bas- = ∑ p2(x,β) seville, 1989]. x∈X p2(x,β) hq i f-divergence [Basseville, 1989] Consider a measur- = Eβ Lβ|g2 . (19) able space Ω with σ-algebra B. Let λ be a measure on (Ω,B) such that any probability laws P and Q are Similarly absolutely continuous with respect to λ, with densities 0 hq i ρ(β ) = β0 Lβ0 |g2 (20) p and q. Let f be a continuous convex real function on E + R , and g be an increasing function on R. The class Hence, ρ(β) < ρ(β0) ⇒ of divergence coefficients between two probabilities: hq i hq i Z  p  Eβ Lβ|g2 < Eβ0 Lβ0 |g2 . (21) d(P,Q) = g f qdλ (24) Ω q Suppose assertion of the stated theorem is not true, are called the f-divergence measure w.r.t. functions 0 then for all Γ, Pe(β ,Γ) ≤ Pe(β,Γ). Then by Theorem ( f ,g) . Here p/q = L is the likelihood ratio. The term 3.3, Eβ0 [Φ(Lβ0 )|g2] ≤ Eβ[Φ(Lβ)|g2] which contradicts in the parenthesis of g gives the Csiszar’s [Csiszar, our result in Eq. 21. 1967, Csiszar, 1975] definition of f-divergence. The Bα(ρ(P,Q)) , for α ∈ (1,∞] measure can be writ- f 2 ten as the following divergence: ∂2ρ(θ,φ) 1 Z 1 ∂ f  1 ∂2 Z √ = − dx + f dx 1 − x log(−F) ∂φ2 φ=θ 4 f ∂θ 2 ∂θ2 f (x) = −1 + , g(F) = , (25) α log(1 − 1/α) 1 Z ∂log f (x|θ)2 = − f (x|θ) dx where, 4 ∂θ 1 Z  1  r p = − If (θ). (31) F = −1 + 1 − qdλ 4 Ω α q where If (θ) is the Fisher Information of distribution Z   1  1 √  = q −1 + − pq dλ f (x|θ) Ω α α Z ∂log f (x|θ)2 1 − ρ I (θ) = f (x|θ) dx. (32) = −1 + . (26) f ∂θ α Using the above relationships, we can write down and the terms in the expansion of Eq. 29 Zθ(θ) = 1−ρ log(1 − ) 0 , ∂Zθ(φ) = 0, and g(F) = α = B (ρ(P,Q)). (27) ∂φ log(1 − 1/α) α φ=θ 2 ∂ Zθ(φ) = C(α)If (θ) > 0, (33) 3.5 Curvature and Fisher Information ∂φ2 φ=θ where C(α) = −1 > 0 In statistics, the information that an observable ran- 4αlog(1−1/α) dom variable X carries about an unknown parameter θ The leading term of B (θ,φ) is given by (on which it depends) is given by the Fisher informa- α tion. One of the important properties of f-divergence (φ − θ)2 B (θ,φ) ∼ C(α)I (θ). (34) of two distributions of the same parametric family α 2 f is that their curvature measures the Fisher informa- tion. Following the approach pioneered by Rao [Rao, 3.6 Differential Metrics 1945], we relate the curvature of BBD measures to the Fisher information and derive the differential cur- Rao [Rao, 1987] generalized the Fisher information vature metric. The discussions below closely follow to multivariate densities with vector valued parame- [DasGupta, 2011]. ters to obtain a “geodesic” distance between two para- metric distributions Pθ,Pφ of the same family. The Definition Let { f (x|θ);θ ∈ Θ ⊆ R}, be a family of Fisher-Rao metric has found applications in many ar- densities indexed by real parameter θ, with some reg- eas such as image structure and shape analysis [May- ularity conditions ( f (x|θ) is absolutely continuous). bank, 2004, Peter and Rangarajan, 2006] , quantum [Brody and Hughston, 1998] and 1−ρ(θ,φ) log(1 − ) Blackhole thermodynamics [Quevedo, 2008]. We de- Z (φ) ≡ B (θ,φ) = α (28) θ α log(1 − 1/α) rive such a metric for BBD measure using property of f-divergence. R p p where ρ(θ,φ) = f (x|θ) f (x|φ)dx Let θ,φ ∈ Θ ⊆ R , then using the fact that ∂Z(θ,φ) Theorem 3.5. Curvature of Z (φ)| is the Fisher = 0, we can easily show that θ φ=θ ∂θi φ=θ information of f (x|θ) up to a multiplicative constant. p ∂2Z p dZ = θ dθ dθ +··· = g dθ dθ +.... Proof. Z ( ) θ ∑ i j ∑ i j i j Expand θ φ around theta i, j=1 ∂θi∂θ j i, j=1 (35) dZ (φ) (φ − θ)2 d2Z (φ) θ θ The curvature metric g can be used to find the Zθ(φ) = Zθ(θ)+(φ−θ) + 2 +... i j dφ 2 dφ geodesic on the curve η(t), t ∈ [0,1] with (29) Let us observe some properties of Bhattacharyya co- C = η(t) : η(0) = θ η(1) = φ. (36) efficient : ρ(θ,φ) = ρ(φ,θ), ρ(θ,θ) = 1, and its Details of the geodesic equation are given in many derivatives: standard differential geometry books. In the con- ∂ρ(θ,φ) 1 ∂ Z text of probability distance measures reader is re- = f (x|θ)dx = 0, (30) ∂φ φ=θ 2 ∂θ ferred to (see 15.4.2 in A DasGupta [DasGupta, 2011] for details) The curvature metric of all Csiszar f- Jensen-Shannon Divergence: The Jensen differ- divergences are just scalar multiple KLD measure ence between two distributions P,Q, with densities [DasGupta, 2011, Basseville, 1989] given by: p,q and weights (λ1,λ2); λ1 + λ2 = 1, is defined as,

g f (θ) = f 00(1)g (θ). (37) i j i j Jλ1,λ2 (P,Q) = Hs(λ1 p + λ2q) − λ1Hs(p) − λ2Hs(q), (43) For our BBD measure where Hs is the Shannon entropy. Jensen-Shannon  √ 00 divergence (JSD) [Burbea and Rao, 1982,Rao, 1982b, 00 1 − x 1 f (x) = −1 + = Lin, 1991] is based on the Jensen difference and is α 4αx3/2 given by: f˜00(1) = 1/4α. (38) JS(P,Q) = (P,Q) Apart from the −1/log(1 − 1 ), this is same as C(α) J1/2,1/2 α   in Eq. 34. It follows that the geodesic distance for our 1 Z h 2p = plog metric is same KLD geodesic distance up to a multi- 2 p + q plicative factor. KLD geodesic distances are tabulated  2q i in DasGupta [DasGupta, 2011]. + qlog dλ (44) p + q

3.7 Relation to other measures The structure and goals of JSD and BBD measures are similar. The following theorem compares the two Here we focus on the special case α = 2, i.e. B2(ρ) metrics using Jensen’s inequality. Theorem 3.6. Lemma 3.7. Jensen’s Inequality: For a convex func- 2 tion ψ, E[ψ(X)] ≥ ψ(E[X]). B2 ≤ H ≤ log4 B2 (39) Theorem 3.8 (Relation to Jensen-Shannon measure). where 1 and log4 are sharp. 2 JS(P,Q) ≥ log2 B2(P,Q) − log2 Proof. Sharpest upper bound is achieved via taking We use the un-symmetrized Jensen-Shannon met- 2 sup H (ρ) . Define ric for the proof. ρ∈[0,1) B2(ρ) 1 − ρ g(ρ) ≡ . (40) −log2 (1 + ρ)/2 Proof. We note that g(ρ) is continuous and has no singulari- Z 2p(x) JS(P,Q) = p(x)log dx ties whenever ρ ∈ [0,1). Hence p(x) + q(x) 1−ρ 1+ρ Z pp(x) + q(x) 1+ρ + log( 2 ) g0(ρ) = log2 ≥ 0. = − 2 p(x)log p dx 2 ρ+1 2p(x) log 2 Z pp(x) + pq(x) It follows that g(ρ) is non-decreasing and hence ≥ − 2 p(x)log dx p2p(x) supρ∈[0,1)g(ρ) = limρ→1 g(ρ) = log(4). Thus √ √ √ (since p + q ≤ p + q) H2/B ≤ log4. (41) " # 2 pp(X) + pq(X) =EP −2log p Combining this with convexity property of Bα(ρ) for 2p(X) α > 1, we get

2 By Jensen’s inequality B2 ≤ H ≤ log4 B2 E[−log f (X)] ≥ −logE[ f (X)], we have Using the same procedure we can prove a generic ver- " p p # sion of this inequality for α ∈ (1,∞] , given by p(X) + q(X) EP −2log ≥ p2p(X)  1  B (ρ) ≤ H2 ≤ −αlog 1 − B (ρ) (42) " # α α α pp(X) + pq(X) − 2logEP . p2p(X) Hence, respectively signal is present and absent are given by   [Budzynski´ et al., 2008] pp(x) + pq(x) Z  2 2  JS(P,Q) ≥ −2log p(x) dx 1 (G − ρ ) p p1(G) = √ exp − , (48) 2p(x) 2πρ 2ρ2 R p ! 1 + p(x)q(x)  2  = −2log − log2 1 G 2 p0(G) = √ exp − (49) 2πρ 2ρ2 B (p(x),q(x)) = 2 2 − log2 4.1 Distance between Gaussian log2 2 processes = B (P,Q) − log2. (45) log2 2 Consider a stationary Gaussian random process X, which has signals s1 or s2 with probability densities p1 and p2 respectively of being present in it. These densities follow the form Eq. 48 with signal to noise 2 2 ratios ρ1 and ρ2 respectively. The probability den- 4 APPLICATION TO SIGNAL sity p(X) of Gaussian process can modeled as limit DETECTION of multivariate Gaussian distributions. The diver- gence measures between these processes d(s1,s2) are Signal detection is a common problem occurring in general functions of the correlator (s1 − s2|s1 − s2) in many fields such as communication engineering, [Budzynski´ et al., 2008]. Here we focus on distin- pattern recognition, and Gravitational wave detection guishing monochromatic signal s(t) = Acos(ωt + φ) [Poor, 1994]. In this section, we briefly describe the and filter sF (t) = AF cos(ωFt + φ) (both buried in problem and terminology used in signal detection. We noise), separated in frequency or amplitude. illustrate though simple cases how divergence mea- The Kullback-Leibler divergence between the sig- sures, in particular BBD can be used for discriminat- nal and filter I(s,sF ) is given by the correlation (s − ing and detecting signals buried in white noise of cor- sF |s − sF ): relator receivers (matched filter). For greater details I(s,s ) =(s − s |s − s ) = (s|s) + (s |s ) − 2(s|s ) of the formalism used we refer the reader to review F F F F F F 2 2 articles in the context of Gravitational wave detection =ρ + ρF − 2ρρF [hcos(∆ωt)icos(∆φ) by Jaranowski and Krolak´ [Jaranowski and Krolak,´ − hsin(∆ωt)isin(∆φ)], (50) 2007] and Sam Finn [Finn, 1992]. One of the central problem in signal detection is to where hi is the average over observation time [0,T]. detect whether a deterministic signal s(t) is embedded Here we have assumed that noise spectral density in an observed data x(t), corrupted by noise n(t). This N( f ) = N0 is constant over the frequencies [ω,ωF ]. can be posed as a hypothesis testing problem where The SNRs are given by the null hypothesis is absence of signal and alternative A2T A2 T ρ2 = , ρ2 = F . (51) is its presence. We take the noise to be additive , so N F N that x(t) = n(t) + s(t). We define the following terms 0 0 used in signal detection: Correlation G (also called (for detailed discussions we refer the reader to matched filter) between x and s, and signal to noise Budzynksi et. al [Budzynski´ et al., 2008]). ratio ρ [Finn, 1992, Budzynski´ et al., 2008]. The Bhattacharyya distance between Gaussian processors with signals of same energy is ( Eq 14 in p G = (x|s), ρ = (s,s), (46) [Kailath, 1967]) just a multiple of the KLD B = I/8. We use this result to extract the Bhattacharyya coeffi- where the scalar product (.|.) is defined by cient : ∗   Z ∞ x˜( f )y˜ ( f ) (s − sF |s − sF ) (x|y) := 4ℜ d f . (47) ρ(s,sF ) = exp − (52) 0 N˜ ( f ) 8 ℜ denotes the real part of a complex expression, tilde 4.1.1 Frequency difference denotes the Fourier transform and the asterisk * de- notes complex conjugation. N˜ is the one-sided spec- Let us consider the case when the SNRs of signal and tral density of the noise. filter are equal, phase difference is zero, but frequen- For white noise, the probability densities of G when cies differ by ∆ω. The KL divergence is obtained by evaluating the correlator in Eq. 50  sin(∆ωT) I(∆ω) = (s − s |s − s ) = 2ρ2 1 − . F F ∆ωT (53) sin(∆ωT) by noting hcos(∆ωt)i = ∆ωT and hsin(∆ωt)i = 1−cos(∆ωT) ∆ωT . Using this, the expression for BBD family can be written as

  ρ2  sin(∆ωT)  1 − 4 1− ∆ωT log 1 − α 1 − e Bα(∆ω) = 1  . log 1 − α (54) As we have seen in section 3.4, both BBD and KLD belong to the f-divergence family. Their curvature for distributions belonging to same parametric family is a constant times the Fisher information (FI) (see Theo- Figure 2: Comparison of Fisher Information, KLD, BBD rem: 3.5). Here we discuss where the BBD and KLD and Hellinger distance for two monochromatic signals dif- deviates from FI, when we account for higher terms fering by frequency ∆ω, buried in white noise. Inset shows in the expansion of these measures. wider range ∆ω ∈ (0,1) . We have set ρ = 1 and chosen 4 parameters T = 100 and N0 = 10 . The Fisher matrix element for frequency gω,ω = h ∂logΛ 2i 2 2 E ∂ω = ρ T /3 [Budzynski´ et al., 2008], where Λ is the likelihood ratio. Using the relation for 2 line element ds = ∑i, j gi jdθidθ j and noting that only frequency is varied, we get ρT∆ω ds = √ . (55) 3 Using the relation between curvature of BBD measure and Fisher’s Information in Eq.34, we can see that for low frequency differences the line element varies as: s 2B (∆ω) α ∼ ds. C(α) √ Similarly dKL ∼ ds at low frequencies. However, at higher frequencies both KLD and BBD deviate from Figure 3: Comparison of Fisher information line element the Fisher√ information metric. In Fig. 2, we have plot- ted ds, d and p2B (∆ω)/C(α) with α = 2 and with KLD, BBD and Hellinger distance for signals differing KL α in amplitude and buried in white noise. We have set A = 1, Hellinger distance (α → ∞) for ∆ω ∈ (0,0.1). We ob- 4 T = 100 and N0 = 10 . serve that till ∆ω = 0.01 (i.e. ∆ωT ∼ 1), KLD and BBD follows Fisher Information and after that they start to deviate. This suggests that Fisher Informa- The correlation reduces to tion is not sensitive to large deviations. There is not A2T (A + ∆A)2T A(A + ∆A)T much difference between KLD, BBD and Hellinger (s − sF |s − sF ) = + − 2 N0 N0 N0 for large frequencies due to the correlator G becom- (∆A)2T ing essentially a constant over a wide range of fre- = . (56) quencies. N0 2 This gives us I(∆A) = (∆A) T , which is the same 4.1.2 Amplitude difference N0 as the line element ds2 with Fisher metric ds = p √ We now consider the case where the frequency and T/2N0∆A. In Fig. 3, we have plotted ds, dKL p phase of the signal and the filter are same but they dif- and 2Bα(∆ω)/C(α) for ∆A ∈ (0,40). KLD and FI fer in amplitude ∆A (which reflects in differing SNR). line element are the same. Deviations of BBD and Hellinger can be observed only for ∆A > 10. several special cases of our measure, in particular Discriminating between two signals s1,s2 requires squared Hellinger distance, and studied relation with minimizing the error probability between them. By other measures such as Jensen-Shannon divergence. Theorem 3.4, there exists priors for which the prob- We have also extended the Bradt Karlin theorem on lem translates into maximizing the divergence for error probabilities to BBD measure. Tight bounds on BBD measures. For the monochromatic signals dis- Bayes error probabilities can be put by using proper- cussed above, the distance depends on parameters ties of Bhattacharyya coefficient. (ρ1,ρ2,∆ω,∆φ). We can maximize the distance for a Although many bounded divergence measures given frequency difference by differentiating with re- have been studied and used in various applications, no spect to phase difference ∆φ [Budzynski´ et al., 2008]. single measure is useful in all types of problems stud- In Fig. 4, we show the variation of maximized BBD ied. Here we have illustrated an application to signal for different signal to noise ratios (ρ1,ρ2), for a fixed detection problem by considering “distance” between frequency difference ∆ω = 0.01. The intensity map monochromatic signal and filter buried in white Gaus- shows different bands which can be used for setting sian noise with differing frequency or amplitude, and the threshold for signal separation. comparing it to Fishers Information and Kullback- Detecting signal of known form involves minimiz- Leibler divergence. ing the distance measure over the parameter space of A detailed study with chirp like signal and colored the signal. A threshold on the maximum “distance” noise occurring in Gravitational wave detection will between the signal and filter can be put so that a detec- be taken up in a future study. Although our measures tion is said to occur whenever the measures fall within have a tunable parameter α, here we have focused on this threshold. Based on a series of tests, Receiver Op- a special case with α = 2. In many practical appli- erating Characteristic (ROC) curves can be drawn to cations where extremum values are desired such as study the effectiveness of the distance measure in sig- minimal error, minimal false acceptance/rejection ra- nal detection. We leave such details for future work. tio etc, exploring the BBD measure by varying α may be desirable. Further, the utility of BBD measures is to be explored in parameter estimation based on min- imal disparity estimators and Divergence information criterion in Bayesian model selection [Basu and Lind- say, 1994]. However, since the focus of the current paper is introducing a new measure and studying its basic properties, we leave such applications to statis- tical inference and data processing to future studies.

ACKNOWLEDGEMENTS

One of us (S.J) thanks Rahul Kulkarni for insight- ful discussions, Anand Sengupta for discussions on application to signal detection, and acknowledge the financial support in part by grants DMR-0705152 and DMR-1005417 from the US National Science Foun- Figure 4: BBD with different signal to noise ratio for a fixed . We have set T = 100 and ∆ω = 0.01. dation. M.S. would like to thank the Penn State Elec- trical Engineering Department for support.

5 SUMMARY AND OUTLOOK

In this work we have introduced a new family of bounded divergence measures based on the Bhat- tacharyya distance, called bounded Bhattacharyya distance measures. We have shown that it belongs to the class of generalized f-divergences and inher- its all its properties, such as those relating Fishers In- formation and curvature metric. We have discussed REFERENCES DasGupta, A. (2011). Probability for Statistics and Ma- chine Learning. Springer Texts in Statistics. Springer New York. Ali, S. M. and Silvey, S. D. (1966). A general class of co- efficients of divergence of one distribution from an- Finn, L. S. (1992). Detection, measurement, and gravita- other. Journal of the Royal Statistical Society. Series tional radiation. Physical Review D, 46(12):5236. B (Methodological), 28(1):131–142. Gibbs, A. and Su, F. (2002). On choosing and bounding Basseville, M. (1989). Distance measures for signal pro- probability metrics. International Statistical Review, cessing and pattern recognition. Signal processing, 70(3):419–435. 18:349–369. Hellinger, E. (1909). Neue begrundung¨ der theo- Basu, A. and Lindsay, B. G. (1994). Minimum disparity rie quadratischer formen von unendlichvielen estimation for continuous models: efficiency, distri- veranderlichen.¨ Journal fur¨ die reine und angewandte butions and robustness. Annals of the Institute of Sta- Mathematik (Crelle’s Journal), (136):210–271. tistical Mathematics, 46(4):683–705. Hellman, M. E. and Raviv, J. (1970). Probability of Error, Equivocation, and the Chernoff Bound. IEEE Trans- Ben-Bassat, M. (1978). f-entropies, probability of er- actions on Information Theory, 16(4):368–372. ror, and feature selection. Information and Control, 39(3):227–242. Jaranowski, P. and Krolak,´ A. (2007). Gravitational-wave data analysis. formalism and sample applications: the Bhattacharya, B. K. and Toussaint, G. T. (1982). An upper gaussian case. arXiv preprint arXiv:0711.1115. bound on the probability of misclassification in terms of matusita’s measure of affinity. Annals of the Insti- Kadota, T. and Shepp, L. (1967). On the best finite set tute of Statistical Mathematics, 34(1):161–165. of linear observables for discriminating two gaussian signals. IEEE Transactions on Information Theory, Bhattacharyya, A. (1943). On a measure of divergence 13(2):278–284. between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc, Kailath, T. (1967). The Divergence and Bhattacharyya Dis- IEEE Transac- 35(99-109):4. tance Measures in Signal Selection. tions on Communications, 15(1):52–60. Bhattacharyya, A. (1946). On a measure of divergence be- Kakutani, S. (1948). On equivalence of infinite product tween two multinomial populations. Sankhya:˜ The In- measures. The Annals of Mathematics, 49(1):214– dian Journal of Statistics (1933-1960), 7(4):401–406. 224. Blackwell, D. (1951). Comparison of experiments. In Sec- Kapur, J. (1984). A comparative assessment of various mea- ond Berkeley Symposium on sures of directed divergence. Advances in Manage- and Probability, volume 1, pages 93–102. ment Studies, 3(1):1–16. Bradt, R. and Karlin, S. (1956). On the design and compar- Kullback, S. (1968). Information theory and statistics. New ison of certain dichotomous experiments. The Annals York: Dover, 1968, 2nd ed., 1. of mathematical statistics, pages 390–409. Kullback, S. and Leibler, R. A. (1951). On information Brody, D. C. and Hughston, L. P. (1998). Statistical geom- and sufficiency. The Annals of Mathematical Statis- etry in quantum mechanics. Proceedings of the Royal tics, 22(1):pp. 79–86. Society of London. Series A: Mathematical, Physical Kumar, U., Kumar, V., and Kapur, J. N. (1986). Some and Engineering Sciences, 454(1977):2445–2475. normalized measures of directed divergence. Inter- Budzynski,´ R. J., Kondracki, W., and Krolak,´ A. (2008). national Journal of General Systems, 13(1):5–16. Applications of distance between probability distribu- Lamberti, P. W., Majtey, A. P., Borras, A., Casas, M., and tions to gravitational wave data analysis. Classical Plastino, A. (2008). Metric character of the quan- and Quantum Gravity, 25(1):015005. tum Jensen-Shannon divergence . Physical Review A, Burbea, J. and Rao, C. R. (1982). On the convexity of 77:052311. some divergence measures based on entropy func- Lee, Y.-T. (1991). Information-theoretic distortion mea- tions. IEEE Transactions on Information Theory, sures for speech recognition. Signal Processing, IEEE 28(3):489 – 495. Transactions on, 39(2):330–335. Chernoff, H. (1952). A measure of asymptotic efficiency for Lin, J. (1991). Divergence measures based on the shannon tests of a hypothesis based on the sum of observations. entropy. IEEE Transactions on Information Theory, The Annals of Mathematical Statistics, 23(4):pp. 493– 37(1):145 –151. 507. Matusita, K. (1967). On the notion of affinity of several dis- Choi, E. and Lee, C. (2003). Feature extraction based tributions and some of its applications. Annals of the on the Bhattacharyya distance. Pattern Recognition, Institute of Statistical Mathematics, 19(1):181–192. 36(8):1703–1709. Maybank, S. J. (2004). Detection of image structures using Csiszar, I. (1967). Information-type distance measures the fisher information and the rao metric. IEEE Trans- and indirect observations. Stud. Sci. Math. Hungar, actions on Pattern Analysis and Machine Intelligence, 2:299–318. 26(12):1579–1589. Csiszar, I. (1975). I-divergence geometry of probability dis- Nielsen, F. and Boltz, S. (2011). The burbea-rao and bhat- tributions and minimization problems. The Annals of tacharyya centroids. IEEE Transactions on Informa- Probability, 3(1):pp. 146–158. tion Theory, 57(8):5455–5466. Nielsen, M. and Chuang, I. (2000). Quantum computation APPENDIX and information. Cambridge University Press, Cam- bridge, UK, 3(8):9. BBD measures of some common distributions Peter, A. and Rangarajan, A. (2006). Shape analysis using Here we provide explicit expressions for BBD B2, the fisher-rao riemannian metric: Unifying shape rep- for some common distributions. For brevity we de- resentation and deformation. In Biomedical Imaging: Nano to Macro, 2006. 3rd IEEE International Sympo- note ζ ≡ B2. sium on, pages 1164–1167. IEEE. • Binomial : Poor, H. V. (1994). An introduction to signal detection and estimation. Springer. P(k) = npk(1 − p)n−k, Q(k) = nqk(1 − q)n−k. Qiao, Y. and Minematsu, N. (2010). A study on in- k k variance of-divergence and its application to speech recognition. Signal Processing, IEEE Transactions √ p ! on, 58(7):3884–3890. 1 + [ pq + (1 − p)(1 − q)]n ζ (P,Q) = −log . Quevedo, H. (2008). Geometrothermodynamics of black bin 2 2 holes. General Relativity and Gravitation, 40(5):971– (57) 984. Rao, C. (1982a). Diversity: Its measurement, decomposi- • Poisson : tion, apportionment and analysis. Sankhya: The In- dian Journal of Statistics, Series A, pages 1–22. k −λp k −λq λpe λqe Rao, C. R. (1945). Information and the accuracy attainable P(k) = k! , Q(k) = k! . in the estimation of statistical parameters. Bull. Cal- cutta Math. Soc., 37:81–91. √ √ 2 ! Rao, C. R. (1982b). Diversity and dissimilarity coefficients: 1 + e−( λp− λq) /2 ζ (P,Q) = −log . A unified approach. Theoretical Population Biology, poisson 2 2 21(1):24 – 43. (58) Rao, C. R. (1987). Differential metrics in probability spaces. Differential geometry in statistical inference, • Gaussian : 10:217–240. ! 1 (x − x )2 Royden, H. (1986). Real analysis. Macmillan Publishing √ p P(x) = exp − 2 , Company, New York. 2πσp 2σp Sunmola, F. T. (2013). Optimising learning with trans- ! 1 (x − x )2 ferable prior information. PhD thesis, University of Q(x) = √ exp − q . Birmingham. 2 2πσq 2σq Toussaint, G. T. (1974). Some properties of matusita’s mea- Annals of the sure of affinity of several distributions. ! Institute of Statistical Mathematics, 26(1):389–394. h 2σ σ (x − x )2 i ζ (P,Q) = 1−log 1+ p q exp − p q . Toussaint, G. T. (1975). Sharper lower bounds for discrimi- Gauss 2 σ2 + σ2 4(σ2 + σ2) nation information in terms of variation (corresp.). In- p q p q formation Theory, IEEE Transactions on, 21(1):99– (59) −λ x −λ x 100. • Exponential : P(x) = λpe p , Q(x) = λqe q . Toussaint, G. T. (1977). An upper bound on the probability " p p 2 # of misclassification in terms of the affinity. Proceed- ( λp + λq) ings of the IEEE, 65(2):275–276. ζexp(P,Q) = −log2 . (60) 2(λp + λq) Toussaint, G. T. (1978). Probability of error, expected divergence and the affinity of several distributions. • Pareto : Assuming the same cut off xm, IEEE Transactions on Systems, Man and Cybernetics, 8(6):482–485. ( αp α xm for x ≥ x Tumer, K. and Ghosh, J. (1996). Estimating the Bayes P(x) = p xαp+1 m (61) error rate through classifier combining. Proceedings 0 for x < xm, of 13th International Conference on Pattern Recogni- ( αq tion, pages 695–699. xm αq for x ≥ xm Varshney, K. R. (2011). Bayes risk error is a bregman di- Q(x) = xαq+1 (62) vergence. IEEE Transactions on Signal Processing, 0 for x < xm. 59(9):4470–4472. Varshney, K. R. and Varshney, L. R. (2008). Quanti- " √ √ 2 # ( αp + αq) zation of prior probabilities for hypothesis testing. ζpareto(P,Q) = −log2 . (63) IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2(αp + αq) 56(10):4553.