<<

Preprint Accepted for publication in Inform. Sciences (2013) 235: 214–223 Measures of based on Shannon and Fisher information concepts

Lubomir Kostal,∗ Petr Lansky, and Ondrej Pokora Institute of Physiology of the Czech Academy of Sciences, Videnska 1083, 14220 Prague 4, Czech Republic

We propose and discuss two information-based measures of statistical dispersion of positive continuous ran- dom variables: the entropy-based dispersion and Fisher information-based dispersion. Although standard devi- ation is the most frequently employed dispersion measure, we show, that it is not well suited to quantify some aspects that are often expected intuitively, such as the degree of . The proposed dispersion measures are not entirely independent, though each describes the quality of from a different point of view. We discuss relationships between the measures, describe their extremal values and illustrate their prop- erties on the Pareto, the lognormal and the lognormal mixture distributions. Application possibilities are also mentioned. Keywords: Statistical dispersion, Entropy, Fisher information, Positive

1. INTRODUCTION 2. MEASURES OF DISPERSION

2.1. Generic case: standard In recent years, information-based measures of randomness (or “regularity”) of signals have gained popularity in various branches of science [1–4]. In this paper we construct mea- We consider a continuous positive random variable (r.v.) T sures of statistical dispersion based on Shannon and Fisher with a probability density function (p.d.f.) f (t) and finite first information concepts and we describe their properties and two moments. Generally, statistical dispersion is a measure mutual relationships. The effort was initiated in [5], where of “variability” or “spread” of the distribution of r.v. T, and the entropy-based dispersion was employed to quantify cer- such a measure has the same physical units as T. There are ff tain aspects of neuronal timing precision. Here we extend the di erent dispersion measures described in the literature and ff previous effort by taking into account the concept of Fisher employed in di erent contexts, e.g., , inter- ff ffi information (FI), which was employed in different contexts , di erence or the LV coe cient [13–16]. [2,6–9]. In particular, FI about the has By far, the most common measure of dispersion is the stan- σ been employed in the analysis of EEG [8, 10], of the atomic dard deviation, , defined as shell structure [11] (together with Shannon entropy) or in the q  description of variations among the two-electron correlated σ = E [T − E (T) ]2 . (1) wavefunctions [12]. The corresponding relative dispersion measure is obtained by ff The goal of this paper is to propose di erent dispersion dividing σ with E (T) . The resulting quantity is denoted as measures and to justify their usefulness. Although the stan- the coefficient of variation, CV , dard deviation is used ubiquitously for characterization of variability, it is not well suited to quantify certain “intuitively σ CV = . (2) intelligible” properties of the underlying probability distribu- E (T) tion. For example highly variable might not be random at all if it only consists of “very small” and “very large” values. The main advantage of CV over σ is, that CV is dimensionless Although the probability density function (or of and thus probability distributions with different can be data) provides a complete view, one needs quantitative meth- compared meaningfully. ods in order to make a comparison between different experi- From Eq. (1) follows, that σ (or CV ) essentially measures mental scenarios. how off-centered (with respect to E(T)) is the distribution of T. Furthermore, since the difference (T − E (T) ) is squared in The methodology investigated here does not adhere to any Eq. (1), it follows that σ is sensitive to outlying values [15]. specific field of applications. We believe, that the general re- On the other hand, σ does not quantify how random, or un- sults are of interest to a wide group of researchers who deal predictable, are the outcomes of r.v. T. Namely, high value of with positive continuous random variables, in theory or in ex- σ (high variability) does not indicate that the possible values periments. of T are distributed evenly [5].

2.2. Dispersion measure based on Shannon entropy

For continuous r.v.’s the association between entropy and ∗ E-mail: [email protected] randomness is less straightforward than for discrete r.v.’s. The 2

(differential) entropy h(T) of r.v. T is defined as [17] Exact conditions (the regularity conditions) under which ∞ Eq. (7) holds are stated slightly differently by different au- h(T) = − f (t) ln f (t) dt. (3) thors. In particular, it is sometimes required that the set 0 {x : p(x; θ) > 0} does not depend on θ, which is an unnec- essarily strict assumption [18]. For any given p.d.f. f (t) one The value of h(T) may be positive or negative, therefore h(T) may conveniently “generate” a simple parametric family by is not directly usable as a measure of statistical dispersion [5]. introducing a location parameter. The appropriate regularity In order to obtain a properly behaving quantity, the entropy- conditions for this case are stated below. based dispersion, σ , is defined as h The family of location parameter densities p(x; θ) satisfies

σh = exp[h(T)]. (4) p(x; θ) = p0(x − θ), (9) The interpretation of σh relies on the asymptotic equiparti- tion property theorem [17]. Informally, the theorem states that where we consider Θ to be the whole real line and p0(x) is almost any sequence of n realizations of the random variable the p.d.f. of the “generating” r.v. X0. Let the location family T comes from a rather small subset (the typical set) in the n- p(x; θ) be generated by the r.v. T ∼ f (t), thus p(x; θ) = dimensional space of all possible values. The volume of this f (x − θ) and Eq. (8) can be written as n subset is approximately σh = exp[nh(T)], and the volume is ∞ bigger for those random variables, which generate more di- ∂ ln f (x − θ) 2 J(θ|X) = f (x − θ) dx = verse (or unpredictable) realizations. Further connection be-  " ∂θ # tween σ and σ follows from the analogue to the entropy θ h (10) power concept [17]: σ /e is equal to the standard deviation ∞ h ∂ ln f (t) 2 of an with entropy equal to h(T). = f (t) dt ≡ J(T), Analogously to Eq. (2), we define the relative entropy-  " ∂t # 0 based dispersion coefficient, Ch, as where the last equality follows from the fact that the deriva- σh Ch = . (5) tives of f (x − θ) with respect to θ or x are equal up to a sign E (T) and due to the location-invariance of the integral (thus justi- The values of σh and Ch quantify how “evenly” is the prob- fying the notation as J(T)). Since the value of J(T) depends ability distributed over the entire support. From this point of only on the “shape” of the p.d.f. f (t), it is sometimes denoted view, σh is more appropriate than σ when discussing random- as the FI about the random variable T [17]. ness of data generated by r.v. T. To interpret J(T) according to the Cramer-Rao bound in Eq. (7), the required regularity conditions on f (t) are: f (t) must be continuously differentiable for all t > 0 and f (0) = 2.3. Dispersion measure based on Fisher information f 0(0) = 0. The integral (10) may exist and be finite even if f (t) does not satisfy these conditions, e.g., if f (t) is differ- The FI plays a key role in the theory of statistical estimation entiable almost everywhere or f (0) , 0. However, in such a of continuously varying parameters [18]. Let X ∼ p(x; θ) be case the value of J(T) does not provide any information about a family of r.v.’s defined for all values of parameter θ ∈ Θ, the efficiency of the location parameter estimation [18]. where Θ is an open subset of the real line. Let θˆ(X) be an The units of J(T) correspond to the inverse of the squared   unbiased of parameter θ, i.e., E θˆ(X) − θ = 0. units of T, therefore we propose the FI based dispersion mea- sure, σ , as If for all θ ∈ Θ and both ϕ(x) ≡ 1 and ϕ(x) ≡ θˆ(x) the J following equation is satisfied ([19, p.169] or [18, p.31]), 1 σJ = √ . (11) ∂ ∂p(x; θ) J(T) ϕ(x)p(x; θ) dx = ϕ(x) dx, (6) ∂θ X X ∂θ Heuristically, σJ quantifies the change in the p.d.f. f (t) sub- δθ t then the of the estimator θˆ(X) satisfies the Cramer- ject to an infinitesimally small shift in , i.e, it quantifies the ff f t f t − δθ Rao bound, di erence between ( ) and ( ). Any peak, or generally “non-smoothness” in the shape of f (t) decreases σJ . Analo- 1 gously to Eqns. (2) and (5) we define the relative dispersion Var(θˆ(X)) ≥ , (7) J(θ|X) coefficient CJ as where σJ CJ = . (12) E (T) ∂ ln p(x; θ) 2 J θ|X = p x θ dx, ( ) ( ; ) (8) In this paper we do not introduce different symbols for CJ in X " ∂θ # dependence on whether the Cramer-Rao bound holds or not. is the FI about parameter θ contained in a single observation We evaluate CJ whenever the integral in Eq. (10) exists and of r.v. X. we comment on the regularity conditions in the text. 3

3. RESULTS the main difference between these two measures: the vari- ability (described by CV ) and randomness (described by Ch) 3.1. Extrema of variability are not interchangeable notions. High variability (overdisper- sion), CV > 1, results in decreased randomness for many com- mon distributions, see Fig.2, although there are exceptions, Generally, the value CV can be any non-negative real num- ber, 0 ≤ C < ∞. The lower bound, C → 0, is approached e.g., the Pareto distribution discussed later. V V ,∞ by a p.d.f. highly peaked at the mean value, in the limit cor- In order to find the ME distribution on [0 ) given both E T C responding to the Dirac’s delta function, f (t) = δ(t − E (T) ). ( ) and V , we first realize that the problem is equivalent  2 There is, however, no unique upper bound distribution for to finding the ME distribution given E (T) and E T . Ap- which CV → ∞. For example, the p.d.f. examples analyzed plying the Lagrange formalism results in a p.d.f. based on the in the next section allow arbitrarily high values of CV and yet Gaussian, with the probability of all negative values aliased their shapes are different. onto the positive half-line, 1 (t − α)2 f t = − , ( ) exp 2 (16) 3.2. Extrema of entropy and its relation to variability Z " 2β # where The relation between C and entropy was investigated in V r a series of papers [5, 20, 21]. The results can be re-stated π α Z = β 1 + erf √ . (17) in terms of Ch as follows. From the definition of Ch by 2  β   2  Eq. (5) and from the properties of entropy [17] follows, that    0 < C < e. The lower bound, C → 0, is not realized by    h h The density in Eq. (16) is also known as the density of the any unique distribution. Highly-peaked (possibly multimodal) folded normal r.v. [23]. The parameters α, β > 0, and E (T) , densities approach the bound and in the limit any discrete- CV are related as valued distribution achieves it. From this fact follows, that ! the relationship between C and C is not unique, small C β2 α2 V h V E (T) = β + exp − , (18) implies small Ch but not vice versa. Z 2β2 The maximum value of Ch is connected with the problem s ! ! of maximum entropy (ME), which is well known in the lit- α2 α α2 β2 CV = β exp − exp − × erature, see e.g., [17, 22]. The goal is to find such a p.d.f., β2 Z 2β2 Z2 (19) that maximizes the functional (3) subject to n constraints of ! −1 α2 β2 the form E (αi (T)) = ξi , where αi (t) and ξi are known and × α exp + . i = 1,...,n. The ME p.d.f. satisfying these constraints can be " 2β2 Z # written in the form [17] The entropy and FI can be calculated for Eq. (16) to be Xn ! 1 1 α α2 f (t) = exp λi αi (t) , (13) − − Z(λ ,. . . , λn ) h(T) = exp + ln Z, (20) 1  i=1  2 2Z 2β2   !   1 α α2 where the "partition function" Z(λ1,. . . , λn ) is the normaliza- J(T) = 1 − exp − . (21) ∞ Pn β2 " Z 2β2 # tion factor, Z(λ1,. . . , λn ) = 0 exp i=1 λi αi (t) dt. The introduced Lagrange multipliers, λ , are related to the aver- i   Note, that C in Eq. (19) is limited to C ∈ (0,1), and ages ξ as [22] V V i therefore the p.d.f. in Eq. (16) provides a solution to the ME ∂ problem only in this range. The density of the ME distribu- − ln Z(λ ,. . . , λ ) = ξ . (14) ∂λ 1 n i tion given by Eq. (16) is shown for different values of CV in i Fig.1. Although it is not possible to express α, β in terms It is well known [17], that the distribution maximizing the of E (T) ,CV from Eqns. (18) and (19), we obtain all dis- entropy on [0,∞) for given E (T) is the exponential distribu- tinct shapes (neglecting the scale) of the folded normal den- tion, sity by fixing, e.g., β = 1 and varying α ∈ (−∞,∞), since limα→−∞ CV (α) = 1, limα→∞ CV (α) = 0 and noting that 1 t CV (α) is monotonously decreasing. In the limit CV = 1 the f (t) = exp − , (15) E (T) " E (T) # density in Eq. (16) becomes exponential, and for CV > 1 there is no unique ME distribution. However, we can always con- and entropy h(T) = 1 + ln E (T) . Thus the upper bound, struct a p.d.f. with CV > 1, which is arbitrarily close to the Ch = e, is unique: it is achieved only if f (t) is exponen- exponential p.d.f., e.g., almost-exponential with a small peak tial. For the exponential distribution holds CV = 1, however, located at some large value of t. Therefore, the maximum non-exponential distributions may have CV = 1 too (see the value of entropy is 1 + ln E (T) − ε for CV > 1, where ε > 0 next section). In other words, the maximum of Ch does not can be arbitrarily small. The corresponding Ch is shown in correspond to any exclusive value of CV . This fact highlights Fig.2. 4

3.3. Extrema of Fisher information and its relation to entropy thus the maximum value of CJ is CJ  0.363. The density from Eq. (27) is shown in Fig.1. Due to convexity of the From Eqns. (10) and (12) follows CJ > 0. Similarly to Ch, FI functional (similarly to the concavity of the entropy func- the lower bound is not achieved by a unique distribution, since tional) in f (t), the maximum of CJ is global. For p.d.f. (27) any continuous, highly peaked density (possibly multimodal) also holds CV  0.447, Ch  1.77. approaches it. Determination of the maximum value of CJ is, If the regularity conditions are relaxed, we arrive by similar however, more difficult. In the following we solve the problem means to the p.d.f. ! of CJ maximization (FI minimization) subject to ξ = E (T) , 1 b t f (t) = Ai2 a + 2 , (29) both when the regularity conditions hold and when they do Z 2 E (T) not, see Fig.1. 2 It is convenient [1,2] to rewrite the FI functional by em- where a2  −1.0188, b2  0.6792. It holds CJ  1.263, p ploying the real probability amplitude u(t) = f (t), so that CV  0.79 and Ch  2.63, the density is shown in Fig.1. Eq. (10) becomes The resulting p.d.f. differs from the exponential shape in both cases, showing that Ch and CJ are two different measures. On 0 C J(T) = 4 u (t)2 dt, (22) the other hand, the p.d.f. which achieves maximum h can be T fitted to approximate the extremal density of CJ rather well (shown in Fig.1), further demonstrating the complex relation- where u0(t) = du(t)/dt. The extrema of FI satisfies the Euler- ships between CV , Ch and CJ . Particularly, even though the Lagrange equation shape of CJ -maximizing (C-R not valid) density differs from ∂L d ∂L the exponential density, it holds Ch  2.63, which is close − = 0, (23) = . ∂u dt ∂u0 to the maximum value Ch e  2 72 (corresponding to the exponential density). where the Lagrangian L is The main properties of the dispersion coefficients CV , Ch and CJ are summarized in Table1. The evenness of the p.d.f. 0 2 2 L = u (t) dt + λ1 u(t) dt − 1 + (described by Ch) is related to the “smoothness” of the den- T "T # sity (described by CJ ). However, more detailed analysis of 2 CJ shows, that Ch and CJ are not interchangeable, and that +λ2 tu(t) dt − ξ , (24) "T # the requirement on the differentiability of f (t) plays an im- portant role. Namely, C is sensitive to the modes of the den- and the multiplicative constants resulting from the substitution J sity, while C is sensitive to the overall spread of the density. f → u are contained in Lagrange multipliers λ , λ . Substi- h 1 2 Since multimodal densities can be more evenly spread than tuting from Eq. (24) into Eq. (23) results in the differential unimodal ones, it is obvious that the behavior of C cannot be equation h deduced from CJ (and vice versa). 00 Another relationship between C and C follows from the u (t) − u(t)[λ1 − λ2t] = 0. (25) h J de-Bruijn’s identity [17, p.672] The solution to this equation can be written as [24] ∂ √ 1 h(T + εZ) = J(T), (30) ∂ε 2 λ1 + λ2t λ1 + λ2t ε=0 u(t) = C1Ai + C2Bi , (26)  λ2/3   λ2/3  where r.v. Z ∼ N (0,1) is standard normal; and from the 2 2     entropy power inequality [17, p.674] where C ,C are constants and Ai(·),Bi (·) are the Airy func- 1 2 2h(X+Y ) ≥ 2h(X) 2h(Y ) tions. Since the integrability of the solution is required, it must e e + e , (31) √ be C2 = 0. The remaining parameters λ1, λ2,C1 are deter- ∞ for independent√ r.v.’s X and Y. The entropy of r.v. εZ in mined by requiring that f (t) dt = 1, that the mean equals 1 0 Eq. (30) is h( εZ) = 2 ln 2πeε, thus from Eq. (31) we have ξ and from the regularity conditions ( f (0) = f 0(0) = 0). Due √ 1 to the presence of the Airy function, these parameters must be h(T + εZ) ≥ ln e2h(T ) + 2πeε . (32) determined by numerical means. The resulting p.d.f. can be 2 f g written as Taking the derivative with respect to ε and evaluating it at ! ε = 0 leads to 1 2 b1t f (t) = Ai a1 + , (27) 1−2h(T ) Z1 E (T) πe ≤ J(T), (33) where Z1 is the normalizing constant, and a1  −2.3381, b1  with equality if and only if T is Gaussian (the inequality is 1.5587. The expression for FI of this p.d.f. can be obtained thus always strict in the context of this paper). In terms of the by integrating Eq. (22) and by combining Eq. (25) with the relative dispersion coefficients Ch, given by Eq. (5), and CJ , constraint values, given by Eq. (12), we have from Eq. (33)

3 2 1 b + a1b 7.5744 ChCJ ≤ √ . (34) J(T) = −4 1 1  , (28) πe  E (T) 2  E (T) 2   5

CV Ch CJ Interpretation Distribution of the probability mass of the outcomes of T Smoothness of f (t) with respect to E (T ) Sensitive to Probability of the values away from Concentration of the probability Changes in f 0 (t), modes E (T ) mass Assumptions Var(T ) exists No assumptions f (t) continuously differentiable for t > 0 and(∗) f (0) = f 0 (0) = 0 Minimum 0 0 0 Minimizing density δ(t − E (T ) ) Not unique Not unique Maximum ∞ e 1.263 0.363(∗) Maximizing density Not unique exp [−t/E (T ) ] /E (T ) Ai2 [a + bt/E (T ) ] /Z Peaked unimodal → 0 → 0 → 0 f (t) → δ(t − E (T ) ) Peaked multimodal > 0 → 0 → 0 P f (t) → i δ(t − τi ) Extreme variance of T → ∞ ≥ 0 ≥ 0 f (t) exponential implies CV = 1 equal to Ch = e implies CJ = 1

Table 1. Summary of properties of the discussed statistical dispersion coefficients of positive continuous random variable T with probability density function f (t) and finite mean value E (T) . The starred(∗) entries are valid if the Cramer-Rao bound holds.

1 from exponential transforms of common distributions: normal maxCJ (C-R valid) . and exponential. maxCh (CV = 0.44) The lognormal p.d.f., parametrized by the mean value and ) 0.8 maxC (C-R not valid)

t J

( . f ffi maxCh (CV = 0.79) coe cient of variation, is

0.6 1 fln(t) = q × function, 2 t 2π ln(1 + CV ) 0.4 2 (35) density 2 1 ln(1 + CV ) + 2 ln(t/E (T) ) × exp − f g . 8 ln(1 + C2 ) 0.2  V  Probability   ffi   The coe cients Ch and CJ of the lognormal distribution can 0 be calculated to be, 0 0.5 1 1.5 2 2.5 3 v t 2 t √ ln(1 + C ) C = πe V , h 2 2 (36) 1 + CV tv Figure 1. Comparison of probability density functions maximizing 2 ln(1 + CV ) the relative dispersion coefficients, Ch and CJ , for different values of CJ = . (37) 2 3 2 coefficient of variation, CV , and E (T) = 1. [1 + CV ] [1 + ln(1 + CV )]

The dependencies of Ch and CJ on CV are shown in Fig.2a, b. We see, that both Ch and CJ as functions of CV 4. APPLICATIONS √ show a “∩” shape with maximum for CV = e − 1  1.31 (for Ch) and around CV  0.55 (for CJ ), confirming that 4.1. Lognormal and Pareto distributions each of the proposed dispersion coefficients provides a dif- ferent point of view. The max Ch p.d.f., Eq. (16), exists only Both lognormal and Pareto distributions appear in a broad for CV ≤ 1, for CV > 1 the upper bound Ch = 1 is shown in range of scientific applications [25]. The lognormal distri- Fig.2a. Note, that the max Ch distribution does generally not bution is found in the description of, e.g., concentration of satisfy the regularity conditions, since f (0) , 0. elements in the Earth’s crust, distribution of organisms in en- The dependence of CJ on Ch is shown in Fig.2c. We ob- vironment or in human medicine, see [26] for a review. The serve, that Ch and CJ indeed do not describe the same quali- Pareto distribution is often described as an alternative model ties of the distribution, since for the lognormal distribution a in situations similar as in the lognormal case, e.g, the sizes single Ch value does not correspond to a single CJ value (and of human settlements, sizes of particle or allocation of wealth vice versa). In the lognormal case, the dependence between among individuals [27, 28]. Another common aspect of log- Ch and CJ forms a closed loop, where Ch = CJ = 0 for both normal and Pareto distributions is, that both can be derived CV → 0 and CV → ∞. In other words, both Ch and CJ fail to 6 distinguish between very different p.d.f. shapes (CV → 0 or where 0 < p < 1 gives the weight of mixture components, CV → ∞). and fln(t; µ,CV ) is the lognormal density parametrized by the The p.d.f. f P (t) of the Pareto distribution is mean µ and CV given by Eq. (35). The lognormal mixture does not allow to express Ch or CJ in a closed form. Numer- 0, t ∈ (0, b) ical evaluation of the involved integrals is more convenient in f (t) = (38) P a −a−1 terms of a logarithmically transformed r.v. X, X = ln T, since  ab t , t ∈ [b,∞)  X is described by a mixture of two normals. Let the density >  > with parameters a 0 and b 0 (the expression in terms of the r.v. X be denoted as gm (x), then of E (T) ,CV is cumbersome). Note, that E (T) exists only if a > 1 and Var(T) only if a > 2, thus we restrict ourselves gm (x) = pφ(x,m1, s1) + (1 − p)φ(x,m2, s2), (44) to the case a > 1 if only Ch and CJ are to be evaluated, and √ − − 2 2 2 additionally to a > 2 if CV is required. From Eq. (38) follows where φ(x,m, s) = exp[ (x m) /(2s )]/ 2πs is the den- 2 that f P (t) is not differentiable at t = b, thus J(T) cannot be sity of the with mean m and variance s . interpreted in terms of the Cramer-Rao bound, although J(T) The mean value, µ = E (T) , and CV of the random variable T is finite for all a > 0. can be expressed as E The parameters a and b are related to (T) and CV by 2 2 s1 s2 q µ =p exp m1 + + (1 − p) exp m2 + , (45) 2  2   2  1 + CV a = 1 + , (39) 1       CV C = p exp 2m + 2s2 +   q V µ 1 1 2 2 f (46) b = E (T) 1 + C − CV 1 + C . (40) V V  2 2 1/2   +(1 − p) exp 2m2 + 2s2 − µ . The coefficients Ch and CJ of the Pareto distribution can be g expressed in terms of parameter a as Since it holds f m (t) = gm (ln t)/t and dx = dt/t, the entropy h(T), given by Eq. (3), can be expressed by employing g (x) ! m a − 1 1 as Ch = exp 1 + , (41) a2 a √ h(T) = h(X) + E (X) (47) (a − 1) 2 + a CJ = √ . (42) where h(X) is the entropy of r.v. X and E (X) = pm + (1 − a3(1 + a) 1 p)m2 is the mean value of r.v. X. Both C√h and CJ have non-zero limit√ as CV → ∞, namely Similarly, we employ r.v. X for the evaluation of CJ , since 3 Ch = e /4  1.12 and CJ = 1/(3 2)  0.2357. How- it holds ever, while Ch as a function of CV is monotonously increas- d −x d ing, CJ attains its maximum value, max CJ  0.2361, for gm (ln t) = e gm (x), (48) dt dx CV  2.3591 (Fig.2). The monotonous shape of Ch ver- sus the non-monotonous shape of CJ in dependence on CV is thus Eq. (10) can be written in terms of gm (x) as a significant qualitative difference in the behavior of Ch and ∞ 2 CJ , although the effect is numerically very small. The shape 1 dgm (x) −2x J(T) = − 1 e gm (x) dx. (49) of the dependence between Ch and CJ forms a closed loop if −∞ " gm (x) dx # both a > 2 and 2 ≥ a > 1 regions are added together, since Ch = CJ = 0 occurs for both CV → 0 and a → 1 (CV does The parametric space of the lognormal mixture model is large, not exist). in the following we illustrate the behavior of this model just in two different situations. First, we vary the weight p while keeping the other param- 4.2. Example: lognormal mixture eters fixed, Fig.3 (top row). While the mean value, E (T) , decreases with p monotonically, CV reaches its maximum Finally, we analyze a more complex example, a mixture CV  1.4 for p  0.5. The shapes of Ch and CJ in de- of two distributions of the same type. The mixture models pendence on CV are radically different: Ch initially increases are met in diverse situations, e.g., in modeling of populations while CJ decreases. This difference in behavior is explained composed of subpopulations, in neuronal coding of odorant by the basic properties of Ch and CJ , namely, that Ch is lowest mixtures [29] or in the description spiking activity of bursting when the p.d.f. is most concentrated (smallest CV ), however, neurons [30, 31]. the shape of the density is “smoother” for higher values of Recently, Bhumbra and Dyball [32] have successfully em- CV . Obviously, this behavior is distribution-dependent. Fur- ployed a mixture of two lognormal distributions to describe thermore, to each Ch corresponds a unique CV (the reverse the neuronal firing in supraoptic nucleus. statement is not true), while the relationship between CJ and The p.d.f. of the lognormal mixture model is CV is non-unique both ways. The relationship between CJ and Ch is unique only in the sense that to each CJ corresponds a f m (t) = p fln(t; µ1,CV 1) + (1 − p) fln(t; µ2,CV 2), (43) unique Ch (the reverse statement is not true). 7

a. b. c.

h 3 lognormal lognormal C J Pareto J Pareto (a > 2) C 1 C 1 maxC Pareto (2 a > 1) h ≥ 2 maxCh dispersion,

dispersion, 0.5 dispersion, 0.5 1 y-based FI-based FI-based

Entrop 0 0 0 0 1 2 3 4 0 1 2 3 4 0 1 2 Coeficient of variation, CV Coeficient of variation, CV Entropy-based dispersion, Ch

Figure 2. Relationships between CV , Ch and CJ for the lognormal, Pareto and Ch-maximizing distribution. The max Ch density is unique for CV ≤ 1, for CV > 1 only the upper bound can be given. The dependence of Ch on CV for the lognormal has a global maximum, while for the Pareto distribution Ch grows monotonously. For all distributions holds Ch → 0 as CV → 0. For the lognormal distribution the dependence of CJ on CV resembles a scaled version of the Ch-CV dependence. For the Pareto distribution the CJ -CV dependence shows a global maximum at CV  2.36, contrary to the monotonicity of the Ch-CV dependence. This confirms that “smoothness” and “evenness” of the distribution are different notions, although, e.g., CJ = 0 for CV → 0 for all distributions. The Pareto distribution with parameter 1 < a ≤ 2 is added to the Ch-CJ dependence plot, since for this case both Ch and CJ can be calculated but CV is undefined.

h 2.5 C J J C 0.18 C 0.18 2 dispersion,

1.5 dispersion, 0.17 dispersion, 0.17 y-based

1 FI-based FI-based

Entrop 0.16 0.16

0.4 0.6 0.8 1 1.2 1.4 0.4 0.6 0.8 1 1.2 1.4 1 1.5 2 2.5 Coefficient of variation, CV Coefficient of variation, CV Entropy-based dispersion, Ch

h 0.3 0.3 C J J C C 2 0.2 0.2 dispersion, dispersion, dispersion,

y-based 1.8 0.1 0.1 FI-based FI-based Entrop 0 0 0.5 0.6 0.7 0.5 0.6 0.7 1.7 1.8 1.9 2 2.1 Coefficient of variation, CV Coefficient of variation, CV Entropy-based dispersion, Ch

Figure 3. Top row: lognormal mixture with variable weight of the components, p ∈ [0,1], in the direction of arrows (m1 = −1, m2 = −0.5, s1 = 0.2 and s2 = 1). Although E (T) decreases with p, CV exhibits maximum at p  0.6. While the relationship between CV and CJ is non-unique, Ch describes CV uniquely (although the reverse statement is not true). Bottom row: lognormal mixture with increasing separation between the mean values of the logarithmically transformed components, m2 ∈ [−1,2], in the direction of arrows (p = 0.2, m1 = −1, s1 = 0.2 and s2 = 0.5). Although the mean value E (T) and CV increase and CJ decreases monotonically, the shape of Ch is unimodal with maximum at CV  0.69. The example shows the specific sensitivity of CJ to modes (CJ decreases as the modes become more apparent), while Ch is sensitive to the overall spread (at CV  0.69 the bimodal distribution is more evenly distributed than for any other value of CV ). 8

In the second example, shown in Fig.3 (bottom row), we we do not claim that Ch (or CJ ) is “more informative” than vary the parameter m2. Both E (T) and CV increase mono- CV due to taking into account, e.g., higher moments of the tonically with increasing m2. While CJ decreases monoton- distribution. For example, one can find different distributions ically, Ch shows a unimodal behavior with maxima around with equal CV ’s but differing Ch’s, and vice-versa, distribu- CV  0.69. Thus, although the shape of f m (t) becomes tions with equal Ch’s but differing CV ’s, see Fig2a. increasingly bimodal with growing m2 (i.e., CJ decreases), It is also important to emphasize what is the benefit of em- at the same time the distribution becomes more spread (or ploying the proposed measures once the full distribution func- equiprobable, thus Ch increases) up to a point CV  0.69. tion (and therefore a complete description of the situation) is From that point on, the bimodality becomes too strong and known. The answer is, that it is often required to compare decreases the evenness (or equiprobability) of the distribution (or “categorize”) individual distributions according some spe- and both Ch and CJ decrease. The increasing tendency of cific property, i.e., to assign a number to each function. The the density to become multimodal (decreasing CJ ) may re- advantage of employing the newly proposed measures lies in sult in more unpredictable outcomes of the random variable T the possibility to describe p.d.f. qualities from different points (increasing Ch). Both these examples show, that Ch and CJ of view, that might be of interest in various applications, see describe different aspects of the p.d.f. shape. e.g., [1,2,9]. The parametrical estimates of the proposed co- efficients (for both simulated and experimental data from ol- factory neurons) were treated in detail in [33]. However, it is 5. DISCUSSION AND CONCLUSIONS natural to ask for the non-parametric versions, which are ar- guably more valuable in practice. Lansky and Ditlevsen [34] discussed the disadvantages of the “classical” C estimator We propose and discuss two measures of statistical disper- V based on sample mean and deviation, proposing solutions es- sion for continuous positive random variables: the entropy- pecially for the problem of biasedness. Non-parametric reli- based dispersion (C ) and the Fisher information-based dis- h able estimates of the entropy (and thus of C ) are well known persion (C ). Both C and C describe the overall spread h J h J [35, 36]. Recently, Kostal and Pokora [37] employed the max- of the distribution differently than the coefficient of variation. imum penalized likelihood method of Good and Gaskins [38] While C is most sensitive to the concentration of the proba- h to jointly estimate C and C from simulated data. bility mass (the predictability of random variable outcomes), h J CJ is sensitive to the modes of the p.d.f. or any non-smothness Acknowledgements. This work was supported by the In- in the p.d.f. shape in general. The difference between Ch and stitute of Physiology RVO:67985823, the Centre for Neuro- CJ is further demonstrated by the fact, that the distributions science P304/12/G069 and by the Grant Agency of the Czech maximizing their values are not the same. On the other hand, Republic projects P103/11/0282 and P103/12/P558.

[1] J. F. Bercher and C. Vignat, “On minimum Fisher informa- [10] M. T. Martin, F. Pennini, and A. Plastino, “Fisher’s information tion distributions with restricted support and fixed variance,” and the analysis of complex signals,” Phys. Lett. A 256, 173– Inform. Sciences 179, 3832–3842 (2009). 180 (1999). [2] B. R. Frieden, Physics from Fisher information: a unification [11] J. B. Szabó, K. D. Sen, and Á. Nagy, “The Fisher-Shannon (Cambridge University Press, New York, 1998). information plane for atoms,” Phys. Lett. A 372, 2428–2430 [3] A. L. Berger, V. J. Della Pietra, and S. A. Della Pietra, “A (2008). maximum entropy approach to natural language processing,” [12] I. A. Howard, A. Borgoo, P. Geerlings, and K. D. Sen, “Com- Comput. Linguist. 22, 39–71 (1996). parative characterization of two-electron wavefunctions using [4] S. A. Della Pietra, V. J. Della Pietra, and J. Lafferty, “Inducing information-theory measures,” Phys. Lett. A 373, 3277–3280 features of random fields,” IEEE Trans. on Pattern Anal. and (2009). Machine Int. 19, 380–393 (1997). [13] S. R. Chakravarty, Ethical social index numbers (Springer- [5] L. Kostal, P. Lansky, and J-P. Rospars, “Review: neuronal cod- Verlag, New York, 1990). ing and spiking randomness,” Eur. J. Neurosci. 26, 2693–2701 [14] H. Cramér, Mathematical methods of statistics (Princeton Uni- (2007). versity Press, Princeton, 1946). [6] L. Bonnasse-Gahot and J-P. Nadal, “Perception of categories: [15] M. Kendall, A. Stuart, and J. K. Ord, The advanced theory of From coding efficiency to reaction times,” Brain Res. 1434, 47– statistics. Vol. 1: Distribution theory (Charles Griffin, London, 61 (2012). 1977). [7] L. Telesca, V. Lapenna, and M. Lovallo, “Fisher informa- [16] S. Shinomoto, K. Shima, and J. Tanji, “Differences in spiking tion measure of geoelectrical signals,” Physica A 351, 637–644 patterns among cortical neurons,” Neural Comput. 15, 2823– (2005). 2842 (2003). [8] C. Vignat and J. F. Bercher, “Analysis of signals in the Fisher– [17] T. M. Cover and J. A. Thomas, Elements of Information Theory Shannon information plane,” Phys. Lett. A 312, 27–33 (2003). (John Wiley and Sons, Inc., New York, 1991). [9] V. Zivojnovic, “A robust accuracy improvement method for [18] E. J. G. Pitman, Some basic theory for blind identification usinghigher order statistics,” in IEEE In- (John Wiley and Sons, Inc., New York, 1979). ternat. Conf. Acous. Speech. Signal Proc., 1993. ICASSP-93., [19] J Shao, (Springer, New York, 2003). Vol. 4 (1993) pp. 516–519. 9

[20] L. Kostal and P. Lansky, “Classification of stationary neuronal [29] J-P. Rospars, P. Lansky, M. Chaput, and P. Viret, “Competitive activity according to its information rate,” Netw. Comput. Neu- and noncompetitive odorant in the early neural cod- ral Syst. 17, 193–210 (2006). ing of odorant mixtures,” J. Neurosci. 28, 2659–2666 (2008). [21] L. Kostal, P. Lansky, and C. Zucca, “Randomness and vari- [30] B. C. DeBusk, E. J. DeBruyn, R. K. Snider, J. F. Kabara, and ability of the neuronal activity described by the Ornstein- A. B. Bonds, “Stimulus-dependent modulation of spike burst Uhlenbeck model,” Netw. Comput. Neural Syst. 18, 63–75 length in cat striate cortical cells,” J. Neurophysiol. 78, 199– (2007). 213 (1997). [22] E. T. Jaynes and G. L. Bretthorst, : The [31] H. C. Tuckwell, Introduction to Theoretical Neurobiology, Logic of Science (Cambridge University Press, Cambridge, Vol. 2 (Cambridge University Press, New York, 1988). 2003). [32] G. S. Bhumbra, A. N. Inyushkin, and R. E. J. Dyball, “Assess- [23] F. C. Leone, L. S. Nelson, and R. B. Nottingham, “The folded ment of spike activity in the supraoptic nucleus,” J. Neuroen- normal distribution,” Technometrics 3, 543–550 (1961). docrinol. 16, 390–397 (2004). [24] M. Abramowitz and I. A. Stegun, Handbook of Mathematical [33] L. Kostal, P. Lansky, and O. Pokora, “Variability measures of Functions, With Formulas, Graphs, and Mathematical Tables positive random variables,” PLoS ONE 6, e21998 (2011). (Dover, New York, 1965). [34] S. Ditlevsen and P. Lansky, “Firing variability is higher than de- [25] N. Johnson, S. Kotz, and N. Balakrishnan, Continuous Uni- duced from the empirical coefficient of variation,” Neural Com- variate Distributions, Vol. 1 (John Wiley & Sons, New York, put. 23, 1944–1966 (2011). 1994). [35] O. Vasicek, “A test for normality based on sample entropy,” J. [26] E. Limpert, W. A. Stahel, and M. Abbt, “Log-normal distri- Roy. Stat. Soc. B 38, 54–59 (1976). butions across the sciences: keys and clues,” BioScience 51, [36] A. B. Tsybakov and E. C. van der Meulen, “Root-n consistent 341–352 (2001). of entropy for densities with unbounded support,” [27] H. A. Simon and C. P. Bonini, “The size distribution of business Scand. J. Statist. 23, 75–83 (1994). firms,” Am. Economic Rev. , 607–617 (1958). [37] L. Kostal and O. Pokora, “Nonparametric estimation of [28] W. J. Reed, “The Pareto, Zipf and other power laws,” Eco- information-based measures of statistical dispersion,” Entropy nomics Lett. 74, 15–19 (2001). 14, 1221–1233 (2012). [38] I. J. Good and R. A. Gaskins, “Nonparametric roughness penal- ties for probability densities,” Biometrika 58, 255–277 (1971).