<<

arXiv:0704.1867v2 [cond-mat.other] 11 Aug 2007 fapwrlwti ygahclmasaeotnue npract in used mathematicall often a are are exponent the for graphical estimators by phen likelihood tail ubiquitous power-law an a are of power-law a following Distributions aaee siainfrpwrlwdistributions power-law for estimation Parameter xod el od xod X N,Uie Kingdom United 3NP, OX1 Oxford, Road, Keble 1 Oxford, uofPirsCnr o hoeia hsc,University , Theoretical for Centre Peierls Rudolf ymxmmlklho methods likelihood maximum by hoeia Physics Theoretical ek Bauke Heiko Preprint on lentv ogahclmethods. graphical to alternative sound y mnn ehd o eemnn h exponent the determining for Methods omenon. c u r nrnial neibe Maximum unreliable. intrinsically are but ice of ∗ . Parameter estimation for power-law distributions by maximum likelihood methods

Heiko Bauke Rudolf Peierls Centre for Theoretical Physics, University of Oxford, 1 Keble Road, Oxford, OX1 3NP, United Kingdom∗ (Dated: August 11, 2007) Distributions following a power-law are an ubiquitous phenomenon. Methods for determining the exponent of a power-law tail by graphical means are often used in practice but are intrinsically unreliable. Maximum likelihood estimators for the exponent are a mathematically sound alternative to graphical methods.

PACS numbers: 02.50.Tt, 89.75.-k

1. Introduction situations probabilities p(k) may differ from a power-law for k kmax as well, e. g. the distribution may have an exponential cut-off.≥ The distribution of a discrete random variable is referred to as Therefore, I will generalize the maximum likelihood ap- a distribution with a power-law tail if it falls as proach introduced in [3] to distributions that follow a power- γ law within a certain range kmin k < kmax but differ from a p(k) k− (1) ≤ ∼ power-law outside this range in an arbitrary way. Furthermore, I will give some statements about the large sample properties for k N and k kmin. Power-laws are ubiquitous distribu- tions∈ that can be≥ found in many systems from different disci- of the estimate of the power-law exponent and present a nu- plines, see [1, 2] and references therein for some examples. merical procedure to identify the power-law regime of the dis- Experimental data of quantities that follow a power-law are tribution p(k). But first, let us see what is wrong with popular usually very noisy; and therefore obtaining reliable estimates graphical methods. for the exponent γ is notoriously difficult. Estimates that are based on graphical methods are certainly used most often in practice. But simple graphical methods are intrinsically unre- 2. Trouble with graphical methods liable and not able to establish a reliable estimate of the expo- nent γ. For that reason, the authors of [3] introduced an alternative All graphical methods for estimating power-law exponents are approach based on a maximum likelihood estimator for the based on a linear least squaresfit of some empirical data points exponent γ. Unfortunately the authors concentrate on a rather (x1,y1), (x2,y2),..., (xM,yM) to the function idealized type of power-law distributions, namely y(x)= a0 + a1x. (4) γ k− p1(k;γ)= (2) The linear least squares fit minimizes the residual ζ(γ,1) N ζ γ M with k , where the normalization constant ( ,1) is given ∆ = ∑(y a a x )2 . (5) ∈ ζ γ i 0 1 i by the Hurwitz- -function which is defined for > 1 and a > i=1 − − 0 by Estimatesa ˆ anda ˆ of the parameters a and a are given by ∞ 0 1 0 1 1 [4] ζ(γ,a)= ∑ . (3) (i + a)γ i=0 ∑M ∑M 2 ∑M ∑M i=1 yi i=1 xi i=1 xi i=1 yixi aˆ0 = − (6) The distribution (2) is characterized by one parameter only, M 2 M 2 M ∑i 1 x ∑i 1 xi  and therefore all properties of this distribution (e.g. its ) = i − = are determined solely by the exponent γ. In many applications   and the power-law (2) is too restrictive. If one states that a quantity follows a power-law, then this ∑M ∑M ∑M M i=1 yixi i=1 xi i=1 yi means usually that the tail (k kmin) of the distribution p(k) aˆ1 = − . (7) γ ≥ ∑M 2 ∑M 2 falls proportionally to k− . Probabilities p(k) for k < kmin may M i=1 xi i=1 xi  differ from the power-law and admit the possibility to tune − the mean or other characteristics independently of γ. In some The ansatz for the residual (5) and derivation of (6) and (7) are based on several assumptions regarding the data points (xi,yi). It is assumed that there are no statistical uncertain- ties in xi, but yi may contain some statistical error. The errors ∗E-mail:[email protected] in different yi are independent identically distributed random 2 variables with mean zero. In particular the standard devia- Table I: Mean and of the distribution of the esti- tion of the error is independent of x . For various graphical i mate for the power-law exponent γ for various methods. All methods methods for the estimation of the exponent of a power-law have been applied to the same data sets of random numbers from distribution these conditions are not met, leading to the poor distribution (2) with γ = 2.5. See text for details. performance of these methods. To illustrate the failure of graphical methods by a computer mean standard deviation method estimate of estimate experiment N = 10000 random numbers mi had been drawn from distribution (2) with γ = 2.5 and an estimate γˆ for the ex- fit on empirical distribution 1.597 0.167 ponent γ was determined by various graphical methods. The fit on cumulative empirical distri- 2.395 0.304 estimator is a random variable and its distribution depends on bution the method that has been used to obtain the estimate. Impor- fit on empirical distribution with 2.397 0.080 tant measures of the quality of an estimator are its mean and logarithmic binninga its standard deviation. If the mean of the estimator equals the fit on cumulative empirical distri- 2.544 0.127 γ true exponent then the estimator is unbiased and estimators bution with logarithmic binning with a distribution that is concentrated around γ are desirable. maximum likelihood 2.500 0.016 For each graphical method a histogram of the distribution of the estimator was calculated to rate the quality of the estimator aIn [3] a similar experiment is reported. For a fit of the logarithmically by repeating the numerical experiment 500 times. binned empirical the authors find a systematical bias of 29 %. I cannot reconstruct such a strong bias, instead I get a bias of 5 % The most straight forward (and most unreliable) graphical only. Probably the quality of this method depends on the details of the binning approach is based on a plot of the empirical probability dis- procedure. tributionp ˆ(k) on a double-. Introducing the indicator function I[ ], which is one if the statement in the brackets is true and else· zero, the empirical probability distri- to a straight line gives much better estimates for the exponent, bution is given by see Figure 1 b. But there is still a small bias to too small values and the distribution of this estimate is rather broad, see Table I. N 1 I Logarithmic binning reduces the noise in the tail of the em- pˆ(k)= ∑ [mi = k] . (8) pirical distributionsp ˆ(k) and Pˆ(k) by merging data points into N i 1 = groups. By introducing the logarithmically scaled boundaries γ γ An estimate ˆ for the power-law exponent is established by i a least squares fit to bi = roundc with some c > 1 (14) (The function roundx rounds x to the nearest integer.) a linear (xi,yi) = (lnk,lnp ˆ(k)) for all k N withp ˆ(k) > 0, (9) ∈ least squares fit is performed to γ ˆ equals the estimate (7) for the slope, see Figure 1a. Because bi+1 1 bi + bi+1 1 − pˆ(k) the lack of data points in the tail of the empirical distribution (xi,yi)= ln − ,ln ∑ (15) γ 2 bi+1 bi this procedure underestimates systematically the exponent , k=bi − ! see Table I. or There are two ways to deal with the sparseness in the tail of the empirical distribution, logarithmic binning and consider- bi+1 1 bi + bi+1 1 − Pˆ(k) ing the empirical cumulative distribution Pˆ(k) instead ofp ˆ(k). (xi,yi)= ln − ,ln ∑ , (16) 2 bi+1 bi The cumulative probability distribution of (2) is defined by k=bi − ! ∞ i γ respectively. As a consequence of the binning the width of P(k)= ∑ − . (10) the distribution of the estimate γˆ of the power-law exponent ζ(γ,1) i=k γ is reduced, see Figure 1c, 1d and Table I. According to If p(k) has a power-law tail with exponent γ then P(k) follows the numerical experiments a fit of the logarithmically binned approximately a power-law with exponent γ 1 because for cumulative distribution gives the best results among graphical k 1 the distribution P(k) can be approximated− by methods. It shows the smallest systematic bias. ≫ All the methods that have been considered so far have a ∞ i γ k1 γ common weakness. In the deviation of (6) and (7) it was as- P(k) − di = − . (11) ≈ k ζ(γ,1) (γ 1)ζ(γ,1) sumed that the standard deviation of the distribution of the Z − error in yi is the same for all data points (xi,yi). But this is The empirical cumulative probability distribution is given by obviously not the case. For fixed k the empirical distribution pˆ(k) is a random variable with mean p (k;γ) and standard 1 N 1 Pˆ(k)= ∑ I[m k] . (12) deviation p1(k;γ)(1 p1(k;γ))/N. For the corresponding i − N i=1 ≥ data on a logarithmic scale the standard deviation is approxi- p It is less sensitive to the noise in the tail of the distribution and mately given by the quotient therefore a fit of p1(k;γ)(1 p1(k;γ))/N 1 p1(k;γ) − γ = − γ . (17) (xi,yi) = (lnk,lnPˆ(k)) for all k N with Pˆ(k) > 0 (13) p1(k; ) s Np1(k; ) ∈ p 3

fit on the distributionp ˆ(k) fit on the cumulative distribution Pˆ(k) = ∑i k pˆ(i) ≥ no binning

a) b) logarithmic binning

c) d)

Figure 1: Comparison of various methods for estimating the exponent of a power-law. Each figure shows data for a single data set of N = 10000 samples drawn from distribution (2) with γ = 2.5. Insets present histograms of estimates for γ for 500 different data sets.

A power-law distribution p1(k;γ) is a monotonically decreas- size N is given by ing function and therefore (17) is an increasing function of k. θˆ θ θ Because the variation of the statistical error is not taken into N = argmax[L( )] = argmax[lnL( )], (18) θ θ account, the distribution of the estimate γˆ is very broad. Methods that deal with the cumulative distribution have an where additional weakness. Cumulation has the side-effect that the N statistical errors in y are not independent any more, which i L(θ)= ∏ p(mi;θ) (19) violates another assumption of the deviation of (6) and (7). i=1 To sum up, estimates of exponents of power-law distribu- tions based on a linear least squares fit are intrinsically inac- denotes the likelihood function. In the limit of asymptotically curate and lack a sound mathematical justification. large samples and under some regularity conditions maximum likelihood estimators share some desirable features [5, 6].

The estimator θˆN exists and is unique. • 3. Maximum likelihood estimators The estimator θˆN is consistent, that means for every ε > 0 • lim P θˆN θ < ε = 1, (20) Maximum likelihood estimators offer a solid alternative to N ∞ | − | graphical methods. Let p(k;θ) denote a single parameter prob- → ˆ   ability distribution. The maximum likelihood estimator θˆ for where P θN θ < ε denotes the probability that the dif- N | − | the unknown parameter based on a sample m1,m2,...,mN of ference θˆN θ is less than ε. | − |  4

The estimator θˆ is asymptotically normal with mean θ 25 N k = 1 • and min k = 2 1 min 2 − 2 . d 20 k = 4 (∆θˆN ) = N E ln p(k;θ) , (21) min dθ k = 8 "  #! min where E[ ] indicates the expectation value of the quantity 15 · N

in the brackets. ∆γ 1/2 Maximum likelihood estimators have asymptotically mini- N 10 • mal variance among all asymptotically unbiased estimators. One says, they are asymptotically efficient. 5

4. Maximum likelihood estimators for 0 1 2 3 4 5 6 7 8 genuine power-laws γ

The most general discrete genuine power-law distribution has Figure 2: Asymptotic standard deviation for maximum likelihood esti- a lower as well as an upper bound and is given by mators for the exponent of a power-law distribution (24). γ γ k− pkmin,kmax (k; )= (22) ζ(γ,kmin,kmax) for k N with kmin k < kmax. Where the non-standard nota- tion ∈ ≤

ζ(γ,kmin,kmax) := ζ(γ,kmin) ζ(γ,kmax) (23) − has been introduced. If the upper bound is missing the distri- bution γ γ k− pkmin (k; )= (24) ζ(γ,kmin) has to be considered for k N with k kmin. The distributions (22) and (24) are generalizations∈ of (2)≥ and will be useful for the analysis of more general distributions that show a power- law behavior only in a certain range but have an arbitrary pro- file outside the power-law regime. This kind of distributions will be considered in section 5. The maximum likelihood estimator γˆN for the parameter γ of the distribution (22) follows from (18) and is given by Figure 3: Empirical distribution of the maximum likelihood estima- tor (histogram) versus its theoretical asymptotic distribution, which is N given by a with mean γ = 2.5 and variance (27). γˆN = argmax γ ∑ lnmi N lnζ(γ,kmin,kmax) (25) The histogram has been obtained from the same data as in Figure 1. γ "− i=1 ! − # or equivalently by the implicit equation In the limit k ∞ equations (25), (26), and (27) give the ζ γ N max ′( ˆN ,kmin,kmax) 1 maximum likelihood→ estimator and the asymptotic variance ζ γ + ∑ lnmi = 0, (26) ( ˆN ,kmin,kmax) N i=1 of this estimator for power-law distributions lacking an upper cut-off (24). A graphical representation of the standard devia- which has to be solved numerically. The prime denotes the tion (27) in the limit k ∞ is given in Figure 2. For each γ max derivative with respect to . The asymptotic variance of this fixed k the quantity ∆γˆ →√N grows faster than linear with γ. γ min N estimator ˆN follows from (21) and equals Therefore the larger the exponent γ the larger the sample size that is necessary to get an estimate within a given error bound. . 1 (∆γˆ )2 = N N If the maximum likelihood method is applied (assuming a distribution (24)) to the same data as in section 2, numeri- ζ(γ,k ,k )2 min max cal experiments show that the estimates for the exponent are 2 . × ζ (γ,kmin,kmax)ζ(γ,kmin,kmax) ζ (γ,kmin,kmax) ′′ − ′ much more precise. The estimate has no identifiable system- (27) atic bias, the standard deviation of the distribution of the esti- 5 mate is smaller by an order of magnitude compared to graphi- If the empirical data is drawn from a distribution having cal methods, see Table I and Figure 3. both a lower crossover point as well as an upper crossover point a sequence of estimates γˆ (k ) is determined by re- N′ cmin stricting the data to a sliding window kcmin mi < wkcmin = ≤ kcmax with w > 1. As long as the window lies completely 5. Maximum likelihood method for within the power-law regime the maximum likelihood esti- general power-law distributions mate obtained from the restricted data set will give a reliable estimate of the power-law exponent. If the window lies at least partly outside the power-law regime the estimate is systemati- The maximum likelihood procedure outlined in section 4 can cally biased. be generalized further to distributions p k that are no pure ( ) To illustrate the procedures outlined above I generated two power-laws (22) or (24) but follow a power-law within a cer- data sets from two distributions having a power-law regime. tain finite range or follow a power-law in the whole tail of the The first data set of N = 10000 samples was drawn from a distribution and have an arbitrary profile outside the power- distribution with a power-law tail which is given by law regime. The popurse of this section is to establish meth- ods for identifying the power-law regime and for estimating 5 2.5 for 1 k 5 the exponent of the power-law regime without making special p(k) − ≤ ≤ (28) 2.5 assumptions about the profile of the probability distribution ∼ (k− for k > 5. beyond the power-law regime. γ The main problem for a generalization of the maximum Plotting the sequence of estimates ˆN (kcmin) against the pa- γ′ likelihood approach is that there might be no good hypothe- rameter kcmin reveals the exponent = 2.5 as wells as the sis for the profile of the probability distribution beyond the crossover point k = 5 very clearly, see Figure 4. The second power-law regime. To overcome this difficulty the empirical data set of N = 100000 samples was drawn from a distribution with two crossover points, viz. data set is restricted to a window kcmin mi < kcmax. (The following discussion covers the case of a≤ power-law tail distri- ∞ 5 2.25 for 1 k 5 butions as well by setting kcmax = .) Assuming that p(k) has − ≤ ≤ a power-law profile for kcmin k < kcmax then the probabil- p(k) k 2.25 for 5 k 100 (29) ≤ ∼  − ≤ ≤ ity distribution of the restricted data set is given by (22) with  2.25 0.05(k 100) 100− e− − for k > 100. kmin = kcmin and kmax = kcmax (or by (24) with kmin = kcmin) and some unknown exponent γ. This allows to estimate the  Figure 5 shows the sequence of estimates γˆ (kcmin) that had power-law exponent by the application of the maximum like- N′ been determined from restricted data sets of samples within lihood method on the restricted data set of size N as presented ′ the sliding window k m < 5k . This sequence ex- in section 4 without making a hypothesis about the profile of cmin i cmin hibits a broad plateau that corresponds≤ to the power-law expo- the probability distribution beyond the power-law regime. nent γ = 2.25. If the window does not lie completely inside In order to apply the maximum likelihood method one has to determine the cut-off points kcmin and kcmax first. Here it has to be taken into account that if the window kcmin mi < ≤ kcmax is chosen too large the estimate γˆ is systematically bi- 4 ased, but on the other hand if it is too small the statistical error is larger than necessary. In some cases one can make conser- 3.5 vative estimates for kcmin and kcmax by plotting the empirical probability distribution (8) on a double-logarithmic scale. An 3 appropriate window can also be found by determining esti- ) mates γˆ (k ) as a function of the window and a χ2-test. N′ cmin c min

k 2.5

Assuming the empirical data is drawn from a distribution ( ’ N with a power-law tail (no upper cut-off)the lower cut-off point γ kcmin can be determined in the following systematic way. By 2 varying the parameter kcmin the maximum likelihood approach gives a sequence of estimates γˆ (k ). If k is very large N′ cmin cmin 1.5 the estimate will be quite inaccurate because only a tiny frac- tion of the experimental data is taken into account; but the 1 0 1 2 3 smaller the cut-off kcmin the more accurate the estimate of the 10 10 10 10 k exponent. If kcmin approaches the point from above (but is c min still above) where the probability distribution starts do differ from a power-law γˆ (k ) will give a very precise estimate Figure 4: Sequence of estimates γˆ (k ) as a function of the cut- N′ cmin N′ cmin for the exponent γ. On the other hand, if kcmin is too small the off kcmin for a data set of N = 10000 samples from distribution (28). hypothesis that the (restricted) empirical data is drawn from a Filled symbols mark where the χ2-test has rejected the hypothesis power-law distribution is violated which causes a significant that the restricted data follows a power-law (24) with exponent γ = γˆ (k ). An error probability of α = 0.001 was chosen. change of the estimate of the power-law exponent. N′ cmin 6

1 k 2 If the to kcmin mi < kcmax restricted data is given by the 10 c max 10 ≤ 4 power-law (22) or (24) with kmin = kcmin, kmax = kcmax, 2 and the exponent γ = γˆN (kcmin) then the statistic c follows 2 ′ 3.5 asymptotically a χ -distribution with ν = (b 1) degrees of freedom, which is given by −

3 xν/2 1e x/2

) − − pχ2 (x,ν)= . (34) 2 ν/2 Γ(ν/2)

c min −

k 2.5 ( ’

N χ2 α γ Let α be the (1 )- of the distribution (34). The hy- 2 pothesis that the− restricted data is given by the power-law (22) or (24), respectively, with kmin = kcmin, kmax = kcmax, and the γ γ 1.5 exponent = ˆN (kcmin) is accepted with the error probability 2 2 ′ α if c χα . If the window kcmin mi < kcmax lies not com- pletely≤ within the power-law regime≤ this hypothesis will be 1 0 1 2 2 10 10 10 rejected by the χ -test and one can detect the upper crossover k c min point as well as the lower crossover point (where the power- law loses its validity) in a reliable way, see Figure 4 and Fig- Figure 5: Sequence of estimates γˆ as a function of the lower ure 5. N′ (kcmin) cut-off kcmin for a data set of N = 100000 samples from distribution (29). Filled symbols mark where the χ2-test has rejected the hypoth- esis that the restricted data follows a power-law (22) with exponent γ γˆ . An error probability of α and the window 6. Computational remarks = N′ (kcmin) = 0.001 width w = 5 had been chosen. The normalizing factors of the probability distributions (22) ζ the power-law regime the estimate γˆ (k ) deviates system- and (24) are given by the Hurwitz- -function. This func- N′ cmin atically from the known exponent. tion is less common than other special functions and may not Apart from a visual inspection of the γˆ (k ) plot the be available in the reader’s favorite statistical software pack- N′ cmin crossover point(s) to the power-law regime can be determined age but the GNU Scientific Library [7] offers an open source by means of a χ2-test. To apply a χ2-test the data set has to be implementation of this function. A direct calculation of the ζ divided into some bins and the following binning turned out Hurwitz- -function by truncating the sum (3) gives unsatis- to be appropriate: The data is partitioned into a small number factory results. b, say b = 6, of bins. In the case of a distribution with a power- The maximum likelihood estimator of the exponent can be computednumerically either by solving (25) or (26). Equation law tail this means each bin j collects n j items such that (25) has the advantagethat it can be solved without calculating n = N pˆ(k ) q = p (k ;γˆ (k )) (30) ζ 1 ′ cmin 1 kcmin cmin N′ cmin derivatives of the Hurwitz- -function [8], whereas the solu- n = N pˆ(k + 1) q = p (k + 1;γˆ (k )) tion of (26) involves its first derivative (e. g. bisection method) 2 ′ cmin 2 kcmin cmin N′ cmin (31) or even higher derivatives (e.g. Newton-Raphson method). . . An explicit implementation of these derivatives is often not . . available but may be calculated numerically. and finally ∞ ∞ 7. Conclusion n = N ∑ pˆ(k) q = ∑ p (k;γˆ (k )), b ′ b kcmin N′ cmin k=k +b 1 k=k +b 1 cmin − cmin − (32) Methods based on a least squares fit are not suited to establish estimates for power-law distribution exponents because least where q j denotes the probability that a data point falls into squares fits rely on assumptions about the data set that are bin j under the assumption that the (restricted) data follows not fulfilled by empirical data from power-law distributions. γ the power-law (24) with kmin = kcmin and the exponent = In this paper maximum likelihood estimators have been intro- γˆ (k ). For distributions with a finite power-law regime N′ cmin duced as a reliable alternative to graphical methods. These es- the binning procedure can be carried out in a similar way. In timators are asymptotically efficient and can be applied to data this case the summation index in (32) is bounded by kcmin + from a wide class of distributions having a power-law regime. b 1 k < k and the probability p (k;γˆ (k )) cmax kcmin,kcmax N′ cmin The crossover points that separate the power-law regime from has− to≤ be considered instead of p (k;γˆ (k )). kcmin N′ cmin the rest of the distribution can be determined by a procedure The test statistic of the χ2-test is given by based on a χ2-test. b 2 Finally I would like to mention that the idea to plot a se- 2 (n j N′q j) quence of estimates γˆ (k ) as shown in Figure 4 is re- c = ∑ − . (33) N′ cmin j=1 N′q j lated to so-called Hill plots [9, 10]. The Hill estimator is 7 a maximum likelihood estimator for the inverse of the ex- mation Society Technologies programme under contract IST- ponent of the continuous p(k) = (γ 001935, EVERGROW. γ − 1)(k/kmin)− /kmin, see [10] for a detailed discussion.

Acknowledgments

Work sponsored by the European Community’s FP6 Infor-

[1] M.E.J. Newman, Contemporary Physics 46(5), 323 (2005) [6] Y. Pawitan, In all likelihood: statistical modelling and inference [2] E.F. Keller, BioEssays 27(10), 1060 (2005) using likelihood, Oxford science publications (Oxford Univer- [3] M.L. Goldstein, S.A. Morris, G.G. Yen, The European Physical sity Press, 2001) Journal B 41(2), 255 (2004) [7] GNU Scientific Library, http://www.gnu.org/software/gsl/ [4] F. Ramsey, D. Schafer, The Statistical Sleuth: A Course in Meth- [8] R.P. Brent, Algorithms for Minimization without Derivatives ods of Data Analysis, 2nd edn. (Duxbury Press, Pacific Grove, (Prentice-Hall, Englewood Cliffs, New Jersey, 1973) CA, 2002) [9] B.M. Hill, Annals of 3(5), 1163 (1975) [5] L.J. Bain, M. Engelhardt, Introduction to Probability and Math- [10] H. Drees, L. de Haan, S. Resnick, Annals of Statistics 28(25), ematical Statistics, Duxbury Classic Series, 2nd edn. (Duxbury 254 (2000) Press, 2000)