Fitting to the Power-Law Distribution
Total Page:16
File Type:pdf, Size:1020Kb
Fitting to the Power-Law Distribution Michel L. Goldstein, Steven A. Morris, Gary G. Yen School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK 74078 (Receipt date: 02/11/2004) This paper reviews and compares methods of fitting power-law distributions and methods to test goodness-of-fit of power-law models. It is shown that the maximum likelihood estimation (MLE) and Bayesian methods are far more reliable for estimation than using graphical fitting on log-log transformed data, which is the most commonly used fitting technique. The Kolmogorov-Smirnoff (KS) goodness-of-fit test is explained and a table of KS values designed for the power-law distribution is given. The techniques presented here will advance the application of complex network theory by allowing reliable estimation of power-law models from data and further allowing quantitative assessment of goodness-of-fit of proposed power-law models to empirical data. PACS Number(s): 02.50.Ng, 05.10.Ln, 89.75.-k I. INTRODUCTION In recent years, a significant amount of research focused on showing that many physical and social phenomena follow a power-law distribution. Some examples of these phenomena are the World Wide Web [1], metabolic networks [2], Internet router connections [3], journal paper reference networks [4], and sexual contact networks [5]. Often, simple graphical methods are used for establishing the fit of empirical data to a power-law distribution. Such graphical analysis can be erroneous, especially for data plotted on a log-log scale. In this scale, a pure power law distribution appears as a straight line in the plot with a constant slope. The pure power-law distribution, known as the zeta distribution, or discrete Pareto distribution [6] is expressed as: 1 k −γ pk()= (1) ζ (γ ) where: k is an integer usually measuring some variable of interest, e.g., number of links per network node p(k) is the probability of observing the value k; γ is the power-law exponent; ζ(γ) is the Riemann zeta function. Without a quantitative measure of goodness of fit, it is difficult to make final conclusions about how well the data approximates a power-law distribution. Moreover, a quantitative analysis of the goodness of fit enables the identification of possible interesting external phenomena that could be causing the distribution to deviate from a power-law. In some cases the underlying process may not actually generate power-law distributed data, but outside influences, such as data collection techniques, may cause the data to appear as power-law distributed. Quantitative assessment the goodness-of-fit for the power-law distribution can assist on identifying these cases. In the remainder of this paper, Section II discusses the methods for fitting a power law. Section III presents two candidate goodness-of-fit tests of a power-law distribution, the Kolmogorov-Smirnov test, and the χ2 goodness-of-fit test. Section IV illustrates the application of fitting and goodness-of-fit testing to analysis of a series of collections of journal papers. Finally, Section V presents conclusions about the problem of fitting power-law distributions and discusses some possible further analysis that can be implemented. II. FITTING POWER-LAW DISTRIBUTIONS Many methods exist in the theory of parameter estimation that can be used for estimating the exponent of the power-law distribution [7]. This section overviews three methods, namely maximum likelihood estimation (MLE), Bayesian Estimation, and linear regression-based methods. 2 In some cases the head of the distribution may deviate from a power-law, while the tail appears to be a power-law. A good example of this is the distribution of outbound links on a webpage [8]. Most have few links only, but some do have a larger amount of links, especially pages that give a list of interesting pages, also called hubs [9]. On a log- log plot, the number of outbound links in the tail appears to be linear, suggesting a “power-law tail.” It is necessary to have a strict definition of a power-law tail, and define estimators and tests for this distribution. It is important to note that the tail usually contains only a small fraction of the data. Thus, no statistical methods may be available to accurately estimate the power-law exponent, or even determine that the distribution has a power-law tail. An analysis of this uncertainty was recently performed by Jones and Handcock [10]. The scope of this paper is limited to analyzing the fitting of power-law functions and to an entire distribution and applying goodness-of-fit tests for validation and comparison. A deeper analysis for tail distributions would require first an analysis of the basis of a tail- only distribution and is beyond the scope of this paper. A. Maximum likelihood estimation (MLE) MLE is often used for estimating the exponent of a power-law distribution [6]. It is based on finding the maximum value of the likelihood function: N x−γ lx()γ | = ∏ i i=1 ζγ() (2) NN Lx()γ | ==logl()γγ| x∑∑()−log xii−logζ(γ)=−γlog x−Nlogζ()γ ii==11 where: l(γ|x) is called the likelihood function of γ given the data x L(γ|x) is the log-likelihood function. The log-likelihood is used because it simplifies the calculation and, because the log function is a monotonically increasing function, which does not disturb the point where the maximum is obtained. This maximum can be obtained theoretically for the zeta distribution by finding the root of the derivative of the log-likelihood function: 3 ddN 1 Lx()γζ|l= −−∑ ogxi N ()γ=0 ddγζi=1 ()γγ (3) ζγ′() 1 N ⇒− = ∑log xi ζγ() N i=1 where: ζ ′(γ ) is the derivative of the Riemann Zeta function. A table with the value of the ratio ζ ′(γ ) ζ (γ ) can be obtained in [11] or values can be generated on most modern mathematical and engineering calculation programs. Calculation of MLE is very fast and robust; however it only offers a single estimate of the power-law exponent without information to define the confidence interval of that estimate. In order to deal with this deficiency, a Bayesian estimator, discussed below, can be used. B. Bayesian estimation A Bayesian estimator, derived from MLE, differs from MLE in the meaning of the parameter estimate [12]. In a Bayesian approach, the unknown parameter is not a single value, but a distribution, called the posterior distribution. Moreover, this approach incorporates what is known about the values of the parameter before any data is analyzed, speeding up convergence and assuring that the final estimate is constrained to values that are deemed as reasonable. This is done by the definition of a prior distribution p(θ). The final posterior distribution is given by a normalized multiplication of the prior and the likelihood. For the power-law distribution: N −γ px()||γγ⋅⋅p( ) l(γx) p(γ) xi px()γ |.==∞∞∝∏ p(γ ) (4) i=1 ζγ() ∫∫px()||γγ⋅⋅p()dγl()γx p(γ)dγ −∞ −∞ The choice of the prior, as mentioned, relates to the range and, possibly, the distribution in this range. Moreover, the prior also defines possible discretization of the estimated parameter. A discrete prior always generates a discrete posterior distribution. 4 Another good feature of the Bayesian estimation process is that it is naturally adaptable to iterative estimation, where one sample or a subset of the samples is analyzed at a time. This is particularly useful if it is interesting to analyze the influence of each of the samples, and if the amount of memory available for implementation is low (the only thing that has to be stored is the posterior at each step). C. Linear regression-based estimators For power-law exponent estimation, linear regression is an often used estimation procedure [13]. Different variations of this technique are all based on the same principle: a linear fit is made to the data that is plotted on a log-log scale. Actually, with reasonable accuracy, the linear fit can be made by hand on a log-log plot of the distribution. However, the linear fitting does not take into consideration that almost all of the data observed is on the first few points of the distribution. For example, for an exponent, γ, of 3.0, 93.6% of the data is expected to have k=1 or k=2. Therefore, an estimation method that does not incorporate this fact will fit to the “noise” in the tail, where very few observations occur [14]. Because of this, two modifications to direct linear fitting were proposed: 1) the use of only the first 5 points for regression and, 2) the use of logarithmically binned data. The first variation is straightforward to implement. The second variation is based on adding all values that fall into bins that are logarithmically spaced (same size in the logarithmic scale), and then performing the linear regression on the log of the quantity of these groups of data. This method is similar to binning methods used for curve estimation [1]. The advantage of this method is that, by grouping the data points the noise is reduced. The reduction of noise is dependent on the size chosen for the bins. However, this method only generates a graph that is approximately linear even when the distribution is a power-law, as can be seen in Figure 1 for “log2” bins, i.e., bin boundaries at (0, 1, 2, 4, 8, ...). The linearization error decreases the estimation accuracy. More importantly, the slope obtained is not directly the value of the power-law exponent. This can be observed by plotting the exponent of the power-law distribution and the slope obtained by simulating this distribution and using the method explained above.