Fitting Ranked Linguistic Data with Two-Parameter Functions

Entropy 2010, 12, 1743-1764; doi:10.3390/e12071743 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Fitting Ranked Linguistic Data with Two-Parameter Functions Wentian Li 1;?, Pedro Miramontes 2;4 and Germinal Cocho 3;4 1 Feinstein Institute for Medical Research, North Shore LIJ Health Systems, 350 Community Drive, Manhasset, NY 11030, USA 2 Departamento de Matematicas,´ Facultad de Ciencias, Universidad Nacional Autonoma´ de Mexico,´ Circuito Exterior, Ciudad Universitaria, Mexico´ 04510 DF, Mexico; E-Mail: [email protected] 3 Departamento de Sistemas Complejos, Instituto de F´ısica, Universidad Nacional Autonoma´ de Mexico,´ Apartado Postal 20-364, Mexico´ 01000 DF, Mexico; E-Mail: cocho@fisica.unam.mx 4 Centro de Ciencias de la Complejidad, Universidad Nacional Autonoma´ de Mexico,´ Circuito Escolar, Ciudad Universitaria, Mexico´ 04510 DF, Mexico ? Author to whom correspondence should be addressed; E-Mail: [email protected]; Tel.: +1 516 562 1076; Fax: +1 516 562 1153. Received: 4 April 2010; in revised form: 18 May 2010 / Accepted: 1 July 2010 / Published: 7 July 2010 Abstract: It is well known that many ranked linguistic data can fit well with one-parameter models such as Zipf’s law for ranked word frequencies. However, in cases where discrepancies from the one-parameter model occur (these will come at the two extremes of the rank), it is natural to use one more parameter in the fitting model. In this paper, we compare several two-parameter models, including Beta function, Yule function, Weibull function—all can be framed as a multiple regression in the logarithmic scale—in their fitting performance of several ranked linguistic data, such as letter frequencies, word-spacings, and word frequencies. We observed that Beta function fits the ranked letter frequency the best, Yule function fits the ranked word-spacing distribution the best, and Altmann, Beta, Yule functions all slightly outperform the Zipf’s power-law function in word ranked- frequency distribution. Keywords: Zipf’s law; regression; model selection; Beta function; letter frequency distribution; word-spacing distribution; word frequency distribution; weighting Entropy 2010, 12 1744 1. Introduction Empirical linguistic laws tend to be functions with one key parameter: Zipf’s law [1], which describes the relationship between the number of occurrence of a word (f) and the resulting ranking of the word from common to rare (r), states that f ∼ 1=ra, with the scaling exponent a as the free parameter. Herdan’s or Heaps’ law describes how the number of distinct words used in a text (V , or “vocabulary size”, or number of types) increases with the total number of appearance of words (N, or, the number of tokens) as: V ∼ N θ [2,3]. Although in both Zipf’s and Herdan’s law, there is a second parameter: f = C=ra, V = CN θ, the parameter C is either a y-intercept in linear regression and thus not considered to be functionallyP important, or subtracted from the total number of parameters by the normalization constraint (e.g., i(fi=N) = 1). Interestingly, one empirical linguistic law, the Menzerath-Altmann law [4,5], concerning the relationship between the length of two linguistic units, contains two free parameters. Suppose the higher linguistic unit is the sentence, whose length (y) is measured by the number of the lower linguistic unit, words, and the length of words (x) is measured by the even lower linguistic unit, e.g., phonemes or letters. Then, the Menzerath-Altmann law states that y ∼ xbe−c=x with two free parameters, b and c. Few people paid attention to the two-parameter nature of this function as it does not fit the data as good as the Zipf’s law on its respective data, although Baayen listed several multiple-parameter theoretical models in his review on word frequency distribution [6]. We argue here about need to apply two-parameter functions to fit ranked linguistic data for two reasons. First, some ranked linguistic data, such as the letter usage frequencies, do not follow well a power-law trend. It is then natural to be flexible in data fitting by using two-parameter functions. Second, even for the known cases of “good fits”, such as the Zipf’s law on ranked word-frequency distribution i.e., rank-frequency plot), the fitting tends to be not so good when the full ranking range is included. This imperfection is more often ignored than investigated. We would like to check whether one can model the data even better than Zipf’s law by using two-parameter fitting functions. The first 2-parameter function we consider is the Beta-function which attempts to fit the two ends of a rank-frequency distribution by power-laws with different exponents [7,8]. Suppose the variable f ≥ ≥ · · · ≥ ≡ values areP ranked: f(1) f(2) f(3) f(n), we define the normalized frequencies p(r) f(r)=N such that n f = 1,(n denotes the number of items to be ranked, r denotes the rank number, and P r=1 (r) n N = r=1 f(r) is the normalization factor). In a Beta function, the ranked distribution is modeled by (n + 1 − r)b Beta: p = C (1) (r) ra where the parameter a characterizes the scaling for low-rank-number (i.e., high frequency) items points, and b characterizes the scaling for the high-rank-number (i.e., low frequency) items points. For the example of English words, n is the vocabulary size, and “the” is usually the r = 1 (most frequent) word. If a logarithmic transformation is applied to both sides of Equation (1), the Beta function can be cast in a multiple regression model: 0 Beta (linear regression form): log(p(r)) = c0 + c1 log(r) + c2 log(r ) (2) 0 where r = n + 1 − r, c1 = −a, c2 = b, and c0 = log(C). Entropy 2010, 12 1745 The second 2-parameter function we are interested in is the Yule-function [9]: br Yule: p = C (3) (r) ra A previous application of Yule’s function to linguistic data can be found in [10]. The Menzerath-Altmann function mentioned earlier: b −a=r Menzerath-Altmann: p(r) = Cr e (4) cannot be guaranteed to be a model for ranked distribution as the monotonically decreasing property is not always true even when the parameter b stays negative. Note that Menzerath-Altmann function is a special case of f = Crbexp(−a=r − d · r) which shares the same functional form as the generalized inverse Gauss-Poisson distribution [6]. Another two-parameter function, proposed by Mandelbrot [11], cannot be easily cast in a regression framework, and it is not used in this paper: A Mandelbrot: p = (5) (r) (r + b)a For random texts, one can calculate the value of b, e.g., b = 26=25 = 1:04 for 26-alphabet languages [12]. Clearly, all these 2-parameter functions one way or the other attempt to modify the power-law function: C power-law, Zipf: p = (6) (r) ra For comparison purposes, the exponential function is also applied: −ar exponential: p(r) = Ce (7) Two more fitting functions, whose origin will be explained in the next section, are labeled as Gusein-Zade [13] and Weibull functions [14]: ( ) n + 1 Gusein-Zade: p = C log (8) (r) r ( ( )) n + 1 a Weibull: p = C log (9) (r) r In this paper, these functions will be applied to ranked letter frequency distributions, ranked word-spacing distributions, and ranked word frequency distributions. A complete list of functions, their corresponding multiple linear regression form, and the dataset each function will be applied, are included in Table 1. When the normalization factor N is fixed, parameter C is no longer freely adjustable, and the number of free parameters, K − 1, is one less the number of fitting parameters, K. The value of K − 1 is also listed in Table 1. The general framework we adopt in comparing different fitting functions is Akaike information criterion (AIC) [15] in regression models. In regression, model parameters in model y = F (x) are estimated to minimize the sum of squared errors (SSE): Xn 2 SSE = (yi − F (xi)) (10) i=1 Entropy 2010, 12 1746 It can be shown that when the variance of the error is unknown, which is to be estimated from the data, least square leads to maximum likelihood L^ (the underlying statistical model is that the fitting noise or deviance is normally distributed) (p. 185 of [16]): n SSE n L^ = C − log − (11) 2 n 2 The model selection based on AIC minimizes the following term: AIC = −2L^ + 2K (12) where K is the number of fitting parameters in the model. Combining the above two equations, it is clear that AIC of a regression model can be expressed by SSE (any constant term will be canceled when two models are compared, so it is removed from the definition): SSE AIC = n log + 2K (13) n For two models applied to the same dataset, the difference between their AIC’s is: SSE2 AIC2 − AIC1 = n log + 2(K2 − K1) (14) SSE1 −2=n If model-2 has one more parameter than model-1, model-2 will be selected if SSE2=SSE1 < e . Note that almost all fitting functions listed in Table 1 are variations of the power-law function, so the regression model is better applied to the log-frequency scale: y = log(p). Table 1. List of functions discussed in this paper. The number of free parameters in a function is K − 1. The functional form in the third and the 4th column is for p(r) = f(r)=N, or log(p(r)), respectively.

Load more