Journal of Modern Applied Statistical Methods

Volume 12 | Issue 1 Article 10

5-1-2013 The Length-Biased Versus Random Sampling for the Binomial and Poisson Events Makarand V. Ratnaparkhi Wright State University, [email protected]

Uttara V. Naik-Nimbalkar Pune University, Pune, India

Follow this and additional works at: http://digitalcommons.wayne.edu/jmasm Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons

Recommended Citation Ratnaparkhi, Makarand V. and Naik-Nimbalkar, Uttara V. (2013) "The Length-Biased Versus Random Sampling for the Binomial and Poisson Events," Journal of Modern Applied Statistical Methods: Vol. 12 : Iss. 1 , Article 10. DOI: 10.22237/jmasm/1367381340 Available at: http://digitalcommons.wayne.edu/jmasm/vol12/iss1/10

This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState. Journal of Modern Applied Statistical Methods Copyright © 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 54-57 1538 – 9472/13/$95.00

The Length-Biased Versus Random Sampling for the Binomial and Poisson Events

Makarand V. Ratnaparkhi Uttara V. Naik-Nimbalkar Wright State University, Pune University, Dayton, OH Pune, India

The equivalence between the length-biased and the random sampling on a non-negative, discrete is established. The length-biased versions of the binomial and Poisson distributions are discussed.

Key words: Length-biased data, weighted distributions, binomial, Poisson, convolutions.

Introduction (probabilities), which was The occurrence of so-called length-biased data appropriate at that time. The methodology for has been documented by many researchers. No such discrete data analysis was formalized by formal mathematical definition of length- Rao (1965) who introduced the concept of biasedness exists; however, if for collected data weighted distributions (or in informal the probability of inclusion of an observation in nomenclature, the distorted distributions). Rao a sample is proportional to the magnitude of an also considered the weighted distributions for observation then the data is referred to as the the analysis of various discrete data sets. length-biased data. Further, if the probability of Recently, Gao, et al. (2011) found that the inclusion depends on a certain known function length-biased Poisson (events) data arise in (weight function) of an observation then it is bioinformatics. There are a number of fields referred to as the size-biased data. such as ecology, geology, medical science and The realization of such data as a sample, engineering where the data is length-biased and as opposed to the realization of a well-defined the researchers have used the weighted random sampling procedure, appeared for the distribution for the analysis of such data. first time in a paper by Fisher (1934). In This study follows Ratnaparkhi and particular, the data collected for estimating the Naik-Nimbalkar (2012) regarding the length- proportion of individuals having certain rare biased data arising in oil filed exploration and genetic traits such as albinism are length-biased. seeks to associate length-biased data with a However, in statistical literature, the term random sampling procedure. It is hoped that length-biased data was introduced during the these efforts will lead to new methods for 1960s. Fisher (1934) referred to the data building appropriate models for length-biased collection procedure for counting the number of data and for the estimation of the related albino children in a family as the method of parameters using such data. Since, the general ascertainment, which clearly is not a random discussion of length-biasedness (regardless of sampling procedure. For the analysis of the whether the data is on continuous or discrete collected data he considered the modifications of random variables is more involved, here, the discussion is limited to the discrete random variables. Further, the binomial and Poisson distributions are commonly used for modeling Makarand V. Ratnaparkhi is a Professor of discrete data. Thus, the derivation of results and Statistics. Email him at: the discussion are restricted to these [email protected]. Uttara V. distributions; however, the extensions and Naik-Nimbalkar is a Professor of Statistics. modifications to other discrete distributions are Email her at: [email protected]. possible.

54

RATNAPARKHI & NAIK-NIMBALKAR

Length-Biased Data versus Random Sample simple or are the known properties of the Data: Examples distributions of random variables. Therefore, the detailed proofs are omitted for brevity. For Example 1: Estimation of Proportion of Families future reference, the notations and definitions with Albino Children (Fisher, 1934) used herein are listed below. An albino child is located and his/her family is included in a sample. Therefore, larger • X: A random variable representing the the number of albino children in a family higher populations of interest (the original random is the probability of inclusion of the family in variable) X is assumed to be a non-negative, the sample. Let X= the number of albino non-degenerate discrete random variable. children in a family. Clearly, the data on X is • Y, Z1, Z2: Other random variables that will length-biased. The binomial probabilities are be introduced as needed. considered for the analysis if the observed • f (;)xθ : The pdf of the original frequencies of the values of X. X distribution of a random variable X where θ

is a scalar or a vector of parameters. Example 2: Estimation of the Population of • θ Moose Using Aerial Transect Sampling gxX (;): The pdf of the length-biased Moose, in general, stay and move as θ version of f X (;)x . groups. The group is located (aerial survey) and • LBS: Length-biased sampling the numbers of animals in a group are recorded by a team on the ground. For locating the group, The pdf of the length-biased version of at least one animal from the group must be θ is given by visible while traveling on a randomly selected f X (;)x transect in the aerial survey of the population. θθ=<∞ The Poisson distribution is considered for the gxXX(;) xfx (;)/[ EXEX ],[ ] . analysis of these data. Note that larger the (1) number of animals in group, the higher is the probability of sighting at least one animal and For a more general case of (1), x is replaced by hence the length-biasedness in the collected wx() > 0, and EX[] by EwX[()].<∞ data.

• ()= X Example 3: Length-Biased Data on RNA GtX Et[]:The probability generating Sequences (Gao, et al., 2011) function (pgf) of X . The subscript of Gt() The quantification of transcripts (RNA- denotes the pgf of the corresponding sequences) and related properties, such as the variable. differential expression of the transcripts, arise in • B(,np ): The binomial distribution with microarray data analysis. The differential expression of longer transcripts is more likely to parameters n and p. • be identified than that of a shorter transcript with LBB(,) m p : The length-biased version of the same effect size. This situation is described B(,np ). as length-biasedness in the related data. For the • P()λ : The with analysis of such data, the Poisson distribution is parameter λ. considered as a model. • LBP()λ : The length-biased version of the

Methodology Poisson distribution. The derivations of the results for this study are related to the distributions arising in the analysis Results of length-biased data on discrete random Convolution of the Poisson and Degenerate variables. These are based on certain definitions Random Variables and the probability generating function of the Suppose that the original distribution of random variables. The derivations are either a random variable XP ()λ with pdf

55

LENGTH-BIASED VS. RANDOM SAMPLING FOR BINOMIAL AND POISSON EVENTS

f (;)xexλλ= −λ x /!, interpretation must come from the description of X (2) =∞ the research problem form where such data is x 0,1, 2, , . arising in practice. For practical purposes, if it is known then using (1), the pdf of the length biased that while sampling for the original random Poisson distribution is given by variable X the probability of inclusion of an observation is proportional to the magnitude of gx(;)λλ=− e−−λ (1)x /( x 1)!, the observation, a researcher should define the X (3) x =∞1, 2, , . random variables Z1 and Y appropriately and then collect sample data on these variables independently, for example, take a sample first λ =∞ Further, if Z11Pz ( ), 0,1,2, , , then on Y followed by a sample on Z1 and then define Gt( )=−− exp(λ (1 t )). And, if Y has a X = Z1 + Y for practical purposes. It is likely that Z1 such a sampling plan is more involved in terms = degenerate distribution at Y = 1 with GtY () t, of its execution and planning as compared to the =−−λ practice of collecting the data on X and calling it then,Gtt()ZY+ ( ) exp( (1 t )), which is the 1 a length-biased variable without any justification pgf of (3). Thus, the length-biased version of the with reference to the theory of probability. But, Poisson distribution is a convolution of the it provides a theoretical base for the collected Poisson random variable and the degenerate data and a general theoretical justification to the distribution of Y. Further, this result shows that statistical inference based of these data. the length-biased version of X has the same Amari (1985) introduced the role of distribution as that of (Z + Y). 1 differential geometrical properties of statistical This observation, interpreted in terms of manifolds, which are well defined objects, in sampling, implies that the random sampling on estimation theory. Clearly, the manifolds of the the original Poisson variable denoted by, Z , and 1 family of distributions given by (2) and (3) are the degenerate random variable Y is equivalent not the same. Therefore, due care is needed to the so-called LBS on random variable X. when using (3) for estimating parameter λ when It is known that the chance mechanism, it is known that it is associated with a different namely, the well-defined Poisson process, leads manifold corresponding to the family of to the pdf given by (2) and a well-defined P(λ) distributions given by (2) and not the one exists as a corresponding probability model. associated with distribution (3). However, no known stochastic process can be In view of the discussion related to the associated that will lead to the pdf (3), the so- λ Poisson distribution, it seems reasonable to called the LBP( ) distribution; therefore, in a consider other discrete distributions for which strict sense the pdf in (3) is an artificial the above conclusions are applicable. In general, mathematical construct if the above convolution it can be shown that these conclusions are result is not associated for deriving (3). The applicable to the commonly used original related implication is: because Z1 and Y are discrete distributions with Ω = {0, 1, 2, random variables, the random sampling on Z1 …, ∞}. For example, it was observed that the and Y is well defined and, as a result, the random method based on the convolution of random sampling on a length-biased random variable X variables was useful for generating the length- is in order. Thus, the data on X is a realization of biased version of the original geometric random a random sample on X= (Z1 + Y) and there variable defined on Ω. seems to be no need to refer to the data as a length-biased sample. Regarding the theory of Convolution of the Binomial and Degenerate statistics, this provides a theoretical basis for Random Variables length-biased data as a random sample. Note As noted, the length-biased binomial that, in practice, there is a need for providing a distribution also arises as a model for the length- practical justification for introducing variable Y biased data on the original discrete random and its interpretation; however, such variable X; therefore, it seems reasonable to

56

RATNAPARKHI & NAIK-NIMBALKAR

obtain the results, similar to those obtained for Therefore, a similar situation will be present in the Poisson distribution, for the binomial case other distributions with finite support. This and to further interpret these results in view of raises the following question: Can the so-called applications. length-biased binomial distribution be Suppose that the original distribution of considered as a model for the so-called length- a random variable X  Bnp(, ) with pdf: biased binomial data when it is known that the support of the random variable of interest is finite? At this stage, the exact answer is n − x − nx unknown; therefore, it is left to the reader. fxnpX (;, ) =  p (1 p ) , x (4) xn=0,1,2, . Conclusion It has been shown that, at least for some

situations where data are so-called length- Using (1) the pdf of the length-biased binomial biased, the convolution property could be used distribution is given by: for providing a more rigorous treatment for

modeling such data within the framework of n −1 =−x−−1 nx random sampling and thereby justifies the gxnpX (;, )  p(1 p ) , x −1 (5) probability models such as a Poisson = distribution. Length-biased data on binomial xn1, 2 , . random variables needs further investigation in data analysis rather than using the mathematical = If Z(,),0,1,2,,,22B np z n then constructs such as the weighted binomial n distributions. The extension of results obtained Gt()=+ ( qpt ). If Y has a degenerate Z2 in this study is not straightforward; in general, = distribution at Y=1 with GtY () t, then for length-biased data on continuous random n variables, for example, survival data arising in Gttqpt+ ()=+ ( ) , which is not the pgf of ()ZY2 medical science. However, it is worth (5). Thus, the length-biased version of the investigating such cases in view of the number binomial distribution of random variable is not a of practical situations where the collected data convolution of a binomial distribution and the are on continuous random variables. degenerate distribution of Y. Thus, the situation is different than for P()λ and LBP()λ ; References further, the length-biased sampling on the Amari, S. I. (1985). Differential- original random variable X is not equivalent to geometrical methods in statistics. New York, NY: Springer-Verlag. the random sampling on (+)Z2 Y . Fisher, R. A. (1934). The effects of the It is not possible to associate a natural methods of ascertainment upon the estimation of chance mechanism with the length-biased frequencies. Annals of Eugenics, 6, 13-25. binomial distribution – either as repetitions of Gao, L., et al. (2011). Length bias Bernoulli trials or as a convolution of random correction for RNA-seq gene set analyses. variables. Therefore, the LBB (m, p) is a Bioinformtics, 27,662-669. mathematical construct lacking statistical Rao, C. R. (1965). On discrete significance. Under such a situation, as opposed distributions arising out of methods of to using LBB distribution it might be necessary ascertainment. Sankhya A, 27, 311-324. to investigate other distributions as a model for Ratnaparkhi, M. V., & Naik-Nimbalkar, the length-biased data on (original) binomial U. V. (2012). Length-biased lognormal random variable X. The lack of convolution distribution and its application in the analysis of property and, hence, the non-availability of data from the oil field exploration studies. corresponding chance mechanism for the LBB Journal of Modern Applied Statistical Methods, distribution appears to be due to the finite 11(1), 255-260. support for binomial random variable X.

57