Smooth Distribution Function Estimation for Lifetime Distributions using Szasz-Mirakyan Operators
Ariane Hanebeck Bernhard Klar Institute of Applied Mathematical Statistics Institute of Stochastics Technical University of Munich Karlsruhe Institute of Technology [email protected] [email protected] January 29, 2021
In this paper, we introduce a new smooth estimator for continuous distribution functions on the positive real half-line using Szasz-Mirakyan operators, similar to Bernstein’s approximation theorem. We show that the proposed estimator outper- forms the empirical distribution function in terms of asymptotic (integrated) mean- squared error, and generally compares favourably with other competitors in theoret- ical comparisons. Also, we conduct the simulations to demonstrate the finite sample performance of the proposed estimator.
Keywords. Distribution function estimation, Nonparametric, Szasz-Mirakyan operator, Hermite estimator, Mean squared error, Asymptotic properties
1 Introduction
This paper considers the nonparametric smooth estimation of continuous distribution functions on the positive real half line. Arguably, such distributions are the most important univariate probability models, occuring in diverse fields such as life sciences, engineering, actuarial sciences or finance, under various names such as life, lifetime, loss or survival distributions. The well- known compendium of Johnson et al. (1994) treats in its first volume solely distributions on the positive half line with the exception of the normal and the Cauchy distribution. In the two vol- umes Johnson et al. (1994, 1995) as well as in the compendiums about life and loss distributions of Marshall and Olkin (2007) and Hogg and Klugman (1984), respectively, an abundance of para- metric models for the distribution of non-negative random variables and pertaining estimation arXiv:2005.09994v5 [math.ST] 27 Jan 2021 methods can be found. However, there is a paucity of nonparametric estimation methods especially tailored to this situation. It is the aim of this paper to close this gap by introducing a new nonparametric estimator for distribution functions on [0, ∞) using Szasz-Mirakyan operators. Let X1,X2, ... be a sequence of independent and identically distributed (i.i.d.) random variables having an underlying unknown distribution function F and associated density function f. In the case of parametric distribution function estimation, the model structure is already defined before
1 knowing the data. It is for example known that the distribution will be of the form N (µ, σ2). The only goal is to estimate the parameters, here µ and σ2. Compared to this, in the nonparametric setting, the model structure is not specified a priori but is determined only by the sample. In this paper, all the considered estimators are of nonparametric type. The goal is to investigate properties of a random sample and its underlying distribution. Of utmost importance is the probability P(a ≤ X1 ≤ b) = F (b) − F (a), which can directly be estimated without the need to integrate as in the density estimation setting. By taking the inverse of F , it is also possible to calculate quantiles
−1 xp = inf{x ∈ R : F (x) ≥ p} = F (p). An important application of the inverse of F is the so-called Inverse Transform Sampling. It can be used to generate more samples than already given using the implication
−1 Y ∼ U[0, 1] ⇒ F (Y ) ∼ X1. The best-known distribution function estimators with well-established properties are the em- pirical distribution function (EDF) and the kernel estimator. The EDF is the simplest way to estimate the underlying distribution function, given a finite random sample X1, ..., Xn, n ∈ N. It is defined by 1 n F (x) = X (X ≤ x), n n I i i=1 where I is the indicator function. This estimator is obviously not continuous. The kernel dis- tribution function estimator, however, is a continuous estimator. The univariate kernel density estimator is defined by ! 1 n x − X f (x) = X K i , x ∈ , h,n nh h R i=1 where the parameter h > 0 is called the bandwidth and K : R → R is a kernel that has to fulfill specific properties (see, e.g., Gramacki (2018)). It was first introduced by Rosenblatt (1956) and Parzen (1962). The idea is that the number of kernels is higher in regions with many samples, which leads to a higher density. The width and height of each kernel is determined by the bandwidth h. In the above case, the bandwidth is the same for all kernels. To estimate the distribution function, the kernel density estimator is integrated. Hence, the kernel distribution function estimator is of the form ! ! Z x Z x 1 n u − X 1 n x − X F (x) = f (u) du = X K i du = X i , h,n h,n nh h n K h −∞ −∞ i=1 i=1 R t where K(t) = −∞ K(u) du is a cumulative kernel function. This estimator was first introduced by Yamato (1973). Different methods to choose the bandwidth in the case of the distribution function are given in Altman and Léger (1995), Bowman et al. (1998), Polansky and Baker (2000), and Tenreiro (2006). The two previous estimators can estimate distribution functions on any arbitrary real interval. The Bernstein estimator, on the other hand, is designed for functions on [0, 1]. The goal of the Bernstein estimator is the estimation of a distribution function F with density f supported on [0, 1], given a finite random sample X1, ..., Xn, n ∈ N. It makes use of the following theorem.
2 Theorem. If u is a continuous function on [0, 1], then as m → ∞,
m k B (u; x) = X u P (x) → u(x) m m k,m k=0