Evaluating the CDF of the Skew Normal Distribution
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating the CDF of the Skew Normal Distribution Christine Amsler, Michigan State University Alecos Papadopoulos, Athens University of Economics and Business ∗ Peter Schmidt, Michigan State University April 3, 2020 Final version Abstract In this paper we consider various methods of evaluating the cdf of the Skew Nor- mal distribution. This distribution arises in the stochastic frontier model because it is the distribution of the composed error, which is the sum (or difference) of a Normal and a Half Normal random variable. The cdf must be evaluated in models in which the composed error is linked to other errors using a Copula, in some methods of goodness of fit testing, or in the likelihood of models with sample selection bias. We investigate the accuracy of the evaluation of the cdf using expressions based on the bivariate Normal distribution, and also using simulation methods and some approx- imations. We find that the expressions based on the bivariate Normal distribution are quite accurate in the central portion of the distribution, and we propose several new approximations that are accurate in the extreme tails. By a simulated example we show that the use of approximations instead of the theoretical exact expressions may be critical in obtaining meaningful and valid estimation results. Keywords: skew normal distribution, bivariate normal distribution, stochastic frontier, sim- ulation, computational software. JEL classification: C46, C87, C88, C13. ∗Corresponding author, e-mail: [email protected], ORCID ID: 0000-0003-2441-4550. 1 1 Introduction This article deals with the evaluation of the cumulative distribution function of the Skew Normal distribution (hereafter SN-cdf). This is the distribution of the Normal / Half- Normal composed error in stochastic frontier (SF) models, and that is our main interest. However, the Skew Normal distribution also arises in a number of other contexts. For example, the maximum and the minimum of two correlated bivariate standard Normal random variables follow Skew Normal distributions (Loperfido, 2002). As another example, if X,Y are correlated bivariate Normal variables, then the distribution of X Y > 0 is Skew | Normal (Azzalini and Capitanio, 2014, p. 28). As a third example, in the Normal / Half Normal specification of the two-tier stochastic frontier model, the composed error is made up of three components and its density includes the difference of two Skew Normal cdfs (Papadopoulos, 2015). As a final more general example, the Skew Normal distribution is a useful way to model a moderately skewed regression error (which is often the case), and if one wants to account for sample selection bias, the SN-cdf will appear in the likelihood. Our work is an extension of Amsler et al. (2019) –hereafter AST– who addressed the same problem but who considered a much more limited set of methods of evaluation of the SN-cdf than we do in this paper. They proposed a simulation-based method that is accurate in the lower tail, and an approximation due to Tsay et al. (2013) that is accurate in the upper tail. For some uses, such as testing goodness of fit using the Kolmogorov-Smirnov test, the tails are not important. However, for other uses they are important. For example, suppose that a Skew Normal random variable is linked to other random variables using a Copula such as the Gaussian Copula. Then the SN-cdf is an argument of the inverse of the standard Normal cdf. This will explode to infinity (or minus infinity) if the argument is exactly one (or zero), causing the estimation algorithm to break down. So it is important that the evaluations of the SN-cdf are not exactly zero or one, even unreasonably far into the tails. The methods considered by AST achieve this, but it is not clear how accurate they are in the central portion of the distribution, which is presumably where most of the observations will lie. We will consider two other simulation-based methods, two other upper tail and two other lower tail approximations, and a variety of methods of evaluation of closed-form expressions 2 based on the cdf of the bivariate Normal distribution. The numerically evaluated closed- form expressions turn out to be quite accurate in the central portion of the distribution, where “central portion” is defined rather liberally as the range where the cdf value is greater than 10−20 and less than 1 10−20. Our lower tail approximations are also quite accurate − in the extreme lower tail (cdf value less than 10−20 but large enough to calculate, which, given standard floating-point double-precision arithmetic, means larger than about 10−310). Similarly, our upper tail approximations are quite accurate in the extreme upper tail (cdf value greater than 1 10−20, but small enough to calculate, which means smaller than − about 1 10−310). − The plan of the paper is as follows. Section 2 will define notation and present some preliminary results. Sections 3, 4 and 5 will present the various methods of evaluation of the SN-cdf that we will consider. Section 6 will present and discuss the results of our evaluations. In section 7 we provide a simulated example to show that accurate evaluation deep into the tails of the SN-cdf and the use of approximate expressions may be necessary to obtain meaningful and valid estimation results. Finally, Section 8 will give our conclusions. An Appendix and an on-line supplementary file complete the work. 2 Preliminaries We will follow the AST notation. We consider the Skew Normal distribution with location parameter equal to zero; scale parameter σ > 0; and shape parameter λ R. This ∈ distribution is characterized by the Skew Normal density snλ,σ (ε)=(2/σ)φ(ε/σ)Φ(λε/σ), where φ and Φ are the standard Normal pdf and cdf respectively. We want to calculate the Skew Normal cdf P (Q)= P (ε Q)= Q sn (ε) dε. λ,σ ≤ −∞ λ,σ More specifically, we will consider the caseR of σ = 1 and therefore investigate the accuracy of the evaluation of Pλ,1 (Q), which we will write more simply as Pλ(Q). This does not entail any loss of generality, because for σ = 1, one can use the fact that P (Q)= 6 λ,σ P (Q/σ). Similarly, we will consider only the case of λ 0. For λ < 0 we can use the λ ≥ so-called “reflection property” of the distribution, that P− (Q)=1 P ( Q). λ − λ − The connection of the Skew Normal distribution to the SF model is as follows. The “composed error” is ε = v + u, where v N (0, σ2), u N +(0, σ2), and v and u are ∼ v ∼ u 3 2 2 2 independent. Let σ = σu + σv and λ = σu/σv. Then ε has the Skew Normal density snλ,σ (ε). This discussion is for the case of ε = v + u, which would be natural in a cost frontier, and follows the discussion in Tsay et al. (2013) and AST. In the case of a production frontier, as in the original papers of Aigner, Lovell, and Schmidt (1977) and Meeusen and van den Broeck (1977), we would want to consider ε∗ = v u. Then ε∗ has − density sn (ε∗)=(2/σ)φ(ε∗/σ)Φ( λε∗/σ); that is, we just change the sign of λ. λ,σ − AST evaluated the cdf Pλ(Q) by a simulation based on the representation P (Q)= E (Φ[(Q u)/σ ]) . (1) λ u − v Here σ2 =1/ (1 + λ2), σ2 =1 σ2 = λ2/ (1 + λ2), u N(0, σ2)+, and E represents the v u − v ∼ u u expected value over the Half-Normal distribution of u. They estimated this by averaging Φ[(Q u)/σ ] over a large number of draws from the distribution of u. Specifically, they − v used 10,000,000 draws and got results that they regarded as reliable for negative values and small positive values of Q, but not for large positive values of Q. They created a large tabulation of the cdf as a function of λ and Q and suggested interpolation in this table. However, many people would probably prefer a simple calculation that is easier to program than an interpolation, which we will provide. AST also considered an approximation due to Tsay et al. (2013), p. 261, equations (12) and (13). AST concluded that this approximation is accurate for large positive values of Q but not for negative values.1 For the special case of λ = 1, AST also provided an “exact” result, namely P1 (Q) = [Φ (Q)]2.2 They were unaware that this result was already known.3 This result can be regarded as exact because it is not an approximation and it can be calculated very accurately over an extremely wide range of Q. It is useful as a check on the accuracy of the other methods of evaluating the SN-cdf.4 1Ashour and Abdel-hameed (2010) presented also a closed-form approximation to the Skew Normal density and CDF, but they were functions with branches while the partition of the support depended on the value of λ. They considered only the [-4,4] interval. 2This is the cdf of the maximum of two i.i.d. standard Normal variables, and can also be obtained as the limiting case of the results in Loperfido (2002) mentioned earlier. 3Azzalini (1985), p.174. This paper by Adelchi Azzalini has become by all accounts a classic of the statistical literature, and so, the Mark Twain remark on classic works applies. 4Gupta and Chen (2001) used it also as a validation tool for their SN-cdf tables.