Statistical Inference for a New Class of Skew T Distribution and Its Related Properties

Home , Beta distribution, Density estimation, Geometric distribution, Kumaraswamy distribution, Kurtosis, Location parameter

STATISTICAL INFERENCE FOR A NEW CLASS OF SKEW T DISTRIBUTION AND ITS RELATED PROPERTIES

Doaa Basalamah

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulﬁllment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

August 2017

Committee:

Wei Ning and Arjun Gupta, Advisors

Joseph Chao, Graduate Faculty Representative

Wei Ning and Arjun Gupta, Advisor

Generalized skew distributions have been widely studied in statistics and numerous authors have developed various classes of these distributions, Cordeiro and de Castro (2011). To provide a wide and flexible family of modeling data that account for skewness and heavy tail weight in data automatically, Jones (2004) introduced the beta generated distribution as a generalization of the distribution of order statistics of a random sample from distribution F or by applying the inverse probability integral transformation to the beta distribution. Cordeiro and de Castro (2011) proposed a new class of distribution called the Kumaraswamy generalized distribution (KwF ) which is capable of fitting skewed data that cannot be fitted well by existing distributions. Az- zalini (1985) introduced the univariate skew normal distribution as an extension of the normal distribution to accommodate asymmetry. Inspired by Azzalini’s work, numerous papers have been published on the applications of skewed distributions. Among all skewed distribution, the skew t distribution received special attention after the introduction of the skew multivariate normal distribution by Azzalini and Dalla Valle (1996). In this study we introduce new generalizations of the skew t distribution based on the beta generalized distribution and based on the Kumaraswamy generalized distribution. The new classes of distributions which we call the beta skew t (BST ) and the Kumaraswamy skew t distribution (KwST ) have the ability of fitting skewed and heavy tailed data and they are more general than the skew t distribution as they contain the skew t and some other important distributions as special cases. Related properties of the new distributions such as mathematical properties, moments, and order statistics are derived. A new approach of statistical inference using the L-moment as well as the classical maximum likelihood inference are used to estimate the parameters of the proposed distributions. The proposed distributions are applied to simulated data and real data to illustrate the fitting procedure. Further, we study the problem of analyzing a mixture of KwST distributions from the likelihood-based perspective. Computational technique using EM-type algorithm is employed for computing the maximum likelihood estimates. The proposed methodology is illustrated by analyzing simulated and real data examples. iv

To my parents Maha Baabbad and Abdulrahman Basalamah... To my beloved husband Hosam Batarﬁ... To every one who supported me throughout this journey... I dedicated this work. v ACKNOWLEDGMENTS

I am grateful to Dr. Arjun Gupta and Dr. Wei Ning, who supervised my dissertation, for their infinite support and patience. Every meeting I had with them confirmed to me that I made the right decision to be supervised by them. They always pushed my limits and proved to me that I am capable of obtaining great results. I would like to thank Dr. Gupta for accepting to supervise me. He made me believe in my self again after I nearly lost hope of pursuing Ph.D degree. I would like to express the deepest appreciation to Dr. Ning for the continuous support of my Ph.D study and research, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this dissertation. I am also thankful to the other members of my dissertation committee, Dr. Junfeng Shang and Dr. Joseph Chao for the time they took to review my dissertation. My deepest gratitude goes to the professors who inspired me during graduate school. I owe special thanks to my husband Hosam Batarfi for supporting me during this journey and believing in me. I also would like to thank my kids Mohammed and Ibrahim for helping me make this work possible by sacrificing many of family times. I am most grateful to my family for their support, love and patience that provided me with the strength to overcome all the difficult times. Living far away from your family and friends, would have been insufferable without being surrounded with kind people who made it less painful. Thanks to my friends Amani Al Ghamdi, Manar Al Amoudi, and to anyone who put a smile on my face. The financial support, encouragement and guidance from the Saudi Arabian government made studying abroad possible and easier, so I would like to take this opportunity to thank everyone who works there. vi

TABLE OF CONTENTS Page

CHAPTER 1 LITERATURE REVIEW ...... 1 1.1 Introduction ...... 1 1.2 The Skew Normal Distribution ...... 1 1.3 The Student t Distribution ...... 3 1.4 The Skew t Distribution ...... 7 1.5 Dissertation Structure ...... 12

CHAPTER 2 THE BETA SKEW t DISTRIBUTION ...... 14 2.1 Introduction ...... 14 2.2 Density and Distribution Functions ...... 15 2.2.1 Expansion of the Density and Distribution Function ...... 17 2.3 Properties and Simulations ...... 17 2.3.1 Properties ...... 18 2.3.2 Graphical Illustration ...... 22 2.3.3 Simulations ...... 25 2.4 Moments ...... 27 2.5 Order Statistics ...... 31 2.6 Maximum Likelihood Estimation ...... 34 2.6.1 Illustrative Examples ...... 36 2.7 L-moments Estimation ...... 40 2.7.1 Theoretical and Sample L-moments ...... 40 2.7.2 L-moments Parameter Estimation ...... 44 vii 2.7.3 Illustrative Examples ...... 45

CHAPTER 3 THE KUMARASWAMY SKEW t DISTRIBUTION ...... 48 3.1 Introduction ...... 48 3.2 Density and Distribution Functions ...... 49 3.2.1 Expansion of the Density Function ...... 50 3.3 Properties and Simulations ...... 52 3.3.1 Properties ...... 52 3.3.2 Graphical Illustrations ...... 57 3.3.3 Simulations ...... 59 3.4 Moments ...... 60 3.5 Order Statistics ...... 66 3.6 Maximum Likelihood Estimation ...... 70 3.6.1 Simulation Study ...... 71 3.6.2 Illustrative Examples ...... 72 3.7 L-moments ...... 75 3.7.1 Theoretical and Sample L-moments ...... 76 3.7.2 L-moments Parameter Estimation ...... 78 3.7.3 Illustrative Examples ...... 79

CHAPTER 4 MIXTURE MODELING USING TWO KUMARASWAMY SKEW t DIS- TRIBUTIONS ...... 82 4.1 Introduction ...... 82 4.2 The Kumaraswamy Skew t Mixture Model ...... 83 4.3 The EM Algorithm ...... 84 4.4 The Observed Information Matrix ...... 87 4.5 Simulation Studies ...... 89 4.6 Application ...... 94 viii CHAPTER 5 FINAL REMARKS AND DISCUSSION ...... 97

BIBLIOGRAPHY ...... 100

APPENDIX A SELECTED R PROGRAMS ...... 106 ix

LIST OF FIGURES Figure Page

1.1 Standard skew normal density functions for different values of the shape parameter λ = −10, −3, −1, 0, 1, 3, and 10...... 3 1.2 Student t density function vs Cauchy(0,1) density function...... 4

1.3 Student t density function tr for different degrees of freedom r...... 4

1.4 st2(0, 1, λ) density for different shape parameter λ...... 11

1.5 str(0, 1, 2) density for different degrees of freedom r...... 12

2.1 BST (a, b = 5, λ = 1, r = 3) density curves as a varies...... 23 2.2 BST (a = 5, b, λ = −1, r = 3) density curves as b varies...... 23 2.3 BST (a = 5, b = 3, λ, r = 3) density curves as λ varies...... 24 2.4 BST (a = 5, b = 3, λ = −1, r) density curves as r varies...... 25 2.5 Histogram for random samples of size 1000 of BST distribution...... 26 2.6 Histogram and Q-Q plot for U.S. indemnity losses data set...... 37 2.7 Histogram and ﬁtted density curves to the U.S. indemnity losses data...... 38 2.8 Closer look of the histogram and ﬁtted density curves to the U.S. indemnity losses data...... 38

2.9 BST vs. str MLE ﬁtting to the U.S. indemnity losses dataset...... 39

2.10 Closer look of BST vs. str MLE fitting to the U.S. indemnity losses dataset. . . . 39 2.11 Fitted density of BST (a, b, µ, σ, λ, r)...... 46 2.12 Fitted density of BST (a, b, µ, σ, λ, r) to the Danish fire losses data set ...... 47 2.13 Closer look to the fitted density of BST (a, b, µ, σ, λ, r) to the Danish fire losses data set ...... 47 x 3.1 KwST (a, 3, 1, 3) density as the parameter a varies...... 57 3.2 KwST (5, b, −1, 3) density as the parameter b varies...... 58 3.3 KwST (5, 3, λ, 3) density as the parameter λ varies...... 58 3.4 KwST (5, 3, −1, r) density as the degrees of freedom r varies...... 59 3.5 Histogram for random samples of size 500 of KwST variates ...... 60 3.6 Fitting densities to GLD random sample...... 72 3.7 Histogram and Q-Q plot for Nidd river dataset...... 73 3.8 Histogram and fitted density curves to the Nidd river data...... 75 3.9 Fitted density curves to the Nidd river data...... 75 3.10 Fitted density of KwST (a, b, µ, σ, λ, r)...... 80 3.11 Histogram and Q-Q plot for plasma (fe) data set...... 80 3.12 Fitted density of KwST (a, b, µ, σ, λ, r)...... 81

4.1 The density curve of the KwST.mix with two different parameter vector Ψ1 =

(a1 = 1, b1 = 1, µ1 = −2, σ1 = 5, λ1 = 1, r1 = 2, a2 = 3, b2 = 7, µ2 = 10, σ2 =

1, λ2 = 3, r2 = 2, π1 = .9) on the left and Ψ2 = (a1 = 1, b1 = 1, µ1 = −2, σ1 =

4, λ1 = 2, r1 = 3, a2 = 3, b2 = 2, µ2 = 1, σ2 = 1, λ2 = 5, r2 = 2, π1 = 0.62) on the right...... 84

4.2 The density curve of the KwST.mix with two different parameter vector Ψ1 =

(a1 = 1, b1 = 7, µ1 = −2, σ1 = 2, λ1 = 0, r1 = 3, a2 = 2, b2 = 2, µ2 = 1, σ2 =

1, λ2 = 5, r2 = 2, π1 = .75) on the left and Ψ2 = (a1 = 1, b1 = 7, µ1 = −2, σ1 =

2, λ1 = 4, r1 = 3, a2 = 2, b2 = 2, µ2 = 1, σ2 = 1, λ2 = 5, r2 = 2, π1 = .5) on the right...... 85 4.3 Fitted densities of mixture of two components of KwST.mix, st.mix and the snorm.mix models to the mixture of GLD random sample...... 91 4.4 Fitted densities of mixture of two components of KwST.mix, st.mix and the snorm.mix models to the mixture of skew t sample...... 94 xi 4.5 Fitted densities of mixture of two components of Kumaraswamy skew t (KwST.mix), skew t (st.mix) and skew normal (snorm.mix) models to the bmi data...... 96 xii

LIST OF TABLES Table Page

2.1 Summary description of the U.S. indemnity losses data set...... 37 2.2 Parameter estimations for the U.S. indemnity losses data set ...... 37 2.3 Parameter estimations for the U.S. indemnity losses data set ...... 39

2.4(a)Estimation of the L-location (L1), L-scale(L2), L-skewness(τ3), and L-kurtosis(τ4) of BST (a, b, λ, r) random variable for different values of a, b, and λ...... 43

2.4(b)Estimationof the L-mean (L1), L-variance(L2), L-skewness(τ3), and L-kurtosis(τ4) of BST (a, b, λ, r) random variable for different values of a, b, and r...... 44 2.5 Parameters estimation of BST (a, b, µ, σ, λ, r) using the method of L-moments and MLE...... 45 2.6 BST (a, b, µ, σ, λ, r) parameters estimation of Danish ﬁre losses data...... 46

2 3.1(a)Moments estimation of the mean (µKwST ), variance(σKwST ), skewness(γ1), and

kurtosis(γ2) of KwST (a, b, λ, r) random variable for different values of a, b, and λ. 66

2 3.1(b)Moments estimation of the mean (µKwST ), variance(σKwST ), skewness(γ1), and

kurtosis(γ2) of KwST (a, b, λ, r) random variable for different values of a, b, and r. 67 3.2 Parameter estimations for random sample generated from GLD distribution . . . . 72 3.3 Summary description of the Nidd river data...... 73 3.4 Parameter estimations for the Nidd river dataset ...... 74 3.5 Parameter estimations for the Nidd river dataset ...... 74

3.6(a)Estimation of the L-mean (L1), L-variance(L2), L-skewness(τ3), and L-kurtosis(τ4) of KwST (a, b, λ, r) random variable for different values of a, b, and λ...... 78

3.6(b)Estimationof the L-mean (L1), L-variance(L2), L-skewness(τ3), and L-kurtosis(τ4) of KwST (a, b, λ, r) random variable for different values of a, b, and r...... 79 xiii

3.7 Parameters estimation of KwST (a, b, µ, σ, λ, r) using st3(µ = 2, σ = 1, λ = 2). random sample...... 80 3.8 Parameters estimation of KwST (a, b, µ, σ, λ, r) using AIS (plasma) data set. . . . . 81

4.1 Parameter estimations for samples generated from the mixture of GLD densities. . 91

4.2 Computed KL distance I(1 : 2) and overlapping coefﬁcient δ(f1, f2) ...... 92 4.3 Parameter estimations for samples generated from the mixture of skew t densities. . 93 4.4 Summary description of the percent bmi data...... 95 4.5 Parameter estimations for the percent bmi data ...... 95 1

CHAPTER 1 LITERATURE REVIEW

1.1 Introduction

The normal distribution is a common continuous probability model which is used widely to analyze data in many applications. It has many useful applications due to the central limit theorem. The normal distribution is symmetric about its mean, and it ranges over the entire real line. Therefore, it may not be the best option for data that is inherently positive or skewed. Roberts (1966) introduced the skew normal distribution as an example of a weighted model, but he did not use the term “skew normal” then. As a generalization of the normal distribution, Azzalini (1985) formalized the skew normal distribution for modeling asymmetric data. Since then, the literature has expanded enormously to extend Azzalini’s methodology of skewed distributions other than the normal distribution.

1.2 The Skew Normal Distribution

Azzalini (1985) studied the univariate skew normal distribution as an extension of the normal distribution to accommodate asymmetry. This new class of distribution shared similar properties with the normal distribution. The skew normal distribution denoted as SN introduced the shape parameter λ to model skewness. In this section we introduce the univariate skew normal and give some of its important properties.

Deﬁnition 1.2.1. (Standard Skew Normal Distribution, Azzalini (1985)) A random variable X is said to have the standard skew normal distribution with shape parameter λ ∈ < if it has the pdf given by: f(x; λ) = 2φ(x)Φ(λx), −∞ < x < ∞, (1.2.1) where φ(.) and Φ(.) denote the probability density function and the distribution function of the stander normal distribution and we say X ∼ SN(λ).

A location-scale extension of the skew normal distribution has been deﬁned as follows. 2 Deﬁnition 1.2.2. (Skew Normal Distribution, Azzalini (1985)) Consider a liner transformation Y = µ + σX, where X ∼ SN(λ). Then the random variable Y is said to have the skew normal distribution with location parameter µ, scale parameter σ, and shape parameter λ if it has the pdf given by: 2 y − µ y − µ f(y; µ, σ, λ) = φ( )Φ(λ ), (1.2.2) σ σ σ

denoted by Y ∼ SN(µ, σ, λ), where −∞ < y < ∞, µ, λ ∈ < and σ > 0.

Properties of the skew normal distribution have been studied by Azzalini and other researchers. For further discussion of the skew normal distribution and proofs of the properties readers are referred to Azzalini and Capitanio (2014). We give some basic properties of the skew normal distribution without proofs as follow.

Proposition 1.2.1. Properties of the Skew Normal Distribution Let X ∼ SN(µ, σ, λ). Then,

(a) If µ = 0 and σ = 1, then X ∼ SN(λ).

2 2 (b) If X ∼ SN(λ), then X ∼ χ1.

(d) As λ → ±∞, then X ∼ |Z|, where Z ∼ N(µ, sigma).

(e) As |λ| increases, the skewness of the distribution increases.

(f) The measure of skewness of X is γ1(X), ranges between −0.9953 to 0.9953.

(g) The measure of kurtosis of X is γ2(X), ranges between 0 to 0.869.

Figure 1.1 shows the standard skew normal density function SN(λ) for different values of the shape parameter λ = −10, −3, −1, and 0 in the left hand ﬁgure, and λ = 0, 1, 3, and 10 in the right hand ﬁgure. It shows that for positive values of λ the skew normal density curve is skewed to 3 the right, and for negative λ the distribution curve is skewed to the left. Further, when λ = 0 the skew normal density curve is overlapping with the standard normal density curve.

Figure 1.1: Standard skew normal density functions for different values of the shape parameter λ = −10, −3, −1, 0, 1, 3, and 10.

1.3 The Student t Distribution

The Student t distribution is symmetric and bell-shaped distribution similar to the normal distribution. However, it has heavier tails than the normal distribution which makes it more prone to producing values that fall far from its mean. The Student t distribution is the second most popular distribution, after the normal distribution, due to its application in estimating the mean of a normally distributed population when the sample size is small and population standard deviation is unknown. It is parametrized by one parameter called the degrees of freedom denoted by r. For ﬁnite values of the degrees of freedom r, the tails of the density function decay as an inverse power of order r + 1. When the degrees of freedom r = 1 the Student t distribution reduces to the Cauchy(0,1) distribution, while as the degrees of freedom tends to inﬁnity the distribution converges to the normal distribution. Figure 1.2 shows that the Student t density with degree of 4 freedom r = 1 overlaps the Cauchy(0,1) density curve and Figure 1.3 shows the convergence of the Student t density to the normal density as the degrees of freedom r approaches ∞.

Figure 1.2: Student t density function vs Cauchy(0,1) density function.

Figure 1.3: Student t density function tr for different degrees of freedom r.

There are different approaches to derive the Student t distribution. It can be derived as a mixture of normal and inverted gamma distributions, as a scale mixture of normal distributions, or as a predictive distribution, in Bayesian approach. Another popular method is to derive the Student t distribution from the normal distribution. This method is useful in generating random samples for simulation studies. Let Z0,Z1,Z2, ..., Zr be identical and independent standard normal random

Pr 2 variables. Let Y = i=1 Zi . The random variable Y has been studied in detail in the literature

2 and known to follow the chi-squared distribution with r degrees of freedom, denoted by χr. Let 5 X = √Z0 . This random variable is known to have the Student t distribution with r degrees of Y/r freedom.

Deﬁnition 1.3.1. (Student t Distribution) A random variable X is said to have the Student t distribution with degrees of freedom r if it has the pdf given by:

1 1 t(x; r) = √ , −∞ < x < ∞, (1.3.1) r 1 x2 r+1 rB( , ) 2 2 2 (1 + r ) where B(a, b) denotes the beta function deﬁned by the integral equation

Z ∞ Γ(a)Γ(b) B(a, b) = xa−1(1 − x)b−1dx ≡ , (1.3.2) 0 Γ(a + b)

where a, b > 0, the degrees of freedom r > 0 and we say X ∼ tr.

The Student t distribution can be generalized to a three parameter location-scale family deﬁned as follows.

Deﬁnition 1.3.2. (Non-Standard Student t Distribution) Consider a liner transformation Y =

µ + σX, where X ∼ tr. Then the random variable Y is said to have the non-standard Student t distribution with location parameter µ, scale parameter σ, and degrees of freedom parameter r if it has the pdf given by:

1 1 t(y; µ, σ, r) = √ , (1.3.3) r 1 1 y−µ 1+r σ rB( , ) 2 2 2 2 (1 + r ( σ ) ) denoted by Y ∼ tr(µ, σ) where −∞ < y < ∞ µ ∈ < and σ, r > 0.

Properties of the non-standard Student t distribution had been well studied in literature. We give some basic properties of the non-standard Student t without proofs.

Proposition 1.3.1. Properties of the Non-Standard Student t Distribution

Let X ∼ tr(µ, σ). Then, 6

(a) If µ = 0 and σ = 1, then X ∼ tr.

(b) If r ≤ k, then E(Xk) does not exist.

k/2 k+1 r−k k r Γ( 2 )Γ( 2 ) (d) E(X ) = 1 r , for all even r > k. Γ( 2 )Γ( 2 )

(e) The mean is E(X) = 0 r > 1.

r (f) The variance is V ar(X) = r−2 , for r > 2.

(g) The measure of skewness of X is γ1(X) = 0, for r > 3.

3(r−2) (h) The measure of kurtosis of X is γ2(X) = r−4 , for r > 4.

1 (i) The Student t distribution is unimodal with mode = √ r 1 . rB( 2 , 2 )

Since the moments of the Student t distribution do not always exist, as seen in part (b) of proposition 1.3.1, this distribution does not possess a moment generating function. For more information regarding the moment generating function of the Student t distribution readers are referred to Mood et al. (1974), among others. The characteristic function of this distribution has been a topic of interest for researchers from both theoretical and applicable point of view. Ifram (1970) derived the characteristic function of the Student t distribution as

∞ 1 Z √ r 1 itX it( v)x 2 ( 2 + 2 )) ψx(t) = E(e ) = r 1 e (1 + x ) dx, (1.3.4) B( 2 , 2 ) −∞

where i ∈ I, where I denotes the imaginary set. Mitra (1978) obtained the characteristic function of the Student t distribution as given by

√ m−1 √ −| rt| X j ψX (t) = e cj,(m−1)| rt| , (1.3.5) j=0 7 r+1 where m = 2 and cj,m’s satisfy the following recurrence relations

c0,m = 1,

c1,m = 1, 1 (1.3.6) c = , (m−1),m 1.3...(2m − 5)(2m − 3) c + (2m − 3 − j)c c = (j−1),(m−1) j,(m−1) , 1 ≤ j ≤ m − 1. j,m (2m − 3)

Pestana (1977) developed the the characteristic function of the Student t distribution by providing comments and corrections to Ifram’s (1970) results. Further, Dreier and Kotz (2002) derived the the characteristic function of the Student t distribution as given by

r r/2 Z ∞ √ 2 r − r(2x+|t|) ( r − 1 ) ψX (t) = e [ x(x + |t|)] 2 2 dx, t ∈ <. (1.3.7) Γ(r) 0

Deﬁnition 1.3.3. A random variable X is said to have a half t distribution with location parametre µ, scaleparametreσ and degrees of freedom r if it has the pdf given by:

−( r+1 ) 1 √ " # 2 Γ(r + 2 ) σ 1 √ 2 h(x; µ, σ, r) = 2 r √ 1 + ( σ(x − µ)) , (1.3.8) Γ( 2 ) rπ r

for x > µ, −∞ < µ < ∞, σ, r > 0, denoted as X ∼ |tr|.

Note that as r → ∞, the half t distribution approaches the normal distribution. For additional information about the half t distribution the readers are refereed to Psarakis and Panaretoes (1990), among others.

1.4 The Skew t Distribution

The Student t distribution is more efﬁcient in producing values that fall far from its mean and capturing heavy tail data sets. However, it is a symmetric distribution that cannot capture asymmetry. To accommodate asymmetry and long tailed data, Hansen (1994) introduced the so called skewed t distribution while maintaining the property of a zero mean and variance equal to 8 one. His skew t distribution distribution is derived by introducing a generalization of the Student t distribution as follows

r+1 2 − r+1 Γ( ) ζ 2 f(x; λ, r) = b 2 1 + , (1.4.1) p r r − 2 π(r − 2)Γ( 2 )

where   (bx + a)/(1 − λ) if x < −a/b, ζ =  (bx + a)/(1 + λ) if x ≥ −a/b.

The constant terms a and b are deﬁned by

r − 2 a = 4λc , r − 1 b = 1 + 3λ2 − a2,

and Γ( r+1 ) c = 2 . p r π(r − 2)Γ( 2 ) In this distribution, 2 < r < ∞ denotes the degrees of freedom parameter and −1 < λ < 1 is the asymmetry parameter. Further extension of the Student t distribution is given by Frees and Valdez (1998). They proposed a general approach to generate asymmetric distribution using truncation method. Then they specialized it to the case of the Student t distribution by

2 h x i f(x; λ, r) = 1 t(xλ; r)1(−∞,0)(x) + t( ; r)1(0,∞)(x) . λ + λ λ

Gupta (2003) deﬁned the skew multivariate t distribution using a pair of independent standard skew normal and chi-squared random variables. Jones and Faddy (2003) proposed a tractable skew t distribution given by

a+ 1 b+ 1 1 n x o 2 n x o 2 f(x; a, b) = 1 + 1 − , 2a+b−1B(a, b)(a + b)1/2 (a + b + x2)1/2 (a + b + x2)1/2 (1.4.2) 9 where B(., .) denote the beta function and a, b ∈ <+. Their distribution reduces to the t distribution with 2a degrees of freedom when a = b. Moreover, this skew t distribution is skewed to the left when a < b and skewed to the right when a > b. Another popular approach to generate some asymmetry distribution was proposed by Azzalini and Capitanio (2003). They deﬁned a skew t variate as a scale mixture of skew normal and chi- squared variables. They deﬁned the skew t distribution as follows.

Deﬁnition 1.4.1. (Skew t Distribution, Azzalini and Capitanio (2003)) Let Y ∼ SN(λ) be

2 independent from Z ∼ χr. Let Y X = . (1.4.3) Z/r

Then the random variable X is said to have the skew t distribution with shape parameter λ ∈ < and degrees of freedom r > 0 if it has the pdf given by:

r r + 1 f(x; λ, r) = 2t(x; r)T (λx ; r + 1), −∞ < x < ∞, (1.4.4) x2 + r where t(.) and T (.) are the probability density function and the distribution function of the standard

Student t distribution respectively, and we denote this as X ∼ str(λ).

Consider a liner transformation Y = µ+σX, where X ∼ strλ). Then the random variable Y is said to have the skew t distribution with location parameter µ, scale parameter σ, shape parameter λ ∈ <, and degrees of freedom r > 0 if it has the pdf given by:

s 2 y − µ y − µ r + 1 f(y; µ, σ, λ, r) = t( ; r)T (λ ; r + 1), (1.4.5) y−µ 2 σ σ σ ( σ ) + r

denoted by Y ∼ str(µ, σ, λ) where −∞ < y < ∞, µ, λ ∈ < and σ, r > 0. Arellano-Valle and Genton (2005) discussed generalized skew distributions in the multivariate setting including the skew t one. Huang and Chen (2006) studied generalized skew t distributions and used it in data analysis. Hasan (2013) presented a new approach to deﬁne the non-central skew t distribution. Shaﬁei and Doostparast (2014) introduced the Balakrishnan skew t distribution and 10 its associated statistical characteristics, to name a few. Azzalini and Capitanio (2014) studied the skew t distribution in detail and discussed its moments and some of its properties. We present some basic properties of the skew t distribution without proofs.

Proposition 1.4.1. Properties of the Skew t Distribution Let X ∼ ST (µ, σ, λ, r). Then,

(a) If µ = 0 and σ = 1, then X ∼ str(λ).

(b) If λ = 0, then X ∼ tr(µ, σ).

(d) As r → ±∞, then X ∼ SN(µ, σ, λ).

2 (e) If X ∼ str(λ), then X ∼ F (1, r) where F (r1, r2) denote the Sendecor’s F distribution

with r1 and r2 degrees of freedom.

(f) The skew t density is unimodal.

(g) If r ≤ k, then E(Xk) does not exist.

k/2 1 th k (r/2) Γ( 2 (r−1)) k (h) The k moment is deﬁned as E(X ) = 1 E(Z ), where r > k and Z ∼ Γ( 2 r) SN(λ).

(i) The men is E(X) = µ + σbrδ if r > 1.

2 r 2 (j) The variance is V ar(X) = σ [ r−2 − (bvδ) ] if r > 2.

h 2 i brδ r(3−δ ) 3r 2 (k) The measure of skewness of X is γ1(X) = r 2 3 − + 2(brδ) , if r > 3. [ r−2 −(brδ) ] r−3 r−2

(l) γ1(X) ranges between −4 to 4 if r > 4, but it becomes the whole real line if we consider r > 3. 11 (m) The measure of kurtosis of X is

1 γ2(X) = r 2 [ r−2 − (bvδ) ] " # 3r2 4(b δ)2r(3 − δ2) 6(b δ)2r − r + r − 3(b δ)4 − 3 (r − 2)(r − 4) r − 3 r − 2 r

, if r > 4.

√ rΓ( 1 (r−1)) √ 2 √ λ where br = 1 ,if r > 1 and δ = 2 δ ∈ (−1, 1). πΓ( 2 r) 1+λ

Same as the Student t distribution, the skew t distribution does not possess a moment generating function. Figures 1.4 and 1.5 present illustration of the effect of the shape parameter λ and the degrees of freedom r, respectively, on the shape of the skew t density. It shows that similar to the skew normal density, for positive shape parameter λ the distribution skewed to the right, and for negative λ the distribution skewed to the left. Because we ﬁxed the location and scale parameters µ = 0 and σ = 1, we note that when λ = 0 we get the Student t density. We also see that the skew t density approaches the half t density as λ approaches ∞. For different values of degrees of freedom r = 3, 10, 20, 60, and 120 we note that the skew t density approaches the skew normal density as r approaches ∞ as presented in Figure 1.5.

Figure 1.4: st2(0, 1, λ) density for different shape parameter λ. 12

Figure 1.5: str(0, 1, 2) density for different degrees of freedom r.

1.5 Dissertation Structure

Using Azzalini and Capitanio (2003) definition of the skew t distribution, we introduce new generalizations of the skew t distribution and we study their statistical inference. This new classes of distributions are more flexible for modeling some data sets than the skew t distribution as they contain the latest as a special case. This dissertation is organized as follows. In Chapter 2, we develop the beta skew t distribution, denoted by (BST ), based on the beta generalized distribution. We study the related properties of the BST distributions such as mathematical properties, moments, and order statistics. Further, we study the maximum likelihood and the L-moments inference for the proposed distribution. Finally, The proposed distribution is applied to real data to illustrate the fitting procedure. In Chapter 3, we introduce a new classes of distribution called the Kumaraswamy skew t distribution (KwST ). This distribution is derived based on the Kumaraswamy generalized distribution. Related properties of the KwST distributions such as mathematical properties, moments, and order statistics will be derived in details. A new approach of statistical inference using the L-moment as well as the classical maximum likelihood inference are used to estimate the proposed distribution parameters. Using simulated data and real data, we illustrate the superiority of the KwST distributions proposed here as compared with some of its sub-models using both parameter esti- 13 mation methods: MLE and L-moments. In Chapter 4, we study the problem of analyzing a mixture of KwST distributions from the likelihood-based perspective. First, we provide a brief introduction to the mixture model problem in general. Then, we applied the definition of finite mixture model to the Kumaraswamy skew t distribution to construct a mixture of two KwST model. Computational techniques using EM-type algorithm is employed for computing the maximum likelihood estimates. Simulation and real data analysis are conducted to show the performance of the mixture. In Chapter 5, we provide some discussion and final remarks of the study. 14

CHAPTER 2 THE BETA SKEW t DISTRIBUTION

2.1 Introduction

Azzalini (1985) introduced the univariate skew normal distribution as an extension of the normal distribution to accommodate asymetry. Inspired by Azzalini’s work, numerous work has been done on the applications of skewed distributions. Among all skewed distributions, the skew t distribution received special attention after the introduction of the skew multivariate normal distribution by Azzalini and Dalla Valle (1996). Gupta (2003) defined the skew multivariate t distribution using a pair of independent standard skew normal and chi-squared random variables. Azzalini and Capitanio (2003) defined a skew t variate as a scale mixture of skew normal and chi-squared variables. Several authors studied possible extensions and generalizations of the skew t distribution. Arellano-Valle and Genton (2005) discussed generalized skew distributions in the multivariate setting including the skew t. Huang and Chen (2006) studied generalized skew t distributions and used it in data analysis. Hasan (2013) presented a new approach to define the non-central skew t distribution. Shafiei and Doostparast (2014) introduced the Balakrishnan skew t distribution and its associated statistical characteristics, to name a few. To provide a wide and flexible family to model data that account for skewness and heavy tail weight, Jones (2004) introduced the beta generated distribution as a generalization of the distribution of order statistics of a random sample from a distribution F or by applying the inverse probability integral transformation to the beta distribution. The beta normal distribution was introduced by Eugene et al. (2002). In their work they studied the shape properties of the beta normal distribution as well as estimation of the parameters using the maximum likelihood method. Silva et al. (2010) proposed the beta modified Weibull distribution. Cordeiro and de Castro (2011) studied the beta Weibull geometric distribution and its properties. Cordeiro et al. (2011) derived a closed form expression for moments of the class beta generalized distributions. Regoˆ and Nadarajah (2011) provided more detailed properties of the beta normal distribution. As a generalization of the skew 15 normal, Mameli and Musio (2013) introduced the beta skew normal distribution, to name a few. In this chapter we introduce a new generalization of the skew t distribution based on the beta generalized distribution. The new class of distribution which is called the beta skew t (BST ) has the ability of fitting skewed and heavy tailed data and is more general than the skew t distribution as it contains the skew t distribution as a special case. Related properties of the new distribution such as moments, and the order statistics are derived. The proposed distribution is applied to real data to illustrate the fitting procedure using the maximum likelihood method and the L-moments method. Further, parameter estimation for simulated and real life data is conducted to illustrate the advantage of L-moments method over the MLEs.

2.2 Density and Distribution Functions

For a continuous distribution F with the density function f and parameters a > 0 and b > 0,

Jones (2004) deﬁned the density of the beta generated distribution gF by:

1 g (x; a, b) = f(x)F (x)a−1(1 − F (x))b−1, (2.2.1) F B(a, b)

where B(a, b) is the complete beta function deﬁned in (1.3.2). The distribution function GF is given by:

GF (x; a, b) = IF (x)(a, b), (2.2.2)

where IF (x)(a, b) is the incomplete beta function ratio deﬁned by

B (a, b) I (a, b) = F (x) , 0 ≤ F (x) ≤ 1, (2.2.3) F (x) B(a, b)

and BF (x)(a, b) is the incomplete beta function deﬁned by

Z F (x) a−1 b−1 BF (x)(a, b) = z (1 − z) dz. (2.2.4) 0 16

Thus, the distribution function GF can be written as

Z F (x) 1 a−1 b−1 GF (x; a, b) = z (1 − z) dz, 0 ≤ F (x) ≤ 1. (2.2.5) B(a, b) 0

Throughout this dissertation we denote tr the Student t distribution with cdf T (x; r) and pdf

t(x; r) defined in (1.3.1) , str(λ) the skew t with cdf F (x; λ, r) and pdf f(x; λ, r) defined in (1.5.3) , Kw(a, b) the Kumaraswamy distribution and KwST (a, b, λ, r) the Kumaraswamy skew t distribution with pdf g(x; a, b, λ, r) and cdf G(x; a, b, λ, r) as in (3.2.3) and (3.2.4) respectively. Replacing F (x) by F (x; λ, r) in (2.2.5), we defined the beta skew t distribution denoted by BST (a, b, λ, r) as follows.

Deﬁnition 2.2.1. A random variable X is said to have the beta skew t distribution if it has the distribution function given by

Z F (x;λ,r) 1 a−1 b−1 GF (x; a, b, λ, r) = z (1 − z) dz, (2.2.6) B(a, b) 0

and probability density function

1 g (x; a, b, λ, r) = f(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r))b−1, (2.2.7) F B(a, b) where −∞ < x < ∞, a, b > 0, λ ∈ <, and the degrees of freedom r > 0.

The BST (a, b, λ, r) distribution can be extended to include location and scale parameters µ ∈ < and σ > 0. If X ∼ BST (a, b, λ, r), then Y = µ+σX leads to a six-parameter BST distribution with the parameter vector θ = (a, b, µ, σ, λ, r). We denote it as Y ∼ BST (a, b, µ, σ, λ, r). 17 2.2.1 Expansion of the Density and Distribution Function

Using Newton’s binomial expansion for b ∈ <+, the pdf of BST (a, b, λ, r) in (2.2.7) can be rewritten as

∞ 1 X b − 1 g (x; a, b, λ, r) = (−1)k f(x; λ, r)F (x; λ, r)a+k−1. (2.2.8) F B(a, b) k k=0

If b ∈ Z+, then the index k in the sum in (2.2.8) stops at b − 1. In order statistics literature, Rohatgi and Ehsanes Saleh (1988) generalized (2.2.2) as follows

b−1 X a + b − 1 G (x; a, b) = F (x)a+b−k−1(1 − F (x))k, (2.2.9) F (x) k k=0

where b ∈ Z+ and a ∈ <+.

a−1 X a + b − 1 G (x; a, b) = 1 − F (x)k(1 − F (x))a+b−k−1, (2.2.10) F (x) k k=0

where a ∈ Z+ and b ∈ <+. According to (Gupta and Nadarajah, 2004, p. 12), the integral representation for incomplete

beta ratio GF (x;λ,r)(a, b) can be written as

a Z 1 F (x; λ, r) −b a+b−1 −a GF (x;λ,r)(a, b) = z (1 − z) (1 − zF (x; λ, r)) dz. (2.2.11) aB(a, b)B(1 − b, a + b) 0

2.3 Properties and Simulations

In this section we study some theoretical properties of the proposed distribution. Then we provide graphical illustrations of these properties. Finally, we discuss a classical approach to generate a random sample from BST distribution. 18 2.3.1 Properties

Proposition 2.3.1. Let X ∼ BST (a, b, λ, r). Then

(a) If a = b = 1, then X ∼ str(λ).

(b) If λ = 0 and a = b = 1, then X ∼ tr.

(d) If λ = 0, then X ∼ beta − tr(a, b).

(e) If λ = 0 and r = 1, then X ∼ beta − Cauchy(a, b, 0, 1).

(f) If a = 1, then X ∼ KwST (1, b, λ, r).

(g) If b = 1, then X ∼ KwST (a, 1, λ, r).

(h) If Y = F (x; λ, r), then X ∼ beta(a, b) .

(i) If Y = 1 − F (x; λ, r), then X ∼ beta(b, a) .

(j) Y = (F (X; λ, r))1/a ∼ Kw(a, b).

(k) Y = (1 − F (X; λ, r))1/b ∼ Kw(b, a).

The proof of proposition 2.3.1 follows directly from (2.2.7) and elementary properties of the skew t distribution provided in proposition 1.5.1. Note that in part (d) and (e), the distribution

function of beta−tr(a, b) and beta−Cauchy(a, b, 0, 1) are given by substituting the F (x) in (2.2.2) by the the distribution function of the Student t with degrees of freedom r and the distribution function of Cauchy(0, 1) respectively.

Proposition 2.3.2. Let X ∼ BST (a, b, λ, r) with pdf gF (x; a, b, λ, r) in (2.2.7), then

(a) As a → ∞ or b → ∞, the probability density function gF (x; a, b, λ, r) degenerates to zero.

(b) As r → ∞, X ∼ beta − SN(a, b, λ). 19

Proof. (a) For ﬁxed x, λ, r, b , and as a → ∞

f(x; λ, r) a−1 b−1 lim gF (x; a, b, λ, r) = lim F (x; λ, r) (1 − F (x; λ, r)) a→∞ a→∞ B(a, b) Γ(a + b) = f(x; λ, r)(1 − F (x; λ, r))b−1 lim F (x; λ, r)a−1 = 0. a→∞ Γ(a)

Similarly, For ﬁxed x, λ, r, a, and as b → ∞

limgF (x; a, b, λ, r) = 0. b→∞

This completes the proof of (a). (b) Recall that the deﬁnition of skew t random variable X with pdf f(x; λ, r) by Azzalini and Capitanio (2003) is constructed as a scale mixture of skew normal and Chi-square Variables using the following transformation: Y X =D , q Z r

2 where Y ∼ SN(λ) and Z ∼ χr are independent random variables. By the strong law of large number (SLLN), r r Z Z lim = lim = 1. r→∞ r r→∞ r

Thus, Y limX =D lim ∼ SN(λ). r→∞ r→∞ q Z r

And, limf(x; λ, r) =D φ(x; λ), r→∞

where φ(.; λ) and Φ(.; λ) are the pdf and cdf of the skew normal distribution, respectively. Thus, when X ∼ BST (a, b, λ, r) with pdf gF (x; a, b, λ, r) deﬁned in (2.2.7), for ﬁxed x, a, b, λ and as 20 r → ∞.

D 1 a−1 b−1 limgF (x; a, b, λ, r) = φ(x; λ)Φ(x; λ) (1 − Φ(x; λ)) . r→∞ B(a, b)

That is, X ∼ BSN(a, b, λ).

This completes the proof of (b). (c) Let X be a skew t distributed random variable. We have,

lim φ(y; λ) = lim 2φ(y)Φ(λy) λ→∞ λ→∞ = 2φ(y) lim Φ(λy) λ→∞

= 2φ(y)I[0,∞](y),

where φ(.) and Φ(.) are the pdf and cdf of the normal distribution, respectively. This indicates lim Y =D |W |, where W ∼ N(0, 1). Then, λ→∞

lim Y D Y D λ→∞ D |W | D lim X = lim = = = |tr|. λ→∞ λ→∞ q Z q Z q Z r r r

Thus, when X ∼ BST (a, b, λ, r) with pdf gF (x; a, b, λ, r) deﬁned in (2.2.7), for ﬁxed x, a, b, r, and as λ → ∞,

D 1 a b−1 lim gF (x; a, b, λ, r) = h(x; r)H(x; r) (1 − H(x; r)) , λ→∞ B(a, b)

where h(x; r) and H(x; r) are the pdf and cdf of the half t distribution respectively with the degrees of freedom r. This completes the proof of (c).

Proposition 2.3.3. Let Y1,Y2, .....Yn be a random sample of size n from a str(λ) with probability

density function f(x; λ, r) deﬁned in (1.5.3) and distribution function F (x; λ, r). Let Y1:n ≤ Y2:n ≤

... ≤ Yn:n be the order statistics of the random sample. Then 21 th (a) The i order statistic Yi:n ∼ BST (i, n − i + 1, λ, r), where i = 1, 2, ..., n.

(b) The largest order statistic Yn:n(y) = max{Y1, ..., Yn} ∼ BST (n, 1, λ, r).

Proof. It follows directly from the deﬁnition of the probability density function of order statistics.

Proposition 2.3.4. Let X ∼ BST (a, b, λ, r) be independent from (Y1,Y2, .....Yn) which is a ran-

dom sample of size n from str(λ) with probability density function f(x; λ, r) deﬁned in (1.5.3) and

distribution function F (x; λ, r). Let Y1:n ≤ Y2:n ≤ ... ≤ Yn:n be the order statistics of the random sample. Then,

(a) W = (X|Y1:n ≥ X) ∼ BST(a, b + n, λ, r),

∗ (b) W = (X|Yn:n ≤ X) ∼ BST(a + n, b, λ, r), where Y1:n = min{Y1,Y2, .....Yn} and Yn:n = max{Y1,Y2, .....Yn}.

Proof. (a)

Z Z ∞

P (Y1:n ≥ X) = gY1:n (y1:n; a, d, λ, r)gX (x; a, b, λ, r)dy1:ndx. < x

Z ∞ Z ∞ a−1 d−1 gY1:n (y1:n; λ, r, a, d)dy1:n = nf(y1:n; λ, r)F (y1:n; λ, r) (1 − F (y1:n; λ, r)) dy1:n x x Z ∞ d−1 a−1 = (1 − F (y1:n; λ, r)) nF (y1:n; λ, r) dF (y1:n; λ, r) F (x;λ,r) Z 1 n n−1 (1 − s) 1 n = n (1 − s) ds = n |F (x;λ,r) = (1 − F (x; λ, r)) , F (x;λ,r) n

where s = F (y1:n; λ, r). Thus,

Z f(x; λ, r) a−1 n+b−1 P (Y1:n ≥ X) = F (x; λ, r) (1 − F (x; λ, r)) dx < B(a, b) 1 Z 1 = F (x; λ, r)a−1(1 − F (x; λ, r))n+b−1dF (x; λ, r) B(a, b) 0 1 Z 1 B(a, n + b) = ta−1(1 − t)n+b−1dt = , B(a, b) 0 B(a, b) 22 where t = F (x; λ, r). Then,

R w 1 a−1 n+b−1 −∞ B(a,b) f(x; λ, r)F (x; λ, r) (1 − F (x; λ, r)) dx P (W ≤ w) = B(a,n+b) B(a,b) 1 = f(w; λ, r)F (w; λ, r)a−1(1 − F (w; λ, r))n+b−1, B(a, n + b)

which is the pdf of W ∼ BST (a, n + b, λ, r). (b) Similar to the proof of (a).

Proposition 2.3.4 can be generalized to any c, d ∈ <+ as follows.

Proposition 2.3.5. Let X ∼BST(a, b, λ, r) be independent from Y ∼BST(c, 1, λ, r) and Z ∼BST(1, d, λ, r), where c ∈ <+ and d ∈ <+. Then

(a) (X|Y ≤ X) ∼ BST(a + c, b, λ, r).

(b) (X|Z ≥ X) ∼ BST(a, b + d, λ, r).

2.3.2 Graphical Illustration

To understand the effect of the parameters on the overall shape of the beta skew t probability density, we illustrate different shapes of the density curve by fixing five parameters and varying the sixth one in the following figures. For simplicity, we set up the location parameter µ to be zero and the scale parameter σ to be one. In Figure 2.1, we study the effect of the parameter a on the density shape by fixing the remaining parameters (b = 3, λ = 1, r = 3) and we graph the density of BST for different values of a. Figure 2.1 shows that the left tail of the BST density curve gets lighter, as a increases. 23

Figure 2.1: BST (a, b = 5, λ = 1, r = 3) density curves as a varies.

On the other hand, when b varies and all other parameters are ﬁxed (a = 5, λ = −1, r = 3), we note that the parameter b controls the right tail weight of the BST density as shown in Figure 2.2. In addition, Figures 2.1 and 2.2 show that the BST density curve degenerates to zero as a or b approach inﬁnity.

Figure 2.2: BST (a = 5, b, λ = −1, r = 3) density curves as b varies. 24 Figure 2.3 studies the effect of the parameter λ on the shape of the BST density curve by ﬁxing the parameters (a = 5, b = 3, r = 3) and taking the parameter λ ranging from −5 to 100. Then, we

compare the density curves of BST (5, 3, λ, 3) with the curve of beta − |tr|(a = 5, b = 3, r = 3). As expected, the graph is skewed to the right for positive values of λ and skewed to the left for negative values of λ. Moreover, we observe that as λ increases the BST density curve overlaps

the beta − |tr| density curve which graphically proves part (c) of proposition 2.3.2.

Figure 2.3: BST (a = 5, b = 3, λ, r = 3) density curves as λ varies.

In Figure 2.4, we study the effect of the degrees of freedom r on the shape of the BST density by ﬁxing the parameters (a = 5, b = 3, λ = −1) and taking the degrees of freedom r = 1, 5, 15 and 50. We observe that the shape of the BST (5, 3, −1, r) density gets closer to the one of the BSN(5, 3, −1) as the degrees of freedom r increases, which agrees with the part (b) of proposition 2.3.2. The tail gets thicker as the degrees of freedom decrease. These two properties are inherited from the baseline skew t distribution. Furthermore, Figure 2.1 to Figure 2.4 show that the BST inherits the unimodality from the baseline distribution. 25

Figure 2.4: BST (a = 5, b = 3, λ = −1, r) density curves as r varies.

2.3.3 Simulations

A random sample from BST can be generated using the classical inverse probability integral transform technique as follows.

1. Generate a random sample Y1,Y2, ..., Yn from beta(a, b) distribution.

−1 −1 2. Let Xi = F (Yi; λ, r), where F (., λ, r) is the quantile function of the skew t distribution.

3. X1,X2, ..., Xn ∼ BST (a, b, λ, r).

Figure 2.5 shows histograms of three random samples of size 1000 generated from BST (θ), θ = (a, b, λ, r), distribution using the classical inverse probability integral transform technique with different parameter vectors θ1 = (a = 2, b = 3, λ = −1, r = 2) as in Figure 2.5(a), θ2 = (a =

2, b = 2, λ = 3, r = 2) as in Figure 2.5(b), and θ3 = (a = 2, b = 2, λ = 0, r = 2) as in Figure 2.5(c). 26

(a) θ1 = (a = 2, b = 3, λ = −1, r = 2) (b) θ2 = (a = 2, b = 2, λ = 3, r = 2)

The BST quantile function, denoted by QBST (u), can be obtained from the quantile functions of the skew t distribution and beta distribution denoted by Qst(u) and QB(u) respectively as follows.

QBST (u) = Qst(QB(u)). (2.3.1)

This quantile function can be calculated using some available softwares.

f(x) The hazard rate function deﬁned by h(x) = 1−F (x) is an important quantity that characterizes life phenomena of a system. The associated hazard function for the BST distribution is given by

f(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r))b−1 h(x) = . (2.3.2) R F (x;λ,r) a−1 b−1 B(a, b) − 0 z (1 − z) dz 27 2.4 Moments

In this section we derive an explicit form of the nth moment as a function of the nth moment of the baseline distribution, the skew t distribution.

Theorem 2.4.1. Let X ∼ BST (a, b, µ, σ, λ, r). Then the nth moment for integer n ≥ r is given by

∞ n σn X X b − 1n µ h i E(Xn) = (−1)j ( )iE (Y n−i) F (y; λ, r)a+j−1 − 1 , (2.4.1) B(a, b) j i σ Y j=0 i=0

+ where Y ∼ str(λ). If b ∈ Z , then the index j stops at b − 1.

Proof.

Z xn x − µ x − µ x − µ E(Xn) = f( ; λ, r)F ( ; λ, r)a−1(1 − F ( ; λ, r))b−1dx < B(a, b)σ σ σ σ ∞ Z xn X b − 1 x − µ x − µ = (−1)j f( ; λ, r)F ( ; λ, r)a+j−1dx B(a, b)σ j σ σ < j=0 ∞ 1 X b − 1 Z x − µ x − µ = (−1)j xnf( ; λ, r)F ( ; λ, r)a+j−1dx. B(a, b)σ j σ σ j=0 <

x−µ Substituting z = σ and using the binomial expansion , we obtain

Z x − µ x − µ Z xnf( ; λ, r)F ( ; λ, r)a+j−1dx = (µ + σz)nf(z; λ, r)F (z; λ, r)a+j−1dz < σ σ < n Z X n = (zσ)n−iµif(z; λ, r)F (z; λ, r)a+j−1dz i < i=0 n X n µ Z = σn ( )i zn−if(z; λ, r)F (z; λ, r)a+j−1dz. i σ i=0 <

R n−i a+j−1 Applying integration by part for the quantity < z f(z; λ, r)F (z; λ, r) dz, we let

u = F (z; λ, r)a+j−1, 28 and dv = zn−if(z; λ, r)dz.

Then, du = (a + j − 1)f(z; λ, r)F (z; λ, r)a+j−2,

and Z v = zn−if(z; λ, r)dz = E(Y n−i), <

where Y ∼ str(λ). Thus,

Z n−i a+j−1 n−i a+j−1 n−i y f(y; λ, r)F (y; λ, r) dy = EY (Y )F (y; λ, r) − EY (Y )(a + j − 1) < Z f(y; λ, r)F (x; λ, r)a+j−1dy, < n−i h a+j−1 = EY (Y ) F (y; λ, r) − (a + j − 1) Z i F (y; λ, r)a+j−1dF (y; λ, r) , < h 1 i = E (Y n−i) F (y; λ, r)a+j−1 − (a + j − 1) , Y (a + j − 1)

n−i h a+j−1 i = EY (Y ) F (y; λ, r) − 1 .

Thus,

∞ n 1 X b − 1 X n µ h i E(Xn) = (−1)j σn ( )iE (Y n−i) F (y; λ, r)a+j−1 − 1 B(a, b) j i σ Y j=0 i=0 ∞ n σn X X b − 1n µ h i = (−1)j ( )iE (Y n−i) F (y; λ, r)a+j−1 − 1 . B(a, b) j i σ Y j=0 i=0

Alternatively, the nth moment of X ∼ BST (a, b, λ, r) random variable with integers a ≥ 2

th and b ≥ 2 can be expressed as a function of the n moment of the baseline distribution str(λ) multiplied by a constant as presented in the following theorem. 29 Theorem 2.4.2. Let X ∼ BST (a, b, λ, r) with integers a ≥ 2, b ≥ 2, n > 0 and r ≥ n.

n n E(X ) = c(a, b)EY (Y ), (2.4.2) where

b−2 1 X (−1)i h 1 (a − 1) i (a − 1) c(a, b) = − −(−1)b−1 , B(a, b) B(i + 1, b − i − 1) a + i (a + i − 1)(b − i − 1) a + b − 2 i=0

and Y ∼ str(λ).

Proof. By applying the integration by part, we have

1 Z E(Xn) = xnf(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r))b−1dx. B(a, b) <

Let u = F (x; λ, r)a−1(1 − F (x; λ, r))b−1,

and dv = xnf(x; λ, r)dx.

Then,

du =(a − 1)f(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r))b−1dx+

(b − 1)(1 − F (x; λ, r))b−2(−f(x; λ, r))F (x; λ, r)a−1dx,

and Z v = xnf(x; λ, r)dx. < 30 th Note that v is the n moment of a str(λ) random variable. Then,

1 h E(Xn) = vF (X; λ, r)a−1(1 − F (X; λ, r))b−1|∞ B(a, b) −∞ Z − v(a − 1)f(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r))b−1dx < Z i + v(b − 1)f(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r))b−2dx (2.4.3) < v h Z ∞ = (b − 1)f(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r))b−2dx B(a, b) −∞ Z i − (a − 1)f(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r))b−1dx . <

Note that,

Z (b − 1)f(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r))b−2dx < b−2 Z X b − 2 = (b − 1)f(x; λ, r)F (x; λ, r)a−1 (−1)i F (x; λ, r)idx i < i=0 b−2 X b − 2 Z = (−1)i (b − 1) f(x; λ, r)F (x; λ, r)a+i−1dx i (2.4.4) i=0 < b−2 X b − 2 F (x; λ, r)a+i = (−1)i (b − 1) |∞ i a + i −∞ i=0 b−2 X (−1)i = . B(i + 1, b − i − 1)(a + i) i=0

Similarly,

b−1 ib−1 Z X (−1) (a − 1) (a − 1)f(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r))b−1dx = i . (2.4.5) a + i − 2 < i=0 31 Substituting (2.4.5) and (2.4.6) into (2.4.4) we obtain,

b−2 b−1 ib−1 v h X (−1)i X (−1) (a − 1)i E(Xn) = − i B(a, b) B(i + 1, b − i − 1)(a + i) a + i − 2 i=0 i=0 b−2 v X (−1)i h 1 (a − 1) i = − − B(a, b) B(i + 1, b − i − 1) a + i (a + i − 1)(b − i − 1) i=0 (a − 1) (−1)b−1 . a + b − 3

To illustrate the use of (2.4.3), in the following corollary we provide the mean and the variance of X ∼ BST (a, b, λ, r) random variable when a ≥ 2 and b ≥ 2 are integers.

Corollary 2.4.1. Let X ∼ BST (a, b, λ, r) random variable with a ≥ 2 and b ≥ 2 are integers, then

√ r−1 rΓ( 2 ) E(X) = δ √ r c(a, b), πΓ( 2 ) 2 r−1 1 δ Γ( 2 ) V ar(X) = rc(a, b) − r c(a, b) , (r − 2) πΓ( 2 )

where δ = √ λ , i = 0, 1, ..., (b − 2), r > 2, Γ(.) is the gamma function and c(a, b) is the constant 1+λ2 deﬁned in Theorem 2.4.3.

2.5 Order Statistics

Order statistics make their appearance in many areas of statistical theory and practice. In this section we derive an explicit form of the probability density function of the BST order statistics.

Theorem 2.5.1. Let X1, ..., Xn be a random sample from BST distribution with distribution func-

tion GF (x; a, b, λ, r) in (2.2.6) and probability density function gF (x; a, b, λ, r) in (2.2.7). Let

X1:n ≤ X2:n ≤ ... ≤ Xn:n be the order statistics of the random sample. The density function of the 32 ith order statistic, for i = 1, ..., n, is given by

n−i X nn − i g (x) = (−1)ki g (x; a∗, b, λ, r)h∗(x)c (a, b), (2.5.1) i:n i k F k k=0

where a∗ = a(k + i),

B(a∗, b) c (a, b) = , k B(a, b)[aB(a, b)B(1 − b, a + b)]k+i−1

and h Z 1 ik+i−1 h∗(x) = z−b(1 − z)a+b−1(1 − zF (x; λ, r))−adz . 0

Proof. Using expression (2.2.11) of the incomplete beta function.

R 1 −b a+b−1 −a Let h(x) = 0 z (1 − z) (1 − zF (x; λ, r)) dz, we have

n! g (x) = G (x; a, b, λ, r)i−1(1 − G (x; a, b, λ, r))n−ig (x; a, b, λ, r) i:n (i − 1)!(n − i)! F F F n−i n! X n − i = g (x; a, b, λ, r)G (x; a, b, λ, r)i−1 (−1)k G (x; a, b, λ, r)k (i − 1)!(n − i)! F F k F k=0 n−i X n − i n = (−1)k i g (x; a, b, λ, r)G (x; a, b, λ, r)k+i−1 k i F F k=0 n−i X n − i n n F (x; λ, r)ah(x) ok+i−1 = (−1)k i g (x; a, b, λ, r) k i F aB(a, b)B(1 − b, a + b) k=0 n−i kn−i n X (−1) i n h(x) ok+i−1 = k i f(x; λ, r)F (x; λ, r)a(k+i)−1[1 − F (x; λ, r)]b−1 B(a, b) aB(a, b)B(1 − b, a + b) k=0 n−i X n − i n B(a(k + i), b)n h(x) ok+i−1 = (−1)k i g (a(k + i), b, λ, r) k i F B(a, b) aB(a, b)B(1 − b, a + b) k=0 n−i X n − i n B(a(k + i), b) h(x)∗ = (−1)k i g (a(k + i), b, λ, r) . k i F B(a, b) [aB(a, b)B(1 − b, a + b)]k+i−1 k=0

where h(x)∗ = h(x)k+i−1

Γ(a)Γ(b) According to Thukral (2014), the beta function B(a, b) = Γ(a+b) can be relaxed to include all 33 th real numbers a and b. Therefore, the above expression of the density function gi:n(x) of the i order statistic is true for all real a, b ∈ <. Using the density of the ith order statistic we derived in Theorem 2.5.1, we provide the expression of the largest and the smallest order statistics of a BST (a, b, λ, r) random sample as follows.

Corollary 2.5.1. Let X1, ..., Xn be a random sample from BST (a, b, λ, r) distribution. Then for b ∈ Z+ is an integer and a ∈ <+

(a) The density of the largest order statistic Xn:n(x) = max{x1, ..., xn} is given by

B(an, b) g (x) = ng (x; na, b, λ, r)h(x)n−1 . (2.5.2) n:n F B(a, b)[aB(a, b)B(1 − b, a + b)]n−1

(b) The density of the smallest order statistic X1:n(x) = min{x1, ..., xn} is given by

n−1 X n − 1 B(a∗, b) g (x) = (−1)kn g (x; a∗, b, λ, r)h(x)k , 1:n k F B(a, b)[aB(a, b)B(1 − b, a + b)]k k=0 (2.5.3)

R 1 −b a+b−1 −a ∗ where h(x) = 0 z (1 − z) (1 − zF (x; λ, r)) dz and a = a(k + 1).

Using expression (2.2.11) of the incomplete beta function and for integer b > 0, the ith order statistics of X ∼ BST (a, b, λ, r) can be written as follows.

Theorem 2.5.2. Let X1, ..., Xn be a random sample from BST distribution with distribution func-

tion GF (x; a, b, λ, r) in (2.2.6) and probability density function gF (x; a, b, λ, r) in (2.2.7). Let

th X1:n ≤ X2:n ≤ ... ≤ Xn:n be the order statistics. The density of the i order statistic, for i = 1, ..., n, is given by

( )k+i−1 Pn−i k nn−i 1−F (x;λ,r) Pb−1 gi:n(x) = k=0(−1) i i k gF (x; a, b, λ, r) f(x;λ,r) j=0 gF (x;(a + b − j), j, λ, r) , (2.5.4)

where b ∈ Z+ and a ∈ <+. 34 Proof. By the deﬁnition of order statistics we have,

If b ∈ Z, by (2.2.9) we obtain.

n−i X n − i n g (x) = (−1)k i g (x; a, b, λ, r) i:n k i F k=0 ( b−1 )k+i−1 X a + b − 1 F (x; λ, r)a+b−j−1(1 − F (x; λ, r))j j j=0 n−i X n − i n = (−1)k i g (x; a, b, λ, r) k i F k=0 ( b−1 )k+i−1 1 − F (x; λ, r) X f(x; λ, r)F (x; λ, r)a+b−j−1(1 − F (x; λ, r))j−1 f(x; λ, r) B(a + b − j, j) j=0 ( )k+i−1 Pn−i k nn−i 1−F (x;λ,r) Pb−1 = k=0(−1) i i k gF (x; a, b, λ, r) f(x;λ,r) j=0 gF (x;(a + b − j), j, λ, r) .

2.6 Maximum Likelihood Estimation

In this section, the maximum likelihood estimators (MLEs) of the BST parameters are given.

Let x1, x2, ....., xn be a random sample of size n from the BST (a, b, µ, σ, λ, r) distribution. The 35 log-likelihood function l(θ) for the parameter vector of θ = (a, b, µ, σ, λ, r) can be written as

n X l(θ) = n log(Γ(a + b)) − n log(Γ(a)) − n log(Γ(b)) − n log(σ) + logf(zi; µ, σ, λ, r) i=1 n n X X + (a − 1) log(F (zi; µ, σ, λ, r)) + (b − 1) log(1 − F (zi; µ, σ, λ, r)), i=1 i=1 (2.6.1)

xi−µ where zi = σ . The log-likelihood can be maximized either directly by using the optim function in R or by solving the nonlinear likelihood equations obtained by differentiating equation (2.6.1). The components of the score vector U(θ) are given by

n X Ua(θ) = nψ(a + b) − nψ(a) + log(F (zi; µ, σ, λ, r)), i=0 n X Ub(θ) = nψ(a + b) − nψ(b) + log(1 − F (zi; µ, σ, λ, r)), i=0 n xi−µ X −1 df( σ ; µ, σ, λ, r) Uµ(θ) = σf( xi−µ ; µ, σ, λ, r) dµ i=0 σ n x −µ (a − 1) X 1 dF ( i ; µ, σ, λ, r) − σ σ F ( xi−µ ; µ, σ, λ, r) dµ i=0 σ n x −µ (b − 1) X 1 d(1 − F ( i ; µ, σ, λ, r)) + σ , σ (1 − F ( xi−µ ; µ, σ, λ, r)) dµ i=0 σ n xi−µ n X 1 df( σ ; µ, σ, λ, r) Uσ(θ) = − + σ σf( xi−µ ; µ, σ, λ, r) dσ i=0 σ n x −µ (a − 1) X 1 dF ( i ; µ, σ, λ, r) − σ σ F ( xi−µ ; µ, σ, λ, r) dσ i=0 σ n x −σ (b − 1) X 1 d(1 − F ( i ; µ, σ, λ, r)) + σ , σ (1 − F ( xi−µ ; µ, σ, λ, r)) dσ i=0 σ 36 n xi−µ X 1 df( σ ; µ, σ, λ, r) Uλ(θ) = f( xi−µ ; µ, σ, λ, r) dλ i=0 σ n x −µ X 1 dF ( i ; µ, σ, λ, r) + (a − 1) σ F ( xi−µ ; µ, σ, λ, r) dλ i=0 σ n x −σ X 1 d(1 − F ( i ; µ, σ, λ, r)) + (b − 1) σ , (1 − F ( xi−µ ; µ, σ, λ, r)) dλ i=0 σ

n xi−µ X 1 df( σ ; µ, σ, λ, r) Ur(θ) = f( xi−µ ; µ, σ, λ, r) dr i=0 σ n x −µ X 1 dF ( i ; µ, σ, λ, r) + (a − 1) σ F ( xi−µ ; µ, σ, λ, r) dr i=0 σ n x −σ X 1 d(1 − F ( i ; µ, σ, λ, r)) + (b − 1) σ , (1 − F ( xi−µ ; µ, σ, λ, r)) dr i=0 σ

d where ψ(x) is the digamma function deﬁned by dx Γ(x).

2.6.1 Illustrative Examples

We illustrate the superiority of the BST distributions proposed here by comparing with some of its sub-models such as the beta t distribution Btr and the t distribution tr using the Akaike information criterion (AIC) and Schwarz information criterion (SIC). We give an application using well-known data sets to demonstrate the applicability of the proposed model. Tables are used to display the six parameters θ = (µ, σ, λ, r, a, b) estimate for each model with the negative log- likelihood, the AIC and the SIC values. The data set used here is the U.S. indemnity losses used in Frees and Valdez (1998) and Eling (2012). This data contains 1500 general liability claims giving for each the indemnity payment, denoted by “loss”. For the purposes of scaling, we divide the data set by 1000. The U.S. indemnity losses data is available in the R packages copula and evd. Descriptive statistics of the data are given in Table 2.1. Figure 2.6 presents the histogram for the U.S. indemnity losses data set, as well as the corresponding normal Q-Q plot. The histogram shows that we have a large number of small losses and a lower number of very large losses which is a typical feature of insurance claims data. Summary description of the U.S. indemnity losses data set are given in Table 2.1. 37

Figure 2.6: Histogram and Q-Q plot for U.S. indemnity losses data set.

Table 2.1: Summary description of the U.S. indemnity losses data set.

Min. Median Mean sd Max. skewness kurtosis 0.01 12.00 41.21 102.74 2174.00 9.154 141.978

From Table 2.2, we note that the BST model has the smallest SIC value among all other models. Hence, the BST model is the best ﬁt to the dataset.

Table 2.2: Parameter estimations for the U.S. indemnity losses data set.

Dist. µ σ λ r a b -log(θ) AIC SIC BST 0.532 1.548 0.644 0.383 8.554 2.764 6596.379 13204.76 13236.64 Btr 1.539 2.533 - 0.238 7.429 3.363 6722.746 13455.49 13482.06 tr 7.383 7.317 - 0.788 - - 7243.32 14492.64 14508.58

Figure 2.7 presents graphical display of the ﬁtted density curves to the histogram of the U.S. indemnity losses data where the ( ) line presents the BST density curve, the ( ) line presents the Btr density curve and the ( ) line presents the tr density curve. Figure 2.7 presents a closer look to the ﬁtted density curves. 38

Figure 2.7: Histogram and ﬁtted density curves to the U.S. indemnity losses data.

Figure 2.8: Closer look of the histogram and ﬁtted density curves to the U.S. indemnity losses data.

Finally, in Table 2.3 we compare the fitting superiority of the BST distribution with the baseline distribution str. We observe that the BST distribution is a competitive candidate to fit the data as its AIC and SIC values are very close to the AIC and SIC of the skew t distribution. Further, note that for the str distribution the estimated skewness parameter λ is very large while the BST 39 distribution produced a reasonable estimated value of the parameter λ. Therefore, we suggest using the BST distribution to fit this data set. Figure 2.9 shows the graphical display of the fitted density curves to the histogram of the U.S. indemnity losses data while a closer look to demonstrate the tail fitting for both distributions is presented in Figure 2.10. From the fitting results we conclude that the BST distribution is very promising distribution that has the ability to fit very skewed and heavy tailed data. Table 2.3: Parameter estimations for the U.S. indemnity losses data set.

Dist. µ σ λ r a b -log(θ) AIC SIC BST 0.532 1.548 0.644 0.383 8.554 2.764 6596.379 13204.76 13236.64 str 0.0096 10.687 80448.45 0.859 - - 6594.952 13197.9 13219.16

Figure 2.9: BST vs. str MLE ﬁtting to the U.S. indemnity losses dataset.

Figure 2.10: Closer look of BST vs. str MLE ﬁtting to the U.S. indemnity losses dataset. 40 2.7 L-moments Estimation

The L-moments are defined as linear combinations of expectations of order statistics which exist for any random variable with a finite mean. L-moments are useful in fitting distributions because they specify location, scale, skewness and kurtosis. There are many advantage of L- moments over the ordinary moments. Unlike the ordinary moment, L-moments exist whenever the underlying random variable has a finite mean. In addition, when dealing with data that has large variation, large skewness and heavy tails, L-moments have the advantage of natural unbiasedness, robustness, and often smaller sampling variances than other estimators. In this section, following the definition of L-moments by Hosking (1990), we derive the first seven theoretical L-moments of the proposed BST distribution. Then, we estimate the first four L-moments and the first two L-moments ratios by varying one parameter while fixing other parameters. Further, we conduct some parameter estimation for simulated and real life data using L-moments method. Finally, we illustrate the fitting superiority of L-moments parameters estimation and compare it with the classical ML estimators by the AIC and SIC values.

2.7.1 Theoretical and Sample L-moments

Denote the theoretical L-moments by L1,L2, ... throughout this dissertation. From the expectations of order statistics, Hosking (1990) deﬁned the theoretical L-moments for a real valued random variable X as follows.

m−1 1 X k L = (−1)k E[X ], for m = 1, 2, .... (2.7.1) m m i − 1 m−k:m k=0 41 where E[Xm−k:m] is the expectation of the m − k order statistic of a sample of size m. The ﬁrst four theoretical L-moments are expressed by

L1 = E[X], 1 L = E[X − X ], 2 2 2:2 1:2 1 L = E[X − 2X + X ], 3 3 3:3 2:3 1:3 1 L = E[X − 3X + 3X − X ]. 4 4 4:4 3:4 2:4 1:4

The L-moments ratio are independent of the units of measurement of X and are deﬁned for higher moments, m ≥ 3, as

Lm τm = , m = 3, 4, ... . (2.7.2) L2

It is clear that L1 is the mean of X and hence is a measure of location, as known as L-location.

L2 is known as L-scale and the L-moments ratio τ3 and τ4 are the L-skewness and L-kurtosis respectively. Based on the deﬁnition of the theoretical L-moments Hosking (1990), we derive the theoretical L-moments for the BST distribution as follow.

Theorem 2.7.1. The theoretical L-moments for a BST random variable X with distribution function G(X; a, b, λ, r), provided in (2.2.6), are deﬁned as

m−1 k 2 X X m − 1 k L = (−1)k+j E[XG(X; a, b, λ, r)m−k+j−1]. (2.7.3) m k j k=0 j=0 42 Corollary 2.7.1. The ﬁrst seven BST theoretical L-moments are expressed by

L1 = E[XG(X; a, b, λ, r)],

2 L2 = −1E[XG(X; a, b, λ, r)] + 2E[XG(X; a, b, λ, r) ],

2 3 L3 = E[XG(X; a, b, λ, r)] − 6E[XG(X; a, b, λ, r) ]] + 6E[XG(X; a, b, λ, r) ],

2 3 L4 = −E[XG(X; a, b, λ, r)] + 12E[XG(X; a, b, λ, r) ] − 30E[XG(X; a, b, λ, r) ]

+ 20E[XG(X; a, b, λ, r)4],

2 3 L5 = E[XG(X; a, b, λ, r)] − 20E[XG(X; a, b, λ, r) ] + 90E[XG(X; a, b, λ, r) ]

− 140E[XG(X; a, b, λ, r)4] + 70E[XG(X; a, b, λ, r)5],

2 3 L6 = −E[XG(X; a, b, λ, r)] + 30E[XG(X; a, b, λ, r) ] − 210E[XG(X; a, b, λ, r) ]

+ 560E[XG(X; a, b, λ, r)4] − 630E[XG(X; a, b, λ, r)5] + 252E[XG(X; a, b, λ, r)6],

2 3 L7 = E[XG(X; a, b, λ, r)] − 42E[XG(X; a, b, λ, r) ] + 420E[XG(X; a, b, λ, r) ]

− 1680E[XG(X; a, b, λ, r)4] + 3150E[XG(X; a, b, λ, r)5] − 2772E[XG(X; a, b, λ, r)6]

+ 924E[XG(X; a, b, λ, r)7].

The L-location (L1), L-scale (L2), L-skewness (τ3) and L-kurtosis (τ4) measures of X ∼ BST (a, b, µ, σ, λ, r) can be computed numerically using existing softwares. Table 2.4 shows numerical estimations of these measures by computing the first four L-moments for various values of the parameters a, b, λ, and r with fixed µ = 0 and σ = 1, where Table 2.4(a) presents the numerical estimations of BST (a, b, λ, r) random variable for different values of a, b, and λ and fixed degrees of freedom r = 5, while in Table 2.4(b) the parameter λ = 2 is fixed and a, b, and the degrees of freedom r vary.

Since the theoretical L-moments (Lm) are deﬁned as linear functions of the expected order statistics of a sample of size m. The sample L-moments are computed from the sample of size n 43 of order statistics x1:n, x2:n, ...., xn:n as follows.

n m−1 1 X h X m − 1 i − 1 n − ii l = x (−1)j . (2.7.4) m m n i:n j m − j − 1 j m i=1 j=0

The sample L-moments ratio denoted as τˆm, m ≥ 3 are deﬁned as

lm τˆm = , m = 3, 4, ... (2.7.5) l2

Table 2.4(a): Estimation of the L-location (L1), L-scale(L2), L-skewness(τ3), and L-kurtosis(τ4) of BST (a, b, λ, r) random variable for different values of a, b, and λ.

a b λ r L1 L2 τ3 τ4 BST 1 1 -5 5 -0.931 0.451 -0.258 0.182 -1 -0.671 0.585 -0.097 0.194 0 0.000 0.692 0.000 0. 0.194 1 0.671 0.585 0.097 0.194 5 0.931 0.451 0.258 0.182 50 0.949 0.435 0.295 0.169 500 0.949 0.435 0.295 0.169 BHT - 0.949 0.435 0.295 0.169 BST 5 3 -5 5 -0.535 0.157 -0.120 0.132 -1 -0.277 0.230 -0.006 0.138 0 0.388 0.293 0.039 0.139 1 0.932 0.264 0.080 0.141 5 1.051 0.232 0.130 0.137 50 1.052 0.231 0.132 0.136 500 1.052 0.231 0.132 0.136 BHT - 1.052 0.231 0.132 0.136 BST 2 20 -5 5 -2.307 0.351 -0.166 0.156 -1 -2.261 0.357 -0.160 0.156 0 -1.747 0.341 -0.142 0.153 1 -0.691 0.220 -0.114 0.149 5 0.038 0.077 -0.007 0.145 50 0.120 0.044 0.202 0.134 500 0.121 0.043 0.210 0.130 BHT - 0.121 0.043 0.210 0.130 44

Table 2.4(b): Estimation of the L-mean (L1), L-variance(L2), L-skewness(τ3), and L-kurtosis(τ4) of BST (a, b, λ, r) random variable for different values of a, b, and r.

a b λ r L1 L2 τ3 τ4 KwST 1 1 2 1 43.012 47.411 0.876 0.975 5 0.849 0.505 0.174 0.193 50 0.725 0.401 0.086 0.133 300 0.715 0.394 0.079 0.128 500 0.715 0.393 0.078 0.128 BSN - 0.714 0.392 0.078 0.128 BST 5 3 2 1 1.877 0.727 0.382 0.278 5 1.029 0.241 0.109 0.140 50 0.930 0.198 0.058 0.126 300 0.922 0.195 0.053 0.125 500 0.921 0.195 0.053 0.125 BSN - 0.920 0.194 0.052 0.125 BST 2 20 2 1 -0.565 0.395 -0.455 0.374 5 -0.245 0.144 -0.084 0.146 50 -0.217 0.125 -0.034 0.128 300 -0.215 0.124 -0.030 0.127 500 -0.215 0.123 -0.029 0.126 BSN - -0.214 0.123 -0.029 0.127

2.7.2 L-moments Parameter Estimation

To obtain L-moments of the parameters, Hosking (1990) suggested equating the ﬁrst seven sample L-moments to the corresponding population quantities. Therefore, we obtain parameter estimation of the proposed distribution BST (a, b, µ, σ, λ, r) using the L-moments method numerically by minimizing the combined Pythagorean distance between the combined square errors as given by

2 2 2 2 2 2 2 (L1 − l1) + (L2 − l2) + (τ3 − τˆ3) + (τ4 − τˆ4) + (τ5 − τˆ5) + (τ6 − τˆ6) + (τ7 − τˆ7) ,

where Li and τi are the theoretical L-moment and L-ratio of the BST distribution, li and τˆi are the sample L-moment and L-ratio, respectively. This technique is implemented using the optim function of R for minimization. 45 2.7.3 Illustrative Examples

To demonstrate the performance of the L-moments method compared with the maximum likelihood method, we conduct parameters estimation and data fitting using simulated data from skew t distribution and the Danish fire losses data set. Danish fire losses data set consists of 2156 fire losses over one million Danish Kroner (DKK) from the year 1980 to 1990. The fire losses reported in the data set is corresponding to the damage of buildings, furnishings and personal property, as well as loss of profits. This data has been previously studied in the literature by many authors such as : McNeil (1997), Resnick (1997), Cooray and Ananda (2005), Ahn et al. (2012), Farias et al. (2016), to name a few. We conduct parameter estimate for the BST model. Then, we compare the performance of the methods using the MLE method and the L-moments method using the information criteria AIC and SIC. The following is BST parameter estimation using L-moments to a random sample of size 100 generated from skew t distribution with parameter vector (µ = 2, σ = 1, λ = 2, r = 3). Table 2.5 shows parameter estimate for the BST using L-moments and MLE methods. Based on the AIC and SIC criteria, the method of L-moments provides a good alternative estimation to the method of MLE. Figure 2.10 shows the fitted density of BST (a, b, µ, σ, λ, r) model using both estimation procedures, where the ( ) line presents BST fitted density curve using the L-moments estimated parameters and the ( ) line presents the MLE’s ones. We observe that the L-moments method captured the density peak better than the MLE method.

Table 2.5: Parameters estimation of BST (a, b, µ, σ, λ, r) using the method of L-moments and MLE

a b µ σ λ r AIC SIC L-moments 1.975 1.270 1.976 0.799 0.525 1.924 276.0901 291.7211 MLE 1.753 1.153 1.570 1.0101 1.986 3.156 273.5269 289.1579 46

Figure 2.11: Fitted density of BST (a, b, µ, σ, λ, r)

Table 2.6 presents the BST parameter estimation using L-moments and MLE estimation method to the Danish fire losses data set. Similarly, the L-moments provide a good alternative estimation method to the MLE’s based on the AIC and SIC criteria. Figure 2.12 presents a close up looking of the fitted density curves of BST (a, b, µ, σ, λ, r) using both estimation procedures, where the ( ) line presents BST fitted density curve using the L-moments estimated parameters and the ( ) line presents the MLE’s. In comparison with the MLE method, we note that the L-moments method provided a very good fit to the data set.

Table 2.6: BST (a, b, µ, σ, λ, r) parameters estimation of Danish ﬁre losses data.

a b µ σ λ r AIC SIC L-moments 4.641 7.470 0.945 0.908 1.872 0.235 6959.424 6999.192 MLE 2.680 4.343 0.9573 0.5925 4.7408 0.2730 6855.387 6889.473 47

Figure 2.12: Fitted density of BST (a, b, µ, σ, λ, r) to the Danish ﬁre losses data set

Figure 2.13: Closer look to the ﬁtted density of BST (a, b, µ, σ, λ, r) to the Danish ﬁre losses data set 48

CHAPTER 3 THE KUMARASWAMY SKEW t DISTRIBUTION

3.1 Introduction

Kumaraswamy (1980) introduced a distribution on (0, 1) called a distribution of double bounded random process (DB) which has been used widely in hydrological applications. The DB distribution shares many similarities with the beta distribution, while the DB distribution has the advantage of the tractability of its distribution function and its density function does not depend on some special function which makes the computation of the MLEs easier. Jones (2009) provided detailed survey of the similarities and differences between the beta distribution and the distribution of double bounded random process (DB). Based on the double bounded random process, Cordeiro and de Castro (2011) proposed a new class of distribution which is called the Kumaraswamy generalized distribution denoted as (KwF ). They extended this class of distributions to the normal, Weibull, gamma, Gumbel, and inverse Gaussian distributions by choosing F to be the corresponding distribution functions of these distributions. A major benefit of the Kumaraswamy generalized distribution is its ability of fitting skewed data that can not be fitted well by existing distributions. Since then, the Kumaraswamy generalized distribution has been widely studied and many authors have developed various generalized versions based on this distribution. Cordeiro et al. (2011) studied moments for the various classes of Kumaraswamy generalized distributions such as Ku- maraswamy normal, Kumaraswamy Student-t, Kumaraswamy beta, and Kumaraswamy Snedecor F distribution. Nadarajah et al. (2012) studied some new properties of the Kumaraswamy generalized distribution including asymptotes, shapes, moments, moments generating function, and mean deviations. Mameli (2015) introduced the Kumaraswamy skew normal distribution and derived the moments, moments generating function, and the maximum likelihood estimators for special values of the parameters, to name a few. In this chapter we introduce a new generalization of the skew t distribution based on the Kumaraswamy generalized distribution. The new class of distribution which we call the Ku- 49 maraswamy skew t (KwST ) has the ability of fitting skewed and heavy tailed data and is more flexible than the skew t distribution as it contains the skew t distribution and other important distributions as special cases. Related properties of the KwST such as the moments, and order statistics are discussed. The proposed distribution is applied to a real data to illustrate the fitting superiority as well as the comparison to other existing distributions to indicate its advantage. The fitting procedures are preformed using the maximum likelihood method and the L-moments method. Fur- ther, parameter estimations for simulated and real life data are conducted to illustrate L-moments method in comparison with the ML method.

3.2 Density and Distribution Functions

Let F (x), f(x) be the cdf and pdf of a continuous random variable X. Cordeiro and de Castro (2011) proposed the Kumaraswamy generalized distribution denoted by KwF (a, b) with the pdf g(x; a, b) and the cdf G(x; a, b) given by :

g(x; a, b) = abf(x)F (x)a−1(1 − F (x)a−1)b−1, (3.2.1)

G(x; a, b) = 1 − {1 − F (x)a}b, (3.2.2) where a > 0 and b > 0 are parameters to control the skewness and tail weights. By taking F (x) to be the cdf of the normal, Weibull, gamma, Gumbel, and inverse Gaussian distributions, Cordeiro et al. (2010) deﬁned the Kw-normal, Kw-Weibull, Kw-gamma, Kw-Gumbel, and Kw- inverse Gaussian distributions. We take F (x) in (3.2.2) to be the distribution function of the skew t and introduce a new distribution called the Kumaraswamy skew t distribution denoted by KwST (a, b, λ, r) with pdf g(x; a, b, λ, r) and cdf G(x; a, b, λ, r) as follows.

Deﬁnition 3.2.1. A random variable X is said to have the Kumaraswamy skew t distribution if it 50 has the probability distribution function given by

g(x; a, b, λ, r) = abf(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r)a)b−1, (3.2.3)

and distribution function as given by

G(x; a, b, λ, r) = 1 − {1 − F (x; λ, r)a}b, (3.2.4)

where −∞ < x < ∞, a > 0 and b > 0, x ∈ <, f(x; λ, r) and F (x; λ, r) are the pdf and cdf of the skew t distribution given by Azzalini and Capitanio (2003) with degrees of freedom r > 0 and the shape parameter λ ∈ <.

The KwST model can be extended to include location and scale parameters µ ∈ < and σ > 0, respectively. If X ∼ KwST (a, b, λ, r), then Y = µ + σX leads to a six parameters KwST distribution with the parameter vector ξ = (a, b, µ, σ, λ, r). We denote it by Y ∼ KwST (a, b, µ, σ, λ, r).

3.2.1 Expansion of the Density Function

According to Cardeiro and Castro (2011), using the binomial expansion for b ∈ <+, the pdf of KwST (3.2.3) can be rewritten as

∞ X a(i+1)−1 g(x; a, b, λ, r) = f(x; λ, r) wiF (x; λ, r) , (3.2.5) i=0

i b−1 where the binomial coefficient wi is defined for all real numbers with wi = (−1) ab i . If b ∈ Z+, then the index i in the sum of (3.2.5) stops at b−1. If a ∈ Z+, then (3.2.5) is the density of skew t distribution multiplied by infinite weighted power series of the cdf of the same distribution. 51 On the other hand, if a is not an integer, we can expand the term F (x; λ, r)a(i+1)−1 as follows

F (x; λ, r)a(i+1)−1 = [1 − (1 − F (x; λ, r))]a(i+1)−1 ∞ X a(i + 1) − 1 = (−1)j (1 − F (x; λ, r))j j j=0 ∞ j X X a(i + 1) − 1j = (−1)j+k F (x; λ, r)k. j k j=0 k=0

Further, the density g(x, λ, r, a, b) in (3.2.3) can be rewritten as

∞ ∞ j X X X k g(x; λ, r, a, b) = f(x; λ, r) wi,j,kF (x; λ, r) , (3.2.6) i=0 j=0 k=0

where the coefﬁcient wi,j,k is deﬁned as,

b − 1a(i + 1) − 1j w = (−1)i+j+k ab. i,j,k i j k

According to Nadarajah et al. (2012) a physical interpretation of the KwST distribution for a ∈ Z+ and b ∈ Z+ can be given as follows. Suppose a system of b independent components where each component contains a parallel independent subcomponents. The system fails if any of the b components fails and that each component fails if all of the a subcomponents

th fail. Let Xj1,Xj2, ..., Xja denote the lifetimes of the subcomponents within the j component, j = 1, 2, ..., b where each Xj,i ∼ str(λ),(j = 1, 2, ..., b and i = 1, 2, ..., a). Let Xj denote the lifetime of the jth component, j = 1, ..., b, and let X denote the lifetime of the entire system. So, 52 the cumulative distribution function of X is

P r(X ≤ x) = 1 − P r(X1 > x, X2 > x, ..., Xb > x)

b b = 1 − [P r(X1 > x)] = 1 − [1 − P r(X1 ≤ x)]

b = 1 − [1 − P r(X11 ≤ x, X12 ≤ x, ..., X1a ≤ x)]

a b = 1 − [1 − (P r(X11 ≤ x)) ]

= 1 − [1 − F (x; λ, r)a]b.

So, it follows that the KwST distribution given by (3.2.4) is precisely the time to failure distribution of the entire system.

f(x) The hazard rate function deﬁned by τ(x) = 1−F (x) is an important quantity characterizing life phenomena of a system. The associated hazard function for the KwST distribution is

abf(x; λ, r)F (x; λ, r)a−1 τ(x) = . (3.2.7) 1 − F (x; λ, r)a

3.3 Properties and Simulations

In this section we study some theoretical properties of KwST distribution. Then we provide graphical illustrations of these properties. Finally, we present different approaches to generate a random sample from KwST distribution.

3.3.1 Properties

Proposition 3.3.1. Let X ∼ KwST (a, b, λ, r). Then,

(a) If a = b = 1, then X ∼ str(λ).

(b) If λ = 0 and a = b = 1, then X ∼ tr.

(d) If λ = 0, then X ∼ Kwtr(a, b). 53 (e) If λ = 0 and r = 1, then X ∼ KwCauchy(a, b, 0, 1).

(f) If Y = F (x; λ, r), then Y ∼ Kw(a, b).

The proof of proposition 3.1.1 follows from (3.2.3) and elementary properties of the skew t distribution. Note that in part (d) and (e), the distribution function of Kwtr(a, b) and KwCauchy(a, b, 0, 1) are given by substituting the F (x) in (3.2.2) by the the distribution functions of the student t with degrees of freedom r and the Cauchy (0,1), respectively.

Proposition 3.3.2. Let X ∼KwST(a, b, λ, r) and Y ∼KwST(a, d, λ, r) be two independent random variables. Then, (X|Y ≥ X) ∼KwST(a, b + d, λ, r), where a, b, and d > 0.

Proof. Let W = X|(Y ≥ X). We have

Z Z ∞ P (Y ≥ X) = gY (y; a, d, λ, r)gX (x; a, b, λ, r)dydx, < x

where,

Z ∞ Z ∞ a−1 a d−1 gY (y; a, d, λ, r)dy = adf(y; λ, r)F (y; λ, r) (1 − F (y; λ, r) ) dy x x Z ∞ = d (1 − F (y; λ, r)a)d−1dF (y; λ, r)a x Z 1 d d−1 (1 − s) 1 a d = −d (1 − s) ds = −d |F (x;λ,r)a = −(1 − F (x; λ, r) ) , F (x;λ,r)a d

where s = F (y; λ, r)a. Thus,

Z P (Y ≥ X) = − abf(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r)a)b+d−1dx < Z 1 = −b aF (x; λ, r)a−1(1 − F (x; λ, r)a)b+d−1dF (x; λ, r), 0 Z 1 b+d b+d−1 (1 − t) 1 b = −b (1 − t) dt = −b |0 = , 0 b + d b + d 54 where t = F (x; λ, r)a. Thus, we obtain

R w a−1 a b+d−1 −∞ abf(x; λ, r)F (x; λ, r) (1 − F (x; λ, r) ) dx P (W ≤ w) = b b+d b + d = abf(w; λ, r)F (w; λ, r)a−1(1 − F (w; λ, r)a)b+d−1 b = a(b + d)f(w; λ, r)F (w; λ, r)a−1(1 − F (w; λ, r)a)b+d−1,

which is the pdf of KwST (a, b + d, λ, r).

Proposition 3.3.3. Let X ∼ KwST (a, 1, λ, r) and Y ∼ KwST (c, 1, λ, r) be two independent random variables. Then, (X|Y ≤ X) ∼ KwST (a + c, 1, λ, r), where a and c > 0.

Proof. Similar to the proof of proposition 3.3.2.

The following proposition studies the limiting distribution of the KwST (a, b, λ, r) probability density function as one of the parameters approaches ∞ while the others remain being ﬁxed.

Proposition 3.3.4. Let X ∼ KwST (a, b, λ, r) be a random variable with probability density function g(x; a, b, λ, r) deﬁned in (3.2.3). Then,

(a) As a → ∞ or b → ∞, the probability density function g(x; a, b, λ, r) degenerates to zero.

(b) As r → ∞, X ∼ KwSN(a, b, λ).

Proof. (a) For ﬁxed x, λ, r, b , and as a → ∞

abf(x; λ, r) lim g(x; a, b, λ, r) = lim . a→∞ a→∞ 1 F (x;λ,r)a−1(1−F (x;λ,r)a)b−1

By L’Hopital’s rule,

bf(x; λ, r) lim g(x; a, b, λ, r) = lim a a→∞ a→∞ log(F (x;λ,r))(bF (x;λ,r) −1) F (x;λ,r)a−1(1−F (x;λ,r)a)b bf(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r)a)b = lim = 0. a→∞ log(F (x; λ, r))(bF (x; λ, r)a − 1) 55 For ﬁxed x, λ, r, a, and as b → ∞

abf(x; λ, r) limg(x; a, b, λ, r) = lim . b→∞ b→∞ 1 F (x;λ,r)a−1(1−F (x;λ,r)a)b−1

By L’Hopital’s rule

af(x; λ, r) limg(x; a, b, λ, r) = lim b→∞ b→∞ −F (x; λ, r)−(a−1)(1 − F (x; λ, r)a)−(b−1)log((1 − F (x; λ, r)a)) af(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r)a)b−1 = lim = 0. b→∞ −log(1 − F (x; λ, r)a)

This completes the proof of (a). (b) Recall that the deﬁnition of skew t random variable X with pdf f(x; λ, r) in (1.5.3) by Azzalini and Capitanio (2003) was constructed as a scale mixture of skew normal distribution using the following transformation: Y X =D , q Z r

2 where Y ∼ SN(λ) , Z ∼ χr and Y and Z are independent random variables. By the strong law of large number (SLLN), r r Z Z lim = lim = 1. r→∞ r r→∞ r

Thus, Y limX =D lim ∼ SN(λ). r→∞ r→∞ q Z r

And, limf(x; λ, r) =D φ(x; λ). r→∞

Thus, when X ∼ KwST (a, b, λ, r) with pdf g(x; a, b, λ, r) deﬁned in equation (3.2.3), for ﬁxed x, λ, a, b, and as r → ∞.

limg(x; a, b, λ, r) =D abφ(x; λ)Φ(x; λ)a−1(1 − Φ(x; λ)a)b−1, r→∞ 56 where φ(x; λ) and Φ(x; λ) are the pdf and cdf of the skew normal distribution respectively. That is, X ∼ KwSN(a, b, λ).

This completes the proof of (b). (c) Let X be a skew t distributed random variable. We have,

lim φ(y; λ) = lim 2φ(y)Φ(λy) λ→∞ λ→∞ = 2φ(y) lim Φ(λy) λ→∞

= 2φ(y)I[0,∞](y),

which indicate lim Y =D |W |, where W ∼ N(0, 1). Then, λ→∞

lim Y D Y D λ→∞ D |W | D lim X = lim = = = |tr|. λ→∞ λ→∞ q Z q Z q Z r r r

Thus, when X ∼ KwST (a, b, λ, r) with pdf g(x; a, b, λ, r) deﬁned in (3.2.3), for ﬁxed x, a, b, r, and as λ → ∞.

lim g(x; a, b, λ, r) =D abh(x; r)H(x; r)a(1 − H(x; r)a)b−1, λ→∞

where h(x; r) and H(x; r) are the pdf and cdf of the half t distribution respectively with r is the degrees of freedom. This completes the proof of (c).

Part (a) in proposition 3.3.4 can be generalized to the class of the Kumaraswamy generalized family KwF (a, b) with pdf g(x; a, b) deﬁned in (3.2.1) as follows.

Proposition 3.3.5. Let X ∼ KwF (a, b). As a → ∞ or b → ∞, then the probability density function g(x; a, b) degenerates to zero. 57 3.3.2 Graphical Illustrations

To understand the effect of each parameter in determining the overall shape of the KwST density, we present some graphs with five fixed parameters and the sixth one varying. For simplicity, we fix the location parameter µ to be zero and the scale parameter σ to be one in all graphs. In Figure 3.1 we fixed the parameters (b = 3, λ = 1, r = 3) and we graph the density of KwST (a, 3, 1, 3) distribution for different values of a. Figure 3.1 shows that as a increases the left tail of the KwST density gets lighter.

Figure 3.1: KwST (a, 3, 1, 3) density as the parameter a varies.

On the other hand, we note that the parameter b controls the right tail weight of the KwST density when b varies and all other parameters are ﬁxed (a = 5, λ = −1, r = 3) as shown in Figure 3.2. In addition, Figures 3.1 and 3.2 show that as a or b approach inﬁnity the KwST density degenerate to zero. 58

Figure 3.2: KwST (5, b, −1, 3) density as the parameter b varies.

Figure 3.3 studies the effect of the parameter λ on the shape of the KwST density by ﬁxing the parameters (a = 5, b = 3, r = 3) and taking the parameter λ ranging from −5 to 100. Then, we compare the density curves of KwST (5, 3, λ, 3) with the curve of Kw|tr|(a = 5, b = 3, r = 3). As expected, the graph is skewed to the right for positive values of λ and skewed to the left for negative values of λ. Moreover, we observe that as λ increases the KwST density curve overlaps the Kw|tr| density curve which graphically proves part (c) of proposition 3.3.4.

Figure 3.3: KwST (5, 3, λ, 3) density as the parameter λ varies. 59 In Figure 3.4, we study the effect of the degrees of freedom r on the shape of the KwST density by ﬁxing the parameters (a = 5, b = 3, λ = −1) and taking the degrees of freedom r = 1, 5, 15 and 50. We observe that the shape of the KwST (5, 3, −1, r) density gets closer to the one of the KwSN(5, 3, −1) as the degrees of freedom r increases, which agrees with the part (b) of proposition 3.3.4. The tail gets thicker as the decreases of the degrees of freedom. The last two properties are inherited from the baseline skew t distribution. Furthermore, Figure 3.1 to Figure 3.4 show that the KwST inherits the unimodality from its baseline distribution.

Figure 3.4: KwST (5, 3, −1, r) density as the degrees of freedom r varies.

3.3.3 Simulations

In this section, we provide several methods to generate samples from KwST (a, b, λ, r) distribution. The KwST quantile function is obtained by inverting (3.2.4)

x = Q(u) = G−1(u) = F −1[1 − (1 − u)1/b]1/a, (3.3.1)

where U is a uniform random variable on (0, 1) and F −1 is the inverse function of the dis-

tribution function of str(λ). Then, applying the inverse transformation technique we generate KwST random sample using (3.3.1). An alternative method to generate a random sample from 60 KwST (a, b, λ, r) is to use the algorithm of the acceptance rejection method proposed by Nadara- jah et al. (2012) as follows.

abb(a−1)1−1/a(b−1)b−1 Deﬁne a constant M by M = (ab−1)b−1/a for given a > 1 and b > 1. Then the following scheme holds for simulating KwST (a, b, λ, r) variates:

(a) Generate X = x from the pdf of skew t.

(b) Generate Y = UMx, where U is a uniform variate on (0,1).

Additional method to generate a KwST (a, b, λ, r) random sample is to directly apply part (f) in proposition 3.3.1.

(a) ξ1 = (a = 5, b = 2, λ = −2, r = 2). (b) ξ2 = (a = 2, b = 4, λ = 1, r = 2). Figure 3.5: Histogram for random samples of size 500 of KwST variates

Figure 3.5 shows the histograms of two random samples with size 500 generated from KwST (ξ)

distribution using the acceptance rejection method, with the parameter vectors ξ1 = (a = 5, b =

2, µ = 0, σ = 1, λ = −2, r = 2) and ξ2 = (a = 2, b = 4, µ = 0, σ = 1, λ = 1, r = 2) respectively.

3.4 Moments

In this section we derive explicit expressions for the moments of a KwST random variable

2 using different techniques. Some estimation of the mean (µKwST ), variance(σKwST ), skewness(γ1), 61 and kurtosis(γ2) of KwST (a, b, λ, r) random variable for selected values of the parameters a, b, λ, and r are provided numerically.

Theorem 3.4.1. Let X ∼ KwST (a, b, µ, σ, λ, r) where a, b, n ∈ Z+ and r ≥ n, then

n b−1 X n µ X b − 1 E(Xn) = abσn ( )i (−1)j E (Y n−i)[F (y; λ, r)a(j+1)−1 − 1], (3.4.1) i σ j Y i=0 j=0

where Y ∼ str(λ).

X−µ n Proof. Consider the transformation Z = σ . Applying the binomial expansion to (Zσ + µ) and (1 − F (z; λ, r)a)b−1 we obtain:

Z 1 E(Xn) = ab(zσ + µ)n f(z; λ, r)F (z; λ, r)a−1(1 − F (z; λ, r)a)b−1σdz < σ n b−1 X n µ X b − 1 Z = abσn ( )i (−1)j zn−iF (z; λ, r)a(j+1)−1f(z; λ, r)dz i σ j i=0 j=0 < n b−1 X n µ X b − 1 Z = abσn ( )i (−1)j yn−iF (y; λ, r)a(j+1)−1f(y; λ, r)dy i σ j i=0 j=0 < n b−1 X n µ X b − 1 = abσn ( )i (−1)j E (Y n−iF (y; λ, r)a(j+1)−1), i σ j Y i=0 j=0

where Y ∼ str(λ). Then, applying the integration by part to the quantity

Z i a(j+1)−1 k a+k−1 EY (y F (y; λ, r) ) = y f(y; λ, r)F (y; λ, r) dy. <

Let u = F (y; λ, r)a+j−1,

and dv = yn−if(y; λ, r)dy.

Then, du = (a + j − 1)f(y; λ, r)F (y; λ, r)a+j−2, 62 and Z n−i n−i v = y f(y; λ, r)dx = EY (Y ), <

Thus,

Z n−i a+j−1 n−i a+j−1 n−i y f(y; λ, r)F (y; λ, r) dy = EY (Y )F (y; λ, r) − EY (Y )(a + j − 1) < Z f(y; λ, r)F (y; λ, r)a+j−1dx, < n−i h a+j−1 = EY (Y ) F (y; λ, r) − (a + j − 1) Z i F (y; λ, r)a+j−1dF (y; λ, r) , < h 1 i = E (Y n−i) F (y; λ, r)a+j−1 − (a + j − 1) , Y (a + j − 1)

n−i h a+j−1 i = EY (Y ) F (y; λ, r) − 1 .

Hence,

n b−1 X n µ X b − 1 h i E(Xn) = abσn ( )i (−1)j E (Y n−i) F (y; λ, r)a+j−1 − 1 . i σ j Y i=0 j=0

Further, the nth moment of X with the pdf (3.2.6) can be expressed in terms of the nth moments

of the baseline distribution str(λ) as follows.

Proposition 3.4.2. Let X ∼ KwST (a, b, λ, r) be a random variable where a, b ∈ <+, n ∈ Z+, n ≥ 1 and r ≥ n, then

∞ j n X X k h a+k−1 i E(X ) = wi,j,kE(Y ) F (y; λ, r) − 1 , (3.4.2) i,j=0 k=0

i+j+kb−1a(i+1)−1j + where Y ∼ str(λ) and wi,j,k = (−1) i j k ab. If a ∈ Z , then we use the pdf 63 (3.2.5) to derive the the nth moments of X ∼ KwST (λ, r, a, b) as follows.

∞ n X i h a+i−1 i E(X ) = wiE(Y ) F (y; λ, r) − 1 , (3.4.3) i=0

ib−1 + where Y ∼ str(λ) and wi = (−1) i ab. If b ∈ Z , the index i in the ﬁrst sum in (3.4.2) and the sum in (3.4.3) stops at b − 1.

Alternatively, the nth moment of X ∼ KwST (a, b, λ, r) random variable with integers a ≥ 2

th and b ≥ 2 can be expressed in terms of the n moment of the str(λ) multiplied by a constant as presented in the following proposition.

Proposition 3.4.3. Let X ∼ KwST (a, b, λ, r) with integers n, a and b where a and b ≥ 2, n ≥ 1 and r ≥ n.

n n E(X ) = EY (Y )c(a, b), (3.4.4)

where

b−2 X (−1)i h a (a − 1) i (a − 1) c(a, b) = ab − −(−1)b−1 , B(i + 1, b − i − 1) a(2 + i) − 1 (b − i − 1)(a(1 + i) − 1) ab − 1 i=0

B(a, b) is the complete beta function and Y ∼ str(λ).

Proof. By applying the integration by part, we have

Z E(Xn) = ab xnf(x; λ, r)F (x; λ, r)a−1(1 − F (x; λ, r)a)b−1dx. <

Let u = F (x; λ, r)a−1(1 − F (x; λ, r)a)b−1,

and dv = xnf(x; λ, r)dx. 64 Then,

du =(a − 1)f(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r)a)b−1dx+

(b − 1)F (x; λ, r)a−1(1 − F (x; λ, r)a)b−2(−af(x; λ, r)F (x; λ, r)a−1)dx,

and Z v = xnf(x; λ, r)dx. <

th Note that v is the n moments of a str(λ) random variable. Then,

n h a−1 a b−1 ∞ E(X ) = ab F (x; λ, r) (1 − F (x; λ, r) ) v|−∞ Z − (a − 1)vf(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r)a)b−1dx < Z i + va(b − 1)f(x; λ, r)F (x; λ, r)2(a−1)(1 − F (x; λ, r)a)b−2dx < h Z = abv a(b − 1)f(x; λ, r)F (x; λ, r)2(a−1)(1 − F (x; λ, r)a)b−2dx < Z i − (a − 1)f(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r)a)b−1dx . (∗) <

Note that,

Z a(b − 1)f(x; λ, r)F (x; λ, r)2(a−1)(1 − F (x; λ, r)a)b−2dx < b−2 Z X b − 2 = a(b − 1)f(x; λ, r)F (x; λ, r)2(a−1) (−1)i F (x; λ, r)aidx i < i=0 b−2 X b − 2 Z = (−1)i a(b − 1) f(x; λ, r)F (x; λ, r)a(2+i)−2dx i i=0 < b−2 X b − 2 Z = (−1)i a(b − 1) F (x; λ, r)a(2+i)−2dF i i=0 < b−2 X b − 2 F (x; λ, r)a(2+i)−1 = (−1)i a(b − 1) |∞ i a(2 + i) − 1 −∞ i=0 b−2 b−2 X a(b − 1) = (−1)i i . (∗∗) a(2 + i) − 1 i=0 65 Similarly,

b−1 b−1 Z X (a − 1) (a − 1)f(x; λ, r)F (x; λ, r)a−2(1 − F (x; λ, r)a)b−1dx = (−1)i i . (∗ ∗ ∗) a(1 + i) − 1 < i=0

Substituting (**) and (***) into (*) we get,

b−2 b−2 b−1 b−1 h X a(b − 1) X (a − 1)i E(Xn) = abv (−1)i i − (−1)i i a(2 + i) − 1 a(1 + i) − 1 i=0 i=0 b−2 X b − 2h a(b − 1) (a − 1)(b − 1) i (a − 1) = abv (−1)i − − (−1)b−1 i a(2 + i) − 1 (b − i − 1)(a(i + 1) − 1) ab − 1 i=0 b−2 X (−1)i h a (a − 1) i (a − 1) = abv − − (−1)b−1 . B(i + 1, b − i − 1) a(2 + i) − 1 (b − i − 1)(a(i + 1) − 1) ab − 1 i=0

The mean, variance, skewness and kurtosis measures can be computed numerically using existing softwares. Table 3.1 shows numerical estimations of these measures by computing the first four moments for various values of the parameters a, b, λ, and r with µ = 0 and σ = 1, where Table 3.1(a) presents the numerical estimations of KwST (a, b, λ, r) random variable for different values of a, b, and λ and fixed degrees of freedom r = 5, while in Table 3.1(b) the parameter λ = 2 is fixed and a, b, and the degrees of freedom r are varied. Skewness and kurtosis are calculated as

hX − E(X)3i γ (X) = E , 1 V ar(X)1/2

and hX − E(X)4i γ (X) = E . 2 V ar(X)1/2

From the numerical results in Table 3.1, we observe that the KwST (a, b, λ, r) distribution degenerates to zero as the increase of a or b. Thus the skewness and kurtosis do not exist, and their values are replaced by NA. Further, the KwST (a, b, λ, r) moments estimates get closer to 66 2 Table 3.1(a): Moments estimation of the mean (µKwST ), variance(σKwST ), skewness(γ1), and kurtosis(γ2) of KwST (a, b, λ, r) random variable for different values of a, b, and λ.

2 a b λ r µKwST σKwST γ1 γ2 KwST 1 .50 -10 5 -0.563 0.616 -2.419 28.051 -1 0.119 2.842 4.919 119.904 0 1.107 6.303 5.757 97.085 1 1.753 6.804 6.674 108.903 10 1.927 6.321 7.448 125.575 50 1.929 6.312 7.463 125.904 Kw|tr| 1 .50 - 5 1.929 6.311 7.465 125.943 KwST 5 3 -1 5 -0.055 0.157 -0.005 3.465 0 0.676 0.281 0.298 3.623 1 1.197 0.251 0.542 3.934 10 1.288 0.213 0.761 4.248 50 1.288 0.213 0.761 4.247 Kw|tr| 5 3 - 5 1.288 0.213 0.761 4.247 KwST 10 1 -1 5 0.850 0.463 2.161 19.937 0 2.003 1.270 2.462 25.524 1 2.545 1.477 2.737 30.106 10 2.591 1.457 2.812 31.254 50 2.591 1.457 2.812 31.655 Kw|tr| 10 1 - 5 2.591 1.457 2.812 31.655 KwST 10 2000 -10 5 0 0 NA NA 0 0 0 NA NA 10 0 0 NA NA

the KwSN(a, b, λ) ones as the degrees of freedom r increases and to the Kw|tr|(a, b, r) as the

parameter λ increases, where the numerical estimations of the KwSN and Kw|tr| are presented on the last line of each block. Numerical results in Table 3.1 agree with the proposition 3.3.4.

3.5 Order Statistics

Order statistics make their appearance in many areas of statistical theory and practice. Cordeiro and de Castro (2011) derived the density of order statistics of the KwF distribution as a function of the baseline density multiplied by inﬁnite weighted sums of powers of the distribution function F (x) as given by

n−i ∞ v f(x) X n − i X X g (x) = (−1)j w p (a, b)F (x)r+t, i:n B(i, n − i + 1) j u,v,t r,i+j−1 j=0 r,u,v=0 t=0 67 2 Table 3.1(b): Moments estimation of the mean (µKwST ), variance(σKwST ), skewness(γ1), and kurtosis(γ2) of KwST (a, b, λ, r) random variable for different values of a, b, and r.

2 a b λ r µKwST σKwST γ1 γ2 KwST 1 1 2 5 0.849 0.946 1.791 16.428 10 0.773 0.652 0.865 5.042 50 0.767 0.453 0.661 4.053 300 0.758 0.412 0.873 3.547 KwSN 1 1 2 - 0.714 0.491 0.453 3.301 KwST 10 1 2 5 2.587 1.461 2.800 31.086 10 2.176 0.574 1.283 6.703 50 1.932 0.304 0.683 3.856 300 1.889 0.269 0.588 3.587 KwSN 10 1 2 - 0.714 0.491 0.453 3.301 KwST 2 5 2 5 0.447 0.137 0.158 3.495 10 0.432 0.124 0.090 3.257 50 0.422 0.115 0.040 3.109 200 0.420 0.113 0.031 3.109 KwSN 2 5 2 - 0.419 0.113 0.028 3.073 KwST 1000 15 2 5 0 0 NA NA 10 0 0 NA NA 20 0 0 NA NA

where a ∈ <+,

a(u + 1) − 1b − 1v w = w (a, b) = (−1)u+v+tab , u,v,t u,v,t v u t

and i+j−1 ∞ ∞ X i + j − 1 X X kbmal p (a, b) = (−1)k (−1)mr+l . r,i+j−1 k m l r k=0 m=0 l=0

If a ∈ Z+, then density of order statistics of the KwF distribution is given by

n−i ∞ f(x) X n − i X g (x) = (−1)j w p (a, b)F (x)a(u+1)+r−1, i:n B(i, n − i + 1) j u r,i+j−1 j=0 r,u=0

u b−1 where wu = wu(a, b) = (−1) ab u . We derive a new and simpler representation for the density of the order statistics of KwST random sample and we generalize the result to the order statistics of the Kumaraswamy generalized 68 family KwF .

Theorem 3.5.1. Let X1, ..., Xn be a random sample from a KwST distribution with the deﬁned probability density function g(x; a, b, λ, r) in (3.2.3) and distribution function G(x; a, b, λ, r) in

(3.2.4). Let X1:n ≤ X2:n ≤ ... ≤ Xn:n be the order statistics of the random sample. The density of

th the i order statistic Xi:n, for i = 1, ..., n, is given by

i−1 X gi:n(x) = si,n,kg(x; a, b(n − i + k + 1), λ, r), (3.5.1) k=0

kn−i+k n where si,n,k = (−1) k i−k−1 .

Proof.

n! g (x) = [1 − {1 − F (x; λ, r)a}b]i−1[1 − (1 − {1 − F (x; λ, r)a}b)]n−i i:n (i − 1)!(n − i)! abf(x)F (x)a−1(1 − F (x)a−1)b−1 n! = [1 − {1 − F (x; λ, r)a}b]i−1{1 − F (x; λ, r)a}b(n−i) (i − 1)!(n − i)! abf(x)F (x)a−1(1 − F (x)a−1)b−1.

Applying the binomial expansion of [1 − {1 − F (x; λ, r)a}b]i−1 in powers of {1 − F (x; λ, r)a}b, we get

i−1 n! X i − 1 = (−1)k abf(x)F (x)a−1[{1 − F (x; λ, r)a}]b(n−i+k+1)−1 (i − 1)!(n − i)! k k=0 (−1)k i−1 n! Pi−1 ( k ) a−1 a b(n−i+k+1)−1 = (i−1)!(n−i)! k=0 n−i+k+1 ab(n − i + k + 1)f(x)F (x) [{1 − F (x; λ, r) }] i−1 ki−1 n! X (−1) = k g(x; a, b(n − i + k + 1), λ, r) (i − 1)!(n − i)! n − i + k + 1 k=0 i−1 X n! (i − 1)! 1 = (−1)k g(x; a, b(n − i + k + 1), λ, r) (i − 1)!(n − i)! k!(i − 1 − k)! n − i + k + 1 k=0 i−1 X n! (n − i + k)! = (−1)k g(x; a, b(n − i + k + 1), λ, r) (n − i)!k!(i − (k + 1))! (n − i + k)!(n − i + k + 1) k=0 69 i−1 X (n − i + k)! n! = (−1)k g(x; a, b(n − i + k + 1), λ, r) (n − i)!k!(i − (k + 1))! (n − i + k)!(n − i + k + 1) k=0 i−1 X n − i + k n = (−1)k g(x; a, b(n − i + k + 1), λ, r) k i − k − 1 k=0 i−1 X = si,n,kg(x; a, b(n − i + k + 1), λ, r). k=0

Formula (3.5.1) immediately yields the density of the order statistics of the KwST distribution as a function of ﬁnite weighted sums of the density of the same class of KwST distribution with a new parameter b∗ = b(n − i + k + 1), which is written as a function of the sample size n, the order i and a constant k, where 0 ≤ k ≤ i − 1. Hence, the ordinary moments of the order statistics of the KwST distribution can be rewritten as ﬁnite weighted sums of moments of the KwST distribution with a new parameter b∗. From Theorem 3.5.1, the density of the smallest and largest order statistic are given as follows.

Proposition 3.5.2. (a) The density of the largest order statistic Xn:n(x) is given by

n−1 X gn:n(x) = skg(y; a, b(k + 1)λ, r), (3.5.2) k=0

(−1)k where sk = B(k+1,n−k) .

(b) The density of the smallest order statistic X1:n(x) is g(y; a, nb, λ, r) which is the density of Y ∼ KwST (a, bn, λ, r),.

The result in theorem 3.5.1 can be generalized to the class of the Kumaraswamy generalized family KwF , deﬁned in (3.2.1).

Theorem 3.5.3. Let X1, ..., Xn be a random sample from a Kumaraswamy generalized family KwF distribution with the deﬁned probability density function g(x; a, b) in (3.2.1) and distribution function G(x; a, b) in (3.2.2). Let X1:n ≤ X2:n ≤ ... ≤ Xn:n be the order statistics of the random 70 th sample. The density function gi:n(x) of the i order statistic Xi:n, for i = 1, ..., n , is given by

i−1 X gi:n(x) = si,n,kg(x; a, b(n − i + k + 1)), (3.5.3) k=0

kn−i+k n where si,n,k = (−1) k i−k−1 .

3.6 Maximum Likelihood Estimation

The likelihood-based inference is a primary approach to statistical methodology. The maximum likelihood inference is a well-known concept with a quite standard notation. In this section, the maximum likelihood estimators (MLEs) of the KwST parameters are given.

Consider a sample x1, x2, ....., xn from the KwST (a, b, µ, σ, λ, r) distribution. The log-likelihood function l(ξ) for the parameter vector of ξ = (a, b, µ, σ, λ, r) is

n X l(ξ) = n log(a) + n log(b) − n log(σ) + logf(zi; µ, σ, λ, r) i=1 n n (3.6.1) X X a + (a − 1) log(F (zi; µ, σ, λ, r)) + (b − 1) log(1 − F (zi; µ, σ, λ, r) ), i=1 i=1

xi−µ where zi = σ . The components of the score vector U(ξ) are given by

n n a n X X F (zi; µ, σ, λ, r) log(F (xi; µ, σ, λ, r)) U (ξ) = + log(F (z ; µ, σ, λ, r)) − (b − 1) , a a i 1 − F (x ; µ, σ, λ, r)a i=0 i=1 i n n X U (ξ) = + log(1 − F (z ; µ, σ, λ, r)a), b b i i=0 n xi−µ X −1 df( σ ; µ, σ, λ, r) Uµ(ξ) = σf( xi−µ ; µ, σ, λ, r) dµ i=0 σ n x −µ (a − 1) X 1 dF ( i ; µ, σ, λ, r) − σ σ F ( xi−µ ; µ, σ, λ, r) dµ i=0 σ n x −µ (b − 1) X 1 d(1 − F ( i ; µ, σ, λ, r)a) + σ , σ (1 − F ( xi−µ ; µ, σ, λ, r)a) dµ i=0 σ 71 n xi−µ n X 1 df( σ ; µ, σ, λ, r) Uσ(ξ) = − + σ σf( xi−µ ; µ, σ, λ, r) dσ i=0 σ n x −µ (a − 1) X 1 dF ( i ; µ, σ, λ, r) − σ σ F ( xi−µ ; µ, σ, λ, r) dσ i=0 σ n x −σ (b − 1) X 1 d(1 − F ( i ; µ, σ, λ, r)a) + σ , σ (1 − F ( xi−µ ; µ, σ, λ, r)a) dσ i=0 σ n xi−µ X 1 df( σ ; µ, σ, λ, r) Uλ(ξ) = f( xi−µ ; µ, σ, λ, r) dλ i=0 σ n x −µ X 1 dF ( i ; µ, σ, λ, r) + (a − 1) σ F ( xi−µ ; µ, σ, λ, r) dλ i=0 σ n x −σ X 1 d(1 − F ( i ; µ, σ, λ, r)a) + (b − 1) σ , (1 − F ( xi−µ ; µ, σ, λ, r)a) dλ i=0 σ n xi−µ X 1 df( σ ; µ, σ, λ, r) Ur(ξ) = f( xi−µ ; µ, σ, λ, r) dr i=0 σ n x −µ X 1 dF ( i ; µ, σ, λ, r) + (a − 1) σ F ( xi−µ ; µ, σ, λ, r) dr i=0 σ n x −σ X 1 d(1 − F ( i ; µ, σ, λ, r)a) + (b − 1) σ . (1 − F ( xi−µ ; µ, σ, λ, r)a) dr i=0 σ Solving the components of the score vector simultaneously yields the maximum likelihood estimates (MLEs) of the six parameters. Estimation of each parameter can be carried out using one of the numerical procedures available on computational software. We used the optim function which is available in R software to do so.

3.6.1 Simulation Study

The Generalized Lambda Distribution, denoted as GLD, is a four-parameter generalization of Tukeys Lambda family. It was proposed originally by Ramberg and Schmeiser (1974) which was called “RS distribution. The four-parameter GLD family is known for its high ﬂexibility, as it can create distributions with a large range of different shapes. The wide range of the shape of the GLD family allows it to accommodate almost any heavy-tailed and skewed data. For further discussion of the Generalized Lambda Distribution GLD readers are referred to Karian and Dudewicz (2011). In this section we proceed a sample generated from generalized lambda distribution GLD. We 72

fixed the parameter values at λ1 = 2, λ2 = 1, λ3 = 0, λ4 = 1. The sample sizes considered was n = 500. We fit the KwST , st and the sn distributions using the maximum likelihood method. From Table 3.2 we note that even though the AIC or SIC is better for the st and sn distributions than the KwST distribution, the estimated skewness parameter λ and the degrees of freedom r are very large, which suggest that the sn distribution is converging to the half normal distribution and the st distribution is converging to to the half t distribution. Therefore, the sn or st distributions are not recommended. However, the KwST shows better flexibility in fitting such a data.

Table 3.2: Parameter estimations for random sample generated from GLD distribution.

Dist. µ σ λ r a b -log(ξ) AIC SIC KwST 3.932 0.254 0.267 12.196 0.233 17.869 709.6132 1431.226 1456.514 st 2.981 1.820 -201.512 267.66 - - 679.912 1367.825 1384.684 sn 2.999 1.906 -41896.852 - - - 685.677 1377.355 1389.999

Comparing the density curve ﬁtting for the three distributions in ﬁgure 3.6, we see that the

KwST distribution represents better ﬁtting to the histogram of the random sample than the str and the sn distribution. We also note that the st and sn density curves looks like truncated densities.

Figure 3.6: Fitting densities to GLD random sample.

3.6.2 Illustrative Examples

We illustrate the superiority of the KwST distribution proposed here as compared with some of its sub-models using the Akaike Information Criterion (AIC) and Schwarz information criterion 73 (SIC). We give an application using well-known data set to demonstrate the applicability of the proposed model. Tables are used to display the six parameters ξ = (µ, σ, λ, r, a, b) estimate for each model with the negative log-likelihood, the AIC and the SIC values. The data set used here is the nidd.thresh data, analyzed by Nadarajah and Eljabri (2013), Papastathopoulos and Tawn (2013) and by Welchance (2016). The data set consists of 154 ex- ceedances of the threshold level 65m3s−1 by the River Nidd at Hunsingore Weir from 1934 to 1969. The data set was collected by NERC (1975) and is available for use in the R package evir. We divide the data set by 10 for the purposes of scaling. Descriptive statistics of the dataset are given in Table 3.3. Figure 3.7 presents the histogram for the Nidd river dataset, as well as the cor-

Table 3.3: Summary description of the Nidd river data.

Min. Median Mean sd Max. skewness kurtosis 6.508 8.152 9.787 4.094 30.580 2.615 7.558

responding normal Q-Q plot. The histogram shows that the data set is skewed to the right heavily and the normal Q-Q plot shows that the data depart normality from the tail which make it adequate to ﬁt our new model to this dataset. We ﬁt the KwST , KwSN, and BST distributions to the data.

Figure 3.7: Histogram and Q-Q plot for Nidd river dataset.

The MLEs of the parameters, the negative log-likelihood and information criteria (AIC) and (SIC) are listed in Table 3.4. The proposed six-parameter distribution provides a significant improvement over the five-parameter distribution the Kumaraswamy skew normal KwSN. While we notice a small improvement over the BST distribution proposed in Chapter 2. The KwST distribution has 74 the smallest SIC value among all other distributions which indicates that the KwST is the best fit to this data set.

Table 3.4: Parameter estimations for the Nidd river dataset.

Dist. µ σ λ r a b -log(ξ) AIC SIC KwST 6.397 1.242 5.090 0.870 2.351 1.8101 329.629 671.2587 689.4804 KwSN 7.906 4.811 29.853 - 0.0794 0.477 333.515 677.03 692.2147 BST 6.373 0.619 1.565 7.259 2.076 0.2043 330.1769 671.3539 689.5756

Table 3.5 presents the parameter estimation of the Naidd river data using the ML method of the KwST , Kwt, and st distribution. We compare the fitting using the information criteria. From these three distributions, we note that the negative log-likelihood values are very close but there is a significant difference in the AIC and SIC values. This big difference is due to the number of parameters in each model as the KwST model has the largest number of parameters among the other models. It suggests that the KwST model is a competitive candidate to fit the data.

Table 3.5: Parameter estimations for the Nidd river dataset.

Dist. µ σ λ r a b -log(ξ) AIC SIC KwST 6.397 1.242 5.090 0.870 2.351 1.8101 329.629 671.2587 689.4804 Kwt 6.191 0.666 - 4.336 8.512 0.330 330.3885 670.7771 685.9618 st 6.632 2.074 20.252 1.679 - - 330.442 668.884 681.0318

The following figures are graphical display of the fitted density curves to the histogram of the Nidd river data. Figure 3.8 presents the fitted density curves of the KwST , KwSN, and BST to the to the histogram of the Nidd river data. We note that the KwST and BST have similar performance, however, the KwST does better in capturing the peak of the data. Figure 3.9 compares the fitted density curves of the KwST , Kwt and st distributions. Although none of the three distributions provide a significant better fitting over the others, we note that the st preforms the least in capturing the histogram peak. From Figure 3.8 and Figure 3.9 we conclude that the KwST distribution is a promising distribution to model skewed and heavy tailed data. 75

Figure 3.8: Histogram and ﬁtted density curves to the Nidd river data.

Figure 3.9: Fitted density curves to the Nidd river data.

3.7 L-moments

Following the definition of L-moments by Hosking (1990), we derive the first seven theoretical L-moments of the proposed KwST distribution in this section. Then, we estimate the first two moments and the first two L-moments ratios by varying one parameter while fixing all other parameters. Further, using L-moments method we conduct some parameter estimation for simulated 76 and real life data. Finally, we illustrate the fitting superiority of L-moments parameter estimations and compare them with the classical MLEs using AIC and SIC values.

3.7.1 Theoretical and Sample L-moments

Based on the deﬁnition of the theoretical L-moments by Hosking (1990), we derive the theoretical L-moments for the KwST distribution as follows.

Theorem 3.7.1. The theoretical L-moments for a KwST random variable X are deﬁned as

m−1 m−k−1 X X (−1)k+j k + jm − 1m − 1 L = E (Y ), (3.7.1) m k + j + 1 j k + j k + j Y k=0 j=0

where Y ∼ KwST (a, b(k + j + 1), λ, r).

Proof. Recall that the the probability density function of X ∼ KwST (a, b, µ, σ, λ, r) is a linear combination of probability density functions of the same class of distribution KwST (a, b(n − i + k + 1), λ, r). Hence, the moment of the order statistics is straightforward driven from the ordinary moments derived in section 3.4.1. The expectation of an order statistic can be written as

m−k−1 X k + j m E (X ) = (−1)k E (Y ), (3.7.2) X m−k,m j m − k − j − 1 Y j=0

where Y ∼ KwST (a, b(n − i + k + 1), λ, r). Substituting this expression in equation (2.7.1),

m−1 m−k−1 X X 1 m − 1k + j m L = (−1)k+j E (Y ). (3.7.3) m m k j m − k − j − 1 Y k=0 j=0

Note that m 1 m − 1 m−k−j−1 = . (3.7.4) m k + j + 1 k + j

Thus,

m−1 m−k−1 X X 1 k + jm − 1m − 1 L = (−1)k+j E (Y ), (3.7.5) m k + j + 1 j k k + j Y k=0 j=0 77 where Y ∼ KwST (a, b(k + j + 1), λ, r).

Corollary 3.7.1. The ﬁrst seven KwST theoretical L-moments are expressed by

L1 = E(X),

L2 = E(X) − E(Y1),

L3 = E(X) − E(Y1) + 2E(Y2),

L4 = E(X) − 6E(Y1) + 10E(Y2) − 5E(Y3),

L5 = E(X) − 10E(Y1) + 30E(Y2) − 35E(Y3) + 14E(Y4),

L6 = E(X) − 15E(Y1) + 70E(Y2) − 140E(Y3) + 126E(Y4) − 42E(Y5),

L7 = E(X) − 21E(Y1) + 140E(Y2) − 420E(Y3) + 630E(Y4) − 462E(Y5) + 132E(Y6),

where Yi ∼ KwST (a, b(i + 1), λ, r).

The L-location (L1), L-scale (L2), L-skewness (τ3) and L-kurtosis (τ4) measures can be computed numerically using existing softwares. Table 3.6 shows numerical estimations of these measures by computing the first four L-moments for various values of the parameters a, b, λ, and r with µ = 0 and σ = 1, where Table 3.6(a) presents the numerical estimations of KwST (a, b, λ, r) random variable for different values of a, b, and λ and fixed degrees of freedom r = 5, while in Table 3.6(b) the parameter λ = 2 is fixed and a, b, and the degrees of freedom r are varied. From Table 3.6 we note that when a = b = 1 the L-moments estimations of the KwST (a, b, λ, r) coincide with the L-moments estimations of the skew t distribution which can be computed using some built in function in R software. In addition, the KwST (a, b, λ, r) L-moments estimates get closer to the KwSN(a, b, λ) ones as the degrees of freedom r increases and to the Kw|tr|(a, b, r) as the parameter λ increases, where the numerical estimations of the KwSN and Kw|tr| are presented on the last line of each block. Numerical results in Table 3.6 agree with the proposition 3.3.4. 78

Table 3.6(a): Estimation of the L-mean (L1), L-variance(L2), L-skewness(τ3), and L-kurtosis(τ4) of KwST (a, b, λ, r) random variable for different values of a, b, and λ.

a b λ r L1 L2 τ3 τ4 KwST 1 1 -5 5 -0.9305 0.4508 -0.2580 0.1823 -1 -0.6710 0.5845 -0.0970 0.1939 0 0.000 0.692 0.000 0.1936 1 0.6710 0.5845 0.0970 0.1939 5 0.9305 0.4508 0.2580 0.1823 50 0.9488 0.4351 0.2948 0.1694 500 0.9490 0.4349 0.2953 0.1690 Kw|tr| - 0.9490 0.4350 0.2954 0.1691 KwST 5 3 -5 5 -0.3912 0.1322 -0.1324 0.1434 -1 -0.0545 0.2215 -0.0020 0.1408 0 0.6764 0.2958 0.0416 0.1410 1 1.1967 0.2781 0.0771 0.1413 5 1.2877 0.255 0.1114 0.1369 50 1.2880 0.2550 0.1122 0.1365 500 1.2880 0.2550 0.1122 0.1365 Kw|tr| - 1.2881 0.2550 0.1122 0.1365 KwST 2 20 -5 5 -1.6245 0.2703 -0.1825 0.1609 -1 -1.5595 0.2818 -0.1667 0.1599 0 -1.0566 0.2842 -0.1411 0.1548 1 -0.2249 0.1986 -0.1057 0.1480 5 0.2327 0.0941 0.0212 0.1339 50 0.2628 0.0777 0.1196 0.1081 500 0.2629 0.0777 0.1196 0.1081 Kw|tr| - 0.2629 0.0777 0.1202 0.1082

3.7.2 L-moments Parameter Estimation

To obtain L-moments parameter estimates, Hosking (1990) suggested equating the first the first p sample L-moments to the corresponding population quantities where p is a finite number of the unknown parameters. Therefore, we obtain parameter estimation of the proposed distribution KwST (a, b, µ, σ, λ, r) using the L-moments method numerically by minimizing the combined Pythagorean distance between the combined square errors as follows.

2 2 2 2 2 2 2 (L1 − l1) + (L2 − l2) + (τ3 − τˆ3) + (τ4 − τˆ4) + (τ5 − τˆ5) + (τ6 − τˆ6) + (τ7 − τˆ7) , (3.7.6) 79

Table 3.6(b): Estimation of the L-mean (L1), L-variance(L2), L-skewness(τ3), and L-kurtosis(τ4) of KwST (a, b, λ, r) random variable for different values of a, b, and r.

a b λ r L1 L2 τ3 τ4 KwST 1 1 2 1 43.0116 47.41140 0.8764 0.9754 5 0.8488 0.5050 0.1742 0.1920 50 0.7245 0.4010 0.0847 0.1321 300 0.7147 0.3930 0.0788 0.1272 500 0.7147 0.3930 0.0788 0.1272 KwSN - 0.71365 0.39227 0.07765 0.12753 KwST 5 3 2 1 2.7461 1.072 0.40018 0.28451 5 1.2734 0.2610 0.0996 0.1417 50 1.1258 0.2020 0.0396 0.1287 300 1.1139 0.1980 0.0353 0.1262 500 1.1129 0.1980 0.0353 0.1262 KwSN - 1.1116 0.1970 0.0342 0.1250 KwST 2 20 2 1 0.0456 0.2070 -0.2125 0.2705 5 0.0743 0.1420 -0.0633 0.1408 50 0.0748 0.1330 -0.0526 0.1278 300 0.0749 0.1320 -0.0530 0.1287 500 0.0749 0.1320 -0.0454 0.1287 KwSN - 0.0749 0.1320 -0.0490 0.1281

where L and τ are the theoretical L-moment and L-ratio of the KwST distribution, l and τˆ are the sample L-moment and L-ratio, respectively. This technique is implemented using the optim function of R for minimization.

3.7.3 Illustrative Examples

To demonstrate the performance of the L-moments method in comparison with the method of maximum likelihood, we conduct parameters estimation and data fitting using simulated data from skew t distribution and the well known “ais” data set which is available in the R-package sn (see Azzalini and Azzalini (2016)). This data set consists of different measurements on 202 Australian athletes. In this analysis, we concentrate on the variable measuring the plasma ferritin concentration (fe). The dataset “ais” has been used previously studied by Weisberg (2005) and Cordeiro and Bager (2015), to name a few. The following is parameter estimation using L-moments to a random sample of size 100 generated from the skew t distribution with parameter vector (µ = 2, σ = 1, λ = 2, r = 3). Table 80 3.7 shows parameter estimation for the KwST using L-moments and MLE method. Based on the AIC and SIC criteria, the method of L-moments provide better estimation than the method of MLE. Figure 3.10 shows fitted density of KwST (a, b, µ, σ, λ, r) using both estimation procedures. We note that the L-moments method provides a better fit than the ML method.

Table 3.7: Parameters estimation of KwST (a, b, µ, σ, λ, r) using st3(µ = 2, σ = 1, λ = 2). random sample.

a b µ σ λ r AIC SIC L-moments 2.381 1.530 1.436 1.042 1.593 2.407 273.4304 289.0614 MLE 2.978 0.854 2.018 0.752 -0.309 2.854 274.4424 290.0734

Figure 3.10: Fitted density of KwST (a, b, µ, σ, λ, r).

Figure 3.11: Histogram and Q-Q plot for plasma (fe) data set. 81 In Figure 3.11, we note that the plasma ferritin concentration (fe) data set departs the normality line in the Q-Q plot which suggests that the data set is skewed to the right with long tail. We ﬁt the the KwST distributions to the plasma (fe) data set using the L-moments and the maximum likelihood procedure and we use the (AIC) value to compare the two procedures. Table 3.8 present the parameters estimation of the plasma(fe) data using the L-moments and MLE methods. Although it shows that the ML method has the advantage over the L-moments method for this data set, the later one preforms well in estimating the parameters. The L-moments method has the advantage that always exist. Estimated density curves ﬁtted to the plasma (fe) data set are presented in Figure 3.12.

Table 3.8: Parameters estimation of KwST (a, b, µ, σ, λ, r) using AIS (plasma) data set.

a b µ σ λ r AIC SIC L-moments 3.624 0.912 19.309 21.832 0.960 2.389 2163.561 2183.411 MLE 3.9303 0.3100 29.144 28.35 -0.7898 8.8142 2078.034 2097.884

Figure 3.12: Fitted density of KwST (a, b, µ, σ, λ, r). 82

CHAPTER 4 MIXTURE MODELING USING TWO KUMARASWAMY SKEW t DISTRIBUTIONS

4.1 Introduction

The concept of mixture modeling has been introduced over 100 years ago by the famous bio- metrician Karl Pearson. In his work, Pearson (1894) studied a mixture of two normal distributions with different means µ1 and µ2 and variances σ1 and σ2 in proportions π1 and π2 = 1 − π1. He used his model to fit data set consisted of measurements on the ratio of forehead to body length of 1000 crabs. The data set provided by (Weldon 1893) where a single normal distribution did not fit the data well due to asymmetry of the data. Since then, the literature has expanded enormously to the extent of mixture modeling. Finite mixture modeling has been applied to a wide variety of statistical problems, such as discriminant analysis, survival analysis, image analysis and clustering analysis. In many applied problems, the shapes of fitted mixture normal components may be dis- torted, and inferences can be misleading when the data involves highly asymmetric observations, Lin et al. (2007b). Sometimes, the normal mixture model overfits when additional components are included to capture the skewness. McLachlan and Peel (2004) developed robust mixture of multivariate t distribution to broaden the normal parametric family for potential outliers or data with longer than the normal tails. Lin et al. (2007b) proposed the mixture of finite skew normal distributions to overcome the potential weakness of normal mixtures. To accommodate asymmetry and long tails simultaneously, Lin et al. (2007a) introduced the skew t mixture model. A random variate X taken from a finite mixture model has the density

k X g(x; Ψ) = πigi(xj; θi), (4.1.1) i=1

where gi(xj; θi) is the component density with parameter vector θi = (ai, bi, µi, σi, λi, ri), πi

T is the mixing proportion, i = 1, 2, 3, ..., k, and Ψ = (π, ξ) where ξ = (θ1, θ2, ..., θk) , π =

T Pk (π1, π2, ..., πk) and i=1 πi = 1. 83 4.2 The Kumaraswamy Skew t Mixture Model

Motivated by previous work in ﬁnite mixture model, we develop a wider class of mixture distributions to accommodate asymmetry and long tails simultaneously that contains many important distributions as sub-models. In this study we consider two components mixture model in which a random sample X1,X2, ..., Xn arise from a mixture of KwST distribution given by

2 X g(x; Ψ) = πigi(xj; θi) (4.2.1) i=1

where gi(xj; θi) is the probability density function of KwST (θi), θi = (ai, bi, µi, σi, λi, ri) as

T deﬁned in (3.2.3), πi is the mixing proportion, i = 1, 2, and Ψ = (π, ξ) where ξ = (θ1, θ2) ,

T π = (π1, π2) and π1 + π2 = 1. With this Kumaraswamy skew t mixture (Kwst.mix) model approach, the normal mixture, the t mixture, skew t mixture models can be treated as special cases in this family.

Deﬁnition 4.2.1. A random variable X is said to have a mixture of two Kumaraswamy skew t distributions if it has the pdf deﬁned by

2 X ai−1 ai bi−1 g(x; Ψ) = πiaib1f(x; λi, ri)F (x; λi, ri) (1 − F (x; λi, ri) ) , (4.2.2) i=0

P2 where 0 < πi < 1, i=1 πi = 1, f(x; λi, ri) is the probability density function of the skew t distribution deﬁned in (1.5.3), F (x; λi, ri) is the distribution function of the skew t distribution,

µi, λi ∈ <, and σi, ri, ai, bi > 0,(i = 1, 2).

The Kumaraswamy skew t mixture model presents its wider flexibility as it can produce mixture of variety of distributions such as the skew t, skew normal, normal, among others as special cases. Here we present some graphical illustration of the mixture of two components of the KwST distribution to examine the performance of the proposed model. In the following figures we provide some graphical illustration of the mixture of two KwST distributions for fixed values of the parameter vector Ψ. We observe that this model is very fixable as it can fit different form of data 84

Figure 4.1: The density curve of the KwST.mix with two different parameter vector Ψ1 = (a1 = 1, b1 = 1, µ1 = −2, σ1 = 5, λ1 = 1, r1 = 2, a2 = 3, b2 = 7, µ2 = 10, σ2 = 1, λ2 = 3, r2 = 2, π1 = .9) on the left and Ψ2 = (a1 = 1, b1 = 1, µ1 = −2, σ1 = 4, λ1 = 2, r1 = 3, a2 = 3, b2 = 2, µ2 = 1, σ2 = 1, λ2 = 5, r2 = 2, π1 = 0.62) on the right. as presented in Figures 4.1 and 4.2. Figure 4.1 shows the density curve of the KwST.mix with two different parameter vectors, Ψ1 = (a1 = 1, b1 = 1, µ1 = −2, σ1 = 5, λ1 = 1, r1 = 2, a2 =

3, b2 = 7, µ2 = 10, σ2 = 1, λ2 = 3, r2 = 2, π1 = .9) and Ψ2 = (a1 = 1, b1 = 1, µ1 = −2, σ1 =

4, λ1 = 2, r1 = 3, a2 = 3, b2 = 2, µ2 = 1, σ2 = 1, λ2 = 5, r2 = 2, π1 = 0.62). The right plot in the Figure 4.1 shows that the ﬁrst component contributes the broad shoulder seen in the left half of this plot, while the second component contributes the sharper main peak centered at one. Figure 4.2 also shows the density curve of the KwST.mix with two different parameter vectors,

Ψ1 = (a1 = 1, b1 = 7, µ1 = −2, σ1 = 2, λ1 = 0, r1 = 3, a2 = 2, b2 = 2, µ2 = 1, σ2 = 1, λ2 =

5, r2 = 2, π1 = .75) and Ψ2 = (a1 = 1, b1 = 7, µ1 = −2, σ1 = 2, λ1 = 4, r1 = 3, a2 = 2, b2 =

2, µ2 = 1, σ2 = 1, λ2 = 5, r2 = 2, π1 = .5). Although, we only change the value of two parameters

λ1 and the mixing proportion π1, we get different shapes of the density. As the graph on the left shows that the ﬁrst component has thicker left tail and more variation than the second one, while the second component has longer and thiner right tail.

4.3 The EM Algorithm

In literature the problem of ﬁtting mixture distributions can be handled using many techniques, such as graphical methods, the maximum likelihood method, the method of moments and Bayesian method. A classical way to tackle the mixture problem is maximum likelihood method via the EM 85

Figure 4.2: The density curve of the KwST.mix with two different parameter vector Ψ1 = (a1 = 1, b1 = 7, µ1 = −2, σ1 = 2, λ1 = 0, r1 = 3, a2 = 2, b2 = 2, µ2 = 1, σ2 = 1, λ2 = 5, r2 = 2, π1 = .75) on the left and Ψ2 = (a1 = 1, b1 = 7, µ1 = −2, σ1 = 2, λ1 = 4, r1 = 3, a2 = 2, b2 = 2, µ2 = 1, σ2 = 1, λ2 = 5, r2 = 2, π1 = .5) on the right. algorithm. The basic idea of the EM algorithm is to view the mixture model in the sense of incomplete data structure. A zero-one indicator variable Zj is introduced to deﬁne the component in the mixture model (4.2.2) from which the observed random variable xj is viewed to have arisen.

That is, the variable Zj is Bernoulli random variable with probability πi. In this setting, the mixture model arises because the component label variable Zj is missing from the complete data set.

Therefore, instead of estimating the joint distribution of the featured variable Xj and its component label Zj, we only need to estimate the mixture distribution on data available from the marginal distribution of Xj.

T Assume that Z1,Z2, ...Zn ∼ Bernoulli(π1). Let Xc = (X,Z) be the complete data vector where X = (X1,X2, ..., Xn) and Z = (Z1,Z2, ...Zn). Then the distribution of the complete data vector Xc implies the appropriate distribution or the incomplete data vector X = X1,X2, ..., Xn.

The complete data log likelihood for Ψ, logLc(Ψ), is given by

2 n X X logLc(Ψ) = zij{log(πi) + log(gi(xj; θi))}, (4.3.1) i=1 j=1 86 where log(gi(xj; θi)) is given by

log(gi(xj; θi)) =log(ai) + log(bi) + log(fi(xj; λi, ri)) + (ai − 1)log(Fi(xj; λi, ri)) (4.3.2) ai + (bi − 1)log([1 − Fi(xj; λi, ri) ]).

E-step In this step the problem of the unobserved data zij is handled by taking the conditional expectation of the the complete data log likelihood given the observed data x and using the current

th fit for Ψ, E(logLc(Ψ)|x). Let k be the value of Ψ specified at the k iteration of the EM algorithm, denoted by ψ(k) and ψ(0) denote the parameter vector at the initial values. Then, the conditional expectation, E(logLc(Ψ)|x), in the first iteration can be written as

(0) Q(ψ, ψ ) = Eψ(0) (logLc(Ψ)|x). (4.3.3)

Note that the complete data log likelihood, logLc(Ψ), is linear in the unobserved data zij. Thus, in this step we only need to calculate the conditional expectation of zij given the observed data x as follow.

(k) EΨ(k) = (zij|x) = P (zij = 1|x) = τi(xj;Ψ ).

That is

π(k)g (x ; θ(k)) τ (x ;Ψ(k)) = i i j i i j P2 (k) (k) h=1 πh gh(xj; θh ) (k) (k) (k) (k) (k) (k) (k) (a(k)−1) (k) (k) a(k) (b(k)−1) π a b f(x ; λ , r )F (x ; λ , r ) i (1 − F (x ; λ , r ) i ) i = i i i j i i j i i j i i 2 (k) (k) (k) (k) (k) (k) (k) (k) (k) (k) (k) (k) P (ah −1) ah (bh −1) h=1 πh ah bh f(xj; λh , rh )F (xj; λh , rh ) (1 − F (xj; λh , rh ) ) (k) b −1 (k) (k) (k) (k) (k) (k) (k) (k) P i (ai (s+1)−1) s=0 w(i,s)πi ai bi f(xj; λi , ri )F (xj; λi , ri ) = (k) , 2 b −1 (k) (k) (k) (k) (k) (k) (k) (k) P P i (ai (s+1)−1) h=1 s=0 w(h,s)πh ah bh f(xj; λh , rh )F (xj; λh , rh ) (4.3.4)

(k) (k) sbi −1 where w(i,s) = (−1) s ,s = 0, 1, ..., bi − 1, i = 1, 2 and j = 1, ..., n. It is clear that (k) th τi(xj;Ψ ) is the posterior probability that the j member of the sample with observed value xj 87 belongs to the ith component of the mixture. M-step This step deal with the updating the parameter vector of the (k + 1)th. In this step we

(k+1) calculate the updated estimate πi of the mixing proportions πi independently of the updated es- (k+1) timate θi of the parameter vector ξ. Using equation (4.3.4) the parameter of mixing proportion, (k+1) πi can be estimated by

Pn zij π(k+1) = j=1 i n Pn (k) (4.3.5) τi(xj;Ψ ) = j=1 , (i = 1, 2). n

From the previous step, the conditional expectation of the complete data log likelihood, logLc(Ψ) is given by

2 n (k) X X (k) (k+1) (k) Q(ψ, ψ ) = τi(xj;Ψ ){log(πi) + log(gi(xj; θi ))}. (4.3.6) i=1 j=1

Update the ξ(k) by maximizing Q(ψ, ψ(k)) over ξ by obtaining the appropriate root of

2 n X X (k) (k) τi(xj;Ψ )∂log(gi(xj; θi ))/∂ξ = 0, (4.3.7) i=1 j=1

(k+1) k+1 k+1 which leads to the updated ξ = (θ1 , θ2 ). This process is alternate repeatedly until some distance involving two successive evaluation of the actual log-likelihood L(Ψ(k)), |L(Ψ(k+1)) − L(Ψ(k))| changes by an arbitrary small amount in the case of convergence of the sequence of likelihood values. For more information in the context of ﬁnite mixtures readers are referred to McLachlan and Peel (2004).

4.4 The Observed Information Matrix

Once the parameters of the mixture are estimated using the EM algorithm, we usually want to assess the standard errors of the estimated parameter vector Ψˆ . Under regularity conditions, the 88 Fisher expected information matrix about the parameter vector Ψ, Υ(Ψ), is given by

Υ(Ψ) = EΨ{I(Ψ; X)}, (4.4.1) where I(Ψ; X) is the negative of the Hessian of the log likelihood function deﬁned by

I(Ψ; X) = −∂logL(Ψ)/∂2Ψ. (4.4.2)

The observed information matrix I(Ψ;ˆ X) is given by

ˆ 2 I(Ψ; X) = −∂logL(Ψ)/∂ Ψ|Ψ=ψˆ. (4.4.3)

In this section the observed information matrix of mixture of two KwST is obtained. For the cal- culation and approximation of the observed information matrix I(Ψ;ˆ X) within the EM framework, McLachlan and Peel (2004) and Basford et al. (1997), among others, proved that it is computation- ally convenient in the case of independent data to approximate I(Ψ;ˆ X) in terms of the gradient vector of the complete data log likelihood function as given by

n ˆ X T I(Ψ; X) = sˆjsˆj , (4.4.4) j=1

where sˆj = ∂logLc(Ψ; xj)/∂Ψ|Ψ=Ψˆ and

2 X logLc(Ψ; xj) = τi(xj; Ψ){log(πi) + log(g(xj; θi))}. (4.4.5) i=1

xj −µi ˆ Let yij = and Ψ be the vector of all the unknown parameters partitioned as σi

T Ψ = (π1, a1, b1, µ1, σ1, λ1, r1, a2, b2, µ2, σ2, λ2, r2) . (4.4.6) 89

Thus, the corresponding partition of sˆj is given by

T sˆj = (ˆsj(π1), sˆj(a1), sˆj(b1), sˆj(µ1), sˆj(σ1), sˆj(λ1), sˆj(r1), sˆj(a2), sˆj(b2), sˆj(µ2), sˆj(σ2), sˆj(λ2), sˆj(r2)) . (4.4.7)

The elements of sˆj are given by

ˆ ˆ τ(Ψ, y1,j) τ(Ψ, y2,j) sˆj(π1) = − , πˆ1 πˆ2 ˆ aî 1 ˆ ˆ F (yi,j; λi, rî) sˆ(âi) = 2 + log(F (yi,j; λi, rî)) − (bi − 1) , ˆ aî aî (1 − F (yi,j; λi, rî) ) 1 sˆ(ˆb ) = + log(1 − F (y ; λˆ , rˆ )aî ), i ˆ i,j i i bi A(ˆµi) B(ˆµi) C(ˆµi) (4.4.8) sˆ(ˆµi) = − + , σî µî µî −1 A(ˆσi) B(ˆσi) C(ˆσi) sˆ(ˆσi) = + − + , σî σî σî σî ˆ ˆ ˆ ˆ sˆ(λi) = A(λi) − B(λi) + C(λi),

sˆ(ˆri) = A(ˆri) − B(ˆri) + C(ˆri).

ˆ ˆ ˆ ˆ aˆ Dˆ (f(yi,j ;λi,rî)) (âi−1)Dˆ ((âi−1)F (yi,j ;λi,rî)) (bi−1)Dˆ (F (yi,j ;λi,rî) i ) ˆ hi ˆ hi ˆ hi where A(hi) = , B(hi) = , and C(hi) = aˆ , f(yi,j ;λî,rî) F (yi,j ;λî,rî) F (yi,j ;λî,rî) i and D (H) = ∂H/∂hˆ for i = 1, 2 and j = 1, 2, . . . , n. hî i Once the observed information matrix, I(Ψ;ˆ X), is computed the variance of each of the estimates in Ψˆ , partitioned according to (4.4.5), can be obtained from the corresponding diagonal elements of I−1(Ψˆ ,X).

4.5 Simulation Studies

In order to examine the performance of the proposed model, we conducted the following simulation studies. We investigate the ability of the KwST.mix model in ﬁtting observations generated from mixture of generalized lambda distributions and mixture of skew t distributions. 90 Ramberg and Schmeiser (1974) proposed the Generalized Lambda Distribution (GLD) as a four-parameter generalization of Tukeys Lambda family deﬁned by the quantile function

λ3 λ4 −1 p − (1 − p) F (p; λ1, λ2, λ3, λ4) = λ1 + , λ2

where p are the probabilities, p ∈ [0, 1], λ1, λ2 are the location and scale parameters, and λ3, λ4 are shape parameters jointly related to the strengths of the lower and upper tails, respectively. This distribution reduced to Tukeys lambda distribution when λ1 = 0 and λ2 = λ3 = λ4 = λ. The probability density function of the GLD at the point x = F −1(p) is given by

−1 λ2 f(x) = f(F (p)) = (λ −1) (λ −1) . λ3p 3 λ4(1 − p) 4

The four-parameter GLD family is known for its high ﬂexibility, as it can create distributions with a large range of different shapes. However, the GLD probability distribution function is considered to be legitimate probability distribution for certain restricted four regions of parameter values. We proceed a sample generated from mixture of two Generalized Lambda Distributions GLD

using R package GLDEX. We ﬁxed the parameter values at θ1 = (λ1 = 0, λ2 = 1, λ3 = 2, λ4 =

3) and θ1 = (λ1 = 2, λ2 = 1, λ3 = 0, λ4 = 1) and π = 0.5. The sample sizes considered was n = 500. We ﬁt the KwST.mix, st.mix and the sn.mix distributions using the EM algorithm proposed in section 3. Due to the slow convergence of the EM algorithm and the limited ability of the computer devices that are used to preform the analysis, we ﬁxed the convergence rate in this

example to 0.1 and we assume that the degrees of freedom r1 = r2 = r. Table 4.1 presents the parameter estimations, the AIC and the SIC values corresponding to each distribution. Comparing the AIC value for the three models in Table 4.1 and the density curves ﬁtting in Figure 4.3, we conclude that the KwST.mix model represents the best ﬁt. The SIC value of the st.mix and the KwST.mix is very close, the SIC of the KwST.mix is larger because the KwST.mix has more parameters than the st.mix. Further, inspired by the work of Ning et al. (2008), we use the KL distance and the overlapping 91 Table 4.1: Parameter estimations for samples generated from the mixture of GLD densities.

KwST.mix st.mix snorm.mix π 0.618 0.621 0.630 a1 0.738 - - b1 1.517 - - µ1 0.249 0.105 0.159 σ1 0.276 0.140 0.349 λ1 -1.246 -1.018 -1.481 r 5.643 4.4258 - a2 1.511 - - b2 1.784 - - µ2 2.666 2.390 2.425 σ2 0.916 0.493 0.602 λ2 -1.391 -0.959 -0.996 AIC 1233.925 1242.335 1383.266 SIC 1284.5 1271.837 1412.769

Figure 4.3: Fitted densities of mixture of two components of KwST.mix, st.mix and the snorm.mix models to the mixture of GLD random sample. coefficient to show the advantage of using the KwST.mix models to fit this GLD.mix random sample. Kullback and Leibler (1951) defined a distance function between the true probability distribution and the target probability distribution, denoted as the KL distance. This function measures the similar quality of the behavior of the true probability distribution, GLD.mix, and the target probability distribution, KwST.mix, which is defined as follows.

Deﬁnition 4.5.1.( KL Distance, Kullback and Leibler (1951)) 92

Let f1(x) be a true density function for a variable X and f2(x) be a target density function , which is used for X in practice. The KL distance is deﬁned as

Z f1(x) I(1 : 2) = f1(x)ln dx, (4.5.1) E f2(x)

where E is a Borel set on which f1(x) and f2(x) are well deﬁned.

The smaller KL distance, the closer the target density is to the true density. That is, the KL distance is 0 if the target distribution used to ﬁt the data is equal to the true distribution. The overlapping coefﬁcient calculates the area under both densities which is also used to measure the similarity between two probability densities.

Deﬁnition 4.5.2. (The Overlapping Coefﬁcient)

Let f1(x) and f2(x) be two density functions, then the overlapping coefﬁcient δ(f1, f2) is given by

Z δ(f1, f2) = min(f1(x), f2(x))dx. (4.5.2)

The overlapping coefﬁcient 0 ≤ δ(f1, f2) ≥ 1, where the closer the value to 1 the more similar the behavior of the two densities. In this simulation study we used the estimated parameters of the KwST.mix and the st.mix presented in Table 4.1 and we computed the KL distance I(1 : 2) and

the overlapping coefﬁcient δ(f1, f2) as presented in Table 4.2.

Table 4.2: Computed KL distance I(1 : 2) and overlapping coefﬁcient δ(f1, f2).

Distribution I(1 : 2) δ(f1, f2) KwST.mix 0.060 0.805 st.mix 1.395 0.541

Table 4.2 confirms the superiority of the KwST.mix fitting to the GLD.mix random sample over the st.mix model. We computed the KL distance using the functions “KL.plugin” or “KLD” which are available in R software. From the computed value of the KL distance and the overlapping coefficient we conclude that the KwST.mix is a very good fit to the random sample generated from GLD.mix. 93 Second, we generate a random sample from a mixture of two skew t mixture densities. We

fixed the parameter values at µ1 = 15, µ2 = 20, σ1 = 20, σ2 = 4, λ1 = 6, λ2 = −4, r = 3, and π = .2. The sample sizes considered was n = 500. Ignoring the known true value of the parameters, we fit the KwST.mix, st.mix and the snorm.mix models using the EM algorithm described in Section 3. Due to the slow convergence of the Em algorithm and the limited ability of the computer devices that are used to preform the analysis, we fixed the convergence rate in this example to 0.01 and we assume that the degrees of freedom r1 = r2 = r. Comparing AIC with the density fitting for the st.mix and snorm.mix models in Table 4.3 and density curves fitting in Figure 4.4, we see that modeling using the KwST.mix model represents better fitting to the skew t sample than the st.mix and the snorm.mix models. The SIC value of the st.mix and the KwST.mix is very close, the SIC of the KwST.mix is larger because the KwST.mix has more parameters than the st.mix.

Table 4.3: Parameter estimations for samples generated from the mixture of skew t densities.

KwST.mix st.mix snorm.mix π 0.150 0.892 0.159 a1 1.554 - - b1 1.114 - - µ1 19.487 19.010 20.008 σ1 17.455 16.566 673.522 λ1 13.165 -1.029 1.048 r 3.824 3.822 - a2 1.194 - - b2 1.784 - - µ2 21.377 33.336 19.35 σ2 4.799 220.753 23.412 λ2 -11.477 1.125 -1.367 AIC 3278.293 3281.486 3303.089 SIC 3328.868 3310.988 3332.592 94

Figure 4.4: Fitted densities of mixture of two components of KwST.mix, st.mix and the snorm.mix models to the mixture of skew t sample.

4.6 Application

For comparison purposes, we ﬁt the data with a two component mixture model using our proposed model (KwST.mix), skew t (st.mix) and skew normal (snorm.mix) as component densities. We assume that the degrees of freedom are equal so that we can use the built in parameter estimation function for the skew normal and the skew t mixture models available in package “mixsmsn”. For comparing the ﬁtting results, the ML estimates , the negative log-likelihood, and AIC and SIC values for Kwst.mix, st.mix and snorm.mix models are summarized in tables. As an application of the methodology proposed here, we consider the data set collected by the National Center for Health Statistics (NCHS) of the Center for Disease Control (CDC). The NCHS of the Center for Disease Control has conducted a national health and nutrition examination survey (NHANES) annually since 1999, where the survey data are released in a two-year cycle. In this application we consider the body mass index for men aged between 18 to 80 years. The data set had 4579 participants with bmi records but they considered only those participants who have their weights within [39.50 kg, 70.00 kg] and [95.01 kg, 196.80 kg] to explore the pattern of mixture. 95 This data set was analyzed by Lin et al. (2007a), Basso et al. (2010) who considered the reports made in 19992000 and 20012002. This data set that contains the measure of the Body Mass Index (bmi) for 2107 people can be found in the R package mixsmsn.

Table 4.4: Summary description of the percent bmi data.

Min. Median Mean sd Max. skewness kurtosis 14.86 26.89 28.19 7.498 64.16 0.7142 3.295

Applying the EM-algorithm for 50 iteration with convergence rate 0.1 and ﬁxing the degrees of

freedom r1 = r2, we fit the KwST.mix, st.mix and snorm.mix models to the bmi data and we use the value of the AIC and SIC to compare the three models. Table 4.5 contains the maximum likelihood estimates (MLE) of the parameters of the three models as well as the values of the AIC and SIC. We display the fitting results on a single graph in Figure 4.4. Comparing AIC and SIC values in Table 4.5 we conclude that modeling using the KwST.mix model represents better fitting to the bmi data set.

Table 4.5: Parameter estimations for the percent bmi data.

KwST.mix st.mix snorm.mix π 0.511 0.605 0.554 a1 1.402 - - b1 3.875 - - µ1 21.715 21.508 21.199 σ1 4.484 8.709 11.397 λ1 0.977 0.9897 1.0107 r1 10.946 3.659 - a2 1.557 - - b2 1.075 - - µ2 27.416 33.396 33.117 σ2 6.736 23.095 33.306 λ2 10.729 0.973 1.122 r2 10.946 3.659 - AIC 13782.680 14169.64 14200.69 SIC 13850.54 14223.99 14240.26 96

Figure 4.5: Fitted densities of mixture of two components of Kumaraswamy skew t (KwST.mix), skew t (st.mix) and skew normal (snorm.mix) models to the bmi data. 97

CHAPTER 5 FINAL REMARKS AND DISCUSSION

The t distribution is commonly used to model heavy tailed data. However, it does not accommodate skewed data. We propose two new statistical distribution, The Beta skew t distribution, denoted as BST with some structural properties. The proposed distribution provides flexibility in modeling heavy tailed and skewed data and it is more general than the skew-t distribution as it includes the tr, str(λ), beta − t, BSN and some other important distributions as special cases of its parameters. Since the distribution function of the BST does not has a close form, we proposed the Kumaraswamy skew t distribution , denoted as KwST , which is considered to be even more flexible than the BST since its distribution function has a close form. Similar to the BST distribution, the KwST distribution provides flexibility in modeling heavy-tailed and skewed data and

it is more general than the skew-t distribution as it includes the tr, str(λ), Kw − t, KwSN and some other important distributions as special cases of its parameters. We studied both distributions in detailed by deriving some mathematical properties, moments, order statistics and parameter estimation using the MLE and the L-moments approach. We provide applications using simulated data and some well-known data sets to demonstrate the applicability of the proposed models by comparing them with some of their sub-models using the Akaike Information Criterion (AIC) and Schwarz information criterion (SIC). Based on our studies we conclude that the BST and the KwST distributions are very promising distributions in fitting skewed and heavy tailed data,as they some times fit data that cannot be fitted by the skew t distribution. Further, we can make the following recommendation about the BST and the KwST distributions.

1. We develop the BST and the KwST for the univariate case of the skew t distribution and we recommend applying our techniques to introduce new extensions to the multivariate skew t distribution.

2. The special cases of our densities as outlined in proposition (2.3.1) part (c) and (e), and proposition (3.3.1) part (c) and (e) need to be kept in mind when computing the moments. 98 As the moments of the Cauchy and skew Cauchy distributions do not exists.

3. The special cases of our densities as outlined in proposition (2.3.2) and proposition (3.3.4) need to be kept in mind when preforming parameters estimation or model comparison. If the ﬁtted values of the BST or KwST parameters coincide with one of the special cases then the special case need to be used.

4. The BST and the KwST distributions signiﬁcantly outperform the KwSN distribution for skewed and heavy tailed data as provided in the illustrative examples on chapter 2 and chapter 3. Hence the BST or the KwST distributions are recommended in analyzing data that has rare events such as ﬁnancial risk as they do preform very good in detecting such rare events or outliers.

5. The likelihood ratio (LR) can be used for testing goodness of ﬁt of the KwST distribution and for comparing this distribution with some of its special sub-models. We can compute the maximum values of the unrestricted and restricted log-likelihoods to construct LR statistics for testing some sub-models of the KwST ans the sn distribution. For example, we may use the LR statistic to check if the ﬁt using the KwST distribution is statistically superior to

a ﬁt using the skew t distribution str(λ) and Student t distributions for a given data set. In

any case, hypothesis tests of the type H0 : ψ = ψ0 versus H1 : ψ 6= ψ0, where ψ is a vector

formed with some components of ξ and ψ0 is a speciﬁed vector, can be performed using LR

statistic. For example, the test of H0 : a = b = 1 versus H1 : H0 is not true, is equivalent to

compare the KwST distribution with the str(λ) distribution and the LR statistic reduces to

w = 2[l(λ,ˆ r,ˆ a,ˆ ˆb) − l(λ,˜ r,˜ 1, 1)]

ˆ ˆ ˜ where λ, r,ˆ a,ˆ b are the MLEs under H1 and λ, r˜ are the estimates under H0.

After proposing the KwST distribution we discussed the problem of finite mixtures of the KwST distribution, denoted as KwST.mix, which accommodates both asymmetry and heavy 99 tails jointly that allows analyzing data in a wide variety of combinations. We Consider the mixture of two components case for simplicity. We provide the probability distribution function of this mixture. In a flexible complete-data framework we presented the EM-type algorithms for ML estimation. We obtained the fisher information matrix to asses the standard errors for our ML estimates. Using simulated data we compare the proposed model with the GLD mixture model by using Kullback-Leibler (KL) distance and overlapping coefficient (δ). Simulated and Real data sets are used to demonstrate our approach to show that the KwST.mix model has better performance than the other competitors. Based on our studies we can make the following recommendation about the KwST.mix distribution.

1. In our study we only studied the case of the mixture of two components,(k=2), from the KwST distributions. We recommend studying the case of ﬁnite (k) KwST mixture models.

2. One common criticism of the EM algorithm is that it has a quite slow convergence. McLach- lan and Peel (2004) discussed several methods to accelerate the EM algorithm in context of mixture models such as the expectation conditional maximization (ECME) algorithm of Liu and Rubin (1994) and the alternating expectation conditional maximization (AECM) algorithm of Meng and Van Dyk (1997), among others. Our EM algorithm took a very long time to converges and in some examples we had to wait for more than two days to get the parameter estimation results. So it is worthwhile to curry out one of the suggested methods in the context of the KwST.mix model.

3. Due to recent advances in computational technology, it is highly recommended to carry out Bayesian treatment via Markov Chain Monte Carlo (MCMC) sampling method in the case of the KwST.mix model. 100

BIBLIOGRAPHY

Ahn, S., Kim, J. H., and Ramaswami, V. (2012). A new class of models for heavy tailed distributions in ﬁnance and insurance risk. Insurance: Mathematics and Economics, 51(1):43–52.

Arellano-Valle, R. B. and Genton, M. G. (2005). On fundamental skew distributions. Journal of Multivariate Analysis, 96(1):93–116.

Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, pages 171–178.

Azzalini, A. and Azzalini, M. A. (2016). Package sn.

Azzalini, A. and Capitanio, A. (2003). Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):367–389.

Azzalini, A. and Capitanio, A. (2014). The Skew-Normal and Related Families, volume 3. Cam- bridge University Press.

Azzalini, A. and Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83(4):715–726.

Basford, K., Greenway, D., McLachlan, G., and Peel, D. (1997). Standard errors of ﬁtted component means of normal mixtures. Computational Statistics, 12(1):1–18.

Basso, R. M., Lachos, V. H., Cabral, C. R. B., and Ghosh, P. (2010). Robust mixture modeling based on scale mixtures of skew-normal distributions. Computational Statistics & Data Analy- sis, 54(12):2926–2941.

Cooray, K. and Ananda, M. M. (2005). Modeling actuarial data with a composite lognormal-pareto model. Scandinavian Actuarial Journal, 2005(5):321–334. 101 Cordeiro, G. M. and Bager, R. d. S. (2015). Moments for some Kumaraswamy generalized distributions. Communications in Statistics-Theory and Methods, 44(13):2720–2737.

Cordeiro, G. M. and de Castro, M. (2011). A new family of generalized distributions. Journal of Statistical Computation and Simulation, 81(7):883–898.

Cordeiro, G. M., Nadarajah, S., et al. (2011). Closed-form expressions for moments of a class of beta generalized distributions. Brazilian Journal of Probability and Statistics, 25(1):14–33.

Cordeiro, G. M., Ortega, E. M., and Nadarajah, S. (2010). The Kumaraswamy Weibull distribution with application to failure data. Journal of the Franklin Institute, 347(8):1399–1429.

Dreier, I. and Kotz, S. (2002). A note on the characteristic function of the t-distribution. Statistics & Probability Letters, 57(3):221–224.

Eling, M. (2012). Fitting insurance claims to skewed distributions: Are the skew-normal and skew-student good models? Insurance: Mathematics and Economics, 51(2):239–248.

Eugene, N., Lee, C., and Famoye, F. (2002). Beta-normal distribution and its applications. Com- munications in Statistics-Theory and Methods, 31(4):497–512.

Farias, R. B., Montoril, M. H., and Andrade, J. A. (2016). Bayesian inference for extreme quantiles of heavy tailed distributions. Statistics & Probability Letters, 113:103–107.

Frees, E. W. and Valdez, E. A. (1998). Understanding relationships using copulas. North American Actuarial Journal, 2(1):1–25.

Gupta, A. K. (2003). Multivariate skew t-distribution. Statistics: A Journal of Theoretical and Applied Statistics, 37(4):359–363.

Gupta, A. K. and Nadarajah, S. (2004). Handbook of Beta Distribution and its Applications. CRC press. 102 Hansen, B. E. (1994). Autoregressive conditional density estimation. International Economic Review, 35:705–730.

Hasan, A. M. (2013). A Study of Non-Central Skew t Distributions and Their Applications in Data Analysis and Change Point Detection. PhD thesis, Bowling Green State University.

Hosking, J. R. (1990). L-moments: analysis and estimation of distributions using linear combinations of order statistics. Journal of the Royal Statistical Society. Series B (Methodological), pages 105–124.

Huang, W.-J. and Chen, Y.-H. (2006). Quadratic forms of multivariate skew normal-symmetric distributions. Statistics & Probability Letters, 76(9):871–879.

Ifram, A. F. (1970). On the characteristic functions of the F and t distributions. Sankhya:¯ The Indian Journal of Statistics, Series A, pages 350–352.

Jones, M. (2004). Families of distributions arising from distributions of order statistics. Test, 13(1):1–43.

Jones, M. (2009). Kumaraswamys distribution: A beta-type distribution with some tractability advantages. Statistical Methodology, 6(1):70–81.

Jones, M. and Faddy, M. (2003). A skew extension of the t-distribution, with applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):159–174.

Karian, Z. A. and Dudewicz, E. J. (2011). Handbook of Fitting Statistical Distributions with R. CRC Press Boca Raton.

Kullback, S. and Leibler, R. A. (1951). On information and sufﬁciency. The Annals of Mathemat- ical Statistics, 22(1):79–86.

Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2):79–88. 103 Lin, T. I., Lee, J. C., and Hsieh, W. J. (2007a). Robust mixture modeling using the skew t distribution. Statistics and Computing, 17(2):81–92.

Lin, T. I., Lee, J. C., and Yen, S. Y. (2007b). Finite mixture modelling using the skew normal distribution. Statistica Sinica, pages 909–927.

Liu, C. and Rubin, D. B. (1994). The ECME algorithm: a simple extension of em and ecm with faster monotone convergence. Biometrika, pages 633–648.

Mameli, V. (2015). The kumaraswamy skew-normal distribution. Statistics & Probability Letters, 104:75–81.

Mameli, V. and Musio, M. (2013). A generalization of the skew-normal distribution: the beta skew-normal. Communications in Statistics-Theory and Methods, 42(12):2229–2244.

McLachlan, G. and Peel, D. (2004). Finite Mixture Models. John Wiley & Sons.

McNeil, A. J. (1997). Estimating the tails of loss severity distributions using extreme value theory. ASTIN Bulletin, 27(01):117–137.

Meng, X.-L. and Van Dyk, D. (1997). The EM algorithman old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):511–567.

Mitra, S. (1978). Recursive formula for the characteristic function of student t distributions for odd degrees of freedom. Manuscript, Pensylvania State University, State College.

Mood, A. M., Graybill, F. A., and Boes, D. C. (1974). Introduction to the theory of statistics. 1974. Mood Introduction to the Theory of Statistics1974.

Nadarajah, S., Cordeiro, G. M., and Ortega, E. M. (2012). General results for the Kumaraswamy-G distribution. Journal of Statistical Computation and Simulation, 82(7):951–979.

Nadarajah, S. and Eljabri, S. (2013). The Kumaraswamy GP distribution. Journal of Data Science, 11(4):739–766. 104 Ning, W., Gao, Y., and Dudewicz, E. J. (2008). Fitting mixture distributions using generalized lambda distributions and comparison with normal mixtures. American Journal of Mathematical and Management Sciences, 28(1-2):81–99.

Papastathopoulos, I. and Tawn, J. A. (2013). Extended generalised pareto models for tail estimation. Journal of Statistical Planning and Inference, 143(1):131–143.

Pearson, K. (1894). Mathematical contributions to the theory of evolution. ii. skew variation in homogeneous material. Proceedings of the Royal Society of London, 57(340-346):257–260.

Pestana, D. (1977). Note on a paper of ifram. Sankhya Ser. A, 39:396–397.

Psarakis, S. and Panaretoes, J. (1990). The folded t distribution. Communications in Statistics- Theory and Methods, 19(7):2717–2734.

Ramberg, J. S. and Schmeiser, B. W. (1974). An approximate method for generating asymmetric random variables. Communications of the ACM, 17(2):78–82.

Rego,ˆ G. and Nadarajah, S. (2011). On some properties of beta normal distribution. Communica- tions in Statistics - Theory and Methods, 41(20):3722–3738.

Resnick, S. I. (1997). Discussion of the danish data on large ﬁre insurance losses. ASTIN Bulletin, 27(01):139–151.

Roberts, C. (1966). A correlation model useful in the study of twins. Journal of the American Statistical Association, 61(316):1184–1190.

Rohatgi, V. K. and Ehsanes Saleh, A. M. (1988). A class of distributions connected to order statistics with nonintegral sample size. Communications in Statistics-Theory and Methods, 17(6):2005–2012.

Shaﬁei, S. and Doostparast, M. (2014). Balakrishnan skew-t distribution and associated statistical characteristics. Communications in Statistics-Theory and Methods, 43(19):4109–4122. 105 Silva, G. O., Ortega, E. M., and Cordeiro, G. M. (2010). The beta modiﬁed Weibull distribution. Lifetime Data Analysis, 16(3):409–430.

Thukral, A. K. (2014). Factorials of real negative and imaginary numbers-a new perspective. SpringerPlus, 3(1):658.

Weisberg, S. (2005). Applied Linear Regression, volume 528. John Wiley & Sons.

Welchance, J. (2016). Estimation and the Kumaraswamy Generalized Pareto Distribution. PhD thesis, Tennessee Technological University. 106

APPENDIX A SELECTED R PROGRAMS

• Simulation of KwST random sample using acceptation rejection method proposed by Nadara- jah (2011).

library(sn)

# function to compute the pdf of kumaraswamy−skew t distribution

pdf.kwst= function(x, mu, sigma, alpha, r, a, b){

y<− cdf.st(x, xi=mu, omega=sigma, alpha=alpha , nu=r)

r e t u r n ( a∗b∗ dst(x, xi=mu, omega=sigma, alpha=alpha , nu=r)∗

( y ˆ ( a −1)∗(1−y ˆ a ) ˆ ( b −1)))

}

# function to compute the cdf of skew t distribution

cdf.st=function (x, xi, omega, alpha, nu)

{

if (nu == Inf)

return(psn(x, xi, omega, alpha))

i f ( nu == 1)

return(psc(x, xi, omega, alpha))

i n t . nu <− (round(nu) == nu)

ok <− ! ( i s . na ( x ) | ( x == I n f ) | ( x == −I n f ) )

zz <− l o g ( abs ( x − x i )) − l o g ( omega )

z <− ( s i g n ( x−x i ) ∗ exp(zz))[ok]

if (abs(alpha) == Inf) {

z0 <− replace(z, alpha ∗ z < 0 , 0)

p <− pf(z0ˆ2, 1, nu)

return(if (alpha > 0) p e l s e (1 − p ) ) 107 }

fp <− function(v, alpha, nu, t.value) psn(sqrt(v) ∗ t . value ,

0 , 1 , a l p h a ) ∗ d c h i s q ( v ∗ nu , nu ) ∗ nu

p <− numeric(length(z))

f o r ( i i n s e q len(length(z))) {

if (abs(z[i]) == Inf) {

p [ i ] <− (1 + sign(z[i]))/2 }

e l s e {

upper <− 10 + 5 0 / nu

p [ i ] <− integrate(dst , −Inf, z[i], dp = c(0, 1, alpha, nu),

stop.on. error = FALSE)$value

}

pr <− rep(NA, length(x))

pr [ x == I n f ] <− 1

pr [ x == −I n f ] <− 0

pr [ ok ] <− p

return(pmax(0, pmin(1, pr)))

}

#function to simulate KwST using acceptation rejection method

KwST. RejectionSampling <− function(n, alfa, r, a, b)

{

RN <− NULL

for(i in 1:n)

{

OK <− 0

w hi l e (OK<1)

{ 108 x <− rst(1, xi=0, omega=1, alpha = alfa , nu=r)

M <− ( a ˆ b ∗ b ∗( a −1)ˆ(1 −(1/ a ) ) ∗ (1−b ) ˆ ( b−1) )

/ ( a∗b − 1 ) ˆ ( ( b −1)/ a )

u <− runif(1, min=0, max=1)

y <− u∗M∗ dst(x, xi=0, omega=1, alpha = alfa , nu=r)

i f ( y <= pdf.kwst(x,alfa ,r,a,b))

{

OK <− 1

RN <− c (RN, x )

}

r e t u r n (RN)}

rs=KwST. RejectionSampling(500,1,2,2,4)

hist(rs ,breaks= 50,prob = TRUE, col= ”gray84”,

main=” ” , xlab=”x”)

• The KwST L-moments parameter estimation for nidd.annual data.

library(evir)

library(sn)

library(fBasics)

library(bssn)

library (lmomco)

data(”nidd.annual”)

nidd.annual=nidd.annual/10

m=length(nidd.annual)

summary(nidd . annual)

sd(nidd.annual)

skewness(nidd.annual) 109 kurtosis(nidd.annual)

# function to compute the cdf of skew t distribution

# this function is based on the built in (pst) function method 2

cdf.st=function (x, xi, omega, alpha, nu)

{

if (nu == Inf)

return(psn(x, xi, omega, alpha))

i f ( nu == 1)

return(psc(x, xi, omega, alpha))

i n t . nu <− (round(nu) == nu)

ok <− ! ( i s . na ( x ) | ( x == I n f ) | ( x == −I n f ) )

zz <− l o g ( abs ( x − x i )) − l o g ( omega )

z <− ( s i g n ( x−x i ) ∗ exp(zz))[ok]

if (abs(alpha) == Inf) {

z0 <− replace(z, alpha ∗ z < 0 , 0)

p <− pf(z0ˆ2, 1, nu)

return(if (alpha > 0) p e l s e (1 − p ) )

}

fp <− function(v, alpha, nu, t.value) psn(sqrt(v) ∗ t . value ,

0 , 1 , a l p h a ) ∗ d c h i s q ( v ∗ nu , nu ) ∗ nu

p <− numeric(length(z))

f o r ( i i n s e q len(length(z))) {

if (abs(z[i]) == Inf) {

p [ i ] <− (1 + sign(z[i]))/2 }

e l s e {

upper <− 10 + 5 0 / nu

p [ i ] <− integrate(dst , −Inf, z[i], dp = c(0, 1, alpha, nu), 110 stop.on. error = FALSE)$value

}

pr <− rep(NA, length(x))

pr [ x == I n f ] <− 1

pr [ x == −I n f ] <− 0

pr [ ok ] <− p

return(pmax(0, pmin(1, pr)))

}

# function to compute the mean of KwST random variable m = function(a,b, mu, sigma, alpha, r){

gi = function(x) {

x ∗ a ∗ b ∗ dst(x, xi=mu, omega=sigma, alpha=alpha , nu=r)∗

cdf.st(x, xi=mu, omega=sigma, alpha=alpha , nu=r)ˆ(a −1) ∗

(1− cdf.st(x, xi=mu, omega=sigma, alpha=alpha , nu=r)ˆa)ˆ(b−1)}

return(integrate(gi, −9999,9999, stop.on.error = FALSE)$value)}

# function to compute the first six theoretical l moments of KwST random variable

theoretical <− function(a,b,mu,sigma, lambda, r){

x<− v a l <− l <− numeric(7); t <− numeric ( 5 )

for(i in 1:7) {x [ i ]= m( a , i ∗b, mu, sigma, lambda, r)}

l[1]= x[1] # L−mean

l [ 2 ] = x[1] −x [ 2 ] # L−v a r i a n c e

l [ 3 ] = x [ 1 ] − 3∗x [ 2 ] + 2∗x [ 3 ]

l [ 4 ] = x [ 1 ] − 6∗x [ 2 ] + 10∗x [ 3 ] − 5∗x [ 4 ]

l [ 5 ] = x [ 1 ] − 10∗x [ 2 ] + 30∗x [ 3 ] −35∗x [ 4 ] + 14∗x [ 5 ]

l [ 6 ] = x [ 1 ] − 15∗x [ 2 ] + 70∗x [ 3 ] − 140∗ x [ 4 ] + 126∗ x [ 5 ] 111 −42∗x [ 6 ]

l [ 7 ] = x [ 1 ] − 21∗x [ 2 ] + 140∗ x [ 3 ] − 420∗ x [ 4 ] + 630∗ x [ 5 ]

−462∗x [ 6 ] +132∗x [ 7 ]

t[1]= l[3] / l[2] # L−skewness

t[2]= l[4] / l[2] # L−k u r t o s i s

t[3]= l[5] / l[2]

t[4]= l[6] / l[2]

t[5]= l[7] / l[2]

val[1]=l[1]; val[2]=l[2]

val[3]=t[1]; val[4]=t[2]

val[5]=t[3]; val[6]=t[4]

val[7]=t[5]

return(val)}

# function to compute the first six sample l moments

#a function to compute the sample L−moments by editing the

b u i l t −i n lmoms ( x ) slm= function (x, nmom = 7, no.stop = FALSE)

{

n <− l e n g t h ( x )

i f (nmom > n ) {

if (no.stop)

return (NULL)

stop(”More L−moments requested by parameter ’nmom’ than

data points available in ’x’”)

}

if (length(unique(x)) == 1) {

if (no.stop)

return (NULL) 112 stop(”all values are equal −−Lmoments can not be computed”)

}

z <− TLmoms( x , nmom = nmom)

z $ s o u r c e <− ”lmoms”

r e t u r n ( z )

}

sample.lmom = function(x){

bultin slmom <− s l <− numeric ( 6 )

b u l t i n slmom= slm(x)$lambdas

sl[1]=round(bultin slmom[1],digits= 5)

sl[2]=round(bultin slmom[2],digits= 5)

sl[3]=round((bultin slmom[3]/bultin slmom[2]), digits=5)

sl[4]=round((bultin slmom[4]/bultin slmom[2]), digits=5)

sl[5]=round((bultin slmom[5]/bultin slmom[2]), digits=5)

sl[6]=round((bultin slmom[6]/bultin slmom[2]), digits=5)

sl[7]=round((bultin slmom[7]/bultin slmom[2]), digits=5)

r e t u r n ( s l )} sample.lmom(nidd . annual)

# function to compute the combined square error

cse=function(a,b,mu,sigma, lambda, r){

L<− numeric ( 7 )

L<− theoretical(a,b,mu,sigma, lambda, r)

return ((L[1] − 13.66689)ˆ2 + (L[2] − 3.34307)ˆ2 +

(L [ 3 ] − 0.25353)ˆ2 + (L[4] − 0.09269)ˆ2 +

(L [ 5 ] − 0.05837)ˆ2 + (L[6] − 0.07533)ˆ2 +

(L [ 7 ] − 0.00706)ˆ2)

} 113 # function to compute the − combined square error of the data mini.cse= function(y){

a=y[1]; b=y[2]; mu=y[3]; sigma=y[4]; lambda=y[5]; r=y[6]

return(cse(a,b,mu,sigma ,lambda ,r))

}

fit .lmom=optim(c(1,5,10,5,2,2), mini.cse, control = list(trace=5))

f i t . lmom

# function to compute the pdf of KwST

pdf.kwst= function(x, mu, sigma, lambda, r, a, b){

y=cdf.st(x, xi=mu, omega=sigma, alpha=lambda, nu=r)

g= ( a∗ b ∗ dst(x, xi=mu, omega=sigma, alpha=lambda, nu=r)∗

y ˆ ( a −1) ∗ (1−y ˆ a ) ˆ ( b −1))

r e t u r n ( g )}

# function to compute the −log likelihood for KwST

LL.kwst= function(y,data1){

mu=y[1]; sigma=y[2]; alpha=y[3]; r=y[4]; a=y[5]; b=y[6]

l=pdf.kwst(data1,mu, sigma, alpha,r, a, b)

r e t u r n (−sum(log(l )))

}

## To compute AIC and SIC:

value=LL.kwst(y,nidd.annual) k = 7

AIC = 2∗ v a l u e +2∗k ; AIC

SIC = 2∗ v a l u e + k∗ l o g (m) ; SIC 114 • The EM parameter estimation for mixture of two KwST for bmi data for ﬁxed degrees of freedom r1 = r2 = r.

library(sn)

library(mixsmsn) # for data

library(mvtnorm) # is needed for the above library

library(fBasics) #to compute skewness and Kourtosis

d a t a ( bmi )

d=as.vector(as.matrix(bmi))

summary ( d )

var(d); sd(d)

skewness(d); kurtosis(d, method = ”moment”)

# function to compute the pdf of kumaraswamy−skew t distribution

pdf.kwst= function(x, mu, sigma, alpha, r, a, b){

y<− cdf.st(x, xi=mu, omega=sigma, alpha=alpha , nu=r)

r e t u r n ( a∗b∗ dst(x, xi=mu, omega=sigma, alpha=alpha , nu=r)∗

( y ˆ ( a −1)∗(1−y ˆ a ) ˆ ( b −1)))

}

# function to compute the cdf of skew t distribution

cdf.st=function (x, xi, omega, alpha, nu)

{

if (nu == Inf)

return(psn(x, xi, omega, alpha))

i f ( nu == 1)

return(psc(x, xi, omega, alpha))

i n t . nu <− (round(nu) == nu)

ok <− ! ( i s . na ( x ) | ( x == I n f ) | ( x == −I n f ) ) 115 zz <− l o g ( abs ( x − x i )) − l o g ( omega )

z <− ( s i g n ( x−x i ) ∗ exp(zz))[ok]

if (abs(alpha) == Inf) {

z0 <− replace(z, alpha ∗ z < 0 , 0)

p <− pf(z0ˆ2, 1, nu)

return(if (alpha > 0) p e l s e (1 − p ) )

}

fp <− function(v, alpha, nu, t.value) psn(sqrt(v) ∗ t . value ,

0 , 1 , a l p h a ) ∗ d c h i s q ( v ∗ nu , nu ) ∗ nu

p <− numeric(length(z))

f o r ( i i n s e q len(length(z))) {

if (abs(z[i]) == Inf) {

p [ i ] <− (1 + sign(z[i]))/2 }

e l s e {

upper <− 10 + 5 0 / nu

p [ i ] <− integrate(dst , −Inf, z[i], dp = c(0, 1, alpha, nu),

stop.on. error = FALSE)$value

}

pr <− rep(NA, length(x))

pr [ x == I n f ] <− 1

pr [ x == −I n f ] <− 0

pr [ ok ] <− p

return(pmax(0, pmin(1, pr)))

}

# the conditional expectation of zij given the observed data

tau.ij= function(y){ 116 a<− b<− mu<− sigma<− lambda<− p<− s<− numeric ( 2 )

a[1]=y[1]; b[1]=y[2]; mu[1]=y[3]; sigma[1]=y[4]; lambda[1]=y[5]; r=y[6]

a[2]=y[7]; b[2]=y[8]; mu[2]=y[9]; sigma[2]=y[10]; lambda[2]=y[11]

p[1]= y[12]; p[2]=1 −p [ 1 ] ; s1<− s2<− t 1 <− t2<− numeric ( n )

s1= p [ 1 ] ∗ pdf.kwst(data1 ,mu[1],sigma[1],lambda[1],r ,a[1],b[1])

s2= p [ 2 ] ∗ pdf.kwst(data1 ,mu[2],sigma[2],lambda[2],r ,a[2],b[2])

t1<− s1 / ( s1+s2 )

t2<− s2 / ( s1+s2 )

newList <− list(”tau.1j” = t1, ”tau.2j”= t2)

return(newList)

}

# function to compute pi.hat

pi.hat= function(y,i){

i f ( i ==1)

return(sum(tau.ij(y)$tau.1j) / n)

e l s e

return(sum(tau.ij(y)$tau.2j) / n)

}

# the log likelihood function for the M−s t e p

M.step= function(y){

a<− b<− mu<− sigma<− lambda<− numeric ( 2 )

a[1]=y[1]; b[1]=y[2]; mu[1]=y[3]; sigma[1]=y[4] 117 lambda[1]=y[5]; r=y[6]

a[2]=y[7]; b[2]=y[8]; mu[2]=y[9]; sigma[2]=y[10]

lambda[2]=y[11]

y[12]= pi.hat(y,1); g<− numeric ( n )

g= tau.ij(y)$tau.1j ∗ log(pdf.kwst(data1 ,mu[1],sigma[1],lambda[1],

r,a[1] ,b[1]))+

tau. ij(y)$tau.2j ∗ log(pdf.kwst(data1 ,mu[2],sigma[2],lambda[2],

r,a[2],b[2]))

r e t u r n (−sum ( g ) )

}

KwST.mix= function(x, y){

a<− b<− mu<− sigma<− lambda<− p<− numeric ( 2 )

a[1]=y[1]; b[1]=y[2]; mu[1]=y[3]; sigma[1]=y[4]

lambda[1]=y[5]; r=y[6]

a[2]=y[7]; b[2]=y[8]; mu[2]=y[9]; sigma[2]=y[10]

lambda[2]=y[11]

p[1]=y[12]; p[2]=1 −p [ 1 ]

g=p [ 1 ] ∗ pdf.kwst(x, mu[1], sigma[1], lambda[1], r, a[1], b[1]) +

p [ 2 ] ∗ pdf.kwst(x, mu[2], sigma[2], lambda[2], r, a[2], b[2])

r e t u r n ( g )

}

# the log likelihood function using the estimated parameters

LL = function(y){

l=KwST.mix(d, y)

r e t u r n (− sum(log(l ))) 118 }

i n i t i a l <− c( 1,1,19,3,.25,8,1,1,29,6,1,8,.5); p0<− 0 . 5

# the stopping rule and estimation

i t e r a t i o n <− 1000

r a t e <− .0001

c o u n t e r <− matrix(0,iteration ,14)

counter[1,1:12]= initial

counter[1,14]=100

i =1

while(counter[i ,14] >= r a t e ){

i = i +1

counter[i ,12] <− pi.hat(counter[i − 1 , 1 : 1 2 ] , 1 )

fit .mix=optim(c(counter[i −1,1:11],counter[i ,12]), M.step)

counter[i,1:11] < − fit .mix$par[1:11]

counter[i ,13] <− LL(counter[i ,1:12])

counter[i,14] = abs(counter[i,13] − c o u n t e r [ i −1 ,13])

p r i n t ( i )

if(i==iteration) break }

print(counter)

#this code to tell where is the smallest difference and

## print out the estmation result

result= function(counter){

min= min(counter[2: iteration ,14])

j =1; e s t <− numeric(14); i=0

w hi l e ( j <=iteration){ 119 j =1+ j

if(counter[i ,14] <= min ){

est=counter[j ,]

value=counter[j ,13]

i = j

}}

n e w l i s t <− list(est ,i,value)

return( newlist)

}

#curve fitting

h=result$est

x = seq(min(d) −.5,max(d),by= 0.1)

m = length(x)

yy = numeric (m)

for (i in 1:m){

yy[i] = KwST.mix(x[i], h)}

lines(x,yy, lwd= 1, lty=1, col=”red”)

# To compute the AIC and SIC

value=result$value

AIC.mix.kwst = 2∗ v a l u e +2∗12

SIC.mix.kwst = 2∗ v a l u e + 12∗ l o g ( n )

# ST . mix

St.analysis <− smsn.mix(d,

3, mu = c(19,29), sigma2 = c(3,6), shape = c(.25,1),

pii = .5, g =2, get.init = TRUE,

criteria = TRUE, group = FALSE, family = ”Skew.t”, 120 error = 0.2096789, iter.max = 50, calc.im = F,

obs.prob = FALSE, kmeans.param = NULL)

St.analysis

# SN . mix

Snorm. analysis <− smsn.mix(d,

3, mu = c(19,29), sigma2 = c(3,6), shape = c(.25,1),

pii = .5, g =2, get.init = TRUE,

criteria = TRUE, group = FALSE, family = ”Skew.normal”,

error = 0.2096789, iter.max = 50, calc.im = F,

obs.prob = FALSE, kmeans.param = NULL)

Snorm. analysis

## curve fitting

hist(d, prob = TRUE, breaks=70,

main = ” Fitted Mixture Densities to bmi Data”)

lines(x,yy, type=”l”, lwd= 1, lty=1, col=”red”)

mix. lines(data1 ,St.analysis ,col=”green”)

mix. lines(data1 ,Snorm.analysis ,col=”blue”)

c o l o r s <− c(”red”, ”green”, ”blue”)

l a b e l s <− c(”Mix.KwST”, ”Mix.ST”, ”Mix.SN”)

legend(”topright”, inset=.05, title=”Distributions”,

labels , lwd=1, lty=1, col=colors)

• Computing the overlapping coefﬁcient between two densities.

pdf.kwst= function(x, mu,sigma,alpha, r, a, b){

y=pst(x, xi=mu, omega=sigma, alpha=alpha , nu=r, method=2)

r e t u r n ( a∗b∗ dst(x, xi=mu, omega=sigma, alpha=alpha , nu=r)

∗ ( y ) ˆ ( a −1) ∗ (1 −(y ) ˆ a ) ˆ ( b −1))} 121 # This is a function to compute the pdf of mixture of KwST

library(sn)

KwST.mix1= function(x, y){

a<− b<− mu<− sigma<− lambda<− p<− numeric ( 2 )

a[1]=y[1]; b[1]=y[2]; mu[1]=y[3]; sigma[1]=y[4]

lambda[1]=y[5]; r=y[6]

a[2]=y[7]; b[2]=y[8]; mu[2]=y[9]; sigma[2]=y[10]

lambda[2]=y[11]

p[1]=y[12]; p[2]=1 −p [ 1 ]

g=p [ 1 ] ∗ pdf.kwst(x, mu[1], sigma[1], lambda[1], r, a[1], b[1]) +

p [ 2 ] ∗ pdf.kwst(x, mu[2], sigma[2], lambda[2], r, a[2], b[2])

r e t u r n ( g )

}

# This is a function to compute the pdf of mixture of GLD

library (GLDEX)

gld.mix= function(x, y){

lambda1<− lambda2<− lambda3<− lambda4<− p<− numeric ( 2 )

lambda1[1]=y[1]; lambda2[1]=y[2]; lambda3[1]=y[3]; lambda4[1]=y[4]

lambda1[2]=y[5]; lambda2[2]=y[6]; lambda3[2]=y[7]; lambda4[2]=y[8]

p[1]=y[9]; p[2]=1 −p [ 1 ]

g=p [ 1 ] ∗ dgl(x, lambda1[1], lambda2[1], lambda3[1], lambda4[1], param = ”fmkl”, inverse.eps = 1e −08, max.iterations = 500) +

p [ 2 ] ∗ dgl(x, lambda1[2], lambda2[2], lambda3[2], lambda4[2],

param = ”fmkl”, inverse.eps = 1e −08, max.iterations = 500)

r e t u r n ( g )

}

## To compute the overlaping function min . f 1 f 2 <− function(x, y1, y2) { 122

f1<− gld.mix(x,y1)

f2<− KwST.mix1(x,y2)

pmin(f1, f2)

} y1<−c(0,1,2,3,2,1,0,1,.6) # true values y2<− c(.738, 1.517, .249, .276, −1.246, 5.643, 1.511, 1.784, 2.666,

. 9 1 6 , − 1.39,.618) # estimated values

print(paste(”OVL:”, integrate(min.f1f2 , −Inf, Inf, y1=y1, y2=y2)

$ v a l u e ) )