Study of the longevity in : an application of the Beta skew-normal regression

Valentina Mameli1, Monica Musio2, Luca Deiana3

1 Dipartimento di Scienze Statistiche, Universit`adi Padova, 2 Dipartimento di Matematica ed Informatica, Universit`adi , Italy 3 Dipartimento di Scienze Biomediche, Universit`adi Sassari, Italy

E-mail for correspondence: [email protected]

Abstract: In real applications normality of the errors is a routine assumption for the linear model, but it may be unrealistic. In fact often residuals exhibit non-normal shape, with an heavy right or left tail. In this work, we relax the normality assumption by considering that the errors follow a Beta skew-normal distribution. The new regression model includes as special cases the skew-normal and the normal one. We apply such model to study the longevity in Sardinia.

Keywords: Centenarians; Beta skew-normal; regression; longevity; Sardinia.

1 Introduction

Sardinia has been called “the Centenarian island”(Deiana and Vaupel (2006)). In fact, it turns out to be one of the regions with more alive centenarians in the world. In recent years, several projects have started with the objective to understand and analyse which factors may be related to the longevity in Sardinia. In particular, the project AKEA (see for example Deiana and Vaupel (2006), Poulain et al. (2004) and the references therein) was fo- cused on a census of all centenarians living in Sardinia. This project has highlighted the presence in the island of geographical areas in which the phenomenon of longevity is particularly important. In this work we have analysed data from the AKEA project, collected and validated in two vil- lages of Sardinia. Our aim is to address the following question: Do members of families in which there are centenarians live longer in average? This is achieved by comparing the mean age at death for individuals belonging to a centenarians’ family to that for individuals from families having not cen- tenarians. Our response variable, the age of death, is strongly asymmetric. Recent statistical literature has seen an increasing interest in the construc- tion of flexible parametric families of distributions that exhibit skewness and kurtosis different from the normal distribution. Azzalini (1985) defined 2 Beta skew-normal regression model to study longevity the skew-normal distribution and studied its properties. Subsequently, Az- zalini and Capitanio (1999) introduced the skew-normal regression model. Recently, Mameli and Musio (2013) proposed a new distribution, called Beta skew-normal (BSN), which generalizes the SN distribution. In this work we propose an extension of the SN regression model, in which the er- rors follow a BSN distribution. We apply this model to study the longevity in Sardinia. The paper unfolds as follows: the definition of the BSN distri- bution and the BSN regression model are presented in section 2. Section 3 is devoted to data and results of the statistical analysis. A brief discussion is given in section 4.

2 Statistical model

2.1 The Beta skew-normal distribution A random variable Z is said to have a Beta skew-normal distribution with parameters λ, a and b (BSN(λ, a, b)), if its density is given by

B 1 a−1 b−1 g (z; λ, a, b) = (Φλ(z)) (1 − Φλ(z)) φλ(z), z ∈ R, Φλ(z) B(a, b) with a > 0, b > 0, λ ∈ R. The functions Φλ(·) and φλ(·) are the cdf and the pdf of the skew-normal distribution, respectively. Here, B(a, b) is the beta function. The BSN distribution can be generalized by the inclusion of the location and scale parameters which we identify as µ and σ > 0. Thus if Z ∼ BSN(λ, a, b) then X = µ + σZ is a BSN(µ, σ, λ, a, b).

2.2 The Beta skew-normal regression model The Beta skew-normal regression model is defined as

yi = xiβ + σi, for i = 1, . . . , n where β = (β1, . . . , βk) is a k × 1 vector of unknown regression parameters (k < n), xi = (xi1, . . . , xik) is the vector of k covariates. Under the as- sumption that each error i is distributed as a BSN(λ, a, b), the response variable yi is distributed as a BSN(xiβ, σ, λ, a, b). The new model contains as special cases the skew-normal and the normal regression models. The log-likelihood function based on a sample of N independent observations is

N X n (a−1) (b−1) o l(θ) = log (ti) + log (vi) + log (1 − vi) − log(σBeta(a, b)) , i=1

yi−xiβ where ti = φλ (zi) and vi = Φλ (zi), with zi = σ . The score vector U(θ), obtained by differentiating l(θ) with respect to θ = (β, σ, λ, a, b), has Mameli et al. 3 the following components h i PN xi(zi−witi−λhi) Uβ(θ) = i=1 σ ; h 2 i PN zi −1−zi(witi+λhi) Uσ(θ) = i=1 σ ;

PN ∂vi  Uλ(θ) = i=1 zihi + wi ∂λ ; PN Ua(θ) = i=1 {log vi − [ψ(a) − ψ(a + b)]} ; PN Ub(θ) = i=1 {log (1 − vi) − [ψ(b) − ψ(a + b)]} ;

(a−1)+(2−a−b)vi where ψ(t) is the di-gamma function. Furthermore, wi = , (1−vi)vi φ(λ(zi)) and hi = , where Φ(·) and φ(·) are the cdf and the pdf of the normal Φ(λ(zi)) distribution, respectively. Maximum likelihood estimates of the parameters can be found by setting the above expressions equal to zero and solving them simultaneously. Since analytical solutions are not available in closed form, the R package maxLik is used to find estimates numerically.

3 Data description and statistical analysis

Data were collected in the two villages of Ovodda and Tiana in Sardinia, located 5 km apart in the province of , in an area characterized by exceptional longevity. The data were taken from the project AKEA, and cover the time period from 1860 to 2009. The experimental design was as follows: we first identified the most recent centenarians who die; then we included in the study all the present and previous members of their family, as identified by the family name. In this way, 8 families were identified and compared with 7 families in which, in the same period, there were no cente- narians. The centenarians used to identify their families were removed from the study. As the causes of death in children are very different in nature from those in adults, we have restricted our analysis to individuals who died in adulthood (at 30 years or more). The dataset contains 932 cases, 698 of which coming from families of centenarians. We consider as response variable the age at death, which is strongly asymmetric. For each individ- ual sampled we also know the date of birth, the sex and the year of death, that we consider as potential predictors. Since this area is characterized by an exceptional male longevity as well as a low female/male centenarian ratio (see Poulain et al. (2004)), we have excluded the variable sex from the model. As predictors we consider the year of death of each individual and the dummy variable Cent1 (which is 1 if there is at least one centenarian in the family and 0 otherwise). Then we assume the following model

yi = β0 + β1year deathi + β2Cent1i + σi, for i = 1,..., 932 where the error i follows a BSN(λ, a, b) density function. We use the likelihood ratio (LR) test statistic for comparing the SN model against 4 Beta skew-normal regression model to study longevity 0.030 0.025 0.020 0.015 0.010 0.005 0.000

−40 −20 0 20

FIGURE 1. Histogram of the residuals with superimposed the BSN density the BSN one. Based on the values of the LR test statistic, the BSN model provides a better fit than the SN (LR = 9.604, p-value = 0.008). ˆ ˆ The estimates of the regression parameters are β0 = 73.999, β1 = 8.211 ˆ and β2 = 2.394. The estimated density function is plotted in figure 1.

4 Discussion

In this paper we have introduced the BSN regression model and we have used it to study the longevity in Sardinia. Our analysis shows a significant effect on longevity for persons belonging to a centenarians’ family. Indeed, the effect is probably underestimated, since we have taken no account of the mother’s family. Even so, it seems clear that longevity has a genetic cause. This finding should motivate more detailed studies to investigate the genetic characteristics of centenarians’ families.

References Azzalini, A. (1985). A Class of Distributions Which Includes the Normal Ones. Scandinavian Journal of Statistics, 12(2), 171 – 178. Azzalini, A. and Capitanio, A. (1999). Statistical applications of the mul- tivariate skew normal distribution. Journal of the Royal Statistical Society, Series B, 61, 579 – 602. Deiana, L. and Vaupel, J. (2006). Longevity in Sardinia, The Centenarian Island. Biochimica Clinica, 30(1), ISSN 0393 3504. Mameli, V. and Musio, M. (2013). A new generalization of the Skew-normal distribution: the Beta skew-normal. Communications in Statics: The- ory and Methods, to appear. Poulain, M., Pes, G., Grasland, C., Carru, C., Ferrucci, L., Baggio, G., Franceschi, C., and Deiana, L. (2004). Identification of a geographic area characterized by extreme longevity in Sardinia Island: the AKEA study. Experimental Gerontology, 39, 1423 – 1429.