Lassoing Mixtures and Bayesian Robust Estimation

Home , Mixture model

Guan Xing

Submitted in partial fulﬁllment of the requirements

for the degree of Doctor of Philosophy

Department of Epidemiology and Biostatistics

CASE WESTERN RESERVE UNIVERSITY

September, 2006 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of

———————————————————————————————- candidate for the Ph. D. Degree*.

(Signed)————————————————————————————– J. Sunil Rao, Ph. D Department of Epidemiology and Biostatistics Chair of the committee

———————————————————————————————– Robert C. Elston, Ph. D Department of Epidemiology and Biostatistics

———————————————————————————————– Hemant Ishwaran, Ph. D Cleveland Clinic Foundation

———————————————————————————————— Joe Sedransk, Ph. D Department of Statistics

———————————————————————————————— Christopher C. Whalen, Ph. D Department of Epidemiology and Biostatistics

(date)—————————————————–

*We also certify that written approval has been obtained for any proprietary material therein.

1 Contents

I Lassoing Mixtures 15

1 Finite Mixture Models 16

1.1 Introduction ...... 16

1.2 NPMLE ...... 20

1.3 EM Algorithm for Fixed k ...... 22

1.4 Frequentist Methods for Choosing k ...... 23

1.4.1 Hypothesis Testing ...... 23

1.4.2 Model Selection Approaches ...... 29

1.5 Bayesian Methods ...... 32

1.5.1 Bayes Factor ...... 33

1.5.2 Distance Measures in Bayesian Model Selection ...... 36

2 1.5.3 Fully Bayesian Approach ...... 37

1.6 Summary ...... 38

2 Proposed Lassoing Mixture Method 40

2.1 Introduction ...... 40

2.2 Linear Model Coefﬁcient Shrinkage and Variable Selection: Background . . 42

2.2.1 Introduction ...... 42

2.2.2 Subset Selection ...... 44

2.2.3 Shrinkage: Ridge Regression, Bridge Regression and LASSO . . . 45

2.2.4 LARS: Efﬁcient Computation for LASSO Solutions ...... 51

2.3 Lassoing Mixtures ...... 55

2.3.1 Lassoing Mixture Algorithm ...... 55

2.3.2 Connection with Other Methods ...... 58

2.4 Some Theoretical Results ...... 60

2.4.1 Lassoing mixtures with unknown number of components is a penal-

ized maximum likelihood estimator ...... 61

2.4.2 Consistency of the Final Estimated Mixture and Order Selection . . 63

3 3 Application Examples for Lassoing Mixture Approach 64

3.1 Simulation Studies ...... 64

3.2 Some Classic Real Data Sets ...... 70

3.3 Extended example: Analysis of proteomic mass spectroscopy data for ovar-

ian cancer classiﬁcation ...... 73

4 Discussion 80

II Bayesian Robust Estimation 82

5 Literature Review for Bayesian Robust Estimation 83

5.1 Introduction ...... 83

5.1.1 Diagnostic Methods ...... 84

5.1.2 Robust Methods ...... 86

5.2 Summary ...... 92

6 Proposed Latent Variable Approach 93

6.1 Introduction ...... 93

6.2 Bayesian Robust Estimation Algorithm ...... 95

4 7 Applications of the Bayesian Robust Estimation Approach 98

7.1 Linear Regression ...... 98

7.2 Some Theoretical Results: MSE Comparison ...... 102

7.3 Generalized Linear Regression ...... 103

7.4 Density Estimation ...... 105

8 Discussion and Future Work 107

A Proof of Theorems 109

B Regularity Conditions 116

5 List of Figures

2.1 Bridge penalty plot ...... 47

2.2 LASSO estimation plot ...... 48

2.3 LASSO trace plot ...... 54

2.4 Histogram Plot1 ...... 57

2.5 Trace Plot1 ...... 58

2.6 Trace Plot2 ...... 59

3.1 Simulation densities ...... 66

3.2 Real Data Densities ...... 71

3.3 Real Data Densities ...... 73

3.4 Protein mass spectrometry sample plot ...... 75

3.5 PMS prediction error plot ...... 77

3.6 ...... 79

6 7.1 Simulated data ...... 99

7.2 StackLoss data ...... 101

7.3 Stars data ...... 103

7.4 Prostatic data ...... 105

7 List of Tables

3.1 Simulations 1-3 ...... 67

3.2 Simulations 4-6 ...... 68

3.3 Simulations 7-10 ...... 69

3.4 galaxy ...... 72

3.5 stamp ...... 72

3.6 stamp ...... 74

3.7 Results for 5-fold cross validation with window width 2-6 Da ...... 77

7.1 Analysis Results for the Simulated Data ...... 100

7.2 Analysis Results for the StackLoss Data ...... 102

7.3 Analysis Results for the Stars Data ...... 104

7.4 Analysis Results for the Prostatic Data ...... 106

7.5 Analysis Results for the Barnett Data ...... 106

8 Acknowledgements

I would like to thank my advisor, Dr. J. Sunil Rao, for his guidance in the past ﬁve years.

The whole dissertation is based on his original ideas and he helped me through the research process with his deep insight of the ﬁeld, enlightened suggestion, kindness, and patience.

Besides, he taught me how to develop an academy career, how to collaborate with other researchers. His help both in the research and in the common life is highly appreciated.

Dr. Ishwaran gave many helpful comments on my dissertation proposal and helped us clarify some confusing concepts in ﬁnite mixture models. Dr. Sedransk pointed out a critical error in my former Bayesian model and helped me correct it. Dr. Elston allowed me to sit in the student ofﬁce of his track. This provided me great opportunities to discuss problems with other students and arose my interest in genetic epidemiology. Dr. Whalen gave me some suggestions on the application of my research. I thank them so much for serving as my committee members.

I have many friends who have been very helpful during my Ph.D study, including Simin Hu,

Chao Xing, Tao Wang, Qing Lu, Ritwik Sinha, and Moumita Sinha. I would like to give them my sincerely thankfulness.

9 Finally, I would like to thank my wife, Xiangfen Liang, for her love and support.

10 List of Symbols

Through this dissertation, we use the following deﬁnitions for the symbols:

1. X denotes the random variable to be measured.

2. X denotes the design matrix in linear regression settings.

3. x denotes a particular observation.

4. n denotes the sample size.

5. i denotes the index of the observations.

6. j denotes the index of the component densities.

7. G(θ) denotes the mixing density.

8. θ : {θ1, . . . , θk} denotes the distribution parameters for the corresponding component

densities.

9. Θ denotes the parameter space.

10. π : {π1, . . . , πk} denotes the component weights for the component densities.

11 11. k denotes the component number of the ﬁnite mixture model.

12. L denotes the likelihood function.

13. l denotes the log-likelihood function.

14. Y denotes the response variable in linear regression settings.

15. y denotes a particular observed response.

Regularity conditions used in the dissertation are included in Appendix B.

12 On Lassoing Mixtures and Bayesian Robust Estimation

Abstract

GUAN XING

This dissertation includes two parts. The ﬁrst part describes a new estimation method for

ﬁnite mixture models. Different from traditional methods, we borrow ideas from the variable selection approaches in linear models. After generating the pseudo-response from a saturated mixture model and constructing predictors using candidate component densities, we transform the mixture model density estimation problem to a variable selection problem in linear models. Using a variant of the LASSO constraint approach, we can do component number selection and parameter estimation simultaneously for ﬁnite mixture models. The performance of this method is illustrated with simulated data and some well-known real data sets.

In the second part, we deal with the estimation with the contaminated data. Traditional

Bayesian approaches use the variance-inﬂation model or the mean-shift model. We extend the Bayesian contaminated model to the general case without assuming a speciﬁc distribution for the potential outliers, but using a constructed reference population to draw random samples. With the proposed latent indicator variables for each observation, we construct a

Bayesian hierarchical model and use Gibbs sampler to draw posterior samples. The parameter inference based on the Gibbs samples is robust, and a series of simulations and classic

13 real data analysis indicate the better performance of our methods than other approaches in linear models, generalized linear models and density estimation.

14 Part I

Lassoing Mixtures

15 Chapter 1

Finite Mixture Models

1.1 Introduction

Mixture models have been widely used throughout the history of statistics. The earliest

application can be traced back to Pearson’s (1894) paper on the decomposition of normal

mixtures by the method of moments. Since then, mixture modelling has been applied to

many disciplines including ﬁsheries (Macdonald et al., 1979), genetics (Ott, 1999), physics

(Tanner, 1962), psychology (Broadben, 1966), medicine (Clark et al., 1968), botany (Gor-

don & Prentice, 1977), disease mapping (Schlattmann & Bohning,¨ 1993), and meta-analysis

(Bohning,¨ 1999). For example, Smith (1961, 1963) introduced the mixture model for genet-

ics linkage heterogeneity based on the recombination fraction θ and the proportion of linked

families. He assumed the presence of two types of families, those with linkage (θ < 1/2) and those without linkage (θ = 1/2), and that the observed data come from these two groups

16 with probability π and 1 − π correspondingly. In a ﬁshing study, the ﬁsh size distribution of

a single species is assumed as a mixture of k age groups, and each group consists of the ﬁsh of a single yearly spawning (MacDonald & Pitcher, 1979). For patients suffering from the hypertensive condition known as Conn’s syndrome, the major cause could be an adenoma or other illnesses. Titterington (1976) used a multivariate normal mixture model to categorize the diagnosis. More examples can be found in Titterington et al. (1985) and McLanchlan and Peel (2000).

Parametrically, a mixture density has the following continuous functional form:

Z f(x|G) = f(x|θ)dG(θ), (1.1)

where f(x|θ) is the kernel density and G(θ) is the mixing density. For example, the Student-

t distribution can be interpreted as a mixture of normal distributions with a common mean,

and the variances distributed as a scaled inverse-χ2.

Z 2 2 tν(x|µ, σ ) = N(x|µ, V )dG(V |ν, σ ). (1.2)

The negative binomial distribution is a continuous mixture of Poisson distributions with the

parameter λ following a Gamma(α, β) distribution.

Z NB(x|α, β) = P oisson(x|λ)dGamma(λ|α, β). (1.3)

When the structure of the mixing distribution is unknown, the research on the nonparametric

maximum likelihood estimator (NPMLE) of G(θ), which will be introduced in Section 1.2,

showed that Gˆ(θ) is a step function with ﬁnite support points. Hence researchers prefer a

semi-parametric representation of the mixture density that is more adaptable and tractable.

17 This can be achieved by treating G(θ) as a discrete distribution function with a finite number of support points. Let θ1, . . . , θk be the support points for G(θ) and π1, . . . , πk be the corresponding support probabilities. If k is finite, then a semi-parametric mixture density is of the form k X f(x|Θ) = πjf(x|θj), (1.4) j=1 Pk with θj ∈ Θ, j=1 πj = 1, πj ≥ 0. Note that the whole model is invariant to the permutation of labels j = 1, 2, . . . , k, so it is not uniquely identifiable. To avoid unidentifiability, a unique labelling with the parameters in an increasing order is usually adopted. Consistent with the methods in the literature (Chen, 1995), we define the mixing distribution as

k X G(θ) = πjI(θj ≤ θ). (1.5) j=1

Finite mixture densities are mainly used in modelling heterogeneous data where observations come from different categories, but we do not know the exact source of each observation. Let

X denote a random variable to be measured, we use f(x|θj) to denote the density function for the jth category and πj to denote the corresponding probability that X comes from this category. A mixture model also offers a ﬂexible approximation to burdensome non-standard distributions. Researchers often use mixture densities to approximate an intractable distribution with a heavy tail. For example, Priebe (1994) showed that a log normal density can be approximated by a normal mixture with a large enough component number. Dalal and Hall

(1983) proved that any prior distribution in Bayesian modelling can be arbitrarily closely approximated by a suitable discrete mixture of natural conjugate priors. The corresponding posterior distribution will be a discrete mixture of the updated conjugate components. The

18 ﬁnite mixture model also provides a parametric alternative for some non-parametric statis-

tical methods, such as kernel density estimation. Hastie et al. (1994, 1996) generalized

the Fisher-Rao linear discriminant analysis (LDA) to allow non-linear decision boundaries

by using nonparametric regression (FDA) and Gaussian mixtures (MDA) equivalently. The

ﬁnite mixture models closely resemble the Bayesian hierarchical model structure. The mix-

ture parameters π are thought of as hyperparameters of the unobserved indicator variables,

G(θ) is the prior density of θ, and the observed data are conditional on the indicators and the distribution parameters. Given these characteristics, there is no surprise that ﬁnite mixture models have been widely used by both frequentists and Bayesians alike.

Suppose we have a random sample of size n from a ﬁnite mixture distribution with k com-

ponents, where k and other distribution parameters are unknown. We consider ﬁtting ﬁnite

mixture models with simultaneous estimation of the component number k, the coefﬁcients

π, and the parameters of the component densities θ. With a ﬁxed k, EM method introduced

in Section 1.3 is adaptive and efﬁcient to estimate other parameters (McLachlan & Basford,

1988). However, in many cases, the inference of the unknown k is essential and difﬁcult.

In cluster analysis, the component number of a mixture is interpreted as how many homo-

geneous subgroups there are in the sample. An over-parameterized model can lead to less

precise estimates of parameters and slower convergence of estimation (Chen, 1995). To

choose an appropriate component number k, frequentists usually do hypothesis testing or model selection, whereas Bayesians do indirect or direct inference of k.

In the following sections, Section 1.2 describes the history of research on ﬁnding NPMLE for the mixing distribution. Section 1.3 introduces the EM algorithm for ﬁtting the mixture

19 model with the component number ﬁxed. Section 1.4 describes the frequentists’ approaches

to estimate k. The hypothesis testing methods are introduced in Section 1.4.1. The reasons

why the classical approaches such as the likelihood ratio test (LRT) fail are explained, and

several possible remedy solutions are addressed. Model selection methods are discussed in

Section 1.4.2. Section 1.5 introduces Bayesian approaches. Bayes factor is described in Sec-

tion 1.5.1. Bayesian model selection based on distance measure is addressed in Section 1.5.2.

And a complete Bayesian approach which includes k in the model structure is introduced in

Section 1.5.3. Finally, we brieﬂy summarize the advantages and disadvantages of previous

methods in Section 1.6.

1.2 NPMLE

Suppose we have a random sample of size n from a mixture distribution

Z f(x|G) = f(x|θ)dG(θ), (1.6) where the kernel density f(x|θ) form is known. The objective is to ﬁnd a estimate Gˆ that maximizes the likelihood n Y L(G) = f(xi|G). i=1 If Gˆ(θ) is estimated without assuming a speciﬁc parametric distribution form, instead using a step function kˆ ˆ X ˆ G(θ) = πˆjI(θj ≤ θ), (1.7) j=1 it is called a nonparametric maximum likelihood estimator (NPMLE).

20 In a preliminary report, Robbins (1950) mentioned the idea of estimating the mixing distri-

bution. The ﬁrst substantial theoretical developments concerning the maximum likelihood

estimator of the mixing distribution were made by Kiefer and Wolfowitz (1956). They es-

tablished some quite general conditions for the consistency of the maximum likelihood es-

timator, though they did not talk about the form of this estimator and how to compute it. In

1976, Simar studied the problem of estimating the compounding distribution of a compound

Poisson process, which is a discrete case of Equation 1.6. He demonstrated the existence,

uniqueness, and convergence of the NPMLE, and suggested a computing algorithm. Later,

Laid (1978) extended Simar’s result to the general problem and showed that the NPMLE of a

mixing distribution is a step function, and it is self-consistent. A computation approach was

also proposed. Jewell (1982) considered mixtures of exponential and Weibull distributions,

and showed that the maximum likelihood estimate of the mixing distribution in this case is

unique, consistent, and having ﬁnite support.

In 1980s, Bruce Lindsay (1981, 1983a, 1983b, 1983c) wrote a series of papers indicating

that the existence, uniqueness, and ﬁnite support properties of the NPMLE hold for many

parametric families, including most one parameter exponential families. These fundamental

properties of the NPMLE of a mixing distribution are directly related to the geometrical

properties of the convex hull of the likelihood set. The support size is no larger than the

number of distinct observations. To determine whether a given estimate G0 is the MLE,

Lindsay proposed a gradient function

n X Li(θ) D (θ) = ( − 1). G0 L (G ) i=1 i 0

DG0 (θ) equals to 0 for all support points if G0 maximizes the likelihood. A forward searching

21 algorithm for Gˆ was also described in Lindsay’s papers.

1.3 EM Algorithm for Fixed k

Using the semi-parametric representation of the ﬁnite mixture model 1.4, with the indepen-

dency assumption for the observations, for some ﬁxed k, the log-likelihood function of a

mixture model is k X X ln(x|π, θ) = log[ πjf(x|θj)]. (1.8) x j=1

The parameters π, θ are usually estimated with the maximum likelihood estimation (MLE) method. An efﬁcient and general algorithm, the EM algorithm developed by Dempster,

Laird, and Rubin (1977), is often used. In the mixture model context, the EM algorithm calls the log-likelihood function above the incomplete log-likelihood and invents latent, unobserved variables Z := {Z1,...,Zk} to get the complete log-likelihood. Zj, j = 1, . . . , k,

takes the value 1 if X is from the sub-population j, and 0 otherwise. Let zij denote the value

T of Zj for the observation xi, the likelihood for the pair (xi, zij) is

P r(Xi = xi,Zi1 = zi1,...,Zik = zik)

= P r(Xi = xi | Zi1 = zi1,...,Zik = zik) × P r(Zi1 = zi1,...,Zik = zik) k Y zij zij = f(xi|θj) πj . j=1

The full likelihood is then n k Y Y zij zij f(xi|θj) πj , (1.9) i=1 j=1

22 and the associated complete log-likelihood is

X X X X lcom(x|π, θ) = zijlogπj + zijlogf(xi|θj). (1.10) i j i j

The EM algorithm includes two steps. The E step replaces the unobserved zij with its ex- pected value given the data x1, . . . , xn and θ, π.

zˆij = E(Zij | xi, θ, π)

= P r(Zij = 1 | Xi = xi, θ, π)

f(xi|θj)πj = P (1.11) j0 f(xi|θj0 )πj0

The M step maximizes the weights π and the parameters θ.

new 1 X 1 X f(xi|θj)πj πˆj = zˆij = P . (1.12) n n f(x |θ 0 )π 0 i i j0 i j j ˆnew θ = argmax{lcom(x|π, θ)}. (1.13) θ

Giving some initial values of π0 and θ0, the EM algorithm repeats the E step and the M step

until the convergence achieved. Under some regularity conditions, it can be shown that the

estimates by MLE or EM are consistent (Wald, 1949; Dempster et al., 1977).

1.4 Frequentist Methods for Choosing k

1.4.1 Hypothesis Testing

The likelihood ratio test is one of the most widely used methods for hypothesis testing. It is

based on calculating the ratio of the supremum of likelihood under all conditions relative to

23 that under the null hypothesis. The null hypothesis is rejected when this ratio is larger than

some certain cutoff value. The LRT is simple to interpret, locally most powerful, and the test

statistic has an asymptotic χ2 distribution under some standard regularity conditions (Wilks,

1938). However, it has long been recognized that the LRT statistic does not have the usual

χ2 limiting distribution in the case of ﬁnite mixture models (Hartigan, 1985).

Here, we use a simple two-component mixture (1 − π)f(x|θ0) + πf(x|θ1) as an example,

where θ0 is ﬁxed and θ1, π are unknown parameters. The hypothesis being tested is H0:

θ1 = θ0 vs. H1: θ1 6= θ0, or H0: π = 0 vs. H1: π 6= 0. The corresponding likelihood ratio

test statistic is n Y sup [(1 − π)f(xi|θ0) + πf(xi|θ1)] (θ1,π) i=1 LRTn = n , (1.14) Y f(xi|θ0) i=1

and the corresponding logritham of it is LLRTn = log{LRTn}.

For example, if the null density is the standard normal, f(x|θ0) = N(0, 1), and the alternative distribution is a normal distribution with different location parameter θ1, f(x|θ1) = N(θ1, 1), the likelihood ratio test statistic becomes

n X 1 LLRT = sup log[(1 − π) + πexp(x θ − θ2)]. (1.15) n i 1 2 1 (θ1,π) i=1

Hartigan (1985) pointed out that in the normal distribution case, LLRTn goes to inﬁnity at

the rate of log(p2log(n)) with probability 1 when Θ is unbounded. He also claimed that θ is not consistently estimated by the maximum likelihood estimator when θ = 0.

The reason for the failure of the likelihood ratio test is that the regularity conditions required are not satisﬁed in the mixture case. The regularity conditions require that the null hypothesis

24 is in the interior of the parameter space, whereas in the mixture case, the null hypothesis lies

on the boundary of the parameter space. Also, the model is unidentiﬁable under the null

hypothesis of homogeneity, i.e., we could derive the null distribution from the alternative

distribution with π = 0 or θ1 = θ0 equivalently.

Some statisticians (Chernoff & Lander, 1995; Lemdani & Pons, 1997) suggested reparame-

terization to eliminate unidentiﬁablity. For example, in the genetic linkage test, whether

the recombination fraction follows a mixture of two binomial distributions Bi(k, p) and

Bi(k, 1/2) or a pure distribution Bi(k, 1/2), the alternative density function is f(x|π, p) =

πBi(k, p) + (1 − π)Bi(k, 1/2). Chernoff and Lander (1995) constructed new parameters 1 1 1 θ = (θ , θ )T where θ = π(p − ), θ = π(p − )2. Now the null hypothesis corresponds 1 2 1 2 2 2 2

to H0 : θ = 0 or H0 : θ1 = θ2 = 0 equivalently, which is uniquely identiﬁable. But new

difﬁculties regarding the degeneration of the Fisher information arise and the maximum like-

lihood estimate could have zero variance unless we add additional conditions.

Though the asymptotic distribution of the LRT statistic under mixture models is not a χ2 distribution, many statisticians, including Ghosh and Sen (1985), Bickel and Chernoff (1993),

Chernoff and Lander (1995), and Chen and Chen (2001), have studied this topic in detail.

All found similar conclusions about the supremem of a Gaussian process. For instance, Chen and Chen (2001) showed that under some regularity conditions, the limiting distribution of the LRT statistic under the null is sup {W +(θ)}, where W + is the positive part of a Gaussian θ∈Θ process W with mean 0 and variance 1, and Θ is the compact parameter space.

25 The auto correlation function for the Gaussian process W is

cov{Z (θ) − h(θ)Y (θ ),Z (θ0) − h(θ0)Y (θ )} ρ(θ, θ0) = i i 0 i i 0 . (1.16) p 0 0 var{Zi(θ) − h(θ)Yi(θ0)}var{Zi(θ ) − h(θ )Yi(θ0)}

Here

f(Xi; θ) − f(Xi; θ0) Yi(θ) = Yi(θ, θ0) = , θ 6= θ0. (θ − θ0)f(Xi; θ0) 0 f (Xi; θ0) Yi(θ0) = Yi(θ0, θ0) = , θ = θ0. f(Xi; θ0)

Yi(θ) − Yi(θ0) Zi(θ) = Zi(θ, θ0) = , θ 6= θ0. θ − θ0 dY (θ − θ ) Z (θ ) = Z (θ , θ ) = i 0 | , θ = θ . i 0 i 0 0 dθ θ=θ0 0

E{Yi(θ0)Zi(θ)} h(θ) = 2 . E{Yi (θ0)}

The result provided by Chen and Chen (2001) implies that the asymptotic null distribution of the LRT now is not distribution free. It depends on the true parameter, and different parametric families will have different autocorrelation functions. The limiting null distribution of the LRT will be more complex for more complicated hypotheses. Chen and Chen (2003) studied the test for homogeneity in the normal mixture setting when there is an unknown scale parameter σ2. They proved that the asymptotic null distribution of the likelihood ratio

2 test statistic is the maximum of a χ2 variable and the supremum of the square of a truncated

Gaussian process with mean 0 and variance 1. Also, the existence of an unknown scale parameter slows down the convergence rate of the estimator.

With a Gaussian process, the calculation for the p value of the test statistic is not trivial.

Charnigo and Sun (2004a) pointed out that it can be estimated, in principle, using Sun’s

(1993) approximation to the tail probability of sup{W (θ)}. Generally, researchers approxi- θ∈Θ mate it with permutation or bootstrap. For example, Aitkin et al. (1981) used a re-sampling

26 method to assess the null distribution of −2logL. McLanchlan (1987) and Chen and Chen

(2001) used a bootstrap procedure to estimate the p value. These methods generate null samples by resampling from the empirical distribution or using the parametric bootstrap from the null model, and compare the observed LRT statistic with the 95th percentile of the constructed null distribution.

Chen et al. (2001) proposed a modiﬁed likelihood ratio test (MLRT), which is distribution free and asymptotically locally most powerful. To avoid the boundary and un-identiﬁability problem, they proposed a penalized log-likelihood function

n X ln(π, θ1, θ2) = log{(1 − π)f(xi|θ1) + πf(xi|θ2)} + Clog{4π(1 − π)}, (1.17) i=1

where 0 < π < 1, θ1, θ2 ∈ Θ with θ1 ≤ θ2 and C is a positive constant that controls the

degree of penalty. Now π is in the interior of the parameter space whereas the corresponding

distributions for π = 0 or π = 1 are covered by θ1 = θ2. They also proved that the modiﬁed

ˆ ˆ ˆ ˆ LRT statistic Mn = 2{ln(ˆπ, θ1, θ2) − ln(1/2, θ, θ)} has an asymptotic null distribution of 1 1 χ2 + χ2 under some regularity conditions. The constant C is deﬁned as C = log(M), 2 0 2 1 where [−M,M] is large enough to include the parameters θ. They conceived that this method

is not sensitive to the choice of C based on simulation studies. This approach can be extended to complicated case such that k ≥ 2 component mixture models. More recently, Chen et

al. (2004) considered the problem of testing the hypothesis H0 : k = 2 vs. H1 : k ≥ 3. k k X Using the penalized likelihood function ln(G) = ln(G) + Ck log(πj), a very complicated j=1 ˆ ˆ asymptotic distribution for the LRT statistic Rn = 2{ln(G) − ln(G0)} was obtained:

1 α 1 α ( − )χ2 + χ2 + χ2, (1.18) 2 2π 0 2 1 2π 2

27 where α is decided by the geometric structure of parameters.

The LRT and MLRT are equivalent to using the Kullback-Leibler (KL) information

Z g(x) d(g, f) = log{ }g(x)dx f(x)

as a measure of the distance between two models. Charnigo and Sun (2004a) proposed

using the L2 distance (d(g, f) = R [g(x) − f(x)]2dx) between competing models to test the

homogeneity in a continuous mixture distribution. They argued that the L2 distance is more

closely related to the visual separation between two density curves than the KL distance,

and that the L2 distance leads to a closed-form statistic. Their method, D-test, calculates

the (weighted) L2 distance between a ﬁtted homogeneous model and a ﬁtted k component

heterogeneous model, and compares it with some critical value. The D-test statistic is deﬁned

as follows:

Z k Z k X ˆ ˆ 2 X ˆ 2 d(k, n) = [ πˆjf(x|θj) − f(x|θ0)] dx = [ πˆjf(x|θj)] dx, (1.19) j=1 j=0 where πˆ0 = −1. This statistic has a simple closed-form expression in terms of parameter estimates if the components are from standard parametric families. For example, with the univariate normal case,

k k 2 X X πˆjπˆj0 1 (ˆµj − µˆj0 ) d(k, n) = exp[− 2 2 ], (1.20) q 2 2 0 2 σˆj +σ ˆj0 j=0 j =0 2π(ˆσj +σ ˆj0 )

where the estimates are calculated by the EM method. They also proved that d(2, n) = OP (1)

−1 under the alternative hypothesis and that d(2, n) = OP (n ) under the null hypothesis.

Using simulated studies, Charnigo and Sun (2004a) showed that the D-statistic has more

power than the MLRT for normal location families. For exponential scale and normal loca-

28 tion/scale families, they used a weighting function, a positive increasing function, to empha-

size the disparities in right tails, and the weighted D-statistic was shown to have competitive

power with MLRT. But the D-statistic does not have a clear asymptotic null distribution; they

had to tabulate the critical values based on simulation. In another paper, Charnigo and Sun

(2004b) generalized the D-test to discrete mixture distributions, and showed the equivalence

between the D-test statistic and the MLRT statistic: “under homogeneity and with a penal-

ized maximum likelihood estimation framework, ndw(x)(2, n) behaves asymptotically as a 1 1 constant times χ2 + χ2 ” (pp. 3). 2 0 2 1

1.4.2 Model Selection Approaches

Henna (1985)’s paper concerned the direct estimation of the component number of the ﬁnite

mixtures. Let Fn(x) denote the empirical CDF of the observation, and FGk (x) denote the

Pk cumulative distribution function of a k component mixture distribution Gk : j=1 πjf(x|θj) ˆ with k, π, θ unknown, Henna used Gk,n to denote the estimate of Gk that minimizes the function

n k Z 1 X X i S (G ) = {F (x) − F (x)}2dF (x) = { π F (x |θ ) − }2, (1.21) n k Gk n n n j (i) j n i=1 j=1

where k = 1, 2, . . . , n and x(i) is the ith order statistic of the sample.

ˆ Then he proposed an estimator kn for the true and unknown component number.

ˆ ˆ 2 kn = the minimal integer k such that Sn(Gk,n) < λ (n)/n, (1.22)

where λ(n) is a sequence with properties that λ(n) → ∞, λ2(n)/n → 0 as n → ∞ and

P 2 −2λ2(n) ˆ n{λ (n)/n}e < ∞. The existence and consistency of kn is proved theoretically.

29 Common model selection methods based on a penalized likelihood, including AIC and BIC

have been considered in ﬁtting mixture models by Leroux (1992). He considered a sequence

of nested mixture models with possible number of components k = 1, . . . , n, and proposed an estimator kˆ for the true value of k,

ˆ ˆ k = argmax{ln(Fk) − ak,n}. k

The penalty term ak,n is an increasing function of the component number k of the ﬁtted 1 model. a = dim(F ) for AIC, and a = ( log n)dim(F ) for BIC, where dim(F ) = k,n k k,n 2 k k k − 1 + k dim(θ). The penalized maximum likelihood methods usually help reduce overesti- mation of the model. However, Leroux pointed out that the estimated number of components is at least as large as the true number.

Chen and Kalbﬂeisch (1996) proposed a method to estimate the number of mixture components based on the penalized minimum distance, and showed that their estimator is consistent. They calculated the distance between the CDF of the ﬁtting distribution F (x|G) and the

CDF of the empirical distribution Fn(x) with the penalty term of the summation of the log

−1 P weights, then chose the model with the minimum distance. Let Fn(x) = n i I(xi ≤ x),

Pk F (x|G) = j=1 πjF (x|θj), the penalized distance is deﬁned as

k X D(Fn,F (x|G)) = d(Fn,F (x|G)) − cn logπj, (1.23) j=1

where cn is a sequence of positive constants. The penalty becomes severe when we have very

small πj values, which commonly corresponds to a large k. The conditions for the estima-

tor convergence are satisﬁed by most of the distance measures, including the Kolmogorov-

Smirnov distance (d(F1,F2) = sup |F1(x)−F2(x)|), the Cramer-von Mises distance (d(F1,F2) =

30 R 2 [F1(x) − F2(x)] d[F1(x) + F2(x)]), and the Kullback-Leibler information (d(F1,F2) = n X π1i π log with π = F (i) − F (i − 1)). Chen and Kalbﬂeisch suggested that the appro- 1i π li l l i=1 2i priate distance measure and the related constant sequence should be chosen speciﬁcally for

P −1 different problems. Other alternative penalties such as i πi are also suggested.

James et al. (2001) suggested another consistent method for estimating the component num-

˜ ber of a ﬁnite mixture model. Let fh denote the nonparametric kernel density estimator with n ˜ X the band width h, which is deﬁned as fh = (1/n) φh(x − xi), where φ(x) is the normal i=1 kernel density function. They proposed the estimator for true component number k as

ˆ ˜ k ˜ k+1 kn = min{k : KL(fh, φh ∗ gˆ ) ≤ KL(fh, φh ∗ gˆ ) + an,k+1},

where {an,j: j>=1} is a positive sequence that converges to zero as n → ∞. The nonparamet-

ˆ ric kennel density estimator is consistent, so the consistency of this estimator kn is proved

similarly. The iterative ﬁtting algorithm is described in Algorithm 1.

Algorithm 1 James’ forward ﬁtting algorithm n ˜ X Fitting the data with fh = (1/n) φh(x − xi). i=1 k=1.

˜ Computing the Kullback-Leibler distance between fh and a k-component mixture model Z g (x) g . KL(g , f˜ ) = g (x)log{ k }dx. k k h k ˜ fh(x) repeat

k=k+1.

˜ ˜ Compare KL(gk+1, fh) − KL(gk, fh) with a threshold number an,k

until The change is less than an,k.

31 The choice of the threshold number is critical, they suggested a = 3/n based on the mini-

mum description length (MDL) penalty of Rissanen (1978).

1.5 Bayesian Methods

To ﬁt a mixture distribution with the component number k known, Bayesian approaches can

provide estimators with closed form expressions if the prior distributions are proper. By

introducing the unobserved indicator variables Z := {Z1,...,Zk} with   1 if the ith sample is drawn from the jth mixture component, Zij =  0 otherwise,

the joint distribution of the observed data x and the unobserved indicators Z is conditional

on the model parameters:

n k Y Y zij p(x, Z|θ, π) = p(Z|π)p(x|Z, θ) = (πjf(xi|θj)) . (1.24) i=1 j=1

The vector of mixture indicators Z = {Z1,...,Zk} is viewed as a multinomial random

variable with parameter π, and the natural conjugate prior distribution for π is the Dirichlet

distribution, π ∼ Dirichlet(α1, . . . , αk). The prior distribution of θ is independent of that

2 of π. For example, in a mixture of normals, a common prior for (µj, σj ) is

−1 µj ∼ N(ξ, κ ) and σj ∼ Inv-Gamma(α, β).

Posterior inferences of θ and π are obtained by averaging over the mixture indicators. How-

ever, the closed form of the posterior distribution does not help for the burdensome integra-

tion. Bayesians usually draw samples from the posterior distribution using Gibbs sampler or

32 Markov Chain Monte Carlo (MCMC) methods, then get the inferences of θ and π from these simulations (Diebolt & Robert, 1994).

When the component number of a mixture model is unknown, and the interest is in the estimation of k, Bayesians construct mixture models with different numbers of components separately, and use a signiﬁcance test or other criteria, such as a Bayes factor, the Kullback-

Leibler distance, or the Schwarz criterion (BIC), to draw the inference for k.

1.5.1 Bayes Factor

The Bayes factor is a summary of the evidence provided by the data in favor of one model as opposed to another. It is the ratio of the marginal likelihood under one model to that under another model. Suppose we have two competing models M1 and M2, the ratio of their prior probabilities is p(M2)/p(M1), and the ratio of their posterior probabilities is p(M2|x)/p(M1|x). The Bayes factor is deﬁned as

p(x | M2) B21 = Bayes factor(M2; M1) = . (1.25) p(x | M1)

And that

posterior ratio = Bayes factor × prior ratio.

The values of Bayes factor are interpreted as follows (Kass & Raftery, 1995) :

33 B21 Evidence against M1

1 - 3 Not worth more that a bare mention

3 - 20 Positive

20- 150 Strong

> 150 Very Strong

The marginal densities p(x|Mk)(k = 1, 2) are obtained by integrating over the parameter space. Z p(x|Mk) = p(x|θk,Mk)p(θk|Mk)dθk.

The marginal density usually does not have a closed form. With the advent of MCMC meth-

ods, researchers try to estimate the marginal density based on the draws from the posterior

density. Newton and Raftery (1994) gave an estimator for the marginal density under model

k: G 1 X 1 fˆ (x | M ) = { ( )}−1, (1.26) NR k G (g) g=1 f(x | θk ,Mk) which is the harmonic mean of the likelihood values. This estimator is simulation consistent

but unstable, because it uses the inverse of the likelihood. Gelfand and Dey (1993) proposed

another estimator:

G (g) 1 X p(θ ) fˆ (x | M ) = { ( k )}−1, (1.27) GD k G (g) (g) g=1 f(x | θk ,Mk)π(θk | Mk) where p(θ) is a density with thinner tails than the product of the prior and the likelihood. This

estimator is consistent, but the choice of p(θ) is hard to determine. Chib (1995) suggested another approach. Based on the Bayes theorem

f(x|θ)π(θ) f(x) = , (1.28) π(θ|x)

34 the logarithm of the marginal density can be estimated with

logfˆ(x) = logf(x|θ∗) + logπ(θ∗) − logπˆ(θ∗|x), (1.29)

in which evaluations can be computed from the Gibbs sampling output. For example,

G X πˆ(θ∗ | x) = G−1 π(θ∗ | x, z(g)). (1.30) g=1

The standard error of the estimates can also be computed numerically.

To directly estimate the posterior distribution of the component number, Carlin and Chib

(1995) included the model indicator M into the sampling scheme. But now one necessary condition for the MCMC convergence is violated if different models have different sizes.

They invented a pseudo-prior p(θj | M 6= j) to complete the full conditional distribution and estimated the posterior distribution with

number of M (g) = j pˆ(M = j | x) = . (1.31) total number of M (g)

Then a Bayes factor is computed between any two competing models.

Another approach is suggested by Roeder and Wasserman (1997). They assumed a ﬁxed

upper bound K for the component number, used a partially proper prior and approximated R the marginal distribution fk(x) = p(x | θk)πk(θk)dθk with

ˆ −dk/2 ˆ fk = n p(x | θk), (1.32)

where dk is the number of parameters in the model. Then they estimated p(k | x) by

fˆ pˆ(k | x) = k . (1.33) P ˆ j fj

35 Ishwaran et al. (2001) dealt with this mixture model estimation problem with Bayesian model selection. They also assumed a ﬁxed upper bound K for the component numbers, and decomposed the marginal density for the data into K densities, each corresponding to the contribution from the prior with k = 1, . . ., K components. The weighted Bayes factor then was used for selecting the dimension k. They used an i.i.d. generalized weighted Chi- nese restaurant (GWCR) Monte Carlo algorithm, and proved that the posterior distribution is

−1/4 consistent and has OP (n ) convergence rate, which is the optimal rate of the frequentists.

1.5.2 Distance Measures in Bayesian Model Selection

Mengersen and Robert (1996) proposed using Bayesian entropy (Kullback-Leibler) distance to compare two distributions. They only considered testing the homogeneity of a two- component normal mixture. To avoid the unidentiﬁability caused by the parameter space geometry, they reparameterized the mixture distribution as πN(µ, τ 2) + (1 − π)N(µ +

2 2 2 2 τθ, τ σ ) instead of using the classical πN(µ, τ1 ) + (1 − π)N(µ2, τ2 ). They also claimed that this modiﬁed parametrization allows for a more noninformative prior modelling. They viewed the testing problem as comparing some critical value α with the KL distance between the mixture distribution and the closest null distribution. H0 : π = 0 is accepted

∗ ∗ if α ≥ d(πf(x|θ1) + (1 − π)f(x|θ2), f(x|θ )), where θ minimizes d(πf(x|θ1) + (1 −

π)f(x|θ2), f(x|θ)). The bound α is chosen by considering the boundaries of the region in which a πN(0, 1) + (1 − π)N(θ, σ2) mixture can be approximated by its N((1 − π)θ, π +

(1 − π)σ2 + π(1 − π)θ2) projection.

36 A modiﬁed version with extension to mixtures with large component number was proposed by Sahu and Cheng (2002). They chose a reasonably large number for k as the start, and then considered merging two components with other components ﬁxed. The KL distance between the original model f k and the collapsed model f k−1 is computed and compared with a critical value. The whole process continues until further merging will demolish the

ﬁt. However, the choice of the critical value αk depends on the speciﬁc problem. Sahu and

Cheng suggested a graphical diagnosis to ﬁnd guideline values of αk. The KL distance can- not be computed analytically for the mixture problem. Mengersen and Robert (1996) used a Laplace approximation, whereas Sahu and Cheng(2002) suggested an alternative weighted

KL distance.

1.5.3 Fully Bayesian Approach

In a seminal paper, Richardson and Green (1997) described a fully Bayesian treatment for mixture modelling. They modelled the number of components k, the unobserved indicator z, and the component parameters θ, π jointly. The inference of k is made based on the simulated posterior probabilities. The full joint distribution is expressed as

p(k, π, z, θ, x) = p(k)p(π|k)p(z|π, k)p(θ|k)p(x|θ, z). (1.34)

The difﬁculty with a full model is that the distributions with different k will have different parameter dimensions, which violates a necessary condition for the convergence of the usual

MCMC method. Richardson and Green used the reversible jump MCMC methods developed by Green (1995) to sample from mixtures with varying number of components. Their

37 program moves between models with different k by splitting one mixture component into two or combining two into one. The birth or death of an empty component is also considered in each move. The probabilities for forth and back moves ensure that the Markov Chain will possess the stationary distribution which is the joint posterior distribution of the parameters and the model.

Stephens (2000) proposed a new MCMC algorithm for the mixture problem when k is unknown. He considered the parameters of the model as a point process with each point repre- senting a component density, and constructed a continuous time Markov birth-death process.

The birth of new components occurs at a constant rate, and the death rates of components vary according to the importance of corresponding components. The components that do not help to explain the data die quickly. The relationship of these rates ensures this Markov chain has the correct stationary distribution. Stephens’ (2000) study suggested that this method is competitive to the reverse jump MCMC both in results and computation complexity.

1.6 Summary

The usual likelihood ratio test does not apply for the mixture case. Although many statisticians studied the reasons for the failure and suggested some solutions, such as reparamer- ization and penalized LRT, these methods are complex in computation and lose the simple interpretation of LRT. Besides, the asymptotic null distribution becomes complicated.

Sequential model selection is suggested by both frequentists and Bayesians. It involves

38 stepwise actions. Forward stepwise starts with one component, while backward stepwise

chooses a reasonably large number K as the beginning. Then a modiﬁed model is proposed

and these two models are compared based on some criteria. If the new model is preferred,

another more parsimonious model is proposed and the whole process is repeated until no

further change is possible. Stepwise methods are very greedy and hard to calibrate. For

many methods, the stopping rule is not readily constructed.

Traditional Bayesian methods ﬁt the mixture models with different k, and do model selection

or give an estimate kˆ based on MCMC output. Some researchers, such as Richardson and

Green (1997) and Stephens (2000), aimed at estimating k directly. Their methods can give a simulated posterior distribution of k. However, their models are usually very complicated

(see Equation 1.34) and include several tuning parameters that can not be easily chosen and have much effect on the estimation. As Stephens (2000) noted: “The posterior distribution for k can be highly dependent on not just the prior chosen for k, but also the prior chosen for the other parameters of the mixture model (pp. 64)”. The fact that “Being fully noninformative and obtaining proper posterior distribution are not possible in a mixture context”

(Richardson & Green, 1997, pp. 735) makes situations worse. Besides, several researchers have reported that the convergence of the reversible jump MCMC algorithm is slow (Byers

& Raftery, 1996).

39 Chapter 2

Proposed Lassoing Mixture Method

2.1 Introduction

Suppose there is a random sample of n observations, x1, . . . , xn, from a ﬁnite mixture distri- k X bution πjf(x|θj). The component number k, component weights π, and the distribution j=1 parameters θ are all unknown. We use Y to denote the mixture density. The true density k X values at the sample points are yi = πjf(xi|θj), i = 1, . . . , n. Suppose we have a ﬁtted j=1 saturated mixture model f˜ for the data, and use Y˜ to denote the density of the sample points

estimated with f˜. Then we can approximate the relationship between Y and Y˜ .

Let εi denote the difference between the estimate y˜i and the true value yi,

k X y˜i = yi + εi = πjf(xi|θj) + εi, i = 1, . . . , n. (2.1) j=1

However, since we do not know the true value of k, we use n to approximate it as the start,

40 and expect that there are n − k components with weight 0.

k n X X y˜i = πjf(xi|θj) + 0 ∗ f(xi|θj) + εi j=1 j=k+1 n X P = πjf(xi|θj) + εi with I{πj =0} = n − k, i = 1, . . . , n. (2.2) j=1 j

If we make an assumption for the error term ε that E(ε) = 0, V ar(ε) = Σ for some well deﬁned positive deﬁnite matrix, we can treat π as the regression parameters conditional on

˜ the response Y and the design matrix X with the (i, j)th entry f(xi|θj). In other words, if we assume that the component density parameters θ are known temporarily, we can re-express

Equation 2.2 as n X y˜i = πjXij + εi, i = 1, . . . , n, (2.3) j=1 where the design matrix X is constructed using f(xi|θj), and n − k of the regression parameters πj should have true value 0.

We see that the estimated density could be seen as the summation of the true density, some pseudo component densities with weight zero, and an error term.we can estimate mixture weights as regression parameters due to the linear structure of the mixture model.

To construct a saturated model, we need to use a component number large enough. For many scientiﬁc questions, we can have an upper bound d for the estimate of k, and the model with component number larger than d is a saturated model. While our inference is robust to the choice of the saturated model, we suggest using the model with k = n, the most saturated model.

We can use either the cumulative distribution function (CDF) or the probability density func-

41 tion (PDF) to relate Y to Y˜ . There are subtle differences between these two relations though.

The following Theorem 1 shows that the correlation between predictors (f(x|θj)) is larger under the CDF case, while the design matrix is nearly orthogonal under the PDF case.

Theorem 1. Let XPDF and XCDF be the design matrices constructed from a saturated mixture model as shown above using either the PDF or CDF relation respectively. Then

XPDF is nearly orthogonal, while XCDF has highly correlation between columns.

The impact of the correlation is in how it affects the imposition of sparsity on the mixing weights.

In our approximation, we know that in Equation 2.2, some true πj values should be exactly zero. The ordinary least square (OLS) estimator of the mixing weights under the regression- type relation derived in Equation 2.3 does not help here because it gives n non-zero regression parameters. To impose sparsity, we will use a variant of Tibshirani’s least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996).

2.2 Linear Model Coefﬁcient Shrinkage and Variable Se-

lection: Background

2.2.1 Introduction

As indicated in Equation 2.2 and Equation 2.3, we can treat the mixture density estimation as a linear regression variable selection problem with collinearity between predictors. Be-

42 low we will introduce the traditional solutions for linear models: coefﬁcients shrinkage and variable selection.

Letting Y denote an n × 1 response vector and X denote an n × p design matrix, the linear regression model assumes that E(Y |X) is a linear function of X, i.e.,

E(Y |X) = β0 + X1β1 + ... + Xpβp. (2.4)

The ordinary least square (OLS) estimator minimizes the residual sum of squares (RSS).

n p X X 2 RSS(β) = (yi − β0 − xijβj) , (2.5) i=1 j=1 with solution βˆ = (XT X)−1XT Y , which is an unbiased estimate of β. However, we might not be satisﬁed with the OLS estimator when we have a large number of predictors where there may be collinearity among them. OLS estimates can have poor performance with respect to prediction because of their large variance. We could possibly improve the prediction accuracy by allowing a little bias to reduce the variance. In addition, for some scientiﬁc problems, keeping all predictors may block good interpretation. We are interested in sparse solutions when many of the elements of β are exactly zero, as in the case of model

(2.2). This is the problem of variable or model selection. The following sections describe some typical methods that can be used for this problem, including simultaneous shrinkage and selection operators like the LASSO.

In the following sections, Section 2.2.2 describes the all subsets selection. Section 2.2.3 introduces shrinkage methods, including ridge regression, bridge regression and the LASSO.

LARS, a more general and efﬁcient computational approach is introduced in Section 2.2.4.

43 2.2.2 Subset Selection

Subset selection is commonly carried out by a stepwise method (forward or backward). For- ward stepwise selection starts with a null model (with intercept only), then sequentially adds predictors that increase most an appropriate goodness-of-fit measure. It stops until no predictor can produce an important increase in the goodness-of-fit. Backward elimination starts with the full model and then sequentially deletes those predictors one at-a-time that are least significant. A hybrid stepwise procedure is also suggested. For each step of forward selection, we can use an F statistic to test if one can recalibrate the current model to delete predictors that are no longer important. Another method called Best Subset Regression

(BSR) searches exhaustively and ﬁnds the best model of size k ∈ {0, 1, . . . , p}. It includes the information-theoretic approach (Miller, 1990), resampling based approach (Shao, 1993,

1996), and Bayesian methods (George & McCulloch, 1993; Ishwaran & Rao, 2005a). How- ever, the optimal subset size may not be easy to determine. Over-fitting and under-fitting occur in practice, and asymptotic results do not always imply good finite sample performance (Miller, 1990).

Stepwise selection methods can be very greedy (Hastie et al., 2001) and stop too early.

Best Subset Selection (BSR) methods are not computationally feasible for a large number of predictors. In addition, in our model of Eq.2.2, we have a more discretized version of a continuous response. As a result, it often exhibits high variance in selection, and small changes of data can result in increased instability of subset selection methods. Coefﬁcients shrinkage methods introduced in the next section are continuous in nature and as a result,

44 have less variability (Tibshirani, 1996; Fu, 1998).

2.2.3 Shrinkage: Ridge Regression, Bridge Regression and LASSO

Ridge regression was introduced by Hoerl and Kennard (1970). The idea is to minimize the

P 2 residual sum of squares (RSS) subject to the constraint j βj ≤ t for some t > 0.

n p ˆridge X X 2 X 2 β = argmin (yi − β0 − xijβj) subject to βj ≤ t. (2.6) β i=1 j=1 j

This ends up being equivalent to solving the penalized optimization,

n p p ˆridge X X 2 X 2 β = argmin{ (yi − β0 − xijβj) + λ βj }. (2.7) β i=1 j=1 j=1

Here t and λ are positive numbers that control the amount of shrinkage. There is a one-to- one correspondence between these two parameters. The ridge regression solutions are of the closed form

βˆridge = (XT X + λI)−1XT Y. (2.8)

The constant term λI makes the inverse operation possible even when XT X is not of the full rank. The relationship between βˆols and βˆridge can be derived as

1 βˆridge = βˆols, (2.9) 1 + d where d depends on λ or t. It implies that for any given design matrix X, there is an interval in which any value of t will give a ridge estimator βˆridge that has less mean squared error

(MSE) than that of βˆols. The detailed proof can refer to Hoerl and Kennard (1970).

45 The solution for ridge regression is not invariant under location and scale transformation.

Commonly we should standardize the predictors to have mean 0 and unit variance. We esti-

mate β0 by the mean of the response and assume that the response has mean 0 and estimate the ridge regression parameters without intercept. This standardization procedure has also been applied to other shrinkage operators including bridge regression, LASSO, and LARS, as indicated below.

Frank and Friedman (1993) proposed an alternative constraint

X γ |βj| ≤ t, (2.10) j

with γ ≥ 0 for minimizing the RSS. This was called bridge regression because it bridges

an entire family of shrinkage operators. Bridge regression includes ridge regression with

γ = 2 and subset regression with γ = 0 as special cases. Different choices of γ will produce

different constraints on the parameter space. Figure 2.1 shows the contours of equal value

of (2.10) for several values of γ. When γ < 1, the parameter space will not be concave.

Frank and Friedman (1993) did not give a solution for a speciﬁc γ, but suggested optimizing

the parameter γ. In Fu’s (1998) paper, he used generalized cross-validation to choose the

optimal γ.

One special member of the bridge family suggested by Tibshirani (1996) uses the absolute

P constraint |βj| ≤ t. This was called the least absolute shrinkage and selection operator

(LASSO). In orthogonal design settings (XT X = I), the solution of LASSO has the explicit

form

βˆlasso = sign(βˆols)(| βˆols | −d)+, (2.11)

46 Figure 2.1: Contours of equal value for Bridge penalty with different γ

where d is determined by γ or t. The function a+ deﬁnes a function that takes the value a if a is larger than 0, and takes the value 0 otherwise.

Because of the special geometry of the LASSO constraint (the rotated diamond), the LASSO solution can have exact 0 coefﬁcients for some predictors. For the orthogonal case, the “soft thresholding” solution of Eq 2.11 explains the reason. For the nonorthogonal case, we use a picture to explain the effect of the constraints. Figure 2.2 shows the LASSO and ridge regression with two parameters. The constraint region is a diamond (|β1| + |β2| ≤ t) for

2 2 LASSO and circle (β1 + β2 ≤ t) for ridge regression. The solution is the tangent point of the constraint region and the contours of the residual sum of squares for both methods. We see that if the LASSO solution occurs at the corner, one parameter will be exactly zero. The

47 constraint regions for ridge regression and other bridge ridge regression with γ > 1 do not have corners, so they could not produce sparse solutions. LASSO does coefﬁcients shrinkage and predictors selection simultaneously and continuously. It retains the good features of both subset selection and shrinkage regression while minimizing their shortcomings.

Figure 2.2: Estimation picture for LASSO (left) and ridge regression (right). The ellipses are the contours of the least squares error function.

It is very interesting that we can see LASSO, ridge and bridge regression from a Bayesian point of view. For example, if we take the independent double-exponential distribution as the prior distribution for each βj,

1 | β | f(β ) = exp(− j ), (2.12) j 2τ τ

the LASSO estimate for βj is just the posterior mode. The double exponential density has a sharp mode and heavy tails, which reﬂects the tendency of LASSO to produce either 0 or large estimates. The implicit prior for ridge regression is the normal density. The general prior density function for bridge regression is (Fu, 1998)

γ2−(1+1/γ)λ1/γ π (β) = , (2.13) λ,γ Γ(1/γ)

48 and the posterior distribution is

1 X f(β|Y ) ∼ Cexp{− (RSS + λ |β |γ)}. (2.14) 2 j j

The algorithm for ﬁtting lasso requires quadratic programming techniques. Below is an

outline of the algorithm from Tibshirani (1996).

Algorithm 2 LASSO algorithm by Tibshirani (1996) ˆols ˆols Start with E = {i0} where δi0 = sign(β ), β is the overall least squares estimate. n ˆ X X 2 Find β = min (yi − βjxij) subject to GEβ ≤ t1, where GE is the matrix whose i=1 j

rows are δi for i ∈ E.

P ˆ while |βj| > t do

ˆols Add i to the set E, where δi0 = sign(β ). n ˆ X X 2 β = min (yi − βjxij) subject to GEβ ≤ t1. i=1 j end while

It must converge in at most 2p steps. Tibshirani (1996) suggested that the average number

of iterations is in the range of (0.5p, 0.75p) based on practice. Later, Fu (1998) suggested a shooting algorithm for LASSO as indicated in Algorithm 3.

The LASSO estimates in general do not have an analytic form. As a result, it is difﬁcult to compute the standard errors of the estimates. To solve this problem, Bootstrap approaches

(Efron & Tibshirani, 1993) are usually applied, ﬁxing t or optimizing t for each bootstrap

sample. The choice of λ or t in LASSO (bridge) is determined by cross-validation or gen-

eralized cross-validation. For cross-validation, ﬁve-fold or ten-fold data is preferred. A grid

P ˆols of values of t/ βj is chosen from 0 to 1, and prediction error is estimated for each value.

P ˆols The value t/ βj yielding the lowest prediction error is the optimal choice. For general-

49 Algorithm 3 LASSO algorithm by Fu (1998) Start with βˆols and m=1

repeat

m=m+1

ˆ−j ∂RSS for each j = 1, . . . , p, let S0 = Sj(0, β , X, y), where Sj(β, x, y) = . And set ∂βj

 λ − S  0  T , if S0 > λ  2x xj  j  ˆm −λ − S0 βj = T , if S0 < −λ  2xj xj    0, if |S0| ≤ λ

until βˆm converges

P P 2 ized cross-validation, writing the penalty |βj| as βj /|βj| enables us to get an estimator

of the ridge regression estimator form β˜ = (XT X +λW −)−1XT Y . The number of effective

parameters could be approximated by p(t) = tr{X(XT X + λW −)−1XT }. The generalized

cross validation (GCV) statistic is then constructed as

1 RSS(t) GCV (t) = , (2.15) N {1 − p(t)/N}2

where RSS(t) is the residual sum of squares for ﬁt with constraint t. The optimal t is chosen as the value with the least GCV(t).

The degrees of freedom of LASSO have been studied recently by Zou, Hastie, and Tibshirani

(2005). They used Stein’s unbiased risk estimation (SURE) theory which implies that the number of non-zero coefﬁcients is an unbiased estimate for the effective degrees of freedom of LASSO. Based on this measure, various model selection criteria (e.g. Cp, AIC and BIC)

50 that require a model complexity estimate can be used to obtain the optimal LASSO ﬁt. In

the LARS algorithm introduced next, the Cp type criterium was used.

Knight and Fu (2000) studied the asymptotic behavior of the bridge regression estimator,

1 T which includes LASSO as a special case. By assuming the regularity conditions max Xi Xi → 0 n 1≤i≤n n 1 X and C = X XT → C, where C is a non-negative deﬁnite matrix, they proved that the n n i i i=1 ˆlasso ˆlasso p ˆlasso LASSO estimator β is consistent, i.e. β → β, provided λn = o(n). Also β is √ √ n consistent if λn = O( n).

The LASSO has been applied to many research areas. Tibshirani (1997) applied it to Cox’s

proportional hazards model for variable selection and shrinkage. Fu (2003) applied the

penalty model to the generalized estimating equations (GEE) in longitudinal studies. Other

applications include tree-based models (LeBlanc & Tibshirani, 1999), logistic regression

(Mak, 1999), neural networks (Sun, 1999), and B-spline models (Osborne et al., 1998).

2.2.4 LARS: Efﬁcient Computation for LASSO Solutions

Efron et al. (2004) proposed a new model building algorithm called Least Angle Regression

(LARS) for linear regression. The basic idea of LARS is that it adds predictors to the model

one at-a-time based on the correlation between residuals and predictors. LARS is a less

greedy version of the traditional forward selection methods, and we can implement LASSO

or the forward stagewise linear regression with slight modiﬁcations. It is computationally

efﬁcient and can give all possible LASSO estimates for a given problem. We use the di-

abetes data from Efron et al. (2004) as an example, which includes ten baseline variables

51 measurements (e.g. age, sex, blood pressure,. . . ) and one response variable measurement

(a quantitative measure of disease progression one year after baseline) for 442 diabetes patients. The left panel of Figure 2.3 shows all LASSO solutions for the diabetes data, from the model having no predictors to the model including all predictors. LARS algorithm is shown as Algorithm 4.

Algorithm 4 LARS algorithm

Start with all coefﬁcients equal to zero, estimate µˆ0 = 0.

Construct the predict set E = φ

repeat

Find the predictor Xi, i 6∈ E, that has the largest correlation with the response Y .

E = E ∪ i.

Let µˆ1 =µ ˆ0 + γ1Xi, and residual r = Y − µˆ1. We choose γ1 as much as possible until

T there is another predictor Xj having as much correlation with residual as Xi: Xi (Y −

T µˆ1) = Xj (Y − µˆ1).

Letting u2 be the unit vector that bisects the angle between Xi, Xj, the next LARS

estimate is µˆ2 =µ ˆ1 + γ2u2. We choose γ2 as much as possible until there is another

predictor Xk having as much correlation with residual as Xi and Xj.

until All predictors are included in the model.

ˆ ols ˆp The ﬁnal step makes µˆp = Y and β equal to the OLS estimate for the full set of p

covariates.

If we assume that the increases and decreases of the active set only involve one predictor each time, and modify the LARS algorithm to make sure that if a nonzero coefﬁcients hits zero, it will be removed from the active set of predictors, LARS could yield all possible

52 LASSO solutions. To select the optimal LASSO estimate, Efron et al. (2004) suggested a

Cp-type selection criterion. By deﬁning the degrees of freedom for an estimator µˆ as

n X 2 dfµ,σ2 = cov(ˆµi, yi)/σ , (2.16) i=1 a Cp-type risk estimation formula can be deﬁned as

ky − µˆk2 C (ˆµ) = − n + 2df 2 . (2.17) p σ2 µ,σ

The model with the minimum Cp is the model recommended. The whole process requires

O(p3+np2) computations, which is on the same order as the ordinary least square regression.

With another single but important modiﬁcation, the LASSO-modiﬁed LARS could give the

P LASSO solution with the constraint |βj| ≤ t and βj ≥ 0. This is called the positive

LASSO, and our approach for mixture model selection is based on it. The idea is introduced in Efron et al. (2004), but they did not give any application examples. The right panel of Fig- ure 2.3 shows the positive LASSO solution to the diabetes data. We see that those predictors having negative correlations with the response will not enter the model. The positive LASSO solution usually does not converge to the OLS solution. However, in our situations, the parameters naturally satisfy this positive constraint since all regression coefﬁcients correspond to component weights, which should be positive as Equation 2.2 implies.

53 LASSO Positive LASSO

* * ** * * * * ** ** * * ** * * * *** * * * * * * ** ** ** * * ** * * ** * * * ** ** * *** * * * **** * *** **** * * * * * * * ** ** * * * ** ** * * * Standardized Coefficients Standardized Coefficients

−500 0** 500 * *

* * 10 4 9 3

* 5217469 0*** 100 200 * 300 400 * * 500 600

0.0 0.4 0.8 0.0 0.4 0.8

Figure 2.3: Regression coefﬁcients for the diabetes data. Left panel are the LASSO estimates and right panel are the positive LASSO estimates

54 2.3 Lassoing Mixtures

When the component density parameters θ is known, as Eq. 2.3 shows, we can treat the mixture model order selection problem as a variable selection problem in linear regression, and apply the positive LASSO approach introduced in the previous section with special con-

P straint βj = 0. When the component number k of the mixture model is decided, we

can use the EM algorithm to estimate the component density parameters. In our lassoing

algorithm, we will iterate these two steps until the convergency.

2.3.1 Lassoing Mixture Algorithm

Suppose we want to ﬁt a ﬁnite normal mixture distribution to the data with n samples. The

ﬁtting algorithm is described in Algorithm 5.

Algorithm 5 Lassoing Mixture ˜ ˜ Fit a saturated mixture model f and let y˜i = f(xi).

0 0 0 With initial values θ , construct the design matrix Xij = N(xi|θj ).

repeat

Pn Pn 2 P πˆ = argmin i=1(˜yi − j=1 Xijπj) with the constraint πj = 1 and πj ≥ 0. π Delete the redundant components with weight zero.

Update the component parameters using EM with ﬁxed weights.

m+1 m+1 Xij = N(xi|θj ).

until The convergence of the parameters.

A ﬁnal full EM ﬁt is suggested to optimize the location and scale parameters.

55 We start with the most saturated model — an n-component normal mixture. The contrived response is the cumulative density or probability density of the ﬁtted n-component normal mixture at the data points. To avoid the unboundness of the likelihood function, we assume the component models have homogeneous covariance. We use the ordered data as the initial corresponding location parameters, and the sample variance divided by n as the initial common variance parameter. The entries of the design matrix X will just be the CDF/PDF of each component distribution at the data points. For example, Xij will be the CDF/PDF for

2 the jth component distribution N(X|µj, σj ) at the ith data point.

Using the ordered data as the initial values for the corresponding location parameters of an n-component normal mixture is a reasonable choice, which implies each data point is from its own component distribution. However, the initial value of the common variance can not be easily determined. In practice, we ﬁnd that the large initial value tends to produce the solutions too sparse, while the initial value too small gives a model with redundant components. Our choice of the sample variance divided by n is a heuristic attempt based on our experience. Besides, we ﬁnd that our model selection result is very robust for the choice of the saturated model, a mixture model with the component number large enough can be used as the saturated model instead of the n-component mixture.

We use two simulated data sets from the simulation design 2 presented later in the paper to show the parameter traces of our algorithm. Each data consists of 100 observations from a

2-component normal mixture with homogeneous unit variance, equal weight, and location parameter µ = (0, 3). Figure 2.4 shows the histogram plots with kernel density curves of these two data. Figure 2.5 and Figure 2.6 are the parameter trace plots. Both analysis

56 start with a 100-component normal mixture with the ordered data as the initial mean vector √ and σdata/ 100 as the initial common standard deviation. Because the component number drops sharply from 100 to a small number at the ﬁrst iteration, we only plot the trace from the second step until convergence. 0.25 0.20 0.20 0.15 0.15 Density Density 0.10 0.10 0.05 0.05 0.00 0.00

−2 0 2 4 −2 0 2 4

a) Simulated Data 1 b) Simulated Data 2

Figure 2.4: a) is the histogram plot of the simulated data set 1 with the kernel density curve. b) is the histogram plot of the simulated data set 2 with the kernel density curve.

From the trace plots, we see that the parameter π and µ jump randomly at the beginning.

With the degeneration of the component number k, other parameter estimates converge to

the stationary values gradually. Finally, our algorithm has settled on the correct number of 2

ﬁnal components. It is interesting that components with larger weights at early iterations do

not survive surely in the end.

57 k σ 0.2 0.4 0.6 0.8 2.0 2.5 3.0 3.5 4.0

5 10152025 5 10152025

Iteration Number Iteration Number π µ 0123 0.0 0.2 0.4 0.6

0 5 10 15 20 25 0 5 10 15 20 25

Iteration Number Iteration Number

Figure 2.5: Trace plot of the simulated data set 1. a) is the trace of the component number k. b) is the trace of the common standard deviation σ. c) is the trace of the weight vector π. d) is the trace of the mean vector µ.

2.3.2 Connection with Other Methods

The idea of using penalized optimization for mixture models has been suggested by other statisticians. In this section, we want to show the relationship between our approach and these methods.

The modiﬁed likelihood ratio test suggested by Chen et al. (2001) has been popular because it is distribution-free and asymptotically locally most powerful. It adds a penalty term on the mixture weights to the likelihood function, and solves the optimization of the penalized

58 k σ 234567 0.2 0.4 0.6 0.8 1.0

10 20 30 40 10 20 30 40

Iteration Number Iteration Number π µ 0123 0.0 0.1 0.2 0.3 0.4 0.5

0 10203040 0 10203040

Iteration Number Iteration Number

Figure 2.6: Trace plot of the simulated data set 2. a) is the trace of the component number k. b) is the trace of the common standard deviation σ. c) is the trace of the weight vector π. d) is the trace of the mean vector µ. likelihood n k X X (π, θ) = argmax{log f(xi|π, θ) + C log(πj)}. (2.18) π, θ i=1 j=1

The LASSO solves the following constrained optimization problem

n p p LASSO X X 2 X πˆ = argmin{ (yi − xijπj) + λ |πj|}, (2.19) π i=1 j=1 j=1 which has equivalence with Equation 2.18. It is because in the regression setting, the ordinary

59 least square estimate is equivalent to the maximum likelihood estimate.

n p ˆols X X 2 β = argmin{ (yi − xijβj) } β i=1 j=1

= argmax{logf(Y |Xβ)}. (2.20) β

Chen et al. (2001) also discussed the Bayesian view of their method, the penalty term is considered as the prior distribution of π. This is also the same with the Bayesian explanation of LASSO, which has a double exponential prior distribution.

The general idea of generating the pseudo-response for non-regression situations and em- ploying linear regression modelling is also suggested by other researchers. Zou, Hastie, and Tibshirani (2006) introduced a new method called sparse principal component analysis

(SPCA). They contrived the ordinary principal component analysis (PCA) as the response and used the LASSO (elastic net) to produce the modiﬁed principal components with sparse loadings.

2.4 Some Theoretical Results

In this section, we aim at showing that when the component number k of a mixture model is known, the OLS+ﬁxed-weight-EM algorithm can provide a maximum likelihood estimator

(MLE). Then, our approach is a natural extension for the situation when k is unknown.

60 2.4.1 Lassoing mixtures with unknown number of components is a pe-

nalized maximum likelihood estimator

Firstly we constrain the problem to the situation where the mixture component number k is known. The common approach is ﬁtting the model with EM to get the maximum likelihood estimates (MLE) of the other parameters. We want to show that we can also get MLE by the following Algorithm 6, which iteratively uses the OLS estimator and EM with the ﬁxed weights.

Algorithm 6 OLS+EM ˜ ˜ Fit a saturated mixture model f and let y˜i = f(xi).

0 0 0 With initial values θ , construct the design matrix Xij = N(xi|θj ).

repeat

Pn Pk 2 πˆ = argmin i=1(˜yi − j=1 Xijπj) . π Update the component parameters using EM with the ﬁxed weights.

m+1 m+1 Xij = N(xi|θj ).

until The convergence of the parameters.

The following Lemma 1, Lemma 2, and Theorem 2 show that the likelihood is increased at

each step of the Algorithm 6. Thus, the ﬁnal estimates are MLE.

Lemma 1. Suppose in the ﬁnite mixture model estimation, the number of components k is known and πˆ is the EM estimate for the weight vector π, then πˆ is unbiased.

Lemma 2. Suppose in the ﬁnite mixture model estimation, the number of components k is known, the component parameters θ are known, and only the weight vector π is unknown.

61 Construct the design matrix X and response vector Y˜ as we do in Algorithm 5 and 6, and estimate π with πˆols = argmin(Y˜ − Xπ)2, then πˆols gives the mixture model with the largest π likelihood in all unbiased estimate of π.

Theorem 2. Suppose in the ﬁnite mixture model estimation, the component number k is known. With some initial parameter values, the estimates by Algorithm 6 are MLE.

When the mixture component number k is unknown, a general extension of Algorithm 6 is

Algorithm 5, in which we replace the OLS estimator with the positive LASSO and start from the k = n model, which is the largest possible saturated model. Setting k = n allows the lassoing mixture approach to be cast as a penalized MLE (PMLE) with penalty imposed on the mixture weights by the L-1 parameter constraints of the (positive) LASSO. Due to the sparsity property of the LASSO, redundant mixture component weights may be estimated to be 0. Thus the lassoing mixtures approach can be used in the general case when the mixture component number is unknown.

Next, we will show that the lassoed mixture estimate from Algorithm 5 is indeed consistent.

Additionally, consistency of the estimate of the number of components k can be proven by making use of the new adaptive lasso algorithm of Zou (2005). Note that the general strategy of using a saturated model as a starting point to ﬁt the mixture model has also been suggested in sorts as far back as Laird (1978), although not in a regularized fashion as what is presented here.

62 2.4.2 Consistency of the Final Estimated Mixture and Order Selection

The following Theorem 3 shows the consistency of the ﬁnal lassoed mixture estimate.

p Theorem 3. Let f denote the true mixture density and fˆdenote our estimate, then fˆ−f → 0.

Remarks: Clearly order selection of the finite mixtures can be cast as a variable selection problem. However, the variable selection with LASSO is not always consistent. Mein- shausen and Buhlmann¨ (2006) and Ishwaran and Rao (2005b) showed the conflict between prediction and variable selection, the optimal λ for prediction gives inconsistent variable selection results. Meinshausen and Buhlmann¨ (2006) derived some sufficient conditions for the consistency of lasso variable selection, which are not satisfied always. To obtain the general variable selection consistency, Zou (2005) proposed an adaptive lasso approach.

Theorem 4. [Zou (2005)]

Under the model of Equation 2.3, deﬁne data-dependent weights wj, j = 1, . . . , n to different coefﬁcients.

n p p Ada−lasso X X 2 X πˆ = argmin{ (˜yi − Xijπj) + λ wj|πj|} π i=1 j=1 j=1

Then if wˆ = 1/|πˆols|γ, where γ > 0, the estimate πˆAda−lasso will be consistent and have the optimal estimation rate.

Ishwaran and Rao (2005b) also suggested that we could use a hard thresholding procedure for the LASSO estimates where the thresholding parameter must be a function of the sample size to achieve the consistency of LASSO variable selection.

63 Chapter 3

Application Examples for Lassoing

Mixture Approach

3.1 Simulation Studies

To illustrate how our method performs, we make use of the simulation design from Ishwaran

et al. (2001). There are 10 different simulations in which data are independently drawn

from a ﬁnite location normal mixture with unit variance. All experiments except Experiment

1 have uniform weights for the components. In Experiment 1, π = (1/3, 2/3). Experi- ments 1-3 have two components and µ = (0, 3), (0, 3), (0, 1.8) respectively. Experiments 4-6 have four components and µ = (0, 3, 6, 9), (0, 1.5, 3, 4.5), (0, 1.5, 3, 6) respectively. Exper- iments 7-10 have seven components and µ = (0, 3, 6, 9, 12, 15, 18), (0, 1.5, 3, 4.5, 6, 7.5, 9),

(0, 1.5, 3, 4.5, 6, 9.5, 12.5), (0, 1.5, 3, 4.5, 9, 10.5, 12) respectively. Experiments 1-6 have sam-

64 ple size n = 100, while Experiments 7-10 have sample size n = 400. Each simulation is repeated 500 times. We ﬁx the same random seed for all experiments to avoid the random

ﬂuctuation due to changing random seeds. Figure 3.1 shows the true mixture densities in each simulation. For Experiments 3, 5, 6, 8-10, the mode number is less than the component number because of the close distance between components. We apply our lassoing mixture method to each setting and compare the selection performance to the results published in

Ishwaran et al. (2001).

Tables 3.1-3.3 present the results of our method using CDF/PDF and the results from the

AIC, BIC, GWCR algorithms (Ishwaran et al. 2001). In Experiments 1-2, Lassoing recog- nizes the 2-component mixture, though not as consistent as AIC, BIC and GWCR. In Exper- iment 3, Lassoing discovers the true 2 components, while the other methods tend to recog- nize 1 component. In Experiment 4, AIC and Lassoing PDF are the winner, and Lassoing

CDF method uncover the true model about 20% of the time. In Experiments 5-6, Lassoing methods tend to have higher frequencies of ﬁnding the correct number of components as compared with other methods. In Experiments 7-10, Lassoing methods are better than BIC.

A proportion of Lassoing results have large estimates for the component number k. One

possible reason is that we start from the largest possible model k = n. Lassoing PDF tends

to have larger estimates than that of Lassoing CDF, this may attribute to the different design

matrices of them.

65 Exp 1 Exp 2 Exp 3 5 2 . 20 0 . 0 5 1 . 10 0 . 0 5 0 . 00 0 . 0.00 0.05 0.10 0.15 0.20 0.25 0 0.00 0.05 0.10 0.15 0.20 −20 2 4 −20 2 4 −20 2 4

x x x

Exp 4 Exp 5 Exp 6 10 . 08 0 . 06 0 . 04 0 . 02 0 . 00 0 . 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0

0510 0510 0510

x x x Exp 7 Exp 8 Exp 9 Exp 10 0.00 0.02 0.04 0.06 0.08 0.00 0.02 0.040.00 0.06 0.02 0.08 0.04 0.06 0.08 0.00 0.01 0.02 0.03 0.04 0.05 0.06

−5515 05 010 010

x x x x

Figure 3.1: Mixture densities of simulations

66 Exp n d m k AIC BIC GWCR Lassoing

CDF PDF

1 100 2 2 1 0.018 0.150 0.018 0.050 0.074

2 0.896 0.838 0.920 0.584 0.432

3 0.062 0.012 0.058 0.186 0.220

4 0.024 0.000 0.004 0.068 0.136

5 0.000 0.000 0.000 0.046 0.062

≥ 6 0.000 0.000 0.000 0.066 0.076

2 100 2 2 1 0.022 0.212 0.030 0.022 0.016

2 0.900 0.780 0.916 0.510 0.386

3 0.050 0.006 0.054 0.232 0.238

4 0.028 0.002 0.000 0.132 0.172

5 0.000 0.000 0.000 0.026 0.084

≥ 6 0.000 0.000 0.000 0.078 0.104

3 100 2 1 1 0.702 0.968 0.868 0.106 0.194

2 0.264 0.030 0.130 0.572 0.416

3 0.024 0.002 0.002 0.186 0.184

4 0.000 0.000 0.000 0.044 0.096

5 0.000 0.000 0.000 0.016 0.048

≥ 6 0.000 0.000 0.000 0.076 0.062

Table 3.1: Results of Simulations 1-3: Entries in the last ﬁve columns are the percentage of times out of the 500 samples for which the component number estimate equals a candidate dimension value k. Percentages highlighted by boxes indicate highest value and thus represent the best model for a speciﬁc procedure.

67 Exp n d m k AIC BIC GWCR Lassoing

CDF PDF

4 100 4 4 1 0.000 0.110 0.000 0.014 0.034

2 0.178 0.596 0.102 0.348 0.080

3 0.110 0.110 0.554 0.198 0.090

4 0.674 0.182 0.306 0.194 0.268

5 0.038 0.002 0.038 0.078 0.182

6 0.000 0.000 0.000 0.050 0.108

≥ 7 0.000 0.000 0.000 0.118 0.238

5 100 4 1 1 0.244 0.748 0.144 0.028 0.060

2 0.556 0.246 0.818 0.494 0.312

3 0.142 0.004 0.032 0.230 0.274

4 0.044 0.002 0.006 0.122 0.142

5 0.014 0.000 0.000 0.040 0.086

6 0.000 0.000 0.000 0.018 0.060

≥ 7 0.000 0.000 0.000 0.068 0.066

6 100 4 2 1 0.016 0.188 0.000 0.022 0.060

2 0.474 0.698 0.612 0.476 0.216

3 0.392 0.106 0.368 0.208 0.280

4 0.102 0.008 0.020 0.106 0.184

5 0.014 0.000 0.000 0.054 0.120

6 0.000 0.000 0.000 0.036 0.062

≥ 7 0.002 0.000 0.000 0.098 0.078

Table 3.2: Results of Simulations 4-6: Format and methods used are similar to that described in Table 3.1.

68 Exp n d m k AIC BIC GWCR Lassoing CDF PDF 7 400 7 7 1 0.004 0.816 0.000 0.008 0.026 2 0.000 0.000 0.000 0.260 0.218 3 0.000 0.000 0.010 0.228 0.232 4 0.302 0.168 0.188 0.234 0.206 5 0.212 0.016 0.424 0.160 0.086 6 0.098 0.000 0.178 0.062 0.054 7 0.326 0.000 0.114 0.032 0.116 8 0.036 0.000 0.056 0.006 0.042 9 0.022 0.000 0.030 0.010 0.012 ≥ 10 0.000 0.000 0.000 0.000 0.008 8 400 7 1 1 0.030 0.538 0.000 0.002 0.014 2 0.684 0.462 0.078 0.354 0.282 3 0.000 0.000 0.590 0.252 0.326 4 0.248 0.000 0.272 0.234 0.246 5 0.000 0.000 0.048 0.096 0.094 6 0.012 0.000 0.008 0.036 0.026 7 0.024 0.000 0.004 0.016 0.010 8 0.002 0.000 0.000 0.010 0.002 9 400 7 2+1 1 0.002 0.458 0.000 0.014 0.101 2 0.000 0.000 0.002 0.384 0.334 3 0.144 0.398 0.120 0.220 0.268 4 0.460 0.138 0.408 0.196 0.170 5 0.308 0.006 0.312 0.106 0.138 6 0.048 0.000 0.128 0.046 0.058 7 0.016 0.000 0.024 0.020 0.014 8 0.022 0.000 0.006 0.010 0.006 9 0.000 0.000 0.000 0.004 0.002 10 400 7 1+1 1 0.000 0.000 0.000 0.010 0.014 2 0.496 0.992 0.020 0.292 0.232 3 0.000 0.000 0.370 0.220 0.310 4 0.302 0.006 0.466 0.256 0.242 5 0.118 0.002 0.128 0.112 0.150 6 0.064 0.000 0.010 0.060 0.026 7 0.016 0.000 0.006 0.034 0.010 8 0.004 0.000 0.006 0.008 0.008 9 0.000 0.000 0.000 0.008 0.002 ≥ 10 0.000 0.000 0.000 0.000 0.006 Table 3.3: Results of Simulations 7-10: Format and methods used are similar to that described in Table 3.1.

69 3.2 Some Classic Real Data Sets

We use three well known real data sets to illustrate our methods. The first one is the Galaxy data which was firstly presented by Postman, Huchra, and Geller (1986). It consists of the velocities of 82 galaxies in the Corona Borealis region. These galaxies are moving away from our galaxy from 6 conic section of space at different speeds, and the multimodal distribution of the velocities indicates that there are super clusters of galaxies, each of which moves at its own speed. A histogram and kernel density plot of the data is shown in the left panel of Figure 3.2. This data set has been analyzed using finite mixture models by several researchers including Roeder (1990), Chib (1995), Carlin and Chib (1995), Richardson and

Green (1997), Roeder and Wasserman (1997), Stephens (2000), and Ishwaran et al. (2001).

A review of these papers enables us to believe that 3-7 modes are possible.

We apply our lassoing approach with the pseudo response constructed with PDF or CDF.

The component distributions have homogeneous variance. The result is summarized in Ta- ble 3.4. Both analyis select 3-component normal mixtures with slightly different parameters.

We suggest using EM with known component number to optimize the location and scale parameters.

The second data set is the Hidalgo stamp data. It consists of the thickness measurements of 485 Hidalgo postage stamps of Mexico. A histogram and kernel density plot of the data is shown in the right panel of Figure 3.2. There are historical certiﬁcates that these stamps were printed on several different types of papers. Izenman and Sommer (1988) ﬁtted a normal mixture model with unequal variances to the data and got the component number

70 Histogram of galaxy Histogram of stamp Density Density 0 10203040 0.00 0.05 0.10 0.15

10 15 20 25 30 35 0.06 0.08 0.10 0.12 0.14

galaxy stamp

Figure 3.2: Kernel density plots of some real data sets. Left is the histogram plot of the Galaxy data with the kernel density curve. Right is the histogram plot of the Stamp data with the kernel density curve equal to 3. Whereas, they also used the non-parametric kernel density estimation technique, the Silverman’s test, to determine the most possible number of modes in the underlying density, which yielded 7 modes. Basford et al. (1997) explained the conﬂict by applying a normal mixture model with equal variance to the data and showed that 7 is a feasible choice for the component number. Efron and Tibshirani (1993) used bootstrap to consider the mode numbers of data. They argued that a 2-component normal mixture model could not be rejected whereas the component number might also be 7. Ishwaran et al. (2001) applied their GWCR algorithm to the data and obtained a 8-component model.

A summary of the detailed results from our Lassoing approach is indicated in Table 3.5. For

71 Lassoing (CDF) Lassoing (PDF)

π = (0.111, 0.844, 0.044) π = (0.234, 0.746, 0.020)

µ = (9.765, 21.403, 32.918) µ = (9.850, 21.413, 32.984)

σ2 = 2.080 σ2 = 2.082

LogL = −213.0364 LogL = −219.0603

Table 3.4: Results of ﬁtting the Galaxy data using lassoing mixture

Lassoing (CDF) Lassoing (PDF)

π = (0.703, 0.227, 0.070) π = (0.812, 0.188)

µ = (0.077, 0.102, 0.119) µ = (0.078, 0.107)

σ2 = 0.0058 σ2 = 0.0074

LogL = 1474.565 LogL = 1430.054

Table 3.5: Results of ﬁtting the Stamp data using lassoing mixtures models with homogeneous variance, lassoing-CDF identiﬁes a 3-component normal mixture and lassoing-pdf selects a 2-component normal mixture

The third data set we use is the Old Faithful data, which consists of 272 eruptions of the Old

Faithful geyser in the Yellowstone National Park. Each observation has two variables, the duration time and the waiting time before the next eruptions. A scatter plot of the data is shown in Figure 3.3, which indicates the possibility of 2 or more groups. Stephens (2000) analyzed this data, and concluded that 2 or 3 modes are most possible and the posterior distribution of the component number heavily depends on the prior. Sahu and Cheng (2002) also asserted that the hypothesis of component number equal to 3 could not be rejected.

We assume that the component covariance matrices are homogeneous. Table 3.6 shows that

72 both lassoing-CDF and lassoing-PDF approaches select 2-component normal mixtures with slightly different parameters. waiting 50 60 70 80 90

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions

Figure 3.3: Scatter plot of the Old Faithful data.

3.3 Extended example: Analysis of proteomic mass spec-

troscopy data for ovarian cancer classiﬁcation

Microarray technology is widely used because it can provide the expression of thousands of genes of the samples at the same time. However, it is argued that proteins are closer to actual biologic functions of cells than mRNAs or DNAs, protein biomarkers of a disease should

73 Lassoing (CDF) Lassoing (PDF)

π = (0.381, 0.619) π = (0.431, 0.569)

T T T T µ1 = (2.047, 54.602) , µ2 = (4.296, 80.039) µ1 = (2.048, 54.615) , µ2 = (4.297, 80.046)     0.133 0.752 0.133 0.752     Σ =   Σ =       0.752 35.174 0.752 35.179

LogL = −1140.467 LogL = −1143.037

Table 3.6: Results of ﬁtting the Old Faithful data using lassoing mixtures offer more information about disease than the genetic biomarkers (Wu et al., 2003). Pro- tein mass spectrometry is the new technique to analyze protein expression. It produces the mass/charge ration (m/z) spectra of the interested proteins with high deﬁnition. Figure 3.4 is the zoomed plot of a short interval of the spectra. The x axis of the spectra is the protein mass divided by the number of charges introduced by ionization, and the y axis is the protein intensity of the corresponding x value. This analysis can be conducted on thousands of proteins over a large number of samples simultaneously and can be used to detect the quantitative or qualitative changes between samples. An important application in the early cancer detection is to classify and predict cancer on the basis of protein spectra.

The ovarian cancer data set was produced by the Keck Laboratory at Yale University. It consists of MALDI-MS spectra of 47 patients and 42 normal persons. Each spectra includes

91360 measurements spaced 0.019 Da apart. It has been analyzed by Wu et al. (2003) and

Tibshirani et al. (2004).

Firstly, since each spectrum consists of intensities measurements at the ordered grid points, it is natural to use a density curve to smooth the data and construct the classiﬁer. Because

74 Figure 3.4: Protein mass spectrometry sample plots. The top two spectra are from the cancer samples and the bottom two spectra are from the control samples.

of the wide span of the measurements (800 Da-3500 Da), it is inappropriate to use only one

density curve to fit the data. We split the data into M pieces in order with the width L Da for each piece, and construct the classifier for each piece. Let Gm(x) denote the classifier for the mth piece, m = 1, 2,...,M, C ∈ (−1, 1) denote the patient and control classes, the predictions from all of the classifiers are then combined through a majority vote to produce the final prediction, M X G(x) = sign( Gm(x)). (3.1) m=1

For each piece, we ﬁt ﬁnite normal mixture densities to both patient data and control data.

An observation will be assigned to the class with less distance.   P ˆ 2 P ˆ 2 −1, if (Ym − Fm,p(x)) ≤ (Ym − Fm,c(x)) ; Gm = (3.2)  1, otherwise.

75 ˆ ˆ Here Ym is the normalized intensity (density) vector of the mth piece, Fm,p(x) and Fm,c(x)

are the predicted densities for the mth piece with the mixture models for the patients and

controls correspondingly.

To ﬁt the ﬁnite normal mixture models, we use the data points in the window of piece width

±m/2 Da. The original intensities are normalized to make them a good approximation of

R P the densities at the m/z values. In other words, Ym(x)dx u Ym(xi)(xi+1 −xi) = 1. The common standard deviation of the component densities is ﬁxed as 0.2 through some preliminary studies to make the mixture component number in the interval 1 − 10. The mixture component number and the location parameters of the component densities are selected by our lassoing approach. Due to the optimization complexity, we will not update our estimates by iteration. Instead, we use a sequence of ﬁne grided location candidates to reduce the potential bias. We construct the candidate component densities with ordered location parameter spanned from the beginning to the end of the corresponding piece with the spaced interval

0.01, so the precision of the location parameter estimates is 0.01.

We use 5-fold cross validation to choose the piece width L, the result is summarized in

Table 3.7. It seems that window width 5 Da has the best performance with cross validation error rate 26/89. Further possible improvement in the prediction accuracy can be achieved by biomarker selection and background noise detection. The peak probability contrasts (PPC) approach was proposed by Tibshirani et al. (2004). It extracts the common set of peaks from individual spectra, and apply the nearest shrunken centroid classiﬁer to the set of extracted features. They used the same ovarian cancer data as the example, and showed that the tenfold cross-validated misclassiﬁcation rate is 23/89-30/89 with different options in the algorithm.

76 Width (Da) 2 3 4 5 6

Cross Validation Errors/89 35 31 35 26 31

Table 3.7: Results for 5-fold cross validation with window width 2-6 Da

These results are comparable with that of our method.

We also tried to randomly select 60 patients (30 from each group) as the training sample and do disease prediction for the 29 patients left. From Figure 3.5, we can see that the prediction error rate is around 33 percent. 14 12

● ●

10 ● ●

● 8 prediction errors (/29) 6

2 3 4 5 6

Window width

Figure 3.5: Protein mass spectrometry sample prediction error plot.

Instead of using the extracted peak measurements, we use all information to construct the classiﬁcation model. The reason is that a biomarker that has large relative intensity measure-

77 ment does not always act differently in patients and controls, it may have large measurements in all spectra hence not a useful feature for classiﬁcation. Our method uses mixture models, so it could provide pattern features that differentiate patients from controls. The common peaks extracted by the PPC method represent a discrete re-expression of the pattern features we ﬁnd, which does not contain the discriminative information in other parts of the spectra.

Interestingly, if we plot the mixture component locations versus the corresponding probabilities as in Figure 3.6, we have a “peak” centroid plot. Each component of the mixture model represents a “peak”, which does not have one-to-one correspondence with the real peaks in spectra. Usually there are more components(“peaks”) than real peaks, since we need to use a few mixture components to capture the peak feature. We see that the patient model and control model share some components (peaks) though with different weights, and there are some components (peaks) distinguish these two groups.

78 Patient data Control data 920 920 900 900 880 880 860 860 840 840 Component Location (Peak) Component Location (Peak) 820 820 800 800

0.00 0.10 0.20 0.30 0.00 0.10 0.20 0.30

Probability Probability

Figure 3.6: Mixture component locations vs. probability plot.

79 Chapter 4

Discussion

We propose a new method to do mixture model estimation, especially for the situation when the number of components in unknown. By generating pseudo-response and candidate design matrix, we treat the mixture model estimation problem as a variable selection problem in linear regression. We show that our lassoing method is competitive with other methods such as GWCR, AIC, and BIC. Our method has a computation efﬁciency advantage since we do not do sequential model construction but instead soft thresholding using a variant of the LARS algorithm to select an appropriate model, which has the same computation order as the OLS estimation.

Multivariate mixture models are usually difﬁcult for other methods to handle because of the model complexity and the heavy computation load when calculating the supremum of a multivariate density or drawing posterior samples from a multivariate parameter space. As we noted, only few paper such as Stephens (2000) included multivariate data application

80 examples. With our approach, the linear relationship of density is the same as that in univariate mixture models. The only additional computation is the EM algorithm for multivariate density.

81 Part II

Bayesian Robust Estimation

82 Chapter 5

Literature Review for Bayesian Robust

Estimation

5.1 Introduction

When there appears suspicious heterogeneity in the data, i.e. some observations may come from distributions different from the assumed one, we need the special robust estimation and outlier detection approach to deal with this problem. Robust estimation and outlier detection are the two faces of the same coin. When the method focuses on parameter estimation and does not specify which observation is suspicious, it is called robust estimation. When the main purpose of the approach is to detect observations that do not ﬁt the model, we call it an outlier detection method.

Since Box and Tiao’s (1968) seminal paper, robust estimation and outlier detection have been

83 studied by many Bayesian statisticians. Bayesian methods for robust estimation and outlier detection can be classiﬁed into two groups depending on whether the generating model for the outliers is speciﬁed or not. Diagnostic methods propose a null model for the data generation excluding the outliers, which includes the predictive distribution used in Geisser

(1985) and Pettit and Smith (1985); parameter posterior distribution used in Johson and

Geisser(1983), Chaloner and Brant (1988), and Guttman and Pena (1993). Robust methods propose the generation model for all the data set including the possible outliers, including the work of Box and Tiao (1968), Abraham and Box (1978), and Sharples (1990).

Most studies involve the robust estimation and outlier detection in linear regression. Suppose we have n i.i.d. observations with p predictors. Let Y denote the response and X denote the

T full ranked design matrix. β is a set of p unknown regression parameters. ε = (ε1, . . . , εn) are normally independently distributed random errors. The common linear model is

Y = Xβ + ε. (5.1)

2 −1 The common non-informative priors assume p(ε) ∼ N(0, σ In) and p(β, τ) = τ where

τ = 1/σ2. In the following sections, we will introduce several Bayesian robust estimation and outlier detection methods for linear regression.

5.1.1 Diagnostic Methods

To check the compatibility of one set of observations with the rest of the sample, the often used diagnostic tool is the predictive distribution p(yI |y(I)), where y(I) is the sample without

84 a set of data yI , Z p(yI |y(I)) = p(yI |θ)p(θ|y(I))dθ. (5.2)

Let the set size be k, with non-informative prior,

1/2 −(n−p−k)/2 p(yI |y(I)) = K|I − HI | (1 + QI ) ,

(n − p)s2 Σ((n − p)/2) where K = [ ]−k/2 , H is the hat matrix n − p − k Σ(1/2)kΣ((n − p − k)/2)(n − p − k)k/2 I ˆ 0 ˆ 0 −1 (yI − XI β(I)) (I − HI )(yI − XI β(I)) eI (I − HI ) eI and QI = 2 = 2 (n − p − k)s(I) (n − p − k)s(I)

The larger the studentized residual QI , the smaller p(yI |y(I)). Thus Geisser (1985) and Pettit and Smith (1985) viewed those observations with lowest p(yI |y(I)) as potential outliers.

Another approach is to study the change of the posterior distribution of the regression parameters when some data points are moved, which was suggested by Guttman and Pena (1988,

1993). The posterior distributions of the parameters based on the remaining data y(I) are

denoted as p(β|y(I)) and p(σ|y(I)). Guttman and Pena used the symmetric Kullback-Leibler distance J(f1, f2) = I(f1, f2)+I(f2, f1) to measure the change of the posterior distributions,

where the Kullback and Leibler information between two distributions is deﬁned as

Z f1 I(f1, f2) = Ef1 [ln(f1/f2)] = f1ln , f2

in which f1 and f2 are two probability density functions (PDF). Guttman and Pena showed that the change on the marginal posterior of the regression parameter β is

p 1 J(p(β|y), p(β|y )) = (D2 + D2 ) + (s2 /s2)tr[(X0X)(X0 X )−1] (i) 2 (i) 2 (i) (i) (i) 1 + (s2/s2 )tr[X0X)−1(X0 X )] − p, (5.3) 2 (i) (i) (i)

85 2 where D is the Cook’s distance, which has been used as a measure of inﬂuence of y(i).

Hence, it offers a straightforward diagnosis for the inﬂuence. When there exists only one

outlier, the change in the marginal posterior for σ2 is

1 J(p(σ2|y), p(σ2|y ) (t2 − r2), (i) u 2 i i

2 2 where ri is the standardized residual and ti is the studentized residual. This divergence

provides a measure of “outlyingness”. The change in the joint posterior when there exists

only one outlier is a summation of these two.

2 s2 2 2 2 1 2 2 p s 2 (i) 2 1 hi J(p(β, σ |y), p(β, σ |y(i)) = (ti − ri ) + [ 2 D + 2 D(i)] + , 2 2 s(i) s 2 1 − hi

Using the non-informative prior for the linear model 5.1, Chaloner and Brant (1988) suggested an approach based on the posterior probabilities pi = pr(|εi| > kσ|y, X), where

0 εi = yi − xiβ are the unobserved residuals. They deﬁned the ith observation to be an outlier

if |εi| > kσ with high probability. Points with a high posterior probability pi of being an

outlier will have a large |εˆi| or a large hii, or both. The pi’s were compared with the prior

probability 2Φ(−k). The value of k is chosen to make the prior probability of no outlier is

large. Say 0.95, k = Φ−10.5 + 0.5(0.951/n). This posterior probability diagnostic measure

is appropriate for situations that there is no obvious way of modelling contaminants.

5.1.2 Robust Methods

Box and Tiao (1968) suggested the variance-inﬂation model, in which ε has a normal mixture

distribution: ε ∼ (1 − α)N(0, σ2) + αN(0, k2σ2).

86 2 2 Let a(r) denote the event that a particular set of r of the n ε’s are from N(0, k σ ) and the

remaining n − r from N(0, σ2). There are 2n possible combinations of ε’s. They showed

that the marginal posterior distribution of θ is a weighted average of 2n such conditional

distributions

p(β|y) = Σw(r)p(β|a(r), y), (r)

where p(β|a(r), y) is the conditional posterior distribution of β given that a particular r

combination of the ε’s are from N(0, k2σ2) and the rest are from N(0, σ2), which is a p-

(r) dimensional multivariate t-distribution. Let p be the prior probability of the event a(r),

(r) p h(y(r) ∼ g; y(s) ∼ f) w(r) = p(a(r)|y) = (r) . Σp h(y(r) ∼ g; y(s) ∼ f) (r)

From the computation efﬁciency consideration, they arranged the summation over the 2n possibilities such that

The objective of Box and Tiao (1968) was to estimate the posterior of β, in other words, to make inference about β when the possibility of the occurrence of spurious observation is real.

87 Guttman (1973) introduced the mean-shift model and focused on detecting the occurrence of a spurious observation.

Suppose ε ∼ N(0, σ2), but one of these observations may come from N(a, σ2). Using the non-informative prior for β, σ2, the posterior distribution of a is a weighted combination of generalized t distributions. They deﬁned the posterior odds in favor of a exceeding 0 as

γ = P (a > 0|y1, . . . , yn)/P (a < 0|y1, . . . , yn).

If γ is close to 1, the decision that “a = 0” should be made. If γ has an extreme value, say, greater or equal to 5, or less than or equal to 1/5, then the decision that spuriosity has occurred is clearly called for, and the observation with the largest weight should be deemed ad an outlier.

Abraham and Box (1978) used the mean-shift model to make the inference of β.

y = Xβ + δZ + ε,

0 where Z = (Z1,Z2,...,Zn) is a vector of r unities and (n − r) zeros. The posterior distribution of β|y is a weighted average of multivariate t-distributions with the weights being the posterior probability of a particular set of observations being spurious,

n n X X P (β|y) = w0P (β|y, Z = 0) + wiPi(β|y, Z = U1) + wijPij(β|y, Z = U2) + ..., i=1 i

Sharples (1990) generalized the robustness consideration to the hierarchical model, and showed that variance inﬂation model could be incorporated easily into the general normal

88 and linear hierarchical models, while retaining tractability of analysis. To avoid the computational explosion, they only considered a simple situation that there is at most one outlier per group and at most one outlying group. The posterior distribution of the interested parameters is a weighted mean of the possible events and the weights provide a measure for the occurrence of the outlyingness.

Guttman and Pena (1993) showed that with the mean-shift model, the probability that k observations indexed by I are spuriously generated is

2 −(n−p−k)/2 −1/2 CI = K[s(I)] |I − HI | , where K is the constant to make the sum over all sets I of size k to be 1. For variance inﬂation model, Pena and Tiao (1992) showed that

−1/2 2 −(n−p)/2 PI = C|I − HI | [s(I)] .

For a large n and a small k, both probabilities CI and PI are essentially the same (Pena &

Guttman, 1993). Also, it was shown that there is strong relationship between CI , PI and the predictive density 5.2 such that

−1 CI = K[p(yI |y(I)] .

With the advent of modern MCMC methods, Bayesian statisticians have powerful computing weapon to deal with the problems that are unconquerable before. Some approximations in the Box and Tiao’s model or the mean-shift model due to computation consideration are not necessary now.

89 Verdinelli and Wasserman (1991) showed that the Gibbs simpler makes the problem of computing the posterior marginal in Bayesian analysis of outlier model conceptually clear, and computationally simple and efﬁcient. They considered the mean shift model in the density estimation problem that y = (y1, . . . , yn) sampled with density

2 2 2 f(yi|µ, σ , ε, Ai) = (1 − ε)φ(yi|µ, σ ) + εφ(yi|µ + Ai, σ ), (5.5) and re-expressed it as

2 2 yi|µ, σ , A, δ ∼ N(µ + δiAi, σ ),

where δ = (δ1, . . . , δn) are independent Bernoulli trials with success probability ε.

They used the standard conjugate priors for µ and σ2,

µ ∼ N(θ, υ2)

σ2 ∼ Invχ2(ν, λ),

2 Ai ∼ N(0, τ ),

ε ∼ Beta(ρ1, ρ2).

∗ ∗ Using y denote the corrected data where yi = yi − δiAi, they calculated the marginal posterior distributions

θ/υ2 + ny¯∗/σ2 f(µ|y, σ2, δ, A) = N( , (1/υ2 + n/σ2)−1), 1/υ2 + n/σ2

f((ns2 + νλ)/σ2|y, µ, δ, A, ε) = χ2(n + ν),

2 φ((yi − µ − Ai)/σ)ε f(δi|y, µ, σ , A, ε) = Bernoulli( ), φ((yi − µ − Ai)/σ)ε + φ((yi − µ)/σ)(1 − ε)

90 (y − µ)/σ2 f(A |y, µ, σ2, ε, δ = 1) = N( i , (1/τ 2 + 1/σ2)−1), i i 1/τ 2 + 1/σ2

2 2 f(Ai|y, µ, σ , ε, δi = 0) = N(0, τ ),

2 f(ε|y, µ, σ , δ, A) = Beta(ρ1 + k, ρ + n − k),

where k is the number of δi’s that are equal to 1.

The inference of µ was made with samples drawn from f(µ|y, σ2, δ, A), and the posterior

2 samples of f(δi|y, µ, σ , A, ε) were used to calculated the probability of each observation to be an outlier.

Justel and Pena (1996) generalized Verdinelli and Wasserman’s (1991) approach to regression models, and pointed out that the convergence of Gibbs sampler may slow down when the parameters are highly correlated and may fail due to the strong masking between multiple outliers.

Chung and Kim (1999) used a mean-shift model different from Guttman (1973). Suppose

Pp yi = j=1 xijθj +mi +εi, they assumed a normal mixture distribution for mi by introducing the latent variable γi = 0 or 1,

2 2 2 mi|γi ∼ (1 − γi)N(0, σi ) + γiN(0, ci σi ),

and P r(γi = 1) = 1 − P r(γi = 0) = pi. By appropriately setting the parameter σ and c, if γi = 0, mi would be small and could be estimated as 0. If γi = 1, mi would be large and it means that the corresponding observation probably is an outlier. Using the stochastic search variable selection (SSVS) method (George & McCuloch, 1993), Chung and Kim

(1999) calculated the posterior density f(γ|y) of 2n possible values of γ. f(γ|y) provides a

91 ranking that could be used to select a subset of the most suspicious data. Chung and Kim also discussed the extension of this method to the linear mixed model.

5.2 Summary

Some Bayesian diagnostic tools are very similar to those of frequentists. Bayesians used

Bayesian p values or some distance measures between competitive models to do diagnosis for each observation. The predictive distribution is also suggested for outlier checking.

Robust models are more popular, which specify the generating distributions for both common observations and possible outliers. Before the invention of modern MCMC computing technique, Bayesians have already constructed the variance-inﬂation model and the mean- shift model, and have deduced all the posterior inference analytically. However, due to the computing complexity, only the approximation for one outlier was carrying out. With the advent of MCMC, Bayesians can calculate the posterior probability for the interested sets of observations, and the application of Bayesian robust estimation developed fast.

However, these robust models assume that possible outliers come from a normal distribution with different location parameter or large variance, which may not be a good approximation to the true situation sometimes. Besides, the choice for the location shift constant or the variance inﬂation constant could affect the ﬁnal inference. In the next chapter, we will introduce a newly proposed Bayesian robust estimation approach which does not specify a particular distribution for outliers and those distribution constants are not necessary.

92 Chapter 6

Proposed Latent Variable Approach

6.1 Introduction

In statistical analysis, when not all observations follow the assumed idealized distribution, we

call them contaminated. Classical maximum likelihood estimation does not work well under

this situation, in the sense that the outliers can have substantial inﬂuence on the estimates.

More robust distributions are suggested to ﬁt the model, such as replacing normal distribution

with t distribution for heavy tailed data, and using negative binomial distribution instead of

Poisson distribution for the overdispersed counts. Another strategy is to develop special robust estimation methods to deal with this problem. A common contamination model is

(1 − α)f1(x) + αf2(x), (6.1)

which assumes most observations come from distribution f1(x), while a few “bad” obser-

vations come from distribution f2(x). The probability of an observation being “bad” is α.

93 From the Bayesian viewpoint, Box and Tiao (1968) and Justel and Pena (1996) suggested the variance-inﬂation model (1−α)N(θ, σ2)+αN(θ, k2σ2); Guttman (1973) and Verdinelli and

Wasserman (1991) suggested the mean-shift model (1−α)N(θ, σ2) +αN(θ +λ, σ2), where k and λ are positive constants. Both models are special cases of the common contamination model 6.1.

To extend these Bayesian robust models to more general cases, we proposed a new Bayesian approach. New Bernoulli indicator variables are invented for each observation. If the indicator variable takes the value 1, the corresponding observation will be included in the model construction. Otherwise, the corresponding observation will not participate in the modelling.

Our model still belongs to the common contamination model and can be written as

(1 − α)f(x; θ) + αg(x).

The subtle and important difference between our method and the other models is that we do not specify the spurious distribution g(x) and will not make inference of θ based on “bad” observations. Instead, we will use some constructed reference distribution to approximate g(x). The conditional posterior distributions of θ and the Bernoulli indicator variables are calculated using Gibbs sampler, and a summary statistic of the indicator variables could be used for identifying possible outliers.

The advantage of our approach is that we do not need to worry the choice of the appropriate distribution for the outliers, and the parameter estimation with only “good” observations should be more accurate. We will show that our estimates have less mean square errors theoretically and numerically in following sections, and the masking problem has less effect

94 on our method. Besides, our method can be applied to some situations where the mean-shift or variance-inﬂation model could not, or very hard if possible, be applied such as generalized linear regression, where the relationship among the response, predictor and the error is not linear.

6.2 Bayesian Robust Estimation Algorithm

Suppose we have observed data x1, x2, . . . , xn. Most observations come from distribution f(x; θ), and a few come from another unclear source g(x). The goal is to estimate the unknown parameter θ. Both x and θ might be vectors. We invent an indicator vector variable b = {b1, b2, . . . , bn} to denote the belongings of the observations. bi is the indicator for the ith observation. If bi = 1, the corresponding observation xi comes from f(x; θ). If bi = 0, the corresponding observation belongs to g(x). bi follows a Bernoulli distribution with parameter qi, which is the probability that bi = 1. Without loss of generality, we assume all bi have the common parameter q. We assume a beta distribution prior for q with hyper- parameters (α, β). With the prior distribution of θ, the joint distribution of all the variables is

p(x, θ, b, q) = p(x|θ, b)p(b)p(θ)p(q)

n n Y bi 1−bi Y bi 1−bi = f(xi; θ) g(xi) q (1 − q) i=1 i=1 ∗ p(θ)qα−1(1 − q)β−1 (6.2)

95 Using xb to denote the observations with bi = 1, the conditional posterior distribution of θ is

p(θ|x, b, q) = p(θ|xb) ∝ p(xb|θ)p(θ).

The conditional posterior distribution of q is

P P p(q|x, θ, b) ∝ q bi+α−1(1 − q)n− bi+β−1,

which is still a Beta distribution. The conditional posterior distribution of bi is

bi 1−bi bi 1−bi p(bi|x, θ, q) ∝ f(xi; θ) g(xi) q (1 − q) ,

f(x ; θ)q which is a Bernoulli distribution with parameter i . Since g(x) is f(xi; θ)q + (1 − q)g(xi) unknown, we ﬁt the classical model x ∼ f(x; , θ) with all data initially, and use the predicted

densities at the observations as the constructed reference distribution to approximated g(x).

Since the marginal posterior distribution of bi|x does not have simple analytical form, we

will use the Gibbs sampler to make inference. Following is the general Algorithm 7.

We make the inference of θ using the samples θ1, θ2, . . . , θM after burning process. Since the

estimates are based on “good” observations only, the inference should be more precise and

robust. In the next section, we will show that our linear regression estimates have less mean

square errors (MSE) under some conditions. A summary statistic of the samples of bi will

indicate how often the ith observation is chosen by the model. The observations with less

frequency are possible outliers.

f ∗ q is an increasing function of f. The observation which has larger likeli- f ∗ q + (1 − q) ∗ g hood, i.e., ﬁtting the model better, will have much more chance to be selected by the model.

In Algorithm 7, we use same s to approximate g(xi) in each iteration. It will not affect the selection of the observations, though it may affect the convergence rate of the MCMC.

96 Algorithm 7 Bayesian Robust Estimation Fit the classical model with all data, and construct the reference distribution with the

predicted densities at the observations.

Choose a total iteration number M.

Initially, all bi = 1

repeat

Draw θ from p(θ|x, b, q).

Draw q from p(q|x, θ, b).

Draw one sample s from the reference distribution and let all g(xi) = s.

Draw b from p(b|x, θ, q).

until The iteration number is M.

We suggest using a large MCMC iteration number M. In our simulation, we ran the Gibbs

sampler 20000 times, burned in 10000 and chose the 10th, 20th, . . . , 100th quantiles as the

sample. It is reasonable to assume that the chance of an observation to be outlier is around

0.05 and with high probability to be less than 0.5. As a result, the hyper-parameters we

suggest are α = 3.4, β = 0.1789, which makes the mean of q to be 0.95 and p(q > 0.5) =

0.99.

97 Chapter 7

Applications of the Bayesian Robust

Estimation Approach

In this section, we apply our method to three different areas: linear regression, logistic re-

gression, and density estimation. The analysis of the simulated data and some real data sets

shows that our approach has better performance than other methods.

7.1 Linear Regression

For linear regression Y = Xβ + ε, the common contamination models for outliers assume a normal mixture distributions for the errors: either the normal variance-inﬂation model (Box

& Tiao 1968; Hoeting et al. 1996), ε ∼ (1 − α)N(0, σ2) + αN(0, k2σ2) or the normal mean-

shift model (Guttman 1973; Abrahan & Box 1978), ε ∼ (1 − α)N(0, σ2) + αN(λ, σ2). We

98 use one simulated data and two real data sets to compare the performance of our approach

and these two models.

Simulation 1 We design a simple regression simulation with only 1 predictor variable and

10 observations. The true intercept is 3 and the slope is 2. Special errors are added to make

observation 3 and 8 outliers, which can be easily seen from Figure 7.1 a). y Posterior probability b=0 5 10152025 0.0 0.2 0.4 0.6 0.8

246810 246810

x Observations

Figure 7.1: Simulated data. a) The solid line has the true intercept and slope. The dashed line is the result of OLS, and the dotted line is our result. b) Posterior probability that bi = 0.

We use the common noninformative prior distribution f(β, σ2) ∝ σ−2. For the mean-shift

model, we use the prior distribution N(0, 1) for λ following Verdinelli and Wasserman

(1991). The constant k in the variance inﬂation model is set as 7 which is suggested by

Hoeting et al. (1996). The regression estimates and standard errors are summarized in Ta-

ble 7.1. Our estimates are very close to the true parameter values and the standard errors

are also smaller. The estimates from ordinary least square and mean-shift model are obvious

biased. The posterior probabilities bi = 0 are also plotted in Figure 7.1 b). We can easily identify observation 3 and 8 as outliers.

99 Method Estimate

BRE β = (3.048, 1.998)

sd(β) = (0.8985, 0.160)

OLS β = (5.583, 1.594)

sd(β) = (2.408, 0.388)

OLS w/o outlier β = (2.910, 2.048)

sd(β) = (0.534, 0.086)

Mean-Shift β = (5.646, 1.587)

sd(β) = (2.672, 0.429)

Variance-Inﬂation β = (3.186, 1.995)

sd(β) = (0.857, 0.155)

Table 7.1: Analysis Results for the Simulated Data

Stack Loss Data The ﬁrst real data set we are using is the stack loss data (Brownlee, 1965), which consists of measurements from a plant for 21 days. The response is the percent of unconverted ammonia that escapes from the plant, which is called stack loss. The predictors include air ﬂow, temperature, and acid concentration. This data set has been studied by many researchers including Daniel and Wood (1980), Atkinson (1985), and Hoeting et al. (1996).

The common conclusion is that observations 1, 3, 4, and 21 are outliers. In Figure 7.2, we plot the posterior probabilities bi = 0 for each observation. The 4 possible outliers are identiﬁed correctly. The regression parameter estimates are summarized in Table 7.2. Our estimates are very close to the estimates of OLS without 4 outliers.

100 Posterior probability 0.0 0.2 0.4 0.6 0.8 1.0

5101520

observation

Figure 7.2: Posterior probability plot of the Stack Loss data

Stars Data The stars data consists of the the logarithm of the effective temperature at the surface and the logarithm of the light intensity of 47 stars from the star cluster CYG

OB1. The research interest is to examine whether there is a linear regression between log intensity and log temperature. The data can be found in Rousseeuw and Leroy (1987). There are clearly four observations (11, 20, 30, 34) separated from others in the scatter plot of

Figure 7.3, which correspond with giant stars. Justel and Penal (1996) could not identify any outliers using the variance-inﬂation model and they ascribed the failure to the masking between outliers. However, our posterior plot of b in Figure 7.3 clearly conﬁrms that these

4 observations are outliers. We summarize the regression estimates in Table 7.3. Using the estimate from OLS without outliers as the reference, our approach is the only one ﬁnding correct direction of the ﬁtted line.

101 Method Estimate

BRE β = (−37.177, 0.794, 0.502, −0.051)

sd(β) = (6.691, 0.138, 0.253, 0.098)

OLS β = (−39.920, 0.716, 1.295, −0.152)

sd(β) = (11.896, 0.135, 0.368, 0.156)

OLS w/o outlier β = (−37.653, 0.798, 0.577, −0.067)

sd(β) = (4.732, 0.067, 0.166, 0.062)

Mean-Shift β = (−39.108, 0.718, 1.298, −0.165)

sd(β) = (11.847, 0.132, 0.373, 0.156)

Variance-Inﬂation β = (−36.809, 0.815, 0.522, −0.074)

sd(β) = (4.876, 0.082, 0.210, 0.069)

Table 7.2: Analysis Results for the StackLoss Data

7.2 Some Theoretical Results: MSE Comparison

Below we will prove that our estimate of β has less mean square error (MSE) than that of the ordinary least square approach.

Theorem 5. Suppose ε ∼ (1 − α)N(0, σ2) + αN(0, k2σ2) and in one step of MCMC calculation, Xb is the selected observation used for modelling and the proportion of outliers in

0 T −1 T −1 ˆ Xb is α . Let tr denote the trace of (X X) and trb denote the trace of (Xb Xb) . If βb 0 ˆ ˆ ˆ α − α trb − tr is our estimate and β is the OLS estimate, then MSE(βb) < MSE(β) if > . α trb

Theorem 6. Suppose ε ∼ (1 − α)N(0, σ2) + αN(λ, σ2) and in one step of MCMC calculation, Xb is the selected observation used for modelling and the proportion of outliers in Xb

0 T −1 T −1 ˆ is α . Let tr denote the trace of (X X) and trb denote the trace of (Xb Xb) . If βb is our

102 ● ● ● ● ● 1.0

●

6.0 ●

● ● 0.8 ● ● ● ● ●● 5.5 ● ●●

● 0.6 ● ● ● ● ● ● ●● ● ● ● ●

5.0 ● ● ●

log intensity ● ● 0.4 ● ● ● ● ● ● ●

● Posterior probability b = 0 ● 4.5 ●● ● ● ●● ● ●● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 4.0 ● ● 0.0 3.6 4.0 4.4 0 10 20 30 40

log temperature Observation

Figure 7.3: Results of the Stars data. a) is the scatter plot. b) is the posterior distribution plot

0 0 ˆ ˆ ˆ α(1 − α) − α (1 − α ) trb − tr estimate and β is the OLS estimate, then MSE(βb) < MSE(β) if > . α(1 − α) trb

2 2 ˆ Theorem 7. Suppose ε ∼ (1 − α)N(0, σ ) + αN(λ, σ ), and βM−S is the estimate using the

ˆ ˆOLS normal mean-shift model, then the MSE(βM−S) > MSE(β ).

7.3 Generalized Linear Regression

Another real data set we used is from Brown (1980), which consists of the measurements of 53 prostatic cancer patients. Each patient has six measurements: age, acid level, X-ray result, tumor size, tumor grade, and nodal involvement. The research purpose is to explore

103 Method Estimate

BRE β = (−8.707, 3.095)

sd(β) = (2.677, 0.605)

OLS β = (6.794, −0.413)

sd(β) = (1.237, 0.286)

OLS w/o outlier β = (−4.057, 2.047)

sd(β) = (1.844, 0.420)

Mean-Shift β = (6.721, −0.397)

sd(β) = (1.264, 0.293)

Variance Inﬂation β = (6.801, −0.414)

sd(β) = (1.201, 0.278)

Table 7.3: Analysis Results for the Stars Data the relationship between the binary response of nodal involvement and other variables. This data set has been analyzed by Collett (1991) and Albert and Chib (1995). Albert and Chib considered the model with four covariates log(acid), X-ray, size, and grade. With a Bayesian residual analysis, they claimed that observations 9, 26, 35, 37 are outliers.

Using a logistic regression model, the likelihood function is

n exiβ 1 Y yi 1−yi bi 1−bi p(y|x, β, b) = [( ) ( ) ] g(xi) (7.1) 1 + exiβ 1 + exiβ i=1

We assumed a noninformative prior f(β) = const, and used the normal approximation for

ˆ ˆ the posterior distribution of β|x, y, which will be p(β|x, y) ≈ N(β, Vβ), where β is the

T 00 −1 mode of p(β, x, y) and Vβ is (X diag(−L )X) . The Splus function “glm” was used to exiβ calculate βˆ, and −L00 can be easily calculated out as . (1 + exiβ)2

104 Posterior probability 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 1020304050

Observation

Figure 7.4: Result of the Prostatic data

Figure 7.4 shows that observations 37, 9, 26, 35 are really different from the others. The estimates are summarized in Table 7.4. Our estimates are closer to those calculated without outliers.

7.4 Density Estimation

The example is a two dimensional normal sample data set from the book of Barnett and

Lewis (1994, p.289). We assumed that observations x have a multivariate normal distribution

X|µ, Σ ∼ N(µ, Σ). The commonly proposed noninformative prior distribution p(µ, Σ) ∝

|Σ|−(d+1)/2 was used. The ﬁrst two observations with the lowest frequencies being selected

105 Method Estimate

BRE β = (−1.388, 2.636, 2.176, 1.678, 0.818)

sd(β) = (0.887, 1.344, 0.943, 0.933, 0.904)

GLM β = (−1.306, 2.512, 2.011, 1.544, 0.851)

sd(β) = (0.727, 1.173, 0.821, 0.780, 0.775)

GLM w/o outlier β = (−3.263, 4.730, 4.593, 3.604, 1.827)

sd(β) = (1.365, 1.996, 1.784, 1.570, 1.151)

Table 7.4: Analysis Results for the Prostatic Data

Method BRE MLE MLE w/o outlier

Estimates µ = (1.1384, 1.3437) µ = (1.1740, 1.4492) µ = (1.1344, 1.3765)       0.1378 0.1285 0.1877 0.1996 0.1390 0.1277       Σ =   Σ =   Σ =         0.1285 0.1815 0.1996 0.3392 0.1277 0.2158

Table 7.5: Analysis Results for the Barnett Data by the model are 25 and 31. The result coincides with the results of Barnett and Lewis

(1994) and Varbanov (1998). The estimates are summarized in Table 7.5. Compared with the estimates using all data, our estimates are closer to those calculated without outliers.

Following Theorem 8 shows that our estimate has less MSE than the estimate with all data when the data follow a contaminated distribution. .

2 2 Theorem 8. Suppose X ∼ (1 − α)N(µ1, σ1) + αN(µ2, σ2) and in one step of MCMC

0 calculation, the proportion of outliers in our selected observations is α . If µˆb is our estimate for the location parameter and µˆ is the estimate with all data, then MSE(ˆµb) < MSE(ˆµ) if

α0 < α.

106 Chapter 8

Discussion and Future Work

In part II, we invent the indicator variable for each observation and use only the “good” observations to make inferences. The conditional posterior distributions of the indicator variables are used to locate possible outliers. It can be easily applied to many statistical models with an explicit likelihood function. Our method is similar to the variance-inﬂation model and mean-shift model. However, there are important differences. We do not assume a special distribution for the error term and we do not use the possible outliers for modelling.

We make assumptions that only few observations are outliers, and choose the hyper-parameters based on this assumptions. If prior knowledge of the data is known, more appropriate hyper- parameters should be used. For example. Rousseeuw Data is a simulated data set with 30 good data points and 20 outliers. Our method with the suggested hyper-parameters does not work well with this data set. However, if we know that there are 40 per cent outliers of the sample and select the hyper-parameter based on this prior information, out method can

107 perform very well. The posterior plot of the indicator variables could be used as an index for outlier identifying. We emphasize that there is no critical criteria for outliers and we need to investigate the suspicious observations carefully.

We use the logistic regression as the application example. Our method can also be applied to other generalized linear models. Survival analysis uses partial likelihood function, it deserves more effort for investigate how to apply our idea in survival analysis. We are also thinking of combining our approach with some variable selection methods to do variable selection and outlier detection simultaneously. Some candidate methods are SSVS (George & McCuloch,

1993) and the spike and slab model (Ishwaran & Rao ,2005a).

108 Appendix A

Proof of Theorems

Proof of Theorem 1 :

Let XPDF and XCDF denote the design matrix constructed by PDF and CDF. The dimen-

sionality of the design matrix is n × p. With a large number n, one column of the design

matrix will approximately denote the density function of the corresponding component dis-

Pn tribution discretely. The inner product of the column i and j of XPDF is l=1 XilXjl, which

R 2 2 is approximately equal to N(x|µi, σ )N(x|µj, σ )dx, and could be further approximated 1 with exp [− (µ − µ )2]. With well separated component means, the inner product be- 4σ2 i j

tween two columns is almost zero, and XPDF is nearly orthogonal.

Z 2 2 For XCDF , the inner product of column i and j could be approximated with Φ(x|µi, σ )Φ(x|µj, σ )dx.

This integration does not have a closed form. However, if n is large, each column is approx-

imately a sample of uniform [0, 1] distribution. The inner product between two columns will

109 be close to one. The design matrix will be highly correlated.

Proof of Lemma 1:

n k Y X Suppose we want to maximize the mixture likelihood [ πjf(xi|θj)], where k is known. i=1 j=1 The complete likelihood is as follows:

n k Y Y Zij L = [πjf(xi|θj)] , i=1 j=1

where Zij is the indicator variable whether the ith observation comes from the jth compo- Pn s s Z π f(xi|θj) nent. The weight estimates are πˆs+1 = i=1 ij , and Zˆs+1 = j , where s is j n ij Pk s j=1 πj f(xi|θj) the iteration number of EM.

Z1j,...,Znj are i.i.d, so we have

Pn E(Z ) π Ef(x |θ ) π E(ˆπ ) = i=1 ij = E(Z ) = j 1 j = j = π . j n i=1,j Pk Pk j j=1 πjEf(x1|θj) j=1 πj

The estimate of π is unbiased.

Proof of Lemma 2:

Let y1, y2, . . . , yn denote the density estimates of a saturated model at the observations. Sup-

pose there are another two density estimates of Xπ1 and Xπ2, for which we use U = {ui, i =

1, . . . , n} and V = {vi, i = 1, . . . , n} to denote, U = Xπ1 and V = Xπ2. We want to show

Pn 2 Pn 2 Qn Qn that if i=1(yi − ui) ≤ i=1(yi − vi) , then i=1 ui ≥ i=1 vi.

110 We need to restrict the choice of π in the class of unbiasedness. In other words, Eπ1 = Eπ2,

Pn which is a reasonable assumption based Lemma 1. It’s equivalent to show that if i=1(yi −

2 Pn 2 Pn Pn ui) ≤ i=1(yi−vi) , then i=1 ln ui ≥ i=1 ln vi. Using the Weak Law of Large Number, we change the problem to that if E(Y − U)2 ≤ E(Y − V )2, then ElnU > ElnV .

E(Y − U)2 ≤ E(Y − V )2 ⇒ EU 2 − 2EYEU ≤ EV 2 − 2EYEV .

2 2 Because that Eπ1 = Eπ2, EU = EV , we get EU ≤ EV .

1 2 3 Using Taylor expansion lnx = (x−1)− 2 (x−1) +O(x ) and only keep the ﬁrst two terms,

1 2 1 2 ols 2 E ln U − E ln V ≈ E(2U − 2V + 2 V − 2 U ) ≥ 0. Because πˆ = argmin(Y − Xπ) , π πˆols gives the largest likelihood for all unbiased πˆ.

Proof of Theorem 2 :

Let Y denote the pseudo-response from a saturated mixture model. When we use Algo-

rithm 6, at the step of OLS, for ﬁxed X, Xπˆols is closer to Y than any other Xπˆ, and the

likelihood is increased according to Lemma 2. At the step of Fixed-weight-EM, the likeli-

hood is also increasing. So that our ﬁnal estimates are MLE.

Proof of Theorem 3:

Suppose our algorithm stops at m steps. Let f˜ denote the random variable corresponding to

the saturated distribution, f1, . . . , fm denote the corresponding random variables with corre-

111 sponding mixture distribution at the mth step, and fˆ denote the random variable of the ﬁnal

ˆ ﬁtted mixture distribution. f1, . . . , fm have different parameter sizes and f is derived from

p fm by EM. We assume that all random variables are uniformly bounded. Then f − c → 0

2 p is equivalent to E(f − c) → 0. We will discuss three situations: m = 1, m = op(n) and m n.

If m = 1,

E(f˜− fˆ)2

˜ ˆ 2 = E(f − f1 + f1 − f)

˜ 2 ˆ 2 ˜ ˆ = E(f − f1) + E(f1 − f) + 2E(f − f1)(f1 − f)

˜ p ˆ p From the consistency of LASSO and EM, we have f1 − f → 0 and f − f1 → 0. So the

p p crossover term will be zero and E(f˜ − fˆ)2 → 0, fˆ − f˜ → 0. Because f˜ is consistent,

p f˜− f → 0, our estimate fˆ will also be consistent.

if m 6= 1 but m = op(n), we will have a ﬁnite sum of op(1), which will still be op(1). Our

estimate will still be consistent.

If m n, we could not show the consistency of our estimate. The simulation study shows

that our algorithm converges very fast. We might not worry about this situation.

112 Proof of Theorem 5

For the normal variance-inﬂation model, the mean square error for the ordinary least square

(OLS) estimate of β is

E(βˆ − β)T (βˆ − β) = [(1 − α)σ2 + αk2σ2]tr((XT X)−1).

0 2 0 2 2 T −1 The mean square error for our estimate is [(1 − α )σ + α k σ ]tr((Xb Xb) ).

ˆ ˆ 0 2 2 2 2 MSE(β) − MSE(βb) = (α − α )trb(k − 1)σ + σ (1 + α(k − 1))(tr − trb)

0 2 α − α 1 + α(k − 1) trb − tr 2 2 = [ − 2 ] ∗ α(k − 1)σ trb. α α(k − 1) trb

0 2 ˆ ˆ α − α 1 + α(k − 1) trb − tr So that MSE(β) > MSE(βb) when > 2 . α α(k − 1) trb The ﬁrst term on the right of the equation is close to 1 when k is large. Hence, MSE(βˆ) > 0 ˆ α − α trb − tr MSE(βb) when > . α trb

α0 < α because we choose observations based on the likelihood and observations with large errors have less chance to be selected. We could make the difference between tr((XT X)−1)

T −1 and tr((Xb Xb) ) small by using a large prior probability parameter of b.

Proof of Theorem 6

For the normal mean-shift model, the mean square error for the ordinary least square estimate of βˆ is

E(βˆ − β)T (βˆ − β) = [σ2 + λ2α(1 − α)]tr((XT X)−1).

113 2 2 0 0 T −1 The mean square error for our estimate is [σ + λ α (1 − α )]tr((Xb Xb) ).

ˆ ˆ 0 0 2 2 2 MSE(β) − MSE(βb) = [α(1 − α) − α (1 − α )]trbλ − [σ + λ α(1 − α)](trb − tr)

0 0 2 2 α(1 − α) − α (1 − α ) σ + λ α(1 − α) trb − tr 2 = [ − 2 ] ∗ α(1 − α)λ trb. α(1 − α) λ α(1 − α) trb 0 0 2 2 ˆ ˆ α(1 − α) − α (1 − α ) σ + λ α(1 − α) trb − tr So that MSE(β) > MSE(βb) when > 2 . α(1 − α) λ α(1 − α) trb

The ﬁrst term on the right of the equation is close to 1 when λ is large. Hence, MSE(βˆ) > 0 0 ˆ α(1 − α) − α (1 − α ) trb − tr MSE(βb) when > α(1 − α) trb

Proof of Theorem 7

For the normal mean-shift model, the mean square error for the ordinary least square estimate

E(βˆ − β)T (βˆ − β) = (σ2 + α(1 − α)λ2)tr((XT X)−1),

which is var(ε) ∗ tr((XT X)−1).

∗ Mean-shift model corrects yi with yi = yi − λ, when the ith observation is assumed as an

outlier. Suppose that the mean-shift model claims γ proportion of the data are outliers, we

have a 1 − γ proportion of errors that still have the mixture distribution in Eq 5.5 and a γ

proportion of errors that have the mixture distribution

(1 − α)N(−λ, σ2) + αN(0, σ2)

Combining them together, the error term has a 3-component mixture distribution.

2 2 2 εM−S ∼ (1 − α − γ + 2αγ)N(0, σ ) + α(1 − γ)N(λ, σ ) + (1 − α)γN(−λ, σ ).

114 E(εM−S) = α(1 − γ)λ − (1 − α)γλ = (α − γ)λ, and

2 2 2 2 2 2 E(εM−S ) = (1 − α − γ + 2αγ)σ + α(1 − γ)(λ + σ ) + (1 − α)γ(λ + σ ).

2 2 2 2 V ar(εM−S) = E(ε ) − E(ε) = σ + (α(1 − α) + γ(1 − γ))λ ,

ˆ ˆ which is larger than V ar(ε). Hence, MSE(βM−S)> MSE(βOLS.)

Proof of Theorem 8

2 2 ¯ 2 2 X ∼ (1−α)N(µ1, σ1)+αN(µ2, σ2), so we have X ∼ (1−α)N(µ1, σ1/n)+αN(µ2, σ2/n).

¯ ¯ 2 ¯ 2 2 2 MSE(ˆµ)= MSE(X) = (EX − µ1) + Var(X) = (1 − α)σ1/n + ασ2/n + α(µ1 − µ2) . Let

0 2 0 2 0 2 µˆb denote our estimate, MSE(ˆµb) = (1 − α )σ1/n + α σ2/n + α (µ1 − µ2) . MSE(ˆµ) −

0 2 2 2 2 2 MSE(ˆµb) = (α − α )[σ2/n − σ1/n + (µ1 − µ2) ]. σ2 ≥ σ1 is the general assumption of the robust models.

115 Appendix B

Regularity Conditions

Throughout the dissertation, we assume following conditions:

1. Identiﬁability

We assume that the mixture pdf is identiﬁable in the sense that

Z Z f(x, θ)dG1(θ) = f(x, θ)dG2(θ) (B.1)

for all x implies G1 = G2

2. Compactness

We assume the parameter space Θ is compact and the true parameter θ0 is an interior

point of Θ.

3. Smoothness

The function f(x, θ) has support independent of θ and is twice continuously differen-

tiable with respect to θ in Θ.

116 4. Wald’s integrability conditions

The mixture distribution should satisﬁes Wald’s integrability conditions for consis-

tency of the maximum likelihood estimate.

(a) E|log(f(x, θ)| < ∞ for each θ ∈ Θ.

(b) There exists ρ > 0 such that for each θ ∈ Θ, f(x, θ, ρ) is measurable and

Elogf(x, θ, ρ) < ∞, where

f(x, θ, ρ) = 1 + sup {f(x, θ0)}. |θ0−θ|≤ρ

The following additional conditions are required for Chen and Chen (1998, 2001)

1. Strong Identiﬁability

For any θ1 6= θ2 in Θ,

2 X 0 00 {ajf(x, θj) + bjf (x, θj) + cjf (x, θj)} = 0, for all x j=1

implies that aj = bj = cj = 0, j = 1, 2.

2. Uniform Strong Law of Large Number

4+δ There exits an integrable function g and some δ > 0 such that |Yi(θ)| ≥ g(Xi) and

3 |Yi(θ)| ≤ g(Xi) for all θ ∈ Θ.

3. Tightness

1/2 P 1/2 P 0 1/2 P 00 The processes n Yi(θ), n Yi (θ) and n Yi (θ) are tight.

117 Bibliography

[1] Abraham, B. and Box, G. E. P. (1978). Linear models and spurious observations. Ap-

plied Statistics, 27, 131-138.

[2] Aitkin, M., Anderson, D. and Hinde, J. (1981). Statistical modeling of data on teaching

styles. Journal of the Royal Statistical Society: A, 144, 419-461.

[3] Akaike, H. (1973). Information theory and an extension of the maximum likelihood

principle. In Second International Symposium on Information Theory (B. N. Petrov

and F. Csaki Eds.), pp. 267-281. Akademiai Kiado, Budapest.

[4] Albert, J. and Chib, S. (1995). Bayesian residual analysis for binary response regression

models. Biometrika, 4, 747-759.

[5] Atkinson, A. C. (1985). Plots, Transformations and Regression. Clarendon Press, Ox-

ford.

[6] Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data, 3rd ed. Chichester: John

Wiley & Sons.

118 [7] Basford, K. E., McLachlan, G. J., and York, M. G. (1997). Modelling the distribution

of stamp paper thickness via ﬁnite normal mixtures: The 1872 Hidalgo stamp issue of

Mexico revisited. Journal of Applied Statistics, 24, 169-179.

[8] Bickel, P. and Chernoff, H. (1993). Asymptotic distribution of the likelihood ratio sta-

tistic in prototypical nonregular problem. In Statistics and Probability: a Raghu Raj

Bahadur Festschrift (J. K. Ghosh, S. K. Mitra, K. R. Parthasarathy, and B. L. S. Prakasa

Rao Eds.), pp. 83–96. New York: Wiley.

[9] Bohning,¨ D. (1999). Computer-Assisted Analysis of Mixtures and Applications: Meta-

Analysis, Disease Mapping and Others. Boca Raton, FL: Chapman & Hall/CRC.

[10] Box, G. E. P. and Tiao, C. G. (1968). A Bayesian approach to some outlier problems.

Biometrika, 55, 119-129.

[11] Broadbent, D. E. (1966). A difﬁculty in assessing bimodality in certain distributions.

British Journal of Mathematical and Statistical Psychology, 19, 125-126.

[12] Brown, B. W. (1980). Prediction analysis for binary data. In Biostatistics Casebook,

(R.J. Miller, B. Efron, B. W. Brown and L. E. Moses eds.), pp. 3-18. New York: Wiley.

[13] Brownlee K. A. (1965). Statistical Theory and Methodology in Sincence and Engineer-

ing 2nd Ed. Wiley: New York.

[14] Byers, S. D., and Raftery, A. E. (1998). Nearest-neighbor clutter removal for estimating

features in spatial point processes. Journal of the American Statistical Association, 93,

577-584.

119 [15] Carlin, B. P. and Chib, S. (1995). Bayesian model choice via Markov Chain Monte

Carlo. Journal of the Royal Statistical Society: B., 57, 473-484.

[16] Chaloner, K. and Brant, R. (1988). A Bayesian approach to outlier detection and resid-

ual analysis. Biometrika, 75, 651-659.

[17] Charnigo, R. and Sun, J. (2004a). Testing homogeneity in a mixture distribution via the

L2 distance between competing models. Journal of the American Statistical Associa-

tion, 99, 488-498.

[18] Charnigo, R. and Sun, J. (2004b). Testing homogeneity in discrete mixtures. Submitted.

[19] Chen, J. (1995). Optimal rate of convergence in ﬁnite mixture models. Annals of Sta-

tistics, 23, 221–234.

[20] Chen, H. and Chen, J. (2001). The likelihood ratio test for homogeneity in the ﬁnite

mixture models. Canadian Journal of Statistics. 29, 201-216.

[21] Chen, H. and Chen, J. (2003). Tests for homogeneity in normal mixtures in the presence

of a structural parameter. Statistica Sinica, 13, 351-365.

[22] Chen, H., Chen, J. and Kalbeisch, J. D. (2001). A modiﬁed likelihood ratio test for

homogeneity in ﬁnite mixture models. Journal of Royal Statistical Society: B, 63, 19-

29.

[23] Chen, H., Chen, J., and Kalbeisch, J. D. (2004). Testing for a ﬁnite mixture model with

two components. Journal of Royal Statistical Society: B, 66, 95-115.

120 [24] Chen, J. and Kalbeisch, J. D. (1996). Penalized minimum distance estimates in ﬁnite

mixture models. Canadian Journal of Statistics, 2, 167-176.

[25] Chernoff, H., and Lander, E. (1995). Asymptotic distribution of the likelihood ratio test

that a mixture of two binomials is a single binomial. Journal of Statistical planning and

inference, 43, 19-40.

[26] Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American

Statistical Association, 90, 1313-1321.

[27] Clark, V. A., Chapman, J. M., Coulson, A. H., and Hasselblad, V. (1968). Dividing the

blood pressures from the Los Angeles heart study into two normal distributions. Johns

Hopkins Medical Journal, 122, 77-83.

[28] Cloett, D. (1991). Modelling Binary Data. Chapman and Hall: London.

[29] Chung, Y. and Kim, H. (1999). Bayesian outlier detection in regression model. Journal

of the Korean Statistical Society, 28, 311-324.

[30] Cook, R. D. and Weisberg, S. (1982). Residuals and Inﬂuence in Regression. Chapman

and Hall: New York.

[31] Dalal, S. and Hall, W. J. (1983). Approximating priors by mixtures of natural conjugate

priors. Journal of the Royal Statistical Society: B, 45, 278-286.

[32] Daniel, C. and Wood, F. S. (1980). Fitting Equations to Data. Wiley: New York.

121 [33] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood estima-

tion from incomplete data via the EM algorithm (with discussion). Journal of the Royal

Statistical Society: B, 39, 1-38.

[34] Dieblot, J. and Robert, C. (1994). Estimation of ﬁnite mixture distributions through

Bayesian sampling. Journal of the Royal Statistical Society: B, 56, 363-375.

[35] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression.

The Annals of Statistics, 32, 407-499.

[36] Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. New York:

Chapman and Hall.

[37] Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics re-

gression tools. Technometric, 35, 109-135.

[38] Fu, W. (1998). Penalized regressions: The Bridge versus the Lasso. Journal of Compu-

tational and Graphical Statistics, 7, 397-416.

[39] Fu, W. (2003). Penalized estimating equations, Biometircs, 59, 126-132.

[40] Geisser, S. (1985). On the predictiong of observables: a selective update. In Bayesian

Statistics 2, (J. M. Bernardo, M. H. Degroot, D. V. Lindley and A. F. M. Smith Eds.),

pp. 473-494. Amsterdan: Elsevier.

[41] Gelfand, A. E. and Dey, K. K. (1994). Bayesian model choice: Asymptotics and exact

calculation. Journal of the Royal Statistical Society: B , 56, 501-514.

122 [42] Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data Analy-

sis, 2nd. Chapman and Hall.

[43] George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling.

Journal of the American Statistical Association, 88, 881-889.

[44] Ghosh, J. K. and Sen, P. K. (1985). On the asymptotic performance of the log likelihood

ratio statistic for the mixture model and related results. In Proceedings of the Berkeley

Conference in Honor of J. Neyman and J. Kiefer (L. LeCam and R. A. Olshen Eds.),

Vol. 2, 789–806.

[45] Gordon, A. D. and Prentice, I. C. (1977). Numerical methods in quaternary palaeoe-

cology. IV. Separating mixtures of morphologically similar pollen taxa. The Review of

Palaeobotany and Palynology, 23, 359-372.

[46] Green, P. J. (1995). Reversible jump Markov Chain Monte Carlo computation and

Bayesian model determination. Biometrika, 82, 711-732.

[47] Guttman, I. (1973). Care and handling of univariate or multivariate outliers in detecting

spuriousity - A Bayesian approach. Technometrics, 15, 723-738.

[48] Guttman, I. and Pena, D. (1988). outliers and inﬂuence: evaluation by posteriors of

parameters in the linear model. In Bayesian Statistics 3 (J. M. Bernardo et al. Eds.), pp.

631-640. Oxford University Press

[49] Guttman, I. and Pena, D. (1993). A Bayesian look at diagnositcs in the univariate linear

model. Statistical Sinica, 3, 367-390.

123 [50] Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. Pro-

ceedings of the Berkeley Conference in Honor of J. Neyman and J. Kiefer, Vol. II,

806-810.

[51] Hastie, T. and Tibshirani, R. (1996). Discriminant analysis via Gaussian mixtures. Jour-

nal of the Royal Statistical Society: B, 58, 158-176.

[52] Hastie, T., Tibshirani, R., and Buja, A. (1994). Flexible discriminant analysis by opti-

mal scoring. Journal of the American Statistical Association, 89, 1255-1270.

[53] Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learn-

ing. Springer.

[54] Henna, J. (1985). On estimating of the number of constituents of a ﬁnite mixture of

continuous distributions. Annals of the Institute of Statistical Mathematics, 37, 235-

240.

[55] Hoerl, A.E. and Kennard, R.W. (1970). Ridge regression: biased estimation for

nonorthogonal problems. Technometrics, 12, 55-67.

[56] Hoeting J., Raftery A. E., and Madigan D. (1996). A method for simultaneous variable

selection and outler identiﬁcation in linear regression. Computaitonal Statistics and

Data Analysis, 22, 251-270.

[57] Izenman, A. J. and Sommer, C. J. (1988). Philatelic mixtures and multimodal densities.

Journal of American Statistical Association, 83, 941-953.

124 [58] Ishwaran, H., James, L. F., and Sun, J. (2001). Bayesian model selection in ﬁnite mix-

tures by marginal density decomposition. Journal of the American Statistical Associa-

tion, 96, 1316-1332.

[59] Ishwaran, H. and Rao. J. S. (2005a). Spike and slab variable selection: Frequentist and

Bayesian strategies. Annals of Statistics, 33, 730-773.

[60] Ishwaran, H. and Rao, J. S. (2005b). Total risk for model selection. Technical Report.

[61] James, L. F., Preive, C.E., and Marchette, D. J. (2001). Consistent estimation of mixture

complexity. Annals of Statistics, 29, 1281-1296.

[62] Jewell, N. (1982). Mixtures of exponential distributions. Annals of Statistics, 10, 479-

484.

[63] Johnson, W. and Geisser, S. (1983). A predictive view of the detection and charac-

terization of inﬂuential observations in regression analysis. Journal of the American

Statistical Association, 78, 137-144.

[64] Justel, A. and Pen˜a D. (1996). Gibbs sampling will fail in outlier problems with strong

masking. Journal of Computational and Graphical Statistics, 5, 176-189.

[65] Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical

Association, 90, 773-795.

[66] Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estima-

tor in the presence of inﬁnitely many nuisance parameters. Annals of Mathematical

Statistics, 27, 887-906.

125 [67] Knight, K. and Fu, W. (2000). Asymptotics for Lasso-type estimators. Annals of Statis-

tics, 28, 1356-1378.

[68] Laird, N. M. (1978). Nonparametric maximum likelihood estiamtion of a mixing dis-

tribution. Journal of the American Statistical Association, 73, 805-811.

[69] LeBlanc, M. and Tibshirani, R. (1999). Monotone shrinkage trees. Journal of Compu-

tational and Graphical Statistics. 7, 417-433.

[70] Lemdani, M. and Pons, O. (1997). Likelihood ratio tests for genetic linkage. Statistics

& Probability Letters, 33, 15-22.

[71] Leroux, B. G. (1992). Consistent estimation of a mixture distribution. Annals of Statis-

tics, 20, 1350-1360.

[72] Lindsay, B. G. (1981). Properties of the maximum likelihood estimator of a mixing

distribution. In Statistical Distribution in Scientiﬁc Work (G.P. Patil, ed.), 5, 95-109.

Reidel, Boston.

[73] Lindsay, B. G. (1983a). The geometry of mixing likelihoods: A general story. Annals

of Statistics, 11, 86-94.

[74] Lindsay, B. G. (1983b). Efﬁciency of the conditional score in a mixture setting. Annals

of Statistics, 11, 486-497.

[75] Lindsay, B. G. (1983c). The geometry of mixing likelihoods, Part II: The exponential

family. Annals of Statistics, 11, 783-792.

126 [76] Lindsay, B. G. (1995). Mixture Models: Theory, Geometry and Applications., Hay-

ward: Institute of Mathematical Statistics.

[77] Macdonald, P. D. M. and Pitcher, T. J. (1979). Age gropus from size-frequency data: a

versatile and efﬁcient method of analyzing distribution mixtures. Journal of the Fish-

eries Research Board of Canada, 36, 987-1001.

[78] Mak, C. (1999). Polychotomous logistic regression via the Lasso. Ph. D. Thesis, Uni-

versity of Toronto, Toronto, Canada.

[79] McLachlan, G. J. (1987). On bootstrapping the likelihood ratio test statistic for the

number of components in a normal mixture. Applied Statistics, 36, 318–324.

[80] McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applica-

tions to Clustering. New York: Marcel Dekker.

[81] McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. New York: Wiley.

[82] Meinshausen, N. and Buhlmann,¨ P. (2006). High dimensional graphs and variable se-

lection with the Lasso. The Annals of Statistics, to appear.

[83] Mengersen, K. and Robert, C. P. (1996), Testing for mixtures: A Bayesian entropic

approach ( with discussion). In Bayesian Statistics 5, (J.M. Bernardo, J. O. Berger, A.

P. Dwaid and A. F. M. smith eds.), pp. 255-276. Oxford.

[84] Miller, A. J. (1990). Subset Selection in Regression. Chapman & Hall: London.

[85] Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference by the

weighed likelihood Bootstrap. Journal of the Royal Statistical Society: B, 56, 3-48.

127 [86] Osborne, M. R., Presnell, B., and Turlach, B. A. (1998). Knot selection for regression

splines via the lasso. Dimension Reduction, Computational Complexity, and Infor-

mation, Vol. 30 of Computing Science and Statistics, Interface Foundation of North

America, Inc., pp. 44-49.

[87] Ott, J. (1999). Analysis of Human Genetic Linkage, 2nd ed. Baltimore and London:

The Johns Hopkins University Press.

[88] Pearson, K. (1894). Contribution to the mathematical theory of evolution. Philosophi-

cal Transactions of the Royal Society: A , 185, 71-110.

[89] Pena, D. and Guttman, I. (1993). Comparing probabilistic methods for outlier detection

in linear models. Biometrika, 80, 603-610.

[90] Pena, D. and Tiao, G. C. (1992). Bayesian robustness functions for linear models. In

Bayesian Statistics 4, (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith

Eds.), pp. 365-388. Oxford University Press.

[91] Pettit, L. I. and Smith, A. F. M. (1985), Outliers and inﬂuential observations in linear

models. In Bayesian Statistics 2, (J. M. Bernardo, M. H. Degroot, D. V. Lindley and A.

F. M. Smith Eds.), pp. 473-494. Amsterdan: Elsevier.

[92] Postman, M., Huchra, J. P., and Geller, M. J. (1986). Probes of large-scale structures in

the Corona Borealis region. The Astronomical Journal, 92, 1238-1247.

[93] Priebe, C. E. (1994). Adaptive mixtures. Journal of the American Statistical Associa-

tion, 89, 796-806.

128 [94] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an un-

known number of components (with discussion). Journal of the Royal Statistical Soci-

ety: B, 59, 473-484.

[95] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465-471.

[96] Robbins, H. (1950). A generalization of the method of maximum lieklihood: Estimat-

ing a mixing distribution (Preliminary report). Annals of Mathematical Statistics, 21,

314.

[97] Roeder, K. (1990). Density estimation with conﬁdence sets exempliﬁed by superclus-

ters and voids in the galaxies. Journal of the American Statistical Association, 85, 617-

624.

[98] Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mix-

tures of normals. Journal of the American Statistical Association, 92, 894-902.

[99] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection,

New York: John Wiley.

[100] Sahu, S. and Cheng, R. (2002). A fast distance-based approach for determining the

number of components in mixtures. Canadian Journal of Statistics, 31, 3-22.

[101] Schlattmann, P. and Bohning,¨ D. (1993). Mixture models and disease mapping. Sta-

tistics in Medicine, 12, 943-950.

[102] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6,

461-464.

129 [103] Shao, J. (1993). Linear model selection by cross-validation. Journal of the American

Statistical Association, 88, 486-494.

[104] Shao, J. (1996). Bootstrap model selection. Journal of the American Statistical Asso-

ciation, 91, 655-665.

[105] Sharples, L. D. (1990). Identiﬁcation and accommodation of outliers in general heier-

archial models. Biometrika, 77, 445-453.

[106] Simar, L. (1976). Maximum likelihood estimation of a compound poisson process.

Annals of Statistics, 4, 1200-1209.

[107] Smith, C. A. B. (1961). Homogeneity tests for linkage data. Proc. Sec. Int. Congr.

Hum. Genet., 1, 212-213.

[108] Smith, C. A. B. (1963). Testing for heterogeneity of recombination values in human

genetics. Annals of Human Genetics, 84, 175-182.

[109] Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number

of components- an alternative to reversible jump methods. Annals of Statistics, 28, 40-

74.

[110] Sun, J. (1993). Tail probabilities of the maxima of Gaussian random ﬁelds. The Annals

of Probability, 21, 34-71.

[111] Sun, X. (1999). The Lasso and its implementation for neural networks. Ph. D. Thesis,

University of Toronto, Toronto, Canada.

130 [112] Tanner, W. F. (1962). Components of the hypsometic curve of the earth. Journal of

Geophysical Research, 67, 2841-2843.

[113] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of

the Royal Statistical Society: B, 58, 267-288.

[114] Tibshirani, R. (1997). The Lasso method for variable selection in the Cox model.

Statistics in Medicine, 16, 385-395.

[115] Tibshirani, R., Hastie, H., Narasimhan, B., Soltys,, S., Koong A., and Le Q. (2004)

Sample classiﬁcation from protein mass spectroscopy, by “peak probability contrasts”.

Bioinformatics, 20, 3034-3044.

[116] Titterington, D. M. (1976). Updating a diagnostic system using unconﬁrmed cases.

Applied Statistics, 25, 238-347.

[117] Titterington, D. M., Simth, A. F. M., and Makov, U. E. (1985). Statistical Analysis of

Finite Mixture Distributions, John Wiley & Sons.

[118] Varbanov, A. (1998). Bayesian approach to outlier detection in multivaraite normal

samples and linear models. Communications in Statistics: Theory and Methods, 27,

547-557.

[119] Verdinelli, I. and Wasserman, L. (1991). Bayesian analysis of outlier probliems using

the Gibbs sampler. Statistics and Computing, 1, 105-117.

[120] Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals

of Mathematical Statistics, 20, 595-601.

131 [121] Wilks, S. S. (1938). The large sample distribution of the likelihood ratio for testing

composite hypothesis. Annals of Mathematical Statistics, 9, 60-62.

[122] Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D.,

Williams, K., and Zhao, H. (2003). Comparison of statistical methods for classiﬁca-

tion of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 1636-1643.

[123] Zou, H. (2005). The adaptive Lasso and its oracle properties. Techincal report 645,

School of Statistics, University of Minnesota.

[124] Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis.

Journal of Computational and Graphical Statistics, 15, 265-286.

[125] Zou, H., Hastie, T., and Tibshirani, R. (2005). On the “Degrees of Freedom” of the

Lasso (manuscript).

132