Model Selection via Minimum Description Length

by

Li Li

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of

c Copyright by Li Li 2011 Model Selection via Minimum Description Length

Li Li Submitted for the Degree of Doctor of Philosophy Department of Statistics, University of Toronto

November 2011

Abstract

The minimum description length (MDL) principle originated from data compression literature and has been considered for deriving statistical model selection procedures. Most existing methods utilizing the MDL principle focus on models consisting of independent data, particularly in the context of linear regression. The data considered in this thesis are in the form of repeated measurements, and the exploration of MDL principle begins with classical linear mixed-effects models. We distinct two kinds of research focuses: one concerns the population parameters and the other concerns the cluster/subject parameters. When the research interest is on the population level, we propose a class of MDL procedures which incorporate the dependence structure within individual or cluster with data-adaptive penalties and enjoy the advantages of Bayesian information criteria. When the number of covariates is large, the penalty term is adjusted by data-adaptive structure to diminish the under selection issue in BIC and try to mimic the behaviour of AIC. Theoretical justifications are provided from both data compression and statistical perspectives. Extensions to categorical response modelled by generalized estimating equations and functional data modelled by functional principle components are illustrated. When the interest is on the cluster level, we use group LASSO to set up a class of candidate models. Then we derive a MDL criterion for this LASSO technique in a group manner to selection the final model via the tuning parameters. Extensive numerical experiments are conducted to demonstrate the usefulness of the proposed MDL procedures on both population level and cluster level.

ii Acknowledgements

I would like to thank all the people who have helped me during my doctoral study.

I wish to express my gratitude to my supervisors Professor Fang Yao and Professor Radu Craiu, without whose guidance and encouragement this work would not have been achieved.

I would like to thank my PhD committee members, Professor Michael Evans, Pro- fessor Jeffrey Rosenthal, Professor James Stafford and Professor Zhou Zhou.

I would like specially thank my external examiner, Professor Mu Zhu. His detailed and excellent comments have greatly improved this thesis.

I am grateful to all the professors who have taught me in various graduate courses, Professor David Brenner, Professor Keith Knight, Professor Nancy Reid and Professor Balint Virag.

I would also like to thank Laura Kerr, Andrea Carter and Dermot Wheland who have made my student life in this department smooth and enjoyable.

I am indebted to my colleagues, Meng Du, Yan Bai, Tao Wang, Panpan Wu, Ximing Xu, Alexander Shestopaloff, Chunyi Wang, Weichi Wu and Lizhen Xu. They generously share their knowledge with me.

The last and important thanks go to my parents for their understanding and sup- porting.

iii Contents

1 Introduction 1

1.1 Random Effects Models ...... 3

1.2 Existing Model Selection Methods ...... 9

1.3 Minimum Description Length ...... 12

2 Variable Selection for Random Effects Models via MDL 23

2.1 Linear Mixed-Effects Model ...... 25

2.1.1 Case A: σ2 and D are known ...... 26

2.1.2 Case B: σ2 and D are unknown ...... 32

2.2 Theoretical Results ...... 36

2.3 Simulation ...... 46

2.4 An Extension of MDL for Linear Mixed Model: MDL for Functional Data ...... 49

2.4.1 Functional Data Analysis ...... 50

2.4.2 Reduced Rank Model ...... 53

2.4.3 Principle Components Analysis through Conditional Expectation 57

iv 3 Variable Selection for Generalized Estimating Equation via MDL 63

3.1 Generalized Estimating Equations ...... 65

3.1.1 A Brief Review ...... 65

3.1.2 Case C: φ is known ...... 67

3.1.3 Case D: φ is unknown ...... 71

3.2 Simulation ...... 74

3.3 Conclusion ...... 81

4 Random Effects Selection via Group LASSO 82

4.1 Group LASSO ...... 85

4.2 MDL Criterion for Group LASSO ...... 91

4.3 Random Effects Selection via Group LASSO ...... 94

4.4 Functional Eigenfunctions Selection via Group LASSO ...... 103

A Proof of Consistence 108

B Newton-Raphson Iteration for deriving the MDL Criteria 111

B.1 Newton-Raphson Iteration for gMDL1 ...... 111

B.2 Newton-Raphson Iteration for gMDL2 ...... 112

Bibliography 121

v List of Figures

1.1 Rat Data (in three different groups) ...... 4

1.2 Huffman Code ...... 14

2.1 Trajectories of CD4 percentages ...... 51

4.1 Constrain region for l1-type regulation (within solid line) and l2-type regulation (within dotted line)...... 88

4.2 Residual sum of squares given by glMDL and cAIC. Y-axis denotes the mean residual sum of squares for each Monto Carlo run. The solid line demonstrates the results given by the new method with the classical estimation; the dotted line shows the results given by cAIC; and the asters indicate the fitted values estimated and selected by group LASSO.102

vi List of Tables

2.1 Comparison of lMDL2 criterion with AIC and BIC for LME, shown are the selected number of cases out of 100 Monte Carlo runs. In the table, F < T (or F > T) denotes that a given criterion chooses a model with less (or more) covariates than the true model, while F = T indicates that the true model is selected...... 48

2.2 Comparison of the lMDL2 criterion with AIC and BIC for reduced rank model based on 100 Monte Carlo runs. We use three different structures of the basis. In the table, q < 2 means we use less eigen- functions to fit the model; q = 2 denotes that we use the true number of eigenfunctions to fit the model; and q > 2 indicates that we use more eigenfunctions to fit the model. However, the number of eigenfunction should be less than p. We use B-Spline to derive the basis...... 57

2.3 Comparison of the lMDL2 criterion with AIC and BIC for reduced rank model based on 100 Monte Carlo runs. We use the criteria to jointly select p and q subject to q ≤ p with p ∈ {4, 5, 6}. In the table, q < 2 means less eigenfunctions are selected to fit the model; q = 2 denotes the true number of eigenfunctions is chosen; and q > 2 indicates over-selection of eigenfunctions...... 57

vii 2.4 Comparison of fMDL criterion with AIC and BIC for PACE model, shown are the selected number of cases out of 50 Monte Carlo runs. In the table, F < T or F > T corresponds to eigenvalue under-selection or over-selection, respectively, while F = T indicates that the true model is selected...... 61

3.1 The procedure to generate three correlated binary random variables. 77

3.2 Comparison of gMDL2 criterion with AIC, BIC and QIC for GEE based on 100 Monte Carlo runs. The exchangeable working correlation structure is used to fit the model. We illustrate here two choices of the covariance matrix for the coding device ω(β): the inverse Fisher information (MDL-F) and the identity matrix (MDL-I). In the table, F < T or F > T corresponds to covariate under-selection or over-selection, respectively, while F = T indicates that the true model is selected. . . 79

3.3 Comparison of the MDL criterion with AIC, BIC and QIC for GEE based on 100 Monte Carlo runs. The AR(1) working correlation struc- ture is used to fit the model. In the table, F < T or F > T corresponds to covariate under-selection or over-selection, respectively, while F = T indicates that the true model is selected...... 80

4.1 Comparison of glMDL criterion with cAIC for random effects, shown are the selected number of cases out of 100 Monte Carlo runs. Since none of the methods selects the model with less than three covariates, those cases are not shown in the table...... 101

viii Chapter 1

Introduction

1 1 Introduction 2

The concept of a “true model” responsible for generating a given set of data is usually introduced solely for theoretical purposes since, in practice, the mechanisms producing the data are often much more complex than the models that are being contemplated. From a model selection perspective, a more reasonable aim is to detect, in a class of models, the one that best approximates, or describes, the observed data - this is the approach adopted in my research. Unfortunately, a model selection criterion that would perform optimally under a wide variety of scenarios has been elusive so far, this area of research continuing to be very dynamic even after more than 30 years of development. The initial model selection criteria were established from a “true model” perspective for independent data (Akaike, 1973; Schwarz, 1978). However, there has been an increasing demand for criteria applicable for correlated models such as those in longitudinal studies (Pan, 2001; Vaida and Blanchard, 2005). My research interest is especially focusing on proposing valid model selection methods for random effects model on population level and cluster level, respectively. The methods are extended to the functional data models. Generalized Estimating Equations (GEE,

Liang and Zeger, 1986) is also discussed as one of the models for dependent data. The

Minimum Description Length (MDL, Rissanen, 1978) is used as the framework of the model selection criteria. 1 Introduction 3

1.1 Random Effects Models

Cluster data are found in many different types of studies. All relate to grouping or di- viding a collection of observations into subsets (clusters), such that those within each cluster are more closely related to one another than observations assigned to different clusters. The nature of the observations, the design of the study or the method used to sample the data, all contribute to clustering. For example, a medical study was designed to investigate the role of endogenous testosterone in the craniofacial growth of the young male rat (Verdonck et al., 1998). After weaning, 64 male offspring were randomly assigned into three groups: two experimental groups (with low and high dose of drug to inhibit the pubertal testosterone rise) and a control group (without drug). Both experimental groups were injected intramuscularly with 10 µ triptorelin at day 25. The researchers gave the high dose group a second injection at day 45. The sizes of skulls of the rats were treated as the responses, which were measured every

10 days from day 30 to day 110. In this example, the measurements corresponding to each rat form a cluster. It’s represented as a vector of repeated measurements

T yi = (yi1, . . . , yini ) for individual i, where ni is the number of records made for the ith subject. In Figure 1.1, the records are plotted against age. Each solid line illus- trates a trajectory for one rat. Therefore, the pattern of each individual rate can be seen. Usually, we assume that all the vectors of repeated measurements are uncorre- 1 Introduction 4

Control Group 85 80 75 Response 70 65

50 60 70 80 90 100 110

Days

High Dose 85 80 75 Response 70 65

50 60 70 80 90 100 110

Days

Low Dose 85 80 75 Response 70 65

50 60 70 80 90 100 110

Days

Figure 1.1: Rat Data (in three different groups) 1 Introduction 5 lated across individuals. If we consider the measurements from a particular rat, all the records are definitely related to one another. To model the observed data, when we consider a single rat, a normal model can be used. Then the positive dependence structure can be modelled via a multivariate Gaussian model. However, when we con- sider all the subjects, it may be not appropriate to use the classical Gaussian model, since this model usually works well under the balanced design. In the rat data, a

fixed number of repeated measurements was scheduled to be taken on all subjects at

fixed time points. However, during the study, 14 rats died. This example shows that, in practice, it may be impossible to get all the responses at the same time points.

Furthermore, it may be hard for the researchers to take the same number of repeated measurements of all subjects. In general, due to the highly unbalanced nature of the cluster data, the multivariate regression techniques are difficult to be used directly for many data sets (see, for example, Laird and Ware (1982), Seber (2004)). New models are set up to deal with such data set, such as, Linear Mixed-Effects Model

(LMM, Laird and Ware, 1982), Generalized Estimating Equations (GEE, Liang and

Zeger, 1986), etc.

The hierarchical formulation of the Linear Mixed-Effects Model was defined by 1 Introduction 6

Laird and Ware (1982)

   yi = Xiβ + Ziγi + εi     γi ∼ N(0,D), (1.1)   εi ∼ N(0, Λi),     γ1, ...γn, ε1, ..., εn independent,

where yi is the ni-dimensional vector, Xi and Zi are (ni × p) and (ni × q) matrices of known covariates, β is the p-dimensional vector of fixed effects parameters, which are population-specific, i.e. same for all subjects, γi is the q-dimensional vector of random effects, which are subject-specific, εi is a ni-dimensional vector of residual

ni×ni components. The covariance matrix Λi ∈ R are population-specific parameters

2 and throughout this thesis, we take Λi = σ Ini . Finally, D is the q × q covariance matrix of the random effects.

It follows from the above hierarchical formulation (1.1) that, given γi, yi is normally

2 distributed with mean vector Xiβ + Ziγi and variance-covariance matrix σ Ini . We

2 use f(yi|γi, β, σ ) to denote this conditional distribution of yi. It gives the features of ith particular subject. If the interest is on the population properties, the marginal density function of yi is given by

Z 2 2 f(yi|β, σ ,D) = f(yi|γi, β, σ )f(γi|D)dγi, 1 Introduction 7

where f(γi|D) is the density function of γi which is a multivariate normal distribution.

It is easy to get that marginal density of yi which is multivariate normal distribution

T 2 with mean vector Xiβ and variance-covariance matrix Σi = ZiDZi + σ Ini .

If we check the rat data, the trajectory of each rat looks roughly like a straight line. The overall trend of the sizes of the skulls increases with age and dose. The marginal model captures this property of our study. However, some variability can be detected in Figure 1.1. The rate of changing varies across individuals. Therefore the conditional model should be considered for each specific rat, if rat-specific inference is needed. LMM is a flexible tool for us to focus on different aspects of the data/study.

Assume that θ ∈ Θ is the vector of parameters under focus, where Θ is a k di- mension parameter space. Let α denote the vector of all parameters in Σi which, as discussed before, is a population parameter and depends on neither i, nor the dimen- sion of the matrix ni. For the marginal model, θ = (β, α). The classical technique to make inference is based on the log-likelihood estimations of the marginal model with respect to θ. The log-likelihood is

n Y   1  L (θ) = (2π)−ni/2|Σ (α)|−1/2exp − (y − X β)T Σ−1(α)(y − X β) (1.2) ML i 2 i i i i i i=1 1 Introduction 8

Maximizing (1.2) given α, the conditional maximum likelihood estimator of β is

n !−1 n ! ˆ X T −1 X T −1 β(α) = Xi Σi (α)Xi Xi ΣI (α)yi . (1.3) i=1 i=1

Replacing β by (1.3), the maximum likelihood estimator of α is given by maximizing

(1.2) with respect to α.

The maximum likelihood estimating procedure of α doesn’t take into account the loss in degrees of freedom resulting from estimating β. If the number of fixed effects are large, this issue can affect the inference. Therefore, the restricted maximum likelihood (REML Patterson and Thompson, 1971; Harville, 1974) was introduced as a likelihood function with respect to a specific set of n − p linearly independent error contrasts. The error contrasts are in the form U = AT y, where A is any n × (n − p) full-rank matrix with columns orthogonal to the columns of the X matrix such that

AT X = 0. In practice, such approach ensures that the inaccuracy of β estimation will not influence the estimate of α. It is easy to show that U follows a normal distribution with mean zero and covariance matrix AT Σ(α)A, where Σ(α) is a block- diagonal matrix with blocks Σi(α) on the main diagonal and zeros elsewhere. Then 1 Introduction 9 the likelihood function with respect to the error contrasts is

n n −(n−p)/2 X T −1/2 X T −1 −1/2 L(α) = (2π) | Xi Xi| | Xi Σi Xi| i=1 i=1 ( n ) 1 X ×exp − (y − X β)T Σ−1(y − X β) . (1.4) 2 i i i i i i=1

The estimator of α doesn’t depend on the form of the error contrast.

The conditional distribution of the random effects γi given yi is a multivariate normal distribution. Usually, γi is estimated by the mean of the posterior distribution given by

γˆi = E(γi|yi)

T −1 = DZi Σi (α)(yi − Xiβ).

1.2 Existing Model Selection Methods

For the model under consideration (1.1) and given a sample Yn= {y1, y2, . . . , yn} of size n, both the Akaike Information Criterion (AIC, Akaike, 1973) and the Bayesian

Information Criterion (BIC, Schwarz, 1978) take the form of a penalized log-likelihood

−2l(θ|Yn)+R(n, k), where θ ∈ Θ is the vector of parameters under focus and k is the 1 Introduction 10 dimension of the parameter space Θ, and R(n, k) is the penalty term that depends on the sample size and the number of parameters. Under independence, the AIC and BIC penalties are, respectively, RAIC = 2k and RBIC = k log(n), where k is the number of free parameters. Considering the marginal model, the definition of k is straight forward which is the number of the fixed parameters. In linear mixed-effects model, the fixed parameters contain the mean parameters β and variance components α i.e.

θ = (β, α). For instance, if the correlation matrix of D has exchangeable structure, i.e.   1 τ . . . τ        τ 1 . . . τ    T V   V ,    ......        τ τ . . . 1 then there are q + 2 parameters for the covariance matrix, q for the variances of the

2 2 2 diagonal matrix V = diag{σ1, . . . , σq }, 1 for τ and the other 1 for σ . Modifications of the AIC have been proposed to account for small sample sizes (AICc, Hurvich and

Tsai, 1989), for overdispersion in count data (QAIC, Burnham and Anderson, 2002) and for GEE models (QIC, Pan, 2001). There are also bunch of modifications of BIC.

For example, Pauler (1998) derived a general form of the Schwarz’s approximation and extended this to choose fixed effects.

When we consider the conditional model, γi is part of the density function of yi. 1 Introduction 11

2 Then the parameter in focus is θ = (β, γi, σ ). For each cluster. there are q random effects. Therefore, in total there are nq random effects, each cluster contributes q of them. However, the total number of the free parameters given by the random effects depends on the structure of D, the covariance matrix in (1.1). For instance, if

D = 0, the model reduces to a simple linear regression and the random effects vanish.

Therefore it is not reasonable to directly count the number of random effects in the penalty term. The “degrees of freedom” ρ was introduced to clarify the definition of the number of free parameters and is used to qualify the complexity of a statistical model. In linear regression, it is the number of estimated predictors which is the dimension of the vector space where the fitted values of the observations lie. Hodges and Sargent (2001) provided a formula to calculate the degree of freedom for the con- ditional model. Based on this, Vaida and Blanchard (2005) proposed the conditional

AIC, for which the penalty term is 2(ρ + 1). The primary theoretical approach of cAIC is based on the assumption that the structure of D is known. Later, Liang et al.

(2008) gave a more general justification of cAIC to eliminate the previous restriction.

Spiegelhalter et al. (2002) proposed deviance information criterion (DIC) which also focuses on the cluster level. They estimated the effective number of parameters pD from the Baysian perspective which has a similar form of ρ. In this criteria, pD was defined from the information theoretic argument. It is the effective number of parameters in a model, which is the difference between the posterior mean of the 1 Introduction 12 deviance and the deviance at the posterior means of the parameters of interest. The

final form of DIC is defined as a classical estimate of fit, plus the penalty term which

¯ ¯ is twice the effective number of parameters, i.e. DIC = D(θ) + 2pD, where D(θ) is the posterior mean deviance.

It is well known that the AIC and BIC address different goals of model selection where the former achieves asymptotic optimality and the latter possesses selection consistency (e.g., Shibata, 1981; Nishii, 1984). However, Yang (2005) has shown that the main strengths of AIC and BIC cannot be shared. This motivates our proposed model selection criterion for models with correlated data built upon the minimum description length (MDL) principle, as it attempts to find a good balance between

AIC and BIC.

1.3 Minimum Description Length

In my research, the minimum description length principle is used as the underlying framework to derive the model selection criteria for dependent data. MDL has been introduced by Rissanen (1978). Formally, Rissanen has found a practically useful way to interpret the description of data from the code length used for data transmission, as developed in Shannon’s coding theory (Shannon, 1948). From MDL standpoint, 1 Introduction 13 any probability distribution is evaluated based on its ability to describe the data and does not have to be identical to the distribution underlying the data-generating mechanism. In line with the approach outlined at the beginning of this section, we use the MDL paradigm as a tool for detecting the model that best approximates the data. The connection between MDL and statistical analysis has matured with the work of Barron et al. (1998), Hansen and Yu (2001), Lee (2001) and Hansen and Yu (2003). Hansen and Yu (2001, 2003) presented various frameworks in which the MDL principle can be applied to (generalized) linear models with independent data. They also pointed out numerous connections between MDL and other model selection techniques traditionally used in frequentist and Bayesian statistics, e.g. AIC and BIC. In this work, we derive valid model selection criteria based on MDL principle for most commonly adopted correlated data models, e.g., linear mixed-effects (LME) models. The main contribution of our proposed approach is to systematically take into account the dependence structure by interweaving the estimation of variance- covariance structure with the code length calculation. The extension does not follow straightforwardly from the independent case and has large effects on the performance of the criterion. The methods developed for LME models are justified as a “valid” description length in the sense of achieving the smallest redundancy (i.e., Kullback-

Leibler divergence). Moreover, the proposed criteria possess the selection consistency that does mimic the desirable property of BIC. A significant advantage is that the 1 Introduction 14

Coding Fn. C: A - {0,1}*=strings of 0’s and 1’s a - 0 a b - 10 (0) node bc c - 11

b c (10) (11) Figure 1.2: Huffman Code proposed methodology for LME is general enough to be easily implementable to models of increased complexity, e.g., involving functional data and generalized-type response (GEE).

As already mentioned, the MDL principle relies on the length of the code used for data description (or transmission) based on a given model. The Huffman’s Coding

Algorithm is used as an example to explain the basic idea of the coding theorem.

Assume that A is any set of symbols. A code C is a map from A to a set of codewords.

For instance, let A = {a, b, c} with probability Q(a) = 1/2, Q(b) = 1/4 and Q(c) =

1/4. To encode A into a set of 0, 1 binary codewords, we can construct a binary tree from the end nodes {a, b, c}. We start with the nodes with the smallest probability, b and c, and assign the leaves 0 and 1 arbitrarily to them to achieve the intermediate node bc with node probability 1/2. Then we repeat the process in the first step to assign the leaves 0 and 1 to the nodes a and bc to reach the tree’s root. The procedure is shown in Figure 1.2. 1 Introduction 15

Let L denote the code length function which is the length of the codewords for y ∈ A for this specific code. For instance, L(a) = 1, since the codeword for a is

0. There is only one digit. Therefore, the code length is 1. For node b, we have

L(b) = 2, since the codeword for b is 10 which has two digits. The relation between the code length and probability is explicit: L(y) = − log2 Q(y), for any y ∈ A in this example. However, if Q(a) = Q(b) = Q(c) = 1/3, we have more flexibility to encode the nodes. We can randomly choose any two of them to build the first intermediate node. Therefore, the algorithm can lead to different codes depending on which nodes we choose to merge. However, no matter how we encode the data, we find that the decoding way is unique. Considering the general case, a prefix code is any code that satisfies the unique decodability condition, i.e. no codeword is the prefix of any other (Hansen and Yu, 2001). All Huffman codes given by a binary tree are in this class. There is a direct correspondence between a prefix code on a finite set A and the quantity − log2 Q where Q is a probability distribution on A. An integer-valued function L corresponds to the code length of a binary prefix code if and only if it

P −L(a) satisfies Kraft’s inequality a∈A 2 ≤ 1 (Cover and Thomas, 1991). Therefore, for any prefix code on A with code length L we can define a distribution Q on A using 2−L(a) Q(a) = , (1.5) P −L(y) y∈A 2 for any a ∈ A. Conversely, for a distribution Q on A we can define a prefix code with 1 Introduction 16 code length

L(y) = d− log2 Q(y)e, where dye is the smallest integer greater than or equal to y. If the data generating function P is known, the expected code length of C is defined following the usual statistical way: X LC = P (y)L(y). y∈A

Shannon’s Source Coding Theorem states that the expected code length is minimized when the code length function is constructed by the true density function in (1.5).

For any code, the inequality

X LC ≥ − P (y) log2 P (y), y∈A

holds and becomes an equality if and only if L = − log2 P . Logically, if y is a common symbol, then its associated probability is relatively large. Therefore, short codeword is signed to this frequent symbol. In Figure 1.2, a has the largest probability and it has the shortest code length. In contrast, if y is a rare symbol with small probability, then long codeword is given to this isolated symbol. Mathematically, if we assume that the true distribution function P is given in Figure 1.2, and Q1 and Q2 are two distributions defined on A, then we can calculate the expected code lengths, 1 Introduction 17 respectively, based on these two distributions:

X L1C = P (y)L(y) y∈a,b,c

= P (a)d− log2 P (a)e + P (b)d− log2 P (b)e + P (c)d− log2 P (c)e 1 1 1 1 1 1 = d− log e + d− log e + d− log e 2 2 2 4 2 4 4 2 4 = 3/2,

where Q1(y) = P (y), and

X L2C = P (y)L(y) y∈a,b,c

= P (a)d− log2 Q2(a)e + P (b)d− log2 Q2(b)e + P (c)d− log2 Q2(c)e 1 1 1 1 1 1 = d− log e + d− log e + d− log e 2 2 3 4 2 3 4 2 3 1 = d− log e = 2, 2 3

where Q2(a) = Q2(b) = Q2(c) = 1/3. If Q1 and Q2 are the candidate models, from the above calculations, we can see that the true model gives the shorter expected code length. More generally, if our sample size is large, then in the asymptotic sense, the code length given by the true density function is shorter than those calculated based on the other candidate models. Therefore, in practice, we can get a bunch of code lengths of the observations given by the candidate models. Then we choose 1 Introduction 18 the model that gives the shortest code length as the final model. However, all our discussion is based on the Shannon’s Source Coding Theorem which assumes that the true distribution function is known. The problem is that, in the real world, the underlying distribution of y is usually of interest and it is unknown.

Barron et al. (1998) solved the problem; they extended Shannon’s theorem into a more general case where P is partly known as a member of a model class M. Let each element in this class be indexed by a unique parameter. Then each component in the class corresponds to a code length function. If the observations are sampled from this model class, then the code length given by the data generating function is the shortest one among the whole class in an asymptotic sense. However, some times, it is even hard for the researchers to know whether the true model is actually considered as a potential model. It is also possible that the true model is infinite- dimensional. Therefore, our goal is usually to find a good approximation to the true model in order to make valid inference instead of searching the data generating function. MDL principle gives a flexible approach for choosing the model that best balances complexity and goodness-of-fit. The model selection method is concisely expressed by the MDL principle (Rissanen, 1978):

“Choose the model that gives the shortest description of data”.

Hansen and Yu (2001) have pointed out several ways to measure the code length 1 Introduction 19 of a given data set under this framework.

T T T Let us assume that the data string y = (y1 , ..., yn ) is modelled using a model in

M = {f(y|θ): θ ∈ Θ}, a class of models known up to θ. If θ is given, then the description length of the data can be simply found using the density function indexed by θ, i.e. L(y|θ) = − log fθ(y). However, since the parameter needs to be estimated, it is necessary to transmit the estimator θˆ too.

In regular parametric family, the convergence rate of the estimator is proportional √ to 1/ n. Therefore description length used to communicate the k-dimensional pa- rameter θˆ is L(θˆ) = k ∗ log(n)/2 nats. Combined with the description length of the actual data, the total code length used for data transmission is

L(y) = L(y|θˆ) + L(θˆ) k = − log f (y) + log(n), (1.6) θˆ 2 which is called two-stage description length. In the first stage, encode the parameter θˆ; in the second stage, encode the sequence of data with the distribution fθˆ characterized by the parameter θˆ. One can see that the description length is equivalent to the penalized log-likelihood used by BIC under the independent assumption. The term log(n)/2 reflects the precision used to encode each parameter with a uniform encoder.

If we assume that θ has the density function ω(θ), then the least integral upper bound 1 Introduction 20

Pk of L(θ) is − log ω(θ) − j=1 log δj, where δj is the precision to which the parameter

θj is truncated. In order to get the shortest description code length of the data, we minimize the formula L(y|θ) + L(θ). If we use rough precision to describe the parameter, the code length to encode the data will be longer since in general we use the incorrect estimate of the parameter. In most situations, the estimator has the √ precision 1/ n. In this case, Rissanen (1983) shows that:

k k X X √ − log ω(θ) − log δj = − log ω(θ) + log cj(n) n j=1 j=1 k = log n + O(k), 2 where ω(θ) is under the universal distribution. Therefore, when n is large, O(k) can be ignored and log n/2 is the general code length we used to encode one parameter. √ There is no need for us to use more accurate precision, since 1/ n is the highest precision we can get in the model estimating procedure.

An alternative formulation is found once we consider the model class in its entirety and we assume a distribution for the data using the mixture distribution induced by the user-defined probability distribution ω(θ) on the parameter space Θ. The mixture description length of y is then

Z − log m(y) = − log fθ(y)ω(θ)dθ. (1.7) 1 Introduction 21

If λ is a hyperparameter, i.e. ω(θ) = ω(θ|λ), then the code length used to transmit λ should be added. Rissanen (1989) emphasized that ω(θ) is not a Bayesian prior, but

“an artificial device to minimize the description length”.

In the next chapter we apply a hybrid form of MDL that combines the two-stage and the mixture MDL to the dependent data.

As AIC and BIC, the MDL criteria derived in this thesis take the forms of penalized log-likelihood. In two-stage method, the penalty k log(n)/2 is the code length to √ encode k parameters and 1/ n reflects the precision to encode one of the parameters with uniform encoder without considering the possible prior distribution of θ. Since, √ in general cases, 1/ n represents the convergence rate of the estimator which also reflects the scale of the estimation error, it is redundant to use more digits to transmit the parameter. In this case, log(n)/2 is a general approximate code length to encode one parameter. However, the coding rule suggests us to assign a short codeword to a common symbol and a long codeword to a rare symbol. If we believe the parameter follows a certain distribution, it is very possible that the code length to encode one parameter is other than log(n)/2. If the parameter is a high frequently used value, the actual code length may be shorter than log(n)/2. On the other hand, if the parameter is a low frequently used value, then the actual code length may be longer than log(n)/2. Therefore, the code length of the parameter should be adjusted given 1 Introduction 22 the data set under consideration. The mixture method shows its advantage at this point. Therefore, this method may performs better given a certain data set. However, when mixture method is used, the prior distribution of θ needs to be considered.

The mixture formulation involves the calculation of an integral that often cannot be computed in closed form. That restricts the application of the mixture method. In this case, the two-stage method shows its advantage. Since dependent data usually have more complicated structure than that of the independent data, we will combine the two-stage method and mixture method together to derive the selection criteria. Chapter 2

Variable Selection for Random

Effects Models via MDL

23 2 Variable Selection for Random Effects Models via MDL 24

Making valid statistical inferences is based on the data analysis process which in- cludes specifying a proper model formulation, a model selection criterion, a parameter estimation procedure and a measure of precision. Among all of these, model selection is one of the critical parts. There is an extensive model selection literature. How- ever, most model selection procedures are designed for independent data. Among those, AIC and BIC are the most widely used model selection criteria. However, due to the increasing need for vastly complex models, high-dimensional correlated data has attracted considerable attention in recent years. cAIC is one of the model selection criteria for such dependent data, especially for linear mixed-effects model.

As the name suggests, it is an extension of AIC. There are various criteria for the dependent data most of which were developed based on AIC or BIC. However, nei- ther AIC nor BIC dominates the other. The theoretical and simulation studies (for example, Breiman and Freedman (1983), Shibata (1981) and Speed and Yu (1993)) reveal that AIC works better, if the underlying model is infinitely dimensional which is possible due to the complicated data generating mechanism. On the other hand,

BIC does better, if the underlying model is finitely dimensional. Intuitively, BIC has stronger penalty which favours models with less variables than AIC. Therefore, we try to introduce another framework which try to produce a balance between these two classical model selection methods. Besides, the true model is an important issue we have to consider. Usually, it is impossible for the researchers to know. As we 2 Variable Selection for Random Effects Models via MDL 25 mentioned, the data generating mechanism is often complicated which increases the difficulty for the researchers to detect the true model. Therefore, our aim is to select good approximating model with enough information of the empirical data to make valid inferences. In the Minimum Description Length (MDL) principle, any proba- bility function is considered to measure the description length of the data, which is not necessarily the true underlying data generating function. There are a number of articles about the application of the MDL principle to independent data (for exam- ple, Hansen and Yu (2001), Hansen and Yu (2003)). In this chapter, we extend the principle to the random effects model. As an extension of the linear mixed-effects model, the functional data models are also considered in this chapter.

2.1 Linear Mixed-Effects Model

Recall the hierarchical formulation of the linear mixed-effects model (Laird and Ware,

1982)    yi = Xiβ + Ziγi + εi     γi ∼ N(0,D), (2.1)   εi ∼ N(0, Λi),     γ1, ...γn, ε1, ..., εn independent, 2 Variable Selection for Random Effects Models via MDL 26

T where yi = (yi1, . . . , yini ) is the ni ×1 response vector for the i-th subject, 1 ≤ i ≤ n,

Xi and Zi are (ni × p) and (ni × q) design matrices for fixed and random effects, respectively, β is the p × 1 vector of fixed effects parameters, γi is the q × 1 vector of random effects, and n is the number of subjects. It is often assumed that the random effects γi follows a multivariate normal distribution with mean zero and covariance

T matrix D, i.e., N(0,D), the ni × 1 vector of residuals, i = (i1, . . . , ini ) , follows

0 N(0, Λi), and γi, εi0 are mutually independent for all 1 ≤ i, i ≤ n. Throughout this

2 article we use the canonical form Λi = σ Ini for a clear exposition. As a consequence, the cluster-specific response vectors yi are independent and normally distributed, i.e.

T 2 yi ∼ N(Xiβ, ZiDZi + σ Ini ),

where i = 1, . . . , n.

Our current focus is on MDL-based methods for selecting the fixed effects covariates

β.

2.1.1 Case A: σ2 and D are known

In order to demonstrate the application of MDL to linear mixed-effects model, we begin with the simple case in which only β is unknown. In order to make the procedure 2 Variable Selection for Random Effects Models via MDL 27

T 2 and final criteria forms as simple and concise as possible, let Σi = ZiDZi + σ Ini ; then the marginal density function of yi has the form

  1 1 T −1 fθ(yi) = n /2 1/2 exp − (yi − Xiβ) Σi (yi − Xiβ) . (2.2) (2π) i |Σi| 2

Two-stage MDL

A straightforward way to measure the code length is to take the minus logarithm of the density function. Then add the uniform code length log(n)/2 for each dimension of the parameter. When β is the only unknown parameter, θ = β and k = p. Sub- stituting the function (2.2) as the density function in equation (1.6) and ignoring the constant terms, the description code length is

n X 1 p L(y) = (y − X βˆ)T Σ−1(y − X βˆ) + log(n), (2.3) 2 i i i i i 2 i=1 which obviously yields the similar model selection criterion as the BIC. However, for

Pn BIC, the penalty term is p log(N)/2, where N = i=1 ni. 2 Variable Selection for Random Effects Models via MDL 28

Mixture MDL

To transmit the unknown parameter, we also can apply the mixture MDL principle.

That involves finding a suitable prior distribution of β. A convenient distribution choice for assigning code length to β is the multivariate normal with covariance hy- perparameter V , say β ∼ N(0,V ), which is a conjugate prior for the linear model.

Then the marginal distribution of y is

m(y) Z = fβ(y)ω(β)dβ Z n ! ( n ) Y 1 X 1 T −1 = exp − (yi − Xiβ) Σi (yi − Xiβ) × (2π)ni/2|Σ |1/2 2 i=1 i i=1 1  1  exp − βT V −1β dβ (2π)p/2|V |1/2 2 n ! Pn T −1 −1 −1 1/2 Y 1 |( X Σ Xi + V ) | = i=1 i i × (2π)ni/2|Σ |1/2 |V |1/2 i=1 i ( " n n n n #) 1 X X X X exp − yT Σ−1y − ( yT Σ−1X )( XT Σ−1X + V −1)−1( XT Σ−1y ) . 2 i i i i i i i i i i i i i=1 i=1 i=1 i=1

That leads to the description length L(y|V ) = − log m(y|V ). We ignore the terms that do not contain any information about the unknown parameter β, since they won’t make any contribution to the model selection. The code length has thus the 2 Variable Selection for Random Effects Models via MDL 29 expression

L(y|V )

n 1 X = log |V XT Σ−1X + I | 2 i i i p i=1  n n ! n !−1 n ! 1 X X X X + yT Σ−1y − yT Σ−1X XT Σ−1X + V −1 XT Σ−1y . 2  i i i i i i i i i i i i  i=1 i=1 i=1 i=1 (2.4)

It is noticed that the length function depends on V that needs to be specified by the user. Since in practice, we usually use the generalized least squares estimator of β which has the form

n !−1 n ! ˆ X T −1 X T −1 βGLS = Xi Σi Xi Xi Σi yi , i=1 i=1 with variance covariance matrix

n !−1 ˆ X T −1 Var(βGLS) = Xi Σi Xi , i=1 only depending on Xi and the known variance-covariance matrix of y, we assume

Pn T −1 −1 that V = c( i=1 Xi Σi Xi) , where c ≥ 0 is a scalar. This leads to a simplifica- tion of the code length expression. The new code length function contains only one 2 Variable Selection for Random Effects Models via MDL 30 hyperparameter c

L(y|c) p = log(1 + c) 2  n n ! n !−1 n ! 1 X c X X X + yT Σ−1y − yT Σ−1X XT Σ−1X XT Σ−1y . 2  i i i 1 + c i i i i i i i i i  i=1 i=1 i=1 i=1 (2.5)

The MDL criterion will be obtained for that c ≥ 0 which yields the minimum de- scription length of the data, i.e. the solution to

∂ 0 = L(y|c) ∂c n ! n !−1 n ! p 1 X X X = − yT Σ−1X XT Σ−1X XT Σ−1y . 2(1 + c) 2(1 + c)2 i i i i i i i i i i=1 i=1 i=1

It is possible that there is no solution of the equation under the restriction that c ≥ 0

Pn T −1 −1 Pn T −1 −1 (This is imposed by the fact that V = c( i=1 Xi Σi Xi) and ( i=1 Xi Σi Xi) are positive definite matrices.), so we modify the estimator as

 n ! n !−1 n !  X T −1 X T −1 X T −1 cˆ = max  yi Σi Xi Xi Σi Xi Xi Σi yi /p − 1, 0 . (2.6) i=1 i=1 i=1 2 Variable Selection for Random Effects Models via MDL 31

To make the final form of the criterion more comprehensive, let

n n n X T −1 X T −1 −1 X T −1 FSSσ = ( yi Σi Xi)( Xi Σi Xi) ( Xi Σi yi). i=1 i=1 i=1

The symbol FSSσ is defined as the generalized form of factor sum of squares (FSS) in the linear model which has the form: FSS = yT X(XT X)−1XT y. In the following subsection the generalized form of the residual sum of squares (RSS) is also defined.

Substitutingc ˆ to (2.5), the lMDL1 criterion has the form

  1 Pn T −1 p FSSσ 1  2 [ i=1 yi Σi yi − FSSσ] + 2 [1 + log( p )] + 2 log(n), if FSSσ > p (2.7)  1 Pn T −1  2 i=1 yi Σi yi, otherwise, where log(n)/2 is the length of code needed for transmittingc ˆ. The prefix l is used to specify that this MDL criterion is derived for the linear mixed effects model. It is worth mentioning that, when FSSσ ≤ p, the estimatec ˆ = 0 and the “device” distribution of β becomes a point mass. This corresponds to the null model with all

fixed effects being zero.

Pn T −1 It is worth noting that the first term of the lMDL1 criterion, ( i=1 yi Σi yi −

FSSσ), is the log-likelihood. Examining (2.7) we see that lMDL1 has the same form of a penalized log-likelihood as AIC and BIC, but its penalty is data-adaptive.

More precisely, while in AIC and BIC the penalty only depends on the dimension 2 Variable Selection for Random Effects Models via MDL 32

of the parameter and the number of subjects, in lMDL1 both the data size and its dependence structure are taken into account. For instance, the penalty term depends on FSSσ which involves the covariance matrices Σi, i = 1, . . . , n.

2.1.2 Case B: σ2 and D are unknown

The criterion (2.7) is usually impractical since in most applications all the parameters,

{β, σ2,D}, are unknown. Since for modelling purpose the focus is on β, the variance- covariance components are treated as nuisance parameters. Nevertheless, one still needs to consider the impact of the dependence structure on the derivation of MDL criteria. As before we assume a normal distribution for β but we modify the variance specification to include the parameter σ2. More precisely, we scale the variance-

2 ∗ T 2 ∗ covariance matrix Σi = σ Wi, where Wi = ZiD Zi + Ini and D = σ D . Then

2 Pn T −1 −1 2 β ∼ N(0, σ V ), where V = c( i=1 Xi Wi Xi) . Put τ = σ and assume an inverse gamma distribution as the coding device,

√ a ω(τ) = √ e−a/2τ τ −3/2, 2π i.e., τ ∼ InvGamma(a, 3/2). The joint distribution of β and τ is

1  1  ω(β, τ) = exp − βT V −1β ω(τ). (2π)p/2|τV |1/2 2τ 2 Variable Selection for Random Effects Models via MDL 33

That leads to the marginal distribution of y

ZZ ∗ ∗ m(y|D ) = fβ,σ2 (y)ω(β, τ|D )dβdτ

n ! Z Y 1 Z 1 = × (2π)ni/2|τW |1/2 (2π)p/2|τV |1/2 i=1 i ( n ) X 1 1 exp − (y − X β)T W −1(y − X β) − βT V −1β ω(τ)dβdτ 2τ i i i i i 2τ i=1 n ! Pn T −1 −1 −1 1/2 Y 1 |( X W Xi + V ) | = i=1 i i × (2π)ni/2|W |1/2 |V |1/2 i=1 i Z   √ −N/2 1 a −a/2τ −3/2 τ exp − RSSV √ e τ dτ 2τ 2π n ! √ (N+1) Pn T −1 −1 −1 1/2  − 2 Y 1 |( X W Xi + V ) | a RSSV + a = i=1 i i √ × (2π)ni/2|W |1/2 |V |1/2 2 i=1 i 2π N + 1 Γ , 2 where

n n ! n !−1 n ! X T −1 X T −1 X T −1 −1 X T −1 RSSV = yi Wi yi− yi Wi Xi Xi Wi Xi + V Xi Wi yi , i=1 i=1 i=1 i=1

Pn and N = i=1 ni is the total number of observations. If we ignore the terms that do not depend on the particular choice of model, we obtain

n n 1 X 1 X − log m(y|D∗) = log |W | + log | XT W −1X + V −1| + 2 i 2 i i i i=1 i=1 1 1 N + 1 + log |V | − log a + log(RSS + a). (2.8) 2 2 2 V 2 Variable Selection for Random Effects Models via MDL 34

As we mentioned in the introduction, a is a device used to minimize the code length.

That yields the estimatea ˆ = RSSV /N. After substitution, equation (2.8) becomes

n n 1 X 1 X − log m(y|D∗) = log|W | + log | XT W −1X + V −1| + 2 i 2 i i i i=1 i=1 1 N + log|V | + log(RSS ). (2.9) 2 2 V

Pn T −1 −1 Replace V with c( i=1 Xi Wi Xi) . Then (2.9) becomes

− log m(y|D∗)

n n ! 1 X p N X c = log|W | + log(1 + c) + log yT W −1y − FSS . 2 i 2 2 i i i 1 + c W i=1 i=1 (2.10)

In turn, (2.10) is minimized when

(N − p)FSS  cˆ = max W − 1, 0 , pRSSW

Pn T −1 Pn T −1 −1 Pn T −1 where FSSW = ( i=1 yi Wi Xi)( i=1 Xi Wi Xi) ( i=1 Xi Wi yi)

Pn T −1 and RSSW = i=1 yi Wi yi − FSSW . Then we get the code length function

n 1 X p (N − p)FSSW N N ∗ RSSW L(y|D∗) = log |W | + log + log . (2.11) 2 i 2 pRSS 2 N − p i=1 W 2 Variable Selection for Random Effects Models via MDL 35

In this formulation the criterion depends on D∗. However introducing a device to encode the q × q covariance matrix D∗ does neither yield a closed form criterion, nor is amenable to efficient numerical computation. We thus adopt a two-stage principle to deal with D∗ by plugging in its consistent estimate and increasing the code length by s log(n)/2, where s is the number of distinct parameters used for modelling D∗.

Considering the code length log(n) to transmit the two hyperparameters a and c, the criterion has the form

 p N 1 n h (N−p)FSS ˆ i 2  N∗RSS ˆ  2 s+2 FSS ˆ /RSS ˆ  P ˆ W W 2 W W  i=1 log |Wi| + log + log + log(n ), if > 1 2 pRSSWˆ N−p p/(N−p)  1 PN ˆ N Pn T ˆ −1 s+1  2 i=1 log |Wi| + 2 log i=1 yi Wi yi + 2 log(n), otherwise, (2.12) which is denoted by lMDL2. In this formula, “b” is generic notation for those quan- tities by substituting any root-n consistent estimate of D∗, e.g., maximum likelihood

(ML) or restricted ML (REML) estimate. Similar to lMDL1, the second form is taken whenc ˆ = 0 leading to the null model. To better understand (2.12), if we ignore the constant part, n 1 X N N · RSS ˆ − log |Wˆ | − log W 2 i 2 N − p i=1 is in fact the log-likelihood expression with REML estimate of D∗ and the rest plays the role of penalty on model complexity which creates a distinction from AIC and

BIC by incorporating the dependence structure inherent to the data. 2 Variable Selection for Random Effects Models via MDL 36

2.2 Theoretical Results

In this section we show that the MDL procedures not only possess the desirable property from data compression perspective, but also enjoy the consistency of BIC.

On the other hand, due to the data-adaptive penalty, when the model contains more covariates, where the strength of BIC penalty term may be overly improved, the under selection issue can be adjusted. In this case, the model selection criteria lean to the

AIC’s behaviour while the performance of the model selection criteria is improved for the finite sample performance as illustrated by the simulation studies in the following sections.

We begin our exposition from the point of view of noiseless compression of data which bears a close resemblance to statistical modelling. In general, suppose that fθ(y) is the true density function and Q(y) is any other density function with the corresponding code length (− log Q). To encode the data string y, some extra nats

(the unit of the length function) are needed to transmit the data without knowing the true distribution. The expected value of this extra code length is called the redundancy of Q defined as R(Q) = Eθ{− log Q(y) − [− log fθ(y)]}. One can see that the redundancy coincides with the Kullback-Leibler divergence. Hansen and

Yu (2001) pointed out that if Q can achieve the “smallest” redundancy possible for all members in M = {f(y|θ): θ ∈ Θ}, then (− log Q) is a valid description 2 Variable Selection for Random Effects Models via MDL 37 length for the data string based on models from the class M. Rissanen (1986) has shown that, given the sample size n and the k-dimensional unknown parameter θ, √ if a n-rate estimator θˆ(y) exists and it has uniformly summable tail probabilities,

√ ˆ P Pθ{ n||θ(y) − θ|| ≥ log(n)} ≤ δn, for all θ and n δn ≤ ∞, then the redundancy for any density Q satisfies, for all θ ∈ Θ except on a set with Lebesgue measure zero,

E log[f (y)/Q(y))] lim inf θ θ ≥ 1. (2.13) n→∞ (k/2) log n

This implies that one needs at least (k/2) log n additional bits to encode the data without knowing the true distribution fθ. If R(Q) = (k/2) log n[1 + o(1)], we say that the redundancy of Q achieves the lower bound. If one focuses on the primary parameters, the following proposition can be obtained.

2 −1 Pn T −1 Proposition 1. When σ and D is known, assume that n i=1 Xi Σi Xi → C as n → ∞ for some positive definite matrix C, max1≤i≤n ni < ∞, then the redundancy of the density that induces lMDL1 achieves the lower bound.

Proof of Proposition 1. In our case Q(y) = m(y) and θ = β, the numerator in (2.13) 2 Variable Selection for Random Effects Models via MDL 38 is

2Eβ[log fβ(y) − log m(y)] n X T −1 = Eβ[lMDL1 − (yi − Xiβ) Σi (yi − Xiβ)] i=1

= pEβ log(FSSσ/p)

≤ p log(EβFSSσ/p).

Pn T −1 Pn T −1 −1 Pn T −1 Recall that FSSσ = ( i=1 yi Σi Xi)( i=1 Xi Σi Xi) ( i=1 Xi Σi yi), and

n n ˆ X T −1 −1 X T −1 βGLS = ( Xi Σi Xi) ( Xi Σi yi), i=1 i=1 with n ˆ X T −1 −1 Varβ(βGLS) = ( Xi Σi Xi) . i=1 2 Variable Selection for Random Effects Models via MDL 39

Then

n ˆT X T −1 ˆ EFFSσ = EββGLS( Xi Σi Xi)βGLS i=1 n ˆT X T −1 ˆ = EβT rβGLS( Xi Σi Xi)βGLS i=1 n X T −1 ˆ ˆT = T r[( Xi Σi Xi)Eβ(βGLSβGLS)] i=1 n n X T −1 X T −1 −1 T = T r[( Xi Σi Xi)(( Xi Σi Xi) + ββ )] i=1 i=1 n X T −1 T = T r[Ip + ( Xi Σi Xi)ββ ] i=1 n T X T −1 = p + β ( Xi Σi Xi)β. i=1

−1 Pn T −1 T Pn T −1 The assumption n i=1 Xi Σi Xi → C > 0 indicates that β ( i=1 Xi Σi Xi)β =

O(n). Then

E [log f (y) − log m(y)] lim inf β β n→∞ (p/2) log n (p/2) log(1 + βT (Pn XT Σ−1X )β/p) ≤ lim inf i=1 i i i n→∞ (p/2) log n = 1,

together with (2.13), completes the proof.

Although the MDL principle is well motivated from data compression perspective, 2 Variable Selection for Random Effects Models via MDL 40 it is of more interest to assess whether the proposed MDL formulations lead to sta- tistically sensible model selection procedures. As we mentioned in the introduction section, AIC and BIC can not share each other’s advantages. AIC achieves asymp- totic optimality when the true underlying model is infinite-dimensional, while BIC possesses selection consistency when the “true” model is finite-dimensional. The con- sistency property leads to predictive optimal. Since the proposed MDL procedures contain the data dependent penalties, we show that they join the advantages of BIC and try to mimic the advantages of AIC.

For technical convenience, we consider the balance design with ni = m < ∞, for i = 1, . . . , n, and all the individuals are independent and identical realizations. We borrow the ideas from Breiman and Freedman (1983) to assume that the columns of the m × p design matrix Xi are independent and identically distributed (i.i.d.) as multivariate-normal N(0, Φ). To make the model meaningful in both finite and

2 P∞ 2 2 infinite dimensional cases, we should have σ0 = j=1 βj < ∞, and then define σp =

P∞ 2 2 j=p+1 βj . When the “true” model is finite-dimensional, there is a p0, such that σp =

0, for all p ≥ p0. Since W is symmetric positive definite matrix, there is a matrix M s.t. M 2 = W . Then the hat matrix has the form H = M −1X(XT W −1X)−1XT M −1, where X is the N × p covariate matrix. Let X(j) denote the jth column of the covariate matrix. Given the assumption, it is independent with X when j > p. It can

(j)T −1 −1 (j) be proved that T r(X M (I − H)M X )/(N − p) → c1 when N − p = Op(n) 2 Variable Selection for Random Effects Models via MDL 41

(j)T −1 (j) and T r(X W X )/N → c0. Then under the random design assumptions stated above, we have

Theorem 1. If N − p = Op(n), then

N FSS = [c σ2 − c σ2 + (p/N)(c σ2 + σ2)](1 + o (1)). σ σ2 0 0 1 p 1 p p

2 2 (N − p)FSSW N c0σ0 − c1σp = [ 2 2 + 1](1 + op(1)). pRSSW p c1σp + σ

for some positive constants c0 and c1, where FSSσ is as in (2.7) and FSSW and

RSSW are as in (2.12).

Proof of Theorem 1. Since our interest is about the marginal distribution of y, we sim- plify the form of the model as yi = Xiβ + i, where i = Ziγi + εi. Since Wi is positive

T 2 definite, we can find Mi = Mi such that Wi = Mi . Then we can consider the prob-

−1 −1 −1 −1 −1 lem as a simple linear regressing Mi Xi on Mi yi, i.e. Mi yi = Mi Xiβ + Mi i.

∗ −1 ∗ −1 ∗ −1 Let yi = Mi yi, Xi = Mi Xi and i = Mi i. The new errors are independent

∗ 2 and identically distributed, i.e. var(i ) = σ Ini . Following Breiman and Freedman

(1983), we can extend part of their paper to the correlated case. Let

n m 1 X X R = (y∗ − yˆ∗ )2 = RSS /(N − p), np N − p ik ik W i=1 k=1 2 Variable Selection for Random Effects Models via MDL 42

∗ Pp ∗ ˆ ∗ P∞ ∗ wherey ˆik = j=1 xijkβj. Furthermore, let δi = j=p+1 Xijβj and S = S(n, p) =

∗ 2 ∗ ∗T ∗ −1 ∗T ∗ ||(I − H)y || = (N − p)Rnp, where H = X (X X ) X . We use X to denote the first p covariates. Then S can be decomposed as

∗ ∗ 2 S = ||(I − H)( + δ )|| = S1 + S2 + S3,

where

∗ 2 S1 = ||(I − H) || ,

∗ 2 S2 = ||(I − H)δ || ,

∗ ∗ S3 =< (I − H) , (I − H)δ > .

Since ∗ and δ∗ are independent and they are also independent from (I − H), the cross term S3 can be ignored. It is straightforward that

(I − H)2 = I − H − H + H2

= I − 2H + X∗(X∗T X∗)−1X∗T X∗(X∗T X∗)−1X∗T

= I − H. 2 Variable Selection for Random Effects Models via MDL 43

Therefore, (I − H) is a idempotent matrix. The trace of it is

T r(I − H) = T r(I) − T r(H)

= N − T r(X∗(X∗T X∗)−1X∗T )

= N − T r((X∗T X∗)−1X∗T X∗)

= N − p.

Since (I − H) and ∗ are independent, given that fact that (I − H) is a idempotent

2 PN−p 2 matrix and the trace of the matrix is N − p, we have S1 = σ j=1 ξi , where ξis are iid standard normal distributed. If we have N − p → ∞ when N → ∞, given

. 2 . the strong law of large number, S1 = (N − p)σ , where = has the same definition as in Breiman and Freedman (1983), which means nearly equal and is used only informally. Let Q2 = diag{Φ,..., Φ} and X(j) denotes the jth column of X. Then define η(k) = Q−1X(k), where all the elements of η(k) are i.i.d. standard normal 2 Variable Selection for Random Effects Models via MDL 44

distributed. The second term S2 can be written as

∗ 2 S2 = ||(I − H)δ ||

= δ∗T (I − H)δ∗

∞ X 2 (j)T −1 −1 (j) = βj X M (I − H)M X j=p+1 ∞ X 2 (j)T −1 −1 (j) = βj η QM (I − H)M Qη . j=p+1

. 2 PN−p 2 Since Q and M are full rank matrices and rank(I − H) = N − p, S2 = σp j=i λjξj ,

−1 −1 where λjs are the eigenvalues of QM (I − H)M Q. Given the fact that all the matrices are block wised with number of blocks n and all blocks within one matrix have the same distribution, T r(QM −1(I − H)M −1Q)/n convergences to a constant

−1 −1 number. When N −p = Op(n), we have the form T r(QM (I −H)M Q)/(N −p) →

. 2 c1 a.s., for some c1 > 0. In another word, S2/(N − p) = c1σp. As for the term,

2 FSSW = ||Hy ∗ || , one has

∗ 2 ∗ 2 ∗ 2 FSSW = ||y || + ||Hy || − ||y || ∞ X 2 (j)∗ ∗ 2 = || βj X +  || − (N − p)RSSW j=1 . 2 2 2 2 = Nc0σ0 + Nσ − (N − p)σ − (N − p)c1σp

2 2 2 2 = pσ + Ncoσ0 − Nc1σp + pc1σp, 2 Variable Selection for Random Effects Models via MDL 45

−1 where T r(QW Q)/N → c0 a.s., for some c0 > 0. Therefore, we have the final form

2 2 2 2 (N − p)FSSW N c0σ0 − c1σp + (p/N)(c1σp + σ ) = [ 2 2 ](1 + op(1)). p ∗ RSSW p c1σp + σ

2 2 2 2 2 2 Since FSSW > 0, c0σ0 −c1σp +(p/N)(c1σp +σ ) > 0. When N → ∞, c0σ0 −c1σp ≥ 0.

2 We know that FSSσ = FSSW /σ , which leads to

N FSS = [c σ2 − c σ2 + (p/N)(c σ2 + σ2)](1 + o (1)), σ σ2 0 0 1 p 1 p p and this completes the proof.

Corollary 1. Sicne the BIC with a penalty of log(N) is consistent, it suffices to prove that when the model is finite dimensional and the maximum dimension satisfies p = o(n), then lMDL1 and lMDL2 are consistent.

It is worth mentioning that the choice of the consistent estimate of D∗ does not affect the selection consistency of lMDL2. For instance, either ML or REML estimate of D∗ can be used and yields comparable results, as evidenced by our empirical studies (not reported here for length considerations). Consequently, when the model is finite-dimensional, it is obvious that FSSσ = Op(N) (Consistency is proved in another way. Please see Appendix A) and (N − p)FSSWˆ /(pRSSWˆ ) = Op(N). By analogy to BIC, then lMDL1 and lMDL2 are consistent model selection procedures. 2 Variable Selection for Random Effects Models via MDL 46

Interestingly, when the model is infinite-dimensional in the sense that p may increase

2 2 2 2 with sample size, [c0σ0 −c1σp]/[p(c1σp +σ )] is a factor that reflect “the average signal to noise ratio for the fitted model”. This factor balances between N and p, which makes the penalty relatively small with large p and tend to include more covariates, thus mimicking the behaviour of AIC. Therefore the proposed MDL criteria can adjust itself based on the underlying model, which make it suitable for different data generating mechanisms. That make MDL criterion keep performing well in complicated real world. The simulation results also support the theoretical analysis.

2.3 Simulation

The simulations have been designed to study the performance comparison of the MDL criteria proposed here in linear mixed-effects model to both AIC and BIC. Previous comparisons have shown that MDL has a more balanced performance than AIC and

BIC as its capability is close to the best of the two in a wide range of scenarios for independent data (see, for example, Hansen and Yu, 2003; Craiu and Lee, 2005), and we expect similar findings for the dependent data models considered here. To demon- strate the application of MDL criterion for the linear mixed-effects model, we simulate data for n = 50 clusters/individuals each containing ni = 5 repeated measurements, for all 1 ≤ i ≤ 50. The total number of fixed effects considered is with p = 6 covari- 2 Variable Selection for Random Effects Models via MDL 47 ates and with q = 3 covariates for the random effects (all covariates are independently

ni generated). We assume the kth column of the fixed effects matrix Xi is xki ∈ R ,

1 ≤ k ≤ p where x1ij = 1, x2ij ∼ Exponential(1.5), x3ij ∼ Uniform(−1.5, 1.5), x4ij ∼ Beta(2, 2), x5ij ∼ Exponential(2) and x6ij ∼ N(0, 1), 1 ≤ j ≤ ni.

Considering a similar structure for the random effects matrix, we set z1ij = 1, z2ij ∼ Uniform(−1, 1) and z3ij ∼ N(0, 1) while the error vector εi ∼ N(0, 2I5).

Since our interest is about the population properties of the outcomes, we focuses our selection among the covariates of the fixed effects. When we estimate the parame- ters and select the final model, the true covariates of the random effects are used.

The fixed coefficients are β1 = 0.5, β2 = 1, β3 = 0.6, β4 = 2, and β5 = 0.8, and the random coefficients are generated from the multivariate normal with mean zero, variances 1, 4, 3, and correlation coefficients cov(γi1, γi2) = 0.8, cov(γi1, γi3) = 0.5, cov(γi2, γi3) = 0.4.

Our comparison focuses on lMDL2, as this is the criterion most likely to be used in a realistic data analysis. In Table 2.1 we compare the performance of lMDL2 with that of AIC and BIC based on 100 replicates for each true model setting. It can be seen that when the number of fixed effect covariates is low, AIC always chooses a bigger model and its performance is inferior to that of BIC. In the case of 5 fixed effects included in the model, though AIC and BIC perform similarly, they work in opposite directions in terms of over-/under-selection. We remark that the lMDL2 criterion’s 2 Variable Selection for Random Effects Models via MDL 48 performance is consistently close to the winner between AIC and BIC. This feature can be very useful in practical data analysis where one has little idea whether to use

BIC or AIC.

True(T) (x1, x2) (x1, x2, x3) (x1, x2, x3, x4) (x1, x2, x3, x4, x5) Method AIC BIC MDL AIC BIC MDL AIC BIC MDL AIC BIC MDL F < T 0 0 0 1 3 2 0 5 4 2 13 10 F = T 78 99 95 74 96 90 78 91 92 87 87 88 F > T 22 1 5 25 1 8 22 4 4 11 0 2

Table 2.1: Comparison of lMDL2 criterion with AIC and BIC for LME, shown are the selected number of cases out of 100 Monte Carlo runs. In the table, F < T (or F > T) denotes that a given criterion chooses a model with less (or more) covariates than the true model, while F = T indicates that the true model is selected.

The results are not very surprising. All the three criteria are in the form of penal- ized log-likelihood as we mentioned in the introduction. When our interest is on the population level, we can focus our discussion on the parameters of the fixed effects.

The penalty terms depending on the fixed effects are: RAIC = 2p, RBIC = p log(N)/2,

(N−p)FSSWˆ and 2RMDL = p log( ). If we consider these three penalties from the infor- pRSSWˆ mation point of view, they can be treated as the code lengths to encode p-dimensional parameter. In AIC, the code length to encode 1 parameter is 1 which is constant re- gardless of the sample size and the data structure. However, to achieve the general precision to describe the data and to make model selection decision, the code length for encoding 1 parameter is log(n)/2 in the common situation, where n is the number of subjects in linear mixed-effects model. When the sample size is large, the uniform code length could be greater than 1. In BIC, the code length is roughly given by 2 Variable Selection for Random Effects Models via MDL 49 the number of individuals. In the coding theorem, we connect short codewords with symbols used with higher frequency and long codewords with low frequency symbols.

Therefore, the code length to encode 1 parameter varies with the samples and the underlying models. Especially, when we consider the correlation within each subjects, log(N)/2 may be too long to encode 1 parameter if it is a commonly used value. If the parameter is a rare value given the observed data, log(N)/2 may be shorter than the desired code length. Therefore, we want to seek a more precise way to calculate the code length. This is also the reason that we prefer mixture MDL to two-stage

(N−p)FSSWˆ MDL. In lMDL2, log( )/2 is the average code length to encode 1 parameter pRSSWˆ given the observed the data set. Therefore, the MDL criterion in the table is more stable.

2.4 An Extension of MDL for Linear Mixed

Model: MDL for Functional Data

The proposed MDL procedures for linear mixed-effects models can be easily extended to other correlated data models. We present here two scenarios, 1) the reduced rank model for functional data based on principal component representation, and 2) the principle components analysis through conditional expectation. The former is an 2 Variable Selection for Random Effects Models via MDL 50 direct extension based on the spline specification that adapts the functional model to the LME framework, while the latter needs further derivation.

2.4.1 Functional Data Analysis

Functional data are in the form of random curves which vary with time. The ob- servations are noisy sampled points from these curves. For instant the CD4 data in

Figure 2.1. CD4 is primary receptor used by HIV-1 to gain entry into host T cells, which plays a critical role in immune system. CD4 counts or percentages decrease on average all through the disease incubation period. Therefore, they can be used to mark the health status of HIV infected person. Intuitively, there is a decreasing function which can roughly sketch the relationship between the CD4 percentages and time. The data in Figure 2.1 is part of the results from the Multicenter AIDS Cohort study. There were 283 homosexual men who became HIV positive investigated be- tween 1984 and 1991. Their CD4 percentages were recorded. The measurements were scheduled to be made ever semi-annual. However, due to the individual conditions, the data are sparse with unequal number (1 to 14) of repeated measurements per per- son. Besides, the overall trend, there is an uncertainty or noise in the responses values which demonstrates as the specific features of each individual. Therefore, our goal is to uncover the overall trend over time, to reveal subject-specific variation patterns, 2 Variable Selection for Random Effects Models via MDL 51 70 60 50 40 30 CD4 Percentage CD4 20 10 0

0 1 2 3 4 5 6

Year

Figure 2.1: Trajectories of CD4 percentages to extract the dominate modes of variation, and to recover individual trajectories for sparse measurements. This can be done by the functional principle components analysis. We assume that all the curves are independent and can be expressed by a smooth random function X(t) with the unknown overall mean function EX(t) = µ(t) and the covariance function cov(X(s),X(t)) = G(s, t). Typically, the index variable is time within a bounded and closed interval T . However, in practice, it could be any 2 Variable Selection for Random Effects Models via MDL 52 other continuum variables. In the classical Functional Principle Components (FPC, for example Ramsay and Silverman (2005)) analysis, the ith random curve has the form

∞ X Xi(t) = µ(t) + ξikφk(t), t ∈ T , (2.14) k=1 where φk(t) is eigenfunction and ξik are uncorrelated random variables with mean zero

2 and variances Eξik = λk, where λk are non-increasing eigenvalues i.e. λ1 ≥ λ2 ≥ ... P and k λk < ∞. Intuitively, we want the most former variable contains the rich- est information or the largest variance. More over, the variances can be expressed in an orthogonal (in the L2 sense) form by the eigenvalues and eigenfunctions, i.e.

P∞ G(s, t) = k=1 λkφk(s)φk(t), s, t ∈ T . However, in practice it is impossible for us to estimate the infinite number of eigenfunctions. Therefore, we truncate the random effects part up to the q leading terms. In any model, the measure error is an unavoid- able topic. Let ij denote the additional measure error for the jth observation on the ith curve, where j = i, . . . , ni, where ni is the number of repeated measurements on the ith curve. Noise ij are i.i.d. which follow Gaussian distribution with mean zero and variance σ2. They are also independent of all the other random variables. Let yij denote the observations. Then the model we consider is 2 Variable Selection for Random Effects Models via MDL 53

q X yij = µ(tij) + ξikφk(tij) + ij, tij ∈ T . (2.15) k=1

Usually, there are two ways to select the number of eigenfunctions. The first one is the percentage of variability. Assume that there are q principle components. We can choose the number of principle components to explain, say 90% variability. However, the problem is how we get the threshold 90%. Unfortunately, there is no uniform standard to choose the threshold. The second way is to use the traditional model selection criteria, such as AIC and BIC of which the advantage and disadvantage are discussed already. We will demonstrate the application of the MDL criteria for the linear mixed-effects model on two different functional data models.

2.4.2 Reduced Rank Model

A Brief Review

As an intuitive extension of linear mixed-effects model, Rice and Wu (2001) proposed the Mixed Effects Models for the functional data. The design matrices are replaces

¯ by two bases of spline function Bk(t) and Bk(t) on [0, T ]. The model is the in the form p q X ¯ X yij = Bk(tij)βk + Bk(tij)γik + ij, (2.16) k=1 k=1 2 Variable Selection for Random Effects Models via MDL 54

where βk is the coefficient for the mean function and γik is the random coefficient with mean zero and covariance D. The first term in the left side of equation (2.16) is the mean function and the second term represents the variation between the individuals which is analogue to the second term in equation (2.15).

James et al. (2000) combined the ideas of mixed effects model and functional prin- ciple components together to suggest the Reduced Rank Model. The name follows the property of the model that the number of eigenvalues is less then the dimension of the fixed parameter for the mean trend. Same as the mixed effects model, the mean function µ(t) is represented by a p-dimensional orthonormal spline basis and a

T p-dimensional vector of coefficients βµ, i.e. µ(t) = b(t) βµ. The eigenfunction φk(t) is also portrayed in terms of the spline function plus a p × q matrix Θ of coefficients

T for eigenfunctions, where p > q, i.e. (φ1(t), . . . , φq(t)) = b(t) Θ. Therefore, the FPC model (2.15) has an LME-like representation,

T T yij = b(tij) βµ + b(tij) Θξi + ij (2.17)

i.i.d. i.i.d. 2 where ξi ∼ N(0,D), ij ∼ N(0, σ ), Θ and b(t) satisfy the orthonormal con-

T R T straints: Θ Θ = Iq, T b(t)b(t) dt = Ip. To coincide with the equation (2.15), we restrict D is a diagonal matrix. This model is called reduced rank model which has similar form as linear mixed model. In this model, b(t)T is equivalent to the raw ele- 2 Variable Selection for Random Effects Models via MDL 55

ments of the design matrices, βµ plays the same role as β in the linear mixed-effects model. The primary task of model selection in this setting is to choose an appro- priate number of eigenfunctions, q, for approximating the data. It is also necessary to have sufficient number of basis functions, p, which are determined by the knots sequence and the degree of spline functions. Therefore the proposed lMDL2 can be readily used to choose q and p simultaneously. The approach requires to encode the estimated Θ with an additional [pq −q(q + 1)/2] number of parameters resulting from

T the constraint Θ Θ = Iq.

Simulation

We test the MDL criterion’s ability to choose the suitable number of eigenfunctions

(i.e., the model dimension), on a sample of n = 100 curves, each generated from the underlying model

2 X Xi(t) = µ(t) + ξimφm(t), 1 ≤ i ≤ 100 m=1

p where ξi1 ∼ N(0, 4), ξi2 ∼ N(0, 1), µ(t) = t + sin(t), φ1(t) = − (2/T ) × cos(2πt/T ), p φ2(t) = (2/T ) × sin(2πt/T ) with T = 10 and t ∈ [0,T ]. The number of repeated measurements within individual, ni is the smallest integer that larger than a random number which is generated from uniform distribution on [0, 10]. The time points 2 Variable Selection for Random Effects Models via MDL 56

tij are independently generated from U[0,T ], while the i.i.d. noise ij ∼ N(0, 0.25).

We adopt the commonly used orthonormal cubic splines with knots that are equi- quantiles of all observations {tij : j = 1, . . . , ni; i = 1, . . . , n} to avoid the difficulty caused by clustered design. We replicate the results using 100 independent Monte

Carlo runs produced under the same data generating mechanism.

Note that in LME the number of fixed effects, p, and the number of random effects, q, are not necessarily related. However, the reduced rank model is designed to capture the most variability in the data using as few eigenfunctions as possible so the condition q < p is generally imposed in model (2.17). Keeping in mind that the choice of q is of primary interest, we need only an adequate number of basis to characterize the curves, i.e., p serves as a nuisance tuning parameter. Therefore, it is not that urgent to choose the number of p. Instead, we focus our interest on the q. We know that the true number of eigenfunctions is 2, so we choose p from 4 to 6 to compare the performance of these three criteria under these three different setups of the basis.

The multiplicative factor (i.e., degree of freedom) in the penalty terms of AIC and

BIC is p + [pq − q(q + 1)/2] due to the additional parameters in Θ. In Table 2.2, we can see that it is enough to use 5 basis to fit the model. Therefore, we use these three criteria, AIC, BIC and lMDL2, to jointly select p and q subject to q < p with p ∈ {4, 5, 6}. The results are shown in Table 2.3. From Table 2.2 and 2.3 we see that

AIC doesn’t perform well with two many over-selected cases while BIC is comparable 2 Variable Selection for Random Effects Models via MDL 57

p 4 5 6 Method AIC BIC MDL AIC BIC MDL AIC BIC MDL q < 2 0 9 19 0 0 0 0 0 0 q = 2 47 70 66 70 98 100 68 99 100 q > 2 53 21 15 30 2 0 32 1 0

Table 2.2: Comparison of the lMDL2 criterion with AIC and BIC for reduced rank model based on 100 Monte Carlo runs. We use three different structures of the basis. In the table, q < 2 means we use less eigenfunctions to fit the model; q = 2 denotes that we use the true number of eigenfunctions to fit the model; and q > 2 indicates that we use more eigenfunctions to fit the model. However, the number of eigenfunction should be less than p. We use B-Spline to derive the basis.

Method AIC BIC MDL p 4 5 6 4 5 6 4 5 6 q < 2 0 0 0 0 0 0 0 0 0 q = 2 0 63 5 0 93 5 0 97 3 q > 2 0 27 5 0 2 0 0 0 0

Table 2.3: Comparison of the lMDL2 criterion with AIC and BIC for reduced rank model based on 100 Monte Carlo runs. We use the criteria to jointly select p and q subject to q ≤ p with p ∈ {4, 5, 6}. In the table, q < 2 means less eigenfunctions are selected to fit the model; q = 2 denotes the true number of eigenfunctions is chosen; and q > 2 indicates over-selection of eigenfunctions. with the MDL criterion. MDL criterion select the true eigenfunction at most time.

2.4.3 Principle Components Analysis through Conditional

Expectation

A Brief Review

Yao et al. (2005) proposed the functional principal component analysis through con- 2 Variable Selection for Random Effects Models via MDL 58 ditional expectation, in which idea of pooling the information from all subjects is adopted to overcome the sparseness. The mean function, covariance, eigenfunctions and eigenvalues are estimated based on the whole population. However, the random effect ξik is estimated by PACE (principle analysis through conditional expectation) given by each individual trajectory. If we assume that the first p leading terms are enough to capture the main features of the curves, then the marginal distribution of

T yi = (yi1, . . . , yini ) is approximate a multivariate normal distribution with mean µ(t)

2 Pp and variance-covariance matrix Σi = G(t)+σ Ini , where G(s, t) = k=1 λkφk(s)φk(t).

To choose the number of eigenfunctions, the AIC-type criteria was suggested by Yao et al. (2005) via a pseudo-gaussian log-likelihood

n p p X ni 1 X X Lˆ = {− log(2πσˆ2) − (y − µˆ − ξˆ φˆ )T (y − µˆ − ξˆ φˆ )}. 2 2ˆσ2 i i ik ik i i ik ik i=1 k=1 k=1

It is obvious that the pseudo log-likelihood is built on the cluster level, since all the specific estimates of ξiks are included in the formula. However, the eigenfunctions or eigenvalues are the concepts set up on the population scope. It will be more reasonable to use a new model selection method derived from the population level to select the eigenvalues. We will use the same idea we mentioned in section 2.1 to derive the MDL criterion for functional PACE model. We still assume that σ2 = τ 2 Variable Selection for Random Effects Models via MDL 59 follows the inverse-gamma distribution

√ a ω(τ) = √ e−a/2τ τ −3/2. 2π

Then the marginal distribution of y is

m(y|µ, λk) Z = fσ2 (y)ω(τ|µ, λk)dτ Z n ! ( n ) √ Y 1 X 1 T −1 a −a/2τ −3/2 = exp − yi(t) Wi yi(t) √ e τ dτ (2π)ni/2|τW |1/2 2τ i=1 i i=1 2π √ n ! Z  Pn T −1  Y 1 a yi(t) W yi(t) + a = √ τ −(N+3)/2 exp − i=1 i dτ (2π)ni/2|W |1/2 2τ i=1 i 2π

√ n ! Pn T −1 −(N+1)/2 Y 1 a N + 1 yi(t) W yi(t) + a = √ Γ( ) i=1 i (2π)ni/2|W |1/2 2 2 i=1 i 2π Z Pn y (t)T W −1y (t) + a(N+1)/2 N + 1 × i=1 i i i (Γ( ))−1τ −(N+3)/2 2 2  Pn y (t)T W −1y (t) + a × exp − i=1 i i i dτ 2τ √ n ! Pn T −1 −(N+1)/2 Y 1 a N + 1 yi(t) W yi(t) + a = √ Γ( ) i=1 i . (2π)ni/2|W |1/2 2 2 i=1 i 2π 2 Variable Selection for Random Effects Models via MDL 60

Exclude the terms that do not depend on the particular choice of the model,

n n 1 X 1 N + 1 X −log m(y|µ, λ ) = log |W |− log a+ log( y (t)T W −1y (t)+a). (2.18) k 2 i 2 2 i i i i=1 i=1

The code length function contains only one hyperparameter a. The estimatora ˆ =

Pn T −1 i=1 yi(t) Wi yi(t)/N minimizing the description length (2.18) leads to

n n 1 X N X − log m(y|µ, λ ) = log |W | + log y (t)T W −1y (t). k 2 i 2 i i i i=1 i=1

Plugging in the eigenvalues and the mean function, adding the code length to encode the parameters and excluding the terms contain no information about the specification of the model, we get the final form of the model selection criterion fMDL

n n 1 X N X m + 1 log |Wˆ | + log (Y (t) − µˆ(t))T Wˆ −1(Y (t) − µˆ(t)) + log n, (2.19) 2 i 2 i i i 2 i=1 i=1

ˆ Pm ˆ ˆ ˆ ˆ2 ˆ Pm ˆ ˆ2 ˆ2 where Wi(s, t) = k=1 λkφk(s)φk(t)/σ , if s 6= t, or Wi(s, s) = k=1 λkφk(s)/σ + 1.

Simulation

To demonstrate the application of MDL criterion for the PACE model, we simulate the sample with n = 100 trajectories. Since our interest focuses on the selecting 2 Variable Selection for Random Effects Models via MDL 61 eigenvalues, without loss of generality, we set µ(t) = 0. The underlying model has the form p X Xi(t) = ξikφk(t), 1 ≤ i ≤ 100, (2.20) k=1 p where φ1(t) = − (2/T ) × cos(2πt/T ), the rest eigenfunctions are generated via

Fourier Basis, which are orthonormal to the previous ones, with T = 10 and t ∈ [0,T ].

The number of repeated measurements within each individual, ni is the smallest inte- ger that larger than a random number which is generated from uniform distribution on [5, 10]. Then the time points are uniformly generated following U[0, 10]. The mea- surement error is normally distributed with mean zero and variance 0.25. In Table

True(T) λ1 = 4, λ2 = 1 λ1 = 6, λ2 = 4 λ1 = 10, λ2 = 6 λ3 = 1 λ3 = 4, λ4 = 1 Method AIC BIC MDL AIC BIC MDL AIC BIC MDL F < T 0 0 0 7 14 12 7 19 14 F = T 21 37 41 28 25 29 32 27 30 F > T 29 13 9 15 11 9 11 4 6

Table 2.4: Comparison of fMDL criterion with AIC and BIC for PACE model, shown are the selected number of cases out of 50 Monte Carlo runs. In the table, F < T or F > T corresponds to eigenvalue under-selection or over-selection, respectively, while F = T indicates that the true model is selected.

2.4, the AIC and BIC are calculated via the marginal distribution of the responses.

The comparison is based on 50 Monte Carlo runs. It can be seen that AIC always tends to over-select while BIC shows the feature to under-select. Generally, fMDL is better than the other two criteria. However, when the underlying model includes more eigenvalues, none of them shows excellent performance. This may be caused 2 Variable Selection for Random Effects Models via MDL 62 by the misspecified model we used. In this simulation study, the random effects are treated as the fixed effects. Correlation structure is ignored in the model selection procedure. Therefore, new model selection procedure should be derived to choose the random effects. Further research on this topic is executed in chapter 4. Chapter 3

Variable Selection for Generalized

Estimating Equation via MDL

63 3 Variable Selection for Generalized Estimating Equation via MDL 64

The method of estimating functions is commonly used when one desires to con- duct inference about some parameters of interest but the full distribution of the observations is unknown. The generalized estimating equations (GEE) method is an extension of the quasi-likelihood method (Wedderburn, 1974) and has been proposed by Liang and Zeger (1986) as a tool for marginal inference useful for analyzing cat- egorical responses with cluster-specific dependence. Instead of specifying the entire distribution of the response data, the GEE approach requires only specifications of the relationship between the mean of the response and covariates and of the response variance as function of the mean. In most data analysis cases, model selection is an important aspect of model fitting. As the most commonly used model selection criteria, both AIC and BIC take the forms of penalized log-likelihood. However, GEE is not a likelihood-based method; only the score function is defined. Therefore, both

AIC and BIC, and many of the standard model selection methods are not immedi- ately available. As an extension of AIC, Quasi-likelihood under the independence model criterion (QIC) was proposed by Pan (2001) for GEE. QIC takes the form of penalized Quasi-likelihood in which the penalty term takes the correlation structure into account. Since the likelihood is missing, it is less clear how to derive a variable selection method for GEE. Therefore, the literature on this topic is very limited and we hope to close at least some of the gap in this area. 3 Variable Selection for Generalized Estimating Equation via MDL 65

3.1 Generalized Estimating Equations

3.1.1 A Brief Review

T Let yi = (yi1, . . . , yini ) be the ni × 1 vector of responses corresponding to the i-th cluster, X = (xT ,..., xT )T ∈ Rni×p is the covariate matrix and set µ = E(y |x ). i i1 ini ij ij ij

The relationship between µij and xij is introduced via the link function g(·) so that

T g(µi) ≡ (g(µi1), . . . , g(µini )) = Xiβ, 1 ≤ i ≤ n.

In addition, we assume that v is the variance function so that var(yij) = φv(µij), where φ is the over-dispersion parameter, 1 ≤ j ≤ ni, 1 ≤ i ≤ n. The GEE is then defined as n X T −1 S(y, β, R) = Ψi Ωi (yi − µi) = 0, (3.1) i=1

1/2 1/2 where Ψi = ∂µi/∂β, and Ωi = ∆i Ri∆i with ∆i = diag{v(µi1), ..., v(µini )} and Ri is the user defined correlation matrix (i.e., the so-called “working” correlation) for yi having its structure characterized by a r×1 parameter vector α so that Ri = R(α; Xi).

Note that when Ri(α) equals the true correlation structure (unknown in practice), Ωi is the scaled (by φ) variance-covariance matrix of yi.

One important feature of GEE is that no distribution of the data needs to be 3 Variable Selection for Generalized Estimating Equation via MDL 66 specified so, unlike the LME model, no marginal distribution of the data can be obtained directly from the estimating equation. This offers a convenient way to deal with the correlated discrete responses, and bypass the difficulty of not having an explicit description of the jointly multivariate outcomes. However, it is difficult to derive model selection methods. Fortunately, the MDL principle can be used in conjunction with the quasi-likelihood (Wedderburn, 1974)

Z µ y − t Q(y, µ) = dt. v(t)

If R is in the form of identity matrix, given the observed data set, the quasi-likelihood could be written as

n n i Z µij X X yij − t Q(y, β) = dt. v(t) i=1 j=1

Following this, quasi deviance function

Z µ y − t De(y, µ) = −2{Q(y, µ) − Q(y, y)} = −2 dt, y v(t) is an analogue of the deviance function in Generalized Linear Model. An approxi- mation of the distribution of yij can be expressed in an exponential form via quasi- 3 Variable Selection for Generalized Estimating Equation via MDL 67 deviance function under mild regularity conditions (Nelder and Pregibon, 1987),

v(y )−1/2  1  f (y |φ) = ij exp − De(y , µ) . (3.2) β ij (2πφ)1/2 2φ ij

It is known that a general form of Ri does not necessarily lead to a valid definition of quasi-likelihood (see McCullagh and Nelder, 1991). We thus proceed to derive MDL

criteria for GEE models under working independence, i.e. Ri = Ini , where Ik denotes the k × k identity matrix.

3.1.2 Case C: φ is known

When β is the only unknown parameter, we can use the encoding device, ω(β), equal the normal distribution N(0, φV ), where the scaled covariance matrix V is usually assumed proportional to some user-defined form V ∗ with a hyperparameter c, i.e.,

V = cV ∗. More is to be said about the choice of V ∗ later. The marginal density function of y is expressed using the Laplace approximation

Z m(y|φ) = fβ(y)ω(β)dβ

p/2 ˜ −1/2 ˜ ≈ (2π) | − H(β)| fβ˜(y|φ)ω(β), (3.3) 3 Variable Selection for Generalized Estimating Equation via MDL 68

˜ where H(β) is the Hessian matrix of h(β) = log fβ(y|φ) + log ω(β|X) and β is the posterior mode of β. We know that if M0 is the true model and β0 is the true parameter, we have ∂De(y, β, R) E ( | ) = 0, M0 ∂β β=β0 and the Fisher information matrix

n 1  ∂2De(y, β, R)  X J(β ) = E − | = ΨT Ω−1Ψ . 0 2 ∂β∂βT β=β0 i i i i=1

1 −1 Then the Hessian matrix has the form H(β) = − φ (J(β) + V ). Further more, the posterior mode of β can be expressed as,

∂ β˜ = β − H−1(β) h(β) ∂β 1 = β − φ(J(β) + V −1)−1 V −1β φ = β − (J(β) + V −1)−1V −1β.

The approximate form of β˜ can be derived via one step Newton-Raphson iteration when we use the estimator to replace β:

˜ ˆ ˆ −1 ˆ β ≈ VJ(β)(VJ(β) + Ip) β, 3 Variable Selection for Generalized Estimating Equation via MDL 69

ˆ where Ip is a p × p identity matrix. In the approximation formular, β is the estimate of β which converges to the true parameter. Therefore, the estimate βˆ and the posterior model β˜ are very close, which makes the approximation hold. One then applies Taylor expansion w.r.t β˜ around βˆ to obtain the approximations of the quasi- deviance function

∂De(y, βˆ|φ) 1 ∂2De(y, βˆ|φ) De(y, β˜|φ) ≈ De(y, βˆ|φ) + (β˜ − βˆ) + (β˜ − βˆ)T ( )(β˜ − βˆ) ∂β 2 ∂β∂βT ≈ De(y, βˆ|φ) + (β˜ − βˆ)T J(βˆ)(β˜ − βˆ)

ˆ ˆ −1 ˆ T ˆ ˆ −1 ˆ ≈ De(y, β|φ) + [{VJ(β) + Ip} β] J(β)[{VJ(β) + Ip} β],

and the log-density of the encoding device

˜ ˆ ˆ −1 ˆ log ω(β) ≈ log ω(VJ(β){VJ(β) + Ip} β) p 1 ≈ − log 2π − log |φV | 2 2 1 − [VJ(βˆ){VJ(βˆ) + I }−1βˆ]T V −1[VJ(βˆ){VJ(βˆ) + I }−1βˆ]. 2φ p p

Ignoring the terms that do not contain any information about the unknown parame- ter β and plugging all the approximate expressions in (3.3), we obtain the approximate 3 Variable Selection for Generalized Estimating Equation via MDL 70 code length

− log m(y|φ) p 1 ≈ − log 2π + log | − H| − log f (y|φ) − log ω(β˜) 2 2 β˜ p 1 1 = log |VJ + I | + βˆT (V + J −1)−1βˆ + De(y, βˆ|φ) 2 p 2φ 2φ which is a function of c given our assumption that V = cV ∗. As in LME model case, the MDL criterion is obtained at the minimizing value of c.

It is clear that whether a closed form of this MDL criterion is available or not depends on the choice of V (or V ∗) used in the coding device for β. In general one can numerically solve for c and then calculate the MDL criterion, e.g., using

Newton-Raphson algorithm (at the end of this subsection). We denote this type of

MDL criteria with various choices of V by gMDL1. For instance, there is no closed

−1 form solution when the sandwich estimator is used, i.e., V ∗ = J(βˆ)−1BJ(βˆ) , where

Pn T −1 −1 ∗ ˆ −1 B = i=1 Ψi Ωi cov(c yi)Ωi Ψi/φ. However, when using V = J(β) , we obtain an explicit form. In this case, the code length function has the form

p 1 1 − log m(y|φ) = log(1 + c) + βˆT J(βˆ)βˆ + De (y|φ). (3.4) 2 2φ(1 + c) 2φ βˆ 3 Variable Selection for Generalized Estimating Equation via MDL 71

To minimize equation (3.4),

! βˆT J(βˆ)βˆ cˆ = max − 1, 0 . pφ

Substitutingc ˆ, we get the final form of the criterion gMDL1

  p βˆT J(βˆ)βˆ 1 ˆ p 1 ˆT ˆ ˆ  2 log( pφ ) + 2φ De(y, β|φ) + 2 + 2 log(n), β J(β)β > pφ, (3.5)  1  2φ De(y, 0|φ), otherwise, where De(y, 0|φ) is the deviance function when all the regression parameters are zero,

1 and the term 2 log(n) is the code length used to transmit c. The prefix g is used to specify that this MDL criterion is derived for generalized estimating equation.

If V ∗ 6= J(βˆ)−1, we can use Newton-Raphson iteration to obtainc ˆ. Please see

Appendix B.1.

3.1.3 Case D: φ is unknown

Of course, in practice we have limited knowledge about the value of φ. So when both β and φ are unknown, denoting θ = (β, φ), one needs to consider both encoding 3 Variable Selection for Generalized Estimating Equation via MDL 72 devices, β|φ ∼ N(0, φV ) and φ ∼ InvGamma(a, 3/2) as in Section 2.1, i.e.

√ a ω(φ) = √ e−a/2φφ−3/2. 2φ

Then the joint distribution for the parameters is

1 1 ω(β, φ) = exp{− βT V −1β}ω(φ). (2φ)p/2|φV |1/2 2φ

The basic construction is similar to the case of known over dispersion. When the over dispersion is unknown, the marginal density function of y has the form

ZZ m(y) = fθ(y)ω(β, φ)dβdφ (3.6)

Use Laplace approximation at β˜ to replace the inner integral

Z p/2 ˜ −1/2 ˜ m(y) ≈ (2π) | − H(β)| fβ˜(y)ω(β, φ)dφ.

After we eliminate the terms that contain no information about the unknown param- eters, the code length function has the form

1 N + 1 1 1 − log m(y) ≈ − log a + log(a + β˜tV −1β˜ + De(y, β˜)) + log |V | + log | − H|. 2 2 2 2 3 Variable Selection for Generalized Estimating Equation via MDL 73

Minimizing the code length function with respect to the hyperparameter a, yields

N 1 1 − log m(y) ≈ log(β˜tV −1β˜ + De(y, β˜)) + log |V | + log |I + V −1|. (3.7) 2 2 2

If we choose V ∗ = J(βˆ)−1, an explicit form can be achieved by minimizing the above function with respect to the hyperparameter c. That’s followed by

! (N − p)βˆT J(βˆ)βˆ cˆ = max − 1, 0 . pD(y, βˆ)

Then the final criterion gMDL2 is

 ˆT ˆ ˆ ˆ  p log( (N−p)β J(β)β ) + N log( D(y,β) ) + log(n), if (N − p)βˆT J(βˆ)βˆ > pD(y, βˆ)  2 pD(y,βˆ) 2 N−p  N 1  2 log(D(y, 0)) + 2 log(n), otherwise. (3.8)

When we choose other matrix for V ∗, we need to use numerical maximization (e.g.

Newton-Raphson). Please see Appendix B.2 for details.

It is important to mention that both the sandwich estimator and the inverse Fisher information are to some extent awkward choices for the encoding device of β due to the dependence on the response y, thus contradicting the basic idea of the MDL principle (see Hansen and Yu, 2003, for another similar occurrence in the case of MDL for generalized linear models). As we known , the MDL principle is set up based on 3 Variable Selection for Generalized Estimating Equation via MDL 74 the length of the code used for data transmission. Therefore, from the coding aspect, both the the sender and receiver should know J(βˆ) if we want to directly use it in the final form of the MDL criteria. However, βˆ is estimated by GEE model. It does not make sense that βˆ is known before the data transmission. For this reason, in the

∗ simulation study we also use the identity choice V = Ip which is found to provide the most encouraging results.

3.2 Simulation

We will use the correlated binary outcomes as an example to demonstrate the per- formance of the MDL criteria. At the beginning of this section, the data generating method is illustrated. There are several methods proposed for generating correlated binary random variables. Lee (1993) provided several flexible methods to simulate binary data which are based on linear programming and copula and need to solve a bunch of nonlinear equations. Later, Gange (1995) proposed the method to generate multivariate categorical variates using the iterative proportional fitting algorithm.

Park et al. (1996) suggested a simple method to generate binary random variable.

In this approach, Poisson random variable X(α) with mean α ≥ 0 is used. We will demonstrate the method from the simplest case that only two correlated random variables are generated. Assume that X1(α11 − α12), X2(α22 − α12) and X3(α12) are 3 Variable Selection for Generalized Estimating Equation via MDL 75 mutually independent. Let

Z1 = X1(α11 − α12) + X3(α12)

and

Z2 = X2(α22 − α12) + X3(α12).

The new variables Z1 and Z2 are correlated via X3(α12). Then we can define the binary random variable Yi via Zi

   Yi = 1,Zi = 0   Yi = 0, otherwise,

where i = 1 or 2. Given αijs, we can get the expected values of Y1 and Y2,

2 −αii E(Yi) = E(Yi ) = P r(Xi = X3 = 0) = e = pi.

We also have

E(Y1Y2) = P r(X1 = X2 = X3 = 0) = p1p2 exp(α12). 3 Variable Selection for Generalized Estimating Equation via MDL 76

Then the correlation coefficient is

p p corr(Y1,Y2) = Cov(Y1,Y2)/( V ar(Y1) V ar(Y2))

α12 1/2 = p1p2(e − 1)/(p1p2q1q2) ,

which is denoted by ρ. Conversely, if pis and ρ are known, we can get the values of

αij’s by solving the above equations, which leads to

−1 −1 1/2 αij = log(1 + ρ(qipi qjpj ) ).

The method can be extended from generating two correlated binary variables to k correlated binary variables. We quote the computational steps given by the authors below:

Step 0. Compute αij for 1 ≤ i, j ≤ k. Let l = 0.

Step 1. Let l = l + 1.

Let Tl = {αij : αij > 0, 1 ≤ i, j ≤ k}.

Let βl = αrs be the smallest element in the set Tl.

If αrr = 0 or αss = 0, then stop.

Otherwise, choose an index set Sl containing {r, s} and satisfying αij > 0 for

all {i, j} ∈ Sl. The set Sl is chosen to have as many elements as possible. 3 Variable Selection for Generalized Estimating Equation via MDL 77

Step 2. For all {i, j} ∈ Sl replace αij with αij − βl.

If all αij = 0, then go to Step 3.

Otherwise, go to Step 1.

Pτ Step 3. Let τ = l. For i = 1, 2, ..., k, let Zi = l=1 Xl(βL)ISl (i) and set Yi = I{0}(Zi),

where IA is the indicator function of set A.

In step 1, if αrs or Sl is not unique, we can just choose one arbitrarily. A simple example is given to illustrate the application of this method. Assume our target is to generate three binary outcomes with p1 = 0.4, p2 = 0.5, p3 = 0.6 and ρ12 = 0.5,

ρ13 = 0.1, ρ23 = 0.3. Table 3.1 shows the calculation process step by step. According

l = 1 l = 2 l = 3 0.916 0.478 0.095 0.821 0.383 0 0.821 0.383 0 0.693 0.219 0.598 0.124 0.474 0 0.511 0.416 0.292 (r, s) (1,3) (2,3) (3,3) βl 0.095 0.124 0.292 Sl {1, 2, 3} {2, 3} {3} l = 4 l = 5 l = 6 0.821 0.383 0 0.438 0 0 0.438 0 0 0.474 0 0.091 0 0 0 0 0 0 (r, s) (1,2) (2,2) (1,1) βl 0.383 0.091 0.438 Sl {1, 2} {2} {1} Table 3.1: The procedure to generate three correlated binary random variables. 3 Variable Selection for Generalized Estimating Equation via MDL 78 to Table 3.1, 6 relative basic Poisson random variables are selected. Then we have

Z1 = X1(0.095)+ +X4(0.383) +X6(0.438)

Z2 = X1(0.095) + X2(0.124) +X4(0.383) + X5(0.091)

Z3 = X1(0.095) + X2(0.124) + X3(0.292).

Finally, let Yi = I{0}Zi, i = 1, 2, 3 which are three correlated binary random variables we are seeking.

This method is feasible to handle different correlated structures, but it still has its own limitation. Since Zi and Zj are positive correlated, Yi and Yj also have to be positive correlated by their definition. In another word, this method only can generate positive correlated outcomes. Furthermore, ρij is restricted by its upper bound. Because Cov(Yi,Yj) ≤ piqj and Cov(Yi,Yj) ≤ pjqi, it yields the inequalities

1/2 1/2 that ρij ≤ [piqj/(qipj)] and ρij ≤ [pjqi/(qjpi)] . As for αijs, we have: αij ≤ αii and αij ≤ αjj. From the above example, we can see that essentially we partition a

Poisson random variable into several parts. The common part which is shared by

Zi and Zj responds to the correlation. In that case, it is almost impossible for us to generate a very high-dimensional binary outcome in which each individual has relatively large successful probability. 3 Variable Selection for Generalized Estimating Equation via MDL 79

In the simulation study, we use Park et al. (1996)’s approach to generate correlated binomial outcomes via latent Poisson random variables. We use the same setup for the fixed effects covariate structure as in section 2.3 with n = 100, ni = 3 for all 1 ≤ i ≤ 100, p ≤ 5 and x1ij = 1, x2ij ∼ Exponential(0,2)(0.5), where Exponential(a,b)(λ) is the exponential distribution with parameter λ truncated to the interval (a, b), x3ij ∼ Uniform(−1, 1), x4ij ∈ {0, 1, 2} marks the time the measurements are recorded, and x5ij ∼ N(0, 0.5). The response variable is yij ∼ Binomial(mi, pij), where mi takes the value 1, 2 or 3 with equal probability and pij = logit(ηij). We use two underlying true models in our simulations corresponding to ηij = 0.2x1ij + 0.2x2ij and ηij = 0.2x1ij + 0.2x2ij − 0.1x3ij + 0.3x4ij. The comparison is based on 100 Monte

Carlo replicates. The underlying correlation between distinct yij’s is exchangeable with ρ = 0.5, while the working correlation used in estimation has the exchangeable or

AR(1) structure. In Table 3.2 we compare also the performance of MDL with different

True Fitted AIC BIC QIC MDL-F MDL-I F = T 87 100 80 60 92 (x1, x2) F > T 13 0 24 40 8 F < T 4 71 2 67 9 (x1, x2, x3, x4) F = T 81 28 74 31 81 F > T 15 1 24 2 10

Table 3.2: Comparison of gMDL2 criterion with AIC, BIC and QIC for GEE based on 100 Monte Carlo runs. The exchangeable working correlation structure is used to fit the model. We illustrate here two choices of the covariance matrix for the coding device ω(β): the inverse Fisher information (MDL-F) and the identity matrix (MDL-I). In the table, F < T or F > T corresponds to covariate under-selection or over-selection, respectively, while F = T indicates that the true model is selected. 3 Variable Selection for Generalized Estimating Equation via MDL 80

True Fitted AIC BIC QIC MDL-F MDL-I F = T 87 100 81 63 95 (x1, x2) F > T 13 0 19 37 5 F < T 5 73 2 61 11 (x1, x2, x3, x4) F = T 80 26 76 36 82 F > T 15 1 22 3 7

Table 3.3: Comparison of the MDL criterion with AIC, BIC and QIC for GEE based on 100 Monte Carlo runs. The AR(1) working correlation structure is used to fit the model. In the table, F < T or F > T corresponds to covariate under-selection or over-selection, respectively, while F = T indicates that the true model is selected. covariance specifications of V used in the coding device for β. Besides AIC and BIC we have included in the set of reference the Quasi Information Criterion (QIC) derived under working independence (Pan, 2001). Since the overdispersion parameter is rarely known in practice, we illustrate our comparison on the gMDL2 type criteria with two choices for V : (a) the inverse of the Fisher information matrix (denoted by MDL-F in Table 3.2) which is calculated using (3.8), and (b) the identity matrix (denoted by

MDL-I in Table 3.2) which is evaluated using the Newton-Raphson algorithm. One can see that the performance of gMDL2 is better when using (b) rather than (a) to build the covariance matrix of the coding device ω(β). Following the MDL principle, it is suboptimal to allow the response influence the coding device, which may explain why using the V = cIp in ω(β) yields better results than using either the inverse

Fisher information or the sandwich estimator of variance (not reported here). In fact gMDL2 with V = cIp tends to perform close to the winner between AIC, QIC and BIC in all simulation scenarios. Since robustness is desirable in practice one may wonder 3 Variable Selection for Generalized Estimating Equation via MDL 81

whether the gMDL2’s performance carries on under misspecification of the working correlation. We thus consider the case in which a working autoregressive AR(1) correlation structure is used for fitting the model. The results shown in Table 3.3 suggest that even when the correlation structure is misspecified, the gMDL2 criterion with V = cIp still tends to perform as well as the best of AIC, QIC and BIC.

3.3 Conclusion

We propose MDL-based model selection criteria for a variety of models build around dependent data. The model selection procedure proposed here is shown to be con- sistent and can be easily applied to models designed for longitudinal, functional, and clustered data. Simulations show that the MDL criteria diminish the under-selection problem in BIC and try to mimic the “winner” between BIC and AIC thus exhibit- ing desirable robust properties. In spite of many documented instances where the

MDL principle yields reliable model selection procedures, their use in the statistical literature has been rather infrequent. We hope that the extension such criteria to models with complex dependent structures will stimulate the growth and usage of this exciting area of statistics. Chapter 4

Random Effects Selection via

Group LASSO

82 4 Random Effects Selection via Group LASSO 83

In the previous chapters, our focus is on the selection of the population level pa- rameters. In linear mixed-effects model, our interest is about the selection of β.

However, beside the properties given by the whole population, the researchers also show interest in the changes within each subject which are known as the random effects. When random effects are considered, the traditional penalized log-likelihood type model selection criteria can not be directly applied. The degree of freedom

(d.f.) k is one of the key value in the penalty term. In the marginal model, it is the number of the fixed parameters which includes the mean parameter β and the variance component α. However, in the conditional model, it is inappropriate to take into account all the distinct parameters in, since the estimation of random effects depending on the estimated covariance which may affect the actual degree of freedom of the model. For example, in the linear mixed-effects model (2.1), if D is a zero

matrix and Γi = Ini, where Ini is the ni × ni identity matrix, then the model re- duces to a standard linear model consisting of independent observations. Often there are two types of approaches for selection of the conditional models: amending the penalty term of the selection criterion, or modifying the estimation process and thus the selection criterion.

For amending the penalty term of the model selection criterion, the key is to justify the degree of freedom k which is the number of parameters that are free to vary. The degree of freedom measures the complexity of the model. In a linear model, it is the 4 Random Effects Selection via Group LASSO 84

dimension of vector space into which the response yij is projected onto to obtain the

fitted valuey ˆij. However, when we consider the conditional model, the value of k can not be simply calculated as the number of distinct parameters in the model, since when the correlation structure is considered, not all the parameters can freely swing.

In this case, the degree of freedom is usually smaller than the number of parameters.

Hodges and Sargent (2001) extended the theory of the degree of freedom from linear model to richly parametrized models which include the random-effects model. In their extension, the correlation structure is used to calculate k. Based on the new definition of degree of freedom, Vaida and Blanchard (2005) proposed the condition

AIC (cAIC) for the linear mixed-effects model, which has the form 2l(θ|Yn)+2(ρ+1).

To make distinction from the marginal model, instead of using k, we slightly abuse the notation and use ρ to denote the effective number of parameters in the conditional model. In the formula, the authors use 1 d.f. to account for the unknown error variance σ2. The conditional AIC offers an explicit and convenient way to select the random effects. As an extension of AIC, cAIC has the advantage that the “true model” is not necessarily to be included in the candidate model class. In practice, we only need it “close” to the candidate model class. However, because it is an extension of AIC, it cannot avoid the disadvantage of AIC which is the over-selection problem when the underlying model contains only a few variables. Therefore, we propose to select the random effects under the latter framework, i.e., modifying the estimation 4 Random Effects Selection via Group LASSO 85 process and thus the selection criterion. Particularly, group LASSO is used to select the final model.

4.1 Group LASSO

LASSO is a shrinkage estimation and model selection method proposed by (Tibshi-

T rani, 1996). Suppose we have the observations y = (y1, y2, . . . , yn) and the inde-

T pendent predictors X = [x1,..., xp], where xj = (x1j, . . . , xnj) . LASSO minimizes

Pn Pp 2 the residual sum of squares arg minβ i=1(yi − β0 − j=1 xijβj) subject to the sum

Pp of the absolute values of the coefficients j=1 |βj| being less than a constant t. The other equivalent form is

p p ˆ X 2 X β = arg min ky − xjβjk + λ |βj|, β j=1 j=1

where λ is the tuning parameter. It is known that all the estimated coefficients of βj via the ordinary least squares are nonzero (with probability 1). However, with the l1 penalty term, the coefficients are continuously shrunk to zeros as λ increases. When λ is large enough, some coefficients are shrunk to exact zero. Fan and Li (2001) showed that since the l1 penalty is singular at the origin, the LASSO can automatically select the variables. Then all the candidate models are indexed corresponding to the value 4 Random Effects Selection via Group LASSO 86

λ. The final model is selected among the candidate model class by Cp-type statistics or using cross validation method.

While LASSO is computational efficient, it was initially developed for selecting the individual input variables. Even if the individual variables can be grouped by certain property, LASSO still tends to make shrinkage based on the strength of each individual variable, not on the strength of each group. For instance, we use a group of dummy variables to express a categorical factor with multiple levels, or in additive model, each component function is represented by a combination of a series of basis functions. In these two cases, the typical target is to choose the significant factors or functions. However, LASSO may shrink only part of the individual dummy variables or basis functions to zero. Yuan and Lin (2006) showed that LASSO tends to select more factors in such cases and different parametrizations of the factors may lead to different solutions via LASSO. What we are interested in is to choose the random effects γij which are usually multi-normal distributed. Notice that the elements of

T the jth random effect γ·j = (γ1j, . . . , γnj) are the coefficients for the same covariate, but vary from subject to subject. While the jth covariate is an important input and proved to be significant, some entries in γ·j’s may remain small for some subjects. In this case, LASSO may set such entries to zero, which cannot reflect the significance of that random effect at the population level. Therefore it is more suitable and reasonable to choose the whole set of γ·j rather than the individual γij. 4 Random Effects Selection via Group LASSO 87

Group LASSO (Yuan and Lin, 2006) provids us a powerful tool to estimate and select random effects on the group level. For convenience, we abuse notation for general description of the group LASSO in this section. Suppose there is a natural partition of L groups of the predictor variables. Define the n × pl matrix Xl as the design matrix corresponding to the lth group. The coefficient βl is a pl × 1 vector.

Suppose that y and Xl are all centred, where l = 1,...,L. Then the solution of group

LASSO is obtained by minimizing

L L X 2 X √ ky − Xlβlk + λ plkβlk. (4.1) l=1 l=1

In formula (4.1), l2 norm is considered within each group, i.e.

T 1/2 ||βl|| = (βl βl) .

Figure 4.1 shows the constrain region for l1-type penalty and l2-type penalty when

2 2 1/2 we consider the constraints |β1| + |β2| ≤ 1 and (β1 + β2 ) ≤ 1, respectively. If the model is of larger dimension, the penalty term of group LASSO is an intermediate between l1-type and l2-type penalties. The l2-type penalty within each group treat all the elements equally that does not encourage sparsity. Therefore, group LASSO yields consistent shrinkage within the group. That is, if one of the element in the group is non-zero, then all the elements in the same group are all non-zero. The l1- 4 Random Effects Selection via Group LASSO 88

Figure 4.1: Constrain region for l1-type regulation (within solid line) and l2-type regulation (within dotted line). 4 Random Effects Selection via Group LASSO 89 type penalty is used between groups. That leads to the entire group of coefficients to be set to zero when the tuning parameter λ is sufficiently large. Obviously, if pl = 1, l = 1,...,L, it reduces to the standard LASSO.

Friedman et al. (2010) provided the details of the general algorithm of group

LASSO. Considering the first derivative with respect to βl, we get the subgradient equation

T X − Xl (y − Xlβl) + λ · sl = 0, l where sl = βl/kβlk, if βl 6= 0. Yuan and Lin (2006) proved the necessary and sufficient

ˆ conditions for the existence of a solution set. Assume that βl, l = 1,...,L are the

P ˆ ˆ solutions. Let rl = y − k6=l Xkβk. If kXlrlk < λ, then βl = 0; otherwise

ˆ T ˆ −1 T βl = (Xl Xl + λ/kβlk) Xl rl. (4.2)

T If we assume that each Xl is orthonormalized, i.e. Xl Xl = Ipl , which can be achieved ˆ by Gram-Schmidt orthonomalization, then we can get the explicit form of βl

ˆ βl = (1 − λ/kvlk)vl, (4.3)

T where vl = Xl rl.

We present the simple version of the algorithm step by step: 4 Random Effects Selection via Group LASSO 90

ˆ Step 1. Set β = β0 as the initial value.

Step 2. For l = 1,...,L,

P ˆ let rl = y − k6=l Xkβk. ˆ If kXlrlk < λ, βl = 0,

ˆ otherwise, βl = (1 − λ/kvlk)vl.

Step 3. Iterate the entire step 2 until convergence.

The algorithm is efficient and usually achieves reasonable convergence. If the design matrices are not orthonormal, we can use the numerical solution of equation (4.2)

ˆ instead of the explicit form of βl in step 2, which is time-consuming.

Meinshausen and Yu (2009) showed the l2 consistent in the estimation via LASSO.

They proved that LASSO can produce a good estimator even when the sparsity pat- tern may not be recovered. Liu and Zhang (2009) demonstrated that the estimators given by group LASSO are also l2-consistent under a modified condition. Meier et al.

(2008) extended group LASSO to the logistic model.

Group LASSO was demonstrated competitive in simulation studies and in theoret- ical analysis. However, it still has the limitation that the covariate matrix should be orthonormal. Even though the optimization problem can be numerically solved given the non-orthonormal matrix, in practice, the solution may be not elegant (Friedman 4 Random Effects Selection via Group LASSO 91 et al., 2010). If we orthonormalize the covariate matrix and then apply the group

LASSO, the final result is not the solution to the original problem. We will discuss this problem in section 4.3.

4.2 MDL Criterion for Group LASSO

Once the candidate model class is set by group LASSO, we can choose the final model based on the tuning parameter λ and the estimates. In group LASSO, Cp-type statistic was suggested by Yuan and Lin (2006),

ky − µˆk2 C (ˆµ) = − n − 2df 2 , p σ2 µ,σ which is the unbiased estimate of the true risk E(kµˆ − µk2)/σ2, where µ is the ex- pectation of y given X andµ ˆ is the estimate. In this definition, σ2 is the constant variance of the error that is unknown in most practice cases and dfµ,σ2 is the degree of freedom of the model. Ye (1998) pointed out that it is not guaranteed that the degree of freedom equals to the number of parameters in the model. For example, if we know that one covariate has the most strong correlation with the response than all the other candidate covariates and we use this covariate to fit a model, then the degree of the freedom of the fitted model is greater than one, since the covariate car- 4 Random Effects Selection via Group LASSO 92 ries more statistical information. Efron (2004) provided a formulation of the degree of freedom n X 2 dfµ,σ2 = cov(ˆµi, yi)/σ , i=1 which can be estimated by numerical methods. However, it may be computational expensive. Yuan and Lin (2006) gave an approximation form of d.f. under the or- thonormal design assumption. They also found the asymptotic property of the degree of freedom. Their simulation study demonstrated the excellent performance of group

LASSO. However, in the calculation of the Cp-type statistic for LASSO and group

LASSO, the true variance σ2 is needed. Zou et al. (2007) proved that the least square estimation of the variance is a sufficiently good approximation of the true value to be used in the criterion. However, in group LASSO, the form of the variance estimator is still ambiguous. Therefore, we borrow the idea from the minimum description length

(Rissanen, 1978) to assume that σ2 follows a inverse-gamma distribution. Then we can quantify the information about the variance from the data to reduce the risk caused by the estimator of the variance.

We assume that the conditional distribution of y is normally distributed with the p.d.f. L L 1 1 X X f (y) = exp{− (y − X β )T (y − X β )}, β (2πσ2)n/2 2σ2 l l l l l=1 l=1 and we use τ instead of σ2 to keep the expression consistent with the previous chapters. 4 Random Effects Selection via Group LASSO 93

Then the marginal distribution of y given β is

m(y|β) Z = fβ(y)ω(τ)dτ

L L √ 1 1 X X a a = exp{− (y − X β )T (y − X β )}√ exp{− }τ −3/2dτ (2πτ)n/2 2τ l l l l 2τ l=1 l=1 2π PL T PL √ − n+1 (y − l=1 Xlβl) (y − l=1 Xlβl) + a − n+1 n + 1 = a(2π) 2 ( ) 2 Γ( ). 2 2

Remove the terms that do not influence the particular choice of the model,

− log m(y|β) n + 1 (y − PL X β )T (y − PL X β ) + a 1 = log l=1 l l l=1 l l − log a. 2 2 2

Take the first derivative to get the estimate of the hyper-parameter which minimizes the above function L L X T X aˆ = (y − Xlβl) (y − Xlβl)/n. l=1 l=1

Substitutinga ˆ and removing the terms only depend on n, then the conditional code length is

n (y − PL X β )T (y − PL X β ) 1 L(y|β) = log l=1 l l l=1 l l + log n. 2 n 2 4 Random Effects Selection via Group LASSO 94

The last term is the code length we need to interpret the hyperparameter a. Then plug the estimate of β in the above function via group LASSO. Since we known that the estimation is root-n consistent, it is reasonable and convenient to consider the uniform encoder for β. Then the final form of the MDL criterion for group LASSO is:

n (y − PL X βˆ )T (y − PL X βˆ ) p 1 glMDL = log l=1 l l l=1 l l + log n + log n, (4.4) 2 n 2 2 where p is the number of nonzero regression parameters in β.

4.3 Random Effects Selection via Group LASSO

Now we are ready to explore the group LASSO technique coupled with the MDL criterion to select the random effects via the tuning parameter λ in the linear mixed effects model. Recall the linear mixed effects model,

   yi = Xiβ + Ziγi + εi     γi ∼ N(0,D),   εi ∼ N(0, Λi),     γ1, ...γn, ε1, ..., εn independent, 4 Random Effects Selection via Group LASSO 95

where n is the number of subjects; yi is a ni × 1 vector; Xi and Zi are (ni × p) and

(ni × q) design matrices; β is the p-dimensional vector of fixed effects parameters; γi is the q-dimensional vector of random effects; εi is a ni-dimensional vector of residual

2 components. Usually we assume the canonical form Λi = σ Ini .

As we mentioned in section 4.1, group LASSO has desirable performance for estima- tion and variable selection which is supported by theoretical analysis and simulation studies. However, it was developed for the standard linear model consisting of in- dependent observations. When we consider the random effects, correlation structure should be taken into account. The two-stage representation of the linear mixed-effects model is given by Laird and Ware (1982). In the first stage, β and γi are treated as fixed effects and do not depend on εi, where the definition of εi is the same as before. Therefore, for each individual yi, it has the mean vector Xiβ + Ziγi and

2 variance-covariance σ Ini , and this is our motivation and theoretical base to further consider the conditional model. However, in the second stage, β is considered as the population parameter and γi is multivariate-normally distributed with mean zero and covariance matrix D. Even though our interest is about the conditional model which is defined in the first stage, we still can not ignore the covariance, as it is not sensible to separate stage 1 and stage 2. When we consider the estimation procedure, both two aspects of the model need to be included. We would use the existing well-developed estimation method in the linear mixed-effects model. Group LASSO will be treated 4 Random Effects Selection via Group LASSO 96 as part of the model selection procedure.

When the covariance structure is known, Harville (1976) gave an explicit form of the estimator

T T 2 −1 ˆ γˆi = DZi (ZiDZi + σ Ini ) (yi − Xiβ).

When the covariance structure is unknown, the Expectation-Maximization (EM) al- gorithm is often used to estimate the random effects. Let α denote the vector of all

2 parameters in σ Ini and D. We will show the details of the E-M algorithm under the

REML estimators ,which is given by Laird and Ware (1982):

• M-step : Find the maximum likelihood estimate of α when t1 and t2 are known,

with the definition

2 Pn T Pn Pn – σˆ = i=1 εi εi/ i=1 ni = t1/ i=1 ni,

ˆ Pn T – D = i=1 γi γi/n = t2/n.

Given the definition of α,σ ˆ2 and Dˆ are sufficient statistics of it.

• E-step : Calculate the expected value given the current estimate ofα ˆ.

ˆ Pn T Pn T – t1 = E( i=1 εi εi|yi, αˆ) = i=1{εˆi εˆi + T rV ar(εi|yi, αˆ)},

ˆ Pn T Pn T – t2 = E( i=1 γi γi|yi, αˆ) = i=1{γˆi γˆi + T rV ar(γi|yi, αˆ)}. 4 Random Effects Selection via Group LASSO 97

To make the above procedure more clearly, we have the following formulas. The conditional expectations are:

ˆ E(β|y, αˆ) = β(ˆα), E(γi|y, αˆ) =γ ˆi(ˆα) and E(εi|y, αˆ) =ε ˆi(ˆα).

ˆ The residual has similar form as in the linear regression thatε ˆi = yi − Xiβ − Ziγˆi.

To calculate the variances, there is a trick that the random variables γi and εi are independent from β. Then we can get

V ar(γi|yi, α) = var(γi|yi, β, α) + V ar{E(γi|yi, β, α)}, and

V ar(εi|yi, α) = var(εi|yi, β, α) + V ar{E(εi|yi, β, α)}, where both var(γi|yi, β, α) and var(εi|yi, β, α) are independent from β. When the iteration convergences, we can get the estimate of α, β and γi in the last E-step.

When we use the ML estimators, the M-step stages the same. The E-step has the form

ˆ Pn T ˆ Pn T ˆ t1 = E( i=1 εi εi|yi, β(ˆα), αˆ) = i=1{εˆi εˆi + T rV ar(εi|yi, β(ˆα), αˆ)}, and

ˆ Pn T ˆ Pn T ˆ t2 = E( i=1 γi γi|yi, β(ˆα), αˆ) = i=1{γˆi γˆi + T rV ar(γi|yi, β(ˆα), αˆ)}.

After obtaining the efficient estimates, selection of the final model is the key step.

The model selection literature for the conditional model of LMM is very limited.

Conditional AIC is proposed by Vaida and Blanchard (2005), which is specific for selecting the random effects in linear mixed-effects model. In their work, they de- 4 Random Effects Selection via Group LASSO 98

fined the conditional Akaike information on the cluster/subject level. The unbiased estimator of the cAIC is of the form

cAIC = −2 log f(y|β,ˆ σˆ2, γˆ) + 2K,

where K is the penalty when the sample size is finite,

N(N − p − 1) N(p + 1) K = (ρ + 1) + , (N − p)(N − p − 2) (N − p)(N − p − 2) given the ML estimator;

N − p − 1 p + 1 K = (ρ + 1) + , N − p − 2 N − p − 2 given the REML estimator. In the formula, ρ is the effective degree of freedom defined by Hodges and Sargent (2001). More specifically, pseudo zero responses were added to handle the random effects, i.e. 0 = −γ +γ, and the original model can be represented as a linear model

        y XZ β ε           =     +   . (4.5)         0 0 Ir γ γ

In this general form, we use Γ to denote the variance-covariance matrix of the error 4 Random Effects Selection via Group LASSO 99 term. Then we multiply Γ−1/2 on both sides of the equation (4.5) to make the new error term normally distributed as N(0, σ2I). The generalized least estimator is achieved directly     βˆ XT   T −1     = (M M)   y,     γˆ ZT where   XZ   M =   .   0 ∆

If we use D∗ to denote the variance-covariance matrix of γ, then ∆ is calculated via the equation D∗ = (∆T ∆)−1. The estimator βˆ is identical with that in Harville (1977) andγ ˆ coincides with the empirical Bayes estimator. The fitted responses are given by     XT T −1   yˆ = Hy = XZ (M M)   y,   ZT where H is an analogue to the hat matrix in the linear model. It is well known that, in the standard linear model, the trace of the hat matrix is equal to the degree of freedom of the model. By analogy, the degree of freedom of the above linear mixed- effects model can be defined as the trace of H, which is ρ. In the model selection part, we will compare our MDL criterion with cAIC.

When we consider the model on the subject/cluster level, the random effects are 4 Random Effects Selection via Group LASSO 100 treated as fixed within each subject in the final expression of the model. Correlation structure vanishes in the conditional model. Therefore, LASSO can be used to select the final model. However, as we discussed earlier, it may eliminate random coeffi- cients for some subjects, when the “true” values are close to zero under the condition that the whole groups of these random effects are in fact significant in the underlying model on the population level. Therefore, we adopt the idea of group LASSO to select the random effects. Our model selection procedure consists of two steps. In the first step, we use group LASSO to derive a candidate model class indexed by the tuning parameter λ. In the second step, we use the MDL criterion derived for the group

LASSO to select the final model within this model class. Since our interest focuses on the random effects, without loss of generality , we set β = 0. Assume that the true

T model contains the random effects in the form of γi = (γi1, . . . , γiq) , i = 1, . . . , n. To apply group LASSO, we need to divide the random effects into different groups. Nat- urally, we organize (γ1l, γ2l, . . . , γnl) as the lth group. The following simulation study shows that our method performs excellently for identifying the underlying significant random effects.

The simulation is conducted by generating data for n = 50 subjects. Within each subject, there are 6 repeated measurements, i.e. ni = 6 for all 1 ≤ i ≤ 50. There are

5 candidate covariates. We assume that the kth column of the random effects design matrix Zi is zki which is a ni ×1 vector, where z1ij ∼ N(0, 1), z2ij ∼ Uniform(−1, 1), 4 Random Effects Selection via Group LASSO 101

z3ij ∼ Exponential(1.5), z4ij ∼ Uniform(0, 1) and z5ij ∼ N(0, 2). The error vector

εi ∼ N(0, 0.1I6). The random coefficients are generated from the multivariate normal with mean zero, variances {1, 2, 1}, and correlation coefficients cov(γi1, γi2) = 0.3, cov(γi1, γi3) = 0.2 and cov(γi2, γi3) = 0.1. All the possible candidate models are considered. The group LASSO and glMDL are used to choose model. In Table 4.1, we compare the performance of the new selection method with that of cAIC on 100 replicates for each model. It can be seen that cAIC always chooses a bigger model, and by contrast, the proposed method performs excellently. We also to compare

True Model (γ1, γ2, γ3) Method Final Model cAIC MDL (γ1, γ2, γ3) 0 82 (γ1, γ2, γ3, γ4) 1 5 (γ1, γ2, γ3, γ5) 1 2 (γ1, γ2, γ3, γ4, γ5) 98 11 Table 4.1: Comparison of glMDL criterion with cAIC for random effects, shown are the selected number of cases out of 100 Monte Carlo runs. Since none of the methods selects the model with less than three covariates, those cases are not shown in the table. the residual sum of squares given by these two model selection methods in Figure

4.2. As mentioned earlier, we use the proposed method and cAIC to select the final model and use the classical estimator to calculate the fitted values which are shown in Figure 4.2 by the solid line and the dotted line, respectively. We also use the stars to denote the results given by simultaneous selection and estimation achieved by group LASSO. The mean of residual sum of squares given by the proposed two- 4 Random Effects Selection via Group LASSO 102

Figure 4.2: Residual sum of squares given by glMDL and cAIC. Y-axis denotes the mean residual sum of squares for each Monto Carlo run. The solid line demonstrates the results given by the new method with the classical estimation; the dotted line shows the results given by cAIC; and the asters indicate the fitted values estimated and selected by group LASSO. stage method is smaller than or equal to the value given by cAIC. Group LASSO shows its excellent performance in the random effects model selection. However, due to correlation, it does not excel in estimation. Compare the solid line and the asters, the mean of residual sum of squares calculated based on group LASSO estimation is much larger than that given by the classical estimation. It may also be caused by the fact that the design matrix is not orthonormal. 4 Random Effects Selection via Group LASSO 103

4.4 Functional Eigenfunctions Selection via Group

LASSO

Recall the model commonly used to describe sparse functional data. We assume that the ith curve has the form

∞ X Xi(t) = µ(t) + ξikφk(t), t ∈ T , (4.6) k=1

where φk(t) is eigenfunction and ξik are uncorrelated random variables with mean

2 zero and variances Eξik = λk, where λk are non-increasing eigenvalues i.e. λ1 ≥ P λ2 ≥ ... and k λk < ∞. Since it is implausible to handle the infinite number of eigenfunctions in practice, we truncate the FPC representation to the q leading terms. To be realistic, the jth observations yij obtained for the ith subject are assumed to be contaminated with measurement error ij, j = i, . . . , ni, where ni is the number of repeated measurements on the ith curve. The measurement error ij are i.i.d. following the Gaussian distribution with mean zero and variance σ2, and are independent of all the other random variables. Then we have the data model as follow, q X yij = µ(tij) + ξikφk(tij) + ij, tij ∈ T . (4.7) k=1 4 Random Effects Selection via Group LASSO 104

There is an extensive literature which focuses on the functional principle component analysis under the dense grid of regular time points space. By contrasts, Rice and Wu

(2001), James et al. (2000), James and Sugar (2003) and Yao et al. (2005) considered the modelling of irregularly spaced and potentially sparse functional data. Among them James et al. (2000) proposed the reduced rank model coupled with B-spline basis expansion for representing the sparse functional data, which we mentioned in section 2.4.2. Yao et al. (2005) proposed the functional principal component anal- ysis through conditional expectation, in which idea of pooling the information from all subjects is adopted to overcome the sparseness. We follow this line of the work and adapt our proposed random effects selection to the choice of the number of FPC components. In order to estimate the random effects, we first estimate the popu- lation level parameters. The mean function µ(t) = EX(t) and covariance function cov(X(s),X(t)) = G(s, t) are estimated by smooth technique, e.g., local linear regres- sion, that combines all observations from the entire sample. Please refer to Yao et al.

(2005) for the detail algorithms. We use Gˆ(s, t) to denote the estimator and G˜(t) to denote the estimation along the diagonal. Then the variance for the individual observation at time t is Vˆ (t). To make the estimation of σ2 more stable, two ends of the time interval are cut. Ifσ ˆ2 > 0,

2 Z σˆ2 = {Vˆ (t) − G˜(t)}dt, T T1 4 Random Effects Selection via Group LASSO 105

2 otherwiseσ ˆ = 0. In the integration, T1 = [inf(T ) + |T |/4, sup(T ) − |T |/4], and |T | is the length of the interval T . The eigenfunctions are estimated by discretizing the

R ˆ 2 R ˆ ˆ smoothed covariance subject to the constrains T φk(t) dt = 1 and T φk(t)φm(t) = 0 for m < k. Use the relationship

Z ˆ ˆ ˆ ˆ G(s, t)φk(s)ds = λkφk(t). T

Different from the previous estimation of the random effects, conditional expecta- tion under Gaussian assumption of the process X is used to estimate ξik. Rewrite

T the components of each trajectory in the vector form: Xi = (Xi(Ti1),...,Xi(Tini ) ,

T T yi = (yi1, . . . , yini ) , µi = (µ(Yi1), . . . , µ(Yini ) and φik = (φk(Ti1), . . . , φk(Tini )) . The estimator of the FPC scores, named by PACE (principal analysis through conditional expectation), is given by

ˆ ˆ ˆ ˆT ˆ −1 ξik = E[ξik|yi] = λkφikΣ (yi − µˆi),

ˆ 2 where the element on the jth row and lth column is G(Tij,Tij) +σ ˆ δjl with δjl = 0 when j 6= l and δjl = 1 when j = l. This approach is the analogue of the best linear unbiased estimator (BLUP) of the classical LMM and carries many desirable properties of BLUP (Yao et al., 2005). To choose the eigenfunctions, AIC-type criteria is used in Yao et al. (2005). However, when the AIC is used to choose eigenfunctions, 4 Random Effects Selection via Group LASSO 106 the final model should explain the population property. In this case, the correlation structure should be considered. It conflicts with the conditional likelihood used in AIC in Yao et al. (2005). Therefore we do not use the heuristic AIC criterion suggested by

Yao et al. (2005), but use the cAIC as the comparison benchmark, since the penalty term of cAIC also accounts for the correlation structure. Here we treat the FPC scores ξik as random effects and use the proposed method instead. It is obvious that the estimated eigenfunctions valued at the observational times are entries of the design matrix X, and the random effects corresponding to the same eigenfunction are put into one group. Then we can easily adopt the group LASSO to set up a class of candidate models indexed by the tuning parameter λ, and apply the MDL criterion derived for group LASSO to select the number of groups, e.g., the number of eigenfunctions.

We also conducted numerical studies for demonstrating the usefulness of our above proposal. The simulation setting and data generating scheme is similar to the that in

Section 2.4.3. There are 50 trajectories, each of which contains 8 to 15 time points.

The number of time points ni within each subject is generated from {8, 9,..., 15} with equal likelihood. Then the time points are uniformly generated following U[0, 10].

Three underlying eigenfunctions generated by classical Fourier basis are used with eigenvalues 9, 4 and 1. The measurement error is normally distributed with mean zero and variance 0.004. The largest candidate model contains 5 eigenfunctions. The 4 Random Effects Selection via Group LASSO 107 simulation results show us that cAIC always over selects. Not unexpectedly, our simulation results suggest that cAIC always tends to over-select the number of FPCs

(over 90% of the Monte Carlo runs), while the proposed method performs well to pick the “true” number of eigenfunctions, e.g., 49 times out of 50. Appendix A

Proof of Consistence

T Pn T −1 As we already proved that EβFSSσ = p + β ( i=1 Xi Σi Xi)β. Let ξi = Ziγi + εi which follows a normal distribution with mean zero and variance-covariance matrix

Σi.

We can demonstrate that (FSSσ − EβFSSσ)/n converges to 0 in probability. Let

1 1 FSS − E FSS n σ n β σ n n n n 1 X X X 1 X p = ( yT Σ−1X )( XT Σ−1X )−1( XT Σ−1y ) − βT ( XT Σ−1X )β − n i i i i i i i i i n i i i n i=1 i=1 i=1 i=1 n n n n 1 X X X p 2 X = ( ξT Σ−1X )( XT Σ−1X )−1( XT Σ−1ξ ) − + βT ( XT Σ−1ξ ). n i i i i i i i i i n n i i i i=1 i=1 i=1 i=1 (A.1)

108 A Proof of Consistence 109

Since

T −1 T −1 Eβ(Xi Σi ξi) = Xi Σi Eξi = 0, by the strong law of large numbers,

n 2 X βT ( XT Σ−1ξ ) → 0 a.s. n i i i i=1

For convenience, rewrite the first part of (A.1),

n n n 1 X X X p ( ξT Σ−1X )( XT Σ−1X )−1( XT Σ−1ξ ) − n i i i i i i i i i n i=1 i=1 i=1 1 p = ξT ( Σ−1X(XT Σ−1X)−1XT Σ−1)ξ − . n n

1 −1 T −1 −1 T −1 If A = ( n Σ X(X Σ X) X Σ ), then

EξT Aξ = T r[AVar(ξ)] + (Eξ)T AEξ 1 = T r[( Σ−1X(XT Σ−1X)−1XT Σ−1)Σ] + 0 n p = . n A Proof of Consistence 110

The expected value of the first part is zero. The variance of the quadratic form is

Var(ξT Aξ) = 2T r[AVar(Σ)AVar(Σ)] + 4(Eξ)T AVar(ξ)AEξ 2 = T r[Σ−1X(XT Σ−1X)−1XT Σ−1ΣΣ−1X(XT Σ−1X)−1XT Σ−1Σ] n2 2 2p = T rI = → 0. n2 p n2

This indicates that the first part converges to 0 in L2 and, thus, in probability.

Therefore, FSSσ/n converges to EβFSSσ/n in probability. If we assume that all the nis are bounded, (EβFSSσ) = Op(N). Therefore, it also indicates that FSSσ =

Op(N). Appendix B

Newton-Raphson Iteration for deriving the MDL Criteria

B.1 Newton-Raphson Iteration for gMDL1

In order to avoid confusion from the previous derivation, let V ∗ = M. Then we have

h(c) d = − log m(y|φ) dc 1 1 = T r[(cM + J −1)−1M] − βˆT (cM + J −1)−1M(cM + J −1)−1β,ˆ 2 2φ

111 B Newton-Raphson Iteration for deriving the MDL Criteria 112 and

H(c) d2 = − log m(y|φ) dc2 1 = − T r[(cM + J −1)−1M(cM + J −1)−1M] 2 1 + βˆT (cM + J −1)−1M(cM + J −1)−1M(cM + J −1)−1β,ˆ φ

Finally, the iteration has the form,

c(l+1) = c(l) − h(c(l))/H(c(l)).

B.2 Newton-Raphson Iteration for gMDL2

Let

Σ = (cM + J −1)−1M(cM + J −1)−1, and

Dom = βˆT (cM + J −1)−1βˆ + D(y, βˆ).

Then 1 N βˆT Σβˆ h(c) = T r[(cM + J −1)−1M] − , 2 2 Dom B Newton-Raphson Iteration for deriving the MDL Criteria 113 and 1 N 2βˆT ΣM(cM + J −1)−1βDomˆ − (βˆT Σβˆ)2 H(c) = − T r[ΣM] + . 2 2 Dom2 Bibliography

H. Akaike. Information theory and an extension of the maximum likelihood principle.

In B. N. Petrov and F. Csaki, editors, In Proceedings of the Second International

Symposium on Information Theory, pages 267–281. Akademiai Kiado, Budapest,

1973.

A. Barron, J. Rissanen, and Bin Yu. Minimum description length principle in coding

and modeling. IEEE Trans. Inform. Theory, 44:2743–2760, 1998.

L. Breiman and D. Freedman. How many variables should be entered in a regression

equation? Journal of the American Statistical Association, 78:131–136, 1983.

K.P. Burnham and D.R. Anderson. Model Selection and Multimodel Inference: A

Practical Information Theoretic Approach. Springer, New York, 2002.

T.M.. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons,

New York, 1991.

114 B BIBLIOGRAPHY 115

R.V. Craiu and Thomas C.M. Lee. Model selection for the competing risks model

with and without masking. Technometrics, 47(4):457–467, 2005.

Bradley Efron. The estimation of prediction error: Covariance penalties and cross-

validation. Journal of the American Statistical Association, 96:619–642, 2004.

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood

and its oracle properties. Journal of the American Statistical Association, 96:1348–

1360, 2001.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A note on the group lasso

and a sparse group. 2010.

Stephen J. Gange. Generating multivariate categorical variates using the iterative

proportional fitting algorithm. The American Statistician, 49(2):134–138, 1995.

Mark Hansen and Bin Yu. Model selection and the principle of minimum description

length. Journal of the American Statistical Association, 96(454):746–774, 2001.

Mark Hansen and Bin Yu. Minimum description length model selection criteria for

generalized linear models. Statistics ans Science: A Festschrift for Terry Speed, 40:

145–163, 2003.

David Harville. Bayesian inference for variance components using only error contrasts.

Biometrika, 61:383–385, 1974. B BIBLIOGRAPHY 116

David Harville. Extension of the gauss-markov theorem to include the estimation of

random effects. The Annals of Statistics, 4:384–395, 1976.

David Harville. Maximum likelihood approach to variance component estimation and

to related problem. Journal of the American Statistical Association, 72:320–338,

1977.

James S. Hodges and Daniel J. Sargent. Counting degrees of freedom in hierarchical

and other richly-parameterised models. Biometrika, 88:367–379, 2001.

Clifford M. Hurvich and Chih Ling Tsai. Regression and time series model selection

in small samples. Biometrika, 76:297–307, 1989.

Gareth M. James and Catherine A. Sugar. Clustering for sparsely samples functional

data. Journal of the American Statistical Association, 46:397–408, 2003.

Gareth M. James, Trevor J. Hastie, and Catherine A. Sugar. Principal component

models for sparse function data. Biometrika, 87:587–602, 2000.

Nan M. Laird and James H. Ware. Random-effects models for longitudinal data.

Biometrics, 38:963–974, 1982.

Alan J. Lee. Generating random binary deviates having fixed marginal distributions

and specified degrees of association. The American Statistician, 47(3):209–215,

1993. B BIBLIOGRAPHY 117

Thomas C.M. Lee. An introduction to coding theory and the two-part minimum

description length principle. International Statistical Review, 69(2):169–183, 2001.

Hua Liang, Hulin Wu, and Guohua Zou. A note on conditional aic for linear mixed-

effects models. Biometrika, 95:773–778, 2008.

Kung Yee Liang and Scott L. Zeger. Longitudinal data analysis using generalized

linear models. Biometrika, 73:13–22, 1986.

Han Liu and Jiani Zhang. Estimation consistency of the group lasso and its applica-

tions. Proceedings of the Twelfth International COnference on Artificial Intelligence

ans Statistics, 2009.

P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman and Hall, 2nd

edition, 1991.

Lukas Meier, Sara Van de Geer, and Peter Buhlmann. The group lasso for logistic

regression. Journal of the Royal Statistical Society, Series B, 70:53–71, 2008.

Nicolai Meinshausen and BIn Yu. Lasso-type recovery of sparse representations for

high-dimensional data. The Annals of Statistics, 37:246–270, 2009.

J.A. Nelder and D. Pregibon. An extended quasi-likelihood function. Biometrika, 74:

221–232, 1987. B BIBLIOGRAPHY 118

R. Nishii. Asymptotic properties of criteria for selection of variables in multiple

regression. Ann. Statist., 12(2):758–765, 1984.

Wei Pan. Akaike’s information criterion in generalized estimating equations. Biomet-

rics, 57:120–125, 2001.

Chul Gyu Park, Taesung Park, and Dong Wan Shin. A somple mehtod for generating

correlated binary variates. The American Statistician, 50(4):306–310, 1996.

H.D. Patterson and R. Thompson. Recovery of inter-block information when block

sizes are unequal. Biometrika, 58:545–554, 1971.

Donna K. Pauler. The schwarz criterion and related methods for normal linear models.

Biometrika, 85(1):13–27, 1998.

James O. Ramsay and B.W. Silverman. Functional Data Analysis. Springer, New

York, 2005.

John A. Rice and Colin O. Wu. Nonparametric mixed effects models for unequally

sampled noisy curves. Biometrics, 57:253–259, 2001.

J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.

J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore,

1989. B BIBLIOGRAPHY 119

Jorma Rissanen. A universal prior for integers and estimation by minimum description

length. The Annals of Statistics, 11(2):416–431, 1983.

Jorma Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14

(3):1080–1100, 1986.

Gideon Schwarz. Estimating the dimension of model. The Annals of Statistics, 6(2):

461–464, 1978.

George Arthur Frederick Seber. Multivariate Observations. John Wiley & Sons, New

York, 2004.

C.E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:

379–423, 1948.

Ritei Shibata. An optimal selection of regression variables. Biometrika, 68:45–54,

1981.

T.P. Speed and Bin Yu. Model selection and prediction: Normal regression. Ann.

Inst. Statist. Math., 45:35–54, 1993.

David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde.

Bayesian measures of model complexity and fit. Journal of the Royal Statistical

Society, Series B, 64:583–639, 2002. B BIBLIOGRAPHY 120

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society, Series B, 58:267–288, 1996.

Florin Vaida and Suzette Blanchard. Conditional akaike information for mixed-effects

models. Biometrika, 92:351–370, 2005.

A. Verdonck, L. De Ridder, G. Verbeke, J.P. Bourguignon, C. Carels, E.R. Kuhn,

V. Darras, and F de Zegher. Comparative effects of neonatal and prepubertal

castration on craniofacial growth in rats. Archives of Oral Biology, 43:861–871,

1998.

R.W.M. Wedderburn. Quasi-likelihood functions, generalized linear models, and the

gauss-newton method. Biometrika, 61:439–447, 1974.

Y. Yang. Can the strengths of AIC and BIC be shared? A conflict between model

identification and regression estimation. Biometrika, 92(4):937–950, 2005.

Fang Yao, Hans-Georg Muller, and Jane-Ling Wang. Functional data analysis for

sparse longitudinal data. Journal of the American Statistical Association, 100:

577–590, 2005.

Jianming Ye. On measuring and correcting the effects of data mining and model

selection. Journal of the American Statistical Association, 93:120–131, 1998. Bibliography 121

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped

variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006.

Hui Zou, Trevor Hastie, and Robert Tibshirani. On the “degrees of freedom” of the

lasso. The Annals of Statistics, 35:2173–2192, 2007.