Biometrics 63, 22–32 DOI: 10.1111/j.1541-0420.2006.00662.x March 2007
AModified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data
Nancy R. Zhang∗ and David O. Siegmund∗∗ Department of Statistics, Stanford University, Stanford, California 94305, U.S.A. ∗email: [email protected] ∗∗email: [email protected]
Summary. In the analysis of data generated by change-point processes, one critical challenge is to deter- mine the number of change-points. The classic Bayes information criterion (BIC) statistic does not work well here because of irregularities in the likelihood function. By asymptotic approximation of the Bayes factor, we derive a modified BIC for the model of Brownian motion with changing drift. The modified BIC is similar to the classic BIC in the sense that the first term consists of the log likelihood, but it differs in the terms that penalize for model dimension. As an example of application, this new statistic is used to analyze array-based comparative genomic hybridization (array-CGH) data. Array-CGH measures the number of chromosome copies at each genome location of a cell sample, and is useful for finding the regions of genome deletion and amplification in tumor cells. The modified BIC performs well compared to existing methods in accurately choosing the number of regions of changed copy number. Unlike existing methods, it does not rely on tuning parameters or intensive computing. Thus it is impartial and easier to understand and to use. Key words:Bayes information criterion; Change-point; Comparative genomic hybridization; Model selec- tion.
1. Introduction The approach that we take here is similar to that suggested by The Bayes information criterion (BIC) proposed by Schwarz Siegmund, but we will treat problems that have a more nat- (1978) is a popular method for model selection. However, its ural theoretical formulation and reduce in continuous time to usage is not theoretically justified for change-point problems, finding change-points in the drift of Brownian motion. where the likelihood functions do not satisfy the required reg- Both the original and the modified BIC can be interpreted ularity conditions. An aspect of model selection in change- as penalized likelihood methods, although it should be em- point problems is to determine the number of change-points. phasized that the “penalty” arises as a natural consequence In this article, we will give a modified version of the BIC for of the computation and is not a regularization device to avoid data consisting of independent normally distributed observa- overfitting. A brief comparison of the modified BIC with other tions with constant variance and piecewise constant means. penalized likelihood methods is given in Section 2. There does not exist a standard method for model selec- As an example, we will apply the new version of the BIC to tion in change-point problems. Although lacking in theoretical the analysis of array-based comparative genomic hybridiza- justification, the BIC has previously been applied to models tion (array-CGH) data, which quantitatively measures the of the form treated in this article (e.g., see Yao, 1988; Li, DNA copy number at thousands of locations linearly ordered 2001). The BIC was derived from large-sample estimates of along the genome. The goal in array-CGH data analysis is to posterior model probabilities that do not depend on the prior detect accurately regions of DNA deletion or amplification. probability distributions of models nor of parameters. More More details of array-CGH studies and comparison with the recent model selection methods have taken a computationally methods of Olshen et al. (2004) and Fridyland et al. (2004) based Bayesian approach (e.g., George and McCulloch, 1993; are given in Sections 4 and 5. Green, 1995). They use Monte Carlo methods to estimate posterior model probabilities, and hence involve substantial 2. A Modified BIC for Gaussian Change-Point Models computation and subjective choice of prior. Consider a sequence of observations y =(y1, y2, ...,yT ), where Recently, Siegmund (2004) suggested a new approach to- yi are independently distributed Gaussian random variables: ward obtaining a large-sample approximation to the Bayes 2 factor that bypasses the Taylor expansion used in the deriva- yi ∼ N µj ,σ , for i = τj +1,...,τj+1,j=0,...,m. (1) tion of the BIC. Using this approach, he derived a new BIC- like statistic for use in mapping quantitative trait loci in ge- Here we will refer to the τ’s as the change-points of this pro- netic linkage studies, a problem which is related to finding cess. Of course the change-points are constrained to lie in the change-points in the mean of an Ornstein–Uhlenbeck process. set
22 C 2006, The International Biometric Society Modified BIC with Applications to Analysis of CGH Data 23
Dm = {(t1,...,tm):0≤ t1 ≤ t2 ≤ ···≤tm ≤ T }. where
ni(ˆt)=tˆi − tˆi−1, Forobvious reasons we add the restriction that µj = µj+1. ˆ − ti 1 T We will assume that for large sample sizes, the change-points 1 1 are far enough from each other in that y¯i(ˆt)= yj , y¯ = yj ni(ˆt) T j=tˆi−1 j=1 lim τi/T → ri for 1 ≤ i ≤ m, (2) →∞ T ˆt =(tˆ1,...,tˆm) where 0