ML Estimation of the T Distribution Using EM and Its Extensions, ECM
Total Page:16
File Type:pdf, Size:1020Kb
Statistica Sinica 5(1995), 19-39 ML ESTIMATION OF THE t DISTRIBUTION USING EM AND ITS EXTENSIONS, ECM AND ECME Chuanhai Liu and Donald B. Rubin Harvard University Abstract: The multivariate t distribution has many p otential applications in applied statistics. Current computational advances will make it routinely available in practice in the near future. Here we fo cus on maximum likeliho o d estimation of the parameters of the multivariate t,withknown and unknown degrees of freedom, with and without missing data, and with and without covariates. We describ e EM, ECM and ECME algorithms and indicate their relative computational eciencies. All three algorithms are analytically quite simple, and all have stable monotone convergence to a lo cal maximum likeliho o d estimate. ECME, however, can have a dramatically faster rate of convergence. Key words and phrases: EM, ECM, ECME, incomplete data, missing data, multivari- ate t, robust estimation. 1. Intro duction The multivariate t distribution can be a useful theoretical to ol for applied statistics. Of particular imp ortance, it can be used for robust estimation of means, regression co ecients, and variance-covariance matrices in multivariate linear mo dels, even in cases with missing data. A brief history of the theoretical development leading to such uses is as follows. Dempster, Laird and Rubin (1977) show that the EM algorithm can be used to nd maximum likeliho o d (ML) estimates (MLEs) with complete univariate data and xed degrees of freedom, and Dempster, Laird and Rubin (1980) extend these results to the regression case. Rubin (1983) shows how this result is easily extended to the multivariate t, and Little and Rubin (1987) and Little (1988) further extend the results to show how EM can deal with cases with missing data. Lange, Little and Taylor (1989) consider the more general situation with unknown degrees of freedom and nd the joint MLEs of all parameters using EM; they also provide several applications of this general mo del. Related discussion app ears in many places; a recentexample is Lange and Sinsheimer (1993). Here, using a generalization of the ECM algorithm (Meng and Rubin (1993)), called the ECME algorithm (Liu and Rubin (1994)), we nd the joint MLE 20 CHUANHAI LIU AND DONALD B. RUBIN much more eciently than by EM or ECM. We include comparisons of ECME with both EM and multicycle ECM and provide some new theoretical results. Care must be used with these ML pro cedures, however, esp ecially with small or unknown degrees of freedom, b ecause the likeliho o d function can have many spikes with very high likeliho o d values but little asso ciated p osterior mass under any reasonable prior. The asso ciated parameter estimates, can therefore, be of little practical interest by themselves even though they are formally lo cal or even global maxima of the likeliho o d function. It is, nevertheless, imp ortantto lo cate such maxima b ecause they can critically in uence the b ehavior of iterative simulation algorithms designed to summarize the entire p osterior distribution. The notation and theory needed to present our results are presented sequen- tially. First, in Section 2, we summarize fundamental results concerning the multivariate t distribution, and in Section 3, derive the \complete-data" likeli- ho o d equations and asso ciated ML estimates. In Section 4, we present the EM algorithm for ML estimation with known degrees of freedom, and in Section 5 the EM and multicycle ECM algorithms when the degrees of freedom are to be estimated. Section 6 derives the ecient ECME algorithm, and Section 7 ex- tends ECME for the t to the case of linear mo dels with fully observed predictor variables. Section 8 illustrates the extra eciency of ECME over EM and ECM in two examples, and Section 9 provides a concluding discussion. 2. Multivariate t Distribution When we saya p-dimensional random variable Y follows the multivariate t distribution t (; ;) with center ; p ositive de nite inner pro duct matrix ; p and degrees of freedom 2 (0; 1]; we mean that, rst, given the weight ; Y 2 has the multivariate normal distribution, and second, that is ; that is, the weight is Gamma distributed: Y j; ;; N (; = ); p and (1) j; ; Gamma (=2;=2); where the Gamma( ; ) density function is 1 expf g=( ); > 0; > 0; > 0: As ! 1, then ! 1 with probability 1, and Y b ecomes marginally N (; ): Standard algebraic op erations integrating from the joint density of p (Y; ) lead to the density function of the marginal distribution of Y; namely, t (; ;): p ML ESTIMATION OF THE t DISTRIBUTION 21 +p 1=2 )j j ( 2 ; (2) ( +p)=2 p=2 )[1 + (; )= ] ( ) ( Y 2 where 0 1 (; ) = (Y ) (Y ); Y which is the Mahalanobis distance from Y to the center with resp ect to : If > 1; is the mean of Y; and if > 2; =( 2) is its variance-covariance matrix. Because density (2) dep ends on Y only through (; ); the densityis Y the same for all Y that have the same distance from ; and thus the distribution is ellipsoidally symmetric ab out : Further critical prop erties of the multivariate t concern its marginal and conditional distributions. Supp ose Y is partitioned into Y = (x; y ); where the dimensions of x and y are p and p ; resp ectively. Given ; we have the well- x y known normal results: xj; ;; N ( ; = ) (3) p x x x and y jx; ; ;; N ( ; = ); (4) p y jx y y jx where 1 = (x ) (5) y jx y y;x x x and 1 ; (6) = x;y y y;x x y jx with ( ; )=; the inner pro duct matrix corresp onding to the comp onents x y x 0 x of Y; and = the corresp onding submatrix of corresp onding to the y;x x;y x columns and y rows of ; and can b e found either analytically as in y jx y jx (5) and (6) or numerically bythesweep op erator (e.g., Go o dnight (1979), Little and Rubin (1987)). Thus, for the marginal distribution of x we have from (1) and (3) x t ( ; ;): p x x x 2 From (3), given (; ;;) the random variable ( ; ) is distributed, x x x p x that is, Gamma(p =2; 1=2); so that treating x as data, the likeliho o d of given x (; ;;x)is p ( ; ) x x x x L( j; ;;x) / Gamma ; : 2 2 Since the Gamma distribution is the conjugate prior distribution for the param- eter ; from (1) and this likeliho o d, the conditional p osterior distribution of ; i.e., its distribution given (; ;;x); is + p + ( ; ) x x x x jx; ; ; = j ( ; ); Gamma ; ; (7) x x x 2 2 22 CHUANHAI LIU AND DONALD B. RUBIN whence + p x E ( jx; ; ;)= : (8) + ( ; ) x x x ( ; ;+ p ); From (4) and (7), the conditional distribution of y given x is t y jx y jx x p y where + ( ; ) x x x : = y jx y jx + p x 3. ML Estimation of (; ;) with Observed Y and From the de nition of the multivariate t distribution, n indep endentdraws from t (; ;) can b e describ ed as: p ind Y j; ; N (; = ) for i =1;::: ;n; (9) i p i and iid j Gamma ; for i =1;::: ;n: (10) i 2 2 When b oth Y = fY ;::: ;Y g and = f ;::: ; g are considered observed, 1 n 1 n fY ;::: ;Y ; ;::: ; g comprise the complete data. 1 n 1 n Because of the conditional structure of the complete-data mo del given by distributions (9) and (10), the complete-data likeliho o d function can b e factored into the pro duct of two distinct functions: the likeliho o d of (; ) corresp onding to the conditional distribution of Y given , and the likeliho o d function of cor- resp onding to the marginal distribution of . More precisely,given the complete data (Y; ); the log-likeliho o d function of the parameters ; and ; ignoring constants, is L(; ;jY; )=L (; jY; )+L ( j); (11) N G where P n n 1 0 1 Y Y i i L (; jY; )= ln j j trace i i=1 N 2 2 P P n n 1 0 1 0 1 Y i i i + ; (12) i=1 i=1 2 and P n n (ln( ) ) i i + ln + L ( j )=n ln : (13) i=1 G 2 2 2 2 The complete-data sucient statistics for ; and ; boxed in (12) and in (13), are n n n n X X X X 0 (ln( ) ) : ; and S = Y Y ; S = Y ; S = S = i i i i i i i Y Y Y i i=1 i=1 i=1 i=1 (14) ML ESTIMATION OF THE t DISTRIBUTION 23 Given the complete data (Y; ); f S ;S ;S g is the set of complete-data suf- Y Y Y cient statistics for (; ); and S is the sucient statistic for : Given the sucient statistics S ; S ; S ; and S ; the MLE of (; ) Y Y Y and the MLE of can b e obtained from L (; jY ; )andL ( j ); resp ectively. N G Sp eci cally, the maximum likeliho o d estimates of and from L (; jY; ) N are P n S Y Y i i i=1 P ^ = = ; (15) n S i i=1 and n X 1 1 1 0 0 ^ S S S (Y ^ )(Y ^ ) : (16) = = Y Y Y i i i Y n S n i=1 Therefore, the MLE of the center ; namely ^ ; is the weighted mean of the observations Y ;::: ;Y with weights ;::: ; ; the MLE of the inner pro duct 1 n 1 n ^ matrix ; namely ; is the average weighted sum of squares of the observations Y ;::: ;Y ab out ^ with weights ;::: ; : The estimators of (; ) given by(15) 1 n 1 n and (16) are known as weighted least squares estimators.