ISSN: 2469-5831

Gu et al. Int J Clin Biostat Biom 2018, 4:017 DOI: 10.23937/2469-5831/1510017 Volume 4 | Issue 1 International Journal of Clinical and

RESEARCH ARTICLE Partial Variable Selection and its Applications in Biostatistics Jingwen Gu1, Ao Yuan1,2*, Chunxiao Zhou2, Leighton Chan2 and Ming T Tan1*

1Department of Biostatistics, and Biomathematics, Georgetown University, USA Check for updates 2Epidemiology and Biostatistics Section, Rehabilitation Medicine Department, Clinical Center, National Institutes of Health, USA

*Corresponding author: Ao Yuan, Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington DC 20057, USA; and Biostatistics Section, Rehabilitation Medicine Department, Clinical Center, National Institutes of Health, Bethesda MD 20892, USA, E-mail: [email protected]; Ming T Tan, Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington DC 20057, USA, E-mail: [email protected]

addressed variable selection in artificial neural network Abstract models, Mehmood, et al. [8] gave a review for variable We propose and study a method for partial covariates selec- selection with partial model. Wang, et al. tion, which only select the covariates with values fall in their effective ranges. The coefficients estimates based on the re- [9] addressed variable selection in generalized additive sulting is more interpretable based on the effective covari- partial linear models. Liu, et al. [10] addressed variable ates. This is in contrast to the existing method of variable se- selection in semiparametric additive partial linear mod- lection, in which some variables are selected/deleted in whole. els. The [6,11] and its variation [12,13] are used to To test the validity of the partial variable selection, we extend- ed the Wilks theorem to handle this case. Simulation studies select some few significant variables in the presence of are conducted to evaluate the performance of the proposed a large number of covariates. method, and it is applied to a real data analysis as illustration. However, existing methods only select the whole Keywords variable(s) to enter the model, which may not the most Covariate, Effective , Partial variable selection, Linear desirable in some bio-medical practice. For example, in model, Likelihood ratio test two heart disease studies [14,15] there are more than ten risk factors identified by medical researchers in their Introduction long time investigations, with the existing variable se- lection methods, some of the risk factors will be deleted Variables selection is a common practice in biosta- wholly from the investigation, this is not desirable, since tistics and there is vast literature on this topic. Com- risk factors will be really risky only when they fall into monly used methods include the likelihood ratio test some risk ranges. Thus deleting the whole variable(s) in [1], Akaike information criterion, AIC [2] Bayesian in- this case seems not reasonable, while a more reason- formation criterion, BIC [3], the minimum description able way is to find the risk ranges of these variables, length [4,5] and Lasso [6], etc. The and delete the variable values in the un-risky ranges. In principal components model linear combinations of the some other studies, some of the covariates values may original covariates, reduces large number of covariates just random errors which do not contribute to the influ- to a handful of major principal components, but the ence of the responses, and remove these covariates val- result is not easy to interpret in terms of the original ues will make the model interpretation more accurate. covariates. The stepwise regression starts from the full In this sense we select the variables when their value model and deletes the covariate one by one according falls within some range. To our knowledge, method for to some measure. May, et al. [7] this kind of partial variable selection hasn’t been seen in

Citation: Gu J, Yuan A, Zhou C, Chan L, Tan MT (2018) Partial Variable Selection and its Applications in Biostatistics. Int J Clin Biostat Biom 4:017. doi.org/10.23937/2469-5831/1510017 Received: January 08, 2018: Accepted: April 12, 2018: Published: April 14, 2018 Copyright: © 2018 Gu J, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 1 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831 the literature, which is the goal of our study here. Note ter(s). For simple of discussion we assume there are no that in existing method of variable selection, some vari- unknown parameters. Then the log-likelihood is n ables are selected/deleted, while in our method, some = − ' β  n ( β) ∑log f (yi xi ). variable(s) are partially selected/deleted, i.e., only some i=1 proportions of some variable observations are select- Let βˆ be the Maximum Likelihood Estimate (MLE) ed/deleted. The latter is very different from the exist- of β (when f (⋅) is the standard normal density, βˆ is ing methods. In summary, traditional variable selection just the least squares estimate). If we delete kd(≤ ) col- methods, such as stepwise or Lasso, some covariate(s) umns of Xn and the corresponding components of β, will be removed either wholly or none from the anal- denote the remaining covariate matrix as X− and the n ysis. This is not very reasonable, since some of the re- resulting β as β -, and the corresponding MLE as βˆ -. moved covariates may be partially effective, removing Then under the hypothesis H0 : the deleted columns of all their values may yield miss-leading results, or at least X has no effects, or equivalently the deleted compo- n cost information loss; while for the variables remaining nents of β are all zeros, then asymptotically [1]. in the model, not all their values are necessarily effec- D 2  βˆ − βˆ -  → χ 2 , tive for the analysis. With the proposed method, only  n ( )  n ( ) k the non-effective values of the covariates are removed, where χ 2 is the chi-squared distribution with k k and the effective values of the covariates are kept in -degrees of freedom. For a given nominal level α , let the analysis. This is more reasonable than the existing 2 χ 2 χαd (1− ) be the (1−α ) -th upper quantile of the k methods of removing all or nothing. 2  βˆ − βˆ −  ≥ χ 2 (1−α ) distribution, if  n ( )  n ( ) d , then H0 In the existing method of deleting whole variable(s), the validity of such selection can be justified using the is rejected at significance level α , and its not good to X Wilks result, under the null hypothesis of no effect of delete these columns of n ; otherwise we accept H0 X the deleted variable(s), the resulting two times log-like- and delete these columns of n . lihood ratio will be asymptotically chi-squared distrib- There are some other methods to select columns of uted. We extended the Wilks theorem to the case for Xn , such as AIC, BIC and their variants, as in the model the proposed partial variable deletion, and use it to jus- selection field. In these methods, the optimal deletion tify the partial deletion procedure. Simulation studies of columns of Xn corresponds to the best model selec- are conducted to evaluate the performance of the pro- tion, which maximize the AIC or BIC. These methods are posed method, and it is applied to analyze a real data not as solid as the above one, as may sometimes de- set as illustration. pending on eye inspection to choose the model which The Proposed Method maximize the AIC or BIC. = All the above methods require the models under The observed data is ( yiii,x )( 1,..., n) , where yi is d consideration be nested within each other, i.e., one is the response and xi ∈ R is the covariates, of the i -th ' a sub-model of the other. Another more general model subject. Denote y =( yy,, … ) 'and Xn =(x1n , … ,x ') '. nn1 selection criterion is the Minimum Description Length Consider the (MDL) criterion, a measure of complexity, developed by yn = Xn β + εn , (1) Kolmogorov [4], Wallace and Boulton (1968) [16], etc. where ββ=( ,, … β) 'is the vector of regression pa- The Kolmogorov complexity has close relationship with 1 d the entropy, it is the output of a Markov information rameter, εn = (ε1,…,ε n )' is the vector of random errors, or residual departure from the linear model assumption. source, normalized by the length of the output. It con- verges almost surely (as the length of the output goes Without loss of generality we consider the case the εi’s are to infinity) to the entropy of the source. Let  =g ( ⋅⋅, ) independently identically distributed (iid), i.e. with vari- { } ance matrix Var (ε ) = σ 2 I , where I is the n -dimensional be a finite set of candidate models under consideration, n n = = … identity matrix. When theε ’s are not iid, often it is assumed and Θ {θ j : j 1, ,h} be the set of parameters of in- i terest. may or may not be nested within some other Var (ε ) = Ω for some known positive-definite Ω, then make θi −1/2 −1/2 −1/2  =  = θj, or θi and both in Θ may have the same dimen- the transformationy n Ω yn, Xn = Ω Xn and ε Ω ε θ j  sion but with different parametrization. Next consid- , then we get the model y n = Xn β + ε, and the ε ’s are iid i er a fixed density , with parameter running with Var (ε) = I . When Ω is unknown, it can be estimated f (.| θ j ) θ j n through a subset k, to emphasize the index of by various ways. So below we only need to discuss the case Γ j ⊂ R the parameter, we denote the MLE of under model the ε ’s are iid. θ j i (|)⋅⋅ by ˆ (instead of by ˆ to emphasize the depen- θ j θn Summary of existing work: dence on the sample size), I()θ j the Fisher informa- tion for under ⋅⋅, its determinant, and We first give a brief review of the existing meth- θj f (|) I (θ j ) k j od of variable selection. Assume the model residual the dimension of θ j. Then the MDL criterion (for exam-  = y − x′β has some known density function f (⋅) ple, Rissanen [17] and the review paper by Hansen and (such as normal), with possibly some unknown parame- Yu [5], and references there) chooses to minimize θ j

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 2 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831

n k j n lel to the log-likelihood ratio for (whole) vari- − log f Y |θˆ + log + log ∫ I θ dθ , j =1,…,h . ∑ ( i j ) Γ j ( j ) j ( ) (3) i=1 2 2π able deletion, let, for our case, This method does not require the models be nested, − - Λ = 2  βˆ −  βˆ . but still require select/delete some whole columns. The n  n ( ) n ( ) other existing methods for variable selection, such as Let ( jj1,..., k ) be the columns with partial dele- stepwise regression and Lasso, etc., are all for deleting/ tions, C= {: ix is deleted 1}≤≤in be the index jrr ji, keeping some whole variables, and does not apply to set for the deleted covariates in the j -th column our problem. r (rk=1,..., ) ; C be the cardinality of C , thus jr jr The proposed work γ =C/ nr( = 1,..., k) . For different j and j , C rjr r s jr Now come to our question, which is non-standard and C may or may not have some common compo- js and we are not aware of a formal method to address nents. We first give the following Proposition, in the this problem. However, we think the following ques- simple case in which the index sets C ’s are mutually jr tion is of practical meaning. Consider deleting some exclusive. Then in Corollary 1 we give the result in more ≤ of the components within fixed k (kd) columns general case in which the index sets C ’s are not need js of Xn , the deleted proportions for these columns are − to be mutually exclusive. γγ1,...,kj (0<< γ 1) . Denote X n for the remaining co- For given X , there are many different ways of par- variate matrix, which is Xn with some entries replaced n by 0’s, corresponding to the deleted elements. Before tial column deletions, we may use Theorem 1 totest the partial deletion, the model is each of these deletions. Given a significance level α , a deletion is valid at level α if Λ < χ 2 (1−α ), where y = X β + ε . n n n n χα2 (1− ) is the (1−α ) - th upper quantile of the k After the partial deletion of covariates, the model 2 ∑ γχjj distribution, which can be computed by sim- becomes j=1 − − ulation for given (γγ1,..., k ) . yn = Xn β + εn. The following Theorem is a generalization of the Note that here β and - have the same dimen- β Wilks Theorem [1]. Deleting some whole columns in sion, as no covariate is completely deleted. is the β X corresponds to γ =1 ( jk=1,..., ) in the theo- effects of the original covariates, - is the effects of n j β rem, and then we get the existing Wilks’ Theorem. the covariates after some possible partial deletion. Theorem 1: Under H , suppose CC∩=φ , the It is the effects of the effective covariates. As an over 0 jjrs simplified example, we have individuals, with empty set, for all 1≤≠≤rsk, then we have k five responses yn = ( yyyyy12345,,,, ) and covari- D =−− 2 ate vectors x =1.3, 0.2, − 1.5 ', x2 ( 0.1, 0.9, 1.3) ' , Λn → γ j χ j . ' 1 ( ) ∑ x =1.1,1 .4, − 0.3 = − j=1 3 ( ) , x4 =(0.8,1 .2, − 1.7) ' , x5 (1.0, 2.1, 1.1) ' 22 and Xn = ( xxxxx12345,,,, ) . Then β is the effects of where χχ1 ,..., k are iid chi-squared random vari- the regression of on Xn . If we remove some able with 1-degree of freedom. seemingly insignificant covariate components, for x− =1.3, 0, − 1.5 ', − Note that in Wilks problem the null hypothesis is example, let 1 ( ) x2 = (1.1,1 .4, 0) ', − − − ' that, the coefficients corresponding to some variables x3 = (1.1,1 .4, 0) ', x4 =(0.8,1 .2, − 1.7) ', x5 =(1.0, 2.1, − 1.1) and − −−−−− - are zero. The null hypothesis is nested within the alter- X= xxxxx,,,, . In this case β is the effects of y n ( 1 2345) n regressing on . Thus, though β and - have the same native; while the null hypothesis in our problem is: The Xn β structure, they have different interpretations. The prob- coefficients correspond to some partial variables, and lem can be formulated as testing the hypothesis: the null hypothesis is not nested within the alternative. So the results of the two methods are not really com- − − H0 : β = β vs H1 : β ≠ β parable.

If H0 is accepted, the partial deletion is valid. The case the C ’s are not mutually exclusive is a jr Note that different from the standard null hypothe- bit more complicated. We first re-write the sets C j ’s r sis that some components of the parameters be zeros, such that the above null hypothesis is not a nested hypothesis, or ∪kkCD =∪∪ r=1 jr r=1 jj11 ,...,rr jj ,..., , β - is not a subset of β, so the existing Wilks’ theorem where the D ’s are mutually exclusive, for likelihood ratio statistic does not directly apply here. jj1 ,..., r DDjj,..., are index sets for one column of X n only; Denote − be the corresponding log-likelihood 1 k  n ( β) the Djj, ’s are index sets common for columns j1 based on data yX, − , and the corresponding MLE 12 ( nn) and j2 only; the Djj,, j ’s are index sets common as βˆ -. Since after the partial deletion, βˆ - is the MLE of β 123 for columns jj12, and j3 only,.... Generally some of under a constrained log-likelihood, while βˆ is the MLE − ˆ - ˆ the Djj,..., ’s are empty sets. Let γ jj,..., = D jj,..., under the full likelihood, we have  n ( β ) ≤  n ( β). Paral- 1 r 11rr

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 3 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831

γ = Dn/ al case, we need the following more notations. Let be the cardinality of Djj,..., and jj11,..., rr jj,..., = 1 r (rk1,..., ) x be an i.i.d. copy of data in the set Djj,..., . . ( jj1 ,..., k ) 1 k Let xx−−= with probability γ (rk= 0,1,..., ) By examining the proof of Theorem 1, we get the jj1 ,,… r jj1 ,..., r − − following corollary which gives the result in the more , where x … is an i.i.d. copy of the x ’s, whose jj1 ,,r ij,1 ,..., jr general case. components with index in C . jj1 ,..., r

Corollary 1: Under H0 , we have Corollary 2: Under conditions of Corollary 1, results − D k  ˆ − ˆ -  2 of Theorem 2 hold with x given above. Λ = 2  β −  β → γ … χ … , n  n ( ) n ( ) ∑ ∑ j1 , , jr j1 , , jr = … − − − − r 1 j1 , , jr   Computationally E (x − μ )(x − μ )' is well where the χ 2 ’s are all independent chi-squared approximated by jj1 ,..., r k D ' j1 ,…, jr 1 random variables with r-degrees of freedom (rk=1,..., ) . E  x− − μ− x− − μ− ' ≈ x− − μˆ − x− − μˆ − , … … ( )( )  ∑ ∑ ( i, j j1 , , jr )( i, j j1 , , jr ) r=0 n D (i, j)∈D j1 ,…, jr j1,…, jr Below we give two examples to illustrate the usage where the notation ∈ summation of Proposition. (,ij ) Djj,..., over those x− ’s with deletion index in D , and Example 1: n =1000 , d = 5, k = 3. Columns (1,2,4) ij, jj1 ,..., r = −−1 . has some partial deletions with C1 {201,202,....,299,300}, (µˆ j,,…∈ j) = Σ(,ij ) Dx ij , 1 r D jj1,..., r jj1 ,,… r C2 = {351,352,...,549,550}, C = {601,602,...,849,850} , the C 3 j γ =1/10 γ =1/5 γ 3 =1/4 ’s have no overlap; 1 , 2 , . So by the Simulation Study and Application Proposition, under H0 we have D Simulation study ˆ −−ˆ 1222 11 2 nn(ββ) −( ) →χ123 ++χχ, 10 5 4 We illustrate the proposed method with two exam- where all the chi-squared random variables are inde- ples, Examples 3 and 4 below. The former rejects the pendent, each has 1 degree of freedom. null hypothesis H0 while the latter accepts. In each case we simulate n =1000 i.i.d. data with response y and Example 2: n =1000 , d = 5, k = 3. Col- i with covariates xi= ( xxxxx iiiii12345, , , ,) ( i=1,..., n) . umns (1,2,4) has some partial deletions with We first generate the covariates, sample the xi ’s from C = 101,102,....,299,300;651,652,...,749,750 , C2 = {201,202,...,349,350} , 1 { } the 5-dimensional normal distribution with vector = . In this case the C j ’s C3 {251,252,...,299,300;701,702,...,799,800} μ = (3.1,1.8,−0.5,0.7,1.5)' and a given matrix have overlaps, the Proposition can not be used direct- Γ. = ly, so we use the Corollary. Then D1 {101,102,...,199,200} , Then we generate the response data, which, given D2 = {301,302,...,349,350} , D3 = {701,702,...,799,800} , 1,2 {201,202,...,249,250} , D = 701,702,...,749,750 D = φ = γ =1/5 the covariates. The yi ’s are generated as 1,3 { } , 2,3 , D1,2,3 {251,252,...,299,300} ; 1 , γ = γ = ' γ 2 =1/ 20 , γ 3 =1/10 , 1,2 1/ 20 , 1,3 1/ 20 , γ 2,3 = 0 , γ =1/ 20 . 1,2,3 yi = xi β0 +i , (i =1,…,n) So by the Corollary, under H we have 0 ∈ β = 0.42,0.11,0.65,0.83,0.72 ', the i ’s are i.i.d. D 0 ( )  ˆ − ˆ -  1 2 1 2 1 2 1 2 1 2 1 2 , 2  n ( β) −  n ( β ) → χ1 + χ2 + χ3 + χ1, 2 + χ1,3 + χ1,2,3   5 20 10 20 20 20 N (0,1) . where all the chi-squared random variables are in- dependent, with χ 2 , χ 2 and χ 2 are each of 1 degree Hypothesis test is conducted to examine if the par- 1 2 3 tial deletion is valid or not. Significant level is setas of freedom, χ 2 and χ 2 are each of 2-degrees of free- 1, 2 1,3 α = 0.05. The repeated 1000 times, Prop 2 −α dom, and χ1,2,3 is of 3-degrees of freedom. represents the proportion Λ>n Q(1 −α ) , where Q(1 ) ˆ - is the (1−α ) -th upper quantile of the distribution Next, we discuss the consistency of estimation of β k 2 −− ∑ j=1 γχ jjgiven in Theorem 1, computed via simula- under the null hypothesis H0 . Let xx= r with proba- − tion. bility γ r (rk= 0,1,..., ) , where xr is an i.i.d. copy of the − x ’s, whose components with index in C , in partic- Example 3: In this example, five data sets are gener- ir, jr ular C j0 is the index set for those covariates without ated according to the mentioned method, with five dif- partial deletion. ferent values of β . We are interested to know wheth- 0 1 er covariates with xij < can be deleted. Five data set Theorem 2: Under conditions of Theorem 1, 10 with different β values are simulated. The proportion 0 1 ˆ - β → β a.s. . of x ’s with xij < are shown for each 0 ( ) γ = (γ1,…,γk ) ij D 10 ˆ − , data set, the results are shown in Table 1. The five rows n ( β − β0 )→ N (0,Ω) where in Table 1 are the results for the five data sets. For each f 2 (∈) data, the parameter β is estimated, a and test is con- Ω = E  ( β )  '( β ) = E  x− − μ− x− − μ− ' ∫ d. β0  0 0  ( )( )  ducted using the given , the Λ is computed, Q(1−α ) f (∈) γ n is given, and the corresponding p-value is provided. To extend the results of Theorem 2 to the gener- Note that for our problem, a p-value smaller than α

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 4 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831

γ Q(1−α ) Table 1: The simulation result of , Λn, and p-value according to β0. No. γ −α p-value β0 Λn Q(1 ) 1 (0.42, 0.11, 0.65, 0.83, 0.72) (0.008, 0.022, 0.043, 0.037, 0.030) 14492.91 4.5767 0.006 2 (0.12, 0.85, 0.44, 0.73, 0.62) (0.004, 0.020, 0.041, 0.040, 0.020) 13010.97 4.5748 0.016 3 (0.59, 0.27, 0.73, 0.35, 0.66) (0.008, 0.032, 0.031, 0.048, 0.025) 13505.90 4.5786 0.000 4 (0.21, 0.45, 0.78, 0.56, 0.63) (0.007, 0.022, 0.039, 0.053, 0.033) 12487.58 4.5281 0.005 5 (0.77, 0.51, 0.48, 0.89, 0.32) (0.01, 0.022, 0.042, 0.045, 0.026) 15437.66 4.5317 0.000

γ Q(1−α ) Table 2: The simulation result of , Λn, and p-value according to β0. No. γ −α p-value β0 Λn Q(1 ) 1 (0.42, 0.11, 0.65, 0.83, 0.72) (0.1, 0.1, 0.1) 1.0146 4.6034 0.998 2 (0.12, 0.85, 0.44, 0.73, 0.62) (0.1, 0.1, 0.1) 0.3576 4.6414 0.977 3 (0.59, 0.27, 0.73, 0.35, 0.66) (0.1, 0.1, 0.1) 3.2480 4.6756 0.965 4 (0.21, 0.45, 0.78, 0.56, 0.63) (0.1, 0.1, 0.1) 3.3003 4.6306 0.972 5 (0.77, 0.51, 0.48, 0.89, 0.32) (0.1, 0.1, 0.1) 3.3531 4.6326 0.955

(stage I or II) for less than five years and met other eligi- means a significant value of , or significant difference Λn ble criteria. They were randomly assigned according to between the regression coefficients of original covari- a two-by-two factorial design to one of four treatment ates and those of the covariates after partial deletion, groups: 1) Placebo 2) Active tocopherol 3) Active depre- which implies in turn that the null hypothesis should be nyl 4) Active deprenyl and tocopherol. The observation rejected, or the partial deletion should not be conduct- continued for 14± 6 months and reevaluated every 3 ed (Table 1). months. At each visit, Unified Parkinson’s Disease Rating

We see that the p-values of rejecting H0 , are all Scale (UPDRS) including its motor, mental and activities smaller than 0.05 in the five set of β . This suggests that of daily living components were evaluated. Statistical 1 0 covariates with xij < should not be deleted at signifi- analysis result was based on 800 subjects. The result re- 10 cance level α = 0.05. vealed that no beneficial effect of tocopherol. Deprenyl effect was found significantly prolong the time requiring Example 4: In this example, the original X as in levodopa therapy which reduced the risk of disability by Example 3, but now we replace the entries in first 50 percent according to the measurement of UPDRS. 100 rows and first three columns by noise ∈ , where Our goal is to examine whether some of the covari-  1   N 0, . The delete proportion γ = (0.1,0.1,0.1) is fixed ates can be partially deleted. If traditional variable se-  9  lection methods are used, such as stepwise or Lasso, it with x ’s having absolute values smaller than the low- ij will end up with some covariate(s) been removed whol- er 0.1 percent being deleted. We are interested to see ly from the analysis. This is not very reasonable, since in this case whether these noises can be deleted, i.e. some of the removed covariates may be partially effec- H0 can be rejected or not. The results are shown in the tive, removing all their values may yield miss-leading re- following (Table 2). sults, or at least cost information loss. We use the pro-

We see that the p-values of rejecting H0 are all posed method to examine three of the response vari- ables, PDRS, TREMOR and PIGD, and three covariates, greater than 0.95 for the five sets of β0. It suggests that the data provided strong evidence to conclude that the Age, Motor and ADL for all these responses. The deleted deleted values are noises and they are not necessary to covariates are the ones with values below the γ -th data γ = 0.01,0.02,0.03 the data set at 0.05 significance level. quantile, with and 0.05. We examine each response and covariate one by one. The results are Application to real data problem. shown in Table 3, Table 4 and Table 5 below. We analyze a data set from the Deprenyl and To- In Table 3, response TREMOR is examined. For co- copherol Antioxidative Therapy of Parkinsonism, which variable Age, the likelihood ratio is larger than the Λn is obtained from The National Institutes of Health (NIH). cut-off point Q(1−α ) at all the deletion proportions, it

(For detailed description and data link, https://www. suggests that for Age, no partial deletions with these ncbi.nlm.nih.gov/pubmed/2515723). It is a multi-cen- proportions should be removed. For covariable Motor, ter, placebo-controlled that aimed to deter- −α Λn is smaller than the cutoff point Q(1 ) at the 0.01 mine a treatment for early Parkinson’s disease patient proportion, this covariable can be partially deleted at to prolong their time requiring levodopa therapy. The this proportion. In other words, the covariate Motor number of patients enrolled was 800. The selected ob- with values smaller than 1%-th of its quantile have no ject were untreated patients with Parkinson’s disease impact on the analysis, or can be treated as noise and

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 5 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831

Table 3: Response TREMOR: Λn values and estimated regression coefficients. Age Motor ADL Estimated coefficient 0.0240456 0.1801616 0.00451205 Delete proportion Λn Q(1−α ) Λn ( ) Λn Q(1−α ) 0.01 11.5171 0.0593 0.35929 0.8787 0.00425 0.1897 0.03 20.0485 0.1245 6.2598 0.6924 0.00425 0.1861 0.05 14.0114 0.2937 8.7075 0.9034 0.00425 0.1496 0.1 0.0238 0.3841

Table 4: Response PIGD: Λn values and estimated regression coefficients. Age Motor ADL Estimated coefficient -0.0049032 0.02467423 0.2084862 Delete proportion Λn Q(1−α ) Λn Q(1−α ) Λn Q(1−α ) 0.02 0.4956 0.0849 0.0031 0.0972 3.5607 0.2210 0.03 1.3908 0.1306 0.0166 0.1513 3.5607 0.2256 0.05 0.4816 0.2596 1.2607 0.2188 3.6607 0.1925

Table 5: Response PDRS: Λn values and estimated regression coefficients. Age Motor ADL Estimated coefficient 1.563389 0.0476914 -1.139864 Delete proportion Λn Q(1−α ) Λn Q(1−α ) Λn Q(1−α ) 0.01 1.0073 0.0497 80.7411 0.0944 6.3392 0.0453 0.02 1.0475 0.0669 142.3528 0.0841 57.5051 0.2216 0.03 0.8609 0.1486 321.5332 0.1906 57.5051 0.2111 0.05 0.8650 0.2379 397.6481 0.2227 57.5051 0.2199 should be removed from the analysis. For covariable mates are more meaning full since the on-effective val- ADL, with deletion proportions 0.01-0.1, the likelihood ues of covariate Motor are removed from the analysis. ratioΛ is smaller than Q(1−α ) which suggest that the n In Table 5, the response is PDRS. The likelihood ratios lower percentage of of this covariate have 2 1% -10% Λ of Age, Motor and ADL all are larger than χα(1− ) no impact on the analysis and should be deleted. After n at the deletion proportions of 0.01, 0.02, 0.03 and 0.05. removing the corresponding proportions of Motor and Thus the null hypothesis are rejected at all these pro- ADL, the model is re-fitted to get the parameter esti- portions, or no deletion is valid at these proportions, mates shown there. These estimates have better mean- and the analysis should be based on the original full ing than the ones based on the whole covariates data, data, with the parameter estimates shown in the Table since now the noise values of covariates are removed, (Table 3, Table 4, and Table 5). and only the effective covariates entered the analysis. However, if traditional variable methods are used, such Note that the coefficient for Age is insignificant, and hence the corresponding values with deleted pro- as stepwise regression or Lasso, it may end up with the Λn whole covariate Motor, ADL, or both to be removed, portions are senseless. and leads loss of information or even miss-leading re- Concluding Remarks sults. We proposed a method for partial variable deletion, in In Table 4, response PIGD is investigated. For covari- which only some proportion(s) of covariate(s) values are to able age, Λ is larger than the cut-off point Q(1−α ) n be deleted. This is in contrast to the existing methods either at the 0.02, 0.03 and 0.05 proportions, suggests that select or delete the entire variable(s). Thus this method is partial deletion with these proportions are not appro- new and is a generalization of the existing variable selec- priate. For covariate Motor, is smaller than cut-off Λn tion. The question is motivated from practical problems. point Q(1−α ) at the deletion proportions 0.02 and 0.03, It can used to find the effective ranges of the covariates, suggests that the lower percentage of should be 2 - 3% or to remove possible noises in the covariates, and thus deleted from the analysis. For the variable ADL, is Λn the corresponding estimated effects are more interpreta- larger than the cut-off point Q(1−α ) at the delete pro- ble. The proposed test statistic is a generalization of the portions 0.02, 0.03 and 0.05, hence partial deletion at Wilks likelihood ratio statistic, the asymptotic distribution these proportions are not valid. After deleting 3% of the of the proposed statistic is generally a chi-squared mixture smallest values of Motor, the model is re-fit to get the distribution, the corresponding cut-off point can be com- parameter estimates shown in the Table. The new esti- puted by simulation. Simulation studies are conducted to

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 6 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831 evaluate the performance of the method, and it is applied 7. May RJ, Maier HR, Dandy GC, Fernando TG (2008) to analyze a real Parkinson disease data as illustration. A Non-linear variable selection for artificial neural networks using partial mutual information. Environmental Modelling drawback of the current version of the method is that it and Software 23: 1312-1326. needs to specify the proportions of possible deletions for 8. Mehmood T, Liland KH, Snipen L, Saebo S (2012) A re- the variables, this makes the optimal proportions are not view of variable selection methods in partial least squares easy to find. In our next step research we will try to im- regression. and Intelligent Laboratory Sys- plement an algorithm which finds the optimal proportions tems 118: 62-69. automatically, and more easy to use. As suggested from 9. Wang L, Liu X, Linag H, Carroll R (2011) Estimation and a reviewer, simulation studies should be performed for variable selection for generalized additive partial linear statistical significance test between the proposed method models. 39: 1827-1851. and existing variable selection method(s) to address the 10. Liu X, Wang L, Liang H (2011) Estimation and variable se- contribution of the proposed method. This will be poten- lection for semiparametric additive partial linear models. Stat Sin 21: 1225-1248. tial for our future research work (Appendix). 11. Tibshirani R (1997) The lasso method for variable selection Acknowledgment in the Cox model. Statistics in Medicine 16: 385-395. This research was supported by the Intramural Re- 12. Fan J, Li R (2001) Variable selection via non-concave pe- nalized likelihood and its oracle properties. Journal of the search Program of the National Institutes of Health. The American Statistical Association 96: 1348-1360. content is solely the responsibility of the authors and 13. Fan J, Li R (2002) Variable selection for Cox’s proportional does not necessarily represent the official views of the hazards model and frailty model. Annals of Statistics 30: National Institutes of Health. 74-99. References 14. Wang HX, Leineweber C, Kirkeeide R, Svane B, Schenck-Gus- tafsson K, et al. (2007) Psychosocial stress and atherosclero- 1. Wilks SS (1938) The large-sample distribution of the like- sis: family and work stress accelerate progression of coronary lihood ratio for testing composite hypotheses. Annals of disease in women. The Stockholm Female Coronary Angiog- 9: 60-62. raphy Study. J Intern Med 261: 245-254. 2. Akaike H (1974) A new look at the statistical identification 15. Shara NM, Wang H, Valaitis E, Pehlivanova M, Carter EA, model. IEEE Transaction on Automatic Control 19: 716-723. et al. (2011) Comparison of estimated glomerular filtration rates and albuminuria in predicting risk of coronary heart 3. Schwarz G (1978) Estimating the dimension of a model. disease in a population with high prevalence of diabetes Annals of Statistics 6: 461-464. mellitus and renal disease. Am J Cardiol 107: 399-405. 4. Kolmogorov A (1963) On tables of random numbers. 16. Wallace CS, Boulton DM (1968) An information measure 25: 369-375. for classification. Computer Journal 11: 185-194. 5. Hansen M, Yu B (2001) Model selection and the principle of 17. Rissanen J (1996) Fisher information and stochastic complex- minimum description length. Journal of American Statistical ity. IEEE Transactions on Information Theory 42: 40-47. Association 96: 746-774. 18. Stat 701 (2002) Proof of Wilks’ Theorem on LRT. 6. Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B 58: 267- 19. Bickel PJ, Klaassen CA, Ritov Y, Wellner JA (1993) Effi- 288. cient and Adaptive Estimation for Semiparametric Models. The Indian Journal of Statistics 62: 157-160.

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 7 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831

Appendix

Proof of Theorem 1: Let be the true parameter value generating the observed data,  and β0  n ( β) = ∂ n ( β) / ∂β  2 . Since ˆ is the MLE,  , so we have  n ( β) = ∂  n ( β) / [∂β∂β'] β  n ( β) = 0

  ˆ   ˆ , − n ( β0 ) =  n ( β) −  n ( β0 ) =  n ( βn )( β − β0 ) where lies between ˆ and . Since ˆ → (a.s.), we have → (a.s.). We get βn β β0 β β0 βn β0 −1 ˆ −1 −1 n ( β − β0 ) = n (−n  n ( β)) n  n ( β0 )

n P   x β  Note that −1 , and that  β = v , = ∂ − i 0 − . The ’s are iid −n  n ( β)→ I ( β0 )  n ( 0 ) ∑ i vi  f  yi  / f ( yi xi β0 ) vi i=1   ∂β  with Ev = 0 and T , consequently, ( i ) Var (vi ) = E (vivi ) = I ( β0 ) D ˆ −1 n ( β − β0 )→ N (0, I ( β0 )). (A.1)

To simplify notation, assume the columns with partial deletions are (1,...,k ) . Denote (nn,..., ) be the numbers of x ’s 1 k ij deleted from columns (1,...,k ) of Xn , so (n11,..., nnkk) /= (γγ ,..., ) , denote nnn01=−++(  nk ) and γ 00= nn/ − . Let − be the likelihood with proportions (γγ,..., ) , be deleted from columns (1,...,k ) in X and denote   n ( β0 ) 1 k n  n ( β) and  − be the partial derivatives accordingly. Write  n ( β)

k − β = β + β ,  n ( )  0,n0 ( ) ∑ − j,n j ( ) j=1 where  ( β) is the part of log-likelihood for the n0 data without covariate deletion, and β is that for all the n j 0,n0  − j,n j ( ) k data with the j -th covariate deleted. Denote  β =  β +  β accordingly. n ( ) 0,n0 ( ) ∑ j=1 j,n j ( ) For the log-likelihood without data deletion we make the similar decomposition as

k β = β + β ,  n ( )  0,n0 ( ) ∑ j,n j ( ) j=1 where β is the part of log-likelihood using data from the same individuals as those in , but without covariate  j,n j ( )  n ( β) k deletion, and denote  β =  β +  β accordingly. n ( ) 0,n0 ( ) ∑ j=1 j,n j ( ) Note that the same term β appears in both the decompositions of and − , the same term  β  0,n0 ( )  n ( β)  n ( β) 0,n0 ( ) appears in both the decompositions of  β and  − β , and that for j =1,...,k, β is different from β  n ( )  n ( )  j,n j ( )  − j,n j ( ) in that there is no deletion in β , although both partial log-likelihoods use data from the same set.  j,n j ( ) We have

ˆ  ˆ ˆ 1 ˆ '  ˆ  n ( β) =  n ( β0 ) −  n ( β)( β0 − β) − ( β − β0 )  n ( βn )( β − β0 ) 2

1  ' k  =  β − βˆ − β  β βˆ − β + βˆ − β ' β βˆ − β . n ( 0 ) ( 0 ) 0,n0 ( n )( 0 ) ∑( 0 ) j,n j ( n )( 0 ) 2  j=1 

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 8 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831

P −1 Also, let Z = (Z1,...,Zd) with Z j ’s iid N (0,1) , and note −n  ( β )→ I ( β ) for ( jk= 0,..., ) , so we have j j,n j n 0 ' − βˆ − β  β βˆ − β = −γ n βˆ − β n−1 β n βˆ − β ( 0 ) j,n j ( n )( 0 ) j ( 0 ) j j,n j ( n ) ( 0 ) D

→γ j ZZ' .

So we get

1 k ˆ = + γ +  n ( β)  n ( β0 ) ∑ j Z 'Z op (1). 2 j=0

ˆ − ˆ − ˆ Similarly to (A.1) we have, with βn lies between β and β,

−1 ˆ − − = − −1 − ˆ − −1 −  n ( β β0 ) n ( n  n ( βn )) n  n ( β0 ). Consequently,

D n βˆ − − β → N 0, I −1 ( β ) ( 0 ) ( − 0 ) (A.2)

− Where − − , and is with the -th covariate being removed with probability γ = . I− ( β0 ) = EH (vi vi ') vi ' vi j j ( jk1,..., ) 0 Also

1  ' k '  − βˆ − = − β − βˆ − − β  β − βˆ − − β + βˆ − − β  β − βˆ − − β n ( ) n ( 0 ) ( 0 ) 0,n0 ( n )( 0 ) ∑( 0 ) − j,n j ( n )( 0 ) 2  j=1  and we have

' ' − βˆ − − β  β − βˆ − − β = −γ n βˆ − − β n−1 β − n βˆ − − β ( 0 ) 0,n0 ( n )( 0 ) 0 ( 0 ) 0 0,n0 ( n ) ( 0 ) D

→γ 0ZZ'

Note that  − is a × matrix with the -th row and -th column be zeros, so  − j,n ( βn ) dd j j j

' ' − − −  − βˆ − β  β βˆ − β = βˆ − β  β βˆ − β ( 0 ) − j,n j ( n )( 0 ) ( − j 0,− j ) − j,n j ( n )( − j 0,− j )

ˆ ˆ − where β− is the (d −1) -dimensional vector with the j -th element removed from β , β0,− j is the (d −1) -dimensional j vector with the -th element removed from , and  − is the −× − matrix with the -th column and j β0  − j,n ( βn ) (dd11) ( ) j j j -th row removed from  β . − j,n j ( n ) − Since under H0 , ln (β0 ) = ln (β0 ), now we have, under H0 ,

k ' T  ˆ − ˆ −  ˆ −1 ˆ ˆ −1 − ˆ 2  n β −  n β = γ j n β − β0 −n j  j,n ( βn ) n β − β0 − n β− j − β0,− j −n0  − j,n ( βn ) n β − β0  ( ) ( ) ∑ ( ( ) ( j ) ( ) ( ) ( j ) ( )) j=0

k ' T ˆ −1 ˆ ˆ −1 − ˆ = γ j n β − β0 −n j  j,n ( βn ) n β − β0 − n β− j − β0,− j −n0  − j,n ( βn ) n β− j − β0,− j + op (1), ∑ ( ( ) ( j ) ( ) ( ) ( j ) ( )) j=1

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 9 of 10 • DOI: 10.23937/2469-5831/1510017 ISSN: 2469-5831

χ 2 note that the first term in the above bracket is asymptotically a with d -degrees of freedom, while the second term is asymptotically a χ 2 − random variable with (d 1) -degrees of freedom. As in the proof of Wilks’ Theorem (or some more recent proofs, such as in Stat701, 2002 [18]), for each j we have

' T D ˆ −1 ˆ ˆ −1 − ˆ 2 n β − β −n  ( β ) n β − β − n β− − β − −n  − β n β− − β − → χ ( 0 ) ( j j,n j n ) ( 0 ) ( j 0, j ) ( 0 j,n j ( n )) ( j 0, j ) j where χ 2 is a chi-squared distribution with 1-degree of freedom, and the χ 2 ’s are independent for different j , hence we get j j

D k 2  βˆ − − βˆ −  → γ χ 2.  n ( )  n ( ) ∑ j j j=1 Proof of Theorem 2: i) Is from standard argument for the consistency of MLE. ii) After deleting the irrelevant covariates, the model is

− yi = xi β + εi , εi ∼ f (.),

− − −− − − where the x are i.i.d. x , and xx= with probability γ (rk= 0,1,..., ) , where x is an i.i.d. copy of the x ’s, i r r r ir, whose components with index in C jr , in particular C j0 is the index set for those covariates without partial deletion. The log- likelihood is

n − = −  n ( β) ∑log f ( yi xi β). i=1 By the standard result on regression parameter estimation (eg; Proposition 4.3.1 D and Example 4.3.1 in Bickel, et al.) [19], the efficient score for β based on − ( β) is

f ( y − x− β) f ( y − x− β)  ( β) = ∂ log = (x− − μ− ) ∂β f y − x− β ( ) − − − Where − − . Under the common assumption that = − is independent of x (or just conditioning on x ), μ j = E (x )  y x β0 it follows that

D ˆ − n ( β − β0 )→ N (0,Ω), Where

2 ' f ε  ˆ '   − − − −  ( ) Ω = Eβ (β0 ) ( β0 ) = E (x − μ )(x − μ ) ∫ dε. 0     f (ε )

Gu et al. Int J Clin Biostat Biom 2018, 4:017 • Page 10 of 10 •