Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Jue Hou, Zijian and Tianxi May 3, 2021

Abstract Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors . In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against mis- specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is mis- specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort.

Keywords: generalized linear models, high dimensional inference, model mis-specification, risk Prediction, semi-supervised learning.

1 1 Introduction

Precise risk prediction is vitally important for successful clinical care. High risk patients can be assigned to more intensive monitoring or intervention to improve outcome. Tra- ditionally, risk prediction models are developed based on cohort studies or registry data. Population-based disease registries, while remain a critical source for epidemiological stud- ies, collect information on a relatively small set of pre-specified variables and hence may limit researchers’ ability to develop comprehensive risk prediction models (Warren and Yabroff, 2015). Most clinical care is delivered in healthcare systems (Thompson et al., 2015), and electronic health records (EHR) embedded in healthcare systems accrue rich clinical data in broad patient populations. EHR systems centralize the data collected during routine patient care including structured elements such as codes for International Classification of Diseases (ICD), medication prescriptions, and medical procedures, as well as free-text narrative documents such as physician notes and pathology reports that can be processed through natural language processing (NLP) for analysis. EHR data is also often linked with biobanks which provide additional rich molecular information to assist in developing comprehensive risk prediction models for a broad patient population. Risk modeling with EHR data, however, is challenging due to several reasons. First, precise information on clinical outcome of interest, Y , is often embedded in free-text notes and requires manual efforts to extract accurately. Readily available surrogates of Y , S, such as the diagnostic codes or mentions of the outcome are at best good approximations to the true outcome Y . For example, using EHR data from Mass General Brigham (MGB), we found that the PPV was only 0.48 and 0.19 for having at least 1 ICD code of T2DM and for having at least 1 NLP mention of T2DM, respectively. Directly using these EHR proxies as true disease status to derive risk models may lead to substantial biases. On the other hand, extracting precise disease status requires manual chart review which is not feasible at a large scale. It is thus of great interest to develop risk prediction models under a semi-supervised learning (SSL) framework using both a large unlabeled dataset of size N containing information on predictors X along with surrogates S and a small labeled dataset of size n with additional observations on Y curated via chart review.

2 Additional challenges arise from the high dimensionality of the predictor vector X, and the potential model mis-specifications. Although much progress has been made in high dimensional regression in recent years, there is a paucity of literature on high dimensional inference under the SSL setting. Precise estimation of the high dimensional risk model is even more challenging if the risk model is not sparse. Allowing the risk model to be dense is particularly important when X includes genomic markers since a large number of genetic markers appear to contribute to the risk of complex traits (Frazer et al., 2009). For example, Vujkovic et al. (2020) recently identified 558 genetic variants as significantly associated with T2DM risk. An additional challenge arises when the fitted risk model is mis-specified, which occurs frequently in practice especially in the high dimensional setting. Model mis-specifications can also lead to the fitted model of Y | X to be dense. There are limited methods currently available to make inference about high dimensional risk prediction models in the SSL setting especially under a possibly mis-specified dense model. In this paper, we fill in the gap by proposing an efficient surrogate assisted SSL (SAS) prediction procedure that leverages the fully observed surrogates S to make inference about a high dimensional risk model under such settings. Under the supervised setting where both Y and X are fully observed, much progress has been made in recent years in the area of high dimensional inference. High dimensional re- gression methods have been developed for commonly used generalized linear models (GLM) under sparsity assumptions on the regression parameters (van de Geer and B¨uhlmann, 2009; Negahban et al., 2010; and , 2012). Recently, and Bradic (2018b) stud- ied the inference of linear combination of coefficients under dense linear model and sparse precision matrix. Inference procedures have also been developed for both sparse (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; van de Geer et al., 2014) and dense combinations of the regression parameters (Cai et al., 2019; Zhu and Bradic, 2018a). High- dimensional inference under the logistic regression model has also been studied recently (van de Geer et al., 2014; et al., 2020; Guo et al., 2020). Under the SSL setting with n  N, however, there is a paucity of literature on high dimensional inference. Although the SSL can be viewed as a missing data problem, it

3 differs from the standard missing data setting in a critical way. Under the SSL setting, the missing probability tends to 1, which would violate a key assumption required in the missing data literature (Bang and Robins, 2005; Smucler et al., 2019; Chakrabortty et al., 2019, e.g.). Existing work on SSL with high-dimensional covariates largely focuses on the post- estimation inference on the global parameters under sparse linear models with examples including SSL estimation of population mean (Zhang et al., 2019; Zhang and Bradic, 2019), the explained variance (Cai and Guo, 2018), and the average treatment effect ( et al., 2018; Kallus and , 2020). To the best of our knowledge, our SAS procedure is the first to conduct the semi-supervised inference of the high-dimensional coefficient and the individual prediction in the high-dimensional dense and possibly mis-specified risk prediction model. Our proposed estimation and inference procedures are as follows. For estimation, we first use the labelled data to fit a regularized imputation model with surrogates and high- dimensional covariates; then we impute the missing outcomes for the unlabeled data and fit the risk model using the imputed outcome and high-dimensional predictors. For inference, we devise a novel bias correction method, which corrects the bias due to the regularization for both imputation and estimation. For our proposed methods, we allow the fitted risk model for Y | X to be both mis-specified and potentially dense but only require sparsity on the fitted imputation model of Y | S, X. The sparsity assumption on the imputation model is less stringent since we anticipate that most information on Y can be well captured by the low dimensional S while the fitted model of Y | X might be dense especially under possible model mis-specifications. The remainder of the paper is organized as follows. We introduce our population pa- rameters and model assumptions in Section 2. In Section 3, we propose the SAS estimation method along with its associated inference procedures. In Section 4, we state the theoret- ical guarantees of the SAS procedures, whose proofs are provided in the Supplementary Materials. We also remark on the sparsity relaxation and the efficiency gain of the SSL. In Section 5, we present simulation results highlighting finite sample performance of the SAS estimators and comparisons to existing methods. In Section 6, we apply the proposed method to derive individual risk prediction for T2DM using EHR data from MGB.

4 2 Settings and Notations

q For the i-th observation, ∈ R denotes the outcome variable, Si ∈ R denotes the surro- p+1 gates for Yi and Xi ∈ R denotes the high-dimensional covariates with the first element being the intercept. Under the SSL setting, we observe n independent and identically dis-

T T T tributed (i.i.d.) labeled observations, L = {(Yi, Xi , Si ) , i = 1, ..., n} and (N − n) i.i.d

T T T unlabeled observations, U = {Wi = (Xi , Si ) , i = n + 1, ..., N}. We assume that the labeled subjects are randomly sampled by design and the proportion of labelled sample is n/N = ρ ∈ (0, 1) with ρ → 0 as n → ∞. We focus on the high-dimensional setting where dimensions p and q grow with n and allow p + q to be larger than n. Motivated by our application, our main focus is on the setting N much larger than p, but our approach can be extended to N ≤ p under specific conditions.

To predict Yi with Xi, we consider a possibly mis-specified working conditional mean model with a known monotone and smooth link function g,

T E(Yi | Xi) = g(β Xi). (1)

Our procedure generally allows for a wide range of link functions and detailed requirements on g(·) and its anti-derivative G are given in Section 4. In our motivating example, Y is a binary indicator of T2DM status and g(x) = 1/(1 + e−x) with G(x) = log(1 + ex). Our goal is to accurately estimate the high-dimensional parameter β0, defined as the solution of the estimation equation

T E[Xi{Yi − g(β0 Xi)}] = 0. (2)

T p+1 We shall further construct confidence intervals for g(β0 xnew) with any xnew ∈ R . The T predicted outcome g(β0 xnew) is the condition mean of Y given Xi = xnew when (1) holds and can be interpreted as the best generalized linear prediction with link g even when (1) fails to hold. To enable estimation of β0 under possible model mis-specification, we define a pseudo log-likelihood (PL) function

`(y, x) = yx − G(x) (3)

T T T such that ∂`(y, β Xi)/∂β = Xi {y − g(β Xi)} corresponds to the moment condition (2).

5 Biobank within EHR: 30450 Patients

Onset Gene (SNP) of T2DM X1 Y

Clinical Risk ICD Codes and NLP Mentions Factors of T2DM of T2DM (S1,S2) X2

Baseline Follow-up

Electronic Health Record: 60864 Patients

Figure 1: A dense prediction model (graph with dashed lines) can be compress to a sparse imputation model (through graph with solid lines) when the effect of most baseline covari- ates are reflected in a few variables in the EHR monitoring the development of the event of interest.

We make no assumption on the sparsity of β0 and hence it is not feasible to perform valid supervised learning for β0 when sβ = kβ0k0 > n.

We shall derive an efficient SSL estimate for β0 by leveraging U . To this end, we fit a working imputation model

T E(Yi | Wi) = g(γ Wi) (4)

with the population parameter γ0 defined as

T γ0 : E[Wi{Yi − g(γ0 Wi)}] = 0. (5)

The definition of γ guarantees

T E[Xi{Yi − g(γ0 Wi)}] = 0. (6)

¯ T ¯ T and hence if we impute Yi as Yi = g(γ0 Wi), we have E[Xi{Yi −g(β0 Xi)}] = 0 regardless the adequacy of the imputation model (4). It is thus feasible to carry out an SSL procedure by ¯ first deriving an estimate for Yi using the labelled data L and then regressing the estimated

6 ¯ Yi against Xi using the whole data L ∪ U . Although we do not require β0 to be sparse or any of the fitted models to hold, we do assume that γ0 defined in (5) to be sparse. When the surrogates S are strongly predictive for the outcome, the sparsity assumption on γ0 is reasonable since the majority of the information in Y can be captured in S. Notations. We focus on the setting where min{n, p + q, N} → ∞. For convenience, we shall use n → ∞ in the asymptotic analysis. For two sequences of random variables An and

Bn, we use An = Op(Bn) and An = op(Bn) to denote limc→∞ limn→∞ P(|A| ≥ c|B|) = 0 and limc→0 limn→∞ P(|A| ≥ c|B|) = 0, respectively. For two positive sequences an and bn, an = O(bn) or bn & an means that ∃C > 0 such that an ≤ Cbn for all n; an  bn if an = O(bn) and bn = O(an), and an  bn or an = o(bn) if lim supn→∞ an/bn = 0. We use L Zn → N(0, 1) to denote the sequence of random variables Zn converges in distribution to a standard normal random variable.

3 Methodology

3.1 SAS Estimation of β0

The SAS estimation procedure for β0 consists of two key steps: (i) fitting the imputation model to to obtain estimate γ for γ defined in (5); and (ii) estimating β in (2) by L b 0 0 T fitting imputed outcome Ybi = g(γb Wi) against Xi to U . In Step (i), we estimate γ by the L regularized PL estimator γ, defined as 0 1 b p γb = argmin `imp(γ) + λγ kγ−1k1 with λγ  log(p + q)/n, (7) γ∈Rp+q+1 where a−1 denotes the sub-vector of all the coefficients except for the intercept and

n 1 X ` (γ) = `(Y , γTW ) with `(y, x) defined in (3). (8) imp n i i i=1 The imputation loss (8) corresponds to the negative log-likelihood when Y is binary and the imputation model holds with g being anti-logit. With γb, we impute the unobserved T outcomes for subjects in U as Ybi = g(γb Wi), for n + 1 ≤ i ≤ N.

7 In Step (ii), we estimate β by β = β(γ), defined as, 0 b b b

† p βb(γb) = argmin ` (β; γb) + λβkβ−1k1 with λβ  log(p)/N, (9) β∈Rp+1

† where ` (β; γb) is the imputed PL: n † 1 X T 1 X T ` (β; γ) = `(Ybi, β Xi) + `(Yi, β Xi) with `(y, x) defined in (3). (10) b N N i>n i=1 We denote the complete data PL of the full data as

N 1 X ` (β) = `(Y , βTX ). (11) PL N i i i=1 and define the gradients of the various losses (8)-(11) as

† ∂ `˙ (γ) = ∇` (γ), `˙ (β) = ∇` (β), `˙ (β; γ) = `†(β; γ). (12) imp imp PL PL ∂β

3.2 SAS Inference for Individual Prediction

T Since g(·) is specified, the inference on g(xnewβ) immediately follows from the inference on

T T xnewβ. We shall consider the inference on standardized linear prediction xstdβ with the standardized covariates

xstd = xnew/kxnewk2

and then scale the confidence interval back. This way, the scaling with kxnewk2 is made explicit in the expression of the confidence interval. The estimation error of βb can be decomposed into two components corresponding to the respective errors associated with (7) and (9). Specifically, we write

β − β = {β¯(γ) − β } + {β − β¯(γ)}, (13) b 0 b 0 b b

¯ where β(γb) is defined as the minimizer of the expected imputed loss conditionally on the labeled data, that is, ¯ † β(γb) = argmin E[` (β; γb) | L ]. (14) β∈Rp+1

8 The term β¯(γ) − β denotes the error from the imputation model in (7) while the term b 0 ¯ βb − β(γb) denotes the error from the prediction model in (9) given the imputation model parameter γb. As `1 penalization is involved in both steps, we shall correct the regularization bias from the two sources. Following from the typical one-step debiasing LASSO (Zhang ¯ ˙ † and Zhang, 2014), the bias βb − β(γb) is estimated by Θb ` (βb; γb) where Θb is an estimator of [ {g0(βTX )X XT}]−1, the inverse Hessian of `†(·; γ) at β = β . E 0 i i i b 0 The bias correction for β¯(γ)−β requires some innovation since we need to conduct the b 0 ¯ bias correction for a nonlinear functional β(·) of LASSO estimator γb, which has not been studied in the literature. We identify β¯(γ) and β by the first order moment conditions, b 0

¯ ¯ T T β(γb): Ei>n[Xi{g(β(γb) Xi) − g(γb Wi)} | L ] ≈ 0,

T T T β0 : E[Xi{g(β0 Xi) − Yi}] = E[Xi{g(β0 Xi) − g(γ0 Wi)}] = 0. (15)

Here Ei>n[· | L ] denotes the conditional expectation of a single copy of the unlabeled data given the labelled data. By equating the two estimating equations in (15), we apply the first order approximation and approximate the difference β¯(γ) − β by b 0

β¯(γ) − β ≈ − [ {g0(βTX )X XT}]−1 X {g(γTW ) − g(γTW )} |  (16) b 0 E 0 i i i Ei>n i 0 i b i L

Together with the bias correction for β¯(γ) − β , this motivates the debiasing procedure b 0 n 1 − ρ † X  T ˙ βb − ΘXb i g(γ Wi) − Yi − Θb ` (βb; γ). n b b i=1 The 1 − ρ factor, which tends to one when n much smaller than N, comes from the proportion of unlabeled data whose missing outcome are imputed. For theoretical considerations, we devise a cross-fitting scheme in our debiasing process. We split the labelled and unlabeled data into K folds of approximately equal size, respec- tively. The number of folds does not grow with dimension (e.g. K = 10). We denote the indices sets for each fold of the labelled data L as I1,..., IK , and those of the unlabeled data U as J1,..., JK . We denote the respective sizes of each fold in the labelled data and full data as nk = |Ik| and Nk = nk + |Jk|, where |I| denotes the carnality of I. Define c c Ik = {1, . . . , n}\Ik and Jk = {n + 1,...,N}\Jk. For each labelled fold k, we fit the

9 imputation model with out-of-fold labelled samples:

(k) 1 X T γ = argmin `(Yi, γ Wi) + λγ kγ−1k1. (17) b p+q+1 γ∈R n − nk c i∈Ik

(k) c c Using γb , we fit the prediction model with the out-of-fold data Ik ∪ Jk :   (k) 1 X (k)T T X T βb = argmin  `(g(γ Wi), β Xi) + `(Yi, β Xi) + λβkβ−1k1. (18) p+1 b β∈R N − Nk c c i∈Jk i∈Ik

To estimate the projection

0 T T −1 u0 = E{g (β0 Xi)XiXi }] xstd, (19) we propose an L1-penalized estimator

 (k,k0)T  (k) 1 X X 1 0   T 2 T u = argmin g βb Xi (Xi u) − u xstd + λukuk1 , (20) b p u∈R N − Nk 0 2 k 6=k i∈Ik0 ∪Jk0

(k,k0) where βb is trained with samples out of folds k and k0,

  0   P (k,k )T T P T 0 c ` g γ Wi , β Xi + c `(Yi, β Xi) (k,k ) i∈(Jk∪Jk0 ) b i∈(Ik∪Ik0 ) βb = argmin + λβkβ−1k1, β∈Rp+1 N − Nk − Nk0 (21)

P T 0 i∈Ic∩Ic `(Yi, γ Wi) (k,k ) k k0 with γb = argmin + λγ kγ−1k1. γ∈Rp+q+1 n − nk − nk0 The estimators in (21) take similar forms as those in (17) and (18) except that their training samples exclude two folds of data Ik ∪ Jk and Ik0 ∪ Jk0 . In the summand of (20), 0 0 (k,k ) the data (Yi, Xi, Si) in fold k Ik0 ∪ Jk0 is independent of βb trained without folds k and k0. The estimation of u requires an estimator of β and both estimators are subsequently used for the debiasing step. Using the same set of data multiple times for βb, ub, debiasing and variance estimation may induce over-fitting bias, so we implemented the cross-fitting scheme to reduce the over-fitting bias. As a remark, cross-fitting might not be necessary for theory with additional assumptions and/or empirical process techniques.

10 T T We obtain the cross-fitted debiased estimator for xstdβ as x[stdβ, defined as

K K (k) (k)T 1 X T 1 X X (k)T (k)T x βb − u Xi{g(βb Xi) − g(γ Wi)} K std N b b k=1 k=1 i∈J k (22) K (k)T 1 X X (k)T n (k)T o − u Xi (1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi . n b b k=1 i∈Ik ¯ The second term is used to correct the bias β(γb) − β0 and the third term is used to correct ¯ the bias βb − β(γb). The corresponding variance estimator is K (k)T 2 1 X X (k)T 2 n (k)T o VbSAS = (u Xi) (1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi n b b k=1 i∈Ik 2 K (k)T 2 ρ X X (k)T 2 n T o + (u Xi) g(βb Xi) − g(γ Wi) (23) n b b k=1 i∈Jk   T T Through the link g and the scaling factor kxnewk2, we estimate g(xnewβ0) by g kxnewk2x[stdβ

T and construct the (1 − α) × 100% confidence interval for g(xnewβ0) as    q    q  T T g kxnewk2 x[stdβ − Zα/2 VbSAS/n , g kxnewk2 x[stdβ + Zα/2 VbSAS/n , (24) where Zα/2 is the 1 − α/2 quantile of the standard normal distribution.

4 Theory

We introduce assumptions required for both estimation and inference in Section 4.1. We state our theories for estimation and inference, respectively in Sections 4.2 and 4.3.

4.1 Assumptions

We assume the complete data consist of i.i.d. copies of (Yi, Xi, Si), for i = 1,...,N. For our focused SSL settings, only the first n outcome labels Yi,...,Yn are observed. Under the i.i.d assumption, our SSL setting is equivalent to the missing completely at random

(MCAR) assumption. The sparsities of γ0, β0 and u0 are denoted as

sγ = kγ0k0, sβ = kβ0k0, = ku0k0.

11 We focus on the setting with n, p+q, N → ∞ with n being allowed to be smaller than p+q.

We allow that sγ , sβ and su grow with n, p + q, N and satisfy sγ  n and sβ + su  N. To achieve the sharper dimension conditions, we consider the sub-Gaussian design as in Portnoy (1984, 1985); Negahban et al. (2010). We denote the sub-Gaussian norm for random variables and random vectors both as k · kψ2 . The detailed definition is given in Appendix D.

Assumption 1. For constants ν1, ν2 and M independent of n, p and N,

T T a) the residuals Yi −g(γ0 Wi) and Yi −g(β0 Xi) are sub-Gaussian random variables with sub-Gaussian norm bounded by kY − g(γTW )k ≤ ν and kY − g(βTX )k ≤ ν ; i 0 i ψ2 1 i 0 i ψ2 2

0 b) The link function g satisfies the monotonicity and smoothness conditions: infx∈R g (x) ≥ 0 00 0, supx∈R g (x) < M and supx∈R g (x) < M.

x x Under our motivating example with a binary Yi and g(x) = e /(1 + e ), 1a and 1b are satisfied. The condition is also satisfied for the probit link function and the identity link function. Condition 1a is universal for high-dimensional regression. Admittedly, Lip- schitz requirement in 1b rules out some GLM links with unbounded derivatives like the exponential link, but we may substitute the condition by assuming a bounded kXik∞.

2 2 Assumption 2. For constants σmax and σmin independent of n, p, N, √ a) Wi is a sub-Gaussian vector with sub-Gaussian norm kWikψ2 ≤ σmax/ 2;

b) The weak overlapping condition at the population parameter β0 and γ0,

T 0 T T 2 (i) infkvk2=1 v E([g (β0 Xi) ∧ 1]XiXi )v ≥ σmin,

T 0 T T 2 (ii) infkvk2=1 v E[{g (γ0 Wi) ∧ 1}WiWi ]v ≥ σmin;

c) The non-degeneracy of average residual variance:

T T 2 T 2 2 inf E[{Yi − (1 − ρ) · g(γ0 Wi) − ρ · g(β0 Xi)} (Xi v) ] ≥ σmin. kvk2=1

12 Assumption 2a is typical for high-dimensional regression (Negahban et al., 2010), which also implies the bounded maximal eigenvalue of the second moment

T T 2 sup v E[WiWi ]v ≤ σmax. kvk2=1 Notably, we do not require two common conditions under high-dimensional generalized linear models (Huang and Zhang, 2012; van de Geer et al., 2014): 1) the upper bound on

0 T supi=1,...,N kXik∞; 2) the lower bound on infi=1,...,N g (β0 Xi), often known as the overlap- ping condition for logistic regression model. Compared to the overlapping condition under

T T logistic regression that g(β0 Xi) and g(γ0 Wi) are bounded away from zero, our Assump- tions 2b and 2c are weaker because they are implied by the typical minimal eigenvalue condition

T T 2 inf v E(WiWi )v ≥ σmin kvk2=1 plus the overlapping condition.

4.2 Consistency of the SAS Estimation

We now state the L2 and L1 convergence rates of our proposed SAS estimator.

Theorem 1 (Consistency of SAS estimation). Under Assumptions 1, 2 and with p sγ = o(n/ log(p + q)), sβ = o(N/ log(p)), λβ & log(p)/N, (25) we have √ p  kβb − β0k2 = Op sβλβ + (1 − ρ) sγ log(p + q)/n ,

2  kβb − β0k1 = Op sβλβ + (1 − ρ) sγ log(p + q)/(nλβ) .

A few remarks are in order for Theorem 1. First, the dimension requirement for our SAS estimator achieving L2 consistency significantly weakens the existing dimension requirement in the supervised setting (Negahban et al., 2010; Huang and Zhang, 2012; B¨uhlmannand p Van De Geer, 2011; Bickel et al., 2009) With λβ  log(p)/N, Theorem 1 implies the L2 consistency of βb under the dimension condition,

2 (1 − ρ) sγ log(p + q)/n + sβ log(p)/N = o(1). (26)

13 For the setting N  n, our requirement on the sparsity of β, sβ = o(N/ log(p)) is signif- icantly weaker than sβ = o(n/ log(p)), which is known as the fundamental sparsity limit to identify the high-dimensional regression vector in the supervised setting. Theorem 1 indicates that with assistance from observed S ∈ U , the SAS procedure allows sβ > n pro- vided that N is sufficiently large and the imputation model is sparse. This distinguishes our result from most estimation results in high-dimensional supervised settings.

Second, we briefly discuss the L1 consistency. If the L1 consistency is of interest, the penalty levels are chosen as

np p o λβ  max log(p)/N, sγ /sβλγ , (27)

which produces the L1 estimation rate from Theorem 1

 p p  kβb − β0k1 = Op sβ log(p)/N + sγ sβ log(p)/n .

p  Compared to the condition for L1 consistency under supervised learning, sβ = o n/ log(p) , the condition from SAS estimation sβ = o ((n/sγ + N)/ log(p)) allows a denser β0 in the setting with a very sparse γ0 and a large unlabeled data. On the other hand, the L2 estimation rate in Theorem 1 remains the same if

p np p o log(p)/N . λβ . max log(p)/N, sγ /sβλγ .

We shall point out that our subsequent theory on the SAS inference procedure is based the

L2 consistency, instead of L1 consistency. Theorem 1 implies the following prediction consistency result.

Corollary 2 (Consistency of individual prediction). Suppose xnew is sub-Gaussian random

T T 2 vector satisfying supkvk2=1 v E[xnewxnew]v ≤ σmax. Under the conditions of Theorem 1, we have T   T   g βb xnew − g (β0 xnew) = Op kβb − β0k2 = op(1).

The concentration result of Corollary 2 is established with respect to the joint distribu- tion of the data and the new observation xnew. This is in a sharp contrast to the individual

14 prediction conditioning on any new observation xnew. If the goal is to conduct inference for any given xnew, the theoretical justification is provided in the following Theorem 3 and Corollary 4.

√ 4.3 n-inference with Debiased SAS Estimator

We state the validity of our SSL inference in Theorem 3. We use to A →L B to denote that random variable A converges in distribution to a distribution B.

Theorem 3 (SAS Inference). Let xnew be the random vector representing the covariate of a new individual. Under Assumptions 1, 2 and the dimension condition

2 2 2 2 s log(p + q) ρ(s + sβsu) log(p) s s log(p + q) log(p) (1 − ρ)4 γ + β + (1 − ρ)2 γ u = o(1), (28) n N N

T we draw inference on xnewβ0 conditionally on xnew according to √  xT β  −1/2 T new 0 L nVbSAS x[stdβ − | xnew → N(0, 1), kxnewk2

2 where VbSAS defined in (23) is the estimator of the asymptotic variance

T 2 T T 2 VSAS =E[(u0 Xi) {Y − (1 − ρ) · g(γ0 Wi) − ρ · g(β0 Xi)} ]

T 2 T T 2 + ρ(1 − ρ)E[(u0 Xi) {g(γ0 Wi) − g(β0 Xi)} ],

xnew 0 T T −1 xnew with u0 = Θ0 = [E{g (β0 Xi)XiXi }] . (29) kxnewk2 kxnewk2 By the Young’s inequality, the condition (28) is implied by √ s2 log(p + q)2 ρ(s + s ) log(p) (1 − ρ)4 γ + β √ u = o(1), (30) n N When p is much smaller than the full sample size N, our condition (30) allows the sparsity levels of β0 and u0 to be as large as p. Even if p is larger than N, our SAS inference √ procedure is valid if sβ + su . N/ log(p). In the literature on confidence interval con- struction in high-dimensional supervised setting, the valid inference procedure for a single √ regression coefficient in the linear regression requires sβ . n/ log(p) (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; van de Geer et al., 2014). Such a sparsity condition

15 has been shown to be necessary to construct a confidence interval of a parametric rate (Cai et al., 2017). We have leveraged the unlabeled data to significantly relax the fundamental √ √ limit of statistical inference from sβ . n/ log(p) to sβ . N/ log(p). The amount of labelled data validates the statistical inference for a dense model in high dimensions.

The sparsity of u0 is determined by xnew and the precision matrix Θ0. In the super- vised learning setting, for confidence interval construction for a single regression coefficient, van de Geer et al. (2014) requires su . n/ log(p) is required. According to (30), our SAS √ inference requires su . N/ log(p), which can be weaker than su . n/ log(p) if the amount of unlabeled data is larger than n2. Theorem 3 implies that our proposed CI in (24) is valid in terms of coverage, which is summarized in the following corollary.

Corollary 4. Under Assumptions 1 and 2, as well as (28), the CI defined in (24) satisfies,    q  T T P g kxnewk2 x[stdβ − Zα/2 VbSAS/n ≤ g (xnewβ0)  q  T ≤ g kxnewk2x[stdβ + Zα/2 VbSAS/n = 1 − α + o(1).

√ 0 T p 2g (β0 xnew)kxnewk2Zα/2 VSAS/n . kxnewk2/ n,

where VSAS is the the asymptotic variance defined in (29).

T Confidence interval construction for g (xnewβ0) in high-dimensional supervised setting has been recently studied in Guo et al. (2020). Guo et al. (2020) assumes the prediction model to be correctly specified as a high-dimensional sparse logistic regression and the √ inference procedure is valid if sβ . n/ log p. In contrast, we leverage the unlabeled data to allow for mis-specified prediction model and a dense regression vector, as as the dimension requirement in (28) is satisfied.

4.4 Efficiency comparison of SAS Inference

Efficiency in high-dimensional setting or SSL setting in which the proportion of labelled data decays to zero is yet to be formalized. Here we use the efficiency bound in the classical low-dimensional with a fixed ρ as the benchmark. Apart from the relaxation of

16 various sparsity conditions, we illustrate next that our SAS inference achieves a decent efficiency with properly specified imputation model compared to the supervised learning and the benchmark. Similar to the phenomenon discovered by Chakrabortty and Cai (2018), if the imputa- tion model is correct, we can guarantee the efficiency gain by SAS inference in comparison to the asymptotic variance of the supervised learning,

T 2 T 2 VSL = E[(u0 Xi) {Yi − g(β0 Xi)} ]. (31)

T Proposition 5. If E(Yi | Si, Xi) = g(γ0 Wi), we have VSL ≥ VSAS.

Moreover, we can show that our SAS inference attains the benchmark efficiency derived from classical fixed ρ setting (Tsiatis, 2007). To simplify the derivation, we describe the missing-completely-at-random mechanism through the binary observation indicator Ri, i =

1,...,N, independent of Yi, Xi and Si. We still denote the proportion of labelled data as

ρ = E(Ri). The unsorted data take the form

T T T D = {Di = (Xi , Si ,Ri,RiYi) , i = 1,...,N} .

We consider the following class of complete data semi-parametric models  r 1−r Mcomp = fX,Y,S,R(x, y, s, r) = fX(x)ρ (1 − ρ) fY |S,X(y|s, x)fS|X(s|x):  fY |S,X, fX, fS|X are arbitrary density , (32)

and establish the efficiency bounds for RAL estimators under Mcomp by deriving the as- sociated efficient influence function in the following proposition. We denote the nuisance parameters for fY |S,X, fX and fS|X as η. We use η0 to denote the true underlying nuisance parameter that generates the data. The parameter of interest β0 is not part of the model

Mcomp but defined by the implicit function through the moment condition (2).

T Proposition 6. The efficient influence function for θ = xstdβ under Mcomp is R φ (D ; θ , η ) = i uTX {Y − (Y | S , X )} − uTX { (Y | S , X ) − g(βTX )}. eff i 0 0 ρ 0 i i E i i i 0 i E i i i 0 i

17 T Under the Assumptions of Theorem 3 and additionally E(Yi | Si, Xi) = g(γ0 Wi), our SAS debiased estimator admits the same influence function

N xT β 1 X x[T β − new 0 = φ (D ; θ , η ) + o (ρN)−1/2 std kx k N eff i 0 0 p new 2 i=1 according to Appendix B3 Step 2 (A.31).

5 Simulation

We have conducted extensive simulation studies to evaluate the finite sample performance of the SAS estimation and inference procedures under various scenarios. Throughout, we let p = 500, q = 100, N = 20000 and consider n = 500. The signals in β are varied to be approximately sparse or fully dense with a mixture of strong and weak signals. The surrogates S are either moderately and strongly predictive of Y as specified below. For each configuration, we summarize the results based on 500 simulated datasets. To mimic the zero-inflated discrete distribution of EHR features, we first generate

x x u s s Zi,1,...,Zi,p,Zi ,Zi,1,...,Zi,q independently from N(0, 25). Then we construct Xi from

u x x x T {Zi , Zi = (Zi,1, ..., Zi,1) } via the transformation ς(z) = blog{1 + exp(z)}c: n  √ √  o Pp x Xi,1 = ς j=2 2Xi,j/ p − 1 + Zi,1/ 2 − µX /σX,

x p −1 u √ Xi,j = [ς(Zi,j 1 − p + Zi / p) − µX]/σX, j = 2, . . . , p.

We standardize Xi,j to roughly mean zero and unit variance with µX = 1.80 and σX = 2.74. u The shared term Zi induces correlation among the covariates. For S and Y , we consider two scenarios under which the imputation model is either correctly or incorrectly specified. We present the “Scenario I: neither the risk prediction model nor the imputation model is correctly specified” in the main text and the “Scenario II: The imputation model is correctly specified and exactly sparse” in Section A of the Supplementary materials.

18 Scenario I: neither the risk prediction model nor the imputation model is cor- rectly specified. In this scenario, we first generate Yi from the probit model Z x x T x −1/2 −x2/2 P(Yi = 1|Zi ) = Φ(α Zi ) with Φ(x) = (2π) e dx, −∞ and then generate S from

 s −1 T s −1 Si,1 = ς(Zi,1/2 + θYi) − µS σS + ξ Xi, and Si,j = {ς(Zi,j) − µX}σX , j = 2, . . . , p.

We chose µS and σS depending on α such that Si,1 is roughly mean 0 and variance 1. Under this setting, a logistic imputation model would be misspecified but nevertheless approximately sparse with appropriately chosen ξ. The coefficients α control the optimal prediction accuracy of X for Y while θ controls the optimal prediction accuracy of S for Y . We consider two α of different sparsity patterns, which also determine the rest of parameters

T T Sparse (sα = 3) : α = (0.45, 0.318, 0.318, 0497×1) , µS = 1.82, σS = 2.01,

T T T Dense (sα = 500) : α = (0.316, 0.05929×1, 0.007470×1) , µS = 2.71, σS = 2.68,

T where ak×1 = (a, ..., a)k×1 for any a. The sparsity of α affects the approximate sparsity of

β subsequently (Table 1), which we measured by the squared ratio between `1 norm and

`2 norm 2 2 S(β) = kβk1/kβk2, min |βj| ≤ S(β)/kβk0 ≤ 1. (33) j:βj 6=0 We consider two θ: (a) θ = 0.6 for S to be moderately predictive of Y ; and (b) θ = 1 for strong surrogates. The parameter ξ depends on both the choices of α and θ:

T T sα = 3, θ = 0.6 : ξ = (0.407, 0.330, 0.330, 0.005497×1) ,

T T sα = 3, θ = 1 : ξ = (0.199, 0.163, 0.163, 0.002497×1) ,

T T T sα = 500, θ = 0.6 : ξ = (0.350, 0.06429×1, 0.011470×1) ,

T T T sα = 500, θ = 1 : ξ = (0.169, 0.03229×1, 0.005470×1) .

19 Table 1: AUC Table for simulations with 500 labels under Scenario I. The AUCs are evaluated on an independent testing set of size 100. We approximately measure the sparsity 2 2 by S(v) = kvk1/kvk2.

Scenario Prediction Accuracy (AUC)

Surrogate S(β0) S(γ0) Oracle SLASSO SAS Strong 174 1.32 0.724 0.660 0.711 Moderate 174 1.26 0.724 0.660 0.713 Strong 28.3 1.33 0.719 0.694 0.713 Moderate 28.3 1.24 0.719 0.694 0.711

Due to the complexity of the data generating process and the noncollapsibility of the logistic regression models, we cannot analytically express the true β0 in both scenarios.

Instead, we numerically evaluate β0 with a large simulated data using the oracle knowledge of the ex-changeability among covariates according to the model

s p Xα X logit{P(Yi = 1|Si,1)} ∼ η0 + η1Xi,1 + η2 Xi,j + η3 Xi,j.

j=2 j=sα+1

We derive the true β0 as

β = (η , η , (η )T , (η )T )T. 0 0 1 2 sα×1 3 (p−sα)×1

We report the simulation settings under Scenario I in Table 1, where we present the predictive power of the oracle estimation and the lasso estimation. We also report the average area-under-curve (AUC) of the receiver operating characteristic (ROC) curve for oracle β0, supervised LASSO (SLASSO) and the proposed SAS estimation. Our SAS estimation achieves a better AUC compared to supervised LASSO across all scenarios, and is comparable to the AUC with the true coefficient β0. Besides, we observe that the AUC of supervised LASSO is sensitive to the approximate sparsity S(β0), while the AUC of SAS estimation does not seem to be affected by S(β0). To evaluate the SAS inference for the individualized prediction, we consider six different

L M H choices of xnew. We first select {xnew, xnew, xnew} from a random sample of xnew generated from the distribution of Xi such that their predicted risks are around 0.2, 0.5, and 0.7,

20 corresponding to low, moderate and high risk. We additionally consider three sets of xnew with different levels of sparsity:

S T T Sparse: xnew = (1, 1, 0499×1) ;

I T T T Intermediate: xnew = (1, 0.18330×1, 0470×1) ;

D T T Dense: xnew = (1, 0.045500×1) .

T In Table 2, we compare our SAS estimator of xnewβ0 with the corresponding SLASSO across all settings under Scenario I. The root mean-squared-error (rMSE) of the SAS estimation decays proportionally with the sample size, while the rMSE of the supervised LASSO provides evidence of inconsistency for moderate and dense deterministic xnew. The bias of the supervised LASSO is also significantly larger than that of the SAS estimation. The performance of the SAS estimation is insensitive to sparsity of β0, while that of supervised

LASSO severely deteriorate with dense β0. The improvement from the supervised LASSO to the SAS estimation is regulated by the surrogate strength. In Table 3, we compare our SAS inference with supervised debiased LASSO across the settings under Scenario I. Our SAS inference procedure attains approximately honest coverage of 95 % confidence intervals for all types of xnew under all scenarios. Unsurprisingly, the debiased SLASSO has under coverage for the deterministic xnew as the consequence of violation to the sparsity assumption for β0 and precision matrix. Under our design, the first covariate X1 has the strongest dependence upon the other covariates, whose associated row

T S in the precision matrix is thus densest. Consequently, the inference for β xnew = β0 + β1

L M H The debiased SLASSO also has an acceptable coverage for random xnew, xnew, xnew sampled from the covariate distribution despite the presence of substantial bias, which we attribute to the even larger variance that dominates the bias. In contrast, our SAS inference has small bias across all scenarios and improved variance from the strong surrogate. According to Tables A1, A2 and A3 in the Appendix A, the results under Scenario II are consistent with our findings under Scenario I.

21 Table 2: Comparison of SAS Estimation to the supervised LASSO (SLASSO) with Bias, Empirical standard error (ESE) and root mean-squared error (rMSE) of the linear predic-

T tions xnewβ0 under Scenario I 500 labels, moderate or large S(β0) and strong or moderate surrogates.

SLASSO SAS: Moderate SAS: Strong Type Bias ESE rMSE Bias ESE rMSE Bias ESE rMSE

Moderate S(β0) L xnew 0.605 0.387 0.719 0.165 0.249 0.298 0.118 0.196 0.229 M xnew -0.083 0.337 0.347 -0.008 0.246 0.246 -0.016 0.195 0.196 H xnew -0.718 0.521 0.887 -0.234 0.294 0.376 -0.176 0.225 0.286 S xnew -0.072 0.144 0.161 -0.080 0.094 0.123 -0.018 0.078 0.080 I xnew -0.460 0.096 0.470 -0.110 0.093 0.143 -0.055 0.071 0.090 D xnew -0.413 0.091 0.423 -0.110 0.089 0.141 -0.114 0.069 0.133

Large S(β0) L xnew 0.389 0.275 0.477 0.161 0.215 0.269 0.133 0.264 0.296 M xnew -0.017 0.280 0.280 -0.014 0.213 0.213 -0.017 0.268 0.268 H xnew -0.600 0.481 0.769 -0.251 0.271 0.370 -0.164 0.296 0.339 S xnew -0.202 0.140 0.246 -0.074 0.097 0.122 -0.009 0.078 0.079 I xnew -0.178 0.098 0.203 -0.075 0.086 0.115 -0.071 0.075 0.103 D xnew -0.185 0.090 0.206 -0.109 0.084 0.138 -0.113 0.073 0.135

22 Table 3: Bias, Empirical standard error (ESE), average of the estimated standard error (ASE) along with empirical coverage of the 95% confidence intervals (CP) for the debiased

T supervised LASSO (SLASSO) and debiased SAS estimator of linear predictions xnewβ0 un- der Scenario I with 500 labels, moderate or large S(β0) and strong or moderate surrogates.

Debiased SAS Debiased SLASSO Moderate Surrogates Strong Surrogates Type Bias ESE ASE CP Bias ESE ASE CP Bias ESE ASE CP Risk prediction model approximatedly sparse

L xnew -0.290 1.901 1.896 0.948 0.021 1.873 1.864 0.949 0.018 1.531 1.531 0.950 M xnew -0.091 1.994 1.981 0.947 -0.007 1.961 1.954 0.950 -0.015 1.560 1.570 0.953 H xnew 0.348 2.106 2.074 0.942 -0.050 2.036 2.039 0.950 -0.011 1.632 1.623 0.950 S xnew 0.171 0.157 0.128 0.694 -0.019 0.149 0.150 0.950 -0.001 0.132 0.125 0.924 I xnew -0.001 0.129 0.125 0.938 -0.013 0.123 0.116 0.932 0.010 0.101 0.094 0.920 D xnew 0.141 0.137 0.138 0.812 -0.011 0.123 0.118 0.944 -0.001 0.096 0.095 0.940

Large S(β0) L xnew -0.134 1.918 1.914 0.951 0.018 1.875 1.878 0.951 0.018 1.529 1.524 0.948 M xnew -0.056 1.970 1.962 0.948 -0.020 1.911 1.927 0.952 0.005 1.603 1.597 0.950 H xnew 0.109 2.051 2.029 0.945 -0.022 1.997 1.991 0.950 -0.040 1.671 1.668 0.951 S xnew 0.029 0.155 0.127 0.892 -0.008 0.153 0.147 0.946 -0.013 0.133 0.131 0.938 I xnew 0.002 0.131 0.125 0.930 0.001 0.122 0.114 0.936 0.002 0.101 0.098 0.936 D xnew 0.113 0.135 0.139 0.874 -0.007 0.119 0.116 0.938 -0.003 0.099 0.097 0.960

23 6 Application of SAS to EHR Study

We applied the proposed SAS method to the risk prediction of Type II Diabetes Mellitus (T2DM) using EHR and genomic data of participants of the Mass General Brigham (MGB) Biobank study. To define the study cohort, we extracted from the EHR of each patient their date of first EHR encounter (tini), follow up period (C), the counts and dates for the ICD codes and NLP mentions of clinical concepts related to T2DM as well as its risk factors. We only included patients who do not have any ICD code or NLP mention of

T2DM up to baseline, where the baseline time is defined as 1990 if tini is prior to 1990 and as their first year if tini ≥ 1990. Although neither the ICD code nor NLP mention of T2DM is sufficiently specific, they are highly sensitive and can be used to accurately remove patients who have already developed T2DM at baseline. This exclusion criterion resulted in N = 20216 patients who are free of T2DM at baseline and have both EHR and genomics features for risk modeling. Among those, we have a total of n = 271 patients whose T2DM status during follow up, Y , has been obtained via manual chart review. The prevalence of T2DM was about 14% based on labeled data. We aim to develop a risk prediction model for Y by fitting a working model P (Y =

T 1 | X) = g(β0X), where the baseline covariate vector X includes age, gender, indicator for occurrence of ICD code and NLP counts for obesity, hypertension, coronary artery disease (CAD), hyperlipidemia during the first year window, as well as a total of 49 single nucleotide polymorphism (SNP) previously reported as associated with T2DM in Mahajan et al. (2018) with odds ratio greater than 1.1. We additionally adjust for follow up by including log(C) and allow for non-linear effects by including two-way interactions between the SNPs and other baseline covariates. All variables with less than 10 nonzero values within the labelled set are removed, resulting the final covariates to be of dimension p = 260. We standardize the covariates to have mean 0 and variance 1. To impute the outcome, we used the predicted probability of T2DM derived from the unsupervised phenotyping method MAP ( et al., 2019), which achieves an AUC of 0.98, indicating a strong surrogate. In addition to the proposed SAS procedure, we derive risk prediction models based on the supervised LASSO with both the same set of covariates. We let K = 5 in

24 Log age

Log follow−up time

rs62368490

Obesity NLP:rs10811660

Hyperlipidemia NLP:rs7978610 Variables Obesity NLP:rs7756992

Hyperlipidemia Code:rs10830963

Hypertension Code:rs2249105

Hypertension Code:rs1708302

0.00 0.25 0.50 0.75 1.00 Expit(coefficient)

Method SAS SLASSO Type Initial Debiased

Figure 2: Point and 95% confidence interval estimates for the coefficients with nominal p-value < 0.05 from SAS inference. The horizontal bars indicate the estimated 95% confi- dence intervals. The solid points indicate the (initial) estimates, and the triangles indicate debiased estimates. Colors red and green indicate different methods, SAS and SLASSO, respectively. cross-fitting and use 5-fold cross-validation for tuning parameter selection. To compare the performance of different risk prediction models, we use 10-fold cross-validation to estimate the out-of-sample AUC. We repeated the process 10 times and took average of predicted probabilities across the repeats for each labelled sample and method in comparison. In Figure 2, we present the estimated β coefficients for the covariates that received p- value less than 0.05 from the SAS inference. The confidence intervals are generally narrower from the SAS inference. For the coefficients of baseline age and follow-up time, the SAS inference produced much narrower confidence interval than debiased SLASSO, which are expected to have a positive effect on the T2DM onset status during the observation. In addition, the SAS inference identified one global genetic risk factor and 6 other subgroup genetic risk factors while SLASSO identified none of these. In Table 4, we present the AUCs of the estimated risk prediction models using the high dimensional X . It is important to note that AUC is a measurement of prediction

25 Table 4: The cross-validated (CV) AUC the estimated risk prediction models with high dimensional EHR and genetic features based on SAS and supervised LASSO. Shown also are the AUC of the imputation model derived for the SAS procedure.

Method Imputation SAS SLASSO CV AUC 0.928 0.763 0.488

Low risk:<5% Moderate risk:5%−15% High risk:>15%

1.00 1.00 1.00

0.75 0.75 0.75

0.50 0.50 0.50

0.25 0.25 0.25 T2DM Onset Probability T2DM Onset Probability T2DM Onset Probability

0.00 0.00 0.00 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Patient Patient Patient

Type Estimation: Has T2DM Estimation: No T2DM Inference: Has T2DM Inference: No T2DM Method SAS SLASSO

Figure 3: Point and 95% confidence interval estimates for the predicted risks of 30 randomly selected patients. The vertical bars indicate the estimated 95% confidence intervals. The circle and the triangle shapes correspond to (initial) estimation and debiased estimation, correspondingly. Solid points indicate the observed T2DM cases. Colors red and green indicate different methods, SAS and SLASSO. accuracy, so debiasing might lead to worse AUC by accepting larger variability for reduced bias. The AUC from SLASSO is very poor, probably due to the over-fitting bias with the small sample sizes of the labeled set. With the information from a large unlabeled data, SAS produced the significantly higher AUC than the SLASSO. For illustration, we present in Figure 3 the individual risk predictions with 95% con- fidence intervals for three sets of 10 patients with each set randomly selected from low (< 5%), medium (5% ∼ 15%) or high risk (> 15%) subgroups. These risk groups are constructed for illustration purposes and a patient with xnew classified to low, medium and T T high risk if expit(βb xnew) belongs to the low, medium and high tertiles of {expit(βb Xi), i =

26 1, ..., N}. We observe that the confidence intervals for patients with predicted The debiased SLASSO inference is not very informative with most error bars stretching from zero to one. The contrast between SAS CIs and SLASSO CIs demonstrates the improved efficiency as the result of leveraging information from the unlabeled data through predictive surrogates.

References

Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61 (4), 962–973.

Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009, 08). Simultaneous analysis of lasso and dantzig selector. Ann. Statist. 37 (4), 1705–1732.

B¨uhlmann,P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.

Cai, T., T. Cai, and Z. Guo (2019, Apr). Individualized Treatment Selection: An Optimal Hypothesis Testing Approach In High-dimensional Models. arXiv e-prints, arXiv:1904.12891.

Cai, T. T. and Z. Guo (2018, Jun). Semi-supervised Inference for Explained Vari- ance in High-dimensional Linear Regression and Its Applications. arXiv e-prints, arXiv:1806.06179.

Cai, T. T., Z. Guo, et al. (2017). Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of statistics 45 (2), 615–646.

Chakrabortty, A. and T. Cai (2018, 08). Efficient and adaptive linear regression in semi- supervised settings. Ann. Statist. 46 (4), 1541–1572.

Chakrabortty, A., J. , T. T. Cai, and H. (2019). High dimensional m-estimation with missing outcomes: A semi-parametric framework.

27 Cheng, D., A. Ananthakrishnan, and T. Cai (2018, Mar). Efficient and Robust Semi- Supervised Estimation of Average Treatment Effects in Electronic Medical Records Data. arXiv e-prints, arXiv:1804.00195.

Frazer, K. A., S. S. Murray, N. J. Schork, and E. J. Topol (2009). Human genetic variation and its contribution to complex traits. Nature Reviews Genetics 10 (4), 241–251.

Guo, Z., P. Rakshit, D. S. Herman, and J. (2020). Inference for the case probability in high-dimensional logistic regression. arXiv preprint arXiv:2012.07133 .

Huang, J. and C.-H. Zhang (2012, June). Estimation and selection via absolute penal- ized convex minimization and its multistage adaptive applications. J. Mach. Learn. Res. 13 (1), 1839–1864.

Javanmard, A. and A. Montanari (2014). Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research 15, 2869–2909.

Kallus, N. and X. Mao (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data.

Liao, K. P., J. , and 18 others (2019, 08). High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. Journal of the American Medical Informatics Association 26 (11), 1255–1262.

Ma, R., T. T. Cai, and H. Li (2020). Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Associ- ation 0 (0), 1–15.

Mahajan, A., D. Taliun, and 113 others. (2018, Nov). Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature Genetics 50 (11), 1505–1513.

Negahban, S., P. Ravikumar, M. J. Wainwright, and B. (2010). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Technical Report 797, University of California Berkeley, Department of Statistics.

28 Portnoy, S. (1984, 12). Asymptotic behavior of m-estimators of p regression parameters when p2/n is large. i. consistency. Ann. Statist. 12 (4), 1298–1309.

Portnoy, S. (1985, 12). Asymptotic behavior of m estimators of p regression parameters when p2/n is large; ii. normal approximation. Ann. Statist. 13 (4), 1403–1417.

Smucler, E., A. Rotnitzky, and J. M. Robins (2019, Apr). A unifying approach for doubly-

robust `1 regularized estimation of causal contrasts. arXiv e-prints, arXiv:1904.03737.

Thompson, C. A., A. W. Kurian, and H. S. Luft (2015). Linking electronic health records to better understand breast cancer patient pathways within and between two health systems. eGEMs 3 (1).

Tsiatis, A. (2007). Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer New York. van de Geer, S., P. B¨uhlmann,Y. Ritov, and R. Dezeure (2014, 06). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 (3), 1166–1202. van de Geer, S. A. and P. B¨uhlmann(2009). On the conditions used to prove oracle results for the lasso. Electron. J. Statist. 3, 1360–1392.

Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.

Vujkovic, M., J. M. Keaton, and 48 others (2020). Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nature genetics 52 (7), 680–691.

Warren, J. L. and K. R. Yabroff (2015). Challenges and opportunities in measuring cancer recurrence in the united states. Journal of the National Cancer Institute 107 (8), djv134.

29 Zhang, A., L. D. Brown, and T. T. Cai (2019, 10). Semi-supervised inference: General theory and estimation of means. Ann. Statist. 47 (5), 2538–2566.

Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1), 217–242.

Zhang, Y. and J. Bradic (2019, Feb). High-dimensional semi-supervised learning: in search for optimal inference of the mean. arXiv e-prints, arXiv:1902.00772.

Zhu, Y. and J. Bradic (2018a). Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association 113 (524), 1583–1600.

Zhu, Y. and J. Bradic (2018b). Significance testing in non-sparse high-dimensional linear models. Electron. J. Statist. 12 (2), 3312–3364.

30 Supplementary Material

We present the simulation Scenario II in which the imputation model is correctly specified and exactly sparse in Appendix A. The proofs of Theorems 1, 3, Corollary 2 and Propo- sitions 5 and 6 are given in Appendix B. The technical details are put in Appendix C. Definitions and existing results are stated in Appendix D.

A Additional Simulation

Table A1: AUC Table for simulations with 500 labels under Scenario II. The AUCs are evaluated on an independent testing set of size 100. We approximately measure the sparsity 2 2 by S(v) = kvk1/kvk2.

Scenario Prediction Accuracy (AUC)

Surrogate S(β0) S(γ0) Oracle SLASSO SAS Strong 159 1.10 0.715 0.660 0.702 Moderate 128 1.06 0.715 0.665 0.704 Strong 26.4 1.09 0.710 0.691 0.708 Moderate 18.4 1.03 0.709 0.693 0.707

Scenario II: The imputation model is correctly specified and exactly sparse. In the second scenario, we first generate Si from h i s T xp −1 u √ Si,1 = ς{νZi,1 + α (Zi 1 − p + Zi / p)} − µS /σS.

s and Si,j = {ς(Zi,j) − µX}/σX for j = 2, . . . , p, and then generate Yi from a sparse model

P(Yi = 1|Xi) = expit(θSi,1).

We chose µS ≈ 0.66 and σS ≈ 1 such that Si,1 is roughly mean 0 and variance 1. Under this setting, the imputation model holds with sγ = 1. The factor ν and the coefficients α control the predictiveness of X for S1 and Y while θ controls the predictiveness of W for

31 Y . We consider two α of different sparsity patterns,

T T Sparse (sα = 3) : α = (0.3, 0.212, 0.212, 0497×1)

T T T Dense (sα = 500) : α = (0.211, 0.03929×1, 0.004470×1) ,

T where ak×1 = (a, ..., a)k×1 for any a. Similar to Scenario I, the sparsity of α regulates the approximate sparsity of β measured by (33) (See Table 1). We consider two sets of (ν, θ) to allow W to be either moderately or strongly predictive of Y :

Moderate: ν = 0.4, θ = 2; and Strong: ν = 0.6, θ = 3.7.

The layouts of Tables A2 and A3 are different from those of 2 and Table 3 because of the different data generating mechanism. The distribution of Yi | Xi is not affected by the distribution of Si in Scenario I, while the property does not hold in Scenario II.

32 Table A2: Comparison of SAS Estimation to the supervised LASSO (SLASSO) with Bias, Empirical standard error (ESE) and root mean-squared error (rMSE) of the linear pre-

T dictions xnewβ0 under Scenario II with 500 labels, moderate or large S(β0) and strong or moderate surrogates.

Moderate Surrogates Strong Surrogates SLASSO SAS SLASSO SAS Type Bias ESE rMSE Bias ESE rMSE Bias ESE rMSE Bias ESE rMSE Risk prediction model approximatedly sparse

L xnew 0.505 0.378 0.631 0.163 0.283 0.327 0.349 0.278 0.446 0.085 0.222 0.238 M xnew -0.140 0.331 0.359 -0.047 0.272 0.276 -0.113 0.282 0.304 -0.058 0.217 0.225 H xnew -0.713 0.512 0.878 -0.262 0.313 0.408 -0.678 0.469 0.825 -0.210 0.241 0.320 S xnew -0.111 0.143 0.181 -0.058 0.081 0.100 -0.190 0.142 0.237 -0.037 0.063 0.072 I xnew -0.437 0.098 0.448 -0.119 0.076 0.141 -0.155 0.098 0.183 -0.065 0.061 0.089 D xnew -0.349 0.093 0.361 -0.138 0.078 0.158 -0.150 0.093 0.176 -0.112 0.063 0.129

Large S(β0) L xnew 0.366 0.266 0.453 0.142 0.224 0.265 0.482 0.398 0.625 0.117 0.300 0.322 M xnew -0.060 0.275 0.282 -0.035 0.213 0.216 -0.199 0.337 0.391 -0.082 0.299 0.310 H xnew -0.656 0.475 0.810 -0.272 0.257 0.374 -0.749 0.503 0.903 -0.214 0.325 0.389 S xnew -0.236 0.139 0.274 -0.087 0.079 0.117 -0.054 0.138 0.148 0.003 0.063 0.063 I xnew -0.173 0.096 0.197 -0.078 0.077 0.109 -0.409 0.097 0.420 -0.094 0.057 0.110 D xnew -0.144 0.092 0.171 -0.101 0.080 0.129 -0.359 0.093 0.371 -0.154 0.060 0.166

33 Table A3: Bias, Empirical standard error (ESE) along with empirical coverage of the 95% confidence intervals (CP) for the debiased supervised LASSO (SLASSO) and debiased SAS

T estimator of linear predictions xnewβ0 under Scenario II with 500 labels, moderate or large

S(β0) and strong or moderate surrogates.

Debiased SLASSO Debiased SAS Type Bias ESE ASE CP Bias ESE ASE CP Risk prediction model approximatedly sparse, moderate surrogates

L xnew -0.236 1.936 1.915 0.947 0.014 1.786 1.771 0.950 M xnew -0.044 2.031 1.997 0.944 -0.028 1.873 1.853 0.947 H xnew 0.364 2.110 2.084 0.944 -0.045 1.943 1.924 0.947 S xnew 0.133 0.156 0.127 0.784 -0.028 0.133 0.130 0.944 I xnew 0.004 0.124 0.126 0.942 -0.014 0.102 0.100 0.936 D xnew 0.149 0.121 0.139 0.848 -0.014 0.104 0.105 0.948 Risk prediction model approximatedly sparse, strong surrogates

L xnew -0.070 1.953 1.935 0.947 0.021 1.371 1.366 0.949 M xnew -0.031 2.019 1.986 0.946 -0.026 1.408 1.401 0.949 H xnew 0.148 2.073 2.055 0.948 -0.010 1.458 1.444 0.949 S xnew 0.029 0.153 0.127 0.894 -0.016 0.103 0.096 0.928 I xnew 0.018 0.134 0.126 0.938 -0.004 0.081 0.079 0.944 D xnew 0.134 0.128 0.141 0.842 -0.007 0.081 0.083 0.956

Large S(β0), moderate surrogates L xnew -0.092 1.942 1.925 0.950 0.004 1.796 1.792 0.951 M xnew -0.034 1.995 1.969 0.947 -0.018 1.852 1.835 0.951 H xnew 0.082 2.061 2.036 0.946 -0.027 1.912 1.890 0.948 S xnew -0.009 0.155 0.125 0.876 -0.027 0.131 0.125 0.922 I xnew 0.000 0.126 0.125 0.952 -0.009 0.104 0.103 0.950 D xnew 0.119 0.126 0.139 0.894 -0.012 0.108 0.108 0.940

Large S(β0), strong surrogates L xnew -0.221 1.929 1.926 0.949 0.022 1.353 1.349 0.951 M xnew 0.032 2.047 2.017 0.947 -0.003 1.427 1.414 0.950 H xnew 0.442 2.137 2.104 0.940 -0.039 1.479 1.469 0.951 S xnew 0.176 0.150 0.128 0.698 -0.018 0.094 0.099 0.946 I xnew 0.030 0.128 0.129 0.936 -0.002 0.079 0.077 0.952 D xnew 0.167 0.125 0.142 0.804 -0.008 0.082 0.080 0.954

34 B Proofs of Main Results

We first summarize below notations used Section 3 for the conditional expectations given different part of the data.

Definition A1. The conditional expectation for samples with index in set S conditionally on subset of the data D is denoted as

Ei∈S {f(Yi, Xi, Si) | D}, S ⊆ {1, . . . , n + N}, D ⊂ L ∪ U .

We denote the conditional expectation of unlabeled data given labelled data by Ei>n{f(Wi) |

L } and the conditional probability of new copy of data given current data by Pnew{f(Wi) |

D}. With L and U partitioned into K folds indexed respectively by {Ik, k = 1, ..., K} and {Jk, k = 1, ..., K}, we denote the conditional expectation of fold-k labelled data and unlabeled data given the out-of-fold data respectively by

c c Ei∈Ik {f(Yi, Xi, Si) | Dk} and Ei∈Jk {f(Wi) | Dk}, c c c where Dk = {Si, Xi, i ∈ Jk } ∪ {Yi, Si, Xi, i ∈ Ik}.

B1 Proof of Theorem 1

Our proof shares the general steps with the the restricted strong convexity framework laid down in Negahban et al. (2010) while we have a delicate analysis of the symmetrized Breg- man divergence to establish the improved rate of estimation under semi-supervised learning † setting. To bound β through the symmetrized Bregman divergence (β − β )T`˙ (β ; γ), in- b b 0 0 b stead of directly applying the H¨older’sbound, we first split it into two parts,

† h † † † i (β − β )T`˙ (β ; γ) = (β − β )T `˙ (β ; γ) − {`˙ (β ; γ) | } + {`˙ (β ; γ ) | } b 0 0 b b 0 0 b E 0 b L E 0 0 L | {z } variance from unlabeled data n † † o + (β − β )T `˙ (β ; γ) − `˙ (β ; γ ) | (A.1) b 0 E 0 b 0 0 L | {z } bias from γb and discuss which part dominates the estimation error. When the first variance term in

(A.1) is dominant, the bias from γb becomes eligible. Then, we should recover the usual

35 error bound for LASSO as if γ0 is used. When the second bias term in (A.1) is dominant, the error bound of βb can be controlled by the error bound of γb. Combining the error bounds in the two cases, we obtain the oracle inequalities.

Lemma A1. On event n T ˙ Ω = `PL(β0 + ∆) − `PL(β0) − ∆ `PL(β0) p o ≥ κrsc,1k∆k2{k∆k2 − κrsc,2 log(p)/Nk∆k1}, ∀k∆k2 ≤ 1 , p setting λβ & log(p)/N such that r † † † log(p) ˙ ˙ ˙ λβ ≥ 3 ` (β0; γ) − E{` (β0; γ) | L } + E{` (β0; γ0) | L } + κrsc,1κrsc,2 , b b ∞ N we have the oracle inequalities for estimation error of βb, √ kβ − β k ≤ max 14 s λ /κ , (1 − ρ)7Mσ2 kγ − γk /κ , b 0 2 β β rsc,1 max 0 b 2 rsc,1 kβ − β k ≤ max 84s λ /κ , (1 − ρ)221M 2σ4 kγ − γk2/(κ λ ) . b 0 1 β β rsc,1 max 0 b 2 rsc,1 β

The constants κrsc,1, κrsc,2 are the restrictive strong convexity parameters specified in Lemma A6.

We next prove the oracle inequalities. First, we note that by the definition of βb,

`†(β; γ) + λ kβk ≤ `†(β ; γ) + λ kβ k . (A.2) b b β b 1 0 b β 0 1

Denote the standardized estimation error as δ = (βb − β0)/kβb − β0k2. Due to convexity of the loss function, we have for t = kβb − β0k2 ∧ 1

`†(β + tδ; γ) + λ kβ + tδk ≤ `†(β ; γ) + λ kβ k . (A.3) 0 b β 0 1 0 b β 0 1

By the triangle inequality kβ0k1 − kβ0 + tδk1 ≤ tkδk1, we have from (A.3)

`†(β + tδ; γ) − `†(β ; γ) ≤ tλ kδk (A.4) 0 b 0 b β 1 To apply the restricted strong convexity of the complete data loss (11) established in Lemma A6, we show that the second order approximation error of the imputed loss is equivalent to that of the complete data loss,

† `†(β + tδ; γ) − `†(β ; γ) − tδT`˙ (β ; γ) = ` (β + tδ) − ` (β ) − tδT`˙ (β ). 0 b 0 b 0 b PL 0 PL 0 PL 0

36 Then by applying the restricted strong convexity event Ω, we obtain

† `†(β + tδ; γ) − `†(β ; γ) − tδT`˙ (β ; γ) ≥ t2κ − tκ κ plog(p)/Nkδk . (A.5) 0 b 0 b 0 b rsc,1 rsc,1 rsc,2 1

Applying (A.5) to (A.4), we have with large probability

† tδT`˙ (β ; γ) + t2κ − tκ κ plog(p)/Nkδk ≤ tλ kδk 0 b rsc,1 rsc,1 rsc,2 1 β 1 where kδk2 = 1 from definition. Thus, we have reach

† tκ ≤ λ kδk − δT`˙ (β ; γ) + κ κ plog(p)/Nkδk . (A.6) rsc,1 β 1 0 b rsc,1 rsc,2 1

† Next, we analyze δT`˙ (β ; γ) by the decomposition 0 b

† h † † † i δT`˙ (β ; γ) =δT `˙ (β ; γ) − {`˙ (β ; γ) | } + {`˙ (β ; γ ) | } 0 b 0 b E 0 b L E 0 0 L n † † o + δT `˙ (β ; γ) − `˙ (β ; γ ) | E 0 b 0 0 L

˙ † ˙ † ˙ † ≤kδk1 ` (β0; γ) − E{` (β0; γ) | L } + E{` (β0; γ0) | L } b b ∞ + (1 − ρ) [X {g(βTX ) − g(γTW )} | ] . (A.7) Ei>n i 0 i b i L 2

† To establish the rate for L -norm of {`˙ (β ; γ) | }, we note that 2 E 0 b L

[X {g(βTX ) − g(γTW )} | ] = X Y − g γTW  |  . (A.8) Ei>n i 0 i b i L Ei>n i i b i L

By the characterization of γ0 as in (6), we may rewrite (A.8) as

X Y − g γTW  |  = [g0(γT W )X WT | ](γ − γ), (A.9) Ei>n i i b i L Ei>n u i i i L 0 b where γ = uγ + (1 − u)γ for some u ∈ [0, 1]. Under Assumptions 1b and 2a, as well as u b 0 the fact that Xi is a sub-vector of Wi, we have

† ˙  0 T ⊗2  E{` (β0; γ) | L } ≤ Ei∈U g (γuWi)Wi | L kγ0 − γk2 b 2 2 b ≤M W⊗2 kγ − γk ≤ Mσ2 kγ − γk , (A.10) E i 2 0 b 2 max 0 b 2

⊗2 T where for any vector x, x = xx . By the bound for (A.10) and the definition of λβ, we have the bound from (A.6)

tκ ≤ 2λ kδk + (1 − ρ)Mσ2 kγ − γk . (A.11) rsc,1 β 1 max 0 b 2

37 Hence, we can reach an immediate bound for estimation error from (A.11) without con- sidering the sparsity of β0. We shall proceed to derive a sharper bound that involves the sparsity of β0. We separately analyze two cases.

Case 1: (1 − ρ)Mσ2 kγ − γk ≥ kδk λ /3. max 0 b 2 1 β

In this case, the estimation error is dominated by γ − γ . We simply have from (A.11) b 0

tκ ≤ 7(1 − ρ)Mσ2 kγ − γk , rsc,1 max 0 b 2 tκ kδk λ /3 ≤ 7(1 − ρ)2M 2σ4 kγ − γk2. rsc,1 1 β max 0 b 2

Thus, we have

kβ − β k ≤ (1 − ρ)7Mσ2 kγ − γk /κ , b 0 2 max 0 b 2 rsc,1 kβ − β k ≤ (1 − ρ)221M 2σ4 kγ − γk2/(κ λ ). (A.12) b 0 1 max 0 b 2 rsc,1 β

If case 1 does not hold, then instead

Case 2: (1 − ρ)Mσ2 kγ − γk ≤ kδk λ /3. (A.13) max 0 b 2 1 β

In this case, the estimation error is comparable to that when we have the true γ0 for the imputation. Thus, the sparsity of β0 may affect the estimation error. Following the typical approach to establish the cone condition for δ, we analyze the symmetrized Bregman’s divergence,

n † † o n † † o (β − β )T `˙ (β; γ) − `˙ (β ; γ) = kβ − β k δT `˙ (β; γ) − `˙ (β ; γ) . (A.14) b 0 b b 0 b b 0 2 b b 0 b ˙ † Due to the convexity of the loss ` (·; γb) under Assumption 1b, the symmetrized Bregman’s divergence (A.14) is nonnegative through a mean-value theorem,

n † † o 1 T ˙ ˙ X 0 T T 2 (βb − β0) ` (βb; γ) − ` (β0; γ) ≥ inf g (β Xi){(βb − β0) Xi} ≥ 0. b b β∈ p+1 N R i>n

Denote the indices set of nonzero coefficient in β0 as Oβ = {j : β0,j 6= 0}. We denote the δ and δ c as the sub-vectors for δ at positions in O and at positions not in O , Oβ Oβ β β respectively. The solution βb satisfies the KKT condition

˙ † ˙† k` (βb; γb)k∞ ≤ λβ, ` (βb; γb)j = −λβ sign(βbj), j : βbj 6= 0.

38 From the KKT condition and the definitions of δ and Oβ, we have

−βbjλβ sign(βbj) δ `˙†(β; γ) ≤ |δ |λ , j ∈ O ; δ `˙†(β; γ) = = −λ |δ |, j ∈ Oc . (A.15) j b b j j β β j b b j β j β kβb − β0k2 Applying the (A.15) to (A.14), we have the upper bound,

n † † o X X † δT `˙ (β ; γ) − `˙ (β ; γ) = δ `˙†(β; γ) + δ `˙†(β; γ) + δT`˙ (β ; γ) 0 b 0 b j b b j j b b j 0 b c j∈Oβ j∈Oβ X X † ≤λ |δ | − λ |δ | + δT`˙ (β ; γ) β j β j 0 b c j∈Oβ j∈Oβ † T ˙ ≤λβkδO k1 − λβkδOc k1 + δ ` (β ; γ) . β β 0 b

Then, we apply (A.10), the definition of λβ and (A.13),

2 0 ≤λβkδO k1 − λβkδOc k1 + λβkδk1 and λβkδOc k1 ≤ 5λβkδO k1. β β 3 β β

Therefore, we can bound the L1 norm of δ by the cone property,

√ √ kδk1 ≤ 6λβkδOβ k1 ≤ 6 sβkδk2 = 6 sβ. (A.16)

We then apply the cone condition (A.16) and the case condition (A.13) to the bound (A.11),

√ tκrsc,1 ≤ 14 sβλβ, and tκrsc,1kδk1 ≤ 84sβλβ

Thus, we obtain the rate for estimation error

√ kβb − β0k2 ≤ 14 sβλβ/κrsc,1, and kβb − β0k1 ≤ 84sβλβ/κrsc,1. (A.17)

Since Case 1 and Case 2 are the complement of each other, one of them must occur. Thus, the bound of estimation error is controlled by the larger bound in the two cases,

√ kβ − β k ≤ max 14 s λ /κ , (1 − ρ)7Mσ2 kγ − γk /κ , b 0 2 β β rsc,1 max 0 b 2 rsc,1 kβ − β k ≤ max 84s λ /κ , (1 − ρ)221M 2σ4 kγ − γk2/(κ λ ) , b 0 1 β β rsc,1 max 0 b 2 rsc,1 β which is our oracle inequality in Lemma A1.

39 Consistency We next show that the oracle inequality leads to the consistency under dimension condition (26). To show

˙ † ˙ † ˙ † ` (β0; γ) − E{` (β0; γ) | L } + E{` (β0; γ0) | L } b b ∞ † † †   ˙ ˙ ˙ p ≤ ` (β0; γ) − E{` (β0; γ) | L } + E{` (β0; γ0) | L } = Op log(p)/N , b b ∞ ∞ we express the term of interest as the sum of the following empirical processes

† † 1  ˙ ˙ X T T ` (β ; γ) − {` (β ; γ) | L } = Xi{g(β Xi) − Yi + Yi − g(γ Wi)} 0 b E 0 b N 0 b i>n  − X {g(βTX ) − Y + Y − g(γTW )} |  , Ei>n i 0 i i i b i L n † 1 ˙ X T {` (β ; γ ) | L } = Xi{g(β Xi) − Yi}. E 0 0 N 0 i=1

T Under Assumption 1a and 1a, Xi and g(β0 Xi)−Yi are sub-Gaussian. According to Lemma A8, the event kγ − γ k ≤ 1 occurs with large probability, on which we have a bound for b 0 2 T the sub-Gaussian norm of Yi − g(γb Wi) by Lemma A3. √ kY − g(γTW )k ≤ max{2ν ,M 2σ }, i > n (A.18) i b i ψ2 1 max

T Thus, we obtain from (A.18) that Yi − g(γb Wi) is sub-Gaussian with large probability. Thus by the properties of sub-Gaussian random variables in Lemma A4-d and A4-f, we † have established that the elements in the summands of `˙ (β ; γ) are all sub-exponential 0 b random variables conditionally on the labelled data. We apply the Bernstein’s inequality (Lemma A4-h) conditionally on the labelled data to obtain

† †   ˙ ˙ p ` (β0; γ) − E{` (β0; γ) | L } = Op (1 − ρ) log(p)/N , b b ∞ †   ˙ p E{` (β0; γ0) | L } = Op ρ log(p)/N . ∞

This establishes the order for λβ, p p p λβ & (1 − ρ) log(p)/N + ρ log(p)/N  log(p)/N. (A.19)

By Lemma A6 from Negahban et al. (2010), we have that the probability of restricted strong convexity event converges to one,

κrsc,4N P(Ω) ≥ 1 − κrsc,3e → 1.

40 p Setting λβ  log(p)/N for optimal L2 estimation, we achieve the stated conclusion

p p  kβb − β0k2 = Op sβ log(p)/N + (1 − ρ) sγ log(p + q)/n ,   p 2 sγ log(p + q) log(p)/Nkβb − β0k1 = Op sβ log(p)/N + (1 − ρ) , n by applying the rates from Lemma A8 and (A.19). For optimal L1 estimation, we set a 0 p p larger penalty λβ  log(p)/N ∨ sγ log(p + q)/(sβn) & λβ to achieve r ! p 2 sγ sβ log(p + q) kβb − β0k1 = Op sβ log(p)/N + (1 − ρ) . n

B2 Proof of Corollary 2

Under Assumption 1b, we have

T T T |g(βb xnew) − g(β0 xnew)| ≤ M|(βb − β0) xnew|. (A.20)

Since xnew satisfies Assumption 2a, we have √ T k(βb − β0) xnewkψ2 ≤ kβb − β0k2σmax/ 2.

The tail distribution is regulated by the sub-Gaussian norm by Lemma A4-a, √  T   n o Pnew |(βb − β0) xnew| ≥ t | D ≤ 2 exp −t 2/ kβb − β0k2σmax . (A.21)

Combining (A.20) and (A.21), we obtain

T √  T  −t Pnew |g(βb xnew) − g(β0 xnew)| ≥ tMkβb − β0k2σmax/ 2 | D ≤ 2e .

Thus, T   g(β x ) − g(βTx ) = O kβ − β k . b new 0 new p b 0 2

B3 Proof of Theorem 3

Our proof is organized in five parts. In Part 1, we establish the consistency of the cross-

(k) (k) fitting estimator for precision matrix, namely kub − u0k2 = op(1) with ub and u0 defined

41 in (20) and (19), respectively. In Part 2, we show that the debiased estimator can be approximated by the empirical process

√   √ √ † T T T ˙ T ˙ n x[stdβ − xstdβ0 = − (1 − ρ) nu0 `imp(γ0) − nu0 ` (β0; γ0) + op(1) " n 1 X T T T − 2 = − n u0 Xi{(1 − ρ) · g(γ0 Wi) + ρ · g(β0 Xi) − Yi} i=1 # X T T +ρ Xi{g(β0 Xi) − g(γ0 Wi)} + op(1) i>n

As long as the asymptotic variance VSAS defined in (29) is bounded and bounded away from zero, we have the asymptotic normality of the leading term from the Central Limit Theorem

" n 1 −1/2 X T T T − 2 −n VSAS u0 Xi{(1 − ρ) · g(γ0 Wi) + ρ · g(β0 Xi) − Yi} i=1 # X T T +ρ Xi{g(β0 Xi) − g(γ0 Wi)} N(0, 1). i>n

In Part 3, we deal with the asymptotic variance VSAS and the consistency of the variance estimator VbSAS defined in (23). In Part 4, we reach the conclusion of the theorem based on,

(1 − ρ)kγ(k) − γ k + ku(k) − u k b 0 2 b 0 2 √ (k)  (k)  + nkβ − β k kβ − β k + ku(k) − u k = o (1), (A.22) b 0 2 b 0 2 b 0 2 p for all 1 ≤ k ≤ K. Following Part 4, we show in Part 5 that (28) implies (A.22).

Part 1: Consistency of estimated precision matrix

(k) The definitions of u0 and ub are given in (19) and (20). In this part, we show

(k) p p  kub − u0k2 =Op (su + sβ) log(p)/(N − Nk) + (1 − ρ) sγ log(p + q)/(n − nk) p p  =Op (su + sβ) log(p)/N + (1 − ρ) sγ log(p + q)/n .

(k) Since we set the number of folds K ≤ 10 to be finite, the estimation rate applies for ub for all k = 1,...,K.

42 We denote the components in the quadratic loss function of (20) and their derivatives as

(k,k0) 1 X 1 0 T T 2 T m (u; β) = g (β Xi)(Xi u) − u xstd, Nk0 2 i∈Ik0 ∪Jk0

0 ∂ 0 0 ∂ 0 m˙ (k,k )(u; β) = m(k,k )(u; β), m¨ (k,k )(β) = m˙ (k,k )(u; β) (A.23) ∂u ∂u for k0 ∈ {1,...,K}\{k}. We may express (20) as

0 (k,k0) (k) X Nk (k,k0)   ub = argmin m u; βb + λukuk1, u∈ p N − Nk R k06=k

Similar to the proof of Theorem 1, we establish the estimation rate for ub through an oracle inequality,

Lemma A2. Under Assumptions 1, 2, we establish On event

(k,k0) (k) \ n T (k,k0)   ∗ ∗ p o m 0 Ω = ∆ ¨ βb ∆ ≥ κrsc,1k∆k2{k∆k2 − κrsc,2 log(p)/Nk k∆k1}, ∀k∆k2 ≤ 1 , k06=k p setting λu  log(p)/N such that ( 0 0 N 0 0  (k,k ) n 0  (k,k ) o X k (k,k ) (k,k ) c λu ≥ 3 m˙ u0; βb − E m˙ u0; βb | Dk0 N − Nk ∞ k06=k ) ∗ ∗ p 0 + κrsc,1κrsc,2 log(p)/Nk , we have the oracle inequality for estimation error of βb,

 √ (k,k0)  ku(k) − u k ≤ max 14 s λ /κ∗ , 7Mσ3 ku k sup β − β /κ∗ , b 0 2 u u rsc,1 max 0 2 0 b rsc,1 k06=k 2

 (k,k0) 2  ku(k) − u k ≤ max 84s λ /κ∗ , 21Mσ6 ku k2 sup β − β /(κ∗ λ ) . b 0 1 u u rsc,1 max 0 2 0 b rsc,1 u k06=k 2

∗ ∗ The constants κrsc,1, κrsc,2 are the restrictive strong convexity parameters specified in Lemma A7.

The proof of Lemma A2 repeats the proof of the oracle inequality for Theorem 1, so we put the detail to Section C.

43 To use Lemma A2 for the estimation rate of ub, we only need to verify two conditions. First, the event Ω(k) occurs with probability tending to one. Second, the oracle choice of p λu is of order log(p)/N. (k,k0) Repeating Theorem 1 for each βb , we have under (28)

(k,k0)

βb − β0 = op(1). 2 Then by Lemma A7, the sets whose intersection forms Ω(k) each occurs with probability tending to one. Since we set the number of fold finite K ≤ 10, we can take union bound to obtain that Ω(k) occurs with probability tending to one. We may write

(k,k0) (k,k0) (k,k0)   n (k)   c o m˙ u0; βb − E m˙ u0; βb | Dk0

(k,k0)T (k,k0)T 1 X 0 T n 0 T c o 0 = g (βb Xi)XiXi u0 − Ei∈Ik0 ∪Jk0 g (βb Xi)XiXi u0 | Dk . (A.24) Nk0 i∈Ik0 ∪Jk0 Each element in (A.24) is an empirical process. Under Assumptions 1b and 2a, we can show that each summand is a sub-exponential random variable by Lemma A4-e, A4-f,

(k,k0)T 0 T T 2 g (βb Xi)Xi,jXi u0 ≤ MkXi,jkψ2 kXi u0kψ2 ≤ Mσmaxku0k2/2. ψ1 Hence, we can apply the Bernstein’s inequality to show that

0 0 0  (k,k ) n 0  (k,k ) o   (k,k ) (k,k ) c p m˙ u0; βb − E m˙ u0; βb | Dk0 = Op log(p)/Nk0 . ∞ p  Using the fact that Nk0  N, we obtain that the oracle λu is of order Op log(p)/N . Therefore, we can apply Lemma A2 to obtain

 (k,k0)  ku(k) − u k =O ps log(p)/N + sup β − β b 0 2 p u b 0 k06=k 2 p p  =Op (sβ + su) log(p)/N + (1 − ρ) sγ log(p + q)/n .

Part 2: Asymptotic approximation

(k) Under Assumption 2b-i, we also have the tightness of kub k2 from the bound of ku0k2

ku k ≤ kΣ−1k kx k ≤ σ−2, ku(k)k ≤ ku k + ku(k) − u k = O (1). (A.25) 0 2 0 2 std 2 min b 2 0 2 b 0 2 p

44 Define the scores of in-fold data as " # †(k) 1 ˙ X T T X T ` (β; γ) = Xi{g(β Xi) − g(γ Wi)} + Xi{g(β Xi) − Yi} , Nk i∈Jk i∈Ik

(k) 1 ˙ X T `imp(γ) = Xi{g(γ Wi) − Yi}. (A.26) nk i∈Ik

T Since x[stdβ is the average over K (at most 10) cross-fitted estimators, it suffices to study one of the cross-fitted estimators,

K (k) (k) (k) †(k) (k) (k) (k) (k) 1 X x[T β = xT βb − `˙ (βb ; γ ) − (1 − ρ)`˙ (γ ), x[T β = x[T β . (A.27) std std b imp b std K std k=1 We denote the expected Hessian matrices of losses in (A.26) as

0 T T H(β) = E {g (β Xi)XiXi } , Σ0 = H(β0),

0 T T Himp(γ) = E {g (γ Wi)XiWi } , Σimp = Himp(γ0). (A.28)

Our analysis of the approximation error is based on the first order Mean Value Theorem identity,

†(k) (k) (k) {`˙ (β ; γ(k)) | c} + (1 − ρ) {`˙ (γ(k)) | c} E b b Dk E imp b Dk †(k) (k) = {`˙ (β ; γ )} + (β){β − β } − (1 − ρ) (γ){γ(k) − γ } E 0 0 H e b 0 Himp e b 0 | {z } =0 (k) + (1 − ρ) {`˙ (γ )} +(1 − ρ) (γ){γ(k) − γ } E imp 0 Himp e b 0 | {z } =0 (k) =H(βe){βb − β0} (A.29) for some β on the path from β¯(γ) to β and some γ on the path from γ to γ . The e b 0 e b 0 conditional expectation notation is declared at Definition A1. Based on (A.29), we analyze √  (k)  T T the approximation error for n x[stdβ − xstdβ0 through the following decomposition,

√  (k) √  √ †(k) √ (k) T T T ˙ T ˙ n x[stdβ − nxstdβ0 + nu0 ` (β0; γ0) + n(1 − ρ)u0 `imp(γ0)

√ (k) √ †(k) (k) √ (k) = nxT (β − β ) − nu(k)T`˙ (β ; γ(k)) − n(1 − ρ)u(k)T`˙ (γ(k)) std b 0 b b b b imp b √ †(k) (k) √ †(k) (k) + nuT`˙ (β ; γ ) + (1 − ρ)uT`˙ (γ ) + nu(k)T {`˙ (β ; γ(k)) | c} 0 0 0 0 imp 0 b E b b Dk

45 √ (k) √ (k) + n(1 − ρ)u(k)T {`˙ (γ(k)) | c} − nu(k)T (β){β − β } b E imp b Dk b H e b 0 √ n oT (k) √ (k) = n x − (β)u (β − β ) + n (u − u(k))T (β)(β − β ) std H e 0 b 0 0 b H e b 0 | {z } | {z } T2 T1 √ h †(k) (k) n †(k) (k) †(k) oi + nu(k)T {`˙ (β ; γ(k)) | c} − `˙ (β ; γ(k)) − `˙ (β ; γ ) b E b b Dk b b 0 0 | {z } T3 √ h (k) n (k) (k) oi + n(1 − ρ)u(k)T {`˙ (γ(k)) | c} − `˙ (γ(k)) − `˙ (γ ) b E imp b Dk imp b imp 0 | {z } T4 √ n †(k) (k) o + n (u − u(k))T `˙ (β ; γ ) + (1 − ρ)`˙ (γ ) (A.30) 0 b 0 0 imp 0 | {z } T5

Here we state the rates for T1-T5,

√ (k)  √ (k)  T = O nkβ − β k2 ,T = O nku(k) − u k kβ − β k , 1 p b 0 2 2 p b 0 2 b 0 2  (k)  T = O ρkβ − β k + pρ(1 − ρ)kγ(k) − γ k , 3 p b 0 2 b 0 2 T = O (1 − ρ)kγ(k) − γ k  ,T = O (ku(k) − u k ) . 4 p b 0 2 5 p b 0 2

With the assumed estimation rate in (A.22), we have

T1 + T2 + T3 + T4 + T5 = op(1).

Thus, we have shown

K √   1 X √   n x[T β − xT β = n x[T β − xT β std std 0 K std std 0 k=1 K 1 X √ †(k) √ (k) = − nuT`˙ (β ; γ ) − n(1 − ρ)uT`˙ (γ ) + o (1) K 0 0 0 0 imp 0 p k=1 √ † √ T ˙ T ˙ = − nu0 ` (β0; γ0) − n(1 − ρ)u0 `imp(γ0) + op(1).

Using the indicator Ri = I(i ≤ n), we can alternatively write

N 1 X Ri x[T β−xT β = uTX {Y −g(γTW )}−uTX {g(γTW )−g(βTX )}+o (ρn)−1/2 . std std 0 N ρ 0 i i 0 i 0 i 0 i 0 i p i=1 (A.31)

We provide the details of T1-T5 in Section C2.

46 Part 3: Variance estimation

Finally, we show that asymptotic variance VSAS defined in (29) is bounded from infinity and zero with the consistent estimator VbSAS defined in (23). By the Cauchy-Schwartz inequality, we have a bound for the variance

 T 2 T T 2 VSAS =E (u0 Xi) {(1 − ρ) · g(γ0 Wi) + ρ · g(β0 Xi) − Yi}

T 2 T T 2 + ρ(1 − ρ)E[(u0 Xi) {g(γ0 Wi) − g(β0 Xi)} ] q T 4 T T 4 ≤ E [(u0 Xi) ] E [{(1 − ρ) · g(γ0 Wi) + ρ · g(β0 Xi) − Yi} ] q T 4 T T 4 + ρ(1 − ρ) E [(u0 Xi) ] E [{g(γ0 Wi) − g(β0 Xi)} ]

Under Assumptions 1a, 2a, we have the sub-Gaussian and sub-exponential variables √ √ T −2 ku0 Xikψ2 ≤ ku0k2σmax/ 2 ≤ σmin σmax/ 2,

T T k(1 − ρ) · g(γ0 Wi) + ρ · g(β0 Xi) − Yikψ2 ≤ 2(ν1 ∨ ν2),

T T T T kg(γ0 Wi) − g(β0 Xi)kψ2 ≤ 2{kg(γ0 Wi) − Yikψ2 ∨ kg(β0 Xi) − Yikψ2 } ≤ 2(ν1 ∨ ν2).

By the bound for the moments of sub-Gaussian and sub-exponential random variables stated in Lemma A4-b, we have √ −4 2 2 VSAS ≤ 8 2σmin σmax(ν1 ∨ ν2) .

Under Assumptions 1b, 2a, 2b-i and 2c, we have a lower bound for VSAS,

T T T T 2 VSAS ≥u0 E[XiXi {(1 − ρ) · g(γ0 Wi) + ρ · g(β0 Xi) − Yi} ]u0 2 4 ≥ku0k2σmin

−2 −4 4 2 ≥M σmaxσminν3kxstdk2, which is bounded away from zero.

We analyze the estimation error of variance VbSAS − VSAS through the decomposition,

VbSAS − VSAS

47 K  n 1 (k)T X k X (k)T 2 (k)T 2  = (ub Xi) {(1 − ρ) · g(γb Wi) + ρ · g(βb Xi) − Yi}  n nk  k=1 i∈Ik  T 0 ! 1 h (k)T i  (k)T 2 (k)T 2 c  − Ei∈Ik (u Xi) {(1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi} | Dk  b b  K   n h (k)T i X k (k)T 2 (k)T 2 c  + i∈I (u Xi) {(1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi} | D  n E k b b k  k=1 0 T2   − (uTX )2{(1 − ρ) · g(γTW ) + ρ · g(βTX ) − Y }2  E 0 i 0 i 0 i i 

K  N − n K (k)T X k k X (k)T 2 (k)T 2 + ρ(1 − ρ) (ub Xi) {g(βb Xi) − g(γb Wi)}  N − n Nk − nk  k=1 i∈Jk  T 0 ! 3 h (k)T i  (k)T 2 (k)T 2 c  − Ei∈Ik (u Xi) {g(βb Xi) − g(γ Wi)} | Dk  b b  K   N − n h (k)T i X k k (k)T 2 (k)T 2 c  + ρ(1 − ρ) i∈I (u Xi) {g(βb Xi) − g(γ Wi)} | D  N − n E k b b k  k=1 0 T4   − (uTX )2{g(βTX ) − g(γTW )}2  E 0 i 0 i 0 i 

0 0 Here we state the rates for T1-T4,   T 0 = O n−1/2 ,T 0 = O ku − u k + (1 − ρ)kγ − γ k + ρkβ − β k , 1 p 2 p b 0 2 b 0 2 b 0 2  n o T 0 = O ρ(1 − ρ)N −1/2 ,T 0 = O ρ(1 − ρ) ku − u k + kγ − γ k + kβ − β k . 3 p 4 p b 0 2 b 0 2 b 0 2 With the assumed estimation rate in (A.22), we have

0 0 0 0 T1 + T2 + T3 + T4 = op(1).

0 0 We provide the details of T1-T4 in Section C2.

Part 4: Conclusion with estimation rates

From the approximation in Part 2 and the boundedness and non-degeneracy of VSAS in Part 3, we have shown the asymptotic normality of the cross-fitted debiased estimator √   −1/2 T T nVSAS x[stdβ − xstdβ0 N(0, 1).

48 Together with the consistency of VbSAS in Part 3, we have √   −1/2 T T nVbSAS x[stdβ − xstdβ0 N(0, 1).

Part 5: Sufficient dimension condition

We have established the rate of estimation for γb, βb and ub from Lemma A8, Theorem 1 and Part 4 of this proof above. Since we only keep one fold of the data away for the cross-fitted estimators, they follow the same rates of estimation,

  kγ(k) − γ k = O ps log(p + q)/n , b 0 2 p γ (k) p p  kβb − β0k2 = Op sβ log(p)/N + (1 − ρ) sγ log(p + q)/n ,

(k) p p  kub − u0k2 = Op (sβ + su) log(p)/N + (1 − ρ) sγ log(p + q)/n .

Applying the rates of estimation, we show dimension assumption (28) is sufficient for (A.22).

B4 Efficiency of SAS Inference

Relative Efficiency to Supervised Learning

Proof of Proposition 5. We prove the Proposition by direct calculation

VSL − VSAS

T 2 T 2 T 2 T 2 =E[(u0 Xi) {Y − g(β0 Xi)} ] − E[(u0 Xi) {Y − (1 − ρ) · E(Y |Si, Xi)ρ · g(β0 Xi)} ]

T 2 2 T 2 T 2 2 =E[(u0 Xi) {(1 − ρ )g(β0 Xi) − 2(1 − ρ)g(β0 Xi)E(Y |Si, Xi) + (1 − ρ )E(Y |Si, Xi) }]

2 T 2 T 2 =(1 − ρ) E[(u0 Xi) {E(Y |Si, Xi) − g(β0 Xi)} ]

T 2 2 T 2 + 2ρ(1 − ρ)E[(u0 Xi) {E(Y |Si, Xi) + g(β0 Xi) }].

The last expression is the sum of expectations of complete squares, so it must be non- negative. Thus, we have shown that the SAS asymptotic variance is no greater than the supervised learning variance. The equality holds only if 1) ρ = 1 all samples are labelled;

T T 2) or ρ = 0 and u0 Xi{E(Y |Si, Xi) − g(β0 Xi)} = 0 almost surely.

49 Efficiency Bound among Semi-parametric RAL Estimators

Proof of Proposition 6. The proof follows the flow of Section D.2 in Kallus and Mao (2020). The semi-parametric model for the observed data is   r 1−r Mobs = fX,Y,S,R(x, y, s, r) = fX(x)fS|X(s|x) ρfY |S,X(y|s, x) (1 − ρ) :  fX, fS|X, fY |S,X are arbitrary pdf/pmf, . (A.32)

We consider the parametric sub-model   r Mpar = fX,Y,S,R(x, y, s, r; ζ) = fX(x; ζ)fS|X(s|x; ζ) ρfY |S,X(y|s, x; ζ)  1−r d × (1 − ρ) : ζ ∈ R . (A.33)

The score vector of the parametric sub-model is

Ψ(X,Y, S,R)

∂ log{fX,Y,S,R(X,Y, S,R; ζ)} = ∂ζ ζ=ζ0

∂ log{fX(X; ζ)} ∂ log{fS|X(S|X; ζ)} ∂ log{fY |S,X(Y | S, X; ζ)} = + + R ∂ζ ∂ζ ∂ζ ζ=ζ0 ζ=ζ0 ζ=ζ0

=ΨX(X) + ΨS(S, X) + RΨY (Y, S, X). (A.34)

Next, we decompose the the Hilbert space of mean zero finite variance random variables measurable to σ{X, S,R,YR}, denoted as H. The model tangent space spanned by the score (A.34) is a linear sub-space of H,

Λ = ΛX ⊕ ΛS ⊕ ΛY , [ ΛX = span{ΨX(X)} = {h(X) ∈ H : E[h(X)] = 0}, Mpar [ ΛS = span{ΨS(S, X)} = {h(S, X) ∈ H : E[h(S, X) | X] = 0}, Mpar [ ΛY = span{RΨY (Y, S, X)} = {Rh(Y, S, X) ∈ H : E[h(Y, S, X) | S, X] = 0}. (A.35) Mpar

50 The orthogonal space of model tangent space Λ is

⊥ ⊥ Λ = {h(R, S, X) ∈ H : E[h(R, S, X) | S, X] = 0}, H = Λ ⊕ Λ . (A.36)

Now, we verify that the supervised learning influence function R φ (θ; β) = uTX{Y − g(βTX)} SL ρ 0

T is indeed an influence function for xstdβ by showing

d E{φSL(θ0; β0)Ψ(X,Y, S,R)} = xstd β(ζ) . dζ ζ=ζ0 Since β(ζ) is an implicit function of ζ through the moment condition

T Eζ[X{g(β(ζ) X) − Y } = 0, we solve for its derivative by differentiating the moment condition

d T Eζ[X{g(β(ζ) X) − Y }] =0 dζ ζ=ζ0

d T T 0 T d Eζ[X{g(β0 X) − Y }] + Eζ0 {XX g (β0 X)} β(ζ) =0 dζ dζ ζ=ζ0 ζ=ζ0

T d −Θ0Eζ0 [X{g(β0 X) − Y }{ΨX(X) + ΨS(S, X) + ΨY (Y, S, X)}] = β(ζ) . dζ ζ=ζ0 Then, we verify that the supervised learning influence function is valid   d T R T T xstdβ(ζ) = − Eζ0 u0 X{g(β0 X) − Y }Ψ(X,Y, S,R) dζ ρ ζ=ζ0

=E{φSL(θ0; β0)Ψ(X,Y, S,R)}.

Finally, we derive the efficient influence function by subtract from φSL its projection ⊥ onto Λ = ΛR. Let Π[h(D) | Λ] be the projection of h(D) ∈ H to the space Λ. We can easily calculate the projection of φSL onto ΛR,

Π[φSL(θ0; β0) | ΛR] =E{φSL(θ0; β0) | R, S, X} − E{φSL(θ0; β0) | S, X} R = uTX{ (Y | S, X) − g(βTX)} − uTX{ (Y | S, X) − g(βTX)}. ρ 0 E 0 E

51 The efficient influence function is thus obtained

φeff (θ0; β0) =φSL(θ0; β0) − Π[φSL(θ0; β0) | ΛR] R R = uTX{Y − g(βTX)} − uTX{ (Y | S, X) − g(βTX)} ρ 0 ρ 0 E

T T + u0 X{E(Y | S, X) − g(β X)} R = uTX{Y − (Y | S, X)} + uTX{ (Y | S, X) − g(βTX)}. ρ 0 E 0 E

C Auxiliary Results

C1 General

Lemma A3. Under Assumptions 1a, 1b, 2a, the residuals of the imputed loss are sub- Gaussian random variables , √ √ T T kg(β Xi) − g(γ Wi)kψ2 ≤ 4 max{ν1, ν2,Mkβ − β0k2σmax/ 2,Mkγ − γ0k2σmax/ 2}

Similarly, √ T kYi − g(γ Wi)kψ2 ≤ 2 max{ν1,Mkγ − γ0k2σmax/ 2}, √ T T kg(γ Wi) − g(γ0 Wi)kψ2 ≤ Mkγ − γ0k2σmax/ 2, √ T T kg(β Xi) − g(β0 Xi)kψ2 ≤ Mkβ − β0k2σmax/ 2,

T T kρ · g(β Xi) + (1 − ρ) · g(γ Wi) − Y kψ2 ≤ 4 max{(1 − ρ)ν1, ρν2, √ √

ρMkβ − β0k2σmax/ 2, (1 − ρ)Mkγ − γ0k2σmax/ 2}, √ T T T T kg(β Xi) − g(β0 Xi) − g(γ Wi) + g(γ0 Wi)kψ2 ≤ 2Mσmax max{kβ − β0k2, kγ − γ0k2}.

Proof of Lemma A3. To establish the sub-exponential tail, we consider the following de- composition

T T T T g(β Xi) − g(γ Wi) ={g(β0 Xi) − Yi} − {g(γ0 Wi) − Yi} (A.37) T T T T + {g(β Xi) − g(β0 Xi)} − {g(γ Wi) − g(γ0 Wi)}.

52 According to Assumption 1a, the first two terms on the right-hand side of (A.37) are sub-Gaussian,

T T kg(β0 Xi) − Yikψ2 ≤ ν1, kg(γ0 Wi) − Yikψ2 ≤ ν2.

According to Assumption 1b, the latter two terms on the right-hand side of (A.37) are bounded by

T T T T T T |g(β Xi) − g(β0 Xi)| ≤ M|(β − β0) Xi|, |g(γ Wi) − g(γ0 Wi)| ≤ M|(γ − γ0) Wi|.

T T Under Assumption 2a, (β − β0) Xi and (γ − γ0) Wi are sub-Gaussian random variables, √ T k(β − β0) Xikψ2 ≤ kβ − β0k2σmax/ 2 √ T k(γ − γ0) Wikψ2 ≤ kγ − γ0k2σmax/ 2.

By Lemma A4-e, √ T T T kg(β Xi) − g(β0 Xi)kψ2 ≤ Mk(β − β0) Xikψ2 ≤ Mkβ − β0k2σmax/ 2 √ T T T kg(γ Wi) − g(γ0 Wi)kψ2 ≤ k(γ − γ0) Wikψ2 ≤ Mkγ − γ0k2σmax/ 2.

Finally, we apply Lemma A4-d

T T  T T kg(β Xi) − g(γ Wi)kψ2 ≤4 max kg(β0 Xi) − Yikψ2 , kg(γ0 Wi) − Yikψ2 ,

T T T T kg(β Xi) − g(β0 Xi)kψ2 , kg(γ Wi) − g(γ0 Wi)kψ2 n √ √ o ≤4 max ν1, ν2,Mkβ − β0k2σmax/ 2,Mkγ − γ0k2σmax/ 2 .

Therefore, we have reached the conclusion. We may obtain the rest of bounds following the same derivation.

C2 Inference

Analysis of Estimated Precision Matrix

Proof of Lemma A2. The definition of the cross-fitted loss functions m(k,k0) and their deriva-

(k) tives can be found at (A.23). By the definition of ub , we have

0 (k,k0) 0 (k,k0) X Nk (k,k0)  (k)  (k) X Nk (k,k0)   m ub ; βb + λukub k1 ≤ m u0; βb + λuku0k1. N − Nk N − Nk k06=k k06=k

53 (k) (k) Denote the standardized estimation error as δ = (ub − u0)/kub − u0k2. Due to convexity (k) of the loss function, we have for t = kub − u0k2 ∧ 1

0 (k,k0) 0 (k,k0) X Nk (k,k0)   X Nk (k,k0)   m u0 + tδ; βb +λuku0+tδk1 ≤ m u0; βb +λuku0k1. N − Nk N − Nk k06=k k06=k (A.38)

By the triangle inequality ku0k1 − ku0 + tδk1 ≤ tkδk1, we have from (A.38)

0 (k,k0) (k,k0) X Nk n (k,k0)   (k,k0)  o m u0 + tδ; βb − m u0; βb ≤ tλukδk1 (A.39) N − Nk k06=k

Because the loss functions m(k) are quadratic functions of u, we can apply the restricted strong convexity event Ω(k) to obtain

(k,k0) (k,k0) (k,k0) (k,k0)   (k,k0)   T (k,k0)   m u0 + tδ; βb − m u0; βb − tδ m˙ u0; βb

0 0  (k,k ) =t2δTm¨ (k,k ) βb δ

2 ∗ ∗ ∗ p 0 ≥t κrsc,1 − tκrsc,1κrsc,2 log(p)/Nk kδk1. (A.40)

Applying (A.40) to (A.39), we have with large probability

0 (k,k0) X Nk n T (k,k0)   2 ∗ ∗ ∗ p o 0 tδ m˙ u0; βb + t κrsc,1 − tκrsc,1κrsc,2 log(p)/Nk kδk1 ≤ tλukδk1 N − Nk k06=k where kδk2 = 1 from definition. Thus, we have reach

0 (k,k0) ∗ X Nk n T (k,k0)   ∗ ∗ p o 0 tκrsc,1 ≤ λukδk1 − δ m˙ u0; βb − κrsc,1κrsc,2 log(p)/Nk kδk1 . (A.41) N − Nk k06=k

n  (k,k0) o (k,k0) c The target parameter u0 can be identify by E m˙ u0; βb | Dk0 = 0. We use (k,k0) T (k,k0)   the fact to do a careful analysis of δ m˙ u0; βb by the decomposition

(k,k0) (k,k0) (k,k0) T (k,k0)   T h (k,k0)   n (k,k0)   c oi δ m˙ u ; β =δ m˙ u ; β − m˙ u ; β | 0 0 b 0 b E 0 b Dk (k,k0) T h n (k,k0)   c o  (k,k0) c i + δ E m˙ u0; βb | Dk0 − E m˙ (u0; β0) | Dk0 0 0 0  (k,k ) n 0  (k,k ) o (k,k ) (k,k ) c ≤kδk1 m˙ u0; βb − E m˙ u0; βb | Dk0 ∞ (k,k0)T T T c + Ei∈I 0 ∪J 0 [XiXi u0{g(β0 Xi) − g(βb Xi)} | Dk0 ] . (A.42) k k 2

54 We establish the rate for L2-norm of the population score at u0 through analyzing

(k,k0)T T T T c 0 sup Ei∈Ik0 ∪Jk0 [v XiXi u0{g(β0 Xi) − g(βb Xi)} | Dk ], kvk2=1 whose bound can be derived from Assumptions 1b, 2a, 2a, the Cauchy-Schwartz inequality and Lemma A4-b,

(k,k0)T T T T c 0 Ei∈Ik0 ∪Jk0 [v XiXi u0{g(β0 Xi) − g(βb Xi)} | Dk ]

(k,k0) T T T   c 0 ≤MEi∈Ik0 ∪Jk0 [|v XiXi u0{ β0 − βb Xi}| | Dk ] v u " 2 #  (k,k0) T   T 4 T 4 1/4 u   c ≤M E{(v Xi) }E{(v Xi) } tEi∈Ik0 ∪Jk0 β0 − βb Xi | Dk0

(k,k0) 3 ≤Mσmaxkvk2ku0k2 β0 − βb . 2 Hence, we have shown

h n  (k,k0)T o i (k,k0) T T c 3 Ei∈I 0 ∪J 0 XiXi u0 g(β0 Xi) − g βb Xi | Dk0 ≤ Mσmaxku0k2 β0 − βb . k k 2 2 (A.43)

By the bound for (A.42) through (A.43) and the definition of λβ, we have the bound from (A.41) (k,k0) ∗ 3 tκrsc,1 ≤ 2λukδk1 + Mσmaxku0k2 sup β0 − βb . (A.44) k06=k 2 Hence, we can reach an immediate bound for estimation error from (A.44) without con- sidering the sparsity of u0. We shall proceed to derive a sharper bound that involves the sparsity of u0. We separately analyze two cases. Case 1:

(k,k0) 3 Mσmaxku0k2 sup β0 − βb ≥ kδk1λu/3 k=1,...,K 2

(k,k0) In this case, the estimation error is dominated by β0 − βb . We simply have from (A.44)

(k,k0) ∗ 3 tκrsc,1 ≤ 7Mσmaxku0k2 sup β0 − βb , k=1,...,K 2

(k,k0) 2 ∗ 6 2 tκrsc,1kδk1λu/3 ≤ 7Mσmaxku0k2 sup β0 − βb . k=1,...,K 2

55 Thus, we have

(k,k0) ku(k) − u k ≤ 7Mσ3 ku k sup β − β /κ∗ , b 0 2 max 0 2 0 b rsc,1 k06=k 2

(k,k0) 2 ku(k) − u k ≤ 21Mσ6 ku k2 sup β − β /(κ∗ λ ). (A.45) b 0 1 max 0 2 0 b rsc,1 u k06=k 2 Case 2: (k,k0) 3 Mσmaxku0k2 sup β0 − βb ≤ kδk1λu/3 (A.46) k06=k 2

In this case, the estimation error is comparable to the situation that we have the true β0 for the Hessian. Thus, the sparsity of u0 may affect the estimation error. Following the typical approach to establish the cone condition for δ, we analyze the symmetrized Bregman’s divergence,

0 (k,k0) (k,k0) (k) T X Nk n (k,k0)  (k)  (k,k0)  o (ub − u0) m˙ ub ; βb − m˙ u0; βb N − Nk k0=k

0 (k,k0) (k,k0) (k) X Nk T n (k,k0)  (k)  (k,k0)  o =kub − u0k2 δ m˙ ub ; βb − m˙ u0; βb . (A.47) N − Nk k0=k

0 0  (k,k ) Due to the convexity of the quadratic loss m(k,k ) ·; βb , the symmetrized Bregman’s divergence (A.47) is nonnegative through a mean-value theorem,

0 (k,k0) (k,k0) (k) T X Nk n (k,k0)  (k)  (k,k0)  o (ub − u0) m˙ ub ; βb − m˙ u0; βb N − Nk k0=k

0 (k,k0)T X Nk X 0 (k) T 2 = g (βb Xi){(ub − u0) Xi} 0 N − Nk k =k i∈Ik0 ∪Jk0 ≥0.

Denote the indices set of nonzero coefficient in u0 as Ou = {j : u0,j 6= 0}. We denote

c the δOu and δOu as the sub-vectors for δ at positions in Ou and at positions not in Ou,

(k) respectively. The solution ub satisfies the KKT condition

0 N 0 0  (k,k ) X k (k,k ) (k) m˙ ub ; βb ≤ λu, N − Nk k0=k ∞ 0 (k,k0) X Nk (k,k0)  (k)  (k) (k) m˙ ub ; βb = −λu sign(ubj ), j : ubj 6= 0. N − Nk j k0=k

56 From the KKT condition and the definitions of δ and Ou, we have

0 (k,k0) X Nk (k,k0)  (k)  δj m˙ ub ; βb ≤ |δj|λu, j ∈ Oβ; N − Nk j k0=k (k) (k) 0 0 0  (k,k ) −u λu sign(u ) X Nk (k,k ) (k) bj bj c δj m˙ u ; β = = −λu|δj|, j ∈ O . (A.48) b b (k) β N − Nk j ku − u0k2 k0=k b Applying the (A.48) to (A.47), we have the upper bound,

0 (k,k0) (k,k0) T X Nk n (k,k0)  (k)  (k,k0)  o δ m˙ ub ; βb − m˙ u0; βb N − Nk k0=k

0 (k,k0) 0 (k,k0) X X Nk (k,k0)  (k)  X X Nk (k,k0)  (k)  = δj m˙ u ; βb + δj m˙ u ; βb b j b j 0 N − Nk c 0 N − Nk j∈Ou k =k j∈Ou k =k

0 (k,k0) X Nk T (k,k0)   + δ m˙ u0; βb N − Nk k0=k X X † ≤λ |δ | − λ |δ | + δT`˙ (β ; γ) u j u j 0 b c j∈Ou j∈Ou

0 (k,k0) X Nk T (k,k0)   c ≤λukδOu k1 − λukδOu k1 + δ m˙ u0; βb . N − Nk k0=k

Then, we apply (A.42), the definition of λu and (A.46),

2 0 ≤λ kδ k − λ kδ c k + λ kδk , and λ kδ c k ≤ 5λ kδ k . u Ou 1 u Ou 1 3 u 1 u Ou 1 u Ou 1

Therefore, we can bound the L1 norm of δ by the cone property, √ √ kδk1 ≤ 6λukδOu k1 ≤ 6 sukδk2 = 6 su. (A.49)

Now, we apply the cone condition (A.49) and the case condition (A.46) to the bound (A.44), ∗ √ ∗ tκrsc,1 ≤ 14 suλu, tκrsc,1kδk1 ≤ 84suλu

Thus, we obtain the rate for estimation error

√ ku(k) − u k ≤ 14 s λ /κ∗ , ku(k) − u k ≤ 84s λ /κ∗ . (A.50) b 0 2 u u rsc,1 b 0 1 u u rsc,1

Conclusion:

57 Since Case 1 and Case 2 are the complement of each other, one of them must occur. Thus, the bound of estimation error is controlled by the larger bound in the two cases,

 √ (k,k0)  ku(k) − u k ≤ max 14 s λ /κ∗ , 7Mσ3 ku k sup β − β /κ∗ , b 0 2 u u rsc,1 max 0 2 0 b rsc,1 k06=k 2

 (k,k0) 2  ku(k) − u k ≤ max 84s λ /κ∗ , 21Mσ6 ku k2 sup β − β /(κ∗ λ ) , b 0 1 u u rsc,1 max 0 2 0 b rsc,1 u k06=k 2 which is our oracle inequality.

Analysis for Terms T1-T5 in Part 1

To show √ n o (k) √ (k) 2 T1 = n xstd − H(βe)u0 (βb − β0) = Op nkβb − β0k2 , we rewrite the term as a conditional expectation

√ (k) T n o T1 = nu0 Σ0 − H(βe) (βb − β0)

√ (k) T (k)T h T T 0 0 ci = nEi∈Jk u0 Xi(βb − β0) Xi{g (βe Xi) − g (βb Xi)} | Dk .

Under Assumptions 1b, 2a, we derive the bound for the expectation using the Cauchy- Schwartz inequality and Lemma A4-b,

√ (k) h T T 2 ci |T1| ≤M nEi∈Jk u0 Xi{(βb − β0) Xi} | Dk r h (k) i T 2 c T 4 c ≤M nEi∈Jk {(u0 Xi) | Dk} Ei∈Jk {(βb − β0) Xi} | Dk

q (k) ≤M n8kuTX k2 k(β − β )TX k4 0 i ψ2 b 0 i ψ2

√ (k) 2 3 ≤ nMku0k2 βb − β0 σmax. (A.51) 2

Since ku0k2 is bounded according to (A.25), we have established in

√ (k) 2 |T1| = Op nkβb − β0k2 as declared. To show

√ (k) √ (k)  T = n (u − u(k))T (β)(β − β ) = O nkβ − β k ku(k) − u k , 2 0 b H e b 0 p b 0 2 b 0 2

58 we rewrite the term as a conditional expectation

√ h (k) T i T = n (u − u(k))T X (β − β )TX g0(β X ) | c . 2 Ei∈Jk 0 b i b 0 i e i Dk Similar to (A.51), we derive the bound for the expectation under Assumptions 1b, 2a through the Cauchy-Schwartz inequality and Lemma A4-b, A4-f,

√ h (k) i |T | ≤M n | (u − u(k))T X (β − β )TX | | c 2 Ei∈Jk 0 b i b 0 i Dk √ (k) ≤2M nk (u − u(k))T X (β − β )TX k 0 b i b 0 i ψ1 √ (k) ≤2M nk (u − u(k))T X k k(β − β )TX k 0 b i ψ2 b 0 i ψ2 √ (k) ≤M nku(k) − u k kβ − β k σ2 . b 0 2 b 0 2 max This bound immediately implies √  T = O nkβ − β k ku(k) − u k . 2 p b 0 2 b 0 2 To show

√ h †(k) (k) n †(k) (k) †(k) oi T = nu(k)T {`˙ (β ; γ(k)) | c} − `˙ (β ; γ(k)) − `˙ (β ; γ ) 3 b E b b Dk b b 0 0  (k)  =O ρkβ − β k + pρ(1 − ρ)kγ(k) − γ k , p b 0 2 b 0 2 we rewrite the term as two empirical processes with diminishing summands  √ 1 X (k)T (k)T T = − n u(k)TX {g(β X ) − g(βTX ) − g(γ W ) + g(γTW )} 3 b i b i 0 i b i 0 i Nk i∈Jk

h (k)T i  − u(k)TX {g(β X ) − g(βTX ) − g(γ(k)TW ) + g(γTW )} | c Ei∈Jk b i b i 0 i b i 0 i Dk  √ 1 X (k)T − n u(k)TX {g(β X ) − g(βTX )} b i b i 0 i Nk i∈Ik

h (k)T i  − u(k)TX {g(β X ) − g(βTX )} | c . Ei∈Jk b i b i 0 i Dk

˙ †(k) c We have used the identity E{` (β0; γ0) | Dk} = 0 above. Using Lemmas A3, A4-h and Assumptions (2a) and (2a), we show that each summand is sub-exponential

(k)T ku(k)TX {g(β X ) − g(βTX ) − g(γ(k)TW ) + g(γTW )}k b i b i 0 i b i 0 i ψ1 (k)T ≤ku(k)TX k kg(β X ) − g(βTX ) − g(γ(k)TW ) + g(γTW )k b i ψ2 b i 0 i b i 0 i ψ2  (k)  ≤Mσ2 ku(k)k kβ − β k + kγ(k) − γ k , max b 2 b 0 2 b 0 2

59 (k)T (k) ku(k)TX {g(β X ) − g(βTX )}k ≤ Mσ2 ku(k)k kβ − β k /2. b i b i 0 i ψ1 max b 2 b 0 2

Applying the Bernstein’s inequality, we obtain

 n√ (k) o T = O ku(k)k ρkβ − β k + pρ(1 − ρ)kγ(k) − γ k . 3 p b 2 b 0 2 b 0 2

(k) We achieve the stated rate with the tightness of kub k2 from (A.25). To show

√ h (k) n (k) (k) oi T = n(1 − ρ)u(k)T {`˙ (γ(k)) | c} − `˙ (γ(k)) − `˙ (γ ) 4 b E imp b Dk imp b imp 0 =O (1 − ρ)kγ(k) − γ k  , p b 0 2 we rewrite the term as the empirical process with diminishing summands  √ 1 X (k)T T = − n(1 − ρ) u(k)TX {g(γ W ) − g(γTW )} 4 b i b i 0 i nk i∈Ik  − u(k)TX {g(γ(k)TW ) − g(γTW )} | c . Ei∈Jk b i b i 0 i Dk

˙ (k) c We have used the identity E{`imp(γ0) | Dk} = 0 above. Similar to the analysis of T3, we show that each summand is sub-exponential

ku(k)TX {g(γ(k)TW ) − g(γTW )}k ≤ Mσ2 ku(k)k kγ(k) − γ k /2. b i b i 0 i ψ1 max b 2 b 0 2

Applying the Bernstein’s inequality, we obtain

T = O (1 − ρ)ku(k)k kγ(k) − γ k  . 4 p b 2 b 0 2

(k) We achieve the stated rate with the tightness of kub k2 from (A.25). To show

√ n †(k) (k) o T = n (u − u(k))T `˙ (β ; γ ) + (1 − ρ)`˙ (γ ) = O (ku(k) − u k ) , 5 0 b 0 0 imp 0 p b 0 2 we rewrite the term as the empirical process with diminishing summands

√ 1 X T = − n (u(k) − u )TX {g(βTX ) − g(γTW )} 5 b 0 i 0 i 0 i Nk i∈Jk

60 √ 1 X − n (u(k) − u )TX {ρ · g(βTX ) + (1 − ρ) · g(γTW ) − Y }. b 0 i 0 i 0 i i nk i∈Ik The summands have zero mean because

[(u(k) − u )TX {g(βTX ) − g(γTW )} | c] Ei∈Jk b 0 i 0 i 0 i Dk =(u(k) − u )T [X {g(βTX ) − g(γTW )}] b 0 E i 0 i 0 i =0,

[(u(k) − u )TX {ρ · g(βTX ) + (1 − ρ) · g(γTW ) − Y } | c] Ei∈Ik b 0 i 0 i 0 i i Dk =(u(k) − u )T (ρ [X {g(βTX ) − Y }] + (1 − ρ) [X {g(γTW ) − Y }]) b 0 E i 0 i i E i 0 i i =0.

Similar to the analysis of T3, we show that each summand is sub-exponential √ (k) T T T (k) k(u − u0) Xi{g(β Xi) − g(γ Wi)}k ≤ 2σmax(ν1 ∨ ν2)ku − u0k2 b 0 0 ψ1 b √ (k) T T T (k) k(u − u0) Xi{ρ · g(β Xi) + (1 − ρ) · g(γ Wi) − Yi}k ≤ 2σmax(ν1 ∨ ν2)ku − u0k2 b 0 0 ψ1 b

Applying the Bernstein’s inequality, we obtain

 (k) np o (k) T5 = Op kub − u0k2 ρ(1 − ρ) + 1 = Op (kub − u0k2) .

0 0 Analysis for Terms T1-T4 in Part 2

0 Conditionally on the out-of-fold data, the term T1 is the empirical average of i.i.d. mean zero random variables,

K (k)T X nk 1 X (k)T 2 (k)T 2 T1 = (ub Xi) {(1 − ρ) · g(γb Wi) + ρ · g(βb Xi) − Yi} n nk k=1 i∈Ik ! h (k)T i − (u(k)TX )2{(1 − ρ) · g(γ(k)TW ) + ρ · g(β X ) − Y }2 | c . Ei∈Ik b i b i b i i Dk

We bound the variance of each summand by the Cauchy-Schwartz inequality and Lemmas A4-b, A4-d, A4-e,

(k)T h (k)T 2 (k)T 2 ci Var (ub Xi) {(1 − ρ) · g(γb Wi) + ρ · g(βb Xi) − Yi} | Dk i∈Ik

61 h (k)T i ≤ (u(k)TX )4{(1 − ρ) · g(γ(k)TW ) + ρ · g(β X ) − Y }4 | c Ei∈Ik b i b i b i i Dk r h (k)T i ≤ {(u(k)TX )8 | c} {(1 − ρ) · g(γ(k)TW ) + ρ · g(β X ) − Y }8 | c Ei∈Ik b i Lk Ei∈Ik b i b i i Dk r  (k)T 8 ≤ 48ku(k)TX k8 48 ρkg(β X ) − Y k ∨ (1 − ρ)kg(γ(k)TX ) − Y k (A.52) b i ψ2 b i i ψ2 b i i ψ2

Under Assumption 1a, 1b, 2a, we have √ ku(k)TX k ≤ (ku k + ku(k) − u k )σ / 2 = O (1 + ku(k) − u k ) . b i ψ2 0 2 b 0 2 max p b 0 2

We apply Lemma A3 to obtain

(k)T  (k)  kg(βb Xi) − Yikψ2 = Op 1 + kβb − β0k2 ,

kg(γ(k)TW ) − Y k = O 1 + kγ(k) − γ k  . b i i ψ2 p b 0 2

We have shown that the variance in (A.52) is of order

 (k)  O 1 + ku(k) − u k4 + ρ4kβ − β k4 + (1 − ρ)4kγ(k) − γ k4 . p b 0 2 b 0 2 b 0 2

Thus by the Tchebychev’s inequality, we obtain

n (k) o √  T 0 = O 1 + ku(k) − u k + ρkβ − β k + (1 − ρ)kγ(k) − γ k / n 1 p b 0 2 b 0 2 b 0 2

(k) (k) (k) Applying the consistency of γb , βb and ub from (A.22)

0 −1/2 T1 = Op n = op(1).

0 To analyze T2, we consider the decomposition in which the estimators are replaced by the estimands one by one,

0 T2 K  (k)T X nk h (k)T 2 (k)T 2 ci = i∈I (u Xi) {(1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi} | D n E k b b k k=1   T 2 T T 2 − E (u0 Xi) {(1 − ρ) · g(γ0 Wi) + ρ · g(β0 Xi) − Yi}

K (k)T X nk h (k) T (k)T (k)T 2 ci = i∈I {(u − u0) Xi}u Xi{(1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi} | D n E k b b b k k=1

62 K (k)T X nk h (k) T T (k)T 2 ci + i∈I {(u − u0) Xi}u Xi{(1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi} | D n E k b 0 b k k=1 K (k)T X nk h T 2 (k)T + i∈I (u Xi) {(1 − ρ) · g(γ Wi) + ρ · g(βb Xi) − Yi} n E k 0 b k=1 (k)T i × [(1 − ρ){g(γ(k)TW ) − g(γTW )} + ρ{g(β X ) − g(βTX )}] | c b i 0 i b i 0 i Dk K X nk h + (uTX )2{(1 − ρ) · g(γTW ) + ρ · g(βTX ) − Y } n Ei∈Ik 0 i 0 i 0 i i k=1 (k)T i × [(1 − ρ){g(γ(k)TW ) − g(γTW )} + ρ{g(β X ) − g(βTX )}] | c . b i 0 i b i 0 i Dk

Following the same calculation as in (A.52), we can bound the expectations

 (k)  T 0 = O ku(k) − u k + ρkβ − β k + (1 − ρ)kγ(k) − γ k 2 p b 0 2 b 0 2 b 0 2

Applying the consistency of γb, βb and ub from Lemma A8, Theorem 1 and Part 1 in the proof of Theorem 3, we have established

0 T2 = op(1).

0 0 Repeating the analyses for T1 and T2, we can show

 n (k) o T 0 = O ρp(1 − ρ)/N 1 + ku(k) − u k + kβ − β k + kγ(k) − γ k = o (1), 3 p b 0 2 b 0 2 b 0 2 p  n (k) o T 0 = O ρ(1 − ρ) ku(k) − u k + kβ − β k + kγ(k) − γ k = o (1) 4 p b 0 2 b 0 2 b 0 2 p

D Additional Technical Details

D1 Definitions

We adopt the following definition of sub-Gaussian and sub-exponential random variables.

Definition A2 (Sub-Gaussian and Sub-Exponential Random Variables). The sub-Gaussian parameter for a random variable V is defined as

n V 2/σ2 o kV kψ2 = inf σ > 0 : E(e ) ≤ 2 .

63 The random variable V is sub-Gaussian if kV kψ2 is finite. The sub-Gaussian parameter for a random vector U is defined as

T kUkψ2 = sup kv Ukψ2 . kvk2=1 The sub-Gaussian parameter for a random variable V is defined as

 |V |/ν kV kψ1 = inf ν > 0 : E(e ) ≤ 2 .

The random variable V is sub-exponential if kV kψ1 is finite. The more general Orlicz norm for α ∈ (0, 1) is defined as

  (|V |/ν)α  kV kψα = inf ν > 0 : E e ≤ 2 .

Mimicking the (minimal) Restricted Eigenvalue condition on the minimal eigenvalue of matrix over a cone (Bickel et al., 2009), we define the maximal Restricted Eigenvalue in Definition A3.

Definition A3 (Maximal Restricted Eigenvalue). For a cone-set of the indices set O ⊂ {1, . . . , p}  p+q+1 Cγ (ξ, O) := v ∈ R : kvOc k1 ≤ ξkvOk1 , (A.53) we define the maximal Restricted Eigenvalue of a matrix Σ as √ vTΣv REmax(ξ, O; Σ) = sup . (A.54) v∈Cγ (ξ,O)\{0} kvk2

D2 Statements of Existing Results

The properties in Lemmas A4 and A5 are covered in Vershynin (2018) Chapter 2 and 4.

Lemma A4 (Properties of sub-Gaussian and sub-exponential random variables).

a) Tail-probability:

−x/kV k P(|V | ≥ x) ≤ 2e ψ1 , −x2/kV k2 P(|V | ≥ x) ≤ 2e ψ2 ;

64 b) Moments: (|V |r) ≤ min{κ kV kr , κ kV kr } with κ = r!2 and κ = Γ(r/2)r, E ψ,1 ψ1 ψ,2 ψ2 ψ,1 ψ,2 √ and E(|V |) ≤ πkV kψ2 ;

c) Hierarchy: kV kψ1 ≤ kV kψ2 ;

d) Arbitrary addition: kPm V k ≤ m max kV k and kPm V k ≤ m max kV k ; i=1 i ψ2 i=1,...,m i ψ2 i=1 i ψ1 i=1,...,m i ψ1

e) Multiplication with bounded random variable: kV1V2kψ2 ≤ kV1kψ2 K, kV1V2kψ1 ≤

kV1kψ1 K for |V2| ≤ K almost surely;

f) Multiplication between sub-Gaussian random variables: kV1V2kψ1 ≤ kV1kψ2 kV2kψ2 , in p particular, kV1kψ1 ≤ kV1kψ2 / log(2);

g) Hoeffding’s inequality: V1,...,Vm are independent mean zero sub-Gaussian random variables. For t > 0,

m ! ! X t2 V ≥ t ≤ 4 exp − , κ = 8. P i Pm 2 ψ,3 κψ,3 kVik i=1 i=1 ψ2

h) Bernstein’s inequality: V1,...,Vm are independent mean zero sub-exponential random

variables. For t > 0, κψ,4 = 16 and κψ,5 = 4   m !  m !−1    −1 X 2 X 2 Vi ≥ t ≤ 2 exp − min t κψ,4 kVikψ , t κψ,5 max kVikψ1  . P 1 i=1,...,m i=1  i=1 

p Lemma A5. Let V1,...,Vm be i.i.d sub-Gaussian vectors in R such that

T 2 2 T 2 kv V kψ2 ≤ K E{(v V ) } for some 1 ≤ K < ∞. Then,

m 1 X   V V T − (VV T) = O p/m + pp/m . m i i E p i=1 2 From Negahban et al. (2010) and Huang and Zhang (2012) among other literatures, we have the following results concerning the LASSO under the generalized linear models.

65 Lemma A6. Under Assumptions 1b, 2a and 2b,  T ˙ P `imp(γ0 + ∆) − `imp(γ0) − ∆ `imp(γ0)  p −κrsc,4n ≥ κrsc,1k∆k2{k∆k2 − κrsc,2 log(p + q)/nk∆k1}, ∀k∆k2 ≤ 1 ≥ 1 − κrsc,3e ;  T ˙ P `PL(β0 + ∆) − `PL(β0) − ∆ `PL(β0)  p −κrsc,4N ≥ κrsc,1k∆k2{k∆k2 − κrsc,2 log(p)/Nk∆k1}, ∀k∆k2 ≤ 1 ≥ 1 − κrsc,3e .

The negative log-likelihoods are defined in (8) and (10), and their gradients defined in (12). See Definition A1 for the definition of conditional expectation notation. The constants are all absolute.

The two inequalities in Lemma A6 are direct application of Negahban et al. (2010) Proposition 2 page 22. We can construct an auxiliary loss function to prove the following lemma.

Lemma A7. Under Assumptions 1b, 2a and 2b,

 (k,k0)T 1 X 0   T 2 P g βb Xi (∆ Xi) Nk0 i∈Ik0 ∪Jk0  ∗ 2 ∗ ∗ p ≥ 2κrsc,1k∆k2 − κrsc,1κrsc,2 log(p)/Nk∆k2k∆k1, ∀k∆k2 ≤ 1

2  (k)  σmin ∗ −κ∗ N ≥ β − β ≤ − κ e rsc,4 . P b 0 3 rsc,3 2 2σmax The constants are all absolute. r  (k,k0)T  0 Proof of Lemma A7. First, we show g βb Xi Xi is a sub-Gaussian random vector whose second moment has all eigenvalues bounded away from infinity and zero. Under Assumptions 1b and 2a, we may apply Lemma A4-e, r  (k,k0)T  √ √ √ vT g0 β X X ≤ MkvTX k ≤ Mσ kvk / 2. b i i i ψ2 max 2 ψ2 r  (k,k0)T  0 Thus, g βb Xi Xi is a sub-Gaussian random vector. Under Assumptions 1b and 2a, we can bound the maximal eigenvalue of its second moment,

(k,k0)T T n 0   T c o T 2 2 2 0 v Ei∈Ik0 ∪Jk0 g βb Xi XiXi | Dk v ≤ ME{(v Xi) } ≤ Mkvk2σmax.

66 We derive the lower bound for the minimal eigenvalue of its second moment from Assump- tions 1b, 2a, 2a, 2b-i, the Cauchy-Schwartz inequality and Lemma A4-b,

(k,k0)T T n 0   T c o 0 v Ei∈Ik0 ∪Jk0 g βb Xi XiXi | Dk v

T 0 T T c 0 ≥v Ei∈Ik0 ∪Jk0 {g (β0 Xi)XiXi | Dk }v (k,k0)T h T 2 n T  o c i 0 − Ei∈Ik0 ∪Jk0 (v Xi) g(β0 Xi) − g βb Xi | Dk

  (k,k0) T   2 2 T 2   c ≥kvk2σmin − MEi∈I 0 ∪J 0 (v Xi) β0 − βb Xi | Dk0 k k v u " 2 #  (k,k0)T  2 2 u T 4 c ≥kvk2σmin − MtE{(v Xi) }Ei∈Ik0 ∪Jk0 β0 − βb Xi | Dk0

 (k,k0)  2 2 3 ≥kvk2 σmin − Mσmax βb − β0 . 2

(k) σ2 min Whenever βb − β0 ≤ 2σ3 , we have 2 max

(k,k0)T T n 0   T c o 2 2 0 v Ei∈Ik0 ∪Jk0 g βb Xi XiXi | Dk v ≥ kvk2σmin/2.

Second, we construct an auxiliary least square loss to apply Negahban et al. (2010).

Let εi be independent standard normal random variables. Construct the loss function ( )2 r 0 0  (k,k )T  (k,k ) 1 X T 0 L (v) = εi + (v0 − v) g βb Xi Xi . Nk0 i∈Ik0 ∪Jk0 By the design, we have

(k,k0)T (k,k0) (k,k0) T ∂ (k,k0) 1 X 0   T 2 L (v0 + ∆) − L (v0) − ∆ L (v0 + ∆) = g βb Xi (∆ Xi) . ∂v Nk0 i∈Ik0 ∪Jk0

We apply Proposition 2 in Negahban et al. (2010) for L(k,k0)(v) conditionally on out-of-fold 0 n (k,k ) σ2 o c min data Dk0 and the event βb − β0 ≤ 2σ3 to finish the proof. 2 max

p Lemma A8. For a constant κcone(n, p, q, εr)  sγ log(p + q)/n, the event

( n ) 1 X Ω = k`˙ (γ )k = W {g(γTW ) − Y } ≤ κ (n, p, q, ε ) cone imp 0 ∞ n i 0 i i cone r i=1 ∞

67 occur with probability greater than 1 − εr under Assumptions 1a and 2a. Setting λγ = 2, we have on event Ωcone that

 p+q+1 γ − γ ∈ C (3, supp(γ )) = v ∈ : kv c k ≤ 3kv k , b 0 γ 0 R Oγ 1 Oγ 1

where Oγ = {j : γj 6= 0} is the indices set for nonzero coefficient in γ0. Moreover, we have   kγ − γ k = O ps log(p + q)/n . b 0 2 p γ

The concentration on the event Ωcone is established by the union bound of element wise concentration, which is in turn obtained by the Bernstein inequality for sub-exponential random variables (Lemma A4-h). The rest of Lemma A8 follows Huang and Zhang (2012) Lemma 1 page 5 (page 1843 of the issue) and Negahban et al. (2010) Corollary 5 page 23.

68