High-dimensional predictive regression in the presence of

Bonsoo Koo Heather Anderson Monash University Monash University

Myung Hwan Seo† Wenying Yao Seoul National University Deakin University

Abstract

We propose a Least Absolute Shrinkage and Selection Operator (LASSO) estimator of a predictive regression in which stock returns are conditioned on a large set of lagged co- variates, some of which are highly persistent and potentially cointegrated. We establish the asymptotic properties of the proposed LASSO estimator and validate our theoreti- cal findings using simulation studies. The application of this proposed LASSO approach to forecasting stock returns suggests that a cointegrating relationship among the persis- tent predictors leads to a significant improvement in the prediction of stock returns over various competing forecasting methods with respect to squared error.

Keywords: Cointegration, High-dimensional predictive regression, LASSO, Return pre- dictability.

JEL: C13, C22, G12, G17

⇤The authors acknowledge financial support from the Australian Research Council Discovery Grant DP150104292. †Corresponding author. His work was supported by Promising-Pioneering ResearcherProgram through Seoul National University(SNU) and from the Ministry of Education of the Republic of Korea and the Na- tional Research Foundation of Korea (NRF-0405-20180026). Author contact details are [email protected], [email protected], [email protected], [email protected]

1 1 Introduction

Return has been of perennial interest to financial practitioners and scholars within the economics and finance professions since the early work of Dow (1920). To this end, empirical studies employ linear predictive regression models, as illustrated by Stam- baugh (1999), Binsbergen et al. (2010) and the references therein. Predictive regression allows forecasters to focus on the prediction of asset returns conditioned on a set of one-period lagged predictors. In a typical example of a predictive regression, the dependent variable is the rate of return on stocks, bonds or the spot rate of exchange, and the predictors usually consist of financial ratios and macroeconomic variables. The literature that has investigated the predictability of stock returns is very large, and it has proposed many economically plausible predictive variables and documented their em- pirical forecast performance. Koijen and Nieuwerburgh (2011) and Rapach and Zhou (2013) provide recent surveys of this literature. However, despite the plethora of possible predictors that have been suggested, there is no clear-cut evidence as to whether any of these variables are successful predictors. Indeed, many studies have concluded that there is very little dif- ference between predictions derived from a predictive regression and the historical mean of stock returns. Welch and Goyal (2008) provided a detailed study of stock return predictability based on many predictors suggested in the literature, and concluded that ”a healthy skepticism is appropriate when it comes to predicting the equity premium, at least as of 2006”. predictive regression has provided the main estimation tool for empirical efforts to forecast stock returns, but this involves several econometric issues. See Phillips (2015) for a summary of these issues, together with relevant econometric theory. Among these issues, we restrict our attention to two specific problems, which we believe could be tackled by using a different estimation approach. Firstly, the martingale difference features of excess returns are not readily compatible with the high persistence in many of the predictors suggested by the empirical literature. It is well-documented that excess returns are stationary and that financial ratios are often highly persistent and potentially non-stationary processes. The regression of a stationary variable on a non-stationary variable provides an example of an unbalanced regression, and such a regression is difficult to reconcile with the assumptions required for conventional estimation.1 This so-called unbalanced regression has the potential

1Although the issues associated with unbalanced regression can be tackled by incorporating an accommo- dating error structure such as (fractionally) integrated noise, this error structure is not readily compatible with making predictions.

2 to induce substantial small sample estimation bias, and it also renders many of the usual types of conducted in standard regression contexts invalid. See Campbell and Yogo (2006), Phillips (2014) and Phillips (2015) for more details. Secondly, the literature does not provide guidance on which predictors are to be included or excluded from predictive regressions. Given a large pool of covariates available for return predictability, selecting an appropriate subset of relevant covariates can improve the precision of prediction. However, the choice of the right predictors is extremely difficult in practice, and precision is often sacrificed for parsimony, especially if there are many predictors but the sample size used for estimation is small. Quite commonly a single predictor, or only a handful of predictors are used. Against this backdrop, our paper studies the use of the highly celebrated LASSO (orig- inally proposed by Tibshirani, 1996) to estimate a predictive regression when some of the predictors are persistent and it is possible that some of these persistent predictors are cointe- grated. We are particularly interested in (i) the LASSO’s capability in in this context; and (ii) the properties of LASSO estimation when potentially cointegrated predictors are involved, especially if such properties can circumvent issues associated with unbalanced regression. As will become clear in our empirical study, there is at least one cointegrating relationship between persistent predictors for stock returns, which lends strong support to an estimation approach that allows for cointegration. In a setting in which the target vari- able is stationary and there are many persistent predictors, the fact that the model selection mechanism of the LASSO eliminates variables with small coefficients, plays a role with re- spect to excluding persistent variables that do not contribute to a cointegrating relationship. This leads to ”balance” between the properties of the target variable and a linear combination of the regressors, and tractable prediction errors. It is also possible that the in- corporation of cointegration in this context of predicting returns will improve forecasts, as Christoffersen and Diebold (1998) have noted in other settings. Although predictive regression enables us to analyze the explanatory power of each indi- vidual predictor, our primary objective is to improve the overall prediction of stock returns given a set of predictors. To do this, we focus on the out-of-sample forecasting performance of the set of predictors selected via LASSO estimation. In addition, we investigate the large sample properties of the LASSO estimation of the linear predictive regression. We show the consistency of the LASSO estimator of coefficients on both stationary and nonstationary pre- dictors within the linear predictive regression framework. Furthermore, in the presence of

3 cointegration, we derive the limiting distribution of the cointegrating vector under certain regularity conditions, in contrast with the usual LASSO estimation, in which case the limit- ing distribution of the estimator is not readily available. To the best of our knowledge, this paper is the first work that establishes the limiting distribution for the LASSO estimator of a cointegrating vector. The remainder of this paper is organized as follows. We briefly discuss related literature in Section 2. Section 3 introduces predictive regression models, and Section 4 discusses the LASSO selection of relevant predictors within the model framework. In particular, Section 4.1 demonstrates the consistency of the proposed LASSO estimators, and Section 4.2 derives the limiting distribution of the LASSO estimator of a cointegrating vector. Simulation results are discussed in Section 5. Also, a real application of our methodology based on the Welch and Goyal (2008) set is given in Section 6. Section 7 concludes. All proofs of theorems and lemmas are found in A.

2 Relevant literature

Most of the literature on forecasting stock returns has focused on whether a certain variable has predictive power. The present-value relationship between stock prices and their cash flows suggests a wide array of financial ratios and other financial variables as possible pre- dictors. Documented examples include the earnings-price ratio (Campbell and Shiller, 1988), the yield spread (Fama and Schwert, 1977; Campbell, 1987), the book-to-market ratio (Fama and French, 1993; Pontiff and Schall, 1998), the dividend-payout ratio (Lamont, 1998) and the short interest rate (Rapach et al., 2016). Various macroeconomic variables such as the consumption-wealth ratio (Lettau and Ludvigson, 2001) and the investment rate (Cochrane, 1991) have been considered as well. More recently, Neely et al. (2014) have suggested the use of technical indicators. An issue that arises when returns are conditioned on just one predictor, is that prediction can be biased if other relevant predictors are omitted. This issue might partially explain the failure of single predictor models, although the use of several predictors in a standard regression set-up has rarely improved out of sample forecasts. Welch and Goyal (2008) noted that despite the poor forecast performance of individual predictors over long periods of time, there are episodes during which an individual predictor has power. The latter authors used their findings to claim that most (return forecasting) models are unstable.

4 Recent research on forecasting stock returns has moved away from single predictor mod- els, placing more emphasis on using the large pool of predictors suggested in the literature and drawing on modern econometric forecasting techniques. This research comes with an ac- knowledgment that many predictors are potentially useful in forecasting stock returns in one way or another, but the high dimension of the set of predictors can pose a significant challenge due to the degrees-of-freedom problem (Ludvigson and Ng, 2007), and the temporal insta- bility of the predictive power of any single predictor renders forecasting extremely difficult (Rossi, 2013). Forecast combination methods (see Timmermann, 2006) are now popular, and have been used by Rapach et al. (2010) to draw on information from a large set of predictors while alleviating the effects of structural instability. Other forecasts that rely on a multivariate approach include the dynamic factor approach (Ludvigson and Ng, 2007), Bayesian averag- ing (Cremers, 2002), and use of a system of equations involving vector autoregressive models (Pastor´ and Stambaugh, 2009). Our LASSO approach in the presence of cointegration is an addition to this strand of lit- erature that uses multiple predictors. Since the seminal paper of Tibshirani (1996), LASSO estimation has gained popularity in various fields of study. See Fan and Lv (2010) and Tibshi- rani (2011) for recent developments. The wide variety of predictors available for forecasting stock returns necessitates a wise selection of predictors that capture the quintessential dy- namics of future returns. Moreover, the unstable nature of their prediction power at different time periods renders covariate selection even more crucial. A few studies have investigated the use of LASSO in the presence of non-stationary vari- ables. Caner and Knight (2013) proposed a test involving Bridge estimators in order to differentiate unit root from stationary alternatives. Liao and Phillips (2015) applied the adaptive LASSO approach to cointegrated system in order to estimate a vector error correc- tion model (VECM) with an unknown cointegrating rank and unknown transient lag order. Further, Wilms and Croux (2016) have explored the forecasting ability of a VECM using an ap- proach based on an l1 penalized log-likelihood. Quite recently, Kock (2016) further discussed the oracle efficiency of the adaptive LASSO in stationary and non-stationary autoregressions. These papers differ from ours in many aspects. First of all, instead of using autoregressive models, our paper focuses on predictive regressions in which stationary and non-stationary predictors co-exist. Secondly, we allow for the presence of cointegration and have a different objective from the other studies in the literature, which leads to a different approach. Although LASSO has been widely employed in various fields of studies, the financial lit-

5 erature has not made extensive use of this powerful model selection and estimation tool to address stock return predictability. Only two recent papers have emerged. Buncic and Tis- chhauser (2016) and Li and Tsiakas (2016) suggest that LASSO contributes to better forecasts of stock returns because it enables researchers to select the most relevant factors among a broad set of predictors. Both papers are closely related to the work of Ludvigson and Ng (2007) and Neely et al. (2014). However, each falls short in explaining the theoretical ground that underlies their empirical results, and neither takes the impact of persistent regressors on the predictions into account. Moreover, both Buncic and Tischhauser (2016) and Li and Tsi- akas (2016) impose sign constraints on their estimated coefficients to improve their forecasts. In contrast, we do not require any constraint on the sign of coefficients, and we explicitly ac- count for persistence in some of our predictors. The main difference between our work here is that we provide theoretical credence to the use of the LASSO in a linear predictive setting, when persistent predictors are taken into consideration. A fundamental problem associated with forecasting stock returns is model uncertainty, with respect to which predictors are relevant, and when. Timmermann (2006) pointed out that there are two schools of thought on how to improve forecasting when the underlying model is uncertain. The first suggests the use of forecast combinations, which has been actively investigated (see Elliott et al., 2013; Koo and Seo, 2015, among many others), and used by Rapach et al. (2010) as mentioned above. The other school of thought considers the entire set of possible predictors in a unified framework, allowing the data to select the best combination of predictors in an attempt to search for the best forecasting model. Most 2 literature based on `1-regularization, including the LASSO, falls under this second approach. The LASSO does not explicitly address the unstable nature of stock return predictability, although it is possible to use the automated selection of the LASSO at different points in time by rolling an estimation window throughout the sample period, and allowing the LASSO to pick different predictors for these different estimation windows and provide different coeffi- cient estimates for each set of selected predictors. We adopt this approach in our empirical analysis of stock return predictability in Section 6.

2The former approach pools forecasts whereas the latter approach pools information. See Timmermann (2006) for the pros and cons of these two different approaches.

6 3 Econometric framework: Predictive regression

Our econometric framework is a linear predictive regression in which a large set of condition- ing variables is available, but the prediction power of each variable could be unstable. Some of the conditioning variables could be non-stationary, in particular integrated of order 1, i.e. I(1) variables and might be cointegrated. Our model is framed in the context of stock return predictability because forecasting stock returns epitomizes the issues we are attempting to address: (i) a possibly large set of conditioning variables; (ii) some of them are quite persis- tent; and (iii) the prediction power of each conditioning variable is unstable. Let us consider the following linear predictive regression model:

yt = xt0 1,T bT + zt0 1aT + ut, where b Rk for k • with x being I(1), and a Rp for p • with z being I(0). T 2 ! t,T T 2 ! t I(q) denotes an integrated process of order q. The above array structure could also imply that the true values of the coefficients and the number of covariates may vary as the sample size, T changes. In what follows, however, we shall use the simpler notation that dispenses with the T subscript,

yt = xt0 1b + zt0 1a + ut. (1)

The high persistence of xt is compatible with the property of many commonly used stock 3 return predictors, such as the dividend-price ratio and the earning-price ratio. We use zt in model (1) to represent the other group of non-persistent return predictors, for example the long term bond rate and inflation. See Neely et al. (2014) and Ludvigson and Ng (2007) for a pool of plausible financial and macroeconomic predictors. Model (1) is in line with the most recent literature on stock return predictability. It is a multi-factor model employed in the empirical asset pricing literature. The primary objective is to estimate the conditional mean of yt, given a mixture of possibly high-dimensional sta- tionary and non-stationary conditioning variables available at time t 1. In the context of the present paper, the prediction objective yt is the rate of return on a broad market index over the risk free rate, commonly referred to as the equity premium. It is commonly accepted that the equity premium displays martingale difference features, whereas many return predictors are quite persistent. The stationarity feature of the martingale difference sequence on the left-

3Welch and Goyal (2008) showed that many other return predictors including the treasury bill rate, the long term bond yield and the term spread are also highly persistent, with a first order above 0.99.

7 hand side of model (1) hardly matches with the highly persistent predictors on the right-hand side. Furthermore, a close examination of the historical data of various financial ratios reveals a strong co-movement among these persistent variables, which lends strong credence to the existence of at least one cointegrating relationship among them. Without any loss of generality in the forecasting context, we assume that there is at least one cointegrating relationship within xt. Specifically, the cointegrating vector d and the corre- 4 sponding constant c are such that xt0 b = xt0dc. For identification when it comes to estimating the true parameters d0 and c0, we define b0 = c0d0, where d0 is the standardized cointegrat- 0 ing relation of unit `1 norm and c is the response of the outcome variable to the cointegrated term. b0 = 0 if and only if c0 = 0, under which d0 is not identified.

4 LASSO selection of predictors

This section examines the automated selection of predictors given many covariates whose cardinality can be larger than the sample size but sparsity is assumed. In particular, we propose LASSO estimation in which an `1-type penalty is applied. Note that in matrix form, model (1) can be written as

Y = Xb + Za + u. (2)

The LASSO estimator for (b0, a0)0 can be obtained by

ˆ 0 2 (b ,ˆa0)0 = arg min Y Xb Za 2 T + l( b 1 + a 1), (3) b,a k k k k k k where denotes an ` norm, l is a tuning parameter, and T is the sample size. k·km m Equation (3) describes the usual LASSO objective function that might, but need not in- corporate a cointegrating relationship. We reparameterize the first portion of this objective to investigate how the presence of a cointegrating relationship affects the LASSO estimation

4The interpretation of d is subtle in the case of multiple cointegrating relationships. In such a case, d might be the optimal linear combination of all the cointegrating vectors for predicting the outcome. Importantly, this paper does not seek to identify the number of the cointegrating relationships nor to estimate each cointegrating vector separately. For these, one needs to set up a cointegration system, for which we refer to Johansen (1995).

8 approach. With c = b , d = b/c and the true parameters b0 = c0d0 and a , k k1 0

Y Xb Za = u X(b b0) Z(a a0) = u X(d d0)c Xd0(c c0) Z(a a0) X = u pT(d d0)c W(c c0) Z(a a0) pT = u DB (q q0), c

0 where q =(pTd0, g0)0 with g =(c, a0)0, W = Xd , D =(X pT, G) with G =(W, Z), and

Bc denotes a diagonal matrix whose first k diagonal elements are c and the others are 1.

Likewise, Bcˆ denotes the matrix replacing c in Bc with cˆ.

In addition, the `1 penalty terms can be rewritten as

l( b + a )=l(c + a )=l g , k k1 k k1 k k1 k k1 with the constraint d = 1 holding by construction. k k1 Consequently, in the presence of a cointegrating relationship, the infeasible LASSO esti- mator for q can be represented by

qˆ = arg min u DB (q q0) 2 T + l g . (4) k c k2 k k1 q: d 1=1 k k Note that (4) is a constrained minimisation problem given that the estimator for q is obtained under the constraint that d = 1. Note, also, that l g can be replaced by l q without k k1 k k1 k k1 changing the minimisation problem in this set-up. This formulation separates the estima- tion of cointegrating relation and the effect of the error correction term and shows that the cointegrating relation is estimated without shrinkage. This is useful for inference. In the subsection that follows, we establish the large sample properties associated with the proposed LASSO estimator, qˆ =(pTdˆ 0,ˆg0)0. We start with the consistency of the proposed estimator and its convergence rate, and proceed to the large sample distribution of the LASSO estimator for the cointegrating vector d. All proofs for our theoretical results are provided in Appendix A. We need the following set of regularity conditions to derive the statistical properties of LASSO estimation of q in (4).

Assumptions

A1 Each element of the vector g belongs to a compact set and d = 1 for all d. k k1

9 A2 Let s = S , q be the subvector of q whose elements are in the index set S, and Sˆ = 0 | 0| S D0D/T be the properly scaled Gram matrix. Then, there exists a sequence of positive 0 constants fT such that for all q whose first k elements do not lie in the linear span of d

and q c 1 3 q 1, the following holds k S0 k  k S0 k

2 2 q q0Sˆ q s /f , k S0 k1  0 T with probability approaching one.

[Tt] A3 Let the nonstationary predictors be x = Â v . Then v u , and z u are t k=1 k { t t} tj t j=1,...,p centered a-mixing sequences such that for some positive g1, g2 > 1, a, and b ,

a (m) exp ( amg1 )  P z > z H (z) := exp 1 (z/b)g2 for all j (5) tj  1 1 1 g = g1 + g2 < 1,

⇣ g (1 2⌘g (1 g)/g) ln p = O T 1^ 1 . (6) ⇣ ⌘

Assumption A1 is concerned with restrictions on the parameter space and a normalization restriction on the cointegrating vector. Compatibility conditions such as A2 are common in the large sample theory for LASSO estimation, and their purpose is to place weak restrictions on the design matrix of regressors so as to bound the prediction and parameter estimation errors in terms of the `1 norm. Here, the format of D accounts for the different dimensions of the persistent and stationary regressors, and given the asymptotic orthogonality between these two sets of regressors, one can think of our compatibility condition as providing separate conditions for each set. The gram matrix of the persistent regressors does not converge to a deterministic matrix, and therefore the compatibility factor fT is allowed to vary over T and converge to zero. In the special case in which k is fixed, we may let it decay at a log T rate. This is reflected in the deviation bound in Theorem 1. Assumption A3 is a technical condition required for Lemmas 1 and 2. Notably, Assump- tion A3 implies Assumption 1 in Hansen (1992), which ensures weak convergence of the sum of the product of xt and ut in (1) to a stochastic integral. We can relax the tail condition for the stationary regressors zt at the expense of slower rates of convergence. It is worth noting that these conditions are quite general, accounting for the time series features of the data and the model in Section 3. For instance, we do not require that the error

10 process is the martingale difference sequence. It suffices that v and u are strong-mixing. { t} { t}

4.1 Consistency and rate of convergence

Theorem 1 Let p = O (Tr) for some r < • and k = O pT . Then, under the regularity conditions A1-A3 specified above, ⇣ ⌘

s l DB (qˆ q0) 2 T + B (qˆ q0) = O 0 , (7) k cˆ k2 k cˆ k1 p f ✓ T ◆ where s is the number of non-zero elements in q0 and l 1 = o pT/ ln p pT/ (Th k ln k) 0 ^ ^ for any h > 0. Furthermore, if (min q ) 1 = o((s l/f )⇣⇣1), then the⌘ LASSO⇣ selects all relevant⌘⌘ j | S0,j| 0 T predictors with probability approaching one.

Proof. See Appendix A. Theorem 1 establishes deviation bounds for the prediction error and the parameter esti- mation error in terms of the `1-norm. It permits the numbers of both stationary and non- stationary variables to diverge. However, our bound is more sensitive to the number k of nonstationary variables in the sense that k increases the bound more quickly than does p. Note that when k is fixed or increases only at a logarithmic rate, the deviation bound be- comes s0/(pTfT) up to a logarithmic factor. Furthermore, when k is allowed to increase to infinity faster, the compatibility sequence fT may need to decay to zero at a faster rate. Theorem 1 ensures that LASSO can serve as a predictor-selection tool in a linear predictive regression, in that P S Sˆ 1 where S := j : q0 = 0 for q0 with the corresponding 0 ⇢ ! 0 { j 6 } estimate Sˆ. This is soeven when there are two different types of covariates, either stationary

(zt) or nonstationary (xt), and the dimensions for zt are very large as long as the dimen- sion of xt increases at most at a logarithmic rate. This property is essential for stock return predictability because the number of plausible return predictors is very large and could be nonstationary or stationary. LASSO is an automated method that selects relevant predictors under a sparsity assumption so as to achieve parsimony without sacrificing precision.5 This becomes important when the number of observations is relatively small compared to the number of available predictors.

For the case where the dimension of xt increases to infinity, Theorem 1 shows that the convergence rate for qˆ is slower than Op( s0 ln p/T), which is the usual rate of convergence

5Here, sparsity that although there mightp be many potential predictors, the number of truly relevant predictors is small.

11 of the LASSO estimator for linear models when the data is independent (see Buhlmann¨ and van de Geer, 2011, page 106). The slower rate of convergence occurs in our case because we allow for quite general dependence. However, it is worth noting that due to the presence of cointegrating relationship, the rate of convergence for dˆ is faster than the usual rate of 1 convergence for LASSO estimation and can achieve the oracle rate of T under some proper conditions as shown below.

4.2 Large sample distribution of the LASSO estimator for the cointegrating vector

This section discusses the large sample distribution of the proposed estimator for the cointe- grating vector, d, when its dimension k is fixed. From (3), when cˆ anda ˆ are given, we can focus on the estimator for d, the coefficient associated with Xcˆ. The objective function (3) becomes

dˆ = arg min Y˜ X˜ d 2 T, (8) k k2 d: d 1=1 k k where Y˜ = Y Zaˆ and X˜ = Xcˆ. It is worth noting that there is no shrinkage on the cointegrating vector due to the re- striction on the parameter space, d = 1. This feature is helpful for better prediction and k k1 forecasting because the collective explanatory power coming from persistent predictors con- stituting the cointegrating vector is not compromised by LASSO estimation. This is relevant for the return predictive regression, in which it is highly likely that there exists a cointegrat- ing relationship within persistent predictors given its seemingly unbalanced nature. As can be seen from (8), we can derive the large sample distribution of dˆ based on a constrained least squares with a nonstandard norm restriction using the norm. The large sample k·k1 distribution of appropriately scaled dˆ d0 can be obtained by the minimizer of a Gaussian process.

Theorem 2 Let

1 1 0 0 2 D(b)= 2c b0 W (r)dU (r)+L + c b0 W (r)W (r)0drb, 0 | | 0 Z Z with T 1 i 1 1 L = lim E(vi jui), T • T Â Â ! i=1 j=1

12 where vt = xt xt 1 and (W ( ), U ( ))0 is the weak limit of the partial sum process of (vt, ut)0. Also · · 1 q let k be fixed, p = O(T), l = o(pT/ ln p), and s0l = O(T ) for some positive q. Then, under the regularity conditions A1-A3,

d T(dˆ d0) arg min D(b). (9) ! (d0) = 1 b:sgn 0b Âj bj d0=0 | | { j }

Proof. See Appendix A. Theorem 2 allows us to make inference on the cointegrating vector d. Theorem 2 is non- standard because there is a restriction on the parameter space. If it were not for this re- striction, we could obtain a closed-form expression by deriving the minimizer of the limit objective function D(b) in Theorem 2. Nevertheless, we can simulate the limit distribution for the purpose of making inference.6 We do not elaborate on inference here because our pri- mary objective is on forecast improvement rather than making inference on the cointegrating vector. We illustrate the finite sample performance of the asymptotic approximation in Theorem 2 by considering the following design:

yt = a + b1x1t 1 + b2x2t 1 + ut

2 where a = 0.1, b =(b1, b2)0 =(1, 1)0, var(vit)=sv = 1 for i = 1, 2 where vit = xit xit 1 i 2 and var(ut)=su = 1. We consider two cases in which T = 1000 and T = 10000 with 10000 repetitions. Without loss of generality, we assume there is no stationary regressor. Note that the constrained limit objective function we consider is

1 1 D(b˜)= 2b˜ 0[ W (r)dU (r)] + b˜ 0 W (r)W (r)0drb˜ Z0 Z0

0 ˜ ˜ ˜ subject to sgn(b )0b = Âj b j1 b0=0 where b = cb and b is as in Theorem 2. The minimizer | | { j } of the above reparameterized limit objective function with a constraint yields the limit distri- bution of T(bˆ b0). Following our design, the constraint becomes b˜ = b˜ = b¯. This implies 1 2 6The limit distribution cannot be tabulated as it is not asymptotically pivotal but it can be simulated. In ˜ ˆ 1 particular, the restriction on b can be imposed by replacing d0j with dj = dj dˆ c for some cT 0 and {| j | T } ! T c •. Due to the super-consistent convergence rate of dˆ, this event is equivalent to a restriction on the ⇥ T ! parameter space with a probability approaching one.

13 that

1 W1(r)dU (r) D(b˜)= 2b¯ 0 11 1 2 W (r)dU (r) 3 ⇣ ⌘ R0 2 4 5 1 2 1 R W (r)dr W1(r)W2(r)dr 1 +b¯2 0 1 0 . 11 1 1 2 W (r)W (r)dr W 2(r)dr 3 0 1 1 ⇣ ⌘ 0 R 1 2 R 0 2 4 5 @ A R R The minimizer of the above can then be obtained in a closed-form, and is

1 0 [W1(r)+W2(r)] dU (r) 1 b˜ ⇤ = . 2 2 1 0 1 ÂR i=1 Âj=1 0 Wi(r)Wj(r)dr 1 R @ A Hence, 0 d T(dˆ d ) b˜ ⇤/c ! where dˆ = bˆ/cˆ with bˆ obtained from a LASSO estimation cˆ = bˆ + bˆ , and d0 = b0/c0.To | 1| | 2| simulate the sample distribution of b˜ ⇤, note that

T 2 d 1 T Â xitxjt 0 Wi(r)Wj(r)dr t=1 ! T R 1 d 1 T Â xitut 0 Wi(r)dU (r). t=1 ! R Then, 1 T T Ât=1(x1t + x2t)ut b˜⇤ . 2 2 2 T ⇡ T Âi=1 Âj=1 Ât=1 xitxjt

Meanwhile, for the sample distribution of (dˆ d0), we employ the LASSO estimation with a fixed tuning parameter, l = ln p . This choice reflects the rate of convergence of the estimated 10pT cointegrating coefficients when selecting the set of predictors. Under this setting, Figures 1 and 2 provide of the limiting distribution of d and the sample distributions of (the elements of) (dˆ d0) for 1000 and 10,000 observations where dˆ is the LASSO estimator, as well as the associated QQ-plots. We can see that the distribution of (dˆ d0) is centered around 0, and although the QQ plots show some differences between the sampling and limiting distributions around the tails when T = 1, 000, these differences are barely apparent when T = 10, 000.

14 Figure 1: Limiting distribution and sample distribution of (dˆ d0) and their QQ-plot T = 1000

a The first panel depicts the limiting distribution of (dˆ d0) whereas the second panel depicts its sample 1 1 distribution. b The limiting and sample distributions of (dˆ d0) are identical to those for (dˆ d0) by construction. 2 2 1 1

Figure 2: Limiting distribution and sample distribution of (dˆ d0) and their QQ-plot T = 10000

5 Simulation study

In this section, we use simulation to evaluate the ability of the LASSO estimator to select a set of predictors that leads to out-of-sample forecasting gains. We start with a baseline predictive regression that incorporates a mixture of just four persistent and six non-persistent predictors, and then we expand the number of non-persistent predictors out to one hundred, which allows us to consider high-dimensional predictive regressions as well. Our data generating processes (DGPs) include a mixture of stationary and non-stationary cointegrated predictors as in (1), and although (1) and the other equations in our text omit a constant to simplify our exposition, we include a constant in all our simulated data and estimated equations. We consider four estimation sample sizes (each generated sample follows a burn-in of 1000 observations) and ten different forecast methods. Given an estimation sample of size T, we focus on forecasting yT+1 using different methods, including our proposed LASSO based

15 methods. We evaluate each forecast strategy in terms of the one-step-ahead forecast errors associated with yˆT+1.

5.1 Simulation of our baseline DGP

The baseline DGP that we study incorporates several well known characteristics of financial and macroeconomic data. These include high persistence in some (but not all) individual series, interdependencies between many variables, and rich system-wide dynamics.

The specification associated with the persistent variables xt in this DGP is based on a four variable vector error correction representation given by (10a), in which the first right-hand side vector represents a 4 1 loading matrix (1 is the cointegrating rank in this setting) while ⇥ the second right-hand vector is the cointegrating vector. Hence, xt is a 4-dimensional I(1) pro- cess with cointegrating rank 1 and from (10a), the cointegrating vector is d = (1, 0.5, 0, 0)0. The non-persistent predictors zt in this DGP follow a 6-dimensional stationary VAR(2) process with a complicated dependence structure. This structure, given in (10b), is partly taken from an estimated VAR(2) model of US macroeconomic data (see Martin et al., 2012, p. 485, example 13.19), so it is representative of data obtained from a real economy. The coefficients in the overall predictive regression model (1) are given in (10c), and these are set to be b = cd = 0.3d and a = ( 0.1, 0, 0.02, 0.4, 0, 0.2)0. All of the error terms in (10) (i.e. (u , # , # )) are i.i.d (0, I). Putting these together, the full specification of our baseline t x,t z,t N DGP is then

Dxt = 0 0.3 0 0.2 0 1 0.5 0 0 xt 1 + #x,t, (10a) 0.4 0 0 0 0 0 0 1.31 0.29 0.12 0.04 0 0 1 0 0.21 1.25 0.24 0.04 0 zt = zt 1 B 0 0.07 0.03 1.16 0.01 0 C B C B 0 0.08 0.27 0.07 1.25 0 C B C B 0 0 0 0 0 0.4C B C (10b) @ 00 0A 0 00 0 0.35 0.28 0.07 0.02 0 0 1 0 0.19 0.26 0.24 0.05 0 + zt 2 + #z,t, B0 0.07 0.02 0.16 0.01 0C B C B0 0.13 0.23 0.03 0.31 0C B C B00 0 0 00C B C @ A yt = 0.15 + xt0 1 0.3 0.15 0 0 0 + zt0 1 0.1 0 0.02 0.4 0 0.2 0 + ut. (10c)

16 5.2 Forecast methods

The first of our ten forecast strategies uses OLS and all predictors to estimate the standard linear predictive regression model in (1) given by

yt = xt0 1b + zt0 1a + ut to obtain yˆT+1, while our second strategy predicts yT+1 by using its historical average, i.e. 1 T yˆT+1 = T Âi=1 yi. The third method predicts yT+1 via the estimation of an AR(1) model:

yt = r yt 1 + ut. (11)

Then, since log-price series are often modelled as random walks, we also use a random walk without drift as the fourth forecast method for comparison. We expect this model to show inferior forecast performance when predicting returns because it restricts r = 0 in (11). The fifth method is based on principal component analysis (PCA), as suggested by Stock and Watson (2002) when there are many predictors. Due to the different statistical properties of I(1) and I(0) series, we extract the first 4 principal components from the (standardised) I(0) series, and then use them together with the 4 I(1) predictors in the forecasting exercise. For the sixth method, we employ a forecast combination a` la Timmermann (2006). More specifically, we compute the combined forecast, yˆT+1, as an equally weighted average of all forecasts that are separately obtained from distinctive linear predictive regression models that each condition on just one predictor:

M 1 m yˆT+1 = Â yˆT+1, (12) M m=1

m where M = k + p is the total number of predictors, and yˆT+1 is a forecast based on a bivariate linear predictive regression that conditions on just the m-th predictor. We include forecast combination as one of our competing forecast strategies because it has been widely recognized as a way to improve forecast performance relative to a forecast based on a single model (Timmermann, 2006; Rapach et al., 2010). Similar to the use of LASSO, the combined forecast approach alleviates the impact of model uncertainty and parameter instability. The next three methods are different variants of LASSO. We apply a standard LASSO

17 method with the following objective function:

T 1 2 arg min  yt xt0 1b zt0 1a + l b 1 + l a 1 , (13) b,a T k k k k ⇢ t=1 where is the ` -norm. The penalty term l can be estimated using the cross validation k·k1 1 algorithm due to Qian et al. (2013), that minimizes the cross-validated error loss (LASSOmin).

We call the resulting penalty lmin. An alternative way to estimate l is via forward chaining, which leads to l fwd. Here, we work with a grid of ls and the first 0.9T observations in the generated series of T observations to estimate a set of coefficients for each l. We then use each set of coefficients together with a rolling window of 0.9T observations to obtain one step ahead forecasts for the remaining 0.1T data points, and choose l fwd to minimise the associated forecast mean squared error (FMSE). We also consider setting l = ln p to obtain a 10pT penalty that we call l fix. This latter choice of l is based on a metric associated with the set of non-persistent variables zt (see Lemma 1), and the potential advantage of using this penalty is that it takes the rate of convergence of the estimated cointegrating coefficients into account when selecting the set of predictors. Our final forecasting technique is based on the elastic net by Zou and Hastie (2005), which is often used in macroeconomic or financial applications, when regressors are highly corre- lated. The elastic net mixes the `1 and `2 norms for the penalty term,

T 1 2 2 2 arg min  yt xt0 1b zt0 1a + l h( b 1 + a 1)+(1 h)( b 2 + a 2) , (14) b,a T k k k k k k k k ⇢ t=1 ⇥ ⇤ and in our case we set the mixing parameter, h, to be 0.5. We generate 5000 samples of T = 100 to estimate the parameters that are needed (together with the sample data) to predict yˆT+1 for each of the ten forecast methods. This generates

50000 predictions for estimation samples of size 100, and we use these, together with yT+1 to calculate 5000 forecast errors for each of the ten forecast methods. We then repeat this for T = 200, 400, and 1000, so that a total of four estimation sample sizes are considered. Note that we do not encounter any estimation problems when we generate data from our baseline DGP, but later, when we start generating high-dimensional predictive DGPs in Section 5.3 and work with small estimation samples, we will not have sufficient observations to implement the first of our forecast methods which is based on OLS. We compare the forecast performance for different methods by considering two measures.

18 The first of these is the forecasting mean squared error (FMSE), while the second is a measure of out-of-sample relative fit (RFIT) and is closely related to the out-of-sample R2 measure used by Welch and Goyal (2008). The former is defined as the expected squared difference between the actual realization and the one-step-ahead forecast: FMSE := E[y yˆ ]2. T+1 T+1 The latter is defined as RFIT:=1 FMSE /FMSE for each forecasting method i. We use i OLS the OLS based forecast method as our benchmark when calculating the out-of-sample RFIT because this method is typically used in empirical applications of predictive regressions.7 Table 1 presents FMSE and out-of-sample RFIT for the forecasting strategies based on

5000 repetitions. Overall, the LASSO methods dominate, with LASSO fix having lower FMSE and higher RFIT than all other forecast methods when the sample size is small, and LASSOmin playing this role when the sample size becomes larger. We can infer that the bias induced by adopting the LASSO is small, whereas the reduction in the estimation domi- nates. The LASSOmin, LASSO fix and elastic net methods produce smaller FMSEs than the benchmark OLS forecasts across the four different sample sizes (with only one exception), and hence they produce positive out-of-sample RFIT values. On the other hand, the other methods always lead to inferior forecasting performance compared to the OLS benchmark. The bias induced by using the historical average is particularly costly, and it is interesting to note that despite considerable empirical evidence that forecast combinations are often helpful, they do not appear to help in this setting. One possible reason for this is that the simulated DGP used here has a stable dependence structure, whereas an important practical advantage of forecast combination is that it can ameliorate the effects of structural change. The PCA4 and the forward chaining LASSO methods generate FMSEs that are of similar magnitudes as the three best performing forecast strategies, but they do not show any advantage. We conjecture that the LASSO approach renders the presence of a cointegrating relation- ship conspicuous due to the shrinkage imposed on the irrelevant non-stationary predictors, and this, together with the small but remaining nonzero coefficients on stationary predictors leads to better forecast performance. Since OLS can also ”pick up” cointegrating relation- ships as the number of observations increases, the forecasting performance based on the OLS estimation of the predictive regression becomes closer to that of the LASSO for large samples. This is confirmed in Table 1. When we have a small sample size T = 100, the improvement in the FMSE of LASSOmin over the OLS benchmark, as measured by the out-of-sample RFIT, is 2.96%. However, as the sample size increases to T = 1000, the out-of-sample RFIT of

7All code for our simulations in Section 5 and our forecasts in Section 6 can be found at https://sites.google.com/site/myunghseo/research

19 LASSOmin reduces to 0.03%.

5.3 Alternative DGPs

One clear advantage of the LASSO approach is that the dimension of the stationary variables zt can be arbitrarily large, and indeed can be larger than the sample size T. In the previous DGP we set k = 4 and p = 6. In this section we investigate how the forecasting performance of LASSO compares with alternative forecasting strategies as the dimension p of the stationary predictors increases. Starting from p = 6, we expand the dimension of the non-persistent regressors to p = 12, 20, 50, 100 and 200. Our DGP structure is similar to that in Section 5.1 in that our DGP for the nonstationary variables still follow (10a), and the stationary variables are still generated by VAR(2) processes, but the sets of coefficients for these processes are now randomly drawn (so that our analysis no longer focuses on just one VAR). The coefficients in the associated loading vector a are also drawn randomly, in a way that ensures that the set of elements of zt that actually contribute to the generation of yt is sparse. We provide more detail on the stationary components of the DGPs below. We use the same forecast methodologies as in the previous section (except when OLS is infeasible) and as before, we base our forecast comparison on the FMSE associated with one-step-ahead forecasts. The details relating to the stationary components of our DGPs are that we consider 100 different VAR(2)s for each sample size, and work with 1000 replications of each.8 The coeffi- cients for each of these 100 VAR(2)s are drawn randomly from a normal distribution with a zero mean and variance of less than 0.2, and we have made sure that each drawn VAR(2) pro- cess is stationary, by checking that its maximum eigenvalue is less than 0.8 in absolute value. We impose weak sparsity by randomly setting half of the a coefficients equal to zero and drawing the remaining non-zero coefficients in a from a uniform distribution over [ 0.3, 0.3]. The non-zero coefficients are small, making the dependence of yt on zt relatively weak. We should emphasize that this design is not favorable to the LASSO due to the many smallish non-zero coefficients. The average FMSE produced across 1000 replications of the 100 DGPs are tabulated in Table 2, and the smallest FMSE in each row appears in bold letters with an associated tick. The results observed previously are largely preserved when we increase the dimension of the stationary predictors zt. The three forecasting methods based on the LASSO and the elastic net produce more accurate forecast in terms of FMSE, with LASSO fix performing the

8These simulated VAR(2) processes are available upon request.

20 Table 1: Forecasting performance of different forecast strategies

T OLS Average AR(1) r-walk PCA4 F-comb LASSO fwd LASSOmin LASSO fix el-net FMSE

100 1.2693 7.1731 2.2369 2.3820 3.9902 4.4995 1.3382 1.2316 1.2132X 1.2399 200 1.1051 7.9231 2.1883 2.3349 4.3106 5.2708 1.1568 1.0952 1.0923X 1.0987 400 1.0183 7.8535 2.1278 2.3002 4.0072 5.4067 1.0305 1.0133X 1.0136 1.0150 1000 0.9977 8.3772 2.1469 2.3593 4.1956 5.9519 1.0045 0.9974X 1.0013 0.9975 21 Out-of-sample RFIT

100 -4.6514 -0.7624 -0.8767 -2.1437 -2.5450 -0.0544 0.0296 0.0442X 0.0232 200 -6.1696 -0.9802 -1.1129 -2.9007 -3.7696 -0.0468 0.0089 0.0115X 0.0057 400 -6.7126 -1.0896 -1.2590 -2.9353 -4.3098 -0.0120 0.0049X 0.0046 0.0032 1000 -7.3969 -1.1519 -1.3649 -3.2055 -4.9659 -0.0069 0.0003X -0.0037 0.0002

a Each column denotes one forecasting strategy, respectively, from left to right: (1) OLS model; (2) Historical average; (3) AR(1) model; (4) random walk model; (5) Principal component analysis (PCA); (6) Forecast combination; (7) Forward chaining LASSO; (8) Minimized cross validated LASSO; (9) Fixed penalty LASSO; (10) Elastic net. b The out-of-sample RFIT is defined as RFIT = 1 FMSE /FMSE for each model i. i OLS Table 2: Comparison of FMSE for different forecast strategies as the number of stationary predictors (p) grows

p T OLS Average AR(1) r-walk PCA4 F-comb LASSO fwd LASSOmin LASSO fix el-net 100 1.1637 1.5259 1.4629 2.3513 1.1875 1.4714 1.2162 1.1473 1.1401X 1.1542 200 1.0749 1.5311 1.4586 2.3548 1.1245 1.4887 1.1275 1.0706 1.0696X 1.0735 6 400 1.0335 1.5221 1.4487 2.3667 1.0876 1.4880 1.0550 1.0325X 1.0338 1.0335 1000 1.0122 1.5294 1.4554 2.3775 1.0731 1.5011 1.0214 1.0120X 1.0154 1.0122

100 1.2595 1.6997 1.6415 2.6931 1.3559 1.6387 1.3022 1.2241 1.2132X 1.2355 200 1.1139 1.6919 1.6247 2.6817 1.2803 1.6412 1.1637 1.1029X 1.1035 1.1086 12 400 1.0522 1.6998 1.6315 2.7119 1.23881 1.6542 1.0764 1.0501X 1.0554 1.0523 1000 1.0280 1.6877 1.6135 2.6796 1.2175 1.6466 1.0344 1.0273X 1.0350 1.0279

100 1.3842 1.8769 1.8299 3.0623 1.5670 1.8196 1.4226 1.3117 1.2990X 1.3288 200 1.1598 1.8876 1.8278 3.0391 1.4609 1.8359 1.2093 1.1417X 1.1447 1.1508 20 X

22 400 1.0801 1.8829 1.8181 3.0380 1.4161 1.8360 1.0930 1.0752 1.0818 1.0789 1000 1.0294 1.8742 1.8137 3.0688 1.3906 1.8301 1.0385 1.0283X 1.0398 1.0293

100 2.3141 2.8287 2.8047 4.8699 2.6074 2.7609 1.9903 1.7607 1.7151X 1.7767 200 1.3949 2.8073 2.7709 4.8875 2.4013 2.7430 1.4050 1.3159X 1.3243 1.3420 50 400 1.1767 2.8061 2.7636 4.8898 2.3206 2.7434 1.1687 1.1553X 1.1685 1.1667 1000 1.0659 2.7956 2.7439 4.8790 2.3009 2.7347 1.0691 1.0621X 1.0808 1.0651

100 N/A 3.9130 3.9167 7.0734 3.8034 3.8560 3.1791 2.9096 2.7334X 2.8456 200 2.1674 3.8956 3.8782 7.1138 3.6052 3.8407 1.8934 1.7428X 1.7469 1.7892 100 400 1.3697 3.8947 3.8688 7.0424 3.4771 3.8406 1.3182 1.2996X 1.3277 1.3263 1000 1.1208 3.9227 3.8854 7.0685 3.4284 3.8688 1.1180 1.1106X 1.1409 1.1179 100 N/A 6.7914 6.8466 12.8705 7.0913 6.7299 6.0837 5.9320 5.8183 5.7287X 200 N/A 6.8392 6.8440 12.7566 6.6639 6.7768 3.7295 3.4627 3.3619X 3.4517 200 400 2.0915 6.8380 6.8446 12.8389 6.4734 6.7762 1.7807 1.7538X 1.8223 1.8143 1000 1.2713 6.7071 6.6963 12.8750 6.3050 6.6462 1.2351 1.2323X 1.2770 1.2492 best when the sample size is small and LASSOmin performing the best when the sample size becomes larger. Interestingly, for p = 200 and T = 100, the elastic net dominates the LASSO methods in producing more accurate forecasts. The percentage improvement of the LASSOmin specification relative to the OLS benchmark is always larger when the sample size is small. Notably, as p increases, all forecast strategies considered here generate larger forecast errors. The historical average, AR(1) model, random walk, and principal components model are all mis-specified and hence lead to much larger forecast errors.

6 Stock return predictability

The predictability of the equity premium has been a controversial subject in the finance lit- erature, generating considerable debate about whether predictive models of excess returns can outperform the historical average of these returns, and if so, which predictors lead to good out of sample performance, and over which forecast periods. Welch and Goyal (2008) conducted a particularly influential study of this topic, providing an extensive review of the usefulness and stability of many commonly used predictors. They concluded that return pre- dictability is very weak, both in terms of in-sample fit and out-of-sample forecasting. Their paper inspired a burst of more recent work on this topic, that has not only proposed new predictors, but has also exploited innovative ways of extracting information from the existing set of predictors (see, for example Rapach and Zhou, 2013; Neely et al., 2014; Buncic and Tischhauser, 2016). In this section we use an extension of the data set examined in Welch and Goyal (2008), and we demonstrate how LASSO estimation can improve the predictability of stock returns by allowing for cointegration between persistent predictors.

6.1 Data

We follow the mainstream literature on forecasting the equity premium and use the same fourteen predictors as Welch and Goyal (2008). An updated version of their data is posted on the website (http://www.hec.unil.ch/agoyal/), and we use monthly data from January 1945 to December 2012 (816 observations) from this site in the work that follows. Table 3 presents a brief description of the predictors as well as their estimated first-order autocor- relation coefficients over the entire sample period. The dependent variable of interest is the equity premium yt, which is defined as the difference between the compounded return on the S&P 500 index and the three month Treasury bill rate, and most of the predictors are fi-

23 Table 3: Variable definitions and estimated first order autocorrelation coefficients

Predictor Definition Auto.corr(1) d/p dividend price ratio: difference between the log of dividends 0.9940 and the log of prices d/y dividend yield: difference between the log of dividends and the 0.9940 log of lagged prices e/p earning price ratio: difference between the log of earning and 0.9908 the log of prices d/e dividend payout ratio: difference between the log of dividends 0.9854 and the log of earnings b/m book-to-market ratio: ratio of book value to market value for 0.9926 the Dow Jones Industrial Average ntis net equity expansion: ratio of 12-month moving sums of net 0.9757 issues by NYSE listed stocks over the total end-of-year market capitalization of NYSE stocks tbl Treasury bill rates: 3-month Treasury bill rates 0.9891 lty long-term yield: long-term government bond yield 0.9935 tms term spread: difference between the long term bond yield and 0.9565 the Treasury bill rate dfy default yield spread: difference between Moody’s BAA and 0.9720 AAA-rated corporate bond yields dfr default return spread: difference between the returns of 0.9726 long-term corporate and government bonds svar stock variance: sum of squared daily returns on S&P500 index 0.4716 infl inflation: CPI inflation for all urban consumers 0.5268 ltr long-term return: return of long term government bonds 0.0476 nancial or macroeconomic variables. As shown in Table 3, the majority of these predictors are highly persistent, with eleven out of fourteen variables having a first order autocorrelation coefficient that is higher than 0.95. The remaining three variables (the stock index, inflation and the government bond return) have little persistence, as does the equity premium (EQ), which has a first order autocorrelation coefficient of only 0.1494. We use twenty year rolling windows to allow for possible parameter instability during our sample, and since our data starts from 1945:01, our out-of-sample predictions of the equity premium start from 1965:01. We treat the results for the twenty year windows as our benchmark, but we also conduct some robustness analysis using ten and thirty year rolling windows. Figure S1 in our online supplement plots the estimated AR(1) coefficients for each predictor and the equity premium (EQ) using twenty year rolling windows, and it is not only evident that many predictors exhibit a high persistence level over the entire sample, but also that the time series dependence structure changes throughout the forecasting period. We expect to observe cointegrating relationships between the persistent variables, given

24 the features of the predictors reported in Table 3 and illustrated in Figure S1. Further, fi- nancial considerations suggest that the dividend price ratio (d/p) and the dividend yield (d/y) should have a stable long-run relationship. If such cointegrating relationships exist, they would demonstrate our theoretical framework in Section 3 very nicely, so we study this aspect of our data next. Specifically, we conduct the tests proposed by Johansen (1991, 1995) and use the model selection technique proposed by Poskitt (2000) to assess whether there is cointegration between some of our persistent variables. We exclude the dividend payout ratio (d/e) and long-term yield (lty) when estimating models to avoid multi-collinearity, making our cointegration analysis relate to the nine remaining persistent predictors.

Figure 3: Test for cointegration using twenty year rolling windows

(a) all 9 persistent variables (b) exclude d/p and d/y

7 5 6 4 5 3 4 2 3 2 1 1 0

1965 1975 1985 1995 2005 2012 1965 1975 1985 1995 2005 2012

(c) d/p, d/y, e/p and tbl (d) d/p and d/y

Johansen Poskitt 3 2

2

1 1

1965 1975 1985 1995 2005 2012 1965 1975 1985 1995 2005 2012

Figure 3 shows the number of cointegrating relationships (i.e. cointegrating ranks) that are found in each of the twenty year windows in our sample.9 The blue solid line denotes the cointegrating rank selected using the Johansen test (Johansen, 1991, 1995), and the red dashed line denotes the selected cointegrating rank when the non-parametric method of Poskitt (2000) is used. The Johansen test is based on estimating a vector and it is one of the most widely used tests for cointegration among multiple time series. However, it is well known that this test often suffers from the curse of dimensionality (Gonzalo and Pitarakis, 1999), and in particular this test tends to overestimate the cointegrating rank. This is reflected in Figure 3. Here, we see that the Johansen test always finds at least three cointegrating relationships between the nine persistent variables, whereas the more conservative Poskitt

9The test outcomes based on ten and thirty year windows are qualitatively similar, and the relevant results are available upon request.

25 (2000) procedure only finds a cointegrating rank of one or two. It is interesting to note that when only d/p and d/y are included in the analysis, the Poskitt (2000) procedure always finds that these two variables are cointegrated.

6.2 Forecast comparison

We use the same set of forecasting methods as in the simulation study in Section 5, excepting the method based on principal components since we now have only three I(0) predictors. The benchmark forecast is based on OLS estimation of (1). Note that all forecasts (including the the historical averages) use just the observations in the relevant rolling window instead of the entire past history from 1945:01. One step ahead FMSEs are tabulated in Table 4, and out- of-sample RFIT measures are provided in Table S1 (in our online supplement), respectively. Following Welch and Goyal (2008), we first consider forecasts based on the generic “kitchen sink” model, where all of the predictors listed in Table 3 are included except for d/e and lty. In addition, we base predictions on a simplified model that contains the three non-persistent predictors (svar, infl and ltr) and two persistent predictors d/p and d/y, that appear from Figure 3 (d) to be cointegrated. Tables 4 and S1 show that in terms of overall performance, the forecasts produced by forecast combination, and use of the three LASSO methods or the elastic net produce the smallest one-step-ahead FMSE and hence the highest out-of-sample RFIT, over the three different window sizes. The improvement in out-of-sample RFIT can be as large as 13.7% over the forecasts based on the kitchen sink model. However, the LASSO fix model sometimes fails to generate better predictions than the OLS benchmark when the persistent predictors other than d/p and d/y are excluded from the model. Forecast combination produces the smallest FMSE using ten year and thirty year windows in the kitchen sink case but it fails to improve forecast accuracy upon the OLS benchmark using twenty and thirty year windows in the simplified case that includes only two persistent predictors. In contrast to our simulated forecasts, the elastic net performs slightly better than the LASSOmin for this simple example based on real data, perhaps reflecting higher correlations between the actual I(0) variables than between the simulated I(0) variables. We conduct pairwise Giacomini and White (2006) tests of equal conditional predictive ability of OLS forecasts and alternative forecast methods, using the one-step-ahead FMSE as our . We indicate the significance levels of these tests in Table 4. For the kitchen sink specification, forecast combinations, LASSOmin and the elastic net produce significantly

26 Table 4: Comparison of FMSE for different forecast methods

window OLS Average AR(1) r-walk F-comb LASSO fwd LASSOmin LASSO fix el-net Kitchen sink variables (9 persistent variables as well as svar, inf and ltr)

† 10-year 0.002089 0.001884 0.001900 0.003344 0.001803X⇤⇤ 0.001913⇤⇤ 0.001831⇤⇤ 0.001839⇤ 0.001849⇤⇤ 20-year 0.002046 0.002050 0.002036 0.003598 0.001929 0.002003 0.001928X† 0.001951 0.001928† X 27 30-year 0.002035 0.002051 0.002042 0.003651 0.001956 0.001928 0.001968 0.001961 0.001972

Two persistent variables (d/p and d/y) as well as svar, infl and ltr 10-year 0.001909 0.001884 0.001900 0.003344 0.001808 0.001818 0.001799X† 0.01843 0.001819† 20-year 0.001914 0.002050 0.002036 0.003598 0.001921 0.001924 0.001866 0.001944 0.001866X† 30-year 0.001895 0.002051 0.002042 0.003651 0.001931 0.001895 0.001892 0.001929 0.001888X a Bold letters with a tick mark indicate the forecast method with the smallest forecast mean squared error in each row. b Significance levels of pairwise Giacomini and White (2006) tests of equal conditional predictive ability of OLS forecasts and an alternative forecast method in the corresponding column: †, and denote 10%, 5% and 1% levels of significance. ⇤ ⇤⇤ c Colored blocks represent superior forecast methods compared to the others using the Hansen (2005) test for superior predictive ability. better one-step-ahead forecasts (at the 1% level) than the corresponding OLS alternative using the ten year window, and LASSOmin out-performs OLS at the 10% level using the twenty year window. For the parsimonious two persistent predictors case, the LASSOmin model and elastic net perform significantly better than the OLS benchmark at the 10% level over the the ten year rolling window samples. The Hansen (2005) test for superior predictive ability leads us to similar conclusion. For each forecasting method i, we test the null that

H : FMSE FMSE , for j = i, j , 0 i  j 6 2J where contains all methods considered here. We use purple blocks in Table 4 to highlight J those cases, for which H0 cannot be rejected at the 5% significance level. Regardless of the size of the estimation rolling window, we find no statistically significant evidence that the forecasts generated by combinations, LASSOmin or elastic net are inferior to any of the other forecast strategies, for either the kitchen sink or for the more parsimonious model. For the thirty year estimation samples we find no evidence that the LASSO fix and the OLS forecasts are dominated by other forecasts either.

6.3 Robustness analysis

The tests for cointegration rank shown in Figure 3 suggest that there is usually more than one cointegration relationship between the nine persistent predictors. We point out that although our modelling framework assumes just a single cointegrating vector, it is easy to accommo- date multiple cointegrating relationships. To see this, suppose that there are two normalised cointegrating vectors are d1 and d2 that have normalisation constants c1 and c2. The fact that we are considering one-equation predictive regression implies that b = c1d1 + c2d2 and the LASSO is simply estimating a linear combination of the cointegrating vectors. In other words, instead of estimating the two cointegrating relationships separately, the LASSO identifies a linear combination of them in b. The same intuition applies when we are working with sets of predictors that have higher cointegrating rank. This is exactly the case for our data set relating to stock return prediction. Simple algebra shows that the LASSO estimates of the coefficients of the linear combination of independent cointegrating vectors are still consistent. Furthermore, all the asymptotic properties of the

LASSOmin estimates presented in Section 4 still hold. We plot the twenty year rolling window

28 LASSO estimates of some of the coefficients in a kitchen sink model in Figure 4, and provide the full set of estimated coefficients in Figures S2 and S3 in our online supplement.

An estimated coefficient of 0 indicates that the LASSOmin model drops the associated variable from the predictive model. Interestingly, the first two panels of Figure 4 show that although the coefficients of d/p and d/y often move together (in opposite directions), they do not always reflect a (1, 1) cointegrating relationship. This is due to the presence of other persistent predictors whose predictive ability is sporadic, and as these variables enter and leave the estimated cointegrating relationship, the coefficients of d/p and d/y change. There is an apparent in early 1996, after which more than half of the persistent variables are discarded by the LASSO. Only the earnings price ratio (e/p), stock variance (svar) and inflation (infl) appear relevant in most windows after early 1996. Figure 4, together with Figures S2 and S3 also shows that the signs of some of the esti- mated coefficients change in different parts of the sample. For instance, the earnings price ratio (e/p) is positively related to the equity premium prior to 1975 or after 1995, but it is negatively correlated with the equity premium from 1975 to 1995. This reflects the changing nature of stock return predictability through time. However, the signs of some coefficients stay the same throughout the sample. For instance, a higher Treasury bill rate (tbl) or inflation rate (infl) always leads to a lower equity premium, and a higher dividend price ratio (d/p), default yield spread (dfy) or long-term return (ltr) always leads to higher premium. We use colour to illustrate the NBER recession dates in Figures 4, S2 and S3 to investigate whether expansions or recessions have any systematic impact on the predictive ability of the regressors. There does not appear to be a systematic relationship between the relevance of any predictor and phases of the business cycle. However, several predictor coefficients (e.g. d/p, d/y and e/p) change very sharply immediately after the 1981-1982 recession, and then again just before the 1990-1991 recession, and several (e/p, ntis, svar and infl) take on different values during the 2007-2009 recession. The results demonstrate the sporadic nature of the relationship between equity premium and the predictors commonly used in the literature, which highlights the usefulness of the LASSO to “auto-select” the most relevant and important predictors over successive estimation windows to achieve forecasting gains. Finally, we examine the robustness of the forecasting results using the Welch and Goyal (2008) data, but starting from January 1927. The FMSEs of the competing forecast methods under consideration are tabulated in Table S2 in our online supplement. Compared with

Table 4, most of the results are qualitatively similar. Overall, the LASSOmin forecasts per-

29 Figure 4: LASSOmin estimated coefficients over twenty year windows for eight key predictors (with recession periods shaded in colour)

0.4 0.2 d/p 0 1965 1975 1985 1995 2005 2012 0.1 0

d/y -0.1 -0.2 1965 1975 1985 1995 2005 2012 0.05 0 e/p -0.05 1965 1975 1985 1995 2005 2012

0 -0.5 tbl -1 1965 1975 1985 1995 2005 2012 4 2 dfy 0 1965 1975 1985 1995 2005 2012

1 0 dfr -1 1965 1975 1985 1995 2005 2012

2 0 svar -2

1965 1975 1985 1995 2005 2012 0 -1 infl -2 1965 1975 1985 1995 2005 2012

30 form the best, and pair-wise Giacomini and White (2006) tests in the kitchen sink set-up always find that forecasts based on LASSOmin estimation are statistically better than those based on the OLS regression. The Hansen (2005) test for superior predictive ability always ranks LASSOmin model as one of the three best model specifications for forecasting the equity premium. In general, we can conclude that LASSO has been successful in dealing with the sporadic relationship between the predictors and the prediction objective. This, together with the LASSO’s ability to capture cointegration among persistent predictors leads to superior forecasting performance.

7 Conclusion

This paper considers the use of the LASSO in a predictive regression in which the depen- dent variable is stationary, there are many predictors, and some of these predictors are non- stationary and potentially cointegrated. We show that, asymptotically, the LASSO yields a consistent estimator of the coefficients, including those for the cointegrating vector even when both stationary and nonstationary predictors are allowed to diverge although the di- mension of the nonstationary predictors is not as large as that of the stationary ones. We also provide the asymptotic distribution of the cointegrating coefficients, which can be simulated and used for statistical inference. The set of assumptions required for the consistency of the LASSO estimator allows the dimensions of both the stationary and nonstationary predictors to be arbitrarily large, making it feasible to implement the LASSO even when the number of predictors is higher than the sample size. Our simulation studies show that the proposed LASSO approach leads to superior forecasting performance relative to the commonly used OLS kitchen sink model, an historical average, and several commonly used forecast proce- dures. The application of the LASSO to the problem of predicting the equity premium provides an interesting illustrative example. Given that there are numerous predictors for the equity premium, the variable selection function of the LASSO tackles the challenge of selecting the most relevant variables very nicely. The LASSO also helps to deal with the non-stationarity of some of the predictors by retaining those in a cointegrating relationship and estimating the associated cointegrating coefficients. When we compare the performance of forecasts of the equity premium based on models estimated via the LASSO with forecasts based on other forecasting methods, we find that the former are usually superior, and we attribute

31 this finding, at least in part, to the LASSO’s ability to capture cointegration among persistent predictors. To the best of our knowledge, our paper is the first to study the statistical properties of LASSO estimates of predictive regression coefficients in the presence of cointegration. We should emphasize that our methodology does not exclude other ways of improving the pre- dictability of stock returns, for example, bringing in other macroeconomic variables (Buncic and Tischhauser, 2016) and technical indicators (Neely et al., 2014) and imposing constraints on the return forecasts and/or estimated coefficients (Campbell and Thompson, 2008). As documented by Buncic and Tischhauser (2016) and Li and Tsiakas (2016), combining addi- tional information and parameter constraints with LASSO might further improve the pre- dictability of the equity premium.

Note: We refer readers to our online supplement for empirical results not contained in this paper.

References

Binsbergen, V., Jules, H. and Koijen, R. S. (2010), ‘Predictive regressions: A present-value approach’, The Journal of Finance 65(4), 1439–1471.

Buhlmann,¨ P. and van de Geer, S. (2011), for High-Dimensional Data, New York: Springer.

Buncic, D. and Tischhauser, M. (2016), Macroeconomic factors and equity premium pre- dictability, Working paper.

Campbell, J. Y. (1987), ‘Stock returns and the term structure’, Journal of Financial Economics 18(2), 373–399.

Campbell, J. Y. and Shiller, R. J. (1988), ‘The dividend-price ratio and expectations of future dividends and discount factors’, Review of Financial Studies 1(3), 195–228.

Campbell, J. Y. and Thompson, S. B. (2008), ‘Predicting excess stock returns out of sample: Can anything beat the historical average?’, Review of Financial Studies 21(4), 1509–1531.

Campbell, J. Y. and Yogo, M. (2006), ‘Efficient tests of stock return predictability’, Journal of Financial Economics 81(1), 27–60.

Caner, M. and Knight, K. (2013), ‘An alternative to unit root tests: Bridge estimators differ- entiate between nonstationary versus stationary models and select optimal lag’, Journal of Statistical Planning and Inference 143(4), 691–715.

32 Christoffersen, P. and Diebold, F. (1998), ‘Cointegration and Long-horizon Forecasting’, Jour- nal of Business and Economic Statistics 16, 450–458.

Cochrane, J. H. (1991), ‘Production-based asset pricing and the link between stock returns and economic fluctuations’, The Journal of Finance 46(1), 209–237.

Cremers, K. M. (2002), ‘Stock return predictability: A Bayesian model selection perspective’, Review of Financial Studies 15(4), 1223–1249.

Dow, C. H. (1920), Scientific stock speculation, Magazine of Wall Street.

Elliott, G., Gargano, A. and Timmermann, A. (2013), ‘Complete subset regressions’, Journal of 177(2), 357–373.

Fama, E. F. and French, K. R. (1993), ‘Common risk factors in the returns on stocks and bonds’, Journal of Financial Economics 33(1), 3–56.

Fama, E. F. and Schwert, G. W. (1977), ‘Asset returns and inflation’, Journal of Financial Eco- nomics 5(2), 115–146.

Fan, J. and Lv, J. (2010), ‘A selective overview of variable selection in high dimensional feature space’, Statistica Sinica 20(1), 101.

Giacomini, R. and White, H. (2006), ‘Tests of conditional predictive ability’, Econometrica 74(6), 1545–1578.

Gonzalo, J. and Pitarakis, J.-Y. (1999), Dimensionality effect in cointegration analysis, in R. En- gle and H. White, eds, ‘Festschrift in Honour of Clive Granger’, Oxford University Press, pp. 212–229.

Hansen, B. E. (1992), ‘Convergence to stochastic integrals for dependent heterogeneous pro- cesses’, Econometric Theory 8(04), 489–500.

Hansen, P. R. (2005), ‘A test for superior predictive ability’, Journal of Business & Economic Statistics 23, 365–380.

Johansen, S. (1991), ‘Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models’, Econometrica 59, 1551–1580.

Johansen, S. (1995), Likelihood-based Inference on Cointegrated Vector Auto-regressive Models, Ox- ford University Press, Oxford.

Kock, A. B. (2016), ‘Consistent and conservative model selection with the adaptive lasso in stationary and nonstationary autoregressions’, Econometric Theory 32(01), 243–259.

Koijen, R. S. and Nieuwerburgh, S. V. (2011), ‘Predictability of returns and cash flows’, Annual Review of Financial Economics 3(1), 467–491.

Koo, B. and Seo, M. H. (2015), ‘Structural-break models under mis-specification: implications for forecasting’, Journal of Econometrics 188(1), 166–181.

Lamont, O. (1998), ‘Earnings and expected returns’, The Journal of Finance 53(5), 1563–1587.

Lettau, M. and Ludvigson, S. (2001), ‘Consumption, aggregate wealth, and expected stock returns’, The Journal of Finance 56(3), 815–849.

33 Li, J. and Tsiakas, I. (2016), A new approach to equity premium prediction: Economic funda- mentals matter!, Working paper.

Liao, Z. and Phillips, P. C. (2015), ‘Automated estimation of vector error correction models’, Econometric Theory 31(03), 581–646.

Ludvigson, S. C. and Ng, S. (2007), ‘The empirical risk–return relation: A approach’, Journal of Financial Economics 83(1), 171–222.

Martin, V., Hurn, S. and Harris, D. (2012), Econometric Modelling with Time Series, number 9780521196604 in ‘Cambridge Books’, Cambridge University Press.

Neely, C., Rapach, D., Tu, J. and Zhou, G. (2014), ‘Forecasting the equity risk premium: The role of technical indicators’, Management Science 60(7), 1772–1791.

Pastor,´ L. and Stambaugh, R. F. (2009), ‘Predictive systems: Living with imperfect predictors’, The Journal of Finance 64(4), 1583–1628.

Phillips, P. C. (2014), ‘On confidence intervals for autoregressive roots and predictive regres- sion’, Econometrica 82(3), 1177–1195.

Phillips, P. C. (2015), ‘Halbert white jr. memorial jfec lecture: Pitfalls and possibilities in predictive regression’, Journal of Financial Econometrics 13(3), 521–555.

Pontiff, J. and Schall, L. D. (1998), ‘Book-to-market ratios as predictors of market returns’, Journal of Financial Economics 49(2), 141–160.

Poskitt, D. S. (2000), ‘Strongly Consistent Determination of Cointegrating Rank via Canonical Correlations’, Journal of Business and Economic Statistics 18, 77–90.

Qian, J., Hastie, T., Friedman, J., Tibshirani, R. and Simon, N. (2013), Glmnet for matlab, http://www.stanford.edu/ hastie/glmnetmatlab/.

Rapach, D. E., Ringgenberg, M. C. and Zhou, G. (2016), ‘Short interest and aggregate stock returns’, Journal of Financial Economics 121(1), 46–65.

Rapach, D. E., Strauss, J. K. and Zhou, G. (2010), ‘Out-of-sample equity premium prediction: Combination forecasts and links to the real economy’, Review of Financial Studies 23(2), 821– 862.

Rapach, D. and Zhou, G. (2013), Forecasting stock returns, Vol. 2, Elsevier.

Rossi, B. (2013), ‘Advances in forecasting under instability’, Handbook of Economic Forecasting (111), 1203–1324.

Stambaugh, R. F. (1999), ‘Predictive regressions’, Journal of Financial Economics 54(3), 375–421.

Stock, J. H. and Watson, M. W. (2002), ‘Forecasting using principal components from a large number of predictors’, Journal of the American Statistical Association 97(460), 1167–1179.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288.

Tibshirani, R. (2011), ‘Regression shrinkage and selection via the lasso: a retrospective’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(3), 273–282.

34 Timmermann, A. (2006), ‘Forecast combinations’, Handbook of Economic Forecasting 1, 135–196.

Welch, I. and Goyal, A. (2008), ‘A comprehensive look at the empirical performance of equity premium prediction’, Review of Financial Studies 21(4), 1455–1508.

Wilms, I. and Croux, C. (2016), ‘Forecasting using sparse cointegration’, International Journal of Forecasting 32, 1256 – 1267.

Zou, H. and Hastie, T. (2005), ‘Regularization and variable selection via the elastic net’, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 67(2), 301–320.

35 Appendix A

A.1 Proofs of Theorems Proof of Theorem 1. Since by definition

u DB (qˆ q0) 2 T + l gˆ u 2 T + l g0 , k cˆ k2 k k1 k k2 k k1 rearranging the above equation yields

0 2 0 0 DB (qˆ q ) T + l gˆ 2u0DB (qˆ q ) T + l g k cˆ k2 k k1  cˆ k k1 0 BcˆD0DBcˆ 0 0 0 (qˆ q )0 (qˆ q )+l gˆ 2u0DBˆ(qˆ q ) T + l g . (15) T k k1  c k k1 We start with the first term of the right hand side in (15). Given A1 (i.e. c belongs to a compact parameter space), cˆ = O(1) and thus for any h > 0

h Pr 2cˆ u0X T (T k ln k) 1 •  ^ ! by Lemma 2. Further, Lemma 1 implies that there exists a l˜ such that

Pr 2 u0G T l˜ 1 k k•  ! 1 and l˜ = o(pT/ ln p). Hence, with probability approaching one,

0 0 h 0 2 u0DB (qˆ q ) T l˜ gˆ g + cOˆ (T k ln k) dˆ d k cˆ k•  k k1 p ^ k k1 due to Lemma 1 and the Holder¨ inequality. Therefore, regarding (15), for l 2l˜ (Th k ln k) /3pT , _ ^ ⇣ ⇣ ⌘⌘ 2 DB (qˆ q0) 2 T + 2l gˆ l gˆ g0 + 2l g0 + cOˆ (Th k ln k) dˆ d0 . k cˆ k2 k k1  k k1 k k1 p ^ k k1 Note that due to the triangle inequality,

0 0 gˆ 1 = gˆ 1 + gˆ c 1 g 1 gˆ g 1 + gˆ c 1 k k k S0 k k S0 k k S0 k k S0 S0 k k S0 k where S is the support of the parameter g0, i.e. S := j : g0 = 0 for g, and by construction 0 0 { j 6 } 0 0 gˆ g 1 = gˆ g 1 + gˆ c 1. k k k S0 S0 k k S0 k Then, as k/pT 3l,  ˆ 0 2 0 h p p ˆ 0 2 DBcˆ(q q ) 2 T + l gˆ c 1 3l gˆ g 1 + Op((T k ln k) / T)cˆ T(d d ) 1 k k k S0 k  k S0 S0 k ^ k k 3l gˆ g0 + cˆ pT(dˆ d0)  k S0 S0 k1 k k1 ⇣ ˆ 0 0 ⌘ = 3l B qS qS , (16) ⇤ ⇤ ⇤ 1 ⇣ ⌘ where S is the index set combining S0 and the indices for d, and B is a diagonal matrix of ⇤ ⇤ size S whose first k elements are cˆ and the others are 1. Also, the first k elements of qˆ q0 do ⇤ not lie within the span of d0 due to the norm normalization. That is, if dˆ d0 = ad0 for some a, then dˆ =(1 + a)d0, which in turn implies that a = 0 due to the norm normalization. This

36 enables us to apply the compatibility condition A2, implying that there exists a sequence of positive constants fT such that

ˆ 0 0 ˆ 0 2 2 B qS qS DBcˆ(q q ) 2s0(l, q) TfT ⇤ ⇤ ⇤ 1 k k ⇣ ⌘ 0 BcˆD0DBcˆ 0 2 =(qˆ q )0 (qˆ q )s (l, q) f , (17) T 0 T where s0(l, q) denotes cardinality of the index set S for q, i.e. s0 = S . Then ⇤ | ⇤| 2 DB (qˆ q0) 2 T + l B qˆ q0 k cˆ k2 cˆ 1 ˆ 0 2 ˆ 0 0 =2 DBcˆ(q q ) 2T + l gˆ Sc 1 + l B qS qS k k k 0 k ⇤ ⇤ ⇤ 1 ˆ 0 0 ⇣ ⌘ 4l B qS qS  ⇤ ⇤ ⇤ 1 ⇣ ⌘ 4l s (l, q) DB (qˆ q0) pTf  0 k cˆ k2 T DBq (qˆ q0) 2 T + 4l2s (l, q) f2 . k cˆ k2 0 T The first inequality comes from (16), the second inequality is due to (17) and the last inequality uses 4uv u2 + 4v2. Rearranging the last inequality,  DB (qˆ q0) 2 T + l B qˆ q0 4l2s (l, d)/f2 . k cˆ k2 cˆ 1  0 T ˆ p ˆ 0 p ˆ When b0 = 0, cˆ = b = Op l/ T , since Bcˆ q q = cˆ T d d0 + gˆ g0 1 1 1 1 k k ˆ ⇣ ⌘ and d is not consistent due to the lack of identification. Proof of Theorem 2. First, we refine the convergence rate of dˆ under the additional assumption that k is fixed. Recall that D D X˜ X˜ T2 X˜ G TpT 0 = 0 0 T G X˜ TpTGG T ✓ 0 0 ◆ and note that

1/2+q X˜ 0G TpT = O (T ) k k p where q is any positive number due to Lemma 2. Removing the positive term involving G0G and substituting this and the convergence rate of qˆ that we just obtained into the first inequality in (16), we can deduce that

0 2 0 2 1+q c T dˆ d O (1) dˆ d + O (s l T ) 1 k k1  p k k1 p 0 This yields the desired rate for dˆ. Let T 1 2 ST(g, d)= Â yt gt0g xt0 d . T t=1 It is clear from Theorem 1 that

0 1 sup ST(gˆ , d) ST(g , d) = op(T ), d

37 1 0 where the supremum is taken over any T neighborhood of d . Therefore, let ST(d)= 0 ST(g , d) and consider weak convergence of the following stochastic process

0 1 0 D (b)=T S (d + T b) S (d ) , T T T ⇣ 0 T 0 T ⌘ 2c c 0 = Â utxt0 b + b0 Â xtxt0 bc T t=1 T t=1 1 1 0 0 0 2c b0 W (r)dU (r)+L + c b0 W (r)W (r)0drbc ) 0 0 Z Z := D(b) over any given compact space for b. Furthermore, the estimator dˆ is the minimizer of ST(gˆ , d) under the constraint that d 1 = 0 k k 1. This makes the parameter space for b orthogonal to d to prevent the quadratic term in DT from degenerating. The precise formulation takes some elaboration. Consistency of dˆ means the sign-consistency of its nonzero elements as well. Thus,

0 0 = T( dˆ 1 d 1) k k k0 k 0 = (sgn(d ))T(dˆj d )+ T dˆj 1 0  j j  dj =0 j j | | { } 0 = sgn(d )0b + bj 1 0  dj =0 j | | { } with probability approaching 1. Therefore, we can conclude that

d T(dˆ d0) arg min D(b), ! (d0) = 1 b:sgn 0b Âj bj d0=0 | | { j } by the argmax continuous mapping theorem since the limit process is quadratic over the parameter space.

A.2 Lemmata

Lemma 1 Let ztj ,j= 1, ..., p, be centered a-mixing sequences satisfying the following: for some positive g , g > 1, a, and b, 1 2 a (m) exp ( amg1 )  P z > z H (z) := exp 1 (z/b)g2 for all j (18) tj  1 1 1 g = g1 + g2 < 1, ⇣ g (1 2⌘g (1 g)/g) ln p = O T 1^ 1 . (19) ⇣ ⌘ Then, T p max  ztj = Op T ln p . j p  t=1 ⇣ ⌘ M Proof of Lemma 1. Consider the sums of truncated variables first. Consider ztj = 1/g z 1 z M + M1 z > M with M = (1 + Tg1 ) 2 . Due to Proposition 2 in Mer- tj tj  tj 38 levede et al. (2011), we have

T 1/2 M M E exp T Â ztj Eztj = O (1) = !! t 1 ⇣ ⌘ uniformly in j, and thus

T 1/2 M M E max T Â ztj Eztj = O (ln p) . (20) j p  t=1 ⇣ ⌘ For the remainder term, note that

T M M M M E max ztj ztj + Eztj TpE ztj ztj + Eztj j p    t=1 • 2Tp H (x) dx  ZM = O (Tpexp ( Tg1 )) , where the last equality is obtained by bounding the integral by MH(M)=exp ( Tg1 ) in general since H (x) decays at an exponential rate. Thus,

T T T M M M M E max  ztj E max  ztj Eztj + E max  ztj ztj Eztj j p  j p j p  t=1  t=1  t=1 h ⇣ ⌘i = O pTln p + O (Tp exp ( Tg1 )) . ⇣ ⌘ Finally, recall that ln p = o (Tg1 ) by (19). Next, we derive another maximal inequality involving integrated processes by means of martingale approximation (e.g. Hansen 1992).

Lemma 2 Let x be an integrated process such that Dx is centered and x = O (1). Let Dx { ti} ti 0i p { ti} and ztj ,i= 1, ..., k, j = 1, ..., p, be a-mixing sequences satisfying the conditions in Lemma 1. Also r let k= O pT and p = O (T ) for some r < •. Then ⇣ ⌘ T 1+d max max  ztjxti = Op T Tkp 1 i k 1 j p ^     t=1 ⇣ ⌘ for any d > 0.

Proof of Lemma 2. When k and p are of order O (lnc T) for some finite c, bounding T the maximum by the summation yields the bound of Tkp since Ât=1 ztjxti = Op (T), which is standard see e.g. Hansen (1992). Thus, we turn to the case where k and p grows at a polynomial rate of T and tighten this bound. We apply the martingale difference approximation to z following Hansen (1992). Let tj Ft denote the generated sigma field from all zt s,j and xt+1 s,i, s, i, j = 1, 2, ... and let Et ( ) be the · conditional expectation based on . Define Ft • • etj = Etzt+s,j Et 1zt+s,j and ztj = Etzt+s,j. Â Â s=0 s=1 39 Then it is clear that ztj = etj + zt 1,j zt,j and etj is a martingale difference sequence. We begin with deriving a tail probability bound for the sum of a martingale difference array #t = etjxti/pT for each i and j, where we suppress the dependence of #t on i and j to easen notation. Considero a truncated sequence

#C = # 1 # < C + C1 # C C1 # C t t {| t| } { t } { t  } and its centered version C C C #˜t = #t E #t t 1 . |F C ⇣ ⌘ Then, by construction #˜t is a martingale difference sequence. Furthermore,

C C C #˜t #t + E #t t 1 2C  |F  ⇣ ⌘ by the triangle and Jensen’s inequalities, and by the fact that #C C. Furthermore, let t  C C C C rt = #t #˜t = #t #t E #t #t t 1 , |F ⇣ ⌘ ⇣⇣ ⌘ ⌘ where the last equality follows from the fact that # is an MDS and thus { t} C C #t = #˜t + rt .

By Azuma’s (1967) inequality,

T 2 1 C c Pr  #˜t c 2 exp 2 . ( pT = )  4C t 1 ✓ ◆ C On the other hand, since r = 0 if #t C , we have for any c > 0, t | |  T 1 C Pr  rt > c Pr max #t > C . (21) ( pT = )  t | | t 1 ⇢ However,

xti Pr max #t > C Pr max etj max > C t | |  t t p ⇢ ⇢ T p xTi p  Pr etj > C + Pr max > C  t pT t n o ⇢ T g/2 q 2q (TC ) C C E e + 2T exp + exp ,   tj C C t=1 1 ! ✓ 2 ◆ where the first term comes from the Markov inequality and the second and third terms are given by Merlevede et al.’s (2011) equation (1.11). Next, we derive some moments bounds for the sums of ztj and etj. By (A.1) of Hansen

40 q (1992), if z is a-mixing of size qb/ (q b) with E z < •, tj tj • b 1/b 1/b 1/ q q 1/q E ztj 6 Â ak E ztj .  k=1 ⇣ ⌘ ⇣ ⌘ Further, by the triangle inequality,

q 1/q q 1/q q 1/q q 1/q E etj = E ztj + ztj zt 1,j E ztj + 2 E ztj .  ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ Under our conditions and the mixing decay rate in Lemma 1 , these quantities are bounded for any b and q such that q > b. Then, combining the preceding arguments we conclude that for any aT,

1 T Pr  #t c ( aTpT = ) t 1 T 1 C c Pr  #˜t aT + Pr max #t > C  ( pT = 2) t | | t 1 ⇢ 2 2 T g/2 aTc q 2q (TC) C 2 exp + C E e + 2T exp + exp .  4C2  tj C C ✓ ◆ t=1 1 ! ✓ 2 ◆ Given this bound, we can deduce from the union bound that

1 T Pr max max  etjxti > c1 (1 i k 1 j p aT T = )     t 1 1 T kpPr  etjxti > c1  ( aT T = ) t 1 2 2 T g/2 aTc1 q 2q (TC) C kp 2 exp + C E e + 2T exp + exp .  4C2  tj C C ✓ ◆ t=1 1 ! ✓ 2 ◆! 2q Since E e < •, we begin with choosing C such that kpTC q 0. This choice of C makes tj ! the last two terms degenerate, and for any c1 > 0 the first term also degenerates as long as 1 a b (kpT ) 2q ln 2kp for some b •. Since q can be chosen arbitrarirly big, we can set a T T T ! T at any polynomial rate of T to conclude that

T 1 d max max  etjxti = Op T , (22) 1 i k 1 j p T     t=1 ⇣ ⌘ for any d > 0.

41 Next, we note that by the union bound,

1 T max max  zt 1,j zt,j xti (23) 1 i k 1 j p T     t=1 x1i xTi max max z0,j | | + max max zT,j | |  1 i k 1 j p T 1 i k 1 j p T         1 T + max max zt 1,jDxti Ezt 1,jDxti + max max Ezt 1,jDxt 1 i k 1 j p T  1 i k 1 j p     t=2     k log p log kp = O + O + O (1) , p p p p ✓ T ◆ ✓ T ◆ since sup x /pT = O (1) , max z = O log p , and we can use Lemma 1 with t | t| p j 0,j p xy > c x > pc y > pc . Note that the bound in (22) dominates the one in (23) p {| | } ⇢ | | [ | | for k = O pT and any d > 0. ⇣ ⌘

42