<<

Fourier Methods for Estimating the Central Subspace and the Central Mean Subspace in Regression

Yu Z and

In regression with a high-dimensional predictor vector, it is important to estimate the central and central mean subspaces that preserve sufficient information about the response and the mean response. Using the Fourier transform, we have derived the candidate matrices whose column spaces recover the central and central mean subspaces exhaustively. Under the normality assumption of the predictors, explicit estimates of the central and central mean subspaces are derived. Bootstrap procedures are used for determining dimensionality and choosing tuning parameters. Simulation results and an application to a real data are reported. Our methods demonstrate competitive performance compared with SIR, SAVE,and other existing methods. The approach proposed in the article provides a novel view on sufficient dimension reduction and may lead to more powerful tools in the future. KEY WORDS: Bootstrap; Candidate matrix; Central mean subspace; Central subspace; Fourier transform; SAVE; SIR.

1. INTRODUCTION that is the target of sufficient dimension reduction for FY|X.Un- der mild conditions, Cook (1996, 1998) showed that the central Suppose that Y is a univariate response and X is a p-dimen- subspace exists and is unique. Throughout this article, we as- sional vector of continuous predictors. Let F | denote the con- Y X sume the existence of the central subspace. ditional distribution of Y given X and let E[Y|X] denote the When only the mean response E[Y|X] is of interest, suffi- mean response at X. In full generality, the regression of Y on X cient dimension reduction can be defined for E[Y|X] in a similar is to infer about the conditional distribution FY|X, often with fashion as for FY|X. A subspace S is called a mean dimension- the mean response E[Y|X] of primary interest. When F | or Y X reduction subspace if E[Y|X] does not admit a proper parametric form, nonparametric methods such as local polynomial smoothing are usually used Y ⊥⊥ E[Y|X]|PS X. (2) for regression. Because of the curse of dimensionality, how- If the intersection of all mean dimension-reduction subspaces ever, these methods become impractical when the dimension is also a mean dimension-reduction subspace, then it is con- of X is high. To mitigate the curse of dimensionality, various sidered the central mean subspace, denoted by S [ | ] (Cook dimension reduction techniques have been proposed in the lit- E Y X and 2002). Similar to the central subspace, the central mean erature, including projection pursuit regression (Friedman and subspace exists under mild conditions, and so its existence is Stuetzle 1981) and principal component regression (Hotelling assumed throughout this article. S [ | ] is the target of suffi- 1957; Kendall 1957). One popular approach is to project X onto E Y X cient dimension reduction for the mean response E[Y|X] and a lower-dimensional subspace where the regression of Y on X is always a subspace of the central subspace S | (Cook and sufficient dimen- Y X can be performed. In this article we focus on Li 2002). Recently, Yin and Cook (2002) extended the cen- sion reduction (SDR), which further requires that the projection tral mean subspace to the central kth-moment subspace that is of X onto the lower-dimensional subspace does not result in any sufficient for the first k moments of the conditional distribu- loss of information about FY|X or E[Y|X]. tion FY|X. The theory of sufficient dimension reduction originated from Various dimension-reduction methods have been proposed the seminal works of Li (1991) and Cook and Weisberg (1991). in the literature, some of which can be used to estimate the During the past decade, much progress has been achieved in central subspace or the central mean subspace. For the cen- S SDR (see Cook 1998 for a comprehensive account). Let de- tral subspace, these include sliced inverse regression (SIR; Li Rp note a subspace of and let PS be the orthogonal projec- 1991) and sliced average variance estimation (SAVE; Cook and S S tion operator onto in the usual inner product. is called a Weisberg 1991); for the central mean subspace, they include dimension-reduction subspace if Y and X are independent con- principal Hessian direction (pHd; Li 1992), iterative Hessian ditioned on PS X, that is, transformation (IHT; Cook and Li 2002), the structure adap- Y ⊥⊥ X|PS X, (1) tive method (SAM; Hristache, Juditsky, Polzehl, and Spokoiny 2001), and minimum average variance estimation (MAVE; , where ⊥⊥ means “independent of.” The dimension-reduction Tong, Li, and Zhu 2002). SAM and MAVE are fundamen- subspace may not be unique. When the intersection of all tally different from the other methods mentioned earlier in dimension-reduction subspaces is also a dimension-reduction that both involve nonparametric estimation of the link func- subspace, it is defined to be the central subspace, denoted tion E[Y|X = x], which may be impractical when the dimension by SY|X (Cook 1996, 1998). The dimension of SY|X is called of X is high. All of the other aforementioned methods avoid the structural dimension of the regression of Y on X and is de- nonparametric estimation of the link function and target either noted by dim(SY|X). SY|X can be considered a metaparameter SY|X or SE[Y|X] directly. They usually follow a common proce- dure consisting of two steps. The first step is to define a p × p nonnegative definite matrix M called a candidate matrix ( Zhu is Assistant Professor, Department of Statistics, Purdue University, West Lafayette, IN 47907 (E-mail: [email protected]). Peng Zeng is As- sistant Professor, Department of Mathematics and Statistics, Auburn Univer- © 2006 American Statistical Association sity, Auburn, AL 36849 (E-mail: [email protected]). The authors thank the Journal of the American Statistical Association editor, the associate editor, and three anonymous referees for constructive com- December 2006, Vol. 101, No. 476, Theory and Methods ments and suggestions that greatly helped improve an earlier manuscript. DOI 10.1198/016214506000000140

1638 Zhu and Zeng: Fourier Subspace Estimation in Regression 1639 and Weiss 2003), whose columns span a subspace of SE[Y|X] or To facilitate our approach, we need to modify the model as- ˆ SY|X, and then propose a consistent estimate M of the candidate sumptions as follows. First, we assume that the joint distribu- | | matrix from a sample {(xi, )}1≤i≤n of (X, Y). The second step tion of (X, Y), the conditional distributions of X Y and Y X, and is to obtain the spectral decomposition of Mˆ and use the space the marginal distributions of X and Y admit densities, which are | | spanned by the eigenvectors of Mˆ corresponding to the largest denoted by fX,Y (x, y), fY|X(y x), fX|Y (x y), fX(x), and fY (y).Let = × q eigenvalues as the estimate of S [ | ] or S | , where q is the B (β1, β2,...,βq) be a p q matrix with its columns form- E Y X Y X S dimension of S [ | ] or S | . Recently, Cook and Ni (2005) ing a basis for Y|X. Then (1) can be restated in terms of con- E Y X Y X = proposed the minimum discrepancy method, which is more effi- ditional distributions as FY|X FY|Bτ X or in terms of density as cient than spectral decomposition for estimating SE[Y|X] or SY|X from a given candidate matrix. For these methods to work, some | = | τ = ; τ τ τ fY|X(y x) fY|Bτ X(y B x) h(y β1x, β2x,...,βqx), (3) distributional assumptions need to be imposed on X. For con- ; + venience, we assume that the mean of X is the origin of Rp and where h(y u1,...,uq) is a (q 1)-variate function. Similarly, let α ,...,α be a basis of S [ | ]; then (2) is equivalent to the covariance matrix of X is the standard p × p identity - 1 q E Y X [ | = ]= τ τ τ trix Ip. Then SIR and IHT require that X satisfies the linearity E Y X x g(α1x, α2x,...,αqx), (4) condition, E[X|PS X]=PS X, and SAVE and pHd require Y|X Y|X where g is a q-variate function. We assume the differentiabil- an additional condition called the constant variance condition, ity of h and g with respect to their coordinates wherever it is cov[X|PS X]=QS , where QS = I −PS . These con- Y|X Y|X Y|X p Y|X needed. X ditions are satisfied when the distribution of is multivariate The rest of the article is organized as follows. Section 2 normal. (For a detailed discussion of the conditions, see Cook derives MFM for the central mean subspace, and Section 3 1998.) derives MFC for the central subspace. Section 4 derives the Although SIR, SAVE, pHd, and IHT work well in practice, estimates of MFM and MFC under the assumption that X is nor- there has not been much study on when they can exhaustively mal and discusses the asymptotic properties of these estimates. recover the central subspace or the central mean subspace. It Section 5 focuses on implementation of the proposed meth- is known that SIR fails to capture directions along which Y is ods for estimating the central subspace and the central mean = τ 2 + symmetric about. For example, if Y (β X) ε, where β is a subspace. Section 6 compares the proposed methods with SIR, p-dimensional vector, τ denotes transpose, X follows N(0, Ip), SAVE,and other existing methods using synthetic and real data. and ε is a random error independent of X, then SIR will miss β. Section 7 contains conclusions and future work. The Appendix A sufficient condition for SAVE to exhaustively recover SY|X is gives proofs of propositions, theorems, and crucial equations. that the conditional distribution of X given Y is multivariate In this article we use S(M) to denote the linear space spanned normal, which may be restrictive in practice. A potential risk of by the columns of a matrix M. applying these methods is that they may lead to the loss of in- 2. CENTRAL MEAN SUBSPACE formation regarding FY|X or E[Y|X]. Therefore, it is desirable to derive new methods that can guarantee the exhaustive recov- In this section we propose a candidate matrix MFM whose ery of the central subspace or the central mean subspace under column space is exactly the central mean subspace SE[Y|X].We general conditions. Recently, Li, Zha, and Chiaromonte (2005) follow a commonly used idea for deriving candidate matrices in made progress in this direction by proposing contour regression the literature: identify vectors that belong to SE[Y|X] and com- for sufficient dimension reduction. Contour regression assumes bine them to generate the candidate matrix. The major tool that that X follows an elliptically contoured distribution and also re- we use is the Fourier transform. quires conditions that involve X, Y, and both vectors in SY|X Let m(x) = E[Y|X = x].From(4),m(x) = g(u), where u = S⊥ τ = and Y|X (see assumptions 2.1 and 4.1 and thm. 4.2 in Li et al. A x and A (α1, α2,...,αq), whose columns form a basis of ∂ ∂ ∂ ∂ 2005). S [ | ].Let = ( , ,..., )τ denote the gradient oper- E Y X ∂x ∂x1 ∂x2 ∂xp This article represents another effort to derive methods that ator. By the chain rule of differentiation, can fully recover the central mean subspace and the central sub- ∂ ∂ space under various conditions. The primary tool that we use is m(x) = A g(u). the Fourier transform. At the population level, we have derived ∂x ∂u two candidate matrices, M and M , whose column spaces Thus the gradient of m(x) at any fixed x is a linear combination FM FC S ={ ∈ are identical to the central mean subspace and the central sub- of α1, α2,...,αq; therefore, it is in E[Y|X]. Let supp(X) x Rp } space. Given a sample of (X, Y), if consistent estimates of M : fX(x)>0 be the support of X. The collection of all of the FM ∈ and M can be found, they can be used to exhaustively recover gradients of m(x) over x supp(X) spans the central mean sub- FC S the central mean subspace and the central subspace. In fact, the space E[Y|X], as does the collection of all the gradients of m(x) weighted by fX(x), that is, consistent estimates exist, but their exact formulas or calcula-   tions depend on the amount of prior knowledge that we have ∂ S [ | ] = span m(x), x ∈ supp(X) regarding the distribution of X. Due to space limitations, in this E Y X ∂x article we fully implement our methods only for the case where    ∂ p X is normally distributed. The implementation of our method = span m(x) fX(x), x ∈ R . (5) under more general conditions is briefly described herein and ∂x is currently under further investigation, and we will report the The proof of (5) is given in the Appendix. From (5), it appears results in future work. that we can immediately use the gradient of m(x) to generate a 1640 Journal of the American Statistical Association, December 2006 candidate matrix for SE[Y|X]. However, this idea leads to inef- The first property indicates that the real and imaginary parts ficient estimation of the central mean subspace, because much of ψ(ω) are the vectors that we can use to generate candidate effort must be spent on estimating g and its derivatives non- matrices for the central mean subspace. In the second property, parametrically (Hristache et al. 2001; Xia et al. 2002). Recall ψ(ω) is represented as an expectation of the random function that the primary goal of sufficient dimension reduction is to re- −(ıω + G(X))Y exp{ıωτ X}. Recall that ψ(ω) was originally cover the central mean subspace SE[Y|X] only; thus we hope defined in terms of m(x) as in (6). The new expression of ψ(ω) to achieve dimension reduction while avoiding fitting the link in (8) does not include m(x) explicitly, giving us the opportu- function g(x) and its derivatives as much as possible. This can nity to estimate ψ(ω) without directly estimating m(x).This be realized by considering the Fourier transform of the gradient distinguishes our method from the methods that directly esti- ∂ of m(x) instead of using ∂x m(x) directly. Another advantage of mate m(x) and its derivatives in the literature. ∂ using the Fourier transform of ∂x m(x) is that sufficient dimen- The third property is essentially the RiemannÐLebesgue sion reduction for the central subspace and the central mean lemma for the Fourier transform (Folland 1992, pp. 217, 243). subspace can be dealt with in a unified fashion, as we demon- It indicates that when the density-weighted gradient is ab- strate in the next section. solutely integrable, ψ(ω) decays to 0 as the norm of ω goes For ω ∈ Rp,let    to infinity. The fourth property is a result from applying the τ ∂ Plancheral theorem for the Fourier transform (Folland 1992, ψ(ω) = exp{ıω x} m(x) fX(x) dx. (6) ∂ ∂x pp. 222, 224) to ( ∂x m(x))fX(x) and ψ(ω), and it establishes the connection between the expected outer product of the density- Then ψ(ω) is the Fourier transform of the density-weighted [ ∂ ∂ τ ] ∂ weighted gradient of m(x) (i.e., E ( ∂x m(X))( ∂x m(X)) fX(X) ), gradient ( ∂x m(x))fX(x). Intuitively, ψ(ω) can also be consid- ∗ = { τ } and the integral of the outer product of ψ(ω).LetMFM ered an average of the gradient of m(x) weighted by exp ıω x −p ¯ τ ∗ over x with density f (x). In particular, when ω = 0, ψ(0) = (2π) ψ(ω)ψ(ω) dω. Then the column space of MFM, X S ∗ [ ∂ ] (MFM), is exactly equal to the central mean subspace, as E ∂x m(X) , which is exactly the average gradient of m(x) (Härdle and Stoker 1989). Therefore, in some sense, ψ(ω) is a stated in the following proposition. generalized average derivative of the mean response m(x).Let Proposition 2. M∗ is a real nonnegative definite matrix, ∗ FM a(ω) and b(ω) be the real and imaginary parts of ψ(ω), that is, and S(M ) = S [ | ]. = + FM E Y X ψ(ω) a(ω) ıb(ω). Because the gradient of m(x) belongs to ∗ SE[Y|X], both a(ω) and b(ω) belong to SE[Y|X]. Intuitively, MFM can be considered the sum of the outer prod- An appealing property of ψ(ω) is that it contains all of the uct of ψ(ω) over all ω. Because of (7), ψ(ω) can be considered information of the gradient ∂ m(x), because ∂ m(x) can be the vector of coefficients of exp{ıxτ ω} in the representation of ∂x ∂x ∂ { τ } recovered from ψ(ω) through the inverse Fourier transform ∂x m(x)fX(x). For different ω’s, the exp ıx ω are basic os- ∂ cillatory functions with different frequencies. So ψ(ω) with (Folland 1992, p. 244). Assuming that ( ∂x m(x))fX(x) is inte- Rp   ∂ grable and continuous on and that ψ(ω) also is integrable, small ω captures the patterns of ∂x m(x)fX(x) with low fre-    quencies, whereas ψ(ω) with large ω captures the patterns ∂ − m(x) f (x) = (2π) p exp{−ıxτ ω}ψ(ω) dω. (7) of ∂ m(x)f (x) with high frequencies. According to the third ∂x X ∂x X property of ψ(ω), ψ(ω) goes to 0 when ω goes to infinity. From (5), we know that the central mean subspace is spanned by Therefore, when patterns with various frequencies are of differ- the gradients. Considering the correspondence between ψ(ω) ent interest, ψ(ω) of different ω’s should be treated differently. ∂ and ∂x m(x) as demonstrated in (6) and (7), we can use ψ(ω) This can be realized by assigning different weights to ω when to generate a candidate matrix for the central mean subspace combining the outer product of ψ(ω).WeuseK(ω) to denote S E[Y|X]. The other properties of ψ(ω) are summarized in the the weight function and generate a more flexible candidate ma- following proposition. trix for the central mean subspace as   Proposition 1. M = Re ψ(ω)ψ¯ (ω)τ K(ω) dω 1. The central mean subspace is spanned by a(ω) and b(ω), FM p  that is, SE[Y|X] = span{a(ω), b(ω) : ω ∈ R }. τ τ 2. Suppose that log fX(x) is differentiable and m(x)fX(x) = [a(ω)a(ω) + b(ω)b(ω) ]K(ω) dω, (10) goes to 0 as x→∞, then   where Re(·) means “the real part of.” If K(ω) is a radial weight ψ(ω) =−E (ıω + G(X))Y exp{ıωτ X} , (8)  (X,Y) function, then ψ(ω)ψ¯ (ω)τ K(ω) dω itself is a real matrix. where G(x) = ∂ log f (x). ∂x X Proposition 3. If K(ω) is a positive weight function on Rp, 3. If ( ∂ m(x))f (x) is absolutely integrable for 1 ≤ i ≤ p, S = ∂xi X then MFM is a nonnegative definite matrix and (MFM) →  →∞ then ψ(ω) 0 as ω . SE[Y|X]. ∂ ≤ ≤ 4. If ( ∂x m(x))fX(x) is squared integrable for 1 i p, then  i   Proposition 3 indicates that the central mean subspace  τ ∂ ∂ 2 SE[Y|X] can be exhaustively recovered by the column space m(x) m(x) fX(x) dx ∂x ∂x of MFM. In the proposition, we have assumed that K(ω) is  positive over all Rp. This condition can be substantially weak- − = (2π) p ψ(ω)ψ¯ (ω)τ dω, (9) ened; for example, if ψ(ω) is analytic, then the proposition holds true for any weight function K(ω) with bounded sup- where ψ¯ (ω) is the conjugate of ψ(ω). port that contains an open set. In this article, we focus on Zhu and Zeng: Fourier Subspace Estimation in Regression 1641 positive weight functions only. Although in theory the proposi- sponse of T(Y, t) at X = x is  tion is true for any positive function, in practice the particular   choice will affect the performance of dimension reduction m(x, t) = E T(Y, t)|X = x = exp{ıty}fY|X(y|x) dy. based on MFM, especially when the sample size is moder- ate. For simplicity, we choose the Gaussian function K(ω) = Thus m(x, t) is the Fourier transform, or the characteristic func- − 2 p/2 {− 2 2 } tion, of the conditional density function fY|X(y|x).BothT(Y, t) (2πσW) exp ω /2σW as the weight function in the 2 and m(x, t) are complex functions. The central mean subspace rest of the article, where σW is a constant (or tuning parameter) that controls how the weight is assigned to different ψ(ω)’s. for T(Y, t) is defined as the sum of the central mean sub- spaces for its real and imaginary parts, that is, SE[T(Y,t)|X] = Furthermore, K(ω) leads to an explicit expression of MFM,   SE[sin(tY)|X] +SE[cos(tY)|X]. The following proposition states that = MFM E(U1,V1),(U2,V2) JFM((U1, V1), (U2, V2)) , (11) the central subspace SY|X is equal to the sum of SE[T(Y,t)|X] over ∈ R where t . ∂ Proposition 4. Suppose that f | (y|x) exists and is ab- JFM((U1, V1), (U2, V2)) ∂x Y X solutely integrable with respect to y. Then 2 2 −σ /2U12 = V1V2e W   SY|X = SE[T(Y,t)|X]. × 2 + − 2 + 2 τ σWIp (G(U1) σWU12)(G(U2) σWU12) , t∈R

(U1, V1) and (U2, V2) are independent and identically distrib- Because {T(·, t) : t ∈ R} is a subset of all the possible trans- uted as (X, Y), Ip is the p × p identity matrix, and U12 = formations, Proposition 4 implies (12). U1 − U2. When defining the central kth-moment subspace, Yin and Cook (2002) considered the power transformations of Y, which 3. CENTRAL SUBSPACE are Yl with 1 ≤ l ≤ k. Although the transformations that we As discussed in Section 1, the central mean subspace can consider can be considered an extension of the power transfor- only capture the information in X regarding the mean response mations, they lead to entirely different methods for recovering E[Y|X]. In applications where the entire conditional distribu- the central subspace. Proposition 4 suggests that, to recover the S tion FY|X is of interest, sufficient dimension reduction should central subspace Y|X, we can first recover the central mean ∈ R aim at the central subspace SY|X. This section focuses on the subspace for T(Y, t) at fixed t, then combine them over t . derivation of the candidate matrix MFC for SY|X.Weagainuse The methods and results developed in Section 2 for the central the Fourier transform, as well as other similar ideas from Sec- mean subspace SE[Y|X] can be directly generalized for the cen- tion 2. tral mean subspace SE[T(Y,t)|X] with Y replaced by exp{ıtY}. S First, we establish a connection between SY|X and a family Hence we can combine the candidate matrices for E[T(Y,t)|X] of central mean subspaces. As noted in Section 1, the central to generate a candidate matrix for SY|X. The idea of combin- mean subspace SE[Y|X] is always a subspace of SY|X.LetT de- ing candidate matrices to generate new ones was originally note a transformation of Y; then T(Y) is a new response. It can mentioned by Li (1991) and was further investigated by Ye be shown that the central mean subspace of T(Y), denoted by and Weiss (2003). In what follows, we start with applying the SE[T(Y)|X], is also a subspace of SY|X. For two different trans- Fourier transform to the gradient of m(x, t) as in Section 2 and formations T1(Y) and T2(Y), their corresponding central mean materialize the foregoing idea to derive a candidate matrix MFC S subspaces SE[T (Y)|X] and SE[T (Y)|X] are not necessarily identi- for Y|X. 1 2 = cal and may cover different parts of SY|X. This provides a pos- Because m(x, t) is the mean response of T(Y, t) at X x, similar to (5) in Section 2, we have sibility to recover the entire central subspace by collecting the   central mean subspace of T(Y) over all of the possible transfor- S = ∂ ∈ mations, that is, E[T(Y,t)|X] span m(x, t), x supp(X) ∂x S = S    Y|X E[T(Y)|X], (12) ∂ = span m(x, t) f (x), x ∈ Rp , all possible T ∂x X where where the spanned space of complex vectors is defined to be SE[T(Y)|X] the space spanned by their real parts and imaginary parts. Using T Proposition 4, we have   = u : There exist a finite number of transformations ∂ S | = span m(x, t), x ∈ supp(X), t ∈ R Y X ∂x T1,...,Tk, such that u = u1 +···+uk and    ∂ p ui ∈ SE[T (Y)|X] for 1 ≤ i ≤ k . = span m(x, t) f (x), x ∈ R , t ∈ R . (13) i ∂x X The foregoing equation is indeed true and is implied by Propo- As in Section 2, to derive a candidate matrix for SY|X,wedo sition 4. In fact, it is not necessary to use all of the possible ∂ transformations in (12); a family of properly chosen transfor- not use the gradient ∂x m(x, t) directly; instead we consider its Fourier transform. For any ω ∈ Rp and t ∈ R, define mations is sufficient to serve that purpose.    For any given t ∈ R, define T(Y, t) = exp{ıtY}.TheT(·, t)’s ∂ φ(ω, t) = exp{ıωτ x} m(x, t) f (x) dx. (14) form a family of transformations indexed by t. The mean re- ∂x X 1642 Journal of the American Statistical Association, December 2006

∂ Then φ(ω, t) is the Fourier transform of ∂x m(x, t) weighted by where the marginal density function of X, and it preserves all of the ∂ JFC((U1, V1), (U2, V2)) information about ∂x m(x, t).Leta(ω, t) and b(ω, t) be the real   and imaginary parts of φ(ω, t). The properties of φ(ω, t) are σ 2 σ 2 = exp − W U 2 − T V2 summarized in the following proposition. 2 12 2 12   ∈ R × 2 + − 2 + 2 τ Proposition 5. For each fixed t , the following assertions σWIp (G(U1) σWU12)(G(U2) σWU12) , hold: (20) S 1. Both a(ω, t) and b(ω, t) are vectors in E[T(Y,t)|X];fur- (U1, V1) and (U2, V2) are independent and identically distrib- thermore, uted as (X, Y), Ip is the identity matrix, U12 = U1 − U2, and V = V − V . p 12 1 2 S | = { : ∈ R ∈ R} Y X span a(ω, t), b(ω, t) ω , t . (15) The derivation of MFC can also be understood from the per- spective of inverse regression used in SIR, where the candidate 2. Suppose that log fX(x) is differentiable and that matrix is defined by the first moment of the conditional distri-  →∞ m(x, t)fX(x) goes to 0 when x ; then bution of X given Y. Next we present a connection between   φ(ω, t), which is used to define M , and the conditional dis- φ(ω, t) =−E (ıω +G(X)) exp{ıωτ X+ıtY} . (16) FC (X,Y) tribution of X given Y. Define   3. If ( ∂ m(x, t))f (x) is absolutely integrable for 1 ≤ i ≤ p, η(y, ω) =−E (ıω + G(X)) exp{ıωτ x}|Y = y . ∂xi X then φ(ω, t) → 0 as ω→∞. ∂ For fixed ω, η(y, ω) can be considered the mean response vector 4. If ( m(x, t))fX(x) is squared integrable for 1 ≤ i ≤ p, τ ∂xi for inversely regressing −(ıω + G(X)) exp{ıω X} on Y.The then    properties of η(y, ω) are given in the following proposition.  τ ∂ ∂ 2 Proposition 7. Suppose for any fixed y, f (x, y) goes to 0 m(x, t) m¯ (x, t) fX(x) dx X,Y ∂x ∂x as x→∞. Then the following hold:  − = (2π) p φ(ω, t)φ¯ (ω, t)τ dω, (17) 1. For any given y and ω, the real and imaginary parts of η(y, ω) are vectors in SY|X; in particular, those of η(y, 0) =−E[G(X)|Y = y] are vectors in S | . ¯ ¯ Y X where m(x, t) and φ(ω, t) are the conjugates of m(x, t) 2. φ(ω, t) = E[η(Y, ω) exp{ıtY}]. and φ(ω, t). Recall that φ(ω, t) serves as a building block for MFC.From Proposition 5 is a direct extension of Proposition 1, with Y the second statement in Proposition 7, φ(ω, t) can be considered replaced by exp{ıtY} and m(x) replaced by m(x, t). According the Fourier transform of η(y, ω) with respect to the marginal to the first property in Proposition 5, using similar arguments as density of Y. When X is multivariate normal with mean 0 and those after Proposition 1, we can combine a(ω, t) and b(ω, t) covariance matrix Ip, G(x) =−x and η(y, 0) = E[X|Y], which to derive a candidate matrix for SY|X.LetK(ω) be a weight serve as the building blocks in SIR. This observation leads to an ∈ Rp ∈ R function for ω and let k(t) be a weight function for t . important connection between MFC and the candidate matrix Define MSIR used in SIR.   ¯ τ Proposition 8. Suppose that X follows the normal distribu- MFC = Re φ(ω, t)φ(ω, t) K(ω)k(t) dω dt tion with mean 0 and covariance matrix Ip.IfK(ω) is chosen to  be a point mass at ω = 0, then S(MFC) = S(MSIR). = [a(ω, t)a(ω, t)τ + b(ω, t)b(ω, t)τ ] Proposition 8 indicates that the column spaces of MFC and M coincide when X follows the standard normal distribution × K(ω)k(t) dω dt. (18) SIR and the weight function K(ω) is degenerate at ω = 0. Hence Proposition 6. If K(ω) is a positive weight function on Rp MSIR could be considered a special case of MFC under the normality assumption. Although S(MFC) and S(MSIR) are the and k(t) is a positive weight function on R, then MFC is a non- same, the matrices are generally different from each other. It is negative definite matrix, and S(M ) = S | . FC Y X known that SIR fails to capture the directions along which Y is S 2 = Proposition 6 implies that Y|X can be exhaustively recov- symmetric about, as does MFC with σW 0. In general, we do 2 ered by the column space of MFC. In this article, for conve- not use a degenerate weight function for ω. When σW > 0, the nience, both K(ω) and k(t) are chosen to be the Gaussian func- entire central subspace can be successfully recovered, as stated = 2 −p/2 {− 2 2 } in Proposition 6. tions, which are K(ω) (2πσW) exp ω /2σW and k(t) = (2πσ2)−1/2 exp{−t2/2σ 2}, where σ 2 and σ 2 are two T T W T 4. ESTIMATION OF CANDIDATE MATRICES constants that control how the weights are assigned to different φ(ω, t). Furthermore, the Gaussian weight functions lead to an In this section we derive the estimates of MFC and MFM and explicit expression of MFC, discuss their asymptotic properties. We assume that the dimen- 2 2   sionality q and the tuning parameters σW and σT are known, = MFC E(U1,V1),(U2,V2) JFC((U1, V1), (U2, V2)) , (19) and discuss their selection in the next section. Zhu and Zeng: Fourier Subspace Estimation in Regression 1643

Let {(xi, yi)}1≤i≤n be a random sample of (X, Y). Without where loss of generality, we assume that E[X]=0 and cov[X]=Ip. JFCN((u1, v1), (u2, v2)) We first consider MFC. According to (19), MFC is the expecta-   2 2 tion of JFC over (U1, V1) and (U2, V2) that are independent and σ σ = exp − W u 2 − T v2 identically distributed as (X, Y).LetF(x, y) be the cumulative 2 12 2 12 distribution function of (X, Y). Then MFC can be expressed as    × 2 + + 2 − 2 τ σWIp (u1 σWu12)(u2 σWu12) . (25) MFC = JFC((u1, v1), (u2, v2)) dF(u1, v1) dF(u2, v2). Because Mˆ is a V-statistic, it can be expanded as the sum (21) FCN of a U-statistic and a low-order term (Lee 1990). The asymp- ˆ With the given sample {(xi, yi)}1≤i≤n, a natural estimate of totic distribution of MFCN can be obtained by the theory of F(x, y) is its empirical distribution, U-statistics. n 1 Theorem 1. Suppose that X follows the standard multi- F (x, y) = I[ ≤ ≤ ], n n xi x,yi y variate normal distribution and that the covariance matrix of i=1 vec(JFCN((U1, V1), (U2, V2))) exists. As n →∞, where I[·] is the indicator function. Therefore, a proper estimate n 1 − of MFC is derived by replacing F(u1, v1) and F(u2, v2) in (21) Mˆ = M + J(1) (x , y ) − 2M + o n 1/2 , FCN FCN n FCN i i FCN p with Fn(u1, v1) and Fn(u2, v2), which is i=1 n n 1 where Mˆ = J ((x , y ), (x , y )). FC 2 FC i i j j (22)  n = = (1) i 1 j 1 J (x, y) = E(U ,V ) JFCN((x, y), (U2, V2)) FCN 2 2  The explicit expression of JFC((xi, yi), (xj, yj)) is given in (20). + J ((x, y), (U , V ))τ . ˆ FCN 2 2 Note that to make MFC a legitimate estimate, we need to esti- = ∂ ≤ ≤ (1) mate G(xi) ∂x log fX(xi) for 1 i n. Let FCN be the covariance matrix of vec(JFCN(X, Y)). Then If we know that the distribution of X belongs to a cer- √ ˆ L tain parametric family [i.e., fX(x) = f0(x; θ), where f0(·) is of n vec(MFCN) − vec(MFCN) −→ N(0, FCN), as n →∞. known form and θ is an unknown parameter], then the max- ˆ In Theorem 1, vec is an operator that transforms a matrix imum likelihood estimate θ can be calculated using {xi}1≤i≤n, ∂ ; ˆ ; ˆ to a vector by stacking up all of its columns. For instance, if and G(xi) can be estimated by ∂x f0(xi θ)/f0(xi θ).IffX(x) be- M = (m1,...,mk) is a p × k matrix and the mi’s are the col- longs to the family of elliptically contoured distributions [i.e., = τ τ τ × 2 umn vectors, then vec(M) (m1,...,mk ) is a kp 1 vector. fX(x) = g(x ), where g(·) is an unknown function], then, us- ˆ Theorem 1 asserts that MFCN converges to MFCN at the rate ing {xi}1≤i≤n, a one-dimensional nonparametric procedure √ ˆ  2 ˆ  2 of n, which implies that the eigenvalues and eigenvectors of can be used to obtain g( x ) and g ( x ), which are the es- ˆ  ∂  2 MFCN converge to those of MFCN at the same rate. timates of g and its derivative g , and ∂x log(g( x )) can be −  Although the explicit expression of M in (23) is obtained estimated by 2xgˆ 1(x2)gˆ (x2). In general, if there is no FCN under the normality assumption, in fact it remains as a candi- prior knowledge about fX(x), then nonparametric density esti- date matrix for S | under a weaker condition, as stated in the mators can be used to estimate f (x) and ∂ f (x), and also to Y X X ∂x X following proposition. obtain an estimate of G(xi). The general case is currently under = investigation. Proposition 9. Suppose that B (β1,...,βq) is an ortho- In the rest of this article, we focus on the case where X fol- S ˜ = normal basis of Y|X and that B (βq+1,...,βp) is an ortho- p τ lows a multivariate normal distribution only. Normality is a normal basis of the complementary space of SY|X in R .IfB X common assumption in regression and is at least approximately and B˜ τ X are independent of each other and B˜ τ X follows the valid in many applications. For applications where the nor- standard normal distribution, then S(MFCN) ⊆ SY|X. mality assumption is not valid, variable transformation or data resampling can be considered to alleviate the violation of nor- We call the assumption in Proposition 9 the weak normal- mality, so that the methods developed herein can still be ap- ity condition. Proposition 9 implies that MFCN can be used to plied. (See Brillinger 1994 for the resampling method and Cook recover the central subspace under the weak normality assump- and Nachtsheim 1994 for the Voronoi weighting method.) tion, although it does not guarantee that the central subspace Assume that X follows the multivariate normal distribution can be recovered exhaustively as under the normality condition. τ with mean 0 and covariance matrix Ip. Then G(x) =−x.For Note that the distribution of B X can be arbitrary. The weak ˆ clarity, we use JFCN, MFCN, and MFCN to denote JFC, MFC, normality condition represents a situation in practice where ˆ and MFC under the normality assumption. Hence Y depends only on a subset of predictors and the rest of the   predictors are random noises following normal distributions. MFCN = E(U ,V ),(U ,V ) JFCN((U1, V1), (U2, V2)) (23) 1 1 2 2 For the central mean subspace SE[Y|X] and its candidate and matrix MFM, similar discussions can be used to derive the n n estimates of M under various conditions on X. In what fol- 1 FM Mˆ = J ((u , v ), (u , v )), (24) lows, we report only the results under the normality assump- FCN n2 FCN i i j j ˆ i=1 j=1 tion on X.Again,weuseJFMN, MFMN, and MFMN for JFM, 1644 Journal of the American Statistical Association, December 2006

ˆ ˆ −1/2 MFM, and MFM under the normality condition. Recall that 1. Standardize data as follows: x˜i =  (xi − x¯) and y˜i = ˆ G(x) =−x; therefore, (yi −¯y)/sy, where x¯ and  are the sample mean and the   sample covariance matrix of the xi’s and y¯ and sy are the MFMN = E(U ,V ),(U ,V ) JFMN((U1, V1), (U2, V2)) , (26) 1 1 2 2 sample mean and standard deviation of the yi’s. ˆ ˆ and the estimate of MFMN is 2. Calculate MFCN (or MFMN) using the standardized data { x˜ y˜ } ≤ ≤ n n ( i, i) 1 i n. 1 ˆ ˆ Mˆ = J ((x , y ), (x , y )), (27) 3. Obtain the spectral decomposition of MFCN (or MFMN), FMN 2 FMN i i j j ˆ n = = that is, the eigenvectorÐeigenvalue pairs (eˆ1, λ1),..., i 1 j 1 ˆ ˆ ˆ (eˆp, λp) with λ1 ≥···≥λp. where ˆ ˆ ˆ −1/2 ˆ −1/2 4. Then SY|X(or SE[Y|X]) = span{ eˆ1,..., eˆq}.

JFMN((u1, v1), (u2, v2))   The foregoing procedure is fairly standard in sufficient di- σ 2 mension reduction, except that the estimated candidate matri- = v v exp − W u 2 ˆ ˆ 1 2 2 12 ces MFCN (or MFMN) based on the Fourier method is used in   step 2. For convenience, in the rest of the article we refer to the × 2 + + 2 − 2 τ ˆ σWIp (u1 σWu12)(u2 σWu12) . (28) procedure with MFCN simply as FC, representing the Fourier ˆ method for estimating the central subspace, and the procedure The asymptotic normality of MFMN is established in the follow- ˆ with MFMN as FM, representing the Fourier method for esti- ing theorem. 2 mating the central mean subspace. Two parameters, q and σW, Theorem 2. Suppose that X follows the standard multi- need to be specified in FM, whereas in FC, an additional para- 2 variate normal distribution and that the covariance matrix of meter σT must be chosen. The parameter q is different from the vec(JFMN((U1, V1), (U2, V2))) exists. As n →∞, other two in that the former is a model parameter and the latter n are tuning parameters. The foregoing procedures assume that 1 − Mˆ = M + J(1) (x , y ) − 2M + o n 1/2 , these parameters are known. Next, we discuss the determination FMN FMN n FMN i i FMN p i=1 of dimensionality q and the choice of the tuning parameters. where 5.2 Determination of Dimensionality q  (1) J (x, y) = E(U ,V ) JFMN((x, y), (U2, V2)) S S FMN 2 2  In practice, the dimension of Y|X (or E[Y|X]) is unknown + J ((x, y), (U , V ))τ . and must be inferred from data. One informal method is to gen- FMN 2 2 ˆ ˆ erate the scree plot of the eigenvalues of MFCN (or MFMN) (1) Let FMN be the covariance matrix of vec(JFMN(X, Y)). Then as in principal components analysis, and look for an “elbow” √ pattern in the plot. The dimension q is chosen to be the num- ˆ L n vec(MFMN)−vec(MFMN) −→ N(0, FMN), as n →∞. ber of dominant eigenvalues. Several more formal methods for choosing q have been proposed in the literature. For example, 5. IMPLEMENTATION Li (1991, 1992) proposed using a chi-squared statistic to se- In this section we describe the procedures for estimat- quentially test q = 0, 1, 2, and so on, whereas Cook and Yin ing SY|X and SE[Y|X] using the estimated candidate matrices (2001) advocated using permutation tests for the same purpose. ˆ ˆ Recently, Ye and Weiss (2003) proposed using the bootstrap MFCN and MFMN. We also discuss the determination of dimen- 2 2 procedure to determine q. Next, we follow the basic idea of Ye sionality q and the choice of tuning parameters σW and σT . and Weiss (2003) to develop the bootstrap procedure for choos- 5.1 Algorithms ing q in FC (or FM). First, we introduce a distance measure for two subspaces If the dimensionality of S | (or S [ | ]) is known to be q, Y X E Y X of Rp, then use it to define the variability of an estimated sub- then the first q eigenvectors of M (or M )formanor- FCN FMN space. Let A and B be two p × q matrices of full column rank thogonal basis of SY|X (or SE[Y|X]). From the preceding sec- ˆ ˆ and S(A) and S(B) be the column spaces of A and B.Let tions, it is known that the eigenvectors of MFCN (or MFMN) τ − τ τ − τ √ PA = A(A A) A and PB = B(B B) B be the projection converge to those of M (or M ) at the rate of n. There- − FCN FMN matrices onto S(A) and S(B), where “ ” represents the gener- fore, we can use the first q eigenvectors of Mˆ (or Mˆ ) FCN FMN alized inverse of a matrix. We define√ the trace correlation r be- to generate a linear subspace and use it as an estimate of S | Y X tween S(A) and S(B) to be r = 1 tr(P P ). It can be verified S Sˆ Sˆ q A B (or E[Y|X]), which is denoted by Y|X (or E[Y|X]). In practice, that 0 ≤ r ≤ 1, with r equal to 0 if S(A) and S(B) are perpen- X does not necessarily have mean 0 and identity covariance ma- dicular to each other and equal to 1 if S(A) and S(B) are iden- trix, so the data first need to be standardized. Standardization tical. The larger r is, the closer S(A) is to S(B). Hence we use generally does not affect the convergence rate of the estimates, d = 1 − r as a metric of the distance between S(A) and S(B). but may change their asymptotic variances. The procedure for ˆ ˆ (See Ye and Weiss 2003 for more discussion.) deriving SY|X (or SE[Y|X]) is summarized as follows: ˆ Given {(xi, yi)}1≤i≤n,letSq be the estimate of S for a fixed q, 2 2 2 S S S Sˆ 0. Specify parameters q, σW, and σT (σT is not needed when where represents Y|X or E[Y|X]. The variability of q can estimating SE[Y|X]). be evaluated by the following bootstrap procedure: Zhu and Zeng: Fourier Subspace Estimation in Regression 1645

1. Randomly resample from {(xi, yi)}1≤i≤n with replacement to patterns with high frequencies, which may not be as impor- to generate N bootstrap samples each of size n, and denote tant as the patterns with low frequencies and are sensitive to { ( j) ( j) } ≤ ≤ 2 the jth sample by (xi , yi ) 1≤i≤n for 1 j N. noise. Therefore, large σW makes FC unstable, especially when 2 2. Based on each bootstrap sample [e.g., the jth sample the sample size is moderate. On the other hand, if σW is too { ( j) ( j) } S (xi , yi ) 1≤i≤n], derive the estimate of and denote small (e.g., close to 0), then the weight assigned to φ(ω, t) is ˆ( j) almost 0 except for ω in a small neighborhood of the origin. it by Sq . ˆ( j) ˆ By Proposition 8, FC is close to SIR when σ 2 is small and 3. Calculate the distance between Sq and Sq and denote it W ( j) may miss some symmetric directions. Hence we need to use a by dq . 2  value of σW that is neither too large nor too small. Based on 4. Calculate d¯(q) = 1 N d( j), which is the average dis- N j=1 q our empirical study, we have found that σW = 1/3 (or, equiv- Sˆ( j) Sˆ ≤ ≤ ¯ 2 = tance between q and q for 1 j N.Weused(q) as alently, σW .1) generally works well for standardized data. ˆ 2 = a measure of the variability of Sq. Similarly, for FM, we also recommend using σW .1 in calcu- ˆ p lating M . Repeating this procedure for q = 1,...,p results in {d¯(q)} , FMN q=1 The interpretation of σ 2 is slightly different from that of σ 2 . which are the variabilities of Sˆ for 1 ≤ q ≤ p. T W q Theoretically, we use k(t) to pool the central mean subspaces Suppose that the true dimensionality of S is equal to q0. S [ | ] together to recover the entire S | . When σ 2 = 0, When q < q , Sˆ estimates a q-dimensional proper subspace E T(Y,t) X Y X T 0 q S(M ) degenerates to the null space. When σ 2 is too large, of S. Because there are infinitely many such subspaces, Sˆ FCN T q a relatively large amount of weight is assigned to the central is expected to demonstrate large variability, and the smaller q mean subspaces S [ | ] with large t, which corresponds is, the larger the variability [or d¯(q)] is. When q is slightly E T(Y,t) X to features of the response with high frequencies. This would larger than q , Sˆ estimates S ⊕ S˜, where S˜ is a (q − q )- 0 q 0 make FC unstable and sensitive to noise. Hence we need to use dimensional space orthogonal to S. Because S˜ can be arbitrary, a value of σ 2 that is neither too large nor too small. Our exten- Sˆ is also expected to show large variability, that is, large d¯(q). T q sive empirical study suggests that σ 2 = 1 is a good choice for Sˆ T When q is growing larger and closer to p, q estimates almost standardized response. Rp Sˆ the whole space , so the variability of q starts to decrease The foregoing recommendations for σ 2 and σ 2 are based on ˆ ˆ( j) W T and eventually becomes 0. When q = q0, Sq and Sq estimate heuristics and empirical study. A more formal approach is to ˆ the same fixed space S, and hence the variability of Sq is ex- use the bootstrap procedure introduced in the previous section. ¯ 2 pected to be small. In summary, d(q) demonstrates the follow- We use the choice of σW as an illustration. First, we choose m 2 2 ing overall trend. It decreases for 1 ≤ q ≤ q0, then increases candidate values σ ,...,σ that are equally spaced in a given ¯ 1 m for q ≤ q ≤ q∗, where q∗ is a maximizer of d(q), and then de- 2 = ¯ 2 0 interval. For any σi , i 1,...,m, calculate d(σi ) using a boot- creases to 0 for q∗ ≤ q ≤ p. We call q0 the valley and q∗ the strap procedure similar to that described in the previous section. peak of this trend. In real data analysis, it is possible to have lo- Note that the dimensionality q is assumed to be fixed; instead, cal fluctuations that are not consistent with the overall trend. To σ 2 is varied here. The optimal σ 2 is chosen to be the σ 2 that ¯ W W i choose q0,weplotd(q) against q and look for the overall trend minimizes d¯(σ 2). The optimal σ 2 can be obtained using a sim- in the plot, ignoring possible local deviations; then the valley is i T ¯ ilar bootstrap procedure. chosen to be q0. The plot of d(q) versus q is called the dimen- In practice, some applications may require optimally choos- sion variability plot. Examples on how to use the dimension 2 2 ing q, σW, and σT . We recommend the following procedure to variability plot to choose q0 are given in Section 6. 2 = use the bootstrap repeatedly. First, q is determined with σW .1 2 2 and σ 2 = 1.0 and denoted by q ; second, σ 2 is chosen with 5.3 Choice of σW and σT T 1 W σ 2 = 1.0 and q and denoted by σ 2 ;third,σ 2 is selected with The tuning parameters σ and σ may be considered the T 1 W1 T W T σ 2 and q and denoted by σ 2 . The steps are iterated until bandwidths of the weight functions K(ω) and k(t). However, W1 1 T1 the parameters are stabilized. Our experience suggests that one the selection of σW and σT is fundamentally different from the selection of bandwidths for kernels used in nonparametric func- iteration is usually sufficient. The second and third steps can be combined if a two-dimensional grid is used for selecting tion estimation. In the latter case, the bandwidth of a kernel 2 2 needs to decrease to 0 to ensure the asymptotic consistency of σW and σT simultaneously. the estimated function as the sample size goes to infinity. In ˆ ˆ 6. EXAMPLES FC and FM, the consistency of MFMN and MFCN hold for any 2 2 fixed positive σW and σT ; see Propositions 3 and 6. Neverthe- In this section we present four examples to demonstrate 2 2 less, given a finite sample, the choice of σW and σT affects the the performance of FC and FM and compare them with variability of the resulted estimates, so they need to be cho- other existing methods. The first three examples are based τ sen carefully. In what follows, we first discuss the heuristics for on synthetic models where X = (X1,...,X10) denotes a ran- 2 2 R10 choosing σW and σT and give their recommended values, then dom vector in , ε denotes a random error, Xi’s and ε = briefly introduce a bootstrap procedure for their optimal selec- are independent and identically distributed as N(0, 1), β1 τ = τ tion again following the idea of Ye and Weiss (2003). (1, 1, 1, 1, 0, 0, 0, 0, 0, 0) and β2 (0, 0, 0, 0, 0, 0, 1, 1, 1, 1) . 2 2 We first discuss σW. When σW is too large, φ(ω, t) with In the last example, we apply FC to a real dataset called 1985   2 large ω will receive much larger weight than when σW is Automobile Data to study how the price of car depends on its small. As explained earlier, φ(ω, t) with large ω corresponds features. 1646 Journal of the American Statistical Association, December 2006

= τ 2 + τ + Example 1. Consider the model Y (β1X) /(3 (β2X and 2)2) + .2ε. In this model the central subspace and the cen- βˆ τ = (.0381,.2085, −.0590,.0025, −.1442, tral mean subspace are identical, that is, SY|X = SE[Y|X] = 2 S S = S (β1, β2).Let (β1, β2); thus, no matter whether a −.0137,.4853,.5143,.5186,.3658). method is targeting the central mean subspace or the central Sˆ = S ˆ ˆ S subspace, it can be used to estimate S. The dimension of S, The distance between Y|X (β1, β2) and the true Y|X = 2 = is .04388. q 2, is assumed to be known. We compare FC (σW .1, σ 2 = 1.0) and FM (σ 2 = .1) with other five existing methods To compare FC with SIR (five slices) and SAVE (five slices), T W = including SIR (five slices), SAVE (five slices), y-pHd, r-pHd, we draw 500 samples of size n 500 from the model, apply and IHT as follows. We randomly generate 500 samples of size the methods to the samples to obtain the estimated central sub- n = 500 from the model. For each sample, we apply the seven spaces, and calculate the distances between the estimated cen- S methods listed earlier one by one to obtain the estimates of S. tral subspaces and Y|X. The distances are summarized by the Then we calculate the distances between these estimates and S. side-by-side boxplots included in Figure 2(b). The boxplots in- For each method, we generate a boxplot for the 500 distances. dicate that FC outperforms both SIR and SAVE in this example. This procedure results in seven boxplots that are displayed side Example 3. Consider the following model with discrete re- by side in Figure 1. From Figure 1, we conclude that both FC sponse: Y = I[ τ + ] + 2I[ τ + ], where I[·] denotes the β1 X σε>1 β2 X σε>0 and FM outperform the other methods. IHT has similar perfor- indicator function and σ = .2. So the possible values for Y mance as FC and FM, but demonstrates slightly larger variabil- are 0, 1, 2, and 3. In this example we consider only the cen- ity. A possible explanation for IHT’s good performance is that tral subspace, which is spanned by β1 and β2. It is clear that it is carefully designed to capture the monotone and U-shaped Y is not a continuous function of X. In the derivation of FC, trends in the function that links E[Y|X] and X. a few differentiability conditions are required. But in the final = τ + formula for MFCN, no differentiation is involved. So we expect Example 2. Consider the heteroscedastic model Y (β1X) τ FC would still work in this example. In fact, the involved dif- 4(β2X)ε. For this model, the central mean subspace and the S = S ferentiability in deriving FC is required only in a generalized central subspace are different, because E[Y|X] (β1) and 2 = S = S S sense. We draw a sample of 500 points and apply FC (σW .1 Y|X (β1, β2). Clearly, E[Y|X] is only a proper subspace 2 2 and σ = 3.0) to estimate SY|X.Hereσ is chosen to be larger of SY|X, so we focus our attention on SY|X and the meth- T T than the usual recommended value, because the discontinuity in ods aimed at estimating S | . We draw a sample of 500 data Y X the model represents a feature with high frequency and it could points and apply FC (σ 2 = .1 and σ 2 = 1.0) to estimate S | . W T Y X not be well captured by the transformed response exp{ıty} with Only the first two eigenvalues of the estimated candidate ma- ˆ ˆ small t. The first two eigenvalues of MFCN are relatively large, trix MFCN are relatively large. We use the bootstrap procedure indicating that the dimension of S | is 2. This is further con- to generate the dimension variability plot based on 500 boot- Y X firmed by the dimension variability plot included in Figure 3(a). strap samples, which is shown in Figure 2(a). Using the rule Therefore, the estimated central subspace is spanned by the first described in Section 5.2, it confirms that the dimension of SY|X ˆ two eigenvectors of MFCN, which are is equal to 2. Therefore, SY|X is estimated by the space spanned ˆ ˆ τ = − the first two eigenvectors of MFCN, which are β1 (.0603,.0927,.0567,.0521, .0432, ˆ τ = − −.0106,.4964,.3776,.4923,.4571) β1 (.5012,.4414,.4869,.4015, .0205, .2531, −.1111,.0366, −.0709,.0266) and ˆ τ = β2 (.5274,.5209,.5306,.4528,.0580, .1041, −.0735, −.0771, −.0729, −.0819). S ˆ ˆ S The distance between (β1, β2) and the true subspace Y|X is d = .007809. We use the same procedure as in Example 2 to compare 2 = 2 = FC (σW .1 and σT 3.0) with SIR (four slices) and SAVE (four slices). The three boxplots corresponding to FC, SIR, and SAVE are shown in Figure 3(b). In this example, the perfor- mances of the three methods are comparable, with SIR slightly better than the other two. One explanation for the better per- formance of SIR is that SIR can be considered an extension of linear discriminant analysis. Example 4. In this example we use FC to analyze a real dataset known as 1985 Automobile Data. The objective is to Figure 1. Side-by-Side Boxplots for the Performance Comparison Between FC, FM, SIR, SAVE, IHT, y-pHd, and r-pHd in Example 1. The study how the price of car depends on its features. The dataset y-axis represents the distance between an estimated subspace and the is available at the UCI Machine Learning Repository ( ftp:// ftp. true subspace. Each boxplot is based on 500 samples. ics.uci.edu/pub/machine-learning-databases/autos). Originally, Zhu and Zeng: Fourier Subspace Estimation in Regression 1647

(a) (b)

Figure 2. Dimension Variability Plot of d(q)¯ versus q Based on 500 Bootstrap Samples in Example 2 (a) and Side-by-Side Boxplots for the Performance Comparison Between FC, SIR, and SAVE in Example 2 (b). The y-axis represents the distance between an estimated subspace and the true subspace. Each boxplot is based on 500 samples. there were 205 instances (or cases) and 26 attributes (or vari- better choices. Using FC with the tuned parameters, we cal- ˆ ables) in the dataset, and there were some missing values. Be- culate MFCN and derive its spectral decomposition. The first ˆ ˆ ˆ cause most current dimension-reduction methods, including two eigenvalues of MFCN, λ1 = .1071 and λ2 = .0381, are FC, can only handle continuous variables, we remove eight dominant. The bootstrap procedure is run with 1,000 boot- categorical variables from the dataset. We remove one continu- strap samples, and the dimension variability plot is generated ous variable that contains many missing values. For simplicity, and shown in Figure 5(a). The plot suggests that q = 2, so we ˆ we also discard the instances with missing values. The result- use the first two eigenvectors of MFCN to derive the estimate Sˆ = S ˆ ˆ ing dataset contains 195 instances and 14 variables: Wheelbase Y|X (β1, β2), where (x1), Length (x2), Width (x3), Height (x4), Curb weight (x5), En- ˆ τ = − − β1 (.05, .19,.08,.09,.75, .24, gine size (x6), Bore (x7), Stroke (x8), Compression ratio (x9), Horsepower (x10), Peak rpm (x11), City mpg (x12), Highway 0, −.11,.17,.51,.04, −.08,.12) mpg (x ), and Price (y). We use the logarithm of Price [log(y)] 13 and as the response and x1 to x13 as the predictors. Before we apply ˆ τ = − FC to the data, we standardize each predictor using its mean β2 (.08, .38,.09,.08,.03,.70, and standard deviation. − − − − − 2 = 2 = .17, .20, .06, .08,.06,.49, .17). We first use FC with σW .1 and σT 1.0 and the boot- ˆ τ strap procedure to choose the dimension of the central sub- Figure 4 includes the projection plots of log (y) versus β1x (a) ˆ τ space. The results show that the dimension should be two. To and log (y) versus β2x (b), with 4(a) displaying a strong linear obtain sharper views, we fix q = 2 and use the bootstrap pro- relationship and 4(b) displaying a parabolic relationship. ˆ cedure mentioned at the end of Section 5 to tune the parame- Based on the relative magnitudes of the components, β1 is 2 2 2 = 2 = ters σW and σT . We have found that σW .08 and σT .6are determined mainly by Curb weight (x5) and Horsepower (x10),

(a) (b)

Figure 3. Dimension Variability Plot Based on 500 Bootstrap Samples in Example 3 (a) and Side-by-Side Boxplots for the Performance Compar- ison Between FC, SIR, and SAVE in Example 3 (b). The y-axis represents the distance between an estimated subspace and the true subspace. Each boxplot is based on 500 samples. 1648 Journal of the American Statistical Association, December 2006

(a) (b)

ˆ τ ˆ τ Figure 4. Plot of log(y) versus β1 x (a) and Plot of log(y) versus β2 x (b) in Example 4.

ˆ ˆ τ ˆ τ followed by the other variables, and β2 is determined mainly plots β1x versus β2x, which exhibits a parabolic relationship by Engine size (x6) and City mpg (x12). The strong linear re- between these two directions. (See Li 1997 for a detailed dis- ˆ τ lationship between log(y) and β1x indicates that the price of cussion of nonlinear confounding between predictors.) SIR and a car can be well predicted by Curb weight and Horsepower. SAVE are also used to analyze the data. The former gives sim- ˆ τ The parabolic relationship between log(y) and β2x reveals that ilar results as FC, whereas the latter fails to recover interesting the price of a car also depends on Engine size and City mpg, directions in this particular example. but in a slightly more complicated manner. After checking the 7. CONCLUSION makes and styles of the cars reported in the original data, we have found that the points in the upper branch of the parabola Using the Fourier transform, we have derived two candidate represent high-end cars, such as the sedans or the convertibles matrices, MFC and MFM, for recovering the entire central sub- of, say, Mercedes-Benz, BMW, Jaguar, and Porsche, whereas space and central mean subspace. Under further distributional the points in the lower branch represent lower-end cars, such as assumptions, we have derived explicit estimates of the candi- the hatchbacks of Honda, Chevrolet, Plymouth, Subaru, and so date matrices that lead to estimates of the central and central ˆ τ on. For the high-end cars, the price increases as β2x increases, mean subspaces. Selection of the tuning parameters and deter- ˆ τ whereas for the lower-end cars, the price decreases as β2x in- mination of dimensionality have been discussed. Synthetic and creases. The two relationships above further imply that there ex- real examples were used to demonstrate the performance of the ˆ τ ˆ τ ists a nonlinear confounding between β1x and β2x. Figure 5(b) proposed methods in comparison with other methods. Use of

(a) (b)

2 = 2 = Figure 5. The Dimension Variability Plot Generated From the Bootstrap Procedure With 1,000 Bootstrap Samples, σW .08 and σT .6 (a) and ˆ τ ˆ τ Plot of β1 x versus β2 x That Shows a Nonlinear Relationship (b). Both plots are for Example 4. Zhu and Zeng: Fourier Subspace Estimation in Regression 1649 the Fourier transform may provide a different approach to di- 3. The proof is standard and has been given by Folland (1992, mension reduction in general regression, which is expected to pp. 217, 243). generate more interesting results in the future. Currently we are 4. The proof is standard and has been given by Folland (1992, focused on two issues: (1) to generalize the results of this arti- pp. 222, 244). cle to the case without distributional assumptions imposed on X Proof of Proposition 2 and (2) to develop dimension reduction techniques for regres- Because ψ(ω) = a(ω) + ıb(ω),wehave sion with multiple responses. Because our approach does not involve the slicing of the response and the partition of data, it ψ(ω)ψ¯ (ω)τ =[a(ω)a(ω)τ + b(ω)b(ω)τ ] may be more appropriate than SIR and SAVE for the latter case. + ı[b(ω)a(ω)τ − a(ω)b(ω)τ ]. APPENDIX: PROOFS  By (9), we know that [b(ω)a(ω)τ − a(ω)b(ω)τ ] dω = 0. Therefore, Proof of Equation (5)  ∗ = −p ¯ τ Because A = (α1, α2,...,αq) and the columns α1,...,αq form a MFM (2π) ψ(ω)ψ(ω) dω basis for SE[Y|X] with q dimensions, to prove (5), it is sufficient to  ∈ Rp τ = τ ∂ = −p τ τ show that for β , β A 0 is equivalent to β ∂x m(x) 0forall = (2π) [a(ω)a(ω) + b(ω)b(ω) ] dω. x ∈ supp(X). ∂ = ∂ ∗ By the chain rule of differentiation, ∂x m(x) A ∂u g(u). Thus Clearly, MFM is a real nonnegative definite matrix. For any p-dimen- βτ A = 0 immediately implies that βτ ∂ m(x) = βτ A ∂ g(u) = 0for τ ∗ = τ = τ = ∂x ∂u sional vector β, β MFMβ 0 is equivalent to β a(ω) β b(ω) 0 all x ∈ supp(X). ∗ S ∗ for all ω, which implies that the column space of MFM, (MFM),is Next, we show that the converse is also true by contradiction. the same as span{a(ω), b(ω) : ω ∈ Rp}. By the first property in Propo- Assume that there exists β ∈ Rp such that βτ ∂ m(x) = 0forall ∗ 0 0 ∂x sition 1, we have S(M ) = SE[Y|X]. ∈ τ = = τ  τ  FM x supp(X),butβ0A 0.Letξ1 A β0/ A β0 . Clearly, ξ 1 is a τ ∂ = τ ∂ = Proof of Proposition 3 q-dimensional nonzero vector. So β0 ∂x m(x) β0A ∂u g(u) 0 τ ∂ = means that ξ1 ∂u g(u) 0, which implies that the directional derivative Because = τ of g as a function of u (u1,...,uq) along ξ1 is always 0. This fur-  + = τ τ ther implies that g(u) is a constant along ξ1,thatis,g(u tξ 1) g(u) MFM = [a(ω)a(ω) + b(ω)b(ω) ]K(ω) dω, ∈ R for t . We can expand ξ 1 by bringing in ξ2,...,ξq to form an q τ orthonormal basis for R .LetD = (ξ ,...,ξ q) and v = D u = p 1 MFM is a nonnegative definite matrix. Because K(ω)>0forω ∈ R , τ ∂ τ ∂ (v ,...,vq) ;theng(u) = g(Dv) and g(Dv) = ξ g(u)= 0. ∈ Rp 1 ∂v1 1 ∂u for β we have Hence g does not depend on v1, so we can rewrite g(u) = g(Dv) = ˜ =˜ τ τ τ τ S τ = ⇐⇒ τ = τ = ∈ Rp g(v2,...,vq) g(ξ 2A x,...,ξqA x), which implies that (Aξ2, β MFMβ 0 β a(ω) β b(ω) 0forallω , ...,Aξq) is also a dimension-reduction subspace for E[Y|X] and that the central mean subspace has dimension at most q − 1, which contra- where ⇐⇒ means “being equivalent to.” Therefore, S(MFM) = { ∈ Rp}=S dicts to dim(SE[Y|X]) = q. Thus the proof is completed. span a(ω), b(ω) : ω E[Y|X]. Proof of Proposition 1 Proof of Proposition 4 1. From the definition of ψ(ω) in (6), a(ω) and b(ω) have the fol- Under the model assumption, following the similar argument as in lowing expressions: the proof of (5), the central subspace can be written as    ∂   a(ω) = cos(ωτ x) m(x) f (x) dx ∂ X S | = span f | (y|x) : (x, y) ∈ supp(X, Y) , ∂x Y X ∂x Y X and    where fY|X is the conditional density of Y given X. To prove this propo- ∂ ∈ Rp ∈ b(ω) = sin(ωτ x) m(x) f (x) dx. sition, it is sufficient to show that for β and x supp(X), ∂x X τ ∂ ∂ ∈ S ∈ β m(x, t) = 0forallt ∈ R From (5), ∂x m(x) E[Y|X] for all x supp(X). Thus both a(ω) ∂x and b(ω) belong to SE[Y|X]. Furthermore, by (7), the inverse τ ∂ Fourier transform equation, and the fact that the Fourier trans- ⇐⇒ β fY|X(y|x) = 0forally ∈ supp(Y). τ ∂ = ∂x form is one-to-one, we have that β ( ∂x m(x))fX(x) 0forall p τ p x ∈ R is equivalent to β ψ(ω) = 0forallω ∈ R ,whichis This is an immediate result from the fact that ∂ m(x, t) is the Fourier τ τ p ∂x further equivalent to β a(ω) = β b(ω) = 0forallω ∈ R .Us- ∂ | transform of ∂x fY|X(y x),thatis, ing (5), we have S [ | ] = span{a(ω), b(ω) : ω ∈ Rp}. E Y X   2. Because m(x)fX(x) → 0asx→∞, using integration by ∂ ∂ ∂ m(x, t) = exp{ıty}f | (y|x) dy = exp{ıty} f | (y|x) dy. parts, we have ∂x ∂x Y X ∂x Y X   τ Thus the proposition is proved. ψ(ω) =− m(x) ıω exp{ıω x}fX(x)  Proof of Proposition 5 ∂ + exp{ıωτ x} f (x) dx ∂x X The proof is a straightforward modification of that of Proposition 1   with Y replaced by exp{ıtY} and m(x) replaced by m(x, t). The details =− + { τ } E(X,Y) (ıω G(X))Y exp ıω X . are thus omitted. 1650 Journal of the American Statistical Association, December 2006

Proof of Proposition 6 Hoeffding decomposition of the U-statistic, we can further write ˆ Because MFCN as   n τ τ ˆ n − 1 2 (1) MFC = [a(ω, t)a(ω, t) + b(ω, t)b(ω, t) ]K(ω)k(t) dω dt, M = 2M + J (x , y ) − 2M FCN 2n FCN n FCN i i FCN i=1 p MFC is a nonnegative definite matrix. Because K(ω)>0forω ∈ R  p and k(t)>0fort ∈ R,wehave,foranyβ ∈ R , −1/2 −1 + op n + Op(n ) τ τ τ β MFCβ = 0 ⇐⇒ β a(ω, t) = β b(ω, t) = 0 p n for all ω ∈ R and for all t ∈ R. = + 1 (1) − + −1/2 MFCN JFCN(xi, yi) 2MFCN op n . p n = Therefore, S(MFC) = span{a(ω, t), b(ω, t) : ω ∈ R , t ∈ R}.Usingthe i 1 S = S first property of Proposition 5, we have (MFC) E[Y|X],andthe The second term in the foregoing expression is the average of n inde- proposition is proved. (1) − pendent and identically distributed random matrices (JFCN(xi, yi) Proof of Proposition 7 2MFCN) with 1 ≤ i ≤ n. Because FCN exists, which is guaranteed by the existence of the covariance matrix of vec(J ((U , V ), (U , Applying integration by parts and the condition that f (x, y) FCN 1 1 2 (X,Y) V ))), by the central limit theorem, as n goes to infinity, goes to 0 as x goes to infinity and y is fixed, we have 2  √ ˆ L τ n vec(MFCN) − vec(MFCN) −→ N(0, FCN), η(y, ω) =− (ıω + G(x)) exp{ıω x}fX|Y (x|y) dx  (1) 2 2 where FCN = cov[vec(J (X, Y))] is a p × p matrix. τ FCN =− ıω exp{ıω x}f | (x|y) dx X Y Proof of Proposition 9    = [ { }] −1 τ ∂ Because φ(ω, t) E η(Y, ω) exp ıtY , it is sufficient to prove that − f (y) exp{ıω x} f (x) f | (y|x) dx ∈ S Y ∂x X Y X η(Y, ω) Y|X under the weak normality condition. Applying inte-  gration by parts, we have −1 τ ∂   = f (y) exp{ıω x}f (x) f | (y|x) dx. Y X Y X τ ∂ τ ∂x η(y, ω) = exp{ıω x} f | (x|y) dx + x exp{ıω x}f | (x|y) dx ∂x X Y X Y ∂ | ∈ S ∈ S    Because ∂x fY|X(y x) Y|X,wehavethatη(y, ω) Y|X. The second −1 τ ∂ part of the proposition can be easily verified. = f (y) exp{ıω x} f (x) + xf (x) f | (y|x) dx Y ∂x X X Y X Proof of Proposition 8    −1 τ ∂ When X follows the normal distribution with mean 0 and covariance + f (y) exp{ıω x} f | (y|x) f (x) dx. Y ∂x Y X X matrix Ip, G(X) =−X. Because K(ω) is a point mass at ω = 0,  The second term belongs to SY|X, so we need only focus on the first = [ τ + τ ] term and denote it by I .Letu = Bτ x, u = B˜ τ x,andu = u1 = MFC a(0, t)a(0, t) b(0, t)b(0, t) k(t) dt, 1 1 2 u2 (B, B˜ )τ x = Mτ x.Then where a(0, t) and b(0, t) are the real and imaginary parts of φ(0, t).So    − ∂ S(MFC) = span{a(0, t), b(0, t) : t ∈ R}. By (16), φ(0, t) is the Fourier I = f 1(y) exp{ıω˜ τ u}M p˜(u) + up˜(u) p(y|u ) 1 Y ∂u 1 transform of E[X|Y = y]fY (y),thatis,     = −1 { ˜ τ } ∂ ˜ + ˜ | φ(0, t) = exp{ıty}xf(X,Y)(x, y) dx dy fY (y)B exp ıω u p(u) u1p(u) p(y u1) du ∂u1     = { } [ | = ] + −1 ˜ { ˜ τ } ∂ ˜ + ˜ exp ıty E X Y y fY (y) dy. fY (y)B exp ıω u p(u) u2p(u) ∂u2 p τ Then for any β ∈ R , β E[X|Y = y]fY (y) = 0fory ∈ R is equiv- × p(y|u1) du1 du2, alent to βτ a(0, t) = βτ b(0, t) = 0fort ∈ R. Hence S(M ) = FC ω˜ = Mτ ω p˜ u = f Mu u span{E[X|Y = y] : y ∈ supp(Y)}. The latter is exactly the space aimed where and ( ) X( ) is the density function of .Note S S M 2 = S M = that the first term falls in Y|X. Under the weak normality condition, at by SIR, which is ( SIR). Thus, when σW 0, ( FC) ∂ S we have p˜(u |u ) + u p˜(u |u ) = 0, so the second term in the (MSIR), and the proposition is proved. ∂u2 2 1 2 2 1 foregoing expression is 0. Therefore I1 ∈ SY|X. This proves the propo- Proof of Theorem 1 sition. ˆ Because MFCN is a V-statistic, it can be written as Proof of Theorem 2   −1 ˆ n − 1 n The proof is similar to that of Theorem 1 and thus is omitted. MFCN = 2n 2 [Received July 2004. Revised November 2005.]

× + τ JFCN((xi, yi), (xj, yj)) JFCN((xi, yi), (xj, yj)) REFERENCES i

Cook, R. D., and Li, B. (2002), “Dimension Reduction for Conditional Mean Hotelling, H. (1957), “The Relations of the Newer Multivariate Statistical in Regression,” The Annals of Statistics, 30, 455Ð474. Methods to Factor Analysis,” British Journal of Statistical Psychology, 10, Cook, R. D., and Nachtsheim, C. J. (1994), “Re-Weighting to Achieve Ellipti- 69Ð79. cally Contoured Covariates in Regression,” Journal of the American Statisti- Kendall, M. G. (1957), A Course in Multivariate Analysis, London: Charles cal Association, 89, 592Ð600. Griffin. Cook, R. D., and Ni, L. (2005), “Sufficient Dimension Reduction via Inverse Lee, A. J. (1990), U-Statistics: Theory and Practice, New York: Marcel Dekker. Regression: A Minimum Discrepancy Approach,” Journal of the American Li, B., Zha, H., and Chiaromonte, F. (2005), “Contour Regression: A General Statistical Association, 100, 410Ð428. Approach to Dimension Reduction,” The Annals of Statistics, 33, 1580Ð1616. Cook, R. D., and Weisberg, S. (1991), Discussion of “Sliced Inverse Regression Li, K. C. (1991), “Sliced Inverse Regression for Dimension Reduction” (with for Dimension Reduction,” by K. C. Li, Journal of the American Statistical discussion), Journal of the American Statistical Association, 86, 316Ð342. Association, 86, 328Ð332. (1992), “On Principal Hessian Directions for Data Visualization and Cook, R. D., and Yin, X. (2001), “Dimension Reduction and Visualization in Dimension Reduction: Another Application of Stein’s Lemma,” Journal of Discriminant Analysis” (with discussion), Australian and New Zealand Jour- the American Statistical Association, 87, 1025Ð1039. nal of Statistics, 43, 147Ð199. (1997), “Nonlinear Confounding in High-Dimensional Regression,” Folland, G. B. (1992), Fourier Analysis and Its Applications, Belmont: The Annals of Statistics, 25, 577Ð612. Brooks/Cole. Xia, Y., Tong, H., Li, W. K., and Zhu, L.-X. (2002), “An Adaptive Estimation Friedman, J. H., and Stuetzle, W. (1981), “Projection Pursuit Regression,” Jour- of Dimension Reduction Space” (with discussion), Journal of the Royal Sta- nal of the American Statistical Association, 76, 817Ð823. tistical Society, Ser. B, 64, 363Ð410. Härdle, W., and Stoker, T. M. (1989), “Investigating Smooth Multiple Regres- Ye, Z., and Weiss, R. E. (2003), “Using the Bootstrap to Select One of a New sion by the Method of Average Derivatives,” Journal of the American Statis- Class of Dimension Reduction Methods,” Journal of the American Statistical tical Association, 84, 986Ð995. Association, 98, 968Ð979. Hristache, M., Juditsky, A., Polzehl, J., and Spokoiny, V. (2001), “Structure Yin, X., and Cook, R. D. (2002), “Dimension Reduction for the Conditional Adaptive Approach for Dimension Reduction,” The Annals of Statistics, 29, kth Moment in Regression,” Journal of the Royal Statistical Society,Ser.B, 1537Ð1566. 64, 159Ð175.