Kernel Mean Embedding of Probability Measures and its Applications to Functional Data Analysis
Saeed Hayati Kenji Fukumizu [email protected] [email protected] Afshin Parvardeh [email protected] November 5, 2020
Abstract This study intends to introduce kernel mean embedding of probability measures over infinite-dimensional separable Hilbert spaces induced by functional response statistical models. The embedded function represents the concentration of probability measures in small open neighborhoods, which identifies a pseudo-likelihood and fosters a rich framework for sta- tistical inference. Utilizing Maximum Mean Discrepancy, we devise new tests in functional response models. The performance of new derived tests is evaluated against competitors in three major problems in functional data analysis including function-on-scalar regression, functional one-way ANOVA, and equality of covariance operators.
1 Introduction
Functional response models are among the major problems in the context of Functional Data Analysis. A fundamental issue in dealing with functional response statistical models arises due to the lack of practical frameworks on characterizing probability measure on function spaces. This is mainly a con- sequence of the tremendous gap on how we present probability measures in
arXiv:2011.02315v1 [math.ST] 4 Nov 2020 finite-dimensional and infinite-dimensional spaces. A useful property of finite-dimensional spaces is the existence of a locally fi- nite, strictly positive, and translation invariant measure like Lebesgue or count- ing measure, which makes it easy to take advantage of probability measures directly in the statistical inference. Fitting a statistical model, and estimat- ing parameters, hypothesis testing, deriving confidence regions and developing goodness of fit indices, all can be applied by integrating distribution or condi- tional distribution of response variables as a presumption into statistical proce- dures.
1 Sporadic efforts have been gone into approximating or representing proba- bility measures on infinite-dimensional spaces. Let H be a separable infinite- dimensional Hilbert space and X be a H-valued random element with finite second moment and covariance operator C. Delaigle and Hall [5] approxi- mated probability of Br (x) = {k X − x k< r} by the surrogate density of a finite-dimensional approximated version of X, obtained by projecting the ran- dom element X into a space spanned by first few eigenfunctions of C with largest eigenvalues. The approximated small-ball probability is on the basis of Karhunen-Lo`eve expansion and putting an extra assumption that the compo- nent scores are independent. The precision of this approximation depends on the volume of ball and probability measure itself. Let I be a compact subset of R such as closed interval [0, 1] and X be a zero mean L2 [I]-valued random element with finite second moment and P 1/2 −1/2 Karhunen-Lo`eve expansion X = j≥1 λj Xjψj, in which Xj = λj hX, ψji and {λj, ψj}j≥1 is the eigensystem of covariance operator C. Suppose that the distribution of Xj is absolutely continuous with respect to the Lebesgue mea- sure with density fj. Approximation of the logarithm of p (x | r) = P (Br (x)) = P ({k X − x k< r}) given by Delaigle and Hall [5] is
h X log p(x | r) = C1(h, {λj}j≥1) + log fj(xj) + o(h), j=1 in which xj = hx, ψji, and h is the number of components that depends on r and tends to infinity as r declines to zero. C1 (·) depends only on size of the ball and sequence of eigenvalues, though the quantity o(h) as the precision of approximation depends on P . −1 Ph The quantity h j=1 log fj(xj) is called log-density by Delaigle and Hall [5]. A serious concern with this approximation is its precision, which depends on the probability measure itself. Accordingly, it can not be employed to compare small-ball probability in a family of probability measures. For example, in the case of estimating the parameters in a functional response regression model, the induced probability measure varies with different choices of parameters. Thus this approximation can not be employed for parameter estimation and comparing the goodness of fit of different regression models. Another work in representing probability measures on a general separable Hilbert space H presented by Lin et al. [17]. They constructed a dense sub- space of H called Mixture Inner Product Space (MIPS), which is the union of a countable collection of finite-dimensional subspaces of H. An approximating version of the given H-valued random element lies in this subspace, which in consequence, lies in a finite-dimensional subspace of H according to a given discrete distribution. They defined a base measure on the MIPS, which is not translation-invariant, and introduced density functions for the MIPS-valued ran- dom elements. Absence of a proper method in representing probability measures over infinite- dimensional spaces caused severe problems to statistical inference. To make it
2 clear, as an example Greven et al. [9] developed a general framework for func- tional additive mixed-effect regression models. They considered a log-likelihood function by summing up the log-likelihood of response functions Yi at a grid of time-points tid, d = 1,...,Di, assuming Yi (tid) to be independent within the grid of time-points. A simulation study by Kokoszka and Reimherr [16] re- vealed the weak performance of the proposed framework in statistical hypothesis testing in a simple Gaussian Function-on-Scalar linear regression problem. Currently, MLE and other density-based methods are out of reach in the context of functional response models. In this study, we follow a different path by identifying probability measures with their kernel mean functions and in- troduce a framework for statistical inference in infinite-dimensional spaces. A promising fact about the kernel mean functions, which is shown in this paper, is their ability to reflect the concentration of probability measures in small open neighborhoods, where unlike the approach of Delaigle and Hall [5] is comparable among different probability measures. This property of kernel mean function motivates us to make use of it in fitting statistical models and introducing new statistical tests in the context of functional data analysis. This paper is organized as follows: In Section2, kernel mean embedding of probability measures over infinite-dimensional separable Hilbert spaces is dis- cussed. In Section3 the Maximum Kernel Mean estimation method is intro- duced and estimators for Gaussian Response Regression models are derived. In Section4, new statistical tests are developed for three major problems in functional data analysis and their performance evaluated using simulation stud- ies. Section5 has been devoted to discussion and conclusion. Major proofs are aggregated in the appendix.
2 Kernel mean embedding of probability mea- sures
We summarize the basics of kernel mean embedding. See Muandet et al. [20] for a general reference. Let (H,B (H) ,P ) be a probability measure space. Through- out this study H is an infinite-dimensional separable Hilbert space equipped with inner product h·, ·i . A function k : H × H → R is a positive definite H Pn kernel if it is symmetric, i.e., k(x, y) = k(y, x) and i=1 aiajk(xi, xj) ≥ 0 for all n ∈ N and ai ∈ R and xi ∈ H. k is strictly positive definite if equality implies a1 = a2 = ... = an = 0. k is said to be integrally strictly positive definite if R k(x, y)µ(dx)µ(dy) > 0 for any non-zero finite signed measure µ defined over (H,B (H)). Any integrally strictly positive definite kernel is strictly positive definite while the converse is not true [26]. A positive definite kernel induces a Hilbert space of functions over H, which is called Reproducing Kernel Hilbert Space (RKHS) and equals to Hk = span{k(x, ·); x ∈ H} with inner product X X X X h aik(xi, ·), bik(yi, ·)iHk = aibjk(xi, yj). i≥1 i≥1 i≥1 j≥1
3 For each f ∈ Hk and x ∈ H we have f(x) = hf, k(., x)iHk , which is the repro- ducing property of kernel k. A strictly positive definite kernel k is said to be characteristic for a family of measures P if the map Z m : P → Hk P 7→ k(x, .)P (dx)
p is injective. If EP ( k(X,X)) < ∞ then mP (·) := (m(P ))(·) exists in Hk R [20], and the function mP (·) = k(x, ·)P (dx) is called kernel mean function.
Moreover, for any f ∈ Hk we have EP [f(X)] = hf, mP iHk [25]. Thus, if kernel k is characteristic then every probability measure defined over (H, Σ) is uniquely identified by an element mP of Hk and Maximum Mean Discrepancy (MMD) defined as Z Z MMD(Hk, P, Q) = sup f(x)P(dx) − f(x)Q(dx) f∈Hk,kfk ≤1 Hk = sup hf, m − m i = km − m k , (1) P Q P Q Hk f∈Hk,kfk ≤1 Hk is a metric on the family of measures P over H [20]. A similar quantity called Ball divergence is proposed by Pan et al. [22] to distinguish probability measures defined over separable Banach spaces. For the case of infinite-dimensional spaces, Ball divergence distinguishes two probability measures if at least one of them possesses a full support, that is, Supp (P ) = H. They employed Ball divergence for a two-sample test, which according to their simulation results, the performance of both MMD and Ball divergence are close and superior to other tests. Kernel mean functions can also be used to reflect the concentration of prob- ability measures in small-balls, if the kernel function is translation-invariant. A positive definite kernel k is called translation-invariant if k(x, y) = ψ(x − y) for −σkx−yk2 some positive definite function ψ. Gaussian kernel e H and Laplace kernel −σkx−yk e H are such kernels. If we choose a continuous characteristic kernel that is bounded and translation-invariant, then the kernel mean function mP can be employed to represent the concentration of probability measure in different points of Hilbert space H. for example, consider Z −σkx−yk2 mp(x) = e H P (dy).
H
If mP (·) has an explicit form for a family of probability measures then mP (·) can be employed to study and compare different probability measures. For example, if mP (x1) > mP (x2) then it could be concluded that the concentration of probability measure P around the point x1 is higher than x2, and if for given two probability measures P1 and P2 we had mP1 (x) > mP2 (x) then we conclude that the concentration of probability measure P1 around the point x is higher than that of probability measure P2. This property of kernel mean
4 functions makes them a good candidate to represent probability measures in infinite-dimensional spaces. The representation property of probability measures by kernel mean func- tions is addressed in the next theorem and corollary. Proofs are provided in the appendix.
Theorem 1. Let P1 and P2 be two probability measures on a separable Hilbert space H over the field R. Let ψ : R+ → [0, 1] be a bounded continuous, strictly 2 decreasing and positive definite function e.g. ψ(t) = e−t , such that k(x, y) = ψ(kx − yk ) is a translation-invariant characteristic kernel, and let mP (·) and H 1 mP2 (·) be the kernel mean embedding of P1 and P2, respectively, for the kernel k (·, ·). If mP2 (y) > mP1 (y) for a given y ∈ H, then there exists an open ball Br(y) such that P2 (Br(y)) > P1 (Br(y)), and r depends only on difference mP2 (y) − mP1 (y) and the characteristic kernel itself.
Corollary 2. Let P be a probability measure on a separable Hilbert space H over the field R. Let ψ : R+ → [0, 1] be a bounded continuous, strictly decreasing and 2 positive definite function e.g. ψ(t) = e−t , such that k(x, y) = ψ(kx − yk ) is H a translation-invariant characteristic kernel, and let mp(·) be the kernel mean embedding of P for the kernel k (·, ·). If mP (y2) > mP (y1) for some y1, y2 ∈ H, then there exist open balls of the same size Br(y1) and Br(y2) such that P (Br(y2)) > P (Br(y1)), and r depends only on difference mP (y2) − mP (y1) and the characteristic kernel itself.
Kernel Mean Embedding of probability measures also has a connection with kernel scoring rules. Proper Scoring Rules are well-established instruments with applications in assessing probability models [7]. The following definition is bor- rowed from Steinwart and Ziegel [28] and adapted to our context. In the follow- ing definition, c00 is the infinite-dimensional inner product space of sequences vanishing at infinity, which is a dense subspace of `2.
Definition 3. Let X be an arbitrary measurable space. Here it may be consid- ered to be either the separable Hilbert space `2 or the separable inner product space c00, and let M1 (X) be the space of probability measures on X. For P ⊆ M1 (X), a scoring rule is defined as a function S : P × X → [−∞, ∞] such that the integral R S (P, x) Q (dx) exists for all P,Q ∈ P. The scoring rule is X proper if Z Z S (P, x) P (dx) ≤ S (Q, dx) P (dx) , ∀P,Q ∈ P X X and is called strictly proper if the equality implies P = Q. Kernel scores are a general class of proper scoring rules, in which the scoring rule is generated by a symmetric positive definite kernel k : X × X → R by Z ZZ 1 0 0 Sk (P, x) := − k (ω, x) P (dω) + k (ω, ω ) P (dω) P (dω ) 2
5 1 2 = − mp (x) + kmP k . (2) 2 The Maximum Mean Discrepancy distance between P,Q ∈ P satisfies Z Z 2 kmP − mQk = 2 Sk (Q, x) P (dx) − Sk (P, x) P (dx) . (3) Hk
If k is bounded then P = M1 (X)[28]. In effect, a kernel score rule Sk is a strictly proper scoring rule if and only if kernel mean embedding is injective or k to be characteristic. There are a plethora of studies on the different class of characteristic kernels over locally compact spaces. For example, Steinwart [27] proved that Gaussian kernel is characteristic on compact sets, Sriperumbudur et al. [26, Theorem 9] showed that Gaussian kernel is characteristic on the whole space Rd, and Simon- Gabriel and Sch¨olkopf [24] studied the connection between various concepts of kernels such as universality, characteristic and positive definiteness of kernels. Given a separable Hilbert space H, any integrally strictly positive definite kernel is characteristic [26, Theorem 7], however, it is not clear which ker- nels are integrally strictly positive definite over infinite-dimensional separable Hilbert spaces. To the best of our knowledge, there is no study on existence and construction of characteristic kernels for infinite-dimensional spaces. The following two theorems, proofs of which are provided in Appendix, try to tackle this problem. In Theorem4, the result of Steinwart and Ziegel [28, Threo- rem 3.14] is used to show the existence of a continuous characteristic kernel for infinite-dimensional separable Hilbert spaces, and Theorem5 shows that Gaus- sian kernel is characteristic for c00, the infinite-dimensional inner product space of sequences vanishing at infinity, which is dense in `2.
Theorem 4. Let H be an infinite-dimensional separable Hilbert space. There exists a continuous characteristic kernel on H. ∞ Theorem 5. Let c00 be the space of eventually zero sequences in R . The −σkx−yk2 Gaussian kernel defined as k (x, y) = e 2 is characteristic on c00. Beside what are presented in Theorem4 and Theorem5, we show in Propo- sition6 that Gaussian kernel is characteristic for the family of Gaussian proba- bility measures over H.
3 Maximum Kernel Mean Estimation
In the context of multivariate statistics, the density function is considered as one of the most ubiquitous tools in statistical inference. Density is a non- negative function, which represents the amount of probability mass in a point or concentration of probability measure in a very small neighborhood. Typically a nominated family of probability measures is presented by the corresponding family of densities, and the aim is to choose a density from this family, which is
6 the most likely one that generates a set of observations obtained by a probability- based survey sample. The aforementioned family of probability measures usually parameterized by a parameter θ taking value either in a subset Θ of a finite- dimensional or infinite-dimensional space. Suppose that {Pθ, θ ∈ Θ} is a nominated family of probability measures indexed by θ. The idea behind MLE is as follows: Suppose we randomly survey the population according to a sampling method and the result is an observation y. If θ is unknown, an estimation of θ is one for which Pθ is the most likely generator of y. If the density function fθ = dPθ/dλ exists, we seek for a θ for which fθ(x) is of maximum value. What makes a density function suitable for this kind of inference is the base measure λ where the density is defined relative to it. A counting measure or Lebesgue measure are suitable options in finite-dimensional spaces. These base measures are positive, locally finite and translation invariant and a nontrivial measure with these properties does not exist in an infinite-dimensional separable Hilbert space [6]. Employing the kernel mean function, we can introduce a rather similar idea to likelihood-based estimation in infinite-dimensional spaces. Suppose k is a bounded, continuous, and translation-invariant characteristic kernel as de- −σkx−yk2 scribed in Theorem1, such as Gaussian kernel k(x, y) = e H for a fixed σ > 0. The kernel mean function mP (·) is a bounded function over H which reflects the concentration of P on a small neighborhood of y ∈ H. Consider the family of probability measures M = {Pθ, θ ∈ Θ} and its counterpart fam- ily of kernel mean functions {mPθ (.); θ ∈ Θ}. Assume that θ is unknown and the endeavor is to estimate it through an observed random sample y from the population. Again the aim is to pick a θb ∈ Θ such that Pθˆ is the most likely generator of y. Thus, it seems possible to estimate θ by θb = supθ∈Θ mPθ (y). In this section, we derive the kernel mean embedding of probability measures induced by Functional response models, and we show how parameter estimation and hypothesis testing are capable in this framework.
3.1 Kernel Mean Embedding of Gaussian Probability Mea- sure The assumption of Gaussianity is prevalent and fundamental to many statis- tical problems in the context of functional data analysis, including functional response regressions, functional one-way ANOVA, and testing for homogene- ity of covariance operators. In this regard, it is desirable to study the kernel mean embedding of Gaussian probability measures induced by these category of models. Let H be an arbitrary infinite-dimensional separable Hilbert Space. An H-valued random element is said to be a Gaussian random element with the mean function µ and covariance operator C, for any a ∈ H, we have ha, Xi ∼ N (ha, µi , hCa, ai). A Gaussian random element has a finite second moment, and its covariance operator is trace class [18]. Let (λi, ψi)i≥1 be the eigensys-
7 tem of C and H be an arbitrary function space such as L2[0, 1], then the kernel of P integral operator C admits the decomposition kC (s, t) = j≥1 λjψj(s)ψj(t) = Cov [X(s),X(t)] [14]. Kernel mean function of the Gaussian family of proba- bility measures with mean µ and covariance operator C and its uniqueness is given in the following proposition and is proved in the appendix. Proposition 6. Let Y ∼ N (µ, C) i.e. ha, Y i ∼ N (ha, µi , hCa, ai) then for a Gaussian kernel, Z −σkx−yk2 −1/2 −σh(I+2σC)−1(x−µ),(x−µ)i mP (x) = e H N (µ, C)(dy) = |I + 2σC| e , H the kernel mean embedding is injective and
km k2 = |I + 4σC|−1/2 . P Hk
2 Consider a random sample Yi ∈ L [0, 1], i = 1, . . . , n of independent and identically random elements with distribution N (µ, C) . By choosing a suitable n characteristic kernel for the product space L2[0, 1] , kernel mean embedding of n the induced probability measure by the random sample {Yi} i.e. ⊗i=1N (µ, C) can be computed. Let k be the Gaussian kernel, then according to the follow- Qn Pn ing theorem, which is proved in the appendix, i=1 k(·, ·) and i=1 k(·, ·) are n characteristic kernels for the family of product measures on L2[0, 1] . Proposition 7. Let k (·, ·) be a characteristic kernel defined over a separable Hilbert space H, then product-kernel n n n Y kP (·, ·): H × H −→ R ({xi} , {yi}) 7→ k(xi, yi) i=1 and sum-kernel n n n X kS(·, ·): H × H −→ R ({xi} , {yi}) 7→ k(xi, yi) i=1 are two characteristic kernels for the family of product probability measures n n n P = ⊗j=1Pj | Pj ∈ P, j = 1, . . . , n on H . For the case of Gaussian product-kernel, given a simple random sample Y1,...,Yn drawn from the Gaussian distribution, kernel mean function is
Z Pn 2 −σ kyi−zik n m n (y , . . . , y ) = e i=1 ⊗ N (µ, C)(dz ) , ⊗i=1N (µ,C) 1 n i=1 i which its logarithm equals to
log m n (y , . . . , y ) ⊗i=1N (µ,C) 1 n n X X −σ 2 n X = hyi − µ, ψji − log (1 + 2σλj) , (4) 1 + 2σλj 2 i=1 j≥1 j≥1
8 where y1, . . . , yn are the observation counterparts of Y1,...,Yn. By defining 1 Pn sample mean function and sample covariance operator asy ¯ = n i=1 yi and ˆ 1 Pn CY = n i=1(yi − y¯) ⊗ (yi − y¯) respectively, then, the logarithm of the kernel mean function for Gaussian product-kernel also is equal to
log m n (y , . . . , y ) ⊗i=1N (µ,CY ) 1 n X −nσ h 2i n X = hCˆY ψj, ψji + hy¯ − µ, ψji − log (1 + 2σλj) . (5) 1 + 2σλj 2 j≥1 j≥1
We can see that the kernel mean function is dependent on {y1, y2, . . . , yn} only throughy ¯ and Cˆ . Since Gaussian-product Kernel is characteristic, Equation Y (5) shows that for the family of Gaussian probability measures, y,¯ CˆY is a typical joint sufficient statistic for parameters (µ, CY ). The possibility to identify sufficient statistics through kernel mean functions, alongside Theorem 1 and Corollary 2, reveals how a kernel mean embedding of probability measure behave akin to density function over finite-dimensional spaces. The location and covariance parameters of the distribution can be estimated by maximizing log m n (y , . . . , y ). The resulting estimator which is ⊗i=1N (µ,C) 1 n slightly different in weights of components from the estimators one may obtain by the small ball probability approximation proposed by Delaigle and Hall [5], or OLS approach. As it is highlighted in Proposition 10, the Maximum Kernel Mean (MKM) estimator of the location parameters converges to OLS, and the limiting estimator one may obtain by the small-ball probability approach, as σ tends to infinity. It is also worth noting that although there is no estimation for covariance parameters by the small-ball probability approximation approach, there is an estimation for them by MKM. In the the context of functional regression, as it is addressed in Section 3.2, we may substitute a linear model for µ in (4), and estimate the parameters of the model either m (y , . . . , y ) − 1 km k2 as in (2), or only choose to P 1 n 2 P Hk maximize log m⊗P (dYi)(y1, . . . , yn) for estimating the location parameters seeing as km k2 does not depend on the location parameters. P Hk The Kernel Mean approach also provides a rich toolbox of kernel methods developed by the machine learning community, which can be used in statisti- cal inference. To give just a few examples, we can name Kernel Bayes Rule for Bayesian inference and Latent variable modeling. The Maximum Mean Dis- crepancy (MMD) for hypothesis testing and developing Goodness of Fit indices, and Hilbert Schmidt Independence Criterion (HSIC) for measuring dependency between random elements [see 20,8, 13, 29]. In section4, MMD is used to derive and introduce new tests for three main problems in functional data anal- ysis, including Function-on-Scalar regression, one-way ANOVA, and testing for homogeneity of covariance operators. The power of these tests is studied and compared with competitors by simulation.
9 3.2 MKM Estimation of Parameters in Function-on-Scalar Regression
Let Y be a Gaussian random element taking value in H. Given a random sample of Y , we can employ the kernel mean function to estimate the location and covariance parameters. Let y1, y2, . . . , yn be n independent realizations of Y according to the following Function-on-Scalar regression model:
T Yi (t) = xi β (t) + εi (t) , i = 1, . . . , n (6) where xi is the vector of scalar covariates and β is the vector of p functional pa- rameters. Residual functions εi are n independent copies of a Gaussian random element with mean function zero and covariance operator C. The following two propositions can be employed to obtain the MKM estimation of location and covariance parameters.
Proposition 8. Let y1, y2, . . . , yn be n independent realizations of model (6), where εi is an H-valued Gaussian random element with mean function zero and covariance operator C. The MKM estimation of functional regression parame- ters coincide with the ordinary least square estimation.
Proof. The logarithm of kernel mean function by (4) equals to
log m (y , . . . , y ) : = log m n (y , . . . , y ) β 1 n ⊗i=1N (µi,C) 1 n n X D −1 T T E = −σ (I + 2σC) yi − xi β , yi − xi β i=1 n X − log (1 + 2σλj) . (7) 2 j≥1
Fr´echet derivation of (7) with respect to β is an operator from Hp to R, i.e. ∂ log m (y , . . . , y ): p → . ∂β β 1 n H R
βˆ is a local extremum of log m n (y , . . . , y ), if ⊗i=1N (µ,C) 1 n ∂ log m (y , . . . , y ) (h) = 0 ∀h ∈ p. ∂β βˆ 1 n H
Taking Fr´echet derivation of (7) with respect to β, for an arbitrary h ∈ Hp we have " n # ∂ ∂ X D −1 E log m (y , . . . , y ) (h) = (I + 2σC) y − xT β , y − xT β (h) ∂β β 1 n ∂β i i i i i=1 n X D −1 T T E = 2σ (I + 2σC) xi h, yi − xi β i=1
10 p * n + X −1 X T = 2σ hk, (I + 2σC) xik yi − xi β k=1 i=1
So if βˆ is a local extremum of (7), for each 1 ≤ k ≤ p, we must have
n X T xik yi − xi β = 0, i=1
−1 so XT (Y − Xβ) = 0, and consequently βˆ = XT X XT Y . The remaining question arises here is that if βˆ maximizes (7) or not. Let β = βˆ + ν, then
n X D −1 T T E log mβ(y1, . . . , yn) = log mβˆ(y1, . . . , yn) − σ (I + 2σC) xi ν , xi ν i=1
≤ log mβˆ(y1, . . . , yn), which completes the proof.
−1 From the last proposition, βˆ = XT X XT Y is the MKM estimator of functional regression coefficients. It is also possible to derive a restricted MKM estimation of covariance operator with a similar approach to restricted ML. Let T −1 T A = [u1, . . . , un−k] be the first n − k eigenvectors of I − X X X X . Let ∗ T Y = [Yi]i=1,...,n be a n × 1 matrix, then Yi = ui Y is called the error contrast vector and is a sequence of n−k independent and identically distributed random elements with mean function zero and common covariance operator C. We can ∗ ∗ then use the sequence y1 , . . . , yn−k and employ Proposition9 to estimate the ˆ 1 Pn−k ∗ ∗ covariance operator by C = n−k i=1 yi ⊗ yi .
Proposition 9. Let y1, y2, . . . , yn be n independent realizations of H-valued Gaussian random element with mean function zero, and covariance operator P ˆ 1 Pn C = j≥1 λjψj ⊗ ψj. Let C = n i=1 yi ⊗ yi, then as σ → ∞ the MKM estimator of {λj, ψj}j≥1 converges to
1. ψˆk = The k’th eigenfunction of Cˆ D E2 ˆ 1 Pn ˆ 2. λk = n i=1 yi, ψk
Proof. The logarithm of the kernel mean function of the product measure n ⊗i=1N (µ, C) is presented in (4), while we set µ = 0. Parameter estimation is obtained by taking Fr´echet derivation of kernel mean function with respect to ψk and usual derivation of kernel mean function with respect to λk. In each case, it is shown that the local extremum is the global maximum of kernel mean function.
11 1) ψk: First we obtain the estimation of ψ1. Taking Fr´echet derivation of kernel mean function with respect to ψ1, we have n ∂ ∂ X X −σ 2 log m⊗N (y1, . . . , yn) (h) = hyi, ψji (h) ∂ψ1 ∂ψ1 1 + 2σλj i=1 j≥1 n −2σ X = hyi, ψ1i hyi, hi 1 + 2σλ 1 i=1 −2σn D E = Cψˆ 1, h . 1 + 2σλ1 Consider that ψ lies in a sphere of radius 1, thus ψ˜ is an extremum point of 1 D E 1 log m⊗N (y1, . . . , yn), if Cψˆ 1, h = 0 for any arbitrary h in the tangent space of unit sphere at point ψ˜1, i.e.
⊥ D E ∀h ∈ {ψ˜1} ⇒ Cˆψ˜1, h = 0. (8)
In addition, for the case of identifiability ψ˜1 must associates to the largest eigenvalue of Cˆ . This way, MKM estimation of ψ1 is the solution to the following optimization problem:
n 2 1 X D E D E ⊥ ψˆ1 = arg max yi, ψ˜ s.t. Cˆψ˜k, h = 0 ∀h ∈ {ψ˜} , (9) ˜ n ψ∈H i=1 which immediately follows that MKM estimation of ψ1 is the first eigenfunction of Cˆ, and is independent of kernel parameter σ. The remaining question to answer is that if ψˆ1 maximizes (4) or not. Consider that for any arbitrary n o⊥ ˆ ˆ ˜ ψ1+h h ∈ ψ1 and ψ1 = ˆ , kψ1+hk nσ D E log m ˜ (y1, . . . , yn) = log m ˆ (y1, . . . , yn) − Ch,ˆ h ψ1 ψ1 2 ˆ ψ1 + h (1 + 2σλ1) ≤ log m (y , . . . , y ). ψˆ1 1 n
So ψˆ1 is the MKM estimation of ψ1. For the MKM estimation of ψk, k ≥ 2, the following constraint should be added to the optimization problem (9) D E ψˆk, ψˆj = 0 ∀1 ≤ j < k, which shows that the MKM estimation of all eigenfunctions is the same as the set of eigenfunctions of Cˆ.
2) λk: Taking derivation of kernel mean function with respect to λk, yields
n ∂ ∂ X X −σ 2 n X log m⊗N (y1, . . . , yn) = hyi, ψji − log (1+2σλj) ∂λk ∂λk 1+2σλj 2 i=1 j≥1 j≥1
12 n 2 X 2σ 2 n 2σ = hyi, ψki − 2 2 1 + 2σλ i=1 (1 + 2σλk) k " n # σ X 2 = 2 2σ hyi, ψki − n (1 + 2σλk) . (1 + 2σλk) i=1 (10)
Equating (10) to zero and given ψk, the value of λk which maximizes (4) while ˆ 1 Pn 2 1 we set µ = 0 is given by λk = n i=1 hyi, ψki − 2σ , in consequence by putting D E2 ˆ ˆ 1 Pn ˆ 1 ψk to be the k’th eigenfunction of C, we obtain λk = n i=1 yi, ψk − 2σ , D E2 1 Pn ˆ which is a biased estimator of λk and converges to n i=1 yi, ψk as σ tends to infinity. In the following proposition, we obtained an estimation of the functional regression coefficients of model (6) by employing small-ball (SB) probability approximation. Proposition 10. In the case of Function-on-Scalar regression with the assump- tion of normality as in the model (6), in estimating the location parameters, the MKM or OLS estimator is the same as the one obtained by the small-ball prob- ability approximation proposed by Delaigle and Hall [5].
Proof. Let Y1,...,Yn be a simple random sample generated by model (6), where εi is a sequence of n independent copies of a H-valued Gaussian random P element with mean function zero, and covariance operator C = λjψj ⊗ψj. P j≥1 Let functions βk admits the Fourier decomposition βk = j≥1 θkjψj and define θj = (θkj)k=1,...,p, xi = (xik)k=1,...,p and β = (βk)k=1,...,p. T iid The identity i = Yi − x β ∼ N (0,C) is equivalent to the situation where pi component scores h, ψji/ λj are independent and identically distributed ac- cording to the standard normal distribution for each j ∈ N. Fix r > 0 and let 2 h = argmax r ≤ λj. By method of Delaigle and Hall [5], the log-density with j radius r equals to h h X p X λj T 2 ln P (i|r) ∝ ln fj( λjhi, ψji) ∝ − hYi, ψji − x θj , (11) 2 i j=1 j=1 and thus n n h h X X X λj T 2 X 1 T T T ln P (i|r) ∝− hYi, ψji − x θj = λj − θ (X X)θj+B θj 2 i 2 j j i=1 i=1 j=1 j=1 Pn in which Bj = i=1hYi, ψjixi and X is n × p model matrix and Y is an n × 1 column vector containing functions Yi(·). Estimate of θj can be obtained by ∂ Pn solving the equation ln P (i|r) = 0 thus ∂θj i=1