Kernel Mean Embedding of Probability Measures and its Applications to Functional Data Analysis

Saeed Hayati Kenji Fukumizu [email protected] [email protected] Afshin Parvardeh [email protected] November 5, 2020

Abstract This study intends to introduce kernel mean embedding of probability measures over infinite-dimensional separable Hilbert spaces induced by functional response statistical models. The embedded function represents the concentration of probability measures in small open neighborhoods, which identifies a pseudo-likelihood and fosters a rich framework for sta- tistical inference. Utilizing Maximum Mean Discrepancy, we devise new tests in functional response models. The performance of new derived tests is evaluated against competitors in three major problems in functional data analysis including function-on-scalar regression, functional one-way ANOVA, and equality of covariance operators.

1 Introduction

Functional response models are among the major problems in the context of Functional Data Analysis. A fundamental issue in dealing with functional response statistical models arises due to the lack of practical frameworks on characterizing probability measure on function spaces. This is mainly a con- sequence of the tremendous gap on how we present probability measures in

arXiv:2011.02315v1 [math.ST] 4 Nov 2020 finite-dimensional and infinite-dimensional spaces. A useful property of finite-dimensional spaces is the existence of a locally fi- nite, strictly positive, and translation invariant measure like Lebesgue or count- ing measure, which makes it easy to take advantage of probability measures directly in the statistical inference. Fitting a statistical model, and estimat- ing parameters, hypothesis testing, deriving confidence regions and developing goodness of fit indices, all can be applied by integrating distribution or condi- tional distribution of response variables as a presumption into statistical proce- dures.

1 Sporadic efforts have been gone into approximating or representing proba- bility measures on infinite-dimensional spaces. Let H be a separable infinite- dimensional and X be a H-valued random element with finite second moment and covariance operator C. Delaigle and Hall [5] approxi- mated probability of Br (x) = {k X − x k< r} by the surrogate density of a finite-dimensional approximated version of X, obtained by projecting the ran- dom element X into a space spanned by first few eigenfunctions of C with largest eigenvalues. The approximated small-ball probability is on the basis of Karhunen-Lo`eve expansion and putting an extra assumption that the compo- nent scores are independent. The precision of this approximation depends on the volume of ball and probability measure itself. Let I be a compact subset of R such as closed interval [0, 1] and X be a zero mean L2 [I]-valued random element with finite second moment and P 1/2 −1/2 Karhunen-Lo`eve expansion X = j≥1 λj Xjψj, in which Xj = λj hX, ψji and {λj, ψj}j≥1 is the eigensystem of covariance operator C. Suppose that the distribution of Xj is absolutely continuous with respect to the Lebesgue mea- sure with density fj. Approximation of the logarithm of p (x | r) = P (Br (x)) = P ({k X − x k< r}) given by Delaigle and Hall [5] is

h X log p(x | r) = C1(h, {λj}j≥1) + log fj(xj) + o(h), j=1 in which xj = hx, ψji, and h is the number of components that depends on r and tends to infinity as r declines to zero. C1 (·) depends only on size of the ball and sequence of eigenvalues, though the quantity o(h) as the precision of approximation depends on P . −1 Ph The quantity h j=1 log fj(xj) is called log-density by Delaigle and Hall [5]. A serious concern with this approximation is its precision, which depends on the probability measure itself. Accordingly, it can not be employed to compare small-ball probability in a family of probability measures. For example, in the case of estimating the parameters in a functional response regression model, the induced probability measure varies with different choices of parameters. Thus this approximation can not be employed for parameter estimation and comparing the goodness of fit of different regression models. Another work in representing probability measures on a general separable Hilbert space H presented by Lin et al. [17]. They constructed a dense sub- space of H called Mixture Inner Product Space (MIPS), which is the union of a countable collection of finite-dimensional subspaces of H. An approximating version of the given H-valued random element lies in this subspace, which in consequence, lies in a finite-dimensional subspace of H according to a given discrete distribution. They defined a base measure on the MIPS, which is not translation-invariant, and introduced density functions for the MIPS-valued ran- dom elements. Absence of a proper method in representing probability measures over infinite- dimensional spaces caused severe problems to statistical inference. To make it

2 clear, as an example Greven et al. [9] developed a general framework for func- tional additive mixed-effect regression models. They considered a log-likelihood function by summing up the log-likelihood of response functions Yi at a grid of time-points tid, d = 1,...,Di, assuming Yi (tid) to be independent within the grid of time-points. A simulation study by Kokoszka and Reimherr [16] re- vealed the weak performance of the proposed framework in statistical hypothesis testing in a simple Gaussian Function-on-Scalar linear regression problem. Currently, MLE and other density-based methods are out of reach in the context of functional response models. In this study, we follow a different path by identifying probability measures with their kernel mean functions and in- troduce a framework for statistical inference in infinite-dimensional spaces. A promising fact about the kernel mean functions, which is shown in this paper, is their ability to reflect the concentration of probability measures in small open neighborhoods, where unlike the approach of Delaigle and Hall [5] is comparable among different probability measures. This property of kernel mean function motivates us to make use of it in fitting statistical models and introducing new statistical tests in the context of functional data analysis. This paper is organized as follows: In Section2, kernel mean embedding of probability measures over infinite-dimensional separable Hilbert spaces is dis- cussed. In Section3 the Maximum Kernel Mean estimation method is intro- duced and estimators for Gaussian Response Regression models are derived. In Section4, new statistical tests are developed for three major problems in functional data analysis and their performance evaluated using simulation stud- ies. Section5 has been devoted to discussion and conclusion. Major proofs are aggregated in the appendix.

2 Kernel mean embedding of probability mea- sures

We summarize the basics of kernel mean embedding. See Muandet et al. [20] for a general reference. Let (H,B (H) ,P ) be a probability measure space. Through- out this study H is an infinite-dimensional separable Hilbert space equipped with inner product h·, ·i . A function k : H × H → R is a positive definite H Pn kernel if it is symmetric, i.e., k(x, y) = k(y, x) and i=1 aiajk(xi, xj) ≥ 0 for all n ∈ N and ai ∈ R and xi ∈ H. k is strictly positive definite if equality implies a1 = a2 = ... = an = 0. k is said to be integrally strictly positive definite if R k(x, y)µ(dx)µ(dy) > 0 for any non-zero finite signed measure µ defined over (H,B (H)). Any integrally strictly positive definite kernel is strictly positive definite while the converse is not true [26]. A positive definite kernel induces a Hilbert space of functions over H, which is called Reproducing Kernel Hilbert Space (RKHS) and equals to Hk = span{k(x, ·); x ∈ H} with inner product X X X X h aik(xi, ·), bik(yi, ·)iHk = aibjk(xi, yj). i≥1 i≥1 i≥1 j≥1

3 For each f ∈ Hk and x ∈ H we have f(x) = hf, k(., x)iHk , which is the repro- ducing property of kernel k. A strictly positive definite kernel k is said to be characteristic for a family of measures P if the map Z m : P → Hk P 7→ k(x, .)P (dx)

p is injective. If EP ( k(X,X)) < ∞ then mP (·) := (m(P ))(·) exists in Hk R [20], and the function mP (·) = k(x, ·)P (dx) is called kernel mean function.

Moreover, for any f ∈ Hk we have EP [f(X)] = hf, mP iHk [25]. Thus, if kernel k is characteristic then every probability measure defined over (H, Σ) is uniquely identified by an element mP of Hk and Maximum Mean Discrepancy (MMD) defined as Z Z  MMD(Hk, P, Q) = sup f(x)P(dx) − f(x)Q(dx) f∈Hk,kfk ≤1 Hk = sup hf, m − m i = km − m k , (1) P Q P Q Hk f∈Hk,kfk ≤1 Hk is a metric on the family of measures P over H [20]. A similar quantity called Ball divergence is proposed by Pan et al. [22] to distinguish probability measures defined over separable Banach spaces. For the case of infinite-dimensional spaces, Ball divergence distinguishes two probability measures if at least one of them possesses a full support, that is, Supp (P ) = H. They employed Ball divergence for a two-sample test, which according to their simulation results, the performance of both MMD and Ball divergence are close and superior to other tests. Kernel mean functions can also be used to reflect the concentration of prob- ability measures in small-balls, if the kernel function is translation-invariant. A positive definite kernel k is called translation-invariant if k(x, y) = ψ(x − y) for −σkx−yk2 some positive definite function ψ. Gaussian kernel e H and Laplace kernel −σkx−yk e H are such kernels. If we choose a continuous characteristic kernel that is bounded and translation-invariant, then the kernel mean function mP can be employed to represent the concentration of probability measure in different points of Hilbert space H. for example, consider Z −σkx−yk2 mp(x) = e H P (dy).

H

If mP (·) has an explicit form for a family of probability measures then mP (·) can be employed to study and compare different probability measures. For example, if mP (x1) > mP (x2) then it could be concluded that the concentration of probability measure P around the point x1 is higher than x2, and if for given two probability measures P1 and P2 we had mP1 (x) > mP2 (x) then we conclude that the concentration of probability measure P1 around the point x is higher than that of probability measure P2. This property of kernel mean

4 functions makes them a good candidate to represent probability measures in infinite-dimensional spaces. The representation property of probability measures by kernel mean func- tions is addressed in the next theorem and corollary. Proofs are provided in the appendix.

Theorem 1. Let P1 and P2 be two probability measures on a separable Hilbert space H over the field R. Let ψ : R+ → [0, 1] be a bounded continuous, strictly 2 decreasing and positive definite function e.g. ψ(t) = e−t , such that k(x, y) = ψ(kx − yk ) is a translation-invariant characteristic kernel, and let mP (·) and H 1 mP2 (·) be the kernel mean embedding of P1 and P2, respectively, for the kernel k (·, ·). If mP2 (y) > mP1 (y) for a given y ∈ H, then there exists an open ball Br(y) such that P2 (Br(y)) > P1 (Br(y)), and r depends only on difference mP2 (y) − mP1 (y) and the characteristic kernel itself.

Corollary 2. Let P be a probability measure on a separable Hilbert space H over the field R. Let ψ : R+ → [0, 1] be a bounded continuous, strictly decreasing and 2 positive definite function e.g. ψ(t) = e−t , such that k(x, y) = ψ(kx − yk ) is H a translation-invariant characteristic kernel, and let mp(·) be the kernel mean embedding of P for the kernel k (·, ·). If mP (y2) > mP (y1) for some y1, y2 ∈ H, then there exist open balls of the same size Br(y1) and Br(y2) such that P (Br(y2)) > P (Br(y1)), and r depends only on difference mP (y2) − mP (y1) and the characteristic kernel itself.

Kernel Mean Embedding of probability measures also has a connection with kernel scoring rules. Proper Scoring Rules are well-established instruments with applications in assessing probability models [7]. The following definition is bor- rowed from Steinwart and Ziegel [28] and adapted to our context. In the follow- ing definition, c00 is the infinite-dimensional inner product space of sequences vanishing at infinity, which is a dense subspace of `2.

Definition 3. Let X be an arbitrary measurable space. Here it may be consid- ered to be either the separable Hilbert space `2 or the separable inner product space c00, and let M1 (X) be the space of probability measures on X. For P ⊆ M1 (X), a scoring rule is defined as a function S : P × X → [−∞, ∞] such that the integral R S (P, x) Q (dx) exists for all P,Q ∈ P. The scoring rule is X proper if Z Z S (P, x) P (dx) ≤ S (Q, dx) P (dx) , ∀P,Q ∈ P X X and is called strictly proper if the equality implies P = Q. Kernel scores are a general class of proper scoring rules, in which the scoring rule is generated by a symmetric positive definite kernel k : X × X → R by Z ZZ 1 0 0 Sk (P, x) := − k (ω, x) P (dω) + k (ω, ω ) P (dω) P (dω ) 2

5 1 2 = − mp (x) + kmP k . (2) 2 The Maximum Mean Discrepancy distance between P,Q ∈ P satisfies Z Z  2 kmP − mQk = 2 Sk (Q, x) P (dx) − Sk (P, x) P (dx) . (3) Hk

If k is bounded then P = M1 (X)[28]. In effect, a kernel score rule Sk is a strictly proper scoring rule if and only if kernel mean embedding is injective or k to be characteristic. There are a plethora of studies on the different class of characteristic kernels over locally compact spaces. For example, Steinwart [27] proved that Gaussian kernel is characteristic on compact sets, Sriperumbudur et al. [26, Theorem 9] showed that Gaussian kernel is characteristic on the whole space Rd, and Simon- Gabriel and Sch¨olkopf [24] studied the connection between various concepts of kernels such as universality, characteristic and positive definiteness of kernels. Given a separable Hilbert space H, any integrally strictly positive definite kernel is characteristic [26, Theorem 7], however, it is not clear which ker- nels are integrally strictly positive definite over infinite-dimensional separable Hilbert spaces. To the best of our knowledge, there is no study on existence and construction of characteristic kernels for infinite-dimensional spaces. The following two theorems, proofs of which are provided in Appendix, try to tackle this problem. In Theorem4, the result of Steinwart and Ziegel [28, Threo- rem 3.14] is used to show the existence of a continuous characteristic kernel for infinite-dimensional separable Hilbert spaces, and Theorem5 shows that Gaus- sian kernel is characteristic for c00, the infinite-dimensional inner product space of sequences vanishing at infinity, which is dense in `2.

Theorem 4. Let H be an infinite-dimensional separable Hilbert space. There exists a continuous characteristic kernel on H. ∞ Theorem 5. Let c00 be the space of eventually zero sequences in R . The −σkx−yk2 Gaussian kernel defined as k (x, y) = e 2 is characteristic on c00. Beside what are presented in Theorem4 and Theorem5, we show in Propo- sition6 that Gaussian kernel is characteristic for the family of Gaussian proba- bility measures over H.

3 Maximum Kernel Mean Estimation

In the context of multivariate statistics, the density function is considered as one of the most ubiquitous tools in statistical inference. Density is a non- negative function, which represents the amount of probability mass in a point or concentration of probability measure in a very small neighborhood. Typically a nominated family of probability measures is presented by the corresponding family of densities, and the aim is to choose a density from this family, which is

6 the most likely one that generates a set of observations obtained by a probability- based survey sample. The aforementioned family of probability measures usually parameterized by a parameter θ taking value either in a subset Θ of a finite- dimensional or infinite-dimensional space. Suppose that {Pθ, θ ∈ Θ} is a nominated family of probability measures indexed by θ. The idea behind MLE is as follows: Suppose we randomly survey the population according to a sampling method and the result is an observation y. If θ is unknown, an estimation of θ is one for which Pθ is the most likely generator of y. If the density function fθ = dPθ/dλ exists, we seek for a θ for which fθ(x) is of maximum value. What makes a density function suitable for this kind of inference is the base measure λ where the density is defined relative to it. A counting measure or Lebesgue measure are suitable options in finite-dimensional spaces. These base measures are positive, locally finite and translation invariant and a nontrivial measure with these properties does not exist in an infinite-dimensional separable Hilbert space [6]. Employing the kernel mean function, we can introduce a rather similar idea to likelihood-based estimation in infinite-dimensional spaces. Suppose k is a bounded, continuous, and translation-invariant characteristic kernel as de- −σkx−yk2 scribed in Theorem1, such as Gaussian kernel k(x, y) = e H for a fixed σ > 0. The kernel mean function mP (·) is a bounded function over H which reflects the concentration of P on a small neighborhood of y ∈ H. Consider the family of probability measures M = {Pθ, θ ∈ Θ} and its counterpart fam- ily of kernel mean functions {mPθ (.); θ ∈ Θ}. Assume that θ is unknown and the endeavor is to estimate it through an observed random sample y from the population. Again the aim is to pick a θb ∈ Θ such that Pθˆ is the most likely generator of y. Thus, it seems possible to estimate θ by θb = supθ∈Θ mPθ (y). In this section, we derive the kernel mean embedding of probability measures induced by Functional response models, and we show how parameter estimation and hypothesis testing are capable in this framework.

3.1 Kernel Mean Embedding of Gaussian Probability Mea- sure The assumption of Gaussianity is prevalent and fundamental to many statis- tical problems in the context of functional data analysis, including functional response regressions, functional one-way ANOVA, and testing for homogene- ity of covariance operators. In this regard, it is desirable to study the kernel mean embedding of Gaussian probability measures induced by these category of models. Let H be an arbitrary infinite-dimensional separable Hilbert Space. An H-valued random element is said to be a Gaussian random element with the mean function µ and covariance operator C, for any a ∈ H, we have ha, Xi ∼ N (ha, µi , hCa, ai). A Gaussian random element has a finite second moment, and its covariance operator is [18]. Let (λi, ψi)i≥1 be the eigensys-

7 tem of C and H be an arbitrary function space such as L2[0, 1], then the kernel of P integral operator C admits the decomposition kC (s, t) = j≥1 λjψj(s)ψj(t) = Cov [X(s),X(t)] [14]. Kernel mean function of the Gaussian family of proba- bility measures with mean µ and covariance operator C and its uniqueness is given in the following proposition and is proved in the appendix. Proposition 6. Let Y ∼ N (µ, C) i.e. ha, Y i ∼ N (ha, µi , hCa, ai) then for a Gaussian kernel, Z −σkx−yk2 −1/2 −σh(I+2σC)−1(x−µ),(x−µ)i mP (x) = e H N (µ, C)(dy) = |I + 2σC| e , H the kernel mean embedding is injective and

km k2 = |I + 4σC|−1/2 . P Hk

2 Consider a random sample Yi ∈ L [0, 1], i = 1, . . . , n of independent and identically random elements with distribution N (µ, C) . By choosing a suitable n characteristic kernel for the product space L2[0, 1] , kernel mean embedding of n the induced probability measure by the random sample {Yi} i.e. ⊗i=1N (µ, C) can be computed. Let k be the Gaussian kernel, then according to the follow- Qn Pn ing theorem, which is proved in the appendix, i=1 k(·, ·) and i=1 k(·, ·) are n characteristic kernels for the family of product measures on L2[0, 1] . Proposition 7. Let k (·, ·) be a characteristic kernel defined over a separable Hilbert space H, then product-kernel n n n Y kP (·, ·): H × H −→ R ({xi} , {yi}) 7→ k(xi, yi) i=1 and sum-kernel n n n X kS(·, ·): H × H −→ R ({xi} , {yi}) 7→ k(xi, yi) i=1 are two characteristic kernels for the family of product probability measures n  n n P = ⊗j=1Pj | Pj ∈ P, j = 1, . . . , n on H . For the case of Gaussian product-kernel, given a simple random sample Y1,...,Yn drawn from the Gaussian distribution, kernel mean function is

Z Pn 2 −σ kyi−zik n m n (y , . . . , y ) = e i=1 ⊗ N (µ, C)(dz ) , ⊗i=1N (µ,C) 1 n i=1 i which its logarithm equals to

log m n (y , . . . , y ) ⊗i=1N (µ,C) 1 n n X X −σ 2 n X = hyi − µ, ψji − log (1 + 2σλj) , (4) 1 + 2σλj 2 i=1 j≥1 j≥1

8 where y1, . . . , yn are the observation counterparts of Y1,...,Yn. By defining 1 Pn sample mean function and sample covariance operator asy ¯ = n i=1 yi and ˆ 1 Pn CY = n i=1(yi − y¯) ⊗ (yi − y¯) respectively, then, the logarithm of the kernel mean function for Gaussian product-kernel also is equal to

log m n (y , . . . , y ) ⊗i=1N (µ,CY ) 1 n X −nσ h 2i n X = hCˆY ψj, ψji + hy¯ − µ, ψji − log (1 + 2σλj) . (5) 1 + 2σλj 2 j≥1 j≥1

We can see that the kernel mean function is dependent on {y1, y2, . . . , yn} only throughy ¯ and Cˆ . Since Gaussian-product Kernel is characteristic, Equation Y   (5) shows that for the family of Gaussian probability measures, y,¯ CˆY is a typical joint sufficient statistic for parameters (µ, CY ). The possibility to identify sufficient statistics through kernel mean functions, alongside Theorem 1 and Corollary 2, reveals how a kernel mean embedding of probability measure behave akin to density function over finite-dimensional spaces. The location and covariance parameters of the distribution can be estimated by maximizing log m n (y , . . . , y ). The resulting estimator which is ⊗i=1N (µ,C) 1 n slightly different in weights of components from the estimators one may obtain by the small ball probability approximation proposed by Delaigle and Hall [5], or OLS approach. As it is highlighted in Proposition 10, the Maximum Kernel Mean (MKM) estimator of the location parameters converges to OLS, and the limiting estimator one may obtain by the small-ball probability approach, as σ tends to infinity. It is also worth noting that although there is no estimation for covariance parameters by the small-ball probability approximation approach, there is an estimation for them by MKM. In the the context of functional regression, as it is addressed in Section 3.2, we may substitute a linear model for µ in (4), and estimate the parameters of the model either m (y , . . . , y ) − 1 km k2 as in (2), or only choose to P 1 n 2 P Hk maximize log m⊗P (dYi)(y1, . . . , yn) for estimating the location parameters seeing as km k2 does not depend on the location parameters. P Hk The Kernel Mean approach also provides a rich toolbox of kernel methods developed by the machine learning community, which can be used in statisti- cal inference. To give just a few examples, we can name Kernel Bayes Rule for Bayesian inference and Latent variable modeling. The Maximum Mean Dis- crepancy (MMD) for hypothesis testing and developing Goodness of Fit indices, and Hilbert Schmidt Independence Criterion (HSIC) for measuring dependency between random elements [see 20,8, 13, 29]. In section4, MMD is used to derive and introduce new tests for three main problems in functional data anal- ysis, including Function-on-Scalar regression, one-way ANOVA, and testing for homogeneity of covariance operators. The power of these tests is studied and compared with competitors by simulation.

9 3.2 MKM Estimation of Parameters in Function-on-Scalar Regression

Let Y be a Gaussian random element taking value in H. Given a random sample of Y , we can employ the kernel mean function to estimate the location and covariance parameters. Let y1, y2, . . . , yn be n independent realizations of Y according to the following Function-on-Scalar regression model:

T Yi (t) = xi β (t) + εi (t) , i = 1, . . . , n (6) where xi is the vector of scalar covariates and β is the vector of p functional pa- rameters. Residual functions εi are n independent copies of a Gaussian random element with mean function zero and covariance operator C. The following two propositions can be employed to obtain the MKM estimation of location and covariance parameters.

Proposition 8. Let y1, y2, . . . , yn be n independent realizations of model (6), where εi is an H-valued Gaussian random element with mean function zero and covariance operator C. The MKM estimation of functional regression parame- ters coincide with the ordinary least square estimation.

Proof. The logarithm of kernel mean function by (4) equals to

log m (y , . . . , y ) : = log m n (y , . . . , y ) β 1 n ⊗i=1N (µi,C) 1 n n X D −1 T  T E = −σ (I + 2σC) yi − xi β , yi − xi β i=1 n X − log (1 + 2σλj) . (7) 2 j≥1

Fr´echet derivation of (7) with respect to β is an operator from Hp to R, i.e. ∂ log m (y , . . . , y ): p → . ∂β β 1 n H R

βˆ is a local extremum of log m n (y , . . . , y ), if ⊗i=1N (µ,C) 1 n  ∂  log m (y , . . . , y ) (h) = 0 ∀h ∈ p. ∂β βˆ 1 n H

Taking Fr´echet derivation of (7) with respect to β, for an arbitrary h ∈ Hp we have   " n # ∂ ∂ X D −1 E log m (y , . . . , y ) (h) = (I + 2σC) y − xT β , y − xT β (h) ∂β β 1 n ∂β i i i i i=1 n X D −1 T T E = 2σ (I + 2σC) xi h, yi − xi β i=1

10 p * n + X −1 X T  = 2σ hk, (I + 2σC) xik yi − xi β k=1 i=1

So if βˆ is a local extremum of (7), for each 1 ≤ k ≤ p, we must have

n X T  xik yi − xi β = 0, i=1

−1 so XT (Y − Xβ) = 0, and consequently βˆ = XT X XT Y . The remaining question arises here is that if βˆ maximizes (7) or not. Let β = βˆ + ν, then

n X D −1 T  T E log mβ(y1, . . . , yn) = log mβˆ(y1, . . . , yn) − σ (I + 2σC) xi ν , xi ν i=1

≤ log mβˆ(y1, . . . , yn), which completes the proof.

−1 From the last proposition, βˆ = XT X XT Y is the MKM estimator of functional regression coefficients. It is also possible to derive a restricted MKM estimation of covariance operator with a similar approach to restricted ML. Let T −1 T A = [u1, . . . , un−k] be the first n − k eigenvectors of I − X X X X . Let ∗ T Y = [Yi]i=1,...,n be a n × 1 matrix, then Yi = ui Y is called the error contrast vector and is a sequence of n−k independent and identically distributed random elements with mean function zero and common covariance operator C. We can ∗ ∗ then use the sequence y1 , . . . , yn−k and employ Proposition9 to estimate the ˆ 1 Pn−k ∗ ∗ covariance operator by C = n−k i=1 yi ⊗ yi .

Proposition 9. Let y1, y2, . . . , yn be n independent realizations of H-valued Gaussian random element with mean function zero, and covariance operator P ˆ 1 Pn C = j≥1 λjψj ⊗ ψj. Let C = n i=1 yi ⊗ yi, then as σ → ∞ the MKM estimator of {λj, ψj}j≥1 converges to

1. ψˆk = The k’th eigenfunction of Cˆ D E2 ˆ 1 Pn ˆ 2. λk = n i=1 yi, ψk

Proof. The logarithm of the kernel mean function of the product measure n ⊗i=1N (µ, C) is presented in (4), while we set µ = 0. Parameter estimation is obtained by taking Fr´echet derivation of kernel mean function with respect to ψk and usual derivation of kernel mean function with respect to λk. In each case, it is shown that the local extremum is the global maximum of kernel mean function.

11 1) ψk: First we obtain the estimation of ψ1. Taking Fr´echet derivation of kernel mean function with respect to ψ1, we have     n ∂ ∂ X X −σ 2 log m⊗N (y1, . . . , yn) (h) =  hyi, ψji  (h) ∂ψ1 ∂ψ1 1 + 2σλj i=1 j≥1 n −2σ X = hyi, ψ1i hyi, hi 1 + 2σλ 1 i=1 −2σn D E = Cψˆ 1, h . 1 + 2σλ1 Consider that ψ lies in a sphere of radius 1, thus ψ˜ is an extremum point of 1 D E 1 log m⊗N (y1, . . . , yn), if Cψˆ 1, h = 0 for any arbitrary h in the tangent space of unit sphere at point ψ˜1, i.e.

⊥ D E ∀h ∈ {ψ˜1} ⇒ Cˆψ˜1, h = 0. (8)

In addition, for the case of identifiability ψ˜1 must associates to the largest eigenvalue of Cˆ . This way, MKM estimation of ψ1 is the solution to the following optimization problem:

n 2 1 X D E D E ⊥ ψˆ1 = arg max yi, ψ˜ s.t. Cˆψ˜k, h = 0 ∀h ∈ {ψ˜} , (9) ˜ n ψ∈H i=1 which immediately follows that MKM estimation of ψ1 is the first eigenfunction of Cˆ, and is independent of kernel parameter σ. The remaining question to answer is that if ψˆ1 maximizes (4) or not. Consider that for any arbitrary n o⊥ ˆ ˆ ˜ ψ1+h h ∈ ψ1 and ψ1 = ˆ , kψ1+hk nσ D E log m ˜ (y1, . . . , yn) = log m ˆ (y1, . . . , yn) − Ch,ˆ h ψ1 ψ1 2 ˆ ψ1 + h (1 + 2σλ1) ≤ log m (y , . . . , y ). ψˆ1 1 n

So ψˆ1 is the MKM estimation of ψ1. For the MKM estimation of ψk, k ≥ 2, the following constraint should be added to the optimization problem (9) D E ψˆk, ψˆj = 0 ∀1 ≤ j < k, which shows that the MKM estimation of all eigenfunctions is the same as the set of eigenfunctions of Cˆ.

2) λk: Taking derivation of kernel mean function with respect to λk, yields

 n  ∂ ∂ X X −σ 2 n X log m⊗N (y1, . . . , yn) =  hyi, ψji − log (1+2σλj) ∂λk ∂λk 1+2σλj 2 i=1 j≥1 j≥1

12 n 2 X 2σ 2 n 2σ = hyi, ψki − 2 2 1 + 2σλ i=1 (1 + 2σλk) k " n # σ X 2 = 2 2σ hyi, ψki − n (1 + 2σλk) . (1 + 2σλk) i=1 (10)

Equating (10) to zero and given ψk, the value of λk which maximizes (4) while ˆ 1 Pn 2 1 we set µ = 0 is given by λk = n i=1 hyi, ψki − 2σ , in consequence by putting D E2 ˆ ˆ 1 Pn ˆ 1 ψk to be the k’th eigenfunction of C, we obtain λk = n i=1 yi, ψk − 2σ , D E2 1 Pn ˆ which is a biased estimator of λk and converges to n i=1 yi, ψk as σ tends to infinity. In the following proposition, we obtained an estimation of the functional regression coefficients of model (6) by employing small-ball (SB) probability approximation. Proposition 10. In the case of Function-on-Scalar regression with the assump- tion of normality as in the model (6), in estimating the location parameters, the MKM or OLS estimator is the same as the one obtained by the small-ball prob- ability approximation proposed by Delaigle and Hall [5].

Proof. Let Y1,...,Yn be a simple random sample generated by model (6), where εi is a sequence of n independent copies of a H-valued Gaussian random P element with mean function zero, and covariance operator C = λjψj ⊗ψj. P j≥1 Let functions βk admits the Fourier decomposition βk = j≥1 θkjψj and define θj = (θkj)k=1,...,p, xi = (xik)k=1,...,p and β = (βk)k=1,...,p. T iid The identity i = Yi − x β ∼ N (0,C) is equivalent to the situation where pi component scores h, ψji/ λj are independent and identically distributed ac- cording to the standard normal distribution for each j ∈ N. Fix r > 0 and let 2 h = argmax r ≤ λj. By method of Delaigle and Hall [5], the log-density with j radius r equals to h h X p X λj T 2 ln P (i|r) ∝ ln fj( λjhi, ψji) ∝ − hYi, ψji − x θj , (11) 2 i j=1 j=1 and thus n n h h X X X λj T 2 X  1 T T T  ln P (i|r) ∝− hYi, ψji − x θj = λj − θ (X X)θj+B θj 2 i 2 j j i=1 i=1 j=1 j=1 Pn in which Bj = i=1hYi, ψjixi and X is n × p model matrix and Y is an n × 1 column vector containing functions Yi(·). Estimate of θj can be obtained by ∂ Pn solving the equation ln P (i|r) = 0 thus ∂θj i=1

T −1 θˆj = X X Bj.

13 For a given r > 0 or its coupled quantity h ∈ N, estimation of β is

 h   n h  ˆr X ˆ T −1 X X β =  θkjψj = X X  xik hYi, ψjiψj j=1 i=1 j=1 k=1,...,p k=1,...,p  h  T −1 T X = X X X  hYi, ψjiψj . j=1

Considering the limit of βˆr as r tends to zero, the limiting estimation ends up r T −1 T with limr→0 βˆ = X X X Y , which is the OLS estimation of β.

4 Applications

In the context of machine learning, the Maximum Mean Discrepancy (MMD) is a useful vehicle for hypothesis testing. If the kernel k is characteristic, then MMD is a metric on the space of probability measures. The induced distance by this metric can be employed to derive different statistical tests that can be hard to handle in the context of Functional data analysis, especially simultaneous hypothesis tests such as one-way ANOVA and testing for equality of covari- ance operators in more than two groups. To develop new tests, the probability measures induced by null and alternative hypotheses are embedded in a Hilbert space and their distance is computed by the MMD. Kernel-based methods such as kernel mean and covariance embedding of probability measures have a wide range of applications in analyzing structured and non-structured data and also developing non-parametric tests in finite- dimensional spaces such as testing for homogeneity of location and variance parameters, change-point detection, and test of independence. See for exam- ple Harchaoui et al. [13], Gretton et al. [8] and Tang et al. [29] among others to get some insight. In this section, we employ MMD to develop new tests for three major problems in the context of Functional response models, includ- ing Function-on-Scalar regression, Functional one-way ANOVA, and testing for equality of covariance operators. The performance of new tests is compared to some state-of-the-art methods. Before proceeding, it seems indispensable to notice that in the methods developed in this section, we assumed that the sampled random functions are observed completely. However, in practice, a function is observed only in a sparse or dense subset of the domain. Accordingly, at the first stage of analysis, we may operate a smoothing procedure to construct functions. To scrutinize the effect of smoothing in the results, the first simulation in this section has been done with different number of points sampled per curve. The results seem to be acceptable despite using smoothed functions rather than complete functions.

14 4.1 Hypothesis Testing in Function-on-Scalar Regression Model

Let H be the space of square-integrable functions L2[0, 1], and consider the following simple Function-on-Scalar regression problem:

iid yi (t) = α (t) + xiβ (t) + εi (t) , εi ∼ N (0,C); i = 1, . . . , n t ∈ [0, 1] (12) where N (0,C) is the Gaussian distribution over the space L2[0, 1], with mean function zero and covariance operator C. In proposition8 it is shown that the maximum kernel mean estimation of intercept and slope functions coincide with OLS estimation. To assess the uncertainty of estimation and testing for H0 : β = 0, we run a simulation study proposed by Kokoszka and Reimherr [16] to compare type-I error and power of new test devised from MMD with the current developed tests. Let α (t) = 2t and β (t) = −c0 cos (πt), in which the parameter c0 is used to switch between the null and alternative hypotheses. For the covariance operator C in (12) we use the Mat´ernfamily of covariance operators, once with an infinite smoothness parameter and once with the smoothness parameter set to 1/2, ( ) |s − t|2  |s − t| C (s, t) = exp − and C1 (s, t) = exp − . ∞ ρ /2 ρ

The kernel function C∞ referred to as squared-exponential covariance and

C1/2 referred to as exponential covariance function. To test H0 : β = 0, we devise a new test using MMD statistic by employing Gaussian kernel as a characteristic kernel on Hn. We use either the Gaussian sum-kernel n 2 n n X −σkxi−yik k(·, ·): H × H −→ R ({xi} , {yi}) 7→ e H i=1 or Gaussian product-kernel n 2 n n Y −σkxi−yik k(·, ·): H × H −→ R ({xi} , {yi}) 7→ e H . i=1

Let P0 be the induced probability measure of model (12) under the null hy- pothesis, and P1 be the induced probability measure when the parameter β considered to be free. Let Cˆ0 andα ˆ0 be the estimation of covariance and inter- cept function under the null hypothesis and Cˆ1,α ˆ1 and βˆ1 be the estimation of covariance, intercept and slope function under the alternative hypothesis as de- scribed in Proposition8 and Proposition9, then plugin estimators Pˆ and Pˆ are     0 1 Qn ˆ Qn ˆ ˆ i=1 N αˆ0, C0 and i=1 N αˆ1 + xiβ1, C1 respectively. The MMD statis- tic can then be defined as the maximum mean discrepancy distance between Pˆ0 and Pˆ1. Using the Gaussian Sum-Kernel, MMD statistic equals to:

1  2 2 D E  /2 MMDS = m ˆ − m ˆ = m ˆ + m ˆ − 2 m ˆ , m ˆ P0 P1 P0 P1 P0 P1 Hk Hk Hk Hk

15 " −1/2 −1/2   −1/2 ˆ ˆ ˆ ˆ = n I + 4σC0 + n I + 4σC1 − 2 I + 2σ C0 + C1

n #1/2 D −1 ˆ ˆ E X −σ I+2σ Cˆ0+Cˆ1 αˆ0−αˆ1−xiβ1 , αˆ0−αˆ1−xiβ1 × e ( ( )) ( ) ( ) , i=1 and the Gaussian product-kernel yields:

1  2 2 D E  /2 MMDP = m ˆ − m ˆ = m ˆ + m ˆ − 2 m ˆ , m ˆ P0 P1 P0 P1 P0 P1 Hk Hk Hk Hk " −n/2 −n/2   −n/2 ˆ ˆ ˆ ˆ = I + 4σC0 + I + 4σC1 − 2 I + 2σ C0 + C1

n 1/2 P D ˆ ˆ −1 ˆ ˆ E# −σ (I+2σ(C0+C1)) (αˆ0−αˆ1−xiβ1),(αˆ0−αˆ1−xiβ1) ×e i=1 .

Significance level is put at 0.05 and type-I error and power of test are compared with the test proposed by Greven et al. [9] and implemented in the pffr function of refund package. The rate of rejection for different number of points sampled per curve (m) and two different covariance operators, computed using a Monte Carlo simulation study, and the results are presented in Tables1 and2. Note that the distribution of MMD statistic under null hypothesis is approximated using the random permutation method. It could be realized that type-I error of pffr is inflated and is higher than the significance level. This problem is much worse with increasing number of sampling points per curve. Kokoszka and Reimherr [16] proposed a partial fix to this problem. They suggest to ignore the uncertainty estimates from pffr. Instead, they use the estimated residual functions given by pffr and combine them with a classic estimate of uncertainty. To this end, let βˆ1 (t) andε ˆi (t) be the slope function and residual functions estimated by pffr respectively. Then an estimation of the uncertainty for βˆ1 (t) is   ˆ ˆ  T −1 Cov β1 (s) , β1 (t) = Cov (ε (s) , ε (t)) X X (2,2) n !−1 n X 2 X u n (xi − x¯) εˆi (s)ε ˆi (t) , (13) i=1 i=1 in which X is an n × 2 data matrix with vector of ones in the first column and vector of scalar covariate (xi) in the second column. We can run a variety of hypothesis tests by plugging βˆ1 (t) obtained from pffr and estimation of uncer- tainty obtained by (13) into the fregion.test function in fregion package. Here we compare the proposed test with two existing ones; one is based on the L2 [0, 1] norm and the other test based on hyper-ellipsoid confidence regions proposed by Choi and Reimherr [3]. The simulation results reported in Tables1 and2. It can be noticed that MMD tests have type-I errors close

16 Table 1: Type-I errors and empirical powers for H0 : β = 0 with square- exponential covariance function.

n 30 70

c0 0 0.2 0.4 0.6 0 0.1 0.2 0.3 pffr 50.3 71.4 95.7 99.8 51.3 63.2 89.8 99.0 Norm 5.2 10.7 44.3 88.2 4.7 8.4 24.8 61.6 m = 10 Ellipse 3.7 20.5 76.2 97.5 2.1 11.0 51.1 86.4 MMDP 4.8 23.3 77.2 97.5 4.4 16.1 59.0 89.9 MMDS 3.6 27.4 79.7 97.6 4.8 20.6 61.9 90.8 pffr 69.6 86.6 98.5 99.9 66.7 80.6 96.8 99.7 Norm 5.1 10.6 45.8 87.6 2.7 8.5 23.4 60.4 m = 20 Ellipse 4.6 25.6 73.7 98.6 2.5 14.3 49.1 87.0 MMDP 5.1 23.8 74.1 98.7 4.5 17.7 52.8 88.2 MMDS 5.5 26.9 75.4 98.7 4.8 22.2 62.5 92.0 pffr 87.0 96.6 100 100 85.0 93.2 99.1 100 Norm 5.2 14.2 47.4 86.4 7.2 8.6 26.0 62.3 m = 50 Ellipse 5.1 31.3 81.5 98.1 3.4 14.5 58.2 90.6 MMDP 4.0 25.4 76.5 97.3 4.5 15.9 56.5 90.5 MMDS 4.0 28.8 79.6 98.0 4.8 20.4 65.2 93.3 to the significance level, and their powers are superior to the norm and hyper- ellipse tests. In all situations when the number of sampling points per curve is moderate and high (m = 20 or m = 50), MMDS outperforms MMDP, though, in the case of exponential covariance function where residual functions are less smooth, MMDP has higher power than MMDS when the number of sampling points is small (m = 10). Performance of the kernel methods depends on the choice of kernel’s param- eters. As it is shown in the proof of Theorem1, the precision of the kernel mean function in representing probability measures depends on the kernel bandwidth.

Table 2: Type-I errors and empirical powers for H0 : β = 0 with exponential covariance function.

n 30 70

c0 0 0.2 0.4 0.6 0 0.1 0.2 0.3 pffr 43.9 68.2 95.9 100 43.0 57.8 85.9 98.3 Norm 3.0 11.9 55.5 91.1 2.7 8.5 29.4 70.9 m = 10 Ellipse 0.7 10.8 56.4 92.0 0.1 1.5 18.7 60.0 MMDP 5.1 15.8 58.1 91.1 4.7 12.4 41.65 78.2 MMDS 5.0 8.5 23.9 52.2 4.6 5.5 9.3 15.3 pffr 67.6 86.3 98.7 100 63.6 79.5 96.1 99.8 Norm 3.3 11.9 54.2 92.2 3.8 7.4 29.5 71.7 m = 20 Ellipse 1.1 10.7 54.3 93.0 0.1 0.1 6.4 40.9 MMDP 4.6 18.2 61.3 93.2 5.3 12.5 43.3 82.5 MMDS 4.6 20.8 61.1 91.9 5.4 17.4 54.7 87.9 pffr 86.1 95.3 99.8 100 87.0 93.8 99.3 100 Norm 4.6 12.0 54.6 91.4 4.6 8.2 32.0 73.5 m = 50 Ellipse 0.5 6.8 41.5 83.9 0.1 0.1 3.1 23.3 MMDP 4.4 18.7 61.6 92.6 4.9 11.7 44.9 82.3 MMDS 3.8 21.2 66.2 93.5 5.1 19.9 61.5 91.6

17 A large enough kernel parameter σ works well in our simulation studies. In this simulation study, the kernel parameter has been set to σ = 5e4 for MMDS and σ = 5e1 for MMDP test. Fourier basis also has been used in the smoothing pro- cedure. The number of components for the smoothing procedure is considered to be fixed and equals 41. The results presented in Tables1 and2 are reported by 5000 iterations.

4.2 Functional One-way ANOVA The one-way ANOVA is a fundamental problem in statistical inference. Assume that H is a separable Hilbert space, Yij for i = 1, . . . , k and j = 1, . . . , ni are independent random samples taking values from H, and yij are their observa- tions counterparts. For a typical functional one-way ANOVA problem with the assumption of homogeneity of covariance operators, the random elements Yij are assumed to be generated according to the following model:

iid Yij = µi + εij, εij ∼ N (0,C); i = 1, . . . , k, j = 1, . . . , ni (14) where µi is the mean function of the i’th group and the covariance operator is equal in all the k groups: C = E [εij ⊗ εij] for all i = 1, . . . , k and j = 1, . . . , ni. It is of interest to test the equality of k mean functions, i.e. H0 : µ1 = ··· = µk. Let P1 be the probability measure of samples generated by model (14) under H0 and P2 be the probability measure induced by model (14), under the alternative hypothesis, that is, when parameters µi are considered to be free. With the assumption of homogeneity of covariance operators, the squared MMD with Gaussian product-kernel equals to:

MMD2 = km − m k2 P1 P2 Hk n n = |I + 4σC|− /2 + |I + 4σC|− /2

−n/2 −σ Pk n (I+4σC)−1(µ −µ),(µ −µ) − 2 |I + 4σC| e i=1 ih i i i

−n/2  −σ Pk n (I+4σC)−1(µ −µ),(µ −µ)  = 2 |I + 4σC| 1 − e i=1 ih i i i .

Covariance operator is assumed to be equal between the groups, so the new MMD test statistic can be simplified as

k X D −1 E MMD0 = ni (I + 4σC) (µi − µ) , (µi − µ) i=1 k 2 X −1/2 = ni (I + 4σC) (µi − µ) . (15) i=1 H

Accordingly, the new test statistic is the weighted sum of the distance of group mean functions µi from total mean function µ. By plugging the usual estimation of mean functions (which are also MKM estimations of mean functions) into

18 (15), the new test statistic yields

k X  −1  MMDˆ 0 = ni I + 4σCˆ (ˆµi − µˆ) , (ˆµi − µˆ) i=1

k 1 2 X  − /2 = ni I + 4σCˆ (ˆµi − µˆ) .

i=1 H This test statistic is similar to the kernel Fisher discriminant analysis (KFDA) test statistic proposed by Harchaoui et al. [12] for a two-sample kernel-based non-parametric test when the underlying space is finite-dimensional. Let H again be the space of square-integrable functions L2[0, 1]. Motivated by Zhang et al. [33], a simulation study was run to evaluate the performance of the new MMD test against four other competitors developed for the space of square-integrable functions: a L2-norm based test proposed in Zhang and Chen [31], an F -type test proposed by Shen and Faraway [23], a Global Point- wise F -test offered in Zhang and Liang [32] and the Fmax test developed and proposed by Zhang et al. [33]. We used the same setup as Zhang et al. [33] for data generating procedure in our simulation study. Hence it is assumed that the functional samples in (14) are generated from the following one-way ANOVA model:

T  2 3 X p yij (t) = µi (t) + εij (t) , µi (t) = ci 1, t, t , t , εij (t) = λrzijrψr (t) . r≥1 (16) i = 1, . . . , k; j = 1, . . . , ni; t ∈ [0, 1] . While our method works without the need to put any restriction on the number of components, we follow the same setup as in Zhang et al. [33] and assume a finite number of q nonzero eigenvalues in (16). The parameter ni denotes the size of each group and the set of ψr is the eigenfunctions. For all i,j, the design time points are considered to be balanced and equally spaced, thus all sampled curves are measured in the common grid of time points tj = j/ (T + 1), j = 1,...,T . r−1 The eigenvalues are assumed to follow the pattern λr = aρ for fixed a > 0 and ρ ∈ (0, 1). The tuning parameter ρ determines the decay rate of eigenvalues. For ρ close to zero (resp. close to one), eigenvalues decay fast (resp. slowly) and T residual functions are more (resp. less) smooth. We put c1 = [1, 2.3, 3.4, 1.5] T √ and u = [1, 2, 3, 4] / 30. The vector ci = c1 + (i − 1) δu for different values of δ represents the mean functions of the k groups. The parameter δ switches between null and alternative hypotheses. We fix q = 11, a = 1.5, T = 80 where T is the number of time points where each√ curve is observed. For the√ eigenfunctions, we put ψ1 (t) = 1, ψ2r (t) = 2 sin (2πrt) and ψ2r+1 (t) = 2 cos (2πrt) for r = 1, . . . , q. Different setups for data generating procedure is a combination of the following set of parameters:

iid iid √ • zijr ∼ N (0, 1) or zijr ∼ t4/ 2.

19 • ρ = 0.1, 0.5, 0.9 for different level of smoothness of residual functions.

• (ni) = (20, 30, 30) for the small sample and (ni) = (70, 80, 100) for the large sample cases.

The parameter δ for each pair of ρ and ni is selected in a way that the difference between the performance of five tests can be distinguished. For all the four 2 test statistics, L -norm based, F -type, GPF, and Fmax, the authors proposed bootstrap methods to estimate the null distribution of the test statistic. Consult Zhang [30, Section 4.5.5] for the implementation of the L2-norm based and F - type test, Zhang and Liang [32, Section 2.4] for the implementation of the GPF test, and Zhang et al. [33, Section S.1] for the implementation of the Fmax test. Mention should be made that although in this paper the kernel mean embed- ding of probability measures and the MMD statistic are derived for the family of Gaussian probability distributions, the new MMD test dominates all the four other tests in all the situations even in non-Gaussian scenarios. Simulation results are shown in Tables5 and6. In our simulation study, we take σ = 1e3 and B-spline basis has been used in smoothing procedure. The number of components for the smoothing procedure is considered to be fixed and equals 41. According to the results, the empirical power of the MMD test is higher than the other two tests in all situations. The results presented in these tables are produced and reported by 2000 iterations.

2 Table 3: Empirical powers (in percent) of L , F , GPF, Fmax and MMD0 for iid one-way ANOVA problem when zijr ∼ N (0, 1).

ρ (ni) = (20, 30, 30) (ni) = (70, 80, 100) δ 0 0.015 0.03 0.05 0.065 0 0.01 0.02 0.03 0.04 L2 7.2 8.1 16.2 43.8 70.9 4.6 9.8 24.9 55.9 86.2 F 6.9 6.9 13.4 39.6 66.5 3.8 9.5 24.1 54.6 85.6 0.1 GPF 6.3 7.7 15.9 44.1 71.2 4.4 9.9 24.7 56.0 86.4 Fmax 5.9 15.8 56.8 95.1 100 4.4 26.9 82.1 99.4 100 MMD 5.8 31.8 99.9 100 100 4.2 68.7 100 100 100 δ 0 0.05 0.1 0.15 0.2 0 0.04 0.08 0.12 0.16 L2 4.1 5.9 10.3 15.4 25.0 5.3 8.5 18.4 35.7 62.5 F 3.3 4.6 9.1 13.2 21.8 4.7 7.9 17.2 34.5 60.7 0.5 GPF 4.3 6.0 11.2 16.3 26.9 5.6 8.6 18.5 37.3 64.3 Fmax 3.3 6.5 17.5 35.8 64.1 4.7 10.8 40.7 81.0 98.3 MMD 4.8 19.7 92.7 100 100 4.9 65.7 100 100 100 δ 0 0.15 0.3 0.45 0.6 0 0.1 0.2 0.3 0.4 L2 4.9 5.8 12.5 25.3 48.1 5.1 8.3 20.5 46.9 74.4 F 3.3 4.0 10.3 20.1 41.9 4.8 7.5 20.0 45.0 73.4 0.9 GPF 6.2 7.7 16.2 29.7 52.1 5.6 9.1 21.9 49.2 76.4 Fmax 6.5 6.5 12.3 23.5 44.3 5.1 7.6 18.2 42.2 72.0 MMD 5.3 9.3 31.1 80.0 100 4.4 14.9 74.6 100 100

20 2 Table 4: Empirical powers (in percent) of L , F , GPF, Fmax and MMD0 for iid √ one-way ANOVA problem when zijr ∼ t4/ 2.

ρ (ni) = (20, 30, 30) (ni) = (70, 80, 100) δ 0 0.015 0.03 0.05 0.065 0 0.01 0.02 0.03 0.04 L2 7.4 7.7 17.6 42.0 69.1 6.1 9.3 24.6 53.9 86.6 F 5.8 6.0 14.7 35.6 64.1 5.8 8.6 23.2 52.5 85.7 0.1 GPF 7.0 7.3 17.5 41.3 68.9 6.3 9.2 24.9 53.8 86.1 Fmax 6.3 16.9 54.8 96.9 99.9 6.0 24.8 80.3 99.5 100 MMD 6.3 34.2 100 100 100 6.0 71.0 100 100 100 δ 0 0.05 0.1 0.15 0.2 0 0.04 0.08 0.12 0.16 L2 4.4 6.1 9.8 16.0 25.8 4.9 8.0 17.4 36.7 59.6 F 4.1 4.6 8.4 13.4 22.6 4.6 7.6 16.3 35.6 57.9 0.5 GPF 4.6 6.6 10.1 16.6 25.8 5.0 8.3 18.2 37.8 62.0 Fmax 4.2 7.0 14.6 35.9 62.6 4.8 9.6 40.8 82.6 98.5 MMD 5.4 23.1 94.2 100 100 4.6 68.6 100 100 100 δ 0 0.15 0.3 0.45 0.6 0 0.1 0.2 0.4 0.4 L2 3.9 4.5 11.7 27.7 47.2 5.0 7.3 20.8 46.2 74.3 F 2.8 3.2 8.4 23.4 40.9 4.6 6.7 19.5 44.6 73.2 0.9 GPF 5.4 5.9 15.2 31.0 52.3 5.2 7.9 22.9 48.1 75.8 Fmax 4.9 5.2 11.9 26.4 46.5 5.3 7.1 18.7 42.7 74.2 MMD 5.0 8.9 30.9 78.7 100 5.3 13.1 75.7 100 100

4.3 Testing for Equality of Covariance Operators

Let H be a separable Hilbert space. Assume that Yij for i = 1, . . . , k and j = 1, . . . , ni are independent H-valued Gaussian random elements, and yij are their observation counterparts generated from the following model:

ind Yij = µi + εij, εij ∼ N (0,Ci); i = 1, . . . , k, j = 1, . . . , ni, (17) where µi is the unknown mean function of group i, and εij accounts for subject- effect functions with mean zero and covariance operator Ci = E [εij ⊗ εij]. It is of interest to test the equality of k covariance operators, i.e. H0 : C1 = ··· = Ck. Based on the proof of Proposition6, the squared MMD for comparing two Gaussian probability measures N (µ1,C1) and N (µ2,C2) equal to km − m k2 = km k2 + km k2 − 2 hm , m i P1 P2 Hk P1 Hk P2 Hk P1 P2 Hk −1/2 −1/2 = |I + 4σC1| + |I + 4σC2|

−1 −1 /2 −σh(I+2σ(C1+C2)) (µ1−µ2),(µ1−µ2)i − 2 |I + 2σ (C1 + C2)| e .

To develop the MMD statistic based on a simple random sample from the model (17), first, we have to choose a proper characteristic kernel for the space k P ni 2 i=1 −σkx−yk H . By Proposition6, Gaussian kernel k (x, y) = e H is characteristic for the family of Gaussian probability measures on H, and from Theorem7, Pn ({xi} , {yi}) 7→ i=1 k (xi, yi) is a characteristic kernel for the family of finite n product of Gaussian probability measures on H . Let P1 be the probability mea- sure of samples generated by the model (17) under H0, i.e. C1 = ... = Ck = C

21 and P2 be the probability measure induced by the model (17), when parameters Ci considered to be free. For the centered version of the model (17), that is, µ1 = ··· = µk = 0, the squared MMD with Gaussian Sum-Kernel equals

MMD2 = km − m k2 P1 P2 Hk k k −1/2 X −1/2 X −1/2 = n |I + 4σC| + ni |I + 4σCi| − 2 ni |I + 2σ (C + Ci)| i=1 i=1 k X  −1/2 −1/2 −1/2 = ni |I + 4σC| + |I + 4σCi| − 2 |I + 2σ (C + Ci)| . i=1

There currently developed tests for the k-sample equality of covariance functions problem, if H considered being the space of square-integrable functions over a compact set like L2 [0, 1]. We address two recent successfully developed tests for homogeneity of covariance functions, namely quasi-GPF and quasi-Fmax introduced by Guo et al. [11]. There are other tests, which are shown to be of less powerful in different settings relative to the currently mentioned tests. See Guo et al. [11] and references therein for more information and simulation studies. We compared the new MMD test against quasi-GPF and quasi-Fmax in a simulation study. Our simulation study is motivated by Guo et al. [11], and we used the same setup for the data generating procedure. Assume that the mean function is zero and data is generated in a k−regime scheme according to the following model: X p yij (t) = εij (t) , εij (t) = h (t) λrzijrψir (t) r≥1

i = 1, . . . , k; j = 1, . . . , ni; t ∈ [0, 1] . where h (t) is common for all the groups and ni denotes the size of each group. The set {ψir}r≥1, for each i, is a set of basis functions, and we set the eigenvalues r−1 λr = aρ for fixed a > 0 and ρ ∈ (0, 1). The tuning parameter ρ determines the decay rate of eigenvalues. For a ρ close to zero, eigenvalues decay fast and functional data is more correlated and more smooth, however, for a ρ close to one, eigenvalues decay slowly and realization of functional data is less correlated across its domain and thus less smooth. Although Guo et al. [11] assumed a finite number of q nonzero eigenvalues in the simulation process, our test works well without the need to put any restriction on the number of components. Here we follow Guo et al. [11] and fix T q = 41, a = 1.5, T = 80, h (t) = t+1 where T is the number of time points that each curve is observed at. Different setups for data generating procedure is a combination of the following choice of parameters:

iid iid p3 • zijr ∼ N (0, 1) or zijr ∼ /5 t5. • ρ = 0.1, 0.5, 0.9 for three class of high, moderate and low correlations.

22 2 Table 5: Empirical powers (in percent) of L , Tmax, GPF, Fmax and MMD when iid zijr ∼ N (0, 1).

ρ (ni) = (20, 30, 30) (ni) = (70, 80, 100) ω 0 0.5 1 2.5 5 0 0.5 1 2.5 4 L2 5.3 5.4 6.3 7.1 8.1 4.8 4.7 6.3 6.0 7.2 T 4.8 5.0 5.7 6.7 7.6 4.5 4.6 6.4 6.2 8.1 0.1 max GPF 4.4 5.3 7.1 19.4 72.9 5.2 6.4 7.6 70.1 100 Fmax 4.9 5.7 9.7 48.7 89.1 5.2 8.1 21.2 100 100 MMD 4.4 100 100 100 100 4.7 100 100 100 100 ω 0 0.5 1 1.5 3 0 0.5 0.8 1.1 1.4 L2 4.6 5.0 6.3 6.7 5.3 4.8 4.6 4.4 6.2 5.3 T 4.7 4.5 6.1 4.0 5.4 4.6 4.9 5.1 5.8 5.4 0.5 max GPF 5.1 9.5 27.3 46.7 95.1 5.6 28.7 65.7 91.5 99.3 Fmax 4.2 12.0 30.1 55.4 97.3 5.3 35.4 77.8 97.4 99.6 MMD 4.8 99.7 100 100 100 4.4 100 100 100 100 ω 0 0.5 0.8 1.2 1.5 0 0.4 0.6 0.8 1 L2 5.7 6.6 5.7 5.4 6.7 5.9 4.8 5.6 5.1 5.8 T 4.7 6.7 6.0 5.3 6.8 4.6 4.7 6.0 5.4 5.7 0.9 max GPF 5.5 16.4 29.4 60.8 75.7 5.2 25.2 55.1 84.5 98.7 Fmax 5.1 8.3 14.5 19.5 33.2 5.4 14.2 24.5 43.4 65.3 MMD 4.4 100 100 100 100 4.3 100 100 100 100

• (ni) = (20, 30, 30) for the small sample and (ni) = (70, 80, 100) for the large sample cases.

• ψir (t) = φr (t) for r = 1, 3, 4, . . . , q and ψi2 (t) = φ2 (t) + (i − 1) ω/h (t) for different choice of ω to reflect between group difference of covariance operators, and we can take {φr}r≥1 either the set of Fourier or B-spline basis for L2 [0, 1].

Guo et al. [11] compared quasi-GPF and quasi-Fmax with few other tests in- 2 cluding two other tests L and Tmax [10]. According to their results, quasi-GPF is superior in low correlation schemes, and quasi-Fmax is superior in the high correlation schemes. Although In this paper we derived kernel mean embedding and MMD statistic for the family of Gaussian probability distributions, it can be noticed from Tables5 and6 that the new MMD based test dominates the quasi-GPF and quasi-Fmax in all situations including non-Gaussian scenarios. Let Cˆ be the usual estimation of the covariance operator under H0 and Cˆi the usual covariance operator estimation of group i. Then our MMD test statistic equals

1 " k # /2  −1/2 −1/2   −1/2 ˆ X ˆ ˆ ˆ ˆ MMD = ni I + 4σC + I + 4σCi − 2 I + 2σ C + Ci . i=1

In this simulation study, we take σ = 1e3. The null distribution of all the test 2 statistics L , Tmax, Fmax, GPF and MMD are approximated by the random per- mutation method. The empirical powers of the five test statistics are calculated in a simulation study. Results for α = 0.05 and {φr}r≥1 selected to be the set of

23 2 Table 6: Empirical powers (in percent) of L , Tmax, GPF, Fmax and MMD when iid p3 zijr ∼ /5 t5.

ρ (ni) = (20, 30, 30) (ni) = (70, 80, 100) ω 0 0.5 1 2.5 5 0 0.5 1 2.5 4 L2 5.3 4.4 5.2 4.3 5.6 5.6 7.2 7.6 6.4 5.6 T 4.8 5.2 5.6 5.7 7.1 4.9 5.7 5.8 6.9 7.1 0.1 max GPF 4.9 4.5 10.0 12.9 58.0 4.8 8.5 10.1 30.6 80.3 Fmax 5.2 8.5 13.1 34.1 70.1 5.7 12.8 20.4 88.5 98.5 MMD 4.2 100 100 100 100 4.4 100 100 100 100 ω 0 0.5 1 1.5 3 0 0.5 0.8 1.1 1.4 L2 5.0 4.3 4.3 5.1 5.6 5.1 5.6 4.3 6.9 6.3 T 4.7 5.5 4.8 5.7 7.2 4.7 5.4 4.4 5.7 6.8 0.5 max GPF 5.4 14.2 21.4 34.2 72.8 4.9 15.7 44.2 65.7 84.2 Fmax 4.6 12.8 22.6 40.3 68.5 5.4 24.2 61.4 84.2 91.4 MMD 4.8 84.2 100 100 100 5.0 100 100 100 100 ω 0 0.5 0.8 1.2 1.5 0 0.4 0.6 0.8 1 L2 4.1 4.3 4.2 3.8 4.9 5.2 6.4 5.3 6.7 6.5 T 5.9 4.2 5.5 5.0 5.8 4.9 6.9 7.1 6.9 6.8 0.9 max GPF 5.2 15.7 18.5 31.4 54.9 4.4 10.3 44.3 67.1 72.8 Fmax 4.1 11.4 10.3 12.8 30.3 5.1 7.1 12.8 23.8 50.3 MMD 5.4 91.4 100 100 100 4.8 100 100 100 100

Fourier basis are presented in Tables5 and6. In this simulation study, we used the B-spline basis for the smoothing procedure. The number of components for smoothing procedure is considered to be fixed and equals 41. According to the results, the empirical powers of MMD test are uniformly higher than the other four tests in all the situations. The results presented in these tables were produced and reported by 2000 iterations.

4.3.1 Medfly Data In this section, we apply MMD and the other four tests introduced in Section 2 4.3, L , Tmax, GPF and Fmax, to test for homogeneity of covariance operators in a real data example, according to the model (17) with k = 4. Medfly data set is a functional data of mortality rate of medflies. Approxi- mately, 7,200 medflies of a given size were maintained in aluminum cages. Adults were given either a diet of sugar and water, or a diet of sugar, water and ad libitum. Each day, dead flies were removed, counted, and their sex determined [1]. The number and rate of alive medflies were recorded over a period of 101 days. In effect, the aim is to assess the effects of nutrition and gender on survival or mortality of medlies. Cohorts of medflies consist of four groups, (a) Females on a sugar diet, (b) Females on a protein plus sugar diet, (c) Males on a sugar diet, and (d) Males on a protein plus sugar diet. The effect of gender and nutrition is studied before [see for example 15,2], and it is known that there is an interaction between gender and nutrition on the survival of medflies [21]. Survival functions of cohorts of medflies during a period of 30 days (days 2-31) are illustrated in Figure1. The panels in the first row demonstrate 33 sample functions in each group as

24 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 -0.2 -0.2 -0.2 -0.2 -0.4 -0.4 -0.4 -0.4 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30

(a) (b) (c) (d)

Figure 1: Survival functions of cohorts of medflies: (a) Females on a sugar diet, (b) Females on a protein plus sugar diet, (c) Males on a sugar diet, and (d) Males on a protein plus sugar diet; (Top row) Gray lines: survival functions of samples, Red line: mean function, (Bottom row) Gray lines: deviation of samples from the mean function, Blue line: first eigenfunction of covariance operators, which accounts for 89.3%, 94.5%, 96.4% and 97.8% of the variation of survival functions in each of four groups respectively. well as the mean functions, and each sample function is the survival rate of medflies in one cage. The panels in the second row demonstrate deviation of samples from the group’s mean function as well as the first eigenfunction of the covariance operator. The first eigenfunction explains the major variation of functional samples within each group (89.3%, 94.5%, 96.4%, and 97.8% in each group respectively). It could be noticed that there is a slight difference in eigenvalues and eigenfunctions of covariance operators between groups. The kernel functions of the covariance operators in the four groups are depicted in

30 30 30 30

level

(0.000, 0.002] (0.002, 0.004] 20 20 20 20 (0.004, 0.006] (0.006, 0.008] (0.008, 0.010] t t t t (0.010, 0.012] (0.012, 0.014] (0.014, 0.016] (0.016, 0.018] 10 10 10 10 (0.018, 0.020] (0.020, 0.022]

1 1 1 1

1 10 20 30 1 10 20 30 1 10 20 30 1 10 20 30 s s s s (a) (b) (c) (d)

Figure 2: Estimated covariance functions of the four groups of medflies: (a) Females on a sugar diet, (b) Females on a protein plus sugar diet, (c) Males on a sugar diet, and (d) Males on a protein plus sugar diet.

25 Figure2, which magnifies the between-groups difference of covariance operators. 2 MMD and the four other tests including L , Tmax, GPF and Fmax were employed to test the equality of covariance operators. The results are presented in Table 7. It can be understood that the p-values of all pair-wise comparisons are gen- erally smaller than of the other four tests. As described in Guo et al. [11], it was expected that Fmax test to have higher power than GP F in this data set. However, as it is shown in simulation studies, MMD has higher power than both Fmax and GP F in all of the scenarios.

2 Table 7: p-Values (in percent) of L , Tmax, GPF, Fmax and MMD tests applied to compare covariance operators of survival functions of the four groups of medflies.

2 L Tmax GPF Fmax MMD (a) vs (b) 1.4 0.4 0.6 0.1 0.1 (a) vs (c) 6.2 0.6 3.4 0.1 0.1 (a) vs (d) 0.1 0.1 0.1 0.1 0.1 (b) vs (c) 21.2 29.2 12.2 3.6 0.1 (b) vs (d) 33.8 47.0 13.8 1.2 0.2 (c) vs (d) 24.0 27.8 21.2 6.2 2.6 All Groups 2.6 01.8 0.4 0.1 0.1

5 Conclusions and Discussion

This study explored kernel methods for probability measures and their applica- tions to functional data analysis. We derived conditions of kernels that are char- acteristic for infinite-dimensional separable Hilbert spaces. We also derived a framework for introducing a pseudo-likelihood function over infinite-dimensional separable Hilbert spaces. It is shown that the MKM estimators for location and covariance operator obtained by maximizing this pseudo-likelihood function co- incide with ordinary least square estimators, which is the same as what we observe in finite-dimensional spaces where ordinary least square estimators co- incide with MLE in the case of Gaussian distribution. We also used Maximum Mean Discrepancy as a distance over the space of Gaussian probability measures induced by functional response models and derived new powerful tests for the problems of functional one-way ANOVA and homogeneity of covariance oper- ators. An important question which we have not covered in this paper is how to choose the Gaussian kernel bandwidth parameter σ. As it is also proposed in Sriperumbudur et al. [26], one may choose a family of characteristic kernels n  2 o kσ (x, y) := exp −σ kx − yk , σ > 0 and use the maximal RKHS distance

γ (P, Q) = supσ>0 γσ (P, Q) where γσ is the MMD metric defined by character- istic kernel kσ. γ is a stronger metric than γσ, so new tests derived from γ must

26 have a better performance than those introduced in section4.

A Appendix A.1 Proof of Theorem1 and Corollary2 To provide the proof of Theorem1 we need the following lemma:

Lemma 11. Let {bj} be a descending sequence of positive real numbers and P P {aj} be a series of real numbers such that j≥1 |aj| < ∞ and j≥1 ajbj > 0. PN Then there exists a finite N ∈ N such that j=1 aj > 0.

Proof. Let P = {n1, n2,...} ⊆ N be the set of indices for which aj > 0, define n0 = 0 and for any ni ∈ P , let Tni = N ∩ (ni−1, ni]. Then for any i ≥ 1, P P we have bni aj ≥ bjaj. Let nk ∈ P be the first index such that j∈Tni j∈Tni n Pk ajbj > 0. If k = 1, the proof is straightforward. If k > 1, then j=1

n k k k−1 Xk X X X 1 X 1 X X 1 X aj = aj ≥ bjaj ≥ bjaj + bjaj bni bnk−1 bnk j=1 i=1 j∈Tni i=1 j∈Tni i=1 j∈Tni j∈Tni k n 1 X X 1 Xk ≥ bjaj = bjaj > 0. bnk−1 bnk−1 i=1 j∈Tni j=1

Proof of Theorem1: Suppose mP2 (y) − mP1 (y) = δ > 0. There exists r > 0 big enough such that sup ψ(kx − yk ) ≤ δ/2, in which Br(y) = {x ∈ c H x∈Br (y) s.t kx − yk < r}. Then, we have H H

Z 0 < δ = ψ(kx − yk )(P2 − P1)(dx) H H Z Z = ψ(kx − yk )(P2 − P1)(dx) + ψ(kx − yk )(P2 − P1)(dx) H H c Br (y) Br (y) Z ≤ ψ(kx − yk )(P2 − P1)(dx) + δ/2 H

Br (y) and thus Z δ ψ(kx − yk )(P2 − P1)(dx) ≥ > 0. (18) H 2 Br (y) Let define

27 i • ri,L = (1 − L )r ; i ≥ 1 , L ∈ (0, 1)

• Bi,L = Bri,L (y); i ≥ 1

0 0 • B1,L = B1,L , Bi,L = Bi,L\B(i−1),L ; i ≥ 2 thus from (18) we have

δ X Z 0 < ≤ ψ(kx − yk )(P2 − P1)(dx) 2 H i≥1 0 Bi,L X 0 X 0 ≤ mi,L(P2 − P1)(Bi,L) + γi,LP2(Bi,L) i≥1 i≥1 X 0 ≤ mi,L(P2 − P1)(Bi,L) + sup γi,LP2 (Br(y)) , i≥1 i≥1 where mi,L = inf ψ(kx − yk ) and Mi,L = sup ψ(kx − yk ) and γi,L = x∈B0 H 0 H i,L x∈Bi,L Mi,L − mi,L. Because ψ is a bounded non-negative continuous and strictly decreasing function, we can choose L ∈ (0, 1) such that supi≥1 γi,LP2 (Br(y)) < δ δ P 0 or sup γi,L < and thus mi,L(P2 − P1)(B ) > 0. By 4 i≥1 4P2(Br (y)) i≥1 i,L PN  0  lemma 11 there exists N < ∞ such that i=1 (P2 − P1) Bi,L > 0, which ∗ N  immediately follows that (P2 − P1)(Br∗ (y)) > 0, where r = 1 − L r. Proof of Corollary2: R R Let assume ψ(kx − y2k )P (dx) > ψ(kx − y1k )P (dx) and let P−a be H H H H the push forward of P by the map x 7→ x + a, which translates x to x + a, R R then we have ψ(kxk )P−y (dx) > ψ(kxk )P−y (dx) and thus by the same H H 2 H H 1 argument as stated in the proof of Theorem1 there exists r > 0 big enough such that (P−y2 − P−y1 )(Br (0)) > 0 and consequently P (Br(y2)) > P (Br(y1)).

A.2 Proof of Theorem4

The existence proof of a continuous characteristic kernel for `2 relies on the following theorem by Steinwart and Ziegel [28]. We also need lemma 13 to complete the proof. Theorem 12. [28, Theorem 3.14] For a compact topological Hausdorff space (X, τ), the following statements are equivalent:

1. There exists a universal kernel k on X. 2. There exists a continuous characteristic kernel k on X.

3. X is metrizable, i.e. there exists a metric generating the topology τ.

28 ∞ Lemma 13. Let R be the extended real line, and R and R∞ be the countable products of and respectively, which are equipped with the product topologies, R ∞ R and let `2 ⊂ be the Hilbert space of square summable sequences. If the ∞R function f : R → R is continuous, then f|`2 , which is restriction of f to `2, is continuous with respect to the norm of `2. Proof. Assume that ϕ : R → [−1, 1] is defined as follows: x ϕ(x) = , ∀x ∈ R and ϕ(−∞) = −1, ϕ(+∞) = 1. 1 + |x| It is clear that ϕ is a homeomorphic and order-preserving. Consider ρ(x, y) := |ϕ(x) − ϕ(y)|, for all x, y ∈ R. Then ρ is a metric and the topology induced by this metric is equivalent to the order topology. On the other hand, it is well known that the product topology ∞ on R can be generated by the following metric [4, Theorem 2.6.6], ∞ X ρ(xk, yk) ∞ d(x, y) := k , ∀x = (xk), y = (yk) ∈ R . 2 (1 + ρ(xk, yk)) k=1 Now, we show that the topology induced by the metric d on `2 is weaker than the k k 2 norm topology. For this purpose, let xn = (xn), x = (x ) ∈ ` and kxn − xk → k k 0. Therefore, |xn − x | → 0 for all k ∈ N. Because ϕ is a homeomorphic, k k ρ(xn, x ) → 0 for any k ∈ N and hence d(xn, x) → 0. Thus, if f is continuous with respect to the product topology, then f|`2 is continuous with respect to the norm topology.

Proof of Theorem4: Without loss of generality assume H to be the space of ∞ square summable sequences `2, which is a subset of R , and let B (`2) be Borel sigma-algebra generated by the open sets of `2. R is a one-dimensional locally compact Hausdorff space, and the extended real line R equipped with order topology is a metrizable Hausdorff and compact topological space. Equip both ∞  ∞ R∞ and R with the product topologies and let B (R∞) and B R be the Borel sigma-algebra generated by the open sets of these topologies. Consider ∞ ∞  ∞ that B (`2) = {A ∩ `2 : A ∈ B (R )}, and we have B (`2) ⊆ B (R ) ⊆ B R .  ∞ Note that B (R∞) ⊆ B R Because we equipped extended real line R with order topology, which includes the bases for the natural topology of R. Let ∞  ∞ ι : `2 → R be the usual inclusion map, then for every A ∈ B R we have −1  ∞ ι (A) = A ∩ `2 ∈ B (`2), so ι is a B (`2) − B R measurable map and hence ∞ every `2-valued random element is an R -valued random element and thus the space of Borel probability measures on (`2, B (`2)) is a subset of the space  ∞  ∞ ∞ of Borel probability measures on R , B R . R itself is a metrizable compact topological Hausdorff space, thus by invoking Lemma 12, there exists ∞ a continuous characteristic kernel k (·, ·) on R , which by employing Lemma 13, its restriction to `2 is also continuous with respect to the norm of `2.

29 A.3 Proof of Theorem5

Let `2 be the space of square summable sequences with inner product h·, ·i and norm k·k, and let Λθ be the infinite-dimensional Gaussian measure on the measurable space (R∞, B (R∞)) defined as the product of countably many copies of normal distribution with mean zero and variance θ. The of R∞ is c00, so characteristic function of Gaussian measure, for any x ∈ c00 equals to Z −ihω,xi −σkxk2 ψ (x) := e Λ2σ (dω) = e . (19)

∞ R

Let P and Q be two arbitrary probability measures over c00 such that γk (P, Q) = 0, then Z Z 2 −σkx−yk2 0 = γk (P, Q) = e (P − Q)(dx)(P − Q)(dy)

c00 c00   Z Z Z −ihω,x−yi =  e Λ2σ (dω) (P − Q)(dx)(P − Q)(dy) ∞ c00 c00 R   Z Z Z (a) −ihω,x−yi =  e (P − Q)(dx)(P − Q)(dy) Λ2σ (dω) ∞ R c00 c00   Z Z Z −ihω,xi ihω,yi =  e (P − Q)(dx) e (P − Q)(dy) Λ2σ (dω) ∞ R c00 c00 Z   = (φP (ω) − φQ (ω)) φP (ω) − φQ (ω) Λ2σ (dω) ∞ R Z 2 = |φP (ω) − φQ (ω)| Λ2σ (dω) . (20) ∞ R

In the above equation, Fubini-Tonneli’s theorem is invoked in (a). Dual of c00 with norm k·k, is the space of square summable sequences `2. So to show that

P = Q, it is enough to show that φP = φQ agrees on `2. By (20) and by definition ∞ of the integral and the fact that supp (Λ2σ) = R , for any open set B we have, inf |φ (ω) − φ (ω)|2 = 0. ω∈B P Q

Fix ω0 ∈ c00, and for any m ∈ N define ( m ) X 2 1 B := x ∈ m : (x − ω ) < × ∞, m R i 0i m2 R i=1 which is an open set in R∞. Thus for each m ∈ N, we have, 2 inf |φP (ω) − φQ (ω)| = 0, ω∈Bm

30 2 1 and so there exists ωm ∈ Bm such that, |φP (ωm) − φQ (ωm)| < m . Confirm ∞ that the sequence ωm converges in the metric of R to ω0, since

m 1 X |ωmk − ω0k| X /m X d (ω , ω ) = 2−k ≤ 2−k + 2−k m 0 1 1 + |ωmk − ω0k| 1 + /m k≥1 k=1 k>m 1 ≤ 1 − 2−m + 2−m → 0. m + 1

So hωm, xi → hω0, xi for any x ∈ c00. By a simple application of Bounded Convergence Theorem, we have

2 Z Z 2 −ihωm,xi −ihωm,xi lim |φ (ωm) − φ (ωm)| = lim e P (dx) − e Q (dx) m→∞ P Q m→∞ c00 c00 2 Z Z −ihω ,xi −ihω ,xi = lim e m P (dx)− lim e m Q (dx) m→∞ m→∞ c00 c00 2 Z Z −ihω ,xi −ihω ,xi = e 0 P (dx) − e 0 Q (dx)

c00 c00 2 = |φP (ω0) − φQ (ω0)| and thus

2 2 1 |φ (ω0) − φ (ω0)| = lim |φ (ωm) − φ (ωm)| ≤ lim → 0. P Q m→∞ P Q m→∞ m

So φP = φQ on c00. The space c00 is dense in `2, so φP = φQ agrees on `2 and thus P = Q.

A.4 Proof of Proposition6 Before providing the proof we need some tools, which are provided in the up- coming theorems and lemmas. The next theorem is a generalization of Ky Fan’s 1 inequality, which is useful to show convexity of the map A 7→ |I + A|− /2 on the convex set of positive trace-class operators that is crucial to prove Gaussian kernel is characteristic for the family of Gaussian distributions. The following theorem is a special case of Minh [19, Theorem 1] when µ = γ = 1.

Theorem 14. Let H be an infinite-dimensional separable Hilbert space, and A, B two arbitrary positive trace-class operators, for 0 ≤ α ≤ 1

|α (I + A) + (1 − α)(I + B)| ≥ |I + A|α |I + B|1−α .

For 0 < α < 1, equality occurs if and only if A = B.

31 Lemma 15. Let H be a separable Hilbert space, and let |·| be the determinant −1/2 of a non-negative symmetric operator on H. A 7→ |I + A| is a convex func- tion over the convex set of positive trace-class operators on H, and for any two arbitrary positive trace-class operators A and B,

−1/2 A + B −1/2 −1/2 2 I + ≤ |I + A| + |I + B| 2

1 A+B − /2 −1/2 −1/2 and 2 I + 2 = |I + A| + |I + B| if and only if A = B. Proof. By Theorem 14 we have

log |I + (αA + (1 − α) B)| ≥ α log |I + A| + (1 − α) log |I + B| , so A 7→ log |I + A| is a concave function on the convex set of positive trace-class −1 operators, and thus A 7→ log |I + A| /2 is a convex function and also is A 7→ −1 |I + A| /2, since x 7→ ex is a non-decreasing convex function. Consequently

  −1/2 1 1 1 −1/2 1 −1/2 I + A + B ≤ |I + A| + |I + B| 2 2 2 2 and thus −1/2 A + B −1/2 −1/2 2 I + ≤ |I + A| + |I + B| . 2 By invoking Theorem (14), equality occurs if and only if A = B.

Lemma 16. [18, Proposition 1.2.8] Let H be a separable Hilbert space and N (µ, C) be a Gaussian probability measure on H with mean function µ and covariance operator C. For any σ > 0 Z −σkxk2 −1/2 −σh(I+2σC)−1µ,µi e H N (µ, C)(dx) = |I + 2σC| e .

H Proof of Proposition6: If Y ∼ N (µ, C) then by lemma 16 we have

Z Z −σky−xk2 −σkzk2 mP (x) = e H N (µ, C)(dy) = e H N (x − µ, C)(dz)

H H −1 = |I + 2σC|−1/2 e−σh(I+2σC) (x−µ),(x−µ)i.

Let T1 = I + 2σC1, then Z Z −σkx−yk2 hm , m i = e H N (µ ,C )(dx) N (µ ,C )(dy) P1 P2 Hk 1 1 2 2 H H Z 1 −1 − /2 −σhT (y−µ1),(y−µ1)i = |T1| e 1 N (µ2,C2)(dy)

H

32 D −1/2 −1/2 E 1 Z − /2 −σ T1 (y−µ1),T1 (y−µ1) = |T1| e N (µ2,C2)(dy)

H Z −1/2 −σkzk2  −1/2 −1/2 −1/2 = |T1| e H N T1 (µ2 − µ1) ,T1 C2T1 (dz)

H −1 −1 −1 /2 −1/2 /2 /2 = |T1| I + 2σT1 C2T1    −1/2 −1/2−1 −1/2 −1/2 −σ I+2σT C2T T (µ2−µ1),T (µ2−µ1) e 1 1 1 1

−1 −1/2 −1 /2 = |T1| I + 2σT1 C2   −1/2 −1/2 −1/2−1 −1/2 −σ T I+2σT C2T T (µ2−µ1),(µ2−µ1) e 1 1 1 1

−1 −1 /2 −σh(I+2σ(C1+C2)) (µ2−µ1),(µ2−µ1)i = |I + 2σ (C1 + C2)| e , and thus km − m k2 = km k2 + km k2 − 2 hm , m i P1 P2 Hk P1 Hk P2 Hk P1 P2 Hk −1/2 −1/2 = |I + 4σC1| + |I + 4σC2|

−1 −1 /2 −σh(I+2σ(C1+C2)) (µ2−µ1),(µ2−µ1)i − 2 |I + 2σ (C1 + C2)| e .

By invoking lemma 15 we have

−1/2 −1/2 −1/2 |I + 4σC1| + |I + 4σC2| ≥ 2 |I + 2σ (C1 + C2)| , and the equality occurs if and only if C = C . So km − m k2 = 0 if and 1 2 P1 P2 Hk only if µ1 = µ2 and C1 = C2. Hence, Gaussian kernel is characteristic for the family of Gaussian distributions.

A.5 Proof of Proposition7 Proof. We first give a proof for the product-kernel. A proof for the sum-kernel follows the same approach. Let P be the collection of probability measures on a separable Hilbert space H, and k (·, ·): H × H −→ R a characteristic kernel on H. Consider the kernel mean with product-kernel

n n m n → H n ⊗ P 7→ m n x , . . . , x k : P k j=1 j ⊗i=1Pj ( 1 n) such that for any x1, . . . , xn ∈ H, Z n ! Y n m n x , . . . , x k x , y ⊗ P dy ⊗j=1Pj ( 1 n) : = ( i i) j=1 j( j) n i=1 H n n Y Z Y = k(xi, yi)Pi(dyi) = mPi (xi) . i=1 i=1 H

33 n n n Let P, Q ∈ P i.e. P = ⊗j=1Pj , Q = ⊗j=1Qj such that P =6 Q. Given k is characteristic on H, there exists 1 ≤ i ≤ n such that Pi =6 Qi and mPi (·) =6 n Qn Qn mQi (·), thus there exists (xn) ∈ H such that i=1 mPi (xi) =6 i=1 mQi (xi). Similarly let

n n m n → H n ⊗ P 7→ m n x , . . . , x k : P k j=1 j ⊗i=1Pj ( 1 n) such that for any x1, . . . , xn ∈ H, Z n ! X n m n x , . . . , x k x , y ⊗ P dy ⊗i=1Pj ( 1 n) : = ( i i) j=1 j( j) n i=1 H n n X Z X = k(xi, yi)Pi(dyi) = mPi (xi) . i=1 i=1 H n n n Let P, Q ∈ P i.e. P = ⊗j=1Pj , Q = ⊗j=1Qj and P =6 Q. Given k is characteristic on H, there exists 1 ≤ i ≤ n such that Pi =6 Qi and mPi (·) =6 n Pn Pn mQi (·), thus there exists (xn) ∈ H such that i=1 mPi (xi) =6 i=1 mQi (xi).

Acknowledgements

The first author is grateful to the Graduate office of the University of Isfahan for their support. Part of this work was done while Saeed Hayati was visiting in the Institute of Statistical Mathematics under the support by the Research Organization of Information and Systems. KF has been supported in part by JSPS KAKENHI 18K19793. Afshin Parvardeh gratefully thanks Professor Vic- tor Panaretos and EPFL in Switzerland for the kind hospitality that received during spending his sabbatical leave at EPFL, in which this work, in part, was prepared.

References

[1] J.R. Carey, P. Liedo, D. Orozco, and J.W. Vaupel. Slowing of mortality rates at older ages in large medfly cohorts. Science, 258(5081):457–461, 1992. ISSN 0036-8075. doi: 10.1126/science.1411540. URL https:// science.sciencemag.org/content/258/5081/457. cited By 419. [2] Jeng-Min Chiou, Hans-Georg M¨uller, Jane-Ling Wang, and James R. Carey. A functional multiplicative effects model for longitudinal data, with application to reproductive histories of female medflies. Statist. Sinica, 13 (4):1119–1133, 2003. ISSN 1017-0405. [3] Hyunphil Choi and Matthew Reimherr. A geometric approach to confidence regions and bands for functional parameters. J. R. Stat. Soc. Ser. B. Stat. Methodol., 80(1):239–260, 2018. ISSN 1369-7412. doi: 10.1111/rssb.12239.

34 [4] John B. Conway. A course in point set topology. Undergraduate Texts in Mathematics. Springer, Cham, 2014. ISBN 978-3-319-02367-0; 978-3-319- 02368-7. doi: 10.1007/978-3-319-02368-7. URL https://doi-org.wcmq. idm.oclc.org/10.1007/978-3-319-02368-7. [5] Aurore Delaigle and Peter Hall. Defining probability density for a distri- bution of random functions. Ann. Statist., 38(2):1171–1193, 2010. ISSN 0090-5364. doi: 10.1214/09-AOS741. [6] Tepper Gill, Aleks Kirtadze, Gogi Pantsulaia, and Anatolij Plichko. Exis- tence and uniqueness of translation invariant measures in separable Banach spaces. Funct. Approx. Comment. Math., 50(2):401–419, 2014. ISSN 0208- 6573. doi: 10.7169/facm/2014.50.2.12. URL https://doi-org.wcmq.idm. oclc.org/10.7169/facm/2014.50.2.12. [7] Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc., 102(477):359–378, 2007. ISSN 0162-1459. doi: 10.1198/016214506000001437.

[8] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨olkopf, and Alexander Smola. A kernel two-sample test. J. Mach. Learn. Res., 13:723–773, 2012. ISSN 1532-4435. [9] Sonja Greven, Fabian Scheipl, Sonja Greven, and Fabian Scheipl. A general framework for functional regression modelling. Statistical Modelling, 17(1- 2):1–35, 2017. ISSN 1471-082X. doi: 10.1177/1471082X16681317. [10] Jia Guo, Bu Zhou, and Jin-Ting Zhang. Testing the equality of several covariance functions for functional data: a supremum-norm based test. Comput. Statist. Data Anal., 124:15–26, 2018. ISSN 0167-9473. doi: 10.1016/j.csda.2018.02.002. URL https://doi-org.wcmq.idm.oclc.org/ 10.1016/j.csda.2018.02.002. [11] Jia Guo, Bu Zhou, and Jin-Ting Zhang. New tests for equality of several covariance functions for functional data. J. Amer. Statist. Assoc., 114(527): 1251–1263, 2019. ISSN 0162-1459. doi: 10.1080/01621459.2018.1483827.

[12] Z. Harchaoui, F. Bach, O. Cappe, and E. Moulines. Kernel-based methods for hypothesis testing: A unified view. IEEE Signal Processing Magazine, 30(4):87–97, 2013. [13] Za¨ıdHarchaoui, Eric Moulines, and Francis R. Bach. Kernel change-point analysis. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, edi- tors, Advances in Neural Information Processing Systems 21, pages 609– 616. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/ 3556-kernel-change-point-analysis.pdf. [14] Tailen Hsing and Randall Eubank. Theoretical foundations of functional data analysis, with an introduction to linear operators. Wiley Series in

35 Probability and Statistics. John Wiley & Sons, Ltd., Chichester, 2015. ISBN 978-0-470-01691-6. doi: 10.1002/9781118762547. [15] Roger Koenker and Olga Geling. Reappraising medfly longevity: a quantile regression survival analysis. J. Amer. Statist. Assoc., 96(454):458–468, 2001. ISSN 0162-1459. doi: 10.1198/016214501753168172. URL https: //doi-org.wcmq.idm.oclc.org/10.1198/016214501753168172. [16] Piotr Kokoszka and Matthew Reimherr. Discussion of ‘A general framework for functional regression modelling’ by Greven and Scheipl. Stat. Model., 17(1-2):45–49, 2017. ISSN 1471-082X. doi: 10.1177/1471082X16681331.

[17] Zhenhua Lin, Hans-Georg M¨uller,and Fang Yao. Mixture inner product spaces and their application to functional data analysis. Ann. Statist., 46 (1):370–400, 2018. ISSN 0090-5364. doi: 10.1214/17-AOS1553. [18] Stefania Maniglia and Abdelaziz Rhandi. Gaussian measures on separable hilbert spaces and applications. Quaderni di Matematica, 2004(1), 2004.

[19] H`aQuang Minh. Infinite-dimensional Log-Determinant divergences be- tween positive definite trace class operators. Linear Algebra Appl., 528: 331–383, 2017. ISSN 0024-3795. doi: 10.1016/j.laa.2016.09.018. [20] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch¨olkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017. [21] Hans-Georg M¨uller and Jane-Ling Wang. Statistical Tools for the Analysis of Nutrition Effects on the Survival of Cohorts, pages 191– 203. Springer US, Boston, MA, 1998. ISBN 978-1-4899-1959-5. doi: 10.1007/978-1-4899-1959-5 12. URL https://doi.org/10.1007/ 978-1-4899-1959-5_12. [22] Wenliang Pan, Yuan Tian, Xueqin Wang, and Heping Zhang. Ball di- vergence: nonparametric two sample test. Ann. Statist., 46(3):1109– 1137, 2018. ISSN 0090-5364. doi: 10.1214/17-AOS1579. URL https: //doi-org.wcmq.idm.oclc.org/10.1214/17-AOS1579.

[23] Qing Shen and Julian Faraway. An F test for linear models with functional responses. Statist. Sinica, 14(4):1239–1257, 2004. ISSN 1017-0405. [24] Carl-Johann Simon-Gabriel and Bernhard Sch¨olkopf. Kernel distribution embeddings: universal kernels, characteristic kernels and kernel metrics on distributions. J. Mach. Learn. Res., 19:Paper No. 44, 29, 2018. ISSN 1532-4435.

36 [25] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch¨olkopf. A hilbert space embedding for distributions. In Marcus Hutter, Rocco A. Serve- dio, and Eiji Takimoto, editors, Algorithmic Learning Theory, pages 13– 31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540- 75225-7.

[26] Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨olkopf, and Gert R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res., 11:1517–1561, 2010. ISSN 1532-4435. [27] Ingo Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67–93, 2001. ISSN 1532-4435. [28] Ingo Steinwart and Johanna F. Ziegel. Strictly proper kernel scores and characteristic kernels on compact spaces. Applied and Computa- tional Harmonic Analysis, 2019. ISSN 1063-5203. doi: https://doi. org/10.1016/j.acha.2019.11.005. URL http://www.sciencedirect.com/ science/article/pii/S1063520317301483. [29] Minh Tang, Avanti Athreya, Daniel L. Sussman, Vince Lyzinski, and Carey E. Priebe. A nonparametric two-sample hypothesis testing prob- lem for random graphs. Bernoulli, 23(3):1599–1630, 2017. ISSN 1350-7265. doi: 10.3150/15-BEJ789. URL https://doi-org.wcmq.idm.oclc.org/ 10.3150/15-BEJ789. [30] Jin-Ting Zhang. Analysis of variance for functional data, volume 127 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL, 2014. ISBN 978-1-4398-6273-5. [31] Jin-Ting Zhang and Jianwei Chen. Statistical inferences for functional data. Ann. Statist., 35(3):1052–1079, 2007. ISSN 0090-5364. doi: 10.1214/ 009053606000001505. [32] Jin-Ting Zhang and Xuehua Liang. One-way ANOVA for functional data via globalizing the pointwise F -test. Scand. J. Stat., 41(1):51–71, 2014. ISSN 0303-6898. doi: 10.1111/sjos.12025. URL https://doi-org.wcmq. idm.oclc.org/10.1111/sjos.12025. [33] Jin-Ting Zhang, Ming-Yen Cheng, Hau-Tieng Wu, and Bu Zhou. A new test for functional one-way ANOVA with applications to ischemic heart screening. Comput. Statist. Data Anal., 132:3–17, 2019. ISSN 0167-9473. doi: 10.1016/j.csda.2018.05.004. URL https://doi-org.wcmq.idm.oclc. org/10.1016/j.csda.2018.05.004.

37