Identifiability of interaction kernels in mean-field equations of interacting particles

Quanjun Lang˚, Fei Lu∗

Abstract We study the identifiability of the interaction kernels in mean-field equations for intreacting par- ticle systems. The key is to identify function spaces on which a probabilistic loss functional has a unique minimizer. We prove that identifiability holds on any subspace of two reproducing kernel Hilbert spaces (RKHS), whose reproducing kernels are intrinsic to the system and are data-adaptive. Furthermore, identifiability holds on two ambient L2 spaces if and only if the integral operators as- sociated with the reproducing kernels are strictly positive. Thus, the inverse problem is ill-posed in general. We also discuss the implications of identifiability in computational practice. Keywords: identifiability, mean-field, interacting particle systems, RKHS, regularization Contents

1 Introduction2

2 Main results 3 2.1 Definition of identifiability...... 3 2.2 The ambient function spaces of regression...... 5 2.3 Main results...... 6

3 Radial interaction kernel7 3.1 The loss function...... 7 3.2 The RKHS space and L2 space...... 9 3.3 The weighted L2 space and RKHS...... 11

4 General interaction kernels 13 4.1 The loss function...... 13 4.2 The RKHS space and L2 space...... 14 4.3 The weighted L2 space and RKHS...... 15 4.4 Examples and general function spaces...... 17

5 Identifiability in computational practice 18 5.1 Nonparametric regression in practice...... 18 5.2 Identifiability and ill-conditioned normal matrix...... 19 5.3 Two regularization norms...... 20 5.4 Regularization and the RKHSs...... 21

A Appendix 24

∗Department of , Johns Hopkins University. Email: [email protected]; [email protected]

1 1 Introduction

We study the identifiability of the interaction kernel Ktrue from data consisting a solution of the mean-field equation of interacting particles

d Btu “ ν∆u ` divrupKtrue ˚ uqs, x P D Ă R (1.1) Bu|BD “ 0, ∇u|BD “ 0,

d d d where the domain D Ă R is connected with smooth boundary and Ktrue “ ∇Φtrue : R Ñ R with Φtrue being the interaction potential. Here ˚ denotes the convolution

pKtrue ˚ uqpx, tq “ Ktruepyqupx ´ y, tqdy. d żR In particular, if Φtrue is radical, that is Φtruepxq “ Φtruep|x|q, we have

x 1 Ktruepxq “ ∇pΦtruep|x|qq “ φtruep x q , with φtrueprq “ Φ prq. (1.2) x true ˇ ˇ where φ is also called interaction kernel for simplicity.ˇ ˇ ˇ ˇ In either case, estimating the interaction true ˇ ˇ function Ktrue “ ∇Φtrue is equivalent to estimating the potential Φtrue up to a constant. The mean-field equation (also called aggregation-diffusion equation [6]) describes the population density for large systems of interacting particles or agents:

d 1 N Xj ´ Xi ? Xi “ φ p|Xj ´ Xi|q t t ` 2νdBi, for i “ 1,...,N, (1.3) dt t N true t t j i t i1“1 |Xt ´ Xt | ÿ i d pNq where Xt P R represents the position of agent i at time t. Denote by u px, tq the probability 1 pNq density of particle X at time t. Under suitable conditions on the potential Φtrue, we have u Ñ u in relative entropy of the steady state as the number of particles N Ñ 8 (see e.g., [23, 22,7, 12]). Systems of interacting particles or agents are widely used in many areas of science and engi- neering (see [2, 30, 25,1] and the references therein). Motivated by the applications, there has been increasing interests in inferring the interaction kernel (or the interaction potential) in a nonparamet- ric fashion for generality [5, 21, 19, 20, 16]. The pioneering work [5] minimizes a loss functional based on matching the true and predicted trajectories of particles (see also the recent development for mean-field games [4]). When the system has finitely many particles and when the data consist of multiple trajectories of all particles, the efforts [21, 19, 20, 18, 17] minimizes loss functionals based on likelihood, establishing computationally efficient algorithms that yield estimators achieving the minimax rate of convergence. When the number of particles is large (e.g., 106 or 1023, the Avogadro number), it becomes impractical to collect trajectories of all particles, but one can often observe the population density, i.e., the solution of the mean-field equations. This leads to the inverse problem of inferring the interaction kernel of the mean-field equation from solutions of the PDE. In [16], based on a probabilistic loss functional, the authors introduced an efficient nonparametric regression algo- rithm: it totally avoids the costly numerical simulations of the mean-field equation or the particle system (which are necessary in a trajectory matching method), instead, it estimates the kernel simply by least squares regression. However, the fundamental issue of identifiability is left open. In this study, we provide a complete characterization of the identifiability and discuss its implica- tions to computational practice. Based on a probabilistic loss functional, we seek the function spaces on which it has a unique minimizer. We show that identifiability holds on any subspace of two re- producing kernel Hilbert spaces (RKHS), whose reproducing kernels are intrinsic to the system and adaptive to data. Furthermore, identifiability holds on two ambient L2 spaces if and only if the inte- gral operators associated with the reproducing kernels are strictly positive. Thus, the inverse problem is ill-posed in general because the compact integral operators have unbounded inverse.

2 In computational practice, this implies that the regression matrix becomes ill-conditioned as the dimension of the hypothesis space increases. Thus, regularization is necessary. The two ambient L2 spaces provides natural norms for the Tikhonov regularization. In the context of singular value decomposition (SVD) analysis, we demonstrate numerically that the weighted L2 norm is preferable over the the unweighted L2 space because it leads to larger eigenvalues. Our results are applicable to both radial and general vector valued interaction kernels. This study show that the identifiability is universal for mean-field equations and for systems with N interacting particles (which are SDEs) [18]. The key is the probabilistic loss functional from the likelihood. There are two technical differences: multiple-trajectory data are necessary for the ´1 SDEs (without using ergodicity) and a slightly more flexible lower bound with a factor N´2 (see [18, Proposition 2.1]). However, as both N and the number of trajectories go to infinity, the identifiability in [18] agrees with the identifiability for mean-field equations in this study. A key feature in our identifiability theory is that the RKHSs and their reproducing kernels are intrinsic to the system and are data-adaptive. This is different from the widely-used kernel regression techniques [9, 26] where pre-selected parametric kernels are fitted to data. These intrinsic RKHSs invite further study to reveal the structure of the interaction kernels and applications for system clas- sifications. Also, they invite future study of learning functions hidden inside a linear functional, when the data consists of outputs of the functional instead of the values of the function. The exposition in our manuscript proceeds as follows. In Section2, we provide a mathematical definition of identifiability and describe of the main results. Section3 proves identifiability for radial interaction kernels and Section4 extends the results to general non-radial cases. We discuss in Section 5 the implications of identifiability to computational practice. AppendixA provides a brief review of positive definite functions and reproducing kernel Hilbert spaces.

2 Main results

Throughout the paper, we assume that the data u is a regular solution:

2 1 Assumption 2.1. Assume that u is C in space and C in time and bounded above by Cu. Such a solution exists when the interaction function is local Lipschitz with polynomial growth:

m m d |Ktruepxq ´ Ktruepyq| ď Cp|x ´ y| ^ 1qp1 ` |x| ` |y| q, @x, y P R for some constant C ą 0. We refer to [29] for Lipschitz kernels, and [23, 22] for uniform convex kernels that lead to convergence to equilibrium, [7, 12, 13] and the references therein for general (including singular) kernels.

2.1 Definition of identifiability

Our goal is to study the identifiability of the interaction kernel Ktrue (or φtrue in the radial case) from data consisting of a solution u on r0,T s. Since we know little about the interaction kernel in priori, it is natural to consider nonparametric approaches, in which one finds an estimator by minimizing a loss functional over a hypothesis space. Thus, we define the identifiability as the uniqueness of the minimizer of a loss functional over a proper function space.

d Definition 2.2 (Identifiability). Given data consisting a solution pupx, tq, x P R , t P r0,T sq to the mean-field equation (1.1), we say that the interaction kernel Ktrue (or φtrue when the potential is radial) is identifiable in a hypothesis space H if there exists a data-based loss functional such that Ktrue is the unique minimizer of the loss functional on H.

3 The identifiability consists of three elements: a loss functional E, a function space of learning (hypothesis space) H, and uniqueness of the minimizer of the loss functional over the hypothesis space. In [16], the authors introduce a probabilistic loss functional based on the likelihood of the diffusion process of the mean-field equation. An oracle version using the true interaction kernel is as follows (see Theorem 3.2 and Theorem 4.1 for the practical version based solely on data).

Definition 2.3 (Oracle loss function). Let u be a solution of the mean-field equation (1.1) on [0,T]. d For a candidate interaction potential Ψ: R Ñ R, we denote Kψpxq “ ∇Ψpxq. If furthermore Ψ 1 x is radial, we denote ψprq “ Ψ prq with an abuse of notation. In this case, Kψpxq “ ψpxq |x| . We consider the loss functional

Radial: Epψq “ xxψ, ψyy´ 2xxψ, φtrueyy, (2.1) General: EpKψq “ xxKψ,Kψyy´ 2xxKψ,Ktrueyy. where the bilinear form xx¨, ¨yy is defined by 1 T Radial: xxϕ, ψyy :“ upKϕ ˚ uq ¨ pKψ ˚ uq dx dt T d ż0 żR (2.2) 1 T General: xxKϕ,Kψyy :“ upKϕ ˚ uq ¨ pKψ ˚ uq dx dt T d ż0 żR assuming that the integrals are well-defined.

Note that in either case, the loss functional is quadratic. Hence its minimizer on a finite dimen- sional hypothesis space can be estimated by least squares.

Remark 2.4 (Least squares estimator). For any space H “ span tφ1, . . . , φnu such that

Aij “ xxφi, φjyy, bi “ xxφi, φtrueyy (2.3) are well-defined, a minimizer of the loss functional E on H is given by least squares

n J J φH “ ciφi, where c “ arg min Epcq, Epcq “ c Ac ´ 2c b. n i“1 cPR ÿ p p p ´1 In particular, when A is invertible, we have c “ A b and φH is the unique minimizer of E on H. By Remark 2.4, identifiability on a finite dimensional function space is equivalent to the non- p p degeneracy of the normal matrix A. In general, as the next lemma shows, identifiability is equivalent to the non-degeneracy of the bilinear form.

Lemma 2.5. Identifiability holds on a linear space H if the bilinear form xx¨, ¨yy is well-defined and non-degenerate on H. That is, the true kernel φtrue P H (or Ktrue P H) is the unique minimizer of the loss functional E in (2.1) in H iff xxh, hyyą 0 for all h P H unless h “ 0.

Proof. Since the radial case and general case have the same bilinear form, we consider only the radial case. Note that Epψq “ xxψ ´ φtrue, ψ ´ φtrueyy´ xxφtrue, φtrueyy. (2.4) Thus, φ P H is the unique minimizer of E iff xxh, hyyą 0 for all h P H unless h “ 0.

The bilinear form is non-degenerate iff its Frechet´ derivative, which is a linear operator on a suitable space, is invertible. Thus, our goal of studying identifiability is to find function spaces on which the bilinear form’s derivative is invertible.

4 To find the function spaces, we first write the bilinear form xx¨, ¨yy explicitly. In case of radial interaction, we have (see (3.7) for its derivation)

xxϕ, ψyy“ ϕprqψpsq GT pr, sq dr ds, (2.5) ` ` żR żR with the integral kernel GT given by 1 T G pr, sq “ G pr, sq dt, with T T t ż0 (2.6) d´1 Gtpr, sq : “ ξ ¨ η prsq upx ´ rξ, tqupx ´ sη, tqupx, tqdxdξdη, d d d żS żS żR d d where S denotes the unit sphere in R . In the general case, we have

xxKϕ,Kψyy“ Kϕpyq ¨ KψpzqF T px, yqdydz d d żR żR with an integral kernel (see (4.2) for its derivation) 1 T F T py, zq “ Ftpy, zq dt, with Ftpy, zq :“ upx ´ y, tqupx ´ z, tqupx, tqdx. (2.7) T d ż0 żR 2 Hence the bilinear form’s Frechet´ derivative can be an integral operator either on L pR`q with inte- 2 d gral kernel GT in the radial case or on L pR q with integral kernel F T in the general case. However, those spaces do not reflect information from data, not even the support of these kernels d (note that either GT or F T is determined by the data pupx, tq, x P R , t P r0,T sq). For example, when the integral kernel has a bounded support due to data, little can be said outside of the support in a nonparametric regression method using local basis. This promotes us to seek data-based function spaces of learning in the next section.

2.2 The ambient function spaces of regression We introduce two ambient function spaces of learning (via a nonparametric regression with local basis): a weight L2 space with a data-based measure and an unweighted L2 space on the support of the measure. The function space of learning should be data-based because we can only learn the function in the region where the data explore. Intuitively, the convolution Kψ ˚ u suggests that the independent variable of the interaction kernel ψ is |x ´ y| (or x ´ y for general kernel K), with x and y from the data upx, tq and upy, tq. We show next that this intuition agrees with the probabilistic representation of the mean-field equation: ψ has independent variables |Xt ´ Yt| (or Xt ´ Yt for the general case), where Xt and Yt are independent variables with probability density up¨, tq. Recall that Equation (1.1) is the Kolmogorov forward equation (also called the Fokker-Planck equation) of the following nonlinear stochastic differential equation ? dXt “ ´ rKtrue ˚ uspXt, tqdt ` 2dBt, (2.8) #LpXtq “upx, tqdx, s s for all t ě 0. Here LpXtq denotess the law of Xt, with probability density up¨, tq. 1 1 Let pXt, t ě 0q be an independent copy of pXt, t ě 0q and denote rt “ Xt ´ Xt . We can write the convolution K ˚ usas s ψ ˇ ˇ s s ˇ s s ˇ 1 1 Xt ´ Xt rKψ ˚ uspXt, tq “ ErKψpXt ´ Xtq | Xts “ Erψprtq | Xts. (2.9) rt s s s s s5 s s The above probabilistic representation of Kψ ˚ u indicates that the independent variable of ψ has 1 1 the same distribution of t|Xt ´ Xt|, t P r0,T su (or tXt ´ Xt, t P r0,T su for general kernel K). Let 1 ρT denote the average of probability densities of t|Xt ´ Xt|, t P r0,T su in the case of radial kernel, 1 or tXt ´ Xt, t P r0,T su ins the cases of general kernel:s s s s s T T s s 1 1 d´1 Radial case: ρT prq “ ρtprqdt “ r upy ´ rξ, tqupy, tqdydt, T T d d ż0 ż0 żR żS (2.10) 1 T 1 T General case: ρsT pxq “ ρtpxqdt “ upy ´ x, tqupy, tqdydt, T T d ż0 ż0 żR 1 1 where ρt denotes thes density of |Xt ´ Xt| (or Xt ´ Xt for the general case) for each t. Under 2 Assumption 2.1, the probability density function ρT is C . We denote the support of ρT by X : s s s s X : supp ρ . “ s p T q s Note that X is bounded if upx, tq has a bounded support uniformly for all t P r0,T s. s 2 Thus, the ambient function space of nonparametric regression is L pX , ρT prqdrq for radial ker- 2 nels and L pX , ρT pxqdxq for general kernels. For simplicity of notation, hereafter we denote both 2 2 2 L pX , ρT prqdrq and L pX , ρT pxqdxq by L pρT q as long as the context is clear.s 2 Another naturals function space of learning is L pX q, because the integral kernel GT in (2.6) 2 (or F T sin (2.7)) leads to ans integral operators on L pX q which is the Frechet´ derivative of the loss functional . 2 2 Note that L pX q Ă L pρT q. We will show that both function spaces are reasonable for learning and they lead to different RKHS’s on which the identifiability holds. s 2.3 Main results With the loss functional in Definition 2.3 and the ambient function spaces of learning in Section 2.2, we can now define the Frechet´ derivative of the loss functional . The function spaces of identifiability are those on which the Frechet´ derivative is an invertible operator. We provide a complete characterization for the function spaces of identifiability in both L2pX q 2 and L pρT q. In the radial case, we have 2 • In L pX q, identifiability holds on any subspace of the RKHS HG with reproducing kernel GT 2 ins (2.6). Furthermore, identifiability holds on L pX q iff HG is dense in it, equivalently, the integral operator L on L2pX q with integral kernel G is strictly positive . See Theorem 3.6– GT T 3.7 and details in Section 3.2. 2 • In L pρT q, identifiability holds on any subspace of the RKHS HR with reproducing kernel GT pr,sq 2 RT pr, sq “ . Furthermore, identifiability holds on L pρT q iff HR is dense in it, ρT prqρT psq equivalently,s the integral operator L on L2pρ q with integral kernel R is strictly positive. RT T T See Theorem s3.9– s3.10 and details in Section 3.3. s Similar results holds in the general case (see Section4s). We point out that identifiability is weaker than well-posedness. Identifiability only ensures that the loss functional has a unique minimizer on a hypothesis function space H. It does not ensure well-posedness of the inverse problem unless the bilinear form satisfies a coercivity condition on H (see Remark 3.11). When H is finite dimensional, identifiability is equivalent to the invertibility of the normal matrix (see Remark 2.4), and identifiability implies well-posedness in this case. However, when H is infinite dimensional, the inverse problem is ill-posed because the inverse of the compact operator L (the derivative of the bilinear form) is unbounded (see Theorem 5.1). Thus, identifia- GT bility is weaker than well-posedness.

6 Table 1: Notations.

Radial case General (non-radial) case Interaction potential Ψp|x|q and ψ “ Ψ1; Ψpxq x Interaction kernel Kψpxq “ ∇Ψp|x|q “ ψp|x|q |x| Kψpxq “ ∇Ψpxq Loss functional Epψq EpKq

Density of regression measure ρT , X “ supportpρT q, in (2.10) 2 2 Function space of learning L pX q and L pρT q 2 Mercer kernel & RKHS in L pX q GT in (2.6) ands HG sF T in (2.7) and HF 2 s Mercer kernel & RKHS in L pρT q RT in (3.11) and HR QT in (4.6) and HQ

s We also discuss the implications of identifiability for the computational practice of estimating the interaction kernel from data. Identifiability theory implies that the regression matrix will become ill-conditioned as the dimension of the hypothesis space increases (see Theorem 5.1). Thus, regular- 2 2 ization becomes necessary. We compare two regularization norms, the norms of L pρT q and L pX q, in the context of singular value decomposition (SVD) analysis. Numerical tests suggest that regu- 2 larization based on L pρT q norm leads to slightly better conditioned inversion (sees Figure1). We also show that the regularization by truncated SVD is equivalent to learning on the low-frequency eigenspace of the RKHSs.s We shall use the notations in Table1.

3 Radial interaction kernel

We start from the identifiability of radial interaction kernels. We will first derive the probabilistic loss functional in (3.4) (see Section 3.1). Then, in Section 3.2-3.3, we discuss the identifiability in L2pX q 2 and in L pρT q, in the ambient function spaces of learning defined in Section 2.2. Through this section, we let u be a solution of the mean-field equation (1.1) on [0,T]. Let ψ P 2 ` r x L pρT q withs ρT in (2.10) and let Ψ: R Ñ R be Ψprq “ 0 ψpsqds. Denote Kψpxq “ ψp|x|q |x| . ş 3.1s Thes loss function We show first that the radial loss functional in (2.1) is well-defined.

2 Lemma 3.1. For any ϕ, ψ P L pρT q, the bilinear form xxϕ, ψyy in (2.2) satisfies

xxϕ, ψyyď }ψ} 2 }ϕ} 2 . (3.1) s L pρT q L pρT q 2 Also, for any ψ P L pρT q, the radial loss functional in (2.1) is bounded above by

2 Epψq ď }ψ ´ φtrue} 2 ´ xxφtrue, φtrueyy. (3.2) s L pρT q

Proof. Recall that up¨, tq is the law of Xt defined in (2.8). By definition in (2.2), we have

T 1 s xxϕ, ψyy“ pKϕ ˚ uq ¨ pKψ ˚ uqupx, tqdx dt T d ż0 żR (3.3) 1 T “ rK ˚ upX , tq ¨ K ˚ upX , tqsdt, T E ϕ t ψ t ż0 s s 7 The integrand in time is controlled by

ErKϕ ˚ upXt, tq ¨ Kψ ˚ upXt, tqs 2 1{2 2 1{2 ďEr|Kϕ ˚ upXt, tq| s Er|Kψ ˚ upXt, tq| s (Cauchy-Schwartz) s s 1 2 1{2 1 2 1{2 ďEr|ErKϕpXt ´ Xtq|Xts| s Er|ErKψpXt ´ Xtq|Xts| s (by (2.9)) s s 1 2 1{2 1 2 1{2 ďErEr|KϕpXt ´ Xtq| |Xtss ErEr|KψpXt ´ Xtq| |Xtss (by Jensen’s inequality) s s s s s s 1 2 1{2 1 2 1{2 2 2 “Er|KϕpXt ´ Xtq| s Er|KψpXt ´ Xtq| s “}ϕ}L pρtq}ψ}L pρtq. s s s s s s Then, we obtain (3.1). s s s s The upper bound in (3.2) follows from (3.1) and (2.4). The next theorem shows that the radial loss functional is based on the maximal likelihood ratio the SDE (2.8) representation of the mean-field equation, and it provides a derivative-free loss functional in practice, which leads to an efficient and robust least square estimator (see [16]). Theorem 3.2 (Lost function). The radial loss functional in (2.1) is the expectation of the average (in 1 time) negative log-likelihood of the process pXt, t P r0,T sq in (2.8). If Ψ P L pρT q, it equals to T 1 2 Epψq “ r Kψ ˚ u u `s 2BtupΨ ˚ uq ` 2ν∇u ¨ pKψ ˚ uqssdxdt. (3.4) T d ż0 żR 1 ˇ ˇ Furthermore, if ∇ ¨ Kψ P L pρT qˇ, we canˇ replace the integrand ∇u ¨ pKψ ˚ uq by ´up∇ ¨ Kψ ˚ uq.

Proof. We denote by Pφ the law of the process pXtq on the path space with initial condition X0 „ s up¨, 0q, with the convention that P0 denoting the Winner measure. Then, the average negative log- likelihood ratio of a trajectory Xr0,T s from a law Psψ with respect to the Winner measure is (sees e.g. [15, Section 1.1.4] and [14, Section 3.5]) s T dPψ 1 2 E pψq “ ´ log pX 0,T q “ |Kψ ˚ upXt, tq| dt ´ 2xKψ ˚ upXt, tq, dXty , (3.5) Xr0,T s dP r s T 0 ż0 s ` ˘ where dPψ is the Radon-Nikodyms derivative. s s s dP0 ? Taking expectation and noting that dXt “ Ktrue ˚ upXt, tqdt ` 2dBt, we obtain T 1 2 EEX pψq “ E |Kψ ˚ uspXt, tq| ´ 2Kψs˚ upXt, tq ¨ Ktrue ˚ upXt, tq dt. r0,T s T ż0 “ ‰ s “xxψ, ψyy´ 2xxφtrue,s ψyy“ Epψq, s s (3.6) where the second equality follows from (3.3). To replace the true interaction kernel Ktrue by data, we resort to the mean-field equation (1.1). 1 Noticing that Kψ ˚ u “ ∇Ψ ˚ u “ ∇pΨ ˚ uq for any ψ “ Ψ , we have T T upKtrue ˚ uq ¨ pKψ ˚ uqdx dt “ upKtrue ˚ uq ¨ p∇Ψ ˚ uqdx dt 0 Rd 0 Rd ż żT ż ż “ ´ p∇ ¨ rupKtrue ˚ uqsqΨ ˚ u dx dt p by integration by partsq 0 Rd ż T ż “ ´ pBtu ´ ν∆uqpΨ ˚ uqdx dt p by Equation (1.1)q 0 Rd ż T ż T “ ´ BtupΨ ˚ uqdx dt ´ ν ∇u ¨ pKψ ˚ uqdx dt, pintegration by partsq. d d ż0 żR ż0 żR Combining it with (3.6) and (3.3), we obtain the radial loss functional in (3.4). At last, applying integration by parts again, we can replace the integrand ∇u ¨ pKψ ˚ uq by ´up∇ ¨ Kψ ˚ uq.

8 Remark 3.3 (loss functional in terms of expectations). Along with (2.9), we can write the loss func- tional in (3.4) in terms of expectations 1 T Epψq “ r| rK pX ´ X1q|X s|2s ` B rΨpX ´ X1qs ` 2ν r∆ΨpX ´ X1qs dt, T E E ψ t t t tE t t E t t ż0 ` 1 1 ˘ where we used the fact that ErΨspXt ´s Xtsqs “ ErErΨpXt s´ Xtq|sXtss and s s

1 1 1 BtupΨ ˚ uqdx “ Bt s upx,s tqΨpx ´ yqupsy, tqdydxs s“ BtErErΨpXt ´ Xtq|Xtss. d 2 d d 2 żR żR żR Such a probabilistic loss functional can be approximated by Monte Carlo. This is particularlys s s impor- tant in high-dimensional problems, when multidimensional integrals faces the well-known curse-of- dimensionality, and the mean-field solution (population density) can be efficiently approximated by the empirical distribution of random samples.

3.2 The RKHS space and L2 space

Now we show that the interaction kernel is identifiable on the RKHS HG with reproducing kernel 2 GT defined in (2.6), which is a subspace of L pX q. We show first that GT is a Mercer kernel. Then 2 we connect its RKHS with L pX q through the integral operator of GT , which is the bilinear form’s Frechet´ derivative on L2pX q. Recall that the bilinear form leads to the function GT defined in (2.6): 1 T xxϕ, ψyy :“ pKϕ ˚ uq ¨ pKψ ˚ uqupx, tqdx dt T d ż0 żR 1 T “ Kϕpyq ¨ Kψpzq upx ´ y, tqupx ´ z, tqupx, tqdxdydz dt T d d d ż0 żR żR żR 8 8 1 T “ ϕprqψpsq ξ ¨ η prsqd´1upx ´ rξ, tqupx ´ sη, tqupx, tqdxdξdη drds T d d d ż0 ż0 ż0 żS żS żR GT pr,sq loooooooooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooooooooon (3.7) where the third equality follows from a change of variable to polar coordinates, along with the notion y y d that Kϕpyq “ ϕp|y|q |y| “ ϕprqξ with r “ |y| and ξ “ |y| P S .

Lemma 3.4. The kernel GT is a Mercer kernel, i.e., it is symmetric, continuous and positive definite. Proof. The symmetry is clear from its definition and the continuity follows from the continuity of u. ` k Following Definition A.1, for any tr1, ¨ ¨ ¨ , rku Ă R and pc1, . . . , ckq P R , we have T 1 d´1 cicjGT pri, rjq “ cicjprirjq ξ ¨ ηupx ´ riξ, tqupx ´ rjη, tqdξdηupx, tqdxdt T d d d i,j ż0 żR ij żS żS ÿ ÿ 2 T 1 d´1 “ ciri ξupx ´ riξ, tqdξ upx, tq dxdt ě 0. T 0 Rd ˇ Sd ˇ ż ż ˇ i ż ˇ ˇÿ ˇ Thus, it is positiveˇ definite. ˇ ˇ ˇ Since GT is a Mercer kernel, it determines an RKHS HG with GT as reproducing kernel (see Ap- pendixA). We show next that the identifiability holds on the RKHS HG, i.e. xx¨, ¨yy is non-degenerate on HG, by studying the integral operator with kernel GT .

L fprq “ G pr, sqfpsqds. (3.8) GT T ` żR

9 Note that by definition, xxϕ, ψyy“ xϕ, L ψy 2 . (3.9) GT L pX q

We start with a Lemma on the boundedness and integrability of GT .

Lemma 3.5. The kernel function GT in (2.6) satisfies the following properties: ` (a) For all r, s P R , GT pr, sq À ρT prq. Hence GT is supported in X ˆ X . 2 2 (b) GT is in L pX ˆ X q, and GT pr, ¨q P L pX q for every r P X . s Proof. First we can write GT as 1 T y z GT pr, sq “ upx ´ y, tqdy ¨ upx ´ z, tqdz upx, tq dxdt. T d |y| |z| ż0 żR ˜ż|y|“r ¸ ˜ż|z|“s ¸ z ` Notice that |z|“s |z| upx ´ z, tqdz is uniformly bounded for all s P R because of the density property and the continuity of u. Also recall that X is the support of ρ , hence (a) is proved. ´ş ¯ T For (b), by Jensen’s inequality and boundedness of up¨, tq as a probability density,

2 s |GT pr, sq| drds ` ` żR żR T 1 2 2 2 ď 2 upx ´ y, tq dy ¨ upx ´ z, tq dz upx, tq dxdt. T ` ` d żR żR ż0 żR ˜ż|y|“r ¸ ˜ż|z|“s ¸ T 1 2 ď 2 |upx ´ y, tqupx ´ z, tqupx, tq| dxdydz dt ă 8. T d d d ż0 żR żR żR The second property follows similarly. By Lemma 3.5 and Theorem A.3, L is a positive compact self-adjoint operator on L2pX q, GT 8 8 and it has countably many positive eigenvalues tλiui“1 with orthonormal eigenfunctions tϕiui“1. In ? 8 particular, t λiϕiui“1 is an orthonormal basis of HG. The following theorem follows directly.

Theorem 3.6 (Identifiability on RKHS). Identifiability holds on HG, the RKHS with reproducing kernel GT in (2.6). ? 8 ? Proof. Since t λiϕiu is an orthonormal basis of HG, we can write each f P HG as f “ i“1 ci λiϕi 8 2 2 8 2 2 and we have hf, fi “ c . Thus, xxfyy “ c λ ą 0 for any f ‰ 0 in HG. HG i“1 i i“1 i i ř The RKHS HG has theř nice feature of being data-informed:ř its reproducing kernel GT depends d solely on the data pupx, tq, x P R , t P r0,T sq. However, it is an implicit function space, so it is helpful to consider the explicit function space L2pX q. Theorem 3.7 (Identifiability on L2pX q). The following are equivalent. (a) Identifiability holds on L2pX q. (b) L is strictly positive. GT 2 (c) HG is dense in L pX q. n Moreover, for any h “ i“1 ciϕi P H “ span tϕ1, . . . , ϕnu with tϕiu being orthonormal eigen- functions of L corresponding to positive eigenvalues tλiu, we have GT ř n n n h 2 c2λ , h 2 c2λ´1, h 2 c2. xx yy “ i i } }HG “ i i } }L2pX q “ i (3.10) i“1 i“1 i“1 ÿ ÿ ÿ In particular, the bilinear form xx¨, ¨yy satisfies the coercivity condition on H: n xxhyyě mint λiui“1}k}L2pX q, @h P H. a 10 Proof. (a) ñ (b). Suppose L is not strictly positive, then there exists eigenfunction h P L2pX q GT corresponding to eigenvalue 0. But we would also have xxhyy“ 0. 8 2 (b) ñ (a). If GT is strictly positive, then tϕiui“1 would be an orthonormal basis for L pX q and all eigenvalues of L are positive. Take h “ 8 c ϕ . Then, xxhyy“ 0 implies GT j“1 j j 8 ř 8 8 D E 2 LG h, h “ x cjλjϕj, cjϕjyL2pX q “ cj λj “ 0. T L2pX q j“1 j“1 j“1 ÿ ÿ ÿ h 0 L2 X Hence “ in p q. ? λ ϕ 8 H H L2 X ϕ 8 (b) ô (c). Note that i i i“1 is a basis of G. Thus, G is dense in p q iff t iui“1 is a 2 basis of L pX q, i.e. LG is strictly positive. T n ( At last, for any h “ j“1 cjϕj P H “ span tϕ1, . . . , ϕnu, we have ř n 2 D E 2 n 2 xxhyy “ LG h, h “ cj λj ě mintλiui“1}h}L2pX q. T L2pX q j“1 ÿ 1 2 Also, we have h “ n c λ´ { λ φ P H with }h}2 “ n c2λ´1; j“1 j j j j G HG i“1 i i ř a ř 3.3 The weighted L2 space and RKHS 2 In this section, we study the identifiability on L pρT q through the RKHS whose reproducing kernel is a weighted integral kernel. We define the following kernel RT on the set Xs ˆ X :

GT pr, sq RT px, yq “ , pr, sq P X ˆ X . (3.11) ρT prqρT psq Similar to Lemma 3.4, we can prove s s Lemma 3.8. The kernel RT defined in (3.11) is a Mercer kernel.

With RT being a Mercer kernel, it leads to another RKHS HR (see AppendixA). To characterize the RKHS, we define the integral operator L : L2pρ q Ñ L2pρ q: RT T T

L ϕpsq “ R pr,s sqϕprqρ psrqdr. (3.12) RT T T ` żR Here we assume that R P L2pρ b ρ q, so that L is a compacts operator. This assumption holds T T T RT true when X is compact, which is typical in practice. Note that by definition, s s xxϕ, ψyy“ xϕ, L ψy 2 . (3.13) RT L pρT q

Under the above assumption, L is a positive and compacts self-adjoint operator on L2pρ q, RT T and it has countably many positive eigenvalues tγ u8 with orthonormal eigenfunctions tψ u8 . In ? i i“1 i i“1 particular, t γ ψ u8 is an orthonormal basis of H . Based on the eigenfunctions of L and by i i i“1 R RT s the same argument as in Theorem 3.6, we have

Theorem 3.9 (Identifiability on RKHS). Identifiability holds on HR, the RKHS with reproducing kernel RT in (3.11).

2 Similar to Theorem 3.7, the identifiability holds on L pρT q if the RKHS HR is dense in it.

2 Theorem 3.10 (Identifiability on L pρT q). The following are equivalent. s

s 11 2 (a) Identifiability holds on L pρT q. (b) L in (3.12) is strictly positive. RT 2 s (c) HR is dense in L pρT q. n Moreover, for any h “ i“1 ciψi P H “ span tψ1, . . . , ψnu with tψiu being orthonormal eigen- functions of L correspondings to eigenvalues tγi ą 0u, we have RT ř n n n 2 2 2 2 ´1 2 2 xxhyy “ c γ , }h} “ c γ , }h} 2 “ c . (3.14) i i HR i i L pρT q i i“1 i“1 i“1 ÿ ÿ ÿ s In particular, the bilinear form xx¨, ¨yy satisfies the coercivity condition on H:

? n 2 xxhyyě mint γiui“1}k}L pρT q, @h P H.

Remark 3.11 (Relation to the coercivity condition of thes bilinear form). Recall that a bilinear form 2 xx¨, ¨yy is said to be coercive on a subspace H Ă L pρT q if there exists a constant cH ą 0 such that 2 xxψ, ψyyě cH}ψ} 2 for all ψ P H. Such a coercivity condition has been introduced on subspaces L pρT q 2 of L pρT q in [5, 21, 19, 20, 18] for systems of finitelys many particles or agents. Thorem 3.10 shows s that, for any finite-dimensional hypothesis space H “ span tψ1, . . . , ψnu, the coercivity condition n holds withs cH “ mintγiui“1, but the coercivity constant vanishes as the dimension of H increases to infinity.

Now we have two RHKSs: HG and HR, which are function spaces of idnenifiabilty under differ- 1{2 2 1{2 2 ent topologies. They are the images L pL pX qq and L pL pρT qq (see AppendixA). The follow- GT RT ing remarks discuss the relation between them. s Remark 3.12 (The two integral operators). The integral operators L and L are derived from GT RT 2 2 the same bilinear form: for any ϕ, ψ P L pX q Ă L pρT q, by (3.9) and (3.13), we have

xxϕ, ψyy“ xL ϕ, ψyL2 X “ xL ϕ, ψyL2 ρ . GT p q s RT p T q Since L2pX q is dense in L2pρ q, the second equality implies that thes null-space of and L is T GT a subset of L . However, there is no correspondence between their eigenfunctions of nonzero RT eigenvalues. To see this, let ϕ P L2pX q be an eigenfunction of L with eigenvalue λ ą 0. Then, is GT i it follows from the second equality that

xλ ϕ , ψy 2 “ xL ϕ , ψy 2 “ xL ϕ, ψy 2 “ xρ L ϕ , ψy 2 i i L pX q GT i L pX q RT L pρT q T RT i L pX q

2 2 ϕi for any ψ P L pX q. Thus, λiϕi “ ρT L ϕi in L pX q. Thus,s neither ρϕi nor is an eigenfunc- RT s ρ tion of L . In finite-dimensional approximation in Section5, we will show that L leads to a RT RT generalized eigenvalue problem, while L leads to the normal matrix in regression. s GT 2 Remark 3.13 (Metrics on HG and HR). We have three metrics on HG Ă L pX q: the RKHS norm, the L2pX q norm and the norm induced by the bilinear form. From (3.10), we have, for any h “ 8 c ϕ P H with tϕ u being the eigenfunctions L , i“1 i i G i GT

ř 8 8 8 h 2 c2λ , h 2 c2λ´1, h 2 c2. xx yy “ i i } }HG “ i i } }L2pX q “ i i“1 i“1 i“1 ÿ ÿ ÿ ? Thus, the three metrics satisfy xxhyyď maxit λiu}h}L2pX q ď maxitλiu}h}HG . Similarly, there are 2 ? 2 three metrics on HR Ă L pρT q satisfying xxhyyď maxit γiu}h}L pρT q ď maxitγiu}h}HR .

s 12 s 4 General interaction kernels

d d We consider the case of general interaction kernel Ktrue : R Ñ R , which can be non-radial. In comparison to the radial case, the major difference is that Ktrue “ ∇Φ is a vector-valued function d on R . The extension is straightforward and the elements are similar: the quadratic loss functional and the RKHS spaces of identifiability.

4.1 The loss function d d Let K : R Ñ R be a general interaction kernel with potential Ψ, i.e., K “ ∇Ψ. We will denote it by Kψ to emphasize its relation with Ψ; otherwise, we will use K for simplicity. Proposition 4.1 (Loss function). Let u be a solution of the mean-field equation (1.1) on [0,T]. The general loss functional E in (2.1) is the expectation of the average (in time) negative log-likelihood 2 of the process pXt, t P r0,T sq in (2.8). If Kψ “ ∇Ψ P L pρT q with ρT being the general density in (2.10), the loss functional can be written as s T s s 1 2 EpKψq :“ Kψ ˚ u u ` 2BtupΨ ˚ uq ` 2ν∇u ¨ pKψ ˚ uq dx dt. (4.1) T d ż0 żR ”ˇ ˇ ı 1 ˇ ˇ Furthermore, if Kψ P L pρT q, we can replace the integrand ∇u ¨ pKψ ˚ uq by ´up∇ ¨ Kψ ˚ uq. Proof. The proof is the same as Theorem 3.2. s The loss functional E is quadratic. Thus, its Frechet´ derivative (with respect to a suitable function space) is a linear operator and the uniqueness of its minimizer is equivalent to the non-degeneracy d d of its Frechet´ derivative. Recall that for vector-valued functions K1,K2 : R Ñ R , we define the bilinear form in (2.2):

1 T xxK1,K2yy“ pK1 ˚ uq ¨ pK2 ˚ uq upx, tqdx dt T d ż0 żR 1 T “ K1pyq ¨ K2pzq upx ´ y, tqupx ´ z, tqupx, tqdx dt dydz d d T d żR żR ż0 żR (4.2) F T d looooooooooooooooooooooooooomooooooooooooooooooooooooooond i i i i “ K1pyqK2pzqF T py, zqqdydz “: xxK1,K2yy d d i“1 R R i“1 ÿ ż ż ÿ where the scalar-valued kernel F T py, zq is defined in (2.7), and in the last equality we abuse the notation of the bilinear form: we also denote it for the components of the vector-valued functions. We also denote xxKyy2 :“ xxK,Kyy. The next lemma extends Lemma 2.5, highlighting the crucial role of the bilinear form.

d 2 Lemma 4.2. Suppose H “ i“1 H0. Then, identifiability holds in H (or in H0) if and only if xx¨yy 2 is nondegenerate on H0, i.e. xxhyy ą 0 for any h ‰ 0 in H0. ś Proof. By Definition 2.2, it suffices to show that the loss functional E in (4.1) has a uniqueness minimizer on H. With the bilinear functional (4.2), we can write the loss functional as EpKq “ 2 2 xxK,Kyy´ 2xxK,Ktrueyy“ xxK ´ Ktrueyy ´ xxKtrueyy . Let K ´ Ktrue “ pk1pxq, . . . , kdpxqq with 2 d 2 ki P H0 for i “ 1, . . . , d, then xxK ´ Ktrueyy “ i“1xxkiyy . Thus, Ktrue is identifiable by E in H 2 iff xxkyy ą 0 for any k ‰ 0 in H0. ř

13 Remark 4.3 (Minimizer of the loss functional). Consider the hypothesis space H “ span tK1,...,Knu 2 d d n with Ki “ ∇Φi P L pR , R q such that A in (4.3) is invertible. Then, for any K “ i“1 ciKi P H, the loss functional E becomes EpKq “ Epcq “ cJAc ` bJc and it has a unique minimizer on H: ř n ´1 K “ ciKi, with c “ A b, (4.3) i“1 ÿ where the normal matrix A and vectorp b arep given by p

1 T Aij “ xxKi,Kjyy“ pKi ˚ uq ¨ pKj ˚ uqupx, tqdx dt, T d ż0 żR 1 T bi “ xxKi,Ktrueyy“ ´ rBtu pΦi ˚ uq ` ν∇u ¨ pKi ˚ uqs dx dt. T d ż0 żR 4.2 The RKHS space and L2 space

We show next that the identifiability holds on the RKHS with reproducing kernel F T in (2.7), which 2 is a subspace of L pX q. We show first that the F T is a Mercer kernel. Then, we connect its RKHS with L2pX q through its integral operator on L2pX q.

Lemma 4.4. The function F T in (2.7) is a Mercer kernel.

Proof. The symmetry of F T follows from its definition, and its continuity follows from the continuity of u. It is positive definite by its definition and Theorem A.2.

Denote by HF the RKHS with reproducing kernel F T and denote its inner product by x¨, ¨yHF . We show next that identifiability holds on the RKHS HF , i.e., xx¨, ¨yy is non-degenerate on HF , by studying the integral operator with kernel F T . We start with the next lemma on the boundedness and integrability of F T .

Lemma 4.5. The symmetric function F T in (2.7) satisfies the following properties: d (a) For all x, y P R , F T px, yq ď CuρT pxq, where Cu “ supxPRd,tPr0,T s upx, tq. 2 d d 2 d d (b) F T is in L pR ˆ R q, and F T px, ¨q P L pR q for every x P R . s Proof. The boundedness in (a) follows directly:

1 T F T px, yq ď Cu upz ´ y, tqupz, tqdz dt “ CuρT pyq. T d ż0 żR d For (b), note that by symmetry, we have F T px, yq ď CuρT pyq for any x,s y P R . Then,

2 2 2 pF T px, yqq dxdy ă Cu ρTspxqρT pyqdxdy “ Cu, d d d d żR żR żR żR 2 2 and similarly, d pF px, yqq dy ď C ρ pxq for every x. s s R T u T By Lemmaş 4.5, the integral operator L : L2pX q Ñ L2pX q, s F T

L fpxq “ F py, xqfpyqdy. (4.4) F T T żX is a bounded, positive and compact operator on L2pX q (see e.g., [28, Proposition 1]). Hence L has F T at most countably many positive eigenvalues. We use the same notation tλ u8 and tϕ u8 as L i i“1 i i“1 GT

14 to denote the positive eigenvalues and the corresponding eigenfunctions of L , as long as there is F T ? ϕ 8 L2 X λ ϕ 8 no confusion in the context. Similarly, t iui“1 are orthonormal in p q. Moreover, i i i“1 1{2 2 form an orthonormal basis of HF and HF “ L pL pX qq (see e.g., [8, Section 4.4]). Note that F T (

xxK ,K yy“ xK , L K y 2 . (4.5) 1 2 1 F T 2 L pX q 8 When ρT ’s support is bounded, then by Mercer’s theorem, F T px, yq “ i“1 λiϕipxqϕipyq, which converges uniformly on X ˆ X . ř s Theorem 4.6 (Identifiability on RKHS). Identifiability holds on HF . ? 8 ? Proof. Since t λiϕiu is an orthonormal basis of HF , we can write each f P HF as f “ i“1 ci λiϕi 8 2 2 8 2 2 and we have hf, fi “ c . Thus, xxfyy “ c λ ą 0 for any f ‰ 0 in HF . HF i“1 i i“1 i i ř The RKHS HF has theř nice feature of being data-informed:ř its reproducing kernel F T depends d on the data pupx, tq, x P R , t P r0,T sq. However, it is an implicit function space. Thus, it is helpful to consider the explicit function space L2pX q. Theorem 4.7 (Identifiability on L2pX q). The following are equivalent. (a) Identifiability holds on L2pX q. (b) L in (4.4) is strictly positive. F T 2 (c) HF is dense in L pX q.

Moreover, for any finite dimensional space H “ span tϕ1, . . . , ϕnu with tϕiu being eigenfunctions ? n of L corresponding to eigenvalues tλ ą 0u, we have xxkyyě mint λ u }k} 2 , @k P H. F T i i i“1 L pX q Proof. The proof is the same as Theorem 3.7.

4.3 The weighted L2 space and RKHS d d In practice, the data pupx, tq, x P R , t P r0,T sq only explore a small region X in R with a weight 2 ρT defined in (2.10). Thus, it is important to consider the weighted space L pρT q (which denotes 2 2 the space L pρT pxqdxq by abusing the notation) instead of L pX q. In this section, we study the 2 identifiabilitys on L pρT q through the RKHS whose reproducing kernel is a weighteds kernel. Consider thes weighted integral kernel QT : s F T px, yq QT px, yq “ , px, yq P X ˆ X . (4.6) ρT pxqρT pyq With a proof similar to those for Lemma 4.4, we have s s Proposition 4.8. The kernel QT defined in (4.6) is a Mercer kernel over pX , ρT q, which is symmetric, continuous and positive definite. s With QT being a Mercer kernel, we denote its RKHS space by HQ with inner product x¨, ¨yHQ . 2 Similar to Section 4.2, we connect the RKHS HQ with L pρT q through QT ’s integral operator. Here 2 2 2 we assume that Q P L pρT b ρT q, so that its integral operator L : L pρT q Ñ L pρT q T QT s s LQs fpxq “ QT py, xqfpyqρT pyqdy, s s (4.7) T d żR is a compact operator (actually, a Hilbert-Schmidt operator).s This assumption holds true when X is compact, which is the most typical case in practice. But it may not hold when X is non-compact. The following two examples show that the assumption holds when the distribution of pXt, t ě 0q is a Gaussian process, but it does not hold when the process is stationary with a Cauchy distribution. s 15 2 1 ´ x Example 4.9. Suppose K x x and d 1. Then, U x ? e 2ν2 is a stationary solution p q “ “ p q “ 2π to the mean-field equation (1.1) (equivalently, N p0, ν2q is an invariant measure of the SDE (2.8)). Suppose that upx, 0q “ Upxq. Then, we have

2 1 ´ 1 px2`y2´xyq 1 ´ x F T px, yq “ ? e 3ν2 and ρT pxq “ ? e 4ν2 . 2ν 3π 2ν π

The weighted integral kernel QT is still square integrable dues to the fast decay of F T : 2 F x, y 1 1 2 2 2 2 T p q ´ 2 r4px´yq `x `y s t}Q } 2 “ dxdy “ e 12ν dxdy ă 8. T L pρT bρT q ρ pxqρ pyq 3π żR żR T T żR żR Example 4.10.s Supposes that the density of Cauchy distribution, Upxq “ 1 1 , is a steady solution s s π 1`x2 to (1.1), and suppose that upx, 0q “ Upxq. Then we have

2 px2 ´ xy ` y2 ` 12q 2 1 F T px, yq “ and ρT pxq “ π2 px2 ` 4qpy2 ` 4qpx2 ´ 2xy ` y2 ` 4q π px2 ` 4q

However, QT is not square integrable: s 2 2 2 2 F T px, yq px ´ xy ` y ` 12q }Q } 2 “ dxdy “ dxdy “ 8. T L pρT bρT q ρ pxqρ pyq 2px2 ´ 2xy ` y2 ` 4q żR żR T T żR żR s8 s 8 Denote tγiui“1 and tψiui“1 the positive eigenvalues and corresponding eigenfunctions of LQ . s s ? T ψ 8 L2 ρ γ ψ 8 H Similarly, t iui“1 are orthonormal in p T q and i i i“1 forms an orthonormal basis of Q. 1 2 Furthermore, H “ L { pL2pρ qq and we can write the bilinear form (4.2) as Q Q T ( T s xxK1,K2yy“ xK1, L K2y 2 . (4.8) s QT L pρT q

The following theorem on identifiability over the space HQ iss a counterpart of Theorem 4.6. d Theorem 4.11. Let H “‘i“1HQ. Then identifiability holds on H. 8 ? Proof. Let k P HQ and write k “ i“1 ci γiψi. Then xxkyy“ 0 implies ř xxkyy“ kpxqkpyqF T px, yqdxdy “ kpxqkpyqQT px, yqρT pxqρT pyqdxdy d d d d żR żR żR żR D E 8 2 2 s s “ LQ k, k “ cj γj “ 0. (4.9) T L2pρ q T j“1 ÿ s Thus cj “ 0 for all j and kpxq “ 0 a.e. in pX , ρT q. Suppose we have K P H, then we can apply the above discussion for the components of K and achieve the identifiability on H.

2 s The following theorem is an L pρT q version of Theorem 4.7. 2 Theorem 4.12 (Identifiability on L pρT q). The following are equivalent. s d 2 (a) Identifiability holds on H “‘i“1L pρT q. s (b) L in (4.4) is strictly positive. QT 2 s (c) HQ is dense in L pρT q.

Moreover, for any finite dimensional space H “ span tψ1, . . . , ψnu with tψiu being eigenfunctions of ? n L corresponding to eigenvaluess tγi ą 0u, we have: xxKyyě mint γiu }K} 2 , @K P H. QT i“1 L pρT q

Proof. The proof is identical to the proof in Theorem 4.7 and we omit it. s

16 4.4 Examples and general function spaces 2 2 d We show a few examples that identifiability holds on L pX q or L pρT q when X “ R , under mild d conditions on the initial condition u0. These examples are limited to the case when X “ R because we prove their integral kernel to be strictly positive definite based ons Fourier transformation. Recall that Ftpx, yq in (2.7) is positive definite for all t P r0,T s. The next lemma shows that we only need the integral kernel to be strictly positive definite at only a single time instant.

Lemma 4.13. Assume that Ft0 px, yq in (2.7) is strictly positive definite for some t0 P r0,T s, that is, 2 d for every function f P L pR q,

fpxqfpyqFt0 px, yqdxdy “ 0 (4.10) d d żR żR 2 d 2 implies f “ 0 a.e.. Then the identifiability holds on both L pR q and L pρq.

Proof. By the continuity of F T , xxfyy“ 0 implies d d fpxqfpyqFt0 px, yqdxdy “ 0. Then f “ 0 R R 2 d a.e. by the assumption, also f “ 0 a.e. in ρ by the Radon-Nikodym theorem. Hence, L pR q and ş ş L2pρq have the identifiability.

We show next that Ft0 px, yq is strictly positive definite when upx, t0q’s Fourier transform satisfy suitable conditions. Without loss of generality, we assume t0 “ 0 and consider the initial condition u0 in he following discussion.

d Theorem 4.14. Assume the initial value u0 is non-vanishing on R . d ixy (a) Suppose there exists a Borel measure µ on C such that u0pxq “ d e dµpyq. If the carrier d 2 d 2 R of µ is the entire C , then L pR q and L pρT q have the identifiability. ş d ixy (b) Suppose there exists a Borel measure µ on R such that u0pxq “ d e dµpyq. If the carrier of R d µ contains an open set, then the space of boundeds and compactly supported functions MbcpR q ş has the identifiability.

(c) Suppose u0pxq “ u0p|x|q, i.e. u0pxq is radial and there exists a Borel measure ν on r0, 8q 8 ´|x|2t such that u0pxq “ 0 e dνptq. If ν is not concentrated at zero, then the space of Borel d probability measures PpR q has the identifiability. ş 2 d Proof. As for paq, by Lemma 4.2 and Lemma 4.13, we only need to show that for all f P L pR q, (4.10) implies f “ 0. From the definition of Ftpx, yq in (2.7), we can write (4.10) as

2 |u0 ˚ fpxq| u0pxqdx “ 0, d żR d d which implies u0 ˚ f “ 0 in R a.e. since u0 is supported on R . Taking the Fourier transform on d both sides, we have Ftfuµ “ 0. Since µ is supported on C , we have Ftfu “ 0 and then f “ 0 a.e.. d The proof for pbq is due to [24]. By the previous steps, for f P MbcpR q we have Ftfuµ = 0. Thus, Ftfu vanishes on the open set supporting µ. By Paley-Winner Theorem, Ftfu is an entire function since f has compact support. Thus we have f “ 0 a.e.. The proof for pcq is due to [27] and [31]. In particular, Theorem 7.14 in [31] states that such u0 is strictly positive definite. Then, by Proposition 5 in [27], u0 must characteristic in the sense that µ ÞÑ F0p¨, yqµpdyq in injective on P. Thus, P has the identifiability. Note that in this case, we also 2 d 2 cannot get the identifiability over L pR q and L pρq. ş Note the condition for u0 is weakest in pbq and paq, pcq are stronger, hence the identifiable space in paq, pcq is larger than the one in pbq. Also, in either case (b) or (c), we cannot get the identifiability 2 d 2 over L pR q and L pρq.

17 Example 4.15. Let u0pxq be the density function of Gaussian distribution N pµ, Σq. The Fourier iµξ´ 1 ξJΣ ξ d transform, i.e., the characteristic function of u0 is e 2 . Since it is nonvanishing on C , it 2 d 2 satisfies the condition in paq, and thus L pR q and L pρq have the identifiability. By Theorem 4.7 2 d 2 and Theorem 4.12, HF is dense in L pR q and HQ is dense in L pρq. Note that this density follows from the continuity of u and the initial distribution u0.

Example 4.16. The condition in pcq for radial functions is equivalent to u0 being strictly positive 1 d [27]. Hence all the nonvanishing L pR q integrable radial strictly positive functions will work, such 2 2 ´β as the radial Gaussian density of N p0,Iq and the Cauchy distribution u0pxq9pc ` |x| q with c ą 0, β ą d{2 (see [31, Theorem 6.13]). Actually, these density functions also satisfy the condition of (a). The Gaussian density is verified in the above example. Similarly, the characteristic function β ξ d 2 d 2 of the Cauchy distribution is e´| | , which is nonzero for all ξ P C . Thus, L pR q and L pρq have 2 d 2 the identifiability, HF is dense in L pR q and HQ is dense in L pρq. 2 d 2 d We show next that identifiability hold on L pR q and L pρq when u0 with support on R is obtained from modifying a compactly supported probability density.

d Proposition 4.17. Let vpxq P CcpR q be a probability density, and ηε be the density function of 2 d 2 N p0, εIq. If the initial value u0 “ v ˚ ηε, then L pR q and L pρq has the identifiability. 2 d Proof. By Lemma 4.13, we show that f “ 0 a.e. if f P L pR q satisfy (4.10). Note that (4.10) implies that u0 ˚ f “ 0 a.e.. Taking Fourier transform, we have FtvuFtηεuFtfu “ 0. Since v is a compactly supported probability density, Ftvu is non-vanishing by Paley-Winner Theorem. Also, Ftηεu is non-vanishing. Then, we must have Ftfu “ 0, and therefore f “ 0 a.e..

5 Identifiability in computational practice

In this section, we discuss the implications of identifiability for the computational practice of esti- mating the interaction kernel from data. For simplicity, we consider only radial interaction kernels in the case that the spatial dimension is d “ 1. We shall see that the regression matrix will be- come ill-conditioned as the dimension of the hypothesis space increases (see Theorem 5.1). Thus, 2 regularization becomes necessary. We compare two regularization norms, the norms of L pρT q or L2pX q, corresponding to the function spaces of learning, in the context of singular value decompo- 2 sition (SVD) analysis. Numerical tests suggest that regularization based on L pρT q norm leadss to slightly better conditioned inversion (see (5.8) and Figure1). We also show that the regularization by truncated SVD is equivalent to learning on the low-frequency eigenspace of the RKHSs.s

5.1 Nonparametric regression in practice

In computational practice, suppose we are given data upx, tq on R ˆ r0,T s on discrete space mesh grids, we find a minimizer of the loss functional by least squares as in Remark 2.4. We review only those fundamental elements and we refer to [16] for more details. We first select a set of adaptive-to-data basis functions from the data-based measure ρT , guided 2 2 by the function space of learning (either L pρT q or L pX q). The starting point is to find the measure n ρT using (2.10) on a fine mesh (the mesh that u is defined on) and obtain its support. Letstriui“0 be a uniform partition of the support set X ands denote the width of each interval as ∆r. They provide thes knots for the B-spline basis functions of H. Here we use piecewise constant basis functions to facilitate the rest discussions, that is,

H “ spantφ1, . . . , φnu, with φiprq “ 1rri´1,risprq.

18 One may also use other partitions, for example, a partition with uniform probability for all intervals, as well as other basis functions, such as higher degree B-splines (often depending on the regularity of the true kernel) or weighted orthogonal polynomials (which are global basis functions). With these basis functions, we let 1 T Aij “ xxφi, φjyy“ pKφi ˚ uq ¨ pKφj ˚ uqupx, tqdx dt, (5.1) T d ż0 żR 1 T bi “ xxφi, φtrueyy“ ´ rBtu Φi ˚ u ` ν∇u ¨ pKφi ˚ uqs dx dt. (5.2) T d ż0 żR r x n where Φiprq “ 0 φipsqds is an anti-derivative of φi, Kφi pxq “ φip|x|q |x| and the vector b P R comes from Theorem 3.2. The integrals are approximated from data by Riemann sum or trapezoid ş rule. We then find a minimizer of the loss functional over H by least squares

n J J φH “ ciφi, where c “ arg min Epcq, Epcq “ c Ac ´ 2c b. (5.3) n i“1 cPR ÿ p In practice, we compute minimizerp by pc “ A´1b when the normal matrix A is well-conditioned. When A is ill-conditioned or singular, we use pseudo-inverse or regularization to obtain c “ pA ` ´1 λBq b with λB being a proper regularization.p We select the optimal dimension of the hypothesis space that minimizes the loss functional so as to avoid undercutting or overfitting. p As we show in the next section, the normal matrix becomes ill-conditioned as the dimension of the hypothesis space increases. Thus, regularization becomes necessary. In Section 5.3, we will discuss regularizations by truncated SVD and the connection to the RKHSs.

5.2 Identifiability and ill-conditioned normal matrix We show next that the normal matrix becomes ill-conditioned as the hypothesis space’s dimension increases, because its smallest eigenvalue converges to zero.

2 Theorem 5.1. Let H “ spantφ1, . . . , φnu Ă L pX q and let A and b be the normal matrix defined in (5.1) and (5.2). Then, (a) A “ L | : L2pX q Ñ L2pX q, that is, A is a representation of the operator L in (3.8) on GT H GT H Ă L2pX q. Thus, A has the same spectrum as L | . GT H (b) A “ L | : L2pρ q Ñ L2pρ q, with L defined in (3.12), in the sense that if ϕ “ RT H T T RT n c φ is an eigenfunction of L | with eigenvalue λ, then c “ pc , . . . , c qJ P n solves i“1 i i RT H 1 n R the generalized eigenvalues problems ř

2 Ac “ λP c, with P “ pxφi, φjyL pρT qq (5.4)

Thus, A has the same spectrum as L | . s RT H In particular, the smallest eigenvalue of A converges to zero as the n Ñ 8. n Proof. Part (a) follows directly from (3.9). For Part (b), note that if ϕ “ i“1 ciφi is an eigenfunction of L |H with eigenvalue λ, then by definition of P , we have RT ř n λpP cq “ xλϕ, φ y 2 “ xL ϕ, φ y 2 “ c xL φ , φ y 2 “ pAcq , k k L pρT q RT k L pρT q i RT i k L pρT q k i“1 ÿ s s s where the last equality follows from (3.13). The smallest eigenvalue of A converges to zero since the integral operators are compact.

19 Table 2: Notation of variables in the eigenvalue problems on H.

2 2 in L pX q in L pρT q

integral kernel and operator GT « GT , L « L RT « GT , L « L G GT R RT T s T eigenfunction and eigenvalue L ϕk “ λkϕyk L ψk “ γkψyk xGT xRT ÝÑ ÝÑ eigenvector and eigenvalue AyϕÝÑ “ λ ϕÝÑ Ayψ “ γ P ψ kx k pkx kx k p xk

p p Remark 5.2 (Ill-conditioned normal matrix). It follows from the above theorem that the normal matrix A becomes ill-conditioned as n increases. The rank of A is no larger than the dimension of either the RKHSs HG or HR because they are the images of the square root of the integral operators. Remark 5.3 (Well-posedness). The inverse problem of identifying the interaction kernel is ill-posed in general: since the normal matrix becomes ill-conditioned as n increases, a small perturbation in the eigen-space of the smallest eigenvalues may lead to large errors in the estimator. More specif- ´1 2 ´1 ically, we are solving the inverse problem φ “ L pL φtrueq in L pρT q, where L is an un- RT RT RT bounded operator on H . In practice, the normal matrix approximates the operator L , and b R RT approximates L φ . On the other hand,p if the error in b occurs only in the eigen-space of large RT true s eigenvalues, then the error will be controlled and the inverse problem is well-posed.

We summarize the notations in these two eigenvalue problems in Table2.

5.3 Two regularization norms Theorem 5.1 suggests that the normal matrix A is, albeit invertible, often ill-conditioned, so A´1 will amplify the numerical error in b, leading to error-prone estimator c “ A´1b. In this case, it is necessary to add regularization. The widely-used Tikhonov regularization [10] with H1pX q norm and with the L-curve method [11] works well in [16]. There are twop factors in the method, the regularization norm and the parameter. The L-curve method provides a way to select the parameter for a given regularization norm. However, the choice of regularization norm is problem dependent and is less explored. 2 2 In this section we compare two regularization norms: the L pX q and L pρT q norms. We show by numerical examples that the weighted norm leads to a better conditioned inverse problem. s Tikhonov regularization Instead of minimizing the loss functional Epψq, Tikhonov regulariza- 2 tion minimizes Eλpψq “ Epψq ` λ|||ψ||| , where the regularization norm |||¨||| is to be specified later. Intuitively, |||¨||| gives an priori assumption on the regularity of the estimator. The minimizer of Eλ over the hypothesis space H is determined by

´1 cλ “ pA ` λBq b (5.5)

Here B is a n ˆ n matrix determined by the norm |||¨|||. The presence of B can reduce the condition p number of the linear system (5.5), so that the error in b is not amplified by the inversion. The major difficulty in Tikhonov regularization is to choose the regularization norm |||¨||| (hence the matrix B) along with an optimal regularization parameter λ. Given a regularization norm |||¨|||, an optimal parameter aims to balance the decrement of the loss functional E and the increment of the norm. Among various methods, the L-curve method [11] selects λ at where the largest curvature occurs on a parametric curve of (logpEpcλqq, logp|||cλ|||qq; the truncated SVD methods [10] discard the smallest singular values of A and solve the modified least square problem. p p 20 We focus on comparing two regularization norms from the ambient function spaces of learning:

2 Unweighted L norm: |||¨||| “ }¨}L2pX q Bij “ hφi, φjiL2pX q , 2 (5.6) Weighted L norm: |||¨||| “ }¨} 2 Bij “ hφi, φji 2 . L pρT q L pρT q

With the basis functions being piecewise constants on as uniform partition of Xs, we have B “ ∆rIn for the first case and B “ P with P defined in (5.4). In particular, when we take φi to be piecewise constants, the matrix P becomes a diagonal matrix

P “ diagpρT pr1q,..., ρT prnqq∆r, (5.7)

1 ri where tρ prnqu is the average density on the interval rri 1, ris , i.e. ρ priq “ ρT prqdr. T p p ´ T ∆r ri´1 n Note that we can represent ρT by ρT “ i“1 ρT priqφi, ş p p s ř Weighted and unweighteds SVDp Singularp value decomposition (SVD) analysis provides an effective tool for understanding the features of Tikhonov regularization [10, 11] so as to select an optimal parameter. The idea is to study the decay of the singular values (or eigenvalues for symmetric matrices) and compare it with the projection of b in the eigenspace. More precisely, n J n let A “ i“1 σiuiui , where σi are the decreasingly-ordered eigenvalues of A and ui P R is the corresponding orthonormal eigenvector. The regularized regression solution to Ac “ b is ř uJb c “ n f i u , where the filter factors f from regularization aim to filter out the error-prone i“1 i σi i i uJb contributions from i u when σ is small. This promotes the discrete Picard condition [10] which ř σi i i p J requires ui b to decay faster than σi. 2 2 Here we demonstrate that the inverse problem is better conditioned on L pρT q than on L pX q, which corresponds to the two regularization norms in (5.6). These two function spaces lead to weighted and unweighted SVD, in which we compute the eigenvalues and eigenvectorss as follows:

J J Unweighted SVD: Φ DG Φ “ A, with Φ Φ “ I, (5.8) J J Weighted SVD: Ψ DR Ψ “ A, with Ψ P Ψ “ I.

Here DG and DR are diagonal matrices. The solution of the unweighted SVD, Φ and DG, provides orthonormal eigenvectors and eigenvalues of A “ L | on L2pX q. In the weighted SVD, the GT H generalized eigenvalue problem for pA, P q with solution Φ and D are for A “ L | on L2pρ q. G RT H T Figure1 shows the decay of the singular values from the unweighted SVD (denoted by G; and J “G: b-projection” refers to |ui b| with ui being the columns of of Φ) and weighted SVD (denoteds by J R, and “R: b-projection” refers to |ui b| with ui being the columns of of Ψ) for the three examples in [16]. In all these examples, the weighted SVD has larger eigenvalues than those of the unweighted SVD; and it has slightly smaller ratios of b-projection-to-eigenvalue. Thus, the weighted SVD leads to a better-posed inversion.

5.4 Regularization and the RKHSs Here we connect the regularization with the RKHSs. In particular, we show that the regularization by truncated SVD is equivalent to selecting low-frequency orthogonal basis of the RKHSs.

Unweighted SVD and the RKHS HG First, we show that the integral kernel GT on the ba- sis in H is equivalent to A up to a multiplicative factor. We approximate GT by GT pr, sq “

x

21 100 100 100 G eigenvalues G eigenvalues G eigenvalues R eigenvalues R eigenvalues R eigenvalues -5 -5 10 G: b-projection 10 G: b-projection G: b-projection R: b-projection R: b-projection R: b-projection -10 10-10 10-10 10

10-15 10-15

10-20 10-20 10-20 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Ratio b-projection/spectrum Ratio b-projection/spectrum Ratio b-projection/spectrum 108 1010 Ratio: G Ratio: G Ratio: R Ratio: R 105 106

5 104 10

102 Ratio: G Ratio: R 0 100 10 100 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Eigenvalue number Eigenvalue number Eigenvalue number (a) Cubic potential (b) Opinion dynamics (c) Attraction-repulsion

Figure 1: SVD analysis of the regression in three example potentials: cubic potential with φprq “ 3r2, opinion dynamics with φ being piecewise linear, and the attraction-repulsion potential with φprq “ r ´ r´1.5. Here R represents the weighted SVD and G represents the unweighed SVD. In all three examples, the weighted SVD has larger eigenvalues than those of the unweighted SVD; and it has slightly smaller ratios of b-projection and eigenvalue.

n i,j“1 GT pri, rjqφiprqφjpsq, where GT pri, rjq is the average of GT over the region rri´1, ris ˆ rrj´1, rjs. Recall the definition of GT as in (2.6) and the bilinear form in (2.5), we have ř x x 1 1 GT pri, rjq “ φiprqφjpsqGT pr, sqdrds “ xxφi, φjyy“ Ai,j. p∆rq2 p∆rq2 ż ż x Consider the eigenfunctions of the approximated operator L given by tϕku, where ϕkprq “ GT n ϕ r φ r . Then L ϕ λ ϕ implies i“1 kp iq ip q k “ k k y GT x x ř x y x xx n n λkϕkprq “ GT pr, sqϕkpsqds “ GT pri, rjqφiprqφjpsqq ϕkpriqφipsq ds d d R R ˜i,j“1 ¸ ˜i“1 ¸ ż ż ÿ ÿ x x n n x x 2 x x “ p∆rq GT pri, rjqϕkpriq φjprq. (5.9) j“1 ˜i“1 ¸ ÿ ÿ x ÝÑ x This is equivalent to say that, ϕk “ pϕkpr1q,..., ϕkprnqq for k “ 1, . . . , n are eigenvectors of the n matrix A corresponding to eigenvalues tλkuk“1 in the unweighted SVD in (5.8). Thus, the discrete approximation ofx the RKHSxHG is span tϕ1,..., ϕmu, the span of the eigen- functions with positive eigenvalues. In particular,x the low frequency eigenfunctions . x x Next, we show that a truncated SVD regularization is a selection of low-frequency basis of the RKHS. Recall that in a truncated SVD, one truncates the eigenvalues of A to avoid amplifying the error in b. That is, one truncates the small eigenvalues after m according to a threshold and obtains the least squares solution

m ´1 ÝÑJ ÝÑ c “ λk ϕk b ϕk. (5.10) k“1 ÿ ` ˘ p p 22 m Equivalently, we can view the truncated SVD as finding a solution on HG “ span tϕ1,..., ϕmu. m Note that HG Ă HG is spanned by the first m eigenfunctions of the operator L , which filters out GT the high-frequency functions. Suppose the estimator φ in Hm is given by φprq “ m xα ϕ prxq “ G y k“1 k k n m pΦmαqiφiprq, where α “ pα1, . . . , αmq is a vector in R , Φm is the matrix consists of the i“1 ř first m columns of Φ and pΦmαqi represents the ithp component of the vectorp Φmα. Thenx the loss functionalř evaluated on φ writes

J J J p Epφq “ α ΦmAΦmα ´ 2b Φmα.

J J J The minimizer of Epφq is determinedp by the α such that ΦmAΦmα “ Φmb. Notice that ΦmAΦm “ ´1 J n diagpλ1,..., λmq “ DG,m, we have α “ DG,mΦmb. If we write φprq “ i“1 ciφiprq, then p ´1 J ř x x c “ Φmα “ ΦmDG,mΦmb p

This is the same as the estimator in (5.10).

Weighted SVD and the RKHS H We first show that solving eigenvalue of the operator L R RT over H is equivalent to the generalized eigenvalue problem in (5.8). The approximation of RT is

n GT pri,rj q given by RT pr, sq “ i,j“1 RT pri, rjqφiprqφjpsq, where RT pri, rjq “ . Then we can ρT priqρT prj q y n derive the discrete eigenfunctionsř of the operator L . Suppose ψkprq “ i“1 ψkpriqφiprq is an x x RT x p p eigenfunction of L with corresponding eigenvalue γk. Then we have, as in (5.9), RT y x ř x y n n pn 2 GT pri, rjq γk ψkpriqφiprq “ p∆rq ψkpriq φjprq. ˜i“1 ¸ j“1 ˜i“1 ρT priq ¸ ÿ ÿ ÿ x x x ÝÑ p Denote ψk “ ψkpr1q,..., ψkprnq for k “ 1, . . . , n, with pP in (5.7), we have

´ ¯ ÝÑ ÝÑ x x Aψk “ γkP ψk. ÝÑ n Since both A and P are symmetric, it is well-defined that ψk are the generalized eigenvectors p k 1 n “ n ÝÑ ! ) of the pair pA, P q with eigenvalues tγkuk“1. Let ψk be the columns of Ψ and let DR “ k“1 diagpγ1,..., γnq, we have the second equation in (5.8!). ) p m Next,p considerx the inference on the hypothesis space HR “ span ψ1,..., ψm . Suppose the es- m m n ! ) timator φprq in HR is given by φ “ k“1 αkψkprq “ i“1pΨmαqiφiprq, where α “ pα1, . . . , αmq m x x is a vector in R , Ψm is the matrix consists of the first m columns of Ψ and pΨmαqi represents the ř ř ith componentp of the vector Ψmpα. Then the lossx functional evaluated on φ writes

J J J Epφq “ α ΨmAΨmα ´ 2b Ψmα. p

J J J The minimizer of Epφq is determinedp by the α such that ΨmAΨmα “ Ψmb. Notice that ΨmAΨm “ ´1 J n diagpλ1,..., λmq “ DR,m, we have α “ DR,mΨmb. If we write φprq “ i“1 ciφiprq, then p ´1 J ř x x c “ Ψmα “ ΨmDG,mΨmb. p

m 1 ÝÑJ ÝÑ This is equivalent to c “ pψk b ψk, a trucnated SVD with m eigenvectors. k“1 γk ř ´ ¯ x 23 A Appendix

Positive definite functions We review the definitions and properties of positive definite kernels. The following is a real-variable version of the definition in [3, p.67].

Definition A.1 (Positive definite function). Let X be a nonempty set. A function G : X ˆ X Ñ R is n positive definite if and only if it is symmetric (i.e. Gpx, yq “ Gpy, xq) and j,k“1 cjckGpxj, xkq ě 0 n for all n P N, tx1, . . . , xnu Ă X and c “ pc1, . . . , cnq P R . The function φ is strictly positive n ř definite if the equality hold only when c “ 0 P R . d d Theorem A.2 (Properties of positive definite kernels). Suppose that k, k1, k2 : X ˆX Ă R ˆR Ñ R are positive definite kernels. Then

(a) k1k2 is positive definite. ([3, p.69]) d (b) Inner product xu, vy “ j“1 ujvj is positive xdefinite ([3, p.73]) (c) fpuqfpvq is positive definite for any function f : X Ñ ([3, p.69]). ř R

RKHS and positive integral operators We review the definitions and properties of Mercer kernel, RKHS, and related integral operator on compact domain (see e.g., [8]) and non-compact domain (see e.g., [28]). Let pX, dq be a metric space and G : X ˆ X Ñ R be continuous and symmetric. We say that G is a Mercer kernel if it is positive definite (as in Definition A.1). The reproducing kernel Hilbert space (RKHS) HG associated with G is defined to be closure of spantGpx, ¨q : x P Xu with the inner product n,m

xf, gyHG “ cidjGpxi, yjq i“1,j“1 ÿ n n for any f “ i“1 ciGpxi, ¨q and g “ j“1 dkGpxj, ¨q. It is the unique Hilbert space such that spantGp¨, yq, y P X u is dense in HF and having reproducing kernel property in the sense that for all ř ř f P HG and x P X , fpxq “ xGpx, ¨q, fyG (see [8, Theorem 2.9]). By means of the Mercer Theorem, we can characterize the RKHS HG through the integral opera- tor associated with the kernel. Let µ be a nondegenerate Borel measure on pX, dq (that is, µpUq ą 0 2 for every open set U Ă X). Define the integral operator LG on L pX, µq by

LGfpxq “ Gpx, yqfpyqdµpyq. żX The RKHS has the operator characterization (see e.g., [8, Section 4.4] and [28]):

Theorem A.3. Assume that the G is a Mercer kernel and G P L2pX ˆ X, µ b µq. Then

1. LG is a compact positive self-adjoint operator. It has countably many positive eigenvalues 8 8 tλiui“1 and corresponding orthonormal eigenfunctions tφiui“1. ? 8 2. t λiφiui“1 is an orthonormal basis of the RKHS HG. 1{2 2 3. The RKHS is the image of the square root of the integral operator, i.e., HG “ LG L pX, µq.

References

[1] Aaron S. Baumgarten and Ken Kamrin. A general constitutive model for dense, fine-particle suspensions validated in many geometries. Proc Natl Acad Sci USA, 116(42):20828–20836, 2019.

24 [2] Nathan Bell, Yizhou Yu, and Peter J. Mucha. Particle-based simulation of granular materials. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation - SCA ’05, page 77, Los Angeles, California, 2005. ACM Press. [3] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic analysis on semi- groups: theory of positive definite and related functions, volume 100. New York: Springer, 1984. [4] Mauro Bonafini, Massimo Fornasier, and Bernhard Schmitzer. Data-driven entropic spatially inhomogeneous evolutionary games. arXiv preprint arXiv:2103.05429, 2021. [5] Mattia Bongini, Massimo Fornasier, Markus Hansen, and Mauro Maggioni. Inferring interac- tion rules from observations of evolutive systems I: The variational approach. Mathematical Models and Methods in Applied Sciences, 27(05):909–951, 2017. [6] Jose´ A Carrillo, Katy Craig, and Yao Yao. Aggregation-diffusion equations: dynamics, asymp- totics, and singular limits. In Active Particles, Volume 2, pages 65–108. Springer, 2019. [7] Jose A. Carrillo, Massimo DiFrancesco, , T. Laurent, and D. Slepcev.ˇ Global- in-time weak measure solutions and finite-time aggregation for nonlocal interaction equations. Duke Math. J., 156(2):229–271, 2011. [8] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007. [9] Jianqing. Fan and Qiwei. Yao. Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York, NY, 2003. [10] Per Christian Hansen. REGULARIZATION TOOLS: A Matlab package for analysis and solu- tion of discrete ill-posed problems. Numer Algor, 6(1):1–35, 1994. [11] Per Christian Hansen. The L-curve and its use in the numerical treatment of inverse problems. pages 119–142, 2000. [12] Pierre-Emmanuel Jabin and Zhenfu Wang. Mean Field Limit for Stochastic Particle Systems. In Nicola Bellomo, Pierre Degond, and Eitan Tadmor, editors, Active Particles, Volume 1, pages 379–402. Springer International Publishing, Cham, 2017. [13] Pierre-Emmanuel Jabin and Zhenfu Wang. Quantitative estimates of propagation of chaos for stochastic systems with w´1,8 kernels. Invent. math., 214(1):523–591, 2018. [14] Ioannis Karatzas and Steven E Shreve. Brownian motion. In Brownian Motion and Stochastic Calculus. Springer, second edition, 1998. [15] Yury A. Kutoyants. Statistical inference for ergodic diffusion processes. Springer, 2004. [16] Quanjun Lang and Fei Lu. Learning interaction kernels in mean-field equations of 1st-order systems of interacting particles. arXiv preprint arXiv:2010.15694, 2020. [17] Zhongyang Li and Fei Lu. On the coercivity condition in the learning of interacting particle systems. arXiv preprint arXiv:2011.10480, 2020. [18] Zhongyang Li, Fei Lu, Mauro Maggioni, Sui Tang, and Cheng Zhang. On the identifiabil- ity of interaction functions in systems of interacting particles. Stochastic Processes and their Applications, 132:135–163, 2021. [19] Fei Lu, Mauro Maggioni, and Sui Tang. Learning interaction kernels in heterogeneous systems of agents from multiple trajectories. Journal of Machine Learning Research, 22(32):1–67, 2021. [20] Fei Lu, Mauro Maggioni, and Sui Tang. Learning interaction kernels in stochastic systems of interacting particles from multiple trajectories. to appear in Foundations of Computational Mathematics, 2021.

25 [21] Fei Lu, Ming Zhong, Sui Tang, and Mauro Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proceedings of the National Academy of Sciences of the United States of America, 116(29):14424–14433, 2019. [22] Florent Malrieu. Convergence to equilibrium for granular media equations and their Euler schemes. Ann. Appl. Probab., 13(2):540–560, 2003. [23] Sylvie Mel´ eard.´ Asymptotic Behaviour of Some Interacting Particle Systems; McKean-Vlasov and Boltzmann Models, volume 1627, pages 42–95. Springer Heidelberg, Berlin, Hei- delberg, 1996. [24] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. J. Mach. Learn. Res., 7:26512667, December 2006. [25] Sebastien. Mostch and Eitan. Tadmor. Heterophilious Dynamics Enhances Consensus. Siam Review, 56(4):577 – 621, 2014. [26] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer school on ma- chine learning, pages 63–71. Springer, 2003. [27] Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R.G. Lanckriet. Universality, charac- teristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12(70):2389–2410, 2011. [28] Hongwei Sun. Mercer theorem for rkhs on noncompact sets. Journal of Complexity, 21(3):337 – 349, 2005. [29] Alain-Sol Sznitman. Topics in Propagation of Chaos, volume 1464, pages 165–251. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991. [30] Tamas´ Vicsek and Anna Zafeiris. Collective motion. Physics Reports, 517:71 – 140, 2012. [31] Holger Wendland. Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2004.

26