The Teaching Dimension of Kernel Perceptron
Akash Kumar Hanqi Zhang Adish Singla Yuxin Chen MPI-SWS University of Chicago MPI-SWS University of Chicago
Abstract et al.(2020)). One of the most studied scenarios is the case of teaching a version-space learner (Goldman & Kearns, 1995; Anthony et al., 1995; Zilles et al., Algorithmic machine teaching has been stud- 2008; Doliwa et al., 2014; Chen et al., 2018; Mansouri ied under the linear setting where exact teach- et al., 2019; Kirkpatrick et al., 2019). Upon receiving ing is possible. However, little is known for a sequence of training examples from the teacher, a teaching nonlinear learners. Here, we estab- version-space learner maintains a set of hypotheses that lish the sample complexity of teaching, aka are consistent with the training examples, and outputs teaching dimension, for kernelized perceptrons a random hypothesis from this set. for different families of feature maps. As a warm-up, we show that the teaching complex- As a canonical example, consider teaching a 1- ity is Θ(d) for the exact teaching of linear dimensional binary threshold function fθ∗ (x) = d k 1 perceptrons in R , and Θ(d ) for kernel per- x θ∗ for x [0, 1]. For a learner with a finite (or { − } ∈ i ceptron with a polynomial kernel of order k. countable infinite) version space, e.g., θ i=0,...,n ∈ { n } Furthermore, under certain smooth assump- where n Z+ (see Fig. 1a), a smallest training set i ∈ i+1 i i+1 tions on the data distribution, we establish a is , 0 , , 1 where θ∗ < ; thus the { n n } n ≤ n rigorous bound on the complexity for approxi- teaching dimension is 2. However, when the version mately teaching a Gaussian kernel perceptron. space is continuous, the teaching dimension becomes We provide numerical examples of the opti- , because it is no longer possible for the learner to ∞ mal (approximate) teaching set under several pick out a unique threshold θ∗ with a finite training canonical settings for linear, polynomial and set. This is due to two key (limiting) modeling assump- Gaussian kernel perceptrons. tions of the version-space learner: (1) all (consistent) hypotheses in the version space are treated equally, and (2) there exists a hypothesis in the version space 1 Introduction that is consistent with all training examples. As one can see, these assumptions fail to capture the behavior of many modern learning algorithms, where the best Machine teaching studies the problem of finding an opti- hypotheses are often selected via optimizing certain mal training sequence to steer a learner towards a target loss functions, and the data is not perfectly separable concept (Zhu et al., 2018). An important learning- (i.e. not realizable w.r.t. the hypothesis/model class). theoretic complexity measure of machine teaching is the teaching dimension (Goldman & Kearns, 1995), To lift these modeling assumptions, a more realistic which specifies the minimal number of training ex- teaching scenario is to consider the learner as an empir- amples required in the worst case to teach a target ical risk minimizer (ERM). In fact, under the realizable concept. Over the past few decades, the notion of setting, the version-space learner could be viewed as an teaching dimension has been investigated under a va- ERM that optimizes the 0-1 loss—one that finds all hy- riety of learner’s models and teaching protocols (e.g,. potheses with zero training error. Recently, Liu & Zhu Cakmak & Lopes(2012); Singla et al.(2013; 2014); Liu (2016) studied the teaching dimension of linear ERM, et al.(2017); Haug et al.(2018); Tschiatschek et al. and established values of teaching dimension for several (2019); Liu et al.(2018); Kamalaruban et al.(2019); classes of linear (regularized) ERM learners, including Hunziker et al.(2019); Devidze et al.(2020); Rakhsha support vector machine (SVM), logistic regression and ridge regression. As illustrated in Fig. 1b, for the previ- th Proceedings of the 24 International Conference on Artifi- ous example it suffices to use (θ∗ , 0) , (θ∗ + , 1) cial Intelligence and Statistics (AISTATS) 2021, San Diego, { − } with any min(1 θ∗, θ∗) as training set to teach California, USA. PMLR: Volume 130. Copyright 2021 by θ ≤ − l the author(s). ∗ as an optimizer of the SVM objective (i.e., 2 regu-
i/n (i + 1)/n ✓⇤ ✏ ✓⇤ + ✏ linear polynomial Gaussian 0 ✓⇤ 1 0 ✓⇤ 1 d+k 1 TD (exact) Θ(d)Θ − (a) 0/1 loss (b) SVM (hinge loss) k ∞ (log2 1 ) -approximate TD - - dO Assumption - 3.2.1 3.4.1, 3.4.2 0 ✓⇤ 1 (c) Perceptron Table 1: Teaching dimension for kernel perceptron Figure 1: Teaching a 1D threshold function to an complexity of teaching a “relaxed” target that ERM learner. Training instances are marked in grey. is close to the target hypothesis in terms of the (a) Version-space learner with a finite hypothesis set. expected risk. Our relaxed notion of teaching (b) SVM and training set (θ∗ , 0) , (θ∗ + , 1) . (c) { − } dimension strictly generalizes the teaching di- ERM learner with (perceptron) loss and training set mension of Liu & Zhu(2016), where it trades off (θ∗, 0), (θ∗, 1) . { } the teaching complexity against the risk of the taught hypothesis, and hence is more practical in larized hinge loss); hence the teaching dimension is 2. characterizing the complexity of a teaching task (§2). In Fig. 1c, we consider teaching an ERM learner with perceptron loss, i.e., `(fθ(x), y) = max ( y (x θ), 0) − · − • We show that exact teaching is feasible for kernel (where y 1, 1 ). If the teacher is allowed to con- ∈ {− } 1 perceptrons with finite dimensional feature maps, struct any training example with any labeling , then such as linear kernel and polynomial kernel. it is easy to verify that the minimal training set is Specifically, for data points in Rd, we establish (θ∗, 1), (θ∗, 1) . { − } a Θ (d) bound on the teaching dimension of While these results show promise at understanding linear perceptron. Under a mild condition on optimal teaching for ERM learners, existing work (Liu data distribution, we provide a tight bound of d+k 1 & Zhu, 2016) has focused exclusively on the linear Θ k− for polynomial perceptron of order k. setting with the goal to teach the exact hypothesis We also exhibit optimal training sets that match (e.g., teaching the exact model parameters or the exact these teaching dimensions (§3.1 and §3.2). decision boundary for classification tasks). Aligned with these results, we establish an upper bound as • We further show that for Gaussian kernelized percep- shown in §3.1. It remains a fundamental challenge tron, exact teaching is not possible with a finite set to rigorously characterize the teaching complexity for (log2 1 ) of hypotheses, and then establish a dO bound nonlinear learners. Furthermore, in the cases where on the -approximate teaching dimension (§3.4). To exact teaching is not possible with a finite training set, the best of our knowledge, these results constitute the classical teaching dimension no longer captures the the first known bounds on (approximately) teaching fine-grained complexity of the teaching tasks, and hence a non-linear ERM learner (§3). one needs to relax the teaching goals and investigate new notions of teaching complexity. 2 Problem Statement In this paper, we aim to address the above challenges. We focus on kernel perceptron, a specific type of ERM Basic definitions We denote by the input space learner that is less understood even under the linear X and := 1, 1 the output space. A hypothesis is setting. Following the convention in teaching ERM Y {− } a function h : . In this paper, we identify learners, we consider the constructive setting, where the X → Y a hypothesis hθ with its model parameter θ. The teacher can construct arbitrary teaching examples in hypothesis space is a set of hypotheses. By training the support of the data distribution. Our contributions H point we mean a pair (x, y) . We assume are highlighted below, with main theoretical results ∈ X × Y that the training points are drawn from an unknown summarized in Table1. distribution over . A training set is a multiset P X × Y = (x1, y1) , , (xn, yn) where repeated pairs are • We formally define approximate teaching of kernel D { ··· } perceptron, and propose a novel measure of allowed. Let D denote the set of all training sets of -approximate all sizes. A learning algorithm A : D 2H takes teaching complexity, namely the → teaching dimension in a training set D D and outputs a subset of the ( -TD), which captures the ∈ hypothesis space . That is, A doesn’t necessarily H 1If the teacher is restricted to only provide consistent return a unique hypothesis. labels (i.e., the realizable setting), then the ERM with perceptron loss reduces to the version space learner, where Kernel perceptron Consider a set of training points n d the teaching dimension is ∞. := (xi, yi) where xi R and hypothesis θ D { }i=1 ∈ ∈ Akash Kumar, Hanqi Zhang, Adish Singla, Yuxin Chen
d R . A linear perceptron is defined as fθ(x) := sign(θ x) for some real t > 0. Thus, · in homogeneous setting. We consider the algorithm Aopt to learn an optimal perceptron to classify as θ A θ A D TD( t ∗ , opt) = min TD(p ∗, opt). defined below: { } real p>0 n X Since it can be stringent to construct a teaching set Aopt ( ) := arg min `(fθ(xi), yi). (1) D d for decision boundary (see §3.4), exact teaching is not θ R i=1 ∈ always feasible. We introduce and study approximate teaching where the loss function `(fθ(x), y) := max( y fθ(x), 0). which is formally defined as: − · Similarly, we consider the non-linear setting via kernel- Definition 1 (-approximate teaching set). Con- based hypotheses for perceptrons that are defined with sider a kernel perceptron learner, with a kernel : respect to a kernel operator R which : R and the corresponding RKHS feature K X × X → K X × X → adheres to Mercer’s positive definite conditions (Vapnik, map Φ( ). For a target model θ∗ and > 0, we 1998). A kernel-based hypothesis has the form, say · is an -approximate∈ HK teaching set wrt to Tif S the ⊆ X kernel × Y perceptron θˆ A satisfies k opt( ) X P ∈ TS x x x f( ) = αi ( i, ) (2) ·K h ˆ i i=1 E [max( y f ∗(x), 0)] E max( y f(x), 0) − · − − · ≤ where i xi and αi are reals. In order to simplify (4) ∀ ∈ X the derivation of the algorithms and their analysis, we where the expectations are over (x, y) and ˆ ˆ ∼ P reproducing kernel Hilbert space f ∗(x) = θ∗ Φ(x) and f(x) = θ Φ(x). associate a (RKHS) · · with in the standard way common to all kernel K methods. Formally, let be the closure of the set Naturally, we define approximate teaching dimension HK of all hypotheses of the form given in Eq. (2). A non- as: linear kernel perceptron corresponding to optimizes K Definition 2 (-approximate teaching dimension). Eq. (1) as follows: Consider a kernel perceptron learner, with a kernel n : R and the corresponding RKHS feature X K X × X → A ( ) := arg min `(f (x ), y ) map Φ( ). For a target model θ∗ and > 0, we opt θ i i (3) · ∈ HK D θ K define - θ A as the teaching dimension which ∈H i=1 TD( ∗, opt) is the size of the smallest teaching set for -approximate Pl l where fθ( ) = αi (ai, ) for some ai teaching of θ∗ wrt . · i=1 ·K · { }i=1 ⊂ X P and αi real. Alternatively, we also write fθ( ) = θ Φ( ) · · · where Φ: is defined as feature map to the ker- According to Definition2, exact teaching corresponds X → HK nel function . A reproducing kernel Hilbert space with to constructing a 0-approximate teaching set for a tar- K could be decomposed as (x, x0) = Φ(x), Φ(x0) get classifier (e.g., the decision boundary of a kernel K K h i (Scholkopf & Smola, 2001) for any x, x0 . Thus, we perceptron). We study linear and polynomial kernel- Pl ∈ X also identify fθ as i=1 αi Φ(ai). ized perceptrons in the exact teaching setting. Under · some mild assumptions on the smoothness of the data The teaching problem We are interested in the distribution, we establish approximate teaching bound problem of teaching a target hypothesis θ∗ where a help- on approximate teaching dimension for Gaussian ker- ful teacher provides labelled data points , nelized perceptron. T S ⊆ X × Y also defined as a teaching set. Assuming the construc- tive setting (Liu & Zhu, 2016), to teach a kernel per- 3 Teaching Dimension for Kernel ceptron learner the teacher can construct a training set Perceptron d with any items in R i.e. for any (x0, y0) we have d ∈ T S x0 R and y0 1, 1 . Importantly, for the purpose ∈ ∈ {− } In this section, we study the generic problem of teach- of teaching we do not assume that are drawn i.i.d TS ing kernel perceptrons in three different settings: 1) from a distribution. We define the teaching dimension linear (in §3.1); 2) polynomial (in §3.2); and Gaus- for exact parameter of θ∗ corresponding to a kernel per- sian (in §3.4). Before establishing our main result for ceptron as TD(θ∗, Aopt), which is the size of the small- Gaussian kernelized perceptrons, we first introduce two est teaching set such that Aopt ( ) = θ∗ . We TS TS { } important results for linear and polynomial perceptrons define teaching of exact parameters of a target hypothe- inherently connected to the Gaussian perceptron. Our sis θ∗ as exact teaching. Since, a perceptron is agnostic proofs are inspired by ideas from linear algebra and to norms, we study the problem of teaching a target projective geometry as detailed in the supplemental classifier decision boundary where Aopt ( ) = tθ∗ TS { } materials. The Teaching Dimension of Kernel Perceptron
3.1 Homogeneous Linear Perceptron Also, p(θˆ) = 0 = xi θˆ 0 i [d] but then Pd 1 ⇒ · ≥ ∀ˆ ∈ xd = i=1− xi = i [d] xi θ = 0. Note that, In this subsection, we study the problem of teaching a − ⇒ ∀ ∈ · θˆ θ∗ 0 forces θˆ = tθ∗ for some positive constant t. linear perceptron. First, we consider an optimization · ≥ Thus, is a teaching set for the decision boundary of problem similar to Eq. (1) as shown in Liu & Zhu D θ∗. This establishes the upper bound, and hence the (2016): theorem follows. n X λ 2 Aopt := arg min `(θ xi, yi) + θ A (5) θ d · 2 || || ∈R i=1 Numerical example To illustrate Theorem1, we provide a numerical example for teaching a lin- where `( , ) is a convex loss function, A is a positive 3 · · ear perceptron in R , with θ∗ = ( 3, 3, 5)> semi-definite matrix, θ A is defined as √θ>Aθ, and − || || (illustrated in Fig. 2a). To construct the λ > 0. For convex loss function `( , ), Theorem 1 (Liu · · teaching set, we first obtain an orthogonal ba- & Zhu, 2016) established a degree-of-freedom lower sis (0.46, 0.86, 0.24)>, (0.76, 0.24, 0.6)> for the bound on the number of training items to obtain a { − − } subspace orthogonal to θ∗, and add a vector unique solution θ∗. Since, the loss function for linear ( 1.22, 0.62, 0.36)> which is in the exact opposite perceptron is convex thus we immediately obtain a − − − direction of the first two combined. Finally we add to lower bound on the teaching dimension as follows: an arbitrary vector which has a positive dot prod- Corollary 1. If A = 0 and λ = 1, then Eq. (1) can TS uct with the normal vector, e.g. ( 0.46, 0.46, 0.76)>. − be solved as Eq. (5). Moreover, teaching dimension for Labeling all examples positive, we obtain of size 4. decision boundary corresponding to a target model θ∗ TS is lower-bounded by Ω(d). 3.2 Homogeneous Polynomial Kernelized Now, we would establish an upper bound on Perceptron TD(Aopt, θ∗) for exact teaching of the decision bound- ary of a target model θ∗. The key idea is to find a set In this subsection, we study the problem of teaching a of points which span the orthogonal subspace of θ∗, polynomial kernelized perceptron in realizable setting. which we use to force a solution θˆ Aopt such that it Similar to §3.1, we establish an exact teaching bound ∈ has a component only along θ∗. Formally, we state the on the teaching dimension under a mild condition on the data distribution. We consider homogeneous poly- claim of the result with proof as follows: d nomial kernel of degree k in which for any x, x0 R Theorem 1. Given any target model θ∗, for solving K ∈ Eq. (1) the teaching dimension for the decision bound- k (x, x0) = ( x, x0 ) ary corresponding to θ∗ is Θ (d). The following is a K h i teaching set: If Φ( ) denotes the feature map for the corresponding · xi = vi, yi = 1 i [d 1]; RKHS, then we know that the dimension of the map is ∀ ∈ − d+k 1 d 1 k− where each component of the map can be repre- − X q k! λ d xd = vi, yd = 1; xd+1 = θ∗, yd+1 = 1 x x λ sented by Φλ( ) = Qd λ ! where (N 0 ) − i=1 i ∈ ∪ { } i=1 P and i λi = k. Denote by the RKHS correspond- d d HK d where vi i=1 is an orthogonal basis for R which ex- ing to the polynomial kernel . We use k := k(R ) { } K H H tends with vd = θ∗. to represent the linear space of homogeneous polyno- mials of degree k over Rd. We mention an important Proof. Using Corollary1, the lower bound for solv- result which shows the RKHS for polynomial kernels is ing Eq. (1) is immediate. Thus, if we show that the isomorphic to the space of homogeneous polynomials mentioned labeled set of training points form a teach- of degree k in d variables. ing set, then we can show an upper bound which Proposition 1 (Chapter III.2, Proposition 6 (Cucker would imply a tight bound of Θ (d) on the teach- & Smale, 2001)). k = as function spaces and ing dimension for finding the decision boundary. De- inner product spaces.H HK note the set of labeled data points as . Denote by Pd+1 D d p(θ) := max( yi θ xi, 0). Since vi i=1 is an d i=1 − · · { } The dimension dim k(R ) of the linear space of ho- orthogonal basis, thus i [d 1] vi θ∗ = 0, thus it H d d+k 1 ∀ ∈ − p θ · mogeneous polynomials of degree k over R is k− . is not very difficult to show that (t ∗) = 0 for some d+k 1 ˆ Denote by r := − . Since is a vector space for positive scalar t. Note, if θ is a solution to Eq. (1) then: k HK polynomial kernel , thus for exact teaching there is an d+1 K d+k 1 X obvious lower bound of Ω k− on the teaching θˆ arg min max( yi θ xi, 0) ∈ θ d − · · dimension. ∈R i=1 Akash Kumar, Hanqi Zhang, Adish Singla, Yuxin Chen
Before we establish the main result of this subsection are pathological cases where violation of the assump- we state a mild assumption on the target model we tion leads to models which couldn’t be approximately consider for exact teaching which is as follows: taught. Assumption 3.2.1 (Existence of orthogonal polyno- Intuitively, solving Eq. (3) in the paradigm of exact mials). For the target model θ∗ , we assume that teaching reduces to nullifying the orthogonal subspace there exist (r 1) linearly independent∈ HK polynomials − of θ∗ i.e. any component of θ∗ along the subspace on the orthogonal subspace of θ∗ in of the form r 1 HK is nullified. Since the information of the span of the Φ(zi) − where i zi . { }i=1 ∀ ∈ X subspace has to be encoded into the datapoints chosen for teaching, Assumption 3.2.1 is a natural step to make. Similar to Theorem1, the key insight in having As- Interestingly, we show that the step is not so stringent. sumption 3.2.1 is to find independent polynomial on In the realizable setting in which all the teaching points the orthogonal subspace defined by θ∗. We state the are correctly classified, if we lift the assumption then claim here with proof established in the supplemental exact teaching is not possible.We state the claim in the materials. following lemma: Theorem 2. For all target models θ∗ for which K Lemma 1. Consider a target model θ that doesn’t the Assumption 3.2.1 holds, for solving∈ HEq. (3), the ∗ satisfy Assumption 3.2.1. Then, there doesn’t exist a exact teaching dimension for the decision boundary teaching set ∗ which exactly teaches θ i.e. for any d+k 1 θ ∗ corresponding to θ∗ is − . TS k θ∗ and any real t > 0 O TS
Numerical example For constructing in the A ( ∗ ) = tθ . TS opt θ ∗ polynomial case, we follow a similar strategy in the TS 6 { } higher dimensional space that the original data is pro- Lemma1 shows that for exact teaching θ∗ should jected into. The only difference is that we need to satisfy Assumption 3.2.1. Then, the natural question ensure the teaching examples have pre-images in the that arises is whether we can achieve arbitrarily -close original space. For that, we adopt a randomized al- approximate teaching for θ∗. In other words, we would gorithm that solves for r 1 boundary points in the ˜ − like to find θ∗ that satisfies Assumption 3.2.1 and is original space (i.e. solve for θ∗ Φ(x) = 0) , while check- · in -neighbourhood of θ∗. We show a negative result ing the images of these points are linearly independent. for this when k is even. For this we assume that, Also, instead of adding a vector in the opposite direc- the datapoints in the teaching set ˜∗ have lower- tion of these points combined, we simply repeat the TSθ bounded norm, call it, δ > 0 i.e. if (xi, yi) ˜∗ then r 1 points in the teaching set, while assigning one ∈ T Sθ Φ(xi) δ. We require this additional assumption − || || ≥ copy of them positive labels and the other copy nega- only for the purpose of analysis. We would show that tive labels. Finally, we need one last vector (label it it wouldn’t lead to any pathological cases where the positive) whose image has a positive component in θ∗, constructed target model θ∗ incorporates approximate and we obtain of size 2r 1. TS − teaching. Fig. 2b and Fig. 2c demonstrate the above constructive Lemma 2. Let Rd and be the reproducing procedure on a numerical example with d = 2, homoge- kernel Hilbert spaceX ⊆ such that kernelHK function is of K neous polynomial kernel of degree 2, and θ∗ = (1, 4, 4)>. degree k. If k has parity even then there exists a target In Fig. 2b we show the decision boundary (red lines) and model θ∗ which violates Assumption 3.2.1 and can’t be the level sets (polynomial contours) of this quadratic taught approximately. perceptron, as well as the teaching set identified via the above algorithmic procedure. In Fig. 2b, we visualize The results are discussed in details with proofs in the the decision boundary (grey plane) in the feature space supplemental materials. Assumption 3.2.1 and the (after applying the feature map). The blue surface stated lemmas provide insights into understanding the corresponds to all the data points that have pre-images problem of teaching for non-linear perceptron kernels. in the original space R2. In the next section, we study Gaussian kernel and the ideas generated here would be useful in devising a 3.3 Limitations in Exact Teaching of teaching set in the paradigm of approximate teaching. Polynomial Kernel Perceptron 3.4 Gaussian Kernelized Perceptron In the previous section §3.2, we imposed the Assump- tion 3.2.1 on the target models θ∗. It turns out that In this subsection, we consider the Gaussian kernel. we couldn’t do better than this. More concretely, we Under mild assumptions inspired by the analysis of need to impose this assumption for exact teaching of teaching dimension for exact teaching of linear and polynomial kernel perceptron learner. Further, there polynomial kernel perceptrons, we would establish as The Teaching Dimension of Kernel Perceptron
(a) Linear (TS) (b) Polynomial (TS) (c) Polynomial (feature space)
Figure 2: Numerical examples of exact teaching for linear and polynomial perceptrons. Cyan plus marks and red dots correspond to positive and negative teaching examples respectively.
our main result an upper bound on the -approximate Definitions and notations for approximate teaching dimension of Gaussian kernel perceptrons us- teaching For any classifier f , we define err(f) ∈ HK ing a construction of an -approximate teaching set. = E(x,y) (x,y)[max( y f(x), 0)]. Our goal is to find ∼P − · a classifier f with the property that its expected true Preliminaries of Gaussian kernel A Gaussian ker- loss err(f) is as small as possible. In the realizable nel is a function of the form setting, we assume that there exists an optimal sep- K ||x−x0||2 arator f ∗ such that for any data instances sampled (x, x0) = e− 2σ2 (6) from the data distribution the labels are consistent i.e. K (y f ∗(x) 0) = 0. In addition, we also experiment d for any x, x0 R and parameter σ. First, we would P · ≤ ∈ for the non-realizable setting. In the rest of the sub- try to understand the feature map before we find an section, we would study the relationship between the approximation to it. Notice: teaching complexity for an optimal Gaussian kernel
0 ˆ ||x−x0||2 ||x||2 ||x0||2 x, x perceptron for Eq. (3) and err(f ∗) err(f) where 2σ2 2σ2 2σ2 h σ2 i | − | e− = e− e− e f ∗ is the optimal separator and fˆ is the solution to A ∗ ∗ 2 opt( θ ) for the constructed teaching set θ . Consider the scalar term z = x, x0 /σ . We can ex- TS TS h i pand the term of the product using the Taylor expan- 3.4.1 Gaussian Kernel Approximation sion of ez near z = 0 as shown in Cotter et al.(2011), 0 k x, x x, x0 Now, we would talk about finite-dimensional polyno- h σ2 i P 1 e = ∞ 2 which amounts to k=0 k! h σ i . We mial approximation Φ˜ to the Gaussian feature map Φ can further expand the previous sum as via projection as shown in Cotter et al.(2011). Con- sider k x, x0 ∞ h i X 1 x, x0 e σ2 = h i Φ:˜ d q k! σ2 R R k=0 −→ ˜ x x ˜ x ˜ x d ( , 0) = Φ( ) Φ( 0) X∞ 1 X k K · = xl x0 k!σ2k · l With these approximations, we consider classifiers of k=0 l=1 the form f˜(x) = θ˜ Φ˜(x) such that θ˜ Rq. Now, · ∈ X∞ 1 X k λ λ assume that there is a projection map P such that = x (x0) (7) k!σ2k Cλ · · Φ˜ = PΦ. In Cotter et al.(2011), authors used the k=0 λ =k | | following approximation to the Gaussian kernel: k k! = 2 0 2 s k where λ Qd λ ! . Thus, we use Eq. (7) to obtain ||x|| ||x || C i=1 i X 1 x, x0 ˜(x, x0) = e− 2σ2 e− 2σ2 h i (8) explicit feature representation to the Gaussian kernel K k! σ2 ||x||2 k k=0 √ λ λ in Eq. (6) as Φk,λ(x) = e− 2σ2 C x . We get · √k!σk · the explicit feature map Φ( ) for the Gaussian kernel This gives the following explicit feature representation · with coordinates as specified. Theorem 1 of Ha Quang for the approximated kernel: q (2010) characterizes the RKHS of Gaussian kernel. It k 2 dim( ) = ||x|| λ λ establishes that . Thus, we note that the ˜ 2σ2 C HK ∞ k s, Φk,λ(x) = Φk,λ(x) = e− x exact teaching for an arbitrary target classifier f ∗ in ∀ ≤ · √k!σk · this setting has an infinite lower bound. This calls for (9) analysing the teaching problem of a Gaussian kernel in where Φk,λ(x) is the coordinate for Gaussian feature the approximate teaching setting. map. Note that the feature map Φ˜ defined by the Akash Kumar, Hanqi Zhang, Adish Singla, Yuxin Chen