The Teaching Dimension of Kernel

Akash Kumar Hanqi Zhang Adish Singla Yuxin Chen MPI-SWS University of Chicago MPI-SWS University of Chicago

Abstract et al.(2020)). One of the most studied scenarios is the case of teaching a version-space learner (Goldman & Kearns, 1995; Anthony et al., 1995; Zilles et al., Algorithmic machine teaching has been stud- 2008; Doliwa et al., 2014; Chen et al., 2018; Mansouri ied under the linear setting where exact teach- et al., 2019; Kirkpatrick et al., 2019). Upon receiving ing is possible. However, little is known for a sequence of training examples from the teacher, a teaching nonlinear learners. Here, we estab- version-space learner maintains a set of hypotheses that lish the sample complexity of teaching, aka are consistent with the training examples, and outputs teaching dimension, for kernelized a random hypothesis from this set. for different families of feature maps. As a warm-up, we show that the teaching complex- As a canonical example, consider teaching a 1- ity is Θ(d) for the exact teaching of linear dimensional binary threshold function fθ∗ (x) = d k 1 perceptrons in R , and Θ(d ) for kernel per- x θ∗ for x [0, 1]. For a learner with a finite (or { − } ∈ i ceptron with a polynomial kernel of order k. countable infinite) version space, e.g., θ i=0,...,n ∈ { n } Furthermore, under certain smooth assump- where n Z+ (see Fig. 1a), a smallest training set i ∈ i+1  i i+1 tions on the data distribution, we establish a is , 0 , , 1 where θ∗ < ; thus the { n n } n ≤ n rigorous bound on the complexity for approxi- teaching dimension is 2. However, when the version mately teaching a Gaussian kernel perceptron. space is continuous, the teaching dimension becomes We provide numerical examples of the opti- , because it is no longer possible for the learner to ∞ mal (approximate) teaching set under several pick out a unique threshold θ∗ with a finite training canonical settings for linear, polynomial and set. This is due to two key (limiting) modeling assump- Gaussian kernel perceptrons. tions of the version-space learner: (1) all (consistent) hypotheses in the version space are treated equally, and (2) there exists a hypothesis in the version space 1 Introduction that is consistent with all training examples. As one can see, these assumptions fail to capture the behavior of many modern learning algorithms, where the best Machine teaching studies the problem of finding an opti- hypotheses are often selected via optimizing certain mal training sequence to steer a learner towards a target loss functions, and the data is not perfectly separable concept (Zhu et al., 2018). An important learning- (i.e. not realizable w.r.t. the hypothesis/model class). theoretic complexity measure of machine teaching is the teaching dimension (Goldman & Kearns, 1995), To lift these modeling assumptions, a more realistic which specifies the minimal number of training ex- teaching scenario is to consider the learner as an empir- amples required in the worst case to teach a target ical risk minimizer (ERM). In fact, under the realizable concept. Over the past few decades, the notion of setting, the version-space learner could be viewed as an teaching dimension has been investigated under a va- ERM that optimizes the 0-1 loss—one that finds all hy- riety of learner’s models and teaching protocols (e.g,. potheses with zero training error. Recently, Liu & Zhu Cakmak & Lopes(2012); Singla et al.(2013; 2014); Liu (2016) studied the teaching dimension of linear ERM, et al.(2017); Haug et al.(2018); Tschiatschek et al. and established values of teaching dimension for several (2019); Liu et al.(2018); Kamalaruban et al.(2019); classes of linear (regularized) ERM learners, including Hunziker et al.(2019); Devidze et al.(2020); Rakhsha support vector machine (SVM), and ridge regression. As illustrated in Fig. 1b, for the previ- th Proceedings of the 24 International Conference on Artifi- ous example it suffices to use (θ∗ , 0) , (θ∗ + , 1) cial Intelligence and Statistics (AISTATS) 2021, San Diego, { − } with any  min(1 θ∗, θ∗) as training set to teach California, USA. PMLR: Volume 130. Copyright 2021 by θ ≤ − l the author(s). ∗ as an optimizer of the SVM objective (i.e., 2 regu-

AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjsoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPefeMtg== AAAB73icbVDLSgNBEOyNrxhfUY9eBoMgHsKuBIy3gBePEcwDkjXMTmaTIbOz60yvEEJ+wosHRbz6O978GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo5uZ33ri2ohY3eM44X5EB0qEglG0UruLQ4704aJXLLlldw6ySryMlCBDvVf86vZjlkZcIZPUmI7nJuhPqEbBJJ8WuqnhCWUjOuAdSxWNuPEn83un5MwqfRLG2pZCMld/T0xoZMw4CmxnRHFolr2Z+J/XSTGs+hOhkhS5YotFYSoJxmT2POkLzRnKsSWUaWFvJWxINWVoIyrYELzll1dJ87LsVcrXd5VSrZrFkYcTOIVz8OAKanALdWgAAwnP8ApvzqPz4rw7H4vWnJPNHMMfOJ8/wiSPxg== AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjsoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPefeMtg== AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjcoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPe3uMtw== AAAB73icbVDLSgNBEOyNrxhfUY9eBoMgHsKuBIy3gBePEcwDkjXMTmaTIbOz60yvEEJ+wosHRbz6O978GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo5uZ33ri2ohY3eM44X5EB0qEglG0UruLQ4704aJXLLlldw6ySryMlCBDvVf86vZjlkZcIZPUmI7nJuhPqEbBJJ8WuqnhCWUjOuAdSxWNuPEn83un5MwqfRLG2pZCMld/T0xoZMw4CmxnRHFolr2Z+J/XSTGs+hOhkhS5YotFYSoJxmT2POkLzRnKsSWUaWFvJWxINWVoIyrYELzll1dJ87LsVcrXd5VSrZrFkYcTOIVz8OAKanALdWgAAwnP8ApvzqPz4rw7H4vWnJPNHMMfOJ8/wiSPxg== AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjsoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPefeMtg== AAAB+nicbVBNS8NAEN34WetXq0cvwSKIYEmkYL0VvHisYD+giWWznbRLNx/sTpQS+1O8eFDEq7/Em//GTZuDtj4YeLw3w8w8LxZcoWV9Gyura+sbm4Wt4vbO7t5+qXzQVlEiGbRYJCLZ9agCwUNoIUcB3VgCDTwBHW98nfmdB5CKR+EdTmJwAzoMuc8ZRS31S2UHR4D0/uzcgVhxkWkVq2rNYC4TOycVkqPZL305g4glAYTIBFWqZ1sxuimVyJmAadFJFMSUjekQepqGNADlprPTp+aJVgamH0ldIZoz9fdESgOlJoGnOwOKI7XoZeJ/Xi9Bv+6mPIwThJDNF/mJMDEysxzMAZfAUEw0oUxyfavJRlRShjqtog7BXnx5mbQvqnatenVbqzTqeRwFckSOySmxySVpkBvSJC3CyCN5Jq/kzXgyXox342PeumLkM4fkD4zPHwF3k9Q= AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjcoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPe3uMtw== AAAB73icbVDLSgNBEOyNrxhfUY9eBoMgHsKuBIy3gBePEcwDkjXMTmaTIbOz60yvEEJ+wosHRbz6O978GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo5uZ33ri2ohY3eM44X5EB0qEglG0UruLQ4704aJXLLlldw6ySryMlCBDvVf86vZjlkZcIZPUmI7nJuhPqEbBJJ8WuqnhCWUjOuAdSxWNuPEn83un5MwqfRLG2pZCMld/T0xoZMw4CmxnRHFolr2Z+J/XSTGs+hOhkhS5YotFYSoJxmT2POkLzRnKsSWUaWFvJWxINWVoIyrYELzll1dJ87LsVcrXd5VSrZrFkYcTOIVz8OAKanALdWgAAwnP8ApvzqPz4rw7H4vWnJPNHMMfOJ8/wiSPxg== AAAB+nicbVBNS8NAEN34WetXq0cvwSKIQkmkYL0VvHisYD+giWWznbRLNx/sTpQS+1O8eFDEq7/Em//GTZuDtj4YeLw3w8w8LxZcoWV9Gyura+sbm4Wt4vbO7t5+qXzQVlEiGbRYJCLZ9agCwUNoIUcB3VgCDTwBHW98nfmdB5CKR+EdTmJwAzoMuc8ZRS31S2UHR4D0/uzcgVhxkWkVq2rNYC4TOycVkqPZL305g4glAYTIBFWqZ1sxuimVyJmAadFJFMSUjekQepqGNADlprPTp+a

AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU02kYL0VvHisaG2hDWWznbRLN5uwuxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkRwbVz32ymsrW9sbhW3Szu7e/sH5cOjRx2nimGLxSJWnYBqFFxiy3AjsJMopFEgsB2Mb2Z++wmV5rF8MJME/YgOJQ85o8ZK9/xC9ssVt+rOQVaJl5MK5Gj2y1+9QczSCKVhgmrd9dzE+BlVhjOB01Iv1ZhQNqZD7FoqaYTaz+anTsmZVQYkjJUtachc/T2R0UjrSRTYzoiakV72ZuJ/Xjc1Yd3PuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lTxeVr1a9fquVmnU8ziKcAKncA4eXEEDbqEJLWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QMJOI2g AAAB7nicbVDLSgNBEOz1GeMr6tHLYBAiQtyVgPEW8OIxgnlAsoTZSScZMju7zMwKYclHePGgiFe/x5t/4yTZgyYWNBRV3XR3BbHg2rjut7O2vrG5tZ3bye/u7R8cFo6OmzpKFMMGi0Sk2gHVKLjEhuFGYDtWSMNAYCsY38381hMqzSP5aCYx+iEdSj7gjBortUr80ru4kr1C0S27c5BV4mWkCBnqvcJXtx+xJERpmKBadzw3Nn5KleFM4DTfTTTGlI3pEDuWShqi9tP5uVNybpU+GUTKljRkrv6eSGmo9SQMbGdIzUgvezPxP6+TmEHVT7mME4OSLRYNEkFMRGa/kz5XyIyYWEKZ4vZWwkZUUWZsQnkbgrf88ippXpe9Svn2oVKsVbM4cnAKZ1ACD26gBvdQhwYwGMMzvMKbEzsvzrvzsWhdc7KZE/gD5/MHpuOOdQ== The Teaching Dimension of Kernel Perceptron

i/n (i + 1)/n ✓⇤ ✏ ✓⇤ + ✏ linear polynomial Gaussian 0 ✓⇤ 1 0 ✓⇤ 1 d+k 1 TD (exact) Θ(d)Θ − (a) 0/1 loss (b) SVM (hinge loss) k ∞ (log2 1 ) -approximate TD - - dO  Assumption - 3.2.1 3.4.1, 3.4.2 0 ✓⇤ 1 (c) Perceptron Table 1: Teaching dimension for kernel perceptron Figure 1: Teaching a 1D threshold function to an complexity of teaching a “relaxed” target that ERM learner. Training instances are marked in grey. is close to the target hypothesis in terms of the (a) Version-space learner with a finite hypothesis set. expected risk. Our relaxed notion of teaching (b) SVM and training set (θ∗ , 0) , (θ∗ + , 1) . (c) { − } dimension strictly generalizes the teaching di- ERM learner with (perceptron) loss and training set mension of Liu & Zhu(2016), where it trades off (θ∗, 0), (θ∗, 1) . { } the teaching complexity against the risk of the taught hypothesis, and hence is more practical in larized hinge loss); hence the teaching dimension is 2. characterizing the complexity of a teaching task (§2). In Fig. 1c, we consider teaching an ERM learner with perceptron loss, i.e., `(fθ(x), y) = max ( y (x θ), 0) − · − • We show that exact teaching is feasible for kernel (where y 1, 1 ). If the teacher is allowed to con- ∈ {− } 1 perceptrons with finite dimensional feature maps, struct any training example with any labeling , then such as linear kernel and polynomial kernel. it is easy to verify that the minimal training set is Specifically, for data points in Rd, we establish (θ∗, 1), (θ∗, 1) . { − } a Θ (d) bound on the teaching dimension of While these results show promise at understanding linear perceptron. Under a mild condition on optimal teaching for ERM learners, existing work (Liu data distribution, we provide a tight bound of d+k 1 & Zhu, 2016) has focused exclusively on the linear Θ k− for polynomial perceptron of order k. setting with the goal to teach the exact hypothesis We also exhibit optimal training sets that match (e.g., teaching the exact model parameters or the exact these teaching dimensions (§3.1 and §3.2). decision boundary for classification tasks). Aligned with these results, we establish an upper bound as • We further show that for Gaussian kernelized percep- shown in §3.1. It remains a fundamental challenge tron, exact teaching is not possible with a finite set to rigorously characterize the teaching complexity for (log2 1 ) of hypotheses, and then establish a dO  bound nonlinear learners. Furthermore, in the cases where on the -approximate teaching dimension (§3.4). To exact teaching is not possible with a finite training set, the best of our knowledge, these results constitute the classical teaching dimension no longer captures the the first known bounds on (approximately) teaching fine-grained complexity of the teaching tasks, and hence a non-linear ERM learner (§3). one needs to relax the teaching goals and investigate new notions of teaching complexity. 2 Problem Statement In this paper, we aim to address the above challenges. We focus on kernel perceptron, a specific type of ERM Basic definitions We denote by the input space learner that is less understood even under the linear X and := 1, 1 the output space. A hypothesis is setting. Following the convention in teaching ERM Y {− } a function h : . In this paper, we identify learners, we consider the constructive setting, where the X → Y a hypothesis hθ with its model parameter θ. The teacher can construct arbitrary teaching examples in hypothesis space is a set of hypotheses. By training the support of the data distribution. Our contributions H point we mean a pair (x, y) . We assume are highlighted below, with main theoretical results ∈ X × Y that the training points are drawn from an unknown summarized in Table1. distribution over . A training set is a multiset P X × Y = (x1, y1) , , (xn, yn) where repeated pairs are • We formally define approximate teaching of kernel D { ··· } perceptron, and propose a novel measure of allowed. Let D denote the set of all training sets of -approximate all sizes. A learning algorithm A : D 2H takes teaching complexity, namely the → teaching dimension  in a training set D D and outputs a subset of the ( -TD), which captures the ∈ hypothesis space . That is, A doesn’t necessarily H 1If the teacher is restricted to only provide consistent return a unique hypothesis. labels (i.e., the realizable setting), then the ERM with perceptron loss reduces to the version space learner, where Kernel perceptron Consider a set of training points n d the teaching dimension is ∞. := (xi, yi) where xi R and hypothesis θ D { }i=1 ∈ ∈ Akash Kumar, Hanqi Zhang, Adish Singla, Yuxin Chen

d R . A linear perceptron is defined as fθ(x) := sign(θ x) for some real t > 0. Thus, · in homogeneous setting. We consider the algorithm Aopt to learn an optimal perceptron to classify as θ A θ A D TD( t ∗ , opt) = min TD(p ∗, opt). defined below: { } real p>0 n X Since it can be stringent to construct a teaching set Aopt ( ) := arg min `(fθ(xi), yi). (1) D d for decision boundary (see §3.4), exact teaching is not θ R i=1 ∈ always feasible. We introduce and study approximate teaching where the loss function `(fθ(x), y) := max( y fθ(x), 0). which is formally defined as: − · Similarly, we consider the non-linear setting via kernel- Definition 1 (-approximate teaching set). Con- based hypotheses for perceptrons that are defined with sider a kernel perceptron learner, with a kernel : respect to a kernel operator R which : R and the corresponding RKHS feature K X × X → K X × X → adheres to Mercer’s positive definite conditions (Vapnik, map Φ( ). For a target model θ∗ and  > 0, we 1998). A kernel-based hypothesis has the form, say · is an -approximate∈ HK teaching set wrt to Tif S the ⊆ X kernel × Y perceptron θˆ A satisfies k opt( ) X P ∈ TS x x x f( ) = αi ( i, ) (2) ·K h ˆ i i=1 E [max( y f ∗(x), 0)] E max( y f(x), 0)  − · − − · ≤ where i xi and αi are reals. In order to simplify (4) ∀ ∈ X the derivation of the algorithms and their analysis, we where the expectations are over (x, y) and ˆ ˆ ∼ P reproducing kernel Hilbert space f ∗(x) = θ∗ Φ(x) and f(x) = θ Φ(x). associate a (RKHS) · · with in the standard way common to all kernel K methods. Formally, let be the closure of the set Naturally, we define approximate teaching dimension HK of all hypotheses of the form given in Eq. (2). A non- as: linear kernel perceptron corresponding to optimizes K Definition 2 (-approximate teaching dimension). Eq. (1) as follows: Consider a kernel perceptron learner, with a kernel n : R and the corresponding RKHS feature X K X × X → A ( ) := arg min `(f (x ), y ) map Φ( ). For a target model θ∗ and  > 0, we opt θ i i (3) · ∈ HK D θ K define - θ A as the teaching dimension which ∈H i=1  TD( ∗, opt) is the size of the smallest teaching set for -approximate Pl l where fθ( ) = αi (ai, ) for some ai teaching of θ∗ wrt . · i=1 ·K · { }i=1 ⊂ X P and αi real. Alternatively, we also write fθ( ) = θ Φ( ) · · · where Φ: is defined as feature map to the ker- According to Definition2, exact teaching corresponds X → HK nel function . A reproducing kernel Hilbert space with to constructing a 0-approximate teaching set for a tar- K could be decomposed as (x, x0) = Φ(x), Φ(x0) get classifier (e.g., the decision boundary of a kernel K K h i (Scholkopf & Smola, 2001) for any x, x0 . Thus, we perceptron). We study linear and polynomial kernel- Pl ∈ X also identify fθ as i=1 αi Φ(ai). ized perceptrons in the exact teaching setting. Under · some mild assumptions on the smoothness of the data The teaching problem We are interested in the distribution, we establish approximate teaching bound problem of teaching a target hypothesis θ∗ where a help- on approximate teaching dimension for Gaussian ker- ful teacher provides labelled data points , nelized perceptron. T S ⊆ X × Y also defined as a teaching set. Assuming the construc- tive setting (Liu & Zhu, 2016), to teach a kernel per- 3 Teaching Dimension for Kernel ceptron learner the teacher can construct a training set Perceptron d with any items in R i.e. for any (x0, y0) we have d ∈ T S x0 R and y0 1, 1 . Importantly, for the purpose ∈ ∈ {− } In this section, we study the generic problem of teach- of teaching we do not assume that are drawn i.i.d TS ing kernel perceptrons in three different settings: 1) from a distribution. We define the teaching dimension linear (in §3.1); 2) polynomial (in §3.2); and Gaus- for exact parameter of θ∗ corresponding to a kernel per- sian (in §3.4). Before establishing our main result for ceptron as TD(θ∗, Aopt), which is the size of the small- Gaussian kernelized perceptrons, we first introduce two est teaching set such that Aopt ( ) = θ∗ . We TS TS { } important results for linear and polynomial perceptrons define teaching of exact parameters of a target hypothe- inherently connected to the Gaussian perceptron. Our sis θ∗ as exact teaching. Since, a perceptron is agnostic proofs are inspired by ideas from linear algebra and to norms, we study the problem of teaching a target projective geometry as detailed in the supplemental classifier decision boundary where Aopt ( ) = tθ∗ TS { } materials. The Teaching Dimension of Kernel Perceptron

3.1 Homogeneous Linear Perceptron Also, p(θˆ) = 0 = xi θˆ 0 i [d] but then Pd 1 ⇒ · ≥ ∀ˆ ∈ xd = i=1− xi = i [d] xi θ = 0. Note that, In this subsection, we study the problem of teaching a − ⇒ ∀ ∈ · θˆ θ∗ 0 forces θˆ = tθ∗ for some positive constant t. linear perceptron. First, we consider an optimization · ≥ Thus, is a teaching set for the decision boundary of problem similar to Eq. (1) as shown in Liu & Zhu D θ∗. This establishes the upper bound, and hence the (2016): theorem follows. n X λ 2 Aopt := arg min `(θ xi, yi) + θ A (5) θ d · 2 || || ∈R i=1 Numerical example To illustrate Theorem1, we provide a numerical example for teaching a lin- where `( , ) is a convex loss function, A is a positive 3 · · ear perceptron in R , with θ∗ = ( 3, 3, 5)> semi-definite matrix, θ A is defined as √θ>Aθ, and − || || (illustrated in Fig. 2a). To construct the λ > 0. For convex loss function `( , ), Theorem 1 (Liu · · teaching set, we first obtain an orthogonal ba- & Zhu, 2016) established a degree-of-freedom lower sis (0.46, 0.86, 0.24)>, (0.76, 0.24, 0.6)> for the bound on the number of training items to obtain a { − − } subspace orthogonal to θ∗, and add a vector unique solution θ∗. Since, the loss function for linear ( 1.22, 0.62, 0.36)> which is in the exact opposite perceptron is convex thus we immediately obtain a − − − direction of the first two combined. Finally we add to lower bound on the teaching dimension as follows: an arbitrary vector which has a positive dot prod- Corollary 1. If A = 0 and λ = 1, then Eq. (1) can TS uct with the normal vector, e.g. ( 0.46, 0.46, 0.76)>. − be solved as Eq. (5). Moreover, teaching dimension for Labeling all examples positive, we obtain of size 4. decision boundary corresponding to a target model θ∗ TS is lower-bounded by Ω(d). 3.2 Homogeneous Polynomial Kernelized Now, we would establish an upper bound on Perceptron TD(Aopt, θ∗) for exact teaching of the decision bound- ary of a target model θ∗. The key idea is to find a set In this subsection, we study the problem of teaching a of points which span the orthogonal subspace of θ∗, polynomial kernelized perceptron in realizable setting. which we use to force a solution θˆ Aopt such that it Similar to §3.1, we establish an exact teaching bound ∈ has a component only along θ∗. Formally, we state the on the teaching dimension under a mild condition on the data distribution. We consider homogeneous poly- claim of the result with proof as follows: d nomial kernel of degree k in which for any x, x0 R Theorem 1. Given any target model θ∗, for solving K ∈ Eq. (1) the teaching dimension for the decision bound- k (x, x0) = ( x, x0 ) ary corresponding to θ∗ is Θ (d). The following is a K h i teaching set: If Φ( ) denotes the feature map for the corresponding · xi = vi, yi = 1 i [d 1]; RKHS, then we know that the dimension of the map is ∀ ∈ − d+k 1 d 1 k− where each component of the map can be repre- − X q k! λ d xd = vi, yd = 1; xd+1 = θ∗, yd+1 = 1 x x λ sented by Φλ( ) = Qd λ ! where (N 0 ) − i=1 i ∈ ∪ { } i=1 P and i λi = k. Denote by the RKHS correspond- d d HK d where vi i=1 is an orthogonal basis for R which ex- ing to the polynomial kernel . We use k := k(R ) { } K H H tends with vd = θ∗. to represent the linear space of homogeneous polyno- mials of degree k over Rd. We mention an important Proof. Using Corollary1, the lower bound for solv- result which shows the RKHS for polynomial kernels is ing Eq. (1) is immediate. Thus, if we show that the isomorphic to the space of homogeneous polynomials mentioned labeled set of training points form a teach- of degree k in d variables. ing set, then we can show an upper bound which Proposition 1 (Chapter III.2, Proposition 6 (Cucker would imply a tight bound of Θ (d) on the teach- & Smale, 2001)). k = as function spaces and ing dimension for finding the decision boundary. De- inner product spaces.H HK note the set of labeled data points as . Denote by Pd+1 D d p(θ) := max( yi θ xi, 0). Since vi i=1 is an d  i=1 − · · { } The dimension dim k(R ) of the linear space of ho- orthogonal basis, thus i [d 1] vi θ∗ = 0, thus it H d d+k 1 ∀ ∈ − p θ · mogeneous polynomials of degree k over R is k− . is not very difficult to show that (t ∗) = 0 for some d+k 1 ˆ Denote by r := − . Since is a vector space for positive scalar t. Note, if θ is a solution to Eq. (1) then: k HK polynomial kernel , thus for exact teaching there is an d+1 K d+k 1 X obvious lower bound of Ω k− on the teaching θˆ arg min max( yi θ xi, 0) ∈ θ d − · · dimension. ∈R i=1 Akash Kumar, Hanqi Zhang, Adish Singla, Yuxin Chen

Before we establish the main result of this subsection are pathological cases where violation of the assump- we state a mild assumption on the target model we tion leads to models which couldn’t be approximately consider for exact teaching which is as follows: taught. Assumption 3.2.1 (Existence of orthogonal polyno- Intuitively, solving Eq. (3) in the paradigm of exact mials). For the target model θ∗ , we assume that teaching reduces to nullifying the orthogonal subspace there exist (r 1) linearly independent∈ HK polynomials − of θ∗ i.e. any component of θ∗ along the subspace on the orthogonal subspace of θ∗ in of the form r 1 HK is nullified. Since the information of the span of the Φ(zi) − where i zi . { }i=1 ∀ ∈ X subspace has to be encoded into the datapoints chosen for teaching, Assumption 3.2.1 is a natural step to make. Similar to Theorem1, the key insight in having As- Interestingly, we show that the step is not so stringent. sumption 3.2.1 is to find independent polynomial on In the realizable setting in which all the teaching points the orthogonal subspace defined by θ∗. We state the are correctly classified, if we lift the assumption then claim here with proof established in the supplemental exact teaching is not possible.We state the claim in the materials. following lemma: Theorem 2. For all target models θ∗ for which K Lemma 1. Consider a target model θ that doesn’t the Assumption 3.2.1 holds, for solving∈ HEq. (3), the ∗ satisfy Assumption 3.2.1. Then, there doesn’t exist a exact teaching dimension for the decision boundary teaching set ∗ which exactly teaches θ i.e. for any d+k 1 θ ∗ corresponding to θ∗ is − . TS k θ∗ and any real t > 0 O TS

Numerical example For constructing in the A ( ∗ ) = tθ . TS opt θ ∗ polynomial case, we follow a similar strategy in the TS 6 { } higher dimensional space that the original data is pro- Lemma1 shows that for exact teaching θ∗ should jected into. The only difference is that we need to satisfy Assumption 3.2.1. Then, the natural question ensure the teaching examples have pre-images in the that arises is whether we can achieve arbitrarily -close original space. For that, we adopt a randomized al- approximate teaching for θ∗. In other words, we would gorithm that solves for r 1 boundary points in the ˜ − like to find θ∗ that satisfies Assumption 3.2.1 and is original space (i.e. solve for θ∗ Φ(x) = 0) , while check- · in -neighbourhood of θ∗. We show a negative result ing the images of these points are linearly independent. for this when k is even. For this we assume that, Also, instead of adding a vector in the opposite direc- the datapoints in the teaching set ˜∗ have lower- tion of these points combined, we simply repeat the TSθ bounded norm, call it, δ > 0 i.e. if (xi, yi) ˜∗ then r 1 points in the teaching set, while assigning one ∈ T Sθ Φ(xi) δ. We require this additional assumption − || || ≥ copy of them positive labels and the other copy nega- only for the purpose of analysis. We would show that tive labels. Finally, we need one last vector (label it it wouldn’t lead to any pathological cases where the positive) whose image has a positive component in θ∗, constructed target model θ∗ incorporates approximate and we obtain of size 2r 1. TS − teaching. Fig. 2b and Fig. 2c demonstrate the above constructive Lemma 2. Let Rd and be the reproducing procedure on a numerical example with d = 2, homoge- kernel Hilbert spaceX ⊆ such that kernelHK function is of K neous polynomial kernel of degree 2, and θ∗ = (1, 4, 4)>. degree k. If k has parity even then there exists a target In Fig. 2b we show the decision boundary (red lines) and model θ∗ which violates Assumption 3.2.1 and can’t be the level sets (polynomial contours) of this quadratic taught approximately. perceptron, as well as the teaching set identified via the above algorithmic procedure. In Fig. 2b, we visualize The results are discussed in details with proofs in the the decision boundary (grey plane) in the feature space supplemental materials. Assumption 3.2.1 and the (after applying the feature map). The blue surface stated lemmas provide insights into understanding the corresponds to all the data points that have pre-images problem of teaching for non-linear perceptron kernels. in the original space R2. In the next section, we study Gaussian kernel and the ideas generated here would be useful in devising a 3.3 Limitations in Exact Teaching of teaching set in the paradigm of approximate teaching. Polynomial Kernel Perceptron 3.4 Gaussian Kernelized Perceptron In the previous section §3.2, we imposed the Assump- tion 3.2.1 on the target models θ∗. It turns out that In this subsection, we consider the Gaussian kernel. we couldn’t do better than this. More concretely, we Under mild assumptions inspired by the analysis of need to impose this assumption for exact teaching of teaching dimension for exact teaching of linear and polynomial kernel perceptron learner. Further, there polynomial kernel perceptrons, we would establish as The Teaching Dimension of Kernel Perceptron

(a) Linear (TS) (b) Polynomial (TS) (c) Polynomial (feature space)

Figure 2: Numerical examples of exact teaching for linear and polynomial perceptrons. Cyan plus marks and red dots correspond to positive and negative teaching examples respectively.

our main result an upper bound on the -approximate Definitions and notations for approximate teaching dimension of Gaussian kernel perceptrons us- teaching For any classifier f , we define err(f) ∈ HK ing a construction of an -approximate teaching set. = E(x,y) (x,y)[max( y f(x), 0)]. Our goal is to find ∼P − · a classifier f with the property that its expected true Preliminaries of Gaussian kernel A Gaussian ker- loss err(f) is as small as possible. In the realizable nel is a function of the form setting, we assume that there exists an optimal sep- K ||x−x0||2 arator f ∗ such that for any data instances sampled (x, x0) = e− 2σ2 (6) from the data distribution the labels are consistent i.e. K (y f ∗(x) 0) = 0. In addition, we also experiment d for any x, x0 R and parameter σ. First, we would P · ≤ ∈ for the non-realizable setting. In the rest of the sub- try to understand the feature map before we find an section, we would study the relationship between the approximation to it. Notice: teaching complexity for an optimal Gaussian kernel

0 ˆ ||x−x0||2 ||x||2 ||x0||2 x, x perceptron for Eq. (3) and err(f ∗) err(f) where 2σ2 2σ2 2σ2 h σ2 i | − | e− = e− e− e f ∗ is the optimal separator and fˆ is the solution to A ∗ ∗ 2 opt( θ ) for the constructed teaching set θ . Consider the scalar term z = x, x0 /σ . We can ex- TS TS h i pand the term of the product using the Taylor expan- 3.4.1 Gaussian Kernel Approximation sion of ez near z = 0 as shown in Cotter et al.(2011), 0 k x, x  x, x0  Now, we would talk about finite-dimensional polyno- h σ2 i P 1 e = ∞ 2 which amounts to k=0 k! h σ i . We mial approximation Φ˜ to the Gaussian feature map Φ can further expand the previous sum as via projection as shown in Cotter et al.(2011). Con- sider k x, x0 ∞   h i X 1 x, x0 e σ2 = h i Φ:˜ d q k! σ2 R R k=0 −→ ˜ x x ˜ x ˜ x d ( , 0) = Φ( ) Φ( 0) X∞ 1 X k K · = xl x0 k!σ2k · l With these approximations, we consider classifiers of k=0 l=1 the form f˜(x) = θ˜ Φ˜(x) such that θ˜ Rq. Now, · ∈ X∞ 1 X k λ λ assume that there is a projection map P such that = x (x0) (7) k!σ2k Cλ · · Φ˜ = PΦ. In Cotter et al.(2011), authors used the k=0 λ =k | | following approximation to the Gaussian kernel: k k! = 2 0 2 s k where λ Qd λ ! . Thus, we use Eq. (7) to obtain ||x|| ||x ||   C i=1 i X 1 x, x0 ˜(x, x0) = e− 2σ2 e− 2σ2 h i (8) explicit feature representation to the Gaussian kernel K k! σ2 ||x||2 k k=0 √ λ λ in Eq. (6) as Φk,λ(x) = e− 2σ2 C x . We get · √k!σk · the explicit feature map Φ( ) for the Gaussian kernel This gives the following explicit feature representation · with coordinates as specified. Theorem 1 of Ha Quang for the approximated kernel: q (2010) characterizes the RKHS of Gaussian kernel. It k 2 dim( ) = ||x|| λ λ establishes that . Thus, we note that the ˜ 2σ2 C HK ∞ k s, Φk,λ(x) = Φk,λ(x) = e− x exact teaching for an arbitrary target classifier f ∗ in ∀ ≤ · √k!σk · this setting has an infinite lower bound. This calls for (9) analysing the teaching problem of a Gaussian kernel in where Φk,λ(x) is the coordinate for Gaussian feature the approximate teaching setting. map. Note that the feature map Φ˜ defined by the Akash Kumar, Hanqi Zhang, Adish Singla, Yuxin Chen

d+s explicit features in Eq. (9) has dimension d . Thus, find classifiers which are close to the optimal one point- d+s ˜  θ∗ PΦ = Φ where the first d coordinates are retained. wise, thus we assume that target model has unit We denote the RKHS corresponding to ˜ as ˜ .A norm. As mentioned in Eq. (2), we can write the target K HK Pl simple property of the approximated kernel map is model θ∗ as θ∗ = i=1 αi (ai, ) for some l ∈ HK ·K · ai i=1 and αi R. The classifier corresponding stated in the following lemma which was proven in { } ⊂ X ∈ to θ∗ is represented by f ∗. Eq. (3) can be rewritten Cotter et al.(2011). n corresponding to a teaching set := (xi, yi) as: Lemma 3 (Cotter et al.(2011)) . For the approximated D { }i=1 map ˜, we obtain the following upper bound: n l K X  X   s+1 Aopt := arg min max yi βj (aj, xi), 0 1 x x0 l − · ·K x x ˜ x x β R i=1 j=1 ( , ) ( , ) || || ·2 || || ∈ K − K ≤ (s + 1)! σ (11) (10) Similar to Assumption 3.2.1 (cf §3.2), to construct an Note that if s is chosen large enough and the points approximate teaching set we assume a target model θ∗ 2 x, x0 are bounded wrt σ , then RHS of Eq. (10) can has the property that for some truncated polynomial be bounded by any  > 0. Since (x, x) ˜(x, x) = space ˜ defined by feature map Φ˜ there are linearly K − K HK 2 independent projections in the orthogonal complement P⊥Φ(x) , thus for a Gaussian kernel, information θ d+s of P ∗ in ˜ . More formally, we state the property  HK theoretically, the first s coordinates are highly sen- as an assumption which is discussed in details in the sitive. We would try to analyze this observation under supplemental materials. some mild assumptions on the data distribution to con- Assumption 3.4.1 (Existence of orthogonal classi- struct an -approximate teaching set. As discussed in fiers). For the target model θ∗ and some  > 0, we the supplemental materials, we would find the value assume that there exists r = r(θ , ) such that θ has of s as if the datapoints are coming from a ball of ∗ P ∗ n log2 1 o x 2 r 1 linear independent projections on the orthogonal  d r 1 radius R := max 2 , d in R i.e. || ||2 R. − ˜ e σ ≤ subspace of Pθ∗ in ˜ of the form Φ(zi) i=1− such HK { } Thus, we wish to solve for the value of s such that that i zi . 1 s+1 ∀ ∈ X (s+1)! (R) . · ≤ For the analysis of the key results, we impose a smooth- To approximate s we use Sterling’s approximation, ness condition on the linear independent projections n r 1 which states that for all positive integers , we have Φ(˜ zi) − that they are oriented away by a factor of { }i=1 √ n+1/2 n n+1/2 n 1 ˜ z ˜ z 1 2πn e− n! en e− . r 1 . Concretely, for any i, j Φ( i) Φ( j) 2(r 1) . ≤ ≤ − · ≤ − Using the bound stated in Lemma3, we fix the value This smoothness condition is discussed in the supple- 2 1 2 log  mental. Now, we consider the following reformulation for s as e R. We would assume that R = 2 since · e of the optimization problem in Eq. (11) as follows: we wish to achieve arbitrarily small -approximate2 r := r(θ , ) = d+s 2r 1 teaching set. We define ∗ s . X− Aopt := arg min max (`(β0, γ, xi, yi), 0) (12) r−1 β0 , γ 3.4.2 Bounding the Error ∈R ∈R i=1 In this subsection, we discuss our key results on approx- where for any i [2r 1] ∈ − r 1 imate teaching of a Gaussian kernel perceptron learner  X−  θ `(β0, γ, xi, yi) = yi β0 (a, xi) + γj (zj, xi) under some mild assumptions on the target model ∗. − · ·K ·K j=1 In order to show err(f ∗) err(fˆ)  via optimizing − ≤ to a solution θˆ for Eq. (3), we would achieve a point- and with respect to the teaching set wise -closeness between f ∗ and fˆ. Specifically, we    r 1   θ∗ := zi, 1 , zi, 1 − a, 1 (13) ˆ TS − i=1 ∪ show that f ∗(x) f(x)  universally which is simi- − ≤ a θ a 3 a lar in spirit to universal approximation theorems (Liang where is chosen such that P ∗ PΦ( ) > 0 and Φ( ) z · a · & Srikant, 2017; Lu & Lu, 2020; Yarotsky, 2017) for Φ( i) Q  (where Q is a constant). could be chosen ≤ p· 2 d neural networks. We prove that this universal approxi- from a ( 2√Rσ , 0) spherical ball in R . We index 2 1 B 2r 1 (log  ) ∗ x − mation could be achieved with dO size teaching the set θ as ( i, yi) i=1 . Eq. (12) is optimized ˆ TS { }Pr 1 set. over θ = β0 (a, ) + j=1− γj (zj, ) such that ·K · r 1 ·K · θˆ Φ(a) > 0 and Φ(zi) − satisfy Assumption 3.4.1 We assume that the input space is bounded such that · { }i=1 x, x X ˆ h iHK where θ = (1). x 2 2√R. Since the motivation is to ∀ ∈ X σ ≤ O 2When R = d all the key results follow the same analysis. 3We assume θ∗ is non-degenerate in K˜ (as for polynomial The Teaching Dimension of Kernel Perceptron

(a) Optimal Gaussian boundary (b) Polynomial approximation (c) Taught Gaussian boundary

Figure 3: Approximate teaching for Gaussian kernel perceptron. (a) Teacher “receives” θ∗ by training from the complete data set; (b) Teacher identifies a polynomial approximation of the Gaussian decision boundary and generates the teaching set θ∗ (marked by red dots and cyan crosses); (c) Learner learns a Gaussian kernel TS perceptron from θ∗ . TS

Note that any solution to Eq. (12) can have unbounded Theorem 4. For any target θ∗ that satisfies ∈ HK norm and can extend in arbitrary directions, thus we Assumption 3.4.1-3.4.2 and  > 0, the teaching set make an assumption on the learner which would be θ∗ constructed for Eq. (12) is an -approximate TS (log2 1 ) essential to bound the error of optimal separator of teaching set with -TD(θ∗, Aopt) = dO  i.e. for Eq. (12). any fˆ Aopt( θ∗ ), ∈ TS Assumption 3.4.2 (Bounded Cone). For the target Pl model θ∗ = αi (ai, ), the learner optimizes err(f ∗) err(fˆ) . i=1 ·K · − ≤ to a solution θˆ for Eq. (12) with bounded coefficients. Pl Pr 1 Alternatively, the sums αi and β0 + − γj i=1 | | | | j=1 | | Numerical example Fig.3 demonstrates the ap- ˆ ˆ are bounded where θ has the form θ = β0 proximate teaching process for a Gaussian learner. Pr 1 ∈ HK · (aj, ) + − γj (zj, ). θ K · j=1 ·K · We aim to teach the optimal model ∗ (infinite- dimensional) trained on a pre-collected dataset with This assumption is fairly mild or natural in the sense Gaussian parameter σ = 0.9, whose corresponding that for θˆ Aopt as a classifier approximates θ∗ ∈ boundary is shown in Fig. 3a. Now, for approximate point-wise then they shouldn’t be highly (or unbound- teaching, the teacher calculates θ˜ using the polynomial edly) sensitive to datapoints involved in the classifiers. approximated kernel (i.e. ˜, and in this case, k=5) in It is discussed in greater details in the supplemen- K Pl Eq. (8) and the corresponding feature map in Eq. (9). tal materials. We denote by C := i=1 αi and Pr 1 | | To ensure Assumption 3.4.1 is met while generating D := β0 + − γj . In the supplemental ma- ˜ | | j=1 | | teaching examples for θ, we employ the randomized terials, we show that there exists a unique solution algorithm (as was used in §3.2) with the key idea of en- (upto a positive scaling) to Eq. (12) which satisfies suring that the teaching examples on the boundary are Assumption 3.4.2. We would show that θ∗ is an linearly independent in the approximated polynomial TS 2 1 (log  ) -approximate teaching set with r = dO on feature space, i.e. ˜(zi, zj) = 0. Finally, the Gaus- K the -approximate teaching dimension. To achieve sian learner receives θ∗ and learns the boundary TS this, we first establish the -closeness of fˆ (classifier shown in Fig. 3c. Note the slight difference between fˆ(x) := θˆ Φ(x) where θˆ Aopt) to f ∗. Formally, we the boundaries in Fig. 3b and in Fig. 3c as the learner · ∈ state the result as follows: learns with a Gaussian kernel.

Theorem 3. For any target θ∗ that satisfies As- ∈ HK sumption 3.4.1-3.4.2 and  > 0, the teaching set θ∗ 4 Conclusion TS constructed for Eq. (12) satisfies f ∗(x) fˆ(x)  − ≤ We have studied and extended the notion of teaching for any x and any fˆ Aopt( θ∗ ). ∈ X ∈ TS dimension for optimization-based perceptron learner. Using Theorem3, we can obtain the main result of We also studied a more general notion of approximate (log2 1 ) teaching which encompasses the notion of exact teach- the subsection which gives an dO  bound on - ing. To the best of our knowledge, our exact teaching approximate teaching dimension. We detail the proofs dimension for linear and polynomial perceptron learner in the supplemental materials: is new; so is the upper bound on the approximate teach- ∗ kernels in §3.2) i.e. has points a ∈ X such that Pθ ·PΦ(a) > ing dimension of Gaussian kernel perceptron learner 0 (classified with label 1). and our analysis technique in general. There are many Akash Kumar, Hanqi Zhang, Adish Singla, Yuxin Chen possible extensions to the present work. For example, Ha Quang, M. Some properties of gaussian reproduc- one may extend our analysis to relaxing the assump- ing kernel hilbert spaces and their implications for tions imposed on the data distribution for polynomial function approximation and learning theory. Con- and Gaussian kernel perceptrons. This can potentially structive Approximation - CONSTR APPROX, 32: be achieved by analysing the linear perceptron and 307–338, 10 2010. doi: 10.1007/s00365-009-9080-0. finding ways to nullify subspaces other than orthogo- Haug, L., Tschiatschek, S., and Singla, A. Teach- nal vectors. This could enhance the results for both ing inverse reinforcement learners via features and the exact teaching of polynomial perceptron learner to demonstrations. In Advances in Neural Information more general case and a tighter bound on the approxi- Processing Systems, pp. 8464–8473, 2018. mate teaching dimension of Gaussian kernel perceptron learner. On the other hand, a natural extension of our Hunziker, A., Chen, Y., Aodha, O. M., Rodriguez, work is to understand the approximate teaching com- M. G., Krause, A., Perona, P., Yue, Y., and Singla, A. plexity for other types of ERM learners, e.g. kernel Teaching multiple concepts to a forgetful learner. In Advances in Neural Information Processing Systems SVM, kernel ridge, and kernel logistic regression. We , believe the current work and its extensions would en- pp. 4050–4060, 2019. rich our understanding of optimal and approximate Kamalaruban, P., Devidze, R., Cevher, V., and Singla, teaching and enable novel applications. A. Interactive teaching algorithms for inverse rein- forcement learning. In IJCAI, pp. 2692–2700, 2019. 5 Acknowledgements Kirkpatrick, D., Simon, H. U., and Zilles, S. Optimal collusion-free teaching. In Proceedings of the 30th International Conference on Algorithmic Learning Yuxin Chen is supported by NSF 2040989 and a C3.ai Theory, volume 98, pp. 506–528, 2019. DTI Research Award 049755. Liang, S. and Srikant, R. Why deep neural networks ICLR References for function approximation? In , 2017. Liu, J. and Zhu, X. The teaching dimension of linear Anthony, M., Brightwell, G., and Shawe-Taylor, J. On learners. Journal of Research, 17 specifying boolean functions by labelled examples. (162):1–25, 2016. Discrete Applied Mathematics , 61:1–25, 07 1995. doi: Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, 10.1016/0166-218X(94)00007-Z. L. B., Rehg, J. M., and Song, L. Iterative machine Cakmak, M. and Lopes, M. Algorithmic and human teaching. In ICML, pp. 2149–2158, 2017. teaching of sequential decision tasks. In AAAI, 2012. Liu, W., Dai, B., Li, X., Liu, Z., Rehg, J. M., and Song, Chen, Y., Singla, A., Mac Aodha, O., Perona, P., and L. Towards black-box iterative machine teaching. In Yue, Y. Understanding the role of adaptivity in ma- ICML, pp. 3147–3155, 2018. chine teaching: The case of version space learners. In Lu, Y. and Lu, J. A universal approximation theorem Advances in Neural Information Processing Systems, of deep neural networks for expressing distributions, pp. 1476–1486, 2018. 2020. Cotter, A., Keshet, J., and Srebro, N. Explicit approxi- Mansouri, F., Chen, Y., Vartanian, A., Zhu, J., and mations of the gaussian kernel. CoRR, abs/1109.4603, Singla, A. Preference-based batch and sequential 2011. teaching: Towards a unified view of models. In Advances in Neural Information Processing Systems, Cucker, F. and Smale, S. On the mathematical foun- pp. 9195–9205, 2019. dations of learning. BULLETIN, 39, 11 2001. doi: 10.1090/S0273-0979-01-00923-5. Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and Singla, A. Policy teaching via environment poisoning: Devidze, R., Mansouri, F., Haug, L., Chen, Y., and Training-time adversarial attacks against reinforce- Singla, A. Understanding the power and limitations ment learning. In ICML, volume 119, pp. 7974–7984, of teaching with imperfect knowledge. In IJCAI, 2020. 2020. Scholkopf, B. and Smola, A. J. Learning with Kernels: Doliwa, T., Fan, G., Simon, H. U., and Zilles, S. Recur- Support Vector Machines, Regularization, Optimiza- sive teaching dimension, vc-dimension and sample tion, and Beyond. MIT Press, Cambridge, MA, USA, compression. JMLR, 15(1):3107–3131, 2014. 2001. ISBN 0262194759. Goldman, S. A. and Kearns, M. J. On the complexity of Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and teaching. Journal of Computer and System Sciences, Krause, A. On actively teaching the crowd to classify. 50(1):20–31, 1995. In NIPS Workshop on Data Driven Education, 2013. The Teaching Dimension of Kernel Perceptron

Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. Near-optimally teaching the crowd to classify. In ICML, pp. 154–162, 2014. Tschiatschek, S., Ghosh, A., Haug, L., Devidze, R., and Singla, A. Learner-aware teaching: Inverse reinforce- ment learning with preferences and constraints. In Advances in Neural Information Processing Systems, 2019. Vapnik, V. Statistical Learning Theory. Adaptive and learning systems for signal processing, communica- tions, and control. Wiley, 1998. ISBN 9788126528929. Yarotsky, D. Error bounds for approximations with deep relu networks. Neural networks : the official journal of the International Neural Network Society, 94:103–114, 2017. Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. An overview of machine teaching. CoRR, abs/1801.05927, 2018. Zilles, S., Lange, S., Holte, R., and Zinkevich, M. Teach- ing dimensions based on cooperative learning. In COLT, pp. 135–146, 2008.