Implicit Kernel Learning

Chun-Liang Li Wei-Cheng Chang Youssef Mroueh Yiming Yang Barnabás Póczos {chunlial, wchang2, yiming, bapoczos}@cs.cmu.edu [email protected] Carnegie Mellon University and IBM Research

Abstract discrepancy (MMD) (Gretton et al., 2012)isapowerful two-sample test, which is based on a statistics computed via kernel functions. Even though there is a surge of Kernels are powerful and versatile tools in in the past years, several successes have and statistics. Although the been shown by kernel methods and deep feature extrac- notion of universal kernels and characteristic tion. Wilson et al. (2016)demonstratestate-of-the-art kernels has been studied, kernel selection still performance by incorporating deep learning, kernel greatly influences the empirical performance. and Gaussian process. Li et al. (2015); Dziugaite et al. While learning the kernel in a data driven way (2015)useMMDtotraindeepgenerativemodelsfor has been investigated, in this paper we explore complex datasets. learning the spectral distribution of kernel via implicit generative models parametrized by In practice, however, kernel selection is always an im- deep neural networks. We called our method portant step. Instead of choosing by a heuristic, several Implicit Kernel Learning (IKL). The proposed works have studied kernel learning.Multiplekernel framework is simple to train and inference is learning (MKL) (Bach et al., 2004; Lanckriet et al., performed via sampling random Fourier fea- 2004; Bach, 2009; Gönen and Alpaydın, 2011; Duve- tures. We investigate two applications of the naud et al., 2013)isoneofthepioneeringframeworks proposed IKL as examples, including genera- to combine predefined kernels. One recent kernel learn- tive adversarial networks with MMD (MMD ing development is to learn kernels via learning spectral GAN) and standard . Em- distributions (Fourier transform of the kernel). Wilson pirically, MMD GAN with IKL outperforms and Adams (2013)modelspectraldistributionsviaa vanilla predefined kernels on both image and mixture of Gaussians, which can also be treated as an text generation benchmarks; using IKL with extension of linear combination of kernels (Bach et al., Random Kitchen Sinks also leads to substan- 2004). Oliva et al. (2016)extendittoBayesiannon- tial improvement over existing state-of-the-art parametric models. In addition to model spectral dis- kernel learning algorithms on popular super- tribution with explicit density models aforementioned, vised learning benchmarks. Theory and con- many works optimize the sampled random features ditions for using IKL in both applications are or its weights (e.g. Băzăvan et al. (2012); Yang et al. also studied as well as connections to previous (2015); Sinha and Duchi (2016); Chang et al. (2017); state-of-the-art methods. Bullins et al. (2018)). The other orthogonal approach to modeling spectral distributions is learning feature maps for standard kernels (e.g. Gaussian). Feature 1Introduction maps learned by deep learning lead to state-of-the-art performance on different tasks (Hinton and Salakhut- dinov, 2008; Wilson et al., 2016; Li et al., 2017). Kernel methods are among the essential foundations in machine learning and have been extensively studied in In addition to learning effective features, implicit gen- the past decades. In supervised learning, kernel meth- erative models via deep learning also lead to promis- ods allow us to learn non-linear hypothesis. They also ing performance in learning distributions of complex play a crucial role in statistics. Kernel maximum mean data (Goodfellow et al., 2014). Inspired by its recent success, we propose to model kernel spectral distribu- Proceedings of the 22nd International Conference on Ar- tions with implicit generative models in a data-driven tificial Intelligence and Statistics (AISTATS) 2019, Naha, fashion, which we call Implicit Kernel Learning (IKL). Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by IKL provides a new route to modeling spectral distri- the author(s). Implicit Kernel Learning butions by learning sampling processes of the spectral learning a spectral distribution by optimizing densities, which is under explored by previous works aforementioned. arg max Ex i,x0 i [Fi(x, x0)k(x, x0)] = k ⇠P ⇠Q 2K i=1 In this paper, we start from studying the generic prob- X i!>(x x0) arg max Ex i,x i Fi(x, x0)E! k e , lem formulation of IKL, and propose an easily imple- k ⇠P 0⇠Q ⇠P 2K i=1 h h ii mented, trained and evaluated neural network parame- X (1) terization which satisfies Bochner’s theorem (Section 2). where F is a task-specific objective function and is K We then demonstrate two example applications of the asetofkernels. (1) covers many popular objectives, proposed IKL. Firstly, we explore MMD GAN (Li et al., such as kernel alignment (Gönen and Alpaydın, 2011) 2017) with IKL on learning to generate images and and MMD distance (Gretton et al., 2012). Existing text (Section 3). Secondly, we consider a standard two- works (Wilson and Adams, 2013; Oliva et al., 2016) staged supervised learning task with Random Kitchen learn the spectral density Pk(!) with explicit forms Sinks (Sinha and Duchi, 2016)(Section4). The con- via parametric or non-parametric models. When we ditions required for training IKL and its theoretical learn kernels via (1),itmaynotbenecessarytomodel guarantees in both tasks are also studied. In both the density of Pk(!),aslongasweareabletoesti- tasks, we show that IKL leads to competitive or better i!>(x x0) mate kernel evaluations k(x x0)=E![e ] via performance than heuristic kernel selections and exist- sampling from Pk(!) (Rahimi and Recht, 2007). Al- ing approaches modeling kernel spectral densities. It ternatively, implicit probabilistic (generative) models demonstrates the potentials of learning more powerful define a stochastic procedure that can generate (sample) kernels via deep generative models. Finally, we discuss data from Pk(!) without modeling Pk(!).Recently, the connection with existing works in Section 5. the neural implicit generative models (MacKay, 1995) regained attentions with promising results (Goodfel- low et al., 2014)andsimplesamplingprocedures.We 2KernelLearning first sample ⌫ from a base distribution P(⌫) which is known (e.g. Gaussian distribution), then use a deter- ministic function h parametrized by ,totransform Kernels have been used in several applications with ⌫ into ! = h (⌫),where! follows the complex target success, including supervised learning, unsupervised distribution (!). Inspired by the success of deep learning, and hypothesis testing. They have also Pk implicit generative models (Goodfellow et al., 2014), been combined with deep learning in different applica- we propose an Implicit Kernel Learning (IKL) method tions (Mairal et al., 2014; Li et al., 2015; Dziugaite et al., by modeling (!) via an implicit generative model 2015; Wilson et al., 2016; Mairal, 2016). Given data Pk d h (⌫), where ⌫ P(⌫), which results in x R ,kernelmethodscomputetheinnerproductof ⇠ 2 (x) the feature transformation in a high-dimensional ih (⌫)>(x x0) k (x, x0)= e (2) Hilbert space H via a kernel function k : R, E⌫ X⇥X! which is defined as k(x, x0)= (x),(x0) , where h i h iH (x) is usually high or even infinitely dimensional. If and reducing (1)tosolve k is shift invariant (i.e. k(x, y)=k(x y)), we can represent k as an expectation with respect to a spectral ih (⌫)>(x x0) arg max Ex Pi,x0 Qi Fi(x, x0)E⌫ e . distribution (!). ⇠ ⇠ Pk i=1 h ⇣ ⌘i X (3) The gradient of (3)canberepresentedas Bochner’s theorem (Rudin, 2011) Acontinuous, real valued, symmetric and shift-invariant function k ih (⌫)>(x x0) Ex Pi,x0 Qi E⌫ Fi(x, x0)e . d ⇠ ⇠ r on R is a positive definite kernel if and only if there i=1 X h i is a positive finite measure Pk(!) such that Thus, (3) can be optimized via sampling x, x0 from data and ⌫ from the base distribution to estimate gradient i!>(x x0) i!>(x x0) as shown above (SGD) in every iteration. Next, we k(x x0)= e dPk(!)=E! Pk e . d ⇠ discuss the parametrization of h to satisfy Bochner’s ZR h i Theorem, and describe how to evaluate IKL kernel in practice. 2.1 Implicit Kernel Learning Symmetric Pk(!) To result in real valued kernels, We restrict ourselves to learning shift invariant kernels. the spectral density has to be symmetric, where According to that, learning kernels is equivalent to Pk(!)=Pk( !).Thus,weparametrizeh (⌫)= Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, Barnabás Póczos

˜ sign(⌫) h (abs(⌫)), where is the Hadamard product The MMD GAN objective then becomes and h˜ can be any unconstrained function if the base min✓ max' M'(P , P✓). X distribution P(⌫) is symmetric (i.e. P(⌫)=P( ⌫)), such as standard normal distributions. 3.1 Training MMD GAN with IKL

Kernel Evaluation Although there is usually no Although the composition kernel with a learned fea- closed form for the kernel evaluation k (x, x0) in (2) ture embedding f' is powerful, choosing a good base with fairly complicated h ,wecanevaluate(approx- kernel k is still crucial in practice (Bińkowski et al., imate) k (x, x0) via sampling finite number of ran- 2018). Different base kernels for MMD GAN, such ˆ ˆ ˆ as rational quadratic kernel (Bińkowski et al., 2018) dom Fourier features k (x, x0)=h (x)>h (x0), and distance kernel (Bellemare et al., 2017), have been where ˆh (x)> =[(x; h (⌫1)),...,(x; h (⌫m))],and studied. Instead of choosing it by hands, we propose (x; !) is the evaluation on ! of the Fourier transfor- to learn the base kernel by IKL, which extend (5) to mation (x) (Rahimi and Recht, 2007). be k ,' = k f' with the form Next, we demonstrate two example applications cov- ih (⌫)>(f'(x) f'(x0)) ered by (3), where we can apply IKL, including kernel k ,'(x, x0)=E⌫ e . (6) alignment and maximum mean discrepancy (MMD). h i We then extend the MMD GAN objective to be 3MMDGANwithIKL min max M ,'(P , P✓), (7) ✓ ,' X n Given xi i=1 P ,insteadofestimatingthedensity { } ⇠ X P ,GenerativeAdversarialNetwork(GAN)(Goodfel- where M ,' is the MMD distance (4) with the IKL X low et al., 2014) is an implicit generative model, which kernel (6). Clearly, for a given ',themaximizationover learns a generative network g✓ (generator). The genera- in (7) can be represented as (1) by letting F1(x, x0)= tor g✓ transforms a base distribution P over into P✓ 1, F2(x, y)= 2 and F3(y, y0)=1. In what follows, Z Z to approximate P , where P✓ is the distribution of g✓(z) we will use for convenience k ,', k and k' to denote X and z P .Duringthetraining,GANalternatively kernels defined in (6), (2)and(5)respectively. ⇠ Z estimates a distance D(P P✓) between P and P✓, X k X and updates g✓ to minimize D(P P✓).Differentprob- X k 3.2 Property of MMD GAN with IKL ability metrics have been studied (Goodfellow et al., 2014; Li et al., 2015; Dziugaite et al., 2015; Nowozin As proven by Arjovsky and Bottou (2017), some prob- et al., 2016; Arjovsky et al., 2017; Mroueh et al., 2017; ability distances adopted by existing works (e.g. Good- D Li et al., 2017; Mroueh and Sercu, 2017; Gulrajani fellow et al. (2014)) are not weak (i.e. Pn P then ! et al., 2017; Mroueh et al., 2018; Arbel et al., 2018)for D(Pn P) 0), which cannot provide better signal to k ! training GANs. train g✓.Also,theyusuallysufferfromdiscontinuity, Kernel maximum mean discrepancy (MMD) is a prob- hence it cannot be trained via at cer- tain points. We prove that max ,' M ,'(P , P✓) is a ability metric, which is commonly used in two-sample- X test to distinguish two distributions with finite sam- continuous and differentiable objective in ✓ and weak ples (Gretton et al., 2012). Given a kernel k,theMMD under mild assumptions as used in (Arjovsky et al., between P and Q is defined as 2017; Li et al., 2017). Assumption 1. g✓(z) is locally Lipschitz and differ- Mk(P, Q)=EP,P[k(x, x0)] 2EP,Q[k(x, y)]+EQ,Q[k(y, y0)]. entiable in ✓; f'(x) is Lipschitz in x and ' is (4) 2 compact. f' g✓(z) is differentiable in ✓ and there For characteristic kernels, Mk(P, Q)=0iff P = Q. Li are local Lipschitz constants, which is independent et al. (2015); Dziugaite et al. (2015)trainthegenerator of ', such that Ez Pz [L(✓, z)] < + . The above g✓ by optimizing min✓ Mk(P , P✓) with a Gaussian ⇠ 1 X assumptions are adopted by Arjovsky et al. (2017). kernel k. Li et al. (2017) propose MMD GAN, which Lastly, assume given any , where is compact, trains g✓ via min✓ maxk Mk(P , P✓), where is a 2 2K X ih (⌫)>(x x0) K k (x, x0)=E⌫ e and k (x, x0) < is pre-defined set of kernels. The intuition is to learn | | 1 akernelargmaxk Mk(P , P✓), which has a stronger differentiable andh Lipschitz ini(x, x0) which has an up- 2K X signal (i.e. larger distance when P = P✓)totraing✓. per bound Lk for Lipschitz constant of (x, x0) given X 6 Specifically, Li et al. (2017)consideracomposition different . kernel k' which combines Gaussian kernel k and a Theorem 2. Assume function g✓ and kernel k ,' neural network f' as k' = k f', where satisfy Assumption 1, max ,' M ,' is weak, that is, 2 D k'(x, x0)=exp( f'(x) f'(x)0 ). (5) max ,' M ,'(P , Pn) 0 Pn P . Also, k k X ! () ! X Implicit Kernel Learning max ,' M ,'(P , P✓) is continuous everywhere and dif- tuned hyperparameters q and ↵q for each kernel as X ferentiable almost everywhere in ✓. reported in Appendix F.1.

Lemma 3. Assume is bounded. Let x, x0 X 2 Lastly, for learning base kernels, we compare IKL with ih (⌫)>(x x0) , k (x, x0)=E⌫ e is Lipschitz in SM kernel (Wilson and Adams, 2013) f', which learns X 2 mixture of Gaussians to model kernel spectral density. (x, x0) if E⌫ h (⌫) h < , whichi is variance since k k 1 It can also be treated as the explicit generative model E⌫ [h (⌫)] = 0. ⇥ ⇤ counter part of the proposed IKL. 2 2 We penalize h(E⌫ h (⌫) u) as an approxima- k k In both tasks, P(⌫), the base distribution of IKL, is tion of Lemma 3 in practice to ensure that assumptions ⇥ ⇤ astandardnormaldistributionandh is a 3-layer in Theorem 2 are satisfied. The algorithm with IKL MLP with 32 hidden units for each layer. Similar to and gradient penalty (Bińkowski et al., 2018) is shown the aforementioned mixture kernels, we consider the in Algorithm 1. mixture of IKL kernel with the variance constraints 2 E[ h (⌫) ]=1/q, where q is the bandwidths for Algorithm 1: MMD GAN with IKL k k the mixture of Gaussian kernels. Note that if h is Input: ⌘ the learning rate, B the batch size, nc number of f,h updates per g update, m the number an identity map, we recover the mixture of Gaussian kernels. We fix h to be 10 and resample m = 1024 of basis, GP the coefficient of gradient penalty, h the coefficient of variance constraint. random features for IKL in every iteration. For other Initial parameter ✓ for g, ' for f, for h settings, we follow Bińkowski et al. (2018)andthe hyperparameters can be found in Appendix F.1. Define ( ,')=M ,'(P , P✓) L 2 X 2 2 GP ( xˆf'(ˆx) 2 1) h(E⌫ [ h (⌫) ] u) kr k k k while ✓ has not converged do 3.3.1 Results and Discussion for t =1,...,nc do B B We compare MMD GAN with the proposed IKL and Sample xi i=1 P( ), zj j=1 { }m ⇠ X { } ⇠ different fixed kernels. We repeat the experiments 10 P( ), ⌫k P(⌫) Z { }k=1 ⇠ times and report the average result with standard error ( ,') ' + ⌘Adam (( ,'), ( ,')) r ,'L in Table 1.Notethatforinceptionscorethelarger end for B B the better; while JS-4 the smaller the better. We also Sample xi i=1 P( ), zj j=1 { }m ⇠ X { } ⇠ report WGAN-GP results as a reference. Since FID P( ), ⌫k P(⌫) Z { }k=1 ⇠ score results (Heusel et al., 2017) is consistent with ✓ ✓ ⌘Adam(✓, ✓M ,'(P , P✓)) r X inception score and does not change our discussion, we end while put it in Appendix C.1 due to space limit. Sampled images on larger datasets are shown in Figure 1. 3.3 Empirical Study Method Inception Scores ( ) JS-4 ( ) We consider image and text generation tasks for quan- " # Gaussian 6.726 0.021 0.381 0.003 titative evaluation. For image generation, we evaluate ± ± RQ 6.785 0.031 0.463 0.005 the inception score (Salimans et al., 2016) and FID ± ± SM 6.746 0.031 0.378 0.003 score (Heusel et al., 2017) on CIFAR-10 (Krizhevsky ± ± IKL 6.876 0.018 0.372 0.002 and Hinton, 2009). We use DCGAN (Radford ± ± et al., 2016)andexpandstheoutputoff' to be 16- WGAN-GP 6.539 0.034 0.379 0.002 dimensional as Bińkowski et al. (2018). For text genera- ± ± tion, we consider a length-32 character-level generation Table 1: Inception scores and JS-4 divergece results. task on Google Billion Words dataset. The evaluation is based on Jensen-Shannon divergence on empirical Pre-defined Kernels Bińkowski et al. (2018)show 4-gram probabilities (JS-4) of the generated sequence RQ kernels outperform Gaussian and energy distance and the validation data as used by Gulrajani et al. kernels on image generation. Our empirical results (2017); Heusel et al. (2017); Mroueh et al. (2018). The agree with such finding: RQ kernels achieve 6.785 in- model architecture follows Gulrajani et al. (2017)in ception score while for Gaussian kernel it is 6.726,as using ResNet with 1D convolutions. We train every shown in the left column of Table 1. In text genera- algorithm 10, 000 iterations for comparison. tion, nonetheless, RQ kernels only achieve 0.463 JS-4 1 For MMD GAN with fixed base kernels, we con- score and are not on par with 0.381 acquired by Gaus- sider the mixture of Gaussian kernels k(x, x0)= 1 2 For RQ kernels, we searched 10 possible hyperparam- x x0 exp( k 2 k ) (Li et al., 2017)andthemixture eter settings and reported the best one in Appendix, to q 2q 2 ensure the unsatisfactory performance is not caused by the x x0 ↵ k(x, x0)= (1 + k k ) q Pof RQ kernels q 2↵q .We improper parameters. P Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, Barnabás Póczos

Figure 1: Samples generated by MMDGAN-IKL on CIFAR-10, CELEBA and LSUN dataset. sian kernels, even though it is still slightly worse than and the JS-4 divergence for training MMD GAN (IKL) WGAN-GP. These results imply kernel selection is task- without variance constraint, i.e. h =0.Wecouldob- specific. On the other hand, the proposed IKL learns serve the variance keeps going up without constraints, kernels in a data-driven way, which results in the best which leads exploded MMD values. Also, when the performance in both tasks. In CIFAR-10, although exploration is getting severe, the JS-4 divergence starts Gaussian kernel is worse than RQ, IKL is still able to increasing, which implies MMD cannot provide mean- transforms P(⌫), which is Gaussian, into a powerful ker- ingful signal to g✓. The study justifies the validity of nel, and outperforms RQ on inception scores (6.876 v.s. Theorem 2 and Lemma 3. 6.785). For text generation, from Table 1 and Figure 2, we observe that IKL can further boost Gaussian into better kernels with substantial improvement. Also, we note that the difference between IKL and pre-defined kernels in Table 1 is significant based on the t-test at 95% confidence level.

Figure 3: Learning MMD GAN (IKL) without the variance constraint on Google Billion Words datasets for text generation.

Figure 2: Convergence of MMD GANs with different Other Studies One concern of the proposed IKL is kernels on text generation. the computational overhead introduced by sampling Learned Kernels The SM kernel (Wilson and random features as well as using more parameters to Adams, 2013), which learns the spectral density via mix- model h .Sinceweonlyusesmallnetworktomodel ture of Gaussians, does not significantly outperforms h ,theincreasedcomputationoverheadisalmostneg- Gaussian kernel as shown in Table 1,since Li et al. ligible under GPU parallel computation. The detailed (2017)alreadyusesequal-weightedmixtureofGaussian comparison can be found in Appendix C.2.Wealso formulation. It suggests that proposed IKL can learn compare IKL with Bullins et al. (2018), which can be more complicated and effective spectral distributions seen as a variant of IKL without h ,andstudtthe than simple mixture models. variance constraint. Those additional discussions can be found in Appendix C.1. Study of Variance Constraints In Lemma 3,we 2 prove bounding variance E[ h (⌫) ] guarantees k 4RandomKitchenSinkswithIKL k k to be Lipschitz as required in Theorem 2.Weinves- tigate the importance of this constraint. In Figure 3, Rahimi and Recht (2009)proposeRandom Kitchen 2 we show the training objective (MMD), E[ h (⌫) ] Sinks (RKS) as follows. We sample !i Pk(!) and k k ⇠ Implicit Kernel Learning

d ˆ transform x R into (x)=[(x; !1),...,(x; !M )], Algorithm 2: Random Kitchen Sinks with IKL sup 2(x; !) < 1 where x,! .Wethenlearnaclassifier Stage 1: Kernel Learning | | ˆ on the transformed features (x; !).Kernelmethods Input: X = (x ,y ) n ,thebatchsizeB for data { i i }i=1 with random features (Rahimi and Recht, 2007)isan and m for random feature, learning rate ⌘ example of RKS, where Pk(!) is the spectral distribu- Initial parameter for h tion of the kernel and (x; !)= cos(!>x), sin(!>x) . while has not converged or reach maximum iters We usually learn a model w by solving ⇥ ⇤ do n B 1 Sample (xi,yi) i=1 X. Fresh sample 2 ˆ m { } ✓ arg min w + ` w>(xi) . (8) ⌫j P(⌫) w 2 k k n { }j=1 ⇠ i=1 g X ⇣ ⌘ 1 1 m ih (⌫j )>(xi x )) i0 If ` is a convex loss function, the objective (8) can be B(B 1) i=i yiyi0 m j=1 e r 6 0 solved efficiently to global optimum. ⌘Adam( , g ) P P end while Spectral distributions Pk are usually set as a parame- Stage 2: Random Kitchen Sinks terized form, such as Gaussian distributions, but the M Sample ⌫i i=1 P(⌫),notethatM is not selection of Pk is important in practice. If we consider { } ⇠ necessarily equal to m RKS as kernel methods with random features, then Transform X into (X) via h and ⌫ M selecting P is equivalent to the well-known kernel selec- { i}i=1 Learn a linear classifier on ((X),Y) tion (learning) problem for supervised learning (Gönen and Alpaydın, 2011).

Two-Stage Approach We follows Sinha and Duchi Comparison with Existing Works Sinha and (2016)toconsiderkernellearningforRKSwitha Duchi (2016)learnnon-uniformweightsforM ran- two-stage approach. In stage 1, we consider ker- dom features via kernel alignment in stage 1 then using nel alignment (Cristianini et al., 2002)oftheform, these optimized features in RKS in the stage 2. Note that the random features used in stage 1 has to be the argmaxk E(x,y),(x0,y0) i=j yy0k(x, x0).Byparame- 2K 6 terizing k via the implicit generative model h as in same as the ones in stage 2. A jointly training of fea- P Section 2, we have the following problem: ture mapping and classifier can be treated as a 2-layer neural networks (Băzăvan et al., 2012; Alber et al., ih (⌫)>(x x0) argmax E(x,y),(x0,y0)yy0E⌫ e , . (9) 2017; Bullins et al., 2018). Learning kernels with afore-

h i mentioned works will be more costly if we want to use a which can be treated as (1) with F1(x, x0)=yy0.After large number of random features for training classifiers. solving (9),welearnasampler h where we can easily In contrast to implicit generative models, Oliva et al. sample. Thus, in stage 2, we thus have the advantage (2016)learnanexplicitBayesiannonparametricgener- of solving a convex problem (8) in RKS with IKL. The ative model for spectral distributions, which requires algorithm is shown in Algorithm 2. specifically designed inference algorithms. Learning m kernels for (8) in dual form without random features Note that in stage 1, we resample ⌫j in every { }j=1 has also been proposed. It usually require costly steps, iteration to train an implicit generative model h . The such as eigendecomposition of the Gram matrix (Gönen advantage of Algorithm 2 is the random features used and Alpaydın, 2011). in kernel learning and RKS can be different, which allows us to use less random features in kernel learning (stage 1), and sample more features for RKS (stage 2). 4.1 Empirical Study One can also jointly train both feature mapping ! We evaluate the proposed IKL on both synthetic and and the model parameters w,suchasneuralnetworks. benchmark binary classification tasks. For IKL, P(⌫) We remark that our intent is not to show state-of- is standard Normal and h is a 3-layer MLP for all ex- the-art results on supervised learning, on which deep periments. The number of random features m to train neural networks dominate (Krizhevsky et al., 2012; h in Algorithm 2 is fixed to be 64.Otherexperiment He et al., 2016). We use RKS as a protocol to study details are described in Appendix F.2. kernel learning and the proposed IKL, which still has competitive performance with neural networks on some Kernel learning with a poor choice of Pk(!) We tasks (Rahimi and Recht, 2009; Sinha and Duchi, 2016). generate x n (0,I ) with y = sign( x pd), { i}i=1 ⇠N d i k k2 Also, the simple procedure of RKL with IKL allows where d is the data dimension. A two dimensional us to provide some theoretical guarantees of the per- example is shown in Figure 4. Competitive baselines formance, which is sill challenging of deep learning include random features (RFF) (Rahimi and Recht, models. 2007) as well as OPT-KL (Sinha and Duchi, 2016). Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, Barnabás Póczos

addition, we consider the joint training of random fea- tures and model parameters, which can be treated as two-layer neural network (NN) and serve as the lower bound of error for comparing different kernel learning algorithms. The test error versus different M = 26, 27,...,213 { } in the second stage are shown in Figure 5.First,in light of computation efficiency, SM and the proposed IKL only sample m = 64 random features in each Figure 4: Left figure is training examples when d = iteration in the first stage, and draws different number 2.Rightfigureistheclassificationerrorv.s.data of basis M from the learned h (⌫) for the second stage. dimension. OPT-KL, on the contrary, the random features used in training and testing should be the same. Therefore, OPT-IKL needs to deal with M random features in In the experiments, we fix M = 256 in RKS for all the training. It brings computation concern when M algorithms. Since Gaussian kernels with the bandwidth is large. In addition, IKL demonstrates improvement =1is known to be ill-suited for this task (Sinha and over the representative kernel learning method OPT- Duchi, 2016), we directly use random features from it KL, especially significant on the challenging datasets for RFF and OPT-KL. Similarly, we set P(⌫) to be such as CIFAR-10. In some cases, IKL almost reaches standard normal distribution as well. the performance of NN, such as MNIST, while OPT-KL (M = The test error for different data dimension d = degrades to RFF except for small number of basis 26) 2, 4,...,18, 20 is shown in Figure 4.NotethatRFF . This illustrates the effectiveness of learning kernel { } is competitive with OPT-KL and IKL when d is small spectral distribution via the implicit generative model h (d 6), while its performance degrades rapidly as . Also, IKL outperforms SM, which is consistent  d increases, which is consistent with the observation with the finding in Section 3 that IKL can learn more in Sinha and Duchi (2016). More discussion of the complicated spectral distributions than simple mixture reason of failure can be referred to Sinha and Duchi models (SM). (2016). On the other hand, although using standard normal as the spectral distribution is ill-suited for this 4.2 Consistency and Generalization task, both OPT-KL and IKL can adapt with data and The simple two-stages approach, IKL with RKS, al- learn to transform it into effective kernels and result lows us to provide the consistency and generalization in slower degradation with d. guarantees. For consistency, it guarantees the solution Note that OPT-KL learns the sparse weights on the of finite sample approximations of (9) approach to the sampled random features (M = 256). However, the optimum of (9) (population optimum), when we in- sampled random features can fail to contain informative crease number of training data and number of random ones, especially in high dimension (Bullins et al., 2018). features. We firstly define necessary symbols and state Thus, when using limited amount of random features, the theorem. OPT-IKL may result in worse performance than IKL Let s(x ,x )=y y be a label similarity function, in the high dimensional regime in Figure 4. i j i j where y 1.Weuses to denote s(x ,x ) in- | i| ij i j terchangeably. Given a kernel k,wedefinethetrue Performance on benchmark datasets Next, we and empirical alignment functions as, evaluate our IKL framework on standard benchmark T (k)=E [s(x, x0)k(x, x0)] binary classification tasks. Challenging label pairs are ˆ 1 T (k)= n(n 1) i=j sijk(xi,xj). chosen from MNIST (LeCun et al., 1998) and CIFAR- 6 10 (Krizhevsky and Hinton, 2009)datasets;eachtask P consists of 10000 training and 2000 test examples. For In the following, we abuse the notation k to be kh for all datasets, raw pixels are used as the feature repre- ease of illustration. Recall the definitions of kh(x, x0)= (x), (x0) and kˆ (x, x0)=ˆ (x)>ˆ (x0).Wede- sentation. We set the bandwidth of RBF kernel by the h h h i h h h median heuristic. We also compare with Wilson and fine two hypothesis sets Adams (2013), the spectral mixture (SM) kernel, which = f(x)= w, h(x) H h , w, w 1 uses Gaussian mixture to learn spectral density and FˆHm { h ˆ i | 2H h i } m = f(x)=w>h(x) h , w 1,w R . can be seen as the explicit generative model counter- FH { | 2H k k 2 } part of IKL. Also, SM kernel is a MKL variant with Definition 4. (Rademacher’s Complexity) Given a linear combination (Gönen and Alpaydın, 2011). In hypothesis set , where f : R if f , F X⇥X! 2F Implicit Kernel Learning

(a) MNIST (4-9) (b) MNIST (5-6) (c) CIFAR-10 (auto-truck) (d) CIFAR-10 (plane-bird)

Figure 5: Test error rate versus number of basis in second stage on benchmark binary classification tasks. We report mean and standard deviation over five runs. Our method (IKL) is compared with RFF (Rahimi and Recht, 2009), OPT-KL (Sinha and Duchi, 2016), SM (Wilson and Adams, 2013)andtheend-to-endtrainingMLP(NN). and a fixed sample X = x ,...,x , the empirical learning MMD GAN and supervised learning with Ran- { 1 n} Rademacher’s complexity of is defined as dom Kitchen Sinks (RKS). For these two tasks, the F conditions and guarantees of IKL for are studied. Em- n n 1 pirical studies show IKL is better than or competitive RX ( )= E sup if(xi) , F n f with the state-of-the-art kernel learning algorithms. It " 2F i=1 # X proves IKL can learn to transform P(⌫) into effective where are n i.i.d. Rademacher random variables. kernels even if P(⌫) is less less favorable to the task.

We then have the following theorems showing that We note that the preliminary idea of IKL is mentioned the consistency guarantee depends on the complexity in Băzăvan et al. (2012), but they ended up with a algo- of the function class induced by IKL as well as the rithm that directly optimizes sampled random features number of random features. The proof can be found (RF), which has many follow-up works (e.g. Sinha and in Appendix D. Duchi (2016); Bullins et al. (2018)). The major dif- h ˆ ference is, by learning the transformation function , Theorem 5. (Consistency) Let h = the RF used in training and evaluation can be differ- ˆ ˆ m arg maxh T (kh), with i.i.d, samples ⌫i i=1 2H { } ent. This flexibility allows a simple training algorithm drawn from P(⌫). With probability at least 1 3,we (SGD) and does not require to keep learned features. ˆ have T (khˆ ) suph T (kh) In our studies on GAN and RKS, we show using a | 2H | simple MLP can already achieve better or competitive 8 log 1 2 log 4 performance with those works, which suggest IKL can n 1 n 1 ˆm 2EX RX ( )+RX ( ) + + . FH FH s n s m be a new direction for kernel learning and worth more h i studies. We highlight that IKL is not conflict with existing Applying Cortes et al. (2010), We also have a general- works but can be combined with them. In Section 3, ization bound, which depends number of training data we show combining IKL with kernel learning via em- n,numberofrandomfeaturesm and the Rademacher bedding (Wilson et al., 2016)andmixtureofspectral complexity of IKL kernel, as shown in Appendix E. The distributions (Wilson and Adams, 2013). Therefore, n Rademacher complexity RX ( ),forexample,canbe FH in addition to the examples shown in Section 3 and 1/pn or even 1/n for kernels with different bound- Section 4, IKL is directly applicable to many existing ing conditions (Cortes et al., 2013). We would expect works with kernel learning via embedding (e.g. Dai et al. worse rates for more powerful kernels. It suggests the (2014); Li and Póczos (2016); Wilson et al. (2016); Al- trade-offbetween consistency/generalization and using Shedivat et al. (2016); Arbel et al. (2018); Jean et al. powerful kernels parametrized by neural networks. (2018); Chang et al. (2019)). A possible extension is combining with Bayesian inference (Oliva et al., 2016) 5Discussion under the framework similar to Saatchi and Wilson (2017). The learned sampler from IKL can possibly We propose a generic kernel learning algorithm, IKL, provide an easier way to do Bayesian inference via which learns sampling processes of kernel spectral dis- sampling. tributions by transforming samples from a base dis- tribution P(⌫) into ones for the other kernel (spectral density). We compare IKL with other algorithms for Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, Barnabás Póczos

References Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M.- F. F., and Song, L. (2014). Scalable kernel methods Al-Shedivat, M., Wilson, A. G., Saatchi, Y., Hu, Z., via doubly stochastic gradients. In NIPS. and Xing, E. P. (2016). Learning scalable deep kernels with recurrent structure. arXiv preprint Dudley, R. M. (2018). Real Analysis and Probability. arXiv:1610.08936. Chapman and Hall/CRC. Alber, M., Kindermans, P.-J., Schütt, K., Müller, K.- Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, R., and Sha, F. (2017). An empirical study on the J. B., and Ghahramani, Z. (2013). Structure discov- properties of random bases for kernel methods. In ery in nonparametric regression through composi- NIPS. tional kernel search. In ICML. Arbel, M., Sutherland, D. J., Bińkowski, M., and Gret- Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. ton, A. (2018). On gradient regularizers for mmd (2015). Training generative neural networks via max- gans. In NIPS. imum mean discrepancy optimization. In UAI. Arjovsky, M. and Bottou, L. (2017). Towards principled Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., methods for training generative adversarial networks. and Lin, C.-J. (2008). Liblinear: A library for large In ICLR. linear classification. JMLR. Arjovsky, M., Chintala, S., and Bottou, L. (2017). Gönen, M. and Alpaydın, E. (2011). Multiple kernel Wasserstein GAN. In ICML. learning algorithms. JMLR. Bach, F. R. (2009). Exploring large feature spaces with Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, hierarchical multiple kernel learning. In NIPS. B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bach, F. R., Lanckriet, G. R., and Jordan, M. I. (2004). Bengio, Y. (2014). Generative adversarial nets. In Multiple kernel learning, conic duality, and the smo NIPS. algorithm. In ICML. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, Băzăvan, E. G., Li, F., and Sminchisescu, C. (2012). B., and Smola, A. (2012). A kernel two-sample test. Fourier kernel learning. In ECCV. JMLR. Bellemare, M. G., Danihelka, I., Dabney, W., Mo- Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, hamed, S., Lakshminarayanan, B., Hoyer, S., and V., and Courville, A. (2017). Improved training of Munos, R. (2017). The cramer distance as a solu- wasserstein gans. In NIPS. tion to biased wasserstein gradients. arXiv preprint arXiv:1705.10743. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Bińkowski, M., Sutherland, D. J., Arbel, M., and Gret- residual learning for image recognition. In CVPR. ton, A. (2018). Demystifying mmd gans. In ICLR. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, Borisenko, O. and Minchenko, L. (1992). Directional B., and Hochreiter, S. (2017). Gans trained by a derivatives of the maximum function. Cybernetics two time-scale update rule converge to a local nash and Systems Analysis,28(2):309–312. equilibrium. In NIPS. Bullins, B., Zhang, C., and Zhang, Y. (2018). Not-so- Hinton, G. E. and Salakhutdinov, R. R. (2008). Us- random features. In ICLR. ing deep belief nets to learn covariance kernels for gaussian processes. In NIPS. Chang, W.-C., Li, C.-L., Yang, Y., and Poczos, B. (2017). Data-driven random fourier features using Jean, N., Xie, S. M., and Ermon, S. (2018). Semi- stein effect. In IJCAI. supervised deep kernel learning: Regression with Chang, W.-C., Li, C.-L., Yang, Y., and Póczos, unlabeled data by minimizing predictive variance. In B. (2019). Kernel change-point detection with Advances in Neural Information Processing Systems, auxiliary deep generative models. arXiv preprint pages 5327–5338. arXiv:1901.06077. Krizhevsky, A. and Hinton, G. (2009). Learning multi- Cortes, C., Kloft, M., and Mohri, M. (2013). Learning ple layers of features from tiny images. kernels using local rademacher complexity. In NIPS. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Cortes, C., Mohri, M., and Rostamizadeh, A. (2010). Imagenet classification with deep convolutional neu- Generalization bounds for learning kernels. In ICML. ral networks. In NIPS. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, Kandola, J. S. (2002). On kernel-target alignment. L. E., and Jordan, M. I. (2004). Learning the kernel In ICML. matrix with semidefinite programming. JMLR. Implicit Kernel Learning

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, (1998). Gradient-based learning applied to document E. P. (2016). Deep kernel learning. In AISTATS. recognition. Proceedings of the IEEE. Yang, Z., Wilson, A., Smola, A., and Song, L. (2015). Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and A la carte–learning fast kernels. In AISTATS. Poczos, B. (2017). Mmd gan: Towards deeper un- Zhang, Y., Liang, P., and Charikar, M. (2017). A derstanding of moment matching network. In NIPS. hitting time analysis of stochastic gradient langevin Li, C.-L. and Póczos, B. (2016). Utilize old coordinates: dynamics. In COLT. Faster doubly stochastic gradients for kernel methods. In UAI. Li, Y., Swersky, K., and Zemel, R. (2015). Generative moment matching networks. In ICML. MacKay, D. J. (1995). Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spec- trometers, Detectors and Associated Equipment. Mairal, J. (2016). End-to-end kernel learning with supervised convolutional kernel networks. In NIPS. Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. (2014). Convolutional kernel networks. In NIPS. Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y. (2018). Sobolev gan. In ICLR. Mroueh, Y. and Sercu, T. (2017). Fisher gan. In NIPS. Mroueh, Y., Sercu, T., and Goel, V. (2017). Mcgan: Mean and covariance feature matching gan. In ICML. Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-gan: Training generative neural samplers using variational divergence minimization. In NIPS. Oliva, J. B., Dubey, A., Wilson, A. G., Póczos, B., Schneider, J., and Xing, E. P. (2016). Bayesian nonparametric kernel-learning. In AISTATS. Radford, A., Metz, L., and Chintala, S. (2016). Unsu- pervised representation learning with deep convolu- tional generative adversarial networks. In ICLR. Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In NIPS. Rahimi, A. and Recht, B. (2009). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS. Rudin, W. (2011). Fourier analysis on groups.John Wiley & Sons. Saatchi, Y. and Wilson, A. G. (2017). Bayesian gan. In NIPS,pages3625–3634. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved techniques for training gans. In NIPS. Sinha, A. and Duchi, J. C. (2016). Learning kernels with random features. In NIPS. Wilson, A. and Adams, R. (2013). Gaussian process kernels for pattern discovery and extrapolation. In ICML.