Implicit Kernel Learning Chun-Liang Li Wei-Cheng Chang Youssef Mroueh Yiming Yang Barnabás Póczos {chunlial, wchang2, yiming, bapoczos}@cs.cmu.edu [email protected] Carnegie Mellon University and IBM Research Abstract discrepancy (MMD) (Gretton et al., 2012)isapowerful two-sample test, which is based on a statistics computed via kernel functions. Even though there is a surge of Kernels are powerful and versatile tools in deep learning in the past years, several successes have machine learning and statistics. Although the been shown by kernel methods and deep feature extrac- notion of universal kernels and characteristic tion. Wilson et al. (2016)demonstratestate-of-the-art kernels has been studied, kernel selection still performance by incorporating deep learning, kernel greatly influences the empirical performance. and Gaussian process. Li et al. (2015); Dziugaite et al. While learning the kernel in a data driven way (2015)useMMDtotraindeepgenerativemodelsfor has been investigated, in this paper we explore complex datasets. learning the spectral distribution of kernel via implicit generative models parametrized by In practice, however, kernel selection is always an im- deep neural networks. We called our method portant step. Instead of choosing by a heuristic, several Implicit Kernel Learning (IKL). The proposed works have studied kernel learning.Multiplekernel framework is simple to train and inference is learning (MKL) (Bach et al., 2004; Lanckriet et al., performed via sampling random Fourier fea- 2004; Bach, 2009; Gönen and Alpaydın, 2011; Duve- tures. We investigate two applications of the naud et al., 2013)isoneofthepioneeringframeworks proposed IKL as examples, including genera- to combine predefined kernels. One recent kernel learn- tive adversarial networks with MMD (MMD ing development is to learn kernels via learning spectral GAN) and standard supervised learning. Em- distributions (Fourier transform of the kernel). Wilson pirically, MMD GAN with IKL outperforms and Adams (2013)modelspectraldistributionsviaa vanilla predefined kernels on both image and mixture of Gaussians, which can also be treated as an text generation benchmarks; using IKL with extension of linear combination of kernels (Bach et al., Random Kitchen Sinks also leads to substan- 2004). Oliva et al. (2016)extendittoBayesiannon- tial improvement over existing state-of-the-art parametric models. In addition to model spectral dis- kernel learning algorithms on popular super- tribution with explicit density models aforementioned, vised learning benchmarks. Theory and con- many works optimize the sampled random features ditions for using IKL in both applications are or its weights (e.g. Băzăvan et al. (2012); Yang et al. also studied as well as connections to previous (2015); Sinha and Duchi (2016); Chang et al. (2017); state-of-the-art methods. Bullins et al. (2018)). The other orthogonal approach to modeling spectral distributions is learning feature maps for standard kernels (e.g. Gaussian). Feature 1Introduction maps learned by deep learning lead to state-of-the-art performance on different tasks (Hinton and Salakhut- dinov, 2008; Wilson et al., 2016; Li et al., 2017). Kernel methods are among the essential foundations in machine learning and have been extensively studied in In addition to learning effective features, implicit gen- the past decades. In supervised learning, kernel meth- erative models via deep learning also lead to promis- ods allow us to learn non-linear hypothesis. They also ing performance in learning distributions of complex play a crucial role in statistics. Kernel maximum mean data (Goodfellow et al., 2014). Inspired by its recent success, we propose to model kernel spectral distribu- Proceedings of the 22nd International Conference on Ar- tions with implicit generative models in a data-driven tificial Intelligence and Statistics (AISTATS) 2019, Naha, fashion, which we call Implicit Kernel Learning (IKL). Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by IKL provides a new route to modeling spectral distri- the author(s). Implicit Kernel Learning butions by learning sampling processes of the spectral learning a spectral distribution by optimizing densities, which is under explored by previous works aforementioned. arg max Ex i,x0 i [Fi(x, x0)k(x, x0)] = k ⇠P ⇠Q 2K i=1 In this paper, we start from studying the generic prob- X i!>(x x0) arg max Ex i,x i Fi(x, x0)E! k e − , lem formulation of IKL, and propose an easily imple- k ⇠P 0⇠Q ⇠P 2K i=1 h h ii mented, trained and evaluated neural network parame- X (1) terization which satisfies Bochner’s theorem (Section 2). where F is a task-specific objective function and is K We then demonstrate two example applications of the asetofkernels. (1) covers many popular objectives, proposed IKL. Firstly, we explore MMD GAN (Li et al., such as kernel alignment (Gönen and Alpaydın, 2011) 2017) with IKL on learning to generate images and and MMD distance (Gretton et al., 2012). Existing text (Section 3). Secondly, we consider a standard two- works (Wilson and Adams, 2013; Oliva et al., 2016) staged supervised learning task with Random Kitchen learn the spectral density Pk(!) with explicit forms Sinks (Sinha and Duchi, 2016)(Section4). The con- via parametric or non-parametric models. When we ditions required for training IKL and its theoretical learn kernels via (1),itmaynotbenecessarytomodel guarantees in both tasks are also studied. In both the density of Pk(!),aslongasweareabletoesti- tasks, we show that IKL leads to competitive or better i!>(x x0) mate kernel evaluations k(x x0)=E![e − ] via performance than heuristic kernel selections and exist- − sampling from Pk(!) (Rahimi and Recht, 2007). Al- ing approaches modeling kernel spectral densities. It ternatively, implicit probabilistic (generative) models demonstrates the potentials of learning more powerful define a stochastic procedure that can generate (sample) kernels via deep generative models. Finally, we discuss data from Pk(!) without modeling Pk(!).Recently, the connection with existing works in Section 5. the neural implicit generative models (MacKay, 1995) regained attentions with promising results (Goodfel- low et al., 2014)andsimplesamplingprocedures.We 2KernelLearning first sample ⌫ from a base distribution P(⌫) which is known (e.g. Gaussian distribution), then use a deter- ministic function h parametrized by ,totransform Kernels have been used in several applications with ⌫ into ! = h (⌫),where! follows the complex target success, including supervised learning, unsupervised distribution (!). Inspired by the success of deep learning, and hypothesis testing. They have also Pk implicit generative models (Goodfellow et al., 2014), been combined with deep learning in different applica- we propose an Implicit Kernel Learning (IKL) method tions (Mairal et al., 2014; Li et al., 2015; Dziugaite et al., by modeling (!) via an implicit generative model 2015; Wilson et al., 2016; Mairal, 2016). Given data Pk d h (⌫), where ⌫ P(⌫), which results in x R ,kernelmethodscomputetheinnerproductof ⇠ 2 φ(x) the feature transformation in a high-dimensional ih (⌫)>(x x0) k (x, x0)= e − (2) Hilbert space H via a kernel function k : R, E⌫ X⇥X! which is defined as k(x, x0)= φ(x),φ(x0) , where h i h iH φ(x) is usually high or even infinitely dimensional. If and reducing (1)tosolve k is shift invariant (i.e. k(x, y)=k(x y)), we can − represent k as an expectation with respect to a spectral ih (⌫)>(x x0) arg max Ex Pi,x0 Qi Fi(x, x0)E⌫ e − . distribution (!). ⇠ ⇠ Pk i=1 h ⇣ ⌘i X (3) The gradient of (3)canberepresentedas Bochner’s theorem (Rudin, 2011) Acontinuous, real valued, symmetric and shift-invariant function k ih (⌫)>(x x0) Ex Pi,x0 Qi E⌫ Fi(x, x0)e − . d ⇠ ⇠ r on R is a positive definite kernel if and only if there i=1 X h i is a positive finite measure Pk(!) such that Thus, (3) can be optimized via sampling x, x0 from data and ⌫ from the base distribution to estimate gradient i!>(x x0) i!>(x x0) as shown above (SGD) in every iteration. Next, we k(x x0)= e − dPk(!)=E! Pk e − . − d ⇠ discuss the parametrization of h to satisfy Bochner’s ZR h i Theorem, and describe how to evaluate IKL kernel in practice. 2.1 Implicit Kernel Learning Symmetric Pk(!) To result in real valued kernels, We restrict ourselves to learning shift invariant kernels. the spectral density has to be symmetric, where According to that, learning kernels is equivalent to Pk(!)=Pk( !).Thus,weparametrizeh (⌫)= − Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, Barnabás Póczos ˜ sign(⌫) h (abs(⌫)), where is the Hadamard product The MMD GAN objective then becomes ◦ ◦ and h˜ can be any unconstrained function if the base min✓ max' M'(P , P✓). X distribution P(⌫) is symmetric (i.e. P(⌫)=P( ⌫)), − such as standard normal distributions. 3.1 Training MMD GAN with IKL Kernel Evaluation Although there is usually no Although the composition kernel with a learned fea- closed form for the kernel evaluation k (x, x0) in (2) ture embedding f' is powerful, choosing a good base with fairly complicated h ,wecanevaluate(approx- kernel k is still crucial in practice (Bińkowski et al., imate) k (x, x0) via sampling finite number of ran- 2018). Different base kernels for MMD GAN, such ˆ ˆ ˆ as rational quadratic kernel (Bińkowski et al., 2018) dom Fourier features k (x, x0)=φh (x)>φh (x0), and distance kernel (Bellemare et al., 2017), have been where φˆh (x)> =[φ(x; h (⌫1)),...,φ(x; h (⌫m))],and studied. Instead of choosing it by hands, we propose φ(x; !) is the evaluation on ! of the Fourier transfor- to learn the base kernel by IKL, which extend (5) to mation φ(x) (Rahimi and Recht, 2007). be k ,' = k f' with the form Next, we demonstrate two example applications cov- ◦ ih (⌫)>(f'(x) f'(x0)) ered by (3), where we can apply IKL, including kernel k ,'(x, x0)=E⌫ e − .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-