Learning to Learn Kernels with Variational Random Features
Xiantong Zhen * 1 2 Haoliang Sun * 3 Yingjun Du * 2 Jun Xu 4 Yilong Yin 3 Ling Shao 5 1 Cees Snoek 2
Abstract from related tasks to enable fast adaptation to a new task with a limited amount of data. Generally speaking, exist- We introduce kernels with random Fourier fea- ing meta-learning algorithms (Ravi & Larochelle, 2017; tures in the meta-learning framework for few-shot Bertinetto et al., 2019) design the meta-learner to extract learning. We propose meta variational random meta-knowledge that improves the performance of the base- features (MetaVRF) to learn adaptive kernels for learner on individual tasks. Meta knowledge, like a good the base-learner, which is developed in a latent parameter initialization (Finn et al., 2017), or an efficient variable model by treating the random feature optimization update rule shared across tasks (Andrychowicz basis as the latent variable. We formulate the opti- et al., 2016; Ravi & Larochelle, 2017) has been extensively mization of MetaVRF as a variational inference explored in general learning framework, but how to define problem by deriving an evidence lower bound un- and use in few-shot learning remains an open question. der the meta-learning framework. To incorporate shared knowledge from related tasks, we propose An effective base-learner should be powerful enough to a context inference of the posterior, which is es- solve individual tasks and able to absorb information pro- tablished by an LSTM architecture. The LSTM- vided by the meta-learner to improve its own performance. based inference network effectively integrates the While potentially strong base-learners, kernels (Hofmann context information of previous tasks with task- et al., 2008) have not yet been studied in the meta-learning specific information, generating informative and scenario for few-shot learning. Learning adaptive ker- adaptive features. The learned MetaVRF is able nels (Bach et al., 2004) in a data-driven way via random to produce kernels of high representational power features (Rahimi & Recht, 2007) has demonstrated great suc- with a relatively low spectral sampling rate and cess in regular learning tasks and remains of broad interest also enables fast adaptation to new tasks. Ex- in machine learning (Sinha & Duchi, 2016; Hensman et al., perimental results on a variety of few-shot re- 2017; Carratino et al., 2018; Bullins et al., 2018; Li et al., gression and classification tasks demonstrate that 2019). However, due to the limited availability of data, it is MetaVRF can deliver much better, or at least com- challenging for few-shot learning to establish informative petitive, performance compared to existing meta- and discriminant kernels. We thus explore the relatedness learning alternatives. among distinctive but relevant tasks to generate rich random features to build strong kernels for base-learners, while still maintaining their ability to adapt quickly to individual tasks. 1. Introduction In this paper, we make three important contributions. First, Learning to learn, or meta-learning (Schmidhuber, 1992; we propose meta variational random features (MetaVRF), integrating, for the first time, kernel learning with random arXiv:2006.06707v2 [cs.LG] 13 Aug 2020 Thrun & Pratt, 2012), offers a promising tool for few-shot learning (Andrychowicz et al., 2016; Ravi & Larochelle, features and variational inference into the meta-learning 2017; Finn et al., 2017) and has recently generated increas- framework for few-shot learning. We develop MetaVRF ing popularity in machine learning. The crux of meta- in a latent variable model by treating the random Fourier learning for few-shot learning is to extract prior knowledge basis of translation-invariant kernels as the latent variable. Second, we formulate the optimization of MetaVRF as a *Equal contribution 1Inception Institute of Artificial Intelli- 2 variational inference problem by deriving a new evidence gence, UAE Informatics Institute, University of Amsterdam, lower bound (ELBO) in the meta-learning setting, where The Netherlands 3School of Software, Shandong University, China 4College of Computer Science, Nankai University, China the posterior over the random feature basis corresponds to 5Mohamed bin Zayed University of Artificial Intelligence, Abu the spectral distribution associated with the kernel. This Dhabi, UAE. Correspondence to: X. Zhen
th cipled way of learning data-driven kernels with random Proceedings of the 37 International Conference on Machine Fourier features and more importantly, fits well in the meta- Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). learning framework for few-shot learning allowing us to Learning to Learn Kernels with Variational Random Features
Figure 1. The learning framework of our meta variational random features (MetaVRF). The meta-learner employs an LSTM-based inference network φ( ) to infer the spectral distribution over ωt, the kernel from the support set t of the current task t, and the outputs t 1 t 1 · S h − and c − of the previous task. During the learning process, the cell state in the LSTM is deployed to accumulate the shared knowledge through experiencing a set of prior tasks. The remember and forget gates in the LSTM episodically refine the cell state by absorbing information from each experienced task. For each individual task, the task-specific information extracted from its support set is combined with distilled information from the previous tasks to infer the adaptive spectral distribution of the kernels.
flexibly customize the variational posterior to leverage the relatively low sampling rates. We also apply MetaVRF to meta knowledge for inference. As the third contribution, versatile and challenging settings with inconsistent training we propose a context inference which puts the inference of and test conditions, and it still delivers promising results, random feature bases of the current task into the context of which further demonstrates its strong learning ability. all previous, related tasks. The context inference provides a generalized way to integrate context information of re- 2. Method lated tasks with task-specific information for the inference of random feature bases. To establish the context inference, We first describe the base-learner based on the kernel ridge we introduce a recurrent LSTM architecture (Hochreiter regression in meta-learning for few-shot learning, and then & Schmidhuber, 1997), leveraging its innate capability of introduce kernel learning with random features, based on learning long-term dependencies, which can be adopted to which our meta variational random features are developed. explore shared meta-knowledge from a large set of previous tasks. The LSTM-based inference connects knowledge from 2.1. Meta-Learning with Kernels previous tasks to the current task, gradually collecting and refreshing the knowledge across the course of learning. The We adopt the episodic training strategy commonly used for learning process with an LSTM-based inference network is few-shot classification in meta-learning (Ravi & Larochelle, illustrated in Figure1. Once learning ceases, the ultimate 2017), which involves meta-training and meta-testing stages. LSTM state gains meta-knowledge from related experienced In the meta-training stage, a meta-learner is trained to en- tasks, which enables fast adaptation to new tasks. hance the performance of a base-learner on a meta-training set with a batch of few-shot learning tasks, where a task is We demonstrate the effectiveness of the proposed MetaVRF usually referred as an episode (Ravi & Larochelle, 2017). by extensive experiments on a variety of few-shot regression In the meta-test stage, the base-learner is evaluated on a and classification tasks. Results show that our MetaVRF meta-testing set with different classes of data samples from achieves better, or at least competitive, performance com- the meta-training set. pared to previous methods. Moreover, we conduct further analysis on MetaVRF to demonstrate its ability to be in- For the few-shot classification problem, we sample C-way k- tegrated with deeper architectures and its efficiency with shot classification tasks from the meta-training set, where k Learning to Learn Kernels with Variational Random Features is the number of labelled examples for each of the C classes. Rather than using pre-defined kernels, we consider learning t C k Given the t-th task with a support set = (xi, yi) i=1× adaptive kernels with random Fourier features in a data- t m tS t { } and query set = (x˜i, y˜i) i=1 ( , ), we learn driven way. Moreover, we leverage the shared knowledge Q t { } S Q ⊆ X the parameters α of the predictor fαt using a standard by exploring dependencies among related tasks to learn rich learning algorithm with kernel trick αt = Λ(Φ(X),Y ), features for building up informative kernels. where t = X,Y . Here, Λ is the base-learner and Φ: S { } RX is a mapping function from to a dot product X → X 2.2. Random Fourier Features space . The similarity measure k(x, x0) = Φ(x), Φ(x0) H h i is usually called a kernel (Hofmann et al., 2008). Random Fourier features (RFFs) were proposed to construct approximate translation-invariant kernels using explicit fea- As in traditional supervised learning problems, the base- ture maps (Rahimi & Recht, 2007), based on Bochner’s learner for the t-th single task can use a predefined kernel, theorem (Rudin, 1962). e.g., radius base function, to map the input into a dot prod- uct space for efficient learning. Once the base-learner is Theorem 1 (Bochner’s theorem) (Rudin, 1962) A contin- obtained on the support set, its performance is evaluated on uous, real valued, symmetric and shift-invariant function d the query set by the following loss function: k(x, x0) = k(x x0) on R is a positive definite kernel if and only if it is− the Fourier transform p(ω) of a positive L fαt Φ(x˜) , y˜ , (1) finite measure such that (x˜,y˜) t X∈Q where L( ) can be any differentiable function, e.g., cross- iω>(x x0) k(x, x0) = e − dp(ω) = Eω[ζω(x)ζω(x0)∗], · d entropy loss. In the meta-learning setting for few-shot learn- ZR ing, we usually consider a batch of tasks. Thus, the meta- iω>x where ζω(x) = e . (6) learner is trained by optimizing the following objective func- tion w.r.t. the empirical loss on T tasks It is guaranteed that ζω(x)ζω(x)∗ is an unbiased estima- T tion of k(x, x0) with sufficient RFF bases ω drawn from t t t { } L fαt Φ (x˜) , y˜ , s.t. α = Λ Φ (X),Y , p(ω) (Rahimi & Recht, 2007). t (x˜,y˜) t X X∈Q (2) For a predefined kernel, e.g., radius basis function (RBF), where Φt is the feature mapping function which can be we use Monte Carlo sampling to draw bases from the spec- obtained by learning a task-specific kernel kt for each task tral distribution, which gives rise to the explicit feature map: t with data-driven random Fourier features. 1 z(x) = [cos(ω>x + b1), , cos(ω>x + bD)], (7) In this work, we employ kernel ridge regression, which √D 1 ··· D has an efficient closed-form solution, as the base-learner Λ for few-shot learning. The kernel value in the Gram matrix where ω1, , ωD are the random bases sampled from Ck Ck { ··· } K R × is computed as k(x, x0) = Φ(x)Φ(x0)>, p(ω), and [b1, , bD] are D biases sampled from a uni- ∈ ··· where “ ” is the transpose operation. The base-learner form distribution with a range of [0, 2π]. Finally, the kernel > Λ for a single task is obtained by solving the following values k(x, x0) = z(x)z(x0)> in K are computed as the dot objective w.r.t. the support set of this task, product of their random feature maps with the same bases.
Λ = arg min Tr[(Y αK)(Y αK)>] + λ Tr[αKα>], α − − 3. Meta Variational Random Features (3) which admits a closed-form solution We introduce our MetaVRF using a latent variable model in which we treat random Fourier bases as latent variables 1 α = Y (λI + K)− (4) inferred from data. Learning kernels with random Fourier features is tantamount to finding the posterior distribution The learned predictor is then applied to samples in the query over random bases in a data-driven way. It is naturally cast set X˜: into a variational inference problem, where the optimization Yˆ = fα(X˜) = αK,˜ (5) objective is derived from an evidence lower bound (ELBO) under the meta-learning framework. ˜ ˜ Ck m Here, K = Φ(X)Φ(X)> R × , with each element as ∈ k(x, x˜) between the samples from the support and query 3.1. Evidence Lower Bound sets. Note that we also treat λ in (3) as a trainable param- eter by leveraging the meta-learning setting, and all these From a probabilistic perspective, under the meta-learning parameters are learned by the meta-learner. setting for few-shot learning, the random feature basis Learning to Learn Kernels with VariationalSubmission Random and Formatting Features Instructions for ICML 2019
165 specific spectral distribution of kernels. In the following 165 value k(x, x0)=z(x)z(x0)> in K is computed as the dot -th task 166 section, we introduce our meta variational random features 166 product of their random feature maps167 with(MetaVRF), the same in which bases. random Fourier bases are treated as 167 Instead of simply approximating168 a kernellatent accurately, variables inferred our from the support set in the meta- 168 169 learning setting. goal is to learn an adaptive kernel specific to a task, which 169 170 is tantamount to find the posterior171 distribution over the ran- 170 3. Meta Variational Random Features dom Fourier bases in a data-driven172 way. In the following 171 173 3.1. Evidence Lower Bound section, we introduce our meta variational random features Figure 2. Illustration of MetaVRF in a directed graphical model. 174 t t 172 From the probabilistic perspective of view, the goal of few- (x, y) is a test sample in the query set . The base ! is inferred (MetaVRF) by treating random Fourier175 bases as latent vari- t 1Q 173 shot learning is to maximize the conditional predictive log- by conditioning on both the base ! from the previous task and ables inferred from data under the meta-learning176 framework. Figure 2. Illustration of MetaVRFthe support in a set directedt of the graphical current task. model. 174 likelihood of samples from the query set . We treat the t t 177 (x, y) is aQ sample in the query set . TheS base ! of the t-th ! t 175 178 random Fourier base of the kernel as a latent variable: where Qtis the support set of the t-th task