LMKL-Net: a Fast Localized Multiple Kernel Learning Solver Via Deep Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com LMKL-Net: A Fast Localized Multiple Kernel Learning Solver via Deep Neural Networks Zhang, Z. TR2018-072 July 12, 2018 Abstract In this paper we propose solving localized multiple kernel learning (LMKL) using LMKL-Net, a feedforward deep neural network. In contrast to previous works, as a learning principle we propose parameterizing both the gating function for learning kernel combination weights and the multiclass classifier in LMKL using an attentional network (AN) and a multilayer perceptron (MLP), respectively. In this way we can learn the (nonlinear) decision function in LMKL (approximately) by sequential applications of AN and MLP. Empirically on benchmark datasets we demonstrate that overall LMKL-Net can not only outperform the state-of-theart MKL solvers in terms of accuracy, but also be trained about two orders of magnitude faster with much smaller memory footprint for large-scale learning. arXiv This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2018 201 Broadway, Cambridge, Massachusetts 02139 LMKL-Net: A Fast Localized Multiple Kernel Learning Solver via Deep Neural Networks Ziming Zhang Mitsubishi Electric Research Laboratories (MERL) Cambridge, MA 02139-1955 [email protected] Abstract In this paper we propose solving localized multiple kernel learning (LMKL) using LMKL-Net, a feedforward deep neural network. In contrast to previous works, as a learning principle we propose parameterizing both the gating function for learning kernel combination weights and the multiclass classifier in LMKL using an attentional network (AN) and a multilayer perceptron (MLP), respectively. In this way we can learn the (nonlinear) decision function in LMKL (approximately) by sequential applications of AN and MLP. Empirically on benchmark datasets we demonstrate that overall LMKL-Net can not only outperform the state-of-the- art MKL solvers in terms of accuracy, but also be trained about two orders of magnitude faster with much smaller memory footprint for large-scale learning. 1 Introduction Multiple kernel learning (MKL) is a classic yet powerful machine learning technique for integrating heterogeneous features in the reproducing kernel Hilbert space (RKHS) by linear or nonlinear combination of base kernels. It has been demonstrated successfully in many applications such as object detection in computer vision [46] as one of the joint winners in VOC2009 [8]. In the literature many MKL formulations have been proposed. For instance, [3] proposed a block- `1 regularized formulation for binary classification, and later [20] proposed using `p norm as regularization in MKL. As a classic example we show the objective in SimpleMKL [39] as follows: M N M ! 1 X 1 2 X X T min kwmk2 + C ζi; s.t. yi wmΦm(xi) + b ≥ 1 − ζi; ζi ≥ 0; 8i; w ;β2∆M−1;ζ;b 2 β m m=1 m i=1 m=1 (1) where the set f(xi; yi)g denote training data, wm; 8m and b denote the classifier parameters, β = M−1 [βm] 2 ∆ denote the weights for M kernels lying on the (M − 1)-simplex space, ζ = [ζi] denote the slack variables for the hinge loss, C ≥ 0 is a predefined regularization constant, (·)T denotes the matrix transpose operator, and Φm(·) denotes the feature map in RKHS for the m-th T kernel Km so that Km(xi; xj) = Φm(xi) Φm(xi); 8xi; 8xj. Based on the dual we can rewrite Eq. 1 as follows: X min J(β) s:t: βm = 1; βm ≥ 0; 8m (2) β m 1 P P P maxα − αiαjyiyj βmKm(xi; xj) + αi where J(β) = P 2 i;j m i (3) s:t: i αiyi = 0; 0 ≤ αi ≤ C; 8i: Here αi; 8i denote the Lagrangian variables in the dual. To optimize Eq. 2 typically alternating optimization algorithms are developed, that is, learning β in Eq. 2 while fixing α and learning α Preprint. Work in progress. in Eq. 3 by solving a kernel support vector machine (SVM) problem while fixing β. Such family of algorithms in MKL are called wrapper methods. From this perspective Eq. 1 essentially learns a kernel mixture by linear combination of base kernels. Accordingly we can write the decision function, f, in SimpleMKL for a new sample z 2 X as follows: " # X X fSimple(z) = αiyi βmKm(z; xi) : (4) i m Localized MKL (LMKL) [10; 22; 30] and its variants such as [24] are another family of MKL algorithms which learn localized (i.e. data-dependent) kernel weights. For instance, [10] propose replacing β in Eq. 1 with a function η, leading to the following objective: ! 1 X 2 X X T min kwmk2 + C ζi; s.t. yi ηm(xi)wmΦm(xi) + b ≥ 1 − ζi; ζi ≥ 0; 8i; w ,η (x)≥0;ζ;b 2 m m m i m (5) where ηm(xi) denotes the gating function that takes data xi as input and outputs the weight for the m-th kernel. The decision function is as follows: " # X X fLoc(z) = αiyi ηm(z)Km(z; xi)ηm(xi) + b: (6) i m Comparing Eq. 5 with Eq. 1, we can see that LMKL essentially relaxes the conventional MKL by introducing the data-dependent function η in LMKL that may better capture data properties such as distributions for learning kernel weights. Since the learning capability of η is high, previous works usually prefer regularizing it using explicit expressions such as Gaussian similarity in [10]. Beyond learning linear mixtures of kernels, some works focus on learning nonlinear kernel combi- nation such as involving product between kernels [7; 2; 17; 27] or deep MKL [43; 4] that embeds kernels into kernels. Among these works, the primal formulations may be nontrivial to write down explicitly. Instead some works preferred learning the decision functions directly based on Empirical Risk Minimization (ERM). For instance, in [7; 43] the nonlinear kernel mixtures are fed into the dual of an SVM to learn the parameters for both kernel mixtures and the classifiers. MKL, including multi-class MKL [50] and multi-label MKL [18; 44], can be further generalized to multi-task learning (MTMKL) [16; 12; 32]. However, these research topics are out of scope of this paper. Motivation: From the dual perspective, solving kernel SVMs such as Eq. 3 involves a constrained quadratic programming (QP) problem for computing the Lagrangian variables that determine linear decision boundaries in RKHS. This is the key procedure that consumes most of the computation. Moreover, in the literature of LMKL it lacks of a principle for learning the gating functions rather than manually tuning the functions such as enforcing Gaussian distributions as prior. In addition the optimal decision functions may not be necessarily linear in RKHS for the sake of accuracy. Therefore, it is highly desirable to have principles for learning both gating and decision functions in LMKL. Deep neural networks (DNNs) have been proven as a universal approximator for an arbitrary function [28]. In fact recently researchers have started to apply DNNs as efficient solvers to some classic numerical problems such partial differential equations (PDEs) and backward stochastic differential equations (BSDEs) in high dimensional spaces [48; 40; 25]. Conventionally solving these numerical problems with large amount of high dimensional data is very time-consuming. In contrast DNNs are able to efficiently output approximate solutions with sufficient precision. Contributions: Based on the considerations above, we propose a simple neural network, namely LMKL-Net, as an efficient solver for LMKL. As a learning principle we parameterize the gating function as well as the classifier in LMKL using an attentional network (AN) and a multilayer perceptron (MLP), respectively, without enforcing any (strong) prior explicitly. We expect that by fitting the data, the network can approximate both underlying optimal functions properly. The localized weights learned from AN guarantee that the kernel mixtures are indeed valid kernels. Empirically we demonstrate the superiority of LMKL-Net over the state-of-the-art MKL solvers in terms of accuracy, memory, and running speed. 2 2 Related Work Localized MKL (LMKL): The success of LMKL highly depends on the gating functions. Due to different design choices, previous works can be categorized into either data/sample related [10; 13; 24] or group/cluster related [49; 31; 22]. For instance, in [10] the gating function is defined explicitly as a Gaussian similarity function, while in [22] the gating function is represented as the likelihoods of data in the clusters that are predefined by the likelihood function. Recently [30] proposed viewing the gating function in LMKL as an explicit feature map that can generate an additional kernel on the data. In all of these works, the classifier parameters are shared among all the data samples. In contrast to previous works, we propose using an attentional network to approximate the unknown optimal gating function. This parameterization can avoid the need of any prior on the function. Large-Scale MKL: Several MKL solvers have addressed the large-scale learning problem from the perspective of either computational complexity or memory footprint, such as SILP-MKL [42], SimpleMKL [39], `p-norm-MKL [20], GMKL [45], SPG-GMKL [15], UFO-MKL [37], OBSCURE [36], and MWUMKL [29].