Random Feature Maps for the Itemset Kernel

Random Feature Maps for the Itemset Kernel Kyohei Atarashi Subhransu Maji Satoshi Oyama Hokkaido University University of Massachusetts, Amherst Hokkaido University/RIKEN AIP atarashi [email protected] [email protected] [email protected] Abstract proximately but efficiently using linear models in a low dimensional space by mapping the data points using Z(·). Ex- Although kernel methods efficiently use feature combinations amples include random Fourier feature maps that approxi- without computing them directly, they do not scale well with the size of the training dataset. Factorization machines (FMs) mate shift-invariant kernels: K(x; y) = k(jx − yj) (Rahimi and related models, on the other hand, enable feature com- and Recht 2008); random Maclaurin feature maps that ap- binations efficiently, but their optimization generally requires proximate dot product kernels (Kar and Karnick 2012): solving a non-convex problem. We present random feature K(x; y) = k(hx; yi); and tensor sketching for polynomial m m maps for the itemset kernel, which uses feature combina- kernels: KP (x; y; c) := (c + hx; yi) (Pham and Pagh tions, and includes the ANOVA kernel, the all-subsets ker- 2013). Although polynomial kernels are dot product kernel, and the standard dot product. Linear models using one nels and can be approximated by random Maclaurin feature of our proposed maps can be used as an alternative to kernel maps, tensor sketching can be more efficient. methods and FMs, resulting in better scalability during both training and evaluation. We also present theoretical results for Factorization machines (FMs) (Rendle 2010; 2012) and a proposed map, discuss the relationship between factoriza- variants (Blondel et al. 2016a; 2016b; Novikov, Trofimov, tion machines and linear models using a proposed map for the and Oseledets 2016) also model feature combinations with- ANOVA kernel, and relate the proposed feature maps to prior out explicitly computing them, similar to kernel methods, work. Furthermore, we show that the maps can be calculated but have better scalability during evaluation. These meth- more efficiently by using a signed circulant matrix projection ods can be thought of as a two-layer neural network with technique. Finally, we demonstrate the effectiveness of using polynomial activations with a fixed number of learnable pa- the proposed maps for real-world datasets. rameters (See Equation (5).) However, unlike kernel methods, their optimization problem is generally non-convex and 1 Introduction difficult to solve. But due to their efficiency during evaluation FMs are attractive for large-scale problems and have Kernel methods enable learning in high, possibly infinite been successfully applied to applications such as link pre- dimensional feature spaces without explicitly expressing diction and recommender systems. This work analyzes the them. In particular, kernels that model feature combina- relationship between polynomial kernel models and factor- tions such as polynomial kernels, the ANOVA kernel, and ization machines in more detail. the all-subsets kernel (Blondel et al. 2016a; Shawe-Taylor and Cristianini 2004) have been shown to be effective for Our contributions. We present a random feature map for the itemset kernel that takes into account all feature combi- a number of tasks in computer vision and natural lan- [d] guage understanding (Lin, RoyChowdhury, and Maji 2015; nations within a family of itemsets S ⊆ 2 . To the best of Fukui et al. 2016). However their scalability remains a chal- our knowledge, the random feature map for the itemset ker- lenge; support vector machines (SVMs) with non-linear kernel is novel. The itemset kernel includes the ANOVA kernels require O(n2) time and O(n2) memory for training and nel, all-subsets kernel, and standard dot product, so linear O(n) time and memory for evaluation, where n is the num- models using this map are an alternative to the ANOVA or ber of training instances (Chang and Lin 2011). all-subsets kernel SVMs, FMs, and all-subsets model. They To address this issue several researchers have proposed scale well with the size of the training dataset, unlike ker- d D nel methods, and their optimization problem is convex and randomized feature maps Z(·): R 7! R for kernels d d easy to solve, unlike that of FMs. We also present theoreti- K(·; ·): R × R 7! R that satisfy cal analyses of the proposed random feature map and discuss E[hZ(x);Z(y)i] = K(x; y): (1) the relationship between linear models trained on these features and factorization machines. Furthermore, we present a The idea is to perform classification, regression, or cluster- faster and more memory-efficient random feature map for ing on a corresponding high-dimensional feature space ap- the ANOVA kernel based on the signed circulant matrix Copyright c 2019, Association for the Advancement of Artificial technique (Feng, Hu, and Liao 2015). Finally, we evaluate Intelligence (www.aaai.org). All rights reserved. the effectiveness of the feature maps on several datasets. 2 Background and Related Work model equation is 2.1 Kernels Using Feature Combinations k X 2 First, we present the ANOVA kernel and all-subsets ker- fFM(x; w; P ; λ) := hw; xi + λsKA(ps; x); (5) nel, which use feature combinations (Blondel et al. 2016a; s=1 Shawe-Taylor and Cristianini 2004), and define the item- d d×k k set kernel. We also describe several models that use feature where w 2 R , P 2 P , and λ 2 R are learnable pa- combinations. rameters, and k 2 N is a rank-hyper parameter. The com- The ANOVA kernel is similar to the polynomial kernel. putational cost for evaluating FMs is O(dkm) and does The definition of an m-order ANOVAkernel between x; y 2 not depend on the amount of training data. However, the Rd is FM optimization problem is non-convex and hence chal- lenging. Fortunately, it can be solved relatively efficiently d m X using a coordinate descent method because it is multi- K (x; y) := xj ··· xj yj ··· yj ; (2) A 1 m 1 m convex w.r.t w1; : : : ; wd; p1;1; : : : ; pd;k. Although parame- j1<···<jm ter λ was not introduced in the original FMs (Rendle 2010; where 2 ≤ m ≤ d 2 N is the order of the ANOVA ker- 2012), Blondel et al. showed that introducing λ increases the nel. For convenience, 0=1-order ANOVA kernels are often capacity of FMs (Blondel et al. 2016b). 0 1 Polynomial networks (PNs) (Livni, Shalev-Shwartz, and defined as KA(x; y) = 1 and KA(x; y) = hx; yi. The difference between the ANOVA kernel and the polynomial Shamir 2014) are models based on polynomial kernels. They kernel is that the ANOVA kernel does not use feature com- are depth two neural networks with a polynomial activation 2 function. Although both PNs and FMs use feature combina- binations that include the same feature (e.g., x1x1; x2x3) while the polynomial kernel does. Although the evaluation tions, there is a key difference: PNs can be represented by of the m-order ANOVA kernel involves O(dm) terms, it the polynomial kernel while FMs can be represented by the can be computed in O(dm) time using dynamic program- ANOVAkernel. This difference means PNs use feature com- ming (Blondel et al. 2016a; Shawe-Taylor and Cristianini binations among the same features while FMs do not. Ex- 2004). In some applications, ANOVA-kernel-based models periments have shown that FMs achieve better performance have achieved better performance than polynomial-kernel- than PNs (Blondel et al. 2016a). based models (Blondel et al. 2016a; 2016b). We discuss Blondel et al. proposed the all-subsets model (Blondel et these models later in this section. al. 2016a), which uses all feature combinations: While the ANOVAkernel uses only m-order different fea- k K X ture combinations, the all-subsets kernel all uses all differ- f (x; P ; λ) := λ K (p ; x): (6) ent feature combinations and is defined as all s all s s=1 d Y K (x; y) := (1 + x y ): (3) Although the all-subsets model sometimes performed better all j j than FMs and PNs on link prediction tasks, it tended to have j=1 lower performance (Blondel et al. 2016a). Clearly, evaluation of the all-subsets kernel takes only O(d) time. 2.3 Random Feature Maps for Polynomial Here, we define the itemset kernel. For a given family Kernels [d] [d] of itemsets S ⊆ 2 , where [d] = f1; : : : ; dg and 2 = The random Maclaurin (RM) feature map (Kar and Karnick f;; f1g; f2g;:::; fdg; f1; 2g;:::; [d]g, we define the item- 2012) is for dot product kernels: K(x; y) = k(hx; yi). It set kernel as P1 n uses the Maclaurin expansion of k(·): k(x) = n=0 anx , X Y (n) KS (x; y) := xjyj = hφS (x); φS (y)i: (4) where an = k (0)=n! is the n-th coefficient of the V 2S j2V Maclaurin series. It uses two distributions: porder(N = n) = 1=pn+1, where p > 1, and the Rademacher distri- The itemset kernel clearly uses feature combinations in the bution (a fair coin distribution). Its computational cost is family of itemsets S. The itemset kernel can be regarded as D OP N d time and memory, where N (s 2 [D]) is the an extension of the ANOVA kernel, all-subsets kernel, and s=1 s s [d] order of the s-th randomized feature, especially O(Ddm) standard dot product. For example, when S = 2 , K2[d] clearly uses all feature combinations and hence is equivalent time and memory when the objective kernel is the homoge- neous polynomial kernel: Km (x; y) = hx; yim.1 to the all-subsets kernel Kall in Equation (3).

Random Feature Maps for the Itemset Kernel

Low-Rank Tucker Approximation of a Tensor from Streaming Data : ; \ \ } Yiming Sun , Yang Guo , Charlene Luo , Joel Tropp , and Madeleine Udell

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Low-Rank Pairwise Alignment Bilinear Network for Few-Shot Fine

Fast and Guaranteed Tensor Decomposition Via Sketching

High-Order Attention Models for Visual Question Answering

Visual Attribute Transfer Through Deep Image Analogy ∗

Monet: Moments Embedding Network

Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition

High-Order Attention Models for Visual Question Answering

Monet: Moments Embedding Network∗

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Hadamard Product for Low-Rank Bilinear Pooling