Online Compact Convexified Factorization Machine

Wenpeng Zhang†, Xiao Lin†, Peilin Zhao‡

†Tsinghua University, ‡South China University of Technology [email protected],[email protected],[email protected]

ABSTRACT have to be re-trained each time new data arrives. Since these data Factorization Machine (FM) is a approach with streams usually arrive in large-scales and are changing constantly a powerful capability of . It yields state-of- in a real-time manner, the incurred high re-training cost makes the-art performance in various batch learning tasks where all the batch learning algorithms impractical in such settings. This cre- training data is made available prior to the training. However, in ates an impending need for an efficient online learning algorithm real-world applications where the data arrives sequentially in a for Factorization Machine. Moreover, it is expected that the online streaming manner, the high cost of re-training with batch learning learning algorithm has a theoretical guarantee for its performance. algorithms has posed formidable challenges in the online learning In the current paper, we aim to develop an ideal algorithm scenario. The initial challenge is that no prior formulations of FM based on the paramount framework – Online Convex Optimization could fulfill the requirements in Online Convex Optimization (OCO) (OCO) [12, 26] in the online learning setting. Unfortunately, extant – the paramount framework for online learning algorithm design. formulations are unable to fulfill the two fundamental requirements To address the aforementioned challenge, we invent a new convexi- demanded in OCO: (i) any instance of all the parameters should fication scheme leading to a Compact Convexified FM (CCFM) that be represented as a single point from a convex compact decision seamlessly meets the requirements in OCO. However for learning set; and, (ii) the loss incurred by the prediction should be formu- Compact Convexified FM (CCFM) in the online learning setting, lated as a convex function over the decision set. Indeed, in most most existing algorithms suffer from expensive projection opera- existing formulations for FM [4, 8, 16, 20, 23], the loss functions tions. To address this subsequent challenge, we follow the general are non-convex with respect to the factorized feature interaction projection-free algorithmic framework of Online Conditional Gradi- matrix, thus violating requirement (ii). Further, although some stud- ent and propose an Online Compact Convex Factorization Machine ies have proposed formulations for convex FM which rectify the (OCCFM) algorithm that eschews the projection operation with effi- non-convexity problem [3, 33], they still treat the feature weight cient linear optimization steps. In support of the proposed OCCFM vector and the feature interaction matrix as separated parameters, in terms of its theoretical foundation, we prove that the developed thus violating requirement (i) in OCO. algorithm achieves a sub-linear regret bound. To evaluate the em- To address these problems, we propose a new convexification pirical performance of OCCFM, we conduct extensive experiments scheme for FM. Specifically, we rewrite the global bias, the feature on 6 real-world datasets for online recommendation and binary weight vector and the feature interaction matrix into a compact classification tasks. The experimental results show that OCCFM augmented symmetric matrix and restrict the augmented matrix outperforms the state-of-art online learning algorithms. with a nuclear norm bound, which is a convex surrogate of the low-rank constraint [6]. Therefore the augmented matrices form a convex compact decision set which is essentially a symmetric 1 INTRODUCTION bounded nuclear norm ball. Then we rewrite the prediction of Factorization Machine (FM) [23] is a generic approach for super- FM into a convex linear function with respect to the augmented vised learning . It provides an efficient mechanism for feature engi- matrix, thus the loss incurred by the prediction is convex. Based on neering, capturing first-order information of each input feature as the convexification scheme, the resulting formulation of Compact Convexified FM (CCFM) can seamlessly meet the aforementioned arXiv:1802.01379v1 [cs.LG] 5 Feb 2018 well as second-order pairwise feature interactions with a low-rank matrix in a factorized form. FM achieves state-of-the-art perfor- requirements of the OCO framework. mances in various applications, including recommendation [22, 24], Yet, when we investigate various online learning algorithms computational advertising [16], search ranking [20] and toxicoge- for Compact Convexified Factorization Machine within the OCO nomics prediction [33], and so on. As such, FM has recently re- framework, we find that most of existing online learning algorithms gained significant attention from researchers4 [ , 5, 17, 31] due to involve a projection step in every iteration. When the decision set its increasing popularity in industrial applications and is a bounded nuclear norm ball, the projection amounts to the com- competitions [16, 38]. putationally expensive Singular Value Decomposition (SVD), and Despite of the overwhelming research on Factorization Machine, consequently limits the applicability of most online learning algo- majority of the existing studies are conducted in the batch learning rithms. Notably, one exceptional algorithm is Online Conditional setting where all the training data is available before training. How- Gradient (OCG), which eschews the projection operation by a linear ever, in many real-world scenarios, like online recommendation and optimization step. When OCG is applied to the nuclear norm ball, online advertising [21, 29], the training data arrives sequentially the linear optimization amounts to the computation of maximal in a streaming fashion. If the batch learning algorithms are ap- singular vectors of a matrix, which is much simpler. However, it plied in accordance with the streams in such scenarios, the models remains theoretically unknown whether an algorithm similar to OCG still exists for the specific subset of a bounded nuclear norm Given an input feature vector x ∈ Rd , vanilla FM makes the ball. prediction yˆ ∈ R with the following formula:

In response, we propose an online learning algorithm for CCFM, d d T Õ Õ T which is named as Online Compact Convexified Factorization Ma- yˆ(x, ω0, ω, V ) = ω0 + ω x + (VV )i j xi xj, chine (OCCFM). We prove that when the decision set is a symmetric i=1 j=i+1 d nuclear norm ball, the linear optimization needed for an OCG-alike where ω0 ∈ R is the bias term, ω ∈ R is the first-order feature algorithm still exsits, i.e. it amounts to the computation of the max- weight vector and V ∈ Rd×k is the second-order factorized feature T imal singular vectors of a specific symmetric matrix. Based on this interaction matrix; (VV )ij is the entry in the i-th row and j-th finding, we propose an OCG-alike online learning algorithm forOC- column of matrix VV T , and k ≪ d is the hyper-parameter deter- CFM. As OCCFM is a variant of OCG, we prove that the theoretical mining the rank of V . As shown in previous studies [3, 33], yˆ is analysis of OCG still fits for OCCFM, which achieves a sub-linear non-convex with respect to V . regret bound in order of O(T 3/4). Further, we conduct extensive The non-convexity in vanilla FM will result in some problems experiments on real-world datasets to evaluate the empirical per- in practice, such as local minima and instability of convergence. In formance of OCCFM. As shown in the experimental results in both order to overcome these problems, two formulations for convex online recommendation and online binary classification tasks, OC- FM [3, 33] have been proposed, where the feature interaction is CFM outperforms the state-of-art online learning algorithms in directly modeled as Z ∈ Rd×d rather than VV T in the factorized terms of both efficiency and prediction accuracy. form which induces non-convexity. The matrix Z is then imposed The contributions of this work are summarized as follows: upon a bounded nuclear norm constraint to maintain the low-rank (1) To the best of our knowledge, we are the first to propose the property. In general, convex FM allows for more general modeling of Online Compact Convexified Factorization Machine with feature interaction and is quite effective in practice. The difference theoretical guarantees, which is a new variant of FM in the between the two formulations is whether the diagonal entries of Z online learning setting. are utilized in prediction. For clarity, we refer to the formulation (2) The proposed formulation of CCFM is superior to prior re- in [3] as Convex FM (1) and that in [33] as Convex FM (2). search in that it seamlessly fits into the online learning frame- Note that we propose a new convexification scheme for FM work. Moreover, this formulation can be used in not only in this work, which is inherently different from the above two online setting, but also batch and stochastic settings. formulations for convex FM. The details of our convexification (3) We propose a routine for the linear optimization on the deci- scheme and its comparison with convex FMs will be presented in sion set of CCFM based on the Online Conditional Gradient Section 3.1. algorithm, leading to an OCG-alike online learning algorithm for CCFM. Moreover, this finding also applies to any other 2.2 Online Convex Optimization online learning task whose decision set is the symmetric Online Convex Optimization (OCO) [12, 26] is the paramount frame- bounded nuclear norm ball. work for designing online learning algorithms. It can be seen as a (4) We evaluate the performance of the proposed OCCFM on structured repeated game between a learner and an adversary. At both online recommendation and online binary classification each round t ∈ {1, 2, ··· ,T }, the learner is required to generate a n tasks, showing that OCCFM outperforms state-of-art online decision point xt from a convex compact set Q ⊆ R . Then the learning algorithms. adversary replies the learner’s decision with a convex ft : Q → R and the learner suffers the loss ft (xt ). The goal of the We believe our work sheds light on the Factorization Machine learner is to generate a sequence of decisions {xt | t = 1, 2, ··· ,T } research, especially for online Factorization Machine. Due to the so that the regret with respect to the best fixed decision in hindsight wide application of FM, our work is of both theoretical and practical T T significance. Õ Õ ∗ regretT = ft (xt ) − min ft (x ) x ∗∈Q t=1 t=1 2 PRELIMINARIES 1 is sub-linear inT , i.e. lim regretT = 0. The sub-linearity implies In this section, we first review more details of factorization machine, T →∞ T and then provide an introduction to the online convex optimization that when T is large enough, the learner can perform as well as the framework and the online conditional gradient algorithm, both best fixed decision in hindsight. of which we utilize to design our online learning algorithm for Based on the OCO framework, many online learning algorithms factorization machine. have been proposed and successfully applied in various applica- tions [10, 19, 36]. The two most popular representatives of them are 2.1 Factorization Machine Online Gradient Descent (OGD) [39] and Follow-The-Regularized- Leader (FTRL) [21]. Factorization Machine (FM) [23] is a general supervised learning approach working with any real-valued feature vector. The most important characteristic of FM is the capability of powerful feature 2.3 Online Conditional Gradient engineering. In addition to the usual first-order information of Online Conditional Gradient (OCG) [12] is a projection-free online each input feature, it captures the second-order pairwise algorithm that eschews the possible computationally ex- interaction in modeling, which leads to stronger expressiveness pensive projection operation needed in its counterparts, including than linear models. OGD and FTRL. It enjoys great computational advantage over other 2 Algorithm 1 Online Conditional Gradient (OCG) is non-convex with respect to V [3, 32], when we plug it into the 1 loss function f at each round t, the resulting loss incurred by the Input: Convex set Q, Maximum round number T , parameters η and {γt } t prediction is also non-convex with respect to V . Although some 1: Initialize x1 ∈ Q 2: for t = 1, 2,..., T do previous studies have proposed two formulations for convex FM, 3: Play xt and observe ft they cannot fit into the OCO framework either. Similar to vanilla ( ) Ít−1 ⟨∇ ( ) ⟩ ∥ − ∥2 FM, the two convex formulations treat the unconstrained feature 4: Let Ft x = η τ =1 fτ Cτ , x + x x1 2 5: Compute vt = argminx ∈Q ⟨x, ∇Ft (xt )⟩ weight vector ω and the feature interaction matrix Z as separated 6: Set xt +1 = (1 − γt )xt + γt vt components of parameters, which violates requirement (i). 7: end for In order to meet the requirements of the OCO framework, we first invent a new convexification scheme for FM which results in afor- mulation named as Compact Convexified FM (CCFM). Then based online learning algorithms when the decision set is a bounded nu- on the new formulation, we design an online learning algorithm clear norm ball. As will be shown in Section 3, we utilize a similar – Online Compact Convexified Factorization Machine (OCCFM), decision set in our proposed convexification scheme, thus we use which is a new variant of the OCG algorithm tailored to our setting. OCG as the cornerstone in our algorithm design. In the following, we introduce more details about it. 3.1 Compact Convexified Factorization In practice, to ensure that the newly generated decision points lie inside the decision set of interest, most online learning algorithms Machine invoke a projection operation in every iteration. For example, in The main idea of our proposed convexification scheme is to put OGD, when the gradient descent step generates an infeasible it- all the parameters, including the bias term, the first-order feature erate that lies out of the decision set, we have to project it back weight vector and the second-order feature interaction matrix, into to regain feasibility. In general, this kind of projection amounts to a single augmented matrix which is then enforced to be low-rank. solving a quadratic convex program over the decision set and will As the low-rank property is not a convex constraint, we approxi- not cause much problem. However, when the decision sets are of mate it with a bounded nuclear norm constraint, which is a common specific types, such as the bounded nuclear norm ball, the setofall practice in the community [3, 13, 26]. The details semi-definite matrices and polytopes [12, 13], it turns to amount of the scheme are given in the following. to very expensive algebraic operations. To avoid these expensive Recall that in vanilla FM, ω0, ω and V refer to the bias term, the operations, projection-free online learning algorithm OCG has been linear feature weight vector and the factorized feature interaction matrix respectively. Due to the non-convexity of the prediction proposed. It is much more efficient since it eschews the projection ∈ Rd×d operation by using a linear optimization step instead in every it- yˆ with respect to V , we adopt a symmetric matrix Z to model the pairwise feature interaction, which is a popular practice eration. For example, when the decision set is a bounded nuclear in designing convex variants of FM [3, 33]. Then we rewrite the bias norm ball, the projection operation needed in other online learning term ω0, the first-order feature weight vector ω and the second- algorithms amounts to computing a full singular value decomposi- order feature interaction matrix Z into a single compact augmented tion (SVD) of a matrix, while the linear optimization step in OCG matrix C: amounts to only computing the maximal singular vectors, which is  Z ω  C = T , at least one order of magnitude simpler. Recently, a decentralized ω 2ω0 distributed variant of OCG algorithm has been proposed in [35], T d×d d which allows for high efficiency in handling large-scale machine where Z = Z ∈ R , ω ∈ R , ω0 ∈ R. learning problems. In order to achieve a model with low complexity, we restrict the augmented matrix C to be low-rank. However, the constraint over 3 ONLINE COMPACT CONVEXIFIED the rank of a matrix is non-convex and does not fit into the OCO FACTORIZATION MACHINE framework. A typical approximation of√ the rank of a matrix C is ∥ ∥ ∥ ∥ = ( T ) = Íd In this section, we turn to the development of our Online Compact its nuclear norm C tr [6]: C tr tr C C i=1 σi , where Convexified Factorization Machine. As introduced before, OCOis σi is the i-th singular value of C. As the singular values are non- the paramount framework for designing online learning algorithms. negative, the nuclear norm is essentially a convex surrogate of the There are two fundamental requirements in it: (i) any instance of all rank of the matrix [3, 13, 26]. Therefore it is standard to consider a the model parameters should be represented as a single point from a relaxation that replaces the rank constraint by the bounded nuclear ∥ ∥ ≤ . convex compact decision set; (ii) the loss incurred by the prediction norm C tr δ S Sd×d { | ∈ should be formulated as a convex function over the decision set. Denote as the set of symmetric matrices: = X X d×d T Vanilla FM can not directly fit into the OCO framework due R , X = X }, the resulting decision set of the augmented ma- C to the following two reasons. First, vanilla FM contains both the trices can be written as the following formulation: unconstrained feature weight vector ω (ω can be written into ω)   0 Z ω d×d d K = {C |C = T , ∥C ∥tr ≤ δ, Z ∈ S , ω ∈ R , ω0 ∈ R}. and the factorized feature interaction matrix V ; they are separated ω 2ω0 components of parameters which can not be formulated as a single point from a decision set. Second, as the prediction yˆ in vanilla FM As the set K is bounded and closed, it is also compact. Next we prove that the decision set K is convex in Lemma 1. 1 1 Usually γt is set to t1/2 3   Z ω lemma 3. yˆ(C) lemma 1. The set K = {C |C = T , ∥C ∥tr ≤ δ, Z ∈ The prediction function is a convex function of ω 2ω0  Z ω  d×d d S , ω ∈ R , ω0 ∈ R} is convex. C ∈ K, where K = {C|C = T , ∥C∥tr ≤ δ, Z ∈ ω 2ω0 d×d d Proof. The proof is based on an important property of convex S , ω ∈ R , ω0 ∈ R}. ∩ sets: if two sets S1 and S2 are convex, then their intersection S = S1 Proof. By definition, yˆ(C) is a separable function: S2 is also convex [6]. In our case, the decision set K is an intersection ( ) ( ) ( )  Z ω  yˆ C = д Z + h ω + ω0, of K˜ and B, where K˜ = {C |C = , Z ∈ Sd×d, ω ∈ ωT 2ω 1 T T 0 where д(Z) = 2x Zx, h(ω) = ω x. By definition, д(Z) and h(ω) d (d+1)×(d+1) R , ω0 ∈ R} and B = {C|∥C∥tr ≤ δ, C ∈ R }. To prove are linear functions with respect with Z and ω correspondingly, K K˜ B and thus are convex functions. is convex, we prove that both and are convex sets.     First, we prove that K˜ is a convex set. For any two points Z1 ω1 Z2 ω2 Consider α ∈ [0, 1], C1 = T , C2 = T ∈ C , C ∈ K˜ and any α ∈ [0, 1], we have ∀ ∀ ω1 2ω0 ω2 2ω0 1 2 K     , according to the separable formulation of yˆ, we obtain Z1 ω1 Z2 ω2 αC1 + (1 − α)C2 = α T + (1 − α) T yˆ(αC1 + (1 − α)C2) = д(αZ1 + (1 − α)Z2) + h(αω1 + (1 − α)ω2) + ω0 . ω1 2ω0 ω2 2ω0   By definition of the linear functions д(Z) and h(ω) and the re- Zα ωα = T , arrangement of the formulation, we have ωα 2ω0 д(αZ1 + (1 − α)Z2) + h(αω1 + (1 − α)ω2) + ω0 where Zα = αZ1+(1−α)Z2 and ωα = αω1+(1−α)ω2. By definition K˜ T T ( − ) T ( − ) T = αд(Z1) + (1 − α)д(Z2) + αh(ω1) + (1 − α)h(ω2) + ω0 of , Z1 = Z1 , Z2 = Z2 , thus αZ1 + 1 α Z2 = αZ1 + 1 α Z2 . T ( − ) T ( ( − = α(д(Z1) + h(ω1) + ω0) + (1 − α)(д(Z2) + h(ω2) + ω0). Following the property of transpose, αω1 + 1 α ω2 = αω1 + 1 T α)ω2) . Therefore we obtain αC1 + (1 − α)C2 ∈ K˜ . By definition, By applying the separable formulation of yˆ again, we obtain ˜ K is convex. α(д(Z1)+h(ω1)+ω0)+(1−α)(д(Z2)+h(ω2)+ω0) = αyˆ(C1)+(1−α)yˆ(C2). Second, it remains to prove that B is also a convex set. The Therefore, yˆ(αC1 + (1 − α)C2) = yˆ(αC1 + (1 − α)C2), by definition, bounded nuclear norm ball is a typical convex set of matrices [6], yˆ(C) is a convex linear function with respect with C ∈ K. □ which is widely adopted in the machine learning community [12, 26]. The detailed proof can be found in [6]. Next, we show that given the above prediction function yˆ(C) : Following the convexity preserving property under the intersec- K → R and any convex loss function f (y) : R → R, the nested tion of convex sets, we have K = K˜ ∩ B is also convex. □ function f (yˆ(C)) = f (C) is also convex, as shown in Lemma 4: 1 T T T d As the decision set K consists of augmented matrices with lemma 4. Let yˆ(C) = 2xˆ Cxˆ, C ∈ K, xˆ = [x , 1], x ∈ R , bounded nuclear norm, by the properties of block matrix, we show and f (y) : R → R be arbitrary convex function, the nested function that the decision set K is equivalent to a symmetric nuclear norm f (yˆ(C)) = f (C) : K → R is also convex with respect to C ∈ K. ball in Lemma 2. Proof. Let C1 ∈ K and C2 ∈ K be any two points in the set  Z ω  K, and f (y) be an arbitrary convex function, by definition of yˆ, lemma 2. The sets K = {C|C = , ∥C∥ ≤ δ, Z ∈ ωT 2ω tr α ∈ [0, 1], we obtain 0 ∀ d×d d ′ (d+1)×(d+1) S , ω ∈ R } and K = {C|C ∈ S , ∥C∥tr ≤ δ } are f (yˆ(αC1 + (1 − α)C2)) = f (αyˆ(C1) + (1 − α)yˆ(C2)). ′ equivalent: K = K . By the convexity of f (y), we obtain The proof is straight-forward by writing an arbitrary symmetric f (αyˆ(C1) + (1 − α)yˆ(C2)) ≤ α f (yˆ(C1)) + (1 − α)f (yˆ(C2)). matrix into a block form, the details are omitted here. Therefore f (αC1+(1−α)C2) ≤ α f (C1)+(1−α)f (C2). By definition, By choosing a point C from the compact decision set K, the the nested function f (yˆ(C)) = f (C) is also convex. □ prediction yˆ(C) of an instance x is formulated as: 1 In summary, by introducing the new convexification scheme, yˆ(C ) = xˆT Cxˆ, 2 we obtain a new formulation for FM, in which the decision set is  x  convex compact and the nested loss function is convex. We refer to where C ∈ K, xˆ = . Although the prediction function yˆ has the resulting formulation as . 1 Compact Convexified FM (CCFM) The comparison between vanilla FM, convex FM and the pro- a similar form with the feature interaction component 1xT Zx of 2 posed CCFM is illustrated in Table 1. vanilla FM, they are intrinsically different. Plugging the formulation The differences  Z ω   x  Comparison between vanilla FM and CCFM. of C = and xˆ = into the prediction function, between vanilla FM and CCFM are three-folded: (i) the primal differ- ωT 2ω 1 0 ence between vanilla FM and CCFM is the convexity of prediction we have 1 1 function: the prediction function of vanilla FM is non-convex while yˆ(C ) = xˆT C xˆ = ω + ωT x + x T Z x . 2 0 2 it is convex in CCFM. (ii) vanilla FM factorizes the feature inter- Therefore the prediction function yˆ(C) contains all the three com- action matrix Z into VV T , thus restricting Z to be symmetric and ponents in vanilla FM, including the global bias, the first-order positive semi-definite; while in CCFM, there is no specific restriction feature weight and the second-order feature interactions. Most on Z except symmetry. (iii) vanilla FM only considers the inter- importantly, yˆ(C) is a convex function in C, as shown in Lemma. 3. actions between distinct features: xixj , i, j ∈ {1,...,d}, i , j, ∀ 4 2 Table 1: Comparison between different FM formulations Algorithm 2 Online Compact Convexified Factorization Ma- chine (OCCFM) Properties Convex Compact All-Pairs Online (d+1)×(d+1) Input: Convex set K = {C | ∥C ∥tr ≤ δ, C ∈ S }, Maximum FM \ \ \ \ 3 √ √ round number T , parameters η and {γt } . CFM [3] \ \ √ 1: Initialize C ∈ K CFM [33] \ \ \ 1 √ √ √ √ 2: t = 1, 2,..., T CCFM for do 3: Get input (xt , yt ) and compute ft (Ct ) Ít−1 2 4: Let Ft (C ) = η τ =1 ⟨∇fτ (Cτ ), C ⟩ + ∥C − C1 ∥2 −∇ ( ) while CCFM models the interactions between all possible feature 5: Compute the maximal singular vectors of Ft Ct : u and v ˆ ⟨ ∇ ( )⟩ ˆ T ∈ { } 6: Solve Ct = argminC ∈K C, Ft Ct with Ct = δuv pairs: xixj , i, j 1,...,d . By rewriting the prediction function ˆ ∀ 7: Set Ct+1 = (1 − γt )Ct + γt Ct y = ⟨C, xˆxˆT −diaд(xˆ ◦ xˆ)⟩ ◦ as ˆ where is the element-wise product, 8: end for we can easily leave out the interactions between same features without changing the theoretical guarantees in CCFM. Combining (ii) and (iii), we find that CCFM allows for general modeling of ball, we first propose the online learning algorithm for CCFM (OC- feature interactions, which improves its expressiveness. CFM), which is essentially an OCG variant. Then we provide the Comparison between convex FM and CCFM. Although con- theoretical analysis for OCCFM in details. vex FM involves convex prediction functions, they are still inher- 3.2.1 Algorithm. As introduced in the Preliminaries, it is a typi- ently different from CCFM: (i) in convex FM, the first-order feature cal practice to apply OCG algorithm on the bounded nuclear norm weight vector and second-order feature interaction matrix are sep- ball in the OCO framework. This typical example is related to CCFM, arated, resulting in a non-compact formulation. As pointed out but is also inherently different. On one hand, the decision setof in [3, 32], the separated formulation makes it difficult to jointly CCFM is a subset of the bounded nuclear norm ball. Thus it is learn the two components in the training process, which can cause tempting to apply the subroutine of OCG over the bounded nuclear inconsistent convergence rates; But in CCFM, all the parameters norm ball on the decision set of CCFM; on the other hand, the are written into a compact augmented matrix. (ii) in CCFM, we decision set of CCFM is also a subset of symmetric matrices, and restrict that the compact augmented matrix C is low-rank; while it remains theoretically unknown whether the subroutine can be convex FM formulations only require the second-order matrix Z applied to the symmetric bounded nuclear norm ball. Therefore, it to be low-rank, leaving the first-order feature weight vector ω un- is a non-trivial problem to design an OCG-alike algorithm for the bounded. (iii) convex FM can not fit into the OCO framework easily CCFM due to its specific decision set. due to its non-compact decision set while CCFM seamlessly fits the To avoid the expensive projection onto the decision set of CCFM, OCO framework. we follow the core idea of OCG to replace the projection step with 3.2 Online Learning Algorithm for Compact a linear optimization problem over the decision set. What remains to be done is to design an efficient subroutine that solves the linear Convexified Factorization Machine optimization over this set in low computational complexity. Recall With the convexification scheme mentioned above, we have gota the procedure of OCG in Algorithm 1, the validity of this subroutine convex compact set and a convex loss function in the new formula- depends on the following two important requirements: tion of Compact Convexified Factorization Machine (CCFM), which • First, the subroutine should solve the linear optimization allows us to design the online learning algorithm following the over the decision set with low computational complexity; in OCO framework [26]. As shown in the Preliminaries, the decision our case, the subroutine should generate the optimal solution set is an important issue that affects the computational complexity Cˆt to the problem Cˆt = argmin ∈K ⟨C, ∇Ft (Ct )⟩ with linear of the projection step onto the set. In CCFM, the decision set is C or lower computational complexity, where Ct is the iterate a subset of the bounded nuclear norm ball, where the projection of C at round t, K is the decision set of CCFM, ∇Ft (Ct ) is the step involves singular value decomposition and takes super-linear sum of gradients of the loss function incurred until round t. time via our best known methods [12]. Although Online Gradient • Second, the subroutine should be closed over the convex Descent (OGD) and Follow-The-Regularized-Leader (FTRL) are the decision set; in our case, the augmented matrix Ct should be two most classical online learning algorithms, they suffer from the inside the decision set K throughout the algorithm: Ct+1 = expensive SVD operation in the projection step, thus do not work (1−γt )Ct +γtCˆt ∈ K, t = 1,...,T , where Cˆt is the output well on this specific decision set. ∀ of the subroutine and γt is the step-size in round t. Meanwhile, the nuclear norm ball is a typical decision set where Considering these two requirements, we propose a subroutine of the expensive projection step can be replaced by the efficient linear the linear optimization over the symmetric bounded nuclear norm optimization subroutine in the projection-free OCG algorithm. As ball, based on which we build the online learning algorithm for the decision set of CCFM is a subset of the bounded nuclear norm the Compact Convexified Factorization Machine. Specifically, we 2 In Table 1, FM, CFM, CCFM refer to vanilla FM , Convex FM and Compact Convex prove that in each round, the linear optimization over the decision FM respectively. The term "Convex" indicates whether the prediction function is convex; the term "Compact" indicates whether the feasible set of the formulation is set in CCFM is also equivalent to the computation of maximal compact; the term "All-Pairs" indicates whether all the pair-wise feature interactions singular vectors of a specific symmetric matrix. This subroutine are involved in the formulation; the term "Online" indicates whether the formulation 3 1 fits the OCO framework for Online learning. Similarly, γt is set to t 1/2 5 can be solved in linear time via the power method, which validates As V is an orthogonal matrix, V ΣΣT V T is an eigenvalue decom- its efficiency. position of D, where ΣΣT is the diagonal matrix of eigenvalues of The procedure is detailed in Algorithm 2 where u and v are the D, and vi , i ∈ {1,...,d} is the eigenvector of D. ∀ T 2 left and right maximal singular vectors of −∇Ft (Ct ) respectively. Since C is symmetric, D = C C = C . For any eigenvalue λ and The projection step is replaced with the subroutine at line 5 − 6 eigenvector q of C: Cq = λq, we have in the algorithm and the detailed analysis of the algorithm will be Dq = CT Cq = CT (Cq) = λCT q = λCq = λ2q. presented later. Thus the square of every eigenvalue λ of C is also the eigenvalue 3.2.2 Theoretical Analysis. The Online Compact Convexified λ2 of D and every eigenvector q of C is also the eigenvector v of D. Factorization Machine (OCCFM) algorithm is essentially a variant By rearranging the eigenvalues of C so that |λˆi |, i = 1,...,d of OCG algorithm, which preserves the similar process. First, we is non-increasing, we have σi = |λˆi |. By rewriting the eigenvalue show that OCCFM satisfies the aforementioned two requirements, Íd ˆ T Íd ˆ T Íd T decomposition C as i=1 λiqˆiqˆi = i=1 |λi |pˆiqˆi = i=1 σiuivi , then we prove the regret bound of OCCFM following the OCO where pˆi = sдn(λˆi )qˆi , we have pˆi = ui , qˆi = vi , i = 1,...,d. framework. Therefore u vT = sдn(λ )qˆ qˆT , i = {1,...,d∀} is a symmetric To prove that OCCFM satisfies the aforementioned two require- i i i i i matrix. ∀ ments, we present the theoretical guarantee in Theorem 1 and □ Theorem 2 respectively. With these preparations, we prove Theorem 1. theorem 1. In Algorithm 2, the subroutine of linear optimization amounts to the computation of maximal singular vectors: Given Ct ∈ . First, recall that for anyC ∈ S(d+1)×(d+1), ( )×( ) Proof of Theorem 1 K = {C| ∥C∥ ≤ δ, C ∈ S d+1 d+1 },Cˆ = argmin⟨C, ∇F (C )⟩ = Íd+1 T tr t t t the SVD of C is C = U ΣV = σiuiv . Following the conclu- C ∈K i=1 i T sion in Lemma 5, we have |ui | = |vi |, i ∈ {1,...,d + 1}, we δu1v , whereu1 andv1 are the maximal singular vectors of −∇Ft (Ct ). 1 rewrite the decision set K with the SVD of∀ C: theorem 2. In Algorithm 2, the subroutine is closed over the de- d+1 cision set K, i.e. the augmented matrix Ct generated at each itera- Õ K ={C |C = U ΣV T = σ u vT , |u | = |v |, i, tion t is inside the decision set K: Ct ∈ K, t = 1,...,T , where i i i i i i=1 ∀ K = {C|∥C∥ ≤ δ, C ∈ S(d+1)×(d+1)} ∀ tr . d+1 Õ C ∈ S(d+1)×(d+1), σ ≤ δ, U , V ∈ R(d+1)×(d+1) }. Before we proceed with the proof of Theorem 1 and Theorem i ∀ i=1 2, we provide a brief introduction to the Singular Value Decom- Recall that in Algorithm 2, the points are generated from the position (SVD) of a matrix, which is frequently used in the proofs decision set K : C , C ∈ K, we have of the theorems. For any matrix C ∈ Rm×n, the Singular Value t 1 T t−1 Decomposition is a factorization of the form C = U ΣV , where η Õ ( )×( ) ∇F (C ) = f (C )xˆ xˆT + 2(C − C ) ∈ S d+1 d+1 . m×K n×K t t 2 τ τ t t t 1 U ∈ R and V ∈ R , (K = min{m, n}) are the orthogonal τ =1 ∈ RK×K matrices, and Σ is a diagonal matrix. The diagonal entries (d+1)×(d+1) Denote −∇Ft (Ct ) as Ht ∈ S and the SVD of Ht as σi of Σ are non-negative real numbers and known as singular val- Íd+1 T , ∈ { ,..., } ues of C. Conventionally, the singular values are in a permutation Ht = i=1 θi µiνi , where θi i 1 d + 1 are the singular , ∀ of non-increasing order: σ ≥ σ ≥,..., ≥ σ . values of Ht and µi νi are the left and right singular vectors 1 2 K respectively, the subroutine in Algorithm 2 solves the following Next, we prove Lemma 5, which is an important part for the proof linear optimization problem: of both theorems. In this lemma, we show that the outer product of the left and right maximal singular vectors of a symmetric matrix min . ⟨C, ∇Ft (Ct )⟩ (d+1)×(d+1) (as part of the subroutine in Algorithm 2) is also a symmetric matrix. s .t . C ∈ S , ∥C ∥tr ≤ δ . × ⟨ ∇ ( )⟩ ⟨ − ⟩ lemma 5. Let C ∈ Sd d be an arbitrary real symmetric matrix, The objective function can be rewritten as C, Ft Ct = C, Ht . T and the linear optimization problem can be rewritten as: and C = U ΣV be the singular value decomposition of C, u1, v1 ∈ d d+1 R be the left and right maximal singular vectors of C respectively, Õ argmin⟨C, ∇F (C )⟩ = argmax⟨ σ u vT , H ⟩. then the matrix u vT is also symmetric: u vT ∈ Sd×d . t t i i i t 1 1 1 1 C ∈K C ∈K i=1 Using the invariance of trace of scalar, we have Proof. The proof is based on the connection between the sin- d+1 d+1 gular value decomposition and eigenvalue decomposition of the Õ T Õ T argmax⟨ σi ui vi , Ht ⟩ = argmax σi ui Ht vi real symmetric matrix. Denote the singular value decomposition C ∈K i=1 C ∈K i=1 d×d T Íd T of C ∈ S as: C = U ΣV = σiuiv and its eigenvalue d+1 i=1 i Õ T Íd T ≤ argmax σ ξ θ , ξ ∈ {−1, 1}. decomposition as: C = QΛQ = j=1 λjqjqj . i i π (i) i C ∈K i=1 ∀ First, we show that the eigenvalues of CCT are the square of the singular values of C: where θπ (i), i is a permutation of the singular values of matrix Ht , and π is the arbitrary∀ permutation over [d + 1]. The last inequality D = CT C = V ΣU T U ΣT V T = V ΣΣT V T . can be attained following Lemma 6. 6 Based on the non-negativity of singular values and the rearrange- framework, we prove that the regret of Algorithm 2 after T rounds ment inequality, we have: is sub-linear in T , as shown in Theorem 3. d+1 d+1 d+1 d+1 Õ Õ Õ Õ theorem 3. The Online Compact Convexified Factorization Ma- σ ξ θ ≤ σ θ ≤ σ θ ≤ σ · θ = ∥C ∥ θ . i i π (i) i π (i) i i i 1 tr 1 η = D γ = 1 i=1 i=1 i=1 i=1 chine with parameters 4GT 3/4 , t t 1/2 , attains the following 4 where θ1 is the maximal singular value of Ht = ∇Ft (Ct ). guarantee : (d+1)×(d+1) Recall that Ht = −∇Ft (Ct ) ∈ S , following Lemma T T Õ Õ ∗ 3/4 1/4 T (d+1)×(d+1) reдretT = ft (Ct ) − min ft (C ) ≤ 12DGT + DGT , 5, we have µiνi ∈ S , i. Therefore the equality can C ∗∈K ∀ t =1 t=1 be attained when ui = µi , vi = νi , i ∈ {1,...,d + 1}, and ∀ D G K σ1 = δ, σi = 0, i = 2, ··· ,d + 1. where , represent an upper bound on the diameter of and an ∀ Íd+1 T T f (C) K In this case, Cˆt = ui σiv = δµ1ν . Following Lemma 5, upper bound on the norm of the sub-gradients of t over , i.e. i=1 i 1 ∥∇ ( )∥ ≤ , ∈ K ˆ T ∈ S(d+1)×(d+1) ∥ ˆ ∥ Íd f C G C . Ct = δµ1ν1 and Ct tr = i=1 σi = δ. ∀ Íd+1 T ∈ K ˆ T As a result, i=1 ui σivi and thus we have C = δµ1ν1 = The proof of this theorem largely follows from that in [12] with argmin⟨C, ∇Ft (Ct )⟩, which indicates that the linear optimization some variations. Due to the page limit, we omit it here. C ∈K K over the decision set amounts to the computation of the maximal 4 EXPERIMENTS singular vectors of the symmetric matrix Ht . □ In this section, we evaluate the performance of the proposed online lemma 6. Let A ∈ Rn×n and д : Rn → R be a twice-differentiable compact convexified factorization machine on two popular machine convex function, and σi (A), i be the singular values of matrix A. For learning tasks: online rating prediction for recommendation and ∀ n any two orthonormal bases {u1,...,un } and {v1,...,vn } of R , online binary classification. there exists a permutation π over [n] and ξ1,..., ξn ∈ {−1, 1} such that: T T T 4.1 Experimental Setup д(u1 Av1, u2 Av2,..., un Avn ) ≤ Compared Algorithms We compare the empirical performance д(ξ1σπ ( )(A), ξ2σπ ( )(A),..., ξn σπ (n)(A)). 1 2 of OCCFM with state-of-the-art variants of FM in the online learn- The proof can be found in [2] and the details are omitted here. ing setting. As the previous studies on FM focus on batch learning Now we are ready to prove the Theorem 2. settings, we construct the baselines by applying online learning approaches to the existing formulations of FM, which is a common Proof of Theorem 2. The proof is conducted with the math- experimental methodology in online learning research [19, 37]. ematical induction method. In the experiments, the comparison between OCCFM and other ∈ K When t = 1, recall the initialization of Algorithm 2, C1 , algorithms is focused on the aspects of formulation and online thus the proposition stands when t = 1. learning algorithms respectively. In existing research, two formula- Now assuming the proposition stands for t > 1: C ∈ K, t ∈ t tions of FM are related to OCCFM, i.e. the non-convex formulation {2,...,T }, we prove that C ∈ K. According to the assumption,∀ T +1 of vanilla FM [23] and the non-compact formulation of Convex FM the augmented matrix is inside the decision set: CT ∈ K, thus (d+1)×(d+1) [33]. Meanwhile the most popular online learning algorithms are ∥CT ∥tr ≤ δ and CT ∈ S . Following the formulation of CCFM and Algorithm 2, we have Online Gradient Descent (OGD) [39], Passive-Aggressive (PA) [9] and Follow-The-Regularized-Leader (FTRL) [21] which achieves the T η Õ ∇F (C ) = f (C )xˆ xˆT + 2(C − C ) ∈ S(d+1)×(d+1), best empirical performance among them in most cases. To illustrate T T 2 t t t t T 1 t=1 the comparison between different formulations, we apply the state- Based on Lemma 5 and Theorem 1, we have of-art FTRL algorithm on vanilla FM, CFM and the proposed CCFM, which are denoted as FM-FTRL, CFM-FTRL and CCFM-FTRL re- ˆ ⟨ , ∇ ( )⟩ vT ∈ S(d+1)×(d+1), CT = argmin C FT CT = δu1 1 spectively. To illustrate the comparison between different online C ∈K learning algorithms, we apply OGD, PA and FTRL on the proposed where u and v are the maximal singular vectors of −∇F (C ). 1 1 T T CCFM, which are denoted as CCFM-OGD, CCFM-PA, CCFM-FTRL Moreover ∥Cˆ ∥ = ∥δu vT ∥ = δ. Therefore Cˆ ∈ K. By the T tr 1 1 tr T respectively. To summarize, the compared algorithms are: convexity of decision set K, we obtain • FM-FTRL: vanilla FM with FTRL algorithm (a similar approach ˆ CT +1 = γ CT + (1 − γ )CT ∈ K. is proposed in [28] for batch learning); Therefore the induction stands when t = T + 1. • CFM-FTRL: Convex FM [33] with FTRL algorithm; In summary, the induction stands for both t = 1 and t ∈ 2,...,T , • CCFM-OGD: Compact Convexified FM with OGD algorithm; + T > 1, which indicates Ct ∈ K, t ∈ Z . • CCFM-PA: Compact Convexified FM with PA algorithm; ∀ ∀ □ • CCFM-FTRL: Compact Convexified FM with FTRL algorithm; • OCCFM: Online Compact Convexified FM algorithm. With Theorem 1 and Theorem 2, we prove that the aforemen- Datasets We select different datasets for the tasks respectively. tioned two requirements are satisfied. Thus the subroutine in Online For the Online Recommendation tasks, we use the typical Movielens Compact Convexified Factorization Machine is a valid conditional gradient step in OCG algorithm, making OCCFM a valid OCG 4We have reivised the minor mistakes made in the original proof and thus give a variant. Following the theoretical analysis of OCG in the OCO slightly different bound here. 7 datasets, including Movielens-100K, Movielens-1M and Movielens- 10M 5 ; for the Online Binary Classification tasks, we select datasets from LibSVM 6, including IJCNN1, Spam and Epsilon . The statistics of the datasets are summarized in Table 2. Table 2: Statistical Details of the Datasets

Datasets #Features #Instances Label Movielens-100K 2,625 100,000 Numerical Movielens-1M 9,940 1,000,209 Numerical Movielens-10M 82,248 10,000,000 Numerical Figure 1: Comparison of the efficiency of CCFM-OGD, IJCNN1 22 141,691 Binary CCFM-PA, CCFM-FTRL and OCCFM on Movielens 100K and Spam 252 350,000 Binary Movielens 1M Epsilon 2,000 100,000 Binary Table 3: RMSE on Movielens-100K, Movielens-1M and For each dataset, we conduct the experiments with five-fold cross- Movielens-10M datasets in Online Rating Prediction of Rec- validation. In our experiments, the training instances are randomly ommendation tasks 7 permutated and fed one by one to the model sequentially. Upon the arrival of each instance, the model makes the prediction and RMSE on Datasets is updated after the label is revealed. The experiment is conducted Algorithms Movielens100K Movielens1M Movielens10M with 20 runs of different random permutations for the training data. The results are reported with the averaging performance over these FM-FTRL 1.2781 1.1074 1.0237 runs. CFM-FTRL 1.1036 1.0552 1.0078 Evaluation Metrics To evaluate the performances on both tasks CCFM-OGD 1.2721 1.0645 1.0291 properly, we select different metrics respectively: the Root Mean CCFM-PA 1.2164 1.0570 1.0142 † † † Square Error (RMSE) for the rating prediction tasks; and the Error CCFM-FTRL 1.0873 1.0441 0.9725 Rate and AUC (Area Under Curve) for the binary classification OCCFM 1.0359* 0.9702* 0.9441* tasks, since AUC is known to be immune to the class imbalance problems. We list the RMSE of OCCFM and other compared algorithms in 4.2 Online Recommendation Table. 3. From our observation, OCCFM achieves higher prediction In the online Recommendation task, at each round, the model re- accuracy than the other online learning baselines. Since FM-FTRL, ceives a pair of user ID and item ID sequentially and then predicts CFM-FTRL, CCFM-FTRL use the same online learning algorithm the value of the incoming rating correspondingly. Denote the in- with different formulations of FM, the comparison between them stance arriving at round t as (ut , it , yt ), where ut , it and yt repre- illustrates the advantage of Compact Convexified FM. Meanwhile, sent the user ID, item ID and the rating given by user ut to item it , CCFM-OGD, CCFM-PA, CCFM-FTRL and OCCFM adopt the same T the input feature vector xt is constructed like this: formulation of Compact Convexified FM, the comparison between them shows the effectiveness of the conditional gradient step in OCCFM algorithm. We also measure the efficiency of CCFM-OGD, CCFM-PA, CCFM- where |U | and |I | refer to the number of users and the number of FTRL and OCCFM algorithms on different datasets and see how fast items respectively. Upon the arrival of each instance, the model the average losses decrease with the running time, which is shown 1 T T T in Fig. 1. From the results we can clearly observe that OCCFM runs predicts the rating with yˆt = 2xˆt Ct xˆt , where xˆt = [xt , 1] . The rating prediction task is essentially a regression problem, thus the significantly faster than CCFM-OGD, CCFM-PA and CCFM-FTRL, convex loss function incurred at each round is the squared loss: which illustrates the necessity and efficiency of using the linear 2 optimization instead of the projection step. ft (Ct ) = ∥yˆt (Ct ) − yt ∥2 . 4.3 Online Binary Classification The nuclear norm bounds in the CCFM and CFM are set to 10, 10 and 20 for Movielens-100K, 1M and 10M respectively. For FM, the rank In the online binary classification tasks, the instances are denoted as (xt , yt ) t, where xt is the input feature vector and yt ∈ {−1, +1} parameters are set to 10 on all the datasets. These hyper-parameters ∀ are selected with a grid search. For OGD, PA and FTRL algorithms, is the class label. At round t, the model predicts the label with siдn(y ) = siдn( 1xˆT C xˆ ) xˆ = [xT , ]T the at round t is set to √1 as what their corresponding ˆt 2 t t t , where t t 1 . The loss function t is a logistic loss function with respect to Ct : theories suggest [9, 26, 39]. In experiments on ML-1M and ML-10M, we randomly sample 1000 users and 1000 items for simplicity. 7 The results with † mark have passed the significance test with p < 0.01 compared 5 https://grouplens.org/datasets/movielens/ with FM-FTRL and CFM-FTRL; the results with ∗ mark have passed the significance 6 https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html test with p < 0.01 compared with the other algorithms 8 variants of FM. However, the formulations of Convex FM are not well organized into a compact form which we provide for Compact Convexified FM to meet the requirements of online learning setting. The most significant difference between OCCFM and previous research is that OCCFM is an online machine learning model while most existing variants are batch learning models. Some studies attempt to apply FM in online applications [18][28]. But these studies do not fit in the online learning framework with theoretical guarantees while OCCFM provides a theoretically provable regret bound. Figure 2: Comparison of the efficiency of CCFM-OGD, CCFM-PA, CCFM-FTRL and OCCFM on Spam and Epsilon 5.2 Online Learning Online learning stands for a family of efficient and scalable machine 1 learning algorithms [7, 9, 14, 25, 30]. Unlike conventional batch ft (Ct ) = loд(1 + ). exp(−yt · yˆt (Ct )) learning algorithms, the online learning algorithms are built upon the assumption that the training instances arrive sequentially rather The nuclear norm bounds for CCFM and CFM in this task are set than being available prior to the learning task. to 300, 1000, and 200 for IJCNN1, Spam and Epsilon respectively; Many algorithms have been proposed for online learning, includ- while the ranks for FM are set to 20, 50 and 50 accordingly. All ing the classical Perception algorithm and Passive-Aggressive (PA) the hyper-parameters are selected with a grid search and set to algorithm [9]. In recent years, the design of many efficient online the values with the best performances. For OGD, PA and all FTRL learning algorithms has been influenced by convex optimization algorithms, the learning rate at round t is set to √1 as the theories tools [11, 30]. Some typical algorithms include Online Gradient t suggest [26]. Descent (OGD) and Follow-The-Regularized-Leader (FTRL) [1][27]. However, their further applicability is limited by the expensive pro- Table 4: Error Rate and AUC on IJCNN1, Spam and Epsilon jection operation required when additional norm constraints are datasets in Online Binary Classification Tasks added. Recently, the Online Conditional Gradient (OCG) algorithm [15] has regained a surge of research interest. It eschews the com-

Error Rate on Datasets AUC on Datasets putational expensive projection operation thus is highly efficient Algorithms IJCNN Spam Epsilon IJCNN Spam Epsilon in handling large-scale learning problems. Our proposed OCCFM FM-FTRL 0.0663 0.0867 0.1773 0.9647 0.9070 0.8213 model builds upon the OCG algorithm. For further details, please CFM-FTRL 0.0544 0.0598 0.1654 0.9736 0.9390 0.8319 refer to [15] and [12]. CCFM-OGD 0.0911 0.1076 0.1742 0.9192 0.9239 0.8277 CCFM-PA 0.0711 0.0976 0.1412 0.9312 0.9201 0.8575 CCFM-FTRL 0.0520† 0.0588† 0.1380† 0.9757 0.9398 0.8668† 6 CONCLUSION OCCFM 0.0243* 0.0567* 0.1202* 0.9859* 0.9414* 0.8737* In this paper, we propose an online variant of FM which works in online learning settings. The proposed OCCFM meets the re- The comparison of the prediction accuracy between OCCFM and quirements in OCO and is also more cost effective and accurate other baseline algorithms is presented in Table 4. As shown in the compared to extant state-of-the-art approaches. In the study, we table, OCCFM achieves the lowest error rates and the highest AUC first invent a convexification Compact Convexified FM(CCFM) values among all the online learning approaches, which reveals the based on OCO such that it fulfills the two fundamental require- advantage of OCCFM in prediction accuracy. ments within the OCO framework. Then, we follow the general We also compare the running time of OCCFM and other base- projection-free algorithmic framework of Online Conditional Gradi- line algorithms and present the results in Fig 2. Similar with the ent and propose an Online Compact Convex Factorization Machine observation in recommendation tasks, our proposed OCCFM runs (OCCFM) algorithm that eschews the projection operation with significantly faster than CCFM-OGD and CCFM-FTRL. linear optimization step. In terms of theoretical support, we prove that the algorithm preserves a sub-linear regret, which indicates 5 RELATED WORK that our algorithm can perform as well as the best fixed algorithm in hindsight. Regarding the empirical performance of OCCFM, we 5.1 Factorization Machine conduct extensive experiments on real-world datasets. The experi- The core of FM is to leverage feature interactions for feature aug- mental results on recommendation and binary classification tasks mentation. According to the order of feature interactions, the exist- indicate that OCCFM outperforms the state-of-art online learning ing research can be categorized into two lines: the first category algorithms. [8, 16, 34] focuses on second-order FM which models pair-wise feature interactions for feature engineering; while the second cat- REFERENCES egory [4, 5] attempts to model the interactions between arbitrary [1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. 2008. Competing in the number of features. One line of related research in the first category dark: An efficient algorithm for bandit linear optimization. In In Proceedings of is Convex Factorization Machine [3, 32, 33], which looks for convex the 21st Annual Conference on Learning Theory (COLT). 9 [2] Zeyuan Allen-Zhu, Elad Hazan, Wei Hu, and Yuanzhi Li. 2017. Linear Con- [26] Shai Shalev-Shwartz. 2012. Online Learning and Online Convex Optimization. vergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls. CoRR Found. Trends Mach. Learn. 4, 2 (Feb. 2012), 107–194. abs/1708.02105 (2017). [27] Shai Shalev-Shwartz et al. 2012. Online learning and online convex optimization. [3] Mathieu Blondel, Akinori Fujino, and Naonori Ueda. 2015. Convex factorization Foundations and Trends® in Machine Learning 4, 2 (2012), 107–194. machines. In Joint European Conference on Machine Learning and Knowledge [28] Anh-Phuong Ta. 2015. Factorization machines with follow-the-regularized- Discovery in Databases. Springer, 19–35. leader for CTR prediction in display advertising. In Big Data (Big Data), 2015 [4] Mathieu Blondel, Akinori Fujino, Naonori Ueda, and Masakazu Ishihata. 2016. IEEE International Conference on. IEEE, 2889–2891. Higher-order factorization machines. In Advances in Neural Information Process- [29] Jialei Wang, Steven C.H. Hoi, Peilin Zhao, and Zhi-Yong Liu. 2013. Online Multi- ing Systems. 3351–3359. task Collaborative Filtering for On-the-fly Recommender Systems. In Proceedings [5] Mathieu Blondel, Masakazu Ishihata, Akinori Fujino, and Naonori Ueda. 2016. of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, New York, Polynomial networks and factorization machines: new insights and efficient NY, USA, 237–244. https://doi.org/10.1145/2507157.2507176 training algorithms. international conference on machine learning (2016), 850– [30] Jialei Wang, Peilin Zhao, and Steven C. H. Hoi. 2016. Soft Confidence-Weighted 858. Learning. ACM Trans. Intell. Syst. Technol. 8, 1, Article 15 (Sept. 2016), 32 pages. [6] Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge [31] Jianpeng Xu, Kaixiang Lin, Pang Ning Tan, and Jiayu Zhou. 2016. Synergies that university press. Matter: Efficient Interaction Selection via Sparse Factorization Machine. In Siam [7] N. Cesa-Bianchi, A. Conconi, and C. Gentile. 2004. On the Generalization Ability International Conference on . 108–116. of On-Line Learning Algorithms. Information Theory IEEE Transactions on 50, 9 [32] Makoto Yamada, Wenzhao Lian, Amit Goyal, Jianhui Chen, Kishan Wimalawarne, (2004), 2050–2057. Suleiman A Khan, Samuel Kaski, Hiroshi Mamitsuka, and Yi Chang. 2015. Convex [8] Chen Cheng, Fen Xia, Tong Zhang, Irwin King, and Michael R Lyu. 2014. Gradient Factorization Machine for Regression. arXiv preprint arXiv:1507.01073 (2015). boosting factorization machines. In Proceedings of the 8th ACM Conference on [33] Makoto Yamada, Wenzhao Lian, Amit Goyal, Jianhui Chen, Kishan Wimalawarne, Recommender systems. ACM, 265–272. Suleiman A. Khan, Samuel Kaski, Hiroshi Mamitsuka, and Yi Chang. 2017. Convex [9] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Factorization Machine for Toxicogenomics Prediction. In Proceedings of the 23rd Singer. 2006. Online Passive-Aggressive Algorithms. J. Mach. Learn. Res. 7 (Dec. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2006), 551–585. (KDD ’17). ACM, New York, NY, USA, 1215–1224. [10] Peter DeMarzo, Ilan Kremer, and Yishay Mansour. 2006. Online Trading Algo- [34] Fajie Yuan, Guibing Guo, Joemon M. Jose, Long Chen, Haitao Yu, and Weinan rithms and Robust Option Pricing. In Proceedings of the Thirty-eighth Annual Zhang. 2017. BoostFM: Boosted Factorization Machines for Top-N Feature- ACM Symposium on Theory of Computing (STOC ’06). ACM, New York, NY, USA, based Recommendation. In Proceedings of the 22Nd International Conference on 477–486. Intelligent User Interfaces (IUI ’17). ACM, New York, NY, USA, 45–54. [11] Mark Dredze, Koby Crammer, and Fernando Pereira. 2008. Confidence-weighted [35] Wenpeng Zhang, Peilin Zhao, Wenwu Zhu, Steven C. H. Hoi, and Tong Zhang. Linear Classification. In Proceedings of the 25th International Conference on Ma- 2017. Projection-free Distributed Online Learning in Networks. In Proceedings chine Learning (ICML ’08). ACM, New York, NY, USA, 264–271. of the 34th International Conference on Machine Learning (Proceedings of Ma- [12] Elad Hazan et al. 2016. Introduction to online convex optimization. Foundations chine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, and Trends® in Optimization 2, 3-4 (2016), 157–325. International Convention Centre, Sydney, Australia, 4054–4062. [13] Elad Hazan and Satyen Kale. 2012. Projection-free Online Learning. In Proceedings [36] Peilin Zhao and Steven C.H. Hoi. 2013. Cost-sensitive Online Active Learning of the 29th International Coference on International Conference on Machine Learning with Application to Malicious URL Detection. In Proceedings of the 19th ACM (ICML’12). Omnipress, USA, 1843–1850. SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD [14] Steven C. H. Hoi, Jialei Wang, and Peilin Zhao. 2014. LIBOL: a library for online ’13). ACM, New York, NY, USA, 919–927. learning algorithms. JMLR. 495–499 pages. [37] Peilin Zhao, Steven C. H. Hoi, Rong Jin, and Tianbao Yang. 2011. Online AUC [15] Martin Jaggi. 2013. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Opti- Maximization. In Proceedings of the 28th International Conference on International mization. In Proceedings of the 30th International Conference on Machine Learn- Conference on Machine Learning (ICML’11). Omnipress, USA, 233–240. ing (Proceedings of Machine Learning Research), Sanjoy Dasgupta and David [38] Erheng Zhong, Yue Shi, Nathan Liu, and Suju Rajan. 2016. Scaling Factorization McAllester (Eds.), Vol. 28. PMLR, Atlanta, Georgia, USA, 427–435. Machines with Parameter Server. In Proceedings of the 25th ACM International on [16] Yuchin Juan, Damien Lefortier, and Olivier Chapelle. 2017. Field-aware factor- Conference on Information and Knowledge Management. ACM, 1583–1592. ization machines in a real-world online advertising system. In Proceedings of [39] Martin Zinkevich. 2003. Online Convex Programming and Generalized Infinites- the 26th International Conference on World Wide Web Companion. International imal Gradient Ascent. In Proceedings of the Twentieth International Conference on World Wide Web Conferences Steering Committee, 680–688. International Conference on Machine Learning (ICML’03). AAAI Press, 928–935. [17] Xiangnan He Hanwang Zhang Fei Wu Tat-Seng Chua Jun Xiao, Hao Ye. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the Twenty-Sixth International Joint Conference on , IJCAI-17. 3119–3125. [18] Takuya Kitazawa. 2016. Incremental Factorization Machines for Persistently Cold-starting Online Item Recommendation. arXiv preprint arXiv:1607.02858 (2016). [19] Bin Li and Steven C H Hoi. 2014. Online portfolio selection: A survey. Comput. Surveys 46, 3 (2014), 35. [20] Chun-Ta Lu, Lifang He, Weixiang Shao, Bokai Cao, and Philip S. Yu. 2017. Multi- linear Factorization Machines for Multi-Task Multi-View Learning. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, New York, NY, USA, 701–709. [21] H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. 2013. Ad Click Prediction: a View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). [22] Trung V. Nguyen, Alexandros Karatzoglou, and Linas Baltrunas. 2014. Gaussian Process Factorization Machines for Context-aware Recommendations. In Proceed- ings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’14). ACM, New York, NY, USA, 63–72. [23] Steffen Rendle. 2010. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 995–1000. [24] Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2011. Fast context-aware recommendations with factorization machines. In Pro- ceedings of the 34th international ACM SIGIR conference on Research and develop- ment in Information Retrieval. ACM, 635–644. [25] F Rosenblatt. 1958. The : a probabilistic model for information storage and organization in the brain. Psychological Review 65, 6 (1958), 386.

10