
A Unified Algorithm for One-class Structured Matrix Factorization with Side Information Hsiang-Fu Yu Hsin-Yuan Huang Inderjit S. Dhillon Chih-Jen Lin University of Texas at Austin National Taiwan University University of Texas at Austin National Taiwan University [email protected] [email protected] [email protected] [email protected] Abstract that instance i is associated with label j. In practice, only part of instance i’s labels have been observed. In many applications such as recommender systems and PU learning is essentially a matrix completion problem. multi-label learning the task is to complete a partially ob- Given Y = 1; 8(i; j) 2 Ω+, we would like to predict served binary matrix. Such PU learning (positive-unlabeled) ij problems can be solved by one-class matrix factorization if other entries are zero or one. One common approach for (MF). In practice side information such as user or item fea- matrix completion is through matrix factorization (MF) by tures in recommender systems are often available besides the finding two low-rank latent matrices observed positive user-item connections. In this work we con- W =[:::; w ;:::]> 2Rm×k and H =[:::; h ;:::]> 2Rn×k sider a generalization of one-class MF so that two types of i j side information are incorporated and a general convex loss > + such that Yij = 1 ≈ wi hj; 8(i; j) 2 Ω . Note that k is a function can be used. The resulting optimization problem pre-specified number satisfying k m and k n. Unlike is very challenging, but we derive an efficient and effective traditional matrix completion problems where Yij; 8(i; j) 2 alternating minimization procedure. Experiments on large- + scale multi-label learning and one-class recommender sys- Ω have different values, here approximating all the ob- tems demonstrate the effectiveness of our proposed approach. served positive entries will result in the all-one prediction for all (i; j) 2= Ω+. For such a one-class scenario, in the approximation process one must treat some (i; j) 2= Ω+ as > 1 Introduction negative entries so that Yij = 0 ≈ wi hj. Therefore, ex- Many practical applications can be modeled in a positive isting one-class MF works solve the following optimization and unlabeled data learning (PU learning) framework. Be- problem. tween two types of objects, some connections between ob- X T 2 X 2 X ¯ 2 min Cij (Yij − wi hj ) + λikwik + λj khj k ; (1) W;H ject i of the first type and j of the second type are observed, i;j2Ω i j but for most remaining situations, either i and j are not ¯ connected or the connection is not observed. A typical ex- where Cij is a cost associated with the loss, and λi; λj are + − ample is the collaborative filtering with one-class informa- regularization parameters. The set Ω = Ω [ Ω includes tion (Pan et al. 2008; Hu, Koren, and Volinsky 2008; Pan and both observed positive and selected negative entries. The + Scholz 2009; Li et al. 2010; Paquet and Koenigstein 2013). rationale of selecting some (i; j) 2= Ω and treating their Given m users and n items, we observe part of a 0=1 ma- Yij = 0 is because in general each user likes only a small m×n + trix Y 2 R with Yij = 1, where (i; j) 2 Ω . An set of items (or each instance is associated with a small set observed entry indicates that user i likes item j. The goal of labels). − is to know for any unobserved pair (i; j) 2= Ω+, whether i The selection of negative entries in Ω is an important likes j or not. Thus, only part of positive-labeled entries are issue. Roughly there are two approaches: observed, while there are many unknown negative entries. • Subsampled: we subsample some unobserved entries to − − + This setting is very different from traditional collaborative have Ω with jΩ j = O(jΩ j). − + filtering, where a real-valued rating (observed or not) is as- • Full: Ω = [m]×[n]nΩ . That is, every unobserved entry sociated with any Y . Note that this may be a more common is considered as negative. Unfortunately, the huge size of ij − scenario; for example, most people watch the video that in- Ω makes this approach computationally expensive. terest them, without leaving any rating information. Recently, Yu, Bilenko, and Lin (2017) successfully devel- Another important application that falls into the PU lean- oped efficient optimization methods to solve (1) for the ing framework is multi-label classification (Kong et al. 2014; Full approach. They show that the Full approach gives sig- Hsieh, Natarajan, and Dhillon 2015). The two types of ob- nificantly better results than the Subsampled approach. jects are instances and labels. An entry Y = 1 indicates A shortcoming is that their methods work only with the ij > 2 squared loss (Yij − wi hj) , which is a real-value regres- Copyright © 2017, Association for the Advancement of Artificial sion loss. However Y is a 0/1 matrix in the one-class set- Intelligence (www.aaai.org). All rights reserved. ting, so a 0/1 classification loss might be more suitable. Yet, Table 1: Various one-class MF formulations supported by sion of (1) is the following optimization problem. our proposed algorithms. SQ-SQ: we apply the square loss on entries in both Ω+ and Ω−. SQ-wSQ: we apply the square min f(W; H); where f(W; H) = (2) W;H loss on entries in Ω+ and the weighted square loss on Ω−. + X > General-wSQ: we apply a general loss on entries in Ω and Cij`(Yij; xi W hj) + λwR(W ) + λhR(H): the weighted square loss on Ω−. For the Full approach of (i;j)2Ω most formulations in this family, our proposed algorithms In (2), λw, λh and λg are regularization parameters; are the first efficient approach with time complexity linear in + > > > > O(jΩ j). [1]: (Pan and Scholz 2009), [2]: (Yu, Bilenko, and R(W )=tr W W +λgW X LXW ; R(H)=tr H H Lin 2017), [3]: (Yu et al. 2014), and [4]: (Rao et al. 2015). are regularizers1; L is a positive definite matrix; SQ-SQ SQ-wSQ General-wSQ > Rm×d + + X = [x1;:::; xm] 2 loss `ij on Ω square square general loss − − `(a; b) loss `ij on Ω square weighted square weighted square includes feature vectors corresponding to users; is a Standard [1],[2] [1],[2] this paper loss function convex in b. We also use Feature-aware LEML [3] this paper this paper ` (a; b) = C `(a; b) Graph-structured GRALS [4] this paper this paper ij ij Feature+Graph this paper this paper this paper to denote the loss term for the (i; j) entry. Typically L is in a form of a graph Laplacian matrix with L = D −S, where D Pm Rm×m is a diagonal matrix with Dii = t=1 Sit and S 2 developing efficient methods for the Full approach with a is a similarity matrix among users. Then in R(W ) we have classification loss remains a challenging problem. > > 1 X > > 2 + tr W X LXW = Si1;i2 W xi1 −W xi2 : In problem (1), Yij; (i; j) 2 Ω are the only given infor- 2 i1;i2 mation. However, for most applications, some “side infor- mation” is also available. For example, besides the prefer- If the relationship between users i1 and i2 is strong, the larger S will make W >x closer to W >x . This use ence of user i on item j, user or item features may also i1;i2 i1 i2 be known. For multi-label learning, a data instance always of the graph information in the regularization term has been comes with a feature vector. Further, relationships among considered in various learning methods (Smola and Kondor users (items) may be available, which can be represented as 2003; Li and Yeung 2009; Zhou et al. 2012; Zhao et al. 2015; a graph. How to effectively incorporate side information into Rao et al. 2015; Natarajan, Rao, and Dhillon 2015), where Natarajan, Rao, and Dhillon (2015) focus on PU learning. the PU learning framework is thus a crucial research issue. > We further note that in the optimization problem (2), W xi In this paper, we consider a formulation which unifies and becomes the latent representation of the i-th user (or in- generalizes many existing structured matrix-factorization stance). Thus in contrast to W 2 Rm×k in the standard MF formulations for PU learning with side information. In Sec- formulation, now we have W 2 Rd×k. Then the predic- tion 2, we introduce the formulation and review related tion on unseen instances, such as in multi-label classifica- works. Our main contribution in Section 3 is to develop an tion, can be done naturally using W >x as feature vector x’s efficient alternating minimization framework for the Full ap- latent representation. There are other approaches to incor- proach with any convex loss function. Experiments in Sec- porate side information into MF such as Singh and Gordon tion 4 consider multi-label classification and recommender (2008), but we focus on the formulation (2) in this paper. systems to illustrate the effectiveness of our approach. Re- Some past works have considered optimization problems sults show a clear performance improvement using a classifi- related to (2). In (Rao et al. 2015), for rating-based MF, cation loss. A summary showing how we generalize existing they consider the same regularization term R(W ) by assum- works is in Table 1, indicating that our proposed algorithms ing that pairwise relationships are available via a graph.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages19 Page
-
File Size-