A Unified Algorithm for One-Class Structured Matrix Factorization With

A Unified Algorithm for One-Class Structured Matrix Factorization With

A Unified Algorithm for One-class Structured Matrix Factorization with Side Information Hsiang-Fu Yu Hsin-Yuan Huang Inderjit S. Dhillon Chih-Jen Lin University of Texas at Austin National Taiwan University University of Texas at Austin National Taiwan University [email protected] [email protected] [email protected] [email protected] Abstract that instance i is associated with label j. In practice, only part of instance i’s labels have been observed. In many applications such as recommender systems and PU learning is essentially a matrix completion problem. multi-label learning the task is to complete a partially ob- Given Y = 1; 8(i; j) 2 Ω+, we would like to predict served binary matrix. Such PU learning (positive-unlabeled) ij problems can be solved by one-class matrix factorization if other entries are zero or one. One common approach for (MF). In practice side information such as user or item fea- matrix completion is through matrix factorization (MF) by tures in recommender systems are often available besides the finding two low-rank latent matrices observed positive user-item connections. In this work we con- W =[:::; w ;:::]> 2Rm×k and H =[:::; h ;:::]> 2Rn×k sider a generalization of one-class MF so that two types of i j side information are incorporated and a general convex loss > + such that Yij = 1 ≈ wi hj; 8(i; j) 2 Ω . Note that k is a function can be used. The resulting optimization problem pre-specified number satisfying k m and k n. Unlike is very challenging, but we derive an efficient and effective traditional matrix completion problems where Yij; 8(i; j) 2 alternating minimization procedure. Experiments on large- + scale multi-label learning and one-class recommender sys- Ω have different values, here approximating all the ob- tems demonstrate the effectiveness of our proposed approach. served positive entries will result in the all-one prediction for all (i; j) 2= Ω+. For such a one-class scenario, in the approximation process one must treat some (i; j) 2= Ω+ as > 1 Introduction negative entries so that Yij = 0 ≈ wi hj. Therefore, ex- Many practical applications can be modeled in a positive isting one-class MF works solve the following optimization and unlabeled data learning (PU learning) framework. Be- problem. tween two types of objects, some connections between ob- X T 2 X 2 X ¯ 2 min Cij (Yij − wi hj ) + λikwik + λj khj k ; (1) W;H ject i of the first type and j of the second type are observed, i;j2Ω i j but for most remaining situations, either i and j are not ¯ connected or the connection is not observed. A typical ex- where Cij is a cost associated with the loss, and λi; λj are + − ample is the collaborative filtering with one-class informa- regularization parameters. The set Ω = Ω [ Ω includes tion (Pan et al. 2008; Hu, Koren, and Volinsky 2008; Pan and both observed positive and selected negative entries. The + Scholz 2009; Li et al. 2010; Paquet and Koenigstein 2013). rationale of selecting some (i; j) 2= Ω and treating their Given m users and n items, we observe part of a 0=1 ma- Yij = 0 is because in general each user likes only a small m×n + trix Y 2 R with Yij = 1, where (i; j) 2 Ω . An set of items (or each instance is associated with a small set observed entry indicates that user i likes item j. The goal of labels). − is to know for any unobserved pair (i; j) 2= Ω+, whether i The selection of negative entries in Ω is an important likes j or not. Thus, only part of positive-labeled entries are issue. Roughly there are two approaches: observed, while there are many unknown negative entries. • Subsampled: we subsample some unobserved entries to − − + This setting is very different from traditional collaborative have Ω with jΩ j = O(jΩ j). − + filtering, where a real-valued rating (observed or not) is as- • Full: Ω = [m]×[n]nΩ . That is, every unobserved entry sociated with any Y . Note that this may be a more common is considered as negative. Unfortunately, the huge size of ij − scenario; for example, most people watch the video that in- Ω makes this approach computationally expensive. terest them, without leaving any rating information. Recently, Yu, Bilenko, and Lin (2017) successfully devel- Another important application that falls into the PU lean- oped efficient optimization methods to solve (1) for the ing framework is multi-label classification (Kong et al. 2014; Full approach. They show that the Full approach gives sig- Hsieh, Natarajan, and Dhillon 2015). The two types of ob- nificantly better results than the Subsampled approach. jects are instances and labels. An entry Y = 1 indicates A shortcoming is that their methods work only with the ij > 2 squared loss (Yij − wi hj) , which is a real-value regres- Copyright © 2017, Association for the Advancement of Artificial sion loss. However Y is a 0/1 matrix in the one-class set- Intelligence (www.aaai.org). All rights reserved. ting, so a 0/1 classification loss might be more suitable. Yet, Table 1: Various one-class MF formulations supported by sion of (1) is the following optimization problem. our proposed algorithms. SQ-SQ: we apply the square loss on entries in both Ω+ and Ω−. SQ-wSQ: we apply the square min f(W; H); where f(W; H) = (2) W;H loss on entries in Ω+ and the weighted square loss on Ω−. + X > General-wSQ: we apply a general loss on entries in Ω and Cij`(Yij; xi W hj) + λwR(W ) + λhR(H): the weighted square loss on Ω−. For the Full approach of (i;j)2Ω most formulations in this family, our proposed algorithms In (2), λw, λh and λg are regularization parameters; are the first efficient approach with time complexity linear in + > > > > O(jΩ j). [1]: (Pan and Scholz 2009), [2]: (Yu, Bilenko, and R(W )=tr W W +λgW X LXW ; R(H)=tr H H Lin 2017), [3]: (Yu et al. 2014), and [4]: (Rao et al. 2015). are regularizers1; L is a positive definite matrix; SQ-SQ SQ-wSQ General-wSQ > Rm×d + + X = [x1;:::; xm] 2 loss `ij on Ω square square general loss − − `(a; b) loss `ij on Ω square weighted square weighted square includes feature vectors corresponding to users; is a Standard [1],[2] [1],[2] this paper loss function convex in b. We also use Feature-aware LEML [3] this paper this paper ` (a; b) = C `(a; b) Graph-structured GRALS [4] this paper this paper ij ij Feature+Graph this paper this paper this paper to denote the loss term for the (i; j) entry. Typically L is in a form of a graph Laplacian matrix with L = D −S, where D Pm Rm×m is a diagonal matrix with Dii = t=1 Sit and S 2 developing efficient methods for the Full approach with a is a similarity matrix among users. Then in R(W ) we have classification loss remains a challenging problem. > > 1 X > > 2 + tr W X LXW = Si1;i2 W xi1 −W xi2 : In problem (1), Yij; (i; j) 2 Ω are the only given infor- 2 i1;i2 mation. However, for most applications, some “side infor- mation” is also available. For example, besides the prefer- If the relationship between users i1 and i2 is strong, the larger S will make W >x closer to W >x . This use ence of user i on item j, user or item features may also i1;i2 i1 i2 be known. For multi-label learning, a data instance always of the graph information in the regularization term has been comes with a feature vector. Further, relationships among considered in various learning methods (Smola and Kondor users (items) may be available, which can be represented as 2003; Li and Yeung 2009; Zhou et al. 2012; Zhao et al. 2015; a graph. How to effectively incorporate side information into Rao et al. 2015; Natarajan, Rao, and Dhillon 2015), where Natarajan, Rao, and Dhillon (2015) focus on PU learning. the PU learning framework is thus a crucial research issue. > We further note that in the optimization problem (2), W xi In this paper, we consider a formulation which unifies and becomes the latent representation of the i-th user (or in- generalizes many existing structured matrix-factorization stance). Thus in contrast to W 2 Rm×k in the standard MF formulations for PU learning with side information. In Sec- formulation, now we have W 2 Rd×k. Then the predic- tion 2, we introduce the formulation and review related tion on unseen instances, such as in multi-label classifica- works. Our main contribution in Section 3 is to develop an tion, can be done naturally using W >x as feature vector x’s efficient alternating minimization framework for the Full ap- latent representation. There are other approaches to incor- proach with any convex loss function. Experiments in Sec- porate side information into MF such as Singh and Gordon tion 4 consider multi-label classification and recommender (2008), but we focus on the formulation (2) in this paper. systems to illustrate the effectiveness of our approach. Re- Some past works have considered optimization problems sults show a clear performance improvement using a classifi- related to (2). In (Rao et al. 2015), for rating-based MF, cation loss. A summary showing how we generalize existing they consider the same regularization term R(W ) by assum- works is in Table 1, indicating that our proposed algorithms ing that pairwise relationships are available via a graph.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    19 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us