<<

Collaborative Fashion Recommendation: A Functional Factorization Approach

Yang Hu†, Xi Yi‡, Larry S. Davis§ Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742 [email protected]†, [email protected]‡, [email protected]§

ABSTRACT With the rapid expansion of online shopping for fashion products, effective fashion recommendation has become an increasingly important problem. In this work, we study the problem of personalized outfit recommendation, i.e. auto- matically suggesting outfits to users that fit their personal (a) Users 1 fashion preferences. Unlike existing recommendation sys- tems that usually recommend individual items, we suggest sets of items, which interact with each other, to users. We propose a functional tensor factorization method to model the interactions between user and fashion items. To effec- tively utilize the multi-modal features of the fashion items, we use a gradient boosting based method to learn nonlinear (b) Users 2 functions to map the feature vectors from the feature space into some low dimensional latent space. The effectiveness of the proposed algorithm is validated through extensive ex- periments on real world user data from a popular fashion- focused social network.

Categories and Subject Descriptors (c) Users 3 H.3.3 [Information Search and Retrieval]: Retrieval models; I.2.6 [Learning]: Knowledge acquisition Figure 1: Examples of fashion sets created by three Keywords Polyvore users. Different users have different style preferences. Our task is to automatically recom- Recommendation systems; Collaborative filtering; Tensor mend outfits to users that fit their personal taste. factorization; Learning to rank; Gradient boosting

1. INTRODUCTION book3, where people showcase their personal styles and con- With the proliferation of social networks, people share al- nect to others that share similar fashion taste. With this most everything in their daily life online nowadays. They rising trend, a larger share of the fashion industry has moved share the dinners they had, the movies they watched, the online, which triggers a strong demand for intelligent fashion music they listened to, the places they visited, and also, analysis techniques. the outfits they wore. There are numerous fashion-focused Many recent works have begun to study fashion related online communities, such as Polyvore1, Chictopia2, Look- problems, e.g. clothing parsing [29, 28, 30, 7], clothing recog- nition [4, 13], clothing retrieval [26, 20], and clothing recom- 1http://www.polyvore.com/ 2 mendation [11, 18, 12]. In this work, we are interested in the http://www.chictopia.com/ problem of fashion recommendation, which is a key problem for promoting people’s interest and participation in online Permission to make digital or hard copies of all or part of this work for personal or shopping. There are two kinds of fashion recommendation classroom use is granted without fee provided that copies are not made or distributed problems. One is recommending whole outfits that people for profit or commercial advantage and that copies bear this notice and the full cita- may be interested in. The other is recommending fashion tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- items that make good matches with some given items. We publish, to post on servers or to redistribute to lists, requires prior specific permission mainly focus on the first problem. However the model we and/or a fee. Request permissions from [email protected]. learned can also be applied to solve the second task. MM’15, October 26–30, 2015, Brisbane, Australia. c 2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00. 3 DOI: http://dx.doi.org/10.1145/2733373.2806239. http://lookbook.nu/

129 One crucial point that has not been addressed in previ- ing, recognition, retrieval, as well as fashion recommenda- ous work is recommendation should be personal. Figure 1 tion. shows examples of fashion sets created by three users on Clothing parsing predicts pixel-wise labeling for garment Polyvore. It is obvious that different people have different items, which provides a foundation for other tasks. Several preferences of styles, which is a reflection of their ages, occu- solutions have been proposed in the literature [29, 28, 7, 30]. pations, cultural background, place of living, etc. Therefore, For clothing recognition, Chen et al. [4] described clothing for different users, an effective recommender system should appearance by semantic attributes using a CRF based ap- recommend different outfits. proach. Kiapour et al. [13] designed an online game to crowd To generate good recommendations, we need to learn from source human judgements of fashion styles and used the data people’s past behavior. The outfits a person has created or collected to train models for between-class and within-class worn reveal his or her fashion taste. However, the number classifications of styles. Clothing retrieval attempts to find of outfits we observe for a single person may be small. On similar clothing to a given query. Wang et al. [26] designed the other hand, there are a large number of outfits shared by a retrieval system by using color and attribute information. other people who have similar fashion taste. These outfits Liu et al. [20] considered the cross-scenario retrieval prob- could give a person a broader glimpse of the clothing in lem of finding similar clothing in online stores given a daily stock. Therefore, for effective recommendation, we should human photo captured in the wild. also learn from the behavior of others. This is the basic There are only a handful of attempts to solve the prob- idea of collaborative filtering, which is a common strategy lem of fashion recommendation. For item recommendation, for recommendation problems. Iwata et al. [11] proposed a probabilistic topic model for Collaborative filtering analyzes relationships between users learning information about fashion coordinates. In [12], and interdependencies among items to identify new user- Jagadeesh et al. proposed two classes of recommenders, item associations. However, previous work mostly restricted namely deterministic and stochastic fashion recommenders. attention to recommending individual items like movies and They mainly focused on color modeling for recommendation. music. We, on the other hand, want to recommend sets of Liu et al. [18] studied both outfit and item recommenda- items to users. In addition to matching user preferences, the tion problems. They designed a latent SVM based model fashion items in an outfit should also match with each other. for occasion-oriented clothing recommendation, i.e. given a Compared with movies and music, the number of fashion user-input occasion, suggest the most suitable clothing, or items is extremely huge. Each item may have only been recommend items to pair with the reference clothing. For chosen by very few or even no users. Therefore collabora- outfit recommendation, they only considered outfits with tive filtering methods that only use user-item associations two clothing items. None of these works considered the per- to characterize items are inapplicable. It is important to sonal issue, as we do. use auxiliary information about the fashion items to cap- Besides recommendation, there is increasing interest in ture the relationships amongst items. There is rich informa- building personalized models for other learning problems. tion about fashion items on the web. For example, visual Personalized tags are annotated for photos [24] and songs [23]. features related to style can be extracted from the images In [27] Weston et al. studied the collaborative retrieval prob- of the items. Sellers often provide properties of the fash- lem. Latent models were learned for recommending items ion products through text descriptions. Other information, to a user with respect to a given query. Yue et al. [31] pre- such as the price and the popularity of the items, may also sented a clustering analogue to collaborative filtering, called be available. It is important to design a model that can personalized collaborative clustering. effectively handle these heterogeneous features. Factorization techniques have shown great success in rec- With these considerations in mind, we propose a func- ommender systems. Various models have been proposed to tional tensor factorization model for fashion recommenda- factorize the user-item rating matrix [15]. To further en- tion. We decompose the high order interactions between hance the performance of the models, the use of auxiliary users and the fashion items into a set of pairwise interactions information in matrix factorization has been studied. In [1, in some latent space. We use the idea of gradient boosting 22, 27], linear functions have been used to map the feature to learn nonlinear functions to map the multi-modal feature vectors in the feature space into a latent space. Recently, vectors of the fashion items from the feature space into the Chen et al. [5] proposed using gradient boosting to automati- latent space. A learning to rank formulation is used to learn cally construct feature functions in matrix factorization. We the model. We collect real world user data from a popular extend Chen et al.’s work to tensor factorization. Also, in- fashion-focused social network. Comprehensive experiments stead of using regression, a learning to rank formulation is have been conducted to verify the effectiveness of the pro- used in our work. posed algorithm. Moving from individual items to item sets opens a new area for the studying of recommendation systems. Besides 3. COLLABORATIVE FASHION RECOM- fashion, it has potential applications to other problems. For MENDATION example, on Polyvore, users are also interested in creating collages of furniture. Using this data, the proposed method 3.1 Problem Formulation may also be used to help people design their rooms. In personalized outfit recommendation, we recommend a list of outfits, each of which consists of a set of fash- 2. RELATED WORK ion items, to a user. Assume the fashion items can be Fashion analysis techniques have drawn much attention in grouped into N categories. For example, the three most es- the computer vision and multimedia communities [19]. Re- sential fashion categories are tops, bottoms and shoes. Let cent papers have studied problems including clothing pars- I(n) = {x(n),..., x(n) }, n = 1,...,N be the set containing 1 L(n)

130 items in the n-th category. L(n) is the total number of items Notation Explanation (n) (n) M (n) (n) x(n) Feature vector of the l-th item in the n-th in I . xl ∈ R , l = 1,...,L denotes the feature l vector of the l-th item in the n-th category. M (n) is the di- fashion category (n) (n) vu Latent space representation of user u’s mension of xl . An outfit with N items can be represented (1) (N) preference for items from the n-th category as Ot = {x ,..., x }, t = {t1, . . . , tN } denotes the in- t1 tN h(n)(·) Functions to map the n-th category items dexes of the items in Ot. Assume there are U users in total. to the latent space for matching with users The level of affection a user u, u = 1,...,U, has for Ot is h(nm)(·) Functions to map the n-th category items denoted by a rating score ru,Ot . Our problem is to learn a model so that the score for any pair of user u and outfit Ot to the latent space for matching with the m-th can be predicted. Then we recommend those with highest category items N Number of fashion categories, i.e. number ru,Ot to user u. of items in an outfit 3.2 Functional Pairwise Interaction Tensor Fac- M (n) Number of features for the n-th category torization items (n) ru,Ot can be viewed as an entry in a (N + 1)-th order L Number of n-th category items tensor, with N dimensions corresponding to the N fashion U Number of users categories and the remaining one corresponding to the users. D Dimension of latent space We need to decompose this tensor so that unobserved entries t Indexes of the items in an outfit in it can be predicted. Ot Outfit with items indexed by t

Tensor decomposition has been intensively studied in the ru,Ot Rating score of outfit Ot given by user u literature [14]. Higher order interactions between the com- P The set of preference pairs ponents are considered in many works. In our problem, the tensor we observed is extremely sparse. Among all possible outfits, each user can only rate a very small number of them. Table 1: Notation It would be difficult to learn effective high order interactions from such a limited number of observations. We therefore model ru,Ot by factoring it into a set of pairwise interactions items was used. Due to the introduction of the functions in some latent space: (nm) (n) N {h (x )}n,m=1, we refer to Eq.(1) as Functional Pair- N wise Interaction Tensor Factorization (FPITF). X (n) (n) (n) Note that, if available, we may also utilize auxiliary in- ru,Ot = vu · h (xtn )+ (1) n=1 formation about the users, such as their ages and occupa- N N tions, to help predict user’ preferences, i.e., instead of using X X (nm) (n) (mn) (m) (n) N h (xtn ) · h (xtm ) . {vu }n=1, a set of functions are used to map these fea- n=1 m=n+1 tures into the latent space just like the use of the functions {h(nm)(x(n))}N for the fashion items. Here, we adopt In the first summation, we model users’ affection for in- n,m=1 conventional representation for the users and keep our focus dividual fashion items. v(n) ∈ D characterizes user u’s u R on the analysis of the fashion items. preference for items from the n-th category. D is the di- (n) (n) D mensionality of the latent space. h (xtn ) ∈ R contains outputs from D functions, i.e. 3.3 The Optimization Problem To learn the model in Eq.(1), we assume that, for each h(n)(x(n)) = [h(n)(x(n)), . . . , h(n)(x(n))]T . (2) tn 1 tn D tn user, the preference orderings between some pairs of outfits (n) (n) are given as training data, i.e. The functions hd (x ), d = 1,...,D map the features of the n-th category items to the latent space for matching with users. In the second summation, the matches between items P ≡ {(i, j)|ui = uj , ru ,O > ru ,O }, i, j = 1,...,Z (4) i ti j tj (nm) (n) from two different categories are measured. h (xtn ) ∈ D R also contains outputs from D functions: Z is the total number of user outfit tuples (u, Ot) involved (nm) (n) (nm) (n) (nm) (n) T in the training set. The i-th and j-th tuples make a pair h (xtn ) = [h1 (xtn ), . . . , hD (xtn )] . (3) if they concern the same user and the user likes outfit Oti (nm) (n) more than he or she likes outfit Ot . The actual values hd (x ), d = 1,...,D map the features of the n-th cat- j of the ratings ru ,O and ru ,O are not needed. In fact, egory items to a latent space in which their matches with i ti j tj items from the m-th category can be evaluated. To make for different users the absolute values of the ratings may be the discussion concise, h(n)(x(n)) will also be referred to as highly biased. The use of preference pairs as training data h(nn)(x(n)) in the following. avoids the need to calibrate the ratings and has become a The use of this simple factorization model in Eq.(1) effec- popular approach for learning ranking . tively avoids overfitting. It also leads to more efficient algo- With the above training data, we use the following objec- rithm for learning. A similar pairwise factorization method tive function for our learning problem has been successfully used for personalized tag recommonda- tion [23] where a third-order tensor was decomposed. How- 1 X Ω + C log(1 + exp(−(ru ,O − ru ,O ))) , (5) ever, in that application, they solved a conventional tensor 2 i ti j tj factorization problem where no auxiliary information of the (i,j)∈P

131 Algorithm 1: Coordinate descent for solving Eq.(5) (nm) (n) N (n) N Randomly initialize {h (x )}n,m=1 and {vu }n=1; while not converged do for n = 1 to N do for m = 1 to N do for d = 1 to D do (nm) (n) Update hd (x ) with functional gradient descent end end end for u = 1 to U do (n) N Update {vu }n=1 by solving the optimization Figure 2: Example web page for a fashion item. problem in Eq.(25) end end over, a paragraph of text providing detailed description of the item is also available. A linear model cannot handle such heterogeneous information effectively. We therefore use a functional gradient descent method [9] to learn non-linear where ru ,O and ru ,O are computed by Eq.(1). Ω de- i ti j tj models that can inherently integrate the multi-modal infor- notes the regularization term and is defined as mation in x(n).

U N N N L(n) X X (n) 2 X X X (nm) (n) 2 3.4.1 Functional Gradient Descent Ω = kvu k + kh (xl )k . (6) u=1 n=1 n=1 m=1 l=1 Let Φ(h(x)) be a general loss function, where h ∈ H is Eq.(5) applies a regularized logistic regression model to the function we seek to find, and H is a pre-defined function pairwise classification, i.e. we are learning to classify a pair space. The gradient boosting method [9] applies gradient (i, j) into two classes, where class label y = 1 means descent in to minimize Φ(h(x)), i.e., compute (i,j) the gradient of Φ(h(x)) with respect to h(x) at the current ru ,O > ru ,O and y(i,j) = −1 corresponds to the con- i ti j tj estimation h (x) and form the next estimate as trary. The loss term in Eq.(5) is the negative log-likelihood k of observing the training data, using a logistic function as h (x) = h (x) − α ∇Φ(h(x)) . (8) the probability model, i.e. k+1 k k

P (y |ru ,O , ru ,O ) (7) However, we can not compute ∇Φ(h(x)) at all x, but rather (i,j) i ti j tj we can only compute it at a finite data sample {∇Φ(h(xi))}. −1 =(1 + exp(−y (ru ,O − ru ,O ))) . (i,j) i ti j tj So the gradient boosting method finds a function that inter- polates/approximates these sample values and thus obtains To obtain good generalization ability, the regularization term an approximation of the negative gradient −∇Φ(h(x)) to controls the complexity of the model by penalizing the mag- form the next iterate. (n) (nm) (n) nitudes of vu and the functions h (x ). A detailed (n) (n) In the following, we consider the learning of hd (x ) for Bayesian analysis of a similar optimization criterion for in- some n and d as an example to show how functional gradient dividual item recommendation can be found in [21]. descent can be applied to our problem. All other functions We use coordinate descent to minimize the objective func- (nm) (n) N in {h (x )}n,m=1 are learned the same way. tion in Eq.(5). Assume all other terms are fixed for all users, we first minimize Eq.(5) with respect to functions (n) (n) (nm) (n) N (nm) (n) N 3.4.2 Learning hd (x ) {h (x )}n,m=1. Then assuming {h (x )}n,m=1 (n) (nm) (n) N ˆ (n) N N Let {vˆu }n=1 and {h (x )}n,m=1 be the current es- are fixed, we minimize Eq.(5) with respect to {vu }n=1 for (n) N (nm) (n) N all u = 1,...,U. The general procedure is shown in Algo- timates of {vu }n=1 and {h (x )}n,m=1 respectively. rithm 1. In the following two sections, we explain these two The corresponding score for tuple (ui, Oti ) with these esti- mates isr ˆu ,O . Let steps in details. i ti

(nm) (n) N 3.4 Learning {h (x )}n,m=1 with Functional (n) (n) (n) sˆi =r ˆu ,O − vˆ hˆ (x ) . (9) Gradient Descent i ti ui,d d ti,n In general, we can choose a parameterized model for each Optimizing Eq.(5) with respect to h(n)(x(n)) is equivalent (nm) (n) (nm) (n) d function hd (x ). For example, hd (x ) can be a lin- to minimizing (n) (nm) (n) (nm) (n) ear function of x so that hd (x ) = wd ·x . How- ever, the features in x(n) may come from multiple modali- L(n) (n) (n) 1 X (n) (n) 2 X ties. We can extract low level visual features from the images Φ(h (x )) = (h (x )) + C log(1+ d 2 d l of the fashion items. Products for online shopping also have l=1 (i,j)∈P an abundance of other information on the web. Figure 2 (10) shows an example web page for a fashion item on Polyvore. (n) (n) (n) (n) (n) (n) exp(−(ˆv h (x ) − vˆ h (x ) +s ˆi − sˆj ))) . It provides the category, name and price of the item. More- ui,d d ti,n uj ,d d tj,n

132 (n) (n) The gradient with respect to hd (xl ) is (n) (n) Algorithm 2: Update hd (x ) with functional gradi- ent descent ∂Φ(h(n)(x(n)))  d  (n) g = (n) (n) (n) (n) (11) l (n) (n) h (x )=hˆ (x ) 1. For l = 1,...,L , compute the gradient g with ∂h (x ) d l d l l d l Eq.(11); X (n) (n) (n) =C (σi,j − 1)(ˆv 1(ti,n = l) − vˆ 1(tj,n = l)) 2. Construct a training set with L samples ui,d uj ,d (i,j)∈P (n) (n) {(x1 , −g1),..., (x (n) , −gL(n) )}, (19) + hˆ(n)(x(n)), l = 1,...,L(n) , L d l and learn a decision tree regressor from it; where 3. Update the terminal node coefficients of the tree with Eq.(16); −1 (n) σi,j = (1 + exp(−(ˆrui,Ot − rˆuj ,Ot ))) . (12) (n) i j 4. Update hd (x ) with 1(·) is indicator function that has value 1 if its argument is Q (n) (n) ˆ(n) (n) X (n) true, and zero otherwise. hd (x ) ← hd (x ) + ν γq1(x ∈ Rq), (20) (n) (n) L q=1 We then construct a training set with {(xl , −gl )}l=1 as the samples and learn a regression function from it. Specifi- 0 < ν ≤ 1 is a shrinkage parameter. cally, a small decision tree regressor is learned here [8]. Com- pared to other learning methods, the decision tree has the advantage of inherently being able to handle the heteroge- (n) Q neous features in x . Let {Rq}q=1 be the terminal nodes tively, i.e. Q of the tree learned, and let {eq}q=1 be the corresponding coefficients for the nodes. According to Eq.(8) the update Q  ∂Ψ({γq}q=1)  (n) Ψ = (17) (n) γq γ=0 of hd (x ) becomes ∂γq (n) Q L (n) (n) (n) (n) X (n) X (n) (n) (n) ˆ = hˆ (x )1(x ∈ Rq) hd (x ) = hd (x ) + αk eq1(x ∈ Rq) . (13) d l l q=1 l=1 X (n) (n) + C (σi,j − 1)(ˆv 1(x ∈ Rq) L(n) ui,d ti,n Note that since the negative gradients {−gl }l=1 are used (i,j)∈P to learn the regressor, the operator before αk is changed to (n) (n) − vˆ 1(x ∈ Rq)) plus. Eq.(13) can be alternatively expressed as uj ,d tj,n

Q (n) (n) ˆ(n) (n) X (n) 2 Q hd (x ) = hd (x ) + γq1(x ∈ Rq) (14) ∂ Ψ({γq} ) Ψ = q=1  (18) q=1 γq γq 2 γ=0 ∂ γq (n) with γq = αkeq. Eq.(14) can be viewed as adding Q sep- L (n) Q X (n) arate basis functions at each step {1(x ∈ Rq)}q=1. We = 1(xl ∈ Rq) can further improve the quality of the fitting by using the l=1 optimal coefficients for each of these basis functions. These X (n) (n) + C σi,j (1 − σi,j )(ˆv 1(x ∈ Rq) optimal coefficients are the solution to ui,d ti,n (i,j)∈P

L(n) Q (n) (n) 2 − vˆ 1(xt ∈ Rq)) 1 X X (n) (n) (n) 2 uj ,d j,n arg min 1(x ∈ Rq)(hˆ (x ) + γq) T l d l γ=[γ1,...,γQ] 2 l=1 q=1 σi,j is computed by Eq.(12). (15) (n) (n) We summarize the procedure for updating hd (x ) in Q Algorithm 2. The time complexities for calculating the gra- X (n) X (n) + C log(1 + exp(−(ˆv γq1(x ∈ Rq) dient g with Eq.(11) and the terminal node coefficients γ ui,d ti,n (n) (n) (i,j)∈P q=1 with Eq.(16) are O(P + L ) and O(P + L + Q) re- Q spectively, where P = |P| is the total number of prefer- (n) X (n) − vˆ γq1(x ∈ Rq) +r ˆu ,O − rˆu ,O ))) , ence pairs in P. The cost to construct a decision tree is uj ,d tj,n i ti j tj (n) (n) (n) q=1 O(L M log L ) [8].

(n) (n) which substitutes hd (x ) in Eq.(10) with Eq.(14). Fol- (n) N 3.5 Learning {vu }n=1 lowing [9], we obtain the approximated solution to Eq.(15) (n) N by a single Newton-Raphson step, using a diagonal approx- To learn {vu }n=1 for user u, we concatenate them into imation to the Hessian. The result is a single vector wu that w = [v(1)T ,..., v(N)T ]T . (21) γq = −Ψγq /Ψγq γq , q = 1, . . . , Q, (16) u u u where Ψ and Ψ are the first and the second order Assume Pu ≡ {(i, j)|ui = uj = u, ru ,O > ru ,O } is the γq γq γq i ti j tj derivatives of the objective function in Eq.(15) (denoted as subset of P that only contains preference pairs concerning Q Ψ({γq}q=1)) with respect to γq evaluated at γ = 0 respec- user u. Let (ui, Oti ) be a tuple involved in Pu, i.e. ui = u.

133 We define (1)T (N)T x = [hˆ (x(1) ),..., hˆ (x(N) )]T (22) i ti,1 ti,N be the concatenation of the mappings of the fashion items involved in (ui, Oti ) using the currently learned functions ˆ(n) (n) N {h (x )}n=1. Also let

N N X X (nm) (n) (mn) (m) b = hˆ (x ) · hˆ (x ) . (23) i ti,n ti,m n=1 m=n+1

The rating score ru ,O can be rewritten as i ti T ru ,O = w xi + bi . (24) i ti u

(n) N Then optimizing Eq.(5) with respect to {vu }n=1 reduces to minimizing the following objective function Figure 3: Objective function value in Eq.(5) (black solid line) and training NDCG (blue dashed line) 1 T X f(wu) = wu wu + C log(1+ (25) along iteration in one run of the training algorithm. 2 (i,j)∈Pu exp(−(wT x − wT x + b − b ))) . u i u j i j And the posterior Dirichlet parameters associated with each We modify the trust region Newton method for logistic item are used as the low dimensional features. regression in [17] to solve this optimization problem. 4.2 Baselines 4. EXPERIMENTS To the best of our knowledge, this is the first work to study personalized outfit recommendation. The methods that can be compared against are very limited. We use the following 4.1 Dataset and features methods as baselines for comparison, which include both 4 We use data from the popular fashion website Polyvore personalized and non-personalized methods. to evaluate our recommendation algorithm. We collect a In the first method, outfits that are similar to the outfits dataset consisting of image collages (called“Sets”on Polyvore) a user liked in the past are recommended, which is the ba- created by 150 users. In this evaluation, we assume an outfit sic idea of content-based filtering. The similarity between contains three items, i.e. one top, one bottom and a pair of two outfits is measured by the sum of similarities between shoes. There are 38,800 tops, 21,833 bottoms and 24,619 individual items in the outfits, i.e. pairs of shoes in total in our datasets. N On Polyvore, product images, which have clean back- X (n) (n) Sim(Ot , Ot ) = exp(−kx − x k) . (26) ground and only contain individual items, are available for i j ti,n tj,n most of the fashion items. This is a big advantage since we n=1 do not need to extract them from street snaps by clothes Letting Su contains all the outfits user u has created and parsing, which is still a challenging computer vision prob- therefore liked, the score for any new outfit Ot is defined as lem. We therefore can more clearly study the performance the average similarities between Ot and those in Su, i.e. of the recommendation algorithms. 1 X We extract low level visual features from the item images. (ru,Ot )Mean = Sim(Ot, Otj ) . (27) |Su| O ∈S We first use the salient region detection method in [6] to tj u precisely locate the the items in the images (region of interest (ROI)). Then features are extracted from the ROIs. The We refer to this method as CF Mean. features we used include color histograms in RGB and LAB The idea of the second method is similar to the first one, spaces, dense multi-scale SIFT descriptors (PHOW) [25] and except that the similarity between a new outfit and the out- Pyramid of HOG (PHOG) [3]. The total number of features fits in Su is measured in a slightly different way. For each is 7262. We use PCA to reduce the dimensionality so that item in Ot, it first finds the item most similar to it in Su, 95% of the energy is preserved. The resulting dimensions then the score is computed as the sum of item similarities are around 100 for the three item categories. N X (n) (n) The text information we use includes the categories, names (r ) = max exp(−kx − x k) . (28) u,Ot NN tn tj,n Ot ∈Su and descriptions of the fashion items on the web pages (Fig- n=1 j ure 2). For each type of information, after removing the stop words and perform stemming, we discard the words that oc- We refer to this method as CF NN since it is based on the cur less than 5 times in the dataset. Then word counts are nearest neighbors of the items. used to construct the feature vectors. Finally, latent dirich- The recommendation of outfits involves finding the out- let allocation (LDA) [2] is applied to reduce the dimension- fits whose constituent items match well with each other. In ality of the features. In particular, the 20 most informative the third method, the multi-view extension of canonical cor- relation analysis (CCA) is used to learn a set of matrices topics for each type of text information are selected by LDA. N {Wn}n=1 that project the feature vectors of the items from 4http://www.polyvore.com/ each fashion category into a low-dimensional common space

134 Figure 4: Comparison of recommendation methods. Figure 5: Comparison of recommendation methods Different feature settings are tested, i.e. only using with different numbers of training outfits. The num- visual features, only using text features, and using ber of positive outfits for each user is set to 90, 135 both visual and text features (Combined). and 180 respectively. The neutral outfits are five times the positive ones for each user. such that, for the user created outfits, the sums of the cor- Method D = 6 D = 12 D = 18 D = 24 D = 30 relations between pairs of items in the common space are Visual .5309 .5460 .5518 .5550 .5576 maximized. We use the multi-view CCA implementation Text .5246 .5432 .5439 .5497 .5578 in [10] to learn the models. During training, outfits cre- Combined .5550 .5785 .5822 .5902 .5941 ated by all users are used together to learn a global model for creating outfits. We also tried the method of learning a personalized model for each user only using his/her own Table 2: Changing the dimensionality of the latent outfits. The performance was quite poor due to the very space for the FPITF method. limited number of training samples for each user. 4.3 Outfit recommendation tion in Eq.(5) and the training NDCG during one run of the training algorithm in Figure 3. We can see that both val- For training our model, we consider two rating levels for ues change smoothly as a function of iteration. They also the outfits. The outfits created by a user are taken as pos- change very quickly in early iterations. itive outfits for that user. And we assume that outfits with In Figure 4, we compare the performance of the four meth- randomly mixed top, bottom and shoes are neutral outfits ods for outfit recommendation. Results using different kinds for any user (We do not use the word “negative” here since of features for the fashion items, i.e. only using visual fea- it is more appropriate for outfits that are disliked by a user, tures, only using text features and using both visual and which is not necessarily the case for the random outfits). It is text features, are shown. We set the latent space dimen- possible that a neutral outfit may happen to match a user’s sion D = 30 for FPITF in this experiment. Among the two taste. However, considering the huge number of outfits that methods that only use a user’s own training outfits for calcu- can be obtained by random mixturing, this possibility is rel- lating r , CF NN performs much better than CF Mean. atively small. Therefore, it is reasonable to assume that a u,Ot CF NN also outperforms the multi-view CCA method that user likes a positive outfit more than a neutral one, i.e. they trains a universal model for all users. Our method, which form a preference pair for training. Our training set con- considers both personalization and utilizing the outfits cre- tains 180 positive outfits and 900 neutral outfits for each ated by other users, achieves the best results. For FPITF, user. For testing, 45 positive outfits and 450 neutral outfits the NDCGs with visual and text features only are very close. not included in the training set are used per user. By combining them together, we obtain significant perfor- During evaluation, for each user, the positive and neutral mance gain. This shows the effectiveness of our method for outfits in the testing set are ranked in descending order of fusing different kinds of features. On the other hand, the their scores r . The performance is measured by mean u,Ot CF NN method, which ranks second, doesn’t shown much NDCG, a widely used criteria for comparing ranked lists. We performance improvement with the addition of text features follow the NDCG definition used in [16]. Letting π0 be the to visual features. ordering being evaluated, the NDCG at the m-th position is We show the performance of our method with different m yπ0(i) −1 X (2 − 1) dimensions of the latent space in Table 2. This dimension NDCG@m = (Nm) , (29) parameter affects test performance, training and evaluation log2(max(2, i)) i=1 time, as well as storage requirements. We can see that where Nm is the score of an ideal ordering, yπ0(i) is 1 for pos- FPITF performs better than the other methods even with itive outfits and 0 for neutral ones. Mean NDCG is the mean D = 6. Higher dimension is needed for more complex fea- of NDCG@m for m = 1,...,M with M being the length tures. So FPITF with the combined features obtains larger of the ordering. We report the average of mean NDCG over performance gains when moving from D = 6 to D = 30 than all users and refer to it as NDCG for short in the following. those with visual or text features only. We first verify experimentally the convergence of our learn- Figure 5 compares the performance of the four methods ing algorithm. We show the changes of the objective func- with different numbers of training outfits. Both visual and

135 Method Top Bottom Shoe CF WMean 0.4336 0.3876 0.4246 CF NN 0.3946 0.3560 0.3770 Multi-view CCA 0.3327 0.3359 0.3105 FPITF 0.2605 0.2377 0.2548

Table 3: Comparison of item retrieval results. Each (a) Finding tops to match with given bottom and shoes. column corresponds to the search of the item in the column title given the other two items. The smaller the values, the better the results. text features are used for all methods. We set the number of positive outfits for each user to 90, 135 and 180 respectively. The number of neutral outfits for each user is five times the positive number for each setting. FPITF outperforms the (b) Finding bottoms to match with given top and shoes. other methods in all cases. The improvements of CF Mean and multi-view CCA are relatively small with the doubling of the training outfits, but both FPITF and CF NN enjoy large performance improvements. We show in Figure 7 some examples of the top recom- mendations. For each user, the top 10 outfits obtained by the four methods using combined features are shown respec- tively. We can see that for different users, FPITF has recom- (c) Finding shoes to match with given top and bottom. mended outfits with quite different styles. It also has recom- mended more user created sets than the other methods. As Figure 6: Examples of top 10 recommended items a non personalized method, the recommendations by CCA with respect to the given query using the FPITF exhibit much weaker personalization than the other three model. The two items in purple boxes are the query. methods. Although CF Mean and CF NN are able to gener- The items in red boxes are the ground truth items. ate personalized recommendations, since they only consider outfits previously created by the the same user, their results tend to lack diversity (note how similar jeans dominate in For CF NN, the score becomes their recommendations for User 1 and User 2). By also bor- (n) (n) rowing ideas from other users, FPITF is able to generate (ru,Ot )NN = max exp(−kxt − xt k) . (31) O ∈S n j,n diversified recommendations that are consistent with users’ tj u taste. Among its recommendations, even those not marked For multi-view CCA and FPITF, we just need to omit the by red boxes are very likely to also be liked by the users and pairwise interactions in r that do not involve x(n). be taken as good references for their future attire. u,Ot tn For each user, all the positive outfits in the testing set 4.4 Personalized item retrieval are used to form the queries. For each outfit, N − 1 items in it are used as the query. The remaining item in that Besides recommending whole outfits to users, the model outfit is taken as the ground truth. Assume we are searching learned by FPITF can also be used to recommend individual for an item from the n-th category. We compute scores fashion items to a user with respect to some given items. for all the n-th category items in the testing set and rank This is the problem of personalized retrieval. Here we show them in descending order of the scores. The position of the the performance of FPITF for this task. ground truth item in the ordering is used as the performance For comparison, we also adapt the baseline methods in measurement. We divide the position by the total number Section 4.2 for this retrieval problem. Assume we need to of items in the ordering to normalize it to range of [0, 1]. In find an appropriate item from the n-th category to match an ideal ordering, the ground truth item is ranked as the with the given items for user u. We can still use r as u,Ot first item and has a position measurement of 0. Therefore, the score of item x(n). Note that except for x(n), the other tn tn the smaller the measurement value, the better the ordering. items in Ot are given as query. Direct use of CF Mean re- Table 3 compares the four methods for item retrieval. sults in a method that computes the score as the average Each column corresponds to the search of one item. For (n) of the similarities between xtn and the n-th category items example, the column “Top” contains results of recommend- involved in Su. The query is actually not considered. We ing a top to a user with respect to the given bottom and therefore modify CF Mean to use weighted average similari- shoes. The combined features are used in the experiments. ties, where the weight is determined by how similar the items The latent space dimensionality of FPITF is set to 30. Note in the query are to the corresponding items in Otj ∈ Su, i.e. that we directly apply the FPITF model learned for outfit recommendation to the retrieval task here. Again, FPITF X (n) (n) (r ) = w exp(−kx − x k) , (30) u,Ot WMean j tn tj,n obtains the best results among the four methods. Multi-view O ∈S tj u CCA, which explicitly optimizes the matching between pairs of items, outperforms the other two CF methods. where w ∼ P exp(−kx(l)−x(l) k) and the sum of {w }|Su| j l6=n tl tj,l j j=1 Note that although the results of FPITF are around 0.25, is normalized to 1. which is a bit far from the optimal value 0, this measurement

136 is an underestimate of the true quality of the results. Items [12] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and that are similar to the ground truth item may rank high N. Sundaresan. Large scale visual recommendations in the ordering. Although treated as negative items during from street fashion images. In KDD, 2014. evaluation, they are actually good alternatives to the ground [13] H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. truth item. This can be seen in Figure 6, which shows the Berg. Hipster wars: Discovering elements of fashion top recommended items for some example queries. styles. In ECCV, 2014. [14] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM REVIEW, 51(3):455–500, 5. CONCLUSION 2009. In this work, we have studied the problem of personalized [15] Y. Koren, R. Bell, and C. Volinsky. Matrix fashion recommendation, where outfits catering to users’ factorization techniques for recommender systems. personal fashion taste are automatically recommended. We IEEE Computer, 42:30–37, 2009. propose a functional tensor factorization method to model [16] C.-P. Lee and C.-J. Lin. Large-scale linear ranksvm. the user-item and item-item interactions. To handle the Neural Computation, 26:781–817, 2014. multi-modal features of the fashion items, a gradient boost- [17] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region ing based method was used to learn the nonlinear functions newton method for large-scale logistic regression. that map the feature vectors from the feature space to a JMLR, 9:627–650, 2008. latent space. The effectiveness of the proposed method has [18] S. Liu, J. Feng, Z. Song, T. Zhang, H. Lu, C. Xu, and been shown through extensive experiments. As a first at- S. Yan. “Hi, magic closet, tell me what to wear!”. In tempt to approach a new recommendation problem, there ACM Multimedia, 2012. are some remaining issues worth exploring, such as the cold start problem and incorporating social information from so- [19] S. Liu, L. Liu, and S. Yan. Fashion analysis: Current cial networks to enhance performance. We will investigate techniques and future directions. IEEE MultiMedia, these problems in the future. 21(2):72–79, 2014. [20] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to-shop: Cross-scenario clothing retrieval via 6. ACKNOWLEDGMENTS parts alignment and auxiliary set. In CVPR, 2012. This work was supported by the NSF Grant (12621215) [21] S. Rendle, C. Freudenthaler, Z. Gantner, and EAGER: Video Analytics in Large Heterogeneous Reposito- L. Schmidt-Thieme. BPR: Bayesian personalized ries. ranking from implicit feedback. In UAI, 2009. [22] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme. Fast context-aware 7. REFERENCES recommendations with factorization machines. In SIGIR, 2011. [1] D. Agarwal and B.-C. Chen. Regression-based latent [23] S. Rendle and L. Schmidt-Thieme. Pairwise factor models. In KDD, 2009. interaction tensor factorization for personalized tag [2] D. Blei, A. Ng, and M. Jordan. Latent dirichlet recommendation. In WSDM, 2010. allocation. JMLR, 3:993–1022, 2003. [24] B. Sigurbj¨ornsson and R. van Zwol. Flickr tag [3] A. Bosch, A. Zisserman, and X. Munoz. Representing recommendation based on collective knowledge. In shape with a spatial pyramid . In CIVR, 2007. WWW, 2008. [4] H. Chen, A. Gallagher, and B. Girod. Describing [25] A. Vedaldi and B. Fulkerson. VLFeat: An open and clothing by semantic attributes. In ECCV, 2012. portable library of computer vision algorithms. [5] T. Chen, H. Li, Q. Yang, and Y. Yu. General http://www.vlfeat.org/, 2008. functional matrix factorization using gradient [26] X. Wang, T. Zhang, D. Tretter, and Q. Lin. Personal boosting. In ICML, 2013. clothing retrieval on photo collections by color and [6] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, attributes. IEEE Trans. on Multimedia, V. Vineet, and N. Crook. Efficient salient region 15(8):2035–2045, 2013. detection with soft image abstraction. In ICCV, 2013. [27] J. Weston, C. Wang, R. Weiss, and A. Berenzweig. [7] J. Dong, Q. Chen, X. Shen, J. Yang, and S. Yan. Latent collaborative retrieval. In ICML, 2012. Towards unified human parsing and pose estimation. [28] K. Yamaguchi, H. Kiapour, and T. L. Berg. Paper doll In CVPR, 2014. parsing: Retrieving similar styles to parse clothing. In [8] F. Pedregosa et al. Scikit-learn: Machine learning in ICCV, 2013. Python. JMLR, 12:2825–2830, 2011. [29] K. Yamaguchi, H. Kiapour, L. E. Ortiz, and T. L. [9] J. H. Friedman. Greedy function approximation: A Berg. Parsing clothing in fashion photographs. In gradient boosting machine. The Annals of Statistics, CVPR, 2012. 29(5):1189–1232, 2001. [30] W. Yang, P. Luo, and L. Lin. Clothing co-parsing by [10] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A joint image segmentation and labeling. In CVPR, multi-view embedding space for modeling internet 2014. images, tags, and their semantics. IJCV, 106:210–233, [31] Y. Yue, C. Wang, K. El-Arini, and C. Guestrin. 2014. Personalized collaborative clustering. In WWW, 2014. [11] T. Iwata, S. Watanabe, and H. Sawada. Fashion coordinates recommender system using photographs from fashion magazines. In IJCAI, 2011.

137 (a) Users 1 (b) User 2

(c) User 3 (d) User 4

Figure 7: Examples of top 10 recommended outfits. For each user, the results from top to bottom are obtained by CF Mean (purple), CCA (green), CF NN (blue) and FPITF (grey) respectively. The outfits marked with red boxes were created by the users.

138