The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

Multi-Feature Discrete for Fast Cold-Start Recommendation

Yang Xu,1 Lei Zhu,1∗ Zhiyong Cheng,2 Jingjing Li,3 Jiande Sun1 1Shandong Normal University 2Shandong Computer Science Center (National Supercomputer Center in Jinan) 2Qilu University of Technology (Shandong Academy of Sciences) 3University of Electronic Science and Technology of China [email protected]

Abstract tween their latent features. However, the time complexity for generating top-k items recommendation for all users is Hashing is an effective technique to address the large- O(nmr + nm log k) (Zhang, Lian, and Yang 2017). There- scale recommendation problem, due to its high computation and storage efficiency on calculating the user preferences fore, MF-based methods are often computational expensive on items. However, existing hashing-based recommendation and inefficient when handling the large-scale recommenda- methods still suffer from two important problems: 1) Their tion applications (Cheng et al. 2018; 2019). recommendation process mainly relies on the user-item in- Recent studies show that the hashing-based recommen- teractions and single specific content feature. When the in- dation algorithms, which encode both users and items into teraction history or the content feature is unavailable (the binary codes in Hamming space, are promising to tackle cold-start problem), their performance will be seriously de- teriorated. 2) Existing methods learn the hash codes with re- the efficiency challenge (Zhang et al. 2016; 2014). In these laxed optimization or adopt discrete coordinate descent to methods, the preference score could be efficiently computed directly solve binary hash codes, which results in signifi- by Hamming distance. However, learning binary codes is cant quantization loss or consumes considerable computation generally NP-hard (Hastad˚ 2001) due to the discrete con- time. In this paper, we propose a fast cold-start recommenda- straints. To tackle this problem, the researchers resort to a tion method, called Multi-Feature Discrete Collaborative Fil- two-stage hash learning procedure (Liu et al. 2014; Zhang tering (MFDCF), to solve these problems. Specifically, a low- et al. 2014): relaxed optimization and binary quantization. rank self-weighted multi-feature fusion module is designed Continuous representations are first computed by the relaxed to adaptively project the multiple content features into binary optimization, and subsequently the hash codes are generated yet informative hash codes by fully exploiting their comple- by binary quantization. This learning strategy indeed simpli- mentarity. Additionally, we develop a fast discrete optimiza- tion algorithm to directly compute the binary hash codes with fies the optimization challenge. However, it inevitably suf- simple operations. Experiments on two public recommenda- fers from significant quantization loss according to (Zhang tion datasets demonstrate that MFDCF outperforms the state- et al. 2016). Hence, several solutions are developed to di- of-the-arts on various aspects. rectly optimizing the binary hash codes from the matrix fac- torization with discrete constraints. Despite much progress has been achieved, they still suffer from two problems: 1) Introduction Their recommendation process mainly relies on the user- With the development of online applications, recommender item interactions and single specific content feature. Under systems have been widely adopted by many online services such circumstances, they cannot provide meaningful recom- for helping their users find desirable items. However, it is mendations for new users (e.g. for the new users who have still challenging to accurately and efficiently match items no interaction history with the items). 2) They learn the hash to their potential users, particularly with the ever-growing codes with Discrete Coordinate Descent (DCD) that learns scales of items and users (Batmaz et al. 2019). the hash codes bit-by-bit, which results in significant quan- In the past, Collaborative Filtering (CF), as exemplified tization loss or consumes considerable computation time. by Matrix Factorization (MF) algorithms (Koren, Bell, and In this paper, we propose a fast cold-start recommendation Volinsky 2009) have demonstrated great successes in both × method, called Multi-Feature Discrete Collaborative Filter- academia and industry. MF factorizes an n m user-item ing (MFDCF) to alleviate these problems. Specifically, we rating matrix to project both users and items into a r- propose a low-rank self-weighted multi-feature fusion mod- dimensional latent feature space, where the user’s prefer- ule to adaptively preserve the multiple content features of ence scores for items are predicted by the inner product be- users into the compact yet informative hash codes by suffi- ∗The corresponding author ciently exploiting their complementarity. Our method is in- Copyright c 2020, Association for the Advancement of Artificial spired by the success of the multiple feature fusion in other Intelligence (www.aaai.org). All rights reserved. relevant areas (Wang et al. 2018; 2012; Lu et al. 2019a;

270 Zhu et al. 2017a). Further, we develop an efficient discrete applies Deep Belief Network (DBN) to extract item repre- optimization approach to directly solve binary hash codes sentation from item content information, and combines the by simple efficient operations without quantization errors. DBN with DCF. Discrete content-aware matrix factorization Finally, we evaluate the proposed method on two public rec- methods (Lian et al. 2017; Lian, Xie, and Chen 2019) de- ommendation datasets, and demonstrate its superior perfor- velop discrete optimization algorithms to learn binary codes mance over state-of-the-art competing baselines. for users and items at the presence of their respective content The main contributions of this paper are summarized as information. Discrete Factorization Machines (DFM) (Liu et follows: al. 2018) learns hash codes for any side feature and models the pair-wise interactions between feature codes. Besides, • We propose a Multi-Feature Discrete Collaborative Fil- since the above binary cold-start recommendation frame- tering (MFDCF) method to alleviate the cold-start rec- works solve the hash codes with bit-by-bit discrete optimiza- ommendation problem. MFDCF directly and adaptively tion, they still consumes considerable computation time. projects the multiple content features of users into binary hash codes by sufficiently exploiting their complementar- ity. To the best of our knowledge, there is still no similar The Proposed Method work. Notations. Throughout this paper, we utilize bold lowercase letters to represent vectors and bold uppercase letters to rep- • We develop an efficient discrete optimization strategy to resent matrices. All of the vectors in this paper denote col- directly learn the binary hash codes without relaxed quan- umn vectors. Non-bold letters represent scalars. We denote tization. This strategy avoids performance penalties from tr(·) as the trace of a matrix and ·F as the Frobenius norm both the widely adopted discrete coordinate descent and of a matrix. We denote sgn(·):R →±1 as the round-off the storage cost of huge interaction matrix. function. • We design a feature-adaptive hash code generation strat- egy to generate user hash codes that accurately capture Low-rank Self-weighted Multi-Feature Fusion the dynamic variations of cold-start user features. Exper- n Given a training dataset O = oi|i=1, which contains n iments on the public recommendation datasets demon- user’s multiple features information represented with M dif- strate the superior performance of the proposed method ferent content features (e.g. demographic information such over the state-of-the-arts. as age, gender, occupation, and interaction preference ex- tracted from item side information). The m-th content fea- (m) (m) (m) dm×n Related Work ture is X =[x1 , ..., xn ] ∈ R , where dm is the In this paper, we investigate the hashing-based collabora- dimensionality of the m-th content feature. Since the user’s tive filtering at the presence of multiple content features for multiple content features are quite diverse and heteroge- fast cold-start recommendation. Hence, in this section, we neous, in this paper, we aim at adaptively mapping multiple (m) M mainly review the recent advanced hashing-based recom- content features X |m=1 into a consensus multi-feature mendation and cold-start recommendation methods. representation H ∈ Rr×n (r is the hash code length) in a A pioneer work, (Das et al. 2007) is proposed to exploit shared homogeneous space. Specifically, it is important to Locality-Sensitive Hashing (LSH) (Gionis, Indyk, and Mot- consider the complementarity of multiple content features wani 1999) to generate hash codes for Google new read- and the generalization ability of the fusion module. Moti- ers based on their item-sharing history similarity. Based on vated by these considerations, we introduce a self-weighted this, (Karatzoglou, Smola, and Weimer 2010; Zhou and Zha fusion strategy and then formulate the multi-feature fusion 2012) followed the idea of Iterative Quantization (Gong part as: et al. 2013) to project real latent representations into hash M codes. To enhance discriminative capability of hash codes, (m) (m) min ||H − W X ||F (1) de-correlation constraint (Liu et al. 2014) and Constant Fea- W(m),H ture Norm (CFN) constraint (Zhang et al. 2014) are im- m=1 (m) posed when learning user/item latent representations. The where ||·||F is the Frobenius norm of the matrix. W ∈ above works basically follow a two-step learning strategy: Rr×dm ,m =1, ..., M is the mapping matrix of the m-th relaxed optimization and binary quantization. As indicated content feature, H ∈ Rr×n is the consensus multi-feature by (Zhang et al. 2016), this two-step approach will suffer representation. According to (Lu et al. 2019b), Eq.(1) is from significant quantization loss. equivalent to To alleviate quantization loss, direct binary code learning by discrete optimization is proposed (Shen et al. 2015). In M 1 (m) (m) 2 min ||H − W X ||F (2) the recommendation area, Discrete Collaborative Filtering (m) (m) μ∈ΔM , , μ (DCF) (Zhang et al. 2016) is the first binarized collabora- W H m=1 tive filtering method and demonstrates superior performance (m) over aforementioned two-stage recommendation methods. where μ is the weight of the m-th content feature and However, it is not applicable to cold-start recommendation it measures the importance of the current content feature. def M  scenarios. To address cold-start problem, on the basis of ΔM = {x ∈ R |xi ≥ 0, 1 x =1} is the probabilistic DCF, Discrete Deep Learning (DDL) (Zhang et al. 2018) simplex.

271 In real-world recommender systems, such as Taobao1 and Given a user-item rating matrix S of size n × m, where Amazon2, there are many different kinds of users and items, n and m are the number of users and items, respectively. which have rich and diverse characteristics. However, a spe- Each entry sij indicates the rating of a user i for an item r cific user only has a small number of interactions in the sys- j. Let bi ∈{±1} denote the binary hash codes for the i- tem with limited items. Consequently, the side information ∈{±}r of users and items would be pretty sparse. We need to han- th user, and dj 1 denote the binary hash codes for dle a very high-dimensional and sparse feature matrix. To the j-th item, the rating of user i for item j is approximated 1 1  avoid spurious correlations caused by the mapping matrix, by Hamming similarity ( 2 + 2r bi dj). Thus, the goal is to W r×n we impose a low-rank constraint on : learn user binary matrix B =[b1, ..., bn] ∈{±1} and r×m M item binary matrix D =[d1, ..., dm] ∈{±1} , where  1 min ||H − W (m)X(m)||2 + γrank(W (m)) r  min(n, m) is the hash code length. Similar to the prob- (m) F μ∈Δ , (m), μ M W H m=1 lem of conventional collaborative filtering, the basic discrete (3) collaborative filtering can be formulated as:  2 where γ is a penalty parameter and rank(·) is the rank oper- min ||S − B D||F W B,D (7) ator of a matrix. The low-rank constraint on helps high- r×n r×m light the latent shared features across different users and s.t.B ∈{±1} , D ∈{±1} handles the extremely spare observations. Meanwhile, the To address the sparse and cold-start problem, we integrate low-rank constraint on W makes the optimization more dif- multiple content features into the above model, by substitut- ficult. To tackle this problem, we adopt an explicit form of ing the user binary feature matrix B with the rotated multi- low-rank constraint as follows: feature representation RH(R ∈ Rr×r is rotation matrix) M and keeping their consistency during the optimization pro- 1 (m) (m) 2 cess. The formula is given as follows: min ||H − W X ||F (m) (m) μ∈ΔM ,W ,H μ   2 2 m=1 min ||S − H R D||F + β||B − RH||F (4) B,D,R M l (8) (m) 2 r×n r×m  + γ (σi(W )) s.t. B ∈{±1} , D ∈{±1} , R R = Ir m=1 i=k+1 This formulation has three advantages: 1) Only one of the W (m) decomposed variable is imposed with discrete constraint. where l is the total number of singular values of As shown in the optimization part, the hash codes can be W (m) W (m) and σi( ) represents the i-th singular value of . learned with a simple sgn(·) operation instead of bit-by-bit Note that discrete optimization used by existing discrete recommenda- r tion methods. The second regularization term can guarantee (m) 2 (m) (m) (m) (m) (σi(W )) = tr(V W W V ) the acceptable information loss. 2) The learned hash codes i=l+1 can reflect user’s multiple content features via H and in- (5) volve the latent interactive features in S simultaneously. 3) We extract user’s interactive preference from the side infor- where V consists of the singular vectors which correspond  mation of their rated items as content features. This design to the (r − l)-smallest singular values of W (m)W (m) . not only avoids the approximation of item binary matrix D, Thus, the multiple content features fusion module can be reduces the complexity of the proposed model, but also ef- rewritten as: fectively captures the content features of items. M 1 ||H − W (m)X(m)||2 Overall Objective Formulation min (m) F μ∈Δ (m), M W H m=1 μ By integrating the above two parts into a unified learning framework, we derive the overall objective formulation of M   Multi-Feature Discrete Collaborative Filtering (MFDCF) as: + γ tr(V (m) W (m)W (m) V (m)) M m=1 1 (m) (m) 2 min ||H − W X ||F (6) μ(m),B,D,H,R,V ,W (m) (m) m=1 μ   2 2 Multi-Feature Discrete Collaborative Filtering + α||S − H R D||F + β||B − RH||F M In this paper, we fuse multiple content features into binary   hash codes with matrix factorization, which has been proved + γ tr(V (m) W (m)W (m) V (m)) to be accurate and scalable on addressing the collaborative m=1 filtering problems. Discrete collaborative filtering generally r×n r×m  s.t. B ∈{±1} , D ∈{±1} , R R = Ir maps both users and items into a joint low-dimensional (9) Hamming space where the user-item preference is measured by the Hamming similarity between the binary hash codes. where α, β, γ are balance parameters. The first term projects multiple content features of users into a shared homoge- 1www.taobao.com neous space. The second and third terms minimize the in- 2www.amazon.com formation loss during the process of integrating the multiple

272 content features with the basic discrete CF. The last term is where (a) holds since 1μ =1and the equality in (b) holds (m) M a low-rank constraint for W , which can highlight the (m) (m) ∝ √h (m) 2 when μ (m) . Since ( h ) = const,we latent shared features across different users. μ m=1 can obtain the optimal μ(m) in Eq.(11) by Fast Discrete Optimization h(m) Solving hash codes in Eq.(9) is essentially an NP-hard prob- μ(m) = (12) M (m) lem due to the discrete constraint on binary feature matrix. m=1 h Existing discrete recommendation methods always learn the W m hash codes bit-by-bit with DCD (Shen et al. 2015). Al- Step 2: learning . Removing the terms that are ir- relevant to the W m, the optimization formula is rewritten though this strategy alleviates the quantization loss prob- as lem caused by conventional two-step relaxing-rounding op- timization strategy, it is still time-consuming. M 1 min ||H − W (m)X(m)||2 (m) F In this paper, with the favorable support of objective for- W (m) μ m=1 mulation, we propose to directly learn the discrete hash M    codes with fast optimization. Specifically, different from ex- + γ tr(V (m) W (m)W (m) V (m)) isting discrete recommendation methods (Zhang et al. 2016; m=1 2018; Lian et al. 2017; Liu et al. 2018), we avoid explicitly (13) computing the user-item rating matrix S, and achieve linear W computation and storage efficiency. We propose an effec- We calculate the derivative of Eq.(13) with respect to and tive optimization algorithm based on augmented Lagrangian set it to zero, multiplier (ALM) (Lin, Chen, and Ma 2010; Murty 2007). In M 1 (m) (m) 2 particular, we introduce an auxiliary variable ZR to separate ||H − W X || (m) F the constraint on R, and transform the objective function m=1 μ Eq.(9) to an equivalent one that can be tackled more easily. M   Then the Eq.(9) is transformed as: + γ tr(V (m) W (m)W (m) V (m))=0 M m=1 1 ||H − W (m)X(m)||2 min (m) F  1  Θ μ ⇒ γV (m)V (m) W (m)+ W (m)X(m)X(m) m=1 μ(m)   2 2 + α||S − H R D||F + β||B − RH||F 1 (m) M = HX   μ(m) V (m) W (m)W (m) V (m) + γ tr( ) (14) m=1 G By using the following substitutions, λ R 2 ⎧ + ||R − ZR + ||F λ ⎪ (m) (m) 2 ⎨⎪ A = γV V  r×n r×m   B = 1 X(m)X(m) B ∈± D ∈± R R I Z Z I (m) (15) s.t. 1 , 1 , = r, R R = r ⎪ μ (10) ⎩ C = 1 HX(m) μ(m) where Θ denotes the variables that need to be solved in r×r the objective function, GR ∈ R measures the differ- Eq.(14) can be rewritten as AW + WB = C, which can ence between the target and auxiliary variable, λ>0 is be efficiently solved by Sylvester operation in Matlab. a balance parameter. With this transformation, we follow Step 3: learning R. Similarly, the optimization formula the alternative optimization process by updating each of for updating R can be represented as (m) (m) μ , B, D, H, R, V , W , ZR and GR, given others min tr(−2αRDSH + αRDDRHH fixed.  R R=Ir Step 1: learning μm. For convenience, we denote ||H − (m) (m) (m)    GR W X ||F as h . By fixing the other variables, we − 2βR BH − λR (ZR − )) (m) λ ignore the term that is irrelevant to μ . The original prob- (16) lem can be rewritten as: Z 2 We introduce an auxiliary variable R and substitute M (m)       h R DD RHH with R DD ZRHH , the Eq.(16) min (11) μ(m)≥0,1μ=1 (m) can be transformed into the following form m=1 μ max tr(RC) ,  With Cauchy-Schwarz inequality, we derive that R R=Ir 2 2 M (m) M (m) M (b) M     (a) C =2αDS H − αDD ZRHH h h (m) ≥ (m) 2 (m) =( (m) )( μ ) ( h )  (17) m=1 μ m=1 μ m=1 m=1 +2βBH + λZR − GR

273  The optimal R is defined as R = PQ , where P and Q Step 7: learning ZR. The objective function with respect are comprised of left-singular and right-singular vectors of to ZR can be represented as C respectively (Zhu et al. 2017b).  n×m max tr(ZR Czr) S ∈ R  (24) Note that, the user-item rating matrix is in- Z ZR=Ir DSH R R cluded in the term when updating . In real-   where Czr = −αDD RHH +λR+GR. The optimal world retail giants, such as Taobao and Amazon, there are   hundreds of millions of users and even more items. In conse- ZR is defined as ZR = PzrQzr, where Pzr and Qzr are quence, the user-item rating matrix S would be pretty enor- comprised of left-singular and right-singular vectors of Czr mous and sparse. If we compute S directly, the computa- respectively. tional complexity will be O(mn) and it is extremely expen- Step 8: learning GR. By fixing other variables, the up- sive to calculate and store S. In this paper, we apply the date rule of GR is singular value decomposition to obtain the left singular and GR = GR + λ(R − ZR) (25) right singular vectors as well as the corresponding singular S S values of . We utilize a diagonal matrix Σo to store the Feature-adaptive Hash Code Generation for o-largest (o  min{m, n}) singular values, and employ an S S Cold-start Users n × o matrix Po ,ano × m matrix Qo to store the cor- responding left singular and right singular vectors respec- In the process of online recommendation, we aim to map S P SΣSQS multiple content features of the target users into binary hash tively. We substitute with o o o and the computa- (m) M tional complexity can be reduced to O(max{m, n}). codes with the learned hash projection matrix {W }m=1. Thus, the calculation of C can be transformed as When cold-start users have no rating history in the training set and are only associated with initial demographic infor- S S S    C =2αDQ Σ P H − αDD ZRHH mation, the fixed feature weights obtained from offline hash o o o (18)  code learning cannot address the feature-missing problem. +2βBH + λZR − GR In this paper, with the support of offline hash learning, we With Eq.(18), both the computation and storage cost can be propose to generate hash codes for cold-start users with a decreased with the guarantee of accuracy. self-weighting scheme. The objective function is formulated Step 4: learning H. We calculate the derivative of objec- as tive function with respect to H and set it to zero, then we M (m) (m) get min ||Bu − W Xu ||F (26) B ∈{±1}r×nu M u m=1 1   −1 H =( Ir + αR DD R + βIr) where W (m) ∈ Rr×d is the linear projection matrix from m=1 μ(m) (m) Eq.(9), Xu is content feature of target users, and nu is M the number of target users. As proved by (Lu et al. 2019b), 1 W (m)X(m) RDS RB ( + α + β ) Eq.(26) can be shown to be equivalent to m=1 μ(m) M (19) 1 (m) (m) 2 min ||Bu − W Xu ||F r×n (m) S S S Bu∈{±1} u ,μ∈ΔM where S is substituted with Po Σo Qo , and then we have m=1 μu   (27) αRDS = αRDQS ΣSP S (20) o o o (m) We employ alternating optimization to update μu and RDS The time complexity of computing is reduced to Bu. The update rules are O(max{m, n}). (m) Step 5: learning B, D. We calculate the derivative of ob- (m) hu (m) (m) (m) μ = ,h = ||Bu − W X ||F jective function with respect to B and D respectively, and u M (m) u u hu set them to zero, then we can obtain the closed solutions of m=1 B,D M as 1 (m) (m) Bu = sgn( W Xu )   −1 (m) D = sgn((H R ) S), B = sgn(RH) (21) m=1 μu S S S (28) where S is also substituted with Po Σo Qo , and update rule of D is transformed as   −1 S S S Experiments D = sgn((H R ) Po Σo Qo ) (22) Evaluation Datasets V (m) V Step 6: learning . As described in Eq.(5), is We evaluate the proposed method on two public recom- stacked by the singular vectors which correspond to the mendation datasets: Movielens-1M3 and BookCrossing4.In (m) (m) (r − l)-smallest singular values of W W . Thus we these two datasets, each user has only one rating for an item. can solve the eigen-decomposition problem to get V : 3https://grouplens.org/datasets/movielens/ (m) (m) V ← svd(W W ) (23) 4https://grouplens.org/datasets/book-crossing/

274 Figure 1: Cold-start recommendation performance on MovieLens-1M.

Figure 2: Cold-start recommendation performance on BookCrossing.

• Movielens-1M: This dataset is collected from the Movie- Table 1: Statistics of experimental datasets. Lens website by GroupLens Research. It originally in- cludes 1,000,000 ratings from 6040 users for 3952 Dataset #User #Item #Rating Sparsity movies. The rating score is from 1 to 5 with 1 granularity. MovieLens-1M 6,040 3,952 1,000,209 95.81% The users in this dataset are associated with demographic BookCrossing 2,151 6,830 180,595 98.77% information (e.g. gender, age, and occupation), and the movies are related to 3-5 labels from a dictionary of 18 genre labels. Du et al. 2018) to evaluate whether the target user’s favorite items appear in the top-k recommendation list. • BookCrossing: This dataset is collected by Cai-Nicolas Accuracy@k is to test whether the target user’s favorite Ziegler from the Book-Crossing community. It contains items that appears in the top-k recommendation list. Given 278,858 users providing 1,149,780 ratings (contain im- the value of k, similar to (Zhang et al. 2018; Du et al. 2018), plicit and explicit feedback) about 271,379 books. The we calculate Accuracy@k value as: rating score is from 1 to 10 with 1 interval for explicit feedback, or expressed by 0 for implicit feedback. Most #Hit@k Accuracy@k = (29) users in this dataset are associated with demographic in- |Dtest| formation (e.g. age and location). where |Dtest| is the number of test cases, and #Hit@k is Considering the extreme sparsity of the original BookCross- the total number of hits in the test set. ing dataset, we remove the users with less than 20 ratings and the items rated by less than 20 users. After the filter- ing, there are 2,151 users, 6,830 items, and 180,595 ratings Evaluation Baselines left in the BookCrossing dataset. For the MovieLens-1M In this paper, we compare our approach with two state- dataset, we keep all users and items without any filtering. of-the-art continuous value based recommendation methods The statistics of the datasets are summarized in Table 2. The and two hashing based binary recommendation methods. bag-of-words encoding method is used to extract the side • Content Based Filtering KNN (CBFKNN) (Gantner information of the item, and one-hot encoding approach is et al. 2010) is a straightforward cold-start recommenda- adopted to generate feature representation of user’s demo- tion approach based on the user similarity. Specifically, it graphic information. To accelerate the running speed, we adopts the low-dimensional projection of user attribute to follow (Wang et al. 2017) and perform PCA to reduce the calculate the user simialrity. interactive preference feature dimension to 128. In our ex- periments, we randomly select 20% users as cold-start users, • Discrete Factorization Machines (DFM) (Liu et al. and their ratings are removed. We repeat the experiments 2018) is the first binarized factorization machines method with 5 random splits and report the average values as the that learns the hash codes for any side feature and models experimental results. the pair-wise interaction between feature codes. • Discrete Deep Learning (DDL) (Zhang et al. 2018) is Evaluation Metrics a binary deep recommendation approach. It adopts Deep The goal of our proposed method is to find out the top-k Belief Network to extract item representation from item items that user may be interested in. In our experiment, we side information, and combines the DBN with DCF to adopt the evaluation metric Accuracy@k (Zhang et al. 2018; solve the cold-start recommendation problem.

275 α is fix to 1000 β is fix to 1000 γ is fix to 1 Accuracy@20 Accuracy@20 Accuracy@20

γ γ β α β α

Figure 3: Performance variations with the key parameters on MovieLens-1M

(a) (b) (c) 50000 MovieLens 4 MovieLens 2 MovieLens 40000 BookCrossing BookCrossing BookCrossing )/iteration

30000 (s 20000 1 2 10000 Objective Value

0 10 20 30 Training time 8 16 32 64 128 Training time (s)/iteration 20% 40% 60% 80% 100% Number of Iterations Hash Code Length Data Size

Figure 4: (a) Convergence curve, (b-c) Efficiency v.s hash code length and data size.

Table 2: Training time comparison with two hashing-based optimization process, which limits the learning capability recommendation approaches on MovieLens. of DDL. Additionally, these experimental results show that the proposed MFDCF outperforms the compared continu- Method/#Bits 8 16 32 64 128 DDL 3148.86 3206.09 3289.81 3372.19 3855.81 ous value based hybrid recommendation methods under the DFM 166.55 170.86 172.75 196.45 246.5 same cold-start settings. The better performance of MFDCF Ours 58.87 60.49 62.55 68.41 90.17 than CBFKNN and ZSR validates the effects of the proposed multiple feature fusion strategy.

• Zero-Shot Recommendation (ZSR) (Li et al. 2019b) Model Analysis considers cold-start recommendation problem as a zero- shot learning problem (Li et al. 2019a). It extracts user Parameter and convergence sensitivity analysis. We con- preference for each item from user attribute. duct experiments to observe the performance variations with the involved parameters α, β, γ. We fix the hash code length In experiments, we adopt 5-fold cross validation method as 128 bits and report results on MovieLens-1M. Simi- on random split of training data to tune the optimal hyper- lar results can be found on other datasets and hash code parameters of all compared approaches. All the best hyper- lengths. Since α, β, and γ are equipped in the same ob- parameters are found by grid search. jective function, we change their values from the range of {10−6, 10−4, 10−1, 1, 10, 103, 106} while fixing other pa- Accuracy Comparison rameters. Detailed experimental results are presented in Fig- In this subseciton, we evaluate the recommendation accu- ure 5. From it, we can observe that the performance is rel- racy of MFDCF and the baselines in cold-start recommen- atively better when α is in the range of {106, 103, 10}, β is dation scenario. Figure 1 and 2 demonstrate the Accuracy@k in the range of {103, 10}, and γ is in the range of {1, 10}. of the compared approaches on two real-world recom- The performance variations with γ shows that the low-rank mendation datasets for the cold-start recommendation task. constraint is well on highlighting the latent shared features Compared with existing hashing-based recommendation ap- across different users. The convergence curves recording the proaches, the proposed MFDCF consistently outperforms objective function of MFDCF method with the number of the compared baselines. DFM exploits the factorization ma- iterations are shown in Figure 4(a). This experiment result chine to model the potential between user charac- indicates that our proposed method converges very fast. teristics and product features. However, it ignores the col- Efficiency v.s. hash code length and data size. We con- laborative interaction. DDL is based on the discrete col- duct the experiments to investigate the efficiency variations laborative filtering. It adopts DBN to generate item fea- of MFDCF with the increase of hash code length and train- ture representation from their side information. Neverthe- ing data size on two datasets. The average time cost of train- less, the structure of DBN is independent with the overall ing iteration is shown in Figure 4(b-c). When the hash code

276 length is fixed as 32, each round of training iteration costs Du, X.; Yin, H.; Chen, L.; Wang, Y.; and Zhou, X. 2018. Per- several seconds and scales linearly with the increase of data sonalized video recommendation using rich contents from videos. size. When running MFDCF on 100% training data, each TKDE. round of iteration scales quadratically with the increase of Gantner, Z.; Drumond, L.; Freudenthaler, C.; Rendle, S.; and code length due to the time complexity of optimization pro- Schmidt-Thieme, L. 2010. Learning attribute-to-feature mappings cess is O(max{mr2,nr2}). for cold-start recommendations. In ICDM, 176–185. Run time comparison. In this experiment, we com- Gionis, A.; Indyk, P.; and Motwani, R. 1999. Similarity search in pare the computation efficiency of our approach with high dimensions via hashing. In VLDB, 518–529. two state-of-the-art hashing-based recommendation meth- Gong, Y.; Lazebnik, S.; Gordo, A.; and Perronnin, F. 2013. Iterative ods DMF and DDL. Table 2 demonstrates the training time quantization: A procrustean approach to learning binary codes for of these methods on MovieLens-1M using a 3.4GHz Intel large-scale image retrieval. TPAMI 35(12):2916–2929. Core(TM) i7-6700 CPU. Compared with DDL and DFM, Hastad,˚ J. 2001. Some optimal inapproximability results. J. ACM our MFDCF is about 50 and 3 times faster respectively. The 48(4):798–859. superior performance of the proposed method is attributed to Karatzoglou, A.; Smola, A. J.; and Weimer, M. 2010. Collaborative that both DDL and DFM iteratively learn the hash codes bit- filtering on a budget. In AISTATS, 389–396. by-bit with discrete coordinate descent. Additionally, DDL Koren, Y.; Bell, R. M.; and Volinsky, C. 2009. Matrix factorization requires to update the parameters of DBN iteratively, which techniques for recommender systems. IEEE Computer 42(8):30– consumes more time. 37. Li, J.; Jing, M.; Lu, K.; Ding, Z.; Zhu, L.; and Huang, Z. 2019a. Conclusion Leveraging the invariant side of generative zero-shot learning. In In this paper, we design a unified multi-feature discrete col- CVPR, 7402–7411. laborative filtering method that projects multiple content Li, J.; Jing, M.; Lu, K.; Zhu, L.; Yang, Y.; and Huang, Z. 2019b. features of users into the binary hash codes to support fast From zero-shot learning to cold-start recommendation. In AAAI, cold-start recommendation. Our model has four advantages: 4189–4196. 1) handles the data sparsity problem with low-rank con- Lian, D.; Liu, R.; Ge, Y.; Zheng, K.; Xie, X.; and Cao, L. 2017. straint. 2) enhances the discriminative capability of hash Discrete content-aware matrix factorization. In KDD, 325–334. codes with multi-feature binary embedding. 3) generates Lian, D.; Xie, X.; and Chen, E. 2019. Discrete matrix factor- feature-adaptive hash codes for varied cold-start users. 4) ization and extension for fast item recommendation. TKDE DOI: achieves computation and storage efficient discrete binary 10.1109/TKDE.2019.2951386. optimization. Experiments on two public recommendation Lin, Z.; Chen, M.; and Ma, Y. 2010. The augmented lagrange mul- datasets demonstrate the state-of-the-art performance of the tiplier method for exact recovery of corrupted low-rank matrices. proposed method. CoRR abs/1009.5055. Liu, X.; He, J.; Deng, C.; and Lang, B. 2014. Collaborative hash- Acknowledgements ing. In CVPR, 2147–2154. The authors would like to thank the anonymous review- Liu, H.; He, X.; Feng, F.; Nie, L.; Liu, R.; and Zhang, H. 2018. ers for their constructive and helpful suggestions. The Discrete factorization machines for fast feature-based recommen- work is partially supported by the National Natural Science dation. In IJCAI, 3449–3455. Foundation of China (61802236, 61902223, U1836216, Lu, X.; Zhu, L.; Cheng, Z.; Li, J.; Nie, X.; and Zhang, H. 2019a. U1736122), in part by the Natural Science Foundation of Flexible online multi-modal hashing for large-scale multimedia re- Shandong, China (No. ZR2019QF002), in part by the Youth trieval. In MM, 1129–1137. Innovation Project of Shandong Universities, China (No. Lu, X.; Zhu, L.; Cheng, Z.; Nie, L.; and Zhang, H. 2019b. Online 2019KJN040), in part by the Natural Science Foundation multi-modal hashing with dynamic query-adaption. In SIGIR, 715– for Distinguished Young Scholars of Shandong Province 724. (JQ201718), and in part by Taishan Scholar Project of Shan- Murty, K. G. 2007. Nonlinear programming theory and algorithms. dong, China. Technometrics 49(1):105. References Shen, F.; Shen, C.; Liu, W.; and Shen, H. T. 2015. Supervised discrete hashing. In CVPR, 37–45. Batmaz, Z.; Yurekli, A.; Bilge, A.; and Kaleli, C. 2019. A review on deep learning for recommender systems: challenges and reme- Wang, M.; Li, H.; Tao, D.; Lu, K.; and Wu, X. 2012. Multimodal dies. Artif. Intell. Rev. 52(1):1–37. graph-based reranking for web image search. TIP 21(11):4649– 4661. Cheng, Z.; Ding, Y.; He, X.; Zhu, L.; Song, X.; and Kankanhalli, M. S. 2018. Aˆ3ncf: An adaptive aspect attention model for rating Wang, M.; Fu, W.; Hao, S.; Liu, H.; and Wu, X. 2017. Learn- prediction. In IJCAI, 3748–3754. ing on big graph: Label inference and regularization with anchor Cheng, Z.; Chang, X.; Zhu, L.; Catherine Kanjirathinkal, R.; and hierarchy. TKDE 29(5):1101–1114. Kankanhalli, M. S. 2019. MMALFM: explainable recommenda- Wang, M.; Luo, C.; Ni, B.; Yuan, J.; Wang, J.; and Yan, S. 2018. tion by leveraging reviews and images. TOIS 37(2):1–28. First-person daily activity recognition with manipulated object pro- Das, A.; Datar, M.; Garg, A.; and Rajaram, S. 2007. Google news posals and non-linear feature fusion. TCSVT 28(10):2946–2955. personalization: scalable online collaborative filtering. In WWW, Zhang, Z.; Wang, Q.; Ruan, L.; and Si, L. 2014. Preference pre- 271–280. serving hashing for efficient recommendation. In SIGIR, 183–192.

277 Zhang, H.; Shen, F.; Liu, W.; He, X.; Luan, H.; and Chua, T. 2016. Discrete collaborative filtering. In SIGIR, 325–334. Zhang, Y.; Yin, H.; Huang, Z.; Du, X.; Yang, G.; and Lian, D. 2018. Discrete deep learning for fast content-aware recommendation. In WSDM, 717–726. Zhang, Y.; Lian, D.; and Yang, G. 2017. Discrete personalized ranking for fast collaborative filtering from implicit feedback. In AAAI, 1669–1675. Zhou, K., and Zha, H. 2012. Learning binary codes for collabora- tive filtering. In KDD, 498–506. Zhu, L.; Huang, Z.; Liu, X.; He, X.; Song, J.; and Zhou, X. 2017a. Discrete multi-modal hashing with canonical views for robust mo- bile landmark search. TMM 19(9):2066–2079. Zhu, L.; Shen, J.; Xie, L.; and Cheng, Z. 2017b. Unsupervised visual hashing with semantic assistant for content-based image re- trieval. TKDE 29(2):472–486.

278