Item-Based Collaborative Filtering with BERT

Item-based Collaborative Filtering with BERT Yuyangzi Fu Tian Wang eBay Inc. eBay Inc. [email protected] [email protected] Abstract items and seed items that the user has provided feedback for (which may be implicit e.g. clicking, In e-commerce, recommender systems have become an indispensable part of helping users or explicit e.g. rating), and then select the most sim- explore the available inventory. In this work, ilar items to recommend. Although less impacted we present a novel approach for item-based by data sparsity, due to their reliance on content collaborative filtering, by leveraging BERT to rather than behavior, they can struggle to provide understand items, and score relevancy between novel recommendations which may activate the different items. Our proposed method could user’s latent interests, a highly desirable quality for address problems that plague traditional rec- recommender systems (Castells et al., 2011). ommender systems such as cold start, and Due to the recent success of neural networks in ”more of the same” recommended content. We conducted experiments on a large-scale real- multiple AI domains (LeCun et al., 2015) and their world dataset with full cold-start scenario, and superior modeling capacity, a number of research the proposed approach significantly outper- efforts have explored new recommendation algo- forms the popular Bi-LSTM model. rithms based on Deep Learning (see, e.g., Barkan and Koenigstein, 2016; He et al., 2017; Hidasi et al., 1 Introduction 2015; Covington et al., 2016). Recommender systems are an integral part of e- In this paper, we propose a novel approach for commerce platforms, helping users pick out items item-based collaborative filtering, by leveraging the of interest from large inventories at scale. Tra- BERT model (Devlin et al., 2018) to understand ditional recommendation algorithms can be di- item titles and model relevance between different vided into two types: collaborative filtering-based items. We adapt the masked language modelling (Schafer et al., 2007; Linden et al., 2003) and and next sentence prediction tasks from the nat- content-based (Lops et al., 2011; Pazzani and ural language context to the e-commerce context. Billsus, 2007). However, these have their own The contributions of this work are summarized as limitations when applied directly to real-world e- follows: commerce platforms. For example, traditional user- based collaborative filtering recommendation al- • Instead of relying on unique item identifier gorithms (see, e.g., Schafer et al., 2007) find the to aggregate history information, we only most similar users based on the seed user’s rated use item’s title as content, along with token items, and then recommend new items which other embeddings to solve the cold start problem, users rated highly. For item-based collaborative which is the main shortcoming of traditional filtering (see, e.g., Linden et al., 2003), given a recommendation algorithms. seed item, recommended items are chosen to have • By training model with user behavior data, most similar user feedback. However, for highly our model learns user’s latent interests more active e-commerce platforms with large and con- than item similarities, while traditional rec- stantly changing inventory, both approaches are ommendation algorithms and some pair-wise severely impacted by data sparsity in the user-item deep learning algorithms only provide similar interaction matrix. items which users may have bought. Content-based recommendation algorithms calculate similarities in content between candidate • We conduct experiments on a large-scale e- 54 Proceedings of the 3rd Workshop on e-Commerce and NLP (ECNLP 3), pages 54–58 Online, July 10, 2020. c 2020 Association for Computational Linguistics commerce dataset, demonstrating the effec- 2.1.1 Next Purchase Prediction tiveness of our approach and producing rec- The goal of this task is to predict the next item ommendation results with higher quality. a user is going to purchase given the seed item he/she has just bought. We start with a pre-trained 2 Item-based Collaborative Filtering BERTbase model, and fine-tune it for our next pur- with BERT chase prediction task. We feed seed item as sen- As mentioned earlier, for a dynamic e-commerce tence A, and target item as sentence B. Both item platform, items enter and leave the market contin- titles are concatenated and truncated to have at uously, resulting in an extremely sparse user-item most 128 tokens, including one [CLS] and two interaction matrix. In addition to the challenge of [SEP] tokens. For a seed item, its positive items long-tail recommendations, this also requires the are generated by collecting items purchased within recommender system to be continuously retrained the same user session, and the negative ones are I and redeployed in order to accommodate newly randomly sampled. Given the positive item set p, I listed items. To address these issues, in our pro- and the negative item set n, the cross-entropy loss posed approach, instead of representing each item for next purchase prediction may be computed as with a unique identifier, we choose to represent L = − X log p(i ) − X log(1 − p(i )): each item with its title tokens, which are further np j j ij 2Ip ij 2In mapped to a continuous vector representation. By (1) doing so, essentially two items with the same title would be treated as the same, and can aggregate 2.1.2 Masked Language Model user behaviors accordingly. For a newly listed item As the distribution of item title tokens is differ- in the cold-start setting, the model can utilize the ent from the natural language corpus used to train similarity of the item title to ones observed before BERTbase, we further fine-tune the model for the to find relevant recommended items. masked language model (MLM) task as well. In The goal of item-based collaborative filtering is the masked language model task, we follow the to score the relevance between two items, and for training schema outlined in Devlin et al.(2018) a seed item, the top scored items would be recom- wherein 15% of the tokens in the title are chosen mended as a result. Our model is based on BERT to be replaced by [MASK], random token, or left (Devlin et al., 2018). Rather than the traditional unchanged, with a probability of 80%, 10% and RNN / CNN structure, BERT adopts transformer 10% respectively. Given the set of chosen tokens encoder as a language model, and use attention M, the corresponding loss for masked language mechanism to calculate the relationship between model is input and output. The training of BERT model can X be divided into two parts: Masked Language Model Llm = − log p(mi): (2) and Next Sentence Prediction. We re-purpose these mi2M Masked Lan- tasks for the e-commerce context into The whole model is optimized against the joint loss guage Model on Item Titles, and Next Purchase Llm + Lnp. Prediction. Since the distribution of item title tokens differs drastically from words in natural lan- 2.1.3 Bi-LSTM Model (baseline) guage which the original BERT model is trained As the evaluation is conducted on the dataset hav- on, retraining the masked language model allows ing a complete cold-start setting, for the sake of better understanding of item information for our comparison, we build a baseline model consisting use-case. Next Purchase Prediction can directly be of a title token embedding layer with 768 dimen- used as the relevance scoring function for our item sions, a bidirectional LSTM layer with 64 hidden collaborative filtering task. units, and a 2-layer MLP with 128 and 32 hidden units respectively. For every pair of items, the two 2.1 Model titles are concatenated into a sequence. After go- Our model is based on the architecture of BERTbase ing through the embedding layer, the bidirectional (Devlin et al., 2018). The encoder of BERTbase LSTM reads through the entire sequence and gener- contains 12 Transformer layers, with 768 hidden ates a representation at the last timestep. The MLP units, and 12 self-attention heads. layer with logistic function produces the estimated 55 Prec Prec Recall NDCG Method We observe that the proposed BERT model @1 @10 @10 @10 greatly outperforms the LSTM-based model. When Bi-LSTM 0.064 0.029 0.295 0.163 only fine-tuned for the Next Purchase Prediction BERTbase w/o MLM 0.263 0.057 0.572 0.408 BERTbase 0.555 0.079 0.791 0.669 task, our model exceeds the baseline by 310.9%, 96.6%, 93.9%, and 150.3% in precision@1, preci- Table 1: Result on ranking the item sion@10, recall@10, and NDCG@10 respectively. When fine tuning for the masked language model probability score. The baseline model is trained task is added, we see the metrics improved further using the same cross-entropy loss shown in Eq.1. by another 111.0%, 38.6%, 38.3%, and 64.0%. From the experiment, the superiority of proposed 2.2 Dataset BERT model for item-based collaborative filter- We train our models on an e-commerce website ing is clear. It is also clear that adapting the to- data. We collected 8,001,577 pairs of items, of ken distribution for the e-commerce context with which 33% are co-purchased (BIN event) within masked language model within BERT is essential the same user session, while the rest are randomly for achieving the best performance. sampled as negative samples. 99.9999% of entries In order to visually examine the quality of rec- of the item-item interaction matrix is empty. The ommendations, we present the recommended items sparsity of data forces the model to focus on gener- for two different seed items in Table.2.

Load more