Scalable Realistic Recommendation Datasets Through Fractal Expansions

Scalable Realistic Recommendation Datasets through Fractal Expansions Francois Belletti Karthik Lakshmanan Walid Krichene Yi-Fan Chen John Anderson Google AI Google AI Google AI Google AI Google AI Google Google Google Google Google [email protected] [email protected] [email protected] [email protected] [email protected] Abstract—Recommender System research suffers currently TABLE I: Size of MovieLens 20M [19] vs industrial dataset from a disconnect between the size of academic data sets and the in [55]. scale of industrial production systems. In order to bridge that gap we propose to generate more massive user/item interaction MovieLens 20M Industrial data sets by expanding pre-existing public data sets. #users 138K Hundreds of Millions User/item incidence matrices record interactions between users and items on a given platform as a large sparse matrix whose #items 27K 2M rows correspond to users and whose columns correspond to items. #topics 19 600K Our technique expands such matrices to larger numbers of rows #observations 20M Hundreds of Billions (users), columns (items) and non zero values (interactions) while preserving key higher order statistical properties. We adapt the Kronecker Graph Theory to user/item incidence matrices and show that the corresponding fractal expansions useful characteristics of the dataset. For instance, [37] shows a preserve the fat-tailed distributions of user engagements, item popularity and singular value spectra of user/item interaction privacy breach of the Netflix prize dataset. More importantly, matrices. Preserving such properties is key to building large publishing anonymized industrial data sets runs counter to user realistic synthetic data sets which in turn can be employed expectations that their data may only be used in a restricted reliably to benchmark Recommender Systems and the systems manner to improve the quality of their experience on the employed to train them. platform. We provide algorithms to produce such expansions and apply them to the MovieLens 20 million data set comprising 20 million Therefore, we decide not to make user data more broadly ratings of 27K movies by 138K users. The resulting expanded available to preserve the privacy of users. We instead choose data set has 10 billion ratings, 864K items and 2 million users to produce synthetic yet realistic data sets whose scale is in its smaller version and can be scaled up or down. A larger commensurable with that of our production problems while version features 655 billion ratings, 7 million items and 17 million only consuming already publicly available data. users. Index Terms—Machine Learning, Deep Learning, Recom- Producing a realistic MovieLens 10 billion+ dataset: mender Systems, Graph Theory, Simulation In this work, we focus on the MovieLens dataset which only entails movie ratings posted publicly by users of the I. INTRODUCTION MovieLens platform. The MovieLens data set has now become a standard benchmark for academic research in Recommender Machine Learning (ML) benchmarks compare the capa- Systems, [1], [8], [20], [22], [26], [32], [34], [38], [44], [48], bilities of models, distributed training systems and linear [49], [54], [56], [58] are only few of the many recent research algebra accelerators on realistic problems at scale. For these articles relying on MovieLens whose latest version [19] has arXiv:1901.08910v3 [cs.IR] 20 Feb 2019 benchmarks to be effective, results need to be reproducible by accrued more than 800 citations according to Google Scholar. many different groups which implies that publicly shared data Unfortunately, the data set comprises only few observed in- sets need to be available. teractions and more importantly a very small catalogue of Unfortunately, while Recommendation Systems constitute a users and items — when compared to industrial proprietary key industrial application of ML at scale, large public data recommendation data. sets recording user/item interactions on online platforms are not yet available. For instance, although the Netflix data set [6] In order to provide a new data set — more aligned with and the MovieLens data set [19] are publicly available, they the needs of production scale Recommender Systems — we are orders of magnitude smaller than proprietary data [3], [12], aim at expanding publicly available data by creating a realistic [55]. surrogate. The following constraints help create a production- Proprietary data sets and privacy: While releasing large size synthetic recommendation problem similar and at least as anonymized proprietary recommendation data sets may seem hard an ML problem as the original one for matrix factoriza- an acceptable solution from a technical standpoint, it is a non- tion approaches to recommendations [20], [25]: trivial problem to preserve user privacy while still maintaining • orders of magnitude more users and items are present in Item-wise Item-wise can preserve high level statistics of the original data set while rating sums rating sums (log/log) 15000 105 scaling its size up by multiple orders of magnitudes. 104 10000 103 Many different transforms can be applied to the matrix R 102 5000 101 which can be considered a standard sparse 2 dimensional im- 100 -1 age. A recent approach to creating synthetic recommendation total rating 0 total rating 10 10-2 data sets consists in making parametric assumptions on user 5000 10-3 0 1 2 3 4 5 behavior by instantiating a user model interacting with an 0 10 10 10 10 10 10 5000 item rank 10000 15000 20000 25000 30000 online platform [11], [46]. Unfortunately, such methods (even item rank calibrated to reproduce empirical facts in actual data sets) do User-wise User-wise not provide strong guarantees that the resulting interaction rating sums rating sums (log/log) 500 103 2 data is similar to the original. Therefore, instead of simulating 0 10 101 500 recommendations in a parametric user-centric way as in [11], 100 1000 [46], we choose a non-parametric approach operating directly 10-1 1500 -2 in the space of user/item affinity. In order to synthesize a total rating total rating 10 2000 10-3 large realistic dataset in a principled manner, we adapt the 2500 10-4 0 1 2 3 4 5 6 0 10 10 10 10 10 10 10 Kronecker expansions which have previously been employed 20000 40000 60000 80000 user rank 100000120000140000 to produce large realistic graphs in [28]. We employ a non- user rank parametric analytically tractable simulation of the evolution of User/item User/item the user/item bi-partite graph to create a large synthetic data rating spectrum 3 rating spectrum (log/log) 250 10 set. Our choice is to trade-off realism for analytic tractability. 200 We emphasize the latter. 150 102 100 While Kronecker Graphs Theory is developed in [28], [29] magnitude magnitude on square adjacency matrices, the Kronecker product operator singular value 50 singular value 0 101 is well defined on rectangular matrices and therefore we can 0 500 1000 1500 2000 2500 100 101 102 103 104 apply a similar technique to user/item interaction data sets — singular value singular value rank rank which was already noted in [29] but not developed extensively. The Kronecker Graph generation paradigm has to be changed Fig. 1: Key first and second order properties of the original with the present data set in other aspects however: we need MovieLens 20m user/item rating matrix (after centering and to decrease the expansion rate to generate data sets with the re-scaling into [−1; 1]) we aim to preserve while synthetically scale we desire, not orders of magnitude too large. We need expanding the data set. Top: item popularity distribution (total to do so while maintaining key conservation properties of the ratings of each item). Middle: user engagement distribution original algorithm [29]. (total ratings of each user). Bottom: dominant singular values In order to reliably employ Kronecker based fractal expan- of the rating matrix (core to the difficulty of matrix factor- sions on recommender system data we devise the following ization tasks). In all log/log plots the small fraction of non- contributions: positive row-wise and column-wise sums are removed. • we develop a new technique based on linear algebra to adapt fractal Kronecker expansions to recommendation the synthetic dataset; problems; • the synthetic dataset is realistic in that its first and • we demonstrate that key recommendation system specific second order statistics match those of the original dataset properties of the original dataset are preserved by our presented in Figure 1. technique; • we also show that the resulting algorithm we develop is Key first and second order statistics of interest we aim to scalable and easily parallelizable as we employ it on the preserve are summarized in Figure 1 — the details of their actual MovieLens 20 million dataset; computation are given in Section IV. • we produce a synthetic yet realistic MovieLens 655 Adapting Kronecker Graph expansions to user/item billion dataset to help recommender system research scale feedback: We employ the Kronecker Graph Theory intro- up in computational benchmark for model training. duced in [28] to achieve a suitable fractal expansion of recommendation data to benchmark linear and non-linear The present article is organized as follows: we first reca- user/item factorization approaches for recommendations [20], pitulate prior research on ML for recommendations and large [25]. Consider a recommendation problem comprising m users synthetic dataset generation; we then develop an adaptation of and n items. Let (Ri;j)i=1:::m;j=1:::n be the sparse matrix of Kronecker Graphs to user/item interaction matrices and prove recorded interactions (e.g. the rating left by the user i for item key theoretical properties; finally we employ the resulting j if any and 0 otherwise). The key insight we develop in the algorithm experimentally to MovieLens 20m data and validate present paper is that a carefully crafted fractal expansion of R its statistical properties.

Scalable Realistic Recommendation Datasets Through Fractal Expansions

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support