Scalable Realistic Recommendation Datasets through Fractal Expansions

Francois Belletti Karthik Lakshmanan Walid Krichene Yi-Fan Chen John Anderson Google AI Google AI Google AI Google AI Google AI Google Google Google Google Google [email protected] [email protected] [email protected] [email protected] [email protected]

Abstract—Recommender System research suffers currently TABLE I: Size of MovieLens 20M [19] vs industrial dataset from a disconnect between the size of academic data sets and the in [55]. scale of industrial production systems. In order to bridge that gap we propose to generate more massive user/item interaction MovieLens 20M Industrial data sets by expanding pre-existing public data sets. #users 138K Hundreds of Millions User/item incidence matrices record interactions between users and items on a given platform as a large sparse matrix whose #items 27K 2M rows correspond to users and whose columns correspond to items. #topics 19 600K Our technique expands such matrices to larger numbers of rows #observations 20M Hundreds of Billions (users), columns (items) and non zero values (interactions) while preserving key higher order statistical properties. We adapt the Kronecker Graph Theory to user/item incidence matrices and show that the corresponding fractal expansions useful characteristics of the dataset. For instance, [37] shows a preserve the fat-tailed distributions of user engagements, item popularity and singular value spectra of user/item interaction privacy breach of the Netflix prize dataset. More importantly, matrices. Preserving such properties is key to building large publishing anonymized industrial data sets runs counter to user realistic synthetic data sets which in turn can be employed expectations that their data may only be used in a restricted reliably to benchmark Recommender Systems and the systems manner to improve the quality of their experience on the employed to train them. platform. We provide algorithms to produce such expansions and apply them to the MovieLens 20 million data set comprising 20 million Therefore, we decide not to make user data more broadly ratings of 27K movies by 138K users. The resulting expanded available to preserve the privacy of users. We instead choose data set has 10 billion ratings, 864K items and 2 million users to produce synthetic yet realistic data sets whose scale is in its smaller version and can be scaled up or down. A larger commensurable with that of our production problems while version features 655 billion ratings, 7 million items and 17 million only consuming already publicly available data. users. Index Terms—Machine Learning, Deep Learning, Recom- Producing a realistic MovieLens 10 billion+ dataset: mender Systems, Graph Theory, Simulation In this work, we focus on the MovieLens dataset which only entails movie ratings posted publicly by users of the I.INTRODUCTION MovieLens platform. The MovieLens data set has now become a standard benchmark for academic research in Recommender Machine Learning (ML) benchmarks compare the capa- Systems, [1], [8], [20], [22], [26], [32], [34], [38], [44], [48], bilities of models, distributed training systems and linear [49], [54], [56], [58] are only few of the many recent research algebra accelerators on realistic problems at scale. For these articles relying on MovieLens whose latest version [19] has arXiv:1901.08910v3 [cs.IR] 20 Feb 2019 benchmarks to be effective, results need to be reproducible by accrued more than 800 citations according to Google Scholar. many different groups which implies that publicly shared data Unfortunately, the data set comprises only few observed in- sets need to be available. teractions and more importantly a very small catalogue of Unfortunately, while Recommendation Systems constitute a users and items — when compared to industrial proprietary key industrial application of ML at scale, large public data recommendation data. sets recording user/item interactions on online platforms are not yet available. For instance, although the Netflix data set [6] In order to provide a new data set — more aligned with and the MovieLens data set [19] are publicly available, they the needs of production scale Recommender Systems — we are orders of magnitude smaller than proprietary data [3], [12], aim at expanding publicly available data by creating a realistic [55]. surrogate. The following constraints help create a production- Proprietary data sets and privacy: While releasing large size synthetic recommendation problem similar and at least as anonymized proprietary recommendation data sets may seem hard an ML problem as the original one for matrix factoriza- an acceptable solution from a technical standpoint, it is a non- tion approaches to recommendations [20], [25]: trivial problem to preserve user privacy while still maintaining • orders of magnitude more users and items are present in Item-wise Item-wise can preserve high level statistics of the original data set while rating sums rating sums (log/log) 15000 105 scaling its size up by multiple orders of magnitudes. 104 10000 103 Many different transforms can be applied to the matrix R 102 5000 101 which can be considered a standard sparse 2 dimensional im- 100 -1 age. A recent approach to creating synthetic recommendation total rating 0 total rating 10 10-2 data sets consists in making parametric assumptions on user 5000 10-3 0 1 2 3 4 5 behavior by instantiating a user model interacting with an 0 10 10 10 10 10 10 5000 item rank 10000 15000 20000 25000 30000 online platform [11], [46]. Unfortunately, such methods (even item rank calibrated to reproduce empirical facts in actual data sets) do User-wise User-wise not provide strong guarantees that the resulting interaction rating sums rating sums (log/log) 500 103 2 data is similar to the original. Therefore, instead of simulating 0 10 101 500 recommendations in a parametric user-centric way as in [11], 100 1000 [46], we choose a non-parametric approach operating directly 10-1 1500 -2 in the space of user/item affinity. In order to synthesize a

total rating total rating 10 2000 10-3 large realistic dataset in a principled manner, we adapt the 2500 10-4 0 1 2 3 4 5 6 0 10 10 10 10 10 10 10 Kronecker expansions which have previously been employed

20000 40000 60000 80000 user rank 100000120000140000 to produce large realistic graphs in [28]. We employ a non- user rank parametric analytically tractable simulation of the evolution of User/item User/item the user/item bi-partite graph to create a large synthetic data rating spectrum 3 rating spectrum (log/log) 250 10 set. Our choice is to trade-off realism for analytic tractability. 200 We emphasize the latter. 150 102 100 While Kronecker Graphs Theory is developed in [28], [29]

magnitude magnitude on square adjacency matrices, the Kronecker product operator singular value 50 singular value

0 101 is well defined on rectangular matrices and therefore we can 0 500 1000 1500 2000 2500 100 101 102 103 104 apply a similar technique to user/item interaction data sets — singular value singular value rank rank which was already noted in [29] but not developed extensively. The Kronecker Graph generation paradigm has to be changed Fig. 1: Key first and second order properties of the original with the present data set in other aspects however: we need MovieLens 20m user/item rating matrix (after centering and to decrease the expansion rate to generate data sets with the re-scaling into [−1, 1]) we aim to preserve while synthetically scale we desire, not orders of magnitude too large. We need expanding the data set. Top: item popularity distribution (total to do so while maintaining key conservation properties of the ratings of each item). Middle: user engagement distribution original algorithm [29]. (total ratings of each user). Bottom: dominant singular values In order to reliably employ Kronecker based fractal expan- of the rating matrix (core to the difficulty of matrix factor- sions on data we devise the following ization tasks). In all log/log plots the small fraction of non- contributions: positive row-wise and column-wise sums are removed. • we develop a new technique based on linear algebra to adapt fractal Kronecker expansions to recommendation the synthetic dataset; problems; • the synthetic dataset is realistic in that its first and • we demonstrate that key recommendation system specific second order statistics match those of the original dataset properties of the original dataset are preserved by our presented in Figure 1. technique; • we also show that the resulting algorithm we develop is Key first and second order statistics of interest we aim to scalable and easily parallelizable as we employ it on the preserve are summarized in Figure 1 — the details of their actual MovieLens 20 million dataset; computation are given in Section IV. • we produce a synthetic yet realistic MovieLens 655 Adapting Kronecker Graph expansions to user/item billion dataset to help recommender system research scale feedback: We employ the Kronecker Graph Theory intro- up in computational benchmark for model training. duced in [28] to achieve a suitable fractal expansion of recommendation data to benchmark linear and non-linear The present article is organized as follows: we first reca- user/item factorization approaches for recommendations [20], pitulate prior research on ML for recommendations and large [25]. Consider a recommendation problem comprising m users synthetic dataset generation; we then develop an adaptation of and n items. Let (Ri,j)i=1...m,j=1...n be the sparse matrix of Kronecker Graphs to user/item interaction matrices and prove recorded interactions (e.g. the rating left by the user i for item key theoretical properties; finally we employ the resulting j if any and 0 otherwise). The key insight we develop in the algorithm experimentally to MovieLens 20m data and validate present paper is that a carefully crafted fractal expansion of R its statistical properties. II.RELATED WORK learned by training the neural model to predict user behavior on a large data set of previously observed user/item interac- Recommender Systems constitute the workhorse of many e- tions. commerce, social networking and entertainment platforms. In DNNs consume large quantities of data and are com- the present paper we focus on the classical setting where the putationally expensive to train, therefore they give rise to key role of a recommender system is to suggest relevant items commonly shared benchmarks aimed at speeding up train- to a given user. Although other approaches are very popular ing. For training, a Stochastic Gradient Descent method is such as content based recommendations [43] or social rec- employed [27] which requires forward model computation ommendations [10], collaborative filtering remains a prevalent and back-propagation to be run on many mini-batches of approach to the recommendation problem [33], [42], [45]. (user, item, score) examples. The matrix completion task still Collaborative filtering: The key insight behind collabora- consists in predicting a rating for the interaction of user i and tive filtering is to learn affinities between users and items based item j although (i, j) has not been observed in the original on previously collected user/item interaction data. Collabora- data-set. The model is typically run on billions of examples tive filtering exists in different flavors. Neighborhood methods as the training procedure iterates over the training data set. group users by inter-user similarity and will recommend items Model freshness is generally critical to industrial recom- to a given user that have been consumed by neighbors [43]. mendations [12] which implies that only limited time is Latent factor methods such as matrix factorization [25] try to available to re-train the model on newly available data. The decompose user/item affinity as the result of the interaction of throughput of the trainer is therefore crucial to providing more a few underlying representative factors characterizing the user engaging recommendation experiences and presenting more and the item. Although other latent models have been devel- novel items. Unfortunately, public recommendation data sets oped and could be used to construct synthetic recommendation are too small to provide training-time-to-accuracy benchmarks data sets (e.g. Principal Component Analysis [23] or Latent that can be realistically employed for industrial applications. Dirichlet Allocation [9]), we focus on insights derived from Too few different examples are available in MovieLens 20m matrix factorization. for instance and the number of different available items is The matrix factorization approach represents the affinity orders of magnitude too small. In many industrial settings, ai,j between a user i and an item j with an inner product millions of items (e.g. products, videos, songs) have to be T d xi yj where xi and yj are two vectors in R representing taken into account by recommendation models. The recom- the user and the item respectively. Given a sparse matrix of mendation model learns an embedding matrices of size (N, d) 3 6 9 user/item interactions R = (ri,j)i=1...m,j=1...n, user and item where d ∼ 10 − 10 and N ∼ 10 − 10 are typical values. factors can therefore be learned by approximating R with a As a consequence, the memory footprint of this matrix may low rank matrix XY T where X ∈ Rm,k entails the user dominate that of the rest of the model by several orders factors and Y ∈ Rn,k contains the item factors. The data of magnitude. During training, the latency and bandwidth set R represents ratings as in the MovieLens dataset [19] of the access to such embedding matrices have a prominent or item consumption (ri,j = 1 if and only if the user i influence on the final throughput in examples/second. Such has consumed item j [6]). The matrix factorization approach computational difficulties associated with learning large em- is an example of a solution to the rating matrix completion bedding matrices are worthwhile solving in benchmarks. A problem which aims at predicting the rating of an item j by higher throughput enables training models with more examples a user i which has not been observed yet and corresponds which enables better statistical regularization and architectural to a value of 0 in the sparse original rating matrix. Such a expressiveness. The multi-billion interaction size of the data factorization method learns an approximation of the data that set used for training is also a major factor that affects modeling preserves a few higher order properties of the rating matrix choices and infrastructure development in the industry. R. In particular, the low rank approximation tries to mimic Our aim is therefore to enable a comparison of modeling the singular value spectrum of the original data set. We draw approaches, software engineering frameworks and hardware inspiration from matrix factorization to tackle synthetic data accelerators for ML in the context of industry scale recom- generation. The present paper will adopt a similar approach mendations. In the present paper we focus on enabling a better to extend collaborative filtering data-sets. Besides trying to evaluation of examples/sec throughput and training-time-to- preserve the spectral properties of the original data, we operate accuracy for Neural Approaches [20] under the constraint of conserving its first and second order and Matrix Factorization Approaches [25]. A major issue statistical properties. with this approach is, as we mentioned, the size of publicly Deep Learning for Recommender Systems: Collaborative available collaborative filtering data sets which is orders of filtering has known many recent developments which motivate magnitude smaller than production grade data (see Table I) our objective of expanding public data sets in a realistic and has mis-representative orders of magnitudes in terms of manner. Deep Neural Networks (DNNs) are now becoming numbers of distinct users and items. The present paper offers common in both non-linear matrix factorization tasks [12], a first solution to this problem by providing a simple and [20], [51] and sequential recommendations [18], [47], [57]. tractable non-parametric fractal approach to scaling up public The mapping between user/item pairs and ratings is generally recommendation data sets by several orders of magnitude. of the coarse interaction matrix. We choose to expand the user/item interaction matrix by extrapolating this self-similar

Item groups structure and simulating its growth to yet another level of granularity: each original item is treated as a synthetic topic in the expanded data set and each actual user is considered a fictional user group. User groups A key advantage of this fractal procedure is that it may

(1,1) (1,2) (1,3) be entirely non-parametric and designed to preserve high level properties of the original dataset. In particular, a fractal User/item expansion re-introduces the patterns originally observed in the interaction (2,1) (2,2) (2,3) matrix entire real dataset within each block of local interactions of

(3, 2) (3, 3) (3, 4) the synthetic user/item matrix. By carefully designing the way such blocks are produced and laid out, we can therefore hope (4, 4) to produce a realistic yet much larger rating matrix. In the

User/item interaction patterns following, we show how the Kronecker operator enables such a construction. 2) Fractal expansion through Kronecker products: The Fig. 2: Typical user/item interaction patterns in recommenda- Kronecker product — denoted ⊗ — is a non-standard matrix tion data sets. Self-similarity appears as a natural key feature operator with an intrinsic self-similar structure: of the hierarchical organization of users and items into groups   of various granularity. a11B . . . a1nB  . .. .  A ⊗ B =  . . .  (1) a B . . . a B Such data sets will help build a first set of benchmarks for m1 mn model training accelerators based on publicly available data. where A ∈ Rm,n, B ∈ Rp,q and A ⊗ B ∈ Rmp,nq. We plan to publish the expanded data. However, MovieLens In the original presentation of Kronecker Graph Theory [28] 20m is publicly available and our method can already be as well as the stochastic extension [35] and the extended the- applied to recreate such an expanded data set. Meta-data ory [29], the Kronecker product is the core operator enabling is also common in industrial applications [3], [7], [12] but the synthesis of graphs with exponentially growing adjacency we consider its expansion outside the scope of this first matrices. As in the present work, the insight underlying development. the use of Kronecker Graph Theory in [29] is to produce large synthetic yet realistic graphs. The fractal nature of the III.FRACTAL EXPANSIONS OF USER/ITEMINTERACTION Kronecker operator as it is applied multiple times (see Figure DATA SETS 2 in [29] for an illustration) fits the self-similar statistical The present section delineates the insights orienting our properties of real world graphs such as the internet, the web design decisions when expanding public recommendation data or social networks [28]. sets. If A is the adjacency matrix of the original graph, fractal 1) Self-similarity in user/item interactions: Interactions be- expansions are created in [28] by chaining Kronecker products tween users and items follow a natural hierarchy in data sets as follows: where items can be organized in topics, genres, categories A ⊗ A . . . A. etc [55]. There is for instance an item-level fractal structure in MovieLens 20m with a tree-like structure of genres, sub- As adjacency matrices are square, Kronecker Graphs are genres, and directors. If users were clustered according to not employed on rectangular matrices in pre-existing work their demographics and tastes, another hierarchy would be although the operation is well defined. Another slight diver- formed [43]. The corresponding structured user/item interac- gence between present and pre-existing work is that — in tion matrix is illustrated in Figure 2. The hierarchical nature their stochastic version — Kronecker Graphs carry Bernouilli of user/item interactions (topical and demographic) makes the probability distribution parameters in [0, 1] while MovieLens recommendation data set structurally self-similar (i.e. patterns ratings are in {0.5, 1.0,..., 5.0} originally and [−1, 1] after we that occur at more granular scales resemble those affecting center and rescale them. We show that these differences do not coarser scales [36]). prevent Kronecker products from preserving core properties One can therefore build a user-group/item-category inci- of rating matrices. A more important challenge is the size of 3 3 dence matrix with user-groups as rows and item-categories the original matrix we deal with: R ∈ R(138×10 ,27×10 ).A as columns — a coarse interaction matrix. As each user group naive Kronecker expansion would therefore synthesize a rating consists of many individuals and each item category com- matrix with 19 billion users which is too large. prises multiple movies, the original individual level user/item Thus, although Kronecker products seem like an ideal candi- interaction matrix may be considered as an expanded version date for the mechanism at the core of the self-similar synthesis of a larger recommendation dataset, some modifications are Algorithm 1 Kronecker fractal expansion needed to the algorithms developed in [29]. for i = 1 to m do 3) Reduced Kronecker expansions: We choose to synthe- kBlocks ← empty list size a user/item rating matrix for j = 1 to n do ω ← next pseudo random number Re = Rb ⊗ R kBlock ← F (Rb(i, j), R, w) where Rb is a matrix derived from R but much smaller (for kBlocks append kBlock instance Rb ∈ R128,256). For reasons that will become apparent end for as we explore some theoretical properties of Kronecker fractal outputToFile(kBlocks) expansions, we want to construct a smaller derived matrix Rb end for that shares similarities with R. In particular, we seek Rb with a similar row-wise sum distribution (user engagement distribu- 0 0 tion), column-wise distribution (item engagement distribution) where G : Rm,n × N → Rm ,n is a randomized sketching and singular value spectrum (signal to noise ratio distribution operation on the matrix A which reduces its size by several in the matrix factorization). orders of magnitude. A trivial scheme consists in sampling 4) Implementation at scale and algorithmic extensions: a small number of rows and columns from A at random. Computing a Kronecker product between two matrices A and Other random projections [2], [15], [31] may of course be B is an inherently parallel operation. It is sufficient to broad- used. The randomized procedures above produce a user/item cast B to each element (i, j) of A and then multiply B by ai,j. interaction matrix where there is no longer a block-wise Such a property implies that scaling can be achieved. Another repetitive structure. Less obvious statistical patterns can give advantage of the operator is that even a single machine can rise to more challenging synthetic large-scale collaborative produce a large output data-set by sequentially iterating on the filtering problems. values of A. Only storage space commensurable with the size of the original matrix is needed to compute each block of the IV. STATISTICAL PROPERTIES OF KRONECKER FRACTAL Kronecker product. It is noteworthy that generalized fractal EXPANSIONS expansions can be defined by altering the standard Kronecker product. We consider such extensions here as candidates to After having introduced Kronecker products to self- engineer more challenging synthetic data sets. One drawback similarly expand a recommendation dataset into a much larger though is that these extensions may not preserve analytic one, we now demonstrate how the resulting synthetic user/item tractability. interaction matrix shares crucial common properties with the A first generalization defines a binary operator ⊗F with original. F : R × Rm,n × N → Rp,q as follows: 1) Salient empirical facts in MovieLens data: First, we   introduce the critical properties we want to preserve. Note that F (a11, B, ω11) ...F (a1n, B, ω1n) throughout the paper we present results on a centered version  . .. .  A ⊗F B =  . . .  (2) of MovieLens 20m. The average rating of 3.53 is subtracted F (am1, B, ωm1) ...F (amn, B, ωmn) from all ratings (so that the 0 elements of the sparse rating matrix match un-observed scores and not bad movie ratings). where ω11, . . . , ωmn is a sequence of pseudo-random num- bers. Including randomization and non-linearity in F appears Furthermore, we re-scale the centered ratings so that they are as a simple way to synthesize data sets entailing more varied all in the interval [−1, 1]. As a user/item interaction dataset on patterns. The algorithm we employ to compute Kronecker an online platform, one expects MovieLens to feature common products is presented in Algorithm 1. The implementation we properties of recommendation data sets such as “power-law” employ is trivially parallelizable. We only create a list of Kro- or fat-tailed distributions [55]. necker blocks to dump entire rows (users) of the output matrix First important statistical properties for recommendations to file. This is not necessary and can be removed to enable concern the distribution of interactions across users and across as many processes to run simultaneously and independently items. It is generally observed that such distributions exhibit as there are elements in Rb (provided pseudo random numbers a “power-law” behavior [1], [14], [16], [30], [39], [40], [52], are generated in parallel in an appropriate manner). [55]. To characterize such a behavior in the MovieLens data The only reason why we need a reduced version Rb of R is set, we take a look at the distribution of the total ratings to control the size of the expansion. Also, A ⊗ B and B ⊗ A along the item axis and the user axis. In other words, we are equal after a row-wise and a column-wise permutation. compute row-wise and column-wise sums for the rating matrix Therefore, another family of appropriate extensions may be R and observe their distributions. The corresponding ranked obtained by considering distributions are exposed in Figure 1 and do exhibit a clear   “power-law” behavior for rather popular items. However we b11G(A, ω11) . . . b1nG(A, ω1n) observe that tail items have a higher popularity decay rate.  . .. .  B ⊗G A =  . . .  (3) Similarly, the engagement decay rate increases for the group bm1G(A, ωm1) . . . bmnG(A, ωmn) of less engaged users. The other approximate “power-law” we find in Figure 1 two sets, i.e. A × B = {a × b | a ∈ A, b ∈ B}. lies in the singular value spectrum of the MovieLens dataset. Proposition 1: Consider A ∈ Rm,n and B ∈ Rp,q and their We compute the top k singular values [21] of the MovieLens Kronecker product K = A ⊗ B. Then rating matrix R by approximate iterative methods (e.g. power R(K) = R(A) × R(B) and C(K) = C(A) × C(B). iteration) which can scale to its large (138K, 27K) dimension. The method yields the dominant singular values of R and Proof 1: Consider the ith row of K, by definition of the corresponding singular vectors so that one can classically K the corresponding sum can be rewritten as follows: approximate R by R ' UΣV where Σ is diagonal of P k = P a b j=1...np i,j j=1...np bi−1cp+1,bj−1cq +1 {i−1}p,{j−1}q dimension (k, k), U is column-orthogonal of dimension (m, k) which in turn equals and V is row-orthogonal of dimension (k, n) — which yields X X abi−1c +1,jb{i−1} ,j0 . the rank k matrix closest to R in Frobenius norm. p p 0 Examining the distribution of the 2048 top singular values j=1...n j =1...q of R in the MovieLens dataset (which has at most 27K non- Refactoring the two sums concludes the proof for the row- zero singular values) in Figure 1 highlights a clear “power- wise sum properties. The proof for column-wise properties is law” behavior in the highest magnitude part of the spectrum identical.  of R. We observe in the spectral distribution an inflection for Theorem 1: Consider A ∈ Rm,n and B ∈ Rp,q and their smaller singular values whose magnitude decays at a higher Kronecker product K = A ⊗ B. Then rate than larger singular values. Such a spectral distribution is S(K) = S(A) × S(B). as a key feature of the original dataset, in particular in that it conditions the difficulty of low-rank approximation approaches Proof 2: One can easily check that (XY ) ⊗ (VW ) = to the matrix completion problem. Therefore, we also want (X ⊗ V )(Y ⊗ W ) for any quadruple of matrices X,Y,V,W the expanded dataset to exhibit a similar behavior in terms of for which the notation makes sense and that (X ⊗ Y )T = T T spectral properties. X ⊗ Y . Let A = UAΣAVA be the SVD of B and In all the high level statistics we present, we want to B = VBΣBVB the SVD of B. Then (A ⊗ B) = (UA ⊗ T preserve the approximate “power-law” decay as well as its UB)(ΣA ⊗ ΣB)(VA ⊗ VB). Now, (UA ⊗ UB) (UA ⊗ UB) = T T T T T inflection for smaller values. Our requirements for the expand- (UA ⊗ UB )(UA ⊗ UB) = (UA UA) ⊗ (UB UB ). Writing the T T ing transform which we apply to R are therefore threefold: same decomposition for (VA ⊗VB)(VA ⊗VB ) and considering we want to preserve the distributions of row-wise sums of that UA, UB are column-orthogonal while VA, VB are row- R, column-wise sums of R and singular value distribution orthogonal concludes the proof.  of R. Additional requirements, beyond first and second order The properties above imply that knowing the row-wise high level statistics will further increase the confidence in the sums, column-wise sums and singular value spectrum of the realism of the expanded synthetic dataset. Nevertheless, we reduced rating matrix Rb and the original rating matrix R consider that focusing on these first three properties is a good is enough to deduce the corresponding properties for the starting point. expanded rating matrix Re — analytically. As in [29], the Kro- It is noteworthy that we do not consider here the temporal necker product enables analytic tractability while expanding structure of the MovieLens data set. We leave the study of data sets in a fractal manner to orders of magnitude more sequential user behavior — often found to be Long Range data. Dependent [5], [13], [41] — and the extension of synthetic 3) Constructing a reduced Rb matrix with a similar spec- data generation to sequential recommendations [3], [48], [53] trum: Considering that the quasi “power-law” properties of for further work. R imply — as in [29] — that S(R) × S(R) has a similar 2) Preserving MovieLens data properties while expanding distribution to S(R), we seek a small Rb whose high order it: We now expose the fractal transform design we rely on to statistical properties are similar to those of R. As we want to preserve the key statistical properties of the previous section. generate a dataset with several billion user/item interactions, m,n millions of distinct users and millions of distinct items, we Definition 1: Consider A ∈ R = (ai,j)i=1...n,j=1...m, we Pm are looking for a matrix Rb with a few hundred or thousand denote the set { i=1 ai,j} of row-wise sums of A by R(M), nPn o rows and columns. The reduced matrix Rb we seek is therefore the set j=1 ai,j of column-wise sums of A by C(M), orders of magnitude smaller than R. In order to produce a and the set of non-zero singular values of A by S(M). reduced matrix Rb of dimensions (1000, 1700) one could use Definition 2: Consider an integer i and a non-zero positive the reduced size older MovieLens 100K dataset [19]. Such a integer p, we denote the integer part of i − 1 in base p dataset can be interpreted as a sub-sampled reduced version i−1 bi − 1cp = b p c and the fractional part {i − 1}p = i − of MovieLens 20m with similar properties. However the data bi − 1cp. sets have been collected seven years apart and therefore First we focus on conservation properties in terms of row- temporal non-stationarity issues become concerning. Also, we wise and column-wise sums which correspond respectively to aim to produce an expansion method where the expansion marginalized user engagement and item popularity distribu- multipliers can be chosen flexibly by practitioners. In our tions. In the following, × denotes the Minkowski product of experiments, it is noteworthy that naive uniform user and item sampling strategies have not yielded smaller matrices Rb with Algorithm 2 Compute reduced matrix Rb similar properties to R in our experiments. Different random (U, Σ,V ) ← sparseSVD(R, k) projections [2], [15], [31] could more generally be employed U¯ ← imageResize(U, n0, k) however we rely on a procedure better tailored to our specific V¯ ← imageResize(V, k, m0) statistical requirements. −1/2 U ← U¯ U¯ T U¯ We now describe the technique we employed to produce a e ¯ ¯ T −1/2 ¯ reduced size matrix Rb with first and second order properties Ve ← V V V close to R which in turn led to constructing an expansion Rbtemp ← UeΣVe matrix Re = Rb ⊗ R similar to R. We want the dimensions of M ← max(Rbtemp) 0 0 0 0 Rb to be (m , n ) with m << m and n << n. Consider again m ← min(Rbtemp) the approximate Singular Value Decomposition (SVD) [21] of return Rbtemp/(M − m) R with the k = min(m0, n0) principal singular values of R: R ' UΣV (4) V. EXPERIMENTATION ON MOVIELENS 20 MILLION DATA where U ∈ n,k has orthogonal columns, V ∈ k,m has R R The MovieLens 20m data comprises 20m ratings given orthogonal rows, and Σ ∈ k,k is diagonal with non-negative R by 138 thousand users to 27 thousand items. In the present terms. section, we demonstrate how the fractal Kronecker expansion To reduce the number of rows and columns of R while technique we devised and presented helps scale up this dataset preserving its top k singular values a trivial solution would to orders of magnitude more users, items and interactions — consist in replacing U and V by a small random orthogonal all in a parallelizable and analytically tractable manner. matrices with few rows and columns respectively. Unfortu- 1) Size of expanded data set: In present experiments we nately such a method would only seemingly preserve the construct a reduced rating matrix R of size (16, 32) which spectral properties of R as the principal singular vectors would b implies the resulting expanded data set will comprise 10 be widely changed. Such properties are important: one of the billion interactions between 2 million users and 864K items. key advantages of employing Kronecker products in [29] is In appendix, we present the results obtained for a reduced the preservation of the network values, i.e. the distributions of rating matrix R of size (128, 256) and a synthetic data set singular vector components of a Graph’s adjacency matrix. b 0 consisting of 655 billion interactions between 17 million users To obtain a matrix Ue ∈ n ,k with fewer rows than U but R and 7 million items. column-orthogonal and similar to U in the distribution of its Such a high number of interactions and items enable the values we use the following procedure. We re-size U down training of deep neural collaborative models such as the to n0 rows with n0 < n through an averaging-based down- Neural Collaborative Filtering model [20] with a scale which scaling method that can classically be found in standard im- is now more representative of industrial settings. Moreover, age processing libraries (e.g. skimage.transform.resize in the 0 the increased data set size helps construct benchmarks for scikit-image library [50]). Let U¯ ∈ n ,k be the corresponding R deep learning software packages and ML accelerators that resized version of U. We then construct Ue as the column 0 employ the same orders of magnitude than production settings orthogonal matrix in n ,k closest in Frobenius norm to U¯. R in terms of user base size, item vocabulary size and number Therefore as in [17] we compute of observations. T −1/2 Ue = U¯ U¯ U¯ . (5) 2) Empirical properties of reduced Rb matrix: The con- struction technique of Rb had for objective to produce, just We apply a similar procedure to V to reduce its number of like in [29], a matrix sharing the properties of R ⊗ R though k,m0 columns which yields a row orthogonal matrix Ve ∈ R smaller in size. To that end, we aimed at constructing a matrix 0 with m < m. The orthogonality of Ue (column-wise) and Ve Rb of dimension (16, 32) with properties close to those of R in (row-wise) guarantees that the singular value spectrum of terms of column-wise sum, row-wise sum and singular value spectrum distributions. Rb = UeΣVe (6) We now check that the construction procedure we devised consists exactly of the k = min(m0, n0) leading components does produce a Rb with the properties we expected. As the of the singular value spectrum of R. Like R, Rb is re-scaled impact of the re-sizing step is unclear from an analytic stand- to take values in [−1, 1]. The whole procedure to reduce R point, we had to resort to numerical experiments to validate down to Rb is summarized in Algorithm 2. our method. We verify empirically that the distributions of values of the In Figure 3, one can assess that the first and second order reduced singular vectors in Ue and Ve are similar to those of properties of R and Rb match with high enough fidelity. In U and V respectively to preserve first order properties of R particular, the higher magnitude column-wise and row-wise and value distributions of its singular vectors. Such properties sum distributions follow a “power-law” behavior similar to are demonstrated through numerical experiments in the next that of the original matrix. Similar observations can be made section. about the singular value spectra of Rb and R. Orignal matrix item-wise Reduced matrix item-wise Orignal matrix item-wise Expanded matrix item-wise rating sums (log/log) rating sums (log/log) rating sums (log/log) rating sums (log/log) 105 100 105 105 4 104 104 10 3 3 3 10 10 10 2 -1 10 2 10 2 10 10 101 101 101 100 0 0 -1 10 -2 10 10 10 -2

total rating total rating total rating total rating 10 10-1 10-1 10-3 -2 -2 10 10 10-4 10-3 10-3 10-3 10-5 100 101 102 103 104 105 100 101 102 100 101 102 103 104 105 100 101 102 103 104 105 106 item rank item rank item rank item rank

Original matrix user-wise Reduced matrix user-wise Original matrix user-wise Expanded matrix user-wise rating sums (log/log) rating sums (log/log) rating sums (log/log) rating sums (log/log) 103 101 103 104 3 102 102 10 102 101 101 101 100 100 0 0 10 10 -1 10-1 10-1 10 -2 -2 -2 10 total rating 10 total rating total rating 10 total rating 10-3 -3 -3 10 10 10-4 10-4 10-1 10-4 10-5 100 101 102 103 104 105 106 100 101 102 100 101 102 103 104 105 106 100 101 102 103 104 105 106 107 user rank user rank user rank user rank

Original matrix user/item Reduced matrix user/item Original matrix user/item Expanded matrix user/item rating spectrum (log/log) rating spectrum (log/log) rating spectrum (log/log) rating spectrum (log/log) 103 101 103 103

102

102 100 102

101 magnitude magnitude magnitude magnitude singular value singular value singular value singular value

101 10-1 101 100 100 101 102 103 104 100 101 102 100 101 102 103 104 100 101 102 103 104 105 singular value singular value singular value singular value rank rank rank rank

Fig. 3: Properties of the reduced dataset Rb ∈ R16,32 built Fig. 4: High order statistical properties of the expanded dataset according to steps 4, 5 and 6. We validate the construction Rb ⊗ R. We validate the construction method numerically by method numerically by checking that the distribution of row- checking that the distributions of row-wise sums, column-wise wise sums, column-wise sums and singular values are similar sums and singular values are similar between R and Rb ⊗ R. between R and Rb. Note here that as R is large we only Here we leverage the tractability of Kronecker products as they compute its leading singular values. As we want to preserve impact column-wise and row-wise sum distributions as well as statistical “power-laws”, we focus on preservation of the rela- singular value spectra. The plots corresponding to the extended tive distribution of values and not their magnitude in absolute. dataset are derived analytically based on the corresponding properties of the reduced matrix Rb and the original matrix R. Note here that as we only computed the leading singular values There is therefore now a reasonable likelihood that our of R, we only show the leading singular values of Rb ⊗ R. In adapted Kronecker expansion — although somewhat differing all plots we can observe the preservation of the linear log- from the method originally presented in [29] — will enjoy the log correspondence for the higher values in the distributions same benefits in terms of enabling data set expansion while of interest (row-wise sums, column-wise sums and singular preserving high order statistical properties. values) as well as the accelerated decay of the smaller values 3) Empirical properties of the expanded data set Re: We in those distributions. now verify empirically that the expanded rating matrix Re = Rb ⊗ R does share common first and second order properties with the original rating matrix R. The new data size is 2 orders R is sufficient to determine the corresponding marginals for of magnitude larger in terms of number of rows and columns the expanded data set Re = Rb ⊗ R. Similarly, the leading and 4 orders of magnitude larger in terms of number of non- singular values of the Kronecker product can be computed zero terms. Notice here that, as in general Rb is a dense matrix, with Theorem 1 just based on the leading singular values of the level of sparsity of the expanded data set is the same as R and the singular values of Rb. that of the original. In Figure 4, one can confirm that the spectral properties Another benefit of using a fractal expansion method with of the expanded data set as well as the user engagement analytic tractability, is that we can deduce high order statis- (row-wise sums) and item popularity (column-wise sums) are tics of the expanded data set beforehand without having to similar to those of the original data set. Such observations instantiate it. In particular, Proposition 1 implies that knowing indicate that the resulting data set is representative — in its fat- the column-wise and row-wise sum distributions of Rb and tailed data distribution and quasi “power-law” singular value Rating distributions columns of each block in Eq (1) independently at random. 1.0 original The shuffles will break the block-wise repetitive structure and extended (20m samples) prevent Kronecker SVD from producing a trivial solution to 0.5 the factorization problem. As a result, the expansion technique we present appears as 0.0 a reliable first candidate to train linear matrix factorization ratings models [43] and non-linear user/item similarity scoring mod- 0.5 els [20].

1.0 0.0 0.5 1.0 1.5 2.0 VI.CONCLUSION rating rank 1e7 In conclusion, this paper presents a first attempt at synthe- Fig. 5: Sorted ratings in the original MovieLens 20m data set sizing a realistic large-scale recommendation data sets without and the extended data. We sample 20m ratings from the new having to make compromises in terms of user privacy. We data set. Multiplications inherent to Kronecker products create use a small size publicly available data set, MovieLens 20m, a finer granularity rating scale which somewhat differs from and expand it to orders of magnitude more users, items and the original rating scale. However, we can check that the new observed ratings. Our expansion model is rooted into the rating scale is still representative of common recommendation hierarchical structure of user/item interactions which naturally problems in that most rating values are close to the average suggests a fractal extrapolation model. and few user/item interactions are very positive or negative. We leverage Kronecker products as self-similar operators on user/item rating matrices that impact key properties of row- wise and column-wise sums as well as singular value spectra spectrum — of problems encountered in ML for collaborative in an analytically tractable manner. We modify the original filtering. Furthermore, the expanded data set reproduces some Kronecker Graph generation method to enable an expansion of irregularities of the original data, in particular the accelerating the original data by orders of magnitude that yields a synthetic decay of values in ranked row-wise and column-wise sums as data set matching industrial recommendation data sets in scale. well as in the singular values spectrum. Our numerical experiments demonstrate the data set we create 4) Limitations: Although it does not condition the difficulty has key first and second order properties similar to those of of rating matrix factorization problems — which depends the original MovieLens 20m rating matrix. primarily on the interaction distribution, the number of users, Our next steps consist in making large synthetic data sets the number of items and sparsity of the rating matrix — the publicly available although any researcher can readily use the distribution of ratings is still an important statistical property techniques we presented to scale up any user/item interaction of the MovieLens 20m dataset. Figure 5 shows that there is a matrix. Another possible direction is to adapt the present certain degree of divergence between the rating scales of the method to recommendation data sets featuring meta-data (e.g. original and synthetic data set. In particular, many more rating timestamps, topics, device information). The use of meta-data values are present in the expanded data set as a result of the is indeed critical to solve the “cold-start” problem of users multiplication of terms from Rb and R. The transformations and items having no interaction history with the platform. We turning R into Rb create a smoother scale of values leading also plan to benchmark the performance of well established to a Kronecker product with many different possible ratings. baselines on the new large scale realistic synthetic data we Although the non-zero value distribution is a divergence point produce. between the two data sets, ratings in the synthetic data set are dominated by values which are close to the average as in REFERENCES the original MovieLens 20m. Therefore the synthetic ratings [1]A BDOLLAHPOURI,H.,BURKE,R., AND MOBASHER, B. Controlling do have certain degree of realism as they represent user/item popularity bias in learning-to-rank recommendation. In Proceedings of interactions where strong reactions (positive or negative) from the Eleventh ACM Conference on Recommender Systems (2017), ACM, pp. 42–46. users are much less likely than neutral interactions. [2]A CHLIOPTAS, D. Database-friendly random projections: Johnson- Another limitation of the synthetic data set is the block- lindenstrauss with binary coins. Journal of computer and System wise repetitive structure of Kronecker products. Although the Sciences 66, 4 (2003), 671–687. [3]B ELLETTI, F., BEUTEL,A.,JAIN,S., AND CHI, E. Factorized recurrent synthetic data set is still hard to factorize as the product of neural architectures for longer range dependence. In International two low rank matrices because its singular values are still Conference on Artificial Intelligence and Statistics (2018), pp. 1522– distributed similarly to the original data set, it is now easy to 1530. [4]B ELLETTI, F., LAKSHMANAN,K.,KRICHENE, W., CHEN, Y.-F., AND factorize with a Kronecker SVD [24] which takes advantage ANDERSON, J. Scalable realistic recommendation datasets through of the block-wise repetitions in Eq (1). Randomized fractal fractal expansions. arXiv preprint arXiv:1901.08910 (2019). expansions which presented in Eq (2) and Eq (3) address [5]B ELLETTI, F., SPARKS,E.,BAYEN,A., AND GONZALEZ, J. Random projection design for scalable implicit smoothing of randomly observed this issue. A simple example of such a randomized variation stochastic processes. In Artificial Intelligence and Statistics (2017), around the Kronecker product consists in shuffling rows and pp. 700–708. [6]B ENNETT,J.,LANNING,S., ETAL. The netflix prize. In Proceedings networks. Journal of Machine Learning Research 11, Feb (2010), 985– of KDD cup and workshop (2007), vol. 2007, New York, NY, USA, 1042. p. 35. [30]L EVY,M., AND BOSTEELS, K. Music recommendation and the long [7]B EUTEL,A.,COVINGTON, P., JAIN,S.,XU,C.,LI,J.,GATTO, V., tail. In 1st Workshop On Music Recommendation And Discovery AND CHI, E. H. Latent cross: Making use of context in recurrent (WOMRAD), ACM RecSys, 2010, Barcelona, Spain (2010), Citeseer. recommender systems. In International Conference on Web Search and [31]L I, P., HASTIE, T. J., AND CHURCH, K. W. Very sparse random Data Mining (2018), ACM, pp. 46–54. projections. In Proceedings of the 12th ACM SIGKDD international [8]B HARGAVA,A.,GANTI,R., AND NOWAK, R. Active positive semidefi- conference on Knowledge discovery and data mining (2006), ACM, nite matrix completion: Algorithms, theory and applications. In Artificial pp. 287–296. Intelligence and Statistics (2017), pp. 1349–1357. [32]L IANG,D.,ALTOSAAR,J.,CHARLIN,L., AND BLEI, D. M. Factoriza- [9]B LEI,D.M.,NG, A. Y., AND JORDAN, M. I. Latent dirichlet allocation. tion meets the item embedding: Regularizing matrix factorization with Journal of machine Learning research 3, Jan (2003), 993–1022. item co-occurrence. In Proceedings of the 10th ACM conference on [10]C HANEY,A.J.,BLEI,D.M., AND ELIASSI-RAD, T. A probabilistic recommender systems (2016), ACM, pp. 59–66. model for using social networks in personalized item recommendation. [33]L INDEN,G.,SMITH,B., AND YORK, J. Amazon. com recommenda- In Proceedings of the 9th ACM Conference on Recommender Systems tions: Item-to-item collaborative filtering. IEEE Internet computing, 1 (2015), ACM, pp. 43–50. (2003), 76–80. [11]C HANEY,A.J.,STEWART,B.M., AND ENGELHARDT, B. E. How [34]L U,J.,LIANG,G.,SUN,J., AND BI, J. A sparse interactive model algorithmic confounding in recommendation systems increases homo- for matrix completion with side information. In Advances in neural geneity and decreases utility. arXiv preprint arXiv:1710.11214 (2017). information processing systems (2016), pp. 4071–4079. [12]C OVINGTON, P., ADAMS,J., AND SARGIN, E. Deep neural networks [35]M AHDIAN,M., AND XU, Y. Stochastic kronecker graphs. In Interna- for youtube recommendations. In Proceedings of the 10th ACM tional Workshop on Algorithms and Models for the Web-Graph (2007), Conference on Recommender Systems (2016), ACM, pp. 191–198. Springer, pp. 179–186. [13]C RANE,R., AND SORNETTE, D. Robust dynamic classes revealed by [36]M ANDELBROT,B.B. The fractal geometry of nature, vol. 1. WH measuring the response function of a social system. Proceedings of the freeman New York, 1982. National Academy of Sciences 105, 41 (2008), 15649–15653. [37]N ARAYANAN,A., AND SHMATIKOV, V. How to break anonymity of [14]C REMONESI, P., KOREN, Y., AND TURRIN, R. Performance of rec- the netflix prize dataset. arXiv preprint cs/0610105 (2006). ommender algorithms on top-n recommendation tasks. In Proceedings [38]N IMISHAKAVI,M.,JAWANPURIA, P. K., AND MISHRA, B. A dual of the fourth ACM conference on Recommender systems (2010), ACM, framework for low-rank tensor completion. In Advances in Neural pp. 39–46. Information Processing Systems (2018), pp. 5489–5500. [15]F RADKIN,D., AND MADIGAN, D. Experiments with random pro- [39]O ESTREICHER-SINGER,G., AND SUNDARARAJAN, A. Recommenda- jections for machine learning. In Proceedings of the ninth ACM tion networks and the of electronic commerce. Mis quarterly SIGKDD international conference on Knowledge discovery and data (2012), 65–83. mining (2003), ACM, pp. 517–522. [40]P ARK, Y.-J., AND TUZHILIN, A. The long tail of recommender systems [16]G OEL,S.,BRODER,A.,GABRILOVICH,E., AND PANG, B. Anatomy of and how to leverage it. In Proceedings of the 2008 ACM conference on the long tail: ordinary people with extraordinary tastes. In Proceedings of Recommender systems (2008), ACM, pp. 11–18. the third ACM international conference on Web search and data mining [41]P IPIRAS, V., AND TAQQU,M.S. Long-range dependence and self- (2010), ACM, pp. 201–210. similarity, vol. 45. Cambridge university press, 2017. [17]G OLUB,G.H., AND VAN LOAN, C. F. Matrix computations, vol. 3. [42]R ESNICK, P., IACOVOU,N.,SUCHAK,M.,BERGSTROM, P., AND JHU Press, 2012. RIEDL, J. Grouplens: an open architecture for collaborative filtering [18]H ARIRI,N.,MOBASHER,B., AND BURKE, R. Context-aware music of netnews. In Proceedings of the 1994 ACM conference on Computer recommendation based on latenttopic sequential patterns. In Proceedings supported cooperative work (1994), ACM, pp. 175–186. of the sixth ACM conference on Recommender systems (2012), ACM, [43]R ICCI, F., ROKACH,L., AND SHAPIRA, B. Recommender systems: pp. 131–138. introduction and challenges. In Recommender systems handbook. [19]H ARPER, F. M., AND KONSTAN, J. A. The movielens datasets: History Springer, 2015, pp. 1–34. and context. Acm transactions on interactive intelligent systems (tiis) 5, [44]R UDOLPH,M.,RUIZ, F., MANDT,S., AND BLEI, D. Exponential 4 (2016), 19. family embeddings. In Advances in Neural Information Processing [20]H E,X.,LIAO,L.,ZHANG,H.,NIE,L.,HU,X., AND CHUA, T.-S. Systems (2016), pp. 478–486. Neural collaborative filtering. In Proceedings of the 26th International [45]S ARWAR,B.,KARYPIS,G.,KONSTAN,J., AND RIEDL, J. Item-based Conference on World Wide Web (2017), International World Wide Web collaborative filtering recommendation algorithms. In Proceedings of Conferences Steering Committee, pp. 173–182. the 10th international conference on World Wide Web (2001), ACM, [21]H ORN,R.A.,HORN,R.A., AND JOHNSON,C.R. Matrix analysis. pp. 285–295. Cambridge university press, 1990. [46]S CHMIT,S., AND RIQUELME, C. Human interaction with recommen- [22]J AWANPURIA, P., AND MISHRA, B. A unified framework for structured dation systems. arXiv preprint arXiv:1703.00535 (2017). low-rank matrix learning. In International Conference on Machine [47]S HANI,G.,HECKERMAN,D., AND BRAFMAN, R. I. An mdp-based Learning (2018), pp. 2259–2268. recommender system. Journal of Machine Learning Research 6, Sep [23]J OLLIFFE, I. Principal component analysis. In International encyclope- (2005), 1265–1295. dia of statistical science. Springer, 2011, pp. 1094–1096. [48]T ANG,J., AND WANG, K. Personalized top-n sequential recommenda- [24]K AMM,J., AND NAGY, J. G. Optimal kronecker product approximation tion via convolutional sequence embedding. In International Conference of block toeplitz matrices. SIAM Journal on Matrix Analysis and on Web Search and Data Mining (2018), IEEE, pp. 565–573. Applications 22, 1 (2000), 155–172. [49]T U,K.,CUI, P., WANG,X.,WANG, F., AND ZHU, W. Structural deep [25]K OREN, Y., BELL,R., AND VOLINSKY, C. Matrix factorization embedding for hyper-networks. In Thirty-Second AAAI Conference on techniques for recommender systems. Computer, 8 (2009), 30–37. Artificial Intelligence (2018). [26]K RICHENE, W., MAYORAZ,N.,RENDLE,S.,ZHANG,L.,YI,X., [50]V ANDER WALT,S.,SCHONBERGER¨ ,J.L.,NUNEZ-IGLESIAS,J., HONG,L.,CHI,E., AND ANDERSON, J. Efficient training on very BOULOGNE, F., WARNER,J.D.,YAGER,N.,GOUILLART,E., AND large corpora via gramian estimation. arXiv preprint arXiv:1807.07187 YU, T. scikit-image: image processing in python. PeerJ 2 (2014), e453. (2018). [51]W ANG,H.,WANG,N., AND YEUNG, D.-Y. Collaborative deep learning [27]L ECUN, Y., BENGIO, Y., AND HINTON, G. Deep learning. nature 521, for recommender systems. In ACM SIGKDD International Conference 7553 (2015), 436. on Knowledge Discovery and Data Mining (2015), ACM, pp. 1235– [28]L ESKOVEC,J.,CHAKRABARTI,D.,KLEINBERG,J., AND FALOUTSOS, 1244. C. Realistic, mathematically tractable graph generation and evolution, [52]Y IN,H.,CUI,B.,LI,J.,YAO,J., AND CHEN, C. Challenging the long using kronecker multiplication. In European Conference on Principles of tail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012), Data Mining and Knowledge Discovery (2005), Springer, pp. 133–145. 896–907. [29]L ESKOVEC,J.,CHAKRABARTI,D.,KLEINBERG,J.,FALOUTSOS,C., [53]Y U, F., LIU,Q.,WU,S.,WANG,L., AND TAN, T. A dynamic recurrent AND GHAHRAMANI, Z. Kronecker graphs: An approach to modeling model for next basket recommendation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Orignal matrix item-wise Reduced matrix item-wise rating sums (log/log) rating sums (log/log) Information Retrieval (2016), ACM, pp. 729–732. 105 101 4 [54]Z HANG,J.-D.,CHOW, C.-Y., AND XU, J. Enabling kernel-based 10 103 100 attribute-aware matrix factorization for rating prediction. IEEE Trans- 102 actions on Knowledge and Data Engineering 29, 4 (2017), 798–812. 101 10-1 0 [55]Z HAO,Q.,CHEN,J.,CHEN,M.,JAIN,S.,BEUTEL,A.,BELLETTI, 10 total rating 10-1 total rating 10-2 F., AND CHI, E. H. Categorical-attributes-based item classification for 10-2 recommender systems. In Proceedings of the 12th ACM Conference on 10-3 10-3 Recommender Systems 100 101 102 103 104 105 100 101 102 103 (2018), ACM, pp. 320–328. item rank item rank [56]Z HENG, Y., TANG,B.,DING, W., AND ZHOU, H. A neural autoregres- sive approach to collaborative filtering. arXiv preprint arXiv:1605.09477 Original matrix user-wise Reduced matrix user-wise 3 rating sums (log/log) 1 rating sums (log/log) (2016). 10 10 102 [57]Z HOU,B.,HUI,S.C., AND CHANG, K. An intelligent recommender 101 100 system using sequential web access patterns. In IEEE conference on 100 cybernetics and intelligent systems (2004), vol. 1, IEEE Singapore, 10-1 -1 -2 10 pp. 393–398. total rating 10 total rating -3 [58]Z HOU,G.,ZHU,X.,SONG,C.,FAN, Y., ZHU,H.,MA,X.,YAN, Y., 10 10-4 10-2 JIN,J.,LI,H., AND GAI, K. Deep interest network for click-through 100 101 102 103 104 105 106 100 101 102 103 rate prediction. In Proceedings of the 24th ACM SIGKDD International user rank user rank

Conference on Knowledge Discovery & Data Mining (2018), ACM, Original matrix user/item Reduced matrix user/item rating spectrum (log/log) rating spectrum (log/log) pp. 1059–1068. 103 101

102 100 magnitude magnitude singular value singular value

101 10-1 100 101 102 103 104 100 101 102 103 singular value singular value rank rank

Fig. 6: Properties of the reduced dataset Rb ∈ R128,256. Once more, we validate the construction method numerically by checking that the distribution of row-wise sums, column-wise sums and singular values are similar between R and Rb.

APPENDIX MovieLens 655 billion In this section, we present the properties of a larger expan- sion for MovieLens. The reduced rating matrix Rb is now of size (128, 256) and the synthetic data consists of 655 billion interactions between 17 million users and 7 million items. Empirical properties of the reduced matrix Rb: We assess the scalability of the approach we present to synthesize Rb. In particular, we check that with a size of (128, 256) instead of (16, 32) Rb still shares common statistical properties with the original matrix R. Figure 6 demonstrates that the construc- tion method we devised for Rb still preserves key statistical properties of R. Numerical validation for the expanded data set: We now verify that even with different extension factors and a much larger size, the synthetic data set we generate is similar to the original MovieLens 20m. We focus on the distribution of column-wise and row-wise sums in Re = Rb × R as well as the singular value distribution of the expanded matrix. In Figure 7, we find again that the “power-law” statistical behaviors and their inflections are preserved by the expansion procedure we designed. Limitations: The previous observations demonstrate the scalability and robustness of our expansion method, even with an expansion factor of (128, 256). However, the same limita- tions are present as in the smaller case and Figure 8 shows that Orignal matrix item-wise Expanded matrix item-wise rating sums (log/log) rating sums (log/log) 105 105 4 104 10 103 3 10 102 1 102 10 100 101 10-1 0 10 10-2 -3 total rating 10-1 total rating 10 -4 -2 10 10 10-5 10-3 10-6 100 101 102 103 104 105 100 101 102 103 104 105 106 107 item rank item rank

Original matrix user-wise Expanded matrix user-wise rating sums (log/log) rating sums (log/log) 103 104 3 102 10 102 101 101 0 10 100 -1 10-1 10 -2 -2 10 total rating 10 total rating 10-3 -3 10 10-4 10-4 10-5 100 101 102 103 104 105 106 100 101 102 103 104 105 106 107 108 user rank user rank

Original matrix user/item Expanded matrix user/item rating spectrum (log/log) rating spectrum (log/log) 103 104

103

102 102

magnitude magnitude 1

singular value singular value 10

101 100 100 101 102 103 104 100 101 102 103 104 105 106 singular value singular value rank rank

Fig. 7: We again validate the construction method numerically by checking that the distributions of row-wise sums, column- wise sums and singular values are similar between R and Rb ⊗ R. Here as well, we can observe the similarity between key features of the original rating matrix and its synthetic expansion.

Rating distributions 1.0 original extended (20m samples) 0.5

0.0 ratings

0.5

1.0 0.0 0.5 1.0 1.5 2.0 rating rank 1e7

Fig. 8: Sorted ratings in the original MovieLens 20m data set and the extended data. We sample 20m ratings from the new data set. Again we observe a certain divergence between the synthetic data set and the original. There is some realism though in that most interactions being near neutral. a similar divergence in rating scales exists between the original data set and its expanded synthetic version. Like before the synthetic ratings remain realistic in that their majority is near average. The present section therefore demonstrates that our method scales up and is able to synthesize very large realistic recommendation data sets.