Sparse Factorization Machines for Click-Through Rate Prediction
Total Page:16
File Type:pdf, Size:1020Kb
2016 IEEE 16th International Conference on Data Mining Sparse Factorization Machines for Click-through Rate Prediction Zhen Pan1, Enhong Chen1,∗,QiLiu1, Tong Xu1, Haiping Ma2, Hongjie Lin1 1School of Computer Science and Technology, University of Science and Technology of China {pzhen, hjlin}@mail.ustc.edu.cn, {cheneh, qiliuql, tongxu}@ustc.edu.cn 2 IFLYTEK Co.,Ltd., hpma@iflytek.com Abstract—With the rapid development of E-commerce, recent CPC basis1. Intuitively, if a user clicks the ad and then visits years have witnessed the booming of online advertising industry, the profile of advertiser, there will be a deal more probably which raises extensive concerns of both academic and business than just viewing the ad. Thus, CPC may accurately reveal circles. Among all the issues, the task of Click-through rates (CTR) prediction plays a central role, as it may influence the ranking conversion rate better than the other metrics. However, in order and pricing of online ads. To deal with this task, the Factorization to arrange the pricing and ranking of each ad, expectation of Machines (FM) model is designed for better revealing proper clicking should be precisely estimated, i.e., the click-through combinations of basic features. However, the sparsity of ads rate (CTR) prediction task, which attracts extensive concern transaction data, i.e., a large proportion of zero elements, may of both academic circles and business circles [16], [24]. severely disturb the performance of FM models. To address this problem, in this paper, we propose a novel Sparse Factorization In the literature, traditionally, as a simple but effective tool, Machines (SFM) model, in which the Laplace distribution is Logistic regression (LR) is widely studied and utilized (e.g., introduced instead of traditional Gaussian distribution to model [24], [4], [7], [10], [15]) to predict the CTR. However, this the parameters, as Laplace distribution could better fit the linear model heavily depends on manually selected features, sparse data with higher ratio of zero elements. Along this while the latent correlations between features could hardly be line, it will be beneficial to select the most important features or conjunctions with the proposed SFM model. Furthermore, revealed. It is a fatal flaw as combining features based on we develop a distributed implementation of our SFM model human expertise costs a lot, especially for large-scale datasets on Spark platform to support the prediction task on mass with high dimensional features, which may severely limit its dataset in practice. Comprehensive experiments on two large- application. Recently, some other prior arts based on neural scale real-world datasets clearly validate both the effectiveness network (NN) are proposed, like [3] and [13], in which various and efficiency of our SFM model compared with several state-of- the-art baselines, which also proves our assumption that Laplace elements could be well summarized and complex interactions distribution could be more suitable to describe the online ads could be automatically obtained. However, the NN-methods transaction data. suffer heavy burden of computation, and it is always difficult I. INTRODUCTION to achieve the global optimization. For CTR prediction, the Factorization Machines (FM) [21] model is also adopted based In recent years, with the rapid development of E-commerce, on feature engineering and matrix design. As a successful online advertising industry has become the most popular solution [20], [27], FM could comprehensively model the and effective approach of promotion and marketing. As a correlations between variables with the basic assumption that multi-billion business mode, online ads could build novel the first and second order parameters follow Gaussian distribu- connections between customers and merchants with lower cost tion. Furthermore, to better summarize the features, Bayesian and better coverage. In 2015, spending on online ads reached Factorization Machines are designed by imposing Gaussian $59.6 billion in the US, 20.4% over 2014. While producing hyper-priors for mean of each parameter, and Gamma hyper- opportunities, such an enormous market raises new challenges priors for prior precision [6], [22]. in the real-time bidding (RTB) scheme [32], [33], [25] which However, there are some unique characteristics of ad trans- targets at designing the ads inventories, including allocation, action data, which impairs the performance of FM models. ranking and pricing [28], [31], [19]. Thus, techniques on First, ad transaction data could be extremely sparse due to the computational advertising are urgently required to support the common-used one-hot encoding, i.e., if N kinds of possible decision-making and ensure the profit. status exist for one feature, there will be a N-dimensional In computational advertising, the most crucial task is the vector, which has only one non-zero element indicating the price setting for each ad as it has a direct bearing on the profit. current status. Second, ad transaction data in real-world appli- Usually, several compensation methods are utilized, such as cation could usually be large-scale. For instance, Table I lists Cost per Click (CPC), Cost per Action (CPA) or Cost per Mille statistics of 4 advertisers in iPinYou dataset [11], in which (CPM). Among them, CPC, in which advertisers pay each each record of impression is represented by features of 15,804 time only a user clicks on the ad, is the most popular method dimensions. We realize that for all the advertisers, they hold as 66% of online advertising transactions are counted on a millions of impression records and billions of feature values. ∗Corresponding Author. 1https://en.wikipedia.org/wiki/Online advertising 2374-8486/16 $31.00 © 2016 IEEE 400 DOI 10.1109/ICDM.2016.53 TABLE I research interests since its birth, and the major research issues SOME CHARACTERISTICS OF IPINYOU TRAINING DATASETS,WITH include bidding behavior analysis and strategy optimization, FEATURES OF 15804 DIMENSIONS. Advertiser Key Impressions Non-zero Values Non-zero Rate(%) market segmentation and ad performance analysis, and so 1458 3,083,056 69,550,137 0.1427 on [32]. Actually, one of the main goals for analyzing and 3358 1,742,104 37,915,970 0.1377 improving ad performance is to maximize user response rate 2821 1,322,561 15,106,292 0.1463 2997 312,437 5,383,633 0.1090 for advertising campaigns, such as click through rates (CTR). 0.5 Thus, a number of different models have been proposed for Gaussian distribution Laplace distribution CTR prediction in both academia and industry [24], [21], [20], 0.4 [27], [7], [30], [13]. Among them, Logistic regression (LR) is the most intuitive, analyzing and expanding model [24], 0.3 where the probability that a user clicks on an ad is modeled f(x) as a logistic function of a linear combination of the features. 0.2 Since LR model is highly interpretable and can be trained fast 0.1 when data are large-scale, it is widely studied. For instance, [8] proposed to combine LR model and decision tree model. 0 Specifically, this method took the output of decision trees as -6 -4 -2 0 2 4 6 the input of LR model, and studied how the timeliness of x training data impact on CTR accuracy. However, as a linear Fig. 1. The probability density function of Laplace distribution with (μ = 0,ρ =1) and Gaussian distribution with (μ =0,σ =1). Specifically, the model, LR pays few attention to the different importance horizontal axis shows the possible value of variable x, while the vertical axis and interactions of basic features. In contrast, Factorization records the corresponding probability of x. Machines (FM) model is a type of method that is able to model Unfortunately, the proportion of non-zero elements is even all interactions between variables (features) automatically and lower than 0.15%. Clearly, mass data and extreme sparsity may is a general predictor working with any real valued feature significantly disturb the modeling, as Gaussian distribution vector [21]. Therefore, both the basic FM and the extended could hardly fit the sparse data. versions of FM (e.g., Bayesian FM) have been adopted for To address these problems, in this paper, we propose a CTR prediction [6], [20]. Recently, the authors in [27] also novel Sparse Factorization Machines (SFM) model, in which introduced an online learning algorithm for CTR prediction the Laplace distribution is introduced instead of traditional based on Factorization Machines. Gaussian distribution to model the parameters. As shown in Besides LR and FM, some other methods were also pro- Figure 1, Laplace distribution has a higher peak than Gaussian posed recently. For instance, [7] presented a type of proba- with similar parameters, which results in a higher probability bilistic regression model by Bayesian priors. [30] introduced a of zero elements [18]. Thus, Laplace distribution could better method that assigns user features and ad features with different describe the sparse data with few non-zero elements, and fur- parameters. Thus, this method can ignore the useless features ther, distinguish the relevant features or measure correlations of the users and ads, and the authors also built a distributed between feature pairs. Along this line, to be specific, since learning framework for training the model. [5] presented a Laplace distribution is non-smooth, the Bayesian inference of combination of strategies, including a method for transfer SFM will be analytically intractable. Thus, we formulate it as a learning and an update rule for learning rate. scale-mixture of Gaussian distribution and exponential density, As the volume of data increase, more complex methods and then employ the Markov chain Monte Carlo method to based on neural network are proposed to predict CTR [3], [35], perform the Bayesian inference. Furthermore, we develop a [13]. For instance, in [3], Artificial Neural Networks (ANN) distributed implementation of our SFM model on Spark to was used for CTR prediction and performed better than LR support the prediction task on large-scale data in practice.