Product-Based Neural Networks for User Response Prediction Over Multi-Field Categorical Data

Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data YANRU QU, BOHUI FANG, and WEINAN ZHANG, Shanghai Jiao Tong University, China RUIMING TANG, Noah’s Ark Lab, Huawei, China MINZHE NIU, Shanghai Jiao Tong University, China HUIFENG GUO∗, Shenzhen Graduate School, Harbin Institute of Technology, China YONG YU, Shanghai Jiao Tong University, China XIUQIANG HE†, Data service center, MIG, Tencent, China User response prediction is a crucial component for personalized information retrieval and filtering scenarios, such as recommender system and web search. The data in user response prediction is mostly in a multi-field categorical format and transformed into sparse representations via one-hot encoding. Due to the sparsity problems in representation and optimization, most research focuses on feature engineering and shallow modeling. Recently, deep neural networks have attracted research attention on such a problem for their high capacity and end-to-end training scheme. In this paper, we study user response prediction in the scenario of click prediction. We first analyze a coupled gradient issue in latent vector-based models and propose kernel product to learn field-aware feature interactions. Then we discuss an insensitive gradient issue in DNN-based models and propose Product-based Neural Network (PNN) which adopts a feature extractor to explore feature interactions. Generalizing the kernel product to a net-in-net architecture, we further propose Product-network In Network (PIN) which can generalize previous models. Extensive experiments on 4 industrial datasets and 1 contest dataset demonstrate that our models consistently outperform 8 baselines on both AUC and log loss. Besides, PIN makes great CTR improvement (relatively 34.67%) in online A/B test. CCS Concepts: • Information systems → Information retrieval; • Computing methodologies → Su- pervised learning; • Computer systems organization → Neural networks; Additional Key Words and Phrases: Deep Learning, Recommender System, Product-based Neural Network ACM Reference Format: Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2017. Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data. ACM Transactions on Information Systems 9, 4, Article 39 (October 2017), 35 pages. https://doi.org/ ∗ arXiv:1807.00311v1 [cs.IR] 1 Jul 2018 This work is done when Huifeng Guo was an intern at Noah’s Ark Lab, Huawei †This work is done when Xiuqiang He worked at Noah’s Ark Lab, Huawei Authors’ addresses: Yanru Qu; Bohui Fang; Weinan Zhang, Shanghai Jiao Tong University, China, [email protected], [email protected], [email protected]; Ruiming Tang, Noah’s Ark Lab, Huawei, China, [email protected]; Minzhe Niu, Shanghai Jiao Tong University, China, [email protected]; Huifeng Guo, Shenzhen Graduate School, Harbin Institute of Technology, China, [email protected]; Yong Yu, Shanghai Jiao Tong University, China, yyu@apex. sjtu.edu.cn; Xiuqiang He, Data service center, MIG, Tencent, China, [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe 39 full citation on the first page. Copyrights for components of this work owned by others than the author(s) must behonored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2009 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery. 1046-8188/2017/10-ART39 $15.00 https://doi.org/ ACM Transactions on Information Systems, Vol. 9, No. 4, Article 39. Publication date: October 2017. 39:2 Y. Qu et al. 1 INTRODUCTION Predicting a user’s response to some item (e.g., movie, news article, advertising post) under certain context (e.g., website) is a crucial component for personalized information retrieval (IR) and filtering scenarios, such as online advertising [30, 35], recommender system [24, 36], and web search [2, 5]. The core of personalized service is to estimate the probability that a user will “like”, “click”, or “purchase” an item, given features about the user, the item, and the context [31]. This probability indicates the user’s interest in the specific item and influences the subsequent decision-making such as item ranking [48] and ad bidding [52]. Taking online advertising as an example, the estimated Click-Through Rate (CTR) will be utilized to calculate a bid price in an ad auction to improve the advertisers’ budget efficiency and the platform revenue [32, 35, 52]. Hence, it is much desirable to gain accurate prediction to not only improve the user experience, but also boost the volume and profit for the service providers. Table 1. An example of multi-field categorical data. TARGET WEEKDAY GENDER CITY 1 Tuesday Male London 0 Monday Female New York 1 Tuesday Female Hong Kong 0 Tuesday Male Tokyo Number 7 2 1000 The data collection in these tasks is mostly in a multi-field categorical form. And each data instance is normally transformed into a high-dimensional sparse (binary) vector via one-hot encoding [18]. Taking Table1 as an example, the 3 fields are one-hot encoded and concatenated as »0; 1; 0; 0; 0; 0; 0¼ »1; 0¼ »0; 0; 1; 0;:::; 0; 0¼ : | {z } |{z} | {z } WEEKDAY=Tuesday GENDER=Male CITY=London Each field is represented as a binary vector, of which only 1 entry corresponding to the inputisset as 1 while others are 0. The dimension of a vector is determined by its field size, i.e., the number of unique categories1 in that field. The one-hot vectors of these fields are then concatenated together in a predefined order. Without loss of generality, user response prediction can be regarded as a binary classification problem, and 1/0 are used to denote positive/negative responses respectively [14, 37]. A main challenge of such a problem is sparsity. For parametric models, they usually convert the sparse (binary) input into dense representations (e.g., weights, latent vectors), and then search for a separable hyperplane. Fig.1 shows the model decomposition. In this paper, we mainly focus on modeling and training, thus we exclude preliminary feature engineering [9]. Dense Sparse Feature Repres- Classifier Prediction Input Extractor entation Fig. 1. Model decomposition. 1 For clarity, we use “category” instead of “feature” to represent a certain value in a categorical field. For consistency with previous literature, we preserve “feature” in some terminologies, e.g., feature combination, feature interaction, and feature representation. ACM Transactions on Information Systems, Vol. 9, No. 4, Article 39. Publication date: October 2017. PNNs for User Response Prediction 39:3 Many machine learning models are leveraged or proposed to work on such a problem, including linear models, latent vector-based models, tree models, and DNN-based models. Linear models, such as Logistic Regression (LR) [25] and Bayesian Probit Regression [14], are easy to implement and with high efficiency. A typical latent vector-based model is Factorization Machine (FM)[36]. FM uses weights and latent vectors to represent categories. According to their parametric representations, LR has a linear feature extractor, and FM has a bi-linear2 feature extractor. The prediction of LR and FM are simply based on the sum over weights, thus their classifiers are linear. FM works well on sparse data, and inspires a lot of extensions, including Field-aware FM (FFM) [21]. FFM introduces field-aware latent vectors, which gain FFM higher capacity and better performance. However, FFM is restricted by space complexity. Inspired by FFM, we find a coupled gradient issue of latent vector-based models and refine feature interactions3 as field-aware feature interactions. To solve this issue as well as saving memory, we propose kernel product methods and derive Kernel FM (KFM) and Network in FM (NIFM). Trees and DNNs are potent function approximators. Tree models, such as Gradient Boosting Decision Tree (GBDT) [6], are popular in various data science contests as well as industrial ap- plications. GBDT explores very high order feature combinations in a non-parametric way, yet its exploration ability is restricted when feature space becomes extremely high-dimensional and sparse. DNN has also been preliminarily studied in information system literature [8, 33, 40, 51]. In [51], FM supported Neural Network (FNN) is proposed. FNN has a pre-trained embedding4 layer and several fully connected layers. Since the embedding layer indeed performs a linear transformation, FNN mainly extracts linear information from the input. Inspired by [39], we find an insensitive gradient issue that fully connected layers cannot fit such target functions perfectly. From the model decomposition perspective, the above models are restricted by inferior feature extractors or weak classifiers. Incorporating product operations in DNN, we propose Product-based Neural Network (PNN). PNN consists of an embedding layer, a product layer, and a DNN classifier. The product layer serves as the feature extractor which can make up for the deficiency of DNN in modeling feature interactions. We take FM, KFM, and NIFM as feature extractors in PNN, leading to Inner Product-based Neural Network (IPNN), Kernel Product-based Neural Network (KPNN), and Product-network In Network (PIN). CTR estimation is a fundamental task in personalized advertising and recommender systems, and we take CTR estimation as the working example to evaluate our models. Extensive experiments on 4 large-scale real-world datasets and 1 contest dataset demonstrate the consistent superiority of our models over 8 baselines [6, 15, 21, 25, 27, 36, 46, 51] on both AUC and log loss. Besides, PIN makes great CTR improvement (34.67%) in online A/B test.

Load more