Model Ensemble for Click Prediction in Bing Search Ads
Total Page:16
File Type:pdf, Size:1020Kb
Model Ensemble for Click Prediction in Bing Search Ads Xiaoliang Ling Weiwei Deng Chen Gu Microsoft Bing Microsoft Bing Microsoft Bing No. 5 Dan Ling Street No. 5 Dan Ling Street No. 5 Dan Ling Street Beijing, China Beijing, China Beijing, China [email protected] [email protected] [email protected] ∗ Hucheng Zhou Cui Li Feng Sun Microsoft Research Microsoft Research Microsoft Bing No. 5 Dan Ling Street No. 5 Dan Ling Street No. 5 Dan Ling Street Beijing, China Beijing, China Beijing, China [email protected] [email protected] [email protected] ABSTRACT Google [21], Facebook [14] and Yahoo! [3]. Recently, factoriza- Accurate estimation of the click-through rate (CTR) in sponsored tion machines (FMs) [24, 5, 18, 17], gradient boosting decision ads significantly impacts the user search experience and businesses’ trees (GBDTs) [25] and deep neural networks (DNNs) [29] have revenue, even 0.1% of accuracy improvement would yield greater also been evaluated and gradually adopted in industry. earnings in the hundreds of millions of dollars. CTR prediction is A single model would lead to suboptimal accuracy, and the above- generally formulated as a supervised classification problem. In this mentioned models all have various different advantages and dis- paper, we share our experience and learning on model ensemble de- advantages. They are usually ensembled together in an industry sign and our innovation. Specifically, we present 8 ensemble meth- setting (or even machine learning competition like Kaggle [15]) to ods and evaluate them on our production data. Boosting neural net- achieve better prediction accuracy. For instance, apps recommen- works with gradient boosting decision trees turns out to be the best. dation in Google adopts Wide&Deep [7] that co-trains LR (wide) With larger training data, there is a nearly 0.9% AUC improvement and DNN (deep) together; ad CTR in Facebook [14] uses GBDT in offline testing and significant click yield gains in online traffic. for non-linear feature transformation and feeds them to LR for the In addition, we share our experience and learning on improving the final prediction; Yandex [25] boosts LR with GBDT for CTR pre- quality of training. diction; and there also exists work [29] on ads CTR that feeds the FM embedding learned from sparse features to DNN. Simply replicating them does not yield the best possible level of accuracy. Keywords In this paper, we share our experience and learning on designing click prediction; DNN; GBDT; model ensemble and optimizing model ensembles to improve the CTR prediction in Microsoft Bing Ads. 1. INTRODUCTION The challenge lies in the large design space: which models are Search engine advertising has become a significant element of ensembled together; which ensemble techniques are used; and which the web browsing experience. Choosing the right ads for a query ensemble design would achieve the best accuracy? In this paper, and the order in which they are displayed greatly affects the prob- we present 8 ensemble variants and evaluate them in our system. ability that a user will see and click on each ad. Accurately esti- The ensemble that boosts the NN with GBDT, i.e., initializes the mating the click-through rate (CTR) of ads [10, 16, 12] has a vital sample target for GBDT with the prediction score of NN, is con- impact on the revenue of search businesses; even a 0.1% accuracy sidered to be the best in our setting. With larger training data, it improvement in our production would yield hundreds of millions shows near 0.9% AUC improvement in offline testing and signifi- of dollars in additional earnings. An ad’s CTR is usually modeled cant click yield gains in online traffic. To push this new ensemble as a classification problem, and thus can be estimated by machine design into the system also brings system challenges on a fast and learning models. The training data is collected from historical ads accurate trainer, considering that multiple models are trained and impressions and the corresponding clicks. Because of the sim- each trainer must have good scalability and accuracy. We share our plicity, scalability and online learning capability, logistic regres- experience with identifying accuracy-critical factors in training. sion (LR) is the most widely used model that has been studied by The rest of the paper is organized as follows. We first provide a brief primer on the ad system in Microsoft Bing Ads in Section 2. ∗This work was done during her internship in Microsoft Research. We then present several model ensemble design in detail in Sec- tion 3, followed by the corresponding evaluation against production data. The means of improving model accuracy and system perfor- c 2017 International World Wide Web Conference Committee (IW3C2), mance is described in Section 5. Related work is listed in Section 6 published under Creative Commons CC BY 4.0 License. and we conclude in Section 7. WWW 2017, April 3–7, 2017, Perth, Australia. ACM 978-1-4503-4914-7/17/04. http://dx.doi.org/10.1145/3041021.3054192 2. ADS CTR OVERVIEW In this section, we will describe the overview of the ad system in . Microsoft Bing Ads and the basic models and features we use. 2.1 Ads System Overview output Sponsored search typically uses keyword based auction. Adver- tisers bid on a list of keywords for their ad campaigns. When a user hidden layers searches with a query, the search engine matches the user query with activation with bidding keywords, and then selects and shows proper ads to splitting features the user. When a user clicks any of the ads, the advertiser will ... ... ... be charged with a fee based on the generalized second price [2, LR input features DNN GBDT Figure 1: Graphical illustration of basic models: LR, DNN and 1]. A typical system involves several steps including selection, GBDT. relevance filtration, CTR prediction, ranking and allocation. The input query from the user is first used to retrieve a list of candi- date ads (selection). Specifically, the selection system parses the query, expands it to relevant ad keywords and then retrieves the Deep Neural Network. DNN generalizes to previously unseen ads from advertisers’ campaigns according to their bidding key- query-ad feature pairs by learning a low-dimensional dense em- words. For each selected ad candidate, a relevance model estimates bedding vector for both query and ad features, with less burden of the relevance score between query and ad, and further filters out feature engineering. The middle model in Figure 1 depicts four the least relevant ones (relevance filtration). The remain- layers of the DNN structure, including two hidden layers each with ing ads are estimated by the click model to predict the click prob- u neuron units, one input layer with m features and one output ability (pClick) given the query and context information (click layer with one single output. With a top-down description, the output unit is a real number p 2 (0;1) as the predicted CTR with prediction). In addition, a ranking score is calculated for each 1 ad candidate by bid ∗ pClick where bid is the corresponding bid- p = s(w2 · x2 + b2), where s(a) = 1+exp(−a) is the logistic acti- 1×u ding price. These candidates are then sorted by their ranking score vation function. w2 2 R is the parameter matrix between out- (ranking). Finally, the top ads with a ranking score larger than the put layer and the connected hidden layer, b2 2 R is the bias. x2 2 given threshold are allocated for impression (allocation), such Ru is the activation output of the last hidden layer computed as u×u u u that the number of impressions is limited by total available slots. x2 = s(w1 ·x1 +b1), where w1 2 R ;b1 2 R ;x1 2 R . Similarly, u×m u m The click probability is thus a key factor used to rank the ads in ap- x1 = s(w0 · x0 + b0) where w0 2 R ;b0 2 R and x0 2 R is the propriate order, place the ads in different locations on the page, and input sample. Different hidden layers can be regarded as different even to determine the price that will be charged to the advertiser if internal functions capturing different forms of representations of a click occurs. Therefore, ad click prediction is a core component a data instance. Compared with the linear model, DNN thus has of the sponsored search system. better for catching intrinsic data patterns and leads to better gen- eralization. The sigmoid activation can be replaced as a tanh or 2.2 Models ReLU [19] function. Consider a training data set D = f(xi;yi)g with n examples (i.e., 2.3 Training data m jDj = n), where each sample has m features xi 2 R with observed The training data is collected from an ad impressions log, that label yi 2 f0;1g. We formulate click prediction as a supervised each sample (xi;yi) represents whether or not the impressed ad al- learning problem, and binary classification models are often used located by the ad system has been clicked by a user. The output for click probability estimation p(click = 1juser;query;ad). Given variable yi is 1 if the ad has been clicked, and yi is 0 otherwise. the observed label y 2 f0;1g, the prediction p gets the resulting The input features xi consist of different sources that describe dif- LogLoss (logistic loss), given as: ferent domains of an impression. 1). query features that include `(p) = −y · log p − (1 − y) · log(1 − p); (1) query term, query classification, query length, etc.