<<

arXiv:2001.11358v2 [cs.IR] 12 May 2020 h bnac fdt on niehsisie e ie fi of lines new inspired has online found data of abundance The INTRODUCTION 1 se , bias position learning-to-rank, systems, recommender KEYWORDS moder is bias. there position when exi no especially to to algorithms, compared accuracy LTR better unbiased have ing and mu noise are to methods robust proposed systems. more our LTR that in shows bias evaluation position empirical and coun- selection new for met propose accounts two-stage we and Heckman’s Here, adapt place. been which first have approaches the terfactual documents in user what the of to reflective shown are atte documents occurs Less which clicked bias, query. selection for user’s correcting a to paid given been has results t relevant if even most clicked the be to not ( likely results more ranked are results) higher mos engine that search systems fact the such bias, in position correction s on LTR focus bias for for performance methods poor Recent in result tems. can that of suffernumber data these However, utilized systems. a be (LTR) can learning-to-rank are that data systems observational recommendation of source modern important by collected data Click ABSTRACT oilsinersac htrle nfuddt 2,30]. [26, data found on relies validi that the such research challenge phenomena science can social to and chambers lead echo can and This bubbles filter [9]. p further algorithms, feed bias news new the train agating p to first used are the turn in in users data to Such show an to personalization chose feed o algorithms underlying news not the reflect also posts, but on interests n comments user and media latent clicks social user from as such data feeds, interaction example, algorit For underlying naturally-occur the behavior. than by rather driven platforms, often online are supporting dat data such such from in preferences terns individual learn that machine-l algorithms of development the and behavior human about quiry htaecpbeo akn tm ae nifre srpref user inferred on as known based process items a in ranking ences, of capable are of that parameters the estimate to o procedures rely typically systems These data. interaction preferences available user learn to is goal whose systems recommender n ftepae hr hs isssraei npersonaliz in is surface biases these where places the of One nvriyo liosa Chicago at Illinois of University orcigfrSlcinBa nLann-orn Systems Learning-to-rank in Bias Selection for Correcting orhOvaisi Zohreh [email protected] erigt-ak(LTR) learning-to-rank aionaPltcncState Polytechnic California [email protected] ahy Vasilaky Kathryn University e akn algorithms ranking new nvriyo liosa Chicago at Illinois of University 3] uhof Much [32]. learning n iguser ring because .. top e.g., otrain to e are hey what d lection rma from .Pat- a. [email protected] ntion ai Ahsan Ragib from yof ty hms lace. Our ews hod rop- earning nly ate ys- tly er- ch ed st- as n- n 1 adtu lce)b h sradfcso osigterelev the boosting on focus and user the by clicked) thus (and iheitn oiinba orcinmethods. correction position-bias existing with Heckman ieyt ecikdee fte r o h otrelevant. most the not are they if even result clicked engine be Position search to user. top likely a (e.g., to results displayed ranked higher was makes caus result bias a the where [29], position bias learning the position for on estimation focused has parameter systems the rank unbiasing on work the isi lce eut hc stefcso u work. our selecti of to focus leads the This is which systems. results LTR clicked in in boosted bias be to ne chance and clicked) the (and observed relevan being ranked, of lower probability zero case, h have this to In tens results. through ranked peruse of to time dreds the spend not do users because h hnet bev l eeatrsls ihrbcuet top of because list truncated either a results, show to relevant chose all tem rarel observe users to However, chance [29]. the results relevant ranked obs lower being of of probability non-zero have results relevant all oiinba.Frt epropose we us First, to bias. given choices position limited both from stems in that bias placed lection was it that given position higher [29]. position a lower in clicke being placed document been a of it likelihood the study and f bias tion correct which formulations diffe counterfactual is formulation previous Our from users. relevant to shown the never identifying were that on ments focus we documents, select from clicked recover bee to for order have In would observed. been document it had a clicked whether predicting of problem tual si eeto isasmtos seilywe h posi the when especially r assumptions, under bias algorithms selection LTR istic unbiased existing to bett compared have curacy methods algorithms ensemble state-of-the-art Our learning-to-rank. to unbiased compared when method posed fbigslce u opsto isadcombine and bias position to due selected being of pro nuanced the for account that ensembles bias-correcting sys Bec the features. from which learned in be bias can documents selection show of to type decision any ge to very is applicable method is correction it show this be Because to others. likely more than are user items a some that choic fact limited the and the users for to account we bias, selection too addressing econometric an method, learning-to-rank two-stage Heckman’s of adapting context By the in bias selection dressing nvriyo liosa Chicago at Illinois of University loihsta orc o oiinba yial assume typically bias position for correct that Algorithms ee efaetepolmo erigt aka counterfac a as rank to learning of problem the frame we Here, ee epooeagnrlfaeokfrrcvrn rmse- from recovering for framework general a propose we Here, u xeietleautosdmntaeteuiiyo ou of utility the demonstrate evaluations experimental Our [email protected] ank r ln Zheleva Elena rasslcina iayvral,w rps two propose we variable, binary a as selection treats [email protected] u a-e University Yat-sen Sun Heckman ia Zhang Yifan k ank r eomne eut or results recommended nagrtmfrad- for algorithm an , Heckman inbias tion systems. results t o bias ion )more s) bability rposi- or given e r and ers e get ver esys- he have y erved had d tem’s neral, dby ed docu- rac- er pro- r ance ank r ause rent that bias for l to n -to- eal- un- for on n a - Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, Elena Zheleva is not severe. Moreover, Heckmanrank is more robust to noise than bias, i.e., when a variable can affect both treatment and control. We both ensemble methods and position-bias correcting methods across leverage this work in our discussion of identifiability under selec- difference position bias assumptions. The also show tion bias. that selection bias affects the performance of LTR systems even in The most related work to our context are studies by Hernández- the absence of position bias, and Heckmanrank is able to correct Lobato et al. [22], Schnabel et al. [37], Wang et al. [43]. Both Schn- for it. abel et al. [37] and Hernández-Lobato et al. [22] use a matrix factor- ization model to represent data (ratings by users) that are missing not-at-random, where Schnabel et al. [37] outperform Hernández- 2 RELATED WORK Lobato et al. [22]. More recently, Joachims et al. [29] propose a Here, we provide the context for our work and present the three position approach in the context of learning-to-rank sys- areas that best encompass our problem: bias in recommender sys- tems as a more general approach compared to Schnabel et al. [37]. tems, selection bias correction, and unbiased learning-to-rank. Throughout, we compare our results to Joachims et al. [29], al- Bias in recommender system. Many technological platforms, though, it should be noted that the latter deals with a more specific such as recommendation systems, tailor items to users by filter- bias - position bias - than what we address here. Finally, Wang et al. ing and ranking information according to user history. This pro- [43] address selection bias due to confounding, whereas we address cess influences the way users interact with the system and how selection bias that is treatment-dependent only. the data collected from users is fed back to the system and can Unbiased learning-to-rank. The problem we study here in- lead to several types of biases. Chaney et al. [9] explore a closely vestigates debiasing data in learning-to-rank systems. There are related problem called algorithmic confounding bias, where live two approaches to LTR systems, offline and online, and the work systems are retrained to incorporate data that was influenced by we propose here falls in the category of offline LTR systems. the recommendation algorithm itself. Their study highlights the Offline LTR systems learn a ranking model from historical click fact that training recommendation platforms with naive data that data and interpret clicks as absolute relevance indicators [2, 7, 12, are not debiased can cause a severe decrease in the utility of such 16, 24, 27–29, 36, 41, 42]. Offline approaches must contend with systems. For example, “echo chambers” are consequence of this the many biases that found data are subject to, including position problem [17, 20], where users are limited to an increasingly nar- and selection bias, among others. For example, Wang et al. [41] use rower choice set over time which can lead to a phenomenon called a propensity weighting approach to overcome position bias. Simi- polarization [18]. Popularity bias, is another bias affecting recom- larly, Joachims et al. [29] propose a method to correct for position mender system that is studied by Celma and Cano [8]. Popularity bias, by augmenting SV Mrank learning with an Inverse Propensity bias refers to the idea that a recommender system will display the Score defined for clicks rather than queries. They demonstrate that most popular items to a user, even if they are not the most rele- Propensity-Weighted SV Mrank outperforms a standard Ranking vant to a user’s query. Recommender systems can also affect users SV Mrank by accounting for position bias. More recently Agarwal decision making process, known as decision bias, and Chen et al. et al. [1] proposed nDCG SV Mrank that outperforms Propensity- [13] show how understanding this bias can improve recommender Weighted SV Mrank [29], but only when position bias is severe. We systems. Position bias is yet another type of bias that is studied in show that our proposed algorithm outperforms [29] when position the context of learning-to-rank systems and refers to documents bias is not severe. Thus, we do not compare our results to [1]. that higher ranked will be more likely to be selected regardless of Other studies aim to improve on Joachims et al. [29], such as Wang the document’s relevancy. Joachims et al. [29] focus on this bias et al. [42] and Ai et al. [2], but only in the ease of their methodol- and we compare our results to theirs throughout. ogy. Wang et al. [42] propose a regression-based Expectation Maxi- Selection bias correction. Selection bias occurs when a data mization method for estimating the click position bias, and its main is not representative of the underlying data distribution. advantage over Joachims et al. [29] is that it does not require ran- Selection bias can have various underlying causes, such as partic- domized tests to estimate the propensity model. Similarly, the Dual ipants self-selecting into a study based on certain criteria, or sub- Learning Algorithm (DLA) proposed by Ai et al. [2] jointly learns jects choosing over a choice set that is restricted in a non-random the propensity model and ranking model without randomization way. Selection bias could also encompass the biases listed above. tests. Hu et al. [24] introduce a method that jointly estimates po- Various studies attempt to correct for selection bias in different sition bias and trains a ranker using a pairwise loss function. The contexts. focus of these latter studies is position bias and not selection bias, , and more generally, bivariate selection mod- namely the fact that some relevant documents may not be exposed els, control for the probability of being selected into the sample to users at all, which is what we study here. when predicting outcomes [21]. Smith and Elkan [40] study Heck- In contrast to offline LTR systems, online LTR algorithms inter- man correction for different types of selection bias through Bayesian vene during click collection by interactively updating a ranking networks, but not in the context in learning-to-rank systems. Zadrozny model after each user interaction [11, 23, 25, 33, 35, 38, 39, 44]. [45] study selection bias in the context of well-known classifiers, This can be costly, as it requires intervening with users’ experi- where the outcome is binary rather than continuous as with rank- ence of the system. The main study in this context is Jagerman et al. ing algorithms. Selection bias has also been studied in the context [25] who compare the online learning approach by Oosterhuis and of causal graphs [4–6, 14, 15]. For example, if an underlying data de Rijke [33] with the offline LTR approach proposed by Joachims generation model is assumed, Bareinboim and Pearl [3] show that et al. [29] under selection bias. The study shows that the method selection bias can be removed even in the presence of confounding 2 Correcting for Selection Bias in Learning-to-rank Systems by Oosterhuis and de Rijke [33] outperforms [29] when selection xi , the relevance of document y to query xi is ri (y) ∈ {0, 1}, and bias and moderate position bias exist, and when no selection bias oi ∈ {0, 1} is a set of vectors indicating whether a document y is and severe position bias exist. One advantage of our offline algo- observed. Suppose the performance metric of interest is the sum rithms over online LTR ones is that they do not have a negative of the rank of relevant documents: impact on user experience while learning. ∆ (y|xi ,ri ) = rank(y|y) ri (y). (3) y ∈y 3 PROBLEM DESCRIPTION X Due to position bias, given a presented ranking ¯yi , clicks are more In this section, we review the definition of learning-to-rank sys- likely to occur for top-ranked documents. Therefore, the goal is to tems, position and selection bias in recommender systems, as well obtain an unbiased estimate of ∆ (y|xi ,ri ) for a new ranking y. as our framing of bias-corrected ranking with counterfactuals. There are existing approaches that address position bias in LTR systems. For example, Propensity SV Mrank , proposed by Joachims 3.1 Learning-to-Rank Systems et al. [29], is one such algorithm. It uses inverse propensity weights We first describe learning-to-rank systems assuming knowledge (IPW) to counteract the effects of position bias: of true relevances (full information setting) following [29]. Given a sample x of i.i.d. queries (xi ∼ P(x)) and relevancy score rel(x,y) | ˆ rank(y y) | ∆IPW (y|xi , ¯yi ,oi ) = (4) for all documentsy, we denote ∆ (y xi ) to be the loss of any ranking Q(oi =1|xi , ¯yi ,ri ) y:oi =1X∧ri =1 y for query xi . The risk of ranking system S that returns ranking S(x) for queries x is given by: where the propensity weight Q(oi = 1|xi , ¯yi , ri ) denotes the mar- ginal probability of observing the relevance ri (y) of result y for R(S) = ∆ (S(x)|x) d P(x). (1) query xi , when the user is presented with ranking ¯yi. Joachims Z et al. [29] estimated the IPW to be: Since the distribution of queries is not known in practice, R(S) cannot be computed directly, it is often estimated empirically as 1 η follows: Q(oi =1|xi , ¯yi ,ri ) = (5) rank(y|¯y )  1 i Rˆ(S) = ∆ (S(x )|x ). (2) |X| i i where η is severity of position bias. The IPW has two main prop- xXi ∈X erties. First, it is computed only for documents that are observed The goal of learning-to-rank systems is to find a ranking function and clicked. Therefore, documents that are never clicked do not S ⊂S that minimizes the risk Rˆ(S). Learning-to-rank systems are a contribute to the IPW calculation. Second, as shown by Joachims special case of a recommender system where, appropriate ranking et al. [29], a ranking model trained with clicks and the IPW method is learned. will converge to a model trained with true relevance labels, render- The relevancy score rel(xi ,y) denotes the true relevancy of doc- ing a LTR framework robust to position bias. ument y for a specific query xi . It is typically obtained via human annotation, and is necessary for the full information setting. De- 3.3 Selection bias spite being reliable, true relevance assignments are frequently im- LTR systems rely on implicit feedback (clicks) to improve their per- possible or expensive to obtain because they require a manual eval- formance. However, a sample of relevant documents from click uation of every possible document given a query. data does not reflect the true distribution of all possible relevant Due to the cost of annotation, recommender system training documents because a user observes a limited choice of documents. often relies on implicit feedback from users in what is known as This can occur because i) a recommender system ranks relevant partial information setting. Click logs collected from users are eas- documents too low for a user to feasibly see, or ii) because a user ily observable in any recommender system, and can serve as a can examine only a truncated list of top k recommended items. As proxy to the relevancy of a document. For this reason clicks are a result, clicked documents are not randomly selected for LTR sys- frequently used to train new recommendation algorithms. Unfor- tems to be trained on, and therefore cannot reveal the relevancy tunately, there is a cost for using click log data because of noise of documents that were excluded from the ranking ¯y. This leads to (e.g., people can click on items that are not relevant) and various selection bias. biases that the data are subject to, including position bias and se- Selection bias and position bias are closely related. Besides selec- lection bias which we discuss next. tion bias due to unobserved relevant documents, selection bias can also arise due to position bias: lower ranked results are less likely to 3.2 Position bias be observed, and thus selected more frequently than higher-ranked Implicit feedback (clicks) in LTR systems is inherently biased. Po- ones. Previous work on LTR algorithms that corrects for position sition bias refers to the notion that higher ranked results are more bias assigns a non-zero observation probability to all documents, likely to be clicked by a user even if they are not the most relevant and proofs of debiasing are based on this assumption [29]. How- results given a user’s query. ever, in practice it is rarely realistic to assume that all documents Previous work [29, 41] has focused on tempering the effects can be observed by a user. When there is a large list of potentially of position bias via inverse propensity weighting (IPW). IPW re- relevant documents, the system may choose to show only the top weights the relevance of documents using a factor inversely related k results and a user can only act on these results. Therefore, lower- to the documents’ position on a page. For a given query instance ranked results are never observed, which leads to selection bias. 3 Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, Elena Zheleva

Here, we consider the selection bias that arises when some docu- 4 BIAS-CORRECTED RANKING WITH ments have a zero probability of being observed if they are ranked COUNTERFACTUALS below a certain cutoff k. The objective of this paper is to propose In this section we adapt a well-known sample selection correction a ranking algorithm that corrects for both selection and position method, known as Heckman’s two-stage correction, to the context bias, and therefore is a better tool for training future LTR systems of LTR systems. Integrating the latter framework requires a de- (see Section 4.1). tailed understanding of how LTR systems generally process and cut interaction data to train new recommendation algorithms, and 3.4 Ranking with counterfactuals at what stages in that process selection biases are introduced. Thus, We define the problem of ranking documents as a counterfactual while the methodology we introduce is a well established tool in problem [34]. Let O(x,y) ∈ {0, 1} denote a treatment variable in- the causal inference literature, integrating it within the multiple dicating whether a user observed document y given query x. Let stages of training a machine learning algorithm is a complex trans- CO=1(x,y) ∈ {0, 1} represent the click counterfactual indicating lational problem. We then introduce two aggregation methods to whether a document y would have been clicked had y been ob- combine our proposed Heckmanrank, correcting for selection bias, served under query x. The goal of ranking with counterfactuals is with existing methods for position bias to further improve the ac- to reliably estimate the probability of click counterfactuals for all curacy in ranking prediction. documents: 4.1 Selection bias correction with Heckmanrank P(CO=1 =1|X = x,Y = y) (6) Econometrics, or the application of a statistical methods to eco- and then rank the documents according to this probability. Solving nomic problems, has long been concerned with confounded or held- the ranking with counterfactuals problem would allow us to find out data in the context of consumer choices. Economists are inter- a ranking system S that returns ranking S(x) for query x that is ested in estimating models of consumer choice to both learn con- robust to selection bias. sumers’ preferences and to predict their outcomes. Frequently, the Current techniques that correct for position bias aim to provide data used to estimate these models are observational, not experi- reliable estimates of this probability by taking into consideration mental. As such, the outcomes observed in the data are based on a the rank-dependent probability of being observed. However, this limited and self-selected sample. A quintessential example of this approach is only effective for documents that have a non-zero prob- problem is estimating the impact of education on worker’s wages ability of being observed: based on only those workers who are employed [21]. However, those who are employed are a self-selected sample, and estimates P(CO=1 =1|O =1,rank = i, X = x,Y = y). (7) of education’s effect on wages will be biased. A broad class of models in econometrics that deal with such The challenge with selection bias is to estimate this probability for selection biases are known as bivariate sample selection models. documents that have neither been observed nor clicked in the first A well-known method for correcting these biases in economics is place: known as Heckman correction or two-step Heckman. In the first stage the probability of self selection is estimated, and in the sec- ond stage the latter probability is accounted for. As Heckman [21] P(CO=1 =1|O =0,C =0, X = x, Y = y) (8) pointed out self selection bias can occur for two reasons. “First, = P(CO=1 =1|O =0, X = x, Y = y). (9) there may be self selection by the individuals or data units being investigated (as in the labor example). Second, sample selection To address this challenge, in the following Section 4 we turn to decisions by analysts or data processors operate in much the same econometric methods, which have a long history of addressing se- fashion as self selection (by individuals).” lection bias. Adapting a sample selection model, such as Heckman’s, to LTR Note that in order to recover from selection bias we must ad- systems requires an understanding of when and how data are pro- dress the concept of identifiability and whether causal estimates gressively truncated when training a recommender algorithm. We can even be obtained in the context of our setup. A bias is identi- introduce notation and a framework to outline this problem here. fied to be recoverable if the treatment is known [6]. In our context Let cx,y denote whether a document y is selected (e.g., clicked) the treatment is whether a document enters into the data training under query x for each < query,document > pair; Fx,y represents pool (clicked). While it is difficult to guarantee that a user observed the features of the < query,document >, and ϵx,y is a normally a document that was shown to them (i.e. we cannot know whether distributed error term. The same query can produce multiple < an absence of a click is due non-observance or to non-relevance), query,document > pairs, where the documents are then ordered it is easier to guarantee that a document was not observed by a by a LTR algorithm. However, it is important to note that a LTR user if it was not shown to them in the first place (e.g., it is below a algorithm will not rank every single document in the data given a cutoff for top k results or the user never scrolled down to that doc- query. Unranked documents are typically discarded when training ument in a ranked list). Our proposed solution, therefore, identifies future algorithms. Herein lies the selection bias. Documents that the treatment first as a binary variable (whether the document is are not shown to the user can then never be predicted as a potential shown versus not shown) and then as a continuous variable that choice. Moreover, documents far down in the rankings may still takes position bias into account. be kept in future training data, but will appear infrequently. Both 4 Correcting for Selection Bias in Learning-to-rank Systems these points will contribute to generating increasingly restrictive depends upon whether a document is observed by users and, there- data that new algorithms are trained on. fore, has the potential for being selected. If we fail to account for the repercussions of these selection bi- The conditional expectation of clicking on a document condi- ases, then modeling whether a document is selected will be based tional on the document being shown is given by: only upon the features of documents that were ranked and shown E[c |F,o =1]= αF + E(ϵ |F,o =1)= αF + σλ to the user, which can be written as: x,y x,y x,y x,y x,y x,y x,y (14) biased c , = α F , + ϵ , . (10) x y x y x y We can see that if the error terms in (10) and (11) are correlated In this setup we only consider a simple linear model; however, then E(ϵx,y |F, ox,y = 1) > 0), and estimating (14) without account- future research will incorporate nonlinear models. In estimating ing for this correlation will lead to biased estimates of α. Thus, in (10), we refer to the feature weights estimator, αbiased , as being the second stage, we correct for selection bias to obtain an unbi- ˆ biased, because the feature design matrix will only reflect docu- ased estimate of α by controlling for λx,y : ments that were shown to the user. But documents that were not unbiased ˆ ˆ shown to the user could also have been selected. Thus, (10) reflects cx,y = α Fx,y + σλx,y (θZx,y ) + ϵx,y (15) the limitation outlined in (7). When we discard unseen documents Estimation of (15) allows us to predict click probabilities, cˆ, where then we can only predict clicks for documents that were shown, unbiased cˆ = αˆ F , +σˆλˆ , (θZˆ , ). This click probability refers to while our objective is to predict the unconditional probability that x y x y x y our ability to estimate (9), the unconditional click probability, using a document is clicked regardless of whether it was shown. Heckmanrank. We then compute document rankings for a given To address this point, we will first explicitly model an algorithm’s queryby sorting documentsaccording to their predicted click prob- document selection process. Let o , denote a binary variable that x y abilities. Note that our main equation (15) has a bivariate outcome. indicates whether a document y is shown and observed (o , = 1) x y Thus, in this selection correction setup we are following a Heckpro- or not shown and not observed (o , = 0). For now, we assume that x y bit model, as opposed to the initial model that Heckman proposed if a document is shown to the user that user also sees the document. in Heckman [21] where the main outcome is a continuous variable. We relax this assumption in Section 4.2. Z , is a set of explanatory x y Our setup helps account for the inherent selection bias that can variables that determine whether a document is shown, which in- occur in any LTR system, as all LTR systems must make a choice cludes the features in F , , but can also include external features, x y in what documents they show to a user. What is unique to our including characteristics of the algorithm that first generated the formulation of the problem is our use of a two stage estimation data: process to account for the two stage document selection process: ox,y = 1 (1) . (11) namely, whether the document is shown, and whether the docu- {θZx,y +ϵ , ≥0} x y ment is then selected. Accounting for the truncation of the data is In the first stage of Heckmanrank, we estimate the probability critical for training a LTR system, and previously has not been con- of a document being observed using a Probit model: sidered. In order to improve a system’s ranking accuracy it must be able to predict document selection for both unseen as well as seen P(o =1|Z ) = P(θZ + ϵ > 0|Z ) = Φ(θZ ) (12) x,y x,y x,y x,y x,y x,y documents. If not, the choice set of documents that are available where Φ() denotes the standard normal CDF. Note that a crucial as- to a user can only become progressively smaller. Our correction is sumption here is that we will use both seen and unseen documents a simple method to counteract such a trend in found data. for a given query in estimating (12). Therefore, the dimensions of our data will be far larger than if we had discarded unseen doc- 4.2 Bias-correcting ensembles uments, as most LTR systems typically do. After estimating (12) Biased data limits the ability to accurately train an LTR algorithm we can compute what is known as an Inverse Mills ratio for every on click logs. In this section, we present methods for addressing < query,document > pair: two types of selection bias, one stemming from truncated recom- ϕ(θZx,y ) mendations and the other one from position bias. One of the de- λx,y = (13) ficiencies of using Heckmanrank to deal with biased data is that Φ(θZx,y ) it assumes that all documents that are shown to a user are also where ϕ() is the standard normal distribution. λx,y reflects the observed by the user. However, due to position bias that is not nec- severity of selection bias and corresponds to our desire to condi- essarily the case, and lower-ranked shown documents have lower tion on O = 0 versus O = 1, as described in Equations 7 and 9, but probability of being observed. Therefore, it is natural to consider using a continuous variable reflecting the probability of selection. combining Heckmanrank, which focuses on recovering from selec- rank In the second stage of Heckman , we estimate the probabil- tion bias due to unobserved documents, with a ranking algorithm ity of whether a user will click on a document. Heckman’s seminal that accounts for the nuanced observation probability of shown work showed that if we condition our estimates on the λx,y our documents due to position bias. estimated feature weights will be statistically unbiased in expecta- Algorithms that rely on IPW [1, 29, 41] consider the propen- tion. This can improve our predictions if we believe that including sity of observation for any document given a ranking for a cer- λx,y is relevant in predicting clicks. We assume joint normality tain query and it is exponentially dependent on the rank of the of the errors, and our setup naturally implies that the error terms document in the given ranking. This is clearly different from our (1) ϵx,y and ϵx,y are correlated, namely that clicking on a document approach for recovering from selection bias where we model the 5 Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, Elena Zheleva observation probability to be either 0 or 1 depending on its position of a set of objects y1,y2,...,yn given a query x. The objective is relative in the ranking. to find a single ranking y that corroborates with all other existing Ensemble ranking objective. In order to harness the power rankings. Many aggregation methods have been proposed [31].A of correcting for these biases in a collective manner, we propose to commonly used approach is the Borda count, which scores docu- use ensembles that can combine the results produced by Heckmanrank ments based on their relative position in a ranking, and then totals and any position bias correcting method. We refer to the algorithm all scores across all rankings for a given query [19]. correcting for selection bias as As and for position bias as Ap while In our scenario, we have two rankings ys and yp . For a given ys and yp are the rankings generated for a certain query x by the query, there are n documents to rank. Consider B(ys ,yi ) as the algorithms respectively. Our goal is to produce an ensemble rank- score for document yi (i ∈ {1, 2,..., n}) given by ys . Similarly, ing ye based on ys and yp for all queries that is more accurate than B(yp,yi ) refers to the score for document yi given by yp . The total either ranking alone. score for documentyi would be B(ys ,yi )+B(yp ,yi ). Based on these There is a wide scope for designing an appropriate ensemble total scores we sort the documents in non-ascending order of their method to serve our objective. We propose two simple but pow- scores to produce the ensemble ranking ye . The score given to a erful approaches, as our experimental evaluation shows. The two document yi in a specific ranking ys (or yp ) is simply the number approaches differ in their fundamental intuition. The intuition be- of documents it beats in the respective ranking. For example, given st rd hind the first approach is to model the value of individual ranking a certain query, if a document is ranked 1 in ys and 3 in yp then algorithms through a linear combination of the rankings they pro- the total score for this document would be (n − 1)+(n − 3). This duce. We can learn the coefficients of that linear combination us- very simple scheme reflects the power of the combined method to ing linear models on the training data. We call this method Linear recover from different biases in LTR systems. Combination. The second approach is a standard approach for com- bining ranking algorithms using Borda counts [19]. It works as a 5 EXPERIMENTS post processing step after the candidate algorithms As and Ap pro- In this section, we evaluate our proposed approach for addressing duce their respective rankings ys and yp . We apply a certain Rank selection bias under several conditions: Aggregation algorithm over y and y to produce y for a given s p e • Varying the number of observed documents given a fixed query for evaluation. Next, we discuss each of the approaches in position bias (Section 5.2) the context of our problem. • Varying position bias with no noise (Section 5.2.1) 4.2.1 Linear Combination. A simple aggregation method for • Varying position bias with noisy clicks (Section 5.2.2) combining As and Ap is to estimate the value of each algorithm • Varying noise level in click (Section 5.2.3) in predicting a click. After training the algorithms As and Ap , we The parameter values are summarized in Table 1. use the same training data to learn the weights of a linear model that considers the rank of each document produced by As and Ap . 5.1 Experimental setup For any given query x the ranking of document y produced by Next, we describe the dataset we use, the process for click data As is given by rank(y|ys, x). Similarly, rank(y|yp, x) represents the generation, and the evaluation framework. ranking given by Ap . We also consider the relevance of document 5.1.1 . y, rel(x,y) which is either 0 for not relevant or 1 for relevant, mod- Base dataset In order to explore selection bias in LTR sys- eled through clicks. tems, we conduct several experiments using semi-synthetic datasets We train a binary classifier to predict relevance (click) of docu- based on set 1 and set 2 from the Yahoo! Learning to Rank Chal- 1 ments which incorporates the estimated value of individual algo- lenge (C14B) , denoted as DY LT R . Set 1 contains 19, 944 train and rithms. We select logistic regression to be the binary classifier in 6, 983 test queries including 473, 134 train and 165, 660 test docu- our implementation, but any other standard classification method ments. Set 2 contains 1, 266 train queries and 34, 815 train docu- should work as well. We model the relevance of a document y, ments, with 20 documents per query on average [10]. Each query given two rankings ys and yp as the following logistic function: is represented by an id and each pair is rep- resented by a 700-dimensional feature vector with normalized fea- ∈ , 1 ture values [0 1]. The dataset contains true relevance of rank- rel(x,y) = ings based on expert annotated relevance score ∈ [0, 4] associated −(w0+w1∗rank(y |ys,x)+w2∗rank(y |yp,x)) 1 + e with each < query,document > pair, with 0 meaning least rele- Upon training the logistic regression model we learn the param- vant and 4 most relevant. We binarized the relevance score follow- eters w0,w1,w2 where w1 and w2 represent the estimated impact ing Joachims et al. [29], such that 0 denotes irrelevant (a relevance of As and Ap respectively. During evaluation we predict the click score of 0, 1 or 2), and 1 relevant (a score of 3 and 4). counterfactual probability for each < query,document > pair us- We first conduct extensive experiments on the train portion of ing the trained classifier. Then we can sort the documents for each the smaller set 2, where we randomly sample 70% of the queries query according to these probability values to generate the final as training data and 30% as test data, with which LTR algorithms ensemble ranking ye . can be trained and evaluated respectively (Section 5.2). To confirm the performance of our proposed method with out-of-sample test 4.2.2 Rank Aggregation. Rank aggregation aims to combine rank- data, we conduct experiments on the larger set 1, where we train ings generated by multiple ranking algorithms. In a typical rank aggregation problem, we are given a set of rankings y1, y2,..., ym 1https://webscope.sandbox.yahoo.com/catalog.php?datatype=c 6 Correcting for Selection Bias in Learning-to-rank Systems

LTR algorithms on set 1 train data and evaluate them on set 1 test parameter value/category description section k 1-30 number of observed docs (selection bias) 5.2 data (Section 5.3). η 0, 0.5, 1, 1.5, 2 position bias severity 5.2.1 noise 0%, 10%, 20%, 30% clicks on irrelevant docs 5.2.2, 5.2.3 5.1.2 Semi-synthetic data generation. We use the real-world Table 1: Experimental Parameters base dataset, DY LT R , to generate semi-synthetic datasets that con- tain document clicks for < query,document > rankings. The main bias as we model selection bias by assigning a zero observation motivation behind using the Yahoo! Learning To Rank dataset is probability to documents below cutoff k. In contrast, position bias that it provides unbiased ground truth for relevant results, thus is modeled by assigning a non-zero probability to every single doc- enabling unbiased evaluation of ranking algorithms. In real-world ument where η represents the severity of the position bias. We vary scenarios, unbiased ground truth is hard to come by and LTR algo- severity of both selection bias and position bias with or without the rithms are typically trained on biased, click data which does not existence of noise in click generation. allow for unbiased evaluation. To mimic real-world scenarios for When training Propensity SV Mrank , we apply an Inverse Propen- | , , 1 η LTR, the synthetic data generation creates such biased click data. sity Score for clicked documents Q(o(y)=1 x ¯y r) = ( rank(y |¯y) ) We follow the data-generation process of Joachims et al. [29]. where o and r represent whether a document is observed and rele- We train a base ranker, in our case SV Mrank , with 1% of the train- vant respectively, following Joachims et al. [29]. Q(o(y)=1|x, ¯y,r) ing dataset that contains true relevances, and then use the trained is the propensity score denoting the marginal probability of observ- model to generate rankings for the remaining 99% of the queries ing the relevance of result y for query x if the user was presented in the training dataset. The second step of the data-generation the ranking ¯y, and η indicates the severity of position bias. process generates clicks on the ranked documents in the train- Heckmanrank is implemented following the steps described in ing dataset. The click probability of document y for a given query section 4. In step 1, the documents that appear among the n shown r (y) i results for each query are considered observed (ox,y = 1), and the x is calculated as P(cx,y = 1) = (rank(y |¯y))η where cx,y (y) and remainder as not-observed (o , = 0). It is important to note that ri (y) represent if a document y is clicked and relevant respectively, x y rank(y|¯y) denotes the ranking of documenty for query x if the user other LTR algorithms throw away the documents with ox,y = 0 in Z was presented the ranking ¯y, and η indicates the severity of posi- training, while we do not. In our implementation only includes F tion bias. Note that clicks are not generated for documents that are the feature set common to x,y . For the ensemble methods, the rank bellow a certain rank cutoff k to incorporate the selection bias. selection bias recovery algorithm As is Heckman and the po- rank In a single pass over the entire training data we generate clicks sition bias recovery algorithm Ap is Propensity SV M . following the above click probability. We refer to this as one sam- Given the model learned during training, each algorithm ranks pling pass. For the smaller set 2, we generate clicks over 15 sam- the documents in the test set. In the following subsections, we eval- pling passes, while for the larger set 1, we generate clicks over 5 uate each algorithm performance under different scenarios. For sampling passes. This click-generation process reflects a common evaluation, the (noiseless) clicked documents in the test set are con- user behavior where some relevant documents do not receive any sidered to be relevant documents, and the average rank of relevant clicks, and other relevant documents receive multiple clicks. This results (ARRR) across queries along with nDCG@10 is our metric process captures the generation of noiseless clicks, where users to evaluate each algorithm’s performance. only click on relevant documents. We also consider a click gen- eration process with noisy clicks in which a small percentage of 5.2 Experimental results on set 2 clicks (10 − 30%) occur on irrelevant documents. Here, we evaluate the performance of each algorithm under differ- ent levels of position bias (η =0, 0.5, 1, 1.5, 2) and when clicks are 5.1.3 Evaluation. We explore the performance of LTR algorithms rank rank rank noisy or noiseless (0%, 10%, 20% and 30% noise). In each case, we Naive SV M , Propensity SV M , Heckman along with use ARRR and nDCG@10. the two ensemble methods Linear Combination (CombinedW) and Rank Aggregation (RankAgg) with two different metrics: Average 5.2.1 Effect of position bias. Figure 1 illustrates the performance rank(y |¯y) Rank of Relevant Results ARRR = y:oi =1∧ri =1 and Normal- of all LTR algorithms and ensembles for varying degrees of posi- P |X| tion bias (η ∈ {0, 0.5, 1, 1.5}). Figures 1a, 1b, 1c and 1d show the DCG@p ized Discounted Cumulative Gain nDCG@p = nDCGp ) = IDCG@p performance as ARRR. Figures 1e, 1f, 1g and 1h show nDCG@10. where p is the rank position up to which we are interested to eval- Due to space, we omit the figures for η = 2 since η = 1.5 captures uate, DCG@p represents the discounted cumulative gain of the the trend of propensitySVM starting to work better than the other given ranking whereas IDCG@p refers to the ideal discounted cu- methods. Heckmanrank suffers when there is severe position bias. mulative gain. We can compute DCG@p using the following for- Figures 1a-1d illustrate that Heckmanrank outperforms Propen- p r el (x,y) |REL@p | r el (x,y) mula DCG@p = 2 −1 Similarly, IDCG@p = 2 −sity1 SV Mrank in the absence of position bias (η = 0), or when i=1 log2(i+1) i=1 log2(i+1) where REL@p representsP the list of relevant documentsP (ordered position bias is low (η = 0.5) and moderate (η = 1). The better by their relevance) in the ranking up to positionp for a given query. performance of Heckmanrank over Propensity SV Mrank vanishes In our evaluation we chose p = 10 for nDCG metric and we refer with increased position bias, such that at a high position bias level to it by nDCG@10. (η =1.5), Heckmanrank falls behind Propensity SV Mrank , but still Each figure in the experiments depicts how the ARRR or nDCG@10 outperforms Naive SV Mrank . The reason for this is that a high (y axis) changes when the user only observes the first k ∈ [1, 30] position bias results in a high click frequency for top-ranked docu- documents (x axis). Note that k reflects the severity of selection ments, leaving low-ranked documents with a very small chance of 7 Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, Elena Zheleva

11 NaiveSVM CombinedW 11 NaiveSVM CombinedW 11 NaiveSVM CombinedW 11 NaiveSVM CombinedW PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg 10 Heckman 10 Heckman 10 Heckman 10 Heckman

9 9 9 9

8 8 8 8

7 7 7 7

6 6 6 6

Average Rank of Relevant Result of Relevant Rank Average 5 Result of Relevant Rank Average 5 Result of Relevant Rank Average 5 Result of Relevant Rank Average 5

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) (a) ARRR, η = 0 (b) ARRR, η = 0.5 (c) ARRR, η = 1 (d) ARRR, η = 1.5

0.50 0.50 0.50 0.50

0.45 0.45 0.45 0.45

0.40 0.40 0.40 0.40 nDCG@10 nDCG@10 nDCG@10 nDCG@10 0.35 0.35 0.35 0.35

NaiveSVM CombinedW NaiveSVM CombinedW NaiveSVM CombinedW NaiveSVM CombinedW PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg 0.30 Heckman 0.30 Heckman 0.30 Heckman 0.30 Heckman

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) (e) nDCG@10, η = 0 (f) nDCG@10, η = 0.5 (g) nDCG@10, η = 1 (h) nDCG@10, η = 1.5 Figure 1: The performance of LTR algorithms on set 2. Lower is better for ARRR, higher is better for nDCG@10. being clicked. Heckmanrank is designed to control for the probabil- • Under moderate position bias (η = 1), while Heckmanrank ity of a document being observed. If top-ranked documents have outperforms Propensity SV Mrank for ARRR, it lags behind a disproportionately higher density in click data relative to low- Propensity SV Mrank for nDCG@10. ranked documents, then the predicted probabilities in Heckmanrank • Under severe position bias (η =1.5), Heckmanrank falls be- will also reflect this imbalance. In terms of algorithms that address hind Propensity SV Mrank for both ARRR and nDCG@10. both position bias and selection bias, Figures 1a, 1b show that for • RankAgg performs better than Heckmanrank for all selec- η = 0, 0.5, combinedW and RankAgg outperform both Propensity tion bias levels and it is more robust to position bias than rank rank SV M and Heckman for k / 7. Moreover, Figures 1c and combinedW. combinedW surpasses Heckmanrank under se- 1d show that for η = 1 and η = 1.5 RankAgg outperforms its com- vere selection bias (k / 10). ponent algorithms for almost all values of k. When ARRR is the metric of interest, Figure 1a, 1b, 1c and 1d 5.2.2 Effect of click noise. Thus far, we have considered noise- rank rank illustrate that Heckman outperforms Propensity SV M in less clicks that are generated only over relevant documents. How- the absence of position bias (η = 0) and when position bias is low to ever, this is not a realistic assumption as users may also click on moderate (η = {0.5, 1}), while it falls behind Propensity SV Mrank irrelevant documents. We now relax this assumption and allow for when position bias increases (η =1.5). To compare it to the results 10% of the clicked documents to be irrelevant. for nDCG@10 illustrated in Figures 1e, 1f, 1g and 1h, Heckmanrank When ARRR is the preferred metric, Figures 2a, 2b, 2c and 2d appears to start lagging behind in performance at η = 1.5.For the illustrate that Heckmanrank outperforms Propensity SV Mrank for ensemble methods, Figures 1e, 1f illustrate that when η = {0, 0.5} η = {0, 0.5, 1}, while under higher position bias level (η = 1.5), combinedW and RankAgg outperform their component algorithms Heckmanrank falls behind Propensity SV Mrank . Comparing the for k / 10. 1g demonstrates the better performance of RankAgg to noisy click performance to the noiseless one (Figures 1a, 1b, 1c), its component algorithms for all values of k when η = 1. How- one can conclude that for η = {0, 0.5, 1}, Propensity SVMrank is ever, for a severe position bias η = 1.5, combinedW and RankAgg highly affected by noise, while Heckmanrank is much less affected. do not outperform their component algorithms for any value of For example, Figure 2c illustrates that for η = 1, the better perfor- k, but RankAgg becomes the second best algorithm. Among the mance of Heckmanrank over Propensity SV Mrank is much more ensemble methods, RankAgg is more robust to position bias than noticeable compared to 1c where clicks were noiseless. Interest- combinedW . ingly, the ensembles combinedW nor RankAgg, do not outperform Our main takeaways from this are: the most successful algorithm in the presence of noisy clicks. When nDCG@10 is the preferred metric, one can draw the same • Under small to no position bias (η = 0, 0.5) Heckmanrank conclusions: while Heckmanrank is more robust to noise and out- outperforms Propensity SV Mrank for both metrics. performs Propensity SV Mrank for η = {0, 0.5, 1}, it fails to beat 8 Correcting for Selection Bias in Learning-to-rank Systems

11 NaiveSVM CombinedW 11 NaiveSVM CombinedW 11 NaiveSVM CombinedW 11 PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg 10 Heckman 10 Heckman 10 Heckman 10

9 9 9 9

8 8 8 8

7 7 7 7

6 6 6 6 NaiveSVM CombinedW PropSVM RankAgg

Average Rank of Relevant Result of Relevant Rank Average 5 Result of Relevant Rank Average 5 Result of Relevant Rank Average 5 Result of Relevant Rank Average 5 Heckman

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) (a) ARRR, η = 0 (b) ARRR, η = 0.5 (c) ARRR, η = 1 (d) ARRR, η = 1.5

0.50 0.50 0.50 0.50 NaiveSVM CombinedW PropSVM RankAgg Heckman 0.45 0.45 0.45 0.45

0.40 0.40 0.40 0.40 nDCG@10 nDCG@10 nDCG@10 nDCG@10 0.35 0.35 0.35 0.35

NaiveSVM CombinedW NaiveSVM CombinedW NaiveSVM CombinedW PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg 0.30 Heckman 0.30 Heckman 0.30 Heckman 0.30

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) (e) nDCG@10, η = 0 (f) nDCG@10, η = 0.5 (g) nDCG@10, η = 1 (h) nDCG@10, η = 1.5 Figure 2: The performance of LTR algorithms on set 2 under 10% noisy clicks.

rank . Propensity SV M for η =1 5. Another interesting point is that 10 0.45 Propensity SV Mrank is severely affected by noise when selection bias is high (low values of k), such that it even falls behind Naive 9 SV Mrank . This exemplifies how much selection bias can degrade 0.40 the performance of LTR systems if they do not correct for it. 8

nDCG@10 0.35 7 NaiveSVM CombinedW NaiveSVM CombinedW PropSVM RankAgg PropSVM RankAgg

10 Result of Relevant Rank Average 0.45 Heckman Heckman NaiveSVM CombinedW 6 0.30 PropSVM RankAgg 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Heckman Noise level Noise level 9

0.40 (a) Evaluation on ARRR. (b) Evaluation on nDCG@10.

8 Figure 4: Effect of noisy clicks for high selection bias (k = 10) nDCG@10 0.35 7 and high position bias (η =2). NaiveSVM CombinedW PropSVM RankAgg Average Rank of Relevant Result of Relevant Rank Average Heckman 6 0.30 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Noise level Noise level and nDCG@10. However, RankAgg and combinedW cannot (a) Evaluation on ARRR. (b) Evaluation on nDCG@10. beat Propensity SV Mrank under high position bias. Figure 3: Effect of noisy clicks for high selection bias (k = 10) 5.2.3 Effect of varying noise for η = 1 and η = 2. In this sec- and moderate position bias (η =1). tion, we investigate whether our proposed models are robust to noise. Toward this goal, we varied the noise level from 0% to 30%. In the presence of 10% noisy clicks, the main takeaways are: Figures 3a and 3b show the performance of the LTR algorithms • Under severe to moderate selection bias (k / 15), Propensity for different levels of noise, where k = 10 and η = 1. Under in- SV Mrank suffers a lot from the noise and it even falls behind creasing noise, the performance of Heckmanrank is relatively sta- Naive SV Mrank for both ARRR and nDCG@10. ble and even improves, while the performance of all other LTR algo- • Heckmanrank outperforms Propensity SV Mrank when po- rithms degrades. Even Naive SV Mrank is more robust to noise com- sition bias is not severe (η = {0, 0.5, 1}) for both metrics. pared to Propensity SVMrank , which is different from the results • Just like in the noiseless case, Heckmanrank cannot surpass by Joachims et al. [29] where no selection bias was considered. The Propensity SV Mrank under severe position bias (η =1.5). reason could be that their evaluation is based on the assumption • combinedW and RankAgg surpass Heckmanrank for a se- that all documents have a non-zero probability of being observed, vere selection bias (k / 5) when η = {0, 0.5} for both ARRR while Figure 3a and 3b are under the condition that documents 9 Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, Elena Zheleva

13.5 13.5 13.5 13.5 NaiveSVM CombinedW NaiveSVM CombinedW NaiveSVM CombinedW NaiveSVM CombinedW PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg PropSVM RankAgg 13.0 Heckman 13.0 Heckman 13.0 Heckman 13.0 Heckman

12.5 12.5 12.5 12.5

12.0 12.0 12.0 12.0

11.5 11.5 11.5 11.5 Average Rank of Relevant Result of Relevant Rank Average Result of Relevant Rank Average Result of Relevant Rank Average Result of Relevant Rank Average 11.0 11.0 11.0 11.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) Number of observed documents(k) (a) ARRR, η = 0.5 (b) ARRR, η = 1 (c) ARRR, η = 1.5 (d) ARRR, η = 2

Figure 5: The performance (ARRR) of LTR algorithms on set 1. ranked bellow a certain cut-off (k = 10) have a zero probability of performs better than existing approaches that correct for position being observed. bias but that do not address selection bias. Nonetheless, this perfor- We also investigate the performance of LTR algorithms with re- mance decreases as the position bias increases. At the same time, spect to noise, when position bias is severe (η = 2). As shown in Heckmanrank is more robust to noisy clicks even with severe posi- Figure 4, irrespective of metric of interest, Heckmanrank is robust tion bias, while Propensity SV Mrank is adversely affected by noisy to varying noise, while the performance of all other algorithms de- clicks in the presence of selection bias and even falls behind Naive grades when the noise level increases. Propensity SV Mrank falls SV Mrank . The ensemble methods, combinedW and RankAgg, out- behind all other algorithms in high level of noise. This implies that perform Heckmanrank for severe selection bias and zero to small even though Heckmanrank cannot surpass Propensity SV Mrank position bias. when position bias is severe (η =1.5, 2) in noiseless environments, Our initial study of selection bias suggests a number of promis- it clearly outperforms Propensity SV Mrank in the presence of se- ing future avenues for research. For example, our initial work con- lection bias with noise. This is an extremely useful property since siders only linear models but a Heckman-based solution to selec- in real world applications we cannot assume a noiseless environ- tion bias can be adapted to non-linear algorithms as well, includ- ment. ing extensions that consider bias correction mechanisms specific to each learning-to-rank algorithm. Our experiments suggest that 5.3 Experimental results on set 1 studying correction methods that jointly account for position bias and selection bias can potentially address the limitations of meth- To confirm the performance ofour proposedmethodson the larger ods that only account for one. Finally, even though we specifically set 1 with out-of-sample test data, we ran experiments varying po- studied selection bias in the context of learning-to-rank systems, { . , , . , } sition bias (η = 0 5 1 1 5 2 ) under noiseless clicks. The results we expect that our methodology will have broader applications be- on this dataset were even more promising, especially for high po- yond LTR systems. sition bias. Figure 5 illustrates the ARRR performance of all algo- rithms. Heckmanrank outperforms PropensitySVMrank for all po- sition bias levels, though its strong performance decreases with in- ACKNOWLEDGEMENTS creasing η. This is unlike set 2 where Heckmanrank did not outper- We thank John Fike and Chris Kanich for providing invaluable form PropensitySVMrank under high position bias. The ensemble server support. We acknowledge Amazon Web Services for their RankAgg outperformsboth Heckmanrank and PropensitySVMrank support through AWS Cloud Credits for Research. This material for all position and selection bias levels, while combinedW out- is based on research sponsored in part by the Defense Advanced performs PropensitySVMrank but does not surpass Heckmanrank. Research Projects Agency (DARPA) under contract number Moreover, the stronger performance of Heckmanrank and RankAgg HR00111990114 and the National Science Foundation under Grant No. 1801644. The views and conclusions contained herein are those over PropensitySVMrank is much more pronounced compared to of the authors and should not be interpreted as necessarily repre- set 2. senting the official policies, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to re- 6 CONCLUSION produce and distribute reprints for governmental purposes notwith- In this work, we formalized the problem of selection bias in learning- standing any copyright annotation therein. to-rank systems and proposed Heckmanrank as an approach for correcting for selection bias. We also presented two ensemble meth- REFERENCES ods that correct for both selection and position bias by combining [1] Aman Agarwal, Kenta Takatsu, Ivan Zaitsev, and Thorsten Joachims. 2019. A the rankings of Heckmanrank and Propensity SV Mrank . Our exten- General Framework for Counterfactual Learning-to-Rank. In ACM Conference on Research and Development in Information Retrieval (SIGIR). sive experiments on semi-synthetic datasets show that selection [2] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Un- bias affects the performance of LTR systems and that Heckmanrank biased Learning to Rank with Unbiased Propensity Estimation. SIGIR (2018). 10 Correcting for Selection Bias in Learning-to-rank Systems

[3] Elias Bareinboim and Judea Pearl. 2012. Controlling selection bias in causal [30] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. The inference. In AISTATS. 100–108. parable of Google Flu: traps in big data analysis. Science 343, 6176 (2014), 1203– [4] Elias Bareinboim and Judea Pearl. 2016. Causal inference and the data-fusion 1205. problem. Proceedings of the National Academy ofSciences 113, 27 (2016), 7345– [31] Shili Lin. 2010. Rank aggregation methods. Wiley Interdisciplinary Reviews: 7352. Computational Statistics 2, 5 (2010), 555–570. [5] Elias Bareinboim and Jin Tian. 2015. Recovering causal effects from selection [32] Tie-Yan Liu. 2011. Learning to rank for information retrieval. Springer Science bias. In Twenty-Ninth AAAI Conference on Artificial Intelligence. & Business Media. [6] Elias Bareinboim, Jin Tian, and Judea Pearl. 2014. Recovering from Selection [33] Harrie Oosterhuis and Maarten de Rijke. 2018. Differentiable unbiased online Bias in Causal and .. In AAAI. 2410–2416. learning to rank. In Proceedings of the 27th ACM International Conference on [7] Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A Information and Knowledge Management. ACM, 1293–1302. neural click model for web search. In Proceedings of the 25th International [34] Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. 2016. Causal inference in Conference on World Wide Web. International World Wide Web Conferences statistics: A primer. John Wiley & Sons. Steering Committee, 531–541. [35] Karthik Raman and Thorsten Joachims. 2013. Learning socially optimal infor- [8] Òscar Celma and Pedro Cano. 2008. From hits to niches?: or how popular mation systems from egoistic users. In Joint European Conference on Machine artists can bias music recommendation and discovery. In Proceedings of the Learning and Knowledge Discovery in Databases. Springer, 128–144. 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize [36] Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting Competition. ACM, 5. clicks: estimating the click-through rate for new ads. In WWW. ACM, 521–530. [9] Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. 2018. How [37] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Algorithmic Confounding in Recommendation Systems Increases Homogeneity Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and Decreases Utility. RecSys (2018). and evaluation. arXiv preprint arXiv:1602.05352 (2016). [10] Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. [38] Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016. In Proceedings of the Learning to Rank Challenge. 1–24. Multileave gradient descent for fast online learning to rank. In Proceedings of [11] Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. the Ninth ACM International Conference on Web Search and Data Mining.ACM, Large-scale validation and analysis of interleaved search evaluation. ACM 457–466. Transactions on Information Systems (TOIS) 30, 1 (2012), 6. [39] Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten [12] Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model de Rijke. 2014. Multileaved comparisons for fast online evaluation. In for web search ranking. In Proceedings of the 18th international conference on Proceedings of the 23rd ACM International Conference on Conference on World wide web. ACM, 1–10. Information and Knowledge Management. ACM, 71–80. [13] Li Chen, MarcoDe Gemmis,Alexander Felfernig, PasqualeLops, Francesco Ricci, [40] Andrew Smith and Charles Elkan. 2004. A Bayesian and Giovanni Semeraro. 2013. Human decision making and recommender sys- network framework for reject inference. KDD (2004). tems. ACM Transactions on Interactive Intelligent Systems (TiiS) 3, 3 (2013), http://delivery.acm.org/10.1145/1020000/1014085/p286-smith.pdf 1–7. [41] Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. [14] Juan D Correa and Elias Bareinboim.2017. Causaleffect identification by adjust- Learning to rank with selection bias in personal search. In SIGIR. ACM, 115– ment under confounding and selection biases. In Thirty-First AAAI Conference 124. on Artificial Intelligence. [42] Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc [15] Juan D Correa, Jin Tian, and Elias Bareinboim. 2018. Generalized adjustment Najork. 2018. Position bias estimation for unbiased learning to rank in personal under confounding and selection biases. In AAAI. search. In WSDM. ACM, 610–618. [16] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experi- [43] Yixin Wang, Dawen Liang, Laurent Charlin, and David M Blei. 2018. The decon- mental comparison of click position-bias models. In WSDM. ACM, 87–94. founded recommender: A causal inference approach to recommendation. arXiv [17] Zhao Dan-Dan, Zeng An, Shang Ming-Sheng, and Gao Jian. 2013. Long-term preprint arXiv:1808.06581 (2018). effects of recommendation on the evolution of online systems. Chinese Physics [44] Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing informa- Letters 30, 11 (2013), 118901. tion retrieval systems as a dueling bandits problem. In Proceedings of the 26th [18] Pranav Dandekar, Ashish Goel, and David T Lee. 2013. Biased assimilation, Annual International Conference on Machine Learning. ACM, 1201–1208. homophily, and the dynamics of polarization. Proceedings of the National [45] Bianca Zadrozny. 2004. Learning and evaluating classifiers under sample selec- Academy of Sciences 110, 15 (2013), 5791–5796. tion bias. In Proceedings of the twenty-first international conference on Machine [19] Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. 2001. Rank learning. ACM, 114. aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web. ACM, 613–622. [20] Daniel M Fleder and Kartik Hosanagar. 2007. Recommender systems and their impact on sales diversity. In Proceedings of the 8th ACM conference on Electronic commerce. ACM, 192–199. [21] James Heckman. 1979. Sample Selection Bias as a Specification Error. Econometrica 47, 1 (1979), 153–161. [22] José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. 2014. Probabilistic matrix factorization with non-random missing data. In International Conference on Machine Learning. 1512–1520. [23] Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013. Reusing historical interaction data for faster online learning to rank for IR. In Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 183–192. [24] Ziniu Hu, Yang Wang, Qu Peng, and Hang Li. 2019. Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm. (2019). [25] Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke. 2019. To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions. (2019). [26] Lilli Japec, Frauke Kreuter, Marcus Berg, Paul Biemer, Paul Decker, Cliff Lampe, Julia Lane, Cathy O’Neil, and Abe Usher. 2015. Big data in survey research: Aapor task force report. Public Opinion Quarterly 79, 4 (2015), 839–880. https://doi.org/10.1093/poq/nfv039 [27] Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133–142. [28] Thorsten Joachims, Laura A Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Sigir, Vol. 5. 154–161. [29] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In WSDM. ACM, 781–789.

11