On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search
Total Page:16
File Type:pdf, Size:1020Kb
On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search Gustavo Penha Claudia Hauff Delft University of Technology Delft University of Technology Delft Delft Netherlands Netherlands [email protected] [email protected] Abstract Deterministic ranker Stochastic ranker According to the Probability Ranking Princi- ple (PRP), ranking documents in decreasing or- 0.1 der of their probability of relevance leads to an 0.9 optimal document ranking for ad-hoc retrieval. 0.7 The PRP holds when two conditions are met: [C1] the models are well calibrated, and, [C2] the probabilities of relevance are reported Figure 1: While deterministic neural rankers output with certainty. We know however that deep a point estimate probability (magenta values) of rele- neural networks (DNNs) are often not well cal- vance for a combination of query (blue bars) and doc- ibrated and have several sources of uncertainty, ument (grey bars), stochastic neural rankers output a and thus [C1] and [C2] might not be satisfied predictive distribution (orange curves). The dispersion by neural rankers. Given the success of neu- of the predictive distribution provides an estimation of ral Learning to Rank (L2R) approaches—and the model uncertainty. here, especially BERT-based approaches—we first analyze under which circumstances de- terministic neural rankers are calibrated for for the PRP to hold, ranking models must at least conversational search problems. Then, moti- meet the following conditions: [C1] assign well cal- vated by our findings we use two techniques to ibrated probabilities of relevance, i.e. if we gather model the uncertainty of neural rankers lead- all documents for which the model predicts rele- ing to the proposed stochastic rankers, which output a predictive distribution of relevance as vance with a probability of e.g. 30%, the amount opposed to point estimates. Our experimental of relevant documents should be 30%, and [C2] results on the ad-hoc retrieval task of conver- report certain predictions, i.e. only point estimates sation response ranking1 reveal that (i) BERT- such as, for example, 80% probability of relevance. based rankers are not robustly calibrated and DNNs have been shown to outperform classic that stochastic BERT-based rankers yield bet- Information Retrieval (IR) ranking models over the ter calibration; and (ii) uncertainty estimation past few years in setups where considerable train- is beneficial for both risk-aware neural rank- ing, i.e. taking into account the uncertainty ing data is available. It has been shown that DNNs when ranking documents, and for predicting are not well calibrated in the context of computer vi- unanswerable conversational contexts. sion (Guo et al., 2017). If the same is true for neural L2R models for IR, e.g. transformer-based mod- 1 Introduction els for ranking (Nogueira and Cho, 2019), [C1] According to the Probability Ranking Principle is not met. Additionally, there are a number of (PRP) (Robertson, 1977), ranking documents in sources of uncertainty in the training process of decreasing order of their probability of relevance neural networks (Gal, 2016) that make it unreason- leads to an optimal document ranking for ad-hoc able to assume that neural ranking models fulfill retrieval2. Gordon and Lenk(1991) discussed that [C2]: parameter uncertainty (different combina- 1The source code and data are available at tions of weights that explain the data equally well), https://github.com/Guzpenha/transformer_ structural uncertainty (which neural architecture to rankers/tree/uncertainty_estimation. 2Standard retrieval task where the user specifies his infor- system for documents that are likely relevant (Baeza-Yates mation need through a query which initiates a search by the et al., 1999). 160 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 160–170 April 19 - 23, 2021. ©2021 Association for Computational Linguistics use for neural ranking), and aleatoric uncertainty ber of studies about calibration (Ovadia et al., 2019; (noisy data). Given these sources of uncertainty, Maddox et al., 2019), due to the larger decision using point estimate predictions and ranking ac- making pipelines DNNs are often part of and their cording to the PRP might not achieve the optimal importance for model interpretability (Thiagarajan ranking for retrieval. While the effectiveness bene- et al., 2020). For instance, in the automated medi- fits of risk-aware models (Wang, 2009; Wang and cal domain it is important to provide a calibrated Zhu, 2009), which take into account the risk3, i.e. confidence measure besides the prediction of a dis- the uncertainty of the document’s prediction scores, ease diagnosis to provide clinicians with sufficient have been shown for non-neural IR approaches, this information (Jiang et al., 2012). Guo et al.(2017) has not yet been explored for neural L2R models. have shown that DNNs are not well calibrated in the In this paper we first analyze the calibration of context of computer vision, motivating our study neural rankers, specifically BERT-based rankers of the calibration of neural L2R models. for IR tasks to evaluate how calibrated they are. The second condition ([C2]) for optimal retrieval Then, to model the uncertainty of BERT-based when ranking according to the PRP (Gordon and rankers, we propose stochastic neural ranking mod- Lenk, 1991) is that models report predictions with els (see Figure1), by applying different techniques certainty. While the (un)certainty has not been to model the uncertainty of DNNs, namely MC studied in neural L2R models, there are classic ap- Dropout (Gal and Ghahramani, 2016) and Deep proaches in IR that model the uncertainty. Such ap- Ensembles (Lakshminarayanan et al., 2017) which proaches have been mostly inspired by economics are agnostic to the particular DNN. theory, treating variance as a measure of uncer- In our experiments, we test models under distri- tainty (Varian, 1999). Following such ideas, non- butional shift, i.e. the test data distribution is differ- neural ranking models that take uncertainty into ent from the training data, also referred to as out-of- account (i.e. risk-aware models), and thus do not distribution (OOD) examples (Lee et al., 2018). In follow the PRP (Robertson, 1977), have been pro- real-world settings, there are often inputs that are posed (Zhu et al., 2009; Wang and Zhu, 2009), shifted due to factors such as non-stationarity and showing significant effectiveness improvements sample bias. Additionally, this experimental setup compared to the models that do not model uncer- provides a way of measuring whether the DNN tainty. Uncertainty estimation is a difficult task that ”knows what it knows” (Ovadia et al., 2019), e.g. has other applications in IR besides improving the by outputting high uncertainty for OOD examples. ranking effectiveness: it can be employed to decide We find that BERT-based rankers are not ro- between asking clarifying questions and providing bustly calibrated. Stochastic BERT-based rankers a potential answer in conversational search (Alian- have 14% less calibration error on average than nejadi et al., 2019); to perform dynamic query re- BERT-based rankers. Uncertainty estimation from formulation (Lin et al., 2020) for queries where the stochastic BERT-based rankers is advantageous for intent is uncertain; and to predict questions with no downstream applications as shown by our exper- correct answers (Feng et al., 2020). iments for risk-aware neural ranking (2% more effective on average relative to a model without Bayesian Neural Networks risk-awareness) and for predicting unanswerable Unlike standard algorithms to train neural net- conversational contexts (improves classification by works, e.g. SGD, that fit point estimate weights 33% on average of all conditions). given the observed data, Bayesian Neural Networks (BNNs) infer a distribution over the weights given 2 Related Work the observed data. Denker et al.(1987) contains one of the earliest mentions of choosing probabil- Calibration and Uncertainty in IR ity over weights of a model. An advantage of the Even though to optimally rank documents accord- Bayesian treatment of neural networks (MacKay, ing to the PRP (Robertson, 1977) requires the 1992; Neal, 2012; Blundell et al., 2015) is that they model to be calibrated (Gordon and Lenk, 1991) are better at representing existing uncertainties in ([C1]), the calibration of ranking models has re- the training procedure. One limitation of BNNs is ceived little attention in IR. In contrast, in the ma- that they are computationally expensive compared chine learning community there have been a num- to DNNs. This has lead to the development of 3In this paper we use risk and uncertainty interchangeably. techniques that scale well, and do not require mod- 161 ifications of the neural net architecture and training estimates from stochastic BERT-based rankers use- procedure. Gal and Ghahramani(2016) proposed a ful for risk-aware ranking? RQ3 Are the uncer- way to approximate Bayesian inference by relying tainty estimates obtained from stochastic BERT- on dropout (Srivastava et al., 2014). While dropout based rankers useful for identifying unanswerable is a regularization technique that ignores units with queries? We first describe how to measure the cal- probability p during every training iteration and is ibration of neural rankers ([C1]), followed by our disabled at test time, Dropout (Gal and Ghahra- approach for modeling and ranking under uncer- mani, 2016) employs dropout at both train and test tainty ([C2]), and then we describe how we evalu- time and generates a predictive distribution after ate their robustness to distributional shift. a number of forward passes. Lakshminarayanan et al.(2017) proposed an alternative: they employ 3.1 Measuring Calibration ensembles of models (Ensemble) to obtain a pre- To evaluate the calibration of neural rankers dictive distribution. Ovadia et al.(2019) showed (RQ1) we resort to the Empirical Calibration Error that Ensemble are able to produce well calibrated (ECE) (Naeini et al., 2015).