Utility Is in the Eye of the User: a Critique of NLP Leaderboards

Utility is in the Eye of the User: A Critique of NLP Leaderboards Kawin Ethayarajh Dan Jurafsky Stanford University Stanford University [email protected] [email protected] Abstract attributes valued by the NLP community, such as Benchmarks such as GLUE have helped drive fairness and energy efficiency (Bender and Fried- advances in NLP by incentivizing the creation man, 2018; Strubell et al., 2019). For example, a of more accurate models. While this leader- highly inefficient model would have limited use in board paradigm has been remarkably success- practical applications, but this would not preclude ful, a historical focus on performance-based it from reaching the top of most leaderboards. Sim- evaluation has been at the expense of other ilarly, models can reach the top while containing qualities that the NLP community values in racial and gender biases – and indeed, some have models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study (Bordia and Bowman, 2019; Manzini et al., 2019; the divergence between what is incentivized Rudinger et al., 2018; Blodgett et al., 2020). by leaderboards and what is useful in prac- Microeconomics provides a useful lens through tice through the lens of microeconomic the- which to study the divergence between what is in- ory. We frame both the leaderboard and NLP centivized by leaderboards and what is valued by practitioners as consumers and the benefit they practitioners. We can frame both the leaderboard get from a model as its utility to them. With and NLP practitioners as consumers of models and this framing, we formalize how leaderboards utility – in their current form – can be poor proxies the benefit they receive from a model as its for the NLP community at large. For exam- to them. Although leaderboards are inanimate, this ple, a highly inefficient model would provide framing allows us to make an apples-to-apples com- less utility to practitioners but not to a leader- parison: if the priorities of leaderboards and practi- board, since it is a cost that only the former tioners are perfectly aligned, their utility functions must bear. To allow practitioners to better es- should be identical; the less aligned they are, the timate a model’s utility to them, we advocate greater the differences. For example, the utility of for more transparency on leaderboards, such both groups is monotonic non-decreasing in accu- as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, racy, so a more accurate model is no less preferable and inference latency). to a less accurate one, holding all else constant. However, while the utility of practitioners is also 1 Introduction sensitive to the size and efficiency of a model, the The past few years have seen significant progress utility of leaderboards is not. By studying such on a variety of NLP tasks, from question answer- differences, we formalize some of the limitations ing to machine translation. These advances have in contemporary leaderboard design: been driven in part by benchmarks such as GLUE 1. Non-Smooth Utility: For a leaderboard, an (Wang et al., 2018), whose leaderboards rank mod- improvement in model accuracy on a given els by how well they perform on these diverse tasks. task only increases utility when it also in- Performance-based evaluation on a shared task is creases rank. For practitioners, any improve- not a recent idea either; this sort of shared chal- ment in accuracy can increase utility. lenge has been an important driver of progress since MUC (Sundheim, 1995). While this paradigm has 2. Prediction Cost: Leaderboards treat the cost been successful at driving the creation of more ac- of making predictions (e.g., model size, en- curate models, the historical focus on performance- ergy efficiency, latency) as being zero, which based evaluation has been at the expense of other does not hold in practice. 4846 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4846–4853, November 16–20, 2020. c 2020 Association for Computational Linguistics 3. Robustness: Practitioners receive higher util- All we assume is that all models are evaluated on ity from a model that is more robust to adver- the same held-out test data. sarial perturbations, generalizes better to out- In our framework, leaderboards are consumers of-distribution data, and that is equally fair whose utility is solely derived from the rank of a to all demographics. However, these benefits model. Framing leaderboards as consumers is un- would leave leaderboard utility unchanged. orthodox, given that they are inanimate – in fact, it might seem more intuitive to say that leaderboards We contextualize these limitations with examples are another kind of product that is also consumed from the ML fairness (Barocas et al., 2017; Hardt by practitioners. While that perspective is valid, et al., 2016), Green AI (Strubell et al., 2019; what we ultimately care about is how good of a Schwartz et al., 2019), and robustness literature proxy leaderboards are for practitioner preferences. (Jia and Liang, 2017). These three limitations are Framing both leaderboards and practitioners as con- not comprehensive – other problems can also arise, sumers permits an apples-to-apples comparison us- which we leave to be discussed in future work. ing their utility functions. If a leaderboard were What changes can we make to leaderboards so only thought of as a product, it would not have such that their utility functions better reflect that of the a function, precluding such a comparison. NLP community at large? Given that each practi- Unlike most kinds of consumers, a leaderboard tioner has their own preferences, there is no way to is a consumer whose preferences are perfectly re- rank models so that everyone is satisfied. Instead, vealed through its rankings: the state-of-the-art we suggest that leaderboards demand transparency, (SOTA) model is preferred to all others, the second requiring the reporting of statistics that are of prac- ranking model is preferred to all those below it, tical concern (e.g., model size, energy efficiency). and so on. Put more formally, leaderboard utility is This is akin to the use of data statements for mit- monotonic non-decreasing in rank. Still, because igating bias in NLP systems (Gebru et al., 2018; each consumer is unique, we cannot know the exact Mitchell et al., 2019; Bender and Friedman, 2018). shape of a leaderboard utility function – only that This way, practitioners can determine the utility it possesses this monotonicity. they receive from a given model with relatively little effort. Dodge et al.(2019) have suggested that NLP Practitioners Practitioners are also con- model creators take it upon themselves to report sumers, but they derive utility from multiple prop- these statistics, but without leaderboards requiring erties of the model being consumed (e.g., accuracy, it, there is little incentive to do so. energy efficiency, latency). Each input into their utility function is some desideratum, but since each 2 Utility Functions practitioner applies the model differently, the functions can be different. For example, someone may In economics, the utility of a good denotes the bene- assign higher utility to BERT-Large (Devlin et al., fit that a consumer receives from it (Mankiw, 2020). 2019) and its 95% accuracy on some task, while We specifically discuss the theory of cardinal util- another may assign higher utility to the smaller ity, in which the amount of the good consumed BERT-Base and its 90% accuracy. As with leader- can be mapped to a numerical value that quantifies boards, although the exact shapes of practitioner its utility in utils (Mankiw, 2020). For example, a utility functions are unknown, we can infer that consumer might assign a value of 10 utils to two they are monotonic non-decreasing in each desider- apples and 8 utils to one orange; we can infer both atum. For example, more compact models are more the direction and magnitude of the preference. desirable, so increasing compactness while holding Leaderboards We use the term leaderboard to all else constant will never decrease utility. refer to any ranking of models or systems using 3 Utilitarian Critiques performance-based evaluation on a shared bench- mark. In NLP, this includes both longstanding Our criticisms apply regardless of the shape taken benchmarks such as GLUE (Wang et al., 2018) by practitioner utility functions – a necessity, given and one-off challenges such as the annual SemEval that the exact shapes are unknown. However, not STS tasks (Agirre et al., 2013, 2014, 2015). This every criticism applies to every leaderboard. Stere- is not a recent idea either; this paradigm has been oSet (Nadeem et al., 2020) is a leaderboard that a driver of progress since MUC (Sundheim, 1995). ranks language models by how unbiased they are, 4847 leaderboard utility functions in Figure1. This difference means that practitioners who are content with a less-than-SOTA model – as long as it is lightweight, perhaps – are under-served, while practitioners who want competitive-with- SOTA models are over-served. For example, on a given task, say that an n-gram baseline obtains an accuracy of 78%, an LSTM baseline obtains 81%, and a BERT-based SOTA obtains 92%. The leaderboard does not incentivize the creation of lightweight models that are ∼ 85% accurate, and as a result, few such models will be created. This is Figure 1: A contrast of possible utility functions for indeed the case with the SNLI leaderboard (Bow- leaderboards and NLP practitioners. Note that for man et al., 2015), where most submitted models leaderboards, utility is not smooth with respect to ac- are highly-parameterized and over 85% accurate. curacy; there is only an increase if the model is SOTA This incentive structure leaves those looking for / #2 / #3 on the leaderboard.

Utility Is in the Eye of the User: a Critique of NLP Leaderboards

When Diving Deep Into Operationalizing the Ethical Development of Artificial Intelligence, One Immediately Runs Into the “Fractal Problem”

Bias and Fairness in NLP

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classiﬁcation∗

AI Principles 2020 Progress Update

Facial Recognition Is Improving — but the Bigger That Had Been Scanned

Distributed Blackness Published Just Last Year (2020, New York University Press), Dr

Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing

The Privatization of AI Research(-Ers): Causes and Potential Consequences

Nist-Ai-Rfi-Microsoft-001.Pdf | NIST

Bias and Representation in Artificial Intelligence

Strategies for Collecting Sociocultural Data in Machine Learning

The Role of Workers in AI Ethics and Governance