Utility is in the Eye of the User: A Critique of NLP Leaderboards

Kawin Ethayarajh Dan Jurafsky Stanford University [email protected] [email protected]

Abstract attributes valued by the NLP community, such as Benchmarks such as GLUE have helped drive fairness and energy efficiency (Bender and Fried- advances in NLP by incentivizing the creation man, 2018; Strubell et al., 2019). For example, a of more accurate models. While this leader- highly inefficient model would have limited use in board paradigm has been remarkably success- practical applications, but this would not preclude ful, a historical focus on performance-based it from reaching the top of most leaderboards. Sim- evaluation has been at the expense of other ilarly, models can reach the top while containing qualities that the NLP community values in racial and gender biases – and indeed, some have models, such as compactness, fairness, and en- ergy efficiency. In this opinion paper, we study (Bordia and Bowman, 2019; Manzini et al., 2019; the divergence between what is incentivized Rudinger et al., 2018; Blodgett et al., 2020). by leaderboards and what is useful in prac- Microeconomics provides a useful lens through tice through the lens of microeconomic the- which to study the divergence between what is in- ory. We frame both the leaderboard and NLP centivized by leaderboards and what is valued by practitioners as consumers and the benefit they practitioners. We can frame both the leaderboard get from a model as its utility to them. With and NLP practitioners as consumers of models and this framing, we formalize how leaderboards utility – in their current form – can be poor proxies the benefit they receive from a model as its for the NLP community at large. For exam- to them. Although leaderboards are inanimate, this ple, a highly inefficient model would provide framing allows us to make an apples-to-apples com- less utility to practitioners but not to a leader- parison: if the priorities of leaderboards and practi- board, since it is a cost that only the former tioners are perfectly aligned, their utility functions must bear. To allow practitioners to better es- should be identical; the less aligned they are, the timate a model’s utility to them, we advocate greater the differences. For example, the utility of for more transparency on leaderboards, such both groups is monotonic non-decreasing in accu- as the reporting of statistics that are of practi- cal concern (e.g., model size, energy efficiency, racy, so a more accurate model is no less preferable and inference latency). to a less accurate one, holding all else constant. However, while the utility of practitioners is also 1 Introduction sensitive to the size and efficiency of a model, the The past few years have seen significant progress utility of leaderboards is not. By studying such on a variety of NLP tasks, from question answer- differences, we formalize some of the limitations ing to machine translation. These advances have in contemporary leaderboard design: been driven in part by benchmarks such as GLUE 1. Non-Smooth Utility: For a leaderboard, an (Wang et al., 2018), whose leaderboards rank mod- improvement in model accuracy on a given els by how well they perform on these diverse tasks. task only increases utility when it also in- Performance-based evaluation on a shared task is creases rank. For practitioners, any improve- not a recent idea either; this sort of shared chal- ment in accuracy can increase utility. lenge has been an important driver of progress since MUC (Sundheim, 1995). While this paradigm has 2. Prediction Cost: Leaderboards treat the cost been successful at driving the creation of more ac- of making predictions (e.g., model size, en- curate models, the historical focus on performance- ergy efficiency, latency) as being zero, which based evaluation has been at the expense of other does not hold in practice.

4846 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4846–4853, November 16–20, 2020. c 2020 Association for Computational Linguistics 3. Robustness: Practitioners receive higher util- All we assume is that all models are evaluated on ity from a model that is more robust to adver- the same held-out test data. sarial perturbations, generalizes better to out- In our framework, leaderboards are consumers of-distribution data, and that is equally fair whose utility is solely derived from the rank of a to all demographics. However, these benefits model. Framing leaderboards as consumers is un- would leave leaderboard utility unchanged. orthodox, given that they are inanimate – in fact, it might seem more intuitive to say that leaderboards We contextualize these limitations with examples are another kind of product that is also consumed from the ML fairness (Barocas et al., 2017; Hardt by practitioners. While that perspective is valid, et al., 2016), Green AI (Strubell et al., 2019; what we ultimately care about is how good of a Schwartz et al., 2019), and robustness literature proxy leaderboards are for practitioner preferences. (Jia and Liang, 2017). These three limitations are Framing both leaderboards and practitioners as con- not comprehensive – other problems can also arise, sumers permits an apples-to-apples comparison us- which we leave to be discussed in future work. ing their utility functions. If a leaderboard were What changes can we make to leaderboards so only thought of as a product, it would not have such that their utility functions better reflect that of the a function, precluding such a comparison. NLP community at large? Given that each practi- Unlike most kinds of consumers, a leaderboard tioner has their own preferences, there is no way to is a consumer whose preferences are perfectly re- rank models so that everyone is satisfied. Instead, vealed through its rankings: the state-of-the-art we suggest that leaderboards demand transparency, (SOTA) model is preferred to all others, the second requiring the reporting of statistics that are of prac- ranking model is preferred to all those below it, tical concern (e.g., model size, energy efficiency). and so on. Put more formally, leaderboard utility is This is akin to the use of data statements for mit- monotonic non-decreasing in rank. Still, because igating bias in NLP systems (Gebru et al., 2018; each consumer is unique, we cannot know the exact Mitchell et al., 2019; Bender and Friedman, 2018). shape of a leaderboard utility function – only that This way, practitioners can determine the utility it possesses this monotonicity. they receive from a given model with relatively lit- tle effort. Dodge et al.(2019) have suggested that NLP Practitioners Practitioners are also con- model creators take it upon themselves to report sumers, but they derive utility from multiple prop- these statistics, but without leaderboards requiring erties of the model being consumed (e.g., accuracy, it, there is little incentive to do so. energy efficiency, latency). Each input into their utility function is some desideratum, but since each 2 Utility Functions practitioner applies the model differently, the func- tions can be different. For example, someone may In economics, the utility of a good denotes the bene- assign higher utility to BERT-Large (Devlin et al., fit that a consumer receives from it (Mankiw, 2020). 2019) and its 95% accuracy on some task, while We specifically discuss the theory of cardinal util- another may assign higher utility to the smaller ity, in which the amount of the good consumed BERT-Base and its 90% accuracy. As with leader- can be mapped to a numerical value that quantifies boards, although the exact shapes of practitioner its utility in utils (Mankiw, 2020). For example, a utility functions are unknown, we can infer that consumer might assign a value of 10 utils to two they are monotonic non-decreasing in each desider- apples and 8 utils to one orange; we can infer both atum. For example, more compact models are more the direction and magnitude of the preference. desirable, so increasing compactness while holding Leaderboards We use the term leaderboard to all else constant will never decrease utility. refer to any ranking of models or systems using 3 Utilitarian Critiques performance-based evaluation on a shared bench- mark. In NLP, this includes both longstanding Our criticisms apply regardless of the shape taken benchmarks such as GLUE (Wang et al., 2018) by practitioner utility functions – a necessity, given and one-off challenges such as the annual SemEval that the exact shapes are unknown. However, not STS tasks (Agirre et al., 2013, 2014, 2015). This every criticism applies to every leaderboard. Stere- is not a recent idea either; this paradigm has been oSet (Nadeem et al., 2020) is a leaderboard that a driver of progress since MUC (Sundheim, 1995). ranks language models by how unbiased they are,

4847 leaderboard utility functions in Figure1. This difference means that practitioners who are content with a less-than-SOTA model – as long as it is lightweight, perhaps – are under-served, while practitioners who want competitive-with- SOTA models are over-served. For example, on a given task, say that an n-gram baseline obtains an accuracy of 78%, an LSTM baseline obtains 81%, and a BERT-based SOTA obtains 92%. The leaderboard does not incentivize the creation of lightweight models that are ∼ 85% accurate, and as a result, few such models will be created. This is Figure 1: A contrast of possible utility functions for indeed the case with the SNLI leaderboard (Bow- leaderboards and NLP practitioners. Note that for man et al., 2015), where most submitted models leaderboards, utility is not smooth with respect to ac- are highly-parameterized and over 85% accurate. curacy; there is only an increase if the model is SOTA This incentive structure leaves those looking for / #2 / #3 on the leaderboard. For practitioners, utility is smooth – any improvement in accuracy yields more a lightweight model with limited options. This utility. The utility functions can take on a variety of lack of smaller, more energy-efficient models has shapes but are always monotonic non-decreasing. been an impediment to the adoption of Green AI (Schwartz et al., 2019; Strubell et al., 2019). Al- though there are increasingly more lightweight and so fairness-related criticisms would not apply as faster-to-train models – such as ELECTRA, on much to StereoSet. Similarly, the SNLI leader- which accuracy and training time can be easily board (Bowman et al., 2015) reports the model size traded off (Clark et al., 2020) – their creation was – a cost of making predictions – even if it does not incentivized by a leaderboard, despite there not factor this cost into the model ranking. Still, being a demand for such models in the NLP com- most of our criticisms apply to most leaderboards munity. A similar problem exists with incentivizing in NLP, and we provide examples of well-known the creation of fair models, though the introduction leaderboards that embody each limitation. of leaderboards such as StereoSet (Nadeem et al., 3.1 Non-Smoothness of Utility 2020) are helping bridge this divide.

Leaderboards only gain utility from an increase in 3.2 Prediction Cost accuracy when it improves the model’s rank. This is because, by definition, the leaderboard’s prefer- Leaderboards of NLP benchmarks rank models by ences are perfectly revealed through its ranking – taking the average accuracy, F1 score, or exact there is no model that can be preferred to another match rate (Wang et al., 2018, 2019; McCann et al., while having a lower rank. Since an increase in 2018). In other words, they rank models purely by accuracy does not necessarily trigger an increase the value of their predictions; no consideration is in rank, it does not necessarily trigger an increase given to the cost of making those predictions. We in leaderboard utility either. define ‘cost’ here chiefly as model size, energy- Put another way, the utility of leaderboards takes efficiency, training time, and inference latency – the form of a step function, meaning that it is not essentially any sacrifice that needs to be made in smooth with respect to accuracy. In contrast, the order to use the model. In reality, no model is utility of practitioners is smooth with respect to costless, yet leaderboards are cost-ignorant. accuracy. Holding all else constant, an increase in This means that a SOTA model can simultane- accuracy will yield some increase in utility, how- ously provide high utility to a leaderboard and ever small. Why this difference? The utility of zero utility to a practitioner, by virtue of being leaderboards is a function of rank – it is only in- too impractical to use. For some time, this was true directly related to accuracy. On the other hand, of the 175 billion parameter GPT-3 (Brown et al., the utility of practitioners is a direct function of 2020), which achieved SOTA on several few-shot accuracy, among other desiderata. To illustrate this tasks, but whose sheer size precludes it from be- difference, we contrast possible practitioner and ing fully reproduced by researchers. Even today,

4848 practitioners can only use GPT-3 through an API, models that were found to be brittle or biased. The access to which is restricted. The cost-ignorance question-answering dataset SQuAD 2.0 was cre- of leaderboards disproportionately affects practi- ated in response to the observation that existing sys- tioners with fewer resources (e.g., independent re- tems could not reliably demur when presented with searchers) (Rogers, 2019), since the resource de- an unanswerable question (Rajpurkar et al., 2016, mands would dwarf any utility from the model. 2018). The perplexity of language models rises It should be noted that this limitation of leader- when given out-of-domain text (Oren et al., 2019). boards has not precluded the creation of cheaper Many types of bias have also been found in NLP models, given the real-world benefit of lower costs. systems, with models performing better on gender- For example, ELECTRA (Clark et al., 2020) can stereotypical inputs (Rudinger et al., 2018; Etha- be trained up to several hundred times faster than yarajh, 2020) and racial stereotypes being captured traditional BERT-based models while performing in embedding space (Manzini et al., 2019; Etha- comparably on GLUE. Similarly, DistilBERT is a yarajh et al., 2019a,b; Ethayarajh, 2019). Moreover, distilled variant of BERT that is 40% smaller and repeated resubmissions allow for a model’s hyper- 60% faster while retaining 97% of the language parameters to be tuned to maximize performance, understanding (Sanh et al., 2019). There are many even on a private test set (Hardt, 2017). others like it as well (Zadeh and Moshovos, 2020; Note that leaderboards do not necessarily incen- Hou et al., 2020; Mao et al., 2020). More efficiency tivize the creation of brittle and biased models; and fewer parameters translate to lower costs. rather, because leaderboard utility is so parochial, Our point is not that there is no incentive at all these unintended consequences are relatively com- to build cheaper models, but rather that this incen- mon. Some recent work has addressed the prob- tive is not baked into leaderboards, which are an lem of brittleness by offering certificates of perfor- important artefact of the NLP community. Because mance against adversarial examples (Raghunathan lower prediction costs improve practitioner utility, et al., 2018a,b; Jia et al., 2019). To tackle gender practitioners build them despite the lack of incen- bias, the SuperGLUE leaderboard considers accu- tive from leaderboards. If lower prediction costs racy on the WinoBias task (Wang et al., 2019; Zhao also improved leaderboard utility, then there would et al., 2018). Other work has proposed changes be more interest in creating them (Linzen, 2020; to prevent over-fitting via multiple resubmissions Rogers, 2019; Dodge et al., 2019). At the very least, (Hardt and Blum, 2015; Hardt, 2017) while some making prediction costs publicly available would have argued that this issue is overblown, given allow users to better estimate the utility that they that question-answering systems submitted to the will get from a model, given that the leaderboard’s SQuAD leaderboard do not over-fit to the original cost-ignorant ranking may be a poor proxy for their test set (Miller et al., 2020). A novel approach even preferences. proposes using a dynamic benchmark instead of a static one, creating a moving target that is harder 3.3 Robustness for models to overfit to (Nie et al., 2019). Leaderboard utility only depends on model rank, 4 The Future of Leaderboards which in turn only depends on the model’s per- formance on the test data. A typical leaderboard 4.1 A Leaderboard for Every User would gain no additional utility from a model that Given that each practitioner has their own utility was robust to adversarial examples, generalized function, models cannot be ranked in a way that well to out-of-distribution data (Linzen, 2020), or satisfies everyone. Drawing inspiration from data was fair in a Rawlsian sense (i.e., by maximizing statements (Bender and Friedman, 2018; Mitchell the welfare of the worst-off group) (Rawls, 2001; et al., 2019; Gebru et al., 2018), we instead recom- Hashimoto et al., 2018). In contrast, these are all mend that leaderboards demand transparency and attributes that NLP practitioners care about, par- require the reporting of metrics that are relevant to ticularly those who deploy systems in real-world practitioners, such as training time, model size, in- applications. In fact, the literature on the lack of ference latency, and energy efficiency. Dodge et al. robustness in many SOTA models is extensive (Jia (2019) have suggested that model creators submit and Liang, 2017; Zhang et al., 2020). these statistics of their own accord, but without There are many examples of state-of-the-art NLP leaderboards requiring it, there would be no explicit

4849 incentive to do so. Although NLP workshops and dynamically. If we wanted to create a static leader- conferences could also require this, their purview board for a group of users, however, we would is limited: (1) models are often submitted to leader- need to estimate their utility function. This could boards before conferences; (2) leaderboards make be done explicitly or implicitly. The explicit way these statistics easily accessible in one place. would be to ask questions that use the dollar value Giving practitioners easy access to these statis- as a proxy for cardinal utility: e.g., Given a 100M tics would permit them to estimate each model’s parameter sentiment classifier with 200ms latency, utility to them and then re-rank accordingly. This how much would you pay per 1000 API calls? One could be made even easier by offering an interface could also try to estimate the derivative of the util- that allows the user to change the weighting on ity function with questions such as: How much each metric and then using the chosen weights to would you pay to improve the latency from 200ms dynamically re-rank the models. In effect, every to 100ms, holding all else constant? When there user would have their own leaderboard. Ideally, are multiple metrics to consider, some assumptions users would even have the option of filtering out would be needed to tractably estimate the function. models that do not meet their criteria (e.g., those The implicit alternative to estimating this func- above a certain parameter count). tion is to record which models practitioners actually This would have beneficial second-order effects use and then fit a utility function that maximizes as well. For example, reporting the costs of making the utility of the observed models. This approach predictions would put large institutions and poorly- is rooted in revealed preference theory (Samuelson, resourced model creators on more equal footing 1948) – we assume that what practitioners use re- (Rogers, 2019). This might motivate the creation veals their latent preferences. Exploiting revealed of simpler methods whose ease-of-use makes up for preferences may be difficult in practice, however, weaker performance, such as weighted-average sen- given that usage statistics for models are not often tence embeddings (Arora et al., 2019; Ethayarajh, made public and the decision to use a model might 2018). Even if a poorly-resourced creator could not be made with complete information. not afford to train the SOTA model du jour, they could at least compete on the basis of efficiency or 5 Conclusion create a minimally viable system that meets some In this work, we offered several criticisms of leader- desired threshold (Dodge et al., 2019; Dorr, 2011). board design in NLP. While it has helped create Reporting the performance on the worst-off group, more accurate models, we argued that this has been in the spirit of Rawlsian fairness (Rawls, 2001; at the expense of fairness, efficiency, and robust- Hashimoto et al., 2018), would also incentivize ness, among other desiderata. We were not the first creators to improve worst-case performance. to criticize NLP leaderboards (Rogers, 2019; Crane, 2018; Linzen, 2020), but we were the first to do 4.2 A Leaderboard for Every Type of User so under a framework of utility, which we used to While each practitioner may have their own utility study the divergence between what is incentivized function, groups of practitioners – characterized by leaderboards and what is valued by practitioners. by a shared goal – can be modelled with a single Given the diversity of NLP practitioners, there is function. For example, programmers working on no one-size-fits-all solution; rather, leaderboards low latency applications (e.g., multiplayer games) should demand transparency, requiring the report- will place more value on latency than others. In ing of statistics that may be of practical concern. contrast, researchers submitting their work to a con- Equipped with these statistics, each user could then ference may place more value on accuracy, given estimate the utility that each model provides to that a potential reviewer may reject a model that is them and then re-rank accordingly, effectively cre- not SOTA (Rogers, 2020). Although there is vari- ating a custom leaderboard for everyone. ance within any group, this approach is tractable Acknowledgments when there are many points of consensus. How might we go about creating a leaderboard Many thanks to Eugenia Rho, Dallas Card, Robin for a specific type of user? As proposed in the pre- Jia, Urvashi Khandelwal, Nelson Liu, and Sidd vious subsection, one option is to offer an interface Karamcheti for feedback. KE is supported by an that allows the user to change the utility function NSERC PGS-D.

4850 References Amodei. 2020. Language models are few-shot learn- ers. Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M Cer, Mona T Diab, Aitor Gonzalez-Agirre, Weiwei Kevin Clark, Minh-Thang Luong, Quoc V Le, and Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Christopher D Manning. 2020. Electra: Pre-training Mihalcea, et al. 2015. Semeval-2015 task 2: Seman- text encoders as discriminators rather than genera- tic textual similarity, English, Spanish and pilot on tors. arXiv preprint arXiv:2003.10555. interpretability. In Proceedings SemEval@ NAACL- HLT, pages 252–263. Matt Crane. 2018. Questionable Answers in Question Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M Answering Research: Reproducibility and Variabil- Cer, Mona T Diab, Aitor Gonzalez-Agirre, Weiwei ity of Published Results. Transactions of the Associ- Guo, Rada Mihalcea, German Rigau, and Janyce ation for Computational Linguistics, 6:241–252. Wiebe. 2014. Semeval-2014 task 10: Multilin- gual semantic textual similarity. In Proceedings Se- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and mEval@ COLING, pages 81–91. Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language under- Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez- standing. In Proceedings of the 2019 Conference of Agirre, and Weiwei Guo. 2013. Sem 2013 shared the North American Chapter of the Association for task: Semantic textual similarity, including a pilot Computational Linguistics: Human Language Tech- on typed-similarity. In SEM 2013: The Second Joint nologies, Volume 1 (Long and Short Papers), pages Conference on Lexical and Computational Seman- 4171–4186. tics. Association for Computational Linguistics. Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2019. Schwartz, and Noah A Smith. 2019. Show your A simple but tough-to-beat baseline for sentence em- work: Improved reporting of experimental results. beddings. In 5th International Conference on Learn- In Proceedings of the 2019 Conference on Empirical ing Representations, ICLR 2017. Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- Solon Barocas, Moritz Hardt, and Arvind Narayanan. guage Processing (EMNLP-IJCNLP), pages 2185– 2017. Fairness in machine learning. NIPS Tutorial. 2194. Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward Bonnie Dorr. 2011. Part 5: Machine Translation Eval- mitigating system bias and enabling better science. uation. Springer Science & Business Media. Transactions of the Association for Computational Linguistics, 6:587–604. Kawin Ethayarajh. 2018. Unsupervised random walk sentence embeddings: A strong but simple baseline. Su Lin Blodgett, Solon Barocas, Hal Daume´ III, and In Proceedings of The Third Workshop on Represen- Hanna Wallach. 2020. Language (technology) is tation Learning for NLP, pages 91–100. power: A critical survey of ”bias” in nlp. Kawin Ethayarajh. 2019. Rotate king to get queen: Shikha Bordia and Samuel Bowman. 2019. Identify- Word relationships as orthogonal transformations in ing and reducing gender bias in word-level language embedding space. In Proceedings of the 2019 Con- models. In Proceedings of the 2019 Conference of ference on Empirical Methods in Natural Language the North American Chapter of the Association for Processing and the 9th International Joint Confer- Computational Linguistics: Student Research Work- ence on Natural Language Processing (EMNLP- shop, pages 7–15. IJCNLP), pages 3494–3499. Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated Kawin Ethayarajh. 2020. Is your classifier actually corpus for learning natural language inference. In biased? measuring fairness under uncertainty with Proceedings of the 2015 Conference on Empirical bernstein bounds. arXiv preprint arXiv:2004.12332. Methods in Natural Language Processing, pages 632–642. Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. 2019a. Towards understanding linear word analo- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie gies. In Proceedings of the 57th Annual Meeting Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind of the Association for Computational Linguistics, Neelakantan, Pranav Shyam, Girish Sastry, Amanda pages 3253–3262, Florence, Italy. Association for Askell, Sandhini Agarwal, Ariel Herbert-Voss, Computational Linguistics. Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Clemens Winter, Christopher Hesse, Mark Chen, 2019b. Understanding undesirable word embedding Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin associations. In Proceedings of the 57th Annual Chess, Jack Clark, Christopher Berner, Sam Mc- Meeting of the Association for Computational Lin- Candlish, Alec Radford, Ilya Sutskever, and Dario guistics, pages 1696–1705.

4851 Timnit Gebru, Jamie Morgenstern, Briana Vecchione, adaptation of bert through hybrid model compres- Jennifer Wortman Vaughan, Hanna Wallach, Hal sion. arXiv preprint arXiv:2004.04124. Daumee´ III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language de- Moritz Hardt. 2017. Climbing a shaky ladder: cathlon: Multitask learning as question answering. Better adaptive risk estimation. arXiv preprint arXiv preprint arXiv:1806.08730. arXiv:1706.02733. John Miller, Karl Krauth, Benjamin Recht, and Ludwig Moritz Hardt and Avrim Blum. 2015. The ladder: a Schmidt. 2020. The effect of natural distribution reliable leaderboard for machine learning competi- shift on question answering models. arXiv preprint tions. In Proceedings of the 32nd International Con- arXiv:2004.14444. ference on International Conference on Machine Learning-Volume 37, pages 1006–1014. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equal- Elena Spitzer, Inioluwa , and Timnit ity of opportunity in supervised learning. In Ad- Gebru. 2019. Model cards for model reporting. In vances in neural information processing systems, Proceedings of the conference on fairness, account- pages 3315–3323. ability, and transparency, pages 220–229. Tatsunori Hashimoto, Megha Srivastava, Hongseok Moin Nadeem, Anna Bethke, and Siva Reddy. Namkoong, and Percy Liang. 2018. Fairness with- 2020. Stereoset: Measuring stereotypical bias out demographics in repeated loss minimization. in pretrained language models. arXiv preprint In International Conference on Machine Learning, arXiv:2004.09456. pages 1929–1938. Yixin Nie, Adina Williams, Emily Dinan, Mohit Lu Hou, Lifeng Shang, Xin Jiang, and Qun Liu. 2020. Bansal, Jason Weston, and Douwe Kiela. 2019. Ad- Dynabert: Dynamic bert with adaptive width and versarial nli: A new benchmark for natural language depth. arXiv preprint arXiv:2004.04037. understanding. arXiv preprint arXiv:1910.14599. Robin Jia and Percy Liang. 2017. Adversarial exam- Yonatan Oren, Shiori Sagawa, Tatsunori Hashimoto, ples for evaluating reading comprehension systems. and Percy Liang. 2019. Distributionally robust lan- In Proceedings of the 2017 Conference on Empiri- guage modeling. In Proceedings of the 2019 Con- cal Methods in Natural Language Processing, pages ference on Empirical Methods in Natural Language 2021–2031. Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- Robin Jia, Aditi Raghunathan, Kerem Goksel,¨ and IJCNLP), pages 4218–4228. Percy Liang. 2019. Certified robustness to adver- Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. sarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natu- 2018a. Certified defenses against adversarial exam- arXiv preprint arXiv:1801.09344 ral Language Processing and the 9th International ples. . Joint Conference on Natural Language Processing Aditi Raghunathan, Jacob Steinhardt, and Percy S (EMNLP-IJCNLP), pages 4120–4133. Liang. 2018b. Semidefinite relaxations for certify- ing robustness to adversarial examples. In Advances Tal Linzen. 2020. How can we accelerate progress to- in Neural Information Processing Systems, pages wards human-like linguistic generalization? In Pro- 10877–10887. ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5210– Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. 5217, Online. Association for Computational Lin- Know what you don’t know: Unanswerable ques- guistics. tions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- N Gregory Mankiw. 2020. Principles of economics. guistics (Volume 2: Short Papers), pages 784–789. Cengage Learning. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Thomas Manzini, Lim Yao Chong, Alan W Black, and Percy Liang. 2016. Squad: 100,000+ questions for Yulia Tsvetkov. 2019. Black is to criminal as cau- machine comprehension of text. In Proceedings of casian is to police: Detecting and removing multi- the 2016 Conference on Empirical Methods in Natu- class bias in word embeddings. In Proceedings of ral Language Processing, pages 2383–2392. the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: John Rawls. 2001. Justice as fairness: A restatement. Human Language Technologies, Volume 1 (Long Harvard University Press. and Short Papers), pages 615–621. Anna Rogers. 2019. How the transformers broke NLP Yihuan Mao, Yujing Wang, Chufan Wu, Chen Zhang, leaderboards. https://hackingsemantics. Yang Wang, Yaming Yang, Quanlu Zhang, Yunhai xyz/2019/leaderboards/. Accessed: 2020-05- Tong, and Jing Bai. 2020. Ladabert: Lightweight 20.

4852 Anna Rogers. 2020. Peer review in NLP: reject-if- not-SOTA. https://hackingsemantics.xyz/ 2020/reviewing-models/. Accessed: 2020-05- 20. Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14. Paul A Samuelson. 1948. Consumption theory in terms of revealed preference. Economica, 15(60):243– 253. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI. CoRR, abs/1907.10597. Emma Strubell, Ananya Ganesh, and Andrew McCal- lum. 2019. Energy and policy considerations for in nlp. In Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 3645–3650. Beth Sundheim. 1995. Overview of results of the MUC-6 evaluation. In MUC-6, pages 13–31, Columbia, MD. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language un- derstanding systems. In Advances in Neural Infor- mation Processing Systems, pages 3261–3275. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, page 353. Ali Hadi Zadeh and Andreas Moshovos. 2020. Gobo: Quantizing attention-based nlp models for low la- tency and energy efficient inference. arXiv preprint arXiv:2005.03842. Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. 2020. Adversarial attacks on deep-learning models in natural language process- ing: A survey. ACM Transactions on Intelligent Sys- tems and Technology (TIST), 11(3):1–41. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 2 (Short Papers), pages 15–20.

4853