arXiv:2004.04123v2 [cs.CL] 13 Jan 2021 elo oan uha nls es achiev- news, English as such domains on well el oefrtesm ilb released. be will same the for Code well. https://github.com/oagarwal/entity-switched-ner as ( such MUC-6 datasets standard on performance high ing work systems (NER) Recognition Entity Named Introduction 1 npol ihdr kntn,priual on particularly tone, rates skin error dark large with have standard but people considered task, on be this to for used datasets ex- what For ac- on high curacy attain data. systems recognition of gender types ample, in certain variation for wide performance obscure may predic- performance strong tive ostensibly that has technology however, predictive revealed, of areas other in search ( OntoNotes and ( 2003 aaesaesillmtdt o e ouoscountries. populous as few entities rare more top for created to be easily limited can they still However, are datasets 1 h aaesaeaalbeat available are datasets The hc ae niisi h rgnltxsare texts original the in entities named which hsadtn prahcnfcltt h devel- the facilitate can approach auditing This con- same the In in-domain: even widely vary nohrpeitv tasks. predictive other in attention heightened received fairness have consider that to criteria area this will in and research systems, allow NER robust more of opment elsewhere. re- from entities more than are recognized origins liably certain from entities text, performance We systems’ the state-of-the-art origin. that national find of different of entities but type named same plausible by replaced efrac u otentoa rgno en- create of We origin national tities. the in to differences due on performance sys- specifically of focusing robustness method tems, in-domain a the propose We auditing for a entities. recognizing of to set respect diverse with systems robust- of the is ness about it conclusions data, draw of to paucity difficult the given But En- news. comprising glish datasets standard on per- well systems form (NER) recognition entity Named niySice aaes nApoc oAdtn h In-D the Auditing to Approach An Datasets: Entity-Switched jn i agadD Meulder De and Sang Kim Tjong [email protected] rsmnadSundheim and Grishman [email protected] nvriyo Pennsylvania of University otesenUniversity Northeastern rda n Xue and Pradhan outeso ae niyRcgiinModels Recognition Entity Named of Robustness yo .Wallace C. Byron Abstract si Agarwal Oshin entity-switched 1 , 1996 aaes in datasets, , 2009 Our . ,CoNLL ), , .Re- ). 2003 ) oe ihdr kn( skin dark with women ( htasse a er rmi.Ti a im- has This it. from and learn data, can training system of a choice what the question to way (ii) and, entities, of variety a on robustness Their ocet niysice aaes erpaeen- replace we datasets, entity-switched create To perfor- lowest and entities, Indian and American uaeo tnaddtst ( datasets standard on curate 2018 2019 al. et genstein English to as failing English e.g., American dialects, African recognize identifying to fail may but r ytm nthese on systems large state-of-the- con- art and evaluate We robust of NER. more variety of a evaluation a for scale allow in enti- will entity of This same set texts. the diverse also a but have ties only not datasets gener- ated These programmatically. expanding by datasets models existing NER of robustness domain iewt udlnsfrmdlcr eotn of ( reporting weaknesses card and in strengths model system origin, for of guidelines groups, country with entity a line within for groups systems ethnic the en- e.g., of approach auditing Our ables implications: fairness one portant then is origin national of lens the through ( data on training the depends in known methods represented is being NER entities It of robustness origin. the that national which to groups, correspond across here performance relative their (i) models: NER of properties intertwined two ing anann t oeec.A xml sshown is and Table example text in An the coherence. of its rest maintaining the retaining various while from countries with datasets existing in tities Datasets Entity-Switched 2 entities. Indonesian and Vietnamese on on mance (F1) performance highest have they that find ldete al. et Blodgett epooeadansi oeaut h in- the evaluate to diagnostic a propose We ee estott eeo ehd o test- for methods develop to out set we Here, [email protected] .Lnug dnicto sas ihyac- highly also is identification Language ). ). nvriyo Pennsylvania of University [email protected] 1 eeteoiia niisaereplaced are entities original the were n Nenkova Ani ifiYang Yinfei , , ogeAI Google 2017 2016 .Poigterperformance their Probing ). ). entity-switched ulmiiadGebru and Buolamwini hn tal. et Zhang icele al. et Mitchell omain aaesand datasets , 2018 Au- ) , , Original Sentence New Sentence Defender Hassan Abbas PER rose to intercept a ... Defender Ritwika Tomar PER rose to intercept a ... The Democratic Convention signed an agreement on The Democratic Convention signed an agreement on government and parliamentary support with its government and parliamentary support with its coalition partners the Social Democratic Union ORG and coalition partners the Jharkhand Mukti Morcha ORG and theHungarian Democratic Union ORG. the Mizo National Front ORG.

Table 1: Example of switching entities. by Indian ones. While some types of entities (such mon names (e.g., John Smith) and the second com- as persons) can be readily replaced, other entities posed of Native American, African American, and (such as organizations) require more care to main- names that could be locations (e.g., Madison) or tain the coherence of the text. For this reason, regular words (e.g., Brown). we use two versions of the datasets. In one, we The first names used in the experiment are a replace all entities; in the other, we replace only mix of male, female, and unisex names. We cre- PER. Perturbation techniques have also been used ate full names by matching each first to a in (Prabhakaran et al., 2019) to study the impact random name. For some countries, names of person names on toxicity in online comments have additional constraints that we account for. by substituting in names of controversial personal- In Indonesia,4 names might include only a sin- ities, in (Emami et al., 2019) to make coreference gle name or multiple first names (without a family resolution robust to gender and number cues by name). In ,5 some female names have a making both antecedent and candidate of the same first name followed by the father/husband’s most type, in (Subramanian and Roth, 2019) for ad- called name. verserial training for coreference resolution of per- We replace all PER spans in the test set with a son and location entities, in (Raiman and Miller, single name.6 We attempt to be consistent in the 2017) to augment question-answering datasets and replacement, i.e., full names are replaced by full in (Kaushik et al., 2019) to create manually per- names, first names by first names, and last names turbed examples by experts for sentiment analysis by last names. We treat space-separated multi- and natural language inference. However, there is word names as full names. We take a Western- no large scale dataset available for the evaluation centric view and consider the first word to be a of NER on ethnically diverse entities. first name and the remaining to be the last name, and determine if other occurrences are first or last 2.1 Replacing PER entities names by string matching. If a multi-word name is In the first group of datasets, we change only part of a longer name, we do not break it down and names of people. We replace all PER entities replace it based on the longer name it matches. in the test set with the same string. For exam- We use the English CoNLL’03 and calculate ple, all sequences of PER might be replaced with the F1 for each country, replacing every PER en- the name ‘John Smith’, or the name ‘Marijuana tity in the test set with each of the 20 names in Pepsi’.2 Names for replacement are drawn from turn and taking the average over the 20 versions of lists of popular names in the countries with the the dataset. We evaluate the word-based biLSTM- largest populations. This allows us to examine sys- CRF (Huang et al., 2015), word and chararacter- tem performance with respect to country of origin, based biLSTM-CRF (Lample et al., 2016) and and also in terms of the number of people whose BERT (Devlin et al., 2019). For the first two, we names would be potentially affected by recogni- used 300-d cased GloVe (Pennington et al., 2014) 7 tion failures. Specifically, we selected the 15 most vectors trained on Common Crawl. For BERT, 8 populous countries and found 20 common first we use the public largeuncased model and apply names and 10 common family names for each.3 the default fine-tuning strategy. We use IO label- We used two sets of Chinese names, from main- ing and evaluate systems using micro-F1 at the to- land China and Taiwan, respectively. We also use 4https://en.wikipedia.org/wiki/Indonesian names two sets of U.S. names: The first comprising com- 5https://en.wikipedia.org/wiki/Pakistani name 6Training data is unchanged, as our goal is to quantify the 2https://tinyurl.com/yxdhupr9 robustness of identifying various names. 3Sources include: Wikipedia, websites with baby names, 7http://nlp.stanford.edu/data/glove.840B.zip and websites listing popular names 8Uncased performed better than cased. (Huang et al., 2015) (Lample et al., 2016) (Devlin et al., 2019) GloVe words GloVe words+chars BERT subwords P R F1 P R F1 P R F1 Original testset 96.9 96.5 96.7 97.1 98.1 97.6 98.3 98.1 98.2 Super recall US 96.9 99.6 98.2 96.9 99.6 98.3 98.4 99.7 99.1 Russia 96.8 99.5 98.1 97.1 99.8 98.4 98.4 99.3 98.9 India 96.5 99.5 98.0 97.1 99.3 98.2 98.4 98.8 98.6 Mexico 96.7 98.9 97.8 97.1 98.9 98.0 98.4 99.2 98.8 Poor recall China-Taiwan 95.4 93.2 93.9 97.0 94.9 95.6 98.3 92.0 94.8 US (Difficult) 95.9 87.4 90.2 96.6 87.9 90.7 98.1 88.5 92.3 Indonesia 95.3 84.6 88.7 96.5 91.0 93.3 97.8 85.8 92.0 Vietnam 94.6 78.2 84.2 96.0 78.5 84.5 98.0 84.2 89.8 BERT not best Ethiopia 96.5 96.8 96.6 96.6 98.6 97.9 98.3 90.6 94.1 Nigeria 96.3 92.2 94.1 97.1 96.6 96.8 98.2 90.2 93.8 Philippines 97.3 97.9 97.5 97.5 98.9 98.2 98.6 94.7 96.4 Other Bangladesh 96.7 97.5 97.1 97.1 97.6 97.3 98.4 97.8 98.0 Brazil 96.6 96.8 96.6 97.1 96.2 96.5 98.4 96.7 97.5 China-Mainland 95.7 97.9 96.7 97.0 97.4 97.2 98.4 96.7 97.5 Egypt 96.6 99.2 97.8 97.0 98.2 97.6 98.4 97.4 97.9 Japan 96.7 97.2 96.8 97.0 98.7 97.8 98.5 99.0 98.7 Pakistan 96.2 92.6 94.1 97.0 96.5 96.6 98.3 95.3 96.7

Table 2: Performance of systems on PER entity of CoNLL 03 test data. Original refers to the unchanged data. The rest of the rows are averages over 20 names typical for each country. (Huang et al., 2015) (Lample et al., 2016) (Devlin et al., 2019) GloVe words GloVe words+chars BERT subwords P R F1 P R F1 P R F1 Original 94.7 95.6 95.2 97.5 95.0 96.2 97.0 96.8 96.9 India 94.2 95.5 94.8 97.0 95.7 96.2 96.3 96.9 96.6 Vietnam 93.1 82.3 85.8 96.3 82.3 86.9 96.5 85.2 90.5

Table 3: Performance of systems on PER entity of OntoNotes newswire test data. Original refers to the unchanged data. The rest of the rows report averages over 20 names typical for each country. ken level. tures trained on standard corpora as state-of-the- We present the results in Table 2. All the mod- art is the NER equivalent of developments in pho- els achieve higher F1 on typical American names, tography, which was optimized for perfect expo- Russian, Indian and Mexican names than the orig- sure of white skin, and which is the assumed tech- inal dataset (Super recall). Precision remains the nical reason for many failures of computer vision same but recall improves to almost perfect. applications when applied to dark skinned people For the GloVe models, performance drops by (Benjamin, 2019). At the very least practition- up to ∼10 points F1 for certain countries (Poor ers should be cognizant of these systematic perfor- recall). Names from Indonesia and Vietnam fare mance differences. the worst, along with the difficult US names and BERT and character LSTM-CRF results are names from Taiwan, with small degradation of pre- higher and more stable, but it is nevertheless clear cision and a precipitous drop of recall. BERT ex- that one need not perform a completely out-of- hibits a similar pattern, with stable precision and domain test to observe deteriorating performance; varying recall which remains above 84% for all changing the name strings is sufficient. We per- name origins. The names with the highest and low- form similar tests on the newswire section of est F1 with (Huang et al., 2015) are shown in Table OntoNotes. We use the original train and test 4. splits, where we have switched PER in the lat- Notably, BERT performance is lowest on names ter. We use names from India and Vietnam be- from Ethiopia, Nigeria, and the Philippines (see cause these were in the top and bottom perform- BERT not best rows). In light of these findings, ing entity-switched sets, respectively, for CoNLL one might wonder if accepting current architec- data. We observe a similar drop in performance Highest performance Lowest performance Name Country F1 Name Country F1 Jose Mari Andrada Philippines 98.8 Trinity Washington U.S.(Difficult) 37.8 ChrisCollins U.S. 98.4 My On Vietnam 37.9 Alex Mikhailov Russia 98.4 Thien Thai Vietnam 62.5 Alejandro Garcia Mexico 98.4 Elaf Zahaar Pakistan 69.9

Table 4: Names on which (Huang et al., 2015) achieved highest and lowest F1 scores.

(Huang et al., 2015) (Lample et al., 2016) (Devlin et al., 2019) GloVe words GloVe words+chars BERT subwords P R F1 P R F1 P R F1 Original 90.9 91.4 91.2 90.9 92.6 91.7 95.5 93.3 94.4 India 84.3 77.3 80.7 83.8 82.9 83.3 95.6 87.8 91.5 Vietnam 74.3 73.0 73.6 77.9 76.4 77.2 96.2 81.5 88.2

Table 5: Performance of systems on all entity types in the CoNLL 03 test data.

(Table 3). We hypothesize that this is due to the sonably replace ‘Bank of America’ in ‘We with- names that systems have seen during pre-training drew money from the Bank of America’ with and training. Based on a similar hypothesis, Der- ‘New York Times’ or ‘Mayo Clinic’. For this rea- czynski et al. (2017) collected a dataset with no son, we divided organizations into sub-categories, overlap in train and test set entities. selected candidates for each category from simi- lar websites as above, labelled test ORG with the 2.2 Replacing other entity types sub-category manually, and then replaced them a We construct two more datasets derived from the country-specific ORG of the same sub-category, original English CoNLL test set. In these we re- again being consistent within a document based place all PER, LOC and ORG instances with enti- on string matching. The sub-categories we used ties of the same type and from a particular country. are: Airline, Bank, Corporation, Newspaper, Polit- We do not replace MISC entities, because these are ical Party, Restaurant, Sports Team, Sports Union, not usually country-specific. In one dataset, we University, Others. Others included international replace all entities with corresponding entities of or intergovernmental organizations such as United Indian origin; in the other, with entities of Viet- Nations and we did not replace these. namese origin. Unlike in the above dataset, where We observe a drop in performance on both the we replaced every entity of type PER with the datasets (Table 5). Both precision and recall drop same name throughout, we perform a stochastic for the GloVe systems. However, for BERT, the (though type-constrained) mapping from the orig- precision remains the same and only the recall inal set of entities to the new entity set. That is, drops. when replacing a target entity e of type t, we sam- ple an entity with which to replace it at i.i.d. ran- 2.3 Error Analysis dom from entities of type t in the set of country- We show some examples of the replacement and specific entities. predictions by (Huang et al., 2015) in Table 6. We select PER names as in the previous sec- They show that systems likely ignore predictive tion. For each document, we then generate a list of contexts and assign labels based mostly on the possible names an entity was referred to by string word identity. In row 1, both the entities should matching and replace each entity with a new name be of the same type – LOC or PER; however for as consistently as possible, as in the previous sec- Indian entities, one is predicted as PER and one as tion. For LOC, we select names of villages, cities, LOC; for Vietnamese, one is predicted as PER and and provinces and again select a location of the the other isn’t even recognized as an entity. Row same type from the country-specific list in a docu- 2 is indeed hard and a known word identity would ment. be easier to detect in this case. Row 3 is a common Consistently replacing ORG entities is more pattern ‘LOC DATE’ from the training data but the complicated than replacing PER and LOC entities, systems do not recognize these for the switched because not every organization would be suitable datasets. In fact, for Vietnamese it even causes for every context. We cannot, for instance, rea- the date to be predicted as a LOC. Similarly, row True labels Original Indian Vietnamese Japan LOC began the defence Japan LOC began the defence Dhanbad LOC began the Long An O began the defence of their Asian Cup MISC of their Asian Cup MISC title defence of their of their Asian Cup MISC title with a lucky 2-1 win against with a lucky 2-1 win against Asian Cup MISC title with a with a lucky 2-1 win against Syria LOC in a Group C Syria LOC in a Group C lucky 2-1 win against Bac Lieu PER in a Group C championship match on championship match on Thungapuram PER in a Group championship match on Friday. Friday. C championship match on Friday. Friday. Nader Jokhadar PER had given Nader Jokhadar PER had given Priya Khemka LOC had given Thien Hue LOC had given Syria LOC the lead with a Syria LOC the lead with a Thungapuram O the lead with Bac Lieu PER the lead with a well-struck header in the well-struck header in the a well-struck header in the well-struck header in the seventh minute. seventh minute. seventh minute. seventh minute. ROME LOC 1996-12-06 ROME LOC 1996-12-06 Thevaiyur O 1996-12-06 Ha Long Bay 1996-12-06 LOC SOCCER - FIFA ORG BOSS SOCCER - FIFA ORG BOSS SOCCER - SOCCER - HAVELANGE PER STANDS HAVELANGE PER STANDS Judo Federation ORG of O Vietnam Football Federation LOC BY WEAH PER. BY WEAH O. India LOC BOSS BOSS Thu O Thai MISC Dheeraj Uniyal PER STANDS STANDS BY Duc Tan PER. BY Anjali Lal PER. In an interview with the In an interview with the In an interview with the In an interview with the Italian MISC newspaper Italian MISC newspaper Italian MISC newspaper Italian MISC newspaper Gazzetta dello Sport ORG, he Gazzetta dello Sport ORG, he Hari Bhoomi PER, he was Tien Phong PER, he was was quoted as saying was quoted as saying quoted as saying Masih PER quoted as saying Hue PER had Weah PER had been provoked Weah PER had been provoked had been provoked into the been provoked into the assault into the assault which left into the assault which left assault which left Damerla O which left Giang LOC with a Costa PER with a broken nose. Costa LOC with a broken nose. with a broken nose. broken nose. Cagliari ORG (16) v Cagliari ORG (16) v Thiruvananthapuram LOC (16) Sanna Khanh Hoa Futsal Club PER Reggiana ORG (18) 1530 Reggiana ORG (18) 1530 v Hyderabad LOC (18) 1530 (16) v Ha Long Bay ORG (18) 1530 Bottom team Reggiana ORG Bottom team Reggiana ORG Bottom team Hyderabad LOC Bottom team are also without a suspended are also without a suspended are also without a suspended Ha Long Bay PER are also defender, German MISC defender, German MISC defender, German MISC without a suspended defender, Dietmar Beiersdorfer PER. Dietmar Beiersdorfer PER. Navneet Nedungadi PER. German MISC Linh On PER.

Table 6: Predictions of (Huang et al., 2015) on the original, Indian and Vietnamese test sets. Many errors are made on the entity-switched datasets, even in the presence of strong contextual clues, including patterns common in the training data.

6 is another common pattern in the training data stances. with sports scores but the system fails to recog- To measure this, we introduced a practical ap- nize it. In rows 5 and 7, there are clear contex- proach of entity-switching to create datasets to test tual clues for the entity type but the system fails to the in-domain robustness of systems. We selected consider it for prediction. In row 5, anything fol- entities from different countries and showed that lowing ‘newspaper’ is the newspaper’s name and models perform extremely well on entities from should be ORG. Similarly in row 7, anything fol- certain countries, but not as well on others. This lowing ‘team’ is the team’s name and should be finding has fairness implications when NER is ORG. While it may appear the ‘Italian newspaper’ used in practice. We hope that these datasets will followed by a name from a different origin may facilitate research in developing more robust sys- the cause of the error, in the last row all the names tems. followed by German are recognized correctly irre- spective of origin and it doesn’t seem that systems learns this level of knowledge. References Isabelle Augenstein, Leon Derczynski, and Kalina 3 Conclusion Bontcheva. 2017. Generalisation in named entity recognition: A quantitative analysis. Computer Standard NER datasets such as English news con- Speech & Language, 44:61–83. tain a limited number of unique entities, with many of them occurring in both the train and the Ruha Benjamin. 2019. Race After Technology: Aboli- test sets. State-of-the-art models may thus memo- tionist Tools for the New Jim Code. Polity. rize observed names, rather than learning to iden- Su Lin Blodgett, Lisa Green, and Brendan T. O’Connor. tify entities on the basis of context. As a result, 2016. Demographic dialectal variation in social me- models perform less well on ‘foreign’ entity in- dia: A case study of african-american english. In Proceedings of the 2016 Conference on Empirical Proceedings of the Conference on Fairness, Account- Methods in Natural Language Processing, EMNLP ability, and Transparency, FAT* 2019, Atlanta, GA, 2016, Austin, Texas, USA, November 1-4, 2016, USA, January 29-31, 2019, pages 220–229. pages 1119–1130. Jeffrey Pennington, Richard Socher, and Christopher D. Joy Buolamwini and Timnit Gebru. 2018. Gender Manning. 2014. Glove: Global vectors for word rep- shades: Intersectional accuracy disparities in com- resentation. In Empirical Methods in Natural Lan- mercial gender classification. In Conference on Fair- guage Processing (EMNLP), pages 1532–1543. ness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA, pages 77– Vinodkumar Prabhakaran, Ben Hutchinson, and Mar- 91. garet Mitchell. 2019. Perturbation sensitivity analy- sis to detect unintended model biases. In Proceed- Leon Derczynski, Eric Nichols, Marieke van Erp, and ings of the 2019 Conference on Empirical Methods Nut Limsopatham. 2017. Results of the WNUT2017 in Natural Language Processing and the 9th Inter- shared task on novel and emerging entity recogni- national Joint Conference on Natural Language Pro- tion. In Proceedings of the 3rd Workshop on Noisy cessing (EMNLP-IJCNLP), pages 5740–5745,Hong User-generated Text, pages 140–147, Copenhagen, Kong, China. Association for Computational Lin- Denmark. Association for Computational Linguis- guistics. tics. Sameer S. Pradhan and Nianwen Xue. 2009. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and OntoNotes: The 90% solution. In Proceedings Kristina Toutanova. 2019. BERT: Pre-training of of Human Language Technologies: The 2009 An- deep bidirectional transformers for language under- nual Conference of the North American Chapter of standing. In Proceedings of the 2019 Conference the Association for Computational Linguistics, Com- of the North American Chapter of the Association panion Volume: Tutorial Abstracts, pages 11–12, for Computational Linguistics: Human Language Boulder, Colorado. Association for Computational Technologies, Volume 1 (Long and Short Papers), Linguistics. pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Jonathan Raiman and John Miller. 2017. Globally nor- malized reader. In Proceedings of the 2017 Con- Ali Emami, Paul Trichelair, Adam Trischler, Kaheer ference on Empirical Methods in Natural Language Suleman, Hannes Schulz, and Jackie Chi Kit Che- Processing, pages 1059–1069, Copenhagen, Den- ung. 2019. The KnowRef coreference corpus: mark. Association for Computational Linguistics. Removing gender and number cues for difficult pronominal anaphora resolution. In Proceedings of Sanjay Subramanian and Dan Roth. 2019. Improving the 57th Annual Meeting of the Association for Com- generalization in coreference resolution via adversar- putational Linguistics, pages 3952–3961, Florence, ial training. In Proceedings of the Eighth Joint Con- Italy. Association for Computational Linguistics. ference on Lexical and Computational Semantics (* SEM 2019), pages 192–197. Ralph Grishman and Beth Sundheim. 1996. Message Erik F. Tjong Kim Sang and Fien De Meulder. 2003. understanding conference-6: A brief history. In Introduction to the CoNLL-2003 shared task: Lan- COLING 1996 Volume 1: The 16th International guage-independent named entity recognition. In Conference on Computational Linguistics. Proceedings of the Seventh Conference on Natu- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- ral Language Learning at HLT-NAACL 2003, pages tional lstm-crf models for sequence tagging. arXiv 142–147. preprint arXiv:1508.01991. Yuan Zhang, Jason Riesa, Daniel Gillick, Anton Divyansh Kaushik, Eduard Hovy, and Zachary C. Lip- Bakalov, Jason Baldridge, and David Weiss. 2018. ton. 2019. Learning the difference that makes a dif- A fast, compact, accurate model for language iden- ference with counterfactually-augmented data. tification of codemixed text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Guillaume Lample, Miguel Ballesteros, Sandeep Sub- Language Processing, Brussels, Belgium, October ramanian, Kazuya Kawakami, and Chris Dyer. 2016. 31 - November 4, 2018, pages 328–337. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 260–270, San Diego, California. Association for Computational Linguistics.

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In