AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 AI Matters

Annotated Table of Contents Full article: http://doi.acm.org/10.1145/3340470.3340476 Winning essay from the 2018 ACM SIGAI Student Welcome to AI Matters 5(2) Essay Contest Amy McGovern, co-editor & Iolanda Leite, co-editor Context-conscious fairness in using Full article: http://doi.acm.org/10.1145/3340470.3340471 machine learning to make decisions Welcome and summary Michelle Seng Ah Lee Full article: http://doi.acm.org/10.1145/3340470.3340477 AI Profiles: An Interview with Leslie Kaelbling Winning essay from the 2018 ACM SIGAI Student Essay Contest Marion Neumann Full article: http://doi.acm.org/10.1145/3340470.3340472 The Scales of (Algorithmic) Justice: Interview with Leslie Kaelbling Tradeoffs and Remedies Matthew Sun & Marissa Gerchick Events Full article: http://doi.acm.org/10.1145/3340470.3340478 Michael Rovatsos Winning essay from the 2018 ACM SIGAI Student Full article: http://doi.acm.org/10.1145/3340470.3340473 Essay Contest Upcoming AI events What Metrics Should We Use To Mea- AI Education Matters: Data Science sure Commercial AI? and Machine Learning with Magic: Cameron Hughes & Tracey Hughes The Gathering Full article: http://doi.acm.org/10.1145/3340470.3340479 Todd W. Neller Metrics for commercial AI Full article: http://doi.acm.org/10.1145/3340470.3340474 Data Science and Machine Learning with Magic: The Gathering Crosswords Adi Botea AI Policy Matters Full article: http://doi.acm.org/10.1145/3340470.3340480 Larry Medsker AI generated Crosswords Full article: http://doi.acm.org/10.1145/3340470.3340475 Policy issues relevant to SIGAI Links SIGAI website: http://sigai.acm.org/ Beyond Transparency: A Proposed Newsletter: http://sigai.acm.org/aimatters/ Framework for Accountability in Blog: http://sigai.acm.org/ai-matters/ Decision-Making AI Systems Twitter: http://twitter.com/acm sigai/ Janelle Berscheid & Francois Roewer- Edition DOI: 10.1145/3340470 Despres

1 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 Join SIGAI

Students $11, others $25 For details, see http://sigai.acm.org/ Benefits: regular, student Also consider joining ACM. Our mailing list is open to all.

Notice to Contributing Authors to SIG Newsletters

By submitting your article for distribution in this Special Interest Group publication, you hereby grant to ACM the following non-exclusive, per- petual, worldwide rights:

• to publish in print on condition of acceptance by the editor • to digitize and post your article in the elec- tronic version of this publication • to include the article in the ACM Digital Li- brary and in any Digital Library related ser- vices Contents Legend • to allow users to make a personal copy of the article for noncommercial, educational or research purposes Book Announcement

However, as a contributing author, you retain Ph.D. Dissertation Briefing copyright to your article and ACM will refer requests for republication directly to you. AI Education Submit to AI Matters! Event Report

We’re accepting articles and announce- Hot Topics ments now for the next issue. Details on the submission process are available at Humor http://sigai.acm.org/aimatters. AI Impact

AI Matters Editorial Board AI News Amy McGovern, Co-Editor, U. Oklahoma Iolanda Leite, Co-Editor, KTH Opinion Sanmay Das, Washington Univ. in Saint Louis Alexei Efros, Univ. of CA Berkeley Paper Precis´ Susan L. Epstein, The City Univ. of NY Yolanda Gil, ISI/Univ. of Southern California Spotlight Doug Lange, U.S. Navy Kiri Wagstaff, JPL/Caltech Video or Image Xiaojin (Jerry) Zhu, Univ. of WI Madison Details at http://sigai.acm.org/aimatters Contact us: [email protected]

2 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Welcome to AI Matters 5(2) Amy McGovern, co-editor (University of Oklahoma; [email protected]) Iolanda Leite, co-editor (Royal Institute of Technology (KTH); [email protected]) DOI: 10.1145/3340470.3340471

Issue overview Submit to AI Matters! Welcome to the second issue of the fifth vol- Thanks for reading! Don’t forget to send ume of the AI Matters Newsletter. We have your ideas and future submissions to AI exciting news from SIGAI Vice-Chair Sanmay Matters! We’re accepting articles and an- Das: ”We are delighted to announce that the nouncements now for the next issue. De- first ever ACM SIGAI Industry Award for Excel- tails on the submission process are avail- lence in Artificial Intelligence will be awarded able at http://sigai.acm.org/aimatters. to the Decision Service created by the Real World Reinforcement Learning Team from Mi- crosoft! The award will be presented at IJ- CAI 2019. For more on the award and the team that received it, please see https://sigai. Amy McGovern is co- acm.org/awards/industry award.html.” editor of AI Matters. She is a Professor of com- This issue opens with our interview series, puter science at the Uni- where Marion Neumann interviews Leslie versity of Oklahoma and Pack Kaelbling, a Professor of Computer Sci- an adjunct Professor of ence and Engineering at MIT and founder of meteorology. She directs the Journal of Machine Learning Research. the Interaction, Discovery, In our regular columns, we have a summary Exploration and Adapta- of upcoming AI conferences and events from tion (IDEA) lab. Her re- Michael Rovatsos. Todd Neller’s educational search focuses on ma- column is dedicated to a Magic: The Gather- chine learning and data mining with applica- ing dataset that provides interesting opportu- tions to high-impact weather. nities for exploring research questions on data Iolanda Leite is co-editor science and ML. In the policy column, Larry of AI Matters. She is an Medsker summarizes recent AI related initia- Assistant Professor at the tives and discusses new jobs in the AI future. School of Electrical En- This issue features the first set of winning es- gineering and Computer says from the 2018 ACM SIGAI Student Es- Science at the KTH Royal say Contest, with the second set of winning Institute of Technology in essays to appear in the next issue. In addi- Sweden. Her research in- tion to having their essay appear in AI Mat- terests are in the areas of ters, the contest winners received either mon- Human-Robot Interaction and Artificial Intelli- etary prizes or one-on-one Skype sessions gence. She aims to develop autonomous so- with leading AI researchers. cially intelligent robots that can assist people over long periods of time. In the regular contributed papers, our editors Cameron Hughes and Tracey Hughes pro- pose potential metrics for commercial AI. We close with our (now regular) entertainment column, an AI generated crossword puzzle by Adi Botea. You can also find the solution to the puzzle from the previous issue.

Copyright c 2019 by the author(s).

3 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

AI Profiles: An Interview with Leslie Kaelbling Marion Neumann (Washington University in St. Louis; [email protected]) DOI: 10.1145/3340470.3340472

Introduction (the only part of the magazine I could even sort of understand) and learned about Godel,¨ Welcome to the eighth interview profiling a se- Escher, Bach by Douglas Hofstadter. I man- nior AI researcher. This time we will hear from aged to get a copy of it, and that’s what made Leslie Kaelbling, Panasonic Professor of Com- me really get interested in AI. puter Science and Engineering in the Depart- ment of Electrical Engineering and Computer Science at MIT. What professional achievement are you most proud of?

Starting JMLR, I guess. I think it’s been very helpful for the community, and was actually not very hard to do.

What would you have chosen as your career if you hadn’t gone into CS?

No idea! I’m pretty flexible. But almost sure something technical. Figure 1: Leslie Kaelbling

What is the most interesting project you Biography are currently involved with? Leslie is a Professor at MIT. She has an un- I’m doing the same thing I’ve always been do- dergraduate degree in Philosophy and a PhD ing, which is trying to figure out how to make in Computer Science from Stanford, and was really intelligent robots. I do this mostly out previously on the faculty at Brown University. of curiosity: I want to understand what the She was the founding editor-in-chief of the necessary and sufficient computational meth- Journal of Machine Learning Research. Her ods are for making an agent that behaves in research agenda is to make intelligent robots a way we’d all be happy to call intelligent. I using methods including estimation, learning, think human intelligence is probably a point in planning, and reasoning. She is not a robot. a big space of computational mechanisms that achieve intelligent behavior. I’m interested in Getting to Know Leslie Kaelbling understanding that whole space! When and how did you become interested in CS and AI? AI is grown up – it’s time to make use of it I went to high school in rural California, but for good. Which real-world problem would the summer before my senior year I went to you like to see solved by AI in the future? an NSF summer program in math. We actu- ally ended up studying computer science. My I’m not so focused on solving actual problems, crowning achievement was writing quicksort in but I’m fairly sure that methods that are de- Basic! I also discovered Scientific American, veloped on the way to understanding compu- and started reading Martin Gardner’s columns tational approaches to intelligent behavior will end up being useful in a variety of ways that I Copyright c 2019 by the author(s). don’t anticipate.

4 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

How can we make AI more diverse? Do Help us determine who you have a concrete idea on what we as should be in the AI Mat- (PhD) students, researchers, and ters spotlight! educators in AI can do to increase diversity our field? If you have suggestions for who we should pro- Unfortunately, I don’t, really. The answer for file next, please feel free AI is probably not substantially different from to contact us via email at the answer for CS or even engineering more [email protected]. broadly.

How do you balance being involved in so many different aspects of the AI community? I’m a good juggler! But it’s suddenly much harder than it was, just because of the enor- mous growth of enthusiasm about AI, and ma- chine learning in particular. Everything I do, from teaching undergraduates to graduate ad- missions to hiring to writing tenure letters to reviewing papers to organizing conferences has just gotten an order of magnitude bigger and more complex. I was really affected by this for a while, but now I’m honing my “no”- saying skills so I can protect time to actually do research (which is why I’m in this business, after all).

What do you wish you had known as a Ph.D. student or early researcher? I don’t know. Things worked out pretty well for me, but completely by accident. I think there are many ways in which it’s actually good to not know much. You have a greater chance of doing something really novel or really hard just because you don’t know it’s novel or hard.

What is your favorite AI-related movie or book and why? Well, Godel,¨ Escher, Bach was formative for me. Its focus on primitives and systems of combination, and themes of recursion, se- mantics, quotation, reflection really resonated with me and I’m sure that the ways in which I think and formulate problems still show its in- fluences. I haven’t re-read it since I was 17, though, so I don’t know what it would feel like now.

5 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Events Michael Rovatsos (University of Edinburgh; [email protected]) DOI: 10.1145/3340470.3340473

This section features information about up- 11th International Conference on coming events relevant to the readers of AI Knowledge Discovery, Knowledge Matters, including those supported by SIGAI. Engineering and Knowledge We would love to hear from you if you are are Management (IC3K’19) organizing an event and would be interested in cooperating with SIGAI, or if you have Vienna, Austria, September 17-19, 2019 announcements relevant to SIGAI. For more www.ic3k.org information about conference support visit The purpose of IC3K is to bring together sigai.acm.org/activities/requesting sponsor- researchers, engineers and practitioners on ship.html. the areas of Knowledge Discovery, Knowl- edge Engineering and Knowledge Manage- ment. IC3K is composed of three co-located 17th International Conference on conferences, each specialized in at least one Artificial Intelligence and Law of the aforementioned main knowledge areas. IC3K 2019 will be held in conjunction with WE- (ICAIL’19) BIST 2019 and IJCCI 2019. Montreal, Canada, June 17-20, 2019 Submission deadline: June 12, 2019 https://icail2019-cyberjustice.com The 2019 edition of the International Confer- 11th International Joint Conference on ence on Artificial Intelligence and Law (ICAIL) will be held at the Cyberjustice Laboratory, Computational Intelligence (IJCCI’19) University of Montreal.The conference is held Vienna, Austria, September 17-19, 2019 biennially under the auspices of the Interna- http://www.ijcci.org tional Association for Artificial Intelligence and The purpose of IJCCI is to bring together Law (IAAIL) and in cooperation with the Asso- researchers, engineers and practitioners on ciation for the Advancement of Artificial Intel- the areas of Fuzzy Computation, Evolutionary ligence (AAAI) and the ACM Special Interest Computation and Neural Computation. IJCCI Group on Artificial Intelligence (ACM SIGAI). is composed of three co-located conferences, The conference proceedings are published by each specialized in at least one of the afore- ACM. mentioned main knowledge areas. IJCCI 2019 will be held in conjunction with WEBIST 2019 and IC3K 2019. Submission deadline: June 12, 2019 14th International Conference on the Foundations of Digital Games (FDG’19) International Conference on Web Intelligence (WI’19) San Louis Obispo, CA, August 26-30, 2019 Thessaloniki, Greece, October 14-17, 2019 fdg2019.org https://webintelligence2019.com Foundations of Digital Games is a major inter- Web Intelligence (WI) aims to achieve a national “big tent” academic conference dedi- multi-disciplinary balance between research cated to exploring the latest research in all as- advances in the fields of collective intelli- pects of digital games. FDG is usually held in gence, data science, human-centric comput- Europe or once a year. FDG ing, knowledge management, and network 2018 was held in Malmo, Sweden. FDG 2019 science. It is committed to addressing re- is held in cooperation with ACM and ACM SIG search that deepens the understanding of AI, SIGGRAPH and SIGCHI. computational, logical, cognitive, physical as well as business and social foundations of the Copyright c 2019 by the author(s). future Web, and enables the development and

6 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 application of intelligent technologies. WI ’19 features high-quality, original research papers and real-world applications in all theoretical and technological areas that make up the field of WI.

25th International Conference on Intelligent User Interfaces (IUI’20) Cagliari, Italy, March 17-20, 2020 https://iui.acm.org/2020/ ACM IUI 2020 is the 25th annual meeting of the intelligent interfaces community and serves as a premier international forum for reporting outstanding research and develop- ment on intelligent user interfaces. ACM IUI is where the Human-Computer Interaction (HCI) community meets the Artificial Intelli- gence (AI) community. We are also very inter- ested in contributions from related fields, such as psychology, behavioral science, cognitive science, computer graphics, design, the arts. Submission deadline: 8th October 2019

Michael Rovatsos is the Conference Coordi- nation Officer for ACM SIGAI, and a faculty member at the Univer- sity of Edinburgh. His research is in multiagent systems and human- friendly AI. Contact him at [email protected].

7 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

AI Education Matters: Data Science and Machine Learning with Magic: The Gathering Todd W. Neller (Gettysburg College; [email protected]) DOI: 10.1145/3340470.3340474

Introduction concluded that, for the same power and tough- ness, one pays approximately 0.79 in CMC In this column, we briefly describe a rich for flying. We will take a slightly different ap- dataset with many opportunities for interest- proach in this column. ing data science and machine learning assign- ments and research projects, we take up a Having downloaded AllCards.json, one can simple question, and we offer code illustrating easily load the card data in Python using use of the dataset in pursuit of answers to the Python’s native json library: question. import json Magic: The Gathering (MTG) is a collectible with open(’AllCards.json’, ’r’, card game featuring imperfect information, encoding=’utf8’) as read_file: chance, complex hand management, and fas- data = json.load(read_file) print(len(data), ’cards read.’) cinating metagame economics in the online market for cards. One can study the in- One can then iterate through cards and their game economy that concerns different col- respective data dictionaries: ored/uncolored mana, or one can study the time-series dollar costs of cards on the open for card_name in data: market as they are introduced and then fluc- card_data = data[card_name] tuate in supply and demand, e.g. when rotat- For each card dictionary card data, ing out of standard play legality (generally af- we can access rule text, power, tough- ter less than two years), or when found to have ness, and CMC through dictionary keys powerful uses in newer play formats like Pau- ’text’, ’power’, ’toughness’, and per. ’convertedManaCost’. The structure of As of 25 February 2019, a freely-available, the JSON data is well-documented. rich JSON card dataset for 19386 MTG cards More difficult to ascertain is which cards are may be downloaded from the MTGJSON web- creatures which are inherently “flying”. Some site. For this column, we select a single focus non-creature cards contain rule text affecting question: How much more is the in-game cost flying creatures. Other creature cards de- for a creature with “flying” ability. scribe defense against flying creatures. A sim- ple approach to eliminate false positive and The Cost of Flying false negatives is to filter for only creatures that have either no rule text or rule text con- Flying is an evasive ability of creatures that sisting of “Flying” only. Looking over the data generally comes with a higher converted filtered thus, we noted that cards with “trans- mana cost (CMC), i.e. the total number of in- form” layout should be excluded as their true game currency units needed to bring the card cost of flying is the mana cost of the initial side into play regardless of the specific mana col- plus the game condition that must be achieved ors required. to flip the card to its transformed side. In related work, Henning Hasemann, a Berlin- Instructors and students can jumpstart their based researcher and software engineer, has exploration of the following overview by down- done preliminary data science and machine loading all data and code, including the afore- learning investigation on the cost of flying and mentioned AllCards.json data, Python code, other keyword attributes of MTG cards1. He and a corresponding Jupyter notebook. Copyright c 2019 by the author(s). In both the code and the Jupyter notebook, 1http://leetless.de/tag-MTG.html one can trace a simple exploration of this

8 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 question where, having filtered the 9844 crea- tures conservatively to include only 81 flying and 312 non-flying creatures, we first view the data as a jittered scatterplot where flying and non-flying creatures are represented as blue triangles and red circles, respectively:

A sample-weighted mean of the flying pre- mium with this model is about 0.725 CMC, and one can observe that the flying premium in- creases to about 1 for toughness values 4 and greater which lead to more difficult creature re- moval for the opponent. However, the greater takeaway here is that this free dataset is rich and complex as the Starting simply, we perform and visualize a game, inviting simple inquiries like this and of- linear regression predicting CMC from power, fering opportunities for much more complex toughness, and a binary “flying” attribute, analyses. The example code offered here is yielding a linear model with an R-squared an invitation for educators and their students value of 0.82: to take up a wide variety of fun and interesting questions concerning this popular 25-year-old CMC = 0.6111629366158998 ∗ power game. + 0.41265041398361274 ∗ toughness + 0.60820253825056 ∗ flying + 0.2782416745722158 Past and Future MTG Research

Thus, this linear model would predict a flying At present, published AI research on MTG premium of about 0.61 CMC. Here we visu- is limited to little more than a few papers on alize the predictions for flying and non-flying Monte Carlo Tree Search application to a sim- creatures as blue and red planes, respec- plified MTG card set (Ward & Cowling, 2009; tively: Cowling, Ward, & Powley, 2012) and proce- dural MTG card generation (Summerville & Mateas, 2016). Hearthstone, a very simi- lar digital collectible card game, has received much more research attention. MTG should be of great future interest as a general game play challenge for a variety of reasons beyond its popularity and 25-year his- tory. First and foremost is the greater strategic complexity arising from the rules of over nine- teen thousand MTG cards. The software en- gineering challenge of modeling MTG is likely the reason more attention has been given to Looking along the planes, we see non- the newer and simpler game Hearthstone. An linearity in the data. Applying gradient important building-block challenge here would boosted decision tree learning (sklearn’s be to design AI systems that could learn to GradientBoostingRegressor), we see a more play two fixed decks optimally against one an- nuanced prediction: other.

9 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

A second reason for interest is the fascinating 4(4), 241–257. doi: 10.1109/TCIAIG metagame choices where players build decks .2012.2204883 for competition at time of play (e.g. through Summerville, A., & Mateas, M. (2016). Mys- drafting) or a priori (e.g. through constructed tical tutor: A Magic: The Gathering de- formats) that seek to find beneficial synergy sign assistant via denoising sequence- between a constrained set of allowable cards. to-sequence learning. In Proc. AAAI con- A clever recent construction showed MTG to ference on artificial intelligence and inter- be Turing complete (Churchill, Biderman, & active digital entertainment 2016. Herrick, 2019). Ward, C., & Cowling, P. (2009). Monte Carlo search applied to card selection in A third reason could be termed the Magic: The Gathering. In Proc. 2009 metametagame, where the MTG consumer IEEE symposium on computational intel- negotiates the supply and demand cost dy- ligence and games (pp. 9–16). IEEE. doi: namics controlled by Wizards of the Coast 10.1109/CIG.2009.5286501 LLC (WotC) by creating new play formats (e.g. Pauper, Cube) to constrain costs. In his 2016 GDC talk, MTG head designer Mark Rosewater’s lesson 18 was that “restrictions Todd W. Neller is a Pro- breed creativity”. Over the history of MTG, fessor of Computer Sci- players have shown that cost restrictions ence at Gettysburg Col- breed affordable creativity. For example, lege. A game enthu- Cube designers seek to select sets of 360 or siast, Neller researches more unique MTG cards that allow interesting game AI techniques and gameplay for players that repeatedly draft their uses in undergradu- from the Cube. ate education. Whereas deck construction could be said to invite players to become designers of the games they play, play format design could be said to invite players to become metagame de- signers. When AI systems of the future are ca- pable of such metametalevel decisions in this imperfect information game of chance, we will see an unprecedented leap forward in game design itself. In our modest present time, we hope that you and your students find interesting simple re- search questions to explore in the MTGJSON dataset and sow seeds of curiosity for re- search advances to come.

References Churchill, A., Biderman, S., & Herrick, A. (2019). Magic: The gathering is tur- ing complete. CoRR, abs/1904.09828. Retrieved from http://arxiv.org/ abs/1904.09828 Cowling, P., Ward, C., & Powley, E. (2012). Ensemble determinization in Monte Carlo tree search for the imper- fect information card game “Magic: The Gathering”. IEEE Transactions on Com- putational Intelligence and AI in Games,

10 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

AI Policy Matters Larry Medsker (The George Washington University; [email protected]) DOI: 10.1145/3340470.3340475

Abstract of a community process involving more than one hundred AI professionals. The CCC AI Policy is a regular column in AI Matters initiative to create the Roadmap for Artificial featuring summaries and commentary based Intelligence was started in Fall, 2018, under on postings that appear twice a month in the leadership of Yolanda Gil (University of the AI Matters blog (https://sigai.acm. Southern California and President of AAAI) org/aimatters/blog/). We welcome ev- and Bart Selman (Cornell University and eryone to make blog comments so we can President Elect of AAAI). Follow this link to develop a rich knowledge base of information the whole report. and ideas representing the SIGAI members.

New Jobs in the AI Future News and Announcements As employers increasingly adopt automation AAAI Policy Initiative technology, many workforce analysts look to AAAI has established a new mailing list on US jobs and career paths in new disciplines, es- Policy that will focus exclusively on the discus- pecially data science and applications of AI, to sion of US policy matters related to artificial absorb workers who are displaced by automa- intelligence. All members and affiliates are in- tion. By some accounts, data science is in vited to join the list at this link. Participants will first place for technology career opportunities. have the opportunity to subscribe or unsub- Estimating current and near-term numbers of scribe at any time. The mailing list will be mod- data scientists and AI professionals is difficult erated, and all posts will be approved before because of different job titles and position de- dissemination. This is a great opportunity for scriptions used by organizations and job re- another productive partnership between AAAI cruiters. Likewise, many employees in posi- and SIGAI for policy work. tions with traditional titles have transitioned to data science and AI work. Better estimates, and at least upper limits, are necessary for EPIC Panel on June 5th evidence-based predictions of unemployment rates due to automation over the next decade. A panel on AI, Human Rights, and US pol- icy, was hosted by the Electronic Privacy Infor- McKinsey & Company estimates 375 million mation Center (EPIC) at their annual meeting jobs will be lost globally due to AI and other (and celebration of 25th anniversary) on June automation technologies by 2030, and one 5, 2019, at the National Press Club in DC. Lor- school of thought in today’s public discourse raine Kisselburgh (Purdue) joined Harry Lewis is that at least that number of new jobs will be (Harvard), Sherry Turkle (MIT), Lynne Parker created. An issue for the AI community and (UTenn and White House OSTP director for policy makers is the nature, quality, and num- AI), Sarah Box (OECD), and Bilyana Petkova ber of the new jobs – and how many data sci- (EPIC and Maastricht) to discuss AI policy di- ence and AI technology jobs will contribute to rections for the US. meeting the shortfall. An article in KDnuggets by Gregory Piatet- AI Research Roadmap sky points out that a “Search for data scien- tist (without quotes) finds about 30,000 jobs, The Computing Community Consortium but we are not sure how many of those jobs (CCC) received comments on the draft of are for scientists in other areas – persons em- A 20-Year Community Roadmap for AI Re- ployed to analyze and interpret complex digital search in the US. The draft was the result data, such as the usage statistics of a web- Copyright c 2019 by the author(s). site, especially in order to assist a business

11 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 in its decision-making. Titles include Data without task-specific training”. Scientist, Data Analyst , Statistician, Bioinfor- The reactions to the announcement followed matician, Neuroscientist, Marketing executive, from the decision behind the following state- Computer scientist, etc.” ment in the release: “Due to our concerns Data on this issue could clarify the net number about malicious applications of the technol- of future jobs in AI, data science, and related ogy, we are not releasing the trained model. areas. Computer science had a similar his- As an experiment in responsible disclosure, tory with the boom in the new field followed by we are instead releasing a much smaller migration of computing into many other disci- model for researchers to experiment with, as plines. Another factor is that “long-term, how- well as a technical paper”. ever, automation will be replacing many jobs The Electronic Frontier Foundation has an in the industry, and Data Scientist jobs will not analysis of the manner of the release (letting be an exception. Already today companies journalists know first) and concludes, “when like DataRobot and H2O offer automated solu- an otherwise respected research entity like tions to Data Science problems. Respondents OpenAI makes a unilateral decision to go to a KDnuggets 2015 Poll expected that most against the trend of full release, it endangers expert-level Predictive Analytics and Data Sci- the open publication norms that currently pre- ence tasks will be automated by 2025. To vail in language understanding research”. stay employed, Data Scientists should focus on developing skills that are harder to auto- This issue is an example of previous ideas in mate, like business understanding, explana- our Public Policy blog about who, if anyone, tion, and story telling.” This issue is important should regulate AI developments and prod- in estimating the number of new jobs by 2030 ucts that have potential negative impacts on for displaced workers. society. Do we rely on self-regulation or re- quire governmental regulations? What if the Kiran Garimella in his Forbes article “Job Loss U.S. has regulations and other countries do From AI? There’s More To Fear!” examines not? Would a clearinghouse approach put the scenario of not enough new jobs to re- profit-based pressure on developers and cor- place ones lost through automation. His in- porations? Can the open source movement teresting perspective turns to economists, so- be successful without regulatory assistance? ciologists, and insightful policymakers “to re- examine and re-formulate their models of hu- Please join our discussions at the SIGAI Pol- man interaction and organization and ... re- icy Blog. think incentives and agency relationships”.

How Open Source? Larry Medsker is Re- search Professor of A recent controversy erupted over OpenAI’s Physics and was founding new version of their language model for gen- director of the Data Sci- erating well-written next words of text based ence graduate program on unsupervised analysis of large samples of at The George Wash- writing. Their announcement and decision not ington University. He is to follow open-source practices raises inter- a faculty member in the esting policy issues about regulation and self- GW Human-Technology regulation of AI products. OpenAI, a non- Collaboration Lab and profit AI research company founded by Elon Ph.D. program. His research in AI includes Musk and others, announced on February 14, work on artificial neural networks, hybrid 2019, that “We’ve trained a large-scale unsu- intelligent systems, and the impacts of AI on pervised language model which generates co- society and policy. He is the Public Policy herent paragraphs of text, achieves state-of- Officer for the ACM SIGAI. the-art performance on many language mod- eling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization – all

12 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Beyond Transparency: A Proposed Framework for Accountability in Decision-Making AI Systems Janelle Berscheid (University of Saskatchewan; [email protected]) Francois Roewer-Despres (University of Toronto; [email protected]) DOI: 10.1145/3340470.3340476

Abstract were misapplying the results during trials, re- sulting in defendants incorrectly being labeled Transparency in decision-making AI systems as potentially violent re-offenders (Angwin & can only become actionable in practice when Larson, 2016). all stakeholders share responsibility for vali- dating outcomes. We propose a three-party Stifling innovation is undesirable, but regulatory framework that incentivizes collab- unchecked deployment of high-impact orative development in the AI ecosystem and decision-making systems is damaging to guarantees fairness and accountability are not the long-term health of the AI ecosystem. The merely afterthoughts in high-impact domains. public backlash will likely only intensify as the inevitable failures of unvetted systems come to light. Properly vetted systems have the Introduction potential to save time, money, and even lives, so finding remedies to the negative aspects Decision-making AI systems are becoming of AI is critical to ensuring the public is a commonplace due to recent and rapid ad- stakeholder in the AI ecosystem rather than vances in computer hardware, machine learn- an adversary. ing algorithms, and an explosion in data avail- ability. Increasingly, AI is being deployed to Properly-designed AI systems present a make life-altering decisions, such as those in- unique opportunity to make transparent de- volving health, freedom, security, finance, and cisions by circumventing human fallibility. In livelihood. However, the public is apprehen- fact, the framework we are proposing holds sive about automated decision-making: the AI decision-makers to far more stringent stan- majority of Americans in a Pew Research dards than could ever be applied to human Center survey were opposed to current ap- decision-makers. For example, psychologi- plications of decision-making systems (Smith, cal studies demonstrate human tendencies to 2018). Chief among their worries were con- make decisions subconsciously and then con- cerns regarding the potential bias and unfair- sciously rationalizing them post-hoc (Soon, ness of these systems, as well as skepticism Brass, Heinze, & Haynes, 2008), or to be pre- regarding the validity of the decisions being fer one decision over another based on fram- made. Research on the social impact of AI, ing of the situation (Tversky & Kahneman, such as from New York University’s AI Now 1981). Institute, corroborates these concerns, argu- Ensuring accountability when mistakes are ing that automated decision-making systems made is step towards addressing the con- are often deployed untested and without ac- cerns surrounding bias, fairness, and valid- countability measures or processes for ap- ity. Transparent decision-making—where the peal (Whittaker et al., 2018). reasoning behind an AI’s decision is made Such technologies often create a gap between clear, interpretable, and auditable—is often private and social cost. The motivating ex- proposed as a solution to the problem of bi- ample for this essay was the COMPAS re- ased or invalid AI systems. Indeed, trans- cidivism algorithm, developed by a company parency addresses many of these concerns; named Northpointe, which came under fire af- however, the gap between AI developers and ter a ProPublica investigation found that its re- other parties, such as the public and poli- sults were racially biased and that jurisdictions cymakers, cannot be closed by transparency alone. For example, the recent testimony Copyright c 2019 by the author(s). of Facebook CEO Mark Zuckerberg before

13 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Congress regarding the exploitation of Face- “subject”. book users’ data increased awareness sur- rounding this data misuse, but also revealed Developer that many Congresspeople lack even a basic An entity or person that has created an AI understanding of the technology (Kang, Ka- system for the purpose of real-world deploy- plan, & Fandos, 2018). As a result, many per- ment. This may include, but is not limited to, tinent questions were not asked, leading many those who designed the system, those who to view it as a missed opportunity (Freedland, trained and tested the model at the core of 2018). the system, and those who developed the data pipeline. Thus, we argue that transparency alone is not a sufficient requirement to produce ac- Client countability and reassure stakeholders: ad- An entity or person that intends to deploy a ditional elements are needed to make trans- developer’s AI system as part of a decision- parency actionable in practice. For the pub- making process. This process may include lic to become comfortable with autonomous other human or automated elements, as decision-making systems, there needs to be well as integrate AI systems from multiple a sense that a person can take remedial ac- developers. Typically, the client will be an in- tions against an AI system if wronged. We stitution (such as a charity, hospital, or gov- identify two requirements to meeting such de- ernment body) or a private company. mands: disclosure of the existence of an AI in Subject a decision-making process to the public, cou- An entity or person that is the target of a pled with an appeal process for the AI’s deci- client’s deployed AI system. Subjects in- sions, and a validated scope for the AI, includ- clude patients, inmates, and consumers, or ing the use cases for which it been tested and even private companies. its limitations. These components suggest a need for a This distinction reflects a separation of con- closer relationship between those who de- cerns: developers are primarily concerned velop AI systems and those who create poli- with the technical components of a system; cies for their deployment. We propose a regu- clients are primarily concerned with the use of latory framework for AI systems that captures a system in relation to the subjects for a spe- this need for communication into the develop- cific context. Further, this separation allows a ment and deployment process itself. We en- developer to license their system to more than vision establishing a pair of documents that one client. It remains possible for the devel- ensure these disclosure and scope require- oper and client to be the same entity. ments are fulfilled prior. The developer’s doc- In addition, we want to clarify the types of AI ument publicly declares the scope of their sys- systems being considered. tem and how it has been validated, while the client document explicates the disclosure and AI System appeal policies surrounding the deployment of A system or process that incorporates AI a developer’s system into a particular domain. in some form or another for the purpose of In addition, a formally-established relationship decision-making. between the developers and clients helps re- duce cases where these two parties try to pin In this article, we will be considering require- a system’s failure on the other. In turn, this ments only for decision-making AI systems. would facilitate the open sharing of knowledge We not do consider any AI systems that inter- and the cooperative development of fair proce- act with, or perhaps mimic humans, as these dures. may have a different set of requirements. We also focus our discussion on systems in high- Definitions impact areas, which may include, but are not limited to, health services, financial institu- For the purposes of this article, we want to tions, justice systems, aid programs, and civic make a careful distinction between the par- initiatives. AI systems in lower-impact areas, ties we have termed “developer”, “client”, and such as recommender systems for consumer

14 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 platforms, are less urgently in need of regula- ence were rarely referenced, and the evalua- tion, though these systems could benefit from tions of any explanations that were produced the framework we describe as well. We also did not generally include behavioural experi- exempt AI research from our discussion, in- ments (Miller, Howe, & Sonenberg, 2017). Be- stead choosing to focus on systems intended havioural analyses may be necessary to ad- for real-world deployment. dress subtler issues related to transparency. For example, the adequacy of a particular type of explanation may be dependent on the hu- Transparency man interpreter. Indeed, doctors and patients use different explanatory models when inter- We define transparency as making a deci- preting a medical decision (Good, 1993). sion that is in some way explainable, that has its inner workings available for review, and is Furthermore, hidden feedback loops, where not proprietary. This differs from disclosure, multiple automated decision-making systems which is ‘transparency in the use of a model’ influence each other’s inputs over time, may (rather than ‘transparency in the prediction of also interfere with the system’s decision- a model’). In other words, the use of a model making in a way that is difficult for a trans- can be disclosed, but the model itself might parency algorithm to identify (Sculley et al., not be interpretable. 2015). Transparency in this context may re- veal a change over time, but cannot identify Implemented properly, transparency can help the influence of the other systems on the data us examine erroneous assumptions made collected, and thus on the model’s decision- about the data, and even let human opera- making ability. tors correct bad features in the model (Ribeiro, Singh, & Guestrin, 2016). Transparency also Despite these unsolved issues, transparency allows us to examine current and possible remains an integral component for account- points of failure in a system, as well as monitor ability in AI. Exposing the inner workings of system performance to ensure the quality of a model to external review helps foster trust decision-making does not degrade over time. in decision-making systems. However, trans- In addition, transparency can help us verify parency is not directly actionable in deployed that public stakeholders’ criteria, such as fair- systems unless an additional framework is in ness or the removal of demographic bias, are place to ensure subjects can use transparent being met. explanations to hold developers and clients accountable when wronged. Currently, many decision-making systems are neither transparent nor easily interpretable. Deep learning techniques, in particular, spur Disclosure this concern, since neural networks are often characterized as a black box whose decision- For AI systems to be transparent, their use making process is completely opaque. Ex- must be made known, rather than remaining a plainable AI (XAI) research seeks to develop hidden part of a decision-making process. Ap- techniques to shine light on these systems, plications that fail to disclose the use of AI sys- yet many fundamental problems remain open. tems create a power imbalance, where those Even the definitions of terms such as “trans- unaware of its use are not in a position to parency”, “interpretability”, and “explainability” question or challenge the figures making cru- are difficult to establish and even harder to cial decisions on their behalf. This leads to unify across different input domains. hidden biases and a lack of accountability, as well as creating confusion when problems do In addition, the mechanisms by which humans arise with these automated systems (Diallo, interpret these explanations, especially in re- 2018). lation to natural human subjectivity and bias, has yet to be understood. A recent survey of Therefore, we argue that a person significantly submissions from the International Joint Con- impacted by a decision-making system should ferences on Artificial Intelligence workshop have the presence of this system clearly dis- in XAI and found that knowledge from fields closed to them. This disclosure could be ver- such as behavioural research and social sci- bal or—in formal cases—form-based, requir-

15 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 ing acknowledgement on the part of the sub- reasonably be expected to go through with the ject. Many application domains of decision- process. making systems already require paperwork, such as health or legal systems, so viable dis- Undoubtedly, this disclosure process would closure channels already exist. A distinction cause friction in the adoption of AI applica- can be made in disclosure between AI-made tions by the public. Affected individuals would decisions, where the system makes a ruling likely challenge decisions, ask questions, and without any human oversight, and AI-assisted make appeals more frequently when the pres- decisions, where the system makes recom- ence of AI in a decision-making process is mendations that are reviewed by a human as publicized. This should be viewed as a pos- part of the final decision. Making this distinc- itive long-term effect, rather than as a barrier tion could help diagnose points of failure in to innovation and progress: overlooked stake- a decision-making process. First, whenever holders in these systems will be able to en- an AI-assisted process leads to biased or un- gage in a dialogue with the developers, clients, fair decisions, measures such as retraining the and institutions. Because the long-term health human reviewers may be part of the solution. of AI-enabled technologies will ultimately de- Second, an AI-assisted decision-making pro- pend on the public’s trust and acceptance of cess in which the human reviewer always de- these systems, AI developers should be re- fers to the AI system’s recommendation, may sponsible for winning the confidence of both be treated as an effectively AI-made decision, regulating bodies and the general public re- which may violate the disclosure agreement. garding the efficacy, safety, and fairness of their systems. This will enable a more iter- In addition, disclosure will help developers ative and democratic process in AI develop- identify stakeholders in the decision-making ment and deployment. process who may otherwise be overlooked. For example, developers of a recidivism algo- rithm may think to consult prisoners in their Validated Scope requirements gathering, but may neglect to consult at-risk communities, which ought be Clearly defining the scope of autonomous consulted to create a fair and unbiased devel- decision-making systems protects subjects opment process. Mandatory disclosure of AI against spurious claims made by developers use in their software would create an opportu- and clients, and protects developers against nity for these forgotten stakeholders to make misapplication or misuse of their systems by their voices heard through an appeal process, clients. In addition, clearly-defined scope and would encourage developers to expand would help narrow the discovery phase, as their definition of impacted communities and well as reduce the decision-making burden, of stakeholders in future developments, creating legal cases, audits, and appeals of systems. a more community-conscious AI development process. In life-altering application domains, care must be taken to ensure that automated systems do As alluded to, disclosure goes hand-in- not introduce systematic biases and harmful hand with the additional requirement that au- side effects. Medical devices and drugs un- tonomous decisions are appealable. An ap- dergo heavy regulation to prevent unverified peal process that is obscured or hidden can- claims; why should life-altering autonomous not be described as fair, as it creates barri- decision-making systems not bear a similar ers limiting the participation of wronged sub- burden? People should not be guinea pigs jects. When subjected to autonomous deci- for algorithms deployed untested in real-world sions, people ought to know not only that AI is settings, no matter how promising the applica- employed, but also how its decisions can be tion, or how well the technology has worked examined or appealed. Specifying the details elsewhere. While AI systems should not of the appeal process is outside the purview be limited to a single scope, nor prevented of this article, but we suggest that the process from being applied outside of their originally- should be inclusive to all affected subjects, intended context, a system’s efficacy ought to and have a minimal burden of entry so that be re-validated in each new deployment con- members of marginalized communities could text.

16 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Requiring a validated scope for autonomous Without a clearly-defined scope, clients with- decision-making systems would place the bur- out AI expertise may misunderstand how a de- den of proof of their system’s efficacy on de- veloper’s system is intended to be used, or velopers wishing to enter the market, and on may misinterpret the AI’s decision. For exam- clients wishing to transfer an existing system ple, the Northpointe COMPAS recidivism algo- to a new market, rather than on wronged sub- rithm was originally designed to inform treat- jects trying to prove out-of-scope use. Given ment, but, unbeknownst to the developer, was the power that decision-making systems can being used for sentencing: in Wisconsin, a have over livelihoods, providing evidence that judge overturned a plea deal after viewing a the system is functioning as intended seems defendant’s COMPAS risk score (Angwin & obvious, yet stories continue to emerge of Larson, 2016). Whether due to fragility, un- developers and clients rushing to deploy bi- stated limitations, or the nature of the data, ased or largely-untested decision-making sys- decision-making systems may work in one sit- tems. ProPublica’s investigation into North- uation but not another, yet clients who fail to pointe’s COMPAS algorithm found that it was understand these issues may take systems only about 20% accurate at identifying vio- which are fair, unbiased, and valid in one con- lent reoffenders; the figure for non-violent reof- text, and unwittingly deploy them in another fenders was just 61%. Worse, the system was context which violates one of these conditions. twice as likely to rule unfavourably for black of- A clearly-defined scope can also help protect fenders than white offenders (Angwin & Lar- developers from charges of misuse of per- son, 2016). sonal data. Regulation surrounding the use of these data are tightening, as seen with the Additionally, some recent systems are de- European Union’s (EU’s) new General Data ployed based on spurious, pseudoscien- Protection Regulation (GDPR, 2016). A devel- tific claims, such as Predictim’s rating oper that collected data under a certain pre- potential babysitters for “disrespectful at- tense to build a model with a specific purpose titude” (Merchant, 2018), or companies may be at least partially liable to violations of claiming to determine personality traits and that pretense if a client uses the model for even “criminality” from facial features (Storm, another purpose. However, if developers ex- 2016)(y Arcas, Mitchell, & Todorov, 2017). plicate the purpose of their system ahead of Requiring systems to validate the scope of time, the liability for misapplication rests en- their application before deployment causes tirely with the clients. such premises to fall apart. Attempting to de- fine a metric by which to evaluate disrespect- fulness may reveal inherent biases, which the Proposed Framework AI will subsequently learn. Such applications, which are designed to prey on fear rather than To bring these requirements together in a way provide a truly beneficial service, will be forced that places the onus on developers and clients to either move towards evidence-based vali- to jointly create fair, unbiased, accountable dation and metrics, or to explicitly state that systems, we propose a formal regulation sys- their system is an elaborate placebo with no tem that requires a pair of documents to be real predictive power, which will lessen their filed for any autonomous decision-making sys- appeal. tem to be deployed in a high-impact applica- tion domain. Evaluating the use cases under which an AI The first document, referred to as an AI Valida- system performs well will also bring to light tion Document (AIVD), is developer-filed and its limitations. Though popular culture pushes concerns transparency and validated scope: the narrative that AI is rapidly approaching it defines one or more contexts for which the general intelligence, the current reality is that system was developed, outlines the claims AI achieves high performance only at very nar- made with respect to the system in each con- row tasks, and does not yet generalize to other text, demonstrate how these claims were val- tasks. This leads to fragile, brittle systems, idated, and explains how the system’s deci- as demonstrated by the effectiveness of ad- sions can be interpreted in the case of an au- versarial attacks (Szegedy et al., 2013). dit. The second document, referred to as a

17 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Deployment Disclosure Document (DDD), is cations as part of a larger process or system client-filed and concerns disclosure and au- would need to file a DDD containing: ditability: it identifies the process for deploy- ing a developer’s AI in validated contexts and • the purpose of the decision-making process the specific terms surrounding disclosure and to be created appeal of the AI’s decisions. If this deploy- • the reference number of the AIVD(s) being ment is legally challenged (e.g. through the deployed appeal of a decision that is alleged to be bi- • the specific context under which each ref- ased or unfair), these documents can help de- erenced system will be deployed within the termine fault, if any. Though they would be larger application filed separately, requiring both documents in- centivizes developers and clients to communi- • the process for disclosing the use of AI to cate carefully regarding the accountability sur- subjects of the decision-making process rounding any particular deployment, and to • the process by which a subject can investi- understand both the technical workings of the gate or appeal a decision AI, as well as the domain-specific policies and • the process by which the system’s decisions challenges related to the deployment context. will be traced and linked to the individual Specifically, an AIVD consists of defining: models involved in the decision-making pro- cess, including the assignment of responsi- • the intended purpose of the system, and bility between interacting models, or models • the conditions under which it can be safely making recommendations to a human re- used for the intended purpose viewer • how those use cases have been tested and what metrics have been used to validate This avoids problems such as Northpointe’s them, including how the results have been COMPAS being used by courts in sentenc- checked for bias ing, which was never intended by Northpointe and thus would not have been in the COM- • known limitations of those use cases PAS’s AIVD. Then, any courts trying to misap- • a description of the data used to train the ply COMPAS would have their DDD applica- system, including when, where, and how it tion denied. has been sourced One AIVD could be associated with many dif- • measures undertaken to track the model’s ferent DDDs, across a wide variety of coun- development, including all versions of the tries and contexts, provided the processes de- model that may be deployed, and the scribed in the DDD fall within the validated specifics of how the data was used to train scope outlined in the AIVD. This incentivizes and validate each version developers to create large, reliable, and ro- • how results of the system may be inter- bust AI systems, since the marginal cost of preted an additional deployment is small compared to the initial cost of developing the system. This This avoids problems such as Northpointe’s incentive also suggests that an international COMPAS having an unvetted bias against treaty or body setting the standards for eval- black offenders, as such bias would need to uating these systems and their deployment is be identified in COMPAS’s AIVD, which would favourable, as it affords economic benefits to result in an undeployable system. filing in member nations, akin to the system of This document would have to be filed first, to international agreements surrounding intellec- establish the validity of the application in at tual property. least one context. In addition, an AIVD could However, international cooperation is not inte- be amended over time to accommodate new gral to the framework. Instead, the AIVD and use cases whenever the developer can pro- DDD can be filed on a national basis, with vide sufficient evidence supporting an exten- a national (or regional, as in the EU) body sion of scope. of experts reviewing AIVD and DDD applica- Once an AIVD has been approved, clients tions. Approving AIVDs requires significant wishing to use one or more developer appli- expertise in various domains, including AI,

18 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 and AIVDs will be subject to more scrutiny and formally deploying a decision-making system. longer evaluation times. In contrast, smaller In this case, a precursor to a DDD is to groups of legal and domain experts are suffi- be filed, which describes the attempted pur- cient to evaluate DDDs in a timely fashion. pose, a timeline for the completion of the pi- lot phase of the project (after which a full In the spirit of promoting transparency and DDD must be filed), and the process for dis- accountability throughout, both documents closure. Disclosure in pilot projects must ad- should contain highly-technical, legal sections ditionally include provisions for soliciting and for expert review and simple, nontechnical incorporating feedback and concerns from sections for subject review. In other words, subjects. This ensures that experimental or the documents themselves must be transpar- fringe ideas involve a significant amount of ent. This helps prevent the common scenario shared decision-making between clients and in which users blindly agree to certain docu- subjects. ments, such software end-user license agree- ments, because they are too complicated and Benefits of a Two-Phase Framework lengthy to be read and understood. We argue that this proposal encourages de- Concerns surrounding the stifling of inno- velopers to share knowledge with clients, and vation are apparent with the introduction of to engage with the consequences of their these documents. As such, we propose that systems, while also ensuring clients have a decision-making systems deployed in lower- framework with which to evaluate and ques- or questionable-impact domains would not be tion developers regarding any planned deploy- required to file these documents, though de- ment. AI expertise is scarce compared to de- velopers and clients using AI in these areas mand (Perry, 2018), and many clients outside would still be incentivized to file for the addi- of the information technology domain—which tional legal protection. This raises an issue may include institutions such as hospitals, jus- regarding the definition of the impact level of tice systems, and civic organizations—may a system. After all, seemingly benign sys- not be able to acquire or retain such exper- tems, such as recommender systems on so- tise. Regulation in the form of a DDD will cial media, can potentially have a huge im- push clients to choose vetted developers, who pact on the general public’s views and percep- have filed AIVDs, instead of those making tion of reality (Mozur, 2018), even though this baseless claims. Hype around the power of impact may be difficult to quantify, especially big data and machine learning may engender on the individual level. Still, we believe re- blind trust in non-expert clients; this document quiring AIVDs and DDDs only for high-impact system may raise awareness regarding the ef- domains achieves the best tradeoff between ficacy of AI systems, as clients must evalu- innovation and responsible development. In ate the AIVDs of developers to file their own particular, it helps mitigate any potential sti- DDD. Both clients and policymakers would be fling of innovation in the startup sphere— encouraged to see AI systems as procedures where a small team may not have the neces- that may have flaws and should therefore be sary data accessibility or legal expertise re- examined and challenged. As AI failures be- quired to file AIVDs and DDDs—unless the come more pronounced in the public eye, de- startup is innovating in a clearly high-impact velopers pushing AI without a AIVD, even in a domain, in which case provisions for dealing lower-impact domain, may be seen as a liabil- with accountability ought to factor into their ity, encouraging more thoughtfully-developed business model from the start. AI researchers and carefully-tested AI. would also be exempt from filing these docu- ments, regardless of the impact level of their Under this document system, profit-driven de- research, as the use of AI in this context is ac- velopers have an incentive to help non-expert counted for through the proper informed con- clients define their disclosure, tracing and ap- sent of research subjects, as well as the re- peal processes for each of their AIVDs, in or- search ethics approval process. der to quickly file DDDs for as many clients as possible. In addition, developers wanting Outside of a research setting, certain clients to work with clients outside the scope of their may need to engage in pilot projects before AIVD would have to devise mechanisms for ef-

19 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

ficiently validating these new use cases, lead- Introducing disclosure and scope into the AI ing to much-needed innovation in this area, system’s development and deployment pro- while simultaneously promoting accountabil- cess not only encourages proactive collabo- ity. This also incentivizes developers to con- ration between both parties, but also ensures cern themselves with the real-world implica- subjects are made aware, and can appeal the tions of their systems on subjects. Essentially, decisions, of the system. Unlike transparency, both developers and clients are incentivized to where may open problems remain, disclosure proactively establish fair and accountable ap- and scope are primarily policy-based, and can plications of AI systems, instead of consider- thus be feasibly implemented in all AI systems. ing it an afterthought until bad press comes to These requirements serve not only to ren- light. der transparency algorithms actionable once properly developed, but also to educate all Formally-defined accountability documents stakeholders about the potential and the pit- that are widely recognized also have the po- falls surrounding AI systems. tential to raise public awareness around the technical, legal and ethical issues of decision- We proposed a regulatory framework for AI making AI. Laypeople, who will likely be the systems that formalizes the disclosure and subjects of decision-making systems, need to scope requirements in the form of two doc- be informed about how these systems may uments, AIVDs and DDDs. The double- be behaving (or misbehaving) and about their document structure of this framework incen- rights to appeal automated decisions. Much tivizes a more collaborative AI ecosystem, as in the same way that common knowledge of well as ensuring that fairness and validity are copyright and patents brings awareness to is- not afterthoughts in high-impact AI systems. sues of intellectual property, AIVDs and DDDs Fostering this dialogue between all stakehold- being commonly recognized as a crucial step ers in a decision-making process spurs inno- of AI deployment will help subjects be aware of vation, and will help developers, clients, poli- the requirement that these systems disclose cymakers, and the public realize the potential the use of AI, are transparent, and are val- that effective, fair, and accountable automated idated. This, in turn, will allow subjects to systems have. openly question institutions deploying AI, to be more involved in the development of au- tomated processes, and to hold these institu- References tions accountable. Angwin, J., & Larson, J. (2016, May). Ma- chine Bias. Retrieved 2019-01-09, from Conclusion https://www.propublica.org/ article/machine-bias-risk As decision-making AI systems are becoming -assessments-in-criminal ubiquitous, concerns surrounding bias, fair- -sentencing ness, and accountability are mounting. Trans- Diallo, I. (2018, June). The Ma- parency in these systems is critical in reducing chine Fired Me. Retrieved 2019- their potential harm. However, transparency 01-02, from https://idiallo.com/ alone is not actionable without additional re- blog/when-a-machine-fired-me quirements to close the gap between devel- Freedland, J. (2018, April). Zuckerberg opers, clients, policymakers, and the pub- got off lightly. Why are politicians lic, since developing and deploying fair and so bad at asking questions? The accountable AI systems requires increased Guardian. Retrieved 2019-02-16, from awareness of the strengths and limitations https://www.theguardian.com/ of automated decision-making. For instance, commentisfree/2018/apr/11/ clients may be unaware of the limitations of mark-zuckerberg-facebook these systems, or of what constitutes fair and -congress-senate ethical AI. In contrast, developers may not un- GDPR. (2016, May). Regulation (eu) derstand how to create effective policy sur- 2016/679 of the european parliament rounding their systems’ use, or how to deploy and of the council of 27 april 2016 on them safely in novel and untested contexts. the protection of natural persons with re-

20 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

gard to the processing of personal data Data Mining - KDD ’16 (pp. 1135– and on the free movement of such data, 1144). , California, USA: and repealing directive 95/46/ec. Official ACM Press. Retrieved 2019-01-08, Journal of the European Union, L119, 1– from http://dl.acm.org/citation 88. .cfm?doid=2939672.2939778 doi: Good, B. J. (1993). Medicine, rationality and 10.1145/2939672.2939778 experience: an anthropological perspec- Sculley, D., Holt, G., Golovin, D., Davydov, E., tive. Cambridge University Press. Phillips, T., Ebner, D., . . . Dennison, D. Kang, C., Kaplan, T., & Fandos, N. (2018, (2015). Hidden technical debt in machine April). Knowledge Gap Hinders Ability of learning systems. In Advances in neu- Congress to Regulate Silicon Valley. The ral information processing systems (pp. New York Times. Retrieved 2019-02-16, 2503–2511). from https://www.nytimes.com/ Smith, A. (2018, November). Public 2018/04/12/business/congress Attitudes Toward Computer Algo- -facebook-regulation.html rithms. Retrieved 2019-01-05, from Merchant, B. (2018, December). Pre- http://www.pewinternet.org/ dictim Claims Its AI Can Flag 2018/11/16/attitudes-toward ’Risky’ Babysitters. So I Tried It -algorithmic-decision-making/ on the People Who Watch My Soon, C. S., Brass, M., Heinze, H.-J., & Kids. Retrieved 2018-12-27, from Haynes, J.-D. (2008, May). Uncon- https://gizmodo.com/predictim scious determinants of free decisions in -claims-its-ai-can-flag-risky the human brain. Nature Neuroscience, -babysitters-so-1830913997 11(5), 543–545. Retrieved 2019-02- Miller, T., Howe, P., & Sonenberg, L. (2017, 16, from http://www.nature.com/ December). Explainable AI: Beware of articles/nn.2112 doi: 10.1038/nn Inmates Running the Asylum Or: How .2112 I Learnt to Stop Worrying and Love Storm, D. (2016, May). Faception can the Social and Behavioural Sciences. allegedly tell if you’re a terrorist arXiv:1712.00547 [cs]. Retrieved 2019- just by analyzing your face. Re- 01-08, from http://arxiv.org/abs/ trieved 2019-01-01, from https:// 1712.00547 (arXiv: 1712.00547) www.computerworld.com/ Mozur, P. (2018, October). A Geno- article/3075339/security/ cide Incited on Facebook, With Posts faception-can-allegedly-tell From Myanmar’s Military. The New -if-youre-a-terrorist-just-by York Times. Retrieved 2019-01-11, -analyzing-your-face.html from https://www.nytimes.com/ Szegedy, C., Zaremba, W., Sutskever, 2018/10/15/technology/myanmar I., Bruna, J., Erhan, D., Goodfel- -facebook-genocide.html low, I., & Fergus, R. (2013). In- Perry, T. S. (2018, September). Intel Execs triguing properties of neural networks. Address the AI Talent Shortage, AI arXiv:1312.6199 [cs]. Retrieved 2019- Education, and the “Cool” Factor. Re- 01-08, from http://arxiv.org/abs/ trieved 2019-01-10, from https:// 1312.6199 (arXiv: 1312.6199) spectrum.ieee.org/view Tversky, A., & Kahneman, D. (1981, -from-the-valley/robotics/ January). The framing of decisions artificial-intelligence/ and the psychology of choice. Sci- intel-execs-address-the-ai ence, 211(4481), 453–458. Re- -talent-shortage-ai-education trieved 2019-02-16, from http:// -and-the-cool-factor www.sciencemag.org/cgi/doi/ Ribeiro, M. T., Singh, S., & Guestrin, 10.1126/science.7455683 doi: C. (2016). ”Why Should I Trust 10.1126/science.7455683 You?”: Explaining the Predictions of Whittaker, M., Crawford, K., Dobbe, R., Any Classifier. In Proceedings of the Fried, G., Mathur, V., West, S. M., . . . 22nd ACM SIGKDD International Con- Schwartz, O. (2018, December). AI Now ference on Knowledge Discovery and Report 2018 (Tech. Rep.). AI Now Insti-

21 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

tute, New York University. Retrieved from https://ainowinstitute.org/ AI Now 2018 Report.pdf y Arcas, B. A., Mitchell, M., & Todorov, A. (2017, May). Physiognomy’s New Clothes. Retrieved 2019-01- 07, from https://medium.com/ @blaisea/physiognomys-new -clothes-f2d4b59fdd6a

Janelle Berscheid is a Master’s student in the Computational Epi- demiology and Public Health Informatics Lab (CEPHIL) at the Univer- sity of Saskatchewan. Her current research focuses on health-related applications of data sci- ence and machine learning, and how these technologies can impact healthcare delivery. Francois Roewer- Despres is a Master’s student at the University of Toronto and a member of the Vector Institute for Artificial Intelligence. His research interests broadly cover Human-AI Interaction, including accountability and safety, autonomous social intelli- gence, dialogue systems, and reinforcement learning.

22 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Context-conscious fairness in using machine learning to make decisions Michelle Seng Ah Lee (University of Oxford; [email protected]) DOI: 10.1145/3340470.3340477

Abstract eties that predates the introduction of algorith- mic decision-making. Given a bias, people- The increasing adoption of machine learn- based processes may arrive at different deci- ing to inform decisions in employment, pric- sions. AI, by contrast, can replicate an iden- ing, and criminal justice has raised concerns tical bias at-scale, crystallising the bias and that algorithms may perpetuate historical and removing the outcome ambiguity associated societal discrimination. Academics have re- with human decision-making. This is espe- sponded by introducing numerous definitions cially concerning in domain areas with doc- of “fairness” with corresponding mathemati- umented historical discrimination, as AI can cal formalisations, proposed as one-size-fits- exacerbate any underlying societal problems all, universal conditions. This paper will ex- and inequalities. For example, discrimina- plore three of the definitions and demonstrate tory lending has been a contentious prob- their embedded ethical values and contex- lem. In the United States, ‘redlining,’ risk infla- tual limitations, using credit risk evaluation tion of minority-occupied neighbourhoods, im- as an example use case. I will propose a peded African Americans from obtaining mort- new approach - context-conscious fairness - gage (Nelson, Winling, Marciano, & Connolly, that takes into account two main trade-offs: 2016). Analysis of 2001-2009 UK consumer between aggregate benefit and inequity and credit data on 58,642 households found that between accuracy and interpretability. Fair- non-white households are less likely to have ness is not a notion with absolute and binary financing (Deku, Kara, & Molyneux, 2016). To measurement; the target outcomes and their counteract this, new techniques have been in- trade-offs must be specified with respect to the troduced for pre-processing (purging the data relevant domain context. of bias prior to training the algorithm) and for post-processing (bias correction in predictions Introduction after algorithm build). In both cases, solutions are based on a one-size-fits-all condition of Machine learning (ML) is increasingly used fairness, regardless of the context. to make important decisions, from hiring to sentencing, from insurance pricing to provid- There is an overall consensus that AI sys- ing loans. There has been an explosion tems and technologies should be required to of publications of new guidelines and prin- make fair decisions, yet the binary categori- ciples of Artificial Intelligence (AI) from gov- sation of an algorithm as either fair or unfair ernments (Holden & Smith, 2016), interna- belies the underlying complexities of each use tional organisations (European Commission case. In the next section, I will use discrimina- AI HLEG, 2019), professional associations tory lending as an example to demonstrate the (IEEE Global Initiative, 2018), and academic shortcomings in existing attempts at formalis- institutions (Grosz, 2015). Each of these ing fairness in machine learning predictions. documents references the imperative for AI Different issues may rise in other domains, to treat people fairly by avoiding discrimina- such as employment, pricing, or criminal jus- tion based on legally protected characteris- tice; however, the focus on one use case will tics, such as race, gender, and sexual ori- help bring to light the real-life considerations of entation. However, none of them specifies each fairness definition. In Section 3, I will pro- a framework or methodology to detect and pose a new approach focusing on quantifying correct unfair outcomes, especially given his- the contextual trade-offs of each algorithm so tory of discrimination in our systems and soci- that the decision-makers can select a model that best captures the risk profile of each spe- Copyright c 2019 by the author(s). cific case.

23 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Ethics of fairness definitions and their it is ineffective where disproportionality in limitations outcomes can be justified by non-protected, non-proxy attributes, as this can lead to re- If the outcome disparity in loan decisions is verse discrimination and inaccurate predic- simply a factor of significant features (e.g., dif- tions (Gajane, 2017). Fuster et al. have ference in income), arguably the variance in shown that there is a difference in income outcome may be consistent with the underly- distributions between racial groups (Fuster, ing distributions. Given that minority borrow- Goldsmith-Pinkham, Ramadorai, & Walther, ers are indeed riskier, from an expected con- 2017). As income has a reasonably inferential sequentialist perspective, overall societal wel- relationship to credit risk, it cannot be consid- fare would increase with a more accurate pre- ered as a simple proxy attribute of protected diction of default risk. If we subscribe to this, characteristics. In short, implementing this the machine learning model is fair insofar as measure would unfairly give credit to minori- it most accurately predicts risk and gives each ties with low income. individual the loan and the interest rate that he or she deserves, by foreseeing the conse- Equalised opportunity quences of a loan approval. However, the out- come disparity could be a reflection of inequal- According to a Rawlsian approach to distribu- ity in other markets (e.g. in labour markets), tive justice, regardless of whether policymak- which makes minority incomes more volatile. ers have the moral imperative to correct past Given the evidence in past literature of his- wrongs of systematic discrimination, it is im- torical discrimination based on race and gen- portant to prioritise the protection of the most der in mortgage lending decisions in the US vulnerable of the population. The Max-Min so- (Kendig, 1973), a machine learning algorithm cial welfare function maximises the welfare of can self-perpetuate a reproduction of past in- those who are worst-off (Rawls, 1971). In con- equalities, inaccurately inflating the risk of mi- trast to equal outcome, Rawls’ Difference Prin- nority borrowers. The extent of this inaccu- ciple asserts the need for equality in opportu- racy is difficult to determine; it is impossible to nity. know whether those who were denied a loan would have defaulted. The scarcity and po- Hardt, Sbrero and Price proposed a statistical tential bias of historical data on previously fi- metric of “equalised opportunity” that focuses nancially marginalised groups hinders the sta- on the true positives: given a positive out- tistical modelling of their credit-worthiness. If come, the prediction is independent of the pro- historical evaluations of minority credit risk are tected attribute. Formally, again with Yˆ as the inaccurate, a machine learning model trained predicted outcome, A as a binary protected at- on a biased data set necessarily produces a tribute, and Y as the actual outcome, we have: sub-optimal result. I will discuss three of the ˆ main approaches that seek to define whether P (Y = 1|A = 0,Y = 1) (2) or not an outcome is unfair. = P (Yˆ = 1|A = 1,Y = 1) Unfortunately, equalised opportunity metric is Demographic parity also problematic because it fails to address discrimination that may already be embedded A strictly egalitarian approach mandates equal in the data (Gajane, 2017). Gender and race outcome for each racial group. A related are outside of one’s control, and following the statistical definition is demographic parity, a logic of Dworkin’s theory of Resource Egalitar- population-level metric that requires the out- ianism, no one should end up worse off due to come to be independent of the protected at- bad luck, but rather, people should be given tribute. Formally, with Yˆ as the predicted out- differentiated economic benefits as a result of come and A as a binary protected attribute, we their own choices (Dworkin, 1981). The credit have: risk market does not exist in a vacuum; while P (Yˆ |A = 0) = P (Yˆ |A = 1) (1) people can affect their credit scores and in- come to a certain extent, e.g. by building their While this metric would ensure the equally credit history or improving employable skills, it proportional outcome independent of race, is impossible to isolate similar variables from

24 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 the impact of their upbringing, discrimination context-dependency of the data available and in other markets, and historical inequalities hence of the relational nature of fairness (a) is entrenched in the data. fair - not absolutely - but in relation to (b). I introduce a possible proposal in the next sec- Counterfactual fairness tion. A challenge to the previous two fairness met- rics is that they do not take seriously the no- Proposed solutions tion of causality. Statistical relationships are While the implications of unequal outcome in derived from correlations rather than an infer- mortgage lending are troubling, the revelation ential and directed model. David Lewis (1973) is also an opportunity for policymakers to re- introduced the counterfactual approach of the- define anti-discrimination policies. Consider orising on the cause and effect (Lewis, 1973). the alternatives to a machine learning model. Counterfactual fairness, one of the most re- Human-driven decision-making is mired in cently introduced definitions, attempts to ap- cognitive biases, as shown by a field experi- ply this theory to assert that a decision is ment on discrimination in the labour market, fair if it is the same in the actual world as it in which white-sounding names received 50% would be in a counterfactual world where the more callback than black-sounding names individual belonged to a different demographic (Bertrand & Mullainathan, 2003). Traditional group (Kusner, Loftus, Russell, & Silva, 2017). credit risk models, e.g. a scorecard, are also Given a causal model with latent background not exempt from selection and counterfactual variables U, non-proxy non-protected features biases of a machine-learning model. The ac- ˆ X, and the protected attribute A, Y as the pre- cusation of discrimination in mortgage lend- dicted outcome, and Y as the actual outcome, ing long predates the introduction of machine we have: learning; bias can be embedded in a process (Kendig, 1973). ˆ P (YA←a(U) = Y |X = x, A = a) (3) In reality, the alternative to a machine learn- ˆ = P (YA←a0 (U) = y|X = x, A = a) ing model may be a worse model. Therefore, I propose a benchmarking exercise to map the However, this approach requires a formalisa- change in the two trade-offs, while increas- tion of causal models, which are useful ab- ing the model complexity. While no model, stractions but often infeasible without strong pre-processed or post-processed, may elim- assumptions, simplifications, and robust con- inate all the discriminatory impact on minor- straints. In reality, relationships between vari- ity groups, quantifying the benefits and risks ables are intertwined, complex, or not fully un- would give actionable insights to the decision- derstood. For example, how would we isolate maker on which model best reflects his or her the impact of one’s race on credit-worthiness values and risk profile. Algorithmic process- from the impact of one’s education, job, neigh- ing of credit worthiness would not translate bourhood, and other features of one’s upbring- into automatic decision making but would em- ing? There may also be challenges of reverse power a better evaluation of each case under causality: e.g., race may impact one of the scrutiny. non-protected predictors, e.g. if employers are discriminatory, so that an applicant’s low in- The shift of focus from discriminatory intent come ends up hindering both the probability to discriminatory impact in scholarship is mir- of loan approval and his or her ability to repay rored in legal rulings. “Fairness through Un- a loan. awareness” approach - not using the pro- tected characteristic - is invalidated with ML Implementing a “fair” credit risk algorithm is models that can triangulate protected informa- difficult not just because there are compet- tion.1 Even if a lender is not explicitly consid- ing theoretical approaches to fairness, but be- ering race in calculating credit risk, a machine cause they all assume that there is a one-size- learning model may nonetheless incorporate fits-all definition of fairness. What seems to be required is a new approach that builds on 1See: (Dwork, Hardt, Pitassi, Reingold, & the previous ones but makes the most of the Zemel, 2011) for an extensive critique.

25 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 racial information in its prediction. Given le- gal decisions in the UK (Lowenthal, 2017) and in the US (Baum & Stute, 2015), discrimina- tory impact is illegal regardless of intent. Both courts emphasise a contextual exception: dis- parate impact is allowed if it is crucial to a le- gitimate business requirement. One of such requirements may be the need to predict ac- curately default to allow for greater financial inclusion, which - given discriminatory history - may be at the expense of some minority groups.

Trade-off 1: Aggregate benefit vs. individual inequity Figure 1: Benchmarking models for trade-off in ag- gregate group benefit vs. the scale of negative im- Credit market welfare can be measured by pact on protected groups the consumer surplus generated. Automated underwriting improves the overall accuracy of default predictions (Hacker & Wiedemann, 2) The increase in benefit is at the expense of 2017). This increases the market informa- the welfare of the protected group. The first tion available to the lenders and lowers their assumption is reasonable for most complex overall risk, empowering them to give credit to models in which an implementation of a ma- those who were previously unable to access chine learning algorithm is being considered. it. The resulting increase in financial inclusion If the true relationship is linear, then the out- adds to the overall consumer surplus. of-sample accuracy would decrease with addi- On an individual level, however, there are win- tional complexity, but this will be rare in many ners and losers: the borrowers who would not real-life modelling scenarios. The second as- have received a loan under traditional technol- sumption would depend on the joint distribu- ogy who now successfully secure one; and the tion of the protected characteristic and the out- potential borrowers who would have received come being predicted. In many cases, the risk a loan before but are deemed too high risk un- of discrimination exists because of the statisti- der the new technology. A troubling finding of cal disparity in outcomes between groups. Fuster et al. is that the losers are dispropor- Rather than enforcing a single definition of fair- tionately from minority racial groups, likely as ness, this allows a context-conscious analysis a reflection of historical discrimination that is of which model best serves the customers or inaccurately inflating the risk of minority bor- society. This exercise should give a decision- rowers (Fuster et al., 2017). Meanwhile, the maker an insight into the model and the trade- various attempts at an active correction of this off with which he or she is the most com- “unfairness”, synthesised in Section 2, have fortable. Once these trade-offs are recog- only obfuscated the criteria for fairness, as nised and quantified, it is important to consider they do not provide a satisfactory framework whether the inter-group outcome disparity is to determine which definition is most appro- fair and justified. priate for a given use case. Mapping of this trade-off between aggregate benefit (e.g. in- creased financial inclusion) and inequity (e.g. Trade-off 2: Accuracy vs. interpretability negative impact on minority borrowers) would Historically, lenders have focused on data that allow the lender to identify the most appropri- theoretically relate to outcome; for example, ate algorithm to model default risk. the debt-to-income ratio and past payments Figure 1 is an indicative visualisation of this are indicators of default risk. With Big Data trade-off. There are two assumptions that analytics, firms are beginning to incorporate should be tested: 1) aggregate benefit im- non-traditional data types into their algorithms proves with the complexity of the model and as proxies of risk. An extreme case is the use

26 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Figure 2: Decision boundary for acceptable use of feature of Chinese citizens’ Internet browsing history, what data can be used and what models can location, and payment data to calculate credit be built. For provision of essential products, risk (Koren, 2016). While the justification is e.g. current accounts, criminal justice deci- to open up access to credit to those with- sions, and car insurance, that have a signif- out credit history, this data use risks unfairly icant impact on people’s lives, policymakers marginalising people based on their back- may need to ensure that all features included ground, in uninterpretable and opaque ways. in the variable have a strong inferential rela- Previously, a prospective borrower could build tionship with the outcome rather than simply his or her credit history by paying bills on time for predictive correlation. and expect a positive impact on the probability of loan approval. With Big Data and machine learning, this expectation may be unfounded. Conclusion A decision-maker must justify whether the out- come disparity is due to legitimate and non- The introduction of ML into lending models discriminatory difference in the outcome dis- presents a new opportunity for greater ac- tributions. Figure 2 visualizes a possible deci- curacy in predicting default and, therefore, sion boundary for whether or not an input vari- greater financial inclusion. However, the able should be used in a model, based on its prevalence of ML models in crucial decision- role as a potential proxy for a protected char- making processes has brought to light the in- acteristic. Given Supreme Court decisions in tricate ways in which discrimination can be the UK (Lowenthal, 2017) and the US (Baum perpetuated through these technologies. Fair- & Stute, 2015), even if a variable is correlated ness is a complex feature of the world, and no to a protected feature, there may be reason- single definition can mitigate the risk of unfair able grounds to use it if the differences are treatment. crucial to a legitimate business requirement. At the same time, ML models are auditable for This decision boundary may shift depending fairness. The opportunity lies in the computa- on the context. The drivers of decision-making tional and systematic decision-making of ML. in providing essential products, such as cur- A more context-conscious approach of bench- rent account, car insurance, or mortgage, may marking the trade-offs between aggregate be subject to higher scrutiny than the ratio- benefit and individual inequity and between nale for offering premium credit cards. This accuracy and interpretability would force the ensures that the decision to include features decision-maker to identify the model that is correlated to protected characteristic is care- most suitable to each case. The risk of fully considered within the context of the reg- discrimination in ML is undeniable; however, ulated domain and the potential impact on the it presents an opportunity to consider more customers. closely what we value as a society and to pur- Policymakers should consider this trade-off sue fair treatments and decisions that can be between accuracy and interpretability to limit enforced in our markets.

27 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

References A continuous framework for fairness. CoRR, abs/1712.07924. Retrieved Baum, J. S. J. J., D., & Stute, D. (2015). from http://arxiv.org/abs/1712 Supreme court affirms fha disparate im- .07924 pact claims. Holden, J., & Smith, M. (2016). Prepar- Bertrand, M., & Mullainathan, S. (2003, ing for the future of ai (Tech. Rep.). July). Are emily and greg more em- U.S. Executive Office of the Pres- ployable than lakisha and jamal? a field ident, National Science and Tech- experiment on labor market discrimina- nology Council Committee on Tech- tion (Working Paper No. 9873). National nology. Retrieved from https:// Bureau of Economic Research. Re- obamawhitehouse.archives trieved from http://www.nber.org/ .gov/sites/default/files/ papers/w9873 doi: 10.3386/w9873 whitehouse files/microsites/ Deku, S. Y., Kara, A., & Molyneux, P. ostp/NSTC/preparing for the (2016). Access to consumer credit future of ai.pdf in the uk. The European Journal of IEEE Global Initiative. (2018). Ethically Finance, 22(10), 941-964. Retrieved aligned design: A vision for prioritizing from https://doi.org/10.1080/ human well-being with autonomous 1351847X.2015.1019641 doi: and intelligent systems. Retrieved from 10.1080/1351847X.2015.1019641 https://standards.ieee.org/ Dwork, C., Hardt, M., Pitassi, T., Reingold, O., content/dam/ieee-standards/ & Zemel, R. S. (2011). Fairness through standards/web/documents/ awareness. CoRR, abs/1104.3913. other/ead v2.pdf Retrieved from http://arxiv.org/ Kendig, D. (1973). Discrimination against abs/1104.3913 women in home mortgage financing. Dworkin, R. (1981). What is equality? part Yale Review of Law and Social Action, 1: Equality of welfare. Philosophy and 3(2). Public Affairs, 10(3), 185–246. Koren, J. R. (2016, Jul). What does European Commission AI HLEG. (2019). that web search say about your Draft ai ethics guidelines for trust- credit? Retrieved from https:// worthy ai (Tech. Rep.). European www.latimes.com/business/ Commission High-Level Expert Group la-fi-zestfinance-baidu on Artificial Intelligence (AI HLEG). -20160715-snap-story.html Retrieved from https://ec.europa Kusner, M. J., Loftus, J. R., Russell, C., .eu/futurium/en/node/6044 & Silva, R. (2017, March). Coun- Fuster, A., Goldsmith-Pinkham, P., Ramado- terfactual Fairness. arXiv e-prints, rai, T., & Walther, A. (2017, 01). Pre- arXiv:1703.06856. dictably unequal? the effects of ma- Lewis, D. (1973). Causation. Journal of Phi- chine learning on credit markets. SSRN losophy, 70(17), 556–567. Electronic Journal. doi: 10.2139/ssrn Lowenthal, T. (2017). Essop v home office: .3072038 Proving indirect discrimination. Gajane, P. (2017). On formalizing fair- ness in prediction with machine learn- Nelson, R. K., Winling, L., Marciano, R., & ing. CoRR, abs/1710.03184. Retrieved Connolly, N. (2016). Mapping inequal- from http://arxiv.org/abs/1710 ity. American Panorama. Retrieved .03184 from https://dsl.richmond.edu/ panorama/redlining/#loc=5/ Grosz, R. A. E. H. A. M. T. M. D. M. Y. S., 36.721/-96.943&opacity=0 Barbara. (2015). Artificial intelli- .8&text=intro gence and life in 2030 (Tech. Rep.). Stanford University. Retrieved from Rawls, J. (1971). A theory of justice. Harvard https://ai100.stanford.edu/ University Press. sites/default/files/ai 100 report 0831fnl.pdf Hacker, P., & Wiedemann, E. (2017).

28 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Michelle Seng Ah Lee [email protected] Michelle is an MSc can- didate in Social Data Science at the Oxford Internet Institute and a Research Assistant at the Digital Ethics Lab, supervised by Professor Luciano Floridi. Her research focuses on fairness in machine learning algorithms and their trade-offs on aggregate and individual levels. She is also interested in the regula- tory and ethical considerations of bias and discrimination in artificial intelligence (AI).

29 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

The Scales of (Algorithmic) Justice: Tradeoffs and Remedies Matthew Sun (Stanford University; [email protected]) Marissa Gerchick (Stanford University; [email protected]) DOI: 10.1145/3340470.3340478

Introduction is the use of risk assessments in the crim- inal justice system. Here, we focus on the Every day, governmental and federally funded use of criminal risk assessment at the pre- agencies — including criminal courts, wel- trial stage. The goal of risk assessment tools fare agencies, and educational institutions — (RATs) at the pre-trial stage is typically to es- make decisions about resource allocation us- timate a defendant’s likelihood of engaging in ing automated decision-making tools (Lecher, a particular future action (for example, com- 2018; Fishel, Flack, & DeMatteo, 2018). Im- mitting a new crime or failing to appear in portant factors surrounding the use of these court) based on their similarity to defendants tools are embedded both in their design and in who have committed those actions in the past the policies and practices of the various agen- (Summers & Willis, 2010). This similarity is cies that implement them. As the use of such typically determined using factors regarding tools is becoming more common, a number a defendant’s criminal history but may also of questions have arisen about whether using include information about a defendant’s per- these tools is fair, or in some cases, even le- sonal and social history such as their age, gal (K.W. v. Armstrong, 2015; ACLU, Outten housing and employment status, and in some & Golden LLP,and the Communications Work- cases, their gender (Summers & Willis, 2010; ers of America, 2019). State v. Loomis, 2016). Risk assessments In this paper, we explore the viability of poten- are not themselves decision-makers regard- tial legal challenges to the use of algorithmic ing detention; rather, they are tools used by decision-making tools by the government or a human decision-maker - typically a judge or federally funded agencies. First, we explore magistrate (Desmarais & Lowder, 2019). the use of risk assessments at the pre-trial In this section, we explore legal challenges stage in the American criminal justice system pertaining to risk assessments on the ba- through the lens of equal protection law. Next, sis that their use, under some circumstances, we explore the various requirements to mount may violate constitutional protections. In par- a valid discrimination claim — and the ways ticular, the Fifth Amendment guarantees equal in which the use of an algorithm might com- protection under due process of law and ap- plicate those requirements — under Title VI of plies to the federal government (U.S. Const., the Civil Rights Act of 1964. Finally, we sug- amend. V.), while the Fourteenth Amendment gest the adoption of policies and guidelines guarantees equal protection and due process that may help these governmental and feder- of law and applies to the states (U.S. Const., ally funded agencies mitigate the legal (and amend. XIV.). Our analysis focuses on the related social) concerns associated with using application of equal protection law to the use algorithms to aid decision-making. These poli- of algorithmic risk assessments. Specifically, cies draw on recent lawsuits relating to algo- we discuss policies around the use of gender rithms and policies enacted in the EU by the and proxies for race in risk assessments and General Data Protection Regulation (GDPR) how each might interact with equal protection (2016). of the law. When an individual or entity believes that their Algorithms and Equal Protection right to equal protection has been violated by One case of algorithmic decision-making in a governmental policy - such as the use of a the public domain that has been recently sub- risk assessment algorithm at the pretrial stage jected to increased scrutiny in recent years - they may challenge such a policy by, first, proving that the policy does indeed discrimi- Copyright c 2019 by the author(s). nate in a way that is or was harmful to the indi-

30 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 vidual (Legal Information Institute, 2018a).The der classifications are subject to intermediate court evaluating the matter would then ana- scrutiny, a test established by the Supreme lyze the policy in question through one of four Court in Craig v. Boren (1976). To pass inter- possible lenses - strict scrutiny, intermediate mediate scrutiny, the policy in question must scrutiny, rational basis scrutiny, or a combi- ”advance an important government interest” nation of the prior three, depending on the by means that are ”substantially related to that characteristic (race, national origin, gender, interest” (Legal Information Institute, 2018b; etc.) in question (Legal Information Institute, Craig v. Boren, 1976). The defendant (the ju- 2018a). risdiction that uses X to inform pretrial release decisions) might argue that, because judges One such notable challenge, which we refer- rely on the accuracy of risk scores when mak- ence in the subsequent discussion, was State ing decisions about who to release and be- of Wisconsin v. Loomis (2016), in which Eric cause these risk scores are meant to inform Loomis challenged the use of the Correctional their decision-making, the use of gender in X Offender Management Profiling for Alternative advances an important government interest - Sanctions (COMPAS) risk assessment to in- ensuring public safety through release deter- form a judge’s decision about how long his minations. The defendant might also argue prison sentence would be. Loomis challenged that, given the evidence on differential predic- the use of COMPAS on the grounds that it vio- tive power by gender, the use of gender is in- lated his constitutional right to due process be- deed a means that is ”substantially related” to cause the tool itself was proprietary (in partic- public safety. ular, Loomis knew the factors used on the as- sessment but did not know how each of those In the case of Loomis, the court determined factors was weighted and translated into a the use of gender was permissible because it score, and thus could not challenge its scien- improved accuracy, a non-discriminatory pur- tific validity), and because the tool used gen- pose (State v. Loomis, 2016). Yet some ar- der as a factor in the assessment (State v. gue that such evidence regarding the differen- Loomis, 2016). tial predictive power by gender is too general. Legal scholar Sonja Starr has argued that be- Factor 1: Use of gender cause the Supreme Court has rejected the Though many risk assessments used at the use of broad statistical generalizations about pretrial stage in the United States do not in- groups to justify discriminatory classifications, clude gender as a factor in the calculation of the use of gender in risk assessment (specif- risk scores (Latessa, Smith, Lemke, Makar- ically at sentencing) is unconstitutional (Starr, ios, & Lowenkamp, 2009; VanNostrand et al., 2014). In the case of X, the court would have 2009), some pretrial risk assessments do con- to consider, given the relevant evidence, if it is sider gender, like COMPAS did in the case of actually the case that using gender as a factor Eric Loomis (State v. Loomis, 2016). More- is substantially related to public safety, weigh- over, evidence indicates that risk assessments ing the tension between the group classifica- may not be equally predictive across genders, tions in X and the principle of individualized and may overestimate the recidivism risk of decision-making in the criminal justice system. women compared to men (Skeem, Monahan, & Lowenkamp, 2016). Such evidence sug- Now consider risk assessment Y, a risk as- gests the counterintuitive idea that including sessment that doesn’t include gender in its gender in the calculation of risk scores may be calculation of risk scores, and suppose that a more equitable than excluding it. To illustrate jurisdiction that uses Y has analyzed its own the complexities of this point, we consider two data and found that Y is better at predicting hypothetical scenarios regarding risk assess- recidivism for men than it is at predicting re- ments and gender. cidivism for women. In this case, the policy in question is facially neutral (the use of Y Consider a hypothetical risk assessment X doesn’t appear to be discriminatory towards that includes gender in its calculation of risk women and doesn’t specifically include gen- scores; assume X has been challenged on the der in its calculations), but nonetheless has a basis that its use of gender violates equal pro- disparate impact because it rates women as tection. Equal protection claims involving gen- higher risk than they actually are. If the use

31 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 of Y were challenged under equal protection, More broadly, legal challenges to the use of the challenger would have to show intent - in RATs under constitutional law speak to an particular, that the governmental body using underlying theme of the use of algorithms Y intended to discriminate against women by more generally: the use of these tools does using Y. In Personnel Administrator of Mas- not fit neatly into established legal standards sachusetts v. Feeney (1979), the Supreme (Barocas & Selbst, 2016), and tradeoffs will be Court was faced with the question of whether present, whether mathematical, social, both, a facially neutral policy that had a disparate or otherwise (Corbett-Davies & Goel, 2018). impact on women was a violation of equal pro- Moreover, in the presence of facially neutral tection. A key question was whether the ”fore- RATs, understanding intent is crucial to un- seeability” of the policy’s disparate impact was derstanding if the law has been violated. In sufficient proof of discriminatory intent; the the remedies section, we propose inquires court held that it was not (Weinzweig, 1983). around RAT implementation that may help Thus, if the ruling from Feeney were applied to clarify the intent of policymakers and agencies the hypothetical case regarding Y, awareness who adopt these tools and inform the public of Y’s differential predictive power for men and about the agencies’ decision-making rationale women may not necessarily qualify as proof of in the presence of tradeoffs. intent to discriminate, and the equal protection claim against Y may fall short. Algorithms and Civil Rights Law Factor 2: Use of proxies for race Beyond the constitutional arena, disparate im- Now consider a hypothetical risk assessment pact theory has another, distinct form in civil Z that uses factors such as the stability of a rights law. Famously, Title VII of the Civil defendant’s housing or their employment sta- Rights Act of 1964 explicitly bars employment tus - in practice, many risk assessments do practices that would generate a disparate im- consider these factors, as they are correlated pact, defined by the following conditions: 1) with recidivism risk (Summers & Willis, 2010). the policy creates an adverse effect that falls However, these factors may serve as prox- disproportionately upon a particular protected ies for race (Barocas & Selbst, 2016; Corbett- class, 2) the specific policy in place is not a Davies & Goel, 2018). Though classifications ”business necessity,” and 3) there exists an al- involving race or national origin are typically ternative policy that would not result in dispro- subject to strict scrutiny, absent an explicit dis- portionate harms (42 U.S.C. §2000e et seq.). criminatory classification, both disparate im- In Griggs v. Duke Power Co. (1971), the pact and discriminatory intent are required to Supreme Court found that Duke Power’s re- even trigger a scrutiny test (as they would be quirement of a high school diploma for its in the hypothetical case of Y, described above) higher paid jobs was illegal under Title VII of (Arlington Heights v. Metropolitan Housing the Civil Rights Act of 1964 because it dispro- Dev. Corp., 1977). Thus, for Z’s use to be portionately barred minority groups from those challenged because of its use of proxies for positions and did not have any demonstrable race, one would need to show both that Z has relation to performance on the job. a disparate impact (for example, that though scores inform decision-making for all people, Beyond Title VII and employment practices, Z is less accurate for minorities than for white the Court has ruled in multiple cases involv- people, which may or may not be true in the ing federal statutes with disparate impact pro- case of this hypothetical) and that Z was de- visions, such as Lau v. Nichols (1974) and signed or used to be discriminatory against Alexander v. Choate (1985), that policies the minority group(s) in question. Demonstrat- which create adverse disparate impact are in ing this intent may prove challenging because violation of the law, regardless of the intent of of the correlation between these socioeco- those policies or whether the policies are ap- nomic factors and recidivism risk; nonethe- plied equally to all groups. Such policies that less, the tension between statistical general- create a disparate impact constitute a viola- izations about groups of people and the right tion of Title VI of the Civil Rights Act of 1964, to an individualized decision for each defen- which was enacted at the same time as Title dant is ever present. VII (42 U. S. C. § 2000d). We choose to now

32 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 shift focus to Title VI because Title VI stipu- criminate against Black families, since Black lates that all programs or activities that receive families are far more likely to be called on federal funding may not perpetrate or perpet- by mandatory reporters or anonymous callers uate discrimination on the grounds of race, (Misra, 2018). The Office of Youth, Chil- color, or national origin, while Title VII only dren, and Families is overseen by the Al- concerns employment (42 U. S. C. § 2000d). legheny County Department of Human Ser- However, we note that the U.S. Department of vices, which receives federal funding and as Justice has recently stated that Title VI ”fol- a result may be subject to regulation under Ti- lows...generally..the Title VII standard of proof tle VI (Allegheny County, 2019). for disparate impact”; thus, cases that concern Title VII ”may shed light on the Title VI analy- In these cases and many others, there is of- sis in a given situation” (U.S. Department of ten no obvious evidence of discriminatory in- Justice, 2019). tent; to the contrary, algorithms are commonly deployed in the hopes of mitigating human bi- Twenty-six federal agencies have Title VI reg- ases (Lewis, 2018). In Allegheny County, offi- ulations that address the disparate impact cials stressed that the predictive risk-modeling standard, including USDA, the Department tool would guide, not replace, human decision- of Health and Human Services, and the De- making (Hurley, 2018; Giammarise, 2017). partment of Education (U.S. Department of Yet, we often see that algorithms may still Justice, 2019). These federal agencies pro- produce significant adverse impact on popu- vide funding to a massive array of public pro- lations when analyzed on the basis of race or grams and the social safety net, including pub- gender. As a result, groups or individuals may lic schools, Medicaid, and Medicare. In Lau v. naturally seek to challenge the use of such al- Nichols (1974), for example, the Court found gorithms in programs receiving federal fund- that the San Francisco Unified School District ing under Title VI. According to a Justice De- was in violation of Title VI because it received partment legal manual on Title VI, three con- federal funding yet imposed a disparate im- ditions are required to constitute a violation of pact on non English-speaking students, many Title VI: 1) statistical evidence of disparate ad- of whom were not offered supplemental lan- verse impact on a race, color, or national origin guage instruction or placed into special edu- group, 2) the lack of a substantial legitimate cation classes. justification for the policy, and 3) the presence of a less discriminatory alternative that would This regulatory and legal landscape sets the achieve the same objective but with less of a stage for the application of disparate impact discriminatory effect (42 U. S. C. § 2000d). theory under civil rights law as an important possible remedy for discrimination in algo- In the following sections, we explore how dis- rithmic decision-making. As state and local parate impact claims against the usage of al- governments increasingly turn towards auto- gorithms might fail to succeed in court for mated tools to lower costs, ease administra- three separate reasons. These challenges tive burdens, and deliver benefits, we are likely can be summarized as the lack of presence to observe cases where algorithms, especially of a less discriminatory alternative, the use of when deployed without comprehensive over- predictive accuracy as ”substantial legitimate sight and auditing processes in place, create justification” for the policy, and the possibil- unequal outcomes. In her book Automating In- ity that the only way to ameliorate disparate equality: How High-Tech Tools Profile, Police, impact would be to treat different groups dif- and Punish the Poor, Professor Virginia Eu- ferently, thus triggering a disparate treatment banks examines a statistical tool used by the legal challenge. We explore the current stan- Allegheny County Office of Youth, Children, dard for how a complainant (i.e., plaintiff) must and Families that processes data from pub- prove disparate impact under Title VI, and how lic programs to predict the likelihood that child a recipient (i.e., defendant) might ultimately abuse is taking place in individual households circumvent their claims. across the county (Eubanks, 2018). Because Challenge 1: Proving the presence of a less the frequency of calls previously made on a discriminatory alternative family is an input to the algorithm, Eubanks argues that the tool may systematically dis- The phrase ”less discriminatory alternative”

33 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 implies that there exists a way to compare a White children: White children who received set of policies and determine which is the least the same risk score for neglect as Black chil- discriminatory. However, when it comes to dren were actually less likely to be experienc- algorithmic decision-making, the definition of ing maltreatment (Chouldechova, Benavides- ”fairness” (in other words, the absence of dis- Prado, Fialko, & Vaithianathan, 2018). In this crimination) is hotly debated (Gajane & Pech- case, Eubanks’ critiques of the algorithm’s in- enizkiy, 2017). For example, the notion of puts and other researchers’ empirically mea- ”classification parity” is defined as the require- sured calibration result in directly opposing ment that certain measures of predictive per- views of which racial group is experiencing formance, such as the false positive rate, pre- discrimination. cision, and proportion of decisions that are positive, be equal across protected groups Without a single, legally-codified definition of (Corbett-Davies & Goel, 2018). For exam- fairness, we see the first obstacle to a suc- ple, in order to satisfy false positive classifi- cessful disparate impact claim: a recipient cation parity, the Allegheny County child ne- can argue that no less discriminatory alterna- glect prediction algorithm must make an incor- tive exists, since any alternative will likely in- rect positive prediction (i.e., predict the pres- volve tradeoffs across different measures of ence of child abuse in a family where none fairness. Moreover, we suggest that it is in- is occurring) at the same rate for both White sufficient to choose one measure of fairness and Black families. Another commonly refer- as the priority in all cases, since the soci- enced notion of fairness is ”calibration,” which etal costs associated with different fairness requires that outcomes be independent of pro- measures varies across specific applications tected class status after controlling for esti- (Corbett-Davies & Goel, 2018). For example, mated risk (Corbett-Davies & Goel, 2018). If one might argue that the societal and/or moral the aforementioned algorithm were to satisfy cost of incorrectly detaining a Black individ- calibration, child abuse must be found to actu- ual who will not recidivate is far greater than ally occur at similar rates in White and Black the cost of incorrectly flagging a Black house- families predicted to have a 10% risk of child hold for child abuse. Another person might neglect. take the opposite position, but in either case, blindly prioritizing false positive parity across both tasks would fail to recognize the unique These definitions may sound like they mea- costs associated with each one. sure roughly similar phenomena, but re- cent research on algorithmic fairness shows There also exist practical legal challenges and that they are often in competition, producing ambiguity regarding the existence of a less provable mathematical tradeoffs among each discriminatory alternative. In the realm of Ti- other (Corbett-Davies & Goel, 2018). Op- tle VII, scholars disagree about whether ”re- timizing calibration, for example, may result fusal” to adopt a less discriminatory procedure in reductions in classification parity. ProPub- means that the employer cannot be held liable lica’s analyis of the use of COMPAS at the until it has actively investigated such an alter- pretrial stage in Broward County, Florida re- native and subsequently rejected it (Barocas & vealed that the algorithm yielded much higher Selbst, 2016). This debate raises the question false positive rates for Black defendants than of whether employers should be held respon- it did for White ones (Angwin, Larson, Mattu, sible to perform a costly, exhaustive search of & Kirchner, 2016), but at the same time, in- all potential alternatives, or whether the cost of dividuals given the same COMPAS risk score doing such a search would functionally mean recidivated at the same rate (Corbett-Davies, that less discriminatory alternatives do not ex- Pierson, Feller, Goel, & Huq, 2017). In other ist. According to the U.S. Department of Jus- words, the algorithm was calibrated, but was tice’s guidance regarding Title VI, the burden more likely to incorrectly classify Black defen- is on the complainant to identify less discrim- dants as ”high risk” for recidivism than White inatory alternatives (U.S. Department of Jus- defendants. To further complicate the notion tice, 2019). This may pose a significant chal- of discrimination, the algorithm used in Al- lenge to complainants, as they may not have legheny County to predict risk of child neglect access to the documents and data needed to was miscalibrated in a way that disfavored show which alternatives would be equally ef-

34 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 fective in practice. deployed for real-world use (Buolamwini & Ge- bru, 2018), which could result in a subsequent Challenge 2: Substantial legitimate justifica- disparate impact claim. In this sequence of tion events, the potentially offending entity was optimizing for overall accuracy and failed to The second failure mode for a disparate im- take the possibility of disparate impact into ac- pact claim is that the recipient has articulated count. a ”substantial legitimate justification” for the challenged policy (42 U. S. C. § 2000d). As the This scenario raises the question of whether Justice Department discloses in its Title VI le- the desire to optimize raw predictive accuracy gal manual, ”the precise nature of the justifica- counts as a ”substantial legitimate justifica- tion inquiry in Title VI cases is somewhat less tion” for an algorithm whose outputs are bi- clear in application” (U.S. Department of Jus- ased. It seems plausible that any recipient tice, 2019). For example, the EPA stated in its could argue that predictive accuracy is a legit- 2000 Draft Guidance for Investigating Title VI imate justification: after all, optimizing accu- Administrative Complaints that the ”provision racy maximizes the total number of decisions of public health or environmental benefits...to made correctly, given that the demographic the affected population” was an ”acceptable makeup of the dataset resembles that of the justification” (Draft Title VI Guidance, 2000). real-world population. Optimizing for any other This document was compiled after a 60-day metric, such as an arbitrary fairness mea- period of 7 public listening sessions at the re- sure, may lead to an algorithm with lower over- quest of state and local officials seeking clar- all predictive accuracy (Zliobaite, 2015; Klein- ification in an effort to avoid Title VI violations berg, Mullainathan, & Raghavan, 2016). A re- (Mank, 2000). In contrast, Title VII substitutes cipient of a disparate impact claim could argue the ”legitimate justification” requirement with a that maximizing accuracy leads to higher effi- ”business necessity” stipulation (42 U. S. C. ciency and lower costs for cash-strapped gov- § 2000d). Because Title VI covers a broad ernment agencies. In the Allegheny County scope of federally funded programs, ”legiti- example, having an algorithm accurately flag mate justification” must be defined on a case- families for risk of child neglect reduced the by-case basis, whereas ”business necessity” time required to manually screen applications, has a narrower meaning in case law due to Ti- saving time and labor. Because ”substantial tle VII’s specific focus on hiring practices (U.S. legitimate justification” is relatively ambiguous Department of Justice, 2019). and case-specific, it may be difficult for a com- plainant to prove that maximizing classification In the case of programmatic decision-making, accuracy is not a legitimate justification. discrimination may occur when practitioners do not properly audit their algorithm before Challenge 3: A disparate impact and disparate and while it is deployed. Such an audit could treatment Catch-22 take many forms, such as running a random- ized control trial before permanently imple- It’s important to note that optimizing accuracy menting an algorithm or releasing public re- and fairness measures is not always a zero- ports every year regarding how well the al- sum game. In the aforementioned research gorithm is performing. (For the purposes of about gender in criminal risk assessment, in- the following discussion, we assume that the cluding gender as a variable in the dataset im- task at hand is one of binary/multiclass clas- proved calibration and predictive accuracy be- sification, also known as a ”screening proce- cause women with similar criminal histories to dure”). In the field of machine learning, al- men recidivate at lower rates (Skeem et al., gorithms are commonly trained by iteratively 2016) (Notably, gender is not a protected at- improving performance on a given dataset, tribute under disparate impact clauses in civil as measured by average classification accu- rights law). Similarly, in other cases, we may racy (Alpaydin, 2009). If average classifica- be able to improve predictive accuracy and tion accuracy is not disaggregated across pro- produce gains in fairness measure(s) if some tected groups present in the dataset, dispari- predictive latent variable is identified and in- ties in the algorithm’s performance may only cluded in the dataset (Jung, Corbett-Davies, be discovered once the algorithm is already Shroff, & Goel, 2018).

35 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Consider the case of a hypothetical algorithm the Court’s decision to rule against the CSB that estimates recidivism risk and takes race ”demonstrates the tension between disparate as an input, but does not take criminal his- treatment and disparate impact,” since a neu- tory as an input. Assume in this scenario that tral policy can create disparate outcomes, but criminal history is more predictive of recidi- mitigating the disparate impact would require vism than race. If Black people are dispropor- discriminatory treatment of different groups. tionately likely to have prior convictions - per- haps due to disparate policing practices - then the algorithm will ”penalize” all Black people Remedies by giving them higher risk scores, even ones As we have seen from the above analysis, without prior convictions. If criminal history is there is reason to believe that today’s con- added to the dataset and the algorithm is re- cerns regarding algorithmic bias will not be re- trained, the algorithm’s accuracy will increase solved in the courts alone, despite the high due to the addition of a predictive variable. In number of pending court cases regarding the addition, the algorithm’s performance on fair- use of algorithms. In the Constitutional realm, ness measures may increase as well, since absent a suspect classification, both disparate Black people without criminal histories will no impact and discriminatory intent are needed longer receive a penalty for their racial status. to prove a violation of the law. In addition, It may be the case, however, that the latent the current requirements to make a success- variable whose inclusion would improve fair- ful claim of disparate impact under civil rights ness and accuracy is the protected attribute law are vague with regards to defining what itself (Jung et al., 2018). Including gender as a discriminatory outcome is, which may allow an input to the algorithm would resolve the un- recipients of complaints to leverage whichever equal outcomes in which women are unfairly mathematical constructs of fairness best sup- penalized, but at the same time, explicitly al- port the use of their algorithm. tering decisions based off of an individual’s If we cannot expect to find remedies from the gender is a clear example of disparate treat- judiciary, where should citizens turn for relief? ment (42 U. S. C. § 2000d). The same would To address the above concerns, we propose a be true with regard to protected attributes un- remedy in the form of a unified, collaborative der Title VI such as race, national origin, and effort between the agencies and legislatures, religion. Disparate treatment, in which policies both at the federal and state levels. We de- explicitly treat members of different protected tail what such an effort would look like below, groups differently, is prohibited by Title VI, as using an international regulation to inform our well as many other civil rights laws (U.S. De- proposals. partment of Justice, 2019). Disparate treat- ment cases are arguably easier to prove, since The European Union’s General Data Protec- discrimination is explicitly codified in a recip- tion Regulation (GDPR) offers a compelling ient’s policies, while disparate impact cases case for broad legal regulations coupled with rely on measures of a policy’s outcomes de significant enforcement power. The GDPR facto (Selmi, 2005). The fact that both dis- provides strong protections for individual pri- parate treatment and disparate impact violate vacy by allowing governmental agencies to civil rights statutes may create a Catch-22 for pursue fines and investigations into private entities seeking to resolve disparate impact in companies for data mismanagement and pri- algorithmic decision-making. vacy breaches (Steinhardt, 2018). With regards to automated decision-making, the Indeed, Kroll et al. (2016) note this tension as GDPR (2016) makes mention of a ”right to manifested in the Supreme Court’s decision in explanation” for users who seek explanation a 2009 case involving Title VII, Ricci v. DeSte- for decisions made about them (e.g., loan de- fano (2009). In the case, the New Haven Civil nials) (Goodman & Flaxman, 2017). One Service Board (CSB) refused to certify the re- of the European Commission’s senior advi- sults of a facially neutral test for firefighter pro- sory bodies on data protection released a set motions out of disparate impact concerns, not- of guidelines regarding automated decision- ing that the pass rate for minorities was half making, which included requirements for com- that for whites. As Kroll et al. (2016) note, panies to provide explanations for how users’

36 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 personal data was used by the algorithm legislation largely focuses on barring discrim- (Casey, Farhangi, & Vogl, 2018). The same inatory intent that results in differential treat- body even included a recommendation for ment on the basis of protected attributes, such companies to introduce ”procedures and mea- as race. Today, however, we see that in order sures to prevent...discrimination” and to per- to remedy unintended discrimination in algo- form ”frequent assessments...to check for any rithmic decision-making, we may have to take bias” (17/EN. WP 251, 2017). into account such protected attributes: essen- tially, using differential treatment to ameliorate The fact that the mandates behind the GDPR disparate outcomes. Federal and state legis- have been enforced in practice leads us to lation must acknowledge this nuance, allowing suggest an approach in the U.S. that similarly practitioners to use protected attributes data combines comprehensive legislation with new to promote the most fair outcomes, where the enforcement powers for government agencies relevance of such data and a suitable notion (Lawson, 2019). Of course, attitudes and poli- of fairness are determined on a case by case cies regarding the regulation of private com- basis. For example, under a bail reform law panies differ in the U.S. and the EU (Hawkins, in New Jersey, agencies may collect informa- 2019). Thus, our proposal would not seek tion about a defendant’s race and gender for to impose regulations on all private compa- potential use in a risk assessment calculation, nies across the US, but rather public enti- subject to the condition that decisions are not ties that are already subject to significant gov- discriminatory along race or gender lines (NJ ernment oversight, such as federal agencies Rev Stat § 2A:162-25, 2014). or federally-funded programs. Indirectly, this implicates private companies such agencies Legislation (or other regulation) should stip- may contract with to provide tools or services ulate that public agencies that are going to in their use of algorithmic technology. adopt algorithms to help make decisions must submit the following information to a relevant The remedies we suggest apply to both of oversight agency (at the federal level, the of- the main use cases we previously described; fice described earlier, and at the state level, for federally funded agencies, these remedies some state or local agency with relevant ex- may be enacted through legislation or exec- pertise) prior to the algorithm’s adoption: utive rule-making. Similarly, these remedies could also be applied at the state and local • What decision will the algorithm be used to level. In both cases, we recommend the cre- make or help make? How was that decision ation or significant expansion of agencies fo- or type of decision made before the use of cused specifically on the technical oversight the algorithm? and evaluation of algorithmic tools. For exam- • What are the reasons to implement such an ple, such an existing agency that might take algorithm? Is the algorithm less expensive, up this burden could be the newly created or will it increase efficiency? Is the intent Science, Technology Assessment, and Ana- to make the decision-making process more lytics team at the U.S. Government Account- objective? ability Office (U.S. Government Accountability • What are the particular use cases and use Office, 2019). While courts have been reluc- context of the algorithm? How will the algo- tant to conduct a ”searching analysis of alter- rithm’s outputs be interpreted? Will a human natives,” federal agencies are ”subject matter decision-maker be involved? Who is the experts charged with Title VI enforcement du- population that the algorithm may be used ties” and ”are well-equipped to...evaluate care- on? Are there any exceptions to this policy? fully potential less discriminatory alternatives” • How will the algorithm be evaluated and, if (U.S. Department of Justice, 2019). necessary, revised? Has funding been al- Our remedy additionally attempts to recognize located for regular oversight? Who will be and address the significant gap in current civil performing the evaluations and how? Is the rights legislation with regards to definitions of text (or source code) or training data of the discriminatory intent and disparate impact — algorithm publicly available? which can generate a Catch-22 of sorts, even • Were alternatives considered? What other for well-meaning actors. Existing civil rights options were considered, and why was this

37 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

one chosen? What were tradeoffs between In Conference on fairness, accountability the different choices? and transparency (pp. 77–91). Casey, B., Farhangi, A., & Vogl, R. (2018). The submission of this information to a gov- Rethinking explainable machines: The ernmental body and the public before an al- GDPR’s “right to explanation” debate and gorithm is employed in practice could provide the rise of algorithmic audits in enter- greater clarity to both the public and regulators prise. regarding discriminatory intent and the poten- Chouldechova, A., Benavides-Prado, D., Fi- tial for discriminatory outcomes. Furthermore, alko, O., & Vaithianathan, R. (2018). by actively requiring actors to come up with A case study of algorithm-assisted de- a plan to monitor the algorithm, consider alter- cision making in child maltreatment hot- natives, and think critically about the algorithm line screening decisions. In Conference in the context of human systems, this policy on Fairness, Accountability and Trans- may decrease the likelihood of algorithms pro- parency (pp. 134–148). ducing unintended negative consequences in Civil Rights Act of 1964 Title VI, 78 Stat. 252, practice. 42 U. S. C. § 2000d. Civil Rights Act of 1964 Title VII, 42 U.S.C. §2000e et seq. Acknowledgments Corbett-Davies, S., & Goel, S. (2018). The measure and mismeasure of fairness: A We would like to thank Keith Schwarz for help- critical review of fair machine learning. ful feedback. arXiv preprint arXiv:1808.00023. Corbett-Davies, S., Pierson, E., Feller, A., References Goel, S., & Huq, A. (2017). Algo- rithmic decision making and the cost of 17/EN. WP 251. (2017). Guidelines on Au- fairness. In Proceedings of the 23rd tomated individual decision-making and acm sigkdd international conference on Profiling for the purposes of Regulation knowledge discovery and data mining 2016/679 (pp. 797–806). ACLU, Outten & Golden LLP, and the Craig v. Boren, 429 U.S. 190 (1976)) Communications Workers of America. Desmarais, S. L., & Lowder, E. M. (2019). Pre- (2019). Facebook EEOC complaints. trial risk assessment tools: A primer for https://www.aclu.org/cases/ judges, prosecutors, and defense attor- facebook-eeoc-complaints. (On- neys. MacArthur Foundation Safety and line; accessed June 1, 2019) Justice Challenge. Alexander v. Choate, 469 U.S. 287 (1985) Draft Title VI Guidance for EPA Assistance Allegheny County. (2019). DHS funding. Recipients Administering Environmen- https://www.county.allegheny tal Permitting Programs (Draft Recipient .pa.us/Human-Services/About/ Guidance) and Draft Revised Guidance Funding-Sources.aspx. (Online; for Investigating Title VI Administrative accessed May 18, 2019) Complaints Challenging Permits (Draft Alpaydin, E. (2009). Introduction to machine Revised Investigation Guidance); Notice, learning. MIT press. 65 Fed. Reg. 124 (June 27, 2000). Fed- Angwin, J., Larson, J., Mattu, S., & Kirchner, L. eral Register: The Daily Journal of the (2016). Machine bias. ProPublica, May, United States. 23. Eubanks, V. (2018). Automating inequality: Arlington Heights v. Metropolitan Housing How high-tech tools profile, police, and Dev. Corp., 429 U.S. 252 (1977) punish the poor. St. Martin’s Press. Barocas, S., & Selbst, A. D. (2016). Big Fishel, S., Flack, D., & DeMatteo, D. (2018). data’s disparate impact. Calif. L. Rev., Computer risk algorithms and judicial 104, 671. decision-making. Monitor on Psychol- Buolamwini, J., & Gebru, T. (2018). Gender ogy. shades: Intersectional accuracy dispari- Gajane, P., & Pechenizkiy, M. (2017). ties in commercial gender classification. On formalizing fairness in prediction

38 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

with machine learning. arXiv preprint www.manatt.com/Insights/ arXiv:1710.03184. Newsletters/Advertising-Law/ Giammarise, K. (2017). Allegheny County GDPR-Enforcement-Actions DHS using algorithm to assist in child -Fines-Pile-Up. (Online; accessed welfare screening. https://www.post May 18, 2019) -gazette.com/local/region/ Lecher, C. (2018). What happens when an 2017/04/09/Allegheny-County algorithm cuts your health care. The -using-algorithm-to-assist Verge. -in-child-welfare-screening/ Legal Information Institute. (2018a). stories/201701290002. (Online; Equal protection. https:// accessed May 18, 2019) www.law.cornell.edu/wex/ Goodman, B., & Flaxman, S. (2017). Eu- equal protection. (Online; ac- ropean Union regulations on algorithmic cessed May 18, 2019) decision-making and a “right to explana- Legal Information Institute. (2018b). In- tion”. AI Magazine, 38(3), 50–57. termediate scrutiny. https:// Griggs v. Duke Power Co., 401 U.S. 424 www.law.cornell.edu/wex/ (1971) intermediate scrutiny. (Online; Hawkins, D. (2019). The cybersecurity accessed June 1, 2019) 202: Why a privacy law like gdpr Lewis, N. (2018). Will AI remove hiring would be a tough sell in the U.S. bias? https://www.shrm.org/ https://www.washingtonpost resourcesandtools/hr-topics/ .com/news/powerpost/paloma/ talent-acquisition/pages/ the-cybersecurity-202/2018/ will-ai-remove-hiring-bias 05/25/the-cybersecurity -hr-technology.aspx. (Online; -202-why-a-privacy-law accessed May 18, 2019) -like-gdpr-would-be-a Mank, B. C. (2000). The draft recipient guid- -tough-sell-in-the-u-s/ ance and the draft revised investigation 5b07038b1b326b492dd07e83/ guidance: Too much discretion for epa ?utm term=.1cc41e57f9cf. (Online; and a more difficult standard for com- accessed May 18, 2019) plainants? Environmental Law Reporter, Hurley, D. (2018). Can an algorithm tell when 30. kids are in danger. New York Times, 2. Misra, T. (2018). When criminal- Jung, J., Corbett-Davies, S., Shroff, R., & izing the poor goes high-tech. Goel, S. (2018). Omitted and included https://www.citylab.com/ variable bias in tests for disparate im- equity/2018/02/the-rise-of pact. arXiv preprint arXiv:1809.05651. -digital-poorhouses/552161/ Kleinberg, J., Mullainathan, S., & Raghavan, ?platform=hootsuite. (Online; M. (2016). Inherent trade-offs in the accessed May 18, 2019) fair determination of risk scores. arXiv NJ Rev Stat § 2A:162-25 (2014) preprint arXiv:1609.05807. O.J. (L 119) (2016). Reg (EU) 2016/679 of the Kroll, J. A., Barocas, S., Felten, E. W., Rei- European Parliament and of the Coun- denberg, J. R., Robinson, D. G., & Yu, H. cil of 27 April 2016 on the protection of (2016). Accountable algorithms. U. Pa. natural persons with regard to the pro- L. Rev., 165, 633. cessing of personal data and on the free K.W. v. Armstrong, No. 14-35296 (9th Cir. movement of such data, and repealing 2015) Dir 95/46/EC (General Data Protection Latessa, E., Smith, P., Lemke, R., Makarios, Regulation) M., & Lowenkamp, C. (2009). Creation Personnel Adm’r of Massachusetts v. Feeney, and validation of the Ohio risk assess- 442 U.S. 256 (1979) ment system: Final report. Cincinati, OH: Reg (EU) 2016/679 of the European Par- University of Cincinnati. liament and of the Council of 27 April Lau v. Nichols, 414 U.S. 563 (1974) 2016 on the protection of natural per- Lawson, R. P. (2019). GDPR enforcement sons with regard to the processing of actions, fines pile up. https:// personal data and on the free move-

39 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

ment of such data, and repealing Dir Marissa Gerchick is a 95/46/EC (General Data Protection Reg- rising senior at Stanford ulation). (2016). University studying Math- Ricci v. DeStefano, 557 U.S. 557 (2009) ematical and Computa- tional Science. She is Selmi, M. (2005). Was the disparate impact interested in using data- theory a mistake. Ucla L. Rev., 53, 701. driven tools to improve the Skeem, J., Monahan, J., & Lowenkamp, C. American criminal justice (2016). Gender, risk assessment, and system. sanctioning: The cost of treating women like men. Law and human behavior, Matthew Sun is a rising 40(5), 580. senior at Stanford dou- Starr, S. B. (2014). Evidence-based sen- ble majoring in Computer tencing and the scientific rationalization Science and Public Pol- of discrimination. Stan. L. Rev., 66, 803. icy. He leads a student State v. Loomis, 881 N.W.2d 749 (2016) group called CS+Social Good and is interested Steinhardt, E. (2018). Euro- in applied AI research for pean regulators are intensifying socially relevant issues. GDPR enforcement. https:// www.insideprivacy.com/ eu-data-protection/european -regulators-are-intensifying -gdpr-enforcement/. (Online; accessed May 18, 2019) Summers, C., & Willis, T. (2010). Pre- trial risk assessment research summary. Washington, DC: Bureau of Justice As- sistance. United States. Department of Justice. (2019). Title VI Legal Manual (Updated) U.S. Government Accountability Office. (2019). Our new science, technol- ogy assessment, and analytics team. https://blog.gao.gov/2019/01/ 29/our-new-science-technology -assessment-and-analytics -team/. (Online; accessed May 18, 2019) U.S. Const. amend. V. U.S. Const. amend. XIV. VanNostrand, M., et al. (2009). Pretrial risk assessment in Virginia. Weinzweig, M. J. (1983). Discriminatory im- pact and intent under the equal protec- tion clause: The Supreme Court and the mind-body problem. Law & Ineq., 1, 277. Zliobaite, I. (2015). On the relation between accuracy and fairness in binary classifi- cation. arXiv preprint arXiv:1505.05723.

40 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

What Metrics Should We Use To Measure Commercial AI? Cameron Hughes (Northeast Ohio ACM Chair; [email protected]) Tracey Hughes (Northeast Ohio ACM Secretary; [email protected]) DOI: 10.1145/3340470.3340479

In AI Matters Volume 4, Issue 2, and Issue application have? 4, we raised the notion of the possibility of • How much intelligence does an application an AI Cosmology in part in response to the need to be classified as an AI application? “AI Hype Cycle” that we are currently experi- encing. We posited that our current machine • How reliable is the process that produced learning and big data era represents but one the intelligence for any given AI application? peak among several previous peaks in AI re- • How transparent is the intelligence in any search in which each peak had accompany- given AI application? ing “Hype Cycles”. We associated each peak with an epoch in a possible AI Cosmology. We The answers to these questions require some briefly explored the logic machines, cybernet- kind of qualitative and quantitative metrics. ics, and expert system epochs. One of the Namely, how much intelligence does any objectives of identifying these epochs was to given AI application have and what is the qual- help establish that we have been here before. ity of that intelligence. Further, how could we In particular we’ve been in the territory where congeal the answers to these questions so some application of AI research finds substan- that they can be used to capture (in label form) tial commercial success which is then closely the ’AI Ingredients’ of any technology aimed at followed by AI fever and hype. The public’s ex- the general public? pectations are heightened only to end in dis- illusionment when the applications fall short. The amount of education is often used as one Whereas it is sometimes somewhat of a chal- metric for intelligence. We refer to individu- lenge even for AI researchers, educators, and als as having a grade school, high school or practitioners to know where the reality ends college-level education. Could we employ a and hype begins, the layperson is often in an similar metric for AI applications? Would it be impossible position and at the mercy of pop feasible to classify AI applications in terms of culture, marketing and advertising campaigns. grade levels? For instance, an AI application We suggested that an AI Cosmology might with grade level 5 would be considered to have help us identify a single standard model for more intelligence than a 4th grade AI applica- AI that could be the foundation for a common tion. Any application that didn’t meet 1st grade shared understanding of what AI is and what level would not be considered an AI applica- it is not. A tool to help the layperson under- tion and applications that achieved better than stand where AI has been, where it’s going, 12th grade would be considered advanced AI and where it can’t go. Something that could applications. But how could we determine the provide a basic road map to help the general grade level of any given AI application? public navigate the pitfalls of AI Hype. A consensus set of metrics that could be Here, we want to follow that suggestion with a passed on to the general public has not yet few questions. Once we define and agree on prevailed. In this AI hype cycle, if an applica- what is meant by the moniker artificial intelli- tion uses any artifact from any AI technique a gence and we are able to classify some appli- vendor is quick to advertise it as an AI appli- cation as actually having artificial intelligence, cation. It would be useful to have a metric that another set of questions immediately present would indicate exactly how much AI is in the themselves: purported application. We have this kind of in- formation for other products e.g. how much • How intelligent is any given AI application? chocolate is actually in the chocolate bar or how much real fruit is actually in the fruit juice • How much intelligence does any given AI being sold to the consumer. Exactly how much Copyright c 2019 by the author(s). AI does that drone have? Or how much AI is

41 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 actually in that new social media application? resented in a pie chart. The color indicates The ’how much’ question requires a quantita- the grade levels from lower to higher. In this tive metric of some sort and the grade of intel- case, the system has 41% of the knowledge ligence involved requires the qualitative met- is ”Common Sense” with a relative low grade ric. What if applications that claimed to be AI level as compared to the high grade level of capable were required to state the metrics on ”Instinctive Knowledge” at 23%. The expira- the label or in the advertising? For example, A tion date indicates the time frame for the via- vendor might state: “Our new social media ap- bility of the knowledge from 2000 - 2025. Out- plication is 2% level 3 AI!” This kind of simple side of the indicated time frame, the knowl- metric scheme would help to mitigate AI Hype edge would be consider obsolete. There are cycles. two indicators, one for researchers and prac- titioners and the other for the laymen. The In addition to characterizing the quantity and difference in the indicators is the terminology quality of the embedded AI, requiring a re- used to describe the types of knowledge. Here liability metric like MTBAIF (Mean Time Be- is our lists of metrics: tween AI Failure) is also desirable. Stating how much intelligence is in an application and 1. Percentage of software/hardware dedicated what grade level of intelligence is in an appli- to the implementation of AI techniques cation provides a good start. However, the re- liability of the AI (i.e its limits, tolerances, cer- 2. Grade Level tainty, etc.) and a transparency metric that 3. Reliability (limits, tolerances, certainty, MT- indicates the ontology, inference predisposi- BAIF) tion/bias, and type and quality of knowledge would give the user some real indication of the 4. Transparency (predisposition/bias) utility of the application. AI Metrics Knowledge Ingredients The notion of metrics to measure AI perfor- What if we could provide ’Knowledge In- mance has been under active investigation gredient’ Labels for our AI-based hard- since the very beginnings of AI research and ware/software technologies like those shown a standard widely used and accepted set of in Figure 1. metrics remain elusive. The PerMIS (Perfor- mance Metrics for Intelligent System) work- shops were started in 2000. The PerMIS workshops are dedicated to:

“...defining measures and methodolo- gies of evaluating performance of intelli- gent systems started in 2000, the PerMIS series focuses on applications of perfor- mance measures to applied problems in commercial, industrial, homeland security, and military applications.”

The PerMIS workshops were originally co- sponsored by SIGART (now SIGAI), NIST, IEEE, ACM, DARPA, and NSF. These work- shops endeavored to identify performance in- telligence metrics in many areas such as: on- Figure 1: Pie charts for a Knowledge Indicator that tologies, mobile robots, intelligence interfaces, reflects the Knowledge Ingredients of an AI sys- agents, intelligent test beds, intelligent per- tem. formance assessment, planning models, au- tonomous systems, learning approaches, and In Figure 1, the percentage of the types of embedded intelligent components. ALFUS knowledge the system is comprised of is rep- (Autonomous Levels For Unmanned Systems)

42 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 defines a framework for characterizing hu- ically other information stored in electronic man interaction, the mission and environmen- memory. tal complexity of a ”system of systems”. The • Effect action: purpose of ALFUS was to determine the level To interact with the external environment or of autonomy, a necessary but not sufficient modify the internal state (e.g., the stored in- component of an intelligent system. Accord- formation comprising the ”knowledge base” ing to Ramsbotham [1]: of the system) based on what is compre- hended. “Intelligence implies an ability to per- ceive and adapt to external environments The purpose or “collective mission perfor- in real-time, to acquire and store knowl- mance” of a systems of systems was also cat- edge regarding problem solutions, and to egorized based on functional and architectural incorporate that knowledge into system complexity. The mission and environmental memory for future use.” complexity and the degree of human interac- tion determining the level of autonomy could describes the autonomous intelligent systems be applied to these categories [1]: of systems based on the 4D/RCS Reference Model Architecture for Learning developed by 1. Leader-Follower Intelligent Systems Division of the NIST since Intelligent behavior exhibited by single node, 1980s. In order to attempt to develop some and replicated (sometimes with minor adap- type of metric, the functions that comprised tation) by other nodes. the intelligent behavior had to be decomposed 2. Swarming (simple) into specific hardware and software character- Loosely structured collection of interacting istics. This was called SPACE (Sense, Per- agents, capable of moving collectively. ceive, Attend, Apprehend, Comprehend, Ef- fect action): 3. Swarming (complex) Loosely structured collection of interacting • Sense: agents, capable of individuated behavior to To generate a measurable signal (usually effect common goals. electrical) from external stimuli. A sensor 4. Homogenous intelligent systems will often employ techniques (for examples, A relatively structured collection of iden- bandpass filtering or thresholding) such that tical (or at least similar) agents, wherein only part of the theoretical response of the collective system performance is optimized transducer is perceived. by optimizing the performance of individual • Perceive: agents. To capture the raw sensor data in a form 5. Heterogeneous intelligent systems (analog or digital) that allows further pro- A relatively structured heterogeneous col- cessing to extract information. In this nar- lection of specialized agents, wherein the row construct perception is characterized by functions of intelligence distributed among a 1:1 correspondence between the sensor the diverse agents to optimize performance signal and the output data. of a defined task or set of tasks. • Attend: 6. Ad hoc intelligent adaptive systems To select data from what is perceived by the A relatively unstructured and undefined het- sensor. To a crude approximation, analo- erogeneous collection of agent, wherein the gous to feature extraction. functions comprising intelligence are dy- namically distributed across the system to • Apprehend: adapt to changing tasks. To characterize the information content of the extracted features. Analogous to pattern From less complex groups with low mission recognition. and environmental complexity and high hu- • Comprehend: man interaction (like Leader-Follower) to the To understand the significance of the infor- most sophisticated high mission and environ- mation apprehended in the context of exist- mental complexity and no human interaction ing knowledge–in the case of automata, typ- (like Ad Hoc Intelligent Adaptive Systems),

43 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019 these categories of systems are in a some- what order. Figure 2 shows where the ex- tremes of these categories could be graphed.

Figure 3: Human Evaluator was to determine which was not human.

Figure 2: Lead Follower and Ad Hoc Intelligent systems has passed or almost passed the Tur- Adaptive Systems are located on the 3D space of ing Test. Now passing the test includes simu- Complexity and Human Interaction. lating the human voice like the Google Duplex AI Assistant. Ultimately, Ramsobotham [1] concluded: In “Moving Beyond the Turing Test with the “Even given a comprehensive frame- Allen AI Science Challenge” [2], the authors work, as we begin to build more com- describe the Allen AI Science Challenge “Tur- plex intelligent systems of systems, we ing Olympics”, a series of tests that explore will need to acquire knowledge and im- many capabilities that are considered asso- prove analytic tools and metrics. Among ciated with intelligence. These capabilities the more important will be: ... Better mod- include language understanding, reasoning, els and metrics for characterizing limits of and commonsense knowledge needed to per- information assurance based on these ef- form smart or intelligent activities. This would fects. This will be both a critical need and replace the Alan Turing pass/fail model. The a major challenge.” idea of the Challenge was to have a four- month-long competition where researchers The most known metric used for researchers were to build an AI agent that could answer and commercial AI applications has been the an eighth-grade multiple choice science ques- Turing Test developed by Alan Turing in 1950. tions. The AI would demonstrate its ability The purpose of the test was to evaluate a ma- to utilize state-of-the-art natural language un- chine’s ability to demonstrate intelligent be- derstanding and knowledge-based reasoning. havior. That behavior was to be indistinguish- Below summarizes the nature of the competi- able from a human being exhibiting the same tion: behavior. A human judge was to evaluate Number of total (4) choice questions: 5,083 a natural language conversation between a human and the potential ”intelligent machine” Training set Questions: 2,500 shown in Figure 3. Is this test actually a metric for “intelligence”? Validation set confirming model performance: This has been debated for many years. If a 8,132 NLG agent can produce responses that sim- ulate human responses does that mean that Legitimate questions: 800 the system is intelligent? Probably not, but that has not stopped the use of the Turing Test Final Test + validation set for final score: or AI software developers heralding that their 21,298

44 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Final Legitimate Questions: 2,583 Cameron Hughes is a computer and robot pro- Baseline Score random guessing: 25% grammer. He is a Soft- ware Epistemologist at The final results of the test, the top three com- Ctest Laboratories where petitors had scores with a spread of 1.05% he is currently working on with the highest score 59.31% which is con- A.I.M. (Alternative Intelli- sidered a failing grade. Each of the winning gence for Machines) and model utilized standard information-retrieval- A.I.R (Alternative Intelli- based methods which were not able to pass gence for Robots) tech- the eighth grade science exams. What is re- nologies. Cameron is quired is: the lead AI Engineer for the Knowledge Group at “ ...to go beyond surface text to a Advanced Software Con- deeper understanding of the meaning un- struction Inc. He is a member of the advisory derlying each question, then use reason- board for the NREF (National Robotics Edu- ing to find the appropriate answer.” cation Foundation) and the Oak Hill Robotics Makerspace. He is the project leader of the In this case, such a system would have a 1st technical team for the NEOACM CSI/CLUE grade level Knowledge Quality based on our Robotics Challenge and regularly organizes quasi Knowledge Ingredient Indicator. Based and directs robot programming workshops for on the article [2]: varying robot platforms. Cameron Hughes is the co-author of many books and blogs “All three winners said it was clear that on software development and Artificial Intelli- applying a deeper, semantic level of rea- gence. soning with scientific knowledge to the questions and answers would be the key Tracey Hughes is a soft- to achieving scores of 80% and higher ware and epistemic visu- and demonstrating what might be consid- alization engineer at Ctest ered true artificial intelligence.” Laboratories. She is the lead designer for the Here we’ve provided a very cursory look at a MIND, TAMI, and NO- very limited set of possibilities for measuring FAQS projects that uti- commercial AI. In the next issue, will go a little lize epistemic visualiza- further and dig a little deeper into the ques- tion. Tracey is also a tion of how do we communicate basic AI met- member of the advisory rics and ingredients for commercial AI to the board for the NREF (Na- layperson. tional Robotics Education Foundation) and the Oak Hill Robotics Makerspace. References She is the lead researcher of the technical team for the NEOACM CSI/CLUE Robotics [1] Ramsbotham, Alan.J. (2009). Collective Challenge. Tracey Hughes is the co-author Intelligence: Toward Classifying Systems of with Cameron Hughes of many books on soft- Systems. PerMIS’09. ware development and Artificial Intelligence.

[2] Schoenick, Carissa, Clark, Peter, T. et.al (2017). Moving Beyond the Turing Test with the Allen AI Science Challenge. CACM, VOL.60 (N0.9), pp. 60-64.

45 AI MATTERS, VOLUME 5, ISSUE 2 5(2) 2019

Crosswords Adi Botea (IBM Research, Ireland; [email protected]) DOI: 10.1145/3340470.3340480

1 2 3 4 5 6 7 8 9 10 11 sian lion 19) Salad ingredient 22) React with no joy 23) computing in IoT 25) Farm pro- 12 13 duction 26) Before... the prefix 28) Resident 14 15 A 16 near Cornell University 29) Approach with no good promise 30) Medicine on paper 31) Bi- 17 I 18 19 20 nary string? 33) Returned from a dream ex-

21 23 perience 34) Great Lake 35) Marked from an M impact 37) Park way 40) Bucket 41) Entered a 24 25 A 26 bike race 43) First lady 45) Metaphoric curtain in front of reality 27 T 28 Previous puzzle solution: ASNEER - ALISTS 29 30 31 32 T 33 34 35 - SCARCE - BERTIE - LAMEST - BAREST - ORE - RAINIEST - PEI - SANE - TRUE - ESTONIA - MASER 36 37 38 E - HANGMAN - CRAMP - RAGTOPS - HARM - BARI - 39 40 R CRY - INTERIMS - TIS - ROUTER - BEZANT - PUREED - ATONCE - STORKS - REDEEM 42 43 44 45 S Acknowledgment: I thank Karen McKenna 46 47 for her feedback. The grid is created with the AI system Combus (Botea, 2007). 48 49

References Across: 1) French city by the Straight of Dover 6) Go through a printed paper 12) A Botea, A. (2007). Crossword Grid Composi- point in time 13) Protected with a concrete de- tion with A Hierarchical CSP Encoding. fense 14) A.C. , from Saved by the Bell 16) In Proceeding of the 6th CP Workshop Seas at a raised level 17) Braxton, Ameri- on Constraint Modelling and Reformula- can singer 18) Undesired spot 20) Gear tooth tion ModRef-07. 21) Be indebted 22) Mine in France 23) Ireland in the local language 24) Buckingham guard attire (2 wds.) 26) Summation circuit 27) Cry- ing out loud 29) Follow up actively (2 wds.) 32) Adi Botea is a research Said more formally 36) Nominated as a fellow scientist at IBM Re- 37) Continuous pain 38) Output of a mining search, Ireland. His procedure 39) Verb invoked with ability 40) Le- interests include AI plan- gal argument 41) Old tourist attraction 42) Un- ning, heuristic search, pleasant experience 44) Clothes area 46) automated dialogue sys- Wonder from the world of music 47) Person tems, pathfinding, journey hired to help 48) Unexcitingly 49) Present on planning, and multi-agent the list of requirements path planning. Adi has co-authored more than Down: 1) and Pollux, the Gemini 2) Given 70 peer-reviewed publications. He has pub- away temporarily 3) Against 4) Curling surface lished crosswords puzzles in Romanian, in 5) Croatian neighbor 6) Lose consciousness national-level publications, for almost three 7) Diplomatic skill 8) Pale at a party 9) Tran- decades. quil 10) Poem by Edgar Allan Poe 11) Dijk- stra, the famous computer scientist 15) Prus- Copyright c 2019 by the author(s).

46