<<

ARTIFICIAL INTELLIGENCE

POLICY FORUM Even when not category-jumping, machine learning can be used to draw powerful and compromising in- ferences from self-disclosed, seemingly benign data or readily observed behavior. These inferences can un- Data, privacy, and the greater good dermine a basic goal of many privacy laws—to allow Eric Horvitz1* and Deirdre Mulligan2* individuals to control who knows what about them. Machine learning and inference makes it increasingly Large-scale aggregate analyses of anonymized data can yield valuable results and insights that difficult for individuals to understand what others can address public health challenges and provide new avenues for scientific discovery. These methods know about them based on what they have explicitly can extend our knowledge and provide new tools for enhancing health and wellbeing. However, they or implicitly shared. And these computer-generated raise questions about how to best address potential threats to privacy while reaping benefits for channels of information about health conditions individuals and to society as a whole.The use of machine learning to make leaps across informational join other technically created fissures in existing legal and social contexts to infer health conditions and risks from nonmedical data provides protections for health privacy. In particular, it is dif- representative scenarios for reflections on directions with balancing innovation and regulation. ficult to reliably deidentify publicly shared data sets, given the enormous amount and variety of ancillary hat if analyzing Twitter tweets or Face- Although digital nudging shows promise, a re- data that can be used to reidentify individuals. book posts could identify new mothers cent flare-up in the United Kingdom highlights The capacities of machine learning expose the fun- at risk for postpartum depression (PPD)? the privacy concerns it can ignite. A Twitter suicide- damental limitations of existing U.S. privacy rules that Despite PPD’s serious consequences, early prevention application called Good Samaritan tie the privacy protection of an individual’shealthsta- W ’ identification and prevention remain monitored individuals tweets for words and tustospecificcontextsorspecifictypesofinforma-

difficult. Absent a history of depression, detection phrases indicating a potential mental health crisis. tion a priori identified as health information. Health Downloaded from is largely dependent on new mothers’ self-reports. The app notified the person’s followers so they privacy regulations and privacy laws in the United But researchers found that shifts in sets of activ- States generally are based on the assumption ities and language usage on Facebook are pre- that the semantics of data are relatively fixed dictors of PPD (1) (see the photo). This is but one and knowable and reside in isolated contexts. example of promising research that uses machine Machine learning techniques can instead be used learning to derive and leverage health-related in- to infer new within and across contexts ferences from the massive flows of data about in- and is generally unencumbered by privacy rules http://science.sciencemag.org/ dividuals and populations generated through social in the United States. Using publicly available media and other digital data streams. At the same Twitter posts to infer risk of PPD, for example, time, machine learning presents new challenges does not run afoul of existing privacy law. This for protecting individual privacy and ensuring fair might be unsurprising, and seem unproblematic, use of data. We need to strike a new balance be- given that the posts were publicly shared, but tween controls on collecting information and con- there are troubling consequences. trols on how it is used, as well as pursue auditable Current privacy laws often do double duty. At a and accountable technologies and systems that basic level, they limit who has access to infor- facilitate greater use-based privacy protections. mation about a person. This implicitly limits the Researchers have coined terms, such as digi- extent to which that information influences Machine learning can make “category-jumping” infer- on October 16, 2017 tal disease detection (2)andinfodemiology(3), ences about health. New mother’s activities and language decision-making and thus doubles as a limit on to define the new science of harnessing diverse usage on social media are predictors of postpartum the opportunities for information to fuel dis- streams of digital information to inform public depression. crimination. Because of the heightened privacy health and policy, e.g., earlier identification of epi- sensitivities and concerns with health-related dis- demics, (4) modeling communicability and flow of could intervene to avert a potential suicide. But crimination, we have additional laws that regulate illness (5), and stratifying individuals at risk for the app was shuttered after public outcry drew reg- the use of health information outside the health care illness (6). This new form of health research can ulator concern (8). Critics worried the app would context. U.S. laws specifically limit the use of some also inform and extend understandings drawn from encourageonlinestalkers and bullies to target vul- health information in ways considered unfair. For traditional health records and human subjects re- nerable individuals and collected 1200 signatures example, credit-reporting agencies are generally pro- search.Forexample,thedetectionofadversedrug on a petition arguing that the app breached users’ hibited from providing medical information to make reactions could be improved by jointly leveraging privacy by collecting, processing, and sharing sen- decisions about employment, credit, or housing. The data from the U.S. Food and Drug Administration’s sitive information. Despite the developers’ laudable Americans with Disabilities Act (ADA) prohibits dis- Adverse Event Reporting System and anonymized goal of preventing suicide, the nonprofit was chas- crimination on the basis of substantial physical or search logs (7). Search logs can serve as a large- tised for playing fast and loose with the privacy and mental disabilities—or even a mistaken belief that an scale sensing system that can be used for drug mental health of those it was seeking to save (9). individual suffers from such a disability. If machine safety surveillance—pharmacovigilance. Machine learning can facilitate leaps across infor- learning is used to infer that an individual suffers Infodemiology studies are typically large-scale mational and social contexts, making “category- from a physical or mental impairment, an employer aggregate analyses of anonymized data—publicly jumping” inferences about health conditions or who bases a hiring decision on it, even if the in- disclosed or privately held—that yield results and propensities from nonmedical data generated far ference is wrong, would violate the law. insights on public health questions across popula- outside the medical context. The implications for But the ADA does not prohibit discrimination based tions. However, some methods and models can be privacy are profound. Category-jumping inferences on predispositions for such disabilities (10). Machine aimed at making inferences about unique individ- may reveal attributes or conditions an individual has learning might discover those, too. In theory, the uals that could drive actions, such as alerting or specifically withheld from others. To protect against Genetic Information Non-Discrimination Act (GINA) providing digital nudges, to improve individual such violations, the United States heavily regulates should fill this gap by protecting people genetically or public health outcomes. health care privacy. But, although information about predisposed to a disease. But again, machine learn- health conditions garnered from health care treat- ing exposes cracks in this protection. Although 1 2 Microsoft Research, Redmond, WA 98052, USA. University ment and payment must be handled in a manner GINA prohibits discrimination based on informa- of California, Berkeley, Berkeley, CA 94720, USA. *Corresponding author. E-mail: [email protected] (E.H.); that respects patient privacy, machine learning and in- tion derived from genetic tests or a family history of

CREDIT: ISTOCK/CHRISTINE GLADE [email protected] (D.M.) ference can sidestep many of the existing protections. adisease(11), it does not limit the use of information

SCIENCE sciencemag.org 17 JULY 2015 • VOL 349 ISSUE 6245 253 ARTIFICIAL INTELLIGENCE

about such a disposition—even if it is grounded in information and resell or share that information tional law. What exactly individuals receive when genetics—inferred through machine learning tech- with others—toclearlydisclosetoconsumersin- they request access to their data and to processing niques that mine other sorts of data. In other words, formation about the data they collect, as well as logicvariesbycountry,asdoestheimplementation machine learning that predicts future health status thefactthattheyderiveinferencesfromit(21). of the limitation on “purely automated” processing. from nongenetic information—including health status Here,too,theFTCappearsconcernedwithnotjust The EU is expected to adopt a data privacy regula- due to genetic predisposition—would cir- the raw data, but inferences from its analysis. tion that will supplant local law, with a single na- cumvent existing legal protections (12). The Obama Administration’s Big Data Initiative tional standard. Although the current draft includes Just as machine learning can expose secrets, it has also considered the risks to privacy posed by parallel provisions, their finalformisnotyetknown facilitates social sorting—placing individuals into machinelearningandthepotentialdownsidesof nor is how they will ultimately be interpreted (27). categories for differential treatment—with good or using machine inferences in the commercial mar- In theory, a new EU requirement to disclose the bad intent and positive or negative outcomes. The ketplace (22, 23), concluding that we need to up- logic of processing could apply quite broadly, with methods used to classify individuals as part of ben- date our privacy rules, increase technical expertise implications for public access to data analytics and eficial public health programs and nudges can just in consumer protection and civil rights agencies to algorithms. In the interim, a decision expected this as easily be used for more nefarious purposes, such address novel discrimination issues arising from summer in a case before the European Court of as discrimination to protect organizational profits. big data, provide individuals with privacy preserv- Justice may provide some detail as to what level Policy-makers in the United States and else- ing tools that allow the to control the collection and of access to both data and the logic of processing is where are just beginning to address the challenges manage the use of personal information, as well as currently required under the EU Directive (28). thatmachinelearningandinferenceposetocom- increase transparency into how companies use and Improving the transparency of data processing mitments to privacy and equal treatment. Although trade data. The Administration is also concerned to data subjects is both important and challenging. not specifically focused on health information, with the use of machine learning in policing and Although the goal may be to promote actual under- reports issued by the White —discuss the national security. The White House report called for standingoftheworkingsorlikelyoutputsofma- Downloaded from potential for large-scale data analyses to result in increased technical expertise to help civil rights and chine learning and reasoning methods, the workflows discrimination (13)—and the Federal Trade Com- consumer protection agencies identify, investigate, and dynamism of algorithms and decision criteria mission (FTC) have suggested new efforts to pro- and resolve uses of big data analytics that have a may be difficult to characterize and explain. For ex- tect privacy, regulate harmful uses of information, discriminatory impact on protected classes (24). ample, popular convolutional neural-network learn- and increase transparency. Note that reports and proposals from the Ad- ing procedures (commonly referred to as “deep The FTC is the key agency policing unfair and ministration distinctly emphasize policies and reg- learning”) automatically induce rich, multilayered deceptive practices in the commercial marketplace, ulations focused on data use rather than collection. representations that their developers themselves http://science.sciencemag.org/ including those that touch on the privacy and se- While acknowledging the need for tools that allow may not understand with clarity. Although high- curity of personal information. Its proposed privacy consumers to control when and how their data is level descriptions of procedures and representations framework encourages companies to combine tech- collected, the Administration recommendations fo- might be provided, even an accomplished program- nical and policy mechanisms to protect against re- cus on empowering individuals to participate in mer with access to the source code would be un- identification. The FTC’sproposedruleswould decisions about future uses and disclosures of col- able to describe the precise operation of such a work to ensure that data are both “not reasonably lected data (25). A separate report by the President’s system or predict the output of a given set of inputs. identifiable” and accompanied by public company Council of Advisors on Science and Technology Data’smeaninghasbecomeamovingtarget. commitments not to reidentify it. The same pri- (PCAST) concluded that this was a more fruitful di- Data sets can be easily combined to reidentify data vacy rules should apply to downstream users of the rection for technical protections. Both reports sug- sets thought deidentified, and sensitive knowledge data (14). This approach is promising for machine gest that use-based protections better address the can be inferred from benign data that are routine- learning and other areas of artificial intelligence latent meaning of data—inferences drawn from ly and promiscuously shared. These pose difficul- on October 16, 2017 that rely on data-centric analyses. It allows learning data using machine learning—and can adapt to the ties for current U.S. legal approaches to privacy from large data sets—and sharing them—by encour- scale of the data-rich and connected environment of protection that regulate data on the basis of its aging companies to reduce the risks that data pools the future (26). The Administration called for col- identifiability and express meaning. and data sharing pose for individual privacy. laborative efforts to ensure that regulations in the Use-based approaches are driven, in part, by the The FTC proposal grows, in part, from recent agen- health context will allow society to reap the benefits realization that focusing solely on limiting data col- cy actions focused on inferences that we have deemed and mitigate the risks posed by machine learning lection is inadequate. In a way, this presupposes “context-jumping.” In one high-profile case, Netflix and inferences. Use-based approaches are often fav- that data are an unalloyed good that should be publicly released data sets to support a competition ored by industry, as well, which tends to view data collected on principle, whenever and wherever pos- to improve their recommendation algorithm. When as akin to a natural resource to be mined for com- sible. Whereas we are not ready to abandon limits on outside researchers used ancillary data to reidentify mercial and public benefit, and industry is resistant data collection, we agree that use-based regulations, and infer sensitive attributes about individuals from to efforts to constrain data collection. although challenging to implement, are an impor- the Netflix data sets, the FTC worked with the com- Although incomplete and unlikely to be acted tant part of the future legal landscape—and will pany to limit future public disclosures—setting out upon by the current gridlocked Congress, adop- help to advance privacy, equality, and the public the limits discussed above. In a similar vein, the FTC tion of these recommendations would increase good. To advance transparency and to balance the objected to a change in Facebook’sdefaultsthatex- transparency about data’s collection, use, and con- constraints they impose, use-based approaches would posed individuals’ group affiliations from which sequences. Along with efforts to identify and con- need to emphasize access, accuracy, and correction sensitive information, such as political views and strain discriminatory or unfair uses of data and rights for individuals. sexual orientation, could be inferred (15). inferences, they are promising steps. They also align The evolution of regulations for health informa- Additionally, the FTC has made efforts to ensure with aspects of existing European privacy laws tion, although incomplete, provides a useful map for that individuals can control tracking in the online concerned with the transparency and fairness of thinking about the challenges and opportunities we and mobile environments. These are in part due to data processing, particularly the risks to indi- face today and frames potential solutions. In health the nonobvious inferences that can be drawn from viduals of purely automated decision-making. care, privacy rules were joined by nondiscrimination vast collections of data (16–18)andthesubsequent Current European Union (EU) law requires en- rules and always were accompanied by special pro- riskstoconsumers,whomaybeplacedinclassif- tities to provide individuals with access to the data visions to support research. Today, they are being ications that single them out for specific treatment on which decisions are rendered, as well as infor- joined by collective governance models designed in the marketplace (19, 20). In a related context, mation about decision criteria [see Articles 12 and to encourage pooling of data in biobanks that sup- the FTC recommended that Congress require data 15 of (21)]. Although currently governed by a Europe- port research on health conditions while protecting brokers—companies that collect consumers’ personal wide directive, both provisions are a matter of na- collective interests in privacy.

254 17 JULY 2015 • VOL 349 ISSUE 6245 sciencemag.org SCIENCE Despite practical challenges, we are hopeful that REVIEW informeddiscussionsamongpolicy-makersandthe public about data and the capabilities of machine learning, will lead to insightful designs of programs and policies that can balance the goals of protecting Machine learning: Trends, privacy and ensuring fairness with those of reaping the benefits to scientific research and to individual perspectives, and prospects and public health. Our commitments to privacy and fairness are evergreen, but our policy choices must M. I. Jordan1* and T. M. Mitchell2* adapt to advance them, and support new tech- niques for deepening our knowledge. Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of today’s most rapidly growing technical fields, REFERENCES AND NOTES lying at the intersection of computer science and statistics, and at the core of artificial 1. M. De Choudhury, S. Counts, E. Horvitz, A. Hoff, in Proceedings of International Conference on Weblogs and Social Media intelligence and data science. Recent progress in machine learning has been driven both by [Association for the Advancement of Artificial Intelligence the development of new learning algorithms and theory and by the ongoing explosion in the (AAAI), Palo Alto, CA, 2014]. availability of online data and low-cost computation. The adoption of data-intensive 2. J. S. Brownstein, C. C. Freifeld, L. C. Madoff, N. Engl. J. Med. machine-learning methods can be found throughout science, technology and commerce, 360, 2153–2155 (2009). 3. G. Eysenbach, J. Med. Internet Res. 11, e11 (2009). leading to more evidence-based decision-making across many walks of life, including 4. D. A. Broniatowski, M. J. Paul, M. Dredze, PLOS ONE 8, e83672 health care, manufacturing, education, financial modeling, policing, and marketing. (2013). 5. A. Sadilek, H. Kautz, V. Silenzio, in Proceedings of the achine learning is a discipline focused ance when executing some task, through some Downloaded from Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI, Palo Alto, CA, 2012). on two interrelated questions: How can type of training experience. For example, in learn- 6. M. De Choudhury, S. Counts, E. Horvitz, in Proceedings of the one construct computer systems that auto- ing to detect credit-card fraud, the task is to as- SIGCHI Conference on Human Factors in Computing Systems matically improve through experience? sign a label of “fraud” or “not fraud” to any given (Association for Computing Machinery, New York, 2013), M and What are the fundamental statistical- credit-card transaction. The performance metric pp. 3267–3276. 7. R. W. White, R. Harpaz, N. H. Shah, W. DuMouchel, E. Horvitz, computational-information-theoretic laws that to be improved might be the accuracy of this Clin. Pharmacol. Ther. 96, 239–246 (2014). govern all learning systems, including computers, fraud classifier, and the training experience might http://science.sciencemag.org/ 8. Samaritans Radar; www.samaritans.org/how-we-can-help-you/ humans, and organizations? The study of machine consist of a collection of historical credit-card supporting-someone-online/samaritans-radar. learning is important both for addressing these transactions, each labeled in retrospect as fraud- 9. Shut down Samaritans Radar; http://bit.ly/Samaritans-after. fundamental scientific and engineering ques- ulent or not. Alternatively, one might define a 10. U.S. Equal Employment Opportunity Commission (EEOC), 29 Code of Federal Regulations (C.F.R.), 1630.2 (g) (2013). tions and for the highly practical computer soft- different performance metric that assigns a higher 11. EEOC, 29 CFR 1635.3 (c) (2013). wareithasproducedandfieldedacrossmany penalty when “fraud” is labeled “not fraud” than 12. M. A. Rothstein, J. Law Med. Ethics 36, 837–840 (2008). applications. when “not fraud” is incorrectly labeled “fraud.” 13. Executive Office of the President, Big Data: Seizing Machine learning has progressed dramati- One might also define a different type of training Opportunities, Preserving Values (White House, Washington, cally over the past two decades, from laboratory experience—for example, by including unlab- DC, 2014); http://1.usa.gov/1TSOhiG. 14. Letter from Maneesha Mithal, FTC, to Reed Freeman, Morrison, curiosity to a practical technology in widespread eled credit-card transactions along with labeled & Foerster LLP, Counsel for Netflix, 2 [closing letter] (2010); commercial use. Within artificial intelligence (AI), examples. http://1.usa.gov/1GCFyXR. machine learning has emerged as the method A diverse array of machine-learning algorithms

15. In re Facebook, Complaint, FTC File No. 092 3184 (2012). of choice for developing practical software for has been developed to cover the wide variety of on October 16, 2017 16. FTC Staff Report, Mobile Privacy Disclosures: Building Trust Through Transparency (FTC,Washington,DC,2013);http://1.usa.gov/1eNz8zr. computer vision, speech recognition, natural lan- data and problem types exhibited across differ- 17. FTC, Protecting Consumer Privacy in an Era of Rapid Change: guage processing, robot control, and other ap- ent machine-learning problems (1, 2). Conceptual- Recommendations for Businesses and Policymakers (FTC, plications. Many developers of AI systems now ly, machine-learning algorithms can be viewed as Washington, DC, 2012). recognize that, for many applications, it can be searching through a large space of candidate 18. Directive 95/46/ec of the European Parliament and of The Council of Europe, 24 October 1995. far easier to train a system by showing it exam- programs, guided by training experience, to find 19. L. Sweeney, Online ads roll the dice [blog]; http://1.usa.gov/ ples of desired input-output behavior than to a program that optimizes the performance metric. 1KgEcYg. program it manually by anticipating the desired Machine-learning algorithms vary greatly, in part “ ” 20. FTC, Big data: A tool for inclusion or exclusion? (workshop, response for all possible inputs. The effect of ma- by the way in which they represent candidate FTC, Washington, DC, 2014); http://1.usa.gov/1SR65cv 21. FTC, Data Brokers: A Call for Transparency and Accountability chine learning has also been felt broadly across programs (e.g., decision trees, mathematical func- (FTC, Washington, DC, 2014); http://1.usa.gov/1GCFoj5. computer science and across a range of indus- tions, and general programming languages) and in 22. J. Podesta, “Big data and privacy: 1 year out” [blog]; http://bit. tries concerned with data-intensive issues, such partbythewayinwhichtheysearchthroughthis ly/WHsePrivacy. as consumer services, the diagnosis of faults in space of programs (e.g., optimization algorithms 23. White House Council of Economic Advisers, Big Data and Differential Pricing (White House, Washington, DC, 2015). complex systems, and the control of logistics with well-understood convergence guarantees 24. Executive Office of the President, Big Data and Differential Processing chains. There has been a similarly broad range of and evolutionary search methods that evaluate (White House, Washington, DC, 2015); http://1.usa.gov/1eNy7qR. effects across empirical sciences, from biology to successivegenerationsofrandomlymutatedpro- 25. Executive Office of the President, Big Data: Seizing cosmology to social science, as machine-learning grams). Here, we focus on approaches that have Opportunities, Preserving Values (White House, Washington, methods have been developed to analyze high- been particularly successful to date. DC, 2014); http://1.usa.gov/1TSOhiG. 26. President’s Council of Advisors on Science and Technology throughput experimental data in novel ways. See Many algorithms focus on function approxi- (PCAST), Big Data and Privacy: A Technological Perspective Fig. 1 for a depiction of some recent areas of ap- mation problems, where the task is embodied (White House, Washington, DC, 2014); http://1.usa.gov/1C5ewNv. plication of machine learning. in a function (e.g., given an input transaction, out- 27. European Commission, Proposal for a Regulation of the European A learning problem can be defined as the put a “fraud” or “not fraud” label), and the learn- Parliament and of the Council on the Protection of Individuals with regard to the processing of personal data and on the free problem of improving some measure of perform- ing problem is to improve the accuracy of that movement of such data (General Data Protection Regulation), function, with experience consisting of a sample COM(2012) 11 final (2012); http://bit.ly/1Lu5POv. 1Department of Electrical Engineering and Computer of known input-output pairs of the function. In 28. M. Schrems v. Facebook Ireland Limited, §J. Unlawful data Sciences, Department of Statistics, University of California, some cases, the function is represented explicit- transmission to the U.S.A. (“PRISM”), ¶166 and 167 (2013); 2 Berkeley, CA, USA. Machine Learning Department, Carnegie ly as a parameterized functional form; in other www.europe-v-facebook.org/sk/sk_en.pdf. Mellon University, Pittsburgh, PA, USA. *Corresponding author. E-mail: [email protected] (M.I.J.); cases, the function is implicit and obtained via a 10.1126/science.aac4520 [email protected] (T.M.M.) search process, a factorization, an optimization

SCIENCE sciencemag.org 17 JULY 2015 • VOL 349 ISSUE 6245 255 Data, privacy, and the greater good Eric Horvitz and Deirdre Mulligan

Science 349 (6245), 253-255. DOI: 10.1126/science.aac4520 Downloaded from

ARTICLE TOOLS http://science.sciencemag.org/content/349/6245/253

RELATED http://science.sciencemag.org/content/sci/349/6245/248.full http://science.sciencemag.org/ CONTENT

REFERENCES This article cites 5 articles, 0 of which you can access for free http://science.sciencemag.org/content/349/6245/253#BIBL

PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

on October 16, 2017

Use of this article is subject to the Terms of Service

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. 2017 © The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. The title Science is a registered trademark of AAAS.