Enhancing Selectivity in Big Data

IEEE SYMPOSIUM ON SECURITY AND PRIVACY FPO Enhancing Selectivity in Big Data Mathias Lecuyer, Riley Spahn, and Roxana Geambasu | Columbia University Tzu-Kuo Huang | Uber Advanced Technologies Group Siddhartha Sen | Microsoft Research Today’s companies collect immense amounts of personal data, exposing it to external hackers and privacy-transgressing employees. This study shows that only a fraction of the data is needed to approach state-of-the-art accuracy. We propose selective data systems designed to pinpoint the data that is valuable for a company’s workloads. riven by the immense perceived potential of “big and evolving workload, from data that is collected for D data,” Internet companies, advertisers, and gov- potential future needs. The former, calledin-use data, ernments are accumulating vast quantities of personal should be minimized in size, timespan, and sensitivity. data: clicks, locations, social interactions, and more. The latter, calledunused data, should be set aside and While data offers unique opportunities to improve tapped only in exceptional circumstances (see Figure 1). personal and business effectiveness, its aggressive col- The separation should permit day-to-day evolutions of lection and long-term archival pose significant risks for an organization’s workload, by accessing just the in-use organizations. Hacking and exploiting sensitive corpo- data and without the need to tap into the unused data. rate and governmental information have become com- A system that achieves these goals without damaging monplace. Privacy-transgressing employees have been functional properties, such as scalability, performance, discovered snooping into data stores to spy on friends and accuracy, is called a selective data system. and family. Although organizations strive to restrict access to particularly sensitive data (such as passwords, Selective Data Systems SSNs, emails, banking data), properly managing access Selective data systems can be used to improve data pro- controls for diverse and potentially sensitive informa- tection (see Figure 1). The ability to distinguish data tion is an open problem. needed now or in the likely future, from data collected We hypothesize that not all data that is collected “just in case,” can help organizations restrict the lat- and archived by today’s organizations is—or may ever ter’s exposure to attacks. For example, one could ship be—actually needed to satisfy their workloads. We ask unused data to a tightly controlled store, whose read whether it is possible to architect data-driven systems, accesses are carefully audited and mediated. Intuitively, such as machine learning–based targeting and person- data that is accessed day-to-day is less amenable to alization systems, to permit a clean separation between certain kinds of protection (such as auditing or data that is truly needed by an organization’s current case-by-case access control decisions) than data 2 January/February 2018 Copublished by the IEEE Computer and Reliability Societies 1540-7993/17/$33.00 © 2018 IEEE accessed only for exceptional situations (such as launch- ing a new application). Protect as possible, Turning selective data systems into a reality requires In-use minimize in size, achieving two conflicting goals: (1) minimizing the data time, sensitivity in-use data while (2) avoiding the need to access the unused data to meet both current and evolving work- Protect load needs. This tension is traditional in operating sys- Unused vigorously, tems, where many algorithms (for instance, caching) data avoid rely on processes having a working set of limited size that access captures their data needs for a period of time. However, the context of modern, data-driven ecosystems brings new challenges that likely make traditional working set Figure 1. Selectivity concept. algorithms ineffective. For example, many of today’s big data applications involve machine learning (ML) workloads that are periodically retrained to incorporate new data, by accessing all of the data. How can we determine fall within the important class of supervised classifica- a minimal training set, the “working set” for emerging tion tasks. Historical raw data, which may be needed for ML workloads? And how can we ensure this training set workloads not supported by count featurization, such is sufficient even when workloads evolve? as unsupervised learning or regression tasks, is kept in an encrypted store whose decryption requires special Approach Highlights access. We observe that for ML workloads, significant research Our evaluation with two representative workloads— is devoted to limiting the amount of data required for targeted advertising on the Criteo dataset and movie training. The reasons are many but typically do not recommendation on the MovieLens dataset—reveals involve data protection. Rather, they include increas- that: (1) historical counts let ML models approach ing performance, dealing with sparsity, and limiting state-of-the-art accuracy by training on under 1 percent labeling effort. Techniques such as dimensionality of the data, (2) protecting historical counts with differ- reduction, feature hashing,1 vector quantization,2 and ential privacy has only 2 percent impact on accuracy, count featurization3 are routinely applied in practice to and (3) Pyramid works well for an important class of reduce data dimensionality so models can be trained on ML algorithms—supervised classification tasks—and manageable training sets. Active learning4 reduces the can support workload evolution within that class. amount of labeled data needed for training when labeling requires manual effort. Can such mechanisms also Example Use Case be used to limit exposure of the data being collected? MediaCo, a media conglomerate, collects observations How can an organization that already uses these meth- of user behavior from its hundreds of affiliate news ods architect a selective data system around them? What and entertainment sites. Observations include the arti- kinds of protection guarantees can this system provide? cles users read and share, the ads they click, and so on. As a first step to answering these questions, we present MediaCo integrates all of this data into one repository Pyramid,5 a selective data system built around a specific and uses it to optimize many processes, such as article training set minimization method called count featuriza- recommendation and ad targeting. Because the data is tion.3 (Pyramid was first introduced at the 2017 IEEE needed by all of its engineering teams, MediaCo wants Symposium on Security and Privacy.) Count featuriza- to provide them with wide access to the repository, but it tion is a widely used technique for reducing training worries about the risks of doing so given recent external sets by feeding ML algorithms with a small subset of the hacking and insider attacks affecting other companies. collected data combined (or featurized) with historical MediaCo decides to use Pyramid to limit the expo- aggregates from much larger amounts of data. Pyramid sure of historical observations in anticipation of an builds upon count featurization to: keep a small, roll- attack. For MediaCo’s main workloads—targeting and ing window of accessible in-use data (the hot window); personalization—the company already uses count fea- summarize the history with privacy-preserving aggre- turization to address sparsity challenges; hence, Pyra- gates (called counts); and train application models using mid is directly applicable. It configures Pyramid by hot window data featurized with counts. The counts keeping its hot window of raw observations and its are infused with differentially private noise6 to protect noise-infused historical counts in the widely accessible individual observations that are no longer in the hot repository, allowing all engineers to train their mod- window. Counts can support a variety of models that els, tune them, and explore new algorithms. Pyramid www.computer.org/security 3 IEEE SYMPOSIUM ON SECURITY AND PRIVACY Stop T TAttack-ΔHot Attack T attack Time Unrestricted access (can be compromised) Historical counts store Hot data store Restricted access (assume not compromisable) Historical raw data store Data exposure Unexposed Exposed Unexposed to attack Figure 2. Threat model. absorbs current and evolving workload needs as long as from the count tables but individual records will be the algorithms draw on the same user data to predict the protected with differential privacy. Pyramid forces same outcome (for instance, whether a user will click on models to be retrained when observations are removed an ad). In addition, MediaCo stores all raw observations from the hot raw data store to avoid past information in an encrypted store whose read accesses are disabled leaking through the models. We assume that no out-of- by default. Access to this store is granted temporarily bound copies of the hot data exist. and on a case-by-case basis to engineers who demon- strate the need for statistics beyond those that Pyramid Selectivity Requirements maintains. With this configuration, MediaCo minimizes Four requirements define selective data systems: data access (and hence exposure) to a needs basis. ■ R1: Reduce in-use, exposed data. The hot data window Threat Model is exposed to attackers; hence, Pyramid must limit its Figure 2 illustrates Pyramid’s threat model and guaran- size and timespan subject to application-level func- tees. Pyramid ensures that a one-time compromise will tional requirements, such as the accuracy of models not allow an adversary to access past data. Attacks are trained with it. assumed to have a well-defined start time,T attack, when ■ R2: Protect unused data from in-use data structures. the adversary gains access to the machines charged with Any state reflecting past, unused data and retained by stop running Pyramid, and a well-defined end time,T attack , Pyramid for prolonged periods of time (such as count when administrators discover and stop the intrusion. tables) must be protected with strong, differential Adversaries are assumed to not have had access to the privacy guarantees. system before Tattack, nor to have performed any action ■ R3: Limit impact on accuracy and performance.

Load more