Enhancing Cloud Security Using Data Anonymization

IT@Intel White Paper Intel IT IT Best Practices Cloud Computing and Information Security June 2012

Executive Overview Although more research is Intel IT is exploring data anonymization—the process of obscuring published data necessary before it is ready to prevent the identification of key information—in support of our vision of a hybrid for production use, data cloud computing model and our need to protect the privacy of our employees and anonymization can ease some customers. We believe data anonymization is a viable technique for enhancing the security concerns, allowing security of cloud computing. for simpler demilitarized zone Although we realize that a 100-percent user accessed the actual content, thereby and security provisioning secure cloud infrastructure is impossible. we increasing the access time. are exploring the possibility of anonymizing • In our security analysis, we were able to and enabling more secure data to augment our cloud security detect real security events, such as probes cloud computing. infrastructure. Data anonymization makes occurring on the web server. data worthless to others, while still allowing Although more research is necessary Intel IT to process it in a useful way. before it is ready for production use, data We conducted a proof of concept (PoC), in anonymization can ease some security which we used data anonymization to protect concerns, allowing for simpler demilitarized event logging data stored in a public cloud. zone and security provisioning and enabling The PoC was successful in demonstrating more secure cloud computing. that data anonymization can work and that We plan to explore data anonymization obscured data is still useful for analysis. We further, including conducting a more were able to perform both performance extensive PoC, developing further use cases analysis and security analysis on the for data anonymization, educating potential anonymized data. enterprise cloud users about the potential • In our performance analysis, we discovered benefits and pitfalls of data anonymization, performance issues; for example, one web and documenting existing open source data site performed two redirects before the anonymization applications.

Jeff Sedayao Enterprise Architect, Intel IT IT@Intel White Paper Enhancing Cloud Security Using Data Anonymization

Contents Background All forms of data protection involve a trade-off between security and ease of implementation, Cloud computing can help reduce Executive Overview...... 1 with more secure systems requiring the costs, increase business agility, and greatest effort to implement. Background...... 2 enable IT to focus on projects with a high return on investment. Intel is For example, homomorphic encryption, which Data Anonymization Concepts and already experiencing the benefits of is a strong encryption method that supports Techniques...... 2 implementing an enterprise private computations without decrypting the input, Achieving Privacy Using cloud, and Intel IT is making progress is a possible solution. But while theoretically Anonymization ...... 3 toward the vision of a hybrid cloud possible, homomorphic encryption is too Anonymization Techniques...... 4 usage model. As we pursue our vision computationally intensive for practical use. Proof of Concept...... 6 of using hybrid clouds, we also realize In general, making progress in privacy issues the need to protect data according to PoC Implementation...... 6 comes at the expense of having to do more Intel information security policies. work in processing the published data. We Results...... 7 needed to find a practical approach to cloud Key Learnings...... 7 Security and privacy concerns are a data storage that both protects privacy and significant obstacle that is preventing the Next Steps...... 8 is not too difficult to implement. extensive adoption of the public cloud, not Conclusion...... 8 only in Intel IT, but also across the industry. IT organizations are typically reluctant Acronyms...... 8 to place sensitive and valuable data in Data Anonymization infrastructure that they do not control. This is especially true in Europe, which has strict Concepts and laws concerning the use of and storage of Techniques personally identifiable information. Anonymization is a technique that Furthermore, as the proliferation of devices enterprises can use to increase the making up the compute continuum continues, security of data in the public cloud devices are increasingly implementing while still allowing the data to be Global Positioning System capabilities, and analyzed and used. Data anonymization client-aware, location-oriented social media is the process of changing data that applications are becoming more common. In will be used or published in a way these types of situations, location tracking that prevents the identification of key becomes an issue—location information can information. be very useful for providing customized and Using data anonymization, key pieces of localized services, but the storage and mining confidential data are obscured in a way that of location data is associated with privacy maintains data privacy. The data can still and regulatory issues. Multi-tenancy, where be processed to gain useful information. IT@Intel multiple tenants share cloud infrastructure Anonymized data can be stored in a cloud The IT@Intel program connects IT poses an additional concern about the and processed without concern that other professionals around the world with their deliberate or accidental exposure of data. peers inside our organization – sharing individuals may capture the data. Later, the lessons learned, methods and strategies. We are exploring ways to make data safer for results can be collected and mapped to the Our goal is simple: Share Intel IT best the cloud by preparing for potential security original data in a secure area. practices that create business value and breaches. One method is to store cloud data make IT a competitive advantage. Visit in a way that makes it worthless to anyone us today at www.intel.com/IT or contact except the owner of the data. your local Intel representative if you’d like to learn more.

2 www.intel.com/IT Enhancing Cloud Security Using Data Anonymization IT@Intel White Paper

Achieving Privacy Using Internet Movie Database information, they Anonymization could reveal the identities of many of the individual users.1 A key learning from this Original Data Figure 1 shows a simple theoretical example situation is that data anonymization involves Company Revenue of using data anonymization to protect A USD 100 more than just deleting fields in a database. sensitive data. In the example, the goal is to B USD 200 C USD 300 calculate total revenue for several companies, Several formal models of security can help without exposing the names of those improve data anonymization, including companies. To accomplish this goal, company k-anonymity and l-diversity. The next two names are changed. For example, the name of sections provide further explanation and Secure Enclave Company A in the original data is changed to examples of these security models. Translation table stored “Bob” in the cloud-based data. Also, fictitious in secure enclave records are added to the cloud-based data Company New Name k-anonymity A Bob to further anonymize the data. A translation k-Anonymity is a formal model of privacy B Alice C Dave table stored in a secure enclave, which is 2 created by L. Sweeney. The goal is to make (ﬁctitious) Eve typically a secured area on the enterprise each record indistinguishable from a defined (ﬁctitious) Carol network, maps the translation of company number (k) of other records if attempts are names and identifies which data is fictitious. made to identify the data. Using the anonymized data, total revenue can A set of data is k-anonymized if, for any data be calculated in the cloud without exposing Computing Cloud record with a given set of attributes, there Data as it appears in an external sensitive information. The result can then be are at least k-1 other records that match cloud or demilitarized zone corrected internally by subtracting fictitious those attributes. For example, consider a data Company Revenue Bob USD 100 company amounts, using the translation table. set that contains two attributes: gender and Alice USD 200 birthday. The data set is k-anonymized if, for Dave USD 300 This type of anonymization can block some Eve USD 50 forms of data mining attacks, because by any record, k-1 other records have the same Carol USD 400 adding fictitious data, it is impossible to gender and birthday. In general, the higher determine the number of companies, or the the value of k, the more privacy is achieved. companies with the most revenue or the k-Anonymity assigns properties to data least revenue. attributes and requires that they be handled Figure 1. Data anonymization helps enable safer computing in the cloud. Anonymization can have severe consequences in specific ways, as shown in Table 1. if done incorrectly. For example, a popular online movie supplier released a database of users and movie selections as part of a 1 Schneier, Bruce, “Why ‘Anonymous’ Data Sometimes contest. To protect the identity of customers, Isn’t.” Wired magazine (2007). www.wired.com/politics/ security/commentary/securitymatters/2007/12/ the supplier replaced the customer names securitymatters_1213 with random numbers and removed personal 2 Sweeney, L., “k-anonymity: A Model for Protecting details. Security researchers showed that by Privacy.” International Journal on Uncertainty, Fuzziness correlating the data with publically available and Knowledge-based Systems 10 (2002): 557–570. www.epic.org/privacy/reidentification/Sweeney_Article.pdf

Table 1. k-Anonymity Attributes

Attribute Type Property Example Action Required

Key Can identify an individual directly Name, social security number Remove or obscure

Quasi-identifier Can be linked with external information to identify an individual Zip code, birthday, gender Suppress or generalize

Sensitive Data that an individual is sensitive about revealing Income, type of illness Needs to be de-linked from the individual

www.intel.com/IT 3 IT@Intel White Paper Enhancing Cloud Security Using Data Anonymization

Table 2. Sample k-Anonymized Patient Records Studies have shown that 87 percent of people young woman she is unlikely to have heart

Zip Age Disease in the United States can be identified by the disease, the attacker then knows Yuko has combination of zip code, birthday, and gender.3 a viral infection. 130• 2• Heart disease 130• 2• Heart disease Sensitive attributes, such as the type of Use of l-diversity, another privacy model, can 4 130• 2• Heart disease illness or disease someone may have, should counter both of these attacks. l-Diversity improves anonymization beyond what 130• 2• Viral infection be de-linked from the individual. To guarantee k-anonymity, there should be k identical k-anonymity provides. The difference between 130• 3• Cancer sequences of quasi-attributes in the data. the two is that while k-anonymity requires 130• 3• Cancer each combination of quasi-identifiers to have Table 2 contains sample records from a • denotes a suppressed value. k entries, l-diversity requires that there are l theoretical hospital: Zip and Age are quasi- different sensitive values for each combination identifiers and Disease is a sensitive attribute. of quasi-identifiers. Table 3 adds l-diversity Both the zip code and the patient age have to the data from Table 2. This data has been suppressed, with patient age listing only k-anonymity, where k=4, as well as l-diversity, what decade of life the patients are in. The where l=2. Homogeneity and background table has at least two copies of each quasi- knowledge attacks described above are both identifier, making this table k-anonymous, Table 3. Sample Patient Records Created with impossible with the anonymized data in Table 3. l-Diversity, Where l=2 where k=2. While l-diversity adds more privacy than k-Anonymity guarantees that an individual Zip Age Disease k-anonymity alone, the natural occurrence of cannot be identified from a given number 130• 2• Heart disease sensitive attributes may not provide enough of people in a set of size k. If there are not 130• 2• Heart disease variety to achieve l-diversity. Fictitious data k identical sequences of quasi-identifiers, 130• 2• Heart disease can be inserted into the data to increase fictitious records could be added to the data, 130• 2• Cancer occurrences, but these must be compensated with the understanding that the effects of for when doing analysis. Also, probabilistic 130• 2• Cancer those fictitious records on processing must inferences are still possible. For example, 130• 2• Viral infection be removed at some point. from the data in Table 3, it is possible to 130• 2• Viral infection determine that Yuko has either cancer or a 130• 3• Viral infection L-diversity viral infection, with a 50-percent chance of 130• 3• Viral infection k-Anonymity is vulnerable to a number of either one. 130• 3• Cancer attacks. For example, using the data in Table 2, 130• 3• Cancer at least two attacks are theoretically possible. Anonymization Techniques • denotes a suppressed value. • Homogeneity attack. If an attacker Both k-anonymity and l-diversity require that knows that Bob has an entry in the data data be obscured. Several techniques can be and that Bob is in his 30s, the attacker will used to obscure data, as shown in Table 4. In know that Bob has cancer. the table, sample data is modified to show • Background knowledge attack. If an how each anonymization technique works. attacker knows that 21-year-old Yuko has an entry in the data and that as a

3 Samarati, P., and L Sweeney, “Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement 4 Machanavajjhala, Ashwin, Daniel Kifer, Johannes Gehrke, through Generalization and Suppression.” Technical report and Muthuramakrishnan Venkitasubramaniam, l-Diversity: SRI-CSL-98-04. SRI computer science laboratory. Palo Privacy Beyond k-Anonymity. ACM Trans. Knowl. Discov. Alto, CA (1998). www.epic.org/privacy/reidentification/ Data 1, 1, Article 3 (2007). http://www.cs.cornell. Samarati_Sweeney_paper.pdf edu/~vmuthu/research/ldiversity-TKDD.pdf

4 www.intel.com/IT Enhancing Cloud Security Using Data Anonymization IT@Intel White Paper

Table 4. Anonymization Techniques to Obscure Data

Example First Last Employee Years of Monthly Technique Description and Application IP Address Name Name Number Service Phone Number Salary SAMPLE DATA 143.183.23.3 Bob Smith 325211 13 408-555-2935 USD 5000 143.183.23.10 Alice Jones 452893 3 408-555-2931 USD 4000 Hiding • A value is replaced with a constant 143.183.23.3 Bob Smith 325211 13 408-555-2935 0 value (typically 0); sometimes called 143.183.23.10 Alice Jones 452893 3 408-555-2931 0 a “black marker.” The salary field is hidden. • Useful for suppressing sensitive attributes that may not be needed for processing. For example, to publish data for a phone directory, salary information is not needed.

Hashing • Maps each value to a new 143.183.23.3 6834523439811 13 408-555-2935 USD 5000 (not necessarily unique) value. 143.183.23.10 2349510342932 3 408-555-2931 USD 4000 • Useful for mapping a large, variable The first name, last name, and employee number are hashed into a single fixed-length number. amount of data into a number of a certain length.

Permutation • Maps each original value to a unique 143.183.23.3 Rob Clemente 325211 13 408-555-2935 USD 5000 new value. 143.183.23.10 Eva Gonzales 452893 3 408-555-2931 USD 4000 • Allows the translation of the new The first name and last names are mapped to new values. value back to the original value, given a translation table that is stored in a secure area.

Shift • Adds a fixed offset to the numerical 143.183.23.3 Bob Smith 325211 13 408-555-2935 USD 15000 values. 143.183.23.10 Alice Jones 452893 3 408-555-2931 USD 14000 • Useful for concealing data while USD 10,000 is added to the salary values. allowing computations in areas such as the cloud.

Enumeration • Maps each original value to a new 143.183.23.3 Bob Smith 325211 13 408-555-2935 USD 25000 value to preserve ordering. 143.183.23.10 Alice Jones 452893 3 408-555-2931 USD 20000 • Allows analysis of data that requires Salary data is changed, but the relative order of the salary is preserved. ordering, such as ranking people by salary.

Truncation • A field is shortened, losing data at 143.183.23.3 Bob Smith 325211 13 408 USD 5000 the end. 143.183.23.10 Alice Jones 452893 3 408 USD 4000 • Useful for hiding data while still The phone number is truncated after the area code. This hides the phone number but preserves the preserving information for the data. information about where the employee is located.

Prefix- • Retains n-bit prefix on IP addresses. 143.183.79.169 Bob Smith 325211 13 408-555-2935 USD 5000 preserving 143.183.3.25 Alice Jones 452893 3 408-555-2931 USD 4000 The IP address is scrambled, but the 16-bit prefix is preserved.

www.intel.com/IT 5 IT@Intel White Paper Enhancing Cloud Security Using Data Anonymization

Example logging. In the future, much of this log data Table 5 shows an example of a file where will be generated outside of Intel’s network The Difference between the goal is to calculate the average salary by perimeters or in network demilitarized zones Data Anonymization and location and years of service, while obscuring (DMZs). We need the ability to store and later Encryption the identity of the individual. analyze these logs to make them useful. We wanted to determine if we could use a Although data anonymization and This data is obscured using the following software-as-a-service (SaaS) log management encryption are related topics and techniques: are both useful techniques for supplier in the public cloud but still maintain securing cloud-based data from • Phone numbers are truncated, leaving data security. Using a SaaS log management privacy and security breaches, only the area code. supplier, instead of simply storing log data on they are not the same thing. • Names and employee numbers are hidden. enterprise servers, provides a fast, searchable, and easy-to-use log archive that can reduce • Data anonymization is the • Salary and years of service are offset, the number of log entries needed for other process of transforming data using shift. processing and that is in a Hadoop*-ready so that it can be processed in • The prefix of the IP addresses is preserved. format for Big Data mining.5 a useful way, while preventing This approach protects the privacy of that data from being linked to individuals’ salaries, so the calculation of PoC Implementation individual identities of people, salary averages could be done in a cloud The PoC, which we conducted over several objects, or organizations. without having to worry about data and weeks, had the following overall structure: • Encryption involves results being exposed. The results could be • Evaluate anonymization tools for transforming data to render it adjusted in a secure enclave using knowledge storing anonymized data. unreadable to those who don’t of the shift values associated with salary have the key to decrypt it. and years of service. As long as the shift • Store anonymized logs in a public Encryption can be a useful values remain confidential, the true average cloud-based SaaS log management tool for doing anonymization, salary is protected. Phone numbers have application. particularly when hiding identifying been truncated to prevent using those as • Analyze logs using Hadoop on the information in a set of data. quasi-identifiers to reveal the identity of the stored logs to gather security and However, encryption, while useful, individuals associated with data entries. performance data. is neither necessary nor sufficient • Document results and report them. for doing anonymization. Data We used an existing external application to can be successfully anonymized generate security and performance data that without encryption, and encrypted Proof of Concept we could send to the SaaS log management data is not necessarily anonymized. We conducted a proof of concept (PoC) supplier for analysis. The application that showing that data anonymization is we chose runs on an academic cloud and a viable technique to use for secure monitors the performance of a number of cloud computing. The PoC used data web sites. Performance data, such as the anonymization to protect and process time it takes to look up the web server’s event log data stored in a public cloud. domain name, the time it takes to set up Collaboration with external companies and a TCP connection, and the page download the need for greater awareness of what rate, are made available on a web site that

is occurring in our computing environment 5 Hadoop* is a project that provides an open source is leading us to do more and more data implementation of frameworks for reliable, scalable, distributed computing, and data storage.

Table 5. An Example of Calculating the Average Salary While Obscuring Identity

Employee IP Address First Name Last Name Number Years of Service Phone Number Monthly Salary

143.183.79.169 x y 0 11 408 USD 15000 143.183.3.25 x y 0 1 408 USD 14000

6 www.intel.com/IT Enhancing Cloud Security Using Data Anonymization IT@Intel White Paper

runs on each virtual machine (VM). Figure 2 calculate summary statistics such as illustrates the relationship between the average TCP connection set up times and SaaS log management supplier, the VMs, standard deviation of those times. Secure Enclave and the secure enclave that stores the data Processed data is mapped to real values anonymization translation table. Results Anonymization takes place on the VMs We found that we could glean useful Processed Data sending the data, and data is de-anonymized security and performance information from the anonymized data that we stored Software-as-a-Service within a secure enclave. For the PoC, 47 VMs Log Management Supplier sent anonymized log information to the SaaS and analyzed in the cloud. Our simulated • Security business intelligence attacks could be detected in the anonymized analysis log management supplier at the rate of 2,800 • Performance analysis events (individual log entries) per hour. logs, and we found performance issues in anonymized performance logs. We decided to anonymize IP addresses in the Anonymized Logs web server access logs and URLs in the web • During the security analysis testing, we server access log files. didn’t detect any active probing of the monitoring VMs. However, we searched Virtual Virtual Virtual • We used the IP::Anonymous perl library Machine Machine Machine older logs and found that there had been Monitor Monitor Monitor to mask IP addresses. probes on the web server. This confirmed Log Generation Sources • We used the MD5 hash function to our theory that the approach we took in anonymize URLs. Although most of the looking for security business intelligence security industry views the MD5 hash events could detect real events. Figure 2. Our proof of concept analyzed function as too weak to be effective, anonymized log data in the cloud. • Although the SaaS log management we chose it as an easily implementable supplier we used during the PoC didn’t example. An actual implementation would support number-crunching analytics, such use another as calculating averages, we were able to hash function. pinpoint other performance issues. For • The anonymization software that we used example, we discovered that one web to conceal IP addresses uses Advanced site performed two redirects before the Encryption Standard (AES) encryption to user accessed the actual content, thereby scramble IP addresses. increasing the access time.

Security Analysis Key Learnings We wanted to determine whether the data Our PoC demonstrated that anonymization distribution web servers were being probed. for cloud processing can work, although We planned to probe some of the ports on the more research is necessary before it is ready web servers and then see if we could detect for production use. The following points that activity in the anonymized logs. One way summarize our key learnings from our PoC: to detect a security anomaly is to look for a • Anonymized logs can be used for both spike in log activity. Because this method of performance analysis and security analysis. detection can be subverted through a slow attack, we also conducted a probe in which • Anonymization can ease some security we performed only one port probe on one of concerns, allowing for simpler DMZ and the data distribution web servers. security provisioning and making it possible to do a number of operations in public clouds. Performance Analysis • When using encryption to anonymize We also wanted to use the SaaS log the data, it is important to manage the management supplier to study the encryption keys that control the real-to- performance of the web sites we were anonymized value mapping. analyzing. For example, we wanted to

www.intel.com/IT 7 • Anonymization for the cloud could be have been abandoned by their creators. We breaches in the cloud by anonymizing the a new and major use case for Intel® intend to document some of these open data, making it worthless to others while still Advanced Encryption Standard – New source tools. allowing Intel IT to process it in a useful way. Instructions (Intel® AES-NI), as this • Our PoC used AES encryption to do Although data anonymization is not foolproof, technology speeds up AES encryption anonymization. Intel AES-NI instructions it is one important tool in our continuing used in some anonymization techniques, could be used to speed up the anonymization pursuit of secure cloud computing. We intend such as hashing. process. Therefore, secure use of public clouds to further research data anonymization, could be a potential use case for Intel AES-NI. including developing further use cases, We will adapt the open source anonymization educating potential enterprise cloud users Next Steps tools to use Intel AES-NI and investigate if about the potential benefits and pitfalls of that adds value to the anonymization process. data anonymization, documenting existing Intel’s IT security architecture and open source data anonymization applications, cloud engineering groups deemed our and conducting another, more extensive, PoC. PoC useful enough to warrant further work concerning external logging and Conclusion For more information on data anonymization. We plan to refine Intel has achieved great savings and Intel IT best practices, anonymization use cases and conduct success with our enterprise private visit www.intel.com/it. another PoC with Intel data. As we cloud, and we are moving toward collaborate with more companies and a hybrid cloud usage model and pursue our hybrid cloud vision, our work employing techniques that protect with anonymization has the potential the privacy of Intel’s employees and to be a useful technique for enabling customers. But for that model to work, enterprise usage of public clouds. we need to protect data according to Intel information security policies. We Products already exist today that take conducted a PoC that showed that advantage of anonymization techniques. data anonymization—the process of These products make it possible for obscuring published data to prevent companies to use publicly hosted SaaS the identification of key information— offerings without revealing sensitive is a viable technique for enhancing the information. However, it is clear that more security of cloud computing. work needs to be done with anonymization. Acronyms • We need to educate potential enterprise Collaboration with external companies, and Intel® AES-NI Intel® Advanced cloud users about anonymization and a desire for greater awareness of what is Encryption Standard – its potential benefits and pitfalls. While happening in our computing environment, New Instructions data anonymization isn’t foolproof, it is is leading Intel IT to do more and more DMZ demilitarized zone appropriate for certain use cases, and there data logging. In the future, much of this are ways to determine if vulnerabilities log data will be generated outside of Intel’s PoC proof of concept exist in the anonymized data. network perimeters, or in DMZs. Instead of SaaS software as a service insisting that cloud infrastructures be totally • Available open source tools seem capable VM virtual machine secure, we can prepare for potential security but are not well documented, and some

This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED “AS IS” WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including liability for infringement of any patent, copyright, or other intellectual property rights, relating to use of information in this specification. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted herein. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others.