Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page

+ Mobile Storage Users 20 + Containers in Cloud Architecture 81

SEPTEMBER 2014 www.computer.org/cloudcomputing

Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Call for Papers Legal Clouds: How to Balance Privacy with Legitimate Surveillance and Lawful Data Access

Submission deadline: 1 Mar 2015 • Publication date: July-Aug 2015

ith the increasing popularity of cloud services and •Cloud incident response their potential to either be the target of or the tool Wused in a cybercrime activity, organizational cloud • Cloud information leakage detection and prevention service users need to ensure that their (cloud) data is secure, • Enhancing and/or preserving cloud privacy and in the event of a compromise, they must have the capa- bility to collect evidential data. • Cloud surveillance • Crime prevention strategies Surveillance of citizens by their governments is not new. The relatively recent revelations of Edward Snowden (former NSA • Legal issues relating to surveillance contractor) of the extensive surveillance (including of cloud • Enhancing privacy technology for cloud-based apps service providers and users) by NSA, however, reminded us • High quality survey papers on the above topics are of the need to balance a secure system with welcome. the rights of individuals to privacy. This is further complicated by the need to protect the community from serious and organized crimes, terrorism, cyber-crime, and other threats to Special Issue Guest Editors national security interests. This presents serious implications Kim-Kwang Raymond Choo, University of South for the ability of governments to protect their citizens against Rick Sarre, University of South Australia cyber security threats. It remains an under-researched area due to the interdisciplinary challenges specifi c to this fi eld. Submission Information This special issue will focus on cutting edge research from both Submissions will be subject to IEEE Cloud Computing academia and industry on the topic of balancing cloud user magazine’s peer-review process. Articles should be at most privacy with legitimate surveillance and lawful data access, 6,000 words, with a maximum of 15 references, and should with a particular focus on cross-disciplinary research. For be understandable to a broad audience of people interested in example, how can we design technologies that will enhance cloud computing, big data, and related application areas. The “guardianship” and the “deterrent” effect in cloud security at the writing style should be down to earth, practical, and original. same time as reducing the “motivations” of cybercriminals? All accepted articles will be edited according to the IEEE Topics of interest include but are not limited to: Computer Society style guide. Submit your papers through • Advanced cloud security Manuscript Central at https://mc.manuscriptcentral.com/ ccm-cs. Contact the guest editors at ccm4-2015@computer • Cloud forensics and anti-forensics ______.org. www.computer.org/cloudcomputing

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

EDITOR IN CHIEF Mazin Yousif, T-Systems International, [email protected] EDITORIAL BOARD Zahir Tari, RMIT University Omer Rana, Cardiff University Rajiv Ranjan, CSIRO Computational Informatics Beniamino Di Martino, Second University of Naples Eli Collins, Cloudera Samee Khan, North Dakota State University Kim-Kwang Raymond Choo, University of South Australia J.P. Martin-Flatin, EPFL Ivona Brandic, Vienna University of Technology Pascal Bouvry, University of Luxembourg David Bernstein, Cloud Strategy Partners Laura Taylor, Relevant Technologies Alan Sill, Texas Tech University

STEERING COMMITTEE Manish Parashar, Rutgers, the State University of New Jersey V.O.K. Li, University of Hong Kong Steve Gorshe, PMC-Sierra (Communications Society (Communications Society liaison) liaison; EIC Emeritus IEEE Communications) Rolf Oppliger, eSecurity Technologies Carl Landwehr, NSF, IARPA (EIC Emeritus IEEE S&P) Hui Lei, IBM Dennis Gannon, Kirsten Ferguson-Boucher, Aberystwyth University.

EDITORIAL STAFF CS MAGAZINE Brian Kirk • Lead Editor • [email protected] OPERATIONS COMMITTEE Joan Taylor • Content Editor Paolo Montuschi (chair), Erik R. Altman, Maria Ebling, Miguel Encarnação, Lars Heide, Cecilia Metra, San Murugesan, Shari Lee Garber, Keri Schreiner, Jenny Stout Lawrence Pfleeger, Michael Rabinovich, Yong Rui, Forrest • Contributing Editors Shull, George K. Thiruvathukal, Ron Vetter, Daniel Zeng Carmen Garvey, Jennie Zhu-Mai • Production & Design Robin Baldwin • Senior Manager, Editorial Services Evan Butterfield • Products and Services Director CS PUBLICATIONS BOARD Sandy Brown • Senior Business Development Manager Jean-Luc Gaudiot (VP for Publications), Alain April, Marian Anderson • Senior Advertising Coordinator Laxmi N. Bhuyan, Angela R. Burgess, Greg Byrd, Robert Dupuis, David S. Ebert, Frank Ferrante, Paolo Montuschi, Linda I. Shafer, H.J. Siegel, Per Stenström

IEEE Cloud Computing (ISSN 2325-6095) is published quarterly by the IEEE Computer Subscription rates: IEEE Computer Society members get the lowest rate of US$39 Society. IEEE headquarters: Three Park Ave., 17th Floor, New York, NY 10016-5997. per year. Go to www.computer.org/subscribe to order and for more information on IEEE Computer Society Publications Office: 10662 Los Vaqueros Cir., Los Alamitos, CA other subscription prices. 90720; +1 714 821 8380; fax +1 714 821 4010. IEEE Computer Society headquarters: 2001 L St., Ste. 700, Washington, DC 20036.

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

What will the future of cloud computing look like? What are some of the issues professionals, practitioners, and researchers need to address when utilizing cloud services? IEEE Cloud Computing magazine serves as a forum for the constantly shifting cloud landscape, bringing you original research, best practices, in-depth analysis, and timely columns from luminaries in the fi eld.

FEATURED ARTICLES

24 Guest Editors’ Introduction: 46 Effi cient and Secure Transfer, Securing Big Data Applications in Synchronization, and Sharing of Big the Cloud Data Bharat Bhargava, Ibrahim Khalil, and Ravi Sandhu Kyle Chard, Steven Tuecke, and Ian Foster

27 Enhancing Big Data Security with 56 Location-Based Security Collaborative Intrusion Detection Framework for Cloud Perimeters Zhiyuan Tan, Upasana T. Nagar, Xiangjian He, Priyadarsi Chetan Jaiswal, Mahesh Nath, and Vijay Kumar Nanda, Ren Ping Liu, Song Wang, and Jiankun Hu 65 Multilabels-Based Scalable Access 34 Risk-Aware Virtual Resource Control for Big Data Applications Management for Multitenant Hongsong Chen, Bharat Bhargava, and Fu Zhongchuan Cloud Datacenters Abdulrahman A. Almutairi and Arif Ghafoor

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

September 2014 Volume 1, Issue 3 www.computer.org/cloudcomputing

COLUMNS

4 News 72 What’s Trending? In Brief Bringing Big Data Systems to the Cloud Lee Garber Amandeep Khurana

8 From the Editor in Chief 76 BlueSkies A Focus on Security and Application Security through Privacy in the Cloud Federated Clouds Mazin Yousif Paul Watson

10 Cloud and the Government 81 Cloud Tidbits FedRAMP: History and Future Direction Containers and Cloud: From LXC Laura Taylor to Docker to Kubernetes David Bernstein 15 Standards Now Cloud Standards and the Spectrum of Development Alan Sill 23 Advertising Index 45 IEEE CS Information 20 Cloud and the Law Mobile Users Kim-Kwang Raymond Choo

Reuse Rights and Reprint Permissions: Educational or personal use of this material is permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the copy; and 3) does not imply IEEE endorsement of any third-party products or services. Authors and their companies are permitted to post the accepted version of their IEEE-copyrighted material on their own Web servers without permission, provided that the IEEE copyright notice and a full citation to the origin al work appear on the first screen of the posted copy. An accepted manu- script is a version which has been revised by the author to incorporate review suggestions, but not the published version with copyediting, proofreading and formatting added by IEEE. For more information, please go to: http://www.ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising, or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to the IEEE Intellectual Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or ______pubs-permissions@ ieee.org. Copyright © 2014 IEEE. All rights reserved. Abstracting and Library Use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. IEEE prohibits discrimination, harassment, and bullying. For more information, visit www.ieee.org/web/aboutus/whatis/policies/p9-26.html.

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD NEWS

tary application-portability technologies In Brief and thus might not adopt or might even oppose Docker.

Service Offers New Approach to Cloud Security A vendor has released a new open Lee Garber source program designed to let users se- curely store data in the cloud for future IEEE Computer Society, [email protected] access without also having to place their private cryptographic keys there. CloudFlare’s Keyless SSL lets us- ers store the private keys on an inter- nal, rather than a public-facing, server. The ability to better protect keys could overcome concerns that businesses that handle sensitive data—such as financial and healthcare companies—have about AS MORE ORGANIZATIONS without worrying about the machine or keeping data in the cloud. ADOPT THE CLOUD, NEW ISSUES platform on which they’ll run. Typically, firms using the cloud WILL CONTINUE TO EMERGE. Containers aren’t new, but Docker store private keys on the same public- Each issue, IEEE Cloud Computing claims its technology makes packaging facing server that handles Web traffic. news briefs looks at recent happenings applications and moving them among However, this could let hackers access and trends in the cloud world. various types of machines easier. the key and compromise the security of The system consists of the Docker data stored online. Support Grows for New Software engine, a lightweight runtime and pack- In some cases, businesses use third Approach that Could Boost aging tool; and the Docker hub, a cloud parties to handle their SSL systems, Cloud Computing service for sharing applications and including their keys. However, this Major technology companies such as handling workflows. places those keys out of the businesses’ Amazon and are supporting According to Docker, about 14,000 control. Docker (www.docker.com), a new open applications are now using its contain- With CloudFlare’s new system, source platform that could make it ers. eBay is using the system to test new private SSL keys are maintained on easier to run applications on multiple software in its datacenters. And Google, customers’ internal servers, which can machines. which is trying to challenge Amazon’s sit behind firewalls or be secured in Developers use Docker to place ap- dominance in the cloud computing mar- other ways. Users install an agent on plications in software containers, which ket, is also working with Docker. their servers to handle data-access re- users can download across the The technology isn’t without its quests. To protect the communications or on any private network and use on concerns. For example, machines must involved in the process, the system any Linux machine or cloud platform. download software enabling them to transmits and processes key-signing re- This would be huge benefit for use the containers. The software is sup- quests via an encrypted tunnel to the cloud computing, which is often used posed to run the same way on all Linux user’s server. to make applications that are kept on- versions, but this isn’t always the case. CloudFlare says it got the idea for line available to all types of computing Some containers therefore might not the new product after being approached devices. In fact, proponents note, this is run on all operating systems. Docker by financial institutions that had suf- one of cloud computing’s purposes. and its supporters say they are working fered cyberattacks. They add that Docker will make on this. The company plans to bundle Key- developers’ lives much easier by let- In addition, some cloud service pro- less SSL with its enterprise security ting them focus on designing programs viders are working on their own proprie- service.

4 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Will Cloud Computing Close the Study: European Companies about 1,100 companies worldwide and Tech Industry’s Gender Gap? Aren’t Taking Advantage of Cloud found similar issues. Intel and other companies are express- Technology ing hope that the rise of cloud com- Large corporations are having trouble IBM Uses the Cloud to Take puting could attract more women to finding enough IT workers with the ex- Analytics to the Masses technology-related jobs. pertise necessary to meet their cloud Typically, only large companies with the The US Department of Labor pre- computing goals. money to buy powerful computers and dicts that the increasing use of cloud Many companies, therefore, haven’t expensive software and hire specially computing technologies and services been able to fully adopt cloud technolo- trained personnel have been able to will create 1.4 million jobs domestically gies. And their IT departments aren’t con- perform complex analytics on the huge by 2020. However, US universities will fident of their readiness to do so, according amounts of data they collect. This has provide enough graduates to fill only to a recent study by market research firm limited the adoption of sophisticated an estimated 29 percent of them. Intel International Data Corp. (IDC). data analytics products. says the need to make up the difference could provide a way to get more women interested in technology careers. Currently, women hold only an esti- mated one-fourth of US computing and Intel and other companies are technical jobs. However, cloud comput- expressing hope that the rise of cloud ing is a relatively new technology requir- ing different types of skills. Intel says computing could attract more women this could attract women who might to technology-related jobs. not have been interested in traditional computer technologies and could force companies to change their traditional hiring approaches. To encourage this process, In- IDC initially surveyed European Now, however, IBM is using its tel recently paid half of the registra- firms and found that 56 percent of re- Watson supercomputer technology and tion fee—which ranges from $1,395 to sponding IT departments can’t find quali- the cloud to deliver such services to $1,595—for women attending the first fied workers to support their cloud-related smaller organizations. Scientists and IT Cloud Computing Conference in late efforts. About 60 percent are having trou- developers from IBM’s data analysis and October in . The compa- ble improving the skills of current em- Watson units worked on the Watson ny also paid the entire fee for 50 female ployees so that they can help with tasks Analytics project for about a year before college students majoring in science, such as evaluating cloud service providers. announcing it recently. technology, engineering, or mathemat- Only about 30 percent of European The system combines IBM’s data ics (STEM). IT departments told IDC that they can analytics approaches with Watson’s This effort exposes women to tech- determine the costs and benefits of their computing power and machine-learn- nology and gives them an opportunity cloud projects well enough to justify ing capabilities, as well as its ability to to network and to meet professionals in them to management. And just 40 per- work with natural-language input. The the field, according to Intel, whose pres- cent of companies say they use cloud latter enables company employees who ident, Renée James, is a woman. technology extensively and effectively aren’t data scientists to query databases Support for these types of efforts enough to gain a marketplace advantage. to recognize useful patterns or derive has come from the nonprofit Girls Who All this is occurring as European helpful predictions from large amounts Code organization (http://girlswhocode enterprise spending on cloud services of corporate information. ___.com), whose members include Adobe and technology has grown 25 percent The system can display results in Systems, Amazon, AT&T, eBay, Face- during the past year. To determine if formats such as text, charts, or graphs. book, Google, Intel, Microsoft, and these problems are limited to Europe, It can also incorporate data about exter- Twitter. IDC interviewed high-ranking staff at nal factors to help with the process.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 5

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD NEWS

Industry observers cite a need for ers stole customers’ sensitive personal They then connected to iCloud and re- services that can do what Watson Ana- data, including Social Security numbers trieved various people’s iPhone backups. lytics promises to do. However, they and payment card information. Apple acknowledges security issues add, the offering’s success will ultimate- Recently, security researchers say, with Find My Phone and says it’s fixing ly depend on factors such as reliability, hackers broke into Apple’s iCloud ser- them. But the company claims it isn’t ease of use, and the value of its results. vice and stole nude photos, explicit vid- responsible for the theft of celebrities’ eos, and other personal material that personal material. Security Experts: Hackers Stole 101 celebrities had loaded onto their Instead, it contends, hackers either Nude Photos of Celebrities from iPhones and then stored in iCloud. The guessed celebrities’ passwords based on Apple’s iCloud material was subsequently posted for public information about them, or used The unprecedented series of high-profile sale on black market websites. phishing to send fake but legitimate- cybercrimes that began late last year Security experts contend that the appearing emails that convinced celeb- may have moved into the cloud. cybercriminals breached the iCloud ac- rities to provide login information. A possible attack on Apple’s iCloud counts by exploiting a flaw in Apple’s cloud storage and cloud computing ser- Find My iPhone API. They say the API iOS 8 Bug Deletes iCloud vice has joined an ongoing series of hacks didn’t lock out people making more Documents on major retailers such as the Target de- than a set number of failed attempts Users of iPhones and iPads running partment stores; JPMorgan Chase, the to log into accounts, as many applica- iOS 8 say an operating system flaw de- US’s biggest bank; the huge Home Depot tions do for security purposes. This let letes iWork documents from the iCloud home-improvement store chain; and gro- the hackers keep trying possible pass- Drive when they reset their devices. cery store groups across the United States. words—based on knowledge of the ce- On the MacRumors user-support In many of these cases, the attack- lebrities—until they hit the right ones. discussion website, users reported that performing the “reset all settings” oper- ation removed word processing, spread- sheet, and presentation documents from the new iCloud Drive, which iOS 8 can Call use for storage and synchronization. They complained that the dialog at the start of the reset specifically for Articles said that the process would restore fac- tory settings—as a last resort to solve system problems—but not delete data. IEEE Pervasive Computing Some users stated that even the Apple Time Machine restoration application seeks accessible, useful papers on the latest peer- reviewed developments in pervasive, mobile, and couldn’t recover the missing files, al- ubiquitous computing. Topics include hardware though one said it could. Further technology, software infrastruc ture, real-world Several people complained that details: sensing and interaction, human-computer Apple technical support representatives told them that, for example, the prob- pervasive@ interaction, and systems considerations, including deployment, scalability, security, and privacy. lem was temporary or that the data was computer.org still on the device. www. Author guidelines: Now, however, some users say, it appears they will never recover the computer. www.computer.org/mc/pervasive/author.htm documents. org/pervasive Apple introduced the iCloud Drive this year, saying it was an alternative to third-party cloud storage and synchro- nization services.

6 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Call for Papers IEEE Cloud Computing Magazine Special Issue on Autonomic Clouds

Submission deadline: 15 Jan 2015 • Publication date: Mar-Apr 2015

loud computing continues to increase in complexity • Adaptive data management and processing on clouds due to both the increasing availability of confi guration Coptions from public cloud providers and the increas- • Intrusion estimation and detection systems ing variability and types of application instances that can be • Autonomic federation of cloud infrastructure and services deployed on such platforms— for example, tuning options in hypervisors that enable different virtual machine instances • Platforms for autonomic applications to be associated with physical machines; storage, compute, and I/O preferences that offer different power and price; and The guest editors invite original and high-quality research operating system confi gurations that provide differing degrees submissions addressing all aspects of this fi eld, as long as of performance or security, etc. This complexity can also be the connection to the focus topic is clear and emphasized. seen in enterprise scale datacenters that dominate comput- Experience reports, surveys, critical evaluations of the state of ing infrastructures in industry, which are growing in size and the art, and insightful analysis of established and up-coming complexity, leading to complex business applications and technologies are also welcome. workfl ows. Autonomic computing enables self-management of systems and applications. The underlying concepts and mechanisms of autonomics can be applied to each compo- Special Issue Guest Editors nent within a cloud system (resource manager/scheduler, Manish Parashar, Rutgers University, USA power manager, etc.) as well as to the cloud system as a whole, or they can be applied within an application that Javier Diaz-Montes, Rutgers University, USA makes use of such a cloud system. Autonomics can also play Omer Rana, Cardiff University, UK a critical role as applications explore dynamic federations of cloud infrastructure and services. We invite contributions that Ioan Petri, Cardiff University, UK address a number of topics related to the use of autonomic computing approaches for managing cloud infrastructure, Submission Information creating and managing federations of clouds infrastructure and services, and managing scientifi c applications hosted on Submissions should be 4,000 to 6,000 words long and should a cloud infrastructure. Topics of interest include (but are not follow the magazine’s guidelines on style and presentation. limited to): All submissions will be single-blind anonymously reviewed in • Auto-scaling strategies accordance with normal practice for scientifi c publications. • Adaptive deployment, confi guration, and management For more information, contact the guest editors at ccm2-2015@______approaches computer.org. • Use of feedback and adaptive control strategies for cloud Authors should not assume that the audience will have management specialized experience in a particular sub fi eld. All accepted • Adaptive applications development articles will be edited according to the IEEE Computer Society style guide. Submit your papers to Manuscript Central at • Quality of service management https://mc.manuscriptcentral.com/ccm-cs.

www.computer.org/cloudcomputing

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

FROM THE EDITOR IN CHIEF

ed attacks aimed at privacy violations, and denial-of- A Focus on service attacks also make it increasingly important to raise the security level and increase the focus on privacy protection in cloud settings for big data applications. Security and Given all this, cloud platforms need to be for- tified with robust security and privacy mechanisms to deliver reliable services. Specifically, this special issue aimes to address topics such as access control, Privacy in the encryption, collaborative threat detection using big data analytics, obfuscation, secure storage/retrieval for big data, watermarking of big data, and secure and efficient transmission of big data. These issues Cloud are discussed as the main focus of this effort to dis- seminate recent advances and stimulate future re- search directions in the specialized area of security and privacy within the context of big data applica- tions in a cloud environment. The columns in this issue address a diverse WELCOME TO THE THIRD ISSUE OF IEEE range of topics. In “Standards Now,” Alan Sill pres- CLOUD COMPUTING, DEDICATED TO “SE- ents a general overview of APIs, protocols, pro- CURE CLOUD COMPUTING TECHNIQUES gramming languages, and tools and how they relate FOR BIG DATA.” Bharat Bhargava, Ibrahim to cloud standardization. In “Cloud and the Law,” Khalil, and Ravi Sandhu are the guest editors for Raymond Choo looks at issues pertinent to mobile this special issue. cloud storage users. Paul Watson of Newcastle Uni- Cloud architectures are well suited to big data versity guest authors the “Blue Skies” department, deployments. To date, the bulk of the focus on this in which he explores ways to achieve application se- topic has been development of infrastructures, ana- curity through hybrid and federated clouds. David lytics, and visualization. Although other concerns Bernstein, in the “Cloud Tidbits” column, covers the such as security and privacy have received less at- role of containers in cloud architecture from LXC to tention, they are rising in importance. Many com- Dockers to Kubernetes. Finally, “What’s Trending?” mercially important big data applications need to with guest author Amandeep Khurana of Cloudera, share and process privacy-sensitive data. Increasing highlights issues for big data in public clouds. incidents of data misuse and data breach, distribut- In this issue, you will also see the first “Cloud and the Government,” in which column editor Laura Taylor provides a good overview of the Fed- eral Risk and Authorization Management Program (FedRAMP). It is my pleasure to introduce two new columns to IEEE Cloud Computing’s roster. Both will first appear in the magazine’s next issue. The first col- umn, “Cloud Economics,” will be led by Joe Wein- man (see Weinman’s bio in the sidebar), and will MAZIN YOUSIF cover cloud-economics related topics such as value chain, revenue models, and pricing models. T-Systems International The second column is the “Cloud Community

[email protected]______Corner,” which will cover various topics important to the cloud community, such as recent results from

8 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

INTRODUCING COLUMNIST JOE WEINMAN

oe Weinman is the author of boards of several technology companies. Previous- Cloudonomics: The Business ly, he has held executive positions at Bell Labs, AT&T, Value of Cloud Computing (Wiley, HP, and Telx. Among other accolades, he has been 2012, English, and PTPress, 2014, recognized as a “Top 10 Cloud Computing Leader.” Chinese), which examines private and Weinman has BS and MS degrees in computer public cloud cost and performance science from Cornell University and the Univer- optimization from a quantitative perspective. sity of Wisconsin-Madison, respectively, and has Weinman is also the author of the forthcoming completed executive education at the International book, Digital Disciplines (Wiley CIO), which focuses Institute for Management Development in Laus- on how IT can invigorate business strategy through anne. He has been awarded 21 patents in areas such better processes, products, customer relationships, as homomorphic encryption, pseudoternary line and innovation. coding, adaptive bandwidth schemes, Web search, Weinman is currently the chair of the IEEE and distributed storage and computing. Intercloud Testbed executive committee, an analyst We look forward to his contribution to IEEE for GigaOm Research, and serves on the advisory Cloud Computing!

cloud-themed conferences, book reviews, and in- teresting observations from thought leaders in the NEWSLETTERS cloud community. The magazine’s entire editorial Stay Informed on Hot Topics board will contribute to this column, and we invite your submissions of items that can be brought to the attention of the community. Stay tuned for our next issue, a special issue on “Cloud-Based Smart Evacuation Systems for Emer- gency Management,” which will be available in late December.

MAZIN YOUSIF is the editor in chief of IEEE Cloud Computing. He’s the chief technology officer and vice president of architecture for the Royal Dutch Shell Global account at T-Systems, International. Yousif has a PhD in computer engineering from Pennsylva- nia State University.

computer.org/newsletters Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 9

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD AND THE GOVERNMENT

many governments in East Asia, Northern Europe, FedRAMP: and the Americas are using it as a model for their own cloud security programs. Cloud computing promises lower cost and the ability to quickly scale resources up or down as work- History loads demand, leading organizations in both the pub- lic and private sectors to seriously consider moving their applications and data to the cloud. Concern about cloud security has been the number one ob- and Future stacle to adoption, particularly in the public sector. FedRAMP provides a comprehensive set of cloud security requirements and an independent assess- ment program backed by the chief information of- Direction ficers (CIOs) of the Department of Defense (DoD), the Department of Homeland Security (DHS), and the GSA. Cloud service providers (CSPs) that imple- ment the required security controls and meet inde- pendent assessment requirements can be authorized THE FEDERAL RISK AND AUTHORIZATION for use by the federal government. There’s no short- MANAGEMENT PROGRAM (FEDRAMP), DE- age of CSPs jockeying for what has become the most VELOPED BY THE US GENERAL SERVICES coveted and prestigious qualifier of cloud security. ADMINISTRATION (GSA) IN CONJUNCTION So far, more than 50 CSPs have either been autho- WITH THE US OFFICE OF MANAGEMENT rized, or are far enough into the process that the AND BUDGET (OMB) AND THE FEDERAL CIO FedRAMP website lists them as “in process.” COUNCIL, WAS LAUNCHED 6 JUNE 2012. Figure 1 provides a timeline of FedRAMP-related Standardizing the Authorization Process events starting from the announcement of the ini- Since its 2002 launch, FISMA has required that tial working group through the two-year launch an- all systems hosting US government data be autho- niversary. FedRAMP is the US government program rized prior to being put into production. The autho- to apply the Federal Information Security Manage- rization process is extremely comprehensive, and ment Act (FISMA) to cloud computing. Initially, until FedRAMP came along, system owners had skeptics warned that the program wouldn’t gain ac- to go through the entire authorization process for ceptance and would become another government IT each agency using their system, even if the system casualty. Yet FedRAMP has been so successful that was exactly the same from one agency to another. FedRAMP standardized the process such that au- thorizations can be performed once and reused by multiple agencies. It saves both government and private sector CSPs a lot of time and money and enables fast adoption of new systems and services. According to the FedRAMP program management office (PMO), Amazon estimates that its FedRAMP authorization saves approximately $250,000 per as- sessment. The FedRAMP PMO estimates that as- LAURA TAYLOR sessments cost the US government approximately $250,000. With the launch of FedRAMP, now CSPs Relevant Technologies are paying for the assessments instead of the U.S.

[email protected] government. The authorized cloud systems cover at least 160 known FISMA implementations across the

10 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

April 2009 July – Sept. 2010 Cloud Computing FedRAMP Feb – Mar 2011 Program Management concept vetted Government Tiger June 2012 Office established with industry teams review FedRAMP and government comments launches initial February 2010 operational November FedRAMP capability 2010 Apr – June 2011 concept January 2013 FedRAMP Executive team December 2011 announced JAB grants concept, FedRAMP policy FedRAMP policy second controls, and solidifies Tiger signed templates signed team provisional released recommendations authorization

January 2011 February 2012 December 2012 June 2010 More than 1,200 FedRAMP JAB grants first FedRAMP drafts public comments CONOPS provisional October 2009 initial baseline received published authorization Security working group established March 2009 December 2010 July - Sept. 2011 May 2012 Cloud Computing Cloud Computing FedRAMP drafts 3PAOs Program launched Program Management initial baseline accredited Executive Steering Office established Committee established

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 09 09 09 09 10 10 10 10 11 11 11 11 12 12 12 12 13

FIGURE 1.Federal cloud computing initiative and FedRAMP timeline. (Source: FedRAMP)

government, giving current FedRAMP GSA established the Cloud Computing computing adopted by the government cost savings a conservative estimate of PMO, and Lewin became the Federal cloud program.1 In fall 2009, the group $40 million dollars. Cloud Computing Initiative Director. identified cloud authorization as the When he became the first federal In addition to FedRAMP, Lewin largest security hurdle to government CIO, Vivek Kundra championed cloud was charged with heading up the Fed- cloud adoption. To address this, Mell adoption as a way for agencies to save re- eral Consolidation Ini- conceived of the notion of government- sources and improve service. However, tiative (FDCCI). According to Lewin, wide authorization and worked out a without a way to secure the cloud and “The FDCCI project was really the formal process with his NIST colleagues enable FISMA authorizations, cloud camel’s nose under the tent for launch- (such as Ron Ross). He presented “A No- adoption would not come easily. Kundra ing government cloud and ultimately tional Process on Security Assessment launched the Federal Cloud Computing FedRAMP.” Lewin ensured that the and Authorization for Cloud Computing Working Group under the Federal CIO brain trust at the National Institute of Systems” to the working group and the Council, a group of government CIOs Standards and Technology (NIST) was cloud PMO. It was well received and, in that meets regularly to discuss govern- involved with FedRAMP from the start. early 2010, Lewin worked with Mell to ment IT initiatives. At one of the coun- The Federal Cloud Computing present the idea to Kundra and then to cil meetings, GSA CIO Casey Coleman Working Group was initially chaired by the CIO council. volunteered GSA to take the lead in ad- Peter Mell. Mell was part of the NIST dressing federal agencies’ adoption of Information Technology Laboratory in Forming a Cloud Policy cloud computing. Coleman in turn ap- the Computer Security Division and be- To pitch the idea, they needed a name. pointed her chief of staff, Katie Lewin, came involved in the working group after The acronym FedRAMP appeared on to manage the effort. In April 2009, writing the technical definition of cloud a paper plate next to Mell’s sandwich

SEPTEMBER 2014 IEEE CLOUD COMPUTING 11

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD AND THE GOVERNMENT

Table 1. FedRAMP preparation requirements. Checklist Description

1 You have the ability to process electronic discovery and litigation holds 2 You have the ability to clearly define and describe your system boundaries 3 You can identify customer responsibilities and what they must do to implement controls 4 System provides identification & 2-factor authentication for network access to privileged accounts 5 System provides identification & 2-factor authentication for network access to non-privileged accounts 6 System provides identification & 2-factor authentication for local access to privileged accounts 7 You can perform code analysis scans for code written in-house (non-COTS products) 8 You have boundary protections with logical and physical isolation of assets 9 You have the ability to remediate high risk issues within 30 days, medium risk within 90 days 10 You can provide an inventory and configuration build standards for all devices 11 System has safeguards to prevent unauthorized information transfer via shared resources 12 Cryptographic safeguards preserve confidentiality and integrity of data during transmission Source: Guide to Understanding FedRAMP5

one day as he listed descriptive words that by moving to the cloud, government Goodrich as deputy program manager. for government-wide authorization pro- agencies could improve server utilization Lewin retired from the government in grams. The logo (similar to the one used by 60 to 70 percent and could increase re- 2013, and Goodrich is the current act- today) was a result of an internal secu- sponsiveness to urgent agency needs. The ing FedRAMP director. rity working group competition. The idea stage was set and government agencies Although FedRAMP has attracted was adopted and the admittedly slow would have to start migrating to cloud, Amazon, Microsoft, HP, IBM, AT&T, process of creating the first government- like it or not (see Figure 1 for a chronol- and other big players, the first CSP to be wide authorization process began. ogy of events). authorized was Autonomic Resources, a In December 2010, Kundra published In August 2011, former Microsoft government-only CSP headquartered in the 25 Point Implementation Plan to Re- executive Steve Van Roeckel succeeded Research Triangle Park, North Caro- form Federal Information Technology Kundra as federal CIO. Van Roeckel es- lina. Autonomic Resources predicted Management.2 The plan announced the tablished FedRAMP via the “Security early that FedRAMP and DoD authori- cloud first policy, which stated, “When Authorization of Information Systems zations would be a boon to business and evaluating options for new IT deploy- in Cloud Computing Environments” built a government cloud specifically for ments, OMB will require that agencies memorandum issued on 8 December FedRAMP authorization. According to default to cloud-based solutions whenever 2011 (see https://cio.gov/wp-content/ James Bowman, Autonomic Resources’ a secure, reliable, cost-effective cloud op- uploads/2012/09/fedrampmemo.pdf),______government compliance director, “The tion exists.”2 In February 2011, Kundra which provided a cost-effective, risk- ARC-P IaaS government community released the Federal Cloud Computing based approach for the adoption and use cloud solution was designed and built Strategy, which stated that government of cloud services. Under Lewin’s leader- from the ground up to meet the strin- agencies must focus on managing ser- ship, the FedRAMP PMO ramped up gent FedRAMP and DoD security con- vices rather than assets.3 In this paper, quickly on resources when an OMB ex- trol requirements. Our value lies in our Kundra estimated that $20 billion of the aminer transferred resources from the cost savings, custom-built cloud servic- then $80 billion in IT spending could be GSA Federal Acquisition Services (FAS) es for government, and our high level of migrated to the cloud. Kundra forecast office to Lewin. Lewin hired Matthew security and compliance.”

12 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Cloud providers Achieving FedRAMP Federal agencies Authorization Risk management FedRAMP doesn’t certify or authorize products of any kind. Rather, it aims to verify public and private cloud sys- tems’ security through FISMA. All US government clouds, private and public, must comply with FedRAMP. A system must already be built for its security to be verified. FedRAMP doesn’t care what products you use to build your cloud as long as the system is secure, (a) and as long as it meets the FedRAMP security control baseline. The Joint Au- thorization Board (JAB, which consists Federal agencies Cloud providers of the DoD, DHS, and GSA) selected the controls from NIST SP 800-53, Se- Risk management curity and Privacy Controls for Federal Information Systems and Organizations.4 The Guide to Understanding FedRAMP, V2.0, June 6, 2014 includes a prepara- tion checklist (see Table 1).5 If a CSP system can’t at the minimum meet these requirements, it isn’t a suitable candidate for FedRAMP. Before FedRAMP, the authorization (b) process inherently had many redundan- cies that duplicated authorization work from one agency to another. One agency FIGURE 2. Authorization process for federal clouds: (a) old way and (b) new way. didn’t necessarily trust another agency’s (Source: FedRAMP) authorization process because it used different controls and security tem- plates, and the independent assessment agency can leverage that authorization submit a security package as a candidate process differed from agency to agency. without repeating the process. for authorization. The FedRAMP web- Even if an agency had authorized a cloud This new approach speeds up an site lists the three security package platform, each time a new agency want- agency’s ability to roll out cloud services types. As of this writing, no CSP self- ed to use that platform, the CSP had to while reducing the cost of the authori- submitted packages are listed, although go through the authorization process all zation. The Department of Health and multiple CSPs are currently putting to- over again, as Figure 2a illustrates. Human Services (USDA), the Depart- gether packages in that category. With the advent of FedRAMP, agen- ment of Transportation (DOT), the De- Once an agency decides to autho- cies now use the same security control partment of Agriculture (USDA), and rize a candidate package, the package baseline, the same security templates, the Department of Housing and Urban moves to the “agency authorization” and the same independent assessment Development are at the forefront of category on the FedRAMP website. process as illustrated in Figure 2b. The cloud adoptions. The primary difference between an new process ensures consistency across CSPs can use three different avenues agency authorized package and a JAB all government agencies and instills a to become authorized under FedRAMP. authorized package is the level of re- reciprocity of trust between agencies. They can be authorized by the JAB or view it undergoes. Agency-authorized Once a CSP has been authorized, any by an agency directly, or a CSP can self- packages are reviewed by one agency,

SEPTEMBER 2014 IEEE CLOUD COMPUTING 13

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD AND THE GOVERNMENT

Category Description 4. Joint Task Force Transformation Initiative, Security and Privacy Con- Increased CSP CSP supplied, not yet reviewed (candidate for authorization) trols for Federal Information Systems level of review Agency Reviewed and authorized by agency and Organizations, NIST Special Publication 800-53, rev. 4, Nat’l JAB Reviewed by FedRAMS ISSO and JAB, authorized by JAB Inst. of Standards and Technology, Apr. 2013; http://nvlpubs.nist.gov/

nistpubs/SpecialPublications/NIST______.SP.800-53r4.pdf. FIGURE 3. Summary of FedRAMP packages. 5. Guide to Understanding FedRAMP, Federal Risk and Authorization Management Program (FedRAMP) whereas JAB-authorized packages are within 90 days. Failure to mitigate vul- V2.0, 6 June 2014; http://cloud.cio reviewed by the DHS, DoD, and GSA nerabilities according to these require- .gov/document/guide-understanding

CIOs and their technical teams. The ments could lead to a CSP having its ______-fedramp. JAB’s technical review teams consist of authorization suspended or revoked. 6. Continuous Monitoring Strategy up to a dozen people from DoD, DHS, FedRAMP’s Continuous Monitoring Guide, Federal Risk and Authori- and GSA, all looking at the Security As- Strategy Guide is available on the Fe- zation Management Program (Fe- 6 sessment Report from different angles. dRAMP website. dRAMP) V2.0, 6 June 2014; http://____ Because of the number of people that cloud.cio.gov/document/continuous review packages slated for JAB autho- ______-monitoring-strategy-guide. rization, it can take considerably longer FEDRAMP WILL CONTINUE TO to get through the FedRAMP process EVOLVE ITS PROGRAM AND PRO- if going through the JAB. Once a secu- CESSES OVER TIME. Check in at LAURA TAYLOR is the founder of Rel- rity package is listed in the FedRAMP www.fedramp.gov for the latest updates. evant Technologies and the chair of the repository, federal agencies can review FISMA Center’s Advisory Board. She it to determine if they want to use the References specializes in security compliance and system described in the package.5 Fig- 1. P. Mell and T. Grance, The NIST Def- security audits of government agencies ure 3 summarizes the three FedRAMP inition of Cloud Computing, National and financial institutions. Taylor has security package types described above Inst. of Standards and Technology, provided information security consulting CSPs should not presume that their NIST Special Publication 800-145, services to some of the largest financial work is done after their system has Sep. 2011, Nat’l Inst. of Standards institutions in the world including the been authorized. Continuous monitor- and Technology; http://csrc.nist IRS, the US Treasury, the US govern- ing is required. According to Goodrich, .gov/publications/nistpubs/800-145/ ment-wide accounting system, and vari-

“What we’ve seen at FedRAMP is that SP800-145.pdf.______ous regional banks. She has also served the hard part of security is people and 2. V. Kundra, 25 Point Implementa- as director of security research at TEC, processes, not the technology. The tion Plan to Reform Federal Informa- chief information officer of Schafer Cor- alignment of business processes like tion Technology Management, 9 Dec. poration, director of information security configuration management and patch 2010, The White House; ______https://www at Navisite, director of certification and management with vulnerability scan- .dhs.gov/sites/default/files/publications/ accreditation at COACT, and director of ning is critical to a successful imple- digital-strategy/25-point-implemen______security compliance at USfalcon. mentation of security on all systems.” ______tation-plan-to-reform-federal-it.pdf. Authorized CSPs must perform month- 3. V. Kundra, Federal Cloud Comput- ly scans and send the scan results to ing Strategy, 8 Feb. 2011, The White their government authorization point House; ______https://www.dhs.gov/sites/ of contact. High vulnerabilities must ______default/files/publications/digital Selected CS articles and columns be mitigated within 30 days and mod- -strategy/federal-cloud-computing______are also available for free at ____http:// ComputingNow.computer.org. erate vulnerabilities must be mitigated ______-strategy.pdf.

14 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

STANDARDS NOW

IN MODERN DEVELOPMENT ENVIRON- MENTS, INNOVATION MUST BE BUILT IN EX- Cloud PLICITLY AND NOT EXTERNALLY IMPOSED. Successful cloud computing methods are charac- terized by their intrinsic utility, capacity for highly scaled implementation, and ability to adapt to rapid Standards change. These methods can be classified for practi- cal purposes in terms of APIs, protocols, languages, and tools. Classic approaches to producing standards are and the as archaic when applied to cloud computing as last century’s lighting, transportation, and communica- tion systems are to the rest of our society’s infra- structure. For standards to work and be suitable in Spectrum of this new setting, we need an approach that promotes rapid feedback and simultaneous or near-simultane- ous development and implementation. In earlier columns, I’ve explored the role of com- Development munities in developing cloud standards and laid out the landscape of the organizations operating in this space. I’ve also argued that standards are part of a continuous spectrum of development that ranges from purely practical to purely theoretical end points, and that a standard can be defined lightly as features that relate directly to current innovation “anything agreed to by more than one party.” More opportunities and to discuss the consequent need formal definitions are certainly possible, and I’ve for standards development methods that can keep also discussed the various types of standards orga- pace with rapid software progress. nizations and the importance of defining our ter- minology precisely to understand this spectrum of Application Programming Interfaces development. APIs have emerged as a key feature of the new This time, I’ll compare and contrast the differ- cloud ecosystem. They’ve become so popular that ent types of cloud software components, and discuss they’re sometimes the only components of cloud the pros and cons of taking a combined development software design that beginning programmers en- plus operations (“DevOps”) approach to accelerate counter, and such beginners can be forgiven for progress on software and standards. I’ll focus on thinking that these are the only components of practical ways in which standards fit into familiar categories used by programmers on a day-to-day ba- sis, and on how rapid feedback can improve them for use in these settings.

Cloud Development Categories For convenience, we can organize the components of cloud software and associated methods into broad categories. I present one such classification here. It ALAN SILL isn’t the only possible scheme, and might not satisfy architecture purists, but I’ve simplified the discus- Texas Tech University, sion to focus on current cloud computing trends and [email protected]______needs. The point of this classification is to expose

2325-6095/14/$31.00 © 2014 IEEE SEPTEMBER 2014 IEEE CLOUD COMPUTING 15

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

STANDARDS NOW

cloud software that matter. APIs domi- As the popularity of cloud comput- entirely from a compact description of nate current discussion to the point ing grew, the API paradigm became so the interface, the Web Services De- that they’ve developed their own con- useful that almost all cloud software scription Language (WSDL), without ferences, trends, and place in the eco- developed APIs, even if they weren’t reference to any other knowledge of the nomic landscape. Nonetheless, they interfaces to “applications” in the clas- interface’s characteristics. aren’t the entire picture and can’t sical sense. As this occurred, an evo- This important development played stand on their own. We’ll have to delve lution took place in the design of APIs a crucial role in getting programmers to deeper to understand their history and and their use. Historically, APIs were think of APIs as potentially language- relation to other important cloud func- often specific to the actual program- independent constructs that could be tionality and features. ming languages used and weren’t useful by themselves. Despite the obvi- APIs represent boundary-level con- generally interchangeable between dif- ous value of the language independence ditions needed to transfer information ferent language calls to the methods. afforded by WSDL and SOAP, program- into and out of cloud software environ- In cloud computing, the dominance of mers eventually rebelled at the XML- ments. Classically, these environments Web-based models and of their formal only basis and prescriptiveness of these methods. Although they’re still in use in a variety of Web services and enjoy a strong following for certain types of Successful cloud computing methods programming, many of the new features of cloud methods have transitioned to are characterized by their intrinsic the REST paradigm. utility, capacity for highly scaled This style change has been driven implementation, and ability to adapt partly by the desire to be able to refac- to rapid change. tor services in different ways to span smaller or larger portions of the prob- lems to be solved, and partly by the need for control to define the boundar- ies of the portion of the system exposed were executable applications that were service-oriented architecture under- through an API. One of the defining incapable of exposing their internal pinnings produced allowed cloud APIs characteristics of cloud computing is processes or parameters to the outside to be used across several different lan- flexibility to draw this control bound- world for alteration or external con- guage implementations. ary in ways that sometimes cross the sumption, hence the need to pass input, APIs based on the Representational conventional norms of service-oriented output, and control features through a State Transfer (REST) design pattern architecture. defined interface. The other major cat- now dominate designs currently em- Modern API design for cloud com- egory of boundary-level interfaces used ployed in new cloud software. It’s worth puting often uses design principles in in computing in general is often re- noting, however, that earlier progress ways that are beginning to resemble ferred to as application binary interfaces in decoupling APIs from dependence formal guidelines that lead in the di- (ABIs). I’ll reserve discussion of ABIs to on specific language call interfaces and rection of standards, or even to be a later column. methods was driven by the previously expressed formally as standards. Asso- In general, an interface defines fea- dominant method in service-oriented ciated tools are emerging to allow APIs tures such as syntax, semantics, and architecture design, which is that of to express discoverability of functional optional versus required components of the pattern introduced in the late 1990s features and to build in self-descrip- the information to be presented. In gen- as the Simple Object Access Protocol tion of their characteristics and meth- eral use, APIs and ABIs also describe (SOAP). ods of use. characteristics of the programming call Web services based on SOAP and Examples of API format descrip- sequences, such as classes and details other closely related methods used tion and definition tools that include of the methods to be used and objects XML to define interfaces so formally open formal specifications in addition or information to be exchanged. that code could actually be generated to related implementation software in-

16 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

clude Swagger (http://swagger.io), API creasingly language independent, pro- Languages, Tools, and the Overall Blueprint (http://apiblueprint.org), and tocols and APIs are now often designed Development Environment the RESTful API Markup Language in tandem, and many current cloud Cloud implementations are written in a (RAML; http://raml.org). These ap- standards are written with both pro- variety of programming languages with proaches can be used to encourage or tocol and API components. The Open methods that are supplemented by an enforce API self-documentation and Cloud Computing Interface (OCCI), even wider variety of tools. It’s easy to structure. In addition, a wide variety for example, defines a boundary-level ignore that these languages and tools of related open source and commer- API and protocol for RESTful control of are themselves often organized and de- cial software has emerged to provide cloud computing components existing fined by standards and divided into re- manageability, analytics, and other fea- within the boundary of the system to be leases and versions. Use of languages tures, sometimes provided externally by controlled. Other standards sometimes and tools follows a pattern with wide third parties as filters or add-ons to ex- concentrate on one or the other of these variation in terms of size and type of isting APIs. aspects, or on specific details of control supporting organization, and solutions and communications. with a single person underlying the ap- Protocols Protocols are another important com- ponent of cloud methods. No matter how well described, the interface (API or ABI) can’t exercise all of the func- Formally standardized external tionality needed to control and interact analysis methods can also be applied with running processes alone. A com- usefully in cloud computing. One plete approach to online communica- tion also needs protocols to define and example is the TLA+ language. describe the sequence of operations, format, and sequence of bits “on the wire” and characteristics such as tim- ing, content, or other design princi- ples that govern the information to be REST-based standards and models proach aren’t unusual. There is only passed through it. that use HTTP as their transport pro- space in this column to touch lightly on Protocols can be distinguished tocol can be further distinguished from this topic. from APIs by the degree to which they each other by their use of hypermedia, The concepts of interoperability specify interrelationships between dif- which is an essential feature of modern and scalability have made the distinc- ferent aspects of the information to be cloud API usage that depends on the tion between different types of lan- presented, and often the time sequence, detailed nature of HTTP. guages and tools largely irrelevant by content, and/or ordering of data and op- Protocols are generally used in or- design as a deliberately targeted feature erations. Protocols also cover address ganized versions, so are best developed of cloud solution deployment. It’s taken and data formats, and the mappings as standards. This aspect of cloud de- for granted that a successful cloud in- that are needed to interrelate them. velopment is easy to miss, because it’s frastructure won’t depend unduly on They can express subtle characteristics essentially taken for granted that good features of the programming methods of sequence and flow in ways that are protocols will be used to organize the used to create and implement it. This difficult to express purely within the communications handled by our APIs. aspect is almost a design requirement context of an API. TCP/IP, which gov- Organizations that develop protocols for modern cloud development. erns most operations on the Internet, is include all of the major standardization Formally standardized external a good example. bodies, such as the World-Wide Web analysis methods can also be applied Unlike APIs, protocols were origi- Consortium (W3C) and Internet Engi- usefully in cloud computing. One ex- nally designed to be as independent of neering Task Force (IETF), and essen- ample is the TLA+ language, specifi- language implementations as possible. tially all of the organizations covered in cation, and tools developed by Leslie Because APIs have recently become in- previous columns. Lamport (see http://research.microsoft

SEPTEMBER 2014 IEEE CLOUD COMPUTING 17

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

STANDARDS NOW

.com/en-us/um/people/lamport/tla/tla machine and prebuilt custom container improvement in any area, and is espe-

.html).____ Using mathematical set theory images. These are clearly not the only cially needed in cloud computing. The and predicates, the TLA+ approach de- methods of software development and quicker the feedback, the quicker we scribes the legal behaviors of a system. distribution, but they’ve gained impor- can expect progress. Amazon has recently used TLA+ along tance over the last several years. Unfortunately, earlier standards with similar methods to find and elimi- development practices were based on nate problems due to time sequencing, A DevOps Approach for slower and more formal communication dependencies, and design flaws within Standards patterns that don’t lend themselves to its software and infrastructure.1 Cloud computing emerged specifically today’s rapid progress and rapid cycling Because of the importance of net- to deliver methods for providing ser- between conceiving new ideas and test- working to the successful deployment vices that are easy to factorize and de- ing them in the field. To alleviate this of cloud-based systems, a wide variety ploy, and can be implemented rapidly at shortcoming, we need to take a DevOps of work is ongoing to develop new stan- greatly variable scales. Such a setting approach to bridge the gaps more quick- ly between formal ideas and practical implementation. In doing so, we also need to scale the communication pat- Cloud computing emerged specifically terns horizontally to involve more opin- ions and feedback for the betterment of to deliver methods for providing the field.

services that are easy to factorize One reason that OpenStack (http://____ and deploy, and can be implemented specs..org) is making such prog- rapidly. ress now, for example, is that it has ex- posed its specification-writing process to community input and formalized the pro- cess of pulling resulting improvements into the project’s core development, dards and languages that can be used requires procedures that allow quick selection, and verification procedures. to express the features of networks in cycling and continuous integration be- Other similar software projects, such as cloud settings. I will explore this topic tween the development and operational CloudStack (https://cwiki.apache.org/ further in a future column. deployment of cloud services. The in- confluence/display/cloudstack/design)______The environment in which cloud dustry has therefore adopted the widely and OpenNebula (http://community methods are developed and used is as popular DevOps strategy, which com- ..org/interoperability), are important to their success as the in- bines aspects of development and op- also providing such information. This terface, protocols, languages and tools erations to speed implementation and approach will be strongest if mutual used to implement them. Because the testing of new solutions and application engagement occurs between standard- cloud environment emphasizes advan- of new methods. ization communities and software de- tages in scalability, on-demand de- Although earlier computing models velopers in each project. ployment, flexibility in interoperation, could have used this approach, cloud Engagement of this type is begin- and bridging between multiple levels computing has several characteristics, ning to happen, and open source im- of information, people working in this such as ease of simultaneous side-by-side plementations of OCCI, Topology and area prefer tools with the same flexible comparisons of performance and factor- Orchestration Specification for Cloud characteristics. ization of services, that make the DevO- Applications (TOSCA), Cloud Data Open project and code reposito- ps approach particularly attractive. Management Interface (CDMI), Cloud ries and software distributions laid the This approach can apply equally Infrastructure Management Interface groundwork for the cloud environment, well to standards. Identifying methods (CIMI), and other emerging cloud stan- and lately this approach has been ex- to feed input from real-world experience dards are now available in each of the tended to include methods for public back to standards developing bodies is above software efforts, as well as in gen- sharing of libraries of entire virtual an important and necessary step toward eral-purpose software libraries suitable

18 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

for use in other settings. A quick search endorsed its joint statement of affirma- news you think the community should in the GitHub repositories will yield tion (see http://open-stand.org/about-us/ know. You can reach me at ______alan.sill@ several relevant projects. ______affirmation). standards-now.org. Where closed-source development Despite this strong endorsement, still occurs, it also needs to be pursued other standardization organizations References in a way that encourages rapid cycling have departed from or haven’t yet en- 1. C. Newcombe et al., “Use of Formal between ideas and implementations for dorsed the OpenStand principles. Methods at ,” standards to be effective in these set- Among these is the Web Hypertext Ap- online publication, 29 Sept. 2011; tings. Some degree of interoperability plication Technology Working Group http://research.microsoft.com/en can be extended to otherwise nonstan- (WHATWG), an organization that -us/um/people/lamport/tla/formal______dardized commercial products through formed a decade ago to pursue evolu- ______-methods-amazon.pdf. associated open source projects. The tion of hypertext-related specifications Eutester project (https://github.com/eu-______and explicitly includes a non-consensus- ______calyptus/eutester), for example, can be based membership steering group partly ALAN SILL directs the US National Sci- used to automate tests of a justified by the professed need for speed ence Foundation Center for Cloud and or Amazon cloud. Although no formal in development. Consensus unfortu- Autonomic Computing at Texas Tech open consensus-based standards exist nately takes time and can produce slow University, where he is also a senior sci- to provide the community underpin- and variable results. entist at the High Performance Com- nings for Amazon-compatible products, One way to mitigate these problems puting Center and adjunct professor of projects such as Eutester can partially is to encourage rapid testing against physics. He serves as vice president of fill the gap between product features major implementations, which can standards for the Open Grid Forum and and their user communities. sometimes squeeze out opinions not co-chairs the US National Institute of backed by large-scale organizational Standards and Technology’s “Standards Consensus Versus Speed, and participants. The divergence between Acceleration to Jumpstart Adoption of Rapid Testing as a Cure WHATWG and W3C specifications for Cloud Computing” working group. Sill The downside of taking a rapid-cycling HTML is an example of the potential holds a PhD in particle physics from approach aimed only at functionality is pitfalls in this area. Cloud computing American University. He’s an active that it can place a great deal of pressure needs processes to create open active member of IEEE, the Distributed Man- on the methods commonly used to de- communication between development agement Task Force, TM Forum, and velop consensus within open standards of software and standards without en- other cloud standards working groups, communities. Standards work best if countering such difficulties. and has served either directly or as liai- they can be used to bridge the differ- son for the Open Grid Forum on several ences between projects to provide the national and international standards basis for interoperation. Developing FUTURE INSTANCES OF THIS COL- roadmap committees. For further details, tools that can be adopted effectively in UMN WILL LOOK AT INDIVIDUAL visit http://cac.ttu.edu or contact him at multiple software projects and in com- STANDARDS IN TERMS OF CON- [email protected].______mercial products taxes our collective CEPTUAL FUNCTIONS THEY CAN ability to coordinate and test new fea- BE USED TO PERFORM, SUCH AS tures in different settings and build the IMAGE PORTABILITY, JOB PROVI- consensus needed to create effective SIONING, AND TASK ORCHESTRA- open standards. TION. Meanwhile, the information Such consensus is one of the five presented in this column should help core principles recently enumerated by illustrate the use of standards in neces- the OpenStand effort . Several major sary components of day-to-day software standards developing bodies, includ- development. ing the Internet Society, the IETF, the Please respond with your opinions Selected CS articles and columns Internet Architecture Board, the W3C, on this or previous columns, especially are also available for free at ____http:// ______ComputingNow.computer.org. IEEE, and the Open Grid Forum, have if you disagree with me, and include any

SEPTEMBER 2014 IEEE CLOUD COMPUTING 19

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD AND THE LAW

Threats to Mobile Device and Cloud Storage Mobile Cloud App Users Gartner predicts that

[T]hrough 2017, 75% of mobile security Storage Users breaches will be the result of mobile ap- plication misconfigurations. By 2017, the focus of mobile breaches will shift to tab- lets and smartphones from workstations. Through 2015, more than 75% of mobile applications will fail basic security tests.2 MOBILE DEVICES (SUCH AS ANDROID, iOS, WINDOWS, AND BLACKBERRY DEVICES) In May 2014, for example, a significant number of AND MOBILE APPS ARE RAPIDLY BECOMING Australian iOS devices were reportedly hijacked and PART OF EVERYDAY LIFE FOR INDIVIDUAL locked for ransom. Subsequent analysis determined AND ORGANIZATIONAL USERS IN BOTH DE- that affected users’ iCloud accounts had been com- VELOPED AND DEVELOPING COUNTRIES. promised.3 According to various media articles, af- One popular app category is apps that provide fected users who didn’t set a passcode prior to the cloud-based storage services compatible with a hack had to reset their devices to factory settings, range of devices, including PCs, laptops, and mo- resulting in the erasure of all user data stored on the bile devices. For example, Netskope reports that affected devices. cloud storage apps such as , Amazon Mazin Yousif, editor in chief of this magazine, CloudDrive, OneDrive, and iCloud were among the also questioned whether the recent incident in top 20 most popular cloud apps during the first half which iTunes customers in 119 countries received of 2014.1 , another popular cloud storage U2’s “Songs of Innocence” without their consent4 app, had more than 100 million downloads on the suggests that criminals could potentially target iOS Google Play store at the time of writing. mobile device management (MDM). In principle, As with most popular consumer technologies, it isn’t impossible that iOS MDM servers could be criminals can exploit vulnerabilities in mobile de- compromised, say by a malicious insider, to push vices and operating systems or mobile apps to tar- malicious or potentially unwanted applications to get mobile device and app users. Because of their iOS devices managed by the affected servers. For capability to store vast amounts of user data, cloud example, in recent work, Samuel O’Malley and I storage apps are a potential and attractive target for presented a method that a corrupt insider could use criminals. to facilitate (inaudible) data exfiltration from an air-gapped system without using any modified hard- ware.5 Such techniques could easily be used to exfil- trate data from cloud servers. Christoph Stach and Bernhard Mitschang highlighted the implications of poor privacy man- agement approaches.6 They also pointed out that a vast majority of current mobile apps request access to highly sensitive data and personally identifiable KIM-KWANG information (PII), such as geographical location RAYMOND CHOO and contact data. In other work, Christian D’Orazio and I pro- University of South Australia posed a generic process for identifying vulner-

[email protected] abilities and design weaknesses in iOS apps. Using this process, we revealed a previously unknown/

20 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

unpublished vulnerability in a widely names, passwords and security used Australian Government health- questions, a practice that has care app that consequently exposes the become all too common on the user’s sensitive data and PII stored on Opportunity Internet. the device.7 This is, perhaps, not surprising Individual mobile cloud users must because many mobile apps weren’t de- Crime therefore be vigilant and take measures signed with user security and privacy to protect the data stored on their mo- in mind, owing to the rush to attract Guardian Motivation bile devices and in the cloud. Such mea- new consumers and accelerate the prod- sures should target one or more of the uct’s time to market. Such a situation following areas (see Table 1): is somewhat similar to two or three de- cades ago when published cryptograph- • Reducing opportunity (for example, ic protocols were subsequently found to FIGURE 1. Routine activity theory. RAT increasing the effort required to be insecure.8 proposes that crime occurs when a offend); Suffice to note that threats to mo- suitable target is in the presence of a • Enhancing guardianship (for ex- bile device and cloud storage app users motivated offender and is without a ample, increasing the risk of getting are real and increasingly important be- capable guardian. caught); and cause of the increasing amount of user sensitive data and PII stored on and transmitted from mobile devices and cloud storage and other apps (for exam- ple, using browsers and apps to upload The risk is not just to the mobile and download corporate and personal device and cloud storage app users, data from mobile devices to cloud stor- but also to the organizations they age servers). work for. Routine Activity Theory Approach The routine activity theory (RAT), of- ten used to explain criminal events, proposes that crime occurs when a suit- able target is in the presence of a moti- I don’t think many of us want to • Reducing motivation (for example, vated offender and is without a capable wake up tomorrow and discover that the reducing the rewards of offending). guardian.9 data we stored in the cloud was leaked Offender motivation is a crucial el- and photos we assumed were private In summary, security measures ement of RAT, which assumes that of- are no longer so. In September 2014, shouldn’t lag behind new technology fenders are rational and appropriately for example, a number of celebrities’ trends. Fortunately, the private sector resourced actors operating in the con- iCloud accounts were reportedly com- has enormous incentives for contribut- text of high-value and poorly protected promised, resulting in the theft of (in- ing to mobile device/app and cloud se- targets.10 The interaction between po- timate) photos from these compromised curity. Now is certainly a good time to tential victims (in our context, mobile accounts.11–13 Apple subsequently con- get into the business of mobile device/ device and cloud storage app users), of- firmed the incident14: app and cloud security. fenders, and situational conditions (for example, opportunities such as devices After more than 40 hours of connecting to free Wi-Fi, and weak investigation, we have discov- WE WELCOME YOUR CONTRIBU- guardianship such as poor security hy- ered that certain celebrity ac- TIONS AND ENCOURAGE YOU giene) influences the risk and impact of counts were compromised by TO BE PART OF THE MOBILE AND victimization. a very targeted attack on user CLOUD SECURITY LANDSCAPE.16

SEPTEMBER 2014 IEEE CLOUD COMPUTING 21

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD AND THE LAW

Table 1. Suggested areas for improving data protection in mobile devices and cloud storage apps. Security measures Reduce Enhance Reduce opportunity guardianship motivation Target hardening such as prompt installation of software and hardware Yes Yes No patches and antivirus software Report lost or stolen devices and cybervictimization to appropriate No Yes No authorities Delete data stored on mobile device before disposing of the mobile Yes No Yes device and deactivating the account. Delete data from cloud accounts before deactivating the account or Yes No Yes before the contract expires for corporate cloud users (note that data anonymization and data deletion are not the same). One could also encrypt the data stored in the cloud before deleting the encryption key and the encrypted data from the account before deactivating the account or before the contract expires. Avoid visiting websites of dubious repute or downloading unknown apps Yes No No from third-party app stores Use device encryption and alphanumeric and nonguessable password Yes Yes Yes for cloud and other accounts Use a two-step verification feature offered by cloud services such as Yes Yes No Apple15

From a legal perspective, for example, report 7/14 RS-33-1, 2014; www.____ 5. S. O’Malley and K.-K.R. Choo, what are the implications of user data netskope.com/wp-content/uploads/ “Bridging the Air Gap: Inaudible and PII leakage from mobile devices? 2014/07/NS-Cloud-Report-Jul14______Data Exfiltration by Insiders,”Proc. Should a cloud service provider be re- ______-RS-00.pdf. 20th Americas Conf. Informa- sponsible for pure economic loss to cloud 2. Gartner, “Gartner Says Worldwide tion Systems (AMCIS 14), 2014; service users due to its negligent acts? PC, Tablet and Mobile Phone Com- http://aisel.aisnet.org/amcis2014/

Other areas of interest include the po- bined Shipments to Reach 2.4 Billion ______ISSecurity/GeneralPresentations/12. tential surveillance risks faced by mobile Units in 2013,” press release, 4 April 6. C. Stach and B. Mitschang, “Pri- cloud storage users, particularly in the 2013; www.gartner.com/newsroom/ vacy Management for Mobile Plat- aftermath of the revelations by Edward ______id/2408515. forms—A Review of Concepts and Snowden that the National Security 3. AppleInsider staff, “Hackers Use Approaches,” Proc. 14th IEEE Int’l Agency has been conducting wide-scale ‘Find My iPhone’ to Lockout, Ran- Conf. Mobile Data Management government surveillance, including those som Mac and iOS Device Own- (MDM 13), 2013, pp. 305–313. targeting mobile device and cloud us- ers in Australia,” AppleInsider, 26 7. C. D’Orazio and K.-K.R. Choo, “A ers. Therefore, another key question May 2014; http://appleinsider.com/ Generic Process to Identify Vulner- that needs to be examined is, “How do ______articles/14/05/27/hackers-break abilities and Design Weaknesses in we balance the need for a secure cloud ______-into-lock-macs-and-ios-devices-for iOS Healthcare Apps,” Proc. 48th computing ecosystem and the rights of ______-ransom-in-australia. Ann. Hawaii Int’l Conf. System Sci- individuals to privacy against the need to 4. M. Williams, “Half a Billion iTunes ences (HICSS 15), to be published protect the society from serious and or- Customers Receive Latest U2 Al- in 2015. ganized crimes, terrorism, and cyber and bum for Free,” The Guardian, 10 8. K.-K.R. Choo, Secure Key Establish- national security interests?” Sept. 2014; www.theguardian.com/ ment, Springer, 2009.

music/2014/sep/09/u2-songs-of______9. L.E. Cohen and M. Felson, “Social References ______-innocence-itunes-customers-free Change and Crime Rate Trends: A 1. Netskope, Netskope Cloud Report, _____-album. Routine Activity Approach,” Am.

22 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Sociological Rev., vol. 44, no. 4, 14. Apple, “Update to Celebrity Photo forensics. He has published two books, six 1979, pp. 588–608 Investigation,” Apple media advi- refereed monographs, nine refereed book 10. M. Felson. Crime and Everyday Life, sory, 2 Sept. 2014; www.apple.com/ chapters, and 101 refereed journal and

Pine Forge Press, 1998. pr/library/2014/09/02Apple-Media______conference articles. He is the recipient of 11. L. Kelion, “Apple Toughens iCloud ______-Advisory.html. various awards including a 2010 Australian Security after Celebrity Breach,” 15. Apple, “Frequently Asked Questions Capital Territory Pearcey Award, 2009 BBC News, 17 Sept. 2014; www.bbc about Two-Step Verification for Ap- Fulbright Scholarship, 2008 Australia Day .com/news/technology-29237469. ple ID,” 2014; http://support.apple Achievement Medallion and the British 12. D. Lewis, “iCloud Data Breach: .com/kb/ht5570. Computer Society’s Wilkes Award in 2007. Hacking and Celebrity Photos,” 16. K.-K.R. Choo, “Legal Issues in the Choo has a PhD in information security Forbes, 2 Sept. 2014; www.forbes Cloud,” IEEE Cloud Computing, from Queensland University of Technol-

.com/sites/davelewis/2014/09/02/ vol. 1, no. 1, 2014, pp. 94–96. ogy, Australia. Contact him at ______raymond ______icloud-data-breach-hacking-and [email protected] or ______https://sites ______-nude-celebrity-photos. .google.com/site/raymondchooau. 13. D. Wakabayashi and D. Yadron, KIM-KWANG RAYMOND CHOO is “Apple Denies iCloud Breach,” Wall a senior lecturer in the School of Infor- Street J., 2 Sept. 2014; http://online mation Technology and Mathematical .wsj.com/articles/apple-celebrity Science at the University of South Aus- Selected CS articles and columns are also available for free at ____http:// -accounts-compromised-by-very______tralia. His research interests include cy- ComputingNow.computer.org.______-targeted-attack-1409683803. ber and information security and digital

ADVERTISER INFORMATION

Advertising Personnel Southwest, California: Mike Hughes Marian Anderson: Sr. Advertising Coordinator Email: [email protected] Email: [email protected]______Phone: +1 805 529 6790 Phone: +1 714 816 2139 | Fax: +1 714 821 4010 Southeast: Sandy Brown: Sr. Business Development Mgr. Heather Buonadies Email [email protected] Email: [email protected] Phone: +1 714 816 2144 | Fax: +1 714 821 4010 Phone: +1 973 304 4123 Fax: +1 973 585 7071 Advertising Sales Representatives (display) $GYHUWLVLQJ6DOHV5HSUHVHQWDWLYHV &ODVVLÀHG/LQH Central, Northwest, Far East: Eric Kincaid Heather Buonadies Email: [email protected] ______Email: [email protected] Phone: +1 214 673 3742 Phone: +1 973 304 4123 Fax: +1 888 886 8599 Fax: +1 973 585 7071

Northeast, Midwest, Europe, Middle East: Advertising Sales Representatives (Jobs Board) Ann & David Schissler Email: [email protected], [email protected] Phone: +1 508 394 4026 Heather Buonadies Fax: +1 508 394 1707 Email: [email protected]______Phone: +1 973 304 4123 Fax: +1 973 585 7071

SEPTEMBER 2014 IEEE CLOUD COMPUTING 23

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® SECURE BIG DATA IN THE CLOUD 24 IEEE CLOUD COMPUTING CLOUD IEEE rvosPage Previous rvosPage Previous applications in a cloud environment. cloud a applications in security andprivacy mechanisms for big data researchand toward innovation the of This special issue aims to stimulate discussion of big data analytics andstorage applications. small-scale data, so they don’tneeds meet the Traditional security mechanisms are tailored for Sandhu, Ravi Ibrahim Khalil, Bharat Bhargava, the Cloud Applications in Data Big Securing Introduction: Editors’ Guest PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE ©2014 2325-6095/14/$31.00 SOCIETY COMPUTER IEEE BYTHE PUBLISHED | | Contents Contents University ofTexas,University Antonio San RMIT University, Australia Purdue University | | omin Zoom omin Zoom | | omout Zoom omout Zoom | | rn Cover Front rn Cover Front | | erhIssue Search erhIssue Search | | etPage Next etPage Next q q H OL’ NEWSSTAND WORLD’S THE H OL’ NEWSSTAND WORLD’S THE q q q q M M M M M M q q q q M M M M ® ® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

loud-based platforms are playing signment problem in environments with role-based an increasingly important role in access control (RBAC). the context of big data analytics In “Efficient and Secure Transfer, Synchroniza- and storage applications. The ve- tion, and Sharing of Big Data,” Kyle Chard, Steven locity, volume, and variety of big Tuecke, and Ian Foster propose secure and efficient data for large-scale cloud infra- data access, transfer, and sharing functions for large structures can’t be enhanced without security and datasets across multiple types of local and cloud privacy. Because traditional security mechanisms storage, which they achieve through the Globus are tailored to securing small-scale data, they can’t software-as-a-service (SaaS) platform for data trans- meet the needs of big data. Moreover, the inher- fer and synchronization. Their secure framework ent vulnerabilities of a cloud-based environment supports resiliency and integrity while spanning a require significant focus on both privacy and secu- variety of heterogeneous data storage systems. rity together with risk management procedures. To A fourth article, “Location-Based Security stimulate discussion and invigorate research inter- Framework for Cloud Perimeters,” by Chetan Jaisw- est toward the innovation of security and privacy al, Mahesh Nath, and Vijay Kumar, proposes a mechanisms for big data applications in a cloud en- cost-effective model for location-based firewall fil- vironment, this special issue discusses topics such tering of attacks for mobile and static cloud envi- as intrusion detection and attack prevention, risk ronments. The authors introduce two schemes for awareness, secure and efficient data sharing, and ac- identifying and filtering out static and mobile secu- cess control. rity attackers using a logic-based framework that’s The call for papers was well timed given the dy- coupled with the dynamic revision of firewall poli- namic ongoing research on security and privacy for cies. These functions are performed in a distrib- cloud-based big data applications. We received nu- uted manner, keeping the local and global policies merous submissions, and, after a rigorous peer re- in sync. view process, we selected five articles for this spe- Finally, in “Multilabels-Based Scalable Access cial issue. Control for Big Data Applications,” Chen Hong- song, Bharat Bhargava, and Fu Zhongchuan pro- The Articles pose a multilabel-based access control approach for In “Enhancing Big Data Security with Collabora- Hadoop-based big data applications in clouds that tive Intrusion Detection,” Zhiyuan Tan and his is both efficient and scalable. The work combines colleagues introduce a collaborative intrusion de- active bundle, RBAC, discretionary access control tection framework that focuses on efficiency, scal- (DAC), and mandatory access control (MAC), and ability, and self-adaption for big data applications includes a security degree, lifetime, and access in cloud computing. The system performs intrusion policy among the multilabels. The authors evalu- detection at both the host and network levels in a ate the approach using a rigorous case study of a collaborative manner, using a model for parallel personal health record (PHR) data storage appli- network summarization that utilizes cloud comput- cation. As both coauthor and guest editor, Bharat ing features. Bhargava did not take part in the peer review of The article, “Risk-Aware Virtual Resource Man- this article. agement for Multitenant Cloud Datacenters,” by Ab- dulrahman A. Almutairi and Arif Ghafoor, presents efficient risk-aware virtual resource management e thank all of the authors who submitted procedures that avoid information leakage in cloud- manuscripts to this special issue. We also based multitenant sharing environments. The au- wish to thank the reviewers who helped to review thors propose a sharing-based heuristic that reduces the papers in a very short time period, as well as overall risk, and a partition-based heuristic that is Editor in Chief Mazin Yousif for his encouragement scalable for large datacenters. They use sensitivity and support in organizing this special issue. Finally, characterization to address the virtual resource as- we thank the publication staff for their continuous

SEPTEMBER 2014 IEEE CLOUD COMPUTING 25

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

support. We close this editorial by noting that science from the University of Berne, Switzerland. several more feature topics on scalable and secure Contact him at [email protected].______big data analytics are due to appear in the magazine in the near future. RAVI SANDHU is the executive director of the Insti- tute for Cyber Security at the University of Texas, San BHARAT BHARGAVA is a professor of computer Antonio, where he holds the Lutcher Brown Endowed science at Purdue University. His research interests Chair in Cyber Security. His research interests in- include security and privacy issues in distributed clude cybersecurity practice and education. Sandhu systems and sensor networks. This involves identity has a PhD in computer science from Rutgers Univer- management, secure routing and dealing with mali- sity. He is an IEEE, ACM, and AAAS Fellow. Contact

cious hosts, adaptability to attacks, and experimental him at [email protected]. studies. His recent work involves attack graphs for col- laborative attacks. Bhargava has a PhD in computer science from Rutgers University. Contact him at ___bbs- [email protected].

IBRAHIM KHALIL is a senior lecturer in the School of Computer Science and IT, RMIT University, Mel- bourne, Australia. His research interests include data

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE clustering, network security, scalable computing in distributed systems, m-health, e-health, wireless and Selected CS articles and columns are also available body sensor networks, biomedical signal processing, for free at http://ComputingNow.computer.org. and remote healthcare. Khalil has a PhD in computer

stay connected. Keep up with the latest IEEE Computer Society publications and activities wherever you are.

| @ComputerSociety | facebook.com/IEEEComputerSociety | @ComputingNow | facebook.com/ComputingNow

| IEEE Computer Society | youtube.com/ieeecomputersociety | Computing Now

26 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® SECURE BIG DATA IN THE CLOUD

Enhancing Big Data Security with Collaborative Intrusion Detection

Zhiyuan Tan, University of Twente Upasana T. Nagar, Xiangjian He, and Priyadarsi Nanda, University of Technology Sydney Ren Ping Liu, Commonwealth Scientific and Industrial Research Organisation (CSIRO) Song Wang, La Trobe University Jiankun Hu, University of New South Wales

A collaborative loud computing delivers a flexible network comput- ing model that allows organizations to adjust their IT intrusion detection capabilities on the fly with minimal investment in IT system (CIDS) infrastructure and maintenance. Because an organi- zation need only pay for the services it uses, it can plays an important focus on its core business instead of handling techni- cal issues. role in providing In the cloud computing context, network-accessible resources are defined as services. These services are typically delivered via one of comprehensive three cloud computing service models: security for data • Infrastructure (IaaS) offers storage, computation, and residing on cloud network capabilities to service subscribers through virtual ma- chines (VMs). networks, from • (PaaS) provides an environment for software application development and hosts a client’s applications in a PaaS attack prevention to provider’s computing infrastructure. attack detection. • (SaaS) delivers on-demand software services via a computer network, eliminating the cost of purchasing and maintaining software.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 27

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

These technical and business advantages, howev- ered a type of IDS, they’re more tailored to data se- er, don’t come without cost. The security vulnerabili- curity. However, it’s difficult to completely guarantee ties inherited from the underlying technologies (that data security using DLPSs alone. Attackers who gain is, virtualization, IP, APIs, and datacenter) prevent control of the host machines can modify the DLPS organizations from adopting the cloud in many criti- settings, thereby completely disclosing data to those cal business applications.1 Generally speaking, cloud attackers. Moreover, even though firewalls can block computing is a service-oriented architecture (SOA). unwanted network traffic packets according to a pre- Earlier work gives a comprehensive dependability and defined rule set, they can’t detect sophisticated in- security taxonomy framework revealing the complex trusive attempts such as flooding and insider attacks. security cause-implication relations in this architec- IDSs, DLPSs, and firewalls are therefore not inter- ture.2 We summarize cloud computing vulnerabilities changeable security schemes but collaborative ones. by underlying technology in the sidebar. These vulnerabilities leave loopholes, allowing Conventional IDSs cyberintruders to exploit cloud computing servic- Conventional IDSs are mostly standalone systems re- es and threatening the security and privacy of big siding on computer networks or host machines. They data. Various security schemes, such as encryption, can be categorized as misuse-based or anomaly-based authentication, access control, firewalls, intrusion IDSs, depending on the detection mechanism applied. detection system (IDSs), and data leak prevention Misuse-based IDSs enjoy high detection accu- systems (DLPSs), address these security issues. In racy but are vulnerable to all zero-day intrusions.3

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE This is due to the underlying detection mechanism that checks for a match with existing attack signatures. Obvi- Attackers can initiate automated attacks ously, an IDS can’t generate signatures for an unknown attack. Anomaly-based targeting all vulnerable services within a IDSs show promise for detecting zero- day intrusions,4,5 but are prone to high network simultaneously. false positives. Current enterprise networks (such as cloud computing environments) typi- cally have multiple entry points. This this complex computing environment, however, no topology is intended to enhance a network’s acces- single scheme fits all cases. These schemes should sibility and availability, but it leaves security vulnera- thus be integrated and cooperate to provide a com- bilities that sophisticated attackers can exploit using prehensive line of defense. advanced techniques, such as cooperative intrusions. Unlike traditional attack mechanisms, coop- Intrusion Detection for Securing Cloud erative attack mechanisms are launched simultane- Computing ously by slave machines within a botnet. Attackers IDSs aim to provide a layer of defense against mali- organize instances of this attack type to penetrate cious uses of computing systems by sensing attacks an enterprise network through all its entry points. and alerting users. Because it’s impossible to prevent By evenly distributing the attack traffic volume to all cyberattacks, IDSs have become essential to se- the different entry points, these cooperative intru- curing cloud computing environments. sions can evade detection of traditional standalone IDSs are commonly categorized by the type of IDSs set in front of the entry points. This is be- data source involved in detection. Host-based IDSs cause network traffic behavior at each entry point (HIDSs) detect malicious events on host machines. doesn’t significantly deviate from normal behavior. They handle insider attacks (which attempt to gain After traveling through the entry points, the attack unauthorized privileges) and user-to-root attacks instances are directed to a single targeted service (which attempt to gain root privileges to VMs or the within the enterprise network. host). Network-based IDSs (NIDSs) monitor and Moreover, many of the existing intrusions can flag traffic carrying malicious contents or present- occur collaboratively and simultaneously on nodes ing malicious patterns. This type of IDS can detect throughout a network. Attackers can initiate auto- direct and indirect flooding attacks, port-scanning mated attacks targeting all vulnerable services with- attacks, and so on. in a network simultaneously,6 rather than focusing Although to some extent, DLPSs can be consid- on a specific service.

28 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

VULNERABILITIES IN UNDERLYING TECHNOLOGIES

ulnerabilities in the cloud’s underly- Defects in the implementation of the ing technologies allow cyberintruders TCP/IP protocol suite can lead to a variety of to exploit cloud computing services and attacks, including IP spoofing, ARP spoofing, threaten the security and privacy of big data. DNS poisoning, Routing Information Proto- col (RIP) attacks, flooding, HTTP session rid- Virtualization ing, and session hijacking. Virtualization facilitates multitenancy and re- source sharing (such as physical machines and Application Programming Interfaces networks) and enables maximum utilization APIs provide interfaces for managing cloud of available resources. Categories include full, services, including service provisioning, OS-layer, and hardware-layer virtualizations. orchestration, and monitoring. Areas of Virtual machines (VMs) can gain full access vulnerability include weak credentials, to a host’s resources if isolation between the authorization checks, and input-data valida- host and the VMs isn’t properly configured and tion, which could allow an attacker to seize maintained. (In this case, the VMs escape to root privileges. Developers might introduce the host and seize root privileges.) In addition, defects during the design and implementa- a VM’s security can’t be guaranteed if its host tion of cloud APIs or introduce new security is compromised. Hosts and their VMs share vulnerabilities when fixing bugs. networks via a virtual switch, which VMs could use as a channel to capture the packets transit- Datacenter ing over the networks or to launch Address Datacenter technologies allow administra- Resolution Protocol (ARP) poisoning attacks. tors to manage and store data. Data is often Finally, because a host shares computing stored, processed, and transferred in plain- resources with its VMs, a guest could launch text, which can be compromised, lead- a denial-of-service (DoS) attack via a VM by ing to the loss of confidentiality. Attackers taking up all the host’s resources. might also find residual data from data that’s been deleted. Finally, in a datacenter, IP Suite data from different users (both legitimate The IP suite, the core component of the users and intruders) is mixed together with Internet, ensures the functioning of inter- weak separation, providing opportunities networking systems and allows access to for an intruder to access the data of the remote computing resources. legitimate users.

Need for Collaborative Intrusion Detection shares traffic information with the IDSs located at a Conventional standalone IDSs are susceptible to local network’s entry points. cooperative attacks, so they’re unsuitable for col- In practice, we can organize IDSs within a laborative environments (such as a cloud computing CIDS in a decentralized7 or hierarchical8 manner environment). To defend against this type of attack, over a large network. These IDSs communicate di- collaborative intrusion detection systems (CIDSs) rectly with each other or with a central coordinator, correlate suspicious evidence between different according to the applied mode of organization. IDSs to improve the intrusion detection efficien- In a decentralized CIDS, each IDS can gener- cy. Unlike conventional standalone IDSs, a CIDS ate a complete attack diagram of the network by

SEPTEMBER 2014 IEEE CLOUD COMPUTING 29

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

aggregating network information received from other from all nodes. At the same time, it requires that we IDSs in the CIDS. Detection of malicious attempts perform summarization and combine the results in is undertaken locally at each IDS. In a hierarchical a distributed and parallel manner. In addition, be- CIDS, a coordinator is a central point responsible for cause we’re now dealing with all the network data information aggregation. The central coordinator, in the entire cloud, where an unknown number of which analyzes the aggregated information, generates categories can exist, the summarization algorithms a complete attack diagram of the network. will need to expand their categories on demand to automatically create new clusters when they discov- Limitations of Current Collaborative IDSs er new types of traffic emerging. Collaborative IDSs seem promising for detecting co- Given the characteristics of cloud computing, operative intrusions. However, existing system archi- we must consider several desirable properties when tectures aren’t without criticism. In CIDSs, network designing a new CIDS framework. These properties data summarization is an important precursor to reli- include fast detection of various attacks with minimal able intrusion detection.9 However, traditionally, net- false positive rates, scalability with the expansion of work information is collected and processed by IDS the cloud computing system, self-adaption to changes software built on a single network device that only in the cloud computing environment, and resistance deals with the traffic flowing in and out of that de- to compromise.10 Figure 1 shows the framework of vice. It therefore has limited traffic information. In our proposed CIDS, which meets these requirements. addition, the computation of network data summa- As Figure 1 shows, HIDSs and NIDSs cooper-

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE rization is proportional to the amount of traffic flow ate to perform intrusion detection at the host and that single device experiences. Such an approach has network levels, and each IDS in the network is drawbacks in terms of both accuracy and efficiency. equipped with signature- and anomaly-based detec- In terms of accuracy, without knowledge of the tors.11 This tactic ensures better detection accuracy network data from other nodes, any summarization in both known and unknown attacks. is specific to a partial and insignificant portion of all There are two categories of nodes in this frame- available data over the entire network. Exchanging work—cooperative agent and central coordinator. and combining these summarizations later, without These nodes form a collaborative system whose se- the actual data, provides a minimal information gain. curity is assured through the implementation of var- In terms of efficiency, nodes with denser traffic ious security mechanisms. require additional computation to process summa- rization. Because summarization is a pure overhead Cooperative Agents operation, in an ideal environment, a node will have Cooperative agents stand at the front lines and de- less traffic to process when performing summariza- tect misuses on host machines or malicious behavior tion tasks. on networks. These agents are equipped with HIDSs Security is another concern for existing CI- or NIDSs depending on their location—agents in- DSs. If a CIDS is compromised, the entire cloud stalled on a host machine to detect suspicious events computing environment is in . Conventional are equipped with HIDSs, whereas agents monitor- IDS software, installed on a single network device, ing traffic on a network are equipped with NIDSs. analyzes and maintains network information on the In our framework, the cooperative agents located device but doesn’t include security properties that on host machines are a new type of HIDS, requiring ensure confidentiality, authentication, and integrity. no instrumentation within VMs and modeling pro- Thus, CIDSs that are designed simply by integrating cesses at the VM granularity level (that is, treating conventional IDS software without proper security VMs as individual processes and modeling VM be- enhancements are vulnerable to attacks. haviors accordingly). This scheme ensures that our detection system complies with service-level agree- Collaborative Intrusion Detection ments (SLAs) and legal restrictions, which might not Framework allow an IaaS provider to make amendments or per- Given the defects of existing CIDSs, a new sophis- form intensive monitoring and surveillance on client ticated CIDS framework could strengthen the se- VMs. It also alleviates the ineffectiveness of NIDSs curity of cloud computing systems. However, cloud on encrypted traffic. The host-based cooperative computing presents unique issues. With a large, agents inform a central coordinator when they de- dense network of nodes forming a cloud environ- tect an intrusive behavior or activity. ment, cloud computing offers us unprecedented Cooperative agents residing at the network opportunities for making available network data level conduct first-tier detection, defending against

30 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Internet

Firewall

Firewall NIDS

NIDS

Gateway Central coordinator Backup central coordinator Host machines

Host machines HIDS

Cloud computing environment Host machines HIDS

FIGURE 1. Framework of a collaborative intrusion detection system (CIDS). The figure illustrates how the different types of fellow IDSs are deployed in a cloud computing environment, and how they cooperate with each other and central coordinators in detecting intrusions. (HIDS: host-based IDS, NIDS: network-based IDS)

generic attacks that present abnormality within the and the central coordinator as a master node. The network traffic and don’t involve sophisticated co- MapReduce framework manages all details, ranging operation. The network-based cooperative agents from scheduling to information aggregation. alert a central coordinator to any suspicious pack- ets detected. Meanwhile, these agents summarize Central Coordinator network traffic flowing through the network in a Finally, the network traffic aggregation is performed distributed and parallel manner. In network data on the central coordinator, which generates a com- summarization, the nonparametric Bayes could be plete attack diagram of the entire network (that is, a suitable machine learning approach for solving the cloud computing system). Based on this aggre- the challenges of cloud computing.12 Network sum- gation, the central coordinator is capable of captur- marization is particularly important for detecting ing sophisticated cooperative intrusions that the cooperative intrusions, such as distributed denial- individual network-based cooperative agents miss. of-service (DDoS) attacks. These summarizations When intrusive behaviors (including those identified are periodically sent to a central coordinator, as we by the cooperative agents and the central coordina- discuss next. tor) are detected, the central coordinator raises an This parallel summarization is empowered by alert to a system administrator. cloud computing through the MapReduce frame- It’s worth noting that a hybrid detector com- work.13 The MapReduce framework provides seam- bining misuse-based and anomaly-based detection less and effortless integration of our CIDS framework mechanisms can help reduce the time needed to de- into a distributed and parallel architecture by treating tect and enhance the detection accuracy of known the network-based cooperative agents as slave nodes and unknown attacks.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 31

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Security Mechanisms tion Analysis,” IEEE Trans. Parallel and Distrib- To ensure that the CIDS is resistant to compromise, uted Systems, vol. 25, no. 2, 2014, pp. 447–456. we use authentication and encryption as well as 6. S. Savage, “Internet Outbreaks: Epidemiol- an integrity check. Because the CIDS works 24/7, ogy and Defenses,” keynote address, Internet energy-efficient group key distribution schemes are Soc. Symp. Network and Distributed System preferable for secure key distribution and node au- Security (NDSS 05), 2005; http://cseweb.ucsd. 14,15 thentication. These schemes provide a strong, ______edu/~savage/papers/InternetOutbreak.NDSS05 secure mechanism for updating group keys when ___.pdf. nodes join in or leave the network or a node is be- 7. S. Ram, “Secure Cloud Computing Based on ing compromised. They’re also resilient to collusion Mutual Intrusion Detection System,” Int’l J. attacks, in which multiple nodes are compromised Computer Application, vol. 2, no. 1, 2012, pp. and coordinated for attack. Finally, a backup central 57–67. coordinator runs alongside the main coordinator to 8. S.N. Dhage and B. Meshram, “Intrusion Detec- prevent a single point of failure. The coordinators’ tion System in Cloud Computing Environment,” roles can be exchanged depending on actual require- Int’l J. Cloud Computing, vol. 1, no. 2, 2012, pp. ments and network conditions. 261–282. 9. D. Hoplaros, Z. Tari, and I. Khalil, “Data Sum- marization for Network Traffic Monitoring,”J. uture studies will explore the framework’s im- Network and Computer Applications, vol. 37, Jan.

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE plementation and application on different cloud 2014, pp. 194–205. computing systems. Focuses of our future studies 10. A. Patel et al., “An Intrusion Detection and Pre- will be casted on algorithms for distributed and par- vention System in Cloud Computing: A System- allel data summarization on cloud computing, and atic Review,” J. Network and Computer Applica- their implementation on the MapReduce framework, tions, vol. 36, no. 1, 2013, pp. 25–41. as well as new detection approaches for HIDSs. 11. A.K. Jones and R.S. Sielken, Computer System Intrusion Detection: A Survey, tech. report, Dept. Acknowledgments of Computer Science, Univ. of Virginia, 2000; The work described here was performed when Zhi- http://atlas.cs.virginia.edu/~jones/IDS-research/ yuan Tan was a research associate with the School ______Documents/jones-sielken-survey-v11.pdf. of Computing and Communications at the Univer- 12. N. L. Hjort et al., eds. Bayesian Nonparametrics, sity of Technology, Sydney. vol. 28, Cambridge Univ., 2010. 13. J. Dean and S. Ghemawat, “MapReduce: Simpli- References fied Data Processing on Large Clusters,” Comm. 1. C. Modi et al., “A Survey on Security Issues and ACM, vol. 51, no. 1, 2008, pp. 107–113. Solutions at Different Layers of Cloud Comput- 14. B. Tian et al., “A Mutual-Healing Key Distribu- ing,” J. Supercomputing, vol. 63, no. 2, 2013, pp. tion Scheme in Wireless Sensor Networks,” J. 561–592. Network and Computer Applications, vol. 34, no. 2. J. Hu et al., “Seamless Integration of Depend- 1, 2011, pp. 80–88. ability and Security Concepts in SOA: A Feed- 15. B. Tian et al., “Self-Healing Key Distribution back Control System Based Framework and Tax- Schemes for Wireless Networks: A Survey,” onomy,” J. Network and Computer Applications, Computer J., vol. 54, no. 4, 2011, pp. 549–569. vol. 34, no. 4, 2011, pp. 1150–1159. 3. Y. Meng, W. Li, and L.-F. Kwok, “Towards Adap- ZHIYUAN TAN is a postdoctoral research fellow in tive Character Frequency-Based Exclusive Sig- the Faculty of Electrical Engineering, Mathematics, nature Matching Scheme and Its Applications and Computer Science, University of Twente, En- in Distributed Intrusion Detection,” Computer schede, Netherlands. His research interests include net- Networks, vol. 57, no. 17, 2013, pp. 3630–3640. work security, pattern recognition, machine learning, 4. G. Creech and J. Hu, “A Semantic Approach to and distributed systems. Tan received a PhD from the Host-Based Intrusion Detection Systems Using University of Technology Sydney (UTS), Australia. He’s

Contiguous and Discontiguous System Call Pat- an IEEE member. Contact him at [email protected]. terns,” IEEE Trans. Computers, vol. 63, no. 4, 2014, pp. 807–819. UPASANA T. NAGAR is a PhD student in the 5. Z. Tan et al., “A System for Denial-of-Service At- School of Computing and Communications at the tack Detection Based on Multivariate Correla- University of Technology, Sydney (UTS), Australia,

32 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

and a student member of the Research Centre for In- JIANKUN HU is a full professor and research direc- novation in IT Services and Applications (iNEXT) at tor of the Cyber Security Lab, School of Engineering UTS. Her research interests include network security, and IT, University of New South Wales at the Aus- pattern recognition, and cloud computing. Nagar re- tralian Defence Force Academy, Canberra, Australia. ceived a bachelor’s degree in electronics from the Na- His research interests are in the field of cybersecurity tional Institute of Technology, Surat. Contact her at including biometrics security. Hu received a PhD in

[email protected]. control engineering from Harbin Institute of Technol- ogy, China. He’s an IEEE member. Contact him at

XIANGJIAN HE is a professor of computer science [email protected]. in the School of Computing and Communications at the University of Technology, Sydney (UTS). He’s also director of the Computer Vision and Recognition Laboratory, leader of the Network Security Research group, and a deputy director of the Research Centre for Innovation in IT Services and Applications (iN- EXT) at UTS. His research interests include network security, image processing, pattern recognition, and computer vision. He received a PhD in computer sci- Selected CS articles and columns are also available ence from the University of Technology Sydney (UTS), for free at http://ComputingNow.computer.org. Australia. He’s an IEEE senior member. Contact him at [email protected].

PRIYADARSI NANDA is a senior lecturer in the School of Computing and Communications at the University of Technology, Sydney (UTS), Australia. He’s also a core research member at the Centre for Innovation in IT Services Applications (iNEXT) at UTS. His research interests include network security, network QoS, sensor networks, and wireless networks. Nanda received a PhD in computer science from the University of Technology Sydney (UTS), Australia. He’s an IEEE senior member. Contact him at ______Priyadarsi [email protected].

REN PING LIU is a principal scientist of network- ing technology at the Commonwealth Scientific and Industrial Research Organisation (CSIRO) and an ad- junct professor at Macquarie University and the Uni- versity of Technology, Sydney (UTS), Australia. His research interests include MAC protocol design, Mar- kov analysis, quality-of-service scheduling, TCP/IP internetworking, and network security. Liu received a PhD in electrical and computer engineering from University of Newcastle, Australia. He’s an IEEE se- nior member. Contact him at [email protected].______SONG WANG is a senior lecturer with the Depart- ______ment of Electronic Engineering, La Trobe University, ______Melbourne, Australia. Her research interests include biometric security, blind system identification, and wireless communication. Wang received a PhD in electrical and electronic engineering from the Uni- versity of Melbourne. Contact her at ______song.wang@

______latrobe.edu.au.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 33

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Risk-Aware Virtual Resource Management for

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE Multitenant Cloud Datacenters

Abdulrahman A. Almutairi and Arif Ghafoor, Purdue University

Efficient risk-aware he cloud computing platform-as-a-service (PaaS) par- adigm allows application developers to deploy big data virtual resource applications in the cloud. These applications can be assignment found in the areas of healthcare, e-government, sci- ence, and business.1 PaaS cloud providers can host mechanisms for the customer data stores on premise and outsource the computation to virtual resources from multiple infrastructure-as- cloud’s multitenant a-service (IaaS) cloud providers. These virtual resources can be hosted by multitenant public cloud providers such as the Amazon environment can Elastic Compute Cloud (EC2). The sheer size of big data poses se- help to minimize rious security challenges for these applications. The backend data store can use an access control mechanism to isolate and enforce the risk of controlled data sharing.2 However, when the data is transferred from the backend data store to application logic, it can be leaked information leakage through virtual resource vulnerabilities. In a multitenant environ- ment, untrusted tenants can exploit these vulnerabilities, increas- due to cloud ing the data leakage risk. virtual resource This article focuses on virtual resource vulnerabilities that can cause data leakage, resulting in side-channel attacks and virtual ma- vulnerability. chine (VM) escape.3,4 Proposed solutions to this problem—such as trust- ed virtual domain,5 secure hypervisor,6 and Chinese wall policies7—of-

34 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

fer secure virtual resource isolation among tenants. Here, we assume that a role’s (the cloud custom- However, achieving this isolation lowers resource er’s) assets are the number of data objects (such as utilization. tuples or files) it has access to and that are stored in We propose intelligent virtual resource alloca- PaaS. Vulnerability is the probability of data leakage as tion techniques that assign resources to a cloud cus- a result of virtual resource vulnerabilities. To capture tomer’s data-centric applications. These techniques the worst-case scenario for risk assessment, we assume have low complexity and minimize the imposed risk. that the threat is equal to 1 for all roles (cloud custom- We assume a role-based access control (RBAC)8 ers). In other words, because of resource vulnerabili- mechanism for multitenant datacenter protection. ties, each cloud customer poses a threat in terms of ac- However, the approach is generic and can be applied cessing other customers’ data objects, and vice versa. to any security policy, including discretionary or We propose a workload approximation model mandatory access control. based on a given RBAC policy and characterization of a cloud datacenter. Using this model, we present a Virtual Resource Vulnerability risk-aware assignment problem as well as assignment Elsewhere, we proposed a distributed access control heuristics for virtual resource allocation. Because of architecture featuring a virtual resource manager page limitations, we present our proofs elsewhere.12 (VRM).9 The VRM allocates virtual resources to cloud customers based on an access control policy RBAC Policy Model for Access Control enforced by an access control module (ACM), as Fig- Modules ure 1a shows. In general, these resources are allocat- A datacenter RBAC policy defines permissions for ed to satisfy some service-level agreement (SLA) re- roles to access data objects.8 We formally define this quirements for each cloud customer and to minimize assignment as follows. the cost of provisioning for PaaS cloud providers. As Definition 1: Given an RBAC policy P for a big Figure 1b shows, the VRM includes workload estima- datacenter where R is the set of roles and O is the tion, resource vulnerability estimation, and resource set of data objects, we can represent the permission- assignment components. The workload estimation to-role assignment PA as a directed bipartite graph component estimates the sharing of data among dif- G(V, E), where V = R ∪ O such that R ∩ O = ∅. The ∈ ferent roles of the RBAC policy. The VRM’s resource edges eij E in G represent the existence of role-to- ∈ vulnerability estimation component uses security permission assignment (ri × oj) PA in the RBAC ∈ ∈ analysis tools to estimate virtual resource vulnerabil- policy P, where ri R and oj O. ity.10 Subsequently, this component can be used to A role vertex’s out-degree represents the role’s characterize virtual resources’ vulnerability to differ- cardinality, and a data object vertex’s in-degree rep- ent security measurements—for example, highly se- resents the degree of sharing of that object among cured or unsecured virtual resources. The resource roles. Figure 2a represents an RBAC policy with assignment component uses the workload and vul- |R| = 4 and |O| = 20 as a bipartite graph model. As nerability estimations to assign the virtual resource the figure shows, the cardinality of role r1 is out- to cloud customers’ applications with the goal of degree(r1) = 11. Also, the degree of sharing of data minimizing the total risk of data leakage. object o20 is in-degree(o20) = 4. The risk of data leakage depends on the access The VRM’s resource assignment component re- control policy and virtual resource vulnerabilities. quires the cardinality of shared data objects among ISO 27005 defines this risk as “the potential that roles. For big datacenters, computing these cardi- a given threat will exploit vulnerability of an asset nalities from the bipartite graph is a daunting task. or group of assets and thereby cause harm to the We propose an alternative representation of RBAC organization.”11 Using this definition, we formulate by clustering all data objects that are accessed by the risk due to data leakage for an application in a the different roles into a set of nonoverlapping parti- datacenter as: tions. The set, W, consisted of the cardinalities of these partitions is the spectral model for the RBAC Risk = Assets × Vulnerability × Threat (1) policy. We define this model as follows.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 35

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

PaaS

User‘s resourse request VRM ACM

IaaS (a)

ACM Resourse Workload assignment estimation Policy Big SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE PaaS base data Resource store vulnerability estimation VRM

IaaS1 IaaSm

VM VM VM VM VM VM VM VM 1 2 3 K 1 2 3 K Virtualization layer Virtualization layer

Physical layer Physical layer

(b)

FIGURE 1. The virtual resource management (VRM) architecture: (a) virtual resource design and (b) virtual resource management mechanism.

Definition 2 (RBAC spectral model): Given a ity using a single parameter. Based on the degree of bipartite graph representation of RBAC policy G(V, sharing among roles, the datacenter sensitivity can E), let P(R) be the power set of R excluding the null be high, medium, or low, as elaborated later. In addi- set ∅. The spectral representation of RBAC is the tion, the spectral model allows resource assignment set W, with its elements, indexed by P(R) and lexico- based on a given percentage of datacenters. Varying graphically ordered. Formally, let p ∈ P(R). Then, we this percentage can lead to variable complexity of an ∈ definew p W as: assignment algorithm. The set W can be generated from the bipartite =∈∀∈∃∈{} ∈ wooOrpeEpkki:, roik graph model of RBAC. The members wp W are nonoverlapping and can be viewed as vertices of a Note that | W | = 2n – 1. lattice (that is, binary n-cube) with n levels, where n This model has two advantages over the bipar- is the number of roles. For example, nodes at level 1 tite graph model. First, we can use it to characterize of the lattice represent the cardinalities of partitions an RBAC policy in terms of a datacenter’s sensitiv- corresponding to unshared data objects belonging to

36 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

o O PA 1

o2 R o3

o4 o r1 5

o6

o7 r 2 o8

o9

o10 r3 o11

o12

o13 r4 o14

o15

o16

o17

o18

o19 o (a) 20

w =2 w =0 w{1}=3 w{2}=6 {3} {4}

w =0 w =0 w =1 w{1,2}=0 w =2 {2,3} {2,4} {3,4} w{1,3}=0 {1,4}

w{1,2,4}=2 w =0 w{2,3,4}=0 w{1,2,3}=2 {1,3,4}

w{1,2,3,4}=2 (b)

FIGURE 2. RBAC policy representation: (a) example of RBAC permission assignment and (b) spectral lattice representation of RBAC.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 37

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

individual roles. The nodes at level 2 represent the Figure 3 shows. Note that the value of s should be cardinalities of data partitions that are accessed by greater than or equal to 1. two roles. Similarly, the nodes at level n contain data As Algorithm 1 (Figure 4) proposes, we can use objects shared by all roles. The nodes of W can be the Zipfian distribution to generate a heterogeneous indexed using the role IDs associated with the parti- RBAC-based workload in two steps. In the first step, tion, as Figure 2b shows. As mentioned above, the we classify data objects into n buckets, where each indices of W are subsets of P(R) and its elements are bucket represents the number of total data objects the cardinalities of partitions that can be accessed assigned to a lattice level in Figure 2b. For example, by all the roles in these subsets. Note that the total data objects in bucket 1 are exclusively accessed by size of the datacenter is given as: one role, whereas all roles share data objects in bucket n. The number of data objects in each bucket ∑ wp ∀∈wWp follows Zipfian distribution. In the second step, we assign data objects from bucket i to randomly The following example illustrates the spectral model. selected partitions at level i of the lattice. Note, the ⎛n⎞ Example 1: Figure 2a shows an access control number of partitions at level i is ⎜ ⎟. ⎜ ⎟ policy with |R| = 4 and |O| = 20 as a bipartite graph. ⎝⎜i ⎠ The spectral model is shown as a lattice in Figure Characterizing Datacenter

2b. Notice that w{1,4} = |{o9, o10}| = 2 because o9 and Sensitivity Using a Spectral Model o10 are accessed by both the roles r1 and r4. Based on the statistical property of the access con-

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE trol workload, we propose a data sensitivity-based Datacenter Workload classification of cloud datacenters. The sensitivity Estimation for RBAC Policy classification depends on the level of data object

For a big datacenter, specifying the exact value of wp sharing among roles. In particular, we define a data- is a challenge. One practical approach is to use car- center’s sensitivity as the average degree of sharing dinality estimation techniques.13 For example, for a among its data objects. For example, in Figure 2, transactional workload, we can use the selectivity es- the datacenter average degree of sharing is (11 × 1 timation of query processing for a big datacenter to + 3 × 2 + 4 × 3 + 2 × 4)/20 = 1.85. If the degree of estimate a given query’s size,13 whereby that query (or sharing on average is low, we say the datacenter has a collection of queries) can correspond to a role. If high sensitivity. On the contrary, if there is exten- we use role mining to design an RBAC policy, we can sive sharing of data objects among roles, we say this use role mining techniques such as multi-assignment datacenter has low sensitivity. The medium sensitiv- clustering14 to estimate the cardinalities of the set W. ity class falls in the middle. Here, we assume the access of data objects in a data- We can also model the data object sharing and center follows a Zipfian distribution, an assumption datacenter classification using the Zipfian distribu- supported by the Yahoo Cloud Serving Benchmark tion. The key parameter to characterize datacenter (YCSB).15 Because in this distribution, some objects sensitivity is the scalar parameter s of the Zipfian are shared by a large number of roles (queries) while density function shown in Equation 2. As Figure 3 most are shared among a smaller number of roles shows, the smaller the value of s, the more data ob- (queries), it can provide a heterogeneous workload for jects are uniformly distributed in the set W of the RBAC. The Zipfian distribution is given as follows: RBAC spectral model. In the following example, α we illustrate how we can use Zipfian parameters to fsN()α,, = (2) classify datacenter sensitivity. N − ∑ i s 6 i=1 Example 2: For a datacenter with 0.5 × 10 data objects, suppose we have three RBAC policies (P1, where N is the maximum rank, α is the selected P2 , P3) each with n = 30 roles. Figure 3 shows a rank, and s is the parameter to control the distribu- histogram of objects across the spectral lattice. De- tion shape. pending on the Zipfian distribution, we can identify According to this distribution, if parameter s = three classes of datacenters—high sensitive (HSD), 1, then the probability that a data object is assigned medium sensitive (MSD), and low sensitive (LSD)— to a single role (which corresponds to setting the rank with respect to policies P1, P2, and P3. For example, α = 1) doubles the probability of assigning that data HSD has a large value of s (s ≥ 2) because the shar- object to two roles, the case for which rank α = 2. ing of data objects among P1 roles is very small. On As the value of s increases, the number of data objects other hand, the LSD has a small value of s (1.5 > s ≥ assigned only to individual roles becomes larger, as 1), depicting extensive sharing of data objects among

38 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

350,000

Low sensitive (s = 1.1) 300,000 Medium sensitive (s = 1.5) High sensitive (s =2.0)

250,000

200,000

150,000 Number of data objects

100,000

50,000

0 1 23456789101211 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Spectrum lattice level

FIGURE 3. A statistical characterization of sensitivity of cloud datacenters.

roles of policy P3. The MSD has a value of s that Input: Number of data objects |O|, number of roles n, constant s. falls in the middle (2 > s ≥ 1.5). Note that the num- Output: spectral representation of RBAC W. ber of data objects at level 1 in HSD is double the number of data objects at level 1 in LSD. 1. Let B = {B1,...,Bn} bucket array; 2. foreach i = 1,. . ., |O| do Heterogeneous Virtual Resource 3. α = zipf(n,s); Vulnerability Characterization 4. Bα = Bα + 1; In addition to workload characterization with respect 5. foreach i = 1, . . ., n do to RBAC policy, the VRM also estimates the software 6. foreach j = 1 , . . ., Bi do ⎛ ⎞ ⎜⎛ ⎞ ⎟ vulnerability for a virtual resource—a VM in our 7. ⎜⎜n⎟ ⎟ α = zipf⎜⎜ ⎟,s⎟ case. The estimator uses VMs’ security vulnerabilities ⎝⎜⎝⎜i ⎠⎟ ⎠⎟ to qualitatively classify them into multiple classes. 8. map α to random partition in level i call it pˆ; The classification is based on the common vulner- =+ 9. wwppˆˆ1 ; ability scoring system (CVSS) metric scores. CVSS 10. add w pˆ to W uses an interval scale of 0–10 to measure vulnerabili- 11. return W ty severity.16 To represent the probability of data leak- age, we convert the 0–10 scale to a 0–1 scale. Based FIGURE 4. Algorithm 1: Workload generation algorithm. on CVSS scores, we assume four discrete classes of VMs—highly secured, medium secured, low secured, and unsecured VMs. Although we select four classes provider. Different remote cloud providers can deploy to model the vulnerability with respect to heteroge- different security configurations and virtualization neous virtual resources, our solution is generalizable software (such as a hypervisor) with varying levels of to an arbitrary number of heterogeneous classes. vulnerabilities. Similar to VM classification, the VRM In addition to the probability of leakage within a estimator also classifies the vulnerabilities of remote virtual resource, the VRM needs to consider leakage cloud providers into multiple classes. The vulnerabil- across virtual resources (VMs) within the same IaaS. ity measurements within VMs are independent from VRM estimates the vulnerability of each IaaS cloud the vulnerability measurements across VMs. For

SEPTEMBER 2014 IEEE CLOUD COMPUTING 39

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

example, a highly secure cloud provider can host un- Assignment Heuristics secure VMs. We assume that the probability of leak- We propose two heuristics for solving RAP, which age between two VMs belonging to different remote the resource assignment component in VRM (Figure clouds is negligible. Subsequently, we convert these 1b) can deploy. The first, the sharing-based heuristic qualitative measures into probability of data leakage. (SBH), uses a best-fit strategy. In SBH, each role is We assume that the probability of leakage within any assigned to the best available VM, in terms of prob- VM is higher than the probability of leakage across ability of leakage, such that any increase in the total any two VMs. This is because the size of the trusted risk is kept to a minimum. SBH has high complexity code base across VMs is generally smaller than com- because it finds the local optimal assignment at each mercial operating systems used in a given VM. The step. Alternatively, we propose a low-complexity scal- trusted code base represents the software stack shared able heuristic, the partition-based heuristic (PBH), in the multitenant environment, whereas within a that uses a top-down clustering-based approach. In VM, the shared software stack (for example, the op- each step, this heuristic divides the roles based on erating system and middleware) is larger than shared the highest risk partition. software across VMs (for example, the hypervisor). Because |W| = 2n – 1, to reduce the complexity of Definition 3: (the cloud virtual resource model): SBH and PBH, we propose an approximation strategy

Given VM1 , VM2 , . . ., VMm as the suite of m vir- for workload characterization. The strategy is based tual machines available to a VRM, let di,j represent on considering a smaller percentage of the datacen- the probability of data leakage between VMi and VMj ter’s total size. Let such a percentage be denoted as D.

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE estimated using the vulnerability estimation compo- In particular, D identifies the cutoff level k in the lat- nent, as Figure 1 shows. Accordingly, the cloud virtual tice of P(R), which SBH and PBH can use. We can de- ≥ resource can be modeled as a fully connected undi- fine such a cutoff ask = min0≤k≤n Hk,s D/100 × Hn,s, 12 rected graph H(V, E), where vertices V represent the where Hk,s is kth generalized harmonic number. set of VMs and the weights on the edges represent di,j. Accordingly, for a given value D of a datacenter, the spectral vector W needs to be truncated using cut- Risk-Aware Resource Assignment off level k. The truncated lexicographically order set, Based on the spectral model of a given RBAC policy denoted W′, consists of all the partitions of W starting and the aforementioned virtual resource vulnerabil- from level 1 up to and including the cutoff level k in ′ ity model, we formally define the risk-aware assign- the lattice of Figure 2b. In other words, W = {wp |( wp ment problem (RAP). ∈ W) ∧ |p| ≤ k} ≤ (n + 1)k.12 Definition 4: Given the spectral representation Note that different datacenter sensitivity classes W of the RBAC policy and the adjacency matrix yield different cutoff levels for the same percentage (D) representing the probabilities of leakage among D. Accordingly, the size of W′ varies. The following virtual resources, based on H(V, E), the RAP is to example illustrates how D and the sensitivity classes minimize the total risk of data leakage by assigning can affect the value of k. access control roles to the virtual resource. Example 3: For W of Example 2, when D = 70 The cost function of total risk is defined as percent, the cutoff (k) is 2 for HSD and 8 for LSD.

n For D = 95 percent, the cutoff is 18 for MSD, 9 for =× minRwThreatpitp min∑ ∑ ( , ) HSD, and 24 for LSD. wWp∈ i=1 × ×× max {}dIIl, q iq jl , Sharing-Based Heuristic: Best-Fit Approach 1≤≤lq,, m ∀ j∈∈p Following a best-fit approach, SBH initially selects where the role with the most data objects. It assigns this role to the VM with the least probability of leakage. pPR∈ () Next, it selects the role that has the highest data m sharing with the previously assigned roles and ∑ IiRiq =∀∈1 q=1 allocates the role to a VM such that any increment ⎧ in the total risk is kept to minimum. This step is ⎪1 if rolei is assigned to VM qq Iiq = ⎨ repeated until all roles are assigned. Notice that the ⎪0 otherwise ⎩⎪ first m role assignments are made to distinct VMs ⎧ ⎪1 if ip∉ because the probability of leakage across VMs is Threat(,) p r = ⎨ ⎩⎪0 otherwisse always less than the probability of leakage within a VM. Therefore, the performance of SBH depends on Theorem 1: The RAP problem is NP-complete.12 the initial m role assignments. The remaining n – m

40 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

roles are co-allocated with the already assigned roles Input: A spectral representation of RBAC W′, vulnerability matrix D. such that we keep the total leakage increase low. Output: An assignment matrix I of roles to VMs. Algorithm 2 (Figure 5) formally represents SBH. In Lines 1–3, SBH assigns an initial role to 1. Find initial VM vmj with minimum djj; a VM, storing assigned roles in list A and keeping 2. Find initial Role ri with largest attackability; unassigned roles in list F. Cost matrix C stores the 3. I(ri, vmj) = 1; cost of assigning each unassigned role to each VM. 4. Let A = {ri} be the set of assigned roles; In Lines 4–6, each iteration of the outer loop assigns 5. Let F = R – {ri} be the set of free roles; a role from list F with the minimum value in C to 6. Let C be the cost matrix with element Ci,j representing the a VM and updates the assignment matrix I, which risk of assigning ri to vmj; 7. foreach r ∈ F do includes assignment of all partitions so far. The i 8. foreach j = 1, . . ., m do inner loop updates C by removing the last assigned 9. Compute C ; role’s entry and updating the entry of C resulting i,j 10. Let Ckl, be minimum Ci,j; from the new assignment. At the end of its execution, 11. I(rk, vml) = 1 the algorithm returns the final assignment matrix . 12. A = A∩{rk}; 2 Lemma 1: The complexity of SBH is O(n × m 13. F = F – {rk}; × |W′|).12 14. return I

Partition-Based Heuristic: A Scalable Approach FIGURE 5. Algorithm 2: Sharing-based heuristic (SBH). PBH uses a scalable top-down clustering approach. Initially, we assume that all roles are in one cluster. Input: A spectral representation of RBAC W′, vulnerability matrix D. We then find the highest attackable partition inW ′. Output: An assignment matrix I of roles to VMs. A partition’s attackability is defined as the size of the 1. Sort W′ starting with highest attackable wi to the smallest partition multiplied by the number of threats for attackable wj; that partition. The number of threats for a partition 2. Let P holds indices of sorted W′; equals the total number of roles minus the partition’s 3. Let C = {{r1}, {r2}, . . . , {rn}} be the initial cluster; 4. foreach p ∈ P do level in the lattice of P(R). Note that as the size i 5. foreach c ∈ C do and the number of threats increase, the partition j 6. if p ∩ c ≠∅then becomes more attackable. We begin with division of i j 7. C = C – cj; the root cluster and split it into two clusters. The 8. C = C ∪ (pi ∩ cj); first cluster contains the roles associated with highly 9. C = C ∪ (cj – pi); attackable partitions. The remaining roles are stored 10. if |C| ≥ m then in the second cluster. By splitting the roles into two 11. break; clusters based on the highly attackable partition, we 12. Let L = {l1, . . ., lm} be the sorted VMs based on dii form eliminate the possibility of co-allocating the roles smallest to largest where i ∈ {1, . . ., m}; ∈ associated with highly attackable partitions with 13. Compute the intra risk for each ci C; other roles that pose threats to them. We repeat 14. Let Cˆ be the sorted C based on cluster risk; 15. ∈ the last step until the number of clusters equals the foreach i {1, . . ., m} do 16. foreach r ∈ c do number of VMs. Subsequently, we assign the cluster k i 17. Ir(), vm = 1 ; with the highest policy risk to the least vulnerable kli 18. return I; VM. In essence, this greedy approach of dividing the roles based on the highest attackable partition favors FIGURE 6. Algorithm 3: Partition-based heuristic (PBH). the m top attackable partitions over the others. Algorithm 3 (Figure 6) formally represents PBH. In Lines 1–2, PBH sorts and saves W′ in temporary risk is computed according to equation (1) by setting set P. The initial cluster list C has only one cluster, vulnerability and threat parameters to 1. Line 14 which is the set of all roles. In Lines 4–11, the itera- sorts the clusters and Lines 15–17 assign the roles tion loops over all the partition in P and divides a of a high risk cluster to a VM with low probability cluster into two new clusters if a partition intersects of leakage. The final assignment is returned by the with any cluster in C. The loop in Lines 4–11 contin- algorithm through matrix I. ues until the number of clusters equals the number Lemma 2: The complexity of PBH is O(|W′| of VMs or until each cluster has only one role. Line × log|W′| + n ×|W′|).12 11 sorts the VM indexes in list L and computes the Note that because of its subquadratic complexity policy-based risk for each cluster. The policy-based in terms of number of roles, PBH is scalable.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 41

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Table 1. Total risk (Rt) for the sharing-based and partition-based heuristics (SBH and PBH). Percentage of Low sensitive Medium sensitive High sensitive No. of No. of the datacenter roles (n) VMs (m) (D) SBH PBH SBH PBH SBH PBH 70 3,360,231 3,606,591 2,445,374 2,452,659 1,518,554 2,139,564 80 3,291,886 3,522,308 2,244,157 2,401,973 1,465,960 1,512,723 120 40 90 3,215,064 3,470,285 2,154,764 2,309,507 1,231,830 1,405,526 95 3,344,730 3,470,285 2,217,724 2,302,854 1,224,026 1,406,141 70 4,121,669 4,225,904 2,932,017 3,263,288 1,767,732 2,658,996 80 4,078,425 4,196,071 2,785,004 2,828,237 1,756,312 2,234,971 150 50 90 4,068,345 4,196,071 2,663,637 2,782,716 1,558,556 1,761,015 95 4,038,736 4,201,984 2,647,521 2,782,716 1,468,684 1,692,846 70 5,685,296 5,402,101 3,865,387 4,362,990 2,228,901 3,627,837 80 5,562,226 5,352,938 3,686,288 3,613,758 1,973,325 2,837,346 200 70 90 5,580,674 5,351,282 3,356,819 3,446,952 1,692,089 2,355,639 95 5,475,651 5,350,779 3,355,768 3,446,952 1,754,023 1,970,093 SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE

Performance Evaluation tions of RBAC and data sensitivity classifications. We compare the performance of SBH and PBH Our workload generation algorithm uses the same using two metrics. The data leakage risk is the main statistical model used in YCSB,15 which bases the metric that needs to be minimized. The risk metric selectivity of data objects on the Zipfian distribu- has two submetrics that we use to evaluate the tion. We determine datacenter sensitivity using the proposed heuristics. We compute the first risk metric high, medium, and low settings. We consider dif- from the total workload representing all partitions ferent data percentages (D) of a datacenter, varying in W. This risk metric represents the total risk and from 70 to 95 percent of the datacenter’s total size.

is denoted as Rt. In other words, Rt represents the We simulate RBAC policies with 120, 150, and 200 potential risk for the whole datacenter. We base the roles. To manage simulation time, we assume a data- second submetric of risk on the partition of W′ and center of 500,000 data objects. However, we can ap- use it to study the performance of the heuristics and ply the proposed heuristics to any size datacenter. compare their effectiveness. This submetric—partial Furthermore, in our experiment, we implement

risk (denoted Rp)—corresponds to the risk resulting four classes of VM vulnerability: highly secured ′ from the workload approximation W . Formally, we (with low probability of leakage di,i = 0.2), medium write the risk metrics as follows: secured (di,i = 0.45), low secured (di,i = 0.6), and unsecured (with very high probability of leakage = 10 RRiskwtp∑ () (4) di,i = 0.8). Based on an earlier survey, we assume wWp∈ that 1 percent of VMs are unsecured, 22 percent are highly secured, 44 percent are medium secured, and

RRiskwpp= ∑ () (5) 33 percent are low secured. Accordingly, we classify wWp∈ ' IaaS security into three categories: Note that the difference between total risk and partial risk represents the relative risk error intro- • highly secure IaaS cloud providers, where we duced by the heuristic. This error is a result of the assume the probability of data leakage among

heuristic’s lack of knowledge of all the partitions due VMs is extremely small (di,j = 0.001); to workload approximation. We define this error as • moderately secure IaaS cloud providers, where d = 0.045; and ⎛ ⎞ i,j ()RRtp− ⎜ ⎟ = =−⎜ Rt ⎟ • the least secure IaaS cloud providers, where di,j E ⎜ 1⎟ Rp ⎝ Rp ⎠ = 0.1.

We evaluate our heuristics and study their perfor- Table 1 shows the total risk Rt resulting from mance for different statistical workload approxima- SBH and PBH assignment for various number of

42 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

n = 150 n = 200 m =70 m =70 4,500,000 6,000,000 |T| = 500,000 |T| = 500,000 4,000,000 5,000,000 3,500,000 SBH (high sensitive) SBH (high sensitive) 3,000,000 4,000,000 SBH (medium sensitive) SBH (medium sensitive) 2,500,000 Risk Risk 3,000,000 2,000,000 SBH (Low sensitive) SBH (Low sensitive) 1,500,000 PBH (high sensitive) 2,000,000 PBH (high sensitive) 1,000,000 PBH (medium sensitive) 1,000,000 PBH (medium sensitive) 500,000 0 PBH (Low sensitive) 0 PBH (Low sensitive) 70 80 90 95 70 80 90 95 (a) Percentage of datacenter (b) Percentage of datacenter

3.0 n = 120 10 n =120 m =40 9 m =40 2.5 |T| = 500,000 8 |T| = 500,000 2.0 7 6 1.5 High sensitive 5 High sensitive 4 1.0 Medium sensitive 3 Medium sensitive Relative error 0.5 Low sensitive Relative error 2 Low sensitive 1 0 0 70 80 90 95 70 80 90 95 (c) Percentage of datacenter (d) Percentage of datacenter

3.0 n = 150 10 n = 150 m =50 9 m =50 2.5 |T| = 500,000 8 |T| = 500,000 2.0 7 6 1.5 SBH (high sensitive) 5 High sensitive 4 1.0 SBH (medium sensitive) 3 Medium sensitive Relative error 0.5 SBH (Low sensitive) Relative error 2 Low sensitive 1 0 0 70 80 90 95 70 80 90 95 (e) Percentage of datacenter (f) Percentage of datacenter

n = 200 3.0 n = 200 10 m =70 9 m =70 2.5 |T| = 500,000 8 |T| = 500,000 2.0 7 6 1.5 SBH (high sensitive) 5 SBH (high sensitive) 4 1.0 SBH (medium sensitive) 3 SBH (medium sensitive) Relative error Relative error 0.5 SBH (Low sensitive) 2 SBH (Low sensitive) 1 0 0 70 80 90 95 70 80 90 95 (g) Percentage of datacenter (h) Percentage of datacenter

FIGURE 7. SBH and PBH performance (risk an relative error): (a) Rp for SBH and PBH for n = 150, (b) Rp for SBH and PBH for n = 200, (c) SBH relative error, (d) PBH relative error, (e) SBH relative error, (f) PBH relative error, (g) SBH relative error, and (h) PBH relative error.

roles (n) and number of VMs (m), and various sensi- low sensitivity column for SBH and PBH. The reason tivity and data percentage settings. The Rt of HSD is is that both heuristics try to minimize the partial risk ′ higher than the total risk for MSD and LSD. This is Rp associated with W , and in some cases such mini- because most data objects in HSD are highly attack- mization can result in an increase of Rt. In addition, able, which increases the number of potential threats we notice that varying n and m doesn’t change the compared to other types of datacenters. In addition, heuristics’ behavior. However, as per Equation 3, the

Rt decreases as the value of D increases from 70 to total risk Rt increases with n. 90 percent. This decrease in Rt occurs because for Figures 7a and 7b give the performance of SBH large values of D, VRM has more knowledge about and PBH in term of partial risk Rp for a datacen- W. In other words, as (W − W′) gets smaller, the un- ter’s different sensitivity levels (high, medium, low) certainty in assignment decision decreases. However, while varying the percentage D of the datacenter’s in some cases, Rt increases when the data percentage overall size. As the figures show, SBH outperforms increases from 90 to 95 percent—for example, the PBH slightly for n = 150 and n = 200 but at a higher

SEPTEMBER 2014 IEEE CLOUD COMPUTING 43

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

computation complexity. Figures 7c–7h compare the ture,” Proc. 5th European Conf. Computer Sys- performance of SBH and PBH in terms of their rela- tems, 2010, pp. 209–222. tive error E. The general behavior is that E decreas- 7. S. Berger et al., “Security for the Cloud Infra- es as D increases because the cutoff level k increas- structure: Trusted Virtual Data Center Imple- es proportionally with D, resulting in a reduction of mentation,” IBM J. Research and Development, E. Further, E is higher for HSD than for LSD. This vol. 53, no. 4, 2009, pp. 560–1. is because the cutoff level k for HSD is small, so the 8. D.F. Ferraiolo et al., “Proposed NIST Standard partitions that aren’t considered in the assignment for Role-Based Access Control,” ACM Trans. In- decision are highly attackable. Consequently, these formation and System Security (TISSEC), vol. 4, partitions impose higher risk than LSD and the no. 3, 2001, pp. 224–274. total risk increases as well as the E. For PBH, E is 9. A. Almutairi et al., “A Distributed Access Con- high for HSD when D = 70 percent. This is because trol Architecture for Cloud Computing,” IEEE few of the partitions are highly attackable—that is, Software, vol. 29, no. 2, 2012, pp. 36–44. those at a low level of the spectral model. According- 10. M. Balduzzi et al., “A Security Analysis of Ama- ly, PBH might not be an attractive choice for HSD zon’s Elastic Compute Cloud Service,” Proc. 27th when considering low values of D. However, as Table Ann. ACM Symp. Applied Computing, 2012, pp. 1 shows, the relative error for PBH is within 10 per- 1427–1434. cent of the error produced by SBH for both LSD and 11. ISO/IEC Std. 27005, Information Security Risk MSD cases. As a scalable algorithm, PBH offers a Management, ISO, 2011; ______https://www.iso.org/

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE viable choice for these cases. ______obp/ui/#iso:std:iso-iec:27005:edu-2:v1:en. 12. A. Almutairi and A. Ghafoor, Risk-Aware Virtual Resource Management for Access Control-Based ith the growing use of software as a service Cloud Datacenters, CERIAS tech. report, Pur- (SaaS) for cloud datacenters, the complex in- due Univ., 2014. terplay of software and virtual machines exacerbates 13. H. Zhang, I.F. Ilyas, and K. Salem, “Psalm: the security challenges addressed in this article. Cardinality Estimation in the Presence of Fine- Our future work will consider the impact of services Grained Access Controls,” Proc. IEEE 25th Int’l on the risk of data leakage resulting from joint vul- Conf. Data Eng. (ICDE 09), 2009, pp. 505–516. nerabilities of services and virtual machines. 14. M. Frank et al., “Multi-Assignment Clustering for Boolean Data,” J. Machine Learning Re- Acknowledgments search, vol. 13, no. 1, 2012, pp. 459–489. This work was supported by US National Science 15. B.F. Cooper et al., “Benchmarking Cloud Serv- Foundation grant IIS-0964639. ing Systems with YCSB,” Proc. 1st ACM Symp. Cloud Computing, 2010, pp. 143–154. References 16. P. Mell, K. Scarfone, and S. Romanosky, “A Com- 1. G.-H. Kim, S. Trimi, and J.-H. Chung, “Big-Data plete Guide to the Common Vulnerability Scor- Applications in the Government Sector,” Comm. ing System Version 2.0,” FIRST-Forum of Incident ACM, vol. 57, no. 3, 2014, pp. 78–85. Response and Security Teams, 2007, pp. 1–23. 2. J. Alcaraz Calero et al., “Toward a Multi-Tenancy Authorization System for Cloud Services,” IEEE ABDULRAHMAN ALMUTAIRI is a PhD student Security & Privacy, vol. 8, no. 6, 2010, pp. 48–55. in the School of Electrical and Computer Engineer- 3. T. Ristenpart et al., “Hey, You, Get Off of My ing at Purdue University. His research interests include Cloud: Exploring Information Leakage in Third- information security and privacy and cloud computing Party Compute Clouds,” Proc. 16th ACM Conf. systems. Almutairi received an MS in electrical and com- Computer and Comm. Security, 2009, pp. 199–212. puter engineering from Purdue. He is a student member 4. M. Pearce, S. Zeadally, and R. Hunt, “Virtual- of IEEE. Contract him at [email protected]. ization: Issues, Security Threats, and Solutions,” ACM Computing Surveys (CSUR), vol. 45, no. 2, ARIF GHAFOOR is a professor in the School of 2013, p. 17. Electrical and Computer Engineering at Purdue Uni- 5. L. Catuogno et al., “Trusted Virtual Domains– versity. His research interests include information se- Design, Implementation and Lessons Learned,” curity and distributed multimedia systems. Ghafoor Trusted Systems, Springer, 2010, pp. 156–179. received a PhD in electrical engineering from Colum- 6. U. Steinberg and B. Kauer, “Nova: A Microhy- bia University. He is a fellow of IEEE. Contact him at

pervisor-Based Secure Virtualization Architec- [email protected].

44 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE association of computing professionals and is the leading President: Dejan S. Milojicic provider of technical information in the field. President-Elect: Thomas M. Conte MEMBERSHIP: Members receive the monthly magazine Past President: David Alan Grier Computer, discounts, and opportunities to serve (all activities Secretary: David S. Ebert are led by volunteer members). Membership is open to all IEEE Treasurer: Charlene (“Chuck”) J. Walrad members, affiliate society members, and others interested in the VP, Educational Activities: Phillip Laplante computer field. VP, Member & Geographic Activities: Elizabeth L. Burd COMPUTER SOCIETY WEBSITE: www.computer.org VP, Publications: Jean-Luc Gaudiot OMBUDSMAN: To check membership status or report a change of VP, Professional Activities: Donald F. Shafer address, call the IEEE Member Services toll-free number, VP, Standards Activities: James W. Moore +1 800 678 4333 (US) or +1 732 981 0060 (international). Direct VP, Technical & Conference Activities: Cecilia Metra all other Computer Society-related questions—magazine delivery 2014 IEEE Director & Delegate Division VIII: Roger U. Fujii or unresolved complaints—to [email protected]. 2014 IEEE Director & Delegate Division V: Susan K. (Kathy) Land CHAPTERS: Regular and student chapters worldwide provide the 2014 IEEE Director-Elect & Delegate Division VIII: John W. Walz opportunity to interact with colleagues, hear technical experts, and serve the local professional community. BOARD OF GOVERNORS AVAILABLE INFORMATION: To obtain more information on any Term Expiring 2014: Jose Ignacio Castillo Velazquez, David S. Ebert, of the following, contact Customer Service at +1 714 821 8380 or Hakan Erdogmus, Gargi Keeni, Fabrizio Lombardi, Hironori Kasahara, +1 800 272 6657: Arnold N. Pears Term Expiring 2015: Ann DeMarle, Cecilia Metra, Nita Patel, Diomidis •Membership applications Spinellis, Phillip Laplante, Jean-Luc Gaudiot, Stefano Zanero • Publications catalog Term Expriring 2016: David A. Bader, Pierre Bourque, Dennis Frailey, Jill • Draft standards and order forms I. Gostin, Atsuhiro Goto, Rob Reilly, Christina M. Schober • Technical committee list • Technical committee application EXECUTIVE STAFF • Chapter start-up procedures Executive Director: Angela R. Burgess • Student scholarship information Associate Executive Director & Director, Governance: Anne Marie Kelly • Volunteer leaders/staff directory Director, Finance & Accounting: John Miller • IEEE senior member grade application (requires 10 years Director, Information Technology & Services: Ray Kahn practice and significant performance in five of those 10) Director, Membership Development: Eric Berkowitz Director, Products & Services: Evan Butterfield PUBLICATIONS AND ACTIVITIES Director, Sales & Marketing: Chris Jensen Computer: The flagship publication of the IEEE Computer Society, Computer, publishes peer-reviewed technical content that COMPUTER SOCIETY OFFICES covers all aspects of computer science, computer engineering, Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C. 20036-4928 technology, and applications. Phone: +1 202 371 0101 • Fax: +1 202 728 9614 Periodicals: The society publishes 13 magazines, 16 transactions, Email: [email protected] and one letters. Refer to membership application or request Los Alamitos: 10662 Los Vaqueros Circle, Los Alamitos, CA 90720 information as noted above. Phone: +1 714 821 8380 Conference Proceedings & Books: Conference Publishing Email: [email protected] Services publishes more than 175 titles every year. MEMBERSHIP & PUBLICATION ORDERS Standards Working Groups: More than 150 groups produce IEEE standards used throughout the world. Phone: +1 800 272 6657 • Fax: +1 714 821 4641 • Email: [email protected] Technical Committees: TCs provide professional interaction in Asia/Pacific: Watanabe Building, 1-4-2 Minami-Aoyama, Minato-ku, more than 45 technical areas and directly influence computer Tokyo 107-0062, Japan engineering conferences and publications. Phone: +81 3 3408 3118 • Fax: +81 3 3408 3553 Conferences/Education: The society holds about 200 conferences Email: [email protected] each year and sponsors many educational activities, including IEEE BOARD OF DIRECTORS computing science accreditation. President: J. Roberto de Marca Certifications: The society offers two software developer President-Elect: Howard E. Michel credentials. For more information, visit www.computer.org/ Past President: Peter W. Staecker certification.______Secretary: Marko Delimar Treasurer: John T. Barr Director & President, IEEE-USA: Gary L. Blank NEXT BOARD MEETING Director & President, Standards Association: Karen Bartleson 26–30 January 2015, Long Beach, CA, USA Director & VP, Educational Activities: Saurabh Sinha Director & VP, Membership and Geographic Activities: Ralph M. Ford Director & VP, Publication Services and Products: Gianluca Setti Director & VP, Technical Activities: Jacek M. Zurada Director & Delegate Division V: Susan K. (Kathy) Land Director & Delegate Division VIII: Roger U. Fujii

revised 6 Nov. 2014

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® SECURE BIG DATA IN THE CLOUD 46 IEEE CLOUD COMPUTING CLOUD IEEE and Argonne National Laboratory big data and data-driven research—the “fourth paradigm ofscience” paradigm “fourth research—the data-driven and data big synchronizing, andsharing large quantities of data. transferring, securely for accessing, models security Globus standard supports data interfaces andcommon StevenKyle Tuecke, Foster, Chard, Ian and Sharing of Big Data Synchronization, and Secure Transfer, Efficient and porting big data. data. big porting forsup- platform ideal an provide cloudmodels that believe Many ofdata. quantities large analyzing and sharing, transferring, organizing, tohosting, related challenges rvosPage Previous rvosPage Previous PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE ©2014 2325-6095/14/$31.00 SOCIETY COMPUTER IEEE BYTHE PUBLISHED ences through the arts regularly using them. using regularly arts the through ences sci- natural and physical spanning areas in researchers with plications, ap- forscholarly resources in-house to alternatives viable be to proving are Cloudplatforms scalability. inherent and model, usage as-you-go pay- capability, computing elastic its to part duein is ic communities scientif- and bycommercial adoption unprecedented loud computing’s | | Contents Contents | | omin Zoom omin Zoom | | omout Zoom omout Zoom University of Chicago ofChicago University | | rn Cover Front rn Cover Front 1 As we enter the era of era the weenter As 2 —researchers face | | erhIssue Search erhIssue Search | | etPage Next etPage Next q q H OL’ NEWSSTAND WORLD’S THE H OL’ NEWSSTAND WORLD’S THE q q q q M M M M M M q q q q M M M M ® ® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Supercomputers and Personal resources campus clusters

Object storage Block/drive storage InstanceInstance storage

GlobusGlobus Connect

InCommon/InCommon/ CILogonCILogon lbsNexus Globus Nexus Access

Transfer OOpenIDpenID ® Synchronize

Share MyProxyMyProxy OOAuthAuth

FIGURE 1. Globus provides transfer, synchronization, and sharing of data across a wide variety of storage resources. Globus Nexus provides a security layer through which users can authenticate using a number of linked identities. Globus Connect provides a standard API for accessing storage resources.

Large scientific datasets are increasingly hosted Given the distribution and diversity of stor- on both public and private clouds. For example, pub- age as well as increasingly huge data sizes, we need lic datasets hosted by Amazon Web Services (AWS) standardized, secure, and efficient methods to ac- include 20 Tbytes of NASA Earth science data, cess data, move it to other systems for analysis, syn- 500 Tbytes of Web-crawled data, and 200 Tbytes chronize changing datasets across systems without of genomic data from the 1000 Genomes project. copying the entire dataset, and share data with col- Open clouds such as the Open Science Data Cloud laborators and others for extension and verification. (OSDC)3 host many of the same research datasets in Although high-performance methods are clearly their collection of more than 1 Pbyte of open data. required as data sizes grow, secure methods are Thus, it’s frequently convenient, efficient, and cost- equally important, given that these datasets might effective to work with these datasets on the cloud. include medical, personal, financial, government, In addition to these high-profile public datasets, and intellectual property data. Thus, we need mod- many researchers store and work with large datasets els that provide a standard interface through which distributed across a plethora of cloud and local stor- users can perform these actions and methods that age systems. For example, researchers might use da- leverage proven security models to provide a com- tasets stored in object stores such as Amazon Simple mon interface and single-sign-on. These approaches Storage Service (S3), large mountable block stores must also be easy to use, scalable, efficient, and in- such as Amazon Elastic Block Store (EBS), instance dependent of storage type. storage attached to running cloud virtual machine Globus is a hosted provider of high-performance, (VM) instances, and other data stored on their insti- reliable, and secure data transfer, synchroniza- tutional clusters, personal computers, and in super- tion, and sharing.4 In essence, it establishes a huge computing centers. distributed data cloud through a vast network of

SEPTEMBER 2014 IEEE CLOUD COMPUTING 47

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Globus-accessible endpoints—storage resources example, local user accounts, Lightweight Directory that implement Globus’s data access APIs. Through Access Protocol [LDAP], or InCommon/CILogon). this cloud, users can access, move, and share large Globus uses two separate communication chan- amounts of data remotely, without worrying about nels. The control channel is established between performance, reliability, or data integrity. Globus and the endpoint to start and manage transfers, retrieve directory listings, and establish Globus: Large-Scale Research Data the data channel. The data channel is established Management as a Service directly between two Globus endpoints (GridFTP Figure 1 gives a high-level view of the Globus eco- servers) and is used for data flowing between sys- system. Core Globus capabilities are split into two tems. The data channel is inaccessible to the Globus services: Globus Nexus manages user identities and service, so no data passes through it. groups,5 whereas the Globus transfer service man- Several capabilities differentiate Globus from its ages transfer, synchronization, and sharing tasks competitors: on the user’s behalf.6 Both services offer program- matic APIs and clients to access their functionality • High performance. Globus tunes performance remotely. They’re also accessible via the Globus Web based on heuristics to maximize throughput us- interface (www.globus.org). ing techniques such as pipelining and parallel Globus Nexus provides the high-level security datastreams. fabric that supports authentication and authoriza- • Reliable. Globus manages every stage of data

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE tion. Its identity management function lets users transfer, periodically checks transfer perfor- create and manage a Globus identity; users can cre- mance, recovers from errors by retrying trans- ate a profile associated with their identity, which fers, and notifies users of various events (such they can then use to make authorization decisions. as errors and success). At the conclusion of a It also acts as an identity hub, where users can link transfer, Globus compares checksums to ensure external identities to their Globus identity. Users can data integrity. authenticate with Globus through these linked exter- • Secure. Globus implements best practices secu- nal identities using a single-sign-on model. Supported rity approaches with respect to user authentica- identities include campus identities using InCommon/ tion and authorization, securely manages the CIlogon via OAuth, Google accounts via OpenID, storage and transmission of credentials to end- XSEDE accounts via MyProxy OAuth, an Interoper- points for authentication, and supports optional able Global Trust Federation (IGTF)-certified X.509 data encryption. certificate authority, and Secure Socket Shell (SSH) • Third-party transfer. Unlike most transfer mech- key pairs. To support collective authorization deci- anisms (such as SCP [secure copy]) Globus sions (such as when sharing data with collaborators), facilitates third-party transfers between two re- Globus Nexus also supports the creation and man- mote endpoints. That is, rather than maintain a agement of user-defined groups. persistent connection to an endpoint, users can The Globus transfer service provides core data start a transfer and then let Globus manage it management capabilities and implements an asso- for the duration of the transfer. ciated data access security fabric. Globus uses the • High availability. Globus is hosted using a dis- GridFTP protocol7 to transfer data between logical tributed, replicated, and redundant hosting endpoints—a Globus representation of an accessible model deployed across several AWS availabil- GridFTP server. GridFTP extends FTP to improve ity zones. In the past year, Globus and its con- performance, enable third-party transfers, and sup- stituent services have achieved 99.96 percent port enhanced security models. The basic Globus availability. model for accessing and moving data requires de- • Accessible. Because Globus is a software-as- ploying a GridFTP server on a computer and regis- a-service (SaaS) provider, users can access its tering a corresponding logical endpoint in Globus. capabilities without installing client software The GridFTP server must be configured with an locally, so they can start and manage transfers authentication provider that handles the mapping of through their Web browsers, or using the Glo- credentials to user accounts. Often, authentication bus command-line interface or REST API. is provided by a co-located MyProxy credential man- agement system,8 which lets users obtain short-term In three and a half years of operation, Globus has X.509 certificate-based proxy credentials by authen- attracted more than 18,000 registered users, of ticating with a plug-in authentication module (for which approximately 200 to 250 are active every

48 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

day, and has conducted nearly 1 million transfers, 90,000 80,000 collectively containing more than 2 billion files and 70,000 52 Pbytes of data. Figure 2 summarizes the Globus 60,000 transfers over this period. The graphs include only 50,000 transfer tasks (that is, they don’t include mkdir, de- 40,000 Frequency lete, and so on) in which data has been transferred 30,000 (for example, they don’t include sync jobs that don’t 20,000 0 transfer files) between nontesting endpoints (that is, they ignore Globus test endpoints go#ep1 and 1 bytes go#ep2). Figure 2a shows the frequency of the total 1–10 bytes 1-10 Kbytes 1–10 Gbytes 1–10 Tbytes 10–100 bytes 1–10 Mbytes number of bytes transferred in a single transfer task 10–100 Kbytes 10–100 Mbytes 10–100 Gbytes 10–100 Tbytes 100 bytes–1 Kbyte 00 Tbytes–1 Pbyte (note log bins), and Figure 2b shows the frequency 100 Kbytes–1 Mbyte100 Mbytes–1 Gbyte100 Gbytes–1 Tbyte1 of the total number of files and directories trans- (a) Data transferred ferred in a single transfer task. As Figure 2a shows, the most common transfers are between 100 Mbytes and 1 Gbyte (81,624 total transfers), whereas more 350,000 than 700 transfers have moved tens of Tbytes of 300,000 Files Directories data and 39 have moved hundreds of Tbytes (max 250,000 500.415 Tbytes). The most common number of files 200,000 and directories transferred is less than 10; however, 150,000 Frequency more than 400 transfers have moved more than 1 100,000 million files each (max 39,643,018), and 120 trans- 50,000 fers have moved more than 100,000 directories (max 0 1 2 3 4 5 6 7 8 9 10 7,675,096). Figure 2 highlights the huge scale at 10 1 –10 2 –10 3 –10 4 –10 5 –10 6 –10 7 –10 8 –10 which Globus operates in terms of data sizes trans- 9 –10 10 10 10 10 10 10 10 10 10 ferred, number of files and directories moved, and number of transfers conducted. (b) Number of files/directories

Extending the Globus Data Cloud FIGURE 2. Frequency of transfers with given transfer size and number of Globus currently supports a network of more than files and directories. Transfer task frequency for (a) total transfer size, and 8,000 active (used within the last year) endpoints (b) number of files and directories. distributed across the world and hosted at a vari- ety of locations, from PCs to supercomputers. Us- ers can already access and transfer data from many veloped Globus Connect, a software package that locations via Globus—supercomputing centers such can be deployed quickly and easily to make resources as the National Center for Supercomputing Appli- accessible to Globus. We developed two versions of cations (NCSA) and the San Diego Supercomput- Globus Connect for different deployment scenarios. er Center (SDSC); university research computing Globus Connect Personal is a lightweight single- centers such as those at the University of Chicago; user agent that operates in the background much cloud platforms such as Amazon Web Services and like other SaaS agents (such as Google Drive and the Open Science Data Cloud (OSDC); large user Dropbox). A unique key is created for each instal- facilities such as CERN and Argonne National Lab- lation and is used to peer Globus Connect to the oratory’s Advanced Photon Source; and commercial user’s Globus account, ensuring that the endpoint data providers such as PerkinElmer. This vast col- is only accessible to its owner. Because we designed lection of accessible endpoints ensures that new Globus Connect Personal for installation on PCs, it Globus users have access to large quantities of data supports operation on networks behind firewalls and immediately. network address translation (NAT) through its use As new users join Globus, they often require ac- of outbound connections and relay servers (similar cess to new storage resources (including their own to other user agents such as ). Because it can PCs). Thus, an important goal is to provide trivial run in user space, it doesn’t require administrator methods for making resources accessible via Glo- privileges. Globus Connect Personal is available for bus. To allow data access via Globus, storage systems Linux, Windows, and MacOS. must be configured with a GridFTP server and some Globus Connect Server is a multiuser server authentication method. To ease this process, we de- installation that supports advanced configuration

SEPTEMBER 2014 IEEE CLOUD COMPUTING 49

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

options. It includes a full GridFTP server and an op- sure data integrity). Users can also leverage Globus’s tional colocated MyProxy server for authentication. synchronization and sharing capabilities directly Alternatively, users can configure existing authen- from S3 endpoints. tication sources upon installation. The installation Globus S3 endpoints support transfers directly process requires a one-command setup and comple- from existing endpoints, so don’t require data staging tion of a configuration file that defines aspects such via a Globus Connect deployment hosted on Ama- as the endpoint name, file system restrictions, net- zon’s cloud. This approach differs from GreenButton work interface, and authentication method. Glo- WarpDrive (www.greenbutton.com), which, although bus Connect Server also supports multiserver data it also uses GridFTP, relies on a pool of GridFTP transfer node configurations to provide increased servers hosted on cloud instances. Globus’s S3 sup- throughput. Globus Connect Server is available as port builds upon extensions to GridFTP to support native Debian and RedHat packages. communication directly between S3 and GridFTP With Globus Connect, users can quickly expose servers. Globus enables user-controlled registration any type of storage resource to the Globus cloud. of logical S3 endpoints requiring only details identi- They can use lightweight Globus Connect Per- fying the storage location (that is, the S3 bucket) and sonal endpoints on PCs and even short-lived cloud appropriate information required to connect to the instances. They can even script the download and S3 endpoint. To provide secure access to data stored configuration of these endpoints for programmatic in S3, while also enabling user-controlled sharing via execution. For more frequently used resources with Globus, we leverage Amazon’s Identity and Access

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE multiple users (such as data transfer nodes, clusters, Management (IAM) service to delegate control of an S3 bucket to a trusted Globus user. We peer this Globus IAM user with the Globus transfer service via trusted cre- One of the most common requirements dentials. Thus, when delegating access of an S3 bucket, Globus can base autho- associated with big data is the ability to rization decisions on internal policies (such as sharing permissions) to allow share data with collaborators. transfers between other Globus end- points and the S3 endpoint.

Providing Scalable In-Place Data Sharing storage providers, long-term and high-performance One of the most common requirements associated storage such as High Performance Storage Sys- with big data (and scientific data in general) is the tem [HPSS]), they can deploy Globus Connect Serv- ability to share data with collaborators. Current er and leverage institutional identity providers. They models for data sharing are limited in many ways, can then scale deployments over time by adding Glo- especially as data sizes increase. For example, cloud- bus Connect Server nodes to load balance transfers. based mechanisms such as Dropbox require that Both versions support all Globus features including users first move (replicate) their data to the cloud, access, transfer, synchronization, and sharing. which is both costly and time consuming. Ad hoc models, such as directly sharing from institutional Supporting Cloud Object Stores storage, require manual configuration, creation, and To allow users to access a variety of cloud storage management of remote user accounts, making them systems, Globus supports the creation of endpoints difficult to manage and audit. These difficulties be- directly on object storage. Users can come insurmountable when data is large and when thus access, transfer, and share data between S3 and dynamic sharing changes are required. Rather than existing Globus endpoints as they do between any implement yet another storage service, we focus on other Globus endpoints. To access S3, users must enabling in-place data sharing. That is, shared data create an S3-backed endpoint that maps to a specific does not reside on Globus; rather, Globus lets users S3 bucket to which they have access. With this mod- control who can access their data directly on their el, users can expose access to large datasets stored existing endpoints. in S3 and benefit from Globus’s advanced features, To share data in Globus, a user selects a file sys- including high performance and reliable trans- tem location and creates a shared endpoint—that fer, rather than relying on standard HTTP support is, a virtual endpoint rooted at the shared location (which doesn’t scale to large datasets and doesn’t en- on his or her file system. The user can then select

50 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

other users, or groups of users, who can access the X509 certificates, OpenID identities, InCommon/ shared endpoint—or parts thereof—by specifying CILogon OAuth, and so on) used to provide single- fine-grained read and write permissions. One ad- sign-on, it stores only public information, such as vantage of this model is that permission changes are SSH public keys, X509 certificates, OpenID identity reflected immediately, so users can revoke access to URLs and usernames, and OAuth provider servers, a shared dataset instantly. certificates, and usernames. Thus, when authenti- Globus’s sharing capabilities are extensions built cating, Globus can validate a user’s identity by fol- onto the GridFTP server, which, when enabled, let lowing the private authentication process using the GridFTP server delegate authorization decisions cryptographic techniques rather than comparing to Globus. Specifically, two new GridFTPsite passwords. Consider, for example, authenticating commands let Globus check that sharing is enabled using a campus identity. Here, Globus leverages the on an endpoint and create a new shared endpoint. InCommon/CILogon system and the OAuth proto- We also extended the GridFTP access protocol to al- col to let users enter their username/password via low access by a predefined trusted Globus user. The a trusted campus website. Globus passes a request access request includes additional parameters such token with the user authentication and receives an as the shared owner, shared user, and access con- OAuth token and signature in return, which it ex- trol list (ACL) for the shared endpoint, which Glo- changes for an OAuth access token (and later a cer- bus maintains. When accessing the endpoint, this tificate) from the campus identity provider. information is passed to the GridFTP server to en- Linked identities, such as XSEDE identities, able delegated authorization decisions from the re- are also used for single-sign-on access to endpoints. questing user to the local user account of the shared endpoint owner. Using this approach, the GridFTP server can perform an authorization check to en- Globus stores an active proxy credential sure that the shared user can access the requested path before following the that can be used to impersonate the normal access protocol, which requires changing to the shared endpoint own- user, albeit for a short period of time. er’s local user account and performing the requested action.

Secure Data Access, Transfer, and Sharing Rather than require users to authenticate mul- There are a wide range of potential security implica- tiple times for every action and to allow Globus to tions when accessing distributed data, hosted by dif- manage transfers on a user’s behalf, Globus stores ferent providers, across security domains, and using short-term proxy credentials. This allows Globus to different security protocols. Globus’s multilayered perform important transfer-management tasks such architecture leverages standard security protocols to as restarting transfers upon error. Here, Globus manage authentication and authorization, and avoid stores an active proxy credential that can be used unnecessary storage of (or access to) users’ creden- to impersonate the user, albeit for a short period of tials and data. Most importantly, data does not pass time. To do so securely, Globus only caches the ac- through Globus; rather, it acts as a mediator, allow- tive credential and encrypts it using a private key ing endpoints to establish secure connections be- owned by Globus Nexus. When the active credential tween one another. is required (for example, to compute a file checksum on an endpoint), the credential is decrypted and Authentication and Authorization passed to the specific GridFTP server over the en- At the heart of the Globus security model is Glo- crypted control channel. bus Nexus, which facilitates the complex security protocols required to access the Globus service and Endpoint Data Access and Transfer endpoints using Globus identities as well as linked GridFTP uses the Grid Security Infrastructure external identities. (GSI), a specification that allows secure and del- Globus stores identities (and groups) in a con- egated communication between services in distrib- nected graph. For Globus identities, it stores hashed uted computing environments. GridFTP relies on and salted passwords for comparison when authen- external services to authenticate users and provide ticating. For the linked identities (SSH public keys, trusted signed certificates (typically from a MyProxy

SEPTEMBER 2014 IEEE CLOUD COMPUTING 51

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

server) used to access the server. These certificates limit GridFTP access to particular parts of the file are often hidden from users by the use of an on- system. For instance, administrators might allow ac- line certificate authority (CA), such as MyProxy. cess only to users’ home directories or to specialized The GridFTP service has a certificate containing locations on the file system. the hostname and host information that it uses to The flow of data between endpoints (including identify itself. (This certificate is created automati- S3-backed endpoints and shared endpoints) is an- cally when users install Globus Connect or it can be other potential area of vulnerability because data issued by a CA.) In Globus Connect, the MyProxy can travel on the general Internet. To provide se- server can be optionally installed to issue short-term cure data transfer, Globus supports data encryption certificates on demand. Globus Connect can also be based on secure sockets layer (SSL) connections configured to use external MyProxy servers. Globus, between endpoints. In the case of S3 endpoints, GridFTP, and MyProxy servers are configured to the connection uses HTTPS. To avoid unnecessary trust the certificates exchanged between each other. overhead of less sensitive data, encryption is not a MyProxy servers let users obtain short-term cre- default setting and must be explicitly selected for in- dentials that a GridFTP server uses to assert user dividual transfers. The control channel used to start access to the file system. Administrators can con- and manage transfers is always encrypted to avoid figure MyProxy servers to use various mechanisms potential visibility of credential, transfer, and file for authentication through pluggable authentication system information.

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE Secure Sharing As part of shared endpoint creation, a Globus sharing creates several new se- curity considerations, such as requiring unique token is created on the GridFTP secure peering of shared endpoints and Globus, authorizing access to shared server for each shared endpoint. data, and ensuring that file system in- formation is not disclosed outside of the restricted shared endpoint. The Globus sharing model requires modules (PAMs). Usually, these PAMs support local the GridFTP server to be explicitly configured to al- system credentials or institutional LDAP creden- low sharing. As part of this process, the GridFTP tials. There are two basic models in which Globus server is configured to allow a trusted Globus user to uses a MyProxy server to obtain a credential. In the access the server (and to later change the local user first, Globus passes the user’s username and pass- account to the shared endpoint owner’s local user word to the MyProxy server and receives a credential account). A unique distinguished name (DN) ob- in response. Thus, users must trust Globus not to tained from a Globus CA operated for this purpose store their passwords and to transfer them secure- identifies the user. The GridFTP server is config- ly. In the second and preferred model, Globus uses ured to trust both this special Globus user and the the OAuth protocol to redirect the user to the My- Globus CA via the registered DN. During configura- Proxy server to authenticate directly (that is, Globus tion, administrators can set restrictions (sharing_ doesn’t see the username and password), and the rp) defining what files and paths may be shared on server returns a credential in the OAuth redirection the file system and which users may create shared workflow. endpoints. For example, administrators could limit When accessing data on an endpoint, Globus sharing to a particular path (analogous to a public_ uses SSL/TLS to authenticate with the registered html directory) and a subset of administrative users. GridFTP server using the user’s certificate. The As part of shared endpoint creation, a unique GridFTP server validates the user’s certificate, re- token is created on the GridFTP server for each trieves a mapping to a local user account from a shared endpoint. This token is used to safeguard predefined mechanism (such as a GridMap file), and against redirection and man-in-the-middle at- changes the local user account (used to access the tacks. For instance, an attacker who gains control file system) to the requesting user’s local account. of a compromised Globus account might change the Subsequent file system access occurs as the authen- physical GridFTP server associated with a trusted ticated user’s local account. To provide an additional endpoint (for example, an XSEDE endpoint) to a layer of security, endpoint administrators can con- malicious endpoint under the attacker’s control. figure path restrictions restrict_paths( ) that In this case, the attacker can create a shared end-

52 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

point and can then change the physical server back of websites across the world. Although Globus uses to the trusted server. Because the unique token is custom data transfer protocols that are unlikely created on the malicious server, it won’t be present targets of such an attack, exploits via the website, on the trusted (XSEDE) server, so the attacker won’t endpoints, and linked identity providers are still be able to exploit the shared endpoint to access the possible. In this particular case, we followed pre- trusted server. defined internal security policies to determine if Accessing data on a shared endpoint using the the vulnerability impacted our services, patched the extended GridFTP protocol lets Globus access the issue for all Globus services and Globus-managed GridFTP server (as the trusted Globus account). endpoints, and generated new private keys. We The extended access request specifies data loca- then followed internal processes for responding to tion, shared endpoint owner, the user accessing the potentially compromised user access by revoking shared endpoint, and current ACLs for that shared user access tokens (invalidating all user sessions) endpoint. To ensure that data is accessed only and analyzing access logs. Finally, because of the within the boundaries of what has been shared and exploit’s nature, we analyzed all user endpoints to within restrictions placed by the server administra- identify potentially vulnerable endpoints. We then tor, the GridFTP server checks restricted paths, contacted administrators of these endpoints and shared paths, and Globus ACLs (in that order). As- recommended that they take specific measures to suming nothing negates the access, the GridFTP patch the systems. server changes the local user account, with which it accesses the file system, to the shared endpoint owner’s local user One important security aspect relates account and satisfies the request. Finally, because potentially sensi- to policies for responding to security tive path information could be includ- ed in a shared file path, Globus hides breaches and vulnerabilities. the root path from users accessing the shared endpoint. For example, if a user shares the directory “/kyle/secret/,” it will appear simply as “/~/’’ through the shared end- s data sizes increase, researchers must look point. Globus translates paths before sending re- toward more efficient ways of storing, organiz- quests to the GridFTP server. ing, accessing, sharing and analyzing data. Although Globus’s capabilities make it easy to access, trans- Hosting and Security Policies fer, and share large amounts of data across an ever- All Globus services are hosted on AWS. Although increasing ecosystem of active data endpoints, it this environment has many advantages, such as high also provides a framework on which new approaches availability and elastic scalability, as with all host- for efficiently managing and interacting with big ing options, it also has inherent risks. We mitigate data can be explored. these risks by following best practices with respect The predominant use of file-based data is of- to deployment and management of instances. These ten inefficient because the data required for analy- practices include storing all sensitive state encrypt- sis doesn’t always match the model used to store it. ed, isolating data stores from the general Internet so Researchers typically slice climate data in different they’re only accessible to Globus service nodes (by ways depending on the analysis—for example, geo- AWS security groups), performing active intrusion graphically, temporally, or based on a specific type detection and log monitoring to discover threats, au- of data such as rainfall or temperature. Accessing diting public-facing services and using strict firewalls entire datasets when only small subsets of it are of to restrict access to predefined ports, and establishing interest is both impractical and inefficient. Although backup processes to ensure that all data is encrypted some data protocols, such as the Open source Proj- before it’s put in cloud storage. To ensure that these ect for a Network Data Access Protocol (OpenDAP), practices are followed, we conducted an external se- provide methods for accessing data subsets within curity review,9 and resolved the identified issues. files, no standard model for accessing a wide range One important security aspect relates to policies of data formats currently exists. Recently, research- for responding to security breaches and vulnerabili- ers have proposed more sophisticated data access ties. The recent HeartBleed bug is an example of a models within GridFTP that use dynamic query and security vulnerability that affected a huge number subsetting operations to retrieve (or transfer) data

SEPTEMBER 2014 IEEE CLOUD COMPUTING 53

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

subsets.10 Although this work presents a potential innovate in these areas to provide enhanced capabil- model for providing such capabilities, further work ities directly through the existing network of Globus is needed to generalize the approach across data endpoints. We benefit from using Globus’s transfer types and to develop a flexible and usable language and sharing capabilities and from leveraging the to express such restrictions. same structured approaches toward authentication Files typically contain valuable metadata that and authorization. can be used for organization, browsing, and discov- We intend to continue to develop support for ery. However, accessing this metadata is often dif- other cloud storage and cloud providers, such as per- ficult because it’s stored in various science-specific sistent long-term storage like and formats, often encoded in proprietary binary for- storage models used by other cloud providers (Mi- mats, and typically unstructured (or at least doesn’t crosoft Azure Storage, for example), with the goal of follow standard conventions). Moreover, even when developing an increasingly broad data cloud. the metadata is accessible, few high-level methods exist for browsing it across many files or across stor- Acknowledgments age systems. Often, the line between metadata and We thank the Globus team for implementing and op- data is blurred, and, whereas metadata might be un- erating Globus services. This work was supported in necessary for some analyses, it can be valuable for part by the US National Institutes of Health through others. Thus, we need methods that enable struc- NIGMS grant U24 GM104203, the Bio-Informatics tured access to both data and metadata using com- Research Network Coordinating Center (BIRN-

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE mon formats. Given that metadata can describe data CC), the US Department of Energy through grant or contain other sensitive information (for example, DE-AC02-06CH11357, and the Extreme Science patient names), it’s equally important to provide se- and Engineering Discovery Environment (XSEDE), cure access methods. We therefore need models that which is supported by US National Science Founda- expose such metadata to users and let them query tion grant number ACI-1053575. over it to find relevant data for analysis or share it in a scalable and secure manner. References Often, data sharing occurs for the purpose of 1. D. Lifka et al., XSEDE Cloud Survey Report, tech. publishing to the wider community or as part of a report 20130919-XSEDE-Reports-CloudSurvey publication. Considerable research has explored -v1.0, XSEDE, 2013. current data publishing practices.11,12 In many cas- 2. T. Hey, S. Tansley, and K. Tolle, eds., The Fourth es, researchers found that data wasn’t published Paradigm: Data-Intensive Scientific Discovery, with papers and that original datasets couldn’t be lo- , 2009. cated. This affects one of the core principles of sci- 3. R.L. Grossman et al., “The Design of a Com- entific discovery: that research is reproducible and munity Science Cloud: The Open Science Data verifiable. In response, funding agencies and pub- Cloud Perspective,” Proc. 2012 SC Companion: lishers are increasingly placing strict requirements High Performance Computing, Networking Stor- on data availability associated with grants and pub- age and Analysis (SCC 12), 2012, pp. 1051–1057. lications, although these requirements are often 4. I. Foster, “Globus Online: Accelerating and De- disregarded.12 Even when researchers do publish mocratizing Science through Cloud-Based Ser- data, they often do so poorly, in an ad hoc manner vices,” IEEE Internet Computing, vol. 15, no. 3, that makes the data difficult to find and understand 2011, pp. 70–73. (due to a lack of metadata), and with little guaran- 5. R. Ananthakrishnan et al., “Globus Nexus: An tee that the data is unchanged or complete. We need Identity, Profile, and Group Management Plat- new systems that let researchers publish data, eas- form for Science Gateways and Other Collabora- ily associate persistent identifiers (such as DOIs) tive Science Applications,” Proc. IEEE Int’l Conf. with that data, provide guarantees that the data is Cluster Computing (CLUSTER), 2013, pp. 1–3. immutable and consistent with what was published, 6. B. Allen et al., “Software as a Service for Data provide common interfaces for discovering and ac- Scientists,” Comm. ACM, vol. 55, no. 2, 2012, cessing published data, and do so at scales that cor- pp. 81–88. respond to the growth of big data. 7. W. Allcock et al., “The Globus Striped GridFTP Although these three areas represent different Framework and Server,” Proc. 2005 ACM/IEEE research endeavors, they all require a framework Conf. Supercomputing (SC 05), pp. 54–64. that supports efficient and secure data access. Glo- 8. J. Novotny, S. Tuecke, and V. Welch, “An Online bus provides a model on which we can continue to Credential Repository for the Grid: MyProxy,”

54 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Proc. 10th IEEE Int’l Symp. High Performance STEVEN TUECKE is deputy director at the Univer- Distributed Computing, 2001, pp. 104–111. sity of Chicago’s Computation Institute, where he’s 9. V. Welch, Globus Online Security Review, tech. responsible for leading and contributing to projects report, Indiana Univ., 2012; https://scholarworks______in computational science, high-performance and

.iu.edu/dspace/handle/2022/14147.______distributed computing, and biomedical informatics. 10. Y. Su et al., “SDQuery DSI: Integrating Data Tuecke received a BA in mathematics and computer

Management Support with a Wide Area Data science from St Olaf College. Contact him at _____tuecke@ Transfer Protocol,” Proc. Int’l Conf. High Per- ______uchicago.edu. formance Computing, Networking, Storage and Analysis (SC 13), 2013, article 47. IAN FOSTER is director of the Computation Insti- 11. T.H. Vines et al., “The Availability of Research tute, a joint institute of the University of Chicago and Data Declines Rapidly with Article Age,” Cur- Argonne National Laboratory. He is also an Argonne rent Biology, vol. 24, no. 1, 2014, pp. 94–97. senior scientist and distinguished fellow, and the Ar- 12. A.A. Alsheikh-Ali et al., “Public Availability of thur Holly Compton Distinguished Service Professor Published Research Data in High-Impact Jour- of Computer Science. His research interests include nals,” PLoS ONE, vol. 6, no. 9, 2011, e24357. distributed, parallel, and data-intensive computing technologies, and innovative applications of those KYLE CHARD is a senior researcher at the Com- technologies to scientific problems in such domains putation Institute, a joint venture between the Uni- as climate change and biomedicine. Foster received versity of Chicago and Argonne National Laboratory. a PhD in computer science from Imperial College,

His research interests include distributed meta-sched- United Kingdom. Contact him at [email protected]. uling, grid and cloud computing, economic resource allocation, social computing, and services computing. Chard received a PhD in computer science from Vic- Selected CS articles and columns are also available toria University of Wellington, New Zealand. Contact for free at http://ComputingNow.computer.org. him at [email protected].

IEEE Computer Society | Software Engineering Institute Watts S. Humphrey Software Process Achievement Award

Nomination Deadline: January 15, 2015

Do you know a person or team that deserves recognition for their process improvement activities?

The IEEE Computer Society/Software Engineering Institute Watts S. Humphrey Software Process Achievement Award is presented to recognize outstanding achievements in improving the ability of a target organization to create and evolve software.

The award may be presented to an individual or a group, and the achievements can be the result of any type of process improvement activity.

To nominate an individual or group for a Humphrey SPA Award, please visit http://www.computer.org/portal/web/awards/spa

SEPTEMBER 2014 IEEE CLOUD COMPUTING 55

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Location-Based Security Framework for Cloud Perimeters SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE

Chetan Jaiswal, Mahesh Nath, and Vijay Kumar, University of Missouri, Kansas City

A new approach to any enterprise and government organiza- tions use a variety of mobile gadgets to compose firewall connect to the cloud to manage their data policies to protect processing requirements. Although this platform improves availability and perfor- mobile and static mance, it also increases security risk, as it can allow unwanted malicious network traffic into the organization. cloud perimeters Firewall filtering is often inadequate for stopping these attacks. The problem becomes more complex when multiple firewalls are deployed uses location to because coordination among them becomes extremely difficult if not filter out attacks impossible. Current firewalls use static filtering policies. Although simple, a from unsafe static policy has many disadvantages. First, because border routers enforce a static policy, they can’t react to changes in the external en- locations. vironment. Second, because of physical limitations and differences in trust relationships between an enterprise and its immediate neigh- bors, some firewalls might require preferential treatment over others in admitting different kinds of traffic streams. Therefore, providing perimeter protection policies that react to dynamic changes and re- spect organizational objectives such as preferential treatment while enforcing organizations’ overall security objectives requires dynam- ic and flexible policies at each border gateway that are also part of a global policy such that they enforce common security objectives in mobile clouds.

56 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

The protection issue becomes more complex static parts of individual polices, excluding dynamic when we consider attacks from mobile sources. variables that share information between local and Unlike threats from stationary attackers, mobile global policies. We then translate such optimized attackers disappear from the attack location and policies to rule sets used by today’s firewalls.7 resurface elsewhere. We introduce location-attack protection, in which the firewall can block messages Mobile Cloud from high cybercrime locations (country, state, and Mobile clouds support personal and terminal mo- so on) completely. To enable this protection, we use bility.5 A mobile unit can mount attacks from any constrained logic programming by appropriately al- location at any time. Attack packets pass through tering the flexible authorization framework (FAF)1 several gateways before reaching the cloud. Each and its extensions, and strand spaces and multiset gateway has its own dynamic firewall, and the cloud rewriting strategies for protocol analysis.2–4 These is protected by its own firewall. Whenever a firewall two formalisms control multiple streams of data ex- policy change is incorporated on any of the firewalls, changed between two participants, which is relevant the change is propagated to all other firewalls for to our framework because it requires fine-grained updating. and protocol-specific perimeter protection policies. Firewalls are typically configured using a rule They can also easily incorporate new spatial and base specifying which inbound or outbound packets temporal parameters that are unique and crucial (or sessions) are to be allowed or blocked. A Cisco to mobile clouds. Our scheme supports consistency rule set7 is as follows: pass tcp 20.9.17.8 0.0.0.0 and completeness of local policies: they’re important 121.11.127.20 0.0.0.0 range 23 27, which says that because an individual gateway or firewall on the pe- TCP packets from IP address 20.9.17.8 to IP address rimeter needs to know whether to allow or deny a 121.11.127.20 are to be accepted if the destination stream’s progress. The scheme makes sure that the port range is from 23 to 27. The 0.0.0.0 segments composition of local policies is logically correct to ob- mean that address masking isn’t used. Generally, tain an enterprise-wide perimeter protection policy. such rules are listed in some order in access lists.7 In addition, our scheme makes sure that the effect When a firewall receives a packet, it goes through of the propagation of change in policies to others the list and matches the first rule that applies to the is correct. If the global policy changes, then all lo- packet and follows the specified action. Firewalls cal polices have to accommodate that change. Con- use a closed policy that drops packets not explicitly versely, if a local policy changes, the related global permitted by any rule. This procedure leads to sev- policy may change, which in turn may trigger chang- eral problems: es in other dependent local policies. A mobile unit frequently changes location, • Because the rules are written at the lower pro- which can introduce inconsistency at a policy lev- tocol level, a misconfiguration can make the el. For example, a mobile user might be subject to whole intranet unreachable. a different set of constraints in Kansas City, Kan- • The rule base might have many redundant rules. sas, than in Kansas City, Missouri, which will affect • The semantics depend not only on the rules, but the data access pattern. Firewall filtering schemes also on the order in which they’re listed, an un- must handle such policy changes, due to mobility desirable feature. and other necessary revisions in the policy in real time to eliminate false denial. This becomes tricky Earlier research on this issue provided solutions because a mobile unit becomes unreachable when with limited success. For example, Yair Bartal and it’s switched off or slips into doze mode:5 updates his colleagues proposed Firmato, a firewall manage- or changes can only be installed when the unit be- ment toolkit.8 Although it models the firewall secu- comes active. To address this problem, we use a rity policy and network topologies, it doesn’t permit twofold optimization strategy. In the first phase, we fine-grained admission control of streams, doesn’t apply fold/unfold6 transformations to optimize policy cover intranets with multiple external gateways that rules. In the second phase, we partially materialize enforce different policies, and can’t be used to obtain

SEPTEMBER 2014 IEEE CLOUD COMPUTING 57

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Rule 1: procNxtPkt([], post, Si,+) ← permToOpen(Si) Rule 2: procNxtPkt(pre*car(post), post, Si,+) ← blocked(Si) PROVISION(self): updtLocalStat(car(post), Si ,Li), localPktAcpPolicy(car(post), pre, Si,+), PROVISION(global): gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+) OBLIGATION(global): updtGlobalStat(car(post), Si, [P1, . . ., Pn],+) Rule 3: blocked(x) ← Li = maxLi (a)

Rule 4: procNxtPkt([], post, Si,+, LOi) ← permToOpen(Si, LOi) Rule 5: procNxtPkt(pre*car(post), post, Si,+) ← blocked(Si, LOi) PROVISION(self): updtLocalStat(car(post), Si, Li), localPktAcpPolicy(car(post), pre, Si,+, LOi), PROVISION(global): gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+, LOi) OBLIGATION(global): updtGlobalStat(car(post), Si, [P1, . . ., Pn],+,LOi) Rule 6: blocked(x) ← Li = maxLi (b)

FIGURE 1. Packet acceptance and rejection rules written in Flexible Parameter Protection Specification Language (FPPL) for: (a) local and enterprise-wide policies, and (b) mobile policies. SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE the global health of the traffic streams entering the packets in an ongoing stream. We wrote the FPPF Internet. Alain Mayer and his colleagues9 present filter or protection rules in the Flexible Parameter Fang, a firewall analysis engine that has the same Protection Specification Language (FPPL). We brief- deficiencies as Firmto.8 Our scheme resolves these ly introduce salient features of this language here; deficiencies. A cryptography-based scheme relies details can be found elsewhere.1 on decentralized trust management.10 Our solution distributes network perimeter protection without re- Example Policies Written in FPPL linquishing centralized control and thereby circum- FPPL, similar to other logical languages, consists of vents the performance bottlenecks of a centralized constant symbols, variables, function symbols, and perimeter protection policy. terms. It uses a set of predicates to define packet ac- Unlike wired systems, a mobile node can issue a ceptance and rejection rules for local and enterprise- request from any location, connect to many service wide policies. providers that might have different security require- For example, the rules in Figure 1a define a local

ments, slip into doze mode, power off, or fail. Mobile policy. Rule 1 says that stream Si can be opened if it nodes are also vulnerable to attacks. A mobile cli- has been permitted to do so, where permToOpen(Si) ent’s valid request from one location can be denied holds when the latter is true. Rule 2 says that the

at another location. Several good schemes have been next packet of Si is admitted as long as Si isn’t proposed for protecting mobile systems through blocked, the local packet acceptance and the global firewalls11; however, they provide engineering solu- approval policies allow it, and the corresponding lo- tions to firewall protection and appear highly system cal and global statistics are updated. Rule 3 defines

dependent. the condition for Si being blocked—namely, that We conclude the following: we need real-time the local variable Li (say buffer capacity allocated synchronization of firewalls and subsequent up- to this stream) has been used up by the stream up dates; a multilayer verification is the way to go; and to now. the system must implement geographical location- The local policy needs to know that the predi-

based verification. Our logic-based framework meets cate gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+) (a these requirements. part of enterprise-wide policy) is true for the next packet to be admitted according to the agreement it Flexible Perimeter Protection Framework has with the enterprise-wide security policy. Con- Because a common framework can protect mobile versely, it’s obligated to update the global statistics

and wired traffic, we built a unified flexible frame- updtGlobalStat(car(post), Si, [P1, . . ., Pn],+) that are a work to monitor and dynamically adjust the enter- part of the enterprise-wide security policy. Note that ing datastreams. FPPF is based on having rules built other variables appearing in these two predicates,

with predicates to express policies for accepting namely [P1, . . ., Pn], are unknown to the local policy,

58 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

so they can’t be used in the rule as a normal predi- In mobile attacks, the attack location can cate. The enterprise-wide policies are composed change frequently. As a result, firewall policies can accordingly. change frequently, leading to increased update traf- fic, which might not be able to handle such frequent Enhancing Provisions, Obligations, and updates and might not be able to keep the local and Delegations global policies in sync. Our scheme addresses this Provisions and obligations play a key role in the FPPF issue by keeping the policy warehouse at all base sta- architecture. As the previous example demonstrates, tions. Because a base station serves a specific loca- local policies depend on having provisions approved tion, policy relevant to that location is loaded there. by the global policy base, and, in turn, local policies A mobile unit will cache the policy from the base are obliged to update their local statistics with the station of the cell it’s visiting. A base station will global policy base. This two-way exchange of data al- broadcast policy changes as they occur, and all mo- lows the global policy to respond to perimeter-wide bile units in that cell will capture it and visitors to changes in an accurate and timely manner. that cell will acquire it when they register. In our case, the provision granted by the glob- al policy base to the local policy is specified in the Unsafe Locations predicate gPktAcPolicy(car(post), Si, [P1, . . ., Pn],+). In our experience, more attacks (serious or less se- Therefore, we model a provision as a predicate ex- rious) come from some locations than others. The ported by the grantor and imported by the grantee. predicates coded in the firewalls in our system in- The main characteristic of the provision is that the clude a location parameter to identify an attack’s grantee doesn’t know its definition, but would know origin. If it originates from an “unsafe” location, if it’s evaluated to be true or false. it’s blocked. We define three categories of unsafe The obligation used in the example is the predi- locations. cate updtGlobalStat(car(post), Si, [P1, . . ., Pn],+). A hard location is one from which numerous Note that this is also a predicate that’s exported to serious attacks originate with high frequency. Any the local policy base, which must instantiate the of these attacks can severely affect the cloud’s per- proper instances of variables that would make the formance and integrity. The firewall must stop these predicate instance true. This obligation can be ful- attacks. If the firewall detects that an attack (for ex- filled when the function call is made. ample, a Trojan) is mounted from a serious location, it immediately eliminates this attack. Policy Updates in Mobile Systems A soft location is one from which relatively few- Geographical location plays an important role in er serious attacks originate. These types of attacks managing mobile activities (such as an attack). We don’t significantly affect the cloud’s performance include geographical locations in the predicates and integrity, and the cloud system can continue to used to specify local and enterprise-wide policies. function while the firewall handles the attack. For The firewall policy in the cloud will depend on where example, a music sharing virus can scare people

Si originates (location-specific attacks). Thus, the without harming the computer. The firewall might predicates for the local policy will include attacker’s let it enter the cloud. location, and the predicates for the enterprise-wide Finally, a clean location is one from which no policy will depend on a set of locations where a mo- attacks originate. The firewall might apply mini- bile unit is permitted to roam. mal security checks to messages coming from these We illustrate mobile policy composition with a locations. modified rule and example (Figure 1b).

In this example, rule 4 says that Si can be opened Unsafe Location Identification in Mobile if it originated at location LOi (longitude), and if it Communication has been permitted to do so, where permToOpen(Si) When an attacker moves around in a location while holds when the latter is true. Rule 5 says that the attacking the cloud, it will have the same IP address next packet of Si is admitted provided that Si isn’t at different points inside the location. For example, blocked, local packet acceptance and the global ap- if an attacker moves from point li to point lj inside a proval policies allow it, and the corresponding local location L, the IP address will not change, that is, and global statistics are updated. Rule 6 defines the points li and lj will have the same IP address even condition for Si being blocked—namely, that the lo- though their geographical address (point li to point cal variable Li (say, buffer capacity allocated to this lj) inside L will change. Thus, to hide their identity stream) has been used up by the stream up to now. and avoid being caught in such movement, attackers

SEPTEMBER 2014 IEEE CLOUD COMPUTING 59

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Cell global identification (CGI) At present, cell global identity information is not available in IP packets coming from a mobile device. MCC MNC LAC CI Our scheme (Figure 2) extends the structure of an IP address and includes cell global identity infor- Location area identification mation in IP packets. This helps us to identify the

CI: Cell identity location of the mobile unit mounting the attack, di- LAC: Location area code rectly or through a proxy, and to program the fire- MCC: Mobile country code (3-digit) MNC: Mobile network code (2 or 3 digit for GSM/UMTS wall accordingly to block the attack. For example, application) if the cell global identity in an IP packet is mobile country code (MCC) = 310 (indicates USA) and mo- FIGURE 2. Cell global identitification. bile network code (MNC) = 410 (AT&T network), location area code (LAC) = 3450 and cell identity (CI) = 118541125 represents a cell in Kansas City, HUL: Hard unsafe location list MUL: Mild unsafe location a final list of unsafe Missouri. locations As a security measure, the system maintains lo- T: Attack threshold from a particular location Start cation area codes of unsafe locations and discards incoming packets from these unsafe locations with- out even analyzing them. To determine if a location Yes No LAC in HUL? is safe or unsafe, the system records the number

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE of attacks from each known location. If the num- ber of attacks from a particular location reaches a Check the Discard packet threshold, it marks the location as unsafe. Figure packet 3 illustrates our approach for determining unsafe locations. It’s an ongoing process of finding unsafe No Yes locations based on the number of attacks originating Malicious? from a specific location. We maintain a database of unsafe locations. Allow packet No Yes The serving GPRS support node (SGSN) is the LAC in HUL? main component of the General Packet Radio Ser- vice network. The SGSN can make this location Add LAC to MUL Increment MUL information available in IP packets coming from counter mobile stations because it has access to the location information (CGI) of a mobile station in its area and is also responsible for delivering data packets to and No MUL counter > + T? from the mobile stations in its area. The IP pack- et header, which contains an optional field, can be Yes used to store the cell global identity. Remove LAC Figure 4 shows the flow of IP packets in Discard packet from MUL and the Global System for Mobile Communication add LAC to HUL End (GSM) and Universal Mobile Telecommunications System (UMTS) architectures. We’ve included rele- FIGURE 3. Determining unsafe locations. vant elements of GSM in our scheme for identifying unsafe locations. On the receiving end, the firewall extracts the CGI information from the enhanced generally use a proxy. The tracert (trace route) com- IP packet and searches hard unsafe location (HUL) mand, which shows the path an IP packet travelled and mild unsafe location (MUL) lists and decides to reach a destination, isn’t helpful because it can’t whether to reject or allow the packet. go beyond the proxy. Because our approach can identify the attacker’s In a mobile network, IP address allocation is location, it compromises the attacker’s privacy. Al- dynamic, so it’s easy for an attacker to spoof an IP though this is not an issue in the case of an attacker, address and mount an attack through a proxy. To our scheme should be able to protect the privacy reach the attacker behind the proxy, our approach (which could lead to a security breach) of a typical identifies a mobile phone’s location in a cellular user if that user’s actions look like an attack. We’re network through its cell global identity (Figure 2). investigating a solution to make sure that the loca-

60 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Receiving end RNC IP packet + CGI

SGSN GGSN Internet Firewall Cloud

BSC HUL

IP packet + CGI = Global identity MUL IP pocket BSC: Base station controller GGSN: Gateway GPRS support node RSN: Radio network controller SGSN: Serving GPRS support node

FIGURE 4. Partial network architecture: GSM + GPRS + UMTS.

tion isn’t accessible to anybody or only its encrypted versity as the north CLM (CLM3). The algorithm version is accessible. first randomly selects three landmarks, each from

a different CLM dataset (CLM1, CLM2, CLM3, or Unsafe Locations Identification in Wired CLM4). In the next step, each LM-CLMi (where 1 ≤ ≤ Networks i 3) individually measures the delay to AMX. Because our scheme for identifying unsafe locations Using the AvgLDelayi (distance to delay measure- in a cellular platform won’t work for stationary at- ments based on average lowest delay between any tackers, we developed a different solution for these two landmarks), each LM-CLMi estimates the dis- attackers. We use known landmarks (that is, trusted tance to AMX. computers on the Internet whose geographic coor- We create AvgLDelay for each LM-CLM, which dinates are known a priori) to probe an attacker’s provides the average ratio of distance to delay machine (at an unknown location) and measure the (DDR). After estimating the distance of AMX from response delay to compute attacker coordinates. We the three LM-CLMs, our algorithm ascertains the repeat this process until we identify the location as geolocation (AMLA, AMLO) of AMX. In step 5, the accurately as possible. algorithm considers the area (Zonal_Region) sur-

The packet transfer delay is directly propor- rounding (AMLA, AMLO), called the initial zone. tional to the distance. Line congestion, queuing de- After identifying the initial zone, it creates the lays, and so on can affect this relationship, but by AvgLowestNodetoZoneDelay (dataset of distance- consolidating several observations, we can identify to-delay measurements based on the average lowest the location with reasonable accuracy. We probe delay between a particular node to the zone with the destination from several landmarks to get a de- AM as the given latitude and longitude) for each se- lay vector, which we then use to get an overlap area lected LM-CLMi on the fly. In the next step, each as the destination. When we have many landmarks, LM-CLMi individually measures the delay to AMX. we can triangulate the results to determine the at- Using AvgLowestNodetoZoneDelayi, each LM-CLMi tack’s geoposition. In our approach, we first create estimates the distance to AMX. After estimating the a large dataset of real-world measurements by mea- distance of AMX from the three LM-CLMs, the al- suring latency from each landmark to every other gorithm ascertains the new geolocation (AMLA1, landmark. We used PlanetLab (www.planet-lab.org) AMLO1) of AMX. In step 8, it finds the ets of zonal and a carefully selected geographically diverse set landmarks (ZLMs) in the zonal region (initial value of landmarks across the globe. We used triangula- ±4º) around (AMLA1, AMLO1), which we call the fi- tion to implement our algorithm, which uses three nal zone. out of 50 landmarks as the starting point. It iterates It’s important to find landmarks that are diverse with three different landmarks until we obtain con- with respect to each other as well as to (AMLA1, sistent results. Our observation concurs with the AMLO1). Once the final zone is identified, the algo- other work.12 rithm creates AvgLowestZonalDelay on the fly by Our algorithm (see the sidebar) starts with a considering the prerecorded minimum delays from ∀ ≠ ≤ ≤ set of continental landmarks (CLMs) measuring each LM-ZLMi to LM-ZLMj ( i, j: i j, 1 i n, 1 ≤ ≤ ∈ the delay in reaching the attacker machine (AMX). j n, each LM-ZLM Final_Zone where n is the We considered the landmark at the University of total number of landmarks in the final zone). This

California, Berkeley, as the west CLM (CLM2) dataset provides the final zone’s DDR. In the next and the landmark at Michigan Technological Uni- step, each LM-ZLMi measures the delay to AMX.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 61

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

ALGORITHM STEPS

Here, we describe the steps in our algorithm. We < 3. Then exit with result as (AMLA1, AMLO1). consider two cases: the general case, in which 10. Select any three LMs (LM-ZLM1, LM-ZLM2, LM- more than three landmarks are required to ascer- ZLM3) from Final_Zone based on the top values LA LO tain the geolocation; and the best case, in which of Diversej = ABS (LM-ZLMj – AMLA1, LM-ZLMj LA LA only three landmarks are required to ascertain the – AMLO1) and ABS(LM-ZLMj – LM-ZLMi , LM- LO LO geolocation. ZLMj – LM-ZLMi ). Total number of landmarks (LMs) = N Figure A4 shows the zonal landmarks (ZLMs) of the

1. Select any three LMs (LM-CLM1, LM-CLM2, LM- Final_Zone. These are the LMs in the Zonal_Region of CLM3) from the CLM sets (Figure A). (AMLA1, AMLO1). Because we only need three LMs in 2. Calculate AverageLDelay at each LM-CLMi to at- each iteration, our algorithm computes the diversity tacker machine AMX. parameter for all the LMs in the zone and, based on

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE 3. Estimate distance (CLM-Disti) from each LM-CLMi this parameter, selects the three LMs with the highest to AMX based on AvgLDelay. values (Figure A5). 4. Ascertain the location of AMX using trilateration as (AMLA, AMLO). 11. Create AvgLowestZonalDelay. 12. Calculate AverageOfLowest delay at each LM- 1 We used the great circle and aviation formulas to ZLMi to AMX and estimate distance from perform our calculations. We know the length of the each LM-ZLMi to AMX based on Dataset sides of triangle ABC (Figure A1), which is the distance _AvgLowestZonalDelay. between the known landmarks. Using the lengths 13. Ascertain AMX location using triangulation as

(AB, BC, CD), angles, and the estimated lengths of (AMLA2, AMLO2) (Figure A6). AD, BD and CD, we apply the triangulation to get the

coordinate of point D (AMLA, AMLO). After step 13, there are two locations of AMX as (AMLA1, AMLO1) and (AMLA2, AMLO2). 5. Find all LMs in ±4º (Zonal_Region) of (AMLA, AMLO). We refer to this as the Initial_Zone. 14. Set Zonal_Region = Zonal_Region – Closing _Factor.

We refer to “find all the LMs in ±4º (Zonal_Region) 15. Compare (AMLA1, AMLO1) and (AMLA2, AMLO2). of (AMLA, AMLO)” as Initial_Zone. Because we know the If the result = satisfactory or Zonal_Region = 0º, geolocations of all landmarks, our algorithm finds the then exit with result as (AMLA2, AMLO2). landmarks in the Initial_Zone (Figure A2). If the result is satisfactory in the first iteration, 6. Create AvgLowestNodeToZoneDelay. the algorithm terminates. Thus, steps 9 through 15

7. Estimate distance (CLM-Disti) again from each LM- validate the results. This is the best case because CLMi to AMX based on AvgLowestNodeToZoneDelay. only three LMs can identify the AMX location ac- 8. Ascertain AMX location using triangulation as curately in just one iteration. Because the result

(AMLA1, AMLO1). is unsatisfactory, the algorithm iterates steps 9 through 16, as follows: The algorithm calculates the new geolocation of

AMX based on current zonal DDR (Figure A3). D1 is the 16. Goto step 9 with (AMLA1, AMLO1) = (AMLA2, AMLO2). new geolocation of AM (AMLA1, AMLO1). Reference

9. Find all LMs in the Zonal_Region of (AMLA1, AMLO1). 17. E. Williams, Aviation Formulary V1.46, ______http://wil- Call this Final_Zone if Total_LMs in Zonal_Region ______liams.best.vwh.net/avform.htm.

62 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

B

Estimated location of attacker, D D

C

A ZLM

AMLAT, AMLON

(1) (2)

B

Estimated location of AM, D

D1 C

A ZLM

AMLAT1, AMLON1

(3) (4)

C

B D2

A

Select ZLM Select ZLM

AMLAT1, AMLON1 AMLAT2, AMLON2

(5) (6)

FIGURE A. Landmark selection and attacker location estimation: (1) calculate lengths of the sides of triangle ABC; (2) find the landmarks in the Initial_Zone; (3) calculate the new positions of the attacker machine AMX; (4) determine zonal landmarks (ZLMs) of the Final_Zone; (5) select the three landmarks (LMs) with the highest values; and (6) ascertain the attacker machine location using triangulation.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 63

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

By using AvgLowestZonalDelayi each LM-ZLMi es- 7. Cisco ISO Lock and Key Security, white paper, timates the distance to AMX. After estimating the Cisco Systems, 1996. distance of AMX from the three LM-ZLMs, the al- 8. Y. Bartal et al., “Firmato: A Novel Firewall Man- gorithm ascertains the new geolocation (AMLA2, agement Toolkit,” Proc. IEEE Symp. Security and AMLO2) of AMX. It then compares the two geoloca- Privacy, 1999, pp. 17–31. tions (AMLA2, AMLO2) and (AMLA1, AMLO1) for er- 9. A. Mayer, A. Wool, and E. Ziskind, “Fang: A ror distance. Firewall Analysis Engine,” Proc. IEEE Symp. Se- If the error distance is less than 10 miles, we curity and Privacy, 2000, pp. 177–187. consider the result satisfactory, and the algorithm 10. S. Ioannidis et al., “Implementing a Distrib-

terminates with (AMLA2, AMLO2) as the final geo- uted Firewall,” Proc. ACM Conf. Computer and location of AMX. It can also terminate when the Comm. Security, 2000, pp. 190–199. zonal region reaches zero or the total number of 11. E. Goren and O. Duskin, “Mobile Firewall,” in- landmarks in the Zonal_Region is less than 3. In ternal report, Check Point Software Technolo- other cases, the algorithm continues to iterate until gies, Hebrew Univ. it reaches the desired location accuracy. 12. M. Gondree and Z.N.J. Peterson. “Geolocation of Data in the Cloud,” Proc. 3rd ACM Conf. Data and Application Security and Privacy, 2013.

ur current work will provide us with a plat- CHETAN JAISWAL is a PhD scholar at the Univer-

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE form for dealing with the security of global sity of Missouri, Kansas City. His research interests cloud structure (linking all customers who rent include cloud computing; mobile, wireless sensor cloud services). At present, banks are reluctant networks, and cloud security; and cloud-based data- to use cloud services because they don’t know the base transaction systems. He is also passionate about whereabouts of datacenters. Our system will identify programming, learning new concepts, and teaching.

a datacenter’s geographical location and dynamical- Contact him at [email protected].______ly manage the firewall protecting it. This approach will largely relieve cloud service providers from the MAHESH NATH is a PhD scholar at the University responsibility of securing their datacenters. of Missouri, Kansas City. His research interest include network and information security and privacy, with Acknowledgments an emphasis in next-generation firewall frameworks. We thank Sushil Jajodia for his highly useful sug- Contact him at [email protected].______gestions, which helped us improve the algorithm for firewall composition. US National Science Founda- VIJAY KUMAR is the Curator’s Professor in the tion grant CNS-1347958 supported this research. computer science department at the University of Missouri, Kansas City. His research interests include References information security, wireless and mobile computing, 1. S. Jajodia et al., “Flexible Support for Multiple Ac- and database systems, with particular emphasis in cy- cess Control Policies,” ACM Trans. Database Sys- bersecurity and wireless data dissemination. Kumar tems (TODS), vol. 26, no. 2, 2001, pp. 214–260. has a PhD in computer science from Southampton

2. I. Cervesato et al., “Relating Strands and Mul- University, England. Contact him at kumarv@umkc______tiset Rewriting for Security Protocol Analy- ___.edu. sis,” Proc. 13th Computer Security Foundations Workshop (PCSFW 00), 2000, pp. 35–51. 3. F.J. Fabrega, J.C. Herzog, and J. Guttman, “Strand Spaces: Why Is a Security Protocol Cor- rect?” Proc. IEEE Symp. Security and Privacy, 1998, pp. 160–171. 4. J. Loeckx and K. Sieber, The Foundations of Pro- gram Verification, John Wiley & Sons, 1987. 5. V. Kumar, Mobile Database Systems, John Wiley & Sons, 2006. 6. H. Seki, “Unfold/Fold Transformation of Strati- Selected CS articles and columns are also available fied Programs,” Theoretical Computer Science, for free at http://ComputingNow.computer.org. vol. 86, no. 1, 1991, pp. 107–139.

64 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® SECURE BIG DATA IN THE CLOUD

Multilabels-Based Scalable Access Control for Big Data Applications

Hongsong Chen, University of Science and Technology Beijing Bharat Bhargava, Purdue University Fu Zhongchuan , Harbin Institute of Technology

A multilabels-based access control model uses different labels to provide scalable granularity access protection to big data applications.

ig data refers to datasets that are too large for typical database software tools to capture, store, manage, and analyze.1 Forrester Research defines big data as “a set of skills, techniques, and technologies for handling data on an extreme scale with agility and affordability.”2 In 2012, Gartner de- fined big data as “high volume, high velocity, and/or high variety infor- mation assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”3 Big data is used in many critical areas, such as online social networks and mining, healthcare information systems, physics, e-commerce, sensors and remote sensing, and

SEPTEMBER 2014 IEEE CLOUD COMPUTING 65

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Integrity and Infrastructure securityData privacy Data management reactive security

Secure computations Privacy-preserving Secure data in distributed data mining storage and End-point validation programming and analytics transaction logs and filtering frameworks

Security best Cryptographically practices for Real-time security enforced data- Granular audits nonrelational monitoring centric security data stores

Granular Data access control provenance

FIGURE 1. Classification of the Cloud Security Alliance top 10 big data security and privacy challenges.7 The four categories provide security and privacy protection at different levels. (©2013 Cloud Security Alliance. Used with permission.) SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE

the life sciences. In 2012, the Obama administration Security and Privacy Challenges in Big Data announced the “Big Data Research and Development Applications Initiative,” which explored how big data could be Multiple data sources, multiple data formats, and used to address important problems faced by the gov- multiple user types introduce new security challeng- ernment.4 Big data extends traditional data process- es to access control models for big data applications. ing techniques in three dimensions: volume, velocity, Sensitive data faces many threats, such as informa- and variety. Every day, all kinds of data sources gen- tion leakage, unauthorized access, and tampering. erate 2.5 quintillion bytes of data. Time-sensitive Various methods are used to provide security and data processes, such as online financial transaction privacy protection of big data. systems, require real-time response. Data types and The Cloud Security Alliance (CSA) highlights the structures are variable in big data, which can include top 10 big data security and privacy challenges7: se- unstructured and uncertain data.5 cure computation in distributed programming frame- Some big data applications could improve the works, security best practices for non-relational data economic benefits by correctly and efficiently ap- stores, privacy-preserving data mining and analytics, plying big data and data processing techniques. For cryptographically enforced data centric security, gran- example, Forrester Research estimates the potential ular access control, secure data storage and transac- value for the US healthcare information system of tions logs, granular audits, data provenance, end-point using big data effectively could be more than $300 validation/filtering, and real-time security monitoring. billion per year.2 However, at the same time, re- We classify these challenges into four categories (see search challenges are emerging from issues such as Figure 1): infrastructure security, data privacy, data data heterogeneity, inconsistency, incompleteness, management, integrity, and reactive security. timeliness, security and privacy, visualization, and CSA explains each challenge from four view- collaboration.6 points: use case, modeling, analysis, and implemen- Regardless of whether the concern is security tation. The challenges differ from those of traditional threats to big data or security protections for big data security because of the characteristics of big data, security and privacy issues must be solved ef- data applications. The top 10 challenges are inter- ficiently and in a timely manner. To address these related and involve security and privacy problems security and privacy challenges, we propose a mul- in big data collection, transfer, storage, and process. tilabels-based access control model that provides Granular access control to different data sources and flexible security protection to big data. Our scalable entities is a foundational problem in these security access control model uses labels to provide scalable challenges. Thus, we must reconsider the data access granularity access protection to a big data applica- control model and redesign it to adapt to the variable tion in the healthcare area. access control requirements of big data applications.

66 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Metadata (name, replicas, ... ): Metadata ops Namenode /home/foo/data, 3, ...

Client Block ops

Read Datanodes Datanodes

Replication Blocks

Write Rack 1 Rack 2

Client

FIGURE 2. The Hadoop Distributed File System architecture.10 HDFS has a fault-tolerant storage policy and discretionary access control (DAC). (©2013 Apache Hadoop. Used with permission.)

Big data applications should be compliant with cess granularity according to the application’s secu- relevant rules and regulations,7 such as the Health In- rity requirements. See the sidebar for a discussion of surance Portability and Accountability Act (HIPAA), related access control models. Sarbanes-Oxley Act (SOX), Payment Card Industry Data Security Standard (PCI-DSS), ISO/IEC 27002,8 Multilabels Structure in an HDFS Application Federal Information Security Management Act (FIS- HIPAA requires personal health records (PHRs) MA), and the EU Data Privacy Directive. to be stored securely. PHRs contain important and Consider, for example, Google Flu Treads, a big sensitive data for patients, doctors, health insurance data application that’s based on large numbers of companies, and healthcare institutions. As PHRs in- Google search queries.9 GFT uses the IP address as- crease in volume and data type, they will aggregate sociated with each search query to determine where to become an important source of big data. To pro- the query originated. Using this method, the appli- tect patients’ privacy and control access granularity cation could use search queries to detect influenza to this data, we propose a multilabels-based scalable epidemics. Because GFT handles many people’s access control framework that protects sensitive health status and Google recognizes the impor- data stored in the Hadoop Distributed File System. tance of people’s privacy, none of the queries in the HDFS, which was originally built for the Apache GFT database can be associated with a particular Nutch Web search engine, is designed to store big user. The GFT database doesn’t retain information data in a cloud computing environment. As Figure 2 about user identity, IP address, or physical location. shows, HDFS has a fault-tolerant storage policy and All of the project’s data is used in accordance with discretionary access control (DAC).10 However, DAC Google’s Privacy Policy. The GFT project can not is insufficient for variable big data applications, es- only predict flu trends, but it can do so without vio- pecially for security- and privacy-sensitive applica- lating people’s privacy. To protect user privacy and tions. In our model, access control granularity varies its own business secrets, Google doesn’t make its with the number of multilabels and their content. query data public. Therefore, other researchers can’t The data owner can use the labels selectively, and use the data for research or for predicting flu trends. the system administrator and designer can add, de- Given this analysis, the access control model is lete, or revise the labels according to the applica- a basic and critical security protection that can af- tion’s security requirements. fect big data integration, integrity, confidentiality, As Figure 2 shows, an HDFS cluster consists of a and availability. The problem of access control in big NameNode, which manages the file system’s metada- data is deciding how to select and control data ac- ta, and DataNodes, which store the datasets. Clients

SEPTEMBER 2014 IEEE CLOUD COMPUTING 67

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

ACCESS CONTROL MODELS FOR BIG DATA APPLICATIONS

adoop is an open source framework for big data authorities (AA), a cloud server (server), data owners storage and processing that is widely used in (owners), and data consumers (users). The security Facebook, Yahoo, e-Bay, LinkedIn, and other scientific scheme includes five phases: system initialization, key and big data applications. Several researchers have pro- generation, data encryption, data decryption, and posed access control models for these applications. policy updating. The authors introduce two types Chunming Rong and his colleagues propose an of access structures that are used in constructing access control scheme with secure sharing storage attribute-based encryption (ABE) schemes: linear based on Hadoop.1 Their scheme includes four phas- secret-sharing schemes (LSSS) structure and access es: creating the access token, distributing the access tree structure. Although they propose a method for token, gaining the access token, and accessing blocks. updating the policy, the overhead for this updating To access a data file, a user sends a request to the should be precisely evaluated when the data volume

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE Hadoop client inquiring about the file’s owner. The is huge. Hadoop client then sends a response to this user as Online social networks (OSNs) have become well as the cloud storage provider according to the important sources of big data. In January 2013, secure sharing storage rules over the cloud. The user Facebook released Graph Search,4 which lets users downloads the re-encrypted token file and decrypts control their access policies using their relationships the access token data to obtain the access token. The in traditional security models. Jun Pang and Yang user can access the data stored in the Hadoop data Zhang claim that users can exploit public information nodes using the decryption metadata information. This to adjust access permissions in OSNs.4 They use a access control scheme depends on secure sharing novel OSN model that includes both a user graph and storage and metadata encryption/decryption. Because a public information graph. They developed a hybrid the metadata is encrypted, servicing a huge number of logic that can express fine-grained access control cloud users will affect the efficiency of the metadata policies based on user and public information. Nev- encryption/decryption and key management. ertheless, allowing users to trust and accept public Patricia Ortiz and Oscar Lázaro present a multido- information remains an important problem. main access control method that combines Extensible Access Control Markup Language (XACML) role-based References access control models with Simple Protocol and RDF 1. C. Rong, Z. Quan, and A. Chakravorty, “On Access Query Language (SPARQL) query rewriting capabili- Control Schemes for Hadoop Data Storage,” Proc. ties.2 They describe a mobile application scenario IEEE Int’l Conf. Cloud Computing and Big Data in which a user from domain A wants to access the (CloudCom-Asia 13), 2013, pp. 641–645. resource from domain B. In their access control archi- 2. P. Ortiz et al., “Enhanced Multi-domain Access tecture, each domain contains a domain access server Control for Secure Mobile Collaboration through (DAS). The DAS serves the access control systems. Linked Data Cloud in Manufacturing,” Proc. IEEE Because the architecture includes many domains, lay- Symp. and Workshop World of Wireless, Mobile and ers, and access control modules, it might be difficult to Multimedia Networks (WoWMoM), 2013, pp. 1–9. adapt to big data applications. 3. K. Yang et al., “Enabling Efficient Access Control Kan Yang and his colleagues propose enabling with Dynamic Policy Updating for Big Data in the access control using dynamic policy updating for big Cloud,” Proc. Infocom, 2014, pp. 2013–2021. data in the cloud. They developed an outsourced pol- 4. J. Pang and Y. Zhang, “A New Access Control Scheme icy-updating model based on an adapted ciphertext- for Facebook-style Social Networks,” arXiv preprint policy attribute-based encryption (CP-ABE) method.3 arXiv:1304.2504, 2013. Their cloud storage system has multiple authorities. The system model consists of four entities:

68 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Security Replication Metadata Meta type Life time Access policy Hash value degree number

FIGURE 3. Structure of multilabels. Multilabels can include Hadoop original metadata, data type, security level, lifetime, number of replications, access policy, and hash value.

connect to the NameNode to access file metadata based access control model, every label can be used and execute actual I/O operations on the DataNodes. to control data access granularity. Security degree Because Hadoop is open source, we can extend the and access policy can vary with data type. For ex- metadata in the NameNodes to store labels. Que- ample, we can assign an image from a health exami- ryIO, a Hadoop-based big data query and analytics nation a high security degree, while setting hospital tool, provides manual and automated data tagging and doctor introductory information to a low securi- functions that let users define properties for files ty degree. We can set the lifetime label to one hour, when the data is written to HDFS (see http://queryio. one day, one month, or “permanent.” Security and com/product/big-data-analysis.html). Data owners cost are related to the data lifetime. If the lifetime can therefore manage their big data easily. QueryIO is expired, the data is deleted from the HDFS. This provides data tagging and metadata extension ser- model is similar to temporal attribute-based access vices to structured and unstructured big data. It en- control. We use the hash value to protect the multi- ables users to define additional metadata (data tags) labels’ integrity. to extend the metadata layer. Both HDFS and Que- After the data owner sets the multilabels, the ryIO are implemented in Java, so we can implement security access control scheme will protect them the multilabels access control framework using the by preventing attackers from tampering with them Hadoop API, QueryIO interface, and Java API. or the data. The multilabel access control concept The PHR data storage system includes two is similar to the active bundle concept.11 However, main types of operations—write and read. When multilabels-based access control mode differs from data owners want to write data to HDFS, they cre- active bundles, because the model doesn’t include ate multilabels and write the data into the HDFS virtual machines. Therefore, this model will provide metadata associated with the PHR data. These mul- greater security, while extending the active bundles tilabels adjust to the PHR’s security needs, and can approach’s metadata through multilabels, and ap- include Hadoop original metadata, data type, secu- plying it in the Hadoop big data application. For big rity level, lifetime, number of replications, access data applications that don’t deal with PHR storage, policy, and hash value. Figure 3 shows the multila- the multilabel number and content can differ. For bel data structure. example, a security administrator can add a risk la- In Figure 3, metadata represents the origi- bel to set the security risk threshold. When a user nal Hadoop metadata; data type refers to the basic tries to access the data, the security agent evaluates data type, such as PDF, Word, Openoffice, XML, the access risk using a security evaluation algorithm. HTML, or image; security degree is the security If the agent determines that the risk value is less and protection level; lifetime refers to how long the than the risk threshold, it permits the access; oth- data exists in HDFS; replication number is related erwise, it denies access. In this way, the multilabels to data dependability; and access policy refers to access control model is scalable and configurable to the access control rules for the data. Access control different big data applications. rules can be DAC, Bell and LaPadula (BLP), Bipa, RBAC, or attribute-based access control (ABAC). Personal Health Record Big Data Storage DAC is HDFS’s default access control policy. BLP Application and Biba are mandatory access control (MAC) poli- Because the PHR data storage application requires cies. BLP policy ensures data confidentiality and is privacy protection, we use it to demonstrate the characterized by the phrase, “no read up, no write multilabels-based scalable access control model. down”; whereas Biba policy ensures data integrity, Depending on the security requirements, we can and is characterized by the phrase, “no write up, no tag PHRs with different data type labels (such as read down.” RBAC is widely used in organizations medical examination image, medical record PDF and enterprises. document, or patient information XML) and set the The access policy can be unstructured data, security degree label to high secret, middle secret, such as a tree-based access policy. In our multilabel- low secret, confidential, or unclassified. This con-

SEPTEMBER 2014 IEEE CLOUD COMPUTING 69

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

cept is similar to service-level agreements (SLAs), multilabels in the HDFS metadata that it accesses. which can be used in cloud computing and big data. If the entity’s attributes and the PHR multilabels Because healthcare information uses different data meet access control rules, the entity is granted ac- types, every data type has a different security re- cess to the PHR data; otherwise, it’s denied access. quirement and can be related to the security degree. QueryIO is especially well suited to enabling users For example, we could set an XML file that in- to process unstructured big data. It gives structure cludes a patient’s social security number (SSN) to and supports querying of big data applications. Que- high secret; PDF files containing medical examina- ryIO makes the metadata extension in HDFS feasi- tion, symptom, and prescription information to se- ble. The output writer and input reader class should cret; an HTML file containing an introduction to be reconstructed to realize multilabels’ writing and the doctor to unclassified; and the Word file with reading function. QueryIO can help to extract struc- the patient’s medical reports to middle secret. The tured information from unstructured data and con- lifetime should be set for a different effective period struct a data type label. because of the different data types and security re- quirements. For example, we can set a medical ex- amination report’s effective period to six months, a n the near future, we will use a software engi- patient’s name to five years, a doctor’s name to 10 neering method to realize the multilabels scalable years, and an SSN number to 20 years. The number access control model for the PHR healthcare system of replications can range from 1 to 5. As this num- using the Hadoop open source software and Que-

SECURE BIG DATA IN THE CLOUD THE IN BIG DATA SECURE ber increases, data dependability improves, but the ryIO software tool. user’s cost also increases, because the user will need additional storage spaces. Acknowledgments We compute the PHR hash value using the The Beijing Natural Science Foundation (no. SHA1 hash algorithm. 4142034), Beijing Science and Technique Plan Proj- Every entity in a healthcare information system ect (no. D141100003414002), China Scholarship can be granted a security privilege level—such as Council Foundation, Beijing Higher Education Young high, medium, common, or low—depending on the Elite Teacher Project (YETP0380), Fundamental security degree label in the HDFS metadata. The en- Research Funds for the Central Universities (FRF- tity’s security privilege level is relative to its role in the TP-14-042A2), and Chinese National 863 research healthcare information system. For example, a gov- project (no. 2013AA01A209) supported this work. ernment health institute can be set to high privilege; patients and doctors can be set to medium privilege; a References health insurance company can be set to low privilege; 1. J. Manyika et al., Big Data: The Next Frontier and a pharmacy can be set to common privilege. In a for Innovation, Competition, and Productivity, MAC model, different privilege entities have different report, McKinsey Global Inst., 2011. access permissions to the same data. 2. J. Kobielus et al., Enterprise Hadoop: The When creating PHR data, the data owner can Emerging Core of Big Data, technology report, select and set labels, such as data type, security de- Forrester Research, Oct. 2011. gree, lifetime, access control policy, and number of 3. A. Beyer and D. Laney, The Importance of “Big replications. The security agent then computes the Data”: A Definition, report, Gartner, 2012. hash value using the SHA-256 algorithm, and all the 4. T. Kalil, “Big Data is a Big Deal,” blog, 29 Mar. labels are attached in the PHR data’s HDFS meta- 2012; www.whitehouse.gov/blog/2012/03/29/

data. Thus, the multilabels access control approach ______big-data-big-deal. combines active bundle, RBAC, ABAC, DAC, and 5. Top Tips for Securing Big Data Environments, e- MAC, providing flexible and scalable access con- book, IBM, 2012. trol policies to different data and entities. Before 6. H.V. Jagadish et al., “Big Data and Its Technical the PHR data is written into the HDFS system, the Challenges,” Comm. ACM, vol. 57, no 7, 2014, PHR data owner should be authenticated by Kerbe- pp. 86–94. ros protocol. 7. Expanded Top Ten Big Data Security and Pri- When an entity wants to read or write data in vacy Challenges, tech. report, Big Data Working

HDFS, it should be authenticated by the Kerberos Group, CSA Research, Apr. 2013; ______https://cloud-

protocol. The entity is then granted a security privi- securityalliance.org/download/expanded-top-______lege level according to its role in the healthcare in- ten-big-data-security-and-privacy-challenges.______formation system. The Hadoop client will check the 8. ISO/IEC 27002, Information Technology—Secu-

70 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

rity Techniques—Code of Practice for Informa- BHARAT BHARGAVA is a professor of computer tion Security Management, Int’l Organization for science at Purdue University. His research interests Standardization (ISO) and Int’l Electrotechnical include mobile wireless networks, secure routing and Commission (IEC), 2013. dealing with malicious hosts, providing security in 9. J. Ginsberg et al., “Detecting Influenza Epidem- service-oriented architectures (SOA), adapting to at- ics Using Search Engine Query Data,” Nature, tacks, and experimental studies. Bhargava received a vol. 457, no. 7232, 2008, pp. 1012–1014. PhD in electrical engineering from Purdue Universi- 10. D. Borthakur, HDFS Architecture Guide, Ha- ty. He is a fellow of IEEE Computer Society. Contact doop, 2013; http://hadoop.apache.org/docs/r1.2.1/ him at [email protected]. hdfs_design.html.______11. R. M. Salih, L. Lilien, and L.B. Othmane, 2012, FU ZHONGCHUAN is an associate professor of “Protecting Patients’ Electronic Health Records computer science and technology at the Harbin Insti- Using Enhanced Active Bundles,” Proc. 6th Int’l tute of Technology. His research interests include trust Conf. Pervasive Computing Technologies for computing, information security, multicore comput- Healthcare, 2012, pp. 1–4. ing, cloud computing, and fault-tolerate computing. Zhongchuan received a PhD in computer science and HONGSONG CHEN is an associate professor of technology from the Harbin Institute of Technology. computer science at the University of Science and Contact him at [email protected]. Technology Beijing (USTB) and a visiting scholar in the Department of Computer Science at Purdue University. His research interests include cloud com- puting, cloud security, big data application, wireless network security, and trust computing. Hongsong re- ceived a PhD in computer science from the Harbin Institute of Technology. He is a member of the China Selected CS articles and columns are also available Computer Federation. Contact him at ______chenhs@ustb. for free at http://ComputingNow.computer.org. _____edu.cn or [email protected].

NEW Save up to 40% on selected articles, books, STORE and webinars.

Find the latest trends and insights for your •presentations • research •events webstore.computer.org

SEPTEMBER 2014 IEEE CLOUD COMPUTING 71

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

WHAT’S TRENDING?

The big data ecosystem has seen some fast evo- Bringing Big lution. Most big data systems today incorporate Hadoop-based architectures (http://hadoop.apache .org) and are quickly becoming the center of the enterprise technology stack for data management. Data Systems These architectures usually consist of several com- ponents: Hadoop Distributed File System (HDFS), MapReduce, YARN, and HBase, to name a few. For the purpose of this article, we’ll collectively refer to the Cloud to these as Hadoop. Terms like data lake and data hub refer to HDFS being the central storage system due to the scale and economics it has to offer, en- abling storage of data in full fidelity for long periods of time. Amandeep Khurana, Cloudera Cloud Cloud computing refers to a paradigm for infra- structure, platforms, and software consumption in which users consume from a shared pool of resourc- OVER THE LAST DECADE, TWO TECHNOL- es that someone else manages. Users pay for what OGY TRENDS HAVE BEEN CHANGING HOW they use. There are public cloud environments, such ENTERPRISES THINK ABOUT INFRASTRUC- as Amazon Web Services (AWS), Google, and Mi- TURE, DATA, AND ANALYTICS—BIG DATA crosoft Azure, as well as software offerings, such as AND CLOUD COMPUTING. Here, we’ll look at Openstack and VMWare, that you can use to build the intersection of these two trends. your own private cloud. We’ll limit the discussion to public cloud for now. Big Data We can divide cloud computing technologies into Big data represents a new paradigm of data manage- three levels: infrastructure as a service (IaaS), plat- ment (collection, processing, querying, data types, form as a service (PaaS), and software as a service and scale) that isn’t well served by traditional data (SaaS). These service levels aren’t new, but technol- management systems. Two distinct paradigms are ogy has evolved to make the consumption patterns emerging in the big data space: working with data look different from how they looked in the past. AWS at rest, and working with streams of data in flight. extended the paradigm of end users interacting with We’ll focus on data at rest for now. and consuming a service programmatically without any human involvement in the early 2000s.2 Other vendors, such as Microsoft, Google, and IBM, have since forayed into this business as well.

Intersection of the Two Worlds The worlds of big data and cloud computing (mostly IaaS) share some characteristics that make the in- tersection intuitive in some ways and counterintui- EDITOR: tive in others. ELI COLLINS Motivation and Considerations Cloudera There are several motivations for using cloud envi-

[email protected] ronments for big data deployments as well as some considerations.

72 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Cost. Total cost of ownership of infrastructure in- Performance. Virtualization has a performance hit, cludes hardware, power, racks, hosting space, and especially for I/O intensive workloads. This has rap- the people managing the infrastructure. Public idly decreased in recent times. For certain workloads, cloud benefits from economies of scale and vendors this hit might not be acceptable. For others where often pass these benefits to the customers, who can a slight variation and possibly lower performance is simply consume the infrastructure without worrying acceptable, cloud environments might be sufficient. about the operational costs. Security and compliance. Security and compliance Ease of use. Cloud computing is all about accessing are important considerations for enterprise deploy- resources programmatically and automating systems ments. We can probably write several dedicated ar- as much as possible. That’s not possible with bare- ticles to cover all aspects. The key is that both cloud metal hardware, and ease of use is a big factor when environments and Hadoop have been rapidly devel- considering deploying in cloud environments. oping and have come a long way to cater to the vari- ous requirements. Elasticity. Big data workloads are often times spiky in nature. Users onboard new data sources and need Location. Often, users want to keep their data close to perform ad hoc processing to explore the datasets. to where it’s generated. This could be because of This requires the ability to scale up the environment the volume of data, where it’s accessed from, or re- and perhaps scale down later on. With bare-metal strictions on where it can be moved. For example, infrastructure, you’ll have to provision for that burst certain kinds of data generated in China can’t be requirement or you’ll have to wait for the IT team to transferred outside the country. Public cloud envi- provision new hardware. In cloud environments, you ronments offer the flexibility of having deployments can scale up and down programmatically in a matter in multiple locations without needing your own of minutes. datacenters.

Operations. In public cloud environments, opera- Intersection in Practice tions is the cloud provider’s responsibility. Users Let’s look at how the intersection of the two para- don’t have to worry about operating the infrastruc- digms exists today and where future opportunities ture. If the system fails, they can recover by provi- exist. sioning more resources. Consumption paradigms. Two kinds of consumption Reliability. Some might argue that public cloud in- paradigms exist for big data systems in public cloud frastructures are less reliable than those in bare- environments. metal because virtual machines have a higher In a hosted system, the vendor is hosting the in- chance of going down than physical servers. The frastructure on which big data software is deployed. flip side of that is that you can provision a new vir- Examples of this are enterprises deploying their own tual machine much faster than you can procure software in AWS, Azure, and Google. and provision a new server. With that, reliability In a managed and hosted system, the vendor comes down to how you architect your system for hosts, operates, and manages the big data deploy- fault tolerance. ment and infrastructure for you. This could entail anything from provisioning to debugging the envi- Flexibility. Clouds offer different kinds of infra- ronment when things fail. Examples include Ama- structure configurations with minimal customiza- zon Elastic MapReduce, Qubole Data Service, and tion options. With bare-metal infrastructure, you Altiscale. can customize at the time of procurement. Having said that, most enterprises have standard infra- Architectural considerations. The key architectural structure configurations that they use and custom- consideration in this intersection is the choice of ization is uncommon. persistent storage.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 73

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

WHAT’S TRENDING?

Public cloud environments offer storage sub- ibility in resource management. Virtual machines strates such as AWS Simple Storage Service (S3) can be thought of as execution containers for tasks and Azure blob store. These are stores where you to run in. can story binary large objects (blobs) of data. They The second deployment paradigm includes clus- have a simplistic API—get, put, and delete, for the ters that use HDFS as their primary storage sub- most part. Public cloud environments are built on strate. These are usually persistent clusters where the premise that these stores are where data is stored data is stored in HDFS. Blob stores can be used for for high durability and reliability. Virtual machines periodic backups or as staging areas from which da- come with local storage but they’re ephemeral and tasets are brought into HDFS for further usage and live only for the lifespan of the virtual machine. long-term storage. The workloads run against data Other storage options such as Elastic Block Store in HDFS and not the blob store. Clusters are usu- (EBS) and database services such as DynamoDB ally long running and virtual machines are consid- and Redshift treat S3 as their backup store, which ered to be persistent entities that store data as well enables them to guarantee durability. The public as perform computation. They are the equivalent of cloud world revolves around the blob stores. servers instead of just containers where computation For Hadoop, the world revolves around HDFS. is performed. All the processing and serving frameworks are tight- ly integrated with HDFS and leverage the semantics Use Cases that HDFS has to offer. For example, MapReduce The different deployment paradigms are suitable for leverages data locality information to optimize task different kinds of workloads. scheduling for minimal network usage. HBase has made architectural decisions that enable it to lever- Ad Hoc Batch Workloads age HDFS replication for fault tolerance as well as In an ad hoc batch workload, you have datasets leverage the I/O characteristics of HDFS. stored somewhere or you’ve brought in a new data- This fundamental difference in storage ap- set and want to perform some processing on it to proach defines how the two worlds integrate today cleanse, enrich, or transform it, or perhaps perform and the opportunities going forward. Storing in some aggregations across the dataset to explore the HDFS allows for optimizations that can leverage dataset. Tools of choice for expressing the process- data locality information for things such as task ing include MapReduce, Hive, Pig, Crunch, Cas- scheduling. Storing in a substrate like S3 allows for cading, and Spark. These frameworks can read and storage to be independent of compute, offering flex- write to the blob store or can work with datasets per- ibility in resource management by using virtual ma- sisting in HDFS. chines to do the computations when the workload This type of workload can be modeled in a tran- demands. This difference in approach leads to dif- sient cluster or a persistent cluster with storage be- ferent deployment options. ing the blob store or HDFS, which makes it a good match for public cloud environments. Deployment paradigms. There are two kinds of big data deployment paradigms in the public cloud. Batch Workloads with SLAs The first includes clusters that use a blob store Batch workloads with strict SLAs are usually extract, as their primary storage substrate. These can be transform, load (ETL) jobs or report generation that transient in nature, where clusters are spun up for would be triggered based on schedule or data avail- executing a workflow and die once the workflow ability and have an SLA attached to them. This kind has completed. You could also have clusters that of workload is automated and needs a higher predict- stay on beyond a workflow and you can use them ability in performance and execution time. for running more workflows later. The key here is This type of workload can also be modeled in that the workflow’s source and destination is a blob a transient or a persistent cluster with storage be- store such as S3 or Azure blob storage. In this de- ing the blob store or HDFS, which makes it a good ployment paradigm, data locality is traded for flex- match for public cloud environments.

74 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Ad Hoc Interactive Workloads ness Rev., 20 Oct. 2008; http://hbr.org/product/

Ad hoc interactive workloads consist of interactive, ______amazon-web-services/an/609048-PDF-ENG. fast querying using tools such as Impala, Presto, and Spark to some extent. This usually involves a user querying the dataset with response times on the or- AMANDEEP KHURANA is a principal solutions der of seconds. architect at Cloudera. His research interests include This type of workload is better modeled with large-scale distributed systems, storage systems, and persistent clusters with HDFS as the primary stor- data-oriented products. Khurana has an MS in com- age substrate. Tools such as Impala, Presto, and puter science from the University of California, Santa

Spark integrate with HDFS and leverage data local- Cruz. Contact him at [email protected]. ity (and high throughput from local storage) to pro- vide fast response times during queries. Transient clusters aren’t a good idea here because the time taken to spin up a cluster will outweigh the time tak- en to run the query. These can be deployed in public Selected CS articles and columns are also available clouds and would treat virtual machines as servers for free at http://ComputingNow.computer.org. and blob stores for backups.

Interactive Workloads with SLAs Interactive workloads with SLAs consist of using frameworks like HBase, Solr, and Impala, which are driving applications that users interact with and the response times have SLAs. This type of workload must be deployed on per- sistent clusters with HDFS as the primary storage CONFERENCES substrate and often times not colocated with any in the Palm of Your Hand other workloads. These can be deployed in public clouds and would treat virtual machines as servers Let your attendees have: and blob stores for backups. tDPOGFSFODFTDIFEVMF tDPOGFSFODFJOGPSNBUJPO tQBQFSMJTUJOHT AS YOU CAN SEE, BIG DATA SYSTEMS CAN tBOENPSF USE DIFFERENT DEPLOYMENT PARADIGMS 5IF DPOGFSFODF QSPHSBN NPCJMF BQQ XPSLT GPS Android EFWJDFT  iPhone  BASED ON THE WORKLOADS AND ACCESS iPad BOEUIFKindle Fire. PATTERNS THEY CATER TO. Opportunities for tighter integration will enable big data systems to leverage public cloud environments more effective- ly. As both public cloud and big data systems see more adoption and new usage paradigms evolve, we’ll see features and enhancements on both sides to make the intersection of the two worlds broader 'PS NPSF JOGPSNBUJPO QMFBTF DPOUBDU and more mature. $POGFSFODF1VCMJTIJOH4FSWJDFT $14 BU Future articles will dive deeper into some of the [email protected]______topics that this article touched on.

Reference 1. R.S. Huckman, G.P. Pisano, and L. Kind, “Case Study: Amazon Web Services,” Harvard Busi-

SEPTEMBER 2014 IEEE CLOUD COMPUTING 75

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

BLUE SKIES

Application Security through Federated Clouds

Paul Watson Newcastle University

he transformation of IT through cloud computing is acceler- ating as a wide range of organizations are adopting this new Editor: approach for deploying a variety of applications. However, Rajiv Ranjan security concerns prevent many organizations from deploy- Commonwealth Scientific and In- ing certain types of applications in the cloud. They worry both about dustrial Research attacks on data being sent over the Internet to and from the cloud, Organization, Australia and about whether their applications and data are more vulnerable to attack in a cloud than in their own internal computing resources.

In many cases, the use of good security engi- sensitive applications on the organization’s internal neering techniques can reduce both the risks and computer infrastructure, which is often (rightly or the fears.1,2 Another good approach is to exploit ro- wrongly) perceived to be more secure than the pub- bust, high-level platform-as-a-service (PaaS) com- lic cloud. However, these private clouds have some ponents that are already deployed and managed by important limitations. Although scalability in a pub- a cloud provider, rather than to design and build lic cloud is effectively infinite for most applications applications from the lower base offered by infra- because of the sheer magnitude of the resources structure as a service (IaaS). Examples include the available,3 private clouds are limited by the size of SQL database server offered as a service on Micro- the organization’s internal IT. Further, not all the soft Azure (http://azure.microsoft.com), Amazon’s scalable, platform-level services that are offered on Simple Storage Service (S3, http://aws.amazon.com/ public clouds are available in the private cloud. For

__s3), and domain-specific high-level platforms such example, Amazon offers a range of scalable data as Force.com (http://force.com). storage solutions for structured and unstructured However, organizations still refuse to move data, not all of which are available for deployment some applications—such as those that deal with on private clouds. company-sensitive and medical data—to the cloud To choose where to deploy an application, many because of the perceived risk. system managers therefore ask, “Does my organiza- tion consider any part of this application too risky to Rise of the Private Cloud deploy on a public cloud?” If they answer yes, they The obvious solution is to deploy these security- must deploy the application on a private cloud, with

76 IEEE CLOUD COMPUTING PUBLISHED BY THE IEEE COMPUTER SOCIETY 2325-6095/14/$31.00 © 2014 IEEE

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

A. Smith its inherent restrictions. If not, the pub- 378456729 lic cloud is an option. Unfortunately, this is a wasted op- p = 30% q = 27.4 portunity for many applications that r = 34 could benefit from the scalability and agility of public clouds, but include Read patient Anonymize Analyze Write results some sensitive data. Consider the typi- data (s1) (s2) (s3) cal, though simplified, healthcare sensor (s0) analysis workflow shown in Figure 1. This application takes in data about FIGURE 1. A healthcare sensor analysis workflow. To protect sensitive data, an a patient—a mixture of sensitive medi- organization might feel compelled to perform analysis of the data on a private cal data that identifies the patient and cloud, although it could realize significant benefit from exploiting the computational heart-rate sensor data. It analyzes the resources available in the public cloud. sensor data and generates a summary that can then be stored with the rest of the patient data. Because there’s some meet security requirements. Larger, the rest to 0. Similarly, the security level sensitive data in the workflow, an orga- more complex applications might have of the service that reads the medical data nization might feel that it has no choice many more components, and even dif- needs to be 1, whereas the rest can be but to store and analyze the data in a se- ferent security levels for different types 0. Substituting these values into the in- cure private cloud. This is unfortunate, of data. Making manual decisions on equalities and simplifying produces: because the analysis is often computa- how to partition the application is er- ∧ ∧ ∧ tionally intensive, which makes it ideal ror prone and potentially fraught with l(p0) ≥ 1 l(p1) ≥ 1 l(p2) ≥ 0 l(p3) ∧ ∧ ∧ for exploiting the scalability of the pub- danger as it could result in sensitive ≥ 0 l(n0–1) ≥ 1 l(n1–2) ≥ 0 l(n2–3) lic cloud, particularly if, at peak times, data being stored and processed in the ≥ 0. data from many patients is arriving for public cloud. analysis. Not only healthcare applica- As a result, some researchers have If we have a private cloud at secu- tions suffer from this problem. We see devised methods that can automatically rity level 1 and a public cloud at level 0, equivalent security issues limiting the determine ways to partition applications and we assume that the networks within uptake of the cloud for applications in over federated clouds to meet security clouds have security level 1, the method domains such as financial, human re- requirements. One approach,4,5 inspired can automatically determine that there sources, and government. by traditional multilevel security ap- are four possible ways to partition the proaches,6 models an application as a workflow over public and private clouds Federated Clouds set of communicating distributed com- to fulfill these security conditions. Fig- An alternative is to exploit the best fea- ponents, and uses rules to generate a set ure 4 shows the four valid options. A tures of each: public clouds’ scalability of inequalities that represent the secu- well-defined method such as this lets us and agility and private clouds’ security. rity requirements. build tools to automate option genera- In the case of the healthcare work- This model uses the notation in tion and to deploy the application. flow in Figure 1, a federated cloud (also Table 1 to define the rules in Figure 2. Researchers are also exploring other known as hybrid cloud) approach would Applying these rules to the workflow in approaches for modeling and analyzing store the confidential medical data on a Figure 1 gives us the resulting lattice of the security of applications partitioned private cloud and send the sensor data inequalities shown in Figure 3 (an ar- over federated clouds. Proposals include (tagged with an anonymized ID) to the row from a to b indicates that a ≥ b). using Petri nets to model information public cloud for analysis, with the re- We can then substitute any known flows7 and optimizing the placement of sults returned to the private cloud to be values for the variables and simplify the partitions on clouds to meet quality-of- combined with the confidential data. set of inequalities. For the running exam- service (QoS) requirements.8 It’s also For a simple workflow such as that ple, if we use only two security levels, 0 possible to extend the model to systems in Figure 1, it isn’t too difficult to work (low) and 1 (high), we can set the level of with external components, such as mo- out a way to partition the software to the patient’s medical data to 1 and that of bile devices and the Internet of Things.5

SEPTEMBER 2014 IEEE CLOUD COMPUTING 77

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

BLUE SKIES

Table 1. Lexical conventions for the rules in Figure 2. well as security requirements. The pol- Notation Meaning icy manager could be simple and static, or built on a more sophisticated system si Service i of dynamic, managed service agree- p Platform i i ments between the application owner ni–j Network connecting platform i to platform j and the cloud providers, such as Arjuna

di.x–j.y Data sent from service i port x to service j port y Technologies’ Agility framework (www___ l(z) Security location of z .arjuna.com/agility). c(z) Clearance of z (the maximum location at which z may operate) Employing a policy manager to de- ploy the components of a distributed system wouldn’t have been possible 10 years ago, when software deployment For each service s add the inequality: l(p ) ≥ l(s ) i i i was largely manual and therefore static. (the security level of the platform on which the service is deployed must be greater However, with the rise of virtualization, than or equal to that of the service). we can structure distributed applications For each data connection di:x–j add: l(pi) ≥ l(di:x–j:y) and l(pj) ≥ l(di:x–j:y) (the security level of the platforms on which the services transmitting and receiving as a set of virtual machines (VMs) or the data are deployed must be greater than or equal to that of the data) and other containers (example, www.docker.

l(ni–j) ≥ l(di:x–j:y) ___com) that can be dynamically deployed. (the security level of the network across which the data is transmitted must be Figure 5 shows another viable approach: greater than or equal to that of the data). running a portable, domain-specific plat- To add data security as in Bell-LaPadula: form (in this case, e-Science Central, For each service add: c(si) ≥ l(si). which supports scientific workflows9) For each data connection add: c(sj) ≥ l(di:x–j:y) and l(di:x–j:y) ≥ l(si) on each cloud to enable the dynamic (the Bell-LaPadula “no read up” and “no write down” rules). deployment of application components. When multiple options for meet- FIGURE 2. Security rules to create a security lattice representing a distributed application. ing the requirements exist (as with the healthcare example), there is the prob- lem of how to choose between them. Policy-Based Partitioning of ability. This encourages the adoption A solution is to introduce a cost model Applications across Clouds of an architecture such as that in Fig- that can be applied to each solution, Security is only one of a set of crite- ure 5, in which a policy manager takes allowing them to be ranked. A simple ria that might be used to make deci- a high-level description of an applica- pricing-based cost model, for example, sions about partitioning an application tion and partitions it based on a user- could allow users to choose between running over a set of clouds. Others specified policy. It could, for example, all options that meet their security re- include performance, price, and reli- include deadlines and price limits as quirements.4 This model requires other

l(p0) l(n0–1) c(s1) l(p1) l(n1–2) c(s2) l(p2) l(n2–3) c(s3) l(p3)

c(s0) l(d0.0–1.0) l(d1.0–2.0) l(d2.0–3.0) l(s3)

l(s0) l(s1) l(s2)

FIGURE 3. The security lattice created by applying the rules of Figure 2 to the healthcare workflow of Figure 1.

78 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

inputs, such as estimates of the cost of Read patient Anonymize Analyze Write results 1 data executing services on a cloud, which (s ) (s ) (s ) (s ) 1 2 3 is an interesting area of performance 0 modeling research.10,11 Read patient Anonymize Analyze Write results Making Dynamic Decisions data 2 (s ) (s ) (s ) (s ) 1 2 3 The description thus far suggests a one- 0 way flow from application description, through the policy manager, to deploy- Read patient Anonymize Analyze Write results 3 data ment. But today’s IT is highly dynamic. (s ) (s ) (s ) (s ) 1 2 3 Consider, for example, a smartphone 0 app that periodically sends sensitive data to a cloud for storage and analy- Read patient Anonymize Analyze Write results 4 data sis. The data owner might feel less con- (s ) (s ) (s ) (s ) 1 2 3 cerned when the phone is connected 0 over his or her company’s corporate Wi- Fi than when it’s communicating over a FIGURE 4. Four valid partitioning options. In a federated cloud, workflows can be coffee shop’s Wi-Fi. partitioned over private (outlined in red) and public clouds (outlined in green) to meet For this reason, the diagram in Fig- an application’s security requirements. ure 5 also shows information flowing back to the policy manager from the clouds. Before a mobile client roams to Security, dependability, a new network, it could send informa- Application performance, cost requirements tion to the policy manager about that network’s security (an unknown net- work could default to the lowest secu- rity level). The policy manager could Policy manager then rerun the security analysis method described earlier. If the results show that the new network’s security level doesn’t satisfy the set of inequalities, e-Science e-Science e-Science the client could shut down or switch to central central central a local-only mode of operation in which it caches data on the phone until it can reach a more secure network. Azure OpenShift Private cloud Similarly, an exception could be raised if a cloud failure occurs, trig- gering the policy manager to generate FIGURE 5. Policy-based partitioning of an application over federated clouds. In another deployment option for the ap- this example, a domain-specific platform (e-Science Central) runs on each cloud, plication, assuming one exists. For the enabling dynamic deployment of application components. application workflow in Figure 1, for example, if the public cloud fails, the workflow can still be executed entirely ederated clouds offer a solution to applications over public and private on the private cloud. An optimization the security problems preventing clouds to meet security and other non- is not to restart a computation from the some applications from exploiting the functional requirements. This avoids beginning, but to reuse any intermedi- cloud’s benefits. The major research the need for manual, ad hoc methods ate results that have already been com- challenge is to find rigorous, auditable that are prone to errors that could have puted and are still accessible.12 methods for dynamically partitioning serious consequences.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 79

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

BLUE SKIES

Acknowledgments 5. P. Watson and M. Little, “Multilevel currency and Computation: Practice The Research Councils UK “Social In- Security for Deploying Distributed and Experience, vol. 22, no. 17, clusion through the Digital Economy” Applications on Clouds, Devices 2010, pp. 2369–2380. project EP/G066019/1 funded this and Things,” to be presented at the 10. J. Taheri et al., “Pareto Frontier for research. IEEE Int’l Conf. Cloud Computing Job Execution and Data Transfer Technology and Science (Cloud- Time in Hybrid Clouds, A Frame- References Com 14), 2014. work for Dynamically Generating 1. R. Anderson, Security Engineering, 6. D.E. Bell and L.J. LaPadula, Se- Predictive Models of Workflow Ex- John Wiley & Sons, 2008. cure Computer Systems: Mathe- ecution,” Future Generation Com- 2. E.G. Amaroso, “Practical Meth- matical Foundations, tech. report, puter Systems, vol. 37, July 2014, pp. ods for Securing the Cloud,” IEEE Mitre, 1973. 321–334. Cloud Computing, vol. 1, no. 1, 7. W. Zeng et al., “A Flow Sensitive Se- 11. H. Hiden, S. Woodman, and P. 2014, pp. 28–38. curity Model for Cloud Computing Watson, “A Framework for Dynami- 3. M. Armbrust et al., Above the Clouds: Systems,” CoRR abs/1404.7760, Com- cally Generating Predictive Models A Berkeley View of Cloud Comput- puting Research Repository, 2014. of Workflow Execution,”Proc. 8th ing, tech. report UCB/EECS-2009- 8. E. Goettelmann, W. Fdhila, and C. Workshop Workflows in Support 28, Electrical Eng. and Computer Godart, “Partitioning and Cloud of Large-Scale Science, 2013, pp. Science Dept., Univ. of California, Deployment of Composite Web Ser- 77–87. Berkeley, Feb 2009. vices under Security Constraints,” 12. Z. Wen and P. Watson, “Dynamic 4. P. Watson, “A Multi-Level Security Proc. IEEE Int’l Conf. Cloud Eng. Exception Handling for Partitioned Model for Partitioning Workflows (IC2E 13), 2013, pp. 193–200. Workflow on Federated Clouds,” over Federated Clouds,” J. Cloud 9. P. Watson, H. Hiden, and S. Wood- Proc. IEEE 5th Int’l Conf. Cloud Computing, vol. 1, no. 1, 2012, pp. man, “e‐Science Central for CAR- Computing Technology and Science 1–15. MEN: Science as a Service,” Con- (CloudCom 13), vol. 1, 2013, pp. 198–205.

PAUL WATSON is a professor of com- puter science and director of the Digital Institute at Newcastle University, UK. He also directs the UK’s Social Inclusion through the Digital Economy Hub. His research interests include scalable infor- mation management with a current focus on cloud computing. Watson received a PhD in parallel functional programming from Manchester University. He is a Chartered Engineer and a Fellow of the ______British Computer Society. He received ______the 2014 Jim Gray eScience Award for his work on Clouds for Science. Contact

him at [email protected]. ______

Selected CS articles and columns are also available for free at ____http:// ______ComputingNow.computer.org.

80IEEE CLOUD COMPUTING______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD TIDBITS

WELCOME TO CLOUD TIDBITS! In each issue, I’ll look at a different “tidbit” of technology that I Containers and consider unique or eye-catching, and of particular interest to the IEEE Cloud Computing readers. Today’s tidbit focuses on container technology and how it’s emerging as an important part of the Cloud: From cloud computing infrastructure.

Cloud Computing’s Multiple OS Capability Many formal definitions of cloud computing exist. LXC to Docker The National Institute of Standards and Technol- ogy’s internationally accepted definition calls for “resource pooling,” where the “provider’s computing resources are pooled to serve multiple consumers to Kubernetes using a multitenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.”1 It also calls for “rapid elasticity,” where “capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward Cloud Systems with Hypervisors and commensurate with demand.” Containers Most agree that the definition implies some kind Most commercial cloud computing systems—both ser- of technology that provides an isolation and mult- vices and cloud operating system software products— itenancy layer, and where computing resources are use hypervisors. Enterprise VMware installations, split up and dynamically shared using an operating which can rightly be called early private clouds, use technique that implements the specified multiten- the ESXi Hypervisor (www.vmware.com/products/es-______ant model. Two technologies are commonly used ______xi-and-esx/overview). Some public clouds (Terremark, here: the hypervisor and the container. You might Savvis, and Bluelock, for example) use ESXi as well. be familiar with how a hypervisor provides for vir- Both Rackspace and Amazon Web Services (AWS) use tual machines (VMs). You might be less familiar the XEN Hypervisor (www.xenproject.org/developers/ with containers, the most common of which rely on ______teams/hypervisor.html), which gained tremendous Linux kernel containment features, more commonly popularity because of its early open source inclusion known as LXC (https://linuxcontainers.org).______Both with Linux. Because Linux has now shifted to sup- technologies support isolation and multitenancy. port KVM (www.linux-kvm.org), another open source Not all agree that a hypervisor or container is re- quired to call a given system a cloud; several special- ized service providers offer what is generally called a bare metal cloud, where they apply the referenced elasticity and automation to the rapid provisioning and assignment of physical servers, eliminating the overhead of a hypervisor or container altogether. Although interesting for the most demanding appli- cations, the somewhat oxymoron term “bare metal DAVID cloud” is something perhaps Tidbits will look at in BERNSTEIN more detail in a later column. Thus, we’re left with the working definition that Cloud Strategy Partners, cloud computing, at its core, has hypervisors or con- [email protected] tainers as a fundamental technology.

2325-6095/14/$31.00 © 2014 IEEE SEPTEMBER 2014 IEEE CLOUD COMPUTING 81

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD TIDBITS

APP APP APP A B C As Linux emerged as the dominant open plat- Libs Libs Libs form, replacing these earlier variations, the technol- ogy found its way into the standard distribution in the form of LXC. APP APP APP Figure 1 compares application deployment using OS OS OS A B C a hypervisor and a container. As the figure shows, A B C the hypervisor-based deployment is ideal when ap- Libs Libs Libs plications on the same cloud require different op- erating systems or OS versions (for example, RHEL Linux, Debian Linux, Ubuntu Linux, Windows Hypervisor Container engine 2000, Windows 2008, Windows 2012). The abstrac- tion must be at the VM level to provide this capabil- Host OS Host OS ity of running different OS versions. With containers, applications share an OS (and, Server Server where appropriate, binaries and libraries), and as a re- (a) (b) sult these deployments will be significantly smaller in size than hypervisor deployments, making it possible Figure 1. Comparison of (a) hypervisor and (b) container-based to store hundreds of containers on a physical host deployments. A hypervisor-based deployment is ideal when applications (versus a strictly limited number of VMs). Because on the same cloud require different operating systems or different OS containers use the host OS, restarting a container versions; in container-based systems, applications share an operating doesn’t mean restarting or rebooting the OS. system, so these deployments can be significantly smaller in size. Those familiar with Linux implementations know that there’s a great degree of binary applica- tion portability among Linux variants, with librar- alternative, KVM has found its way into more recent- ies occasionally required to complete the portability. ly constructed clouds (such as AT&T, HP, Comcast, Therefore, it’s practical to have one container pack- and Orange). KVM is also a favorite hypervisor of the age that will run on almost all Linux-based clouds. OpenStack project and is used in most OpenStack dis- tributions (such as RedHat, Cloudscaling, Piston, and Docker Containers Nebula). Of course, Microsoft uses its Hyper-V hy- Docker (www.docker.com) is an open source project pervisor underneath both and Micro- providing a systematic way to automate the faster soft Private Cloud (www.microsoft.com/en-us/server deployment of Linux applications inside portable ______-cloud/solutions/virtualization.aspx). containers. Basically, Docker extends LXC with a However, not all well-known public clouds use kernel-and application-level API that together run hypervisors. For example, Google, IBM/Softlayer, processes in isolation: CPU, memory, I/O, network, and are all examples of extremely successful and so on. Docker also uses namespaces to com- public cloud platforms using containers, not VMs. pletely isolate an application’s view of the underly- Some trace inspiration for containers back to the ing operating environment, including process trees, Unix chroot command, which was introduced as part network, user IDs, and file systems. of Unix version 7 in 1979. In 1998, an extended ver- Docker containers are created using base images. sion of chroot was implemented in FreeBSD and called A Docker image can include just the OS fundamen- jail. In 2004, the capability was improved and released tals, or it can consist of a sophisticated prebuilt appli- with Solaris 10 as zones. By Solaris 11, a full-blown ca- cation stack ready for launch. When building images pability based on zones was completed and called con- with Docker, each action taken (that is, command ex- tainers. By that time, other proprietary Unix vendors ecuted, such as apt-get install) forms a new layer on offered similar capabilities—for example, HP-UX con- top of the previous one. Commands can be executed tainers and IBM AIX workload partitions. manually or automatically using Dockerfiles.

82 IEEE CLOUD COMPUTING ______WWW.COMPUTER.ORG/CLOUDCOMPUTING

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

Application

PaaS PaaS Virtual Virtual Virtual appliance PaaS PaaS PaaS appliance appliance

Container Container Container Container Container Container

VM VM VM VM VM VM VM

OS OS OS OS OS OS OS OS OS OS OS Figure 2. Possible layering combinations for application runtimes.

2 Each Dockerfile is a script composed Appliance, and Bitnami (https://bitnami______Those who want to deploy applica- of various commands (instructions) and ___.com), provide application runtime en- tions with the least infrastructure will arguments listed successively to auto- vironments that shield the application choose the simple container-to-OS ap- matically perform actions on a base from the bare OS by providing an inter- proach. This is why container-based cloud image to create (or form) a new image. face for applications with higher-level, vendors can claim improved performance They’re used to organize deployment ar- more portable constructs. Virtual appli- when compared to hypervisor-based tifacts and simplify the deployment pro- ances gained popularity with equipment clouds. A recent benchmark of a “fast cess from start to finish. manufacturers who wanted to provide data” NewSQL system claimed that in an Containers can run on VMs too. If a a vehicle for distributing software ver- apples-to-apples comparison, running on cloud has the right native container run- sions of an appliance—for example, a IBM Softlayer using containers resulted time (such as some of the clouds men- network load balancer, WAN optimizer, in a fivefold performance improvement tioned) a container can run directly on or firewall. Virtual appliances can run over the same benchmark running on the VM. If the cloud only supports hyper- on top of a VM or a container (native Amazon AWS using a hypervisor.3 visor-based VMs, there’s no problem—the LXC-based or running on top of a VM). Software developers tend to prefer entire application, container, and OS For even more isolation from the using PaaS, which will use a container if stack can be placed on a VM and run just OS, especially desired by application available for its runtime, to maximize per- like any other application to the OS stack. programmers, application runtimes can formance as well as to manage application be reconfigured into total platform-as- clustering. If not, the PaaS will run a con- Abstractions on Top of VMs and a-service (PaaS) runtimes. Readers will tainer on a VM. Consequently, as PaaS Containers remember that last issue I discussed gains in popularity, so do containers. Both VMs and containers provide a rath- PaaS, and mentioned However, using containers for secu- er low-level construct. Basically, both that it uses container technology for de- rity isolation might not be a good idea. present an operating system interface to ployment. It’s for precisely this reason In an August 2013 blog,4 one of Dock- the developer. In the case of the VM, it’s they do so—the distribution can be tar- er’s engineers expressed optimism that a complete implementation of the OS; geted precisely for the container engine containers would eventually catch up to you can run any OS that runs on the and Linux OS on the cloud, and like the VMs from a security standpoint. But in bare metal. The container gives you a virtual appliance can also run on top of a presentation given in January 2014,5 “view” or a “slice” of an OS already run- a VM. the same engineer said that the only ning. You access OS constructs as if you As Figure 2 shows, there are many way to have real isolation with Docker were running an application directly on possible layering combinations, depend- is to either run one Docker per host, or the OS. Developers often build on this ing on the OS’s capabilities, the deploy- one Docker per VM. If high security is level of abstraction to provide more ap- ment/portability strategy, and whether a needed, it might be worth sacrificing plication runtime constructs, so users PaaS is used. the performance of a pure-container de- don’t feel like they’re running on a bare How does one choose? As men- ployment by introducing a VM to obtain machine or a bare OS, but on an appli- tioned earlier, the virtual appliance more tried and true isolation. As with cation runtime of some kind. approach is a favorite vehicle used by any other technology, you need to know Virtual appliances, such as Virtu- network equipment manufacturers to the deployment’s security requirements, alBox (www.virtualbox.org), Rightscale create a portable software appliance. and make appropriate decisions.

SEPTEMBER 2014 IEEE CLOUD COMPUTING 83

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

CLOUD TIDBITS

Open Source Cluster Manager for and therefore Docker and containers— ______introducing--cloud-appliance 7 Docker Containers as a core cloud deployment technology. -vsphere._____ As mentioned earlier, one of containers’ In addition to a host of start-ups (such 3. B. Kepes, “VoltDB Puts the Boot nicest features is that they can be man- as CoreOS, MesoSphere, and Salt- into Amazon Web Services, Claims aged specifically for application clus- Stack), Kubernetes supporters include: IBM Is Five Times Faster,” Forbes, tering, especially when used in a PaaS 6 Aug. 2014; www.forbes.com/sites/ environment. Answering this need, at • Google (for Google Cloud Engine, ______benkepes/2014/08/06/voltdb-puts the June 2014 Google Developer Forum, GCE), ______-the-boot-into-amazon-web-services Google announced Kubernetes, an open • Microsoft (for Microsoft Azure), -claims-ibm-5-faster.______source cluster manager for Docker con- •VMware, 4. J. Petazzoni, “Containers & Dock- tainers.6 According to Google, “Kuber- • IBM (for Softlayer and OpenStack), er: How Secure Are They?” blog, netes is the decoupling of application and 21 Aug. 2013; http://blog.docker containers from the details of the sys- • Red Hat (its OpenStack distribution). .com/2013/08/containers-docker tems on which they run. Google Cloud -how-secure-are-they.______Platform provides a homogenous set of Although HP, Canonical, AWS, and Rack- 5. J. Petazzoni, “Linux Containers raw resources . . . to Kubernetes, and in space are “Docker friendly,” they haven’t (LXC), Docker, and Security,” 31 Jan. turn, Kubernetes schedules containers explicitly endorsed Kubernetes. Industry 2014; www.slideshare.net/jpetazzo/ to use those resources. This decoupling speculation is that once a more neutral ______linux-containers-lxc-docker-and simplifies application development since governance/collaboration structure is put ______-security. users only ask for abstract resources like together around Docker (a start-up com- 6. C. Mcluckie, “Containers, VMs, Ku- cores and memory, and it also simplifies pany) and Kubernetes (still controlled bernetes and VMware,” blog, 25 Aug. data center operations.” by Google), organizations will agree on a 2014; http://googlecloudplatform Google goes on to describe network- common packaging and deployment ap- .blogspot.com/2014/08/containers centric deployment improvements in proach—and here we have practically ev- ______-vms-kubernetes-and-vmware.html. Kubernetes: “While running individual eryone already thinking about it. I’m not 7. B. Butler, “Containers: Buzzword du containers is sufficient for some use cas- aware of any cloud project with this level Jour, or Game-Changing Technol- es, the real power of containers comes of alignment on anything! ogy?” NetworkWorld, 3 Sept. 2014; from implementing distributed systems, www.networkworld.com/article/ and to do this you need a network. How- 2601925/cloud-computing/container______ever, you don’t just need any network. CONTAINERS, DOCKER, AND -party-vmware-microsoft-cisco-and______

Containers provide end users with an KUBERNETES SEEM TO HAVE ______-red-hat-all-get-in-on-app-hoopla abstraction that makes each container a SPARKED THE HOPE OF A UNIVER- ____.html. self-contained unit of computation. Tradi- SAL CLOUD APPLICATION AND tionally, one place where this has broken DEPLOYMENT TECHNOLOGY. And down is networking, where containers are that, my friends, qualified it to be this DAVID BERNSTEIN is the managing exposed on the network via the shared issue’s Cloud Tidbit. I hope you enjoyed director of Cloud Strategy Partners, co- host machine’s address. In Kubernetes, it! founder of the IEEE Cloud Computing we’ve taken an alternative approach: that Initiative, founding chair of the IEEE each group of containers (called a Pod) References P2302 Working Group, and origina- deserves its own, unique IP address that’s 1. P. Mell and T. Grance, The NIST Def- tor and chief architect of the IEEE In- reachable from any other Pod in the clus- inition of Cloud Computing: Recom- tercloud Testbed Project. His research ter, whether they’re co-located on the mendations of the National Institute of interests include cloud computing, dis- same physical machine or not.” Standards and Technology, NIST Spe- tributed systems, and converged commu- cial Publication 800-145, 2011. nications. Bernstein was a University of Industry Movement around 2. U. Thakrar, “Introducing Right- California Regents Scholar with highest Kubernetes Scale Cloud Appliance for vSphere,” honors BS degrees in both mathemat- Shortly after Google’s announcements, blog, 10 Dec. 2013; www.rightscale ics and physics. Contact him at _____david@ several players endorsed Kubernetes— .com/blog/enterprise-cloud-strategies/ cloudstrategypartners.com.

84 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING______

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

connected intelligence in electricity of things

THE FUTURE OF ELECTRICITY IS HERE. ARE YOU READY?

PRESENTING THE FIRST OF ITS KIND PLATFORM PRESENTING THE INTELLIGENT ELECTRICITY ECOSYSTEM

The 2015 IEEE-IEEMA INTELECT Conference and Exposition will include the first-ever 3-in-1 global platform that brings together a $10 billion business opportunity across industry verticals. It will feature: t Interactive Display Pavilions t A World Class Expo t Global Conference: Smart Electricity for Emerging Markets

PLUG INTO THE FUTURE OF ELECTRICITY TODAY www.ii–intelect.org

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND® qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®

AU JESSE HARRINGTON Chief Maker Advocate, Autodesk ROCK

OF RIAN GAFF STARS 3D B Partner, McDermott Will PRINTING & Emery, LLP

PAUL BRODY VP & Global Industry Leader of Electronics, IBM

17 March 2015 The Fourth Street Summit Center 3D Printing Will Actually San Jose, CA Change the World! Are You Ready? REGISTER NOW Every company needs to prepare and implement 3D printing in order to remain relevant in their industry! No one can sit this phenomenon out! Early Discount Pricing Now Available! Get ready at Rock Stars of 3D Printing, a one-day event featuring the computer.org/ experts, early adopters, and visionaries that are driving this revolution. 3dprinting Develop Your 3D Printing Strategy! Ask Questions. Network with Experts. See Exhibits. Shift Your Paradigms!

Here’s a list of other Rock Star speakers for Rock Stars of 3D Printing: • Paul Brody, Vice President and Global Industry Leader of Electronics, IBM • Brian David Johnson, Futurist and Director, Future Casting and Experience Research, Intel

• Cliff Waldman, Council Director and Senior Economist, Manufacturers Alliance for Productivity and Innovation

qM qMqM Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM

THE WORLD’S NEWSSTAND®