SECURE BIG DATA IN THE 46

SOCIETY COMPUTER IEEE THE BY PUBLISHED COMPUTING CLOUD IEEE and Argonne National Laboratory big data and data-driven research—the “fourth paradigm of science” of paradigm “fourth research—the data-driven and data big synchronizing, sharing large quantities and data. of transferring, securely for accessing, models security Globus supports standard data interfaces and common Kyle Steven Tuecke, Chard, Foster, Ian and Sharing of Big Data Synchronization, and Transfer,Secure Efficient and porting big data. data. big porting for sup platform ideal an provide cloud models that believe Many data. of quantities large analyzing and sharing, transferring, organizing, hosting, to related challenges ences through the arts regularly using them. using regularly arts the through ences sci natural and physical spanning areas in researchers with plications, ap for scholarly resources in-house to alternatives viable be to proving are Cloud platforms scalability. inherent and model, usage as-you-go pay- capability, computing elastic its to part due in is ic communities scientif and commercial by adoption unprecedented computing’s loud University of Chicago ofChicago University 1 As we enter the era of of era we enter the As 2 —researchers face face —researchers 2325-6095/14/$31.00 © 2014 IEEE ©2014 2325-6095/14/$31.00 - - - - Supercomputers and PersonalPersonal resources campus clustersclusters

Object storagestorage BlockBlock/drive/drive storage InstanceInstance storage

GlobusGlobus ConnectConnect

InCommon/InCommon/ CILogonCILogon Globus Nexus Access

Transfer

Ne xus OOpenIDpenID ® Synchronize

Share MyProxMyProxyy OOAuthAuth

FIGURE 1. Globus provides transfer, synchronization, and sharing of data across a wide variety of storage resources. Globus Nexus provides a security layer through which users can authenticate using a number of linked identities. Globus Connect provides a standard API for accessing storage resources.

Large scientific datasets are increasingly hosted Given the distribution and diversity of stor- on both public and private clouds. For example, pub- age as well as increasingly huge data sizes, we need lic datasets hosted by (AWS) standardized, secure, and efficient methods to ac- include 20 Tbytes of NASA Earth science data, cess data, move it to other systems for analysis, syn- 500 Tbytes of Web-crawled data, and 200 Tbytes chronize changing datasets across systems without of genomic data from the 1000 Genomes project. copying the entire dataset, and share data with col- Open clouds such as the Open Science Data Cloud laborators and others for extension and verification. (OSDC)3 host many of the same research datasets in Although high-performance methods are clearly their collection of more than 1 Pbyte of open data. required as data sizes grow, secure methods are Thus, it’s frequently convenient, efficient, and cost- equally important, given that these datasets might effective to work with these datasets on the cloud. include medical, personal, financial, government, In addition to these high-profile public datasets, and intellectual property data. Thus, we need mod- many researchers store and work with large datasets els that provide a standard interface through which distributed across a plethora of cloud and local stor- users can perform these actions and methods that age systems. For example, researchers might use da- leverage proven security models to provide a com- tasets stored in object stores such as Amazon Simple mon interface and single-sign-on. These approaches Storage Service (S3), large mountable block stores must also be easy to use, scalable, efficient, and in- such as Amazon Elastic Block Store (EBS), instance dependent of storage type. storage attached to running cloud virtual machine Globus is a hosted provider of high-performance, (VM) instances, and other data stored on their insti- reliable, and secure data transfer, synchroniza- tutional clusters, personal computers, and in super- tion, and sharing.4 In essence, it establishes a huge computing centers. distributed data cloud through a vast network of

SEPTEMBER 2014 IEEE 47 Globus-accessible endpoints—storage resources example, local user accounts, Lightweight Directory that implement Globus’s data access APIs. Through Access Protocol [LDAP], or InCommon/CILogon). this cloud, users can access, move, and share large Globus uses two separate communication chan- amounts of data remotely, without worrying about nels. The control channel is established between performance, reliability, or data integrity. Globus and the endpoint to start and manage transfers, retrieve directory listings, and establish Globus: Large-Scale Research Data the data channel. The data channel is established Management directly between two Globus endpoints (GridFTP Figure 1 gives a high-level view of the Globus eco- servers) and is used for data flowing between sys- system. Core Globus capabilities are split into two tems. The data channel is inaccessible to the Globus services: Globus Nexus manages user identities and service, so no data passes through it. groups,5 whereas the Globus transfer service man- Several capabilities differentiate Globus from its ages transfer, synchronization, and sharing tasks competitors: on the user’s behalf.6 Both services offer program- matic APIs and clients to access their functionality • High performance. Globus tunes performance remotely. They’re also accessible via the Globus Web based on heuristics to maximize throughput us- interface (www.globus.org). ing techniques such as pipelining and parallel Globus Nexus provides the high-level security datastreams. fabric that supports authentication and authoriza- • Reliable. Globus manages every stage of data

SECURE BIG DATA IN THE CLOUD THE IN DATA BIG SECURE tion. Its identity management function lets users transfer, periodically checks transfer perfor- create and manage a Globus identity; users can cre- mance, recovers from errors by retrying trans- ate a profile associated with their identity, which fers, and notifies users of various events (such they can then use to make authorization decisions. as errors and success). At the conclusion of a It also acts as an identity hub, where users can link transfer, Globus compares checksums to ensure external identities to their Globus identity. Users can data integrity. authenticate with Globus through these linked exter- • Secure. Globus implements best practices secu- nal identities using a single-sign-on model. Supported rity approaches with respect to user authentica- identities include campus identities using InCommon/ tion and authorization, securely manages the CIlogon via OAuth, accounts via OpenID, storage and transmission of credentials to end- XSEDE accounts via MyProxy OAuth, an Interoper- points for authentication, and supports optional able Global Trust Federation (IGTF)-certified X.509 data encryption. certificate authority, and Secure Socket Shell (SSH) • Third-party transfer. Unlike most transfer mech- key pairs. To support collective authorization deci- anisms (such as SCP [secure copy]) Globus sions (such as when sharing data with collaborators), facilitates third-party transfers between two re- Globus Nexus also supports the creation and man- mote endpoints. That is, rather than maintain a agement of user-defined groups. persistent connection to an endpoint, users can The Globus transfer service provides core data start a transfer and then let Globus manage it management capabilities and implements an asso- for the duration of the transfer. ciated data access security fabric. Globus uses the • High availability. Globus is hosted using a dis- GridFTP protocol7 to transfer data between logical tributed, replicated, and redundant hosting endpoints—a Globus representation of an accessible model deployed across several AWS availabil- GridFTP server. GridFTP extends FTP to improve ity zones. In the past year, Globus and its con- performance, enable third-party transfers, and sup- stituent services have achieved 99.96 percent port enhanced security models. The Globus availability. model for accessing and moving data requires de- • Accessible. Because Globus is a software-as- ploying a GridFTP server on a computer and regis- a-service (SaaS) provider, users can access its tering a corresponding logical endpoint in Globus. capabilities without installing client software The GridFTP server must be configured with an locally, so they can start and manage transfers authentication provider that handles the mapping of through their Web browsers, or using the Glo- credentials to user accounts. Often, authentication bus command-line interface or REST API. is provided by a co-located MyProxy credential man- agement system,8 which lets users obtain short-term In three and a half years of operation, Globus has X.509 certificate-based proxy credentials by authen- attracted more than 18,000 registered users, of ticating with a plug-in authentication module (for which approximately 200 to 250 are active every

48 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING day, and has conducted nearly 1 million transfers, 90,000 80,000 collectively containing more than 2 billion files and 70,000 52 Pbytes of data. Figure 2 summarizes the Globus 60,000 transfers over this period. The graphs include only 50,000 transfer tasks (that is, they don’t include mkdir, de- 40,000 Frequency lete, and so on) in which data has been transferred 30,000 (for example, they don’t include sync jobs that don’t 20,000 0 transfer files) between nontesting endpoints (that s e s e ytes is, they ignore Globus test endpoints go#ep1 and bytes 1 bytes Gbyte 0 Mbyte go#ep2). Figure 2a shows the frequency of the total 1–10 –100 b 1-10 Kbytes 1–10 Mbytes 1–10 Gbytes 1–10 Tbytes 10 0–100 Kbytes number of bytes transferred in a single transfer task 1 10–10 10–100 10–100 Tbytes 100 bytes–1 Kbyte 00 Tbytes–1 Pbyt (note log bins), and Figure 2b shows the frequency 100 Kbytes–1 Mbyt100e Mbytes–1 Gbyt 100 Gbytes–1 Tbyte1 of the total number of files and directories trans- (a) Data transferred ferred in a single transfer task. As Figure 2a shows, the most common transfers are between 100 Mbytes and 1 Gbyte (81,624 total transfers), whereas more 350,000 than 700 transfers have moved tens of Tbytes of 300,000 Files Directories data and 39 have moved hundreds of Tbytes (max 250,000 500.415 Tbytes). The most common number of files 200,000 and directories transferred is less than 10; however, 150,000 Frequency more than 400 transfers have moved more than 1 100,000 million files each (max 39,643,018), and 120 trans- 50,000 fers have moved more than 100,000 directories (max 0 1 2 3 4 5 6 7 8 9 10 0 0 7,675,096). Figure 2 highlights the huge scale at 10 10 1 –10 2 –10 3 – 4 –10 5 –1 6 –10 7 –1 8 –10 which Globus operates in terms of data sizes trans- 9 –10 10 10 10 10 10 10 10 10 10 ferred, number of files and directories moved, and number of transfers conducted. (b) Number of files/directories

Extending the Globus Data Cloud FIGURE 2. Frequency of transfers with given transfer size and number of Globus currently supports a network of more than files and directories. Transfer task frequency for (a) total transfer size, and 8,000 active (used within the last year) endpoints (b) number of files and directories. distributed across the world and hosted at a vari- ety of locations, from PCs to supercomputers. Us- ers can already access and transfer data from many veloped Globus Connect, a software package that locations via Globus—supercomputing centers such can be deployed quickly and easily to make resources as the National Center for Supercomputing Appli- accessible to Globus. We developed two versions of cations (NCSA) and the San Diego Supercomput- Globus Connect for different deployment scenarios. er Center (SDSC); university research computing Globus Connect Personal is a lightweight single- centers such as those at the University of Chicago; user agent that operates in the background much cloud platforms such as Amazon Web Services and like other SaaS agents (such as and the Open Science Data Cloud (OSDC); large user ). A unique key is created for each instal- facilities such as CERN and Argonne National Lab- lation and is used to peer Globus Connect to the oratory’s Advanced Photon Source; and commercial user’s Globus account, ensuring that the endpoint data providers such as PerkinElmer. This vast col- is only accessible to its owner. Because we designed lection of accessible endpoints ensures that new Globus Connect Personal for installation on PCs, it Globus users have access to large quantities of data supports operation on networks behind firewalls and immediately. network address translation (NAT) through its use As new users join Globus, they often require ac- of outbound connections and relay servers (similar cess to new storage resources (including their own to other user agents such as ). Because it can PCs). Thus, an important goal is to provide trivial run in user space, it doesn’t require administrator methods for making resources accessible via Glo- privileges. Globus Connect Personal is available for bus. To allow data access via Globus, storage systems Linux, Windows, and MacOS. must be configured with a GridFTP server and some Globus Connect Server is a multiuser server authentication method. To ease this process, we de- installation that supports advanced configuration

SEPTEMBER 2014 IEEE CLOUD COMPUTING 49 options. It includes a full GridFTP server and an op- sure data integrity). Users can also leverage Globus’s tional colocated MyProxy server for authentication. synchronization and sharing capabilities directly Alternatively, users can configure existing authen- from S3 endpoints. tication sources upon installation. The installation Globus S3 endpoints support transfers directly process requires a one-command setup and comple- from existing endpoints, so don’t require data staging tion of a configuration file that defines aspects such via a Globus Connect deployment hosted on Ama- as the endpoint name, file system restrictions, net- zon’s cloud. This approach differs from GreenButton work interface, and authentication method. Glo- WarpDrive (www.greenbutton.com), which, although bus Connect Server also supports multiserver data it also uses GridFTP, relies on a pool of GridFTP transfer node configurations to provide increased servers hosted on cloud instances. Globus’s S3 sup- throughput. Globus Connect Server is available as port builds upon extensions to GridFTP to support native Debian and RedHat packages. communication directly between S3 and GridFTP With Globus Connect, users can quickly expose servers. Globus enables user-controlled registration any type of storage resource to the Globus cloud. of logical S3 endpoints requiring only details identi- They can use lightweight Globus Connect Per- fying the storage location (that is, the S3 bucket) and sonal endpoints on PCs and even short-lived cloud appropriate information required to connect to the instances. They can even script the download and S3 endpoint. To provide secure access to data stored configuration of these endpoints for programmatic in S3, while also enabling user-controlled sharing via execution. For more frequently used resources with Globus, we leverage Amazon’s Identity and Access

SECURE BIG DATA IN THE CLOUD THE IN DATA BIG SECURE multiple users (such as data transfer nodes, clusters, Management (IAM) service to delegate control of an S3 bucket to a trusted Globus user. We peer this Globus IAM user with the Globus transfer service via trusted cre- One of the most common requirements dentials. Thus, when delegating access of an S3 bucket, Globus can base autho- associated with big data is the ability to rization decisions on internal policies (such as sharing permissions) to allow share data with collaborators. transfers between other Globus end- points and the S3 endpoint.

Providing Scalable In-Place Data Sharing storage providers, long-term and high-performance One of the most common requirements associated storage such as High Performance Storage Sys- with big data (and scientific data in general) is the tem [HPSS]), they can deploy Globus Connect Serv- ability to share data with collaborators. Current er and leverage institutional identity providers. They models for data sharing are limited in many ways, can then scale deployments over time by adding Glo- especially as data sizes increase. For example, cloud- bus Connect Server nodes to load balance transfers. based mechanisms such as Dropbox require that Both versions support all Globus features including users first move (replicate) their data to the cloud, access, transfer, synchronization, and sharing. which is both costly and time consuming. Ad hoc models, such as directly sharing from institutional Supporting Cloud Object Stores storage, require manual configuration, creation, and To allow users to access a variety of management of remote user accounts, making them systems, Globus supports the creation of endpoints difficult to manage and audit. These difficulties be- directly on object storage. Users can come insurmountable when data is large and when thus access, transfer, and share data between S3 and dynamic sharing changes are required. Rather than existing Globus endpoints as they do between any implement yet another storage service, we focus on other Globus endpoints. To access S3, users must enabling in-place data sharing. That is, shared data create an S3-backed endpoint that maps to a specific does not reside on Globus; rather, Globus lets users S3 bucket to which they have access. With this mod- control who can access their data directly on their el, users can expose access to large datasets stored existing endpoints. in S3 and benefit from Globus’s advanced features, To share data in Globus, a user selects a file sys- including high performance and reliable trans- tem location and creates a shared endpoint—that fer, rather than relying on standard HTTP support is, a virtual endpoint rooted at the shared location (which doesn’t scale to large datasets and doesn’t en- on his or her file system. The user can then select

50 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING other users, or groups of users, who can access the X509 certificates, OpenID identities, InCommon/ shared endpoint—or parts thereof—by specifying CILogon OAuth, and so on) used to provide single- fine-grained read and write permissions. One ad- sign-on, it stores only public information, such as vantage of this model is that permission changes are SSH public keys, X509 certificates, OpenID identity reflected immediately, so users can revoke access to URLs and usernames, and OAuth provider servers, a shared dataset instantly. certificates, and usernames. Thus, when authenti- Globus’s sharing capabilities are extensions built cating, Globus can validate a user’s identity by fol- onto the GridFTP server, which, when enabled, let lowing the private authentication process using the GridFTP server delegate authorization decisions cryptographic techniques rather than comparing to Globus. Specifically, two new GridFTP site passwords. Consider, for example, authenticating commands let Globus check that sharing is enabled using a campus identity. Here, Globus leverages the on an endpoint and create a new shared endpoint. InCommon/CILogon system and the OAuth proto- We also extended the GridFTP access protocol to al- col to let users enter their username/password via low access by a predefined trusted Globus user. The a trusted campus website. Globus passes a request access request includes additional parameters such token with the user authentication and receives an as the shared owner, shared user, and access con- OAuth token and signature in return, which it ex- trol list (ACL) for the shared endpoint, which Glo- changes for an OAuth access token (and later a cer- bus maintains. When accessing the endpoint, this tificate) from the campus identity provider. information is passed to the GridFTP server to en- Linked identities, such as XSEDE identities, able delegated authorization decisions from the re- are also used for single-sign-on access to endpoints. questing user to the local user account of the shared endpoint owner. Using this approach, the GridFTP server can perform an authorization check to en- Globus stores an active proxy credential sure that the shared user can access the requested path before following the that can be used to impersonate the normal access protocol, which requires changing to the shared endpoint own- user, albeit for a short period of time. er’s local user account and performing the requested action.

Secure Data Access, Transfer, and Sharing Rather than require users to authenticate mul- There are a wide range of potential security implica- tiple times for every action and to allow Globus to tions when accessing distributed data, hosted by dif- manage transfers on a user’s behalf, Globus stores ferent providers, across security domains, and using short-term proxy credentials. This allows Globus to different security protocols. Globus’s multilayered perform important transfer-management tasks such architecture leverages standard security protocols to as restarting transfers upon error. Here, Globus manage authentication and authorization, and avoid stores an active proxy credential that can be used unnecessary storage of (or access to) users’ creden- to impersonate the user, albeit for a short period of tials and data. Most importantly, data does not pass time. To do so securely, Globus only caches the ac- through Globus; rather, it acts as a mediator, allow- tive credential and encrypts it using a private key ing endpoints to establish secure connections be- owned by Globus Nexus. When the active credential tween one another. is required (for example, to compute a file checksum on an endpoint), the credential is decrypted and Authentication and Authorization passed to the specific GridFTP server over the en- At the heart of the Globus security model is Glo- crypted control channel. bus Nexus, which facilitates the complex security protocols required to access the Globus service and Endpoint Data Access and Transfer endpoints using Globus identities as well as linked GridFTP uses the Grid Security Infrastructure external identities. (GSI), a specification that allows secure and del- Globus stores identities (and groups) in a con- egated communication between services in distrib- nected graph. For Globus identities, it stores hashed uted computing environments. GridFTP relies on and salted passwords for comparison when authen- external services to authenticate users and provide ticating. For the linked identities (SSH public keys, trusted signed certificates (typically from a MyProxy

SEPTEMBER 2014 IEEE CLOUD COMPUTING 51 server) used to access the server. These certificates limit GridFTP access to particular parts of the file are often hidden from users by the use of an on- system. For instance, administrators might allow ac- line certificate authority (CA), such as MyProxy. cess only to users’ home directories or to specialized The GridFTP service has a certificate containing locations on the file system. the hostname and host information that it uses to The flow of data between endpoints (including identify itself. (This certificate is created automati- S3-backed endpoints and shared endpoints) is an- cally when users install Globus Connect or it can be other potential area of vulnerability because data issued by a CA.) In Globus Connect, the MyProxy can travel on the general . To provide se- server can be optionally installed to issue short-term cure data transfer, Globus supports data encryption certificates on demand. Globus Connect can also be based on secure sockets layer (SSL) connections configured to use external MyProxy servers. Globus, between endpoints. In the case of S3 endpoints, GridFTP, and MyProxy servers are configured to the connection uses HTTPS. To avoid unnecessary trust the certificates exchanged between each other. overhead of less sensitive data, encryption is not a MyProxy servers let users obtain short-term cre- default setting and must be explicitly selected for in- dentials that a GridFTP server uses to assert user dividual transfers. The control channel used to start access to the file system. Administrators can con- and manage transfers is always encrypted to avoid figure MyProxy servers to use various mechanisms potential visibility of credential, transfer, and file for authentication through pluggable authentication system information.

SECURE BIG DATA IN THE CLOUD THE IN DATA BIG SECURE Secure Sharing As part of shared endpoint creation, a Globus sharing creates several new se- curity considerations, such as requiring unique token is created on the GridFTP secure peering of shared endpoints and Globus, authorizing access to shared server for each shared endpoint. data, and ensuring that file system in- formation is not disclosed outside of the restricted shared endpoint. The Globus sharing model requires modules (PAMs). Usually, these PAMs support local the GridFTP server to be explicitly configured to al- system credentials or institutional LDAP creden- low sharing. As part of this process, the GridFTP tials. There are two basic models in which Globus server is configured to allow a trusted Globus user to uses a MyProxy server to obtain a credential. In the access the server (and to later change the local user first, Globus passes the user’s username and pass- account to the shared endpoint owner’s local user word to the MyProxy server and receives a credential account). A unique distinguished name (DN) ob- in response. Thus, users must trust Globus not to tained from a Globus CA operated for this purpose store their passwords and to transfer them secure- identifies the user. The GridFTP server is config- ly. In the second and preferred model, Globus uses ured to trust both this special Globus user and the the OAuth protocol to redirect the user to the My- Globus CA via the registered DN. During configura- Proxy server to authenticate directly (that is, Globus tion, administrators can set restrictions (sharing_ doesn’t see the username and password), and the rp) defining what files and paths may be shared on server returns a credential in the OAuth redirection the file system and which users may create shared workflow. endpoints. For example, administrators could limit When accessing data on an endpoint, Globus sharing to a particular path (analogous to a public_ uses SSL/TLS to authenticate with the registered html directory) and a subset of administrative users. GridFTP server using the user’s certificate. The As part of shared endpoint creation, a unique GridFTP server validates the user’s certificate, re- token is created on the GridFTP server for each trieves a mapping to a local user account from a shared endpoint. This token is used to safeguard predefined mechanism (such as a GridMap file), and against redirection and man-in-the-middle at- changes the local user account (used to access the tacks. For instance, an attacker who gains control file system) to the requesting user’s local account. of a compromised Globus account might change the Subsequent file system access occurs as the authen- physical GridFTP server associated with a trusted ticated user’s local account. To provide an additional endpoint (for example, an XSEDE endpoint) to a layer of security, endpoint administrators can con- malicious endpoint under the attacker’s control. figure path restrictions restrict_paths( ) that In this case, the attacker can create a shared end-

52 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING point and can then change the physical server back of websites across the world. Although Globus uses to the trusted server. Because the unique token is custom data transfer protocols that are unlikely created on the malicious server, it won’t be present targets of such an attack, exploits via the website, on the trusted (XSEDE) server, so the attacker won’t endpoints, and linked identity providers are still be able to exploit the shared endpoint to access the possible. In this particular case, we followed pre- trusted server. defined internal security policies to determine if Accessing data on a shared endpoint using the the vulnerability impacted our services, patched the extended GridFTP protocol lets Globus access the issue for all Globus services and Globus-managed GridFTP server (as the trusted Globus account). endpoints, and generated new private keys. We The extended access request specifies data loca- then followed internal processes for responding to tion, shared endpoint owner, the user accessing the potentially compromised user access by revoking shared endpoint, and current ACLs for that shared user access tokens (invalidating all user sessions) endpoint. To ensure that data is accessed only and analyzing access logs. Finally, because of the within the boundaries of what has been shared and exploit’s nature, we analyzed all user endpoints to within restrictions placed by the server administra- identify potentially vulnerable endpoints. We then tor, the GridFTP server checks restricted paths, contacted administrators of these endpoints and shared paths, and Globus ACLs (in that order). As- recommended that they take specific measures to suming nothing negates the access, the GridFTP patch the systems. server changes the local user account, with which it accesses the file system, to the shared endpoint owner’s local user One important security aspect relates account and satisfies the request. Finally, because potentially sensi- to policies for responding to security tive path information could be includ- ed in a shared file path, Globus hides breaches and vulnerabilities. the root path from users accessing the shared endpoint. For example, if a user shares the directory “/kyle/secret/,” it will appear simply as “/~/’’ through the shared end- s data sizes increase, researchers must look point. Globus translates paths before sending re- toward more efficient ways of storing, organiz- quests to the GridFTP server. ing, accessing, sharing and analyzing data. Although Globus’s capabilities make it easy to access, trans- Hosting and Security Policies fer, and share large amounts of data across an ever- All Globus services are hosted on AWS. Although increasing ecosystem of active data endpoints, it this environment has many advantages, such as high also provides a framework on which new approaches availability and elastic scalability, as with all host- for efficiently managing and interacting with big ing options, it also has inherent risks. We mitigate data can be explored. these risks by following best practices with respect The predominant use of file-based data is of- to deployment and management of instances. These ten inefficient because the data required for analy- practices include storing all sensitive state encrypt- sis doesn’t always match the model used to store it. ed, isolating data stores from the general Internet so Researchers typically slice climate data in different they’re only accessible to Globus service nodes (by ways depending on the analysis—for example, geo- AWS security groups), performing active intrusion graphically, temporally, or based on a specific type detection and log monitoring to discover threats, au- of data such as rainfall or temperature. Accessing diting public-facing services and using strict firewalls entire datasets when only small subsets of it are of to restrict access to predefined ports, and establishing interest is both impractical and inefficient. Although backup processes to ensure that all data is encrypted some data protocols, such as the Open source Proj- before it’s put in cloud storage. To ensure that these ect for a Network Data Access Protocol (OpenDAP), practices are followed, we conducted an external se- provide methods for accessing data subsets within curity review,9 and resolved the identified issues. files, no standard model for accessing a wide range One important security aspect relates to policies of data formats currently exists. Recently, research- for responding to security breaches and vulnerabili- ers have proposed more sophisticated data access ties. The recent HeartBleed bug is an example of a models within GridFTP that use dynamic query and security vulnerability that affected a huge number subsetting operations to retrieve (or transfer) data

SEPTEMBER 2014 IEEE CLOUD COMPUTING 53 subsets.10 Although this work presents a potential innovate in these areas to provide enhanced capabil- model for providing such capabilities, further work ities directly through the existing network of Globus is needed to generalize the approach across data endpoints. We benefit from using Globus’s transfer types and to develop a flexible and usable language and sharing capabilities and from leveraging the to express such restrictions. same structured approaches toward authentication Files typically contain valuable metadata that and authorization. can be used for organization, browsing, and discov- We intend to continue to develop support for ery. However, accessing this metadata is often dif- other cloud storage and cloud providers, such as per- ficult because it’s stored in various science-specific sistent long-term storage like and formats, often encoded in proprietary binary for- storage models used by other cloud providers (Mi- mats, and typically unstructured (or at least doesn’t crosoft Azure Storage, for example), with the goal of follow standard conventions). Moreover, even when developing an increasingly broad data cloud. the metadata is accessible, few high-level methods exist for browsing it across many files or across stor- Acknowledgments age systems. Often, the line between metadata and We thank the Globus team for implementing and op- data is blurred, and, whereas metadata might be un- erating Globus services. This work was supported in necessary for some analyses, it can be valuable for part by the US National Institutes of Health through others. Thus, we need methods that enable struc- NIGMS grant U24 GM104203, the Bio-Informatics tured access to both data and metadata using com- Research Network Coordinating Center (BIRN-

SECURE BIG DATA IN THE CLOUD THE IN DATA BIG SECURE mon formats. Given that metadata can describe data CC), the US Department of Energy through grant or contain other sensitive information (for example, DE-AC02-06CH11357, and the Extreme Science patient names), it’s equally important to provide se- and Engineering Discovery Environment (XSEDE), cure access methods. We therefore need models that which is supported by US National Science Founda- expose such metadata to users and let them query tion grant number ACI-1053575. over it to find relevant data for analysis or share it in a scalable and secure manner. References Often, data sharing occurs for the purpose of 1. D. Lifka et al., XSEDE Cloud Survey Report, tech. publishing to the wider community or as part of a report 20130919-XSEDE-Reports-CloudSurvey publication. Considerable research has explored -v1.0, XSEDE, 2013. current data publishing practices.11,12 In many cas- 2. T. Hey, S. Tansley, and K. Tolle, eds., The Fourth es, researchers found that data wasn’t published Paradigm: Data-Intensive Scientific Discovery, with papers and that original datasets couldn’t be lo- Research, 2009. cated. This affects one of the core principles of sci- 3. R.L. Grossman et al., “The Design of a Com- entific discovery: that research is reproducible and munity Science Cloud: The Open Science Data verifiable. In response, funding agencies and pub- Cloud Perspective,” Proc. 2012 SC Companion: lishers are increasingly placing strict requirements High Performance Computing, Networking Stor- on data availability associated with grants and pub- age and Analysis (SCC 12), 2012, pp. 1051–1057. lications, although these requirements are often 4. I. Foster, “Globus Online: Accelerating and De- disregarded.12 Even when researchers do publish mocratizing Science through Cloud-Based Ser- data, they often do so poorly, in an ad hoc manner vices,” IEEE Internet Computing, vol. 15, no. 3, that makes the data difficult to find and understand 2011, pp. 70–73. (due to a lack of metadata), and with little guaran- 5. R. Ananthakrishnan et al., “Globus Nexus: An tee that the data is unchanged or complete. We need Identity, Profile, and Group Management Plat- new systems that let researchers publish data, eas- form for Science Gateways and Other Collabora- ily associate persistent identifiers (such as DOIs) tive Science Applications,” Proc. IEEE Int’l Conf. with that data, provide guarantees that the data is Cluster Computing (CLUSTER), 2013, pp. 1–3. immutable and consistent with what was published, 6. B. Allen et al., “ for Data provide common interfaces for discovering and ac- Scientists,” Comm. ACM, vol. 55, no. 2, 2012, cessing published data, and do so at scales that cor- pp. 81–88. respond to the growth of big data. 7. W. Allcock et al., “The Globus Striped GridFTP Although these three areas represent different Framework and Server,” Proc. 2005 ACM/IEEE research endeavors, they all require a framework Conf. Supercomputing (SC 05), pp. 54–64. that supports efficient and secure data access. Glo- 8. J. Novotny, S. Tuecke, and V. Welch, “An Online bus provides a model on which we can continue to Credential Repository for the Grid: MyProxy,”

54 IEEE CLOUD COMPUTING WWW.COMPUTER.ORG/CLOUDCOMPUTING Proc. 10th IEEE Int’l Symp. High Performance STEVEN TUECKE is deputy director at the Univer- Distributed Computing, 2001, pp. 104–111. sity of Chicago’s Computation Institute, where he’s 9. V. Welch, Globus Online Security Review, tech. responsible for leading and contributing to projects report, Indiana Univ., 2012; https://scholarworks in computational science, high-performance and .iu.edu/dspace/handle/2022/14147. distributed computing, and biomedical informatics. 10. Y. Su et al., “SDQuery DSI: Integrating Data Tuecke received a BA in mathematics and computer Management Support with a Wide Area Data science from St Olaf College. Contact him at tuecke@ Transfer Protocol,” Proc. Int’l Conf. High Per- uchicago.edu. formance Computing, Networking, Storage and Analysis (SC 13), 2013, article 47. IAN FOSTER is director of the Computation Insti- 11. T.H. Vines et al., “The Availability of Research tute, a joint institute of the University of Chicago and Data Declines Rapidly with Article Age,” Cur- Argonne National Laboratory. He is also an Argonne rent Biology, vol. 24, no. 1, 2014, pp. 94–97. senior scientist and distinguished fellow, and the Ar- 12. A.A. Alsheikh-Ali et al., “Public Availability of thur Holly Compton Distinguished Service Professor Published Research Data in High-Impact Jour- of Computer Science. His research interests include nals,” PLoS ONE, vol. 6, no. 9, 2011, e24357. distributed, parallel, and data-intensive computing technologies, and innovative applications of those KYLE CHARD is a senior researcher at the Com- technologies to scientific problems in such domains putation Institute, a joint venture between the Uni- as climate change and biomedicine. Foster received versity of Chicago and Argonne National Laboratory. a PhD in computer science from Imperial College, His research interests include distributed meta-sched- United Kingdom. Contact him at [email protected]. uling, grid and cloud computing, economic resource allocation, social computing, and services computing. Chard received a PhD in computer science from Vic- Selected CS articles and columns are also available toria University of Wellington, New Zealand. Contact for free at http://ComputingNow.computer.org. him at [email protected].

IEEE Computer Society | Software Engineering Institute Watts S. Humphrey Software Process Achievement Award

Nomination Deadline: January 15, 2015

Do you know a person or team that deserves recognition for their process improvement activities?

The IEEE Computer Society/Software Engineering Institute Watts S. Humphrey Software Process Achievement Award is presented to recognize outstanding achievements in improving the ability of a target organization to create and evolve software.

The award may be presented to an individual or a group, and the achievements can be the result of any type of process improvement activity.

To nominate an individual or group for a Humphrey SPA Award, please visit http://www.computer.org/portal/web/awards/spa

SEPTEMBER 2014 IEEE CLOUD COMPUTING 55