Magellan Final Report

The Magellan Report on Cloud Computing for Science U.S. Department of Energy Office of Advanced Scientific Computing Research (ASCR) December, 2011 CSO 23179 The Magellan Report on Cloud Computing for Science U.S. Department of Energy Office of Science Office of Advanced Scientific Computing Research (ASCR) December, 2011 Magellan Leads Katherine Yelick, Susan Coghlan, Brent Draney, Richard Shane Canon Magellan Staff Lavanya Ramakrishnan, Adam Scovel, Iwona Sakrejda, Anping Liu, Scott Campbell, Piotr T. Zbiegiel, Tina Declerck, Paul Rich Collaborators Nicholas J. Wright Jeff Broughton Rollin Thomas Brian Toonen Richard Bradshaw Michael A. Guantonio Karan Bhatia Alex Sim Shreyas Cholia Zacharia Fadikra Henrik Nordberg Kalyan Kumaran Linda Winkler Levi J. Lester Wei Lu Ananth Kalyanraman John Shalf Devarshi Ghoshal Eric R. Pershey Michael Kocher Jared Wilkening Ed Holohann Vitali Morozov Doug Olson Harvey Wasserman Elif Dede Dennis Gannon Jan Balewski Narayan Desai Tisha Stacey CITRIS/University of STAR Collboration Krishna Muriki Madhusudhan Govindaraju California, Berkeley Linda Vu Victor Markowitz Gabriel A. West Greg Bell Yushu Yao Shucai Xiao Daniel Gunter Nicholas Dale Trebon Margie Wylie Keith Jackson William E. Allcock K. John Wu John Hules Nathan M. Mitchell David Skinner Brian Tierney Jon Bashor This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02¬06CH11357, funded through the American Recovery and Reinvestment Act of 2009. Resources and research at NERSC at Lawrence Berkeley National Laboratory were funded by the Department of Energy from the American Recovery and Reinvestment Act of 2009 under contract number DE-AC02-05CH11231. Executive Summary The goal of Magellan, a project funded through the U.S. Department of Energy (DOE) Office of Advanced Scientific Computing Research (ASCR), was to investigate the potential role of cloud computing in addressing the computing needs for the DOE Office of Science (SC), particularly related to serving the needs of mid- range computing and future data-intensive computing workloads. A set of research questions was formed to probe various aspects of cloud computing from performance, usability, and cost. To address these questions, a distributed testbed infrastructure was deployed at the Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Center (NERSC). The testbed was designed to be flexible and capable enough to explore a variety of computing models and hardware design points in order to understand the impact for various scientific applications. During the project, the testbed also served as a valuable resource to application scientists. Applications from a diverse set of projects such as MG-RAST (a metagenomics analysis server), the Joint Genome Institute, the STAR experiment at the Relativistic Heavy Ion Collider, and the Laser Interferometer Gravitational Wave Observatory (LIGO), were used by the Magellan project for benchmarking within the cloud, but the project teams were also able to accomplish important production science utilizing the Magellan cloud resources. Cloud computing has garnered significant attention from both industry and research scientists as it has emerged as a potential model to address a broad array of computing needs and requirements such as custom software environments and increased utilization among others. Cloud services, both private and public, have demonstrated the ability to provide a scalable set of services that can be easily and cost-effectively utilized to tackle various enterprise and web workloads. These benefits are a direct result of the definition of cloud computing: on-demand self-service resources that are pooled, can be accessed via a network, and can be elastically adjusted by the user. The pooling of resources across a large user base enables economies of scale, while the ability to easily provision and elastically expand the resources provides flexible capabilities. Following the Executive Summary we summarize the key findings and recommendations of the project. Greater detail is provided in the body of the report. Here we briefly summarize some of the high-level findings from the study. • Cloud approaches provide many advantages, including customized environments that enable users to bring their own software stack and try out new computing environments without significant adminis- tration overhead, the ability to quickly surge resources to address larger problems, and the advantages that come from increased economies of scale. Virtualization is the primary strategy of providing these capabilities. Our experience working with application scientists using the cloud demonstrated the power of virtualization to enable fully customized environments and flexible resource management, and their potential value to scientists. • Cloud computing can require significant initial effort and skills in order to port applications to these new models. This is also true for some of the emerging programming models used in cloud computing. Scientists should consider this upfront investment in any economic analysis when deciding whether to move to the cloud. • Significant gaps and challenges exist in the areas of managing virtual environments, workflows, data, cyber-security, and others. Further research and development is needed to ensure that scientists can i Magellan Final Report easily and effectively harness the capabilities exposed with these new computing models. This would include tools to simplify using cloud environments, improvements to open-source clouds software stacks, providing base images that help bootstrap users while allowing them flexibility to customize these stacks, investigation of new security techniques and approaches, and enhancements to MapReduce models to better fit scientific data and workflows. In addition, there are opportunities in exploring ways to enable these capabilities in traditional HPC platforms, thus combining the flexibility of cloud models with the performance of HPC systems. • The key economic benefit of clouds comes from the consolidation of resources across a broad community, which results in higher utilization, economies of scale, and operational efficiency. Existing DOE centers already achieve many of the benefits of cloud computing since these centers consolidate computing across multiple program offices, deploy at large scales, and continuously refine and improve operational efficiency. Cost analysis shows that DOE centers are cost competitive, typically 3{7x less expensive, when compared to commercial cloud providers. Because the commercial sector constantly innovates, DOE labs and centers should continue to benchmark their computing cost against public clouds to ensure they are providing a competitive service. Cloud computing is ultimately a business model, but cloud models often provide additional capabilities and flexibility that are helpful to certain workloads. DOE labs and centers should consider adopting and integrating these features of cloud computing into their operations in order to support more diverse workloads and further enable scientific discovery, without sacrificing the productivity and effectiveness of computing platforms that have been optimized for science over decades of development and refinement. If cases emerge where this approach is not sufficient to meet the needs of the scientists, a private cloud computing strategy should be considered first, since it can provide many of the benefits of commercial clouds while avoiding many of the open challenges concerning security, data management, and performance of public clouds. ii Key Findings The goal of the Magellan project is to determine the appropriate role of cloud computing in addressing the computing needs of scientists funded by the DOE Office of Science. During the course of the Magel- lan project, we have evaluated various aspects of cloud computing infrastructure and technologies for use by scientific applications from various domains. Our evaluation methodology covered various dimensions: cloud models such as Infrastructure as a Service (IaaS) and Platform as a Service (PaaS), virtual software stacks, MapReduce and its open source implementation (Hadoop), resource provider and user perspectives. Specifically, Magellan was charged with answering the following research questions: • Are the open source cloud software stacks ready for DOE HPC science? • Can DOE cyber security requirements be met within a cloud? • Are the new cloud programming models useful for scientific computing? • Can DOE HPC applications run efficiently in the cloud? What applications are suitable for clouds? • How usable are cloud environments for scientific applications? • When is it cost effective to run DOE HPC science in a cloud? We summarize our findings here: Finding 1. Scientific applications have special requirements that require solutions that are tailored to these needs. Cloud computing has developed in the context of enterprise and web applications that have vastly dif- ferent requirements compared to scientific applications. Scientific applications often rely on access to large legacy data sets and pre-tuned application software libraries. These applications today run in HPC centers with low-latency interconnects and rely on parallel file systems. While these applications could benefit from cloud features such as customized environments and rapid elasticity, these need to be in concert with other capabilities

Magellan Final Report

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

Raising the Bar for Using Gpus in Software Packet Processing Anuj Kalia and Dong Zhou, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David G

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Scalable and High Performance MPI Design for Very Large

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor

Computer Architecture: Parallel Processing Basics

Library for Handling Asynchronous Events in C++

Intel Threading Building Blocks

DISTRIBUTED COMPUTING ENVIRONMENT ABSTRACT The

Cost Optimization Pillar AWS Well-Architected Framework

Cost Optimization Pillar AWS Well-Architected Framework Cost Optimization Pillar AWS Well-Architected Framework