SUSTAINED PETASCALE IN ACTION: ENABLING TRANSFORMATIVE RESEARCH 2014 ANNUAL REPORT SUSTAINED PETASCALE IN ACTION: ENABLING TRANSFORMATIVE RESEARCH 2014 ANNUAL REPORT

Editor Nicole Gaynor

Art Director Paula Popowski

Designers Alexandra Dye Steve Duensing

Editorial Board William Kramer Cristina Beldica

The research highlighted in this book is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of . Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

Visit https://bluewaters.ncsa.illinois.edu/science-teams for the latest on Blue Waters- enabled science and to watch the 2014 Blue Waters Symposium presentations.

ISBN 978-0-9908385-1-7 A MESSAGE FROM BILL KRAMER

TABLE OF CONTENTS

3 A MESSAGE FROM BILL KRAMER Welcome to the Blue Waters Annual Report for how Blue Waters serves as a bridge to even more 2014! powerful computers in the future. 4 WHAT IS BLUE WATERS? This book captures the first year of Blue Waters continues its commitment to full production on Blue Waters since the building the next generation of our workforce by 5 BLUE WATERS SYMPOSIUM 2014 started full service on April recruiting dozens of graduate and undergraduate 2, 2013. We’ve had a great year, with many students into our education programs. For many 6 COMMUNITY ENGAGEMENT & EDUCATION researchers transforming knowledge in their of these students this is their first exposure to respective fields. supercomputing, but some, such as our Blue 8 MEASURING BLUE WATERS As of this writing, we have 124 science teams Waters Fellows, have decided to base their entire from well over 50 institutions and organizations careers on advanced modeling and simulation 14 SYMPOSIUM WORKING GROUP REPORTS using NSF’s most powerful system. Many of these or data analytics. teams have worked with our Blue Waters staff to As we compiled this report, the scale of 22 EXTENDED ABSTRACTS ensure that their research runs as productively achievement that this project enabled became as possible, whether that means optimizing the apparent. We are proud to have been a part of 22 space science code or implementing new resource or storage it and look forward to continuing our services management methods. for more and bigger science and engineering for 46 geoscience Blue Waters has provided exceptional service years to come. 66 physics & engineering to the nation’s science, engineering, and research communities. With a balanced and integrated 94 computer science & engineering system with very high sustained computational performance, extraordinary analytical 104 biology & chemistry capabilities, very large memory, world-leading storage capacity and performance, leadership- William C. Kramer 144 social science, economics, & humanities level networking, and an advanced service Project Director & Principal Investigator 152 REFERENCES architecture, the Blue Waters system and staff are KB = kilobytes empowering teams across all NSF directorates to 160 INDEX TB = terabytes do breakthrough science that would otherwise PB = petabytes be impossible. I/O = input/output In May 2014, science and engineering partners, Mnh = million node hours staff, and others associated with the Blue Waters project met face-to-face at the 2014 Blue Waters Allocations denoted as type/ Symposium. Not only did the researchers talk size in extended abstracts. about their accomplishments on the already- existing massive machine, but we also discussed

3 WHAT IS BLUE BLUE WATERS WATERS? SYMPOSIUM 2014

Blue Waters is one of the most powerful Compare this to a typical laptop, which has On May 12, 2014, Blue Waters supercomputer Laboratory at Virginia Tech, said he simulated BACKGROUND in the world and the fastest one processor—1/16 of an XE node—with 4 GB users and many of the NCSA staff who support one scenario of disease propagation for the entire IMAGE: supercomputer at a university. It can complete of memory and half a terabyte of storage. their work converged in Champaign, Illinois, for U.S. population for four months in just 12 seconds Paul Woodward more than 1 quadrillion calculations per second To backup or store data from the file systems the second annual Blue Waters Symposium. The using 352,000 cores. He estimated that the world gave a talk on on a sustained basis and more than 13 times that for longer periods, a nearline tape environment ensuing three days were filled with what many of population would take 6-10 minutes per scenario, his work related at peak speed. The peak speed is almost 3 million was built using Spectra Logic T-Finity tape them would later refer to as a wonderful variety though he emphasized that a realistic assessment to stellar times faster than the average laptop. libraries, a DDN disk cache, and IBM's HPSS. of science talks and opportunities for networking of disease threat would require many such runs. hydrodynamics. The machine architecture balances processing This system provides over 300 PB of usable and collaboration. speed with data storage, memory, and storage (380 PB raw). communication within itself and to the outside The supercomputer lives in a 20,000-square- SHARED EVENINGS, COMMON world in order to cater to the widest variety foot machine room, nearly a quarter of the EFFICIENT DISCOVERY THROUGH GOALS SUPERCOMPUTING possible of research endeavors. Many of the floor space in the 88,000-square-foot National The most popular speaker of the symposium projects that use Blue Waters would be difficult Petascale Computing Facility (NPCF) on the The science talks ranged from high-energy was Irene Qualters, the director of the NSF or impossible to do elsewhere. western edge of the University of Illinois at physics to molecular dynamics to climate science Division of Advanced Cyberinfrastructure. She Blue Waters is supported by the National Urbana-Champaign campus. and even political science. Blue Waters enables spoke Thursday morning about the future of Science Foundation (NSF) and the University NPCF achieved Gold Certification in the U.S. more efficient progress in science, summarized supercomputing at NSF and encouraged users of Illinois at Urbana-Champaign; the National Green Building Council's Leadership in Energy Paul Woodward, professor of astronomy at the to work with NSF to ensure that the future of QUICK FACTS: Center for Supercomputing Applications (NCSA) and Environmental Design (LEED) rating system, University of Minnesota. Researchers can run supercomputing met their needs. The symposium Blue Waters manages the Blue Waters project and provides which is the recognized standard for measuring simulations quickly and then have more time to and NCSA’s Private Sector Program (PSP) annual users include: expertise to help scientists and engineers take sustainability in construction. The facility uses draw meaning from the results while someone meeting met for dinner Tuesday at Allerton Park 124 teams and full advantage of the system. three on-site cooling towers to provide water else runs their simulations. Ed Seidel, director of and Wednesday at Memorial Stadium, combining 719 researchers. Cray Inc. supplied the hardware: 22,640 Cray chilled by Mother Nature a large part of the NCSA, added that big computing and big data the most advanced computational science and XE6 nodes and 4,224 Cray XK7 nodes that year, reducing the amount of energy needed to will revolutionize science, whether physical industry teams in the country, according to Seidel. Symposium include NVIDIA graphics processor acceleration. provide cooling. The facility also reduces power or social, by making possible the formerly Seidel remarked after Wednesday’s dinner that attendees: 187. The XE6 nodes boast 64 GB of memory per node conversion losses by running 480 volt AC power impossible. Many problems are too complex to he heard a common need from PSP and Blue and the XK7s have 32 GB of memory. to compute systems, and operates continually at solve without such resources. Waters partners: an all-around system that not Blue Waters’ three file systems (home, project, the high end of the American Society of Heating, A few talks touched on social sciences that only can run simulations, but also analyze and and scratch) provide room for over 26 PB of Refrigerating, and Air-Conditioning Engineers initially seem incongruous with supercomputing. visualize data. online storage with a combined 1 TB/s read/write standards for efficiency. For example, Shaowen Wang, director of the new Science talks throughout the symposium rate for quick access while jobs are running. The CyberGIS Center at the University of Illinois at bespoke the advances that Blue Waters enabled. three file systems are assembled around Cray's Urbana-Champaign, is leading an exploration Additionally, researchers envisaged what they Sonexion Lustre appliances. The scratch file into minimizing bias in voting districts. Later in could achieve with the next generation of system is the largest and fastest file systems Cray the same session, Keith Bisset, research scientist supercomputers, looking toward the future of has ever provided. at the Network Dynamics & Simulations Science large-scale computing. 4 5 WHATCOMMUNITY IS BLUE at institutions across the country to participate. BLUE WATERS STUDENT Each course includes a syllabus with learning INTERNSHIP PROGRAM outcomes, 40 hours of instruction, reading The Blue Waters Student Internship Program assignments, homework and exercises, and is designed to immerse undergraduate and WATERS?ENGAGEMENT assessment of learning progress. graduate students in research projects associated Other courses include Parallel Algorithm with Blue Waters and/or the Extreme Science and Techniques in fall 2014 (Wen-Mei Hwu, Engineering Discovery Environment (XSEDE) University of Illinois at Urbana-Champaign) & EDUCATION efforts. Twenty one students were selected for and High Performance Visualization for Large- 2014-2015. The students attended a two-week Scale Scientific Data Analytics in spring 2015 institute in late spring 2014 to ensure they were (Han-Wei Shen, The Ohio State University). Blue familiar with parallel and distributed computing Waters welcomes faculty across the nation who tools, techniques, and technologies and with are interested in offering courses in this manner. computing environments like Blue Waters and XSEDE prior to the start of their year-long BLUE WATERS GRADUATE internships. A faculty mentor will guide each FELLOWSHIP PROGRAM intern in their use of HPC resources to solve science and engineering problems. Fellowships are awarded annually to students Applications are accepted during the spring. in multidisciplinary research projects including The selection criteria emphasize the likelihood The Blue Waters Community Engagement on subjects that facilitate effective use of system computer science, applied mathematics, and program enlists researchers, educators, HPC resources. Cray, NVIDIA, Allinea, HDF Group, of success, creating a diverse workforce, and computational sciences. Fellows receive a promoting excellence. center staff, campus staff, and undergraduate and INRIA have participated in past workshops. generous stipend, tuition allowance, and an and graduate students across all fields of study An ongoing monthly teleconference/webinar allocation on Blue Waters. A call for applications to participate as consumers and providers informs users of recent changes to the system goes out during late fall for fellowships starting REPOSITORY OF EDUCATION AND of knowledge and expertise. The program such as software, policy, or significant events as in the next academic year. TRAINING MATERIALS proactively promotes the involvement of well as upcoming training opportunities. Every Blue Waters selected ten PhD students as Blue Waters provides access to education and under-represented groups. Community needs other month a guest presenter adds relevant Blue Waters Fellows for 2014-2015: Kenza training materials developed by practitioners drive these activities and we welcome your topical content to the monthly user group Arraki (New Mexico State University), Matthew to foster the development of a broader, recommendations and suggestions (please meeting. Topics have included Globus Online Bedford (University of Alabama), Jon Calhoun well-educated community able to conduct contact the Blue Waters Project Office at bwpo@ data movement, parallel HDF5, and Lustre best (University of Illinois at Urbana-Champaign), computational science and engineering research ncsa.illinois.edu). practices. Alexandra Jones (University of Illinois at Urbana- using petascale technologies, resources, and Up to 1.8 million node-hours per year are Champaign), Sara Kokkila (Stanford University), methods. The materials include education and dedicated to educational use, which includes Edwin Mathews (University of Notre Dame), VIRTUAL SCHOOL OF course modules related to petascale computing, support of formal courses, workshops, summer Ariana Minot (Harvard University), George COMPUTATIONAL SCIENCE AND materials from training and workshop events, schools, and other training designed to prepare Slota (Penn State University), Derek Vigil- ENGINEERING and other resources contributed by the HPC the petascale workforce. To date, Blue Waters has Fowler (University of California, Berkeley), and community. Included in the repository are 30 supported more than a dozen events. The Blue Waters team conducted a pilot web- Varvara Zemskova (University of North Carolina undergraduate course modules developed with based college course in collaboration with at Chapel Hill). They attended the 2014 Blue Blue Waters support that have been viewed professor Wen-mei Hwu from the Department Waters Symposium, began their research in the TRAINING AND WORKSHOPS more than 28,000 times, and have each been of Electrical and Computer Engineering at the fall of 2014, and will present their findings at the downloaded approximately 6,000 times. Experts in the field provide training opportunities University of Illinois at Urbana-Champaign 2015 Blue Waters Symposium. Nine additional that bring petascale resources, tools, and methods during the spring of 2013. Two collaborating students were named Blue Waters Scholars and to the science and engineering community. faculty at Ohio State University and the granted allocations on Blue Waters. Training is offered in person and through video University of Minnesota hosted the course so Over three years, this fellowship program will conferences, webinars, and self-paced online students at their home campuses could receive award more than $1 million and more than 72 tutorials. Recent and upcoming topics include credit. Professor Hwu recorded lectures, and million integer-core equivalent hours to support OpenACC, advanced MPI capabilities, GPGPU participants on each campus watched the videos graduate research. programming, data-intensive computing, and on their own schedule and then discussed them Training materials can be accessed at the following web addresses: scientific visualization. with local faculty. • undergraduate - https://bluewaters.ncsa.illinois.edu/undergraduate Specifically for its partners, the Blue Waters Because of the success of this pilot program, • graduate - https://bluewaters.ncsa.illinois.edu/graduate project provides hands-on workshops by Blue Blue Waters is working with faculty to offer more • user community - https://bluewaters.ncsa.illinois.edu/training Waters staff and invited speakers twice a year semester-long online courses to allow students

6 7 2014

MEASURING return to service. Partial or degraded service is counted as part of the total outage. Scheduled availability gives an indication of how reliable the system is for science partners BLUE WATERS during the time the system is planned to be available (table 1). In all quarters, we have exceeded our required overall goal of 90-92% availability, sometimes substantially. On a monthly basis, the availability looks excellent, falling into a range of 95-98% (fig. 1). Mean time between system-wide failure (MTBF) is computed by dividing the number of hours the system was available in a month by the number of system-wide interrupts (or one if there are no interrupts) and then converting to days. Since full service began, the goal for MTBF FIGURE 2: System-wide mean time between has been greater than or equal to five days. Once failures for the entire Blue Waters system. again, we exceeded our goals on a quarterly basis. Taking a monthly view, the measured MTBF To assure excellent service, the Blue Waters message. Rather than metrics that measure was above the target for 10 of the 12 months in project tracks multiple metrics of success from activity amounts (e.g., number of users, number Blue Waters’ first year of service (fig. 2), which is usage to downtime to service requests. These of jobs) or rates, the Blue Waters project team remarkable for a system that is 50% larger than metrics aim to ensure that we provide a reliable, works hard to measure quality in addition to any other system Cray has delivered. Overall the high-performance system that accelerates activity. For example, the number of service largest impact to science teams is unscheduled discovery across a variety of disciplines. requests submitted by Blue Waters partners— outages and thus reducing that type of outage Target values for the control metrics have been all the institutions, organizations, and companies remains a key focus of the NCSA and Cray teams. tightened up after six months of operations as that support and use the supercomputer—may A node interrupt is defined as a node failure we gained experience with the system. Overall indicate quality issues with the system, or it may that results in a partner job failure or serious for the first year of operations, the Blue Waters indicate an open and proactive relationship with impact to a job. The interrupt rates are relatively project met or exceeded the expectations for the an increased numbers of partners. Such data stable and generally below three node interrupts vast majority of our stringent control metrics. often has to be analyzed in detail to understand per day. Given the node count of the Blue Waters In reading this report, one must keep in whether an effort is meeting its mission and system, this value is well below projected mind that the data can be very complex and whether the quality of service is at the expected interrupt rates and translates to served decades FIGURE 3: Breakdown of partner service requests can change over the course of the project, so level. In the following, we report on the status of of MTBF per individual node. by type. single data points often do not provide a clear a few of the Blue Waters control metrics.

SERVICE REQUESTS SYSTEM AVAILABILITY Helping our partners effectively use a very System availability has real value to our partners. complex system is a key role of Blue Waters staff. When evaluating system availability we use Obviously, correctly resolving an issue in a short criteria that are more stringent than typical, so time indicates a good quality of service and, most one should take care when comparing. importantly, a higher degree of productivity for For example, for Blue Waters a service science partners. Table 1 shows measures of our interruption is any event or failure (hardware, response time to partner service requests (which software, human, environment) that disrupts the are “trouble tickets” plus requests for advanced specified service level to the partner base for a support, special processing, etc.) for the first specified time period. We call it unscheduled if we quarter of 2014 (other quarters are similar). In give less than 24 hours’ notice of an interruption, all areas except giving at least seven days’ notice though we aim for at least seven days’ notice. The for major upgrades and planned system changes, duration of an outage is calculated as the time we consistently met or exceeded our goals. For during which full functionality of the system is all the announcements that missed the seven FIGURE 4: Accumulated usage of XE nodes by job FIGURE 1: Scheduled system availability. size. Red line indicates that 50% of the actual unavailable to users, from first discovery to full days’ notice mark, only one had less than six usage comes from jobs using ≥2,048 nodes (65,536 integer-core equivalents). 8 9 BLUE WATERS ANNUAL REPORT 2014

days’ notice. That lone event was a security- have more than 20% the number of nodes on related update performed the same day it was Blue Waters, and almost all of them have slower announced and was transparent to those using processors. the system. It was deemed important enough As a percentage of their respective totals, XE not to wait for seven days. We treated it as less very large jobs accounted for 15.4% and XK very than 24-hour notice. large jobs accounted for 57.7% of the node hours Service requests do much more than just used. report trouble. They also include requests for Expansion factor is an indication of the advanced support, expanded assistance, added responsiveness of the resource management to services, and suggestions. Fig. 3 shows the relative work submitted by the science and engineering frequencies of different types of partner service teams. Expansion factor is defined as the time requests in the first quarter of 2014. The total the job waits in the queue plus the requested number of service request during the quarter was wall time of the job divided by the requested 388. Accounts & Allocations and Applications, wall time of the job. On many systems, larger the two largest categories, each made up about jobs are typically more difficult for the system to FIGURE 5: Accumulated usage of XK nodes by job a quarter of the requests. All accounts were schedule. However, on Blue Waters the emphasis FIGURE 8: Distribution of annual actual usage per size. Red line indicates that 50% of the actual discipline area across all allocated projects. XK usage comes from jobs using ≥1,024 nodes. installed within one business day following the is on offering an environment where the partners receipt of all required information. Eighty percent can take full advantage of the size and unique of all other service requests were resolved in less capabilities of the system. For example, the than three business days. Some requests will scheduler has been configured to prioritize large always take longer than three days to resolve, and very large jobs, thus making it easy for the such as requests for additional software or for partners to run their applications at scale. Not help with code optimization; the average time large jobs wait in the queue for less time than the to resolution for these more time-consuming requested wall time, on average, independent of requests was 8.2 business days. the node type. Large jobs take about one to two times the requested wall time to start running, with jobs on XE nodes starting sooner than those PROCESSOR USAGE on XK nodes. Very large jobs wait in the queue From April 1, 2013, through March 31, 2014, for four to six times the requested wall time while partners used more than 135 million node hours Blue Waters collects the resources needed for on Blue Waters (more than 4.3 billion integer such a massive job. All in all, Blue Waters is very core equivalent hours). responsive and provides exceptional turnaround to the teams for all job sizes. FIGURE 6: Usage per job size category in terms of The job size corresponding to 50% of the actual FIGURE 9: Scratch file system daily activity. absolute millions of actual node hours. Orange usage is 2,048 nodes (65,536 integer cores) for the As might be expected, the most common run Blue is read, red is write activity. is XE node hours, blue is XK node hours. XE portion and 1,024 nodes for the XK portion of time is the current queue maximum of 24 hours the Blue Waters system, marked using horizontal (fig. 7). XE jobs have a larger distribution of run lines in fig. 4 and fig. 5, respectively. Note that times, likely due in part to their much larger node the horizontal scale on both of these figures is counts. Long run times are generally beneficial logarithmic. Overall the XK nodes delivered to partners since it reduces the overhead cost of 15.9% of the node-hours, which is only slightly job startup and teardown. higher than their relative fraction of the overall Comparing the breakdown of Blue Waters compute node count. node-hours usage by science discipline, Biology Fig. 6 presents another view of the usage per & Biophysics and Particle Physics each consume job size, where jobs have been categorized by slightly more than a quarter of the node hours their size (hereafter referred to as "not large", (fig. 8). Astronomy and Astrophysics is next in "large", and "very large"). Large jobs are defined line with 17% of the node hours, followed by as those using from 1-2% up to 20% of the system Atmospheric & Climate Science, Fluid Systems, size for the respective node types, with not large and Chemistry, each of which use 6-7% of system and very large below and above those cutoffs. time. Note that a not large job on Blue Waters is FIGURE 7: Distribution of actual usage by job actually larger than a full system job on many FIGURE 10: User data storage growth in the Blue duration. Orange is XE node hours, blue is XK Waters nearline storage. node hours. other systems. Very few systems in the world

10 11 BLUE WATERS ANNUAL REPORT

STORAGE USAGE much lower activity levels on the order of tens of terabytes in the home directories and hundreds The Blue Waters system has three separate file of terabytes in the project directories, as expected. systems totaling 35 raw PB (~26 usable PB): At the time of this writing, there are /home, /project, and /scratch. Home directories 55 partners in 26 projects actively storing data default to 1 TB and project directories default to 5 in the Blue Waters nearline tape subsystem for a TB for each project. Both are managed with user/ total of 6.5 PB of data and more than 44 million group quotas and neither is purged. Partners/ files by the end of the first year of production (fig. Projects can request more space in their project 10). Two very large projects have stored more directory as needed. Additionally the partners than 1.2 PB each. have access to the world’s largest nearline tape storage system. The /scratch file system consists of 21 PB of WRAP UP useable space. Files are purged if they have not The metrics presented for the first year of service been accessed within 30 days. This allows for very are high level and very brief. Blue Waters may be large quotas for many teams. The default quota the most instrumented system in the world, as for the scratch directory is 500 TB per project; we collect more than 7.7 billion monitoring data many teams and partners are granted increases points every day to help us understand, assess, to the default limits for specified time periods and improve how well Blue Waters is serving our by special request. partners. As discussed above, Blue Waters users, Fig. 9 shows an example of /scratch file system stakeholders, and reviewers believe our quality activity January-March of 2014. The /project and of service is exceeding expectations. /home file systems show similar variability with

METRIC TARGET DATA FOR 1/1/2014-3/31/2014 Service requests 95% of partner service requests are acknowledged 96% of partner service tickets had a human-generated are recorded and by a human-generated response within four working response within four business hours acknowledged in a hours of submission timely manner EXCEEDS EXPECTATIONS

Most problems 80% of partner service requests are addressed 80% of partner service requests were resolved within are solved within a within three working days, either by three business days reasonable time - resolving them to the partner’s satisfaction within three working days, or, MEETS EXPECTATIONS - for problems that will take longer, by informing the partner how the problem will be handled within three working days (and providing periodic updates on the expected resolution)

Accounts are installed 95% of accounts requests are installed within one 100% of account requests were resolved within one in a timely manner working day of receiving all necessary information business day of receiving all the required information from the partner/project leader. from the partner.

EXCEEDS EXPECTATIONS

Providing timely and All planned system outages announced at least 24 100% of planned system outages were announced at accurate information hours in advance. least 24 hours in advance.

MEETS EXPECTATIONS

All major upgrades and planned system changes 50% - One security related update was performed the announced at least seven days in advance. same day as the announcement.

BELOW EXPECTATIONS

Two planned outages had less than seven days’ notice, one six days 20 hours, the other six days and 8 hours.

TABLE 1: Metrics for service request response and resolution time.

12 2014

SYMPOSIUM WORKING processing still represent typical scenarios while • Use machine learning to extract data out of there is an increasing trend toward enabling in large generated datasets situ analytics and using visualization as a key • Support data compression for efficient storage input for steering simulations. and transfer GROUP REPORTS • Extend access to nearline storage for analysis Specific requirements for data @scale capabilities • Provide software-as-a-service support for data include management of not only data but also analytics @scale metadata. Both data and metadata are expected • Build fault tolerance capabilities into to grow explosively in nearly every domain due applications to continuing improvements of observational technologies and data-intensive scientific Education and workforce development: practices as well as the anticipated increase • Improve education of application scientists in computing power in the foreseeable future. regarding the capabilities for the state-of-art A major challenge in scalable management of data management, analysis, and visualization data and metadata is validation and verification, • Foster synergistic education efforts on data especially considering the related challenges of science and HPC capabilities fault tolerance on the computing side. A number of science scenarios were discussed Another major requirement addressed was data to elucidate these recommendations. DATA @SCALE • Are today's software and tools adequate for archiving, sharing, and movement. Generally your data movement needs? If not, what are speaking, data archiving, sharing, and movement Donald J. Wuebbles, a climate scientist at the Two group discussions were organized your recommendations for addressing the facilitate scientific data analysis that sometimes University of Illinois at Urbana-Champaign, at the 2014 Blue Waters Symposium by inadequacies? takes longer than the length of a project allocation described a scenario in which petabytes are taking an application-driven approach to on Blue Waters. Meeting the requirements for easily generated on Blue Waters by running 30 addressing application characteristics tied to Data Sharing: scientific analysis of massive simulation datasets different climate models. How can we manage technical requirements for current and future • What are your requirements for sharing your by pertinent communities therefore demands and quickly sift through such data @scale while scenarios in data @scale. The discussions data within your community? What about innovative mechanisms and services for data enabling pertinent scientific communities to focused on addressing full life cycles of data publicly? archiving, sharing, and movement. access related data and metadata? Furthermore, @scale innovation; data archiving and sharing; • What obstacles do you face that complicate data size and complexity will continue to algorithms, software and tools; education and your data sharing? Recommendations increase significantly as climate models pursue workforce development; and challenges and • How could today's software and tools be The following key recommendations were high spatiotemporal resolutions with improved opportunities. improved to advance data sharing capabilities? synthesized based on the group discussions. assimilation of observation data. Several • What is missing from today's capabilities? scientists also mentioned that their current Discussion questions Address the full life cycle of data: simulations easily generate terabytes or even The following discussion questions were posed Analysis, Software, and Tools: • Avoid the need to move data for analysis and petabytes of data. These datasets are often to the group participants. • What are major limitations of current software visualization too big to be moved anywhere else. Generally, and tools for your data handling? • Support data access beyond allocations to scientific communities need and would benefit General: • How do these limitations affect your projects? maximize scientific analysis and impact from have long-term access to examine such • What are the major challenges of data handling • Do you have any suggestions for eliminating • Enable analytics where data are located massive simulation datasets, which naturally for your applications? these shortcomings? • Provide dedicated resources for data analysis leads to increased data searching, publishing, • What new architecture, software, and tools • Do you need any software and tools for data sharing, and movement requirements. The will likely improve your data @scale practices? handling that are important to your projects Data archiving and sharing: National Data Service initiative led by NCSA • What should the National Science Foundation, but currently missing? • Provide data repository with efficient access was brought up as a fundamental solution to the University of Illinois at Urbana-Champaign, • Enable easy and secure data sharing meet such requirements. and the National Center for Supercomputing With the wide range of domains represented, data • Minimize impact on computational work (i.e., Applications be doing to help your projects handling requirements are significantly different decouple file systems from compute nodes While computational simulation represents a achieve desirable data handling? with regard to data I/O patterns (e.g., from one file such that post-processing does not impact major source of big data on Blue Waters, the per process to single shared file per application), simulations and vice-versa) ability to handle other sources of big data has Data Movement: file sizes (e.g., from a few kilobytes to a terabyte or become increasingly important. Scott Althaus, a • How easy and practical is it to move your more), software, and tools (e.g., MPI-IO, NetCDF, Algorithms, analysis, software, and tools: political scientist at Illinois, described a scenario datasets today? HDF, BoxLib). Furthermore, data analytics • Provide common libraries and utilities for data in which his project needs to move multiple • Is it sufficiently fast and simple? is diverse across domains. For example, for manipulation @scale terabytes of text data onto Blue Waters to develop simulation-centric applications, pre- and post- scalable data analytics and suggested there should 14 15 BLUE WATERS ANNUAL REPORT 2014

be opportunities to implement fast, easy to use, Sisneros, Robert Stein, Ilias Tagkopoulos, Rizwan Uddin, can target specifically X86, NVIDIA, or another great all-purpose tool that is portable, but its and secure data transfer services for long-tail Virgil Varvel, Jorge Vinals, Peter Walters, Liqiang (Eric) core architecture effectively. Two software learning curve is not trivial. Wang, David Wheeler, Don Wuebbles scientists who might not be familiar with related technologies that are portable between different high-performance computer tools. architectures today—OpenCL and the Thrust 4. Are you planning algorithm changes that would C++ Cuda library—revealed a ray of hope. A lead to better use of accelerators? MANY-CORE COMPUTING Multiple discussions emphasized that data few teams are using those now to generate code Teams that have not done so need to resolve the analysis workflows typically require interactive Science teams that use accelerators code close across all architectures. lower memory-to-core ratio they would have access to computational resources, for which the to the hardware for the most part (CUDA, or available on many-core devices. One brave job queue management approach does not work custom code generators). A couple of teams opt 2. What issues prevent you from porting your work person asked the question we all consider when well. The allocation of data @scale resources also for a portable approach to accelerators so that to many-core accelerators and what would make starting a move to accelerators: Will the time I needs to consider both computing and storage they can leverage PHI and NVIDIA architecture it more viable? spend working on algorithm improvement be requirements, coupled with software capabilities with a single code base using OpenCL or a Answers to this question reflected some of the worth it if I realize just a 1.5 or 2.0 x speedup? and customized to application characteristics in a portable library like Thrust. The cost of porting same themes from those for question one but cloud fashion. It is important to understand how to accelerators is seen as high (approx. one year’s with a few twists. Hardware limitations and Many-core in the future to support data-centric computational resources effort for a good programmer) and that’s been perceived performance gains are big factors. On the second day of the working group, we such as those based on Apache Hadoop for a barrier to uptake by smaller teams. Going There were several performance horror stories discussed Intel’s (Arnold) and NVIDIA’s (Hwu) enabling data-intensive analysis workflows that forward, there is hope that the next generation about MPI on Xeon PHI; OpenMP threads or hardware and software roadmaps. need to be integrated with Blue Waters. of accelerators will be on the motherboard, which Intel libraries are currently the ways forward on is anticipated to improve memory performance Xeon PHI. On the software side, getting started • The PCI bus is a limiting factor for both brands The participants of the working group were asked issues experience with the current PCI-based in CUDA is still perceived as a bit of a barrier, of accelerators and the future seems likely to to envision grand science drivers for data @scale approach. There is a strong desire for a portable and one team requested more sample codes and bring the accelerators to the motherboard. innovation. Larry Di Girolamo, an atmospheric language (OpenMP 4 or OpenACC), but at this how-to style programming guides to get up and Intel’s Knight’s Landing version of the Xeon scientist at Illinois, posed the question: How do time it is not clear which of those will endure. running. PHI is reported to support that capability, and we fuse petascale (or beyond) data from multiple NVIDIA is moving ARM processors (running geographically distributed sites to generate new The state of many-core The porting process is seen as a cost to the Linux) closer to the accelerator (see the scientific data products? Patrick Reed, a professor Hwu opened the first day of discussion with a teams. In some cases, the low memory-to-core NVIDIA Jetson board). We can expect more of civil and environmental engineering at Cornell review of the hardware differences between Blue ratio compared to general purpose CPU cores progress from NVIDIA in closing the physical University, asked the question: How do we Waters and Titan. Several of the known challenges requires significant algorithm changes. Science gap between processors and accelerators as perform interactive data analytics @scale for were listed (small DRAM, programming, teams may not consider adding a computer we have seen with the PHI. steering simulations? These questions suggest experience in production, etc.) and he described science staffer to their project as furthering that the convergence of computational and early success stories like NAMD, Chroma, and their science, especially if the payoff cannot be • Memory bandwidth and size are both data sciences is both desirable and synergistic. QMCPACK. While accelerator usage is high, the quantified up front. They perceive a significant increasing in next-generation hardware. As Such convergence is expected to fuel innovative number of teams using them and the diversity risk that performance gains may not be realized nodes become more compute capable (more integration of computing, data, and visualization of applications on the XK nodes is less than we even if they invest time and resources in porting threads and cores via many-core), the network capabilities. A great example of this is a typical hoped. their code to the accelerators. bandwidth is not expected to keep pace and workflow in CyberGIS (geographic information system balance is probably going to suffer. We science and systems based on advanced We then covered a handful of broad questions 3. Which tools do you find most (or least) useful may have to all learn to program like Paul cyberinfrastructure), where geospatial scientists about accelerator usage to get feedback from with accelerators (profiling, counters, debuggers, Woodward. from many domains focus on scientific problem the teams. etc.)? solving with seamlessly integrated compute-, There was universal agreement that vendors • Teams greatly desire access to device memory, data-, and visualization-driven capabilities 1. How portable is accelerator code and what is should focus on Linux and HPC as well as similar to GPUDirect from NVIDIA and provided through CyberGIS software and tools. being done to address issues of portability? Windows (or perhaps instead of Windows). Infiniband available through Intel and others. Most teams believe this is a big issue and a Many times, a tool waits an entire release cycle Lowering the latency by copying data only Moderators: Shaowen Wang (group leader), Jason Alt, challenge they need to address as they look into or more before it is ready for Linux. Science once (or not at all for upcoming motherboard Kalyana Chadalavada, Mark Klein adapting code for use on accelerators. Beyond teams are less interested than HPC center or socket-based accelerators) is a big performance Participants: Scott Althaus, Lauren Anderson, Hans- that, the response of the science teams varies. vendor staff in vendor tools. Most teams that are boon. Peter Bischof, Michelle Butler, Tom Cortese, Christopher Daley, Larry Di Girolamo, Joshi Fullop, Sharon Broude Some teams are still waiting for a winning serious about performance are timing their own Geva, Steven Gordon, Harriett Green, Falk Herwig, standard (perhaps OpenACC?) or cannot codes and are proud of that work. If this is the • We discussed the topic of weak vs. strong Barry Isralewitz, Nobuyasu Ito, Athol Kemball, Sara justify the porting expense at this time. With preferred approach, a possible path forward is to scaling. Not all codes behave the same way Kokkila, Quincey Koziol, Stuart Levy, Rui Liu, Lijun Liu, CUDA, you are locked into a vendor (NVIDIA). focus on more timing APIs and libraries that are and science teams require widely varying Edwin Mathews, Jeffrey McDonald, Ariana Minot, Fizza OpenMP 4 has “Open” in the name, but so far open, performant (high resolution), and portable. algorithms for their science. It is difficult to Mughal, Joe Muggli, Brian O'Shea, Leigh Orf, Ludwig Oser, Joseph Paris, Patrick Reed, Dave Semeraro, Rob only Intel supports it well. Other groups have TAU (Sameer Shende, University of Oregon), for build one system that handles both types of engineered code generators into their build and example, is widely regarded and respected as a codes equally well.

16 17 BLUE WATERS ANNUAL REPORT 2014

• Whatever the next generation brings, it should a path forward beyond the useful life of the first in situ with simulations or after the simulation the development of more complex models come with hardware performance counters Track-1 system. or analysis. and simulations with more attributes, more and support for tools that report the single- c. Malleable/elastic resource management for physical sub-processes, and higher degrees core performance in a straightforward way. While all teams indicated the need for more application load balancing and resiliency. of precision. Examples include the use of full- Scaling starts with core 0 (well, 1 for Fortran). computational and analytical resources and more d. Automation through workflows to support cloud models in weather and climate rather There is no motivation to do better if you networking, the needs of S&E communities are repeatability of computational/analytical than parameterizations, direct calculation of cannot tell how you are doing right now. diversifying as large instruments generate huge solutions. turbulence in fluids, and complete treatment quantities of data that must also be processed e. Use of data model programming methods, of fluids, magnetic fields, nuclear equations • The teams on Blue Waters get their network and analyzed, and as multiple communities often combined with more traditional math of state, and radiation transport for multiple performance metrics by direct observation. will need to work together to address modern model programming methods in a single particle species in relativistic astrophysics. Many teams have detected torus issues via grand challenges with multiple computing application or workflow. d. Increased number of ensemble† trials. variations in the wall clock time per simulation modalities. Hence, a Track-1 level system—the Ensembles provide statistical or other time step. The next machine should have more most powerful class of computation and analysis 2. Increased integration with data sources and information for uncertainty quantification bandwidth and a better scheduler that can system available—will need to be embedded more increased use of simulation data products. and probability analysis. For example, weather optimize geographic placement (topology) of deeply in a diverse ecosystem of instruments, a. Using data from multiple experiments and predictions may have up to 50 to 75 ensemble jobs to fully benefit from the communication data archives, smaller Track-2 systems, clouds, observations to set up the initial conditions members for a single prediction. Materials, architecture. and digital services to support the diverse needs for simulation is common in many fields. structures, biophysics, and astrophysics also of the communities NSF serves. For example, in computational biology the use ensembles. Note that this does not imply Moderators: Wen-mei Hwu (group leader), Galen interpretation of multiple experimental inputs smaller-scale runs, but rather more runs at Arnold, Gengbin Zheng Key trends requires computation of atomic-level models scale (32,000 to 320,000 core equivalents). Participants: Victor Anisimov, Kenza Arraki, Cristina Beldica, James Botts, Emil Briggs, Robert Brunner, The trends derived from the working group of very large macromolecular systems, like the Yongyang Cai, David Ceperley, James Coyle, Tomas at the 2014 Blue Waters Symposium and capsid of the AIDS virus, that are consistent 4. Longer simulated time periods are often Dytrych, Robert Edwards, Elif Ertekin, Yu Feng, John other sources may be evident first in best-of- with all experimental data types. required to accurately simulate the system of Grime, Dahai Guo, So Hirata, Alexandra L. Jones, breed implementations (aka, breakthrough, b. Assimilating observation data, steering during interest. Sometimes long simulated time periods Kenneth Judd, Andriy Kot, Jason Larkin, John L. Larson, hero, grand challenge) calculations. Best-of- simulations, and/or extensive post analysis and are the result of increases in fidelity. However, Chris Malone, Philipp Moesta, Thomas Quinn, Simon Scheidegger, Ashish Sharma, Karen Tomko, Frank Tsung, breed applications typically address a scientific validation (e.g., weather, climate, solar physics). simulations of larger systems often require Junshik Um, Derek Vigil-Fowler, Hiroshi Watanabe, discipline’s very important yet previously c. Many traditional best-of-breed modeling longer periods of time to stabilize, and in many Michael Wilde, Paul Woodward, Jejoong Yoo infeasible problems; Track-1 systems enable and simulation teams are realizing that using problems the time scales of natural processes solutions. Community standard practices Track-1 systems enables them to produce are longer than current simulations (e.g., in the advance as other science teams adopt the best- community datasets that are then useful for magnetosphere, global effects can occur on scales TRACK-1 SYSTEMS: of-breed techniques to solve different, new others to analyze in different ways. of days whereas kinetic simulations currently can TRENDS AND NEEDS problems. These are the trends that emerged as only simulate several hours). j After just one year of service, the current NSF we move to the second generation of Track-1 3. The need to dramatically increase fidelity in Track-1 supercomputing system, Blue Waters, systems: models and simulations to improve insights and 5. A long-range investment program for supports over 120 science teams and almost 600 address new problems. Fidelity increases tend to computational and analytical resources is of the country’s leading scientists. It has enabled 1. Changing workflow methods to accommodate be domain specific, but lead to more accurate required to go beyond the S&E of today transformative and wide-ranging impacts across computational methodologies. When size predictions and increases in the scope of the and be competitive in the world. To make a broad range of science and engineering (S&E) prohibits saving entire large datasets, the use problems that can be simulated. major improvements in the capability of S&E disciplines. of in situ visualization and analysis to reduce data a. Increasing use of multi-scale and multi- applications often requires development, testing, movement and speed time to solution becomes physics. These are needed to accurately explore and optimization at scale before production S&E Though Blue Waters has years of grand challenge necessary. This trend also involves the integration simulated phenomena. investigations can be performed. The time line science exploration remaining, now is the critical of some high-throughput work to analyze and b. Increasing resolution. Many areas require for NSF-related facilities, such as LIGO and time to identify the needs and desires of the reduce large-scale simulation results. orders-of-magnitude increases in resolution LSST, extend well beyond any planned funding nation’s science, engineering, and research a. Support of data streaming pipelines for to provide better insights. This is realized by for Track-1 and Track-2 resources. Until there is communities for large-scale systems to follow deadline-driven analysis for experimental and finer grids, more elements or atoms, more a sustainable investment program that is as long Blue Waters, wherever they are deployed. As one observational systems such as LSST, LIGO, and particles, etc., and by increased resolution as or longer than the lifecycles of NSF-related S&E team PI stated, the “first Track-1 system genomic sequencing. These could be primary in observations. A key example is modeling facilities, the facilities will continue to create available to the community has set in motion a support for experimental projects or back-end turbulence of complex flows and chemistries. stand-alone and redundant cyberinfrastructure, significant rethinking by NSF investigations of expansion and will require more integration of c. Increasing complexity. Increased which in the long run is more expensive for the what is possible and what is practical. It would workflow and resource management methods. understanding in physical models and community. be a very bad idea to nip this flowering of very b. Use of visualization to interpret and understand simulation studies, combined with increased large-scale computation in the bud” by not having the simulation and analysis results, whether detail in experiments and observations, drives

18 19 BLUE WATERS ANNUAL REPORT 2014

6. Increased number of problems to address. Future systems DEPLOYMENT SYSTEM TYPE PEAK NODES AGGREGATE ESTIMATED BW SPP ONLINE As it becomes possible for new best-of-breed The working group estimated possible Track-1 TIME FRAME (PF) MEMORY RUNNING ESTIMATE STORAGE simulations to study complex systems, the system capabilities and characteristics for three (PB) POWER (MW) (PF) CAPACITY (PB) solution of many other important problems time frames approximately four years apart to also becomes possible, thereby quickly elevating show the feasibility of systems that would be able 2012/2013 Reference: ~13 27,648 1.66 9-11 1.3 36 this level of simulation to community standard to support the evolving science requirements for current Track-1 – Blue Waters practice. For instance, the first 100 million all- the equivalent Track-1 investment level (see table atom simulations were completed in 2013. By at right). Because of the complexities, costs, and 2016/2017 X86 General ~55 ~19,200 ~3.5 9-10 7-9 ~200 2020 there will be tens to hundreds of teams challenges to S&E team productivity involved Purpose CPU doing hundreds to thousands of 100 million atom in managing and moving @scale datasets, it is (Based on Intel Skylake simulations to solve outstanding problems in most effective to have a single system that has Processor) biology. Similar situations exist in aircraft and a single, integrated communication fabric that engine design, drug discovery, weather and can support multiple workflows and modes of 2016/2017 Intel Many Core ~100 ~26,500 ~2.3 8-10 6-10 ~200 climate prediction, and many other fields. computing and analysis. For simplicity of the (Based on Intel’s Knight Landing examples, the characteristics of the alternatives Processor) 7. Changing algorithmic methods. S&E teams discussed below are based on a single system will substantially improve their algorithmic with homogeneous computational node types. 2016/2017 X86 CPU with ~100 ~21,000 ~1.2 9-11 7-11 ~200 methods to reach new research goals over the The best value choice for an actual Track-1 NVIDIA GPU (Based on next five to ten years—not just to address new system could be a combination of node choices Intel Skylake computer architectures, but also to improve connected with a common interconnect sub- Processor with the time to solution independent of hardware system, a uniform storage name space, and a NVIDIA GPU) changes and to develop the algorithms needed common, secure software environment, where 2020/2021 General Purpose ~200 ~30,000 ~10 12-14 40-50 ~400 for multi-physics and multi-scale simulations. multiple node types can efficiently run different CPU System This is a continuing re-engineering practice workloads in a single overall system. that is typically motivated by trying to use new 2020/2021 Accelerated, ~500 ~30,000 ~4.0-10 8-10 40-50 ~400 technologies or trying to get better results in the The alternatives presented here are developed Many Core System same or less time. based on interpreting vendor and technology a. Going forward, most S&E teams will change roadmaps. Consistent with the NSF Track-1 2024/2025 Accelerated, ~1,200 ~30,000 ~15-30 15-20 100-200 ~1,000 their algorithms to adjust to new system program goals, a key metric for the Track-1 Many Core architectures that require more concurrency follow-on is sustained performance for a wide System within and across nodes, less I/O and range of S&E applications. Here, sustained communication bandwidth, and less memory performance is defined as time to solution for per core. Additionally, teams will upgrade their real science, engineering, and research problems. j In this document fidelity means “accuracy in details” of the science problem. algorithms and work methods to improve the The optimization target that represents sustained quality and efficiency of their science output. performance in a meaningful manner for † In this document ensemble means running the same application and basic problem but with different initial conditions and/or system parameters in order to obtain high-confidence results and provide new An example is replacing particle mesh Ewald evaluation is the Sustained Petascale Performance ‡ insights. It may also mean running the same problem using different applications. It does not mean (PME) calculation with multi-level summation (SPP) Metric developed as part of the current running different problems. and higher order PME interpolation in all- Track-1 acceptance process. atom simulations. ‡ Kramer, W., How to Measure Useful, Sustained Performance. Supercomputing 2011 Conference (SC11), b. Use of adaptive gridding and malleable/ Moderators: William Kramer (group leader), Jeremy Seattle, Wash., November 12-18, 2011. elastic resource management will expand for Enos, Greg Bauer Participants: Ray Bair, David Barkai, Matthew Bedford, applications load balancing and resiliency. Timothy Bouvet, Jon Calhoun, Danielle Chandler, Improving load balancing is critical to Thomas Cheatham, Avisek Das, Lizanne DeStefano, overcoming both Amdahl’s law limits and Manisha Gajbe, Steven Gottlieb, Brian Jewett, Michael the increasing variation in system component Knox, Jing Li, Philip Maechling, Chris Maffeo, Celso performance, while resiliency is needed to Mendes, Omar Padron, James Phillips, Nikolai Pogorelov, Lorna Rivera, Harshad Sahasrabudhe, Mike Showerman, address the number of single-point failures Adam Slagell, Craig Steffen, Mark Straka, Lucas K. in systems with millions to billions of discrete Wagner, Paul Wefel components.

20 21 SPACE SCIENCE

ASTRONOMY 24 The Deflagration Phase of Chandrasekhar-Mass 36 Formation of the First Galaxies: Predictions for the Models of Type Ia Supernovae Next Generation of Observatories

ASTROPHYSICS 26 Extreme Scale Astronomical Image Composition and 38 Understanding Galaxy Formation with the Help of Analysis Petascale Computing COSMOLOGY 28 Petascale Simulation of Turbulent Stellar 40 From Binary Systems and Stellar Core Collapse to Hydrodynamics Gamma Ray Bursts HELIOPHYSICS 30 Ab Initio Models of Solar Activity 42 Simulating the First Galaxies and Quasars: The BlueTide Cosmological Simulation 32 Evolution of the Small Galaxy Population from High Redshift to the Present 44 The Influence of Strong Field Spacetime Dynamics and MHD on Circumbinary Disk Physics 34 Modeling Heliophysics and Astrophysics Phenomena with a Multi-Scale Fluid-Kinetic Simulation Suite 2014

(deflagration) in the turbulent interior of the literature have used ignition “points” 50 to FIGURE 1 white dwarf. 200 km in radius. Others have (unrealistically) (BACKGROUND): The At its birth, the flame is less than a millimeter used tens to hundreds of smaller (10 or 20 km color map shows in thickness. Full-star calculations attempting in radius) ignition kernels distributed about the magnitude of to resolve the initial flame while simulating the or near the center of the white dwarf. One vorticity, with entire star would need to cover length scales consequence of using our small ignition point white/yellow spanning more than ten orders of magnitude. is that less material burns and the white dwarf being regions of This is not feasible with current supercomputers expands less than in other studies. This implies large vorticity without approximations. However, modeling the that if the flame turns into a detonation, the and, therefore, effects of the turbulence is very important for detonation will burn through relatively high- relatively strong flame propagation as the small-scale turbulent density material, producing copious amounts turbulence. The eddies can wrinkle the otherwise smooth flame, of 56Ni and an extremely bright supernova. On burning flame subsequently increasing its propagation speed. the other hand, if a detonation does not occur initially has a the amount of 56Ni produced solely during shape similar THE DEFLAGRATION PHASE OF the deflagration will yield an extremely faint to a torus or CHANDRASEKHAR-MASS MODELS METHODS AND RESULTS supernova. The results of our simulations thus smoke ring. As the burning There have been few highly resolved studies of have important implications for the viability of OF TYPE IA SUPERNOVAE the MCh model to explain the typical SN Ia. bubble makes its nuclear combustion inside stars and none at this way towards the resolution. Several novel results emerged. surface of the Allocation: NSF/8.44 Mnh One was a better understanding of how the star, the smoke 1 WHY BLUE WATERS PI: Stan Woosley burning spreads. Five distinctive phases were ring shape breaks Collaborators: C. M. Malone1; S. Dong1; A. Nonaka2; A. S. Almgren2; J. B. Bell2; observed: (1) nearly isotropic spread by laminar Our prior simulations of the simmering phase in apart due to the M. Zingale3 burning; (2) early floatation and the emergence the MCh model showed that ignition occurred turbulent flow, INTRODUCTION of a single “vortex ring”; (3) fracturing of that in a single zone of roughly (4 km)3 in size. 1University of California, Santa Cruz which pushes 2 ring and the development of more isotropic The requirement to resolve this hot spot with Lawrence Berkeley National Laboratory Type Ia supernovae (SNe Ia) are thermonuclear strong vortex 3 turbulence; (4) nearly constant-angle floatation several tens of zones led to an initial fine-level State University of New York at Stony Brook explosions of white dwarf stars made unstable tubes to the with burning by entrainment; and (5) spreading resolution of about 100 m/zone, using adaptive by accretion of matter from a binary companion. flame's surface. EXECUTIVE SUMMARY: by a lateral pressure gradient as the flame mesh refinement. Once the burning is underway, a subsonic Unlike a smoke neared the surface. Each phase was successfully This configuration required a couple billion The unique resources of Blue Waters are used flame burns through the star in about a second, ring, however, compared to analytic approximations. zones. As the flame grew, so too did the number to address an important unsolved problem in turning carbon and oxygen fuel into mostly our flame is 56 Another aspect was exploring the effect on of fine zones required to resolve the flame. This computational astrophysics: the propagation radioactive Ni ash. The amount of fuel that continuously the burning of the turbulence initially present is definitely a large-scale, Blue Waters-class of turbulent nuclear combustion inside a burns, and hence the energy and brightness of powered by on the grid from prior simmering. We directly problem. The zones were spread over 4,096 Type Ia supernova. Adaptive mesh refinement the supernova, is very sensitive to details of the thermonuclear mapped the results of our previous simmering MPI tasks, each with 16 threads for a total of allows the burning front to be modeled with an flame propagation. This is a complicated problem reactions 3 calculations (done elsewhere) directly into Castro 65,536 cores. Checkpoint files made every couple unprecedented effective resolution of 36,864 because prior convection made the fuel turbulent and does not for further evolution. For a typical ignition of wallclock hours were between 150 GB and zones (~136 m/zone; compare to typical 1 km/ and the burning itself creates more turbulence. dissipate location (about 40 km off center) the background 250 GB in size and made use of Blue Waters’ high zone found in the literature). The initial rise Understanding exactly how SNe Ia explode within the star. turbulent flow field had only a minor effect on performance I/O. Being able to quickly dump this and expansion of the deflagration front are has been a challenge of astrophysics for several Eventually, the flame propagation and nucleosynthesis. If one data to archival storage via the Globus Online tracked until burning reaches the star's edge decades. However, SNe Ia have been used as vortex tubes artificially chooses a more centrally located interface made managing the several tens of (~0.8 seconds). Pre-existing turbulence affects standardizable candles—objects of inferred penetrate the ignition point where the flame floats slowly, then terabytes of data extremely pleasant. the propagation only at the earliest times and, intrinsic brightness that can be used to measure whole of the flame the flame has more time to interact with the even then, only for nearly central ignition. Even cosmological distances. Using SNe Ia in this way and the bulk turbulent field, leading to more distortion of the central ignition—in the presence of a background lead to the discovery of the accelerated expansion flow inside the flame and larger changes in nucleosynthesis. In PUBLICATIONS convective flow field—is rapidly carried off of the universe. flame becomes the limiting case of central ignition, the presence center as the flame is carried by the flow field. In the Chandrasekhar-mass (MCh) model, a Malone, C. M., A. Nonaka, S. E. Woosley, A. turbulent. This of a turbulent field causes the flame to be pushed Overall, very little mass is burned in our models, single white dwarf accretes material from its S. Almgren, J. B. Bell, S. Dong, and M. Zingale, leads to an to one side and possibly entirely off center; the resulting in very little expansion of the star; any companion, slowly compressing and heating the The Deflagration Stage of Chandrasekhar accelerated MCh model for SNe Ia produces asymmetric subsequent detonation will therefore produce core of the white dwarf until it is hot enough Mass Models for Type Ia Supernovae. I. entrainment explosions, even in the fortuitous case of igniting an exceptionally bright supernova. to enable carbon fusion and drive convection. Early Evolution. Astrophys. J., 782:11 (2014), of fresh fuel a sphere at the white dwarf's center. This simmering phase lasts a century before doi:10.1088/0004-637X/782/1/11. and increased Our initial conditions used a single, 2 km the burning triggers a thermonuclear runaway burning. near the center of the star, producing a flame radius ignition point. Previous studies in the

25 2014

grade image mosaics, we can look for new WHY BLUE WATERS galaxies or coherent streams of stars that are difficult or impossible to identify in the standard We plan to complete our goal to create a single image frames. image of the entire SDSS in each band, each of which would exceed one terapixel, by the end of summer in 2014. The creation and subsequent METHODS AND RESULTS analysis of each of these images is an extremely large computational challenge that is ideally The SDSS survey has archived over one million suited to Blue Waters’ large disks system, large images 10 arcminutes by 13 arcminutes, covering memory nodes, and on-board GPU processors. 14,000 square degrees across five wavelength bands, and calibrated both astrometrically and photometrically. For our research, we are porting and EXTREME SCALE ASTRONOMICAL optimizing the open-source Montage software to Blue Waters [4]. Currently we are using MPI, IMAGE COMPOSITION AND although we hope to explore adding OpenMP ANALYSIS FIGURE 1: A mosaic of the Virgo cluster of or OpenACC to capitalize on Blue Waters’ galaxies made by using the SDSS imaging capabilities more fully. Initially we used the XE6 data. nodes, which more closely match our proof-of- Allocation: Illinois/0.025 Mnh concept. Our ultimate goal is to make a single PI: Robert J. Brunner1,2 image of the entire SDSS survey in each band. Collaborators: Britton Jeter2; Harshil Kamdar2; Eric Shaffer2; John C. Hart2 We also plan to incorporate DES imagery. A one-degree square image routinely takes 1 National Center for Supercomputing Applications and dark matter, which drives the formation of around an hour to create for each of the five SDSS 2University of Illinois at Urbana-Champaign structure in the universe. Together, dark matter bands (u, g, r, i, and z). Each resulting mosaic is EXECUTIVE SUMMARY: and dark energy account for ~96% of the total approximately 400 MB, assuming 4 bytes per matter-energy content of the universe; the other pixel, and consists of over 100 individual field We are using Blue Waters to create large-scale 4% is normal baryonic matter (e.g., protons images (figs. 2-3 shows RGB composites of three images from ground-based sky surveys. Projects and neutrons). While quantifying the relative different bands). However, intermediate storage like the recently completed Sloan Digital Sky contributions of these dark components is a needs reach 10 GB per one-degree block because Survey and the ongoing Dark Energy Survey significant scientific achievement—it was Science Montage aligns, registers, and stitches the images obtain images by using charge-coupled devices magazine’s “Scientific Breakthrough of the Year” iteratively for each filter, which requires closer (CCDs). These projects built large mosaic in 2003 and resulted in the 2011 Nobel Prize in to 2 GB per filter for projection, background cameras by combining many CCDs to tile the FIGURES 2 + 3 Physics—it also highlights the fact that we know correction, and intermediate storage. focal plane of a telescope. By leveraging an open- (RIGHT): One- very little about either of them. Once we finish developing our image-stitching source image composition tool, we are taking degree square A number of major projects are underway to pipeline on Blue Waters, we will extend this the individual calibrated images from each CCD images made by acquire data that can expand our knowledge, software stack to develop hierarchically larger and combining them to make large-area mosaic combining nearly like the Sloan Digital Sky Survey (SDSS) [2] and images. (This approach underlies Google Earth images. In addition, we can use other open source one thousand Dark Energy Survey (DES). The petascale Large and Google Sky.) We keep all data live on the tools to combine large-area mosaics that were calibrated Synoptic Survey Telescope (LSST) will begin later disks to speed up subsequent reprocessing. By taken through different filters to make pseudo- images from the this decade and was recently deemed the most building these images, we will support object color images of the universe. These images will Sloan Digital important ground-based astronomy project by detection at a variety of levels as well as publish be used for both public outreach and scientific Sky Survey: a National Research Council committee in 2010. outreach tools and images that allow rapid exploration of the nearby universe. (top) M51, the All of these projects (and others not listed) try to panning and zooming. We also can use a separate Whirlpool Galaxy, detect stars and galaxies in small image subsets in tool called STIFF [5] to convert images taken and (bottom) the order to rapidly process the large data volumes. INTRODUCTION through different filters into a color composite Coma Cluster of These search techniques find fewer nearby image for public viewing (as in figs. 2-3). galaxies. Each A standard cosmological model, the so-called galaxies than the standard cosmological model Finally, we will use SExtractor [6], a standard of these images Lambda Cold Dark Matter Cosmology (LCDM) predicts [3]. source detection and extraction program in is four times the [1], posits that we live in a spatially flat universe To improve detection, we are stitching together astronomy, to explore SDSS and DES data to size of the full that is dominated by two controlling parameters: the small, calibrated images from the SDSS (and look for previously unknown nearby galaxies moon. dark energy, which drives the expansion of space, soon from the DES). By creating large, science- and tidal streams.

27 2014

METHODS AND RESULTS WHY BLUE WATERS The simulation of this process, if it is to yield We need to carry our simulations forward for accurate estimates of the elements that are about 6 million time steps, so it is fortunate produced, must be carried out in 3D. The that our code runs at 10% to 11% of the 64-bit entrainment of hydrogen-rich gas at the top peak performance on Blue Waters when we of the convection zone is the result of complex, run such a problem on 443,232 CPU cores in nonlinear shear instabilities which act against parallel. We see 0.42 to 0.44 Pflop/s sustained the stable stratification of the more buoyant performance running in this mode, depending hydrogen-rich gas. To accurately simulate this upon the mapping of our job to the machine’s process we must resolve these unstable waves toroidal communications fabric. At this rate, it and also the thin layer in which the composition takes about three minutes (~26 time steps per of the gas changes from the helium-carbon second) to simulate one minute for the star. The mixture of the convection zone to the hydrogen- hydrogen ingestion flash lasts for about one day, PETASCALE SIMULATION helium mixture above. We require a fine grid and the simulation shown in fig. 1 simulated 20 OF TURBULENT STELLAR and a numerical method capable of producing hours. It was carried out on the machine in a accurate results for modes that are only several single four-day interval. Blue Waters is unique HYDRODYNAMICS cells in wavelength. in enabling such a large and detailed simulation When the growing convection zone encounters to be performed in so short a time. the hydrogen-rich layers, it is deep in the sense Allocation: NSF/3.88 Mnh that the ratio of the radii of its top and bottom 1 PI: Paul R. Woodward boundaries is significant (i.e., about two or more). PUBLICATIONS Pen-Chung Yew1 Co-PI: The depth of the convection zone implies that the Collaborators: Falk Herwig2; Chris Fryer3; William Dai3; Michael Knox1; Jagan Woodward, P. R., F. Herwig, and P. H. Lin, 4 5 6 7 convection cells that develop within it will be very Jayaraj ; Pei-Hung Lin ; Marco Pignatari ; Ted Wetherbee INTRODUCTION Hydrodynamic Simulations of H Entrainment at large, so that only a few of the largest convection the Top of He-Shell Flash Convection. Astrophys. 1University of Minnesota We are interested in understanding the origin cells will fill the entire convection zone volume. 2 J., (submitted) arXiv:1307.3821. University of Victoria, B. C. of the elements in the developing universe. Thus, we also require that our problem domain 3Los Alamos National Laboratory Herwig, F., P. R. Woodward, P.-H. Lin, M. Elements heavier than hydrogen and helium contain the entire convection zone, not just a 4Sandia National Laboratory Knox, and C. L. Fryer, Global Non-Spherical 5Lawrence Livermore National Laboratory were manufactured within stars and later small sector of it. Finally, we must carry the Oscillations in 3 D 4π Simulations of the 6 University of Basel expelled into the interstellar gas to become simulation through many turn-over times of H-Ingestion Flash. Astrophys. J. Lett., (accepted) 7Fond du Lac Tribal and Community College incorporated in later generations of stars and the largest eddies in the convection zone so that arXiv:1310.4584. planets. The first generations of stars played a EXECUTIVE SUMMARY: entrainment of hydrogen-rich gas can react back particularly important role. The late stages of on the flow to accelerate entrainment through FIGURE 1 (BACKGROUND): Entrainment of hydrogen- We are exploiting the scale and speed of Blue evolution of these stars can be strongly affected burning of ingested hydrogen. This process is rich gas into the helium-shell flash-convection Waters to enable 3D simulations of brief events by hydrogen ingestion events. The products of slow because the initial entrainment is small. zone of the very late thermal pulse star called in the evolution of stars that can profoundly nucleosynthesis are later expelled along with the The above challenges to computation are met Sakurai's object and the subsequent development of impact their production of heavy elements. We outer envelopes of these stars, contributing to in this work by the combination of our PPMstar the Global Oscillation of Shell Hydrogen ingestion are focusing on hydrogen ingestion flash events the gradual build-up of the chemical inventory simulation code and the Blue Waters system. (GOSH) at problem times 251, 626, 963, 970, 976, and 3 because these are potential sites for the origin that we find now in our solar system. Our studies show that grids of 1,536 cells are 982 minutes (left to right, top to bottom) since the of observed, strongly anomalous abundance The H-ingestion events occur, for example, sufficient to deliver accurate simulation of the beginning of the simulation. signatures in stars that formed in the early when a convection zone above a helium-burning entrainment of hydrogen-rich gas at the top of universe. Hydrogen that is pulled down into a shell reaches unprocessed hydrogen-helium gas the helium-shell flash-convection zone, using Concentrations of entrained helium gas from 12 C-rich helium-burning convection zone would above it in the asymptotic giant branch (AGB) the piecewise-parabolic method gas dynamics large to small range in color from red (1.6×10-2) 13 13 16 produce C, which is, via the C(α,n) O reaction, stage of evolution of such stars. In order to method and the piecewise-parabolic Boltzmann to yellow (1×10-3), white (1.6×10-4), aqua (2.5×10-5), a very strong neutron source for the production understand the H-ingestion flashes, as well as the advection scheme to follow the multi-fluid and finally dark blue (3×10-6). Superposed on this of heavy elements. These flash events, as well evolution of many other types of stars such as the volume fraction. The result of this work-in- image of the entrained gas concentration is the as many properties of how pre-supernova stars pre-supernova evolution of stars that eventually progress is that we have discovered a previously rate of energy release from burning the entrained evolve, critically depend on convective boundary explode, it is critical to be able to quantitatively unknown global oscillation of shell hydrogen hydrogen (slowest to fastest combustion: dark mixing, which demands a 3D treatment. We simulate convective boundary mixing between ingestion (GOSH). The GOSH is shown in fig. 1. blue, red, yellow, white). As combustion causes the simulated H ingestion in a very late thermal the hydrogen-helium gas and the helium-carbon entrainment rate to increase, hydrogen ingestion pulse star, Sakurai’s object, for which detailed mixture below it. ultimately leads to global oscillations of shell observational data is available to validate our hydrogen ingestion and burning, or GOSH. A movie can simulation methodology. be found at www.lcse.umn.edu/MOVIES.

29 BLUE WATERS ANNUAL REPORT 2014

can produce flares and coronal mass ejections. 59 solar hours (about 1.25 turnover times) with AB INITIO MODELS OF SOLAR Our project attempts to model this chain of the most recent 10 hours on Blue Waters at a ACTIVITY events. resolution of 48 km horizontally and 12-78 km vertically on a grid 4032x4032x500. The total magnetic energy has reached 2.7% of the kinetic METHODS AND RESULTS Allocation: NSF/6.5 Mnh energy and is increasing slightly slower than 1 PI: Robert Stein The formation and emergence of solar active exponentially. The magnetic energy is roughly the Collaborators: Aake Nordlund2; Mats Carlsson3; William Abbett4; Bart De Pontieu5; same fraction of the kinetic energy at all depths 3 6 regions is modeled by coupling simulations of Viggo Hansteen ; Jesper Schou dynamo-produced magnetic flux rising through (except in the photosphere) and both are power laws in the density. 1Michigan State University the deep solar convection zone using the 2University of Copenhagen anelastic code ANMHD with the compressible 3 University of Oslo radiation-dynamics code STAGGER and a Subsurface Structure of Sunspots 4University of California, Berkeley In our simulations, mini active regions are 5 chromosphere and corona simulation code called Lockheed-Martin Solar & Astrophysics Laboratory spontaneously produced by the magneto- 6Göttingen University BIFROST. Each calculation provides boundary conditions for the model above it. To couple convection and downflows without being EXECUTIVE SUMMARY: ANMHD and STAGGER we are extending the inserted arbitrarily as initial conditions. As a result we can analyze the subsurface structure of FIGURE 1 surface magneto-convection model downward The objective of this project is to understand the pores (small sunspots but without penumbra (TOP RIGHT): to a depth of 30 Mm. The roll of dynamo action how solar magneto-convection powers the Sun's because of the upper boundary condition of a Horizontally near the solar surface to emerging magnetic flux activity, heats the chromosphere and corona, and potential magnetic field that we impose at the averaged (solid) and atmospheric heating is also being modeled accelerates charged particles. To achieve this we top of the photosphere). Fig. 2 shows a case of and minimum using STAGGER. have begun modeling the generation of magnetic magnetic field lines twisted around a magnetic and maximum Radiation transfer using STAGGER still takes fields by dynamo action and their emergence concentration that penetrates nearly vertically entropy for the a significant portion of the time (25%), mostly through the solar surface. The first step is to relax through the surface in one of the pores. Other calculation due to communication. In its current version, a model of solar magneto-convection from the cases of subsurface braiding of magnetic field relaxing the 30 which is the result of several months of work by top of the photosphere to a depth of 30 Mm lines are found. Mm deep model, in order to use results from deep convection Nordlund, STAGGER runs half as fast on Blue as a function zone dynamo and flux emergence calculations Waters as on NASA's Pleiades supercomputer, but Data for Analyzing Solar Observations of the depth we are able to use 4-8 times as many processors, as boundary conditions for the surface magneto- As an additional benefit, the datasets produced grid number. The resulting in a significant speedup. We are working convection simulations. Simultaneously, we in the dynamo simulation have been useful entropy at depth on schemes to run STAGGER on different investigate the role of surface dynamo action in for understanding solar observations [e.g., 1]. has become nearly processors, simultaneously with the magneto- the development of active regions. This data is also being used for analyzing local constant already. hydrodynamics, in order to further speed up the simulation. helioseismic inversion procedures. The data from FIGURE 2 (BOTTOM the flux emergence calculations is being used INTRODUCTION RIGHT): Vertical to analyze Stokes spectra inversion procedures. 30 Mm extension magnetic Earth's weather and space weather is controlled The 96 Mm wide by 20 Mm deep extension to field volume by our Sun—by its radiation, by coronal mass 30 Mm depth has completed 5.5 solar hours of visualization ejections into the solar wind, and by energetic WHY BLUE WATERS simulation so far. The entropy in the extended with particles originating from solar active regions. region is nearly constant and equal to that in To model the emergence of magnetic flux superimposed These, in turn, are controlled by the interaction of the original 20 Mm simulation (fig. 1). The through the solar surface requires simulating fluid streamlines magnetic fields, convection, and radiation. How convective and kinetic energy fluxes, however, the time evolution of magneto-convection for looking downward magnetic fields emerge through the photosphere are several times too large and are decreasing many hours of solar time. To obtain results in a PUBLICATIONS from above. and are shuffled around by the convective slowly with time. The convective cell structure reasonable time requires as many processors as This magnetic motions governs chromospheric and coronal Nagashima, K., et al., Interpreting the and temperature-velocity correlations are still possible. Currently, only Blue Waters provides a concentration heating and determines the generation of flares Helioseismic and Magnetic Imager (HMI) Multi- in the process of relaxing to their statistically substantial number of useable processors. The has a vortical and coronal mass ejections. The behavior of Height Velocity Measurements. Solar Physics, steady-state form. Once this extension is relaxed, relaxation of magneto-convection down to a fluid flow around magnetic fields at the solar surface is controlled (2014), doi: 10.1007/s11207-014-0543-5. ANMHD and BIFROST are ready to couple to it depth of 30 Mm is running on 64,000 processors its surface. by the solar convective dynamo. New magnetic and complete the chain. on Blue Waters. The modeling of dynamo action flux emerging in active regions interacts with in the top two-thirds of convective scale heights existing fields to release huge amounts of energy Surface Dynamo is running on 32,000 processors. when the fields reconnect. This heats the local The dynamo calculation has a domain 96 Mm coronal environment to many millions Kelvin and wide by 20 Mm deep. Thus far, it has run for

30 31 BLUE WATERS ANNUAL REPORT 2014

horizon to the formation and clustering of requirements by reusing information. EVOLUTION OF THE SMALL individual galaxies. However, at the low end Treebuilding originally required an “all-to-all” to GALAXY POPULATION FROM HIGH of the galaxy luminosity function, the CDM identify where off-processor tree nodes where theory and observations are somewhat at odds. located, which has been replaced by a more REDSHIFT TO THE PRESENT In particular, the existence of bulgeless, cored dynamic algorithm. small galaxies is not a natural prediction of CDM. We have run a “pathfinding” simulation at one-tenth the mass resolution of our proposed Allocation: NSF/9.38 Mnh 1 simulation. While this simulation does not PI: Thomas Quinn METHODS AND RESULTS Collaborators: Fabio Governato1; Lauren Anderson1; Michael Tremmel1; Charlotte have sufficient resolution to study the detailed 2 3 3 3 Christensen ; Sanjay Kale ; Lucasz Weslowski ; Harshitha Menon In our previous cosmology work, we used morphology of individual galaxies, it can tell us about gross morphology and the star 1 a “zoom-in” technique in which we selected University of Washington formation and merger history of these galaxies. 2 halos from a large dark matter simulation and University of Arizona This simulation is now being analyzed to make 3University of Illinois at Urbana-Champaign resimulated those halos with high resolution and gas dynamics. However, this technique has predictions about the luminosity function and EXECUTIVE SUMMARY: star formation history of high redshift galaxies. FIGURE 1 (RIGHT): the serious shortcoming that the conclusions Gas distribution Creating robust models of the formation and can be extremely biased by our selection of halos. Hence a simulation is needed in which for our pathfinder evolution of galaxies requires the simulation WHY BLUE WATERS simulation. of a cosmologically significant volume with all halos within a representative volume of the This simulation sufficient resolution and subgrid physics to model universe are simulated with high-resolution gas The mass and spatial resolution requirements for represents individual star-forming regions within galaxies. dynamics. Such a simulation is computationally reliably modeling galaxy morphology have been a uniform This project aims to do this with the specific challenging, but it will allow us to answer a set by our published resolution tests. Therefore cosmological goal of interpreting Hubble Space Telescope number of outstanding questions that are difficult the size of the simulation we will perform is set volume that is observations of high redshift galaxies. We are or impossible to answer with simulations of by the subvolume of the universe we wish to 80 million light using the highly scalable N-body/Smooth Particle individual galaxies. model. The HST observing program to survey Hydrodynamics modeling code, ChaNGa, based For this project, our simulations will model star-forming galaxies in the Hubble ultra-deep years on a side, Quinn, T., Pathways to Exascale N-body on the Charm++ runtime system on Blue recent Hubble Space Telescope (HST) survey field, “Did Galaxies Reionize the Universe?”, contains ~2 Simulations. Exascale Comput. Astrophys., Waters to simulate a 25 Mpc cubed volume of volumes from high redshift to the present with was awarded hundreds of observation hours to billion particles, Ascona, Switzerland, September 8-13, 2013. the universe with a physically motivated star sufficient resolution to make robust predictions determine the number, nature, and evolution of and is capable of Quinn, T., ChaNGa: a Charm++ N-body formation/supernovae feedback model. This of the luminosity function, star formation rate, star-forming galaxies in the Hubble ultra-deep th resolving scales Treecode. 11 Ann. Charm++ Wksp., Urbana, past year's accomplishments include running a and morphologies appropriate for these surveys. field. The approximate volume of this survey is down to ~1000 Ill., April 15-16, 2013. light years. This pathfinding simulation at one-tenth the needed These results can be directly compared with equivalent to our proposed simulation volume. resolves the resolution, which we will use to study overall star results from observational programs. We can This volume will not only allow us to make direct morphologies of formation histories and luminosity functions. We therefore address some basic issues of the CDM comparisons with the survey, but also enhance galaxies down to also significantly improved the parallel scaling of model: its scientific return by understanding how those very small masses the ChaNGa simulation code and demonstrated • How are dark matter dynamics and galaxy surveyed galaxies will evolve to the present. and gives us a strong scaling out to 524,000 cores. morphology connected? This volume size and our required resolution large statistical • Does the standard ΛCDM model produce give a total of 12 billion particles of gas and an sample of the correct number densities of galaxies as a equal number of dark matter particles. This interesting INTRODUCTION function of mass or luminosity? is just over an order of magnitude larger than • What is the overall star formation history simulations that we could run on other resources objects. We Understanding galaxy formation and morphology of the universe? to which we have access. Hence a sustained evolved the within a cosmologically significant survey We have improved the scaling of our petascale facility like Blue Waters is essential simulation for requires the incorporation of parsec-scale physics simulation code by identifying and addressing for this simulation. ~1.5 billion in simulations that cover a gigaparsec or more. a number of bottlenecks that only became years, creating a Recent work by our group has shown that with apparent when scaling beyond 16,000 processors. dataset of physically motivated subgrid models at roughly In particular, we addressed bottlenecks in the PUBLICATIONS ~5 TB, which the 100 parsec scale, many of the morphological load balancing, the domain decomposition, and we will use to properties of galaxies based on star formation Menon, H., L. Wesolowski, G. Zheng, the tree building phase of our computation. We understand the and feedback can be reproduced. P. Jetley, L. Kale, and T. Quinn, ChaNGa: have moved to a hierarchical load balancer where formation and The Cold Dark Matter (CDM) paradigm for Adaptive Techniques for Scalable Uniform and decisions are distributed among subvolumes. evolution of structure formation has had many successes over Non-Uniform N-Body Cosmology Simulation. Domain decomposition was both optimized in galaxies in the a large range of scales, from Cosmic Microwave Supercomputing 2014, New Orleans, La., its serial performance and its communication early universe. Background fluctuations on the scale of the November 16-21, 2014. (submitted)

32 33 BLUE WATERS ANNUAL REPORT 2014

heliospheric plasma are so rare that they should direction and strength of the ISMF in the near MODELING HELIOPHYSICS AND be modeled kinetically. As a result, one needs a vicinity of the global heliosphere [6,7]. With ASTROPHYSICS PHENOMENA WITH tool for self-consistent numerical solution of the realistic boundary conditions in the LISM, we magnetohydrodynamics (MHD), gas dynamics simulated the solar wind-LISM interaction and A MULTI-SCALE FLUID-KINETIC Euler, and kinetic Boltzmann equations. explained the sunward solar wind flow near SIMULATION SUITE Voyager 1, penetration of the LISM plasma into the heliosphere, and other phenomena [8,9]. METHODS AND RESULTS Allocation: NSF/0.78 Mnh Our Multi-Scale Fluid-Kinetic Simulation Suite PI: Nikolai V. Pogorelov1 WHY BLUE WATERS 1 1 (MS-FLUKSS) solves these equations using an Co-PIs: Sergey Borovikov ; Jacob Heerikhuisen adaptive-mesh refinement (AMR) technology We used new possibilities provided by NSF PRAC 1University of Alabama in Huntsville [1]. The grid generation and dynamic load award on Blue Waters to model challenging balancing are ensured by the Chombo package, space physics and astrophysics problems. We EXECUTIVE SUMMARY: which also helps preserve conservation laws ported MS-FLUKSS to Blue Waters and used FIGURE 1 Plasma flows in space and astrophysical at the boundaries of grid patches. To analyze it to solve the Boltzmann equation for ENAs (TOP RIGHT): environments are usually characterized by a the stability of the heliopause and investigate and ideal MHD equations for plasma using a Interstellar substantial disparity of scales, which can only the flow in the heliotail, the local resolution global iteration approach. We implemented the magnetic field be addressed with adaptive mesh refinement of our simulations must be five to six orders following improvements: (1) plasma data and lines draped techniques in numerical simulations and efficient of magnitude smaller than in our typical arrays storing the source terms for the MHD code around the dynamic load balancing among computing cores. computational region. are now shared among the cores of a single node heliopause Multi-Scale Fluid-Kinetic Simulation Suite is a We focus on the two latest numerical results (this was done by using a hybrid MPI+OpenMP exhibits violent collection of codes developed by our team to solve obtained on Blue Waters: (1) MHD-kinetic parallelization); (2) load balancing is now a two- Kelvin-Helmholtz self-consistently the magnetohydrodynamics, simulations of the plasma flow in a long level algorithm that guarantees even workload instability gas dynamics Euler, and kinetic Boltzmann heliotail, and (2) the heliopause instability as between nodes and threads within a single node; at distances equations. This suite is built on the Chombo an explanation of the deep penetration of the (3) we use parallel PFD5 for in/out operations; above 1,000 AU framework and allows us to perform simulations LISM plasma into the heliosphere [2]. The (4) full 64-bit support was implemented to allow in the tailward on both Cartesian and spherical grids. We have interstellar magnetic field (ISMF) is draped the code to handle more than 2 billion particles. direction. The implemented a hybrid parallelization strategy and around the heliotail in Fig. 1. Fig. 2 shows the As a result of these improvements, the code scales heliopause is performed simulations with excellent scalability plasma density distribution in our solar wind- well to 160,000 cores [10]. A 650 GB data file shaped by the up to 160,000 cores. We present the results of LISM interaction simulations that revealed that containing 10 billion particles can be written interstellar our newest simulations of the heliopause and solar cycle effects help destabilize the surface in 32 seconds. magnetic field. its stability, which explain the Voyager 1 “early” of the heliopause and allow deep penetration Also shown is the penetration into the local interstellar medium of the LISM plasma into the heliosphere, which distribution of and help constrain its properties. agrees with Voyager 1 observations as it crossed PUBLICATIONS plasma density the heliopause in mid-2012 at 121 AU from the Borovikov, S. N., and N. V. Pogorelov, Voyager on the solar Sun. Plasma and magnetic field distributions 1 near the Heliopause. Astrophys. J. Lett., 783 equatorial INTRODUCTION were used to model ENA fluxes [3] and cosmic (2014), L16. FIGURE 2 (BOTTOM): Plasma density distribution plane. The ray transport [4]. Our kinetic neutral atom model Flows of partially ionized plasma are frequently Luo, X., M. Zhang, H. K. Rassoul, N. V. in the meridional plane as defined by the solar computational turned out to be uniquely suited to investigate the characterized by the presence of both thermal Pogorelov, and J. Heerikhuisen, Galactic rotation axis (vertical in this figure, with the Sun grid is a cube structure of the heliospheric bow shock modified and nonthermal populations of ions and neutral Cosmic-Ray Modulation in a Realistic Global located at the point (1000, 1000) in this plane) and with 6,000 AU per by charge exchange [5]. atoms. This occurs, for example, in the outer Magnetohydrodynamic Heliosphere. Astrophys. the vector of the LISM velocity at large distances side. It extends The behavior of plasma and magnetic fields heliosphere—the part of interstellar space J., 764 (2013), 85. from the heliosphere. The straight black line shows 5,000 AU into in the vicinity of the heliospheric termination beyond the solar system whose properties Zank, G. P., J. Heerikhuisen, B. E. Wood, N. the current Voyager 1 trajectory. The two points the tail. It is shock and the heliopause is of major importance are determined by the solar wind interaction V. Pogorelov, E. Zirnstein, and D. J. McComas, on the line show positions of Voyager 1 with an shown that the for the interpretation of data from the Voyager with the local interstellar medium (LISM). Heliospheric Structure: The Bow Wave and the interval of 2 years. One can see the termination flow becomes 1 and 2 spacecraft, the only in situ space Understanding the behavior of such flows Hydrogen Wall. Astrophys. J., 763 (2013) 20. shock and the heliopause, which is unstable near superfast missions intended to investigate the boundary requires that we investigate a variety of physical Voyager 1 due to a Rayleigh-Taylor instability magnetosonic at of the solar system. Our team proposed a phenomena: charge exchange processes between caused by charge exchange. about 4,000 AU. quantitative explanation to the sky-spanning neutral and charged particles, the birth of pick- “ribbon” of unexpectedly intense flux of ENAs up ions, the origin of energetic neutral atoms detected by the Interstellar Boundary Explorer. (ENAs), and solar wind turbulence, among Our physical model allowed us to constrain the others. Collisions between atoms and ions in the

34 35 2014

and (3) the reionization of the universe by these available to the academic community that fits all populations of galaxies. All of these problems of these requirements. require simulations with extremely high dynamic range in space and time, complex physics that include radiation transport and non-equilibrium PUBLICATIONS gas chemistry, and large simulation volumes. We Xu, H., K. Ahn, J. Wise, M. L. Norman, and are using the Enzo code [1], which has been B. W. O’Shea, Heating the IGM by X-rays from modified to scale to a large number of cores on Population III binaries in high redshift galaxies. Blue Waters, the only machine that can satisfy Astrophys. J., (submitted) arXiv:1404.6555v2. the heavy data and communication needs. Wise, J. H., B. D. Smith, B. W. O’Shea, and M. Using Blue Waters, we have successfully L. Norman, The formation of metal-enriched modeled the formation of the first generation stars from Population III supernovae: differences of metal-enriched stars in the universe and in enrichment history due to explosion energy. have shown that the strength of the primordial (submitted). FORMATION OF THE FIRST supernova (and the total quantity of metal GALAXIES: PREDICTIONS FOR produced) do not directly correlate to the properties of these first metal-enriched stars. In FIGURE 1 (BACKGROUND): Volume rendering of the THE NEXT GENERATION OF addition, the presence of dust (which may form matter density field from the central region in OBSERVATORIES in the ejecta of the first supernovae) can have a our "rare peak" simulation, which explores the critical effect on metal-enriched star formation, formation of what will at the present day become directly resulting in additional cooling and the a galaxy cluster. At early times, this is simply a Allocation: NSF/7.8 Mnh formation of additional molecular hydrogen that large overdensity of small dwarf-like galaxies, but PIs: Brian O’Shea1; Michael Norman2,3 further increases cooling rates. This may cause exploring the properties of such objects is critical 3 4 5 6,7 Collaborators: James Bordner ; Dan Reynolds ; Devin Silvia ; Sam Skillman ; additional fragmentation and lower mass stars. to understanding how the first stages of structure Britton Smith8; Matthew Turk9; John Wise10; Hao Xu2; Pengfei Chen2 INTRODUCTION We also find that if these Population III stars formation take place. The mechanisms that control the formation and 1Michigan State University form massive black hole/stellar binary systems, 2 evolution of galaxies are poorly understood. This University of California, San Diego they are likely to be prodigious emitters of X-ray 3San Diego Supercomputing Center is doubly true for the earliest and most distant radiation. This radiation both heats and ionizes 4 Southern Methodist University galaxies, where observations are limited and the intergalactic medium, in some cases to 5Michigan State University often indirect. It is critical to understand the 6 104 Kelvin! This may be important for predicting Stanford University properties of the first generations of galaxies 7SLAC National Accelerator Laboratory the topology of the 21 cm neutral hydrogen signal, 8 because they reionized the universe, dispensed University of Edinburgh which low-wavelength radio arrays will detect 9Columbia University large quantities of metal into the low-density in the coming years. 10Georgia Institute of Technology intergalactic medium, and served as the sites of formation and early growth for the supermassive EXECUTIVE SUMMARY: black holes that are found at the center of every WHY BLUE WATERS We are investigating the earliest stages of present-day massive galaxy. The simulations required to properly model the cosmological structure formation—namely, the earliest galaxies require extremely high spatial transition of the universe from a dark, empty and temporal dynamic range, complex physics, place to one filled with stars, galaxies, and METHODS AND RESULTS and, most importantly, radiation transport and the cosmic web. In investigating the “cosmic In this PRAC project, we are investigating non-equilibrium gas chemistry. Furthermore, Dark Ages,” we focus on three specific topics: the earliest stages of cosmological structure large simulation volumes, and thus many (1) the transition between metal-free and formation—namely, the transition of the universe resolution elements, are needed in order to metal-enriched star formation, which marks from a dark, empty place to one filled with stars, model enough galaxies to be able to adequately a fundamental milestone in the early epochs galaxies, and the cosmic web. In investigating the compare theory to observations in a statistically of galaxy formation; (2) the evolution of the “cosmic Dark Ages,” we focus on three specific meaningful way. Taken together, these require populations of high-redshift galaxies that will topics: (1) the transition between metal-free and a supercomputer with large memory and disk form Milky Way-like objects by the present day; metal-enriched star formation, which marks a space to accommodate huge datasets, large and (3) the reionization of the universe by these fundamental milestone in the early epochs computational resources, and an extremely high- populations of galaxies. Using Blue Waters, we of galaxy formation; (2) the evolution of the bandwidth, low-latency communication network have successfully modeled the formation of the populations of high-redshift galaxies that will to enable significant scaling of the radiation first generation of metal-enriched stars in the form Milky Way-like objects by the present day; transport code. Blue Waters is the only machine universe.

37 BLUE WATERS ANNUAL REPORT 2014

Observed galaxies follow surprisingly tight Code improvements UNDERSTANDING GALAXY fundamental scaling relations that any successful We updated the TreeSPH code “GADGET-3” FORMATION WITH THE HELP OF model of galaxy formation has to reproduce. to account for physical processes that are Since no two galaxies look alike, we need to important for forming galaxies with stellar PETASCALE COMPUTING recover the statistical properties of the overall properties that compare well to observed galaxy population instead of explaining the galaxies. The formation and evolution of galaxies formation of a single galaxy like the Milky Way. from cosmological initial conditions up to the Allocation: NSF/3.13 Mnh Observational surveys encompass ~106 galaxies, present day is simulated, considering the effects PI: Kentaro Nagamine1 Collaborators: Ludwig Oser2; Jeremiah P. Ostriker2,3; Greg Bryan2; Renyue Cen3; whereas contemporary work on high-resolution of radiative cooling from primordial gas, as well Thorsten Naab4; Manisha Gajbe5 cosmological zoom-in simulations usually covers as gas enriched with metals and star formation. less than a hundred objects. We are trying to We included recently developed prescriptions 1 University of Nevada, Las Vegas mitigate this huge discrepancy in the number for kinetic stellar feedback originating from 2 Columbia University of galaxies seen in observed versus theoretical asymptotic giant branch stars and supernovae of 3Princeton University 4Max-Planck-Institut für Astrophysik results. Larger scale full-box cosmological Type I and II [1,2]. Additionally we implemented 5National Center for Supercomputing Applications simulations can simulate thousands of galaxies at kinetic feedback from active galactic nuclei [3]. the same time; however, they lack the resolution On top of the physical modules, we used a novel EXECUTIVE SUMMARY: to resolve the internal structure of galaxies. implementation of SPH that is able to deal with Understanding the formation of the present-day the known shortcomings of SPH—mainly the galaxy population is an outstanding theoretical inability to sufficiently capture hydrodynamical challenge that will only be mastered with the METHODS AND RESULTS instabilities. This includes a pressure-entropy formulation of SPH with a Wendland kernel, a help of high-performance computer simulations. HECA This requires a dynamic range of more than Simply using ever larger computer clusters higher-order estimate of velocity gradients, a ten orders of magnitude and a multitude of will not alleviate the problem. Once gas dynamic modified artificial viscosity switch with a strong resolution elements when computing galaxy processes become important, smoothed particle limiter, and artificial conduction of thermal populations in huge cosmological volumes. hydrodynamics (SPH) as well as grid codes scale energy [4]. 6 FIGURE 1: Stellar distribution of a simulated Observational surveys encompass ~10 poorly to arbitrarily large problem sizes. With Each of the above-mentioned improvements Milky Way-size halo at the present day. The young galaxies, whereas contemporary work on high- an increasing number of compute nodes, the overcame known problems in individually stars (white) form an extended disk (white bar resolution cosmological zoom-in simulations resources lost due to communication overhead simulated galaxies (like insufficient angular represents a scale of 30 kpc) similar in size to usually covers less than a hundred objects. We and load balancing grows, thereby limiting the momentum, overcooling, late quenching of star our galaxy, but the simulated galaxy still forms a developed a scalable approach to model full problem sizes and/or resolution fineness that formation, metal enrichment of the IGM, and larger fraction of bulge stars. galaxy populations with a parallel ensemble can be computed in a reasonable amount of time. hydrodynamical instabilities). We would like of high-resolution simulations. We are also We developed a scalable approach to model to use the full set of improvements to simulate working toward updating the code to utilize the full galaxy populations with a parallel ensemble the statistical nature of a significant number of hybrid nature of modern supercomputers and of high-resolution simulations. Instead of galaxies. to implement the Fault Tolerance Interface on simulating the full box at high resolution, we Blue Waters to reduce checkpointing overhead split it into tens of thousands of independent Work in progress and to recover jobs from partial node failures. “zoom” calculations, and for each zoom run, only In addition to our algorithmic approach the region of interest is fully resolved while the successfully making better use of the available rest of the cosmological volume is kept at lower resources, we are currently updating the INTRODUCTION resolution to provide proper gravitational tidal code to utilize the hybrid nature of modern supercomputers (i.e., to use a shared-memory Understanding the formation of the present-day forces. The initial conditions of each zoom run approach for parallelization on a compute node galaxy population is an outstanding theoretical are selected such that contaminated boundary and distributed memory parallelization for challenge that will only be mastered with the regions do not impact the target of interest. communication between nodes). Furthermore, help of high-performance computer simulations. We call this method Hierarchical Ensemble we are trying to implement the Fault Tolerance Embedded in the cosmic web, the structural Computing Algorithm (HECA). Initial scaling Interface on Blue Waters to reduce checkpointing properties of galaxies can only be predicted tests demonstrate that running large numbers overhead and to recover jobs from partial node when the galactic domains of star formation— of individual zoom simulations in parallel failures. molecular clouds—are numerically resolved. outperforms the traditional full-box approach, This requires a dynamic range of more than ten providing a way to efficiently use supercomputers 5 orders of magnitude and an enormous number like Blue Waters with ~10 compute nodes to of resolution elements when computing galaxy study cosmological galaxy evolution. populations in huge cosmological volumes.

38 39 BLUE WATERS ANNUAL REPORT 2014

driving it outward. However, the neutrino postbounce magnetic field configuration is prone. FROM BINARY SYSTEMS AND mechanism appears to lack the efficiency needed The subsequent CCSN evolution leads to two STELLAR CORE COLLAPSE TO to drive hyperenergetic explosions. One possible large asymmetric shocked lobes at high latitudes alternative is the magnetorotational mechanism. (fig. 1), a completely different flow pattern from GAMMA RAY BURSTS In its canonical form, rapid rotation of the 2D. Highly magnetized tubes tangle, twist, and collapsed core and magnetar-strength magnetic drive the global shock front steadily but not field with a dominant toroidal component drive dynamically outward. A runaway explosion does Allocation: NSF/3.81 Mnh a strong bipolar jet-like explosion that could not occur during the ~185 ms of postbounce PI: Peter Diener1 2 2 3 2 time covered. Collaborators: Philipp Moesta ; Sherwood Richers ; Christian D. Ott ; Roland Haas ; result in a hypernova. A number of recent 2D 12 Anthony L. Piro2; Kristen Boydstun2; Ernazar Abdikamalov2; Christian Reisswig2; magnetohydrodynamic (MHD) simulations have The high precollapse field strength of 10 G 1,4 16 Erik Schnetter found robust and strong jet-driven explosions but yields ~10 G in toroidal field and β=Pgas/Pmag<1 only a handful of 3D studies have been carried out within only ~10-15 ms of bounce, creating 1 Louisiana State University with varying degrees of microphysical realism [3- conditions favorable for jet formation. Yet, the 2California Institute of Technology 3University of Tokyo 6] and none have compared 2D and 3D dynamics growth time of the kink instability is shorter than 4University of Guelph directly. the time it takes for the jet to develop. In a short test simulation with an even more unrealistic, is shut off. Unless their material has reached FIGURE 1: Volume EXECUTIVE SUMMARY: ten times stronger initial field, a successful jet is positive total energy, the lobes will fall back rendering of METHODS AND RESULTS launched promptly after bounce but subsequently We present results of new 3D general-relativistic onto the black hole, which will subsequently entropy from the also experiences a spiral displacement. Realistic magnetohydrodynamic simulations of rapidly We have carried out new full, unconstrained hyperaccrete until material becomes centrifugally simulation in precollapse iron cores are not expected to have [9] at t − t =161 rotating, strongly magnetized core collapse. These 3D dynamical spacetime general-relativistic supported in an accretion disk. This would set b magnetic fields in excess of ~108-109 G, which simulations are the first of their kind and include MHD (GRMHD) simulations of rapidly rotating the stage for a subsequent long gamma ray burst ms. The z-axis may be amplified to no more than ~1012 G during a microphysical finite-temperature equation of magnetized CCSN explosions. These are the first and an associated Type Ic-BL CCSN that would (vertical) is the collapse. The 1015-1016 G of large-scale toroidal state and a leakage neutrino approximation to employ a microphysical finite-temperature be driven by a collapsar central engine [7] rather spin axis of the field required to drive a magnetorotational jet scheme. Our results show that the 3D dynamics equation of state, a realistic progenitor model, than by a protomagnetar [8]. protoneutron must be built up after bounce. This will likely of magnetorotational core-collapse supernovae and an approximate neutrino treatment for The results of our study highlight the star and we show require tens to hundreds of dynamical times, are fundamentally different from what was collapse and postbounce evolution. We compared importance of studying magnetorotational 1600 km on a side. even if the magnetorotational instability operates anticipated based on previous simulations in the 3D simulations to 2D simulations that used CCSN explosion in 3D. Future work will be The colormap in conjunction with a dynamo. 2D. A strong bipolar jet that develops in a 2D identical initial conditions. necessary to explore later postbounce dynamics, for entropy is The results of the present and previous full simulation is crippled by a spiral instability and The 3D simulations require fast per-core the sensitivity to initial conditions and numerical chosen such that 3D rotating CCSN simulations suggest that fizzles in full 3D. Our analysis suggests that the performance in combination with an efficient resolution, and possible nucleosynthetic yields. blue corresponds MHD and also a variety of non-axisymmetric to s=3.7k jet is disrupted by an m=1 kink instability of the communication network. A typical simulation Animations and further details on our b hydrodynamic instabilities will grow to non- −1 ultra-strong toroidal field near the rotation axis. employs nine levels of adaptive mesh refinement simulations are available at http://stellarcollapse. baryon , cyan to linear regimes on shorter timescales, disrupting s=4.8k baryon−1 Instead of an axially symmetric jet, a completely (AMR) to increase resolution where needed in org/cc3dgrmhd. b new flow structure develops. Highly magnetized the collapsing star and requires 4 TB of memory any possibly developing axial outflow. This is why indicating the spiral plasma funnels expelled from the core push while producing 500 TB of simulation output. we believe that the dynamics and flow structures shock surface, seen in our full 3D simulation may be generic green to s=5.8k out the shock in polar regions, creating wide As the shockwave, launched by the sudden WHY BLUE WATERS b secularly expanding lobes. halt of the collapse due to the formation of to the postbounce evolution of rapidly rotating baryon−1, yellow A typical simulation requires 20,000 cores on magnetized core collapse that starts from to s=7.4k the protoneutron star, expands to greater b Blue Waters, and only on Blue Waters was our radii, the entire postshock region needs to be realistic initial conditions. baryon−1, and red code able to scale efficiently to such numbers due INTRODUCTION kept at a constant high resolution (~1.5 km) If the polar lobes eventually accelerate, the to higher entropy to the availability of the outstanding interconnect to guarantee stable MHD evolution. The AMR resulting explosion will be asymmetric, though material at Stellar collapse liberates gravitational energy infrastructure. 53 probably less so than a jet-driven explosion. The s=10k baryon−1. of order 10 erg/s (100 B). About 99% of that box covering the shocked region alone contains b lobes carry neutron-rich (Ye ~0.1-0.2) material energy is emitted in neutrinos, and the remainder 200 million points and pushed even Blue Waters’ The outflows from of moderate entropy (s ~10-15 k baryon−1), powers a core-collapse supernova (CCSN) computational infrastructure to its current limit. b the protoneutron which could lead to interesting r-process yields, explosion. A small fraction of CCSN explosions Our results for a model with an initial star (in the G means Gauss, 12 similar to what previous studies have found for are hyperenergetic (~10 B) and involve relativistic poloidal B-field of 10 G indicate that 2D and center) get a measure of their prompt jet-driven explosion. Even if the outflows [e.g., 1,2]. Importantly, all supernova 3D magnetorotational CCSN explosions are severely twisted, magnetic field B lobes continue to move outward, accretion in explosions connected with long gamma ray fundamentally different. In 2D, a strong jet- resulting in two strength. equatorial regions may continue, eventually (after bursts are of Type Ic-BL. Typical explosions may driven explosion occurs. In unconstrained 3D the giant polar lobes. 2-3 s) leading to the collapse of the protoneutron be driven by the neutrino mechanism, in which developing jet is destroyed by non-axisymmetric star and black hole formation. In this case, the neutrinos emitted from the collapsed core deposit dynamics, most likely caused by an m=1 MHD engine supplying the lobes with low-β plasma energy behind the stalled shock, eventually kink instability to which the toroidally dominated

40 41 BLUE WATERS ANNUAL REPORT 2014

hydrodynamic simulation with the “full physics” SIMULATING THE FIRST GALAXIES of galaxy formation (left panel of fig. 1). AND QUASARS: THE BLUETIDE BlueTides includes a complicated blend of different physics that is nonlinearly coupled COSMOLOGICAL SIMULATION on a wide range of scales, leading to extremely complex dynamics. Significant effort was invested in code development and model validation, with Allocation: NSF/2.63 Mnh several improvements to the physical models in PI: Tiziana Di Matteo1 Collaborators: Rupert Croft1; Yu Feng1; Nishikanta Khandai2; Nicholas Battaglia1 P-Gadget: • A relatively new pressure-entropy 1Carnegie Mellon University smoothed particle hydrodynamics (SPH) 2 Brookhaven National Laboratory formulation [1] replaces the old density-entropy EXECUTIVE SUMMARY: formulation, which suppresses phases mixing in a non-physical way. We also improved the Computational cosmology—simulating the entire effective SPH resolution with a higher-order universe—represents one of most challenging quintic kernel that reduces the shot noise level by applications for petascale computing. We need a factor of two without additional memory usage. simulations that cover a vast dynamic range of • In the regime of the simulation, the star space and time scales and include the effect of formation is supply limited, thus it is important FIGURE 1: (LEFT) The number of particles in gravitational fields that are generated by (dark to consider the abundance of H2 molecules that improved threading efficiency. Even though the matter in) superclusters of galaxies upon the is the direct supply of star-forming interstellar domain decomposition and Fourier transform cosmological, hydrodynamical simulation of galaxy formation as a function of year. BlueTides formation of galaxies. These galaxies, in turn, gas. We implemented a molecular H2 gas model remains sequential, the wall time improved harbor gas that cools and makes stars and is based on work by Gnedin et al. [2]. by about a factor of two at 32 threads. The simulation on Blue Waters is currently the largest being funneled into supermassive black holes • The simulation regime also overlaps improved threading efficiency allows us to use simulation ever attempted. The green and blue that are of the size of the solar system. the interesting Epoch of Reionization, where fewer domains, which in turn further reduces colored images are the projection of the density We have started a full-machine run on Blue the entire universe turns from opaque to the complexity of domain decomposition and field in the initial conditions and at the first Waters, the BlueTide cosmological simulation. transparent. Traditionally a uniform UV field inter-domain communication, improving the snapshot (z=150) in the simulation. It is run with an improved version of the is introduced at the same time across space in overall efficiency of the code (right side of fig. 1). (RIGHT) Improved performance of the threading in cosmological code P-Gadget. The simulation hydrodynamical simulations. This is no longer a • I/O: We enabled HDF5 compression in the P-Gadget. Significant improvement in wall clock aims to understand the formation of the first good approximation in a simulation of such large snapshot files. The compression reduces the size time versus number of threads in our new version quasars and galaxies from the smallest to the volume. We incorporate a patchy reionization of a snapshot by a factor of ~30%-40%. (bottom panels) and a schematic view of non- rarest and most luminous, and this process's model from Battaglia et al. [3]; the model We have generated the initial conditions for blocking threading synchronization scheme (top role in the reionization of the universe. The introduces a UV field based on a predicted time- the BlueTides simulation (left side of fig. 1). panels). simulation will be used to make predictions of of-reionization at a different special location in This corresponds to a random realization of the what the upcoming WFIRST and James Webb density field as measured by the WMAP satellite. the simulation. quasars are expected to be small. Blue Waters Space Telescope (JWST; successor of Hubble, The full-machine simulation is underway and We also improved the code infrastructure in makes this possible. launch planned for 2018) will see. several ways: has produced the first snapshot (z=150 in • Memory efficiency: We detached the black fig. 1). Further optimization of the PM solver is hole particle data from the main particle type, currently underway. The simulation is expected PUBLICATIONS METHODS AND RESULTS reducing the memory usage by one quarter to complete in early summer 2014. Hopkins, P. F., A General Class of Lagrangian The largest telescopes currently planned aim to for a problem of the same size. This allows us Smoothed Particle Hydrodynamics Methods and study the “end of Dark Ages” epoch in the early to model 600 billion particles using all of Blue Implications for Fluid Mixing Problems. Mon. universe, when the first galaxies and quasars Waters, while leaving some room for potential WHY BLUE WATERS Not. R. Astron. Soc., 428 (2013), pp. 2840-2856. form and reionization of the universe takes place. node failures. Simulation predictions are of prime interest to Gnedin, N. Y., K. Tassis, and A. V. Kravtsov, Our main production run, BlueTides, will • Low maintainability: The redundant code the community as the instrumental capabilities Modeling Molecular Hydrogen and Star use the whole Blue Waters machine. It will in all major physical modules is rewritten based are just reaching the point that we can explore Formation in Cosmological Simulations. follow the evolution of 0.6 trillion particles in on a new tree walk module. This inspired the base the young universe over the next couple of Astrophys. J., 697:1 (2009), 55. a large volume of the universe (400 co-moving for our improvement in the threading efficiency. decades. The challenge in understanding this Battaglia, N., H. Trac, R. Cen, and A. Loeb, Mpc on a side) over the first billion years of the • Threading efficiency: We replaced global epoch of the universe is that extremely large Reionization on Large Scales. I. A Parametric universe’s evolution with a dynamic range of critical sections with per-particle (per-node) volumes need to be simulated as the first objects Model Constructed from Radiation- 6 (12) orders of magnitude in space (mass). This spin locks. Because the boundaries of thread are rare, while at the same time extremely high hydrodynamic Simulations. Astrophys. J., 776:2 makes BlueTides by far the largest cosmological subdomains are very small, the spin locks hugely resolution is required as the first galaxies and (2013), 81.

42 43 2014

: The goal of this project is to develop accurate a circumbinary disk's evolution. It tells us the GM/c2 is the theoretical estimates of the EM signature of range of separation within which to trust the gravitational SMBH binaries through GRMHD simulations PN approximation and addresses the influence radius. using the state-of-the-art HARM3d code. the initial conditions and binary separation have High-performance computation is essential to on simulated predictions. reach our science goals because the equations Our second project focused on verifying our are evaluated on vast domains involving tens new code for simulations with the BHs in our of millions of model cells at tens of millions of domain. A non-uniform spherical-like coordinate time steps. Such large cell counts are required system was constructed to focus cells near each to resolve the small spatial scales of the black BH and move with the BH, while asymptotically FIGURE 1: Gas accreting onto two orbiting holes (BHs), which span tens of cells, and still approaching a more uniform spherical grid far black holes from a circumbinary disk. Colors cover a sufficiently large domain to include a away from the BHs to better conserve the disk's represent the gas's mass density over large circumbinary disk. angular momentum [1]. eleven orders of magnitude on a logarithmic It is impossible with current resources and THE INFLUENCE OF STRONG FIELD color scale, from dark blue (low) to dark red techniques to accurately simulate the formation WHY BLUE WATERS SPACETIME DYNAMICS AND MHD (high). Each black hole is represented by a of SMBH binaries from the galaxy scale all the ON CIRCUMBINARY DISK PHYSICS black circle. Inset images show close-ups way to merger, particularly accounting for the Blue Waters is an essential system for our of the background image, which spans the interplay of gravitational forces between two project because of its immense size. For the simulation's entire spatial extent. BHs. Since our goal is to predict EM signatures calculations that do not include the BHs on when the BHs are close, we must choose initial 7 Allocation: NSF/1.09 Mnh the domain, we often require about 10 cells conditions as realistic as our computational 7 PI: Manuela Campanelli1 evaluated at approximately 10 time steps using Collaborators: Scott C. Noble1; Miguel Zilhão1; Yosef Zlochower1 budget allows. A key goal is to understand thousands of cores. During the past year, we better how the binary separation and spacetime's finished developing and testing a novel distorted 1 Rochester Institute of Technology INTRODUCTION accuracy using the post-Newtonian (PN) coordinate system that will allow us to resolve approximation affect the disk's EM signatures. EXECUTIVE SUMMARY Supermassive black hole (SMBH) mergers the BHs well and efficiently enough to include In our first project [2] we exploited the ability them in our simulation [1]. These simulations The observation of supermassive black holes are believed to happen frequently at the core of our new PN code [3] to exclude the highest- of most active galaxies. Such mergers would require about ten times more resources than on the verge of merger has numerous exciting order terms from the metric's evaluation to their predecessors. Performing such large runs consequences for our understanding of galactic reveal important information on the birth see how the PN order of accuracy affected the and growth of their host galaxies, as well as in a reasonable amount of time is only possible evolution, black hole demographics, plasmas in dynamics of non-magnetized and magnetized on the largest supercomputers like Blue Waters. strong-field gravity, and general relativity. Our explain how highly relativistic matter behaves gas. The non-magnetized runs were constrained project aims to provide the astronomy community in the surrounding accretion disks and in the to two spatial dimensions so that a larger associated jets. Additionally, detection would with predictions of the electromagnetic parameter space could be explored. The non- PUBLICATIONS signatures of supermassive black holes using provide a concrete example of one of general magnetized runs informed the magnetized cases, realistic initial conditions, enough model cells relativity's most spectacular predictions and which required three dimensions and the entirety Zilhão, M., and S. C. Noble, Dynamic fisheye to resolve the essential magnetohydrodynamic possibly allow us to test the validity of general of Blue Waters. grids for binary black hole simulations. Classical turbulence that drives accretion, and sufficient relativity in a truly strong-field regime. Our aim PN-order effects were obvious in the non- Quant. Grav., 31:6 (2014), 065013. duration to understand their secular trends—all is to provide the field of astronomy with the first magnetized case starting from a separation of Mundim, B. C., H. Nakano, N. Yunes, M. of which are now possible because of Blue Waters. accurate electromagnetic (EM) predictions 20 GM/c2. In a 3D magnetized case, many of Campanelli, S. C. Noble, and Y. Zlochower, Over the past year we have determined the of these circumbinary disk environments the key qualitative conclusions were similar Approximate Black Hole Binary Spacetime via regime where high-order post-Newtonian using state-of-the-art general relativistic whether using first- or second-order PN- Asymptotic Matching. Phys. Rev. D, 89 (2014), terms in our novel, analytic spacetime solution magnetohydrodynamics (GRMHD) simulations accurate spacetime. In both magnetized runs we 084008. significantly affect the circumbinary disk's and general relativistic radiative transfer discovered a unique and exciting periodic EM dynamics and predicted luminosity, and that calculations of the simulated data. signature that could be used to identify SMBH nearly all the key predictions survive using a less binaries in the time domain and measure their accurate post-Newtonian model. In addition, we mass ratio. This implies that the EM signal may have begun simulations in which the black holes METHODS AND RESULTS be robust down to small binary separations. It reside within the numerical domain to explore Intensive, high-cadence astronomical surveys remains to be seen if the quantitative differences the development and evolution of the mini disks (e.g., Pan-STARRS, LSST) make detecting these between magnetic and non-magnetic runs are around each black hole. rare mergers of SMBH binaries more likely, yet larger than systematic errors from the initial accurate theoretical predictions of the SMBH conditions. Our investigation is the first to binary's EM counterpart remain to be done. demonstrate how the level of PN accuracy affects

45 GEOSCIENCE

WEATHER 48 Influences of Orientation Averaging Scheme on the 58 A Scalable Parallel LSQR Algorithm for Solving Scattering Properties of Atmospheric Ice Crystals: Large-Scale Linear System for Tomographic Applications to Atmospheric Halo Formation Problems: A Case Study in Seismic Tomography CLIMATE 50 Exploring the Physics of Geological Sequestration of 60 Solving Prediction Problems in Earthquake System GEOLOGY Carbon Dioxide using High-Resolution Pore-Scale Science on Blue Waters Simulation 62 Collaborative Research: Petascale Design and ENVIRONMENT 52 Enabling Breakthrough Kinetic Simulations of the Management of Satellite Assets to Advance Magnetosphere via Petascale Computing Space-Based Earth Science

54 Blue Waters Applications of 3D Monte Carlo 64 Towards Petascale Simulation and Visualization of Atmospheric Radiative Transfer Devastating Tornadic Supercell Thunderstorms

56 High-Resolution Climate Simulations using Blue Waters BLUE WATERS ANNUAL REPORT 2014

knowledge of single-scattering properties of because it determines the applicability of the PUBLICATIONS INFLUENCES OF ORIENTATION ice crystals is required. conventional geometrical optics method (GOM) AVERAGING SCHEME ON THE In this study, the single-scattering properties to the calculations of scattering properties of Um J., and G. M. McFarquhar, Efficient FIGURE 2 (BOTTOM): of realistically shaped atmospheric ice crystals small particles. numerical orientation average for calculations of Simulated halo SCATTERING PROPERTIES OF were calculated using Blue Waters. The optimal High-resolution images of ice crystals from single-scattering properties of small atmospheric using different ATMOSPHERIC ICE CRYSTALS: orientation averaging scheme was used to aircraft probes were used to define a plausible ice crystals. Proc. ASR Science Team Meeting, sizes and AR of APPLICATIONS TO ATMOSPHERIC calculate the single-scattering properties of ice range of aspect ratios (crystal length divided by Potomac, Md., March 18-21, 2013. hexagonal ice crystals within a predefined accuracy level (i.e., width; AR) for different sized ice crystals. Then Um, J., and G. M. McFarquhar, Optimal crystals. Left and HALO FORMATION 1.0%) and the impacts of size and aspect ratio the single-scattering properties of hexagonal numerical methods for determining the middle columns (ratio between crystal length L and width W) crystals were calculated using the Amsterdam orientation averages of single-scattering used ADDA, right on atmospheric halo formation were quantified. discrete dipole approximation (ADDA). properties of atmospheric ice crystals. J. Quant. column used GOM. Allocation: Illinois/0.698 Mnh Hexagonal crystals with AR of 0.25, 0.50, and Spectrosc. Radiat. Transfer, 127 (2013), pp. All panels have 1 PI: Greg M. McFarquhar 1.0 (fig. 1b-d) show a distinct 22° halo, whereas 207-223. same scale. Collaborators: Junshik Um1 METHODS AND RESULTS only hexagonal crystals with AR of 0.50 and 1.0 1 produce the 46° halo. Further, both hexagonal University of Illinois at Urbana-Champaign Compute time columns and plates have more difficulty than EXECUTIVE SUMMARY: The optimal orientation averaging scheme (regular lattice grid scheme or quasi Monte compact crystals (AR~1.0) in generating halos. The Amsterdam discrete dipole approximation Carlo (QMC) method), the minimum number of This suggests that the AR of crystals plays an was used to determine the optimal orientation orientations, and the corresponding computing important role in the formation of atmospheric averaging scheme (regular lattice grid scheme time required to calculate the average single- halos. Hexagonal column crystals produce or quasi Monte Carlo (QMC) method), the scattering properties within a predefined neither halo. minimum number of orientations, and the accuracy level were determined for four different Fig. 2 shows simulated halos using different corresponding computing time required to nonspherical atmospheric ice crystal models sizes and ARs of hexagonal crystals. For calculate average single-scattering properties (Gaussian random sphere, droxtal, budding comparison, simulations of large crystals using a within predefined accuracy levels for four Bucky ball, and column). GOM are also shown in the right columns. Small nonspherical atmospheric ice crystal models. The QMC required fewer orientations and crystals produce neither the 22° nor 46° halo The QMC requires fewer orientations and less computing time than the lattice grid. For regardless of AR (left column). Large crystals FIGURE 1 (TOP less computing time than the lattice grid. The example, the use of QMC saved 55.4 (60.1, 46.3), produce halos with the appearance of the 22° and/ RIGHT): Scattering minimum number of orientations and the 3,065 (117, 110), 3,933 (65.8, 104), and 381 (22.8, or 46° halo depending on AR (middle column). phase function corresponding computing time for scattering 16.0) hours of computing time for calculating the A compact hexagonal crystal produces both 22° and 46° halos. P11 with D=16 calculations decrease with increasing wavelength, single-scattering properties within 1.0% accuracy μm: (a) L=1.6 μm, whereas they increase with particle nonsphericity. for 3B, droxtal, Gaussian random sphere, and W=16.0 μm, (b) Variations in the sizes and aspect ratios (ratio column models, respectively, at λ=0.55 (3.78, WHY BLUE WATERS L=4.0 μm, W=16.0 between crystal length and width) of ice crystals, 11.0) μm using 300 processors. The calculations μm, (c) L=8.0 μm, which affect the size parameter at which halos of scattering phase function P11 required the When an ice particle’s size parameter was W=16.0 μm, (d) first emerge, are also investigated. The threshold most orientations and asymmetry parameter large, GOM was used to approximate the single- L=16.0 μm, W=16.0 size at which 22° and 46° degree halos form is g and single-scattering albedo ω0 the fewest. scattering properties. Exact methods, such as μm. The black determined. The minimum number of orientations and the discrete dipole method, T-matrix method, and line in each computing time for single-scattering calculations finite-difference time domain method, were used panel indicates decreased with increasing wavelength and for particles with much smaller size parameters. orientation- INTRODUCTION increased with the surface area ratio that defines Although the exact methods provide more averaged P ; 11 Cirrus clouds consist almost exclusively of particle nonsphericity. accurate results, they require more computing color lines are nonspherical ice crystals with various shapes time and memory that rapidly increases with P for each 11 and sizes. The representation of cirrus in small- Atmospheric Halos particle size. The accuracy of radiative transfer orientation. and large-scale models has large uncertainties Hexagonal crystals (columns and plates) are models and satellite retrieval algorithms depends Scattering mainly due to the wide variety of shapes and sizes the building blocks of the most common ice heavily on accurate calculations of single- angles of 22° and of nonspherical ice crystals. Thus, the role of crystal habits. Previous studies have shown scattering properties of ice crystals. Blue Waters 46° are indicated cirrus in modulating the Earth-radiation balance that atmospheric halos begin to form when is an important resource in completing these with vertical is poorly understood. the size parameter (the ratio between particle calculations. dotted lines, and To determine the influence of cirrus on solar size and the wavelength of incident light) of ice a distinct peak and infrared radiation for general circulation crystals reaches 80 to 100. The threshold size at indicates a halo. models (GCMs) and remote sensing studies, which atmospheric halos emerge is important

48 49 BLUE WATERS ANNUAL REPORT 2014

considerable scientific uncertainty about the EXPLORING THE PHYSICS OF proper way to represent pore-scale physics in GEOLOGICAL SEQUESTRATION these models, which leads to uncertain estimates of storage capacity and leakage probability. OF CARBON DIOXIDE USING Experiments at the Department of Energy’s HIGH-RESOLUTION PORE-SCALE Environmental Molecular Science Laboratory SIMULATION (EMSL) provide validation data for high- resolution numerical simulation models. The validated models can then serve as a tool to extrapolate beyond conditions investigated Allocation: Illinois/0.05 Mnh PI: Albert J. Valocchi1 in the laboratory and to develop upscaled Collaborators: Qinjun Kang2 representations of the physical processes that can be used in reservoir-scale simulations to 1 University of Illinois at Urbana-Champaign investigate the impact of industrial-scale CO 2 2 Los Alamos National Laboratory injection into large geological basins. EXECUTIVE SUMMARY:

A better understanding of the physics of METHODS AND RESULTS geological sequestration of CO2 is critical to engineering the injection process to maximize We have developed a 3D, fully parallel code based storage capacity and immobilization of on the color-fluid multiphase lattice Boltzmann model (CFLBM). The lattice Boltzmann method strong friction at the bottom wall. Although has not been possible due to the formidable FIGURE 1: CO2, developing monitoring programs, and is an effective computational strategy to simulate Final fluid performing risk assessments. With the rapid LCO2 penetrates into the pore spaces at all other computational costs. Moreover, it will take a very flow of immiscible fluids in complex porous distributions advancement of computers and computational cross sections, it is clear that the LCO2 fingers in long time for the simulations to reach steady state in the dual- methods, numerical modeling has become an medium geometries. The CFLBM code has been the cross sections closer to the wall (e.g., Z=3) for analysis of fluid distribution, phase saturation, permeability important tool for this problem. We developed shown to scale almost ideally on Blue Waters are thinner than those in the cross sections near and interfacial area, etc. The computational microfluidic a 3D, fully parallel code based on the color-fluid up to 32,768 cores. We have applied the code the center (e.g., Z=12). However, this difference power of Blue Waters is allowing us to conduct system multiphase lattice Boltzmann model, which to simulate liquid CO2 displacement of water in quickly diminishes for cross sections away from 3D simulations of these experiments for the first (1408x1400x24) at scales well on Blue Waters. We applied the code various homogeneous and heterogeneous pore the top/bottom walls. There are no apparent time and to rigorously study 3D effects. networks, matching conditions for microfluidics differences between Z=8 and Z=12. Log Ca=-4.06. Red to simulate liquid CO2 displacement of water and green denote in various homogeneous and heterogeneous experiments conducted at the state-of-the- Comparing the 2D and 3D results at this LCO and water, art facilities at EMSL. We used a previous PUBLICATIONS 2 pore networks and have obtained quantitative capillary number, the LCO2 hasn’t penetrated 2D version of the CFLBM code to match the respectively. agreement with micro-model experiments into the pore spaces of the low-permeability Liu, H., A. J. Valocchi, C. Werth, Q. Kang, experimental conditions, and found that there (Top left) 2D fabricated and conducted using Department domain in the 2D simulation but has started M. Oostrom, Pore-scale simulation of liquid were some discrepancies between experimental simulation, of Energy Environmental Molecular Science breaking through that domain in the 3D CO displacement of water using a two-phase and simulated results. 2 (Others) 3D Laboratory's new micro-fabrication capability. simulation. In addition, in the 3D simulation, lattice Boltzmann model. Adv. Water Resour., (in We first carried out a 3D simulation with 182 simulations at By comparing the 2D and 3D simulations of the the LCO2 saturation is higher in cross sections revision). Z=2, 3, 4, 6, 8, experiments, the 3D effect is being investigated million grid points at a relatively high capillary away from the top/bottom walls than in the 2D 10, 12 (across, for the first time. number. The code ran 60,000 steps in one hour simulation. More simulations with other pore on 12,800 cores (400 nodes) on Blue Waters, and networks and vertical heterogeneity are being then down). the simulation reached steady state after 38,400 carried out. The results and detailed analyses Total number INTRODUCTION node-hours (76.8% of our allocation). We then of fluid saturation and interfacial area will be of lattices in performed a 3D simulation with 47 million grid reported in a future publication. the Z (vertical) Understanding the movement of multiple fluids points at a lower capillary number, using 3,584 direction is 24. within pore spaces in subsurface geological processor cores. formations is critical for addressing important Although the 3D domain used in the simulation WHY BLUE WATERS problems such as enhanced oil recovery, is considerably smaller than that used in the groundwater pollution from leaking tanks or All existing simulations of microfluidic experiments, thus far we have found significant pipelines, geothermal energy production, and multiphase flow experiments are based on 2D differences in fluid distribution between the geological sequestration of CO . Several large- models, ignoring the effect of the top and bottom 2 2D and 3D simulations and among different scale simulations are being used to investigate walls, as well as that of heterogeneity in the cross sections in the vertical (Z) direction of depth direction incurred during manufacturing the impact of industrial-scale CO2 injection the 3D simulation. At Z=2, LCO2 (liquid CO2) into large geological basins. However, there is of the micro-models. Although the 3D effects didn’t penetrate into the pore spaces due to the have long been speculated, a full 3D simulation

50 51 BLUE WATERS ANNUAL REPORT 2014

separation of scales associated with the relevant fig. 1). These simulations focus on how ion ENABLING BREAKTHROUGH systems. kinetic processes at the bow shock, a standing KINETIC SIMULATIONS OF THE The problems considered in this research are shock wave in front of the magnetosphere, motivated by the need to better understand drive turbulence in a narrow region behind the MAGNETOSPHERE VIA PETASCALE Earth’s space environment and its interaction shock called the magnetosheath. In this hybrid COMPUTING with the Sun, the major source of energy in simulation model, the ions are treated kinetically, the solar system. Collectively known as “space while electrons are treated as a massless fluid. weather,” this area of research is increasing in Consequently, the electron kinetic scales are Allocation: PRAC/11.1 Mnh socio-economic significance due to our society’s not resolved, reducing the range of scales to be PI: Homayoun Karimabadi1,2 reliance on space-based communication and resolved by more than an order of magnitude and Collaborators: navigation technologies and the potential risk allowing simulation of larger scales compared to Vadim Roytershteyn2; Yuri Omelchenko1; William Daughton3; Mahidhar Tatineni1; Amit Majumdar1; Kai Germaschewski4 of catastrophic disruptions of the power grid the fully kinetic case. caused by major solar events. Our simulations are at the forefront of 1University of California, San Diego For example, turbulence simulations help us research in their respective communities. The 2SciberQuest understand energy balance in the solar wind, simulations of decaying turbulence revealed the 3 Los Alamos National Laboratory the major driver of the Earth’s magnetospheric formation of current sheets with thicknesses 4University of New Hampshire activity. Similarly, global fully kinetic and hybrid from electron kinetic to ion kinetic scales. The EXECUTIVE SUMMARY simulations help us understand the response of possibility that such current sheets provide an the magnetosphere to external perturbations. efficient dissipation mechanism that is necessary Over 90% of the visible universe is in the plasma In addition to their significance for space to terminate the turbulent cascade has recently state. Some of the key processes in plasmas weather, the processes and their interactions attracted considerable attention. The significance FIGURE 1: Forming include shocks, turbulence, and magnetic PUBLICATIONS considered here are thought to play a role in many of this work is that formation of the current magnetosphere reconnection. The multiscale nature of these Karimabadi, H., et al., The link between space and astrophysical settings. Consequently, sheets was demonstrated in essentially first- in a global processes and the associated vast separation of shocks, turbulence and magnetic reconnection the insights from this study are of great interest principles simulations, which include all of the 3D hybrid scales make their studies difficult in the laboratory. in collisionless plasmas, Physics of Plasmas, 21 to a variety of fields. possible dissipation mechanisms. simulation. (V. In tandem with in situ measurements in space, (2014), 062308. Moreover, we demonstrated that the overall Roytershteyn, H. simulations have been critical to making progress. partition of the dissipated energy, the question Karimabadi) We highlight our latest progress in the areas of METHODS AND RESULTS of primary importance for solar wind studies, magnetic reconnection, plasma turbulence, and is not consistent with the previous assumption fast shocks as captured in global simulations of Here we demonstrate in a series of examples that resonant damping of wave-like fluctuations the magnetosphere. One of the outcomes of how modern petascale simulations help address is the dominant dissipation mechanism. these studies is the dynamic interplay between some of these challenges. The global hybrid and global fully kinetic these seemingly distinct processes. This synergy First, we discuss 3D fully kinetic simulations simulations demonstrated the complex dynamics is demonstrated using specific examples from of decaying plasma turbulence, where the initial of the magnetosphere, where shock physics, our work on Blue Waters. perturbation imposed on a system at large scales seeds a turbulent cascade. The cascade transports turbulence, and reconnection interact in a energy from the injection scale down to the complex manner not seen before in simulations. INTRODUCTION electron kinetic scales, where it is ultimately These simulations provide invaluable input into dissipated by kinetic processes. A unique feature efforts dedicated to better understanding of the Fundamental plasma processes, such as of these simulations is that they describe for the Earth’s magnetospheric dynamics. magnetic reconnection, turbulence, and shocks, first time in a self-consistent manner a number play a key role in the dynamics of many systems of distinct dissipation mechanisms, thus helping in the universe across a vast range of scales. WHY BLUE WATERS assess their relative efficiency. Examples of such systems are laboratory fusion In a second example, we discuss global 2D fully As we emphasized above, the topics considered experiments, the Earth’s magnetosphere, the kinetic simulations of interaction between solar in this work are characterized by extreme solar wind, the solar corona and chromosphere, wind and the magnetosphere. This work focuses separation of spatial and temporal scales. heliosphere, interstellar medium, and many on understanding how magnetic reconnection Consequently, the relevant simulations require astrophysical objects. In most of these settings, couples microscopic electron kinetic physics and extreme computational resources and produce the plasma is weakly collisional, which implies macroscopic global dynamics driven by the solar a large amount of data (in excess of 100 TB from that microscopic processes on the scale of the wind. a single run). Blue Waters is currently the most ion gyroradius and below play an important Finally, we discuss 2D and 3D global hybrid powerful tool available to conduct this work, role, especially inside various boundary layers. simulations of the interaction between solar enabling simulations of unprecedented scale This presents tremendous challenges for wind and the magnetosphere (3D shown in and fidelity. computational modeling due to the extreme

52 53 BLUE WATERS ANNUAL REPORT 2014

scientific utility of the datasets, leading to a sides can explain the invariance. However, only Algorithms. Gordon Research Conf. on Radiation BLUE WATERS APPLICATIONS OF serious problem in climate research. The role the derived effective radius is affected, not optical and Climate, New London, N.H., July 7-12, 2013. 3D MONTE CARLO ATMOSPHERIC of clouds is the leading source of uncertainty in thickness. This is an extremely useful finding as Jones, A. L., and L. Di Girolamo, A New predicting anthropogenic climate change. Only it allows us to better interpret the properties of Spectrally Integrating 3D Monte Carlo Radiative RADIATIVE TRANSFER accurate satellite datasets can provide retrievals clouds in relation to their environment. Transfer Model. 14th Conf. on Atmospheric of global cloud properties that will improve RT in Efforts to couple our model to the Weather Radiation, Boston, Mass., July 7-11, 2014. climate and weather studies. The starting point Research and Forecasting (WRF) model have so Illinois/0.05 Mnh, BW Prof./0.24 Mnh Allocation: for more accurate datasets is a 3D RT model. far involved increasing RT model functionality PI: Larry Di Girolamo1 Collaborators: Alexandra L. Jones1; Daeven Jackson1; Brian Jewett1; Bill Chapman1 by incorporating terrestrial emission (fully tested and benchmarked) and spectral integration 1University of Illinois at Urbana-Champaign METHODS AND RESULTS (under development and testing) to provide radiative heating rates to WRF. Once the models EXECUTIVE SUMMARY: Thus far we have used Blue Waters to reach new benchmarking milestones with our open-source are coupled, the 3D RT package will consume an Our goal is to improve numerical weather 3D Monte Carlo RT model, used the model estimated 99.9% of computing time because each prediction models and remote sensing algorithms to evaluate remote sensing algorithms, and batch of photons involves millions of scattering, for Earth’s atmosphere through better treatment developed new features in preparation for the emission, and absorption events. The Monte of radiative transfer. We produced highly accurate model’s eventual coupling to a cloud dynamics Carlo nature of our model allows work to be 3D Monte Carlo radiative transfer code that model. spread across a large numbers of processors with can handle the atmosphere’s complexity better We evaluated cloud retrieval algorithms used by no communication other than the initial task than prior 1D codes. The code is built to handle NASA’s Multi-angle Imaging SpectroRadiometer assignment and summing the results. scattering, absorption, and emission by the (MISR) and Moderate Resolution Imaging Earth’s atmosphere and surface, includes spectral Spectroradiometer (MODIS). Three-dimensional WHY BLUE WATERS integration, and outputs radiance, irradiance, and RT simulations for a wide range of simulated heating rates. cloud fields were generated to simulate radiance A large increase in available processing units Blue Waters allowed us to reach lower noise data for these instruments. allowed us to extend our simulations to levels (than previously achieved) by running For MISR, which derives the cloud-top height finer precision than ever before. Benchmark the highly scalable code on more processing via a stereoscopic technique, we demonstrated a simulations ran to <0.1% Monte Carlo noise, units than attempted with comparable codes. ~30 m to ~400 m negative bias in cloud-top height, which revealed the insufficiency of some First we are applying our code to study biases which strongly depended on the 3D distribution algorithmic choices that were previously masked FIGURE 1: The two simulated satellite views above, in cloud products derived from NASA satellite of the cloud liquid and ice water content, and by the noise. Addressing those insufficiencies with viewing angles of 0° and 5°, enable 3D viewing instruments. Future directions include coupling sun and view angles. This range dwarfs the has led to a more robust version of our 3D of the cloud scene. While staring between the side- our model with the Weather Research and 40 m/decade trend toward lower cloud top Monte Carlo model. The computationally by-side images, cross your eyes until the white Forecasting model. heights in MISR that has recently grabbed heavy nature of our simulations requires the dots atop each image merge in the center to form a the attention of climate scientists due to its Blue Waters infrastructure for successful and third dot above a third, centered image. The central implication of a negative feedback on a warming timely completion. INTRODUCTION image will emerge as a 3D view. climate via greater radiative cooling to space. Our new tools on Blue Waters will also One of the least understood aspects of the These results suggest that we cannot decouple benefit the wider community by providing a weather and climate system is clouds, particularly actual changes in cloud-top height from changes solid foundation for testing future hypotheses as they interact with solar radiation. Clouds in cloud texture. Any trend remains hypothetical. involving the two-way interaction between cloud redistribute radiative energy from the sun as We also have examined error in the effective systems and radiation, becoming a test bed for well as from the Earth and atmosphere. As the radius of the cloud drop size distribution derived developing practical RT parameterizations that model representation of other physical processes from MODIS, one of the most important variables capture the most important 3D effects at lower has advanced with increasing spatial resolution in understanding the role of clouds and aerosols computational cost and addressing issues related and computing power (e.g., cloud microphysics), in our climate system. These errors are invariant to the interpretation of satellite observations of radiative transfer (RT) remained crude and one to sub-pixel cloud fraction. Our simulations clouds. dimensional (up/down). targeted boundary layer clouds, which are at When allowed to feed back on cloud dynamics, the heart of the uncertainty in cloud feedbacks modeled inaccuracies from this representation plaguing climate projections [1]. Simulated PUBLICATIONS of RT may impact things like rainfall, surface observations (example shown in fig. 1) showed Jones, A. L., and L. Di Girolamo, A 3D Monte and atmospheric heating, and photolysis rates. that in the current algorithm used to retrieve Carlo Radiative Transfer Model for Future In remote sensing, data processing errors in cloud optical thickness and effective radius, Model Parameterizations and Satellite retrieval satellite-derived cloud properties limit the spectrally varying photon leakage from cloud

54 55 BLUE WATERS ANNUAL REPORT 2014

INTRODUCTION number of tropical cyclones per year dropped HIGH-RESOLUTION CLIMATE from ~100 to <10. SIMULATIONS USING BLUE WATERS This collaborative research is using Blue Waters to address key uncertainties in the numerical Effects of Topography modeling of the Earth’s climate system and the When switching to CAM5’s spectral element ability to accurately analyze past and (projected) Allocation: NSF/5.11 Mnh core, we found that heavily smoothing topography 1 future changes in climate. The objective of this PI: Donald J. Wuebbles to reduce numerical artifacts removed Collaborators: Arezoo Khodayari1; Warren Washington2; Thomas Bettge2; Julio phase has been to deliver a well-tuned global many of the resolution-based improvements 2 3 Bacmeister ; Xin-Zhong Liang high-resolution version of the Community in topographically forced flows that were Atmosphere Model (CAM) version 5, the FIGURE 2: The CAR 1 apparent in the finite volume (FV) dynamical reproduce the observed radiative balance by University of Illinois at Urbana-Champaign atmosphere component of the Community Earth member frequency 2National Center for Atmospheric Research core. Extensive development of topographic creating compensating errors among different 3University of Maryland, College Park System Model. In the context of global climate smoothing algorithms and enhanced internal components rather than producing the correct distributions modeling, “high-resolution” means simulations divergence damping for CAM SE was conducted. physics at the individual process level. (thick black EXECUTIVE SUMMARY: with horizontal grid spacing near 25 km. This Topography is suspected to play an important We plan to examine model uncertainties by curves) in resolution challenges climate models since predicting The objective of this phase has been to deliver role in the simulation of wintertime precipitation using the regional Climate-Weather Research many physical processes that are clearly sub- the top cloud a well-tuned global high-resolution version of over the southeastern U.S. fig. 1 shows seasonal Forecasting model (CWRF) [4], with the built- grid at coarser resolutions become marginally radiative the Community Atmosphere Model (CAM) mean precipitation for December-January- in CAR, to estimate the range of regional climate resolved. In addition, a new dynamical core forcings (CRF, version 5, the atmosphere component of February for CAM5 FV, CAM5 SE, and TRMM change projections, which we anticipate will be was introduced in CAM—the spectral element Wm-2) averaged the Community Earth System Model. High observational estimates. FV (fig. 1a) produces substantial. The outcome of the CWRF will be dynamical core (SE) [1]. This is a highly scalable 60°S-60°N in July resolution challenges climate models since generally heavier precipitation than SE (fig. 1b) in compared with the new CESM high-resolution code that allows full exploitation of massively 2004 for SW (left) many physical processes that are clearly sub- this region. The SE simulation appears to diverge simulations to evaluate global versus regional parallel architectures such as Blue Waters. This and LW (right) grid at coarser resolutions become marginally more from TRMM over much of the domain. model (dis)advantages in projecting future new core revealed surprisingly large sensitivities using a subset FIGURE 1: resolved. High-resolution experimentation with We suspect low-level topographic effects are climate change. to topographic forcing and internal dissipation. of 448 members. Seasonal mean the CAM produced a mix of changes. When the still underrepresented in SE, leading to weaker precipitation portion of convection that was parameterized steering of moist Gulf of Mexico air into the Red curves are for December- increased, seasonal mean precipitation southeastern U.S. WHY BLUE WATERS results from the January-February METHODS AND RESULTS subset with the distribution generally improved, but the number A new dynamical core was introduced in CAM— 1980-2005 for observational of tropical cyclones per year dropped from ~100 Deep convection, seasonal precipitation, and Cloud-Aerosol-Radiation Effects the spectral element dynamical core (CAM-SE) (a) CAM5 FV at constraints on to <10. Smoothed terrain proved problematic tropical cyclones Cloud-aerosol-radiation effects dominate the [1]. This is a highly scalable code that allows full a resolution of the top net SW to precipitation in CAM’s new spectral element High-resolution experimentation with global climate model (GCM) climate sensitivity. exploitation of massively parallel architectures 0.23°x0.31°; (b) and LW separately dynamical core. Cloud-aerosol-radiation effects the Community Atmosphere Model The Cloud-Aerosol-Radiation (CAR) ensemble such as Blue Waters to run a large, high- CAM5 SE ne120; (dashed) and also diverged substantially from observations. All version 5 (CAM5) revealed a mix of changes modeling system represents the comprehensive resolution climate model. and (c) TRMM of these are factors in tuning the model to work in the simulated climate [2]. Topographically range of the mainstream parameterizations their sum (solid). 1999-2005. well at high resolution. influenced circulations and the global used in current GCMs [3]. Fig. 2 compares the The CAR ensemble climatology of tropical cyclones were improved. frequency distributions for top-of-atmosphere PUBLICATIONS mean (black) On the other hand, seasonal mean tropical ocean (TOA) shortwave (SW) and longwave (LW) is compared Dennis, J., et al., CAM-SE: A scalable spectral precipitation degraded over large areas compared cloud radiative forcing averaged over 60°S-60°N with the element dynamical core for the Community to simulations with ~100 km resolution. in July 2004 for a fraction of CAR. The spread observational Atmosphere Model. Int. J. High Perform. Comput. The time required for a convective cloud to among the members is 30-60 Wm-2, compared to data from ISCCP Appl., 26 (2012), pp. 74-89. reach maturity (τ ), is known to be important 5-15 Wm-2 for the best observational estimates. (red), SRB (green), cnv Bacmeister, J. T., et al., Exploratory high- to controlling the activity of CAM5’s deep The frequency peaks for all fluxes closely match various versions Resolution Climate Simulations using the convection parameterization. As τ decreases, ISCCP or SRB radiative flux estimates, suggesting of CERES (purple cnv Community Atmosphere Model (CAM). J. parameterized convection becomes stronger, that most schemes used in the leading GCMs shading), and Climate, 27:9 (2014), pp. 3073-3099. and we suspect that problems with the heating may have been tuned to those data. Yet recent CERES_EBAF Liang, X.-Z., and F. Zhang, The Cloud-Aerosol- profile associated with large-scale precipitation satellite retrievals from CERES instruments (orange). Radiation (CAR) ensemble modeling system. may contribute to biases at high resolution. To (except the EBAF version) fall in the tails of the Atmos. Chem. Phys., 13 (2013), pp. 8335-8364. examine the effect of parameterized convection, model distributions. Liang, X.-Z., et al., Regional Climate-Weather τ was decreased from 3,600 seconds to In a subset of GCMs that produce TOA cnv Research and Forecasting Model (CWRF). Bull. 300 seconds. Seasonal mean precipitation radiative balance within the observed range, the Am. Meteorol. Soc., 93 (2012), pp. 1363-1387. distribution generally improved when compared cloud radiative forcing ranges are still over three to GPCP observed precipitation, but the typical times larger than the respective observational uncertainties. Current GCMs may be tuned to

56 57 BLUE WATERS ANNUAL REPORT 2014

to converge faster than other algorithms in approach on NVIDIA C2050 GPU, but is slower A SCALABLE PARALLEL LSQR synthetic tomographic experiments. on NVIDIA M2070. ALGORITHM FOR SOLVING LARGE- Unfortunately, it can be computationally very • To optimize memory copy between host challenging to apply LSQR to a tomographic memory and GPU device memory, we utilize a SCALE LINEAR SYSTEM FOR matrix with a relatively dense kernel component “register-copy” technique to speed up copying TOMOGRAPHIC PROBLEMS: A CASE appended by a highly sparse damping component, between by 20%. In addition, we minimize STUDY IN SEISMIC TOMOGRAPHY because it is simultaneously compute, memory, CPU operations by porting all matrix- and and communication intensive. The coefficient vector-based operations into the GPU. During matrix is typically very large and sparse. For computation, the intermediate results reside on example, a modest-sized dataset of the Los device memory and only a small amount of data Allocation: NSF/0.003 Mnh PI: Liqiang Wang1 Angeles Basin for structural seismology has is copied between host and device memories for Collaborators: He Huang1; Po Chen1; John M. Dennis2 a physical domain of 496x768x50 grid points. MPI communication. The corresponding coefficient matrix has 261 • To increase parallelism, we decompose 1 University of Wyoming million rows, 38 million columns, and 5 billion both matrix and vector. To obtain good load 2 National Center for Atmospheric Research non-zero values. Nearly 90% of the kernel is non- balance, we decompose the matrix in row- EXECUTIVE SUMMARY: zero, while damping takes approximately 10%. wise order and distribute rows according to the number of non-zero elements. We use MPI- Least Squares with QR factorization (LSQR) IO to allow multiple MPI tasks to load data method is a widely used Krylov subspace METHODS AND RESULTS simultaneously. algorithm to solve sparse rectangular linear We demonstrated that the SPLSQR algorithm systems for tomographic problems. Traditional To address the above challenges, we designed and implemented a parallel LSQR implementation has scalable communication and significantly parallel implementations of LSQR have the reduces communication cost compared with potential, depending on the non-zero structure (SPLSQR) using MPI and CUDA. Our major contributions include: existing algorithms. We also demonstrated of the matrix, for significant communication that on a small seismic tomography dataset, cost. The communication cost can dramatically • To make our SPLQR scalable, we designed a partitioning strategy and a computational the SPLSQR algorithm is 9.9 times faster than limit the scalability of the algorithm at large core the PETSc algorithm on 2,400 cores of a Cray counts. algorithm based on the special structure of the matrix. SPLSQR contains a novel data XT5. The current implementation of the SPLSQR We describe a scalable parallel LSQR algorithm algorithm on 19,200 cores of a Cray XT5 is 33 that utilizes the particular non-zero structure of decomposition strategy that treats different components of the matrix separately. The SPLSQR times faster than the fastest PETSc configuration matrices that occurs in tomographic problems. on the modest Los Angeles Basin dataset. In particular, we specially treat the kernel algorithm provides scalable communication volume between a fixed and modest number In our experiment on GPUs, the single GPU component of the matrix, which is relatively code achieves up to a factor of 17.6 speedup dense with a random structure, and the damping of communication neighbors. The algorithm enables scalability to O(10,000) cores for the with 15.7 GFlop/s in single precision and a component, which is very sparse and highly factor of 15.2 speedup with 12.0 GFlop/s in Los Angeles Basin dataset in seismic tomography. FIGURE 1 (TOP): structured separately. The resulting algorithm double precision, compared with the original • We use CUBLAS to accelerate vector GPU speedup has a scalable communication volume with a serial CPU code. The MPI-GPU code achieves operations and CUSPARSE to accelerate sparse and scalability bounded number of communication neighbors up to a factor of 3.7 speedup with 268 GFlop/s matrix vector multiplication (SpMV), which on CN2008ker regardless of core count. We present scaling in single precision and a factor of 3.8 speedup studies from real seismic tomography datasets is the most compute-intensive part of LSQR. dataset. However, CUSPARSE is efficient in handling only with 223 GFlop/s in double precision on 135 MPI that illustrate good scalability up to O(10,000) FIGURE 2 (BOTTOM): regular SpMV in compressed sparse row format, tasks compared with the corresponding MPI- cores on a Cray system. Speedup and but inefficient in SpMV with matrix transpose. CPU code. The MPI-GPU code scales well in scalability of We design two approaches to handle transpose both strong and weak scaling tests. SPLSQ without GPU SpMV, trading memory for better performance. INTRODUCTION on Los Angeles The first approach utilizes a different matrix Basin dataset Least Squares with QR factorization (LSQR) is format (compressed sparse column) for from 360 to 19,200 a member of the Conjugate Gradients family transpose SpMV. Although its performance is cores. of iterative Krylov algorithms and is typically much better than using CUSPARSE directly, it reliable when a matrix is ill conditioned. The requires storing two copies of the matrix. As LSQR algorithm, which uses a Lanczos iteration an alternative, we design a second approach to to construct orthonormal basis vectors in both support both regular and transpose SpMV and the model and data spaces, has been shown avoid storing an additional matrix transpose. It has almost the same performance as the first

58 59 BLUE WATERS ANNUAL REPORT 2014

FIGURE 1: Two INTRODUCTION SOLVING PREDICTION PROBLEMS CyberShake IN EARTHQUAKE SYSTEM SCIENCE The Southern California Earthquake Center hazard models (SCEC) coordinates basic research in earthquake for the Los ON BLUE WATERS science using Southern California as its principal Angeles region natural laboratory. The CyberShake method calculated on produces site-specific probabilistic seismic Blue Waters using Allocation: NSF/3.4 Mnh hazard curves, comparable to Probabilistic a simple 1D earth 1 PI: Thomas H. Jordan Seismic Hazard Analysis (PSHA) hazard curves model (left) and Collaborators: Scott Callaghan1; Robert Graves2; Kim Olsen3; Yifeng Cui4; Jun Zhou4; produced by the U. S. Geological Survey (USGS) a more realistic Efecan Poyraz4; Philip J. Maechling1; David Gill1; Kevin Milner1; Omar Padron, Jr.5; Gregory H. Bauer5; Timothy Bouvet5; William T. Kramer5; Gideon Juve6; Karan that are used in national seismic hazard maps. 3D earth model Vahi6; Ewa Deelman6; Feng Wang7 If the CyberShake method can be shown (right). Seismic to improve on current PSHA methods, it hazard estimates 1 Southern California Earthquake Center may impact PSHA users including scientific, produced using 2U. S. Geological Survey commercial, and governmental agencies like the the 3D earth 3San Diego State University 4San Diego Supercomputer Center USGS. For seismologists, CyberShake provides model show 5National Center for Supercomputing Applications new information about the physics of earthquake lower near-fault 6 Information Sciences Institute ground motions, the interaction of fault geometry, intensities due 7 Our scientific contributions to PSHA have AIR Worldwide 3D earth structure, ground motion attenuation, WHY BLUE WATERS to 3D scattering, the potential to change standard practices in and rupture directivity. For governmental much higher EXECUTIVE SUMMARY: the field. Models used in PSHA contain two SCEC uses Blue Waters to perform large-scale, agencies responsible for reporting seismic hazard complex scientific computations involving intensities A major goal of earthquake system science is types of uncertainty: aleatory variability that information to the public, CyberShake represents thousands of large CPU and GPU parallel jobs, in near-fault to predict the peak shaking at surface sites on describes the intrinsic randomness of the a new source of information that may contribute hundreds of millions of short-running serial CPU basins, higher the scale of a few days to many years. The deep earthquake-generating system, and epistemic to their understanding of seismic hazards, which jobs, and hundreds of terabytes of temporary intensities in uncertainties in these predictions are expressed uncertainty that characterizes our lack of they may use to improve the information they files. These calculations are beyond the scale the Los Angeles through two types of probability: an aleatory knowledge about the system. SCEC’s physics- report to the public. For building engineers, of available academic HPC systems, and, in the basins, and lower variability that describes the randomness of the based system science approach can improve our CyberShake represents an extension of existing past, they required multiple months of time to intensities in earthquake system, and an epistemic uncertainty understanding of earthquake processes, so it can seismic hazard information that may reduce complete using NSF Track-2 systems. Using the hard-rock areas. that characterizes our lack of knowledge about reduce epistemic uncertainties over time. As an some of the uncertainties in current methods, well-balanced system capabilities of Blue Waters the system. Standard models use empirical example of the potential impact, we used the which are based on empirical ground motion CPUs, GPUs, disks, and system software, together prediction equations that have high aleatory averaging-based factorization (ABF) technique attenuation models. with scientific workflow tools, SCEC’s research variability, primarily because they do not model to compare CyberShake models and assess their staff can now complete CyberShake calculations crustal heterogeneities. We show how this consistency with Next Generation Attenuation in weeks rather than months. This enables SCEC variance can be lowered by simulating seismic (NGA) models. ABF uses a hierarchical averaging METHODS AND RESULTS scientists to improve methodology more rapidly wave propagation through 3D crustal models scheme to separate the shaking intensities for SCEC has used Blue Waters to perform as we work towards CyberShake calculations at derived from waveform tomography. SCEC has large ensembles of earthquakes into relative CyberShake computational research, a physics- the scale and resolution required by engineering developed a software platform, CyberShake, (dimensionless) excitation fields representing based, computationally intensive method users of seismic hazard information. that combines seismic reciprocity with highly site, path, directivity, and source-complexity for improving PSHA. We have calculated optimized anelastic wave propagation codes effects, and it provides quantitative, map-based several new PSHA seismic hazard models for to reduce the time of simulation-based hazard comparisons between models. CyberShake Southern California, exploring the variability in PUBLICATIONS calculations to manageable levels. CyberShake directivity effects are generally larger than CyberShake seismic hazard estimates produced hazard models for the Los Angeles region, predicted by the NGA directivity factor [1,2], and Cui, Y., et al., Development and optimizations by alternative 3D earth structure models and each comprising over 240 million synthetic basin effects are generally larger than those from of a SCEC community anelastic wave propagation earthquake source models. seismograms, have been computed on Blue the three NGA models that provide basin effect platform for multicore systems and GPU-based SCEC’s CyberShake workflow system produced Waters. A variance-decomposition analysis factors. However, the basin excitation calculated accelerators. Seismol. Res. Lett., 83:2 (2012), 396. a repeatable and reliable method for performing indicates that more accurate earthquake from CVM-H is smaller than from CVM-S, and Taborda, R., and J. Bielak, Ground-Motion large-scale research calculations in record time. simulations may reduce the aleatory variance shows stronger frequency dependence primarily Simulation and Validation of the 2008 Chino The tools used work within the shared-computer of the strong-motion predictions by at least a because the horizontal dimensions of the basins Hills, California, Earthquake. Bull. Seismol. Soc. resource environment of open science HPC factor of two, which would lower exceedance are much larger in CVM-H. The NGA model of Am., 103:1 (2013), pp. 131–156. resources including Blue Waters. These tools probabilities at high hazard levels by an order Abrahamson & Silva [3] is the most consistent Wang, F., and T. H. Jordan, Comparison of have helped our team increase the scale of the of magnitude. The practical ramifications of with the CyberShake CVM-H calculations, with probabilistic seismic hazard models using calculations by two orders of magnitude over this probability gain for the formulation of risk a basin effect correlation factor >0.9 across the averaging-based factorization. Bull. Seismol. Soc. the last five years without increasing personnel . reduction strategies are substantial. frequency band 0.1-0.3 Hz. Am., (2014), doi: 10.1785/0120130263.

60 61 BLUE WATERS ANNUAL REPORT 2014

Programme, which has to date eluded the We can categorize our project accomplishments and frequency of storms than the sufficiency of COLLABORATIVE RESEARCH: world’s space agencies. Our research is critical to date within three foci: (1) scalable many- the total precipitation. These insights have direct PETASCALE DESIGN AND for the scientific and space agency communities objective design optimization benchmarks, (2) relevance to water security concerns in terms to overcome current computational barriers to advances in the use of high-fidelity astrodynamics floods and droughts. MANAGEMENT OF SATELLITE transform the optimization of future satellite simulation to permit passive control (i.e., ASSETS TO ADVANCE SPACE-BASED constellation architectures for delivering high minimum energy satellite constellations), and EARTH SCIENCE fidelity data to a broad array of applications. (3) benchmark the effects of reduced frequencies WHY BLUE WATERS Similarly, we envision that there is a broad array of satellite-based precipitation on global drought In simple terms, the scale and ambition of our of scientists and users whose future activities and flood forecasting. computational experiments require that we have will draw upon the project’s scientific findings 1. With respect to many-objective design Allocation: NSF/4.53 Mnh the ability to compress years of computational PI: Patrick Reed1 and generated data. As examples, the water- evaluation, we have completed the largest and work into minutes of wall clock time to be feasible. 2 3 Collaborators: Eric F. Wood ; Matthew Ferringer centric stakeholder community desperately best benchmark in terms of search quality and Additionally, our applications are extremely data requires improved monitoring and assessment scalability for our team’s underlying optimization 1 intensive, so Blue Waters’ high core count and Cornell University of the water cycle for improved decision making algorithms. The results were made possible by 2Princeton University high memory are fundamental requirements to related to flooding and droughts, as well as food the Blue Waters Friendly User period access. 3The Aerospace Corporation realizing our goals. and energy security. At 524,288 cores, our search approaches The global hydrologic ensemble will require EXECUTIVE SUMMARY: theoretically ideal performance. These results approximately 30 million core hours that will are the best benchmark ever attained for the This project is a multi-institutional collaboration yield up to 2 PB of model output. This output METHODS AND RESULTS challenge problem of focus and provide a strong between Cornell University, The Aerospace represents a new benchmark dataset that will be foundation for our future tradeoff analyses. Corporation, and Princeton University advancing Our team is exploiting access to the Blue Waters of broad interest in a variety of Earth science and 2. In the context of passive control, our a petascale planning framework that is broadly machine to radically advance our ability to engineering applications. Our satellite design preliminary results focus on the patented four- applicable across space-based Earth observation discover and visualize optimal “many-objective” trade-off analysis will expend approximately satellite “Draim” constellation. Our Draim results systems design. We have made substantial tradeoffs (i.e., conflicts for 4-10 objectives) 120 million core hours to discover how quickly reveal that carefully optimizing an initial orbital progress towards three transformative encountered when designing satellites systems we deviate from the “best case” observation geometry to exploit natural perturbations (e.g., contributions: (1) we are the first team to formally to observe global precipitation. Our design of frequencies, with limits on spending, limits in effects of sun, moon, etc.) to maintain continuous link high-resolution astrodynamics design and satellite-based precipitation systems will explore international coordination, neglect of hydrologic global coverage performance as a function of coordination of space assets with their Earth the use of perturbing astrodynamics forces for objectives, and the simplified astrodynamics elevation angle. This minimizes propellant and science impacts within a petascale “many- passive control, the sensitivity of hyper-resolution simulations currently employed in practice. station keeping requirements to dramatically objective” global optimization framework; (2) global water cycle predictions on attainable reduce mission costs while increasing mission we have successfully completed the largest Monte satellite data frequencies, and advancing new duration. The Draim constellation represents Carlo simulation experiment for evaluating the technologies for highly scalable many-objective PUBLICATIONS a stepping stone to the more complex suite of required satellite frequencies and coverage to design optimization. global precipitation missions that will require Woodruff, M., P. Reed, T. Simpson, and D. maintain acceptable global forecasts of terrestrial Our hypotheses related to passive control the analysis of more than ten satellites. Hadka, Many-Objective Visual Analytics: Using hydrology (especially in poorer countries); require high fidelity astrodynamics simulations 3. We are one of the first teams to show Optimization Tools to Enhance Problem Framing. and (3) we are initiating an evaluation of the that account for orbital perturbations, which how limits in satellite-based precipitation Struct. Multidiscip. Optimiz., (submitted). limitations and vulnerabilities of the full suite of dramatically increase serial design simulation observations propagate to uncertainties in Woodruff, M., T. Simpson, and P. Reed, Many- current satellite precipitation missions including times from minutes to potentially weeks. This surface runoff, evaporation, and soil moisture at Objective Visual Analytics: Diagnosing Multi- the recently approved Global Precipitation project will be the first attempt to develop a distinctly different locations globally. Our results Objective Evolutionary Algorithms’ Robustness Measurement (GPM) mission. This work will 10,000 member Monte Carlo global hydrologic th are based on the Variable Infiltration Capacity to Changing Problem Conception. 15 AIAA/ illustrate the tradeoffs and consequences of the simulation at one degree resolution that (VIC) global macroscale land surface model at ISSMO Multidiscip. Anal. and Optimizat. Conf., GPM mission’s current design and its recent characterizes the uncertain effects of changing 1.0° spatial resolution. For each realization of the Atlanta, Ga., June 16-20, 2014. budget reductions. the available frequencies of satellite precipitation VIC ensemble, each model grid cell’s satellite Ferringer, M., M. DiPrinzio, T. Thompson, on drought and flood forecasts. The simulation K. Hanifen, and P. Reed, A Framework for the optimization components of the work will set precipitation is resampled at different temporal resolutions and then run through the VIC land Discovery of Passive-Control, Minimum Energy INTRODUCTION a theoretical baseline for the best possible Satellite Constellations. AIAA/AAS Astrodyn. frequencies and coverages for global precipitation surface model. Our results suggest differing Our satellite constellation design optimization Specialist Conf., San Diego, Calif., August 4-7, given unlimited investment, broad international effects of spatial and temporal precipitation framework is broadly applicable to the full array 2014. coordination in reconfiguring existing assets, sampling on each water cycle component. For of National Research Council-recommended and new satellite constellation design objectives example, convection plays a dominant role in spaced-based Earth science missions. Our informed directly by key global hydrologic the tropics and sampling will highly impact research proffers a critical step toward realizing forecasting requirements. the measured precipitation. However, plant the integrated global water cycle observatory transpiration is impacted less by the intensity long sought by the World Climate Research

62 63 2014

in the United States even though hundreds of in fig. 1). In order to avoid memory exhaustion tornadoes occur. As expected, this infrequent with high-quality ray casting settings, each frame observation is mirrored numerically, and it is was rendered on a single node and parallelization a challenge to get a long-track EF5 to occur in was achieved by rendering hundreds of frames a simulation. Our experience on Blue Waters concurrently. Animations produced from these indicates that, much like the real atmosphere, frames have revealed very complex, sometimes the likelihood that a given supercell simulation highly turbulent, flow regimes involving will produce a long-track EF5 is very low. In dozens of constructive and destructive vortex addition, very large computational resources like interactions throughout the tornado’s life cycle. Blue Waters are required to simulate the entire thunderstorm, its surrounding environment, and the comparatively small-scale flow associated WHY BLUE WATERS with tornado formation and the tornado’s entire Blue Waters provides an infrastructure that life cycle. The amount of data produced by these TOWARDS PETASCALE FIGURE 1: A volume-rendered view from the is able to support the huge computational, south of the cloud mixing ratio field of simulations is also very large (O(100 TB) per communication, and storage loads inherent to our SIMULATION AND VISUALIZATION the simulated tornadic supercell. Visible simulation). This amount of output presents a specific application. Furthermore, Blue Waters OF DEVASTATING TORNADIC features include a tail cloud, wall cloud, and challenge for meaningful 3D visualization and provides a robust environment that enables EF5 tornado beneath the mesocyclone of the simulation analysis. the rapid development of code optimizations SUPERCELL THUNDERSTORMS simulated storm. During our time on Blue Waters we created for more efficient model performance, as well a new output format for CM1 that dramatically as the creation of new software in order to reduced the wall-clock time required to do enable analysis and visualization of tremendous Allocation: Illinois/0.76 Mnh large amounts of I/O compared to existing amounts of model output. PI: Robert Wilhelmson1,2 CM1 options. HDF version 5 was chosen as the Collaborators: Leigh Orf3; Roberto Sisneros2; Louis Wicker4 INTRODUCTION underlying data format for individual output files. 1 In order to exploit the large amounts of memory PUBLICATIONS University of Illinois at Urbana-Champaign Severe thunderstorms cause billions of dollars 2National Center for Supercomputing Applications available on Blue Waters while reducing the 3 of damage annually to property and agriculture Orf, L., R. Wilhelmson, and L. Wicker, A Central Michigan University latency associated with frequently writing tens 4National Severe Storms Laboratory as well as loss of life due to flooding, lightning, or hundreds of thousands of files to disk, a new Numerical Simulation of a Long-Track EF5 and the severe winds associated with tornadoes. Tornado Embedded Within a Supercell. 94th EXECUTIVE SUMMARY: approach was developed in which, for each write In the United States, prediction of severe storms cycle, one rank per node collects and buffers data Am. Meteorol. Soc. Annual Meeting, Atlanta, Ga., Utilizing the CM1 model, simulations of supercell continues to be a challenge, even with a large to memory dozens of times before flushing to February 2-6, 2014. thunderstorms were conducted. Simulations amount of dedicated and publicly supported disk. This approach reduced the number of files varied in resolution, physics options, forcing, human and technological infrastructure for and disk operations and resulted in fewer (but and the environment in which the storm protecting the public against severe weather larger) files being written to disk less frequently, formed. Concurrently, development on I/O and threats. resulting in better performance compared to visualization code was completed. Two major In order to improve the accuracy of other approaches. objectives were realized: severe weather forecasts, we must improve A set of code was built around this output • A VisIt plugin was developed that enables our understanding of the severe weather format (a “software plugin”) that enables 3D researchers to visualize both at full model scale, phenomena being forecast. The thrust of the analysis utilizing the VisIt visualization software and also in any arbitrary subdomain. This plugin work conducted by our research team on Blue that is supported on Blue Waters. Additional code works with the new CM1 HDF5 output option Waters is to understand better the inner workings was developed on Blue Waters that facilitates that was developed and tuned on Blue Waters. of supercell thunderstorms and their most conversion from the model’s raw output to other • A supercell simulation with an embedded devastating product: the tornado. Specifically, popular data formats such as netCDF and Vis5d. long-track EF5 tornado was conducted and we aim to capture the entire life cycle of the most We successfully simulated a long-track EF5 visualized with volume-rendering techniques in devastating type of tornado: the long-track EF5, tornado that develops within a supercell and stays the VisIt plugin. To the best of our knowledge, this which exhibits the strongest sustained winds and on the ground for 65 miles. To the best of our simulation is the first of its kind, capturing the the longest life cycle of all tornado types. knowledge, this is the first time this has ever been genesis, maintenance, and decay of the strongest accomplished. Utilizing the VisIt plugin, volume- class of tornado. The simulated tornado traveled rendered visualizations were created at very high METHODS AND RESULTS for 65 miles and bears a strong resemblance to temporal resolution, showing the development an observed storm that occurred in a similar Long-track EF5 tornadoes are the least common and maintenance of the EF5 tornado and the environment. of all tornadoes; in some years, none are observed supercell that produced the tornado (example

65 PHYSICS & ENGINEERING

MATERIALS 68 Quantum Electron Dynamics Simulation of 84 Quantum Monte Carlo Calculations of Water- Materials on High-Performance Computers Graphene Interfaces

QUANTUM 70 Lattice QCD on Blue Waters 86 Scaling up of a Highly Parallel LBM-based Simulation Tool (PRATHAM) for Meso- and Large- FLUIDS 73 Direct Simulation of Dispersed Liquid Droplets in Scale Laminar and Turbulent Flow and Heat Isotropic Turbulence Transfer

NANOTECHNOLOGY 74 Accelerating Nanoscale Transistor Innovation with 87 Petascale Particle-in-Cell Simulations of Kinetic NEMO5 on Blue Waters Effects in Plasmas

76 Next-Generation Ab Initio Symmetry-Adapted No- 88 Preliminary Evaluation of ABAQUS, FLUENT, and Core Shell Model and its Impact on Nucleosynthesis CUFLOW Performance on Blue Waters

78 Space-Time Simulation of Jet Crackle 90 Petascale Simulation of High Reynolds Number Turbulence 80 Petascale Quantum Simulations of Nano Systems and Biomolecules 92 Computational Exploration of Unconventional Superconductors using Quantum Monte Carlo 82 Breakthrough Petascale Quantum Monte Carlo Calculations 2014

simulations. It is necessary to develop theoretical yet unexplored systems. Using the example of and algorithmic methods that are capable of fully hydrogen projectiles in gold we showed that we exploiting current and future high-performance can achieve this challenging goal. In particular, computers in electronic structure calculations the influence of the stopping geometry (i.e., the while continuing to use less restrictive path on which the projectile atom travels through approximations. the crystal) is a crucial aspect of the problem that is often difficult to access in experiments. This application may enable first-principles design METHODS AND RESULTS and understanding of radiation hard materials We recently developed and implemented a as well as the processes that underlie scintillators first-principles computational methodology to and radiation shielding. simulate non-adiabatic electron-ion dynamics on massively parallel computers. The scheme WHY BLUE WATERS QUANTUM ELECTRON DYNAMICS is based on the time-dependent extension of density functional theory and the underlying In this context, it is crucial to use machines SIMULATION OF MATERIALS ON Kohn-Sham equations. Using an explicit fourth- such as Blue Waters in order to validate the HIGH-PERFORMANCE COMPUTERS FIGURE 1: Visualization of the excited order Runge–Kutta integration scheme in the computational parameters such as the plane- state electron density as the fast hydrogen context of a plane-wave code, we are now able wave basis set, check the super cell size, and projectile moves through bulk gold material. to integrate the time-dependent Kohn–Sham study long enough trajectories in order to Allocation: BW Prof./0.245 Mnh equations in time, which allows us to explicitly eliminate computational artifacts. Leadership- PI: Andre Schleife1,2 study electron dynamics. We compute Hellman– class machines such as Blue Waters are essential Collaborators: Erik W. Draeger1; Victor Anisimov3; Alfredo A. Correa1; Yosuke Feynman forces from the time-dependent (non- as they pave the way toward exascale computing. 1,4 Kanai adiabatic) Kohn–Sham states and integrate ion Using as many as 251,200 compute cores on Blue INTRODUCTION motion using the Ehrenfest scheme. We showed Waters is an important test that allows us to 1Lawrence Livermore National Laboratory 2University of Illinois at Urbana-Champaign In order for computational materials design that our implementation of this approach in the explore the limits of our parallel implementation. 3 National Center for Supercomputing Applications to succeed, oftentimes it is crucial to develop Qbox/qb@ll code is accurate, stable, and efficient. At the same time, since we found the scaling to 4 University of North Carolina at Chapel Hill a thorough understanding of the interaction of Using the computational power of Blue Waters be excellent, machines such as Blue Waters will EXECUTIVE SUMMARY: ions and electrons. The photon absorption of as well as the BlueGene-based Sequoia high- allow us to tackle exciting large-scale scientific solar cells, radiation damage in materials, and performance computer at Lawrence Livermore problems in the future. Rapidly advancing high-performance computers defect formation are a few of many examples of National Laboratory, we showed excellent scaling such as Blue Waters allow for calculating properties or phenomena that have their roots of our implementation. PUBLICATIONS properties of increasingly complex materials with in the physics of interacting electrons and ions. Owing to this new implementation, we are unprecedented accuracy. However, in order to The physical laws that govern this regime now able to pursue two important directions: Schleife, A., E. W. Draeger, V. Anisimov, A. take full advantage of leadership-class machines are well known, but directly solving the (1) explore the scalability and applicability of A. Correa, and Y. Kanai, Quantum Dynamics modern codes need to scale well on hundreds of Schrödinger equation (which describes the the code in the context of high-performance Simulation of Electrons in Materials on High- thousands of processors. Here we demonstrate behavior of electrons) is intractable even on computing, and (2) apply the code to elucidate Performance Computers. Comput. Sci. Eng., (in high scalability of our recently developed modern computers. Instead scientists must the physics of electronic stopping in a material press). implementation of Ehrenfest non-adiabatic rely on approximations that limit the accuracy under particle-radiation conditions, which electron-ion dynamics that overcomes the limits of quantum mechanics calculations. In particular, is a highly non-adiabatic process and, hence, of the Born–Oppenheimer approximation. We computational cost is reduced by disregarding crucially relies on overcoming the limitations find excellent scaling of the new code up to one the quantum dynamics of electrons in many first- of the Born–Oppenheimer approximation. million compute core floating-point units. As a principles molecular dynamics approaches, which In addition, by studying the scientific representative example of material properties makes various interesting material properties problem of computing the electronic stopping that derive from quantum dynamics of electrons, inaccessible when using such an oversimplified of a hydrogen projectile in crystalline gold we demonstrate the accurate calculation of method. Therefore, accurate description of material, we were able to unravel the influence electronic stopping power, which characterizes electron dynamics through time-dependent of the stopping geometry and to understand the rate of energy transfer from a high-energy quantum mechanical theory is an important contributions of semi-core electrons of the gold particle to electrons in materials. We use the challenge in computational materials physics atoms, especially for highly energetic hydrogen example of a highly energetic hydrogen particle and chemistry today. projectiles. Good agreement with experiments moving through crystalline gold to illustrate The massively parallel and hybrid architecture demonstrates that this approach indeed how scientific insights can be obtained from of modern high-performance computers captures the key physics and even promises the quantum dynamics simulation. constitutes an additional challenge for numerical predictive accuracy that will be beneficial for

69 BLUE WATERS ANNUAL REPORT 2014

because it has been difficult to extract many of has been to determine the same CKM matrix FIGURE 1: LATTICE QCD ON BLUE WATERS the most interesting predictions of quantum element through different processes to look for Comparison chromodynamics (QCD), those that depend on inconsistencies that would signal a breakdown in of the recent the strong coupling regime of the theory. The the standard model. Until now, uncertainties in evaluation of Allocation: NSF/60.1 Mnh the lattice calculations have limited the precision the leptonic 1 only means of doing so from first principles and PI: Robert L. Sugar decay constants 2 3 4 5 with controlled errors is through large-scale of these tests. We aim to match the precision of Collaborators: Alexei Bazavov ; Mike Clark ; Carleton DeTar ; Daping Du ; Robert (x-axis) of Edwards6; Justin Foley3; Steven Gottlieb7; Balint Joo6; Kostas Orginos8; Thomas numerical simulations. These simulations are our calculations to that of experiments. two mesons Primer9; David Richards6; Doug Toussaint9; Mathias Wagner7; Frank Winter6 needed to obtain a quantitative understanding Our first objective with the Clover formulation of the physical phenomena controlled by the of lattice quarks is to perform a calculation of the containing a 1 University of California, Santa Barbara strong interactions, determine a number of the mass spectrum of strongly interacting particles charm quark, the 2University of Iowa D and Ds mesons, 3 fundamental parameters of the standard model, (hadrons). The determination of the excited NVIDIA by the Fermilab 4University of Utah and make precise tests of the standard model. state spectrum of hadrons within QCD is a Lattice and MILC 5University of Illinois at Urbana-Champaign Despite the many successes of the standard major objective for several new generations of 6Thomas Jefferson National Accelerator Facility model, high-energy and nuclear physicists believe experiments worldwide and is a major focus of Collaborations 7 of a number of mesons containing strange and Indiana University that a more general theory will be required to the $310 million upgrade of Jefferson Laboratory. (labeled "This 8College of William & Mary charm quarks [1-3], which in turn have led to understand physics at the shortest distances. The In particular, the GlueX experiment within the work") with 9University of Arizona the evaluation of several CKM matrix elements standard model is expected to be a limiting case new Hall D at Jefferson Laboratory will search for earlier work. that are important for tests of the standard model. EXECUTIVE SUMMARY: of this more general theory. A central objective of the presence of “exotic” mesons. The existence Diamonds, They also have produced the most precise ratios the experimental program in high-energy physics, of these particles is a signature for new states octagons, and • We have developed highly optimized code among the up, down, strange, and charm quark and of lattice QCD simulations, is to determine of matter, specifically the presence of gluonic squares are for the study of quantum chromodynamics masses [2]. Important advances have been made the range of validity of the standard model and degrees of freedom, predicted by QCD but thus calculations (QCD) on Blue Waters and used it to carry out in the development of code for the generation search for physical phenomena that will require far not clearly observed. The spectroscopy effort with two, three, calculations of major importance in high energy of gauge configurations and quark propagators new theoretical ideas for their understanding. is intended to determine whether the equations and four sea and nuclear physics. with the Clover formulation of lattice quarks Thus, QCD simulations play an important role of QCD do, in fact, realize the existence of quarks. [3] • We used Blue Waters to generate gauge [4]. The quark propagators calculated on Blue in efforts to obtain a deeper understanding of such exotic states of matter. Because these configurations (samples of the QCD vacuum) Waters with this code will play a major role in the fundamental laws of physics. predictions will be made before the experiments with both the highly improved staggered the large hadron mass spectrum calculation are performed, these calculations will provide quarks (HISQ) and Wilson–Clover actions. described above. crucial information about the decay signatures For the first time, the up and down quarks in METHODS AND RESULTS of such exotic states that will inform and guide these calculations are as light as in nature for the experimental searches. the HISQs, leading to a major improvement in Our long-term scientific objective is to perform WHY BLUE WATERS Lattice QCD calculations have two steps. First, precision. calculations of QCD, the theory of the strong one generates and saves gauge configurations, Lattice QCD calculations have made major • With the HISQ configurations, we interactions of subatomic physics, to the precision which are representative samples of the QCD progress in the last few years with a limited calculated a ratio of decay constants that enables needed to support large experimental programs ground state. In the second step the gauge number of calculations reaching precision of a determination of a key Cabibbo–Kobayashi– in high-energy and nuclear physics. Under our configurations are used to measure a wide fraction of a percent and techniques in place to Maskawa matrix element to a precision of 0.2% PRAC grant we are using two formulations of range of physical quantities. The generation determine many more quantities to this level and also obtained world-leading precision for at lattice quarks. The highly improved staggered of gauge configurations is the rate-limiting of accuracy. Such precision is needed to test least half a dozen additional quantities. quarks (HISQ) formulation is being used to step in the calculations and requires the most the standard model mentioned above and for a • The Wilson–Clover project explored the calculate fundamental parameters of the standard capable supercomputers available. The most detailed understanding of physical phenomena isovector meson spectrum utilizing both the model of high-energy physics and our current set computationally expensive component of controlled by the strong interactions. The advent GPU and CPU nodes. On the CPU, a multi-grid of theories of subatomic physics, and to make the second step, the measurement routines, of petascale computers, such as Blue Waters, is solver gave an order of magnitude improvement precise tests of the standard model. In particular, is to calculate the Green’s functions for the playing a critical role in these advances because over previous code. the HISQ formulation is being used to calculate high-precision QCD calculations are enormous the masses of quarks, which are the fundamental propagation of quarks in the gauge configurations. undertakings which require computers of the building blocks of strongly interacting matter, For the light quarks, this calculation also requires highest capability and capacity. INTRODUCTION and determine elements of the Cabibbo– highly capable computers. QCD is formulated in the four-dimensional Kobayashi–Maskawa (CKM) matrix, which We have made major progress in our efforts The standard model of high-energy physics space-time continuum. However, in order to are the weak interaction transition couplings to generate gauge configurations and quark encompasses our current knowledge of the carry out numerical calculations one must between quarks. The CKM matrix elements and propagators using Blue Waters. These have fundamental interactions of subatomic physics. reformulate it on a four-dimensional lattice or the quark masses are fundamental parameters of included the most challenging ensembles It has been enormously successful in explaining grid. In order to obtain physical results, one must the standard model and therefore of great interest undertaken to date. The new HISQ configurations a wealth of data produced in accelerator and perform calculations for a range of small lattice in their own right. Furthermore, in recent years a have already been used to make the most precise cosmic ray experiments over the past forty spacings and extrapolate to the continuum (zero major line of research within high-energy physics determination to date of the decay properties years. However, our knowledge is incomplete lattice spacing) limit while keeping the physical

70 71 BLUE WATERS ANNUAL REPORT 2014

size of the box within which the calculations to deal with capillary effects and other curvature- are performed fixed. The computational DIRECT SIMULATION OF DISPERSED dependent phenomena. cost grows roughly as the fifth power of the LIQUID DROPLETS IN ISOTROPIC We employ a level set method to capture inverse of the lattice spacing and one must implicitly the moving interface and a continuous employ very fine grids to obtain high-precision TURBULENCE surface force (CSF) approach to model density results. Furthermore, the computational cost of and viscosity in a continuous fashion. Finally the calculations rises as the masses of the quarks surface tension is included in the momentum NSF/1.24 Mnh decrease. Until quite recently, it has been too Allocation: equations as a volume force and modeled using PI: Said Elghobashi1 expensive to carry out calculations with the Collaborators: Michele Rosso1 the CSF method. masses of the two lightest quarks, the up and Our first simulation goal is a single water the down, set to their physical values. Instead, 1University of California, Irvine droplet falling under the effect of gravity in a fluid at rest. This features most of the physics one had to perform calculations for a range of EXECUTIVE SUMMARY: up and down quark masses, and extrapolate to we are interested in, namely large variations of their physical values. Blue Waters is enabling us, The objective of our research is to enhance the material properties between phases, surface for the first time, to carry out calculations with understanding of the interaction between liquid tension, and droplet deformation, without the small lattice spacings and the masses of the up droplets and a turbulent flow by performing added complexity of turbulence—the perfect and down quarks at their physical values. This direct numerical simulations (DNS). The freely experiment to validate our algorithm. The development has already led to a number of moving deformable liquid droplets are fully second objective is to repeat the experiment calculations of unprecedented precision. resolved in three spatial dimensions and time in a turbulent environment. Eventually we will and all the scales of the turbulent motion are simulate a large number of liquid droplets in simultaneously resolved down to the smallest decaying isotropic turbulence. PUBLICATIONS relevant length and time scales. Our DNS solve Currently we are in the final testing stage of Bazavov, A., et al., Leptonic decay-constant the unsteady 3D Navier–Stokes and continuity our code. No turbulence has been considered equations throughout the whole computational yet since our primary focus at this time is to ratio fK+/fpi+ from lattice QCD with physical light quarks. Phys. Rev. Lett., 110 (2013), 172003. domain, including the interior of the liquid obtain an accurate time evolution of the droplet Bazavov, A., et al., Charmed and strange droplet. The droplet surface motion and interface. pseudoscalar meson decay constants from HISQ deformation are captured accurately by using the simulations. Proc. 31st Int. Symp. Lattice Field level set method. The discontinuous density and WHY BLUE WATERS Theory (LATTICE2013), Mainz, Germany, July viscosity are smoothed out across the interface by 29-August 3, 2013. means of the continuous surface force approach. Performing DNS of turbulent flows is very A variable density projection method is used to Bazavov, A., et al., Determination of |Vus| demanding in terms of computational power impose the incompressibility constraint. from a lattice-QCD calculation of the K  π l and memory availability. The computational ν semileptonic form factor with physical quark grids employed need to be fine enough to masses. Phys. Rev. Lett., 112 (2014), 112001. resolve the smallest flow structures accurately; GOALS Winter, F. T., M. A. Clark, R. G. Edwards, and B. this requirement becomes more and more Joo, A Framework for Lattice QCD Calculations This study aims to investigate the two-way stringent as the Reynolds number based on the on GPUs. Proc. 28th IEEE Int. Parallel Distrib. coupling effects of finite-size deformable liquid Taylor micro-scale is increased. In addition, an Process. Symp., Phoenix, Ariz., May 19-23, 2014. droplets on decaying isotropic turbulence accurate time history of the flow is sought in Bazavov, A., et al., Symanzik flow on HISQ using direct numerical simulation (DNS). order to compute time-dependent statistics, ensembles. Proc. 31st Int. Symp. Lattice Field Turbulent liquid-gas flows are found in many thus limiting the time step interval one can Theory (LATTICE2013), Mainz, Germany, July natural phenomena and engineering devices. use. The demand for computational power is 29-August 3, 2013. In particular, the study of liquid-gas interfaces even larger for a multi-phase flow because the is important in combustion problems with standard projection method for incompressible liquid and gas reagents. The main challenge flows must be replaced by a variable-density in the numerical simulation of multi-phase projection method. The latter results in a flows is the representing the interface between variable-coefficients Poisson's equation that is the phases involved. This requires a method not solvable by a fast Fourier transform, thus for the tracking the interface and a model to requiring an iterative solver. We use the multi- describe the discontinuity in density and viscosity. grid preconditioned conjugate gradient solver Furthermore, surface tension on the moving provided by the PETSc library. Given the interface must be taken into account to be able requirements outlined above, Blue Waters is a necessary resource for our research.

72 73 BLUE WATERS ANNUAL REPORT 2014

be pursued. The number of atoms in critical devices, and these results have been included in ACCELERATING NANOSCALE dimensions is now countable. As the materials the 2013 International Technology Roadmap for TRANSISTOR INNOVATION WITH and designs become more dependent on atomic Semiconductors. Simulations found important details, the overall geometry constitutes a new deviations in the characteristics of devices as NEMO5 ON BLUE WATERS material that cannot be found as such in nature. they are scaled down and raise questions about NEMO5 is a nanoelectronics modeling package future device designs. designed to comprehend the critical multi-scale, NEMO5 takes advantage of Blue Waters’ NSF/1.24 Mnh; GLCPC/0.313 Mnh Allocation: multi-physics phenomena through efficient unique heterogeneous CPU/GPU capability PI: Gerhard Klimeck1 Collaborators: Jim Fonseca1; Tillmann Kubis1; Daniel Mejia1; Bozidar Novakovic1; computational approaches and quantitatively through the use of the MAGMA [3], cuBLAS, Michael Povolotskyi1; Mehdi Salmani-Jelodar1; Harshad Sahasrabudhe1; Evan model new generations of nanoelectronic devices and cuSPARSE libraries. A specific type of Wilson1; Kwok Ng2 including transistors and quantum dots, as non-equilibrium Green’s function (NEGF), the well as predict novel device architectures and Recursive Green’s Function (RGF), which is a Intel Corporation; PETSc Development Team, Argonne National Laboratory; phenomena [1,2]. This technology paradigm computational approach to handling quantum NVIDIA Corporation comes full circle as the NEMO tool suite itself transport in nanoelectronic devices, has been 1Purdue University provides input to ITRS and is also used by leading implemented using the Sancho-Rubio algorithm 2Semiconductor Research Corporation semiconductor firms to design future devices. [4]. NEGF requires storage, inversion, and EXECUTIVE SUMMARY: multiplication of matrices on the order of the number of electronic degrees of freedom, and Relentless downscaling of transistor size has INTRODUCTION it is known that dense matrix multiplication and continued according to Moore’s law for the The U.S. semiconductor industry is one of matrix inversion get respectable Flop/s on GPUs. past 40 years. According to the International the largest export industries. The global These developments have allowed NEMO5 Technology Roadmap for Semiconductors semiconductor device market is over $300 to achieve efficient scalability past 100 nodes (ITRS), transistor size will continue to decrease billion and the U.S. holds more than one third on Blue Waters and performance that shows a in the next 10 years, but foundational issues with of this market. The U.S. is a market leader and single NVIDIA Kepler K20x GPU can provide currently unknown technology approaches must produces a significant number of high-paying, as much processing power as 40 AMD Bulldozer high-technology jobs. At the same time, the cores. Additionally, a PETSc–MAGMA interface end of Moore’s law scaling as we know it will has been developed in conjunction with PETSc be reached in ten years with device dimensions developers for future release to the community FIGURE 1 (LEFT expected to be about 5 nm long and 1 nm, or [5]. cores and 1.44 PFlop/s, the first engineering code TOP): Potential about 5 atoms, in its critical active region width. to deliver a sustained 1.4 PFlop/s on over 220,000 along an InAs Further improvements in these dimensions will cores on Jaguar [6]. WHY BLUE WATERS Ultra-Thin Body come only through detailed and optimized device NEMO5 is a more general code but implements (UTB) transistor. design and better integration. A toy calculation of a 50 nm long wire with a the same underlying numerical approaches and Quantum effects such as tunneling, state 3 nm diameter requires around 1 TFlop/s for framework as OMEN. Scalable algorithms that FIGURE 2 (LEFT quantization, and atomistic disorder dominate a single energy point using NEGF. Resolution primarily use dense/sparse-dense matrix-matrix BOTTOM): the characteristics of these nanoscale devices. of a device's characteristics requires about multiplication and matrix inversion have been Atomistic Fundamental questions need to be answered to 1,000 energy points, and this calculation must used to take advantage of Blue Waters’ GPU representation of address the downscaling of the CMOS switch be repeated perhaps a dozen times for a full capabilities. Si, represented and its replacement. What is the influence of current-voltage sweep. Even with RGF, the as diamond atomistic local disorder from alloy, line-edge computational time scales with the cube of structure roughness, dopant placement, fringe electric cross-sectional area (relative to the direction PUBLICATIONS crystals.

fields, and lattice distortions due to strain of electron flow) and linearly with the length Salmani-Jelodar, M., S. Kim, K. Ng, and G. FIGURE 3 (RIGHT on the carrier transport in nanometer-scale of the device. The treatment of a technically Klimeck, CMOS Roadmap Projection using TOP): InAs- semiconductor devices such as nanowires, currently relevant finFET device would require Predictive Full-band Atomistic Modeling. GaAs quantum finFETs, quantum dots, and impurity arrays? Can an atomistic resolution of a device with a cross (submitted). dot strain 2 power consumption be reduced by inserting new section around (20x40) nm , which includes the Salmani-Jelodar, M., J. D. Bermeol, S. Kim, and displacement. materials and device concepts? core semiconductor and the surrounding gate G. Klimeck, ITRS Tool on NanoHUB. NanoHUB FIGURE 4 (RIGHT material. User Conf., Phoenix, Ariz., April 9-11, 2014. Codes in the NEMO tool suite have been BOTTOM): InAs- METHODS AND RESULTS shown to scale well on leadership-class machines. GaAs quantum dot stationary wave The NEMO software suite has been used on Blue A previous version of NEMO5, OMEN, functions Waters to calculate design parameters for future demonstrated almost perfect scaling to 222,720

74 75 2014

METHODS AND RESULTS nuclei, the ingredient nuclei in the neutrino experiment detectors. We advance an ab initio (i.e., from first principles) • A work-in-progress focuses on one of large-scale nuclear modeling initiative that the most challenging problems in nuclear proffers forefront predictive capabilities for physics today: achieving an ab initio nuclear determining the structure of nuclear systems, modeling of the first excited +0 state (the so- including rare isotopes up through medium- called Hoyle state) in 12C, the resulting state of the mass nuclei that are inaccessible experimentally essential stellar triple-alpha process. Knowing and fall far beyond the reach of other ab initio the structure of low-lying states of 12C is key to methods. We aim to provide nuclear structure modeling nucleosynthesis and stellar explosions. information of unprecedented quality and scope that can be used to gain further understanding of fundamental symmetries in nature that are WHY BLUE WATERS lost in massive datasets or require petascale NEXT-GENERATION AB INITIO (or even exascale) architectures, and to extract The SA-NCSM was specifically designed to efficiently handle complex data in large- SYMMETRY-ADAPTED NO-CORE essential information for astrophysics (e.g., nucleosynthesis and stellar explosions), neutrino scale applications, and its efficacy has been SHELL MODEL AND ITS IMPACT ON physics, and energy-related applied physics demonstrated already, with the largest production FIGURE 1: Density profile of the 20 particles problems. run successfully utilizing 363,616 processors NUCLEOSYNTHESIS in the ground state of Neon. Our NSF-sponsored OCI-PetaApps award on Blue Waters for a 100-terabyte nuclear resulted in a practical and publicly available, Hamiltonian matrix. This was facilitated by our platform-independent and highly scalable new hybrid MPI+OpenMP implementation of Allocation: GLCPC/0.5 Mnh the SA-NCSM, developed and optimized with PI: J. P. Draayer1 computational realization of SA-NCSM [2-5]. 1 1 2 2 3 assistance from the Blue Waters technical team. Collaborators: T. Dytrych ; K. D. Launey ; D. Langr ; T. Oberhuber ; J. P. Vary ; P. nuclei involved in nucleosynthesis are not yet The success of this first demonstration means Maris3; U. Catalyurek4; M. Sosonkina5 The unique features of the SA-NCSM and the accessible by experiment or reliably measured new regions of the chart of nuclides are open for investigation within the framework of ab Blue Waters system are crucial to advancing ab 1Louisiana State University for the astrophysically relevant energy regime. initio methods for nuclei in the lower sd shell 2Czech Technical University in Prague initio methods. 3 and beyond, as well as highly deformed states Iowa State University Targeted nuclei represent a considerable 12 4 exemplified by the Hoyle state in C. And in The Ohio State University INTRODUCTION challenge requiring more than 100,000 cores. The 5 Old Dominion University following list describes the results and projected return, targeted scientific achievements of the Theoretical advances of theab initio symmetry- studies: types performed as well as proposed here could EXECUTIVE SUMMARY: adapted no-core shell model (SA-NCSM) [1] • We have provided the first ab initio help prove the value of current HPC resources, coupled with the cutting-edge computational 20 We have developed a next-generation first- description of Ne. This is an example for an and maybe even serve to help shape future HPC power of the Blue Waters system open up a new principle (ab initio) symmetry-adapted no- open-shell nucleus in the intermediate-mass facilities. region of the periodic table, the “sd shell” (or core shell model (SA-NCSM). The SA-NCSM region, with complexity far beyond the reach intermediate-mass region), including O, F, Ne, capitalizes on exact as well as important of complementary ab initio methods. Following Na, Mg, Al, Si, P, S, and Ar isotopes, for first PUBLICATIONS approximate symmetries of nuclei; it also this success, we target ab initio modeling of Ne, investigations with ab initio methods that hold holds predictive capability by building upon Mg, and Si isotopes, especially those close to Dytrych, T., et al., Collective Modes in Light predictive capabilities. This is essential for further first principles, or fundamentals of the nuclear the limits of stability (at proton and neutron Nuclei from First Principles. Phys. Rev. Lett., understanding nucleosynthesis, as nuclear masses, particles. These theoretical advances coupled drip lines), providing important input to 111:25 (2013), 252501. energy spectra, and reaction rates for many short- with the cutting-edge computational power nuclear reaction studies. Such reactions in the Dreyfuss, A. C., K. D. Launey, T. Dytrych, J. P. lived nuclei involved in nucleosynthesis are not of the Blue Waters system have opened up intermediate-mass region are key to further Draayer, and C. Bahri, Hoyle state and rotational yet available by experiment or reliably measured a new region for first investigations withab understanding phenomena like X-ray burst features in Carbon-12 within a no-core shell- for the astrophysically relevant energy regime. initio methods, the intermediate-mass nuclei nucleosynthesis or the Ne-Na and Mg-Al cycles. model framework. Phys. Lett. B, 727:4-5 (2013), In addition, one of the most challenging from Fluorine to Argon isotopes. For example, • We have studied electron scattering off pp. 511-515. problems in nuclear physics today is to achieve 6 reliable descriptions of Neon isotopes are Li with wave functions calculated in the ab Tobin, G. K., M. C. Ferriss, K. D. Launey, an ab initio nuclear modeling of the Hoyle state already available. Such solutions are feasible initio SA-NCSM. Results show the efficacy of T. Dytrych, J. P. Draayer, A. C. Dreyfuss, and in 12C, which affects, for example, results of core- due to significant reductions in the symmetry- the SA-NCSM model space selection, for the C. Bahri, Symplectic No-core Shell-model collapse supernovae simulations and stellar adapted model space sizes compared to those first time, toward reproducing the low- and Approach to Intermediate-mass Nuclei. Phys. evolution models, predictions regarding X-ray 6 of equivalent ultra-large spaces of standard no- high-momentum components of the Li wave Rev. C, 89 (2014), 034312. bursts, as well as estimates of carbon production core shell models. This is essential for further function. This finding is crucial for planned in asymptotic giant branch stars. 12 16 understanding nucleosynthesis, as nuclear energy studies of neutrino scattering off of C and O spectra and reaction rates for many short-lived

77 2014

set of “knobs” to control it. Such a control space WHY BLUE WATERS may be applied to aircraft and ship detection, environmental noise reduction, and perhaps fine- With Blue Waters we increased the Reynolds tuning bio-medical acoustic procedures. number by an order of magnitude beyond prior runs, allowing us to address important questions regarding scale similarity of the crackle METHODS AND RESULTS phenomenon. This has foundational implications for crackle asymptotics in the high-Reynolds We have designed the present simulations to number limit that is representative of the full- provide a detailed description of the turbulence scale engineering applications. that generates acoustic fields associated with The other, more substantial and Blue Waters- crackle. While jet noise is the target application, specific goal is to examine the possibility of we chose to study a compressible temporally rapid space-time analysis of the full three- FIGURE 1: A 2D plane of a 3D compressible developing turbulent planar shear layer because dimensions-plus-time databases we generate. turbulent temporal shear layer showing it provides a clearer perspective of the root Sound generation is fundamentally unsteady and SPACE-TIME SIMULATION OF JET magnitude of vorticity (color) and dilatation mechanisms of this sound generation. It can turbulence is fundamentally three-dimensional. CRACKLE (grayscale). Dark gray inclined waves are be considered as a model for the near-nozzle Typical post-processing of massive DNS weak shocks with sharp peaked pressure region of a high-Reynolds number jet, where the databases requires compromises in resolution. compressions that lead to the ‘crackling’ turbulence is concentrated in a weakly curved However, Blue Waters can hold the entire character of sound. shear layer between the high-speed potential Allocation: Illinois/0.075 Mnh database in memory. It has the potential to 1 1 PI: David Buchta ; Jonathan B. Freund core flow and the surrounding flow. Because calculate key quantities for the whole database, the model is focused on this small region, we such as space-time correlations of theoretical 1University Of Illinois at Urbana-Champaign can explicitly resolve a larger range of turbulence noise sources, without the usual I/O limitations scales than could be represented in a full jet EXECUTIVE SUMMARY INTRODUCTION of most modern systems. This could introduce a simulation. new paradigm for the analysis of such databases. Exploratory simulations are proposed as Many of us have heard a jet aircraft and thought We have executed three simulations at part of our research studying the noise from it sounded different than usual. Looking up, we increasing flow speeds of technological high-specific thrust jet exhausts, such as on discover it is a military jet. Though their noise importance—Mach numbers 1.5, 2.5, and 3.5— PUBLICATIONS military jets. This noise, known as crackle, can be more intense, what often attracts notice is with excellent parallel scaling on the Blue Waters Anderson, A., and J. Freund, Source exhibits a distinctive rasping fricative sound its distinct character—a rasping fricative sound, system. These show a fundamental change of Mechanisms of Jet Crackle. 33rd AIAA that is harsher than a typical civilian transport. harsher than a typical civilian transport. People character in the generation of the noise, and Aeroacoustics Conference, Colorado Springs, The sound is generated by the complex flow near airbases experience it a lot, often unhappily, their noise exhibits crackle levels comparable Colo., June 4-6, 2012. turbulence interactions in the jet. Blue Waters and those who work closely around such aircraft to full-jet simulations and experiments. Taking Buchta, D., A. Anderson, J. Freund, Near-Field provides simulation capabilities that allow us can even sustain aural injury. advantage of Blue Waters’ I/O capabilities, we Shocks Radiated by High-Speed Free-Shear-Flow to examine the complex root mechanisms with Since the identification of this peculiar, intense have run small-scale tests that load in larger Turbulence. 35th AIAA Aeroacoustics Conference, unprecedented detail. sound from high-speed jet exhausts, the source portions of solution fields than we have previously Atlanta, Ga., June 16-20, 2014. We employ large-scale simulations to study the mechanisms of crackle have been debated. It is employed to evaluate higher resolution space- nonlinear turbulence interactions that generate thought that supersonically advecting turbulent time correlations of the turbulence and sound. crackle. Three simulations at increasing flow eddies surrounding the jet can radiate Mach- To provide a realistic and detailed description speeds of technological importance—Mach wave-like sound, which leads to its peculiar the flow, we solve the three-dimensional numbers 1.5, 2.5, and 3.5—have been executed and intense character, now called “crackle.” The compressible Navier–Stokes equations without with excellent parallel scaling on Blue Waters. question remains: what gives these waves the modeling approximations. The computational They show a fundamental change of character steepened, skewed signature that correlates with domain for ongoing simulations is discretized in the generation of the noise. Taking advantage crackle? in Cartesian coordinates by ~1.6 billion points. of Blue Waters’ I/O capabilities, we ran small- The extreme flow conditions and the space- High-order finite differences and fourth-order scale tests, loading in large portions of solution time character of the turbulence and its sound Runge–Kutta time advancement provide the high fields to evaluate higher resolution space-time challenge experimental diagnostics. Advanced resolution needed for both the turbulence and correlations of turbulence and sound. Results of large-scale simulations such as ours are offering its acoustic radiation. this work will advance scientific understanding a microscope for studying its root mechanisms. of root causes and guide engineering mitigation We are focusing on the nonlinear turbulence of crackle. interactions that generate the sound and the nonlinear acoustics that seem to give crackle some of its features, which may reveal a new

79 BLUE WATERS ANNUAL REPORT 2014

The electrical conductivity of DNA is of To capture the thermal fluctuations, the systems PETASCALE QUANTUM fundamental interest in the life sciences. However, are first investigated using molecular mechanics SIMULATIONS OF NANO SYSTEMS experimental measurements of conductivity (MM) with full solvation. differ dramatically for this system, depending We found that conductivity changes of 6 to 12 AND BIOMOLECULES on the method of sample preparation and the orders of magnitude for 4 bp and 12 bp DNA, experiment being performed. We are thus respectively. The main factor causing these investigating several factors affecting conduction, changes is the structure of DNA. Further analysis NSF/3.2 Mnh Allocation: including DNA sequence, the presence of water, shows that the most relevant parameter is the PI: Jerzy Bernholc1 Co-PIs: Shirley Moore2; Stanimire Tomov3; Wenchang Lu1; Miroslav Hodak1 counterions, and linkers. area of overlap between successive guanines: The current work addresses two issues: (1) the the most conductive configurations have high 1North Carolina State University conversion of biologically active nitrogen back values of this parameter between all neighboring 2 University of Texas at El Paso into atmospheric nitrogen, and (2) the mechanism base pairs. For these configurations, highly 3University of Tennessee, Knoxville of charge flow in DNA, understanding of which delocalized conductive states spanning the EXECUTIVE SUMMARY: could lead to new DNA-based sensors and entire molecule exist (fig. 1), while a break in devices. Another goal of this project is to adapt overlap between neighboring guanines causes This project focuses on high-performance our main quantum simulation code, Real Space the conductive states to be more localized, electronic structure calculations and Multigrid (RMG), to petascale supercomputers decreasing conductivity. development of petascale methods for such and release it to the scientific community, Additionally, we have investigated the effect simulations. We describe two applications: (1) enabling many more petascale simulations of of different sequences by replacing one of the Mechanistic investigation of the action of copper- nano systems and biomolecules. GC pairs with AT, GT, and AC pairs. The first speed interconnect between the nodes (due to containing nitrite reductases, which catalyze the FIGURE 1: one is a well-matched case, whereas the other frequent exchanges of substantial amounts of reduction of nitrite to nitric oxide, a key step Isosurface of two pairs are mismatched. Our investigation data between nodes). Each project required many in the denitrification process. We identify its charge density METHODS AND RESULTS finds that in an ideal DNA structure, these runs to explore its various scientific issues, with mechanism of action and determine the activation of the most substitutions decrease conduction through the a substantial amount of analysis between the energies, transition states, and minimum energy We have investigated theory of the enzymatic conducting DNA by a factor of 5 for the well-matched case runs. High availability and quick turn around pathways. (2) We use first-principles techniques function of CuNiR and found that only a single HOMO state and a factor of 50 when a mismatch is present. on Blue Waters are very important for timely combined with molecular dynamics simulations mechanism is consistent with the structural data from a high- However, when considering dynamical effects progress in our research. to calculate transport properties of B-DNA available for key intermediates. We found that the by averaging over multiple snapshots from MM conductivity DNA connected to carbon nanotubes. We find that the key part of the catalytic cycle involves changes configuration. 98 simulations, we find that all cases have very DNA conformation and especially the overlaps of Asp configuration from “proximal” to similar conductances. PUBLICATIONS The HOMO state between sequential guanine bases play a critical “gatekeeper” to “proximal”, and we identified the Ab initio electronic structure calculations have is extended role in electron transport, which is governed by origins of the two protons needed for the reaction Li, Y., M. Hodak, and J. Bernholc, Enzymatic over most of the 98 255 been very successful in studies of a wide range of charge delocalization. We also describe recent as coming from Asp and His , respectively. We Mechanism of Copper-Containing Nitrite scientific problems ranging from semiconductors guanine bases. optimizations and enhancements to the Real have also found that the previously observed side- Reductase. (submitted). to biological systems. Such calculations are rather Space Multigrid (RMG) code suite developed on coordination of the NO intermediate does computationally expensive and adapting the at North Carolina State University. RMG reached not occur during the normal function of CuNiR. codes and algorithms to maximize performance 1.144 Pflop/s on Blue Waters while using 3,872 CuNiR has the potential for use in on new computer architectures is an ongoing XK (GPU-based) nodes. A portable library of environmental remediation and removal of effort. The RMG code, which uses a sequence routines suitable for inclusion in other HPC excess nitrogen from aquatic environments. We of grids of varying resolutions to perform codes has also been developed. find that nitrite reduction and attachment are the main rate limiting steps with energy barriers quantum mechanical calculations, is very well of 20.05 and 15.44 kcal/mol, respectively. The suited to highly parallel architectures. It avoids fast Fourier transforms, which require global INTRODUCTION former barrier may be reduced by optimizing the T2 Cu binding site, while the latter can be communications, and parallelizes easily via Denitrification has become a critical part of decreased by improving the substrate channel domain decomposition. It has been adapted to remediating human impact on our planet, as leading to the catalytic site. petascale architectures and GPUs during this human activity has dramatically increased We have also investigated charge transport in proposal period and run at 1.14 Pflop/s on Blue the amount of bio-available nitrogen. The DNA. While a lot of experimental studies have Waters using 3,872 of the XK nodes. considered enzyme, copper-containing nitrite been performed, the results are contradictory reductase (CuNiR), is a key enzyme catalyzing and the process is poorly understood. Our the committing step of this process. This class WHY BLUE WATERS calculations consider 4 and a 10 base-pair (bp) of enzymes has been extensively investigated poly (G) poly (C) DNA fragments connected to Both applications described above require a already, but many mechanistic aspects remain (5,5) carbon nanotube leads via alkane linkers. very large parallel supercomputer with a high- controversial.

80 81 BLUE WATERS ANNUAL REPORT 2014

body effects are critical to overcoming key recent experimental results and are a very demonstrating nearly linear scaling up to several BREAKTHROUGH PETASCALE barriers to predictive, accurate simulation. promising illustration of the capacity of QMC tens of thousands of cores. QUANTUM MONTE CARLO Two key challenges addressed by our team in historically very challenging systems. CALCULATIONS are: (1) the development of high-accuracy computational methods for predictive, Defect chemistry of the earth–abundant PUBLICATIONS quantitative analysis of interacting physics; and photovoltaic material CZTS Ma, F., S. Zhang, and H. Krakauer, Excited (2) the application of these methods across a Two relatively new semiconducting materials, Allocation: NSF/2.66 Mnh state calculations in solids by auxiliary-field 1 spectrum of problems, ranging from fundamental CuZnSnS and CuZnSnSe, have received PI: Shiwei Zhang quantum Monte Carlo. New J. Phys., 15 (2013) 2 2 2 physics to real engineering materials design. Co-PIs: David Ceperley ; Lucas Wagner ; Elif Ertekin substantial attention as potential earth- 093017. 3 4 5 1 6 Collaborators: J. Grossman ; R. Hennig ; P. Kent ; H. Krakauer ; L. Mitas ; abundant alternatives to conventional silicon Zhang, S., Auxiliary-Field Quantum Monte A. Srinivasan7; C. Umrigar4 photovoltaics, but the influence of defects on Carlo for Correlated Electron Systems. in METHODS AND RESULTS properties is unknown. We are studying the 1College of William & Mary Emergent Phenomena in Correlated Matter 2University of Illinois at Urbana-Champaign Chromium dimer formation of defect clustering reactions that Modeling and Simulation, E. Pavarini, E. Koch, 3Massachusetts Institute of Technology The chromium dimer has become a landmark can be detrimental to charge carrier transport. and U. Schollwock, Eds. (Forschungszentrum 4 Cornell University test for electronic structure computation (fig. 1a). Jülich, Jülich, Germany , 2013), vol. 3. 5Oak Ridge National Laboratory 6 The quest for a scalable method capable of its Purwanto, W., S. Zhang, and H. Krakauer, North Carolina State University WHY BLUE WATERS 7Florida State University accurate treatment is ongoing. We carried out Frozen-orbital and downfolding calculations phaseless auxiliary-field quantum Monte Carlo The Blue Waters computing framework provides with auxiliary-field quantum Monte Carlo.J. EXECUTIVE SUMMARY: (ph-AFQMC) calculations using large, realistic the exciting opportunity to apply direct Chem. Theory Comput., 9 (2013) pp. 4825-4833. The central challenge that our PRAC team aims basis sets. In parallel, we performed exact stochastic solution methods—namely, QMC to address is accurate, ab initio computations AFQMC calculations for smaller basis sets to methods—to highly ambitious problems. The of interacting many-body quantum mechanical systematically improve the ph-AFQMC accuracy. numerical methods are highly parallelizable, systems. The Blue Waters petascale computing The calculated spectroscopic properties are in framework has enabled highly ambitious good agreement with experimental results. calculations of a diverse set of many-body physics and engineering problems, spanning Adsorption of cobalt atoms on graphene from studies of model systems in condensed Cobalt atoms adsorbed on graphene is of matter (3D Hubbard model) to superconductivity intense research interest because of their possible in high-pressure hydrogen to the simulation of use in spintronics applications. We use auxiliary- real materials of current research interest for field quantum Monte Carlo (QMC) and a size- energy applications. Over the past year, we have correction embedding scheme to accurately carried out simulations aiming to address these calculate the binding energy of Co/graphene three general target areas. In our symposium for several high-symmetry adsorption sites presentation, we gave an overview of the team’s and benchmark a variety of different theoretical activities and highlighted the latest science methods. A theory to explain recent experimental results. We focused on several of our explorations, observations based on the calculations was including near-exact Hubbard model calculations, provided in our talk at the 2014 symposium. the dissociation of the chromium dimer, the adsorption of cobalt atoms on graphene, the Metal-insulator transition in VO2 metal-insulator transition in vanadium dioxide, We have been able to describe the metal FIGURE 1 (A): The quantum Monte Carlo FIGURE 1 (B): Using quantum Monte Carlo and calculations of defects in semiconductors insulator transition in the material VO2 using method enables unprecedented accuracy methods, we can answer fundamental for photovoltaic and other applications. QMC calculations. To our knowledge, this is in capturing the dissociation physics of questions about semiconductor defect the first time this material has been accurately the chromium molecule, a long-standing physics such as the nature of nitrogen simulated without using adjustable parameters. physics challenge. impurities in zinc oxide, a question that INTRODUCTION has historically posed challenges to the Nitrogen doping in zinc oxide defect community. Many-body interactions are at the heart of both Can nitrogen doping make zinc oxide a fundamental and applied physics problems. For p-type semiconductor? This question has long decades, direct numerical solution of interacting been debated, with several experimental results systems has been computationally intractable, indicating no, while several (less accurate) necessitating the use of approximate or effective computational methods suggest yes. Our results, models. Direct approaches accounting for many- the first ever QMC assessment, are summarized in fig. 1b; they are in very good agreement with 82 83 BLUE WATERS ANNUAL REPORT 2014

INTRODUCTION scales well with system size is needed to study QUANTUM MONTE CARLO graphene-water interaction. CALCULATIONS OF WATER- An accurate theoretical picture of the graphene- To achieve these goals, we perform a set water potential energy surface will not of highly accurate quantum Monte Carlo GRAPHENE INTERFACES only establish the basics of graphene-water calculations (QMC) on multiple water molecules interactions, but will also pave the way for interacting with a graphene surface. QMC is revolutionary carbon-based applications related a class of methods that directly approach the Allocation: Illinois/0.45 Mnh to energy, medicine, and water purification. The 1 many-body quantum problem. In this work, we PI: Narayana R. Aluru outcome of our calculations will have several Co-PI: Lucas Wagner1 plan to use two of the most prevalent flavors 1 1 important scientific benefits: Collaborators: Yanbin Wu ; Huihuo Zheng of QMC: variational Monte Carlo (VMC) and 1. A physical picture of the interaction of water fixed node diffusion Monte Carlo (FN-DMC or 1 University of Illinois at Urbana-Champaign with graphene. The interaction of water with DMC), both of which solve for the many-body surfaces is still very much an open problem, with EXECUTIVE SUMMARY: ground state of a system of particles. ramifications in biology, atmospheric science, Our DMC calculations show strong graphene- We are using the power of Blue Waters to and global warming research, in addition to water interaction, indicating the graphene surface perform highly accurate calculations of the technological applications like water purification. is more hydrophilic than previously believed. electronic structure of water adsorbed on Being able to examine the water-surface The graphene-water binding energy computed graphene surfaces. An accurate theoretical interaction with first-principle calculations has in our DMC calculations is in good agreement picture of the graphene-water potential energy the potential to be a game changer in the study with the heat of the adsorption energy measured surface will not only establish the basics of of these systems. using the gas chromatogram technique [1]. The graphene-water interactions, but will also 2. The development and testing of an accurate unusually strong interaction can be attributed to pave the way for revolutionary carbon-based water-graphene surface potential based on the weak bonds between graphene and water. Charge FIGURE 1: Electron applications related to energy, medicine, and highly accurate energies given by the ab initio transfer may also contribute to the interaction. redistribution water purification. Our diffusion Monte Carlo calculations. The water-graphene surface An isolated water molecule has one unoccupied when graphene (DMC) calculations show strong graphene-water potential will be utilized in molecular dynamics band lying below the Fermi level of the graphene. and water interaction, indicating the graphene surface is simulations. When graphene and water approach each other, approach each more hydrophilic than previously believed. The 3. Reference data for the development of charge transfers from graphene to water, which other. A positive unusually strong interaction can be attributed to density functional theory (DFT) in the weak can be shown from electron density analysis value means weak bonds between graphene and water. Our interaction limit. using DMC calculations as shown in fig. 1. accumulation of DMC calculations are of unprecedented accuracy, electron density based on the first-principles Hamiltonian, and and a negative can provide insight for experimentalists seeking METHODS AND RESULTS WHY BLUE WATERS value means to understand water-graphene interfaces and for We are using the power of Blue Waters to perform The DMC calculations are quite computationally depletion of theorists seeking to improve density functional highly accurate calculations of the electronic expensive. To compute one interaction energy electron density. theory for weakly bound systems. structure of water adsorbed on graphene surfaces. point between water and graphene, 200,000 Estimating graphene-water interaction energies core-hours are needed. It is ideal to run these based on ab initio calculations is challenging DMC calculations on the large-scale resources because graphene-water interaction is a of Blue Waters since QWalk, the QMC package weakly bounded system. As such, the electron we use, scales very well up to 64,000 cores (as correlation has to be properly described with tested on Blue Waters). Simulations that used to adequate precision. Conventional DFT fails to take weeks on other systems can now be done do this. within days or even hours on Blue Waters so For this reason, a correction to DFT focused that multiple configurations can be explored to on improving the description of the dispersion obtain unprecedented accuracy in interaction term. Recent reports of graphene-water binding energy profiles. energies using various electronic structure methods vary widely. The scatter of this data is mainly due to approximations in the electron structure methods that are necessary to be able to capture big graphene systems. A method that describes electron correlation accurately and

84 85 BLUE WATERS ANNUAL REPORT 2014

METHODS AND RESULTS front of a second bucket. As the electrons from SCALING UP OF A HIGHLY PETASCALE PARTICLE-IN-CELL the second bucket de-phase and move toward the PARALLEL LBM-BASED SIMULATION PRATHAM is being developed at Oak Ridge SIMULATIONS OF KINETIC EFFECTS front, they feel this defocusing force and begin National Laboratory to demonstrate the accuracy to spread out. A small focusing region between TOOL (PRATHAM) FOR MESO- and scalability of a lattice Boltzmann method for IN PLASMAS the first and second buckets captures some of AND LARGE-SCALE LAMINAR AND turbulent flow simulations that arise in nuclear these electrons in the second bucket, causing applications. The code is written in FORTRAN90 the mono-energetic ring to form. TURBULENT FLOW AND HEAT and made parallel using a message-passing Allocation: NSF/4.69 Mnh 1 Laser-driven inertial fusion energy (IFE) can interface. Silo library is used to write the data PI: Warren Mori TRANSFER Collaborators: Frank S. Tsung1 be detrimentally affected by the coupling of files in a compact form, and VisIt visualization laser light waves to the plasma through which software is used to post-process the simulation 1University of California, Los Angeles it propagates. In stimulated Raman scattering Allocation: Illinois/0.05 Mnh data in parallel. (SRS), the incident laser wave decays into an PI: Rizwan Uddin1 PRATHAM has a variety of models EXECUTIVE SUMMARY: 2 3 electron plasma wave (EPW) and a scattered Collaborators: Sudhakar V Pamidighantam ; Prashant Jain implemented in it. For example, the collision 4 3 In the past two years, the Blue Waters light wave, potentially resulting in a direct loss Associated Researchers: Kameswararao Anupindi ; Emilian Popov between lattice points can be modelled using supercomputer has allowed us to perform very of drive energy, inefficiency in drive symmetry, 1University of Illinois at Urbana-Champaign either a single or multiple relaxation time large-scale simulations which have provided and pre-heating of the fusion fuel due to hot 2National Center for Supercomputing Applications approximation. Lattice models such as D3Q19 qualitative and quantitative understanding electrons generated by the daughter EPW. 3 Oak Ridge National Laboratory and D3Q27 are implemented in a generic manner in many different topics in plasma physics, We have performed 2D PIC simulations 4University of Southampton so that new types can be incorporated easily including plasma-based accelerators and laser of SRS in inhomogeneous plasmas and with EXECUTIVE SUMMARY: into the code. Large eddy simulation (LES) is fusion. The simulation results (described below) speckled laser beams. The laser beams in used to simulate turbulent flows. In LES the have significantly impacted our understanding inertial confinement fusion experiments consist In the present work, we enhance, validate, and larger turbulent structures are directly resolved, in these areas. Furthermore, we will discuss the of a distribution of high-intensity hot spots, test the scalability of PRATHAM (PaRAllel whereas turbulent structures smaller than the grid need to move toward exascale computing and or speckles. While only a small percentage of Thermal Hydraulics Solver using Advanced are modelled using a sub-grid scale (SGS) model. our development plans for future architectures, these may be above the instability threshold Mesoscopic applications), a lattice Boltzmann The present work uses the simple and widely used including Intel Phis and GPUs. for SRS, waves and particles generated by SRS method-based code for solving incompressible, Smagorinsky SGS. This model modifies the fluid in one speckle can stream into a neighboring time-dependent laminar and turbulent fluid flow viscosity by an additional eddy viscosity that speckle and cause it to undergo SRS even if it’s in 3D domains. PRATHAM code is enhanced mimics energy dissipation in the sub-grid eddies, METHODS AND RESULTS FIGURE 1: (left) below the threshold. Simulations of two-speckle with a new lattice type (D3Q27), an immersed which is proportional to the resolved strain rate Comparison of We study the electron beam evolution in scenarios have allowed us to study the conditions boundary method to handle complex geometry, tensor. An immersed boundary method is also u component of an electron beam-driven plasma wakefield for which scattered light waves, EPWs, and hot and the capability to collect and report turbulent implemented so the code can simulate turbulent velocity in the accelerator when the accelerated beam has a electrons generated by SRS in above-threshold statistics. Using some of these new enhancements, flow in complex geometries that are typical in y direction on very small transverse emittance and a very small speckles can trigger SRS in neighboring, below- the code is first tested for turbulent flow in a lid- nuclear applications. A ghost-point immersed the central z matched spot size that can cause the plasma ions threshold speckles. Larger-scale simulations of driven cavity (LDC), turbulent flow in a circular boundary method is implemented which works plane. (right) to collapse toward the beam. The improved quasi- multi-speckle ensembles show that SRS cascades pipe. The results obtained for flow in a LDC at a starting from a geometry file in stereo lithography Comparison of static particle-in-cell code QuickPIC allows triggered by scattered light can lead to pump Reynolds number of 22,000 are compared with format. v component of us to use very high resolution and to model depletion, dominating the recurrence of SRS. the available DNS data in the literature and this In order to validate the solver we used flow velocity in the x asymmetric spot sizes. Simulation results show These complex interactions could not be step acts as a validation of the PRATHAM solver. in a lid-driven cavity (LDC) as a test case. A direction on the that the accelerated beam will reach a steady understood without simulations performed on PRATHAM is found to scale well as the processor LDC is a simple geometry that shows several central z plane. state after propagating several centimeters in Blue Waters. count is increased on Blue Waters. interesting flow features such as bifurcation and the plasma. We find that for round beams the corner eddies. Direct numerical simulation data (Li+) ion density is enhanced by a factor of 100, are available in the literature for comparison. but the emittance only grows by around 20%. Flow in a LDC at a Reynolds number of For asymmetric spot sizes, the ion collapse is 22,000 is simulated and mean and fluctuating less, and emittance growth is zero in the plane turbulent quantities are computed. Fig. 1 shows with the largest emittance and about 20% in the a comparison of mean velocities with the DNS other plane. data as the mesh is refined from 150 mesh points Recent experiments with the Callisto laser to 240 mesh points in each direction. As the in the Jupiter laser facility have demonstrated mesh is refined the results approach that of the the formation of a mono-energetic ring with DNS. A mesh size of 200 points in each direction an average energy of 150-250 MeV. OSIRIS 3D seems to be an optimum size for this problem. simulations have shown that injected electrons We found that PRATHAM scales well up to a 4 from a first bucket create a defocusing area in the processor count of O(10 ).

86 87 BLUE WATERS ANNUAL REPORT 2014

INTRODUCTION nodes on Blue Waters than on a single-core PRELIMINARY EVALUATION OF Dell Precision T7600 workstation. Further ABAQUS, FLUENT, AND CUFLOW Continuous casting is used to produce 95% of simulations on Blue Waters with a different steel in the world in the form of semi-finished number of computer nodes showed almost linear PERFORMANCE ON BLUE WATERS shapes such as slabs, blooms, billets, beam blanks, speed-up with more nodes (fig. 1). Results from and sheets. Even small improvements to this this model show that dithering (oscillation) of process can have a wide impact. Most defects the slide gate to lessen clogging also causes mold Allocation: Illinois/0.1 Mnh arise in the mold region of the casting process 1 flow oscillations, which may become unstable at PI: Brian G. Thomas due to the entrapment of inclusion particles into 1 1 1 1 1 certain frequencies [12]. Co-PIs: Lance Hibbeler ; Kai Jin ; Rui Liu ; Seid Koric ; Ahmed Taha the solidifying shell and crack formation in the For the in-house GPU code, CUFLOW, the 1University of Illinois at Urbana-Champaign newly solidified steel shell. These defects persist pressure Poisson equation (PPE) solver was into the final products and cannot be removed. tested on a single Blue Waters XK7 node by EXECUTIVE SUMMARY: Thus, the best method to improve steel products solving a heat conduction problem in a 3D cube. This project advances the state of the art in is to fully understand the mechanisms of defect The solver uses V-cycle multi-grid technique computationally intensive models of turbulent formation and find operation conditions that and a red/black successive over-relaxation flow, mechanical behavior, heat transfer, and avoid these problems. (SOR) method. Both CPU and GPU versions Owing to the high temperatures and harsh were developed and tested. Increasing grid size solidification in the continuous casting of steel. # THREADS DAYS REQUIRED These models provide practical insight into how commercial environment, it is difficult to from 0.26 million to 0.56 billion cells increased to improve this important manufacturing process. conduct comprehensive measurements in the speedup to a maximum of 25 (fig. 2). The model is 32 6.80 The performance of and preliminary results from manufacturing process. Accurate and efficient being applied to predict the entrapment locations 64 5.25 computational models are needed to optimize the commercial codes FLUENT and ABAQUS of inclusion particles in the solidified strand for 128 5.50 and an in-house code are presented. FLUENT various process variables (nozzle geometry, steel conditions where measurements were obtained 256 5.50 has been tested with a 3D, two-phase turbulent and gas flow rates, electromagnetics, taper of at an operating commercial caster [13]. the mold walls, etc.). In addition to improving flow simulation and demonstrates a speed-up PUBLICATIONS FIGURE 1 factor of about 100 with 256 cores. ABAQUS/ the manufacturing process, development and Hibbeler, L. C., B. G. Thomas, R. C. Schimmel, (LEFT): FLUENT Standard has limited speed-up capabilities validation of better computational methodologies WHY BLUE WATERS and H. H. Visser, Simulation and Online computational because of its direct solver and works best with is useful in the modeling of many other processes. All of the models used in this work are very Measurement of Narrow Face Mold Distortion in cost on Blue about 64 cores; the vast amount of memory on computationally demanding. For the stress Thin-Slab Casting. Proc. 8th European Continuous Waters (per Blue Waters has improved the simulation time METHODS AND RESULTS analysis part, multi-scale thermal-mechanical Casting Conf. , Graz, Austria, June 23-26, 2014. iteration) of a thermo-mechanical model of the mold and simulations of mechanical behavior the with speed-up waterbox by a factor of about 40. A maximum For the stress analysis, several runs with a mesh of solidifying steel shell are being conducted using relative to a lab speed-up factor of 25 has been observed on a 375,200 elements, 754,554 nodes, and 2,263,662 ABAQUS/Explicit. By aiming to capture physical workstation. single Blue Water XK7 node for the in-house degrees of freedom (DOFs) were completed on phenomena involving detailed behavior on the FIGURE 2 (ABOVE): GPU code for flow simulations. Blue Waters with implicit ABAQUS, which small scale of the microstructure, this model PPE solver allows input and use of the phase field. The requires advanced computational resources performance on CPU time for one Newton iteration, consisting like Blue Waters. Blue Waters CPU of 0.57 Tflop/s, is presented in table 1 for different For the fluid flow simulation, the Navier– and GPU. numbers of threads. The optimum number of Stokes equations are solved using the finite threads is ~64. Efficiency appears to be limited volume method for large eddy simulations (LES) TABLE 1: ABAQUS by the FEM assembly process, coupling between incorporating the multi-phase flow via Eulerian– runs on Blue DOFs during solving, or communication across Lagrangian coupling [7-8]. Our previous LES Waters. CPU processors. Results from this model match with simulations with about 1.5 million cells requires time required plant inclinometer measurements on the cold- about four months to simulate only 30 seconds for 1 second face exterior [11]. of model time on high-end workstations [9-10]. simulation for To evaluate performance of the commercial Proper mesh resolution requires over 10 million different numbers package, FLUENT, for fluid flow modeling on cells, and resolving the main periodic frequencies of threads Blue Waters, argon-steel two-phase turbulent identified in plant experiments require over 60 flows in a continuous casting mold was modeled seconds of model time. Thus, we explore the on several computers. The test consisted of 0.66 feasibility of using FLUENT on Blue Waters, with million hexahedral mapped computational cells the help of ANSYS, Inc. and ~8.4 million DOFs. One hundred iterations of FLUENT ran about 108 times faster on 240

88 89 2014

METHODS AND RESULTS in the acceleration, which is very important in the modeling of turbulent dispersion. Our computational challenge is to simulate incompressible turbulence in a periodic 3 domain at a resolution of 8,192 , with more WHY BLUE WATERS than half a trillion grid points, at a Reynolds number that exceeds previous known work, The rapid increase in the range of scales in length while also resolving the small scales better and time with Reynolds number is such that, in general, a 8,1923 simulation is almost 16 times as than standard practice in the literature. The 3 presence of a wide range of scales is crucial to expensive as one at 4,096 resolution (which was many key properties such as the intermittency first reached in 2002, on the Earth Simulator in of extreme events, the spectral transfer from Japan). The cost also increases considerably (by the large scales to the small scales, the effective at least a factor of two) when we include mixing mixing of transported substances compared to and dispersion in addition to the basic flow field. PETASCALE SIMULATION OF HIGH molecular diffusion, and the relative dispersion Consequently, our intended science target can only be reached using a large allocation of time REYNOLDS NUMBER TURBULENCE of contaminants carried along highly convoluted fluid element trajectories. Although several well- on a multi-petaflop computer such as Blue known hypotheses of scale similarity provide an Waters. At the same time, our code performance approximate description of the flow physics, well- benefitted greatly from Cray personnel helping Allocation: NSF/9.03 Mnh us with remote memory addressing, and from the 1 resolved numerical simulation data at a Reynolds PI: P.K. Yeung Blue Waters project staff in arranging reserved Collaborators: A. Majumdar2; D. Pekurovsky2; R. D. Moser3; J.J. Riley4; K.R. number higher than achieved in the recent 5 6 partitions of up to 8,192 32-core Blue Waters Sreenivasan ; B.L. Sawford past are still necessary for theory and model development at the next level of physical realism. nodes, which largely overcomes the issue of 1 Georgia Institute of Technology As of April 2014 we obtained a 8,1923 velocity network contention with other jobs running 2San Diego Supercomputer Center INTRODUCTION on the system. 3University of Texas at Austin field which is statistically stationary and isotropic, 4University of Washington Turbulence at high Reynolds number arises at a Taylor-scale Reynolds number close to 1,300, 5New York University in numerous types of natural phenomena and with a grid spacing which resolves scale sizes 6 PUBLICATIONS Monash University, Australia engineering devices where disturbances in one down to about 1.5 times the Kolmogorov scale. EXECUTIVE SUMMARY: form or another are generated by some physical Many instantaneous snapshots of velocity fields Yeung, P. K., Early Results from High Reynolds mechanism, transported over some distance in spanning about 2.5 large-eddy turnover times Number DNS at 8,1923 resolution. Second We study the complexities of turbulent fluid flow space, or dissipated through molecular viscosity have been saved. When running under a reserved International Conference on Mathematical at high Reynolds number, where resolution of or diffusivity. The prediction of wind gusts in a partition of appropriate network topology the Theory of Turbulence via Harmonic Analysis fluctuations over a wide range of scales requires storm, the life of marine organisms in the ocean, wall time on 262,144 MPI tasks is consistently and Computational Fluid Dynamics, Nara, Japan, massively parallel petascale computation even in the propulsive thrust provided by jet aircraft under 10 seconds per time step, which gives March 3-5, 2014. simplified geometries. The power of Blue Waters, engines, and the dispersion of pollutants in an essentially perfect strong scaling if compared Iyer, K. P., Studies of turbulence structure and combined with the use of remote memory urban environment—all depend on the study with the same problem size using 65,536 cores turbulent mixing using Petascale computing. addressing and reserved partitions to minimize of turbulence and are of concern to our society. without a reserved partition. (PhD thesis, Georgia Institute of Technology, communication costs, has made feasible a Our project will help provide a much better Substantial effort is directed at studying the 2014). simulation at record resolution exceeding half understanding of the underlying flow physics that intensity of the local straining and rotation Iyer, K. P., and P. K. Yeung, Structure functions a trillion grid points. also apply to turbulent flows in more realistic of fluid elements subjected to the distorting and applicability of Yaglom's relation in passive- Early science results include new insights geometries. Improved resolution and higher effects of velocity fluctuations. Results on the scalar turbulent mixing at low Schmidt numbers on fine-scale intermittency, spectral transfer, Reynolds number will also help settle long- probability distributions of these variables with uniform mean gradient. Phys. Fluids, 26 FIGURE 1 and the connection between extreme events standing questions about local isotropy and provide strong validation of conclusions from (2014), 085107. (BACKGROUND): observed in fixed and moving reference frames. sub-grid scale modeling in turbulent mixing. We recent work on the likelihood of extreme events Slender vortex Phase 1 of this project has focused on algorithm will also be able to address a pressing concern approaching 10,000 times the mean value. filaments showing enhancement and the study of velocity field in turbulence simulations, namely how to Comparisons of the statistics of acceleration the complexity structure. Phases 2 and 3 will focus on mixing parameterize Reynolds number dependence well (which is also highly intermittent) in fixed and of fine-scale passive substances and slow-diffusing chemical enough so that important conclusions about the moving reference frames show a strong degree structure in high species, and dispersion of contaminant clouds at flow physics can be extrapolated safely toward of mutual cancellation between the effects of Reynolds number high Reynolds number. Collaborative work based Reynolds numbers that remain out of reach. unsteadiness at a fixed location and of turbulent turbulence at on the new data is expected to lead to significant transport in space. Conditional statistics also 3 8,192 resolution. advancements in theory and modeling. verify quantitatively the connection between intermittency in the energy dissipation rate and

91 BLUE WATERS ANNUAL REPORT 2014

high-efficiency catalysis, and many others. The the mechanism for high-temperature COMPUTATIONAL EXPLORATION strong correlation challenge is particularly superconductivity and potentially allow for OF UNCONVENTIONAL interesting in that the fundamental equation design of these unique materials. Since the FN- to be solved, the time-independent electronic DMC method is generally applicable to the many- SUPERCONDUCTORS USING Schrödinger equation, is well known. There is body Schrödinger equation, developments in QUANTUM MONTE CARLO no known general, efficient, and exact solution this arena carry over to other important issues to this problem. in materials physics, such as the prediction of doping behavior, magneto-electric coupling, and Allocation: Illinois/0.508 Mnh catalysis. PI: Lucas K. Wagner1 METHODS AND RESULTS

1 We are using Blue Waters to simulate correlated University of Illinois at Urbana-Champaign PUBLICATIONS electrons directly using the diffusion Monte EXECUTIVE SUMMARY: Carlo algorithm. This method is based on a Zheng, H., and L. K. Wagner, The mechanism FIGURE 1: A sample The interactions between electrons create many mapping from the Schrödinger equation to the of metal-insulator transition in vanadium dioxide of electronic unique quantum states and electronic devices, dynamics of stochastic particles. These particles from a first-principles quantum Monte Carlo positions such as superconductors. To date, efforts to diffuse, drift, and branch, and in their equilibrium perspective. (submitted). arxiv:1310.1066 pulled from a simulate these quantum states have had great configuration their density represents the Wagner, L. K., and P. Abbamonte, The effect of calculation on difficulty achieving enough fidelity to describe amplitude of the lowest energy state, the most electron correlation on the electronic structure the MgO solid. these states. Using Blue Waters and modern important state for condensed matter physics. and spin-lattice coupling of the high-Tc cuprates: Each sample algorithms, we performed state-of-the-art many- There is one major approximation in diffusion quantum Monte Carlo calculations. (submitted). corresponds to body simulations of strongly interacting quantum Monte Carlo. Since the stochastic particles are all arXiv:1402.4680 the positions of systems VO2 and several superconductor parent positive, they cannot easily represent a function all electrons. materials to unprecedented detail. These with positive and negative regions. Further, the Density of simulations have given insight into how these ground state of electrons is required to have both samples materials perform their function and offer positive and negative regions by fundamental represents the hope that simulations of this type will be able physics. We thus must approximate the zeroes amplitude of the to achieve the dream of computationally guided of the wave function to fix the positive/negative wave function. correlated material design. regions. This is called fixed-node diffusion Monte Carlo (FN-DMC). In practice, the fixed node approximation has been shown to be very accurate on realistic simulations of quantum systems. This method has been implemented FIGURE 2: 2D projection of the resultant spin and scaled up by the authors to run on a large densities for the high temperature superconducting material La CuO . Copper atoms are gold, oxygen fraction of Blue Waters [1]. 2 4 We have completed two pilot studies on atoms are red, and lanthanum atoms are green. correlated electronic systems. Both of these Shown is the response of the spin density to studies are on materials that have been known different phonon modes. for several decades, but to date have resisted description by either first-principles calculations (Schrödinger equation) or assumed models. The lack of a reliable first-principles description is particularly limiting for design of materials with strongly correlated effects. At the 2014 Blue INTRODUCTION Waters Symposium, we present our results on the

One of the grand challenges in condensed matter metal-insulator transition in VO2 and the spin- physics is to describe the behavior of strongly lattice interaction in the high Tc superconducting correlated electronic systems—materials in cuprate parent materials using Blue Waters and which the interactions between electrons are the FN-DMC method to describe their electronic critical to their behavior. Electron interactions structure accurately. are responsible for a number of unique properties The successful description of these of these materials including high-temperature correlated materials opens the door to many superconductivity, giant magnetoresistance, further calculations that may help uncover

92 93 COMPUTER SCIENCE & ENGINEERING

OPTIMIZATION 96 Hybrid Dataflow Programming on Blue Waters 100 System Software for Scalable Computing 98 Redesigning Communication and Work Distribution 102 Scalability Analysis of Massively Parallel Linear SCALABILITY in Scientific Applications for Extreme-scale Solvers on the Sustained Petascale Blue Waters Heterogeneous Systems Computing System PARALLEL FUNCTIONALITY BLUE WATERS ANNUAL REPORT 2014

INTRODUCTION of ideal throughput with 10,000-way concurrency HYBRID DATAFLOW using Swift+GeMTC. PROGRAMMING ON BLUE WATERS This work explores methods for, and potential Fig. 2 demonstrates an upper bound of benefits of, applying the increasingly abundant GeMTC by launching efficiency workloads on and economical general purpose graphics multiple GPU nodes with only a single active processing units (GPGPU) to a broader class Allocation: GLCPC/0.375 Mnh GeMTC worker per GPU. We next enable 168 PI: Michael Wilde1,3 of applications. It extends the utility of GPGPU GeMTC warp workers per GPU (the maximum) Collaborators: Scott J. Krieder2; Justin M. Wozniak1; Timothy Armstrong3; Daniel from the class of heavily vectorizable applications and evaluate the efficiency of workflows with 1,3 1,3 2,3 S. Katz ; Ian T. Foster ; Ioan Raicu to irregularly structured many-task applications. varied task granularities up to 86,000 individually Such applications are increasingly common, 1 operating GPU workers on Blue Waters. After Argonne National Laboratory stemming from both problem-solving approaches 2Illinois Institute of Technology adding 167 additional workers per GPU we 3University of Chicago (i.e., parameter sweeps, simulated annealing or require longer-lasting tasks to achieve high branch-and-bound optimizations, uncertainty efficiency. We attribute this drop in performance EXECUTIVE SUMMARY: quantification) and application domains (climate to greater worker contention on the device This work presents the analysis of hybrid modeling, rational materials design, molecular queues and the fact that Swift must now drive dataflow programming over XK7 nodes of dynamics, bioinformatics). 168 times the amount of work per node. Blue Waters using a novel CUDA framework In many-task computing (MTC) [1], tasks In fig. 3 we observe that tasks exceeding one called GeMTC. GeMTC is an execution model may be of short (even sub-second) duration or second achieve high efficiency up to scales and runtime system that enables accelerators highly variable (ranging from milliseconds to of 40,000 workers. Although we have not yet to be programmed with many concurrent minutes). Their dependency and data-passing identified the cause for this drop in performance, and independent tasks of potentially short or characteristics may range from many similar we expect that the performance degradation at variable duration. With GeMTC, a broad class of tasks to complex, and possibly dynamically extreme levels of concurrency comes from the such “many-task” applications can leverage the determined, dependency patterns. Tasks typically loading of shared libraries from the remote increasing number of accelerated and hybrid high- run to completion; they follow the simple input- parallel file system. In future work we will FIGURE 2 (TOP): end computing systems. GeMTC overcomes the process-output model of procedures, rather continue to improve system-wide performance GeMTC + Swift obstacles to using GPUs in a many-task manner than retaining state as in web services or MPI by reducing the reliance on dynamic loadable efficiency up to by scheduling and launching independent tasks processes. shared libraries and using larger scale evaluation 512 nodes, 1 GeMTC on hardware designed for SIMD-style vector on all ~4,000 XK7 nodes. worker. processing. We demonstrate the use of a high- While we observe a drop in performance METHODS AND RESULTS FIGURE 3 (BOTTOM): level many-task computing programming model moving from a single worker to 168 workers, Efficiency for (the Swift parallel dataflow language) to run tasks Fig. 1 shows a high-level diagram of GeMTC [2] we achieve 168 times the amount of work with workloads with on many accelerators and thus provide a high- driven by tasks generated by the Swift [3] parallel only a five-fold increase in time. These numbers varied task productivity programming model for the growing functional dataflow language. GeMTC launches improve even more when the time for computing granularities number of supercomputers that are accelerator a daemon on the GPU that enables independent versus data transfer increases. In future work up to 86,000 enabled. While still in an experimental stage, tasks to be multiplexed onto warp-level GPU we will continue to improve system-wide independent GeMTC can already support tasks of fine (sub- workers. A work queue in GPU memory is performance and evaluation at even larger scale. warps of Blue second) granularity and execute concurrent populated from calls to a C-based API, and Future work includes performance evaluation Waters. 168 FIGURE 1: Flow of heterogeneous tasks on 86,000 independent GPU workers pick up and execute these tasks. of diverse application kernels (e.g., data pipelining, active workers/ a task in GeMTC GPU warps spanning 2.7 million GPU threads After a worker has completed a computation, the detecting cancer-related genes, glass modeling, GPU. driven by Swift. on Blue Waters. results are placed on an outgoing result queue and protein structure simulation); analysis of and returned to the caller. the ability of such kernels to effectively utilize We first ran a multi-node scaling experiment concurrent warps; enabling of virtual warps where the number of simulations is set equal to which can both subdivide and span physical the number of workers. At each data point there warps; support for other accelerators such are two times as many workers as at the previous as the Xeon Phi; and continued performance data point, so we run twice as much work. In refinement. an ideal system without any overhead we would expect a flat line demonstrating the ability to conduct the same amount of work at each step. Even after eight nodes we achieve 96% utilization. Future work aims to evaluate our system at even larger scales on Blue Waters. We also obtain 70%

96 97 BLUE WATERS ANNUAL REPORT 2014

FIGURE 1 (C): METHODS AND RESULTS global communicator of the MPI job. Results checkpointing on MPI applications. We do not REDESIGNING COMMUNICATION indicate that the performance of both Broadcast have results to report on this activity at this time. Performance of AND WORK DISTRIBUTION IN We have evaluated the performance and scalability and Reduction operations are quite scalable the hybrid HPL of point-to-point and collective operations of with fairly short latencies considering the scale implementation SCIENTIFIC APPLICATIONS FOR Cray-SHMEM and Cray-UPC using the Ohio of the job. Reduction operations are especially WHY BLUE WATERS for different State University Micro-benchmark suite. For CPU-to-GPU node EXTREME-SCALE HETEROGENEOUS scalable owing to the use of dedicated hardware There are very few systems nationally that point-to-point experiments, we evaluate both ratios and with support in the Gemini interconnect. Other dense provide a test bed for scaling communications SYSTEMS intra- and inter-node cases for put, get, and differing MPI collective operations, such as MPI_Allgather to tens or hundreds of thousands of cores, yet atomic operations. UPC collectives are evaluated processes per GPU and MPI_Alltoall, are quite time consuming at communication runtimes and the applications for Broadcast, Scatter, Gather, Allgather, Alltoall, node. a process scale of 128,000 and indicate an area built upon them are expected to run effectively Allocation: GLCPC/0.319 Mnh and Barrier operations. Similarly, OpenSHMEM PI: Karen Tomko1 that needs attention at scale. at these scales and beyond. Blue Waters (D): Latency of 2 collective operations such as Broadcast, Co-PI: Dhabeleswar K. Panda MPI implementations typically provide provides this test bed. The system’s unique mix the MPI_Bcast Khaled Hamidouche2; Hari Subramoni2; Jithin Jose2; Raghunathan Barrier, Collect, and Reduce are evaluated. Collaborators: system-level fault-tolerance support by means of XE6 and XK7 nodes enable investigation of collective Raja Chandrasekar2; Rong Shi2; Akshay Venkatesh2; Jie Zhang2 Our evaluations indicate good point-to-point of transparent checkpoint restart. We are application-level designs for mixed CPU and operation by performance results for both OpenSHMEM and 1 developing an I/O kernel that mimics the GPU node systems. Additionally, the system’s message size for Ohio Supercomputer Center UPC. Further, many of the collective operations 2The Ohio State University I/O pattern of coordinated checkpointing high-bandwidth Lustre file system supports 16,000 to 128,000 (UPC-Scatter, OpenSHMEM-Broadcast) show protocols and are using this benchmark to evaluation of large-scale checkpoint/restart costs. processes. EXECUTIVE SUMMARY: good scalability characteristics. However, for study the performance impact of system-level some of the collectives the performance is FIGURE 1 (RIGHT): In this project we explore communication lower than the corresponding MPI collective Communication performance for modern programming operations. performance models at large scale. Specifically, we evaluate We have tuned a mixed-node version of high- and scalability the performance of point-to-point and performance LINPACK (HPL) to utilize both of modern collective communications for CraySHMEM XE6 and XK7 Blue Waters nodes in a single programming and UPC PGAS models. We tune a hybrid run. Our HPL tests with different versions of models on Blue high-performance LINPACK implementation the benchmark: standard HPL from Netlib, Waters. Clockwise to leverage both CPU and GPU resources for NVIDIA's HPL running on pure CPU nodes, from upper left: systems with mixed node types. We evaluate NVIDIA's HPL running on pure GPU nodes, and collective algorithm performance at very large (A): Latency of our hybrid HPL running across both CPU and scale, starting with Broadcast and extending to the OpenSHMEM GPU nodes. For the underlying math libraries, more complicated operations such as Reduce, fcollect we measure the performance among ACML, All-Reduce, and All-to-All. Additionally, we collective OpenBLAS, and LibSci. evaluate the cost of I/O for checkpoint operations communication OpenBLAS and ACML achieve better to understand the impact of system-level operation by performance than LibSci for standard HPL checkpointing on applications. message size for with the peak performance of a single CPU node 2,000 to 16,000 around 202 GFlop/s. When measuring NVIDIA’s processes. INTRODUCTION HPL on pure CPU nodes (modified version), (B): Latency of we measure the multi-thread computation the UPC all_ The field of computer architecture, capacities among different math libraries, and broadcast interconnection networks, and system design OpenBLAS performs better than ACML in this collective is undergoing rapid change that enables very case, with peak performance of a single CPU communication large supercomputers such as Blue Waters to node around 190 GFlop/s. We also measure the operation by be built. System advances have come in the peak performance efficiency achieved by our message size for form of increased parallelism from many-core hybrid HPL compared to the sum of pure CPU 2,000 to 32,000 accelerators and improved communication and GPU nodes. We get above 70% efficiency UPC threads. interfaces. To leverage these advances, with 16 GPU nodes and 64 CPU nodes. applications must be revamped to use new The performance of collectives on Blue capabilities of the interconnection networks Waters has been evaluated with OSU’s Micro- and more sophisticated programming models. benchmark suite. The experiments focus on Without these corresponding software advances the aspects of scalability of the collectives with the vision of science breakthroughs cannot be increasing message size as well as with increasing achieved. process count in the MPI job. All the collectives have been run on the MPI_COMM_WORLD

98 99 BLUE WATERS ANNUAL REPORT 2014

INTRODUCTION thread spends a short period between two lock algorithm at a large scale with state-of-the-art SYSTEM SOFTWARE FOR SCALABLE acquisition attempts and an arbitration is not network and computer systems. COMPUTING Because of power constraints and limitations performed, the thread holder may reacquire the in instruction-level parallelism, computer lock before other threads notice the lock was architects are unable to build faster processors by relinquished. If this happens repeatedly, it will increasing the clock frequency or by architectural Allocation: BW Prof./0.245 Mnh; NSF/0.613 Mnh lead to lock monopolization. PI: William Gropp1 enhancements. Instead, they are building more Based on this analysis, we designed a way to Collaborators: Pavan Balaji2 and more processing cores on a single chip and mitigate this issue with a FIFO (first-in, first- leaving it up to the application programmer to 1 out) arbitration (ticket-based locking) as well as University of Illinois at Urbana-Champaign exploit the parallelism provided by the increasing 2Argonne National Laboratory a prioritized locking scheme that favors threads number of cores. MPI is the most widely used doing useful work. Experimental results show EXECUTIVE SUMMARY: programming model on HPC systems and that our new locking scheme significantly many production scientific applications use an The goal of the System Software for Scalable increases the throughput of MPI+thread MPI-only model. Such a model, however, does Computing project was to study the performance applications. not make the most efficient use of the shared of low-level communication systems such as MPI Our second focus area involving Blue Waters is resources within the node of an HPC system. in various environments on the Blue Waters a large-scale graph application. While a number For example, having several MPI processes on system and propose optimization techniques of users have demonstrated scaling regular a multicore node forces node resources (such to address performance shortcomings. Over applications on Cray XE platforms, it is more as memory, network FIFOs) to be partitioned the past year, we focused on two such areas: challenging to scale distributed graph algorithms among the processes. To overcome this limitation, (1) MPI communication in multi-threaded to a large scale because of the load balance application programmers are increasingly environments, and (2) MPI communication in problem caused by their irregular communication. looking at using hybrid programming models irregular communication environments such as To solve the load balance problem, we are comprising a mixture of processes and threads, graph algorithms. designing a parallel asynchronous breadth-first which allow resources on a node to be shared search (BFS) algorithm on distributed memory FIGURE 1: Lock among the different threads of a process. systems [5]. This work is in its early stages, but contention in With hybrid programming models, several we showcased some initial performance numbers an MPI+threads threads may concurrently call MPI functions, at the symposium. application requiring the MPI implementation to be Different from level-synchronous BFS, thread safe. In order to achieve thread safety, asynchronous BFS will not wait until all vertices the implementation must serialize access to in the same level have been visited to start some parts of the code by using either locks traversing a new level. Instead, it will start visiting or advanced lock-free methods. Using such its neighbors as soon as it receives a message techniques and at the same time achieving from its parent, so it removes the waiting time. high concurrent multithreaded performance is However, because vertices do not synchronize at a challenging task [2-4]. each level, some vertices may receive a delayed message with a smaller distance to the root. So the algorithm has to send messages to all METHODS AND RESULTS its children again to correct the distance, thus Our first focus area is MPI communication in bringing redundant communication to the multi-threaded environments. The Blue Waters algorithm. In order to minimize the redundant system, while rich in the number of cores per communication, we plan to use priority queues node, is unfortunately not as well optimized for to give a partial order to different messages communication when multiple threads issue handled by each processor. We will also evaluate MPI operations simultaneously. In this work, the tradeoff of computation and communication we analyzed sources of thread contention in in graph algorithms in different problem scales. representative MPI+thread applications using several benchmarks, ranging from micro- benchmarks and stencil computation to graph WHY BLUE WATERS traversal applications [1]. In our study, we found Some problems are not observed when we run that one of the primary sources of lock contention them at a small scale, but Blue Waters gives us is lock monopolization stemming from unfair a unique opportunity to evaluate and design our mutex-based critical sections (fig. 1). When a

100 101 BLUE WATERS ANNUAL REPORT

unique, however, as it is the only solver that SCALABILITY ANALYSIS OF has shown sufficient scalability and robustness MASSIVELY PARALLEL LINEAR to tackle problem sizes of many millions of equations on many thousands of processor cores. SOLVERS ON THE SUSTAINED This project involves porting WSMP to Blue PETASCALE BLUE WATERS Waters and performing full-scale benchmarking COMPUTING SYSTEM tests using assembled global stiffness matrices and load vectors ranging from 1 million to 40 million unknowns extracted from commercial and academic implicit finite element analysis Allocation: Private sector/0.002 Mnh PI: Seid Koric1 applications. Collaborators: Anshul Gupta1 We have ported WSMP to the Blue Waters Cray Linux Environment (CLE) and adapted it 1 University of Illinois at Urbana-Champaign to use PGI’s compiler and AMD’s math library EXECUTIVE SUMMARY: ACML. We could not build the library with Cray’s own compiler, since Cray’s libsci math library Solving linear systems of equations lies at the FIGURE 1: WSMP had issues with p-threads in WSMP. The issue heart of many problems in computational science Factorization with thread safety of p-threads under libsci was and engineering and is responsible for 70-80% wall clock time reported in the JIRA ticket system and forwarded of total computational time consumed by the M40 to Cray for review. most sophisticated multi-physics applications. So far we have managed to solve and FIGURE 2: WSMP Recent solver comparisons [1,2] have shown benchmark a couple of large test systems (“M20” Factorization that the Watson sparse matrix package (WSMP) with 20 million degrees of freedom (DOFs) and Performance M40 solver from IBM’s “Watson” initiative [3] is “M40” with 40 million DOFs) using WSMP on Cray XE6 nodes. The M40 system, with over 40 million DOFs and 3.3 billion non-zeros, is the largest ever to be benchmarked with direct solvers, as best as we can discern from our review of existing literature.

WHY BLUE WATERS We have scaled this problem size to over 32,000 cores on Blue Waters while achieving 60 Tflop/s, which are unprecedented numbers for sparse linear solvers (figs. 1-2).We are most eager to test WSMP at an even wider scale on Blue Waters, as well as to build and test WSMP with Intel’s compiler and MKL when they become available to users.

PUBLICATIONS Vazquez, M., et al., Alya: Towards Exascale for Engineering Simulation Codes. SC 2014, New Orleans, La., November 16-21, 2014 (in review).

102 BIOLOGY & CHEMISTRY

MOLECULAR 106 Petascale Multiscale Simulations of Biomolecular 126 Simulations of Biological Processes on the Systems Whole-Cell Level

CELLULAR 108 Mechanisms of Antibiotic Action on the Ribosome 128 Predictive Computing of Advanced Materials and Ices MEDICINE 109 Polyamine Mediates Sequence- and Methylation- Dependent Compaction of Chromatin 130 Non-Born–Oppenheimer Effects between Electrons and Protons BIOPHYSICS 110 Simulation of the Molecular Machinery for Second-Generation Biofuel Production 132 The Mechanism of the Sarco/Endoplasmic Reticulum ATP-Driven Calcium Pump 112 Petascale Simulations of Complex Biological Behavior in Fluctuating Environments 134 Advanced Computational Methods and Non- Newtonian Fluid Models for Cardiovascular Blood 114 Epistatic Interactions for Brain Expression GWAS Flow Mechanics in Patient-Specific Geometries in Alzheimer's Disease 136 Quantum-Classical Path Integral Simulation of 116 Characterizing Structural Transitions of Membrane Proton and Electron Transfer Transport Proteins at Atomic Details 138 Investigating Ligand Modulation of GPCR 118 The Bacterial Brain: All-Atom Description of Conformational Landscapes Receptor Clustering and Cooperativity within a Bacterial Chemoreceptor Array 140 Sequence Similarity Networks for the Protein “Universe” 120 The Dynamics of Protein Disorder and its Evolution: Understanding Single Molecule FRET Experiments 142 Benchmarking the Human Variation Calling of Disordered Proteins Pipeline

122 The Computational Microscope

124 Hierarchical Molecular Dynamics Sampling for Assessing Pathways and Free Energies of RNA Catalysis, Ligand Binding, and Conformational Change 2014

extremely heterogeneous systems, (2) the compute nodes but also an environment for close memory requirements of very large simulations, collaboration between external researchers and and (3) the inability to easily represent systems expert NCSA technical support staff. This latter with highly dynamic contents. Some (or all) of aspect was particularly crucial in the design these problems can restrict the straightforward and implementation of advanced software application of CG/UCG models to truly dynamic functionality using the Blue Waters network cell-scale biological processes using conventional hardware. The Blue Waters “point-of-contact” MD software. model is therefore considered to be an impressive We developed an unorthodox MD code model for any future supercomputing facilities. designed to alleviate these issues, with the hope of enabling entirely new classes of molecular simulation [2]. Major aspects of this software PUBLICATIONS include the use of Hilbert space-filling curves Dama, J. F., A. V. Sinitskiy, M. McCullagh, J. for dynamic load balancing, the use of on- PETASCALE MULTISCALE Weare, B. Roux, A. R. Dinner, and G. A. Voth, demand sparse data structures to reduce memory The Theory of Ultra-Coarse-Graining. 1. General SIMULATIONS OF BIOMOLECULAR requirements, and the implementation of Principles. J. Chem. Theory Comput., 9:5 (2013), SYSTEMS dynamic molecular descriptions to enable highly pp. 2466-2480. variable molecular topologies and interactions Grime, J. M. A., and G. A. Voth, Highly at runtime. These aspects of the software Scalable and Memory Efficient Ultra-Coarse- were described and motivated by “real-world” Grained Molecular Dynamics Simulations. J. Allocation: NSF/5.07 Mnh examples in our 2014 symposium presentation PI: Gregory A. Voth1 Chem. Theory Comput., 10:1 (2014), pp. 423-431. Collaborators: John Grime1 to illustrate where and how such functionality would prove useful in the context of large-scale 1 University of Chicago INTRODUCTION biological systems. The UCG-MD code was tested for example EXECUTIVE SUMMARY: The application of “coarse-grained” (CG) systems of relevance to CG/UCG biological FIGURE 1 Computer simulations offer a powerful tool for molecular models can significantly extend the models, and the resultant fundamental (BACKGROUND): high-resolution studies of molecular systems. scope of computer simulations, particularly performance measurements were described Example coarse- Increases in the potential scope of computer where the effects of any explicit solvent molecules in our presentation. In particular, superior grained model of simulations are driven not only by theoretical are instead represented implicitly in the CG performance was highlighted with reference an immature developments but also by the impressive (and solute interaction potentials. Recent advances to a conventional MD archetype for both load HIV-1 viral growing) power of modern supercomputers. Very in the theory of CG model generation offer balancing and memory use, with the dynamic particle enclosed large-scale molecular systems can nonetheless the concept of “ultra-coarse-grained” (UCG) topological capabilities of the UCG-MD code by a Hilbert present a serious challenge for simulations at molecular models [1], further increasing the introducing minimal runtime overhead. The space-filling atomic resolutions. In such cases, “coarse- accessible time and length scales for computer UCG-MD code thus presents a versatile platform curve (as used in grained” (CG) molecular models can significantly simulations. Although CG and UCG models are for efficient CG/UCG simulations of very large- load balancing extend the accessible length and time scales via computationally efficient, parallel simulations of scale molecular systems, even in cases where by the UCG-MD the careful generation of simpler representations. very large-scale CG and UCG systems typically more traditional MD approaches can face software). Although CG models are computationally do not realize the full potential of these models. significant difficulties. efficient in principle, this advantage may be The use of molecular models with atomic difficult to realize in practice. Conventional resolution has driven the design of traditional molecular dynamics (MD) software makes molecular dynamics (MD) software. Where WHY BLUE WATERS certain assumptions about the target system for molecular topologies and interaction potentials are fixed at runtime, the properties of atomic- Even with the efficient application of CG/ computational efficiency, and these assumptions UCG models, the ability to access molecular may be invalid for dynamic CG models. To resolution systems heavily influence the nature of the numerical algorithms used. phenomena featuring very large numbers of address these issues, we developed the UCG- molecules interacting in spatial volumes on MD software. Our presentation at the 2014 the order or microns or larger still requires symposium outlined key algorithms of the UCG- METHODS AND RESULTS significant parallel computational power. The MD code and demonstrated their utility for Blue Waters supercomputing resource thus representative CG models of biological systems. Three key challenges hinder the efficient use proved to be critical for the implementation and of advanced CG/UCG models on modern deployment of the UCG-MD software, offering supercomputers: (1) the load balancing of not only an extremely large number of parallel

107 BLUE WATERS ANNUAL REPORT 2014

no new antibiotic drugs have been developed the epigenetic markers, show peculiar chromatin MECHANISMS OF ANTIBIOTIC for nearly 30 years while new strains of bacteria POLYAMINE MEDIATES SEQUENCE- architecture [3]. Understanding how epigenetic ACTION ON THE RIBOSOME have evolved to be resistant to existing drugs AND METHYLATION-DEPENDENT markers control the chromatin architecture can [6]. To promote novel designs of the next lead us to understand how differentiation and generation of antibiotics that are more effective COMPACTION OF CHROMATIN cancer occur. Allocation: GLCPC/0.33 Mnh and less resistance inductive, we investigated the The conventional view of gene regulation says PI: Alexander Mankin1 molecular mechanisms underlying the antibiotic that controlling the binding of transcription 1 2,3 Illinois/0.922 Mnh; BW Prof./0.24 Mnh Collaborators: Nora Vasquez-Laslop ; action of ERY on bacterial ribosomes. Allocation: factors to a specific gene, such as modifying PI: Aleksei Aksimentiev1 1 1 1 histone tails and DNA, achieves finely tuned 1University of Illinois at Chicago Collaborators: Jejoong Yoo ; Haijin Kim ; Taekjip Ha 2University of Illinois at Urbana-Champaign gene regulation. However, experimental advances 1 3Beckman Institute for Advanced Science and Technology METHODS AND RESULTS University of Illinois at Urbana-Champaign in the last decade have radically changed our view on eukaryotic chromatin structure so EXECUTIVE SUMMARY: ERY acts as a protein synthesis inhibitor [7] and EXECUTIVE SUMMARY FIGURE 1(A): binds to the ribosomal exit tunnel of bacterial that, unlike prokaryotes, spatial location and The ribosome The ribosome, one of the ubiquitous molecular ribosomes (fig. 1a) [8,9]. Contrary to prior beliefs Most biological problems can be reduced to chromatin conformations are highly correlated contains a large machines in living cells, is responsible for the [10-14], we found that the macrolide drugs questions about gene regulation mechanisms. to gene activities. In this new framework, gene and a small critical task of translating the genetic code may act on the ribosome directly without the Experimental advances in the last decade have locations are not random but highly controlled subunit. The into functional proteins. The antibiotic drug presence of the nascent protein. We modeled an shown that spatial location and eukaryotic as programmed. nascent protein erythromycin (ERY) acts as a protein synthesis ERY-bound empty ribosome (without nascient chromatin conformations are highly correlated For example, recent experiments [4-6] revealed elongates from inhibitor. The molecular mechanisms underlying protein) and a drug-free empty ribosome based to gene activities. Although it is clearly shown chromosomal territories on an even larger scale the core of the the effects of such drugs are unknown, and on the complete crystal structures of ribosomal that DNA sequence and methylation patterns than before. Fragments containing 0.1-1 million ribosome and bacterial resistance to antibiotics is a growing complexes [8,15], respectively). determine the chromatin conformations, the DNA bases co-localize according to their AT egresses through problem. To promote novel designs of the next We found that ERY reproducibly induced underlying driving force that dominates this content into topologically associated domains the ribosomal generation of antibiotics that are more effective, conformational changes of the universally phenomenon is unclear. Our research aims (TADs). The inner surface of a nucleus attracts exit tunnel to we investigated the molecular mechanisms conserved ribosomal nucleotides U2585 and to investigate how polybasic histone tails or AT-rich TADs [7,8]. Moreover, highly methylated the outside of underlying the antibiotic action of ERY on A2602 (fig. 1b), in agreement with experiments. biogenic polyamine molecules control the TADs are known to form compact clusters [9], the ribosome. ERY bacterial ribosomes. Our results showed that Flipping of U2585 and A2602 from a looped- sequence/methylation-dependent inter-DNA which presumably enables reversible chromatin binds to the exit the ERY drug takes antibiotic effect by altering out orientation, required for aligning tRNA and internucleosomal interactions. reorganization. tunnel. the structure of the bacterial ribosome. substrates to prepare the peptide-bond transfer Although the correlation between chromatin [16,17], to a folded-in orientation was observed conformations and gene activities is well founded BACKGROUND in the ERY-bound ribosome simulations. We note and it is clearly shown that DNA sequence and that A2602 in the looped-out orientation is also Most biological problems can be reduced to methylation patterns determine the chromatin required to prevent premature nascent protein questions about gene regulation mechanisms. For conformations [9], the underlying driving force release [18]. By contrast, the two nucleotides example, development and differentiation of the that dominates this phenomenon is unclear. predominantly assumed looped-out orientation cells are controlled by turning on and off specific in drug-free ribosome simulations [19]. This sets of genes. Several factors can affect gene GOALS finding unveils a new view of the antibiotic action regulation: DNA wrapping around nucleosomes, of macrolides on bacterial ribosomes. histone tails and their chemical modifications, Our research aims to investigate how polybasic DNA modifications such as methylations of CpG histone tails or biogenic polyamine molecules FIGURE 1(B): INTRODUCTION dinucleotides, and transcription factors. control the sequence/methylation-dependent WHY BLUE WATERS Programmed gene regulation by chemical inter-DNA and internucleosomal interactions. Molecular The ribosome, one of the ubiquitous molecular modifications of DNA and histone tails— This task requires that we consider various dynamics machines in living cells, is responsible for the The Blue Waters supercomputer provided epigenetics—is fundamentally important methylation/sequence patterns to extract their simulations show critical task of translating the genetic code into us computational efficiency to perform sub- because it is the central mechanism of human effects on DNA compactions. We will run drug-induced functional proteins. The bacterial ribosome is microsecond time scale all-atom simulations development. All the cells of a human body multiple free-energy simulations in parallel nucleotide- the target of over 50% of antibiotic drugs [1,2], with our complete ribosome systems. share exactly the same genome sequence, but using advanced sampling techniques of all- flipping in the including the widely-prescribed erythromycin cells can play different roles depending on tissue atom molecular dynamics simulations, which ribosome. (ERY; a macrolide drug) which is on the WHO type. Epigenetic markers, not DNA sequence, can be done efficiently only by using a powerful essential medicines list [3,4]. The antibiotic action PUBLICATIONS determine the tissue type. Many diseases, such computer like Blue Waters. of such drugs has been known for over 50 years; Sothiselvam, S., et al., Macrolide antibiotics as cancer, are caused by defective genes or however, the molecular mechanisms underlying allosterically predispose the ribosome for failure in gene regulation. Cancer cells are a the effects of these drugs are unknown [5]. translation arrest. Proc. Natl. Acad. Sci. USA, specialized cell type showing specific epigenetic Bacterial resistance against antibiotics is (2014), doi: 10.1073/pnas.1403586111. marker patterns [1,2] and, presumably due to developing into a major global concern because

108 109 BLUE WATERS ANNUAL REPORT 2014

and biochemically characterized. Among the substrate binding activities that facilitate the SIMULATION OF THE MOLECULAR most intricate enzymatic complexes are the degradation of plant cell wall material. The MACHINERY FOR SECOND- cellulosomes, found especially in anaerobic dockerin/cohesin interactions are the main environments. building blocks of the cellulosomes and their GENERATION BIOFUEL While keeping a commonplace biochemical interaction is known to be stronger than PRODUCTION affinity, cellulosomes’ building blocks can common protein-protein interactions. The maintain their mechanical integrity under strong exceptionally high rupture forces we measured shear forces. The assembly and disassembly of (600-800 pN) are hugely disproportionate to the Allocation: Illinois/0.25 Mnh these protein networks is mediated by highly dockerin/cohesin biochemical affinity, which at PI: Isaac Cann1 specific cohesin/dockerin interactions, the KD≈20 nM is comparable to typical antibody- 1,2 3 3 Collaborators: Rafael C. Bernardi ; Michael A. Nash ; Hermann E. Gaub ; Klaus main building blocks of the cellulosomes. It is antigen interactions. Antibody-antigen Schulten1,2 believed that the cellullosomes’ high activity interactions, however, will rupture at only 1University of Illinois at Urbana-Champaign is related to its extremely flexible scaffoldin, ~60 pN at similar loading rates. To the best of 2Beckman Institute for Advanced Science and Technology constituted of cohesin domains connected by a our knowledge, the dockerin/cohesin complex 3Ludwig-Maximilians-Universität München, Germany very flexible linker. In this work we aim to identify is the highest protein ligand-receptor rupture EXECUTIVE SUMMARY: cellulosomal network components with maximal force ever reported at more than half the rupture FIGURE 1: Cellulosome model built using partially mechanical stability and characterize the extreme force of a covalent bond. available crystallographic structures combined Biofuels are a well-known alternative to fossil flexibility of the cellulosomal complex. The simulation results reproduced the with similarity-based molecular modeling and fuels. However, competition with food production experimental force profile and were able generalized simulated annealing. raises ethical concerns. The production of so- to identify key hydrogen bonding contacts called second-generation biofuels, made from METHODS AND RESULTS previously identified as important for dockerin agricultural waste, is more favorable but is not With the intention of further studying this cohesin. Analysis of the binding interface and yet cost competitive. Our project wants to find synergism, we employed Blue Waters to model associated contact surface area of the molecules a more cost-competitive strategy using bacteria. the entire cellulosome complex (fig. 1). Presently, in the mechanically loaded and unloaded Some bacteria, especially from the genus we carry out 13-million-atom simulations state suggests a catch bond mechanism may Clostridium, employ several synergistic enzymes of cellulosomes on Blue Waters to further be responsible for the remarkable stability. docked extra-cellularly on a highly modular and explore properties and technical potential of Dockerin modules in the simulated binding remarkably flexible molecular framework, the cellulosomes. interface seem to clamp down on the cohesin cellulosome. Our project employed molecular To perform the calculations we utilized the upon mechanical loading, resulting in increased dynamics simulations that complement single- molecular dynamics program NAMD, which stability and decreased accessibility of water into molecule experiments from our collaborators to employs the prioritized message-driven the hydrophobic core of the bound complex. characterize the protein modules docked together execution capabilities of the Charm++ parallel The scaffoldin and its cohesins and the to form cellulosomes. Our experiments measured runtime system, allowing excellent parallel enzymatic domains with dockerins are the that the docking complexes are extremely strong. scaling. The CHARMM36 force field along with main building blocks that characterize the Simulations revealed that pulling the complexes the TIP3 water model were used to describe all macrostructure of the cellulosomes. It was apart actually strengthens the complexes before systems. To characterize the coupling between reported that Clostridium thermocellum, the they rupture. The resulting strength is the dockerin and cohesin, we performed steered most studied cellulosomal organism, exhibits largest ever seen in macromolecular complexes. molecule dynamics simulations of constant one of the highest rates of cellulose utilization Presently, we are running 13-million-atom velocity stretching (SMD-CV protocol) known in nature, and the cellulosomal system of simulations of cellulosomes on Blue Waters to employing three different pulling speeds: this bacterium is reported to display a specific further explore the properties and technical 1.25 Å/ns, 0.625 Å/ns, and 0.25 Å/ns. The activity against crystalline cellulose that is potential of cellulosomes. stochastic generalized simulated annealing (GSA) fifty-fold higher than the corresponding non- method, implemented in the GSAFold plugin cellusomal fungal system in Trichoderma reesei. for NAMD, was employed to generate millions INTRODUCTION of different conformations for the cellulosome WHY BLUE WATERS Deconstruction of plant cell walls to fermentable complex. sugar using enzymatic hydrolysis is being Cellulosome assemblages consist of a The size of the cellulosome had been out of pursued for the production of so-called second- scaffoldin backbone onto which dockerin- reach of molecular dynamics simulations generation biofuels. Driven by significant containing catalytic modules and carbohydrate before the advent of Blue Waters, and even research efforts worldwide, a large number of binding modules are appended. Analogous studying fragments of the cellulosome would enzymes and enzymatic complexes that may be to a “Swiss Army knife,” these cellulosomes be a challenge for any supercomputer except used for biofuel production have been identified contain a plethora of different catalytic and Blue Waters.

110 111 BLUE WATERS ANNUAL REPORT 2014

compounds, and they are relevant to human the fossil record of an evolutionary trajectory, PUBLICATIONS PETASCALE SIMULATIONS OF health as both probiotics and pathogens. especially since dozens of them are needed COMPLEX BIOLOGICAL BEHAVIOR Over the past decades, we have studied for assessing statistical significance for any Pavlogiannis, A., V. Mozhayskiy, I. microbial organisms extensively and gained hypothesis-testing experiment, is not an easy Tagkopoulos, A flood-based information flow IN FLUCTUATING ENVIRONMENTS valuable insights into their system-level properties, task since a simulation can easily lead to terabytes analysis and network minimization method for as well as the mechanistic underpinnings of their of complex data that require analysis. bacterial systems. BMC Bioinformatics, 14:137 complex behavior. Less is known about their We have created the Evolution in Variable (2013) doi:10.1186/1471-2105-14-137. NSF/0.003 Mnh Allocation: potential to acquire new traits and become Environments (EVE) v3.0 synthetic ecology Mozhayskiy, V., I. Tagkopoulos, Horizontal PI: Ilias Tagkopoulos1 resilient to adverse environmental conditions framework, which is currently the most gene transfer dynamics and distribution of fitness effects during microbial in silico 1University of California, Davis through evolutionary forces such as random sophisticated, abstract simulator for microbial mutations, horizontal gene transfer, and genetic evolution, with the capacity to scale up to 8,000 evolution. BMC Bioinformatics, 13:S13 (2012), EXECUTIVE SUMMARY: drift. Elucidating the effect of such environments MPI processes and 128,000 organisms. To doi: 10.1186/1471-2105-13-S10-S13. Mozhayskiy, V., I. Tagkopoulos, Guided One of the central challenges in computational on their gene regulatory and biochemical compare, our previous work (before the PRAC evolution of in silico microbial populations biology is the development of predictive multi- networks is particularly interesting. In turn, it award) scaled up to 200 organisms with a less in complex environments accelerates scale models that can capture the diverse layers can lead to a better understanding of what is complex underlying model [1]. evolutionary rates through a step-wise of cellular organization. Even scarcer are models possible, likely, and potentially transformative To cope with unforeseen computational load adaptation. BMC Bioinformatics, 13:S10 (2012), that encode biophysical phenomena together to the environment they occupy. From antibiotic due to the emergence of complex phenotypes, doi:10.1186/1471-2105-13-S10-S10. with evolutionary forces in order to provide resistance to stress-resistant biotechnological we have developed both static and adaptive load Mozhayskiy, V., I. Tagkopoulos, In silico insight into the effect of adaptation at a systems strains for recombinant protein production, balancers that can account for both fixed and Evolution of Multi-scale Microbial Systems in level. such knowledge will have a tremendous impact non-fixed population sizes [2,3]. We developed the Presence of Mobile Genetic Elements and The goal of this project is to create a scalable on various industrial, agricultural, and medical intuitive visualization tools [4], HDF5 storage Horizontal Gene Transfer. in Bioinfirmatics model and simulation framework to (a) fields. While there have been many studies of solutions, and novel analysis algorithms based Research and Applications (Springer, Berlin, investigate the dynamics of microbial evolution adaptive laboratory evolution in the past couple on network flows [5] to efficiently project data Heidelberg, 2011), pp.262-273. in complex environments, and (b) assess its effect of years, these are limited to a few thousand to accelerate biological discovery. The EVE Mozhayskiy, V., R. Miller, K. L. Ma, I. on microbial organization across the various generations that can hardly capture the vast simulator has since been used to investigate Tagkopoulos, A Scalable Multi-scale Framework biological layers. The simulation framework phenotypic space that microbes can explore. the effect of horizontal gene transfer [6], for Parallel Simulation and Visualization of should be focused on the general principles Hence, the development of computational distribution of fitness effects, and the hypothesis Microbial Evolution. Proc. 2011 TeraGrid Conf., governing evolution and microbial organization modeling and simulation tools that can capture of accelerated evolution through guided, step- Salt Lake City, Utah, July 18-21, 2011. so it can be generalized. these phenomena across multiple scales can lead wise adaptation [7] with interesting results that Over the last five years, our lab has created to transformative advances in this field. drive biological experimentation [8,9]. a multi-scale abstract microbial evolution Future work includes pushing the limits of model that unifies various layers, from diverse microbial simulations to break the one-million- molecular species and networks to organism METHODS AND RESULTS cell barrier, parallelization of organism-specific, and population-level properties. With the help There are a number of challenges we need to data-driven models that integrate omics layers, of Blue Waters and the NCSA team, we are able address to achieve our goals. First, a model of starting from our recent work in the model to scale up to hundreds of thousands of cells, an biological organization that is both biologically bacterium Escherichia coli [10], and integration unprecedented scale of simulation. (It is, however, realistic and computationally feasible is with synthetic biology computer-aided design only a fraction of the billions of cells that are paramount, incorporating the right level of tools for targeted, chassis-aware genome present in a bacterial colony.) Here, we present biological abstraction. Second, the spatial and engineering [11-14]. our scalability results, the methods that we temporal scales of a model that encompasses employed to achieve them, and our current work genes, proteins, networks, cells, and populations WHY BLUE WATERS on a data-driven, genome-scale, population-level are very diverse, which creates additional model for Escherichia coli. hurdles when applying numerical methods Over the last five years, our lab has created to solve them. Third, since evolution is based a multi-scale abstract microbial evolution on random mutations and natural selection, it model that unifies various layers, from diverse INTRODUCTION is inherently hard to predict and can lead to molecular species and networks to organism Microbes are the most abundant and diverse imbalances in the distribution of active cells, and population-level properties. With the help forms of life on Earth. Their impact on the and by extension, computational tasks. Fourth, of Blue Waters and the NCSA team, we are able human race and our ecosystem as a whole is a typical microbial colony has billions of cells, to scale up to hundreds of thousands of cells, an difficult to exaggerate. They have been used while current simulations are at most in the unprecedented scale of simulation. extensively in industrial applications, ranging thousands. This leads to size-specific artifacts from bioremediation to production of organic (size does matter). Finally, storing and visualizing

112 113 BLUE WATERS ANNUAL REPORT 2014

needed for pre-clinical development and testing splitting the analysis into three phases. Using data generated from the epistasis analysis EPISTATIC INTERACTIONS FOR of novel therapies. FastEpistasis we are able to analyze multiple software package implemented for this analysis, BRAIN EXPRESSION GWAS IN phenotypes simultaneously. After conducting FastEpistasis. We have been able to address many some optimization tests we found that running of these challenges through our interactions with ALZHEIMER'S DISEASE METHODS AND RESULTS 32 phenotypes at one time was the most efficient our collaborators at NCSA and the University We have previously conducted an expression way to run the analysis on Blue Waters. Due to of Illinois. the large computation requirements for epistasis Private sector/0.031 Mnh genome-wide association study (eGWAS) Allocation: analysis, incorporation of covariates in regression PI: Nilufer Ertekin-Taner1 for ~200 AD subjects and ~200 subjects with 1 2,3 1 PUBLICATIONS Collaborators: Mariet Allen ; Liudmila Mainzer ; Curtis Younkin ; Victor other non-AD pathologies using samples from models is limiting and is not routinely executed. 2 4 2 1 5 Jongeneel ; Thierry Schüpbach ; Gloria Rendon ; Julia Crook ; Julie Cunningham ; the temporal cortex and cerebellum of post- We were able to account for important covariates Zou, F., et al., Brain Expression Genome-Wide Summit Middha5; Chris Kolbert5; Dennis Dickson1; Steven Younkin1 mortem brain tissue to identify genetic variants in our analysis, such as RNA Integrity Number Association Study (eGWAS) Identifies Human (RIN), age, and gender, by first regressing our 1Mayo Clinic in Jacksonville that influence brain gene expression [1]. Using Disease-Associated Variants. PLOS Genetics, 8:6 gene expression phenotypes with key covariates 2University of Illinois at Urbana-Champaign this single SNP/single phenotype approach (2012), e1002707. 3National Center for Supercomputing Applications we identified 2,089 significant SNP/probe (in R) to generate residuals. We then used the Schüpbach, T., I. Xenarios, S. Bergmann, and 4 Swiss Institute of Bioinformatics associations that replicated across the two tissues residuals as our phenotypes for epistasis analysis. K. Kapur, FastEpistasis: a high performance 5Mayo Clinic in Rochester we investigated and identified an enrichment of We have extensively tested smaller datasets and computing solution for quantitative trait epistasis. EXECUTIVE SUMMARY: human disease-associated variants. Our findings determined that epistasis analysis executed in Bioinformatics, 26:11 (2010), pp. 1468-1469. demonstrate the utility of this study design and FastEpistasis and PLINK generate identical Alzheimer’s disease (AD) is likely influenced confirm that genetic variants that influence results and have also demonstrated that single by the interaction of many genetic and risk for human disease can also influence brain variant analysis using residuals as the phenotype environmental factors, some of which may act expression of genes in cis. gives identical results to multivariable linear by influencing brain gene expression. The aim of However, the single SNP/single phenotype regression analysis using the full model. this study is to test for pairwise genetic variant approach employed in GWAS studies is simplistic Finally, in order to reduce the size of the interactions (epistasis) that influence brain gene and likely not an accurate reflection of the output data and make it more manageable for expression levels using existing data from our complex biological interactions that take place in transfer and entry into a database, we changed expression GWAS study of 359 temporal cortex an organism. Gene (or SNP) interactions, known the p-value threshold for results that are output -4 samples (181 AD, 178 non-AD), 223,632 SNP as epistasis, allow for the study of interaction in the final phase of FastEpistasis from p<10 to -7 genotypes and ~24,000 transcripts that were effects of pairs of SNPs on a given phenotype p<10 , which decreased the output for many measured using an expression array. The analysis and can uncover additional genetic factors that phenotypes by more than 75% while retaining of epistatic effects in large studies, such as ours, influence gene expression and disease. In this results that are well above the p-value threshold requires powerful computational resources study we leverage our existing brain eGWAS following correction for multiple testing. As of and would not be possible without the unique data to identify pairs of SNPs that associate with this writing, we have completed analysis of the computing capabilities of Blue Waters. The first brain gene expression measures with the goal of initial dataset of AD subjects sampled from the of three planned analyses has been completed identifying additional genetic factors that might temporal cortex: 181 AD subjects, 223,632 SNPs for 17,284 array probes detected in >75% of AD influence Alzheimer’s disease risk. and 24,526 probes of which 17,284 are reliably samples. Analyses of the non-AD and combined Through our work with our collaborators at measured in >75% of the subjects. Analysis of (AD + non-AD) groups will follow shortly. the University of Illinois and NCSA we were 178 non-AD subjects for the same set of SNPs able to address many of the challenges that a and the same number of phenotypes is under way project of this scope presented. Considering the and will be followed by analysis of the complete INTRODUCTION thousands of phenotypes we proposed to analyze, dataset of 359 (AD + non-AD) subjects with RNA The primary goal of this study is to identify novel it was imperative that we identify an efficient sampled from the temporal cortex. genetic loci that influence gene expression in software package that could facilitate analysis of the brain in order to identify Alzheimer’s multiple phenotypes at one time. We had initially WHY BLUE WATERS disease (AD) candidate genes, although it is targeted the epistasis tools available through the both feasible and likely that our findings will genetic analysis software PLINK [1]; however, The computation of epistatic interactions for have broader implications. It is well established this did not allow for parallelization of multiple hundreds of samples, hundreds of thousands of that AD has a significant genetic component. phenotypes and would have been time limiting, SNPs and a single phenotype is computationally Therefore, identifying the genetic factors that even with the capabilities of Blue Waters. intensive; for thousands of phenotypes it is influence AD risk can have a significant impact We subsequently utilized the application virtually impossible without the use of specialized on the development of novel therapeutic targets, FastEpistasis [2], which builds on the analysis applications and petascale computation. identification of potential, premorbid biomarkers, paradigm used by PLINK, but is a multi-threaded Furthermore, the storage architecture of and generation of in vivo disease models, much software and runs up to 75 times faster by Blue Waters is highly compatible with the

114 115 BLUE WATERS ANNUAL REPORT 2014

of mechanistic studies of membrane transporter provide potential improvements in the design of Moradi, M., and E. Tajkhorshid, Computational CHARACTERIZING STRUCTURAL proteins cannot be overstated, given their central the biasing protocol. recipe for efficient description of large-scale TRANSITIONS OF MEMBRANE role in a myriad of key cellular processes and their We studied the structural transition between conformational changes in biomolecular systems. involvement in a vast number of pharmaceuticals. the outward-facing and inward-facing states in J. Chem. Theory Comput., 10:2866 (2014), pp. TRANSPORT PROTEINS AT ATOMIC Their importance is also evident in the major shift several transporter systems from different classes 2866-2880. DETAILS in the focus of experimental structural biological including the bacterial ABC transporter MsbA, studies in recent years towards characterizing and two secondary transporters, GlpT and representative structural states formed during Mhp1, using NAMD, a highly scalable molecular Allocation: Illinois/0.686 Mnh the function of these proteins. dynamics code. These simulations resulted in the 1,2 PI: Emad Tajkhorshid Large-scale conformational changes are central most detailed description to date of the transition 1,2 1,2 1,2 1,2 Co-PIs: Mahmoud Moradi ; Giray Enkavi ; Po-Chao Wen ; Jing Li to the mechanism of membrane transporters. process, and at atomic resolution for the first 1Beckman Institute for Advanced Science and Technology A major goal in computational studies of time. The simulations provided novel insight into 2University of Illinois at Urbana-Champaign membrane transporters, therefore, is to describe, the details of the energy coupling mechanisms at an atomic level, the pathways and energetics in these proteins; they also have hinted at EXECUTIVE SUMMARY: associated with structural transitions involved the presence of previously uncharacterized Membrane transporters are specialized in their function. Given the technical challenges intermediate states. We are in the process of molecular devices that couple active transport involved in experimental characterization of experimentally verifying these intermediates of materials across the membrane to various these structural phenomena, simulation studies through our collaborations with leading forms of cellular energy. Their fundamental currently provide the only method to achieve experimental groups. These intermediates largely role in diverse biological processes makes the spatial and temporal resolutions required expand our repertoire of structures that can be them key drug targets, furthering widespread for complete description of the transport cycle used for docking and drug design studies. As interest in their mechanistic studies. Large- in membrane transporters. novel structural entities, they can provide new scale conformational changes (on the order targets for better and more selective drugs. of milliseconds to many seconds) are central to the mechanism of membrane transporters. METHODS AND RESULTS WHY BLUE WATERS Studying such conformational transitions We have recently developed a knowledge- requires sampling complex, high-dimensional based computational approach to describing Our research approach relies on multiple-copy free energy landscapes that are inaccessible to large-scale conformational transitions using algorithms (MCAs) that couple the dynamical conventional sampling techniques such as regular a combination of several distinct enhanced evolution of a large set of replicas/copies of a molecular dynamics simulations. We have sampling techniques. In the proposed approach system (e.g., to enhance sampling or refine developed a novel approach combining several we use non-equilibrium, driven simulations by transition pathways). In our simulations, we extensive search and state-of-the-art sampling designing mechanistically relevant, system- employ BEUS scheme as well as a parallel variation techniques and used it to study a number of specific reaction coordinates whose usefulness of string method, both of which are MCAs and membrane transporters. The method, which Blue and applicability to the transition of interest are are well suited to Blue Waters due to their use Waters has made feasible only recently for large examined using knowledge-based, qualitative of distributed replicas that communicate with macromolecular systems, has greatly impacted assessments along with non-equilibrium work a low overhead cost. Given that every copy of the scope of our computational studies of this measurements which provide an empirical the simulation would require thousands of cores important family of membrane proteins. framework for optimizing the biasing protocol for its simulation, simulating a large number of

in a series of short simulations. interacting replicas simultaneously can only be FIGURE 1: Conformational Free energy landscape of In the second stage, we use the string accomplished on massive computing resources opening and closing of the cytoplasmic end of ABC INTRODUCTION method with swarms-of-trajectories in a high- such as Blue Waters. NAMD also has been transporter MsbA. Free energies were calculated All living organisms rely on continuous exchange dimensional collective variable space to further extensively tested and optimized for Blue Waters, using a bias-exchange umbrella sampling (BEUS) of diverse molecular species (e.g., nutrients, relax the most optimized non-equilibrium showing sustained petascale performance. scheme and plotted in the (α,γ) space, which describe precursors, and reaction products) across trajectory from the first stage. We use the relaxed the relative orientations of different molecular cellular membranes for their normal function trajectory to initiate bias-exchange umbrella domains as shown in the figure. Inward-facing-closed PUBLICATIONS and survival. Membrane transporters are sampling (BEUS) free energy calculations and and inward-facing-open conformations shown in the specialized molecular devices that provide the characterize the transition quantitatively. Using Moradi, M., and E. Tajkhorshid, Mechanistic figure are low-resolution (4.5 Å) crystal structures machinery for selective and efficient transport a biasing protocol fine tuned to a particular picture for conformational transition of a used to design the sampling protocol. of materials across the membrane. They actively transition not only improves the accuracy of the membrane transporter at atomic resolution. pump their substrates across the membrane by resulting free energies but also speeds up the Proc. Natl. Acad. Sci. USA, 110:47 (2013), pp. taking advantage of different forms of cellular convergence. By assessing the efficiency of the 18916–18921. energy. The biological and biomedical relevance sampling we are able to detect possible flaws and

116 117 2014

As the centerpiece of perhaps the most individual proteins and hinting at a signaling FIGURE 1 thoroughly studied sensory signal transduction mechanism. More refined collective dynamics (BACKGROUND): system in all of biology, namely the chemotactic analysis will hopefully provide further insights Central to their network of E. coli, the chemoreceptor array into the functional relationships between these chemotactic represents the next frontier towards a complete proteins and the deep cooperativity underlying ability, bacteria understanding of a basic, naturally evolved the chemoreceptor array’s computing ability. possess a biological computer. The bacterial chemoreceptor universally array possesses the essential functional features conserved, multi- of higher-level signaling assemblies arising in WHY BLUE WATERS million-atom protein lattice more complex eukaryotic cells such as neurons Until recently, the relatively immense spatial and and lymphocytes. Hence, new insights into the known as the temporal scales needed to describe collective chemoreceptor collective function of the chemoreceptor array phenomena in large, multi-protein complexes will help elucidate the fundamental mechanisms array. Scientists such as the chemoreceptor array rendered it in Klaus by which biological systems process information impractical to address these problems with THE BACTERIAL BRAIN: ALL-ATOM Schulten’s group in general. available computational techniques and facilities. DESCRIPTION OF RECEPTOR at the University With the intense parallel computing power of of Illinois at Blue Waters, it is now feasible to explore such CLUSTERING AND COOPERATIVITY METHODS AND RESULTS Urbana-Champaign necessarily large systems computationally. The WITHIN A BACTERIAL have recently We constructed an all-atom model of the unit cell unique atomistic perspective afforded by Blue constructed the CHEMORECEPTOR ARRAY from the bacterial chemoreceptor array based Waters will provide a framework to explicitly first all-atom on crystallographic structures of component test theoretical hypotheses and help elucidate model of the proteins and a 16 Å resolution cryo-electron the connections between diverse experimental chemoreceptor Allocation: Illinois/0.15 Mnh microscopy density of the chemoreceptor results as well as inform future experiments. array and are PI: Yann Chemla1 INTRODUCTION array from E. coli. The model involves over one 1 using Blue Waters Collaborators: Klaus Schulten million atoms per unit cell. This unit cell model A central problem in the chemotaxis field to investigate was equilibrated and subsequently simulated for 1University of Illinois at Urbana-Champaign concerns the intermolecular cooperativity, its amazing 250 ns on Blue Waters using NAMD, a parallel which emerges from the organized clustering of information EXECUTIVE SUMMARY: molecular dynamics code designed for high- proteins within the chemoreceptor array. Indeed, processing performance simulation of large biomolecular The ability of an organism to sense, interpret, and experimental and quantitative modeling studies and control systems. Superior to previous studies, the respond to environmental signals is central to its have shown receptor clustering to be an essential capabilities. experimentally guided unit cell model explicitly survival. Chemotaxis is a ubiquitous signaling functional feature of bacterial chemotaxis, couples each protein component to its proper system by which cells translate environmental giving rise to many of the network’s enhanced native array neighbor so that the structure chemical information into a motile response. signaling properties. However, due to the sheer evolves under its native contacts. The resulting Bacteria in particular have evolved sophisticated size (tens of thousands of individual proteins in stable model and simulations mark the first all- protein networks that survey chemicals in the an average-sized array) and irreducible nature atom description of the structure and dynamics environment and position cells optimally within of the array’s multi-component machinery, the of an intact chemoreceptor array. their habitat. These networks are functionally detailed molecular mechanisms by which these In order to further characterize the collective analogous to the brains of higher organisms: proteins cooperate to robustly transduce signals dynamics of the chemoreceptor array, equilibrium an array of chemoreceptors senses chemical have remained elusive. molecular dynamics (MD) simulations on Blue stimuli and transmits adaptive signals through Similarly, another important signaling Waters were used to extend the total sampling an extended, multi-million-atom protein lattice, feature emerges from the collective nature of time of the unit cell system up to 3 μs. Principal which evaluates these signals to appropriately the chemoreceptor array, namely the ability to component analysis was used to extract, from affect the cell’s swimming behavior. Here, we variably regulate responses to environmental the thermal fluctuations present in these MD present an all-atom structure of the intact signals. Through the reversible methylation of simulations, important structural information bacterial chemoreceptor array, based primarily chemoreceptors at several key sites along their regarding the natural modes of motion of the on crystallographic and electron microscopy data. cytoplasmic domains, bacteria are able to adapt individual array oligomers as well as the global Molecular dynamics simulations on Blue Waters to background chemical concentrations over motions arising within the array itself. The are being used to investigate the dynamical several orders of magnitude in order to efficiently preliminary results reveal coupled excitations properties of the array and provide insight into forage their habitat. How exactly this remarkable of the individual oligomer modes within the its amazing information processing and control adaptation affects the regulatory properties of modes of the complete unit cell, potentially capabilities. the array at the molecular level, however, is still quite mysterious. establishing routes of communication between

119 BLUE WATERS ANNUAL REPORT 2014

~100 ns for free termini of the semi-flexible chains consistent values of gyration radius (fig. 1). THE DYNAMICS OF PROTEIN [3,4]. In this study we use FRET experiments Calculation of reconfiguration time scales as a DISORDER AND ITS EVOLUTION: as benchmarks for molecular dynamics (MD) function of chain length revealed that the ends of simulations of nucleoporin proteins (Nups) of the molecules (the termini of the coils) exhibited UNDERSTANDING SINGLE three different lengths to assess the dependency higher dynamics than the rest of the molecules. MOLECULE FRET EXPERIMENTS OF of chain dynamics on chain length. Nups control In contrast, we found that there was an DISORDERED PROTEINS transport of molecules across the nuclear unanticipated increase in the dynamics of intra- membrane by inducing changes in their structure. chain segments for short fragments compared to The mechanisms responsible for these changes medium fragments due to the reduction of the are yet to be established. influence of chain termini when fragments are Allocation: Illinois/0.05 Mnh PI: Gustavo Caetano-Anollés1 Here we focus on MD simulations of Nup153, a below a certain size. The increased dynamics Co-PI: Frauke Gräter2 membrane pore crucial to membrane trafficking. of short intra-chain segments may enhance Collaborators: Cédric Debès2; Davide Mercadante2; Fizza Mughal1 Our long-term goal is to extend these kinds of the sampling of the molecular conformational analyses to non-structured loop and disordered spectrum and facilitate the molecular function 1University of Illinois at Urbana-Champaign regions of proteins that have been associated with of the nucleoproteins. 2Heidelberg Institute for Theoretical Studies, Germany the rise of molecular flexibility and genetics in Increased dynamics of small coils in non- EXECUTIVE SUMMARY: molecular evolution [5,6]. structured loops and short intrinsically disordered FIGURE 1: Equilibrium MD simulations of Nup153 segments that are abundant in proteins could Intrinsically disordered proteins (IDPs) play fragments of different lengths show the intra- facilitate a wide range of molecular functions crucial roles in cells. They also introduce a chain dynamics of the protein coils. (A) A series METHODS AND RESULTS by enhancing the molecular flexibility of these source of conformational heterogeneity in 3D of representative images, ranging from extended regions. In fact, a structural phylogenomic protein structure that can be experimentally We simulated trajectories of one microsecond and (left) to more collapsed states (right), depict analysis of millions of proteins in hundreds of explored using Förster Resonance Energy beyond on Nup153 fragments of three different the simulated trajectories of the long Nup153 proteomes revealed that the rise of genetics was Transfer (FRET) methods. Here we study lengths. Since the resulting conformational fragment. The protein backbone is shown as a random associated with these flexible regions [5]. this heterogeneity with molecular dynamics ensemble likely depends on the starting coil while the FG-repeats along the sequence are simulations of nucleoporin Nup153 peptides conformation, which is randomly chosen, we ran represented by balls and sticks. (B) Autocorrelation using the GROMACS 4.6 platform, benchmarked three simulations for each fragment. Trajectories WHY BLUE WATERS values plotted against simulation time. The against FRET experiments. Nup153 is a crucial were examined to compute reconfiguration time fluctuations of short (S) 29-residue, medium (M) nuclear pore IDP involved in membrane scales as a function of polymer length, yielding The results of our proof-of-concept study 49-residue, and long (L) 79-residue fragments of trafficking. Our analyses show that the non- correlations of distance fluctuations between now provide a foundation for a Blue Waters- Nup153 are in line with simpler polymer models. structured molecules collapse quickly, reaching fixed points on the chain. enabled high-throughput MD simulation study consistent values of gyration radius. As expected Single molecule FRET results on Nup153 of the dynamics of a massive number of intra- from FRET experiments, the ends of the protein show unexpectedly reduced dynamics in the chain protein regions. Such a study could yield chain exhibited higher dynamics. Remarkably, we 10-100 ns range compared to other unfolded or unprecedented atomistic details of non-bonded found an unanticipated increase in the dynamics disordered proteins investigated previously [4]. interactions, secondary structure propensities, of intra-chain segments for short fragments While the Nup153 from this experiment were and other properties of the random coils of compared to medium fragments. This difference obtained from the full-length 900 amino acid proteins, which are linked to constraints imposed is due to the reduced influences of chain termini residue IDP, previous analogous experiments by billions of years of molecular evolution that when fragments are below a certain size. have all been performed on shorter peptides are responsible for structuring both proteins and (<100 residues) that were labeled at the termini. the genetic code. An obvious assumption is that embedding a INTRODUCTION fragment under consideration with labels at its termini into a much larger disordered protein Intrinsically disordered proteins (IDPs), significantly reduces the inter-label dynamics. which comprise ~10% of all proteins and are We also monitor correlations of distance especially common in eukaryotes [1], pose an fluctuations between fixed points on the enormous experimental challenge due to the chain to calculate reconfiguration time scales heterogeneous conformational ensemble they as a function of chain length, fulfilling our sample. In this regard, Förster Resonance Energy expectation that the increase in chain length Transfer (FRET) experiments have provided causes the characteristic ~100 ns dynamics unprecedented insight into the dynamics of [3,4]. Specifically, we find that the non-structured disordered proteins less than 100 amino acid Nup153 chains, which lack secondary structure, residues long [2]. They have shown molecular collapse quickly in the trajectories, reaching fluctuations with reconfiguration time scales of

120 121 BLUE WATERS ANNUAL REPORT 2014

Such simulations open up new possibilities to cell. The chromatophore is composed of about THE COMPUTATIONAL understand biological systems on a new level, 200 proteins and carries out about 20 processes. MICROSCOPE such as evaluating the effects of pharmaceuticals By modeling the full organelle, it is possible to on the stability of a virus capsid, or unraveling the see how these processes interlock, much like complex interplay between the many processes the gears in a fine Swiss watch, and see how Allocation: NSF/30 Mnh, BW Prof./0.24Mnh that enable photosynthesis. they allow the bacteria to make ATP fuel out PI: Klaus Schulten1 Simulations interpret data, suggest new of sunlight [2]. Collaborators (HIV Project): Peijun Zhang2; Christopher Aiken3 experiments, and do what experiments cannot, The chromatophore project is still underway, 4 5 Collaborators (Chromatophore Project): Neil Hunter ; Simon Scheuring which is to give an atomic-level picture of what but a smaller chromatophore-membrane system 1University of Illinois at Urbana-Champaign is going on inside living systems. The ability to has been simulated and recently published [3]. 2University of Pittsburgh explore living systems via the “computational This smaller, 20-million-atom simulation of a 3Vanderbilt University microscope” of molecular dynamics simulations flat membrane filled with photosynthetic light 4University of Sheffield has a profound impact not only on the progress of harvesting complexes (based on AFM images 5 INSERM/Université Aix-Marseille basic science, but also in the treatment of disease, from a bacterium with flat chromatophore EXECUTIVE SUMMARY: development of drugs, and development of new membranes), served as proof of concept for energy technologies. the chromatophore organelle simulation Cells are the building blocks of life, yet they are and explored the relationship between the FIGURE A: Klaus Schulten’s group at the University themselves constructed of proteins, nucleic organization of the light-harvesting proteins of Illinois at Urbana-Champaign was able to acids, and other molecules, none of which are, METHODS AND RESULTS and the efficiency with which those proteins can construct and simulate the first atomic-resolution in and of themselves, alive. How living things transfer energy between them. A newly published The HIV capsid project produced the first ever model of a mature HIV capsid. This capsid model can arise from the “behavior” of molecules, model of the spherical chromatophore organelle atomic-level structure of a native, mature HIV is now being simulated on Blue Waters to test the which are simply obeying the laws of physics, recently revealed, for the first time, the locations capsid [1]. Since then, the capsid model has interactions of HIV drugs and host cell factors is the essential question of modern biology. not only of the light-harvesting complexes, but continued to be used to analyze the dynamics with the capsid, which could help lead to the design Molecular dynamics simulations can be used of the bc1 complex and ATP synthase, two other as a “computational microscope,” offering the of motion of the HIV capsid subunits and of new HIV therapies. has been used to explore the interactions of critical proteins whose locations were previously ability to explore living systems at the atomic FIGURE B: Scientists in Schulten’s group have also the capsid with drugs and host cell factors. unknown. Deciphering the inner workings of level and providing a necessary complement to constructed an atomic-resolution model of a whole We have explored the interactions of the full this model photosynthetic system can guide experimental techniques such as crystallography, photosynthetic organelle, called a chromatophore. HIV capsid with small molecules, including the development of bio-hybrid green energy nuclear magnetic resonance (NMR), and cryo- With Blue Waters and other petascale computers, the controversial Pfizer PF74 drug (which devices to help address mankind’s energy needs. electron microscopy. scientists are no longer limited to looking at one interferes with host cell binding to the capsid), The chromatophore model is currently being With the rise of petascale computing, we take or two proteins at a time. It is becoming possible the PF1385801 drug (which results in ultra-stable prepared for simulation. a critical step from inanimate toward animate to look at all of the interlocking processes inside capsids), and compounds BI01/02 (which trigger matter by computationally resolving whole an organelle made of many proteins. premature disassembly). In the early stages of the cellular organelles in atomic detail. Herein we WHY BLUE WATERS discuss recent enhancements to the enabling replication cycle, the HIV capsid interacts with various host cell factors, such as cyclophilin A Without Blue Waters and other petascale programs NAnoscale Molecular Dynamics PUBLICATIONS (NAMD) and Visual Molecular Dynamics (which stabilizes and assists capsid assembly), computing resources, neither the HIV nor Zhao, G., et al., Mature HIV-1 capsid (VMD) and successful research on several and TRIM family factors, which disrupt the the chromatophore project would be possible. structure by cryo-electron microscopy and all- large-scale biomolecular systems being studied capsid and assist in nuclear import. Together The HIV project involves simulations of about atom molecular dynamics. Nature, 497:7451 on Blue Waters, including the HIV capsid and with experimental collaborators, computational 65 million atoms, and the chromatophore project (2013), pp. 643-646. a photosynthetic organelle. scientists were able to describe the action of requires simulations of up to 100 million atoms; cyclophilin A on the capsid and are presently both simulations require thousands of nodes on Chandler, D., J. Strumpfer, M. Sener, S. working on the mechanism of restriction by Blue Waters to run effectively. These projects Scheuring, and K. Schulten, Light harvesting INTRODUCTION TRIM-family proteins. Such studies may help are examples of how Blue Waters enables bold, by lamellar chromatophores in Rhodospirillum scientists understand better how the HIV capsid new projects that push the limits of what can photometricum. Biophys. J., 106:11 (2014), pp. Bridging the gap between single protein infects the host cells and could lead to new HIV be done with scientific computing. In our case, 2503-2510. simulations and organelle or cell-scale therapies. that means expanding molecular dynamics Cartron, M. L., et al., Integration of simulations is challenging on many levels. The photosynthetic chromatophore project simulation capabilities from simulating just a energy and electron transfer processes in the Because molecular dynamics simulations on is creating the first all-atom model of a cellular few proteins to simulating full organelles. photosynthentic membrane of Rhodobacter the order of hundreds of millions of atoms organelle. Chromatophores are spherical sphaeroides. BBA-Bioenergetics, (2014), in press. require substantial computational power, organelles in photosynthetic bacteria which allow efficient codes, as well as appropriate analysis the bacteria to absorb sunlight and turn it into and visualization techniques, must be developed. chemical fuel that drives many processes in the

122 123 BLUE WATERS ANNUAL REPORT 2014

INTRODUCTION The initial focus was on simulations of the PUBLICATIONS HIERARCHICAL MOLECULAR RNA tetranucleotide r(GACC) in explicit solvent DYNAMICS SAMPLING FOR A continually growing community of researchers for which good NMR is available to assess. A Bergonzo, C., N. Henriksen, D. R. Roe, J. is using atomistic biomolecular simulation number of papers demonstrate that not only can Swails, A. E. Roitberg, and T. E. Cheatham, III, ASSESSING PATHWAYS AND FREE methods in a variety of applications aimed at, for we converge the conformational ensemble of this Multi-dimensional replica exchange molecular ENERGIES OF RNA CATALYSIS, example, aiding in the refinement of experimental tetranucleotide, but we can do it reproducibly dynamics yields a converged ensemble of an RNA LIGAND BINDING, AND structures, probing ligand-receptor interactions, under different initial conditions. tetranucleotide. J. Chem. Theory Comput., 10:1 investigating protein and RNA folding, and Unpublished recent work is on the UUCG (2014), pp. 492-499. CONFORMATIONAL CHANGE generally to understand biomolecular structure tetraloop structure that has been a challenge for Thibault, J. C., D. R. Roe, J. C. Facelli, and T. and dynamics better. Yet, very few of the force many current and available force fields, despite E. Cheatham, III, Data model, dictionaries, and fields are well validated and few studies show the fact that this is the most stable tetraloop desiderata for biomolecular simulation data Allocation: NSF/14 Mnh complete converge and reproducibility. It is structure observed in nature. Given the reliability indexing and sharing. J. Cheminform., 6:1 (2014), 1 PI: Thomas Cheatham, III estimated that more than 40% of compute of the experimental data, this makes UUCG a doi:10.1186/1758-2946-6-4. Co-PIs: Adrian Roitberg2; Carlos Simmerling3; Darrin York4 cycles on resources allocated through the NSF’s good test system for improvements to the force Roe, D. R., C. Bergonzo, and T. E. Cheatham, Extreme Science and Engineering Discovery III, Evaluation of enhanced sampling provided 1University of Utah fields. Our primary challenge was to converge 2University of Florida Environments (and a large fraction of Blue the distribution of loop structures for UUCG. by accelerated molecular dynamics with 3Stony Brook University Waters cycles) are for the application of these In fig. 1 we demonstrate qualitative Hamiltonian replica exchange methods. J. Phys. 4 Rutgers University types of simulation methods. If these force fields convergence between two restrained-stem Chem. B, 118:13 (2014), pp. 3543-3552. have limitations, this impacts the reliability of the Swails, J. M., D. M. York, and A. E. Roitberg, EXECUTIVE SUMMARY: simulations. To demonstrate convergence, simulation results and could alter interpretations the principal components of the combined Constant pH replica exchange molecular A collaborative team of AMBER developers of the data. Although it is very likely there are still ensembles were calculated. The Kullback–Leibler dynamics in explicit solvent using discrete has been focusing on the development issues with the force fields and sampling, as we divergence of the first five principal components protonation states: Implementation, testing, and application of methods that couple show, the situation is not entirely dire since many decreases quickly, but still shows significant non- and validation. J. Chem. Theory Comput., 10:3 together ensembles of highly GPU-optimized groups have shown that with reliable starting zero values (left panel). The histogram projection (2014), pp. 1341-1352. molecular dynamics engines to fully map out structures (i.e., high-resolution structures from of the principal components shows us areas FIGURE 1 (BOTTOM LEFT): Convergence of principal the conformational, energetic, and chemical experiment) and good sampling near these where the primary motions of the loop differ, components and overlap of projections between landscape of biomolecules. Through applications experimental structures, excellent reproduction specifically in the first two modes (right panel). of the recently released AMBER 14 codes on of experimental observables can be seen and new two independent simulations of the UUCG tetraloop Blue Waters, the team has shown the ability to insight inferred. with weak or tighter stem loop base pair not only efficiently converge conformational WHY BLUE WATERS restraints. (Left) Kullback–Leibler divergence distributions of RNA tetranucleotides, RNA between projections of individual simulation A key enabler of our work is the extensive work tetraloops, and DNA helical structure, but METHODS AND RESULTS principal components calculated over the combined in optimizing the AMBER molecular dynamics trajectories from both simulations as a function of also shown reproducibility in the results under Applying these simulation methods we can (MD) simulations on GPUs by the AMBER time, sampled over 360 replicas in multidimensional different initial conditions. understand better the properties of biomolecules development team. AMBER is arguably one of replica exchange. (Right) Overlap of principal such as proteins and nucleic acids with such the fastest, if not the fastest, MD engines on component projections from the two independent fine-grained detail that the methods accurately NVIDIA GPUs, and this high-performing code simulations for the first five principal components. account for their native environments of solvent, was released to the larger community in April salts, and other molecules, and complement 2014. To take advantage of this speed, we have experimental results. Two key challenges exist. developed and optimized methods that couple The first centers on the means to effectively and together ensembles of independent simulations completely sample the complex conformational that exchange information periodically to landscape or, at the least, sufficiently sample the enhance sampling. We have developed a multi- time scales of the processes of interest. This is dimensional replica exchange (M-REMD) challenging since biomolecular processes occur framework and set of analysis tools that over a wide range of time scales, coupling fast enable independent simulations to exchange and localized dynamics on the femtosecond information (temperature, pH, Hamiltonian or to nanosecond timescales to larger collective “force field”) to enhance sampling. Applying these, motions over microseconds to milliseconds and we have learned which methods help with speed beyond. The other key challenge centers on force to solution and which do not. field accuracy and the ability of the models to properly describe the energetic and dynamic landscape.

124 125 2014

underlying rules governing the assembly of the WHY BLUE WATERS ribosome. Even the reduced network is considerably larger than any cellular network we have simulated METHODS AND RESULTS to date. It is absolutely crucial to have high- performance GPUs to finish the simulation in a To address the global complexity in in vivo timely manner. Many simulations were run to test ribosome biogenesis, we simulated a kinetic the sensitivity of the runs to changing parameters model of the 30S ribosome assembly using the and produce an adequate pool of results to make Lattice Microbe software (LM) on Blue Waters. statistical inferences. We could not have done this LM [8-10] is a package of stochastic solvers for without Blue Waters. Our long-range goal is to simulating the kinetics of biological systems. unite the kinetic model of translation with other The reaction diffusion master equation (RDME) cellular networks extending over several cycles solver incorporates spatial information and only of cell division so that whole-cell simulations SIMULATIONS OF BIOLOGICAL allows molecules to interact with others that are of bacteria responding to various stimuli and nearby. The RDME allows for a more realistic PROCESSES ON THE WHOLE-CELL environmental factors can be achieved. description of biological systems in vivo than As development of LM progresses, Blue LEVEL the alternative, the chemical master equation Waters will continue to be a prime resource (CME) solver. Molecular crowding and initial for our simulations. Support for distributed distributions of ribosomes within the cells are simulations that span multiple nodes over MPI Allocation: Illinois/0.592 Mnh obtained from proteomics and cryo-electron is in development, and features such as Blue 1,2 PI: Zaida Luthey-Schulten tomography reconstructions, and this data can Waters’ high-speed interconnect, GPU-to-fabric Collaborators: Michael J. Hallock1; Ke Chen1; Tyler M. Earnest1; Joseph R. Peterson1; be used by the RDME to describe the cellular 1 1 2 DMA, and a highly parallel file system will be John A. Cole ; Jonathan Lai ; John E. Stone environment [11]. enforce such balance between metabolism and key components for a successful and scalable 1 We construct a model that explicitly uses the University of Illinois at Urbana-Champaign macromolecular synthesis is yet to be recognized. application. 2 dependencies of the Nomura map [1] to decreases Beckman Institute for Advanced Science and Technology We envision that a whole-cell model of ribosome the size of the network to a manageable level, FIGURE 1 (BACKGROUND): (Upper left) The full reaction EXECUTIVE SUMMARY: biogenesis is crucial to understanding cell taking effective r-protein binding rates published growth and how it is regulated in response to network of ribosomal intermediates during the Recent experiments are revealing details of by the Williamson lab [7]. To further reduce binding of ribosomal proteins to the 16S rRNA (green environmental perturbations. the network size, we use well-stirred stochastic fundamental cellular processes involved in In bacterial cells, ribosomal assembly requires node) to form the 30S subunit (red node). (Upper protein synthesis and metabolism. However, simulations to identify the intermediates in right) The full network can be reduced significantly the cooperation of many molecular components: which the majority of the reaction flux flows a dynamical description of these processes at approximately 55 r-proteins, translated in by the analysis of flux through each intermediate. the whole-cell level is still missing. With Blue through. Intermediates which are underutilized By eliminating intermediates whose flux is less different regions of the cell, to find and bind with by the reaction network are removed from the Waters we have been able to develop a kinetic the rRNA in the correct order of assembly, and than 0.7% of the most active species, we can model of ribosome biogenesis that reproduces network along with their associated reactions. prune the network to 62 species and 69 reactions. approximately 20 assembly cofactors are engaged This analysis allowed us to reduce the assembly in vivo experimental observations. Stochastic to facilitate the process at various assembly stages. (Lower right) The reduced small subunit assembly simulations with our GPU-accelerated Lattice network from 1,633 to 62 species (42 assembly network is simulated within the environment of a Nomura et al. [1] originally mapped out intermediates) and 7,000 to 69 reactions. Microbe software have extended the model the hierarchical dependency of the r-proteins living cell. Using our Lattice Microbes whole-cell to the entire bacterial cell by computationally To test the validity of this severely pruned simulation software [8-10], we are investigating binding to the E. coli 16S rRNA using network, we compared the protein binding curves, linking the transcription and translation events equilibrium reconstitution experiments. the spatial-temporal effects of a complicated with ribosome assembly on biologically relevant which show the fraction of r-protein bound to cellular environment on ribosomal biogenesis. The Progress in biophysical approaches boosted our intermediates as a function of time, from the well- time scales. understanding of in vitro ribosomal self-assembly genes encoding the ribosomal proteins and rRNA are stirred simulations using the full network and placed according to their location in the genome mainly for the protein-assisted dynamics of simulations of the reduced network. We saw no RNA folding [2-4] and the kinetic cooperation and allowed to diffuse throughout the nucleoid INTRODUCTION greater than 0.1 % root-mean-square error. After region. Ribosomes are placed within the cytoplasm of protein binding [5-7]. However, protein subsequent tuning with the competing folding Translation is the universal process that according to their experimental distribution. binding orders derived from thermodynamic and conformations identified in our previous studies synthesizes proteins in all living cells. The In addition to the ribosomal protein binding kinetic experiments do not always agree, which [4, 12-14], the model successfully reproduced ribosome constitutes approximately one fourth network, transcription of r-proteins and rRNA and hampers our investigation of the assembly under the structural intermediates reported in the of a bacterial cell’s dry mass and is central translation of r-protein mRNA are simulated. With an in vivo environment. A comprehensive model single particle electron microscopy experiments to translation. Biogenesis of the ribosome, the effect of transcription and translation, a that captures the topology of the protein-RNA [15]. Furthermore, the model predicted new together with all cellular activities involved in realistic simulation of in vivo ribosome biogenesis interaction network is needed to decipher the assembly intermediates that will guide further translation, consume a significant portion of the can be performed. experimental discoveries. cell’s total energy. However, the cell’s effort to

127 BLUE WATERS ANNUAL REPORT 2014

in molecular sciences but potentially with even (ionization and electron attachment energies, controlled errors. As mentioned, conventional PREDICTIVE COMPUTING OF greater impact on society. The theories and band gaps) of conjugated organic molecular matrix algebra algorithms of ab initio theories are ADVANCED MATERIALS AND ICES their accuracy go beyond the usual workhorse solids and supramolecular assemblies. They fundamentally non-scalable. The aforementioned (density-functional approximations) of solid- include solids that serve as bases of advanced Monte Carlo MP methods and embedded- state computation. materials such as bulk heterojunction organic fragment methods are among the few that may Allocation: BW Prof/0.12 Mnh solar cells, batteries, sensors, smart windows, be realistically and usefully deployed on the large PI: So Hirata1 field-effect transistors, and light-emitting number of processors available on Blue Waters. METHODS AND RESULTS diodes. The optoelectronic parameters are the Such calculations, in turn, may directly address 1University of Illinois at Urbana-Champaign Our group has recently made two breakthroughs quantities of prime importance in determining or answer some outstanding scientific questions EXECUTIVE SUMMARY: in computational chemistry for large molecules the solids’ performance and functions, but the of solids or large optoelectronic materials, purely and solids (and liquids). One weds second-order usual density-functional approximations are computationally and with sufficient accuracy. Two breakthroughs in the algorithms of ab initio and higher many-body perturbation theories known to be poor for these properties. Here, electronic structure theory developed recently by (MP2, MP3, etc.) with quantum Monte Carlo our new method is uniquely useful and accurate. the Hirata group will be deployed on Blue Waters PUBLICATIONS (QMC) methods, enabling massively parallel, This portion of our research will broadly impact to perform predictively accurate calculations for systematically accurate electronic structure energy science and technology. Willow, S. Y., M. R. Hermes, K. S. Kim, the optoelectronic properties of large conjugated calculations for larger molecules and solids [2,3]. In only two months on Blue Waters, the Monte and S. Hirata, Convergence acceleration of molecules used in advanced materials and for It changes the usual matrix-algebra formulation Carlo MP2 code has already been ported and parallel Monte Carlo second-order many-body the structures, spectra, and phase diagram of of electronic structure theories, which is tested using the small allocation provided to a perturbation calculations using redundant nature’s most important crystals, such as ice and fundamentally non-scalable with respect to graduate student (Matthew R. Hermes) as a prize walkers. J. Chem. Theory Comput., 9:10 (2013), dry ice, or even molecular liquids such as water, system or computer size, into the more scalable of the student’s ACS Graduate Student Award. pp. 4396-4402. all from first principles. Theseab initio methods stochastic formulation. Using this, we have run electron-correlated go beyond the usual workhorse of solid-state The other breakthrough is the method that electron affinity calculations of C60, whose calculations (density-functional approximations) allows such high-level calculations to be applied derivatives are used as an electron acceptor in the fidelity of simulations that can be achieved to an infinitely extended molecular solid (either in many bulk heterojunction solar cells [1]. A and also use a novel stochastic algorithm, an periodic or non-periodic) or molecular liquids postdoctoral researcher (Dr. Soohaeng Y. Willow) embedded-fragment algorithm, or both to realize by dividing them into fragments embedded in implemented the aforementioned embedded- unprecedented scalability with respect to both the electrostatic field of the solid. The fragments fragment method at the MP2 level for direct ab molecular system and computer sizes. are then treated by well-developed molecular initio molecular dynamics simulations of liquid theories and software in a highly parallel water and a large water droplet with a halogen anion. This is being tested on Blue Waters and, INTRODUCTION algorithm. Our group used this method to study the structures, spectra, equation of state, if successful, will be a major breakthrough in Computational chemists are facing an exciting thermodynamics (heat capacity, enthalpy, Gibbs computational chemistry. Dr. Willow has also prospect of being able to apply systematically free energy), Fermi resonance, phase transition, implemented a massively parallel embedded- accurate, and thus predictive, computational etc. of various phases of ice and dry ice, but at fragment program for solids on Blue Waters methods, the so-called ab initio methods, to large such high theoretical levels as MP2 and coupled- and our group will commence ice phase diagram molecules, solids, and even liquids, which include cluster theory [4-6]. calculations at MP2 or higher levels. nature’s most important solids and liquids such With either or both of these, our group plans as ice and liquid water as well as advanced to predict a variety of properties of all known WHY BLUE WATERS materials used in optoelectronic devices. This molecular phases of ice and dry ice to construct is thanks to the combination of half a century ab initio phase diagrams of these important Today’s workhorse computational methods for of effort that numerous computational chemists solids. A successful outcome will greatly impact solids (density-functional methods) are routine put into refining theories and algorithms of geochemistry, astrophysics, and planetary on a small computer cluster, but with limited molecular electronic structures and our recent science where probing high-pressure phases of accuracy. Therefore, the most meaningful use of breakthroughs (to be described below) allowing the ices of atmospheric species on Earth or other Blue Waters in this area (high-pressure chemistry, them to be applied to solids, as well as the planets are important but experimentally difficult materials science, geochemistry, etc.) is to supercomputers reaching speeds that can make and expensive. fundamentally improve the accuracy rather than such calculations routine. Chemists, physicists, We also plan to extend this to liquid water system size (which is already formally infinite). and materials scientists now anticipate a major using ab initio Born-Oppenheimer molecular In electronic structure theory, this means transformation in computational or quantitative dynamics. We will also apply Monte Carlo MP2, switching from density-functional methods to aspects of solid-state physics and materials MP3, and MP4 to predict the stacking interaction ab initio theories, which solve the fundamental science via high-performance computing—that energies (important for morphology and thus equation of motion of chemistry rigorously is, the same kind of transformation that occurred functions) and optoelectronic parameters and using systematic approximations with

128 129 BLUE WATERS ANNUAL REPORT 2014

such as solar cells. The development of non-Born– We performed initial NEO-RXCHF NON-BORN–OPPENHEIMER Oppenheimer methods to enable accurate and calculations on proton-containing systems on EFFECTS BETWEEN ELECTRONS efficient calculations of PCET reactions will Blue Waters. We analyzed the nuclear densities impact many scientific endeavors, from drug of the protons and compared them to highly AND PROTONS design to the design of more effective catalysts accurate grid-based densities. Our calculations for solar energy devices. illustrate that this approach can provide accurate descriptions of the protons that are treated BW Prof/0.24 Mnh Allocation: quantum mechanically. We also have tested new PI: Sharon Hammes-Schiffer1 METHODS AND RESULTS approximate methods that will enable the study of 1 University of Illinois at Urbana-Champaign In the NEO approach, typically all electrons larger proton-containing systems. Current work focuses on refining these approximate methods EXECUTIVE SUMMARY: and one or a few protons are treated quantum mechanically, and a mixed nuclear-electronic and investigating larger systems of chemical and The quantum mechanical behavior of nuclei plays time-independent Schrödinger equation is biological interest. Our long-term objective is an important role in a wide range of chemical solved. To include the essential electron-proton to use these non-Born–Oppenheimer methods and biological processes. The inclusion of nuclear correlation, we developed an explicitly correlated to study PCET in molecular catalysts that are quantum effects and non-Born–Oppenheimer method, denoted NEO-XCHF. Although directly relevant to solar energy conversion. effects between nuclei and electrons in computer explicitly correlated methods have been shown simulations is challenging. Our group has to be highly accurate for model systems, they WHY BLUE WATERS developed the nuclear-electronic orbital (NEO) are computationally expensive and are currently method for treating electrons and select nuclei in intractable for larger systems of chemical interest. Our in-house NEO code has been adapted a quantum mechanical manner on the same level Recently, we proposed an alternative ansatz to incorporate a hybrid MPI/OpenMP protocol, using an orbital-based formalism. The NEO code with the primary goal of improving computational but the calculations require a large number of uses a hybrid MPI/OpenMP protocol, but the tractability to enable the study of larger systems processors and a substantial amount of memory. calculations require a large number of processors of chemical interest. In this approach, denoted The highly parallel computing system on Blue and a substantial amount of memory. We have NEO-RXCHF, only select electronic orbitals Waters is essential for the application of this used Blue Waters to perform NEO calculations are explicitly correlated to the nuclear orbital(s) approach to systems of interest, where the on systems in which all electrons and one proton and certain exchange terms are approximated, computational bottleneck is the embarrassingly are treated quantum mechanically and have thereby substantially decreasing the number of parallelizable calculation of many integrals. Most tested approximate methods that enable the multi-particle integrals that must be calculated. importantly, the large memory requirements study of larger systems. The computational bottleneck is the calculation of storing these integrals render this problem of two-, three-, and four-particle integrals that impossible when a large number of nodes cannot arise from computing matrix elements of the be used simultaneously, as on other computer INTRODUCTION explicitly correlated wave function over the systems. As our code has demonstrated excellent FIGURE 1: The results of a NEO-RXCHF calculation The inclusion of nuclear quantum effects such mixed nuclear-electronic Hamiltonian. Since scaling, we are able to directly benefit from using performed on the hydrogen cyanide molecule. (Top) as zero-point energy and tunneling in electronic these integrals can be calculated completely a large number of nodes simultaneously on Blue Correlated electron and proton molecular orbitals structure calculations is important for the study independently from one another we applied Waters with very little overhead. obtained from the NEO calculation. The electron of a variety of chemical systems, particularly the OpenMP protocol, providing almost perfect orbital is shown in green and purple, indicating those involving hydrogen transfer or hydrogen- scaling with respect to the number of threads. its two phases, and the proton orbital is shown bonding interactions. Moreover, non-adiabatic When considering calculations on larger in red. (Bottom) The proton density along the N-C-H effects, also called non-Born–Oppenheimer proton-containing systems, two drawbacks with axis, comparing the results of the NEO calculation effects, between electrons and certain nuclei the shared-memory-based OpenMP model are (red dashed) to a numerically exact benchmark grid- are significant for many of these systems. In of immediate concern: (1) the parallelization is based calculation (black solid). this case, the electrons cannot be assumed to restricted to the number of cores on a single respond instantaneously to the nuclear motions, machine, which is usually 32 at most, and (2) and the concept of the nuclei moving on a single the calculations must be performed using the electronic potential energy surface is no longer memory of a single machine. A hybrid MPI/ valid. This type of non-adiabaticity has been OpenMP protocol obviates the need for all shown to play a critical role in proton-coupled integrals to be stored simultaneously and allows electron transfer (PCET) reactions, which the division of the calculation over different are essential for a wide range of chemical and machines. This version of the code scales very biological processes, including photosynthesis, well with respect to the number of MPI processes. respiration, enzyme reactions, and energy devices

130 131 BLUE WATERS ANNUAL REPORT 2014

of the sarcoplasmic reticulum. This process is Before embarking on a full-scale production FIGURE C: Convergence of string iterations THE MECHANISM OF THE SARCO/ important for relaxation of skeletal muscle that run, several scaling studies and trial runs were monitored by pairwise Frechet distances between ENDOPLASMIC RETICULUM is regulated by calcium ions, and a close analogue performed. The convergence of the string is strings at different iterations. The blue patch in the cardiac muscle is a therapeutic target. determined by monitoring the image-wise at the upper right corner indicates a converging ATP-DRIVEN CALCIUM PUMP Over the past few years a number of structural distances from the initial string as well as all string. studies [1-6] have provided atomic resolution possible Frechet distances between strings FIGURE D: Snapshots of selected images of the models for several important states along the corresponding to two different iterations. The final string showing the occlusion of ions in the Allocation: GLCPC/0.6 Mnh pumping cycle. Two major outstanding issues Frechet distance between two strings is a global transmembrane binding sites by upward motions PI: Benoît Roux1 Co-PI: Avisek Das1 are the pathways of ions from either side of the measure of similarity of two strings which takes of M1 and M2 and bending of M1. Purple spheres membrane to the transmembrane binding sites into account the proper order of the images. The are calcium ions and space fills show several key 1 University of Chicago and a detailed description of the conformational final converged string revealed the mechanism residues. Due to the motions of the above-mentioned EXECUTIVE SUMMARY: changes that will elucidate how various parts of occlusion process in unprecedented side chains, the cytoplasmic ion pathway is blocked of the protein communicate over fairly large molecular details. The large-scale motions of by hydrophobic side chains during the transition 2+ Sarco/endoplasmic reticulum Ca -ATPase is distances in order to achieve coupled ATP the cytoplasmic domains induce small-scale from the non-occluded state (image 1) to the an integral membrane protein that uses ATP hydrolysis and calcium transport. We intend motions of key hydrophobic side chains in occluded state (image 35). hydrolysis as a source of free energy to pump two to simulate the transition pathways between the transmembrane helices. These side chain calcium ions per ATP molecule from calcium- experimentally known end points to shed some movements block the cytoplasmic ion pathway poor cytoplasm of the muscle cell to the calcium- light on these important issues. and lock the bound calcium ions inside the rich lumen of the sarcoplasmic reticulum, thereby membrane. maintaining a ten-thousand-fold concentration gradient. Two major outstanding issues are METHODS AND RESULTS the pathways of the ions to and from the WHY BLUE WATERS The most important computational tool for transmembrane binding sites and a detailed studying the dynamics of large systems at To perform this calculation on SERCA (~290,000 understanding of the large-scale conformational biologically relevant temperatures is classical atoms), a single job requires more than 6,000 changes among various functionally relevant molecular dynamics (MD) simulation [7] with nodes, which is more than the node count of states. We hope to shed some light on these all-atom resolution. Conformational changes an entire machine for many small- to medium- important issues by simulating conformational in large biomolecules are complex and slow, sized supercomputers. Therefore, the massively transition pathways between experimentally taking place on timescales that are far beyond parallel architecture of Blue Waters played a known stable states. The optimal path is the reach of brute-force MD simulations. For crucial role in the successful implementation determined by string method with swarms-of- example, even a microsecond-long all-atom of our project. trajectory, which involves running thousands of MD trajectory is not enough to connect two all-atom molecular dynamics trajectories that FIGURE A: Crystal structures of non- end points of a conformational transition. To communicate at a regular interval. Our recent occluded (left, PDB ID: 1SU4) and overcome these problems we have employed simulations on Blue Waters have revealed occlude (right, PDB ID: 1VFP) states a robust computational algorithm called the unprecedented molecular details of several key of sarco/endoplasmic reticulum “String method with swarms of trajectories” [8,9]. 2+ steps of the pumping cycle. Ca -ATPase. The protein has For meaningful results, more than a thousand three large cytoplasmic domains: copies of the system and hundreds of iterations nucleotide binding domain N (green), are required. Simulations were carried out using INTRODUCTION phosphorylation domain P (blue) a modified version of NAMD 2.9 [10]. and actuator domain A (green). The Membrane proteins form an important class We have determined a transition pathway ten transmembrane helices are of biomolecules that are associated with the between the calcium bound non-occluded (PDB color coded as follows: M1 (orange), membrane dividing the inside of a cell (or a ID: 1SU4) [1] and occluded (PDB ID: 1VFP) [4] M2 (pink), M3 (tan), M4 (ochre), M5 cellular compartment) and its environment. states of SERCA. This transition is responsible for (violet), M6 (cyan), M7 (dark green), These proteins play essential roles in controlling the occlusion process which prevents the escape M8 (sky blue), M9 (yellow), M10 the bi-directional flow of material and of bound calcium ions in the transmembrane (silver). information. Our project aims to understand binding sites to the cytoplasmic side. The string the function of an integral membrane protein was represented by 35 images, and for each image FIGURE B: Convergence of string called sarco/endoplasmic reticulum Ca2+-ATPase we used 32 trajectories to estimate the drifts (a iterations monitored by image-wise (SERCA) [1-3] that uses ATP hydrolysis as a total of 1,120 copies of the system). Each iteration distances. The image-wise distances source of free energy to pump two calcium ions involved 20 picoseconds of simulation; about from the initial string in the space per ATP molecule from calcium-poor cytoplasm 100 iterations were performed for production of the collective variables (CV) are of the muscle cell to the calcium-rich lumen calculation. plotted for several iterations.

132 133 2014

We have verified the robustness of our method under diverse flow conditions that are found in the human vasculature, and have applied it to a patient-specific model of a carotid artery that suffers from stenosis and aneurysm. The geometric model of the carotid artery was constructed from MRI images. Figs. 1 and 2 show snapshots of velocity streamlines and time- varying viscosity of blood during a typical heart beat at the end of diastole and middle of systole, respectively. At the end of diastole (fig. 1) where the shear rate is at its minimum, substantial viscosity buildup can be seen. This effect is more pronounced in the aneurysm where, due to the ADVANCED COMPUTATIONAL ballooning effect, local velocity is lower than the METHODS AND NON-NEWTONIAN mean flow velocity. Through this study we found that in the FLUID MODELS FOR middle of the diastole the Newtonian and the CARDIOVASCULAR BLOOD FLOW FIGURE 1: Flow field at end of diastole. shear-rate dependent non-Newtonian models MECHANICS IN PATIENT-SPECIFIC yield significantly different results, where the non-Newtonian model predicts higher shear FIGURE 2: Flow PUBLICATIONS GEOMETRIES stresses as compared to the Newtonian model. field at middle of Consequently, using the Newtonian model Kwack, J., and A. Masud, A stabilized mixed systole. aneurysms of the artery, this viscoelastic feature in clinical applications can provide a non- finite-element method for shear-rate dependent Allocation: Illinois/0.042 Mnh becomes dominant. We have developed non- conservative estimate that is not appropriate non-Newtonian fluids: 3D benchmark problems 1 PI: Arif Masud Newtonian models for blood that account for for patient care. This work was highlighted in and application to blood flow in bifurcating Collaborators: JaeHyuk Kwack1 its viscoelastic response [1,2]. We have also the spring/summer 2014 issue of NCSA’s Access arteries. Comput. Mech., 53:4 (2014), pp. 751-776. 1University of Illinois at Urbana-Champaign developed hierarchical multiscale finite element magazine. methods with local and global (coarse and fine) One of the intended applications of our EXECUTIVE SUMMARY: description of the variational formulations effort is to employ these models for optimizing We have developed novel numerical methods that that results in telescopic depth in scales [3,4]. Ventricular Assistive Devices (VADs) for patient- are integrated with non-Newtonian constitutive This scale split leads to two coupled nonlinear specific requirements, as well as possibly help in models to simulate and analyze blood flow systems: the coarse-scale and the fine-scale the design of these devices. We believe that new through patient-specific geometries in the subsystems. The fine-scale solution is nonlinear methods and the associated computer programs cardiovascular system. The current Blue Waters and time dependent, and it is extracted from the that we have developed will help optimize the project helped us to achieve two objectives: (1) fine-scale sub-problem via a direct application performance of VADs for patient-specific needs, Further explore the mathematical constructs of residual-free bubbles approach over element and thus help in personalized medicine. of our hierarchical multiscale methods on subdomains. The fine-scale solution is then Blue Waters hardware architecture. Our new variationally projected onto the coarse-scale methods exploit the local resident memory on space, and it leads to the hierarchical multiscale WHY BLUE WATERS the processing nodes to make the macro elements method with enhanced stabilization properties. Blue Waters has a unique hardware architecture “smart,” thereby reducing the size of the global The telescopic depth in scales helps reduce the that provides substantial local memory at problem and minimizing data communication. size of the global problem, while increasing the level of the processing nodes. Our new (2) Extract flow physics in patient-specific models “local-solves” that are cost effective on Blue computational methods take advantage of this of carotid artery via high-fidelity blood flow Waters-type architectures. From computational resident memory and carry out high-intensity simulations. and algorithmic perspectives, the hierarchical local calculations that make the elements “scale multiscale framework leads to substantially aware” or “smart.” Accuracy on coarse meshes reduced global communication in favor of is the same as that from highly refined meshes METHODS AND RESULTS increased local computing. This feature is if standard finite element techniques are used. of tremendous benefit in massively parallel Blood shows significant viscoelastic- and shear- These new methods are mathematically robust computing as it reduces communication costs rate-dependent response, and under arterial and computationally economical, and these across the partitioned subdomains. disease conditions such as stenosis and/or attributes have been verified on highly nonlinear and transient test problems. 135 2014

FIGURE 1 METHODS AND RESULTS transferring particle. Thus, in addition to the quantum tunneling effects associated with the (BACKGROUND): We have been pursuing rigorous quantum- proton or electron, the QCPI calculations reveal Phenol-amine classical formulations based on Feynman’s path substantial solvent-induced quantum mechanical proton transfer integral formulation of quantum mechanics. The effects on the dynamics of these reactions. in methyl major appeal of this approach stems from the Upon completion of the present phase of the chloride. local, trajectory-like nature of the Feynman work, the QCPI calculations will offer quantitative paths, which leads naturally to combined results for the kinetics of the chosen proton and quantum-classical treatments that are free of electron transfer processes, along with a detailed approximations. Recent work has described picture of the underlying mechanism, including a quantum-classical path integral (QCPI) the time scale of correlations and decoherence, methodology, which incorporates these ideas the distinct roles of fast and sluggish solvent as well as several advances in the understanding motions and associated quantum effects, as well of decoherence processes. QCPI treats a small as the importance of nonlinear solvent effects QUANTUM-CLASSICAL PATH subsystem using full quantum mechanics, while on the dynamics. INTEGRAL SIMULATION OF PROTON the effects of the environment are captured via standard molecular dynamics (MD) procedures. AND ELECTRON TRANSFER Since all quantum interference effects and WHY BLUE WATERS their quenching by the solvent are accounted for at the most detailed (non-averaged) level, Implementation of the QCPI methodology Allocation: Illinois/0.035 Mnh QCPI leads to correct branching ratios and requires integration of a large number of classical 1 PI: Nancy Makri product distributions, allowing simulation of trajectories, and accompanying Feynman paths Thomas Allen1 Co-PI: important chemical and biological processes of the quantum subsystem, from each initial condition sampled from the solvent density. 1 with unprecedented accuracy. University of Illinois at Urbana-Champaign INTRODUCTION Current work involves the first implementation Because the trajectories are independent and EXECUTIVE SUMMARY: Quantum mechanical effects play an essential of QCPI to the simulation of two paradigm generally relatively short, it is possible to assign a single trajectory to each core within a given Quantum mechanical effects play an essential role in chemical and biological processes and are chemical processes. The first calculation important for understanding energy storage and (by Thomas Allen) is on the proton transfer processor while maintaining computational role in chemical and biological processes and efficiency. This multi-level approach has the are important for understanding energy storage designing novel materials. The major challenge reaction of the phenol-amine complex in methyl in the development of quantum mechanical chloride. This system has been employed in many benefit of minimizing communication time while and designing novel materials. The simulation of maximizing concurrent processing, since related quantum dynamical phenomena in condensed- simulation algorithms stems from the non-local computational investigations using a variety of nature of quantum mechanics, which leads to approximations. The accurate QCPI results will classical and quantum mechanical calculations phase and biological systems continues to present are performed within the same node, where major challenges. We have been pursuing exponential scaling of computational effort with lead to an unambiguous picture of the proton the number of interacting particles. transfer mechanism in this system and will communication between processors should rigorous quantum-classical formulations based be much faster than if the information were on Feynman’s path integral formulation of For many processes of interest, quantum serve as much-needed benchmarks. The second mechanical effects are vital in the treatment process (by Peter Walters and Tuseeta Banerjee) more widely distributed. By exploiting the quantum mechanics. Current work involves very mechanism of decoherence, we are able the first implementation of quantum-classical of a small number of degrees of freedom (e.g., involves the ferrocene-ferrocenium charge those corresponding to a transferring electron transfer pair in benzene and hexane solvents. to circumvent the exponential proliferation of path integral (QCPI) to the simulation of two the number of trajectories with propagation paradigm chemical processes. Upon completion or proton), while the remaining particles This system was chosen for its significance in (solvent molecules or biological medium) electrochemistry. time. The QCPI formulation is well suited to of this phase of the work, the QCPI calculations decomposition based on multi-level parallelism, will offer quantitative results for the kinetics could be adequately described via Newtonian Two widely used MD packages, NAMD and dynamics. However, the traditional Schrödinger LAMMPS, have been combined with the QCPI and Blue Waters provides the ideal platform for of the chosen proton and electron transfer its implementation. processes, along with a detailed picture of the formulation of quantum mechanics (which is software and adapted to yield trajectories subject underlying mechanism, including the time scale based on delocalized wave functions) does not to forces obtained using the proton or electron of correlations and decoherence, the distinct lend itself to a combination with Newtonian coordinates specified by the given quantum roles of fast and sluggish solvent motions and trajectories (which are local in phase space) path. We have simulated the early dynamics associated quantum effects, as well as the unless severe approximations are introduced. of the transferring particles and obtained the importance of nonlinear solvent effects on the Thus, the simulation of quantum dynamical time evolution of the state populations. A set of dynamics. phenomena in condensed-phase and biological convergence tests is performed to determine the systems continues to present major challenges. optimal parameters for longer time calculations. These processes are dominated by solvent effects, and some high-frequency vibrations of the solvent molecules are strongly coupled to the

137 BLUE WATERS ANNUAL REPORT 2014

activation. Detailed structural understanding would have been undiscovered by virtual screen INVESTIGATING LIGAND of this mechanism for β2AR can inform studies docking to the crystal structures alone or without MODULATION OF GPCR of over 19 known subfamilies of class-A GPCRs knowledge of the full activation pathway. These which share sequence and structural features. results highlight MSMs as a tool for picking CONFORMATIONAL LANDSCAPES Because of their central role in cellular signaling, functional intermediate GPCR states that have these proteins are also prominent drug targets. different estimated affinities for known ligand β2AR is implicated in type-2 diabetes, obesity, chemotypes. Information on this correspondence NSF/3.13 Mnh Allocation: and asthma. Knowledge of ligand-modulated between ligand type and receptor conformation PI: Vijay S. Pande1 Collaborators: Morgan Lawrenz1; Diwakar Shukla1; Kai Kohloff1,2; Russ Altman1; GPCR conformational dynamics can improve may be beneficial for future drug design efforts Gregory Bowman1; Dan Belov2; David E. Konerding2 our understanding of drug efficacy at these and can predict ligands that may preferentially receptors and allow development of more bind and isolate rare intermediate conformations 1 Stanford University effective structure-based drug design approaches. of receptors. 2Google EXECUTIVE SUMMARY: METHODS AND RESULTS WHY BLUE WATERS We have completed and ongoing research projects that utilize the extensive computing We generated an extensive dataset from The extensive architecture of Blue Waters gives infrastructure of Blue Waters to study molecular dynamics simulations of β2AR and us two unique advantages: (1) the resource G-protein coupled receptors (GPCRs), key identified kinetically stable states along ligand- allows generation of many long timescale, signaling proteins that are the targets of ~40% modulated activation pathways. We used equilibrated molecular dynamics simulations of commercially available drugs. Our workflow massively parallel small molecule docking to which can give useful information on receptor involves massively parallel biomolecular target these Markov state model (MSM) states dynamics and also be used synchronously with FIGURE 1: Markov state models (MSMs) of receptor simulations that are aggregated by a statistical and demonstrate that our approach incorporates distributed computing resources to generate dynamics allow discovery of intermediate receptor model that maps the connectivity and stability of our rich structural data into a drug discovery exceptionally sampled conformational data states that aid discovery of new drug classes. A receptor states. We have completed work on the workflow that could lead to drugs that interact for our biological targets; and (2) the resource network representation of a ten-state MSM from GPCR β2 adrenergic receptor and used our approach more closely with diverse receptor states, leading allows us to rigorously test a new computational molecular dynamics simulations shows examples to provide the first atomistic description of this to overall increased efficacy and specificity. protocol for identifying new small molecules of inactive (R), active (R*), and kinetically stable receptor’s ligand-modulated activation pathways MSM states from high flux activation pathways that requires massively parallel calculations. intermediate states (R’) connected by arrows that [1]. We targeted intermediate states along identified in this study were targeted with small For β2AR, we have 140 unique, kinetically stable are weighted by transition probability. We show these pathways with extensive small molecule molecule docking of a database of β2AR agonists, receptor states that we wanted to target with a that these intermediate states preferentially bind virtual screens and demonstrated that these antagonists, and decoys with the program Surflex. rigorous small docking algorithm, which takes to different ligand types, which may be fruitful for intermediates select unique ligand types that For both agonists and antagonists, docking to approximately two minutes per molecule. This future drug design efforts and can give testable would be undiscovered without knowledge of the MSM states along activation pathways gives would take ~80,000 hours to run all molecules on predictions for ligands that may isolate rare the full activation pathway. These results show high values for area under the receiver operating a single CPU core for a full initial screen of our intermediate conformations of receptors. that our model of biomolecular simulations characteristic (ROC) curve, which evaluates data, but takes only a week using available Blue significantly contributes to understanding of selection of true ligands from decoys. These Waters cores. A similar savings in time applied both biological mechanisms and drug efficacy results are a statistically significant improvement to all the molecule overlap calculations for the for GPCRs. over results from docking to the active and clustering approach. This quick evaluation of inactive crystal structures and to randomly our computational protocol was integral to the selected snapshots from long-time-scale, agonist- demonstration of a new computational approach. INTRODUCTION bound β2AR deactivation simulations. Next, we show that docking to MSM states G-protein coupled receptors (GPCRs) regulate expands the chemical space of our docking PUBLICATIONS a large variety of physiological processes by results, an essential advantage in docking transmitting signals from extracellular binding Kohloff, K. J., D. Shukla, M. Lawrenz, G. R. approaches. Top-scoring ligands for each MSM of diverse ligands to intracellular signaling Bowman, D. E. Konerding, D. Belov, R. B. Altman, state were selected for 3D shape- and chemistry- molecules, a property called functional selectivity. and V. S. Pande, Nature Chemistry, 6:1 (2014), overlap calculations using the program ROCS. Central to this phenomenon is the well-accepted pp. 15-21. These calculations give Tanimoto scores for biophysical paradigm of conformational selection, overlap of ligands, which were used to cluster in which a ligand selectively stabilizes a state the results and revealed a diversity of chemotypes from an ensemble. GPCRs provide rich examples that are highly ranked, or enriched, differentially of conformational selection. We hypothesize by MSM states along the activation pathways. that there are general principles describing We can show examples of chemotypes that the dynamics of ligand-modulated GPCR

138 139 2014

Pfam families and clans so that a complete library family and clan) and (2) the time sensitivity of of pre-computed SSNs can be provided using an the production of the output relative to database EFI-supported webserver. updates, only a resource the scale of Blue Waters The Blue Waters allocation was awarded in can perform the job in a reasonable time frame. mid-October 2013. Since that time, we have been evaluating how to run and optimize two pieces of code, as well as the Perl scripts that control the flow of data and collect the results: BLAST v2.x (blastall) is a widely used program developed by the National Center for Biotechnology Information (NCBI). Blastall is not efficiently multi-threaded, so we are running as many single-threaded processes per node as FIGURE 1: Sequence similarity network (SSN) there are integer cores available. SEQUENCE SIMILARITY NETWORKS for PF08794, the proline racemase family, CD-Hit is a sequence clustering algorithm [3] FOR THE PROTEIN “UNIVERSE” displayed with a BLAST e-value threshold of that we use to both generate merged datasets of 10-110 (50% sequence identity). The colors are input sequences and/or post-process sequences used to distinguish predicted isofunctional flagged as being similar to each other by blastall. Allocation: Illinois/0.625 Mnh clusters. The majority of the Pfam families and clans PI: John A. Gerlt1 contain <200,000 sequences, so the BLAST Collaborators: Daniel Davidson1; Boris Sadkhin1; David Slater1; Alex Bateman2; and downstream statistics calculations are 3 Matthew P. Jacobson straightforward and efficient. However, for the

1 largest families and clans (the largest contains University of Illinois at Urbana-Champaign 2European Bioinformatics Institute understood. However, if many of the proteins/ ~3 million sequences) the computation time 3University of California, San Francisco enzymes have uncertain or unknown functions, increases exponentially with the number of researchers cannot capitalize on the investments sequences and the RAM requirements become EXECUTIVE SUMMARY: in genome projects. more demanding because of the number of The Enzyme Function Initiative (EFI) is a large- Bioinformatics tools are integral to the EFI's sequences and BLAST results becomes large. scale collaborative project supported by the strategies. The EFI’s goal is to provide to the BLAST results totaling 33 TB of data have National Institutes of General Medical Sciences biological community an “on-demand” library of been obtained for virtually all of the 515 Pfam (U54GM093342-04) [1]. The EFI is devising sequence similarity networks [2] for the 14,831 clans and 14,831 Pfam families. We now are strategies and tools to facilitate prediction of families and 515 clans in the Pfam database addressing the problem of filtering the BLAST the in vitro activities and in vivo metabolic (Release 27.0; March 2013) and to update this data to remove redundant pairs. With the BLAST functions of uncharacterized enzymes discovered library on a minimum three-month refresh results, we will perform statistical analyses that in genome projects. The Blue Waters allocation cycle. Sequence similarities are quantitated by are required for the user to choose parameters enables generation of a library of pre-computed the BLAST e-values between pairs of sequences. for generating the SSNs. sequence similarity networks for all Pfam families Our recent activities address the necessary and clans of proteins in the UniProtKB database experimentation to determine the most efficient that will be provided to the scientific community. METHODS AND RESULTS pipeline for performing the BLASTs and the We are using Blue Waters to calculate all-by- The EFI and University of Illinois at Urbana- downstream statistical analyses so that the all BLAST sequence relationships as well as Champaign’s Institute for Genomic Biology (IGB) process can be automated and performed with statistical analyses of the BLAST results. collaborated on development of scripts that allow a minimum three-month refresh cycle so that facile generation of sequence similarity networks the library of SSNs will remain current as the (SSNs) for protein families using sequences sequence databases are updated. INTRODUCTION from the Pfam sequence database. However, The UniProtKB database contains 56,555,610 the BLAST calculations are computationally WHY BLUE WATERS sequences (release 2014_5; 14-May-2104). intensive, so the user must wait hours to days The majority of the entries are obtained from for these to complete. The project uses an embarrassingly parallel genome sequencing projects, with the rationale Also, instead of requiring users to initiate computing model to perform the BLAST analyses. that knowledge of the complete set of proteins/ SSN generation, the EFI and IGB are using the Because of (1) the scales of the computations enzymes encoded by an organism will allow petascale capabilities of Blue Waters to calculate used (number and sizes of Pfam families and its biological/physiological capabilities to be the BLASTs as well as statistical analyses for all clans and the number of sequences in each

141 BLUE WATERS ANNUAL REPORT 2014

bottlenecks, performance limitations, and can generate over 1,000 separate jobs, which • Investigate the benefits of converting the BENCHMARKING THE HUMAN tradeoffs between speed and accuracy that arise can result in low job priority and break the per- data streaming parts of the pipeline into a map/ VARIATION CALLING PIPELINE at scale. user limit on the number of jobs in the queue. reduce framework We collaborated with the Blue Waters support • Eliminate the bottleneck resulting from team to use an embedded job launcher, which merge steps along the workflow by rewriting that Allocation: Illinois/0.05 Mnh METHODS AND RESULTS submits a single PBS job for each of the three step with a better, multi-threaded code 1 PI: Christopher J. Fields Accuracy vs. speed major embarrassingly parallel blocks of the Collaborators: Liudmila S. Yafremava1; Gloria Rendon1; C. Victor Jongeneel1 In collaboration with members of the workflow, reserving the required number of nodes, and launches the pipeline jobs within it. WHY BLUE WATERS 1University of Illinois at Urbana-Champaign University of Illinois CompGen initiative, we created a comprehensive suite of synthetic Our results are comparable to those published EXECUTIVE SUMMARY: sequence data, which sweeps across a number Blue Waters enables production-grade scaling by other groups [1], but Blue Waters has enough Using the embedded job launcher described Researchers increasingly use genome sequencing of parameters that can affect the accuracy and nodes and disk space to actually demonstrate above, we scaled the pipeline to hundreds of for medical purposes. The standard pipeline robustness of the pipeline. Specifically, we the scalability, as opposed to estimating it from genomes and documented the disk space and used for this purpose presently carries a high generated over 200 datasets, including synthetic smaller number of genomes. If hospitals begin compute requirements for performing daily computational cost, and if hospitals begin to whole exomes and genomes, by simulating a to sequence every arriving patient, the total production runs as expected in a personalized sequence every arriving patient, the total ongoing range of sequencing error rates, base substitution ongoing storage requirement would reach the medicine clinic. Running the variant calling on storage requirement would reach the petabytes transition probabilities, and read lengths. Known petabytes scale, provided that the data are deleted one whole human genome uses up to 20 nodes scale. This is a major hurdle for the routine synthetic variants were introduced into each immediately upon production and delivery. No concurrently for 2-4 days, depending on depth implementation of genome-based individualized dataset, which we compared with the workflow other supercomputing center thus far has directly of sequencing. If hospitals begin to sequence medicine. output. We used these synthetic data to probe the documented its ability to sustain such work. every arriving patient, a compute facility can We scaled the pipeline to hundreds of genomes limits of accuracy of the workflow and determine expect to deal with hundreds of datasets daily. and documented the disk space and compute the conditions under which the variants could Because each run takes several days, the runs on requirements for performing daily production no longer be reliably detected. The accuracy data from different days will overlap, tying up runs as expected in a personalized medicine clinic. can sometimes be rescued by using alternative several thousand nodes every day. During the Our results are comparable to those published software tools, which are more comprehensive course of this production, ~250 TB of storage by other groups, but Blue Waters has enough at the expense of longer compute time. We has to be dedicated to the input data alone. The nodes and disk space to actually demonstrate documented this continuum of tradeoffs between output data, including intermediary files, uses the scalability, as opposed to estimating it from accuracy and speed and determined the extent 10 times as much disk space, provided that the smaller number of genomes. to which specificity and sensitivity are affected by noise in the input data. As a reality check, we data are deleted immediately upon production compared the results with pipeline runs on the and delivery. INTRODUCTION freely available human exomes from the 1000 Genomes Project. Emerging collaborations Researchers increasingly use genome sequencing We took advantage of the opportunity to In the effort to gain as much speed as possible, for medical purposes. A key application is to perform comprehensive profiling of those tools we collaborated with Novocraft, the Malaysian generate a genomic variation profile—the set while running the above accuracy benchmarking company that developed Novoalign, the widely of differences between a patient’s genome and on the synthetic data files. When using the used and currently most accurate short read that of a population average, as well as between pipeline at the expected production rate, we aligner. An MPI version of Novoalign was a tumor and normal tissue from a single person. believe that the bandwidth of the file system will developed on Blue Waters and yields a four- This profile is then used to predict the patient’s be a big contributor to performance degradation fold speedup in the first phase of the pipeline. susceptibility to disease, their response to various when scaling up. We collaborated with scientists from the therapies, or the underlying causes of extant Department of Electrical and Computer pathologies. Using an embedded launcher Engineering, NCSA, and the CompGen initiative The standard pipeline used for this purpose The standard bioinformatics software is not at the University of Illinois at Urbana-Champaign presently carries a high computational cost. This parallelized across cluster nodes, is sometimes to perform an in-depth study of performance, is a major hurdle for the routine implementation single threaded, and is intended to run on robustness, and scalability of the pipeline. of genome-based individualized medicine. Still, embarrassingly parallel data. At the same With their involvement, we have identified the it is believed that whole-genome sequencing time, the procedure itself is fairly complex, following exciting new directions that should and analysis will become the standard of care with multiple split/merge points, numerous help harden and speed up the pipeline even in medicine within the next few years, requiring conditional forks, and several data entry points further: that this pipeline be run for every patient who into the workflow. When running on sets of • Investigate the possibilities of running the comes through a hospital’s doors on any given day. several hundred genomes at once, the pipeline entire pipeline in RAM Our project seeks to explore the computational

142 143 SOCIAL SCIENCE, ECONOMICS, & HUMANITIES

146 Benchmarking Computational Strategies for 150 Policy Responses to Climate Change in a Dynamic Applying Quality Scoring and Error Modeling Stochastic Economy Strategies to Extreme-Scale Text Archives

148 An Extreme-Scale Computational Approach to Redistricting Optimization 2014

METHODS AND RESULTS WHY BLUE WATERS The scale of our research problem and its With a corpus of this size, the computational embarrassingly parallel nature demanded a demands of using natural language processing, distributed solution. We chose to leverage Akka machine learning, or rule-based scoring since it claimed to provide a high-performance strategies would severely tax the capabilities of JVM-based framework featuring simple other HPC platforms. Only Blue Waters offers concurrency and distribution. the computational scale required to carry out the When engineering an Akka-based solution necessary quality scoring and error correction for running in a distributed environment, the strategies in a timely fashion. Akka cluster extension is often used because it “provides a fault-tolerant decentralized peer- FIGURE 1 (CONTINUED): Each circle represents an to-peer based cluster membership service actor. Actors exchange messages to collaborate with no single point of failure or single point on a task, which in our case represents some work BENCHMARKING COMPUTATIONAL of bottleneck… using gossip protocols and an that needs to be performed. In general, the Work STRATEGIES FOR APPLYING automatic failure detector” [1]. The cluster Producers take input data and package it into extension provides simpler coordination between independent units of work to be sent to the Work FIGURE 1: Akka framework for running share- QUALITY SCORING AND ERROR distributed actors and makes it easier to adopt a Coordinator, which in turn stores those in a queue. nothing, highly parallel, distributed “let it crash” [2] philosophy for enhancing system The Work Coordinator assigns one work unit at a MODELING STRATEGIES TO processing on Blue Waters. robustness. time to a Worker. Executing the work inside the EXTREME-SCALE TEXT ARCHIVES Our first research prototype using this Worker actors is done via an internal asynchronous framework evaluated the OCR quality of Executor actor that allows the worker itself to documents from the HathiTrust corpus by remain responsive to external queries (e.g., status Allocation: Illinois/0.05 Mnh tokenizing the text and examining the tokens updates). This allows the Worker to detect and Scott Althaus1 INTRODUCTION PI: against a set of rules defining characteristics handle Executor failures that may arise from the Collaborators: Loretta Auvil1; Boris Capitanu1; David Tcheng1; Ted Underwood1 Researchers in the humanities and social expected of regular words (e.g., whether execution of the work while remaining available to execute future work (spawning a new Executor, if 1University of Illinois at Urbana-Champaign sciences often analyze unstructured data in the tokens consist of exclusively alphabetic the form of images and text that have been characters, or exhibit a “normal” character necessary). EXECUTIVE SUMMARY: scanned and digitized from non-digital sources. frequency distribution). Each page of the ~3.2 At present, the most important barrier to For this type of research, the most important million volumes was scored based on two quality extreme-scale analysis of unstructured data barrier to conducting extreme-scale analysis metrics. within digitized text archives is the uncertain of unstructured data is the uncertain quality of Fig. 1 shows the architecture of this initial Optical Character Recognition (OCR) accuracy the textual representations of scanned images prototype framework. In this prototype we used a of scanned page images. We used Blue Waters to derived from Optical Character Recognition single Work Producer whose job was to traverse evaluate OCR errors on the HathiTrust Public (OCR) techniques. Only when OCR quality is the directory structure where the documents Use Dataset, which is the world’s largest corpus high can automated analysis safely rely on natural were stored and send each document path as of digitized library volumes in the public domain, language processing and machine learning a work item to the Work Coordinator. The consisting of 3.2 million zipped files totaling methods to correctly extract information from worker Executor was programmed to accept as nearly 3 TB. The primary aim is to develop unstructured data. Assessing the quality of input the document path and run the necessary error quality scoring strategies that can enhance unstructured data and developing strategies analysis summarized above to produce the OCR the volume-level metadata managed by the for correcting errors in such data are therefore quality score measures. Given the size of our HathiTrust Research Center with probabilistic among the most important research tasks for documents and the complexity of the analysis, an quality metrics. A secondary aim is to develop unlocking the potential for extreme-scale analysis initial performance evaluation of the prototype a JVM distributed computing solution based on of unstructured data in historical archives. Our on a subset of the data showed that optimal Scala and Akka for Blue Waters. Our task was project is using Blue Waters to detect, score, and performance was reached when using 192 to 224 to calculate a metric to indicate quality of text correct OCR errors in the HathiTrust Public Use workers. Beyond performance plateaued, likely on each page of a given volume. An initial run Dataset, which is the world’s largest corpus of due to an I/O bottleneck. We are planning several took about 8 hours to score the entire dataset at digitized library volumes in the public domain, optimizations to alleviate this bottleneck. a processing rate of approximately 110 volumes consisting of over 1.2 billion scanned pages of The final run on the entire dataset was per second. OCR text. completed in 8 hours and 5 minutes and used a total of eight compute nodes. The average processing speed was about 110 documents per second.

147 BLUE WATERS ANNUAL REPORT 2014

and provides an open research tool for solving library eliminates the global synchronization AN EXTREME-SCALE fine-scale redistricting problems. The primary cost at the migration step which would increase COMPUTATIONAL APPROACH TO goal is to develop and implement computational dramatically when using a large number of capabilities that can generate and objectively processors. The library scaled up to 16,384 REDISTRICTING OPTIMIZATION evaluate alternative redistricting schemes processors on the retired Ranger supercomputer and compare them to one another based on at the Texas Advanced Computing Center. compliance with voting laws as well as with On Blue Waters, we further improved Illinois/0.6 Mnh Allocation: various notions of “fairness” and adherence to scalability by resolving a major bottleneck in PI: Shaowen Wang1,2 Collaborators: Wendy K. Tam Cho1,2; Yan Liu1,2 democratic principles. Political and geographical our PGA library. The original asynchronous constraints include: migration strategy developed for this library 1University of Illinois at Urbana-Champaign • Competitiveness: districts with similar scaled well on 16,000 processor cores with 2 National Center for Supercomputing Applications proportions of different partisans, resulting in fine-tuned PGA parameters, but scaled poorly EXECUTIVE SUMMARY: competitive elections; on a larger number of faster processors and • Contiguity: all parts of every district must caused MPI communication layer failure. The The redistricting problem, or drawing electoral be physically connected; outgoing message buffer controlled by MPI maps, amounts to arranging a finite number • Compactness: no overly irregular shaped experienced buffer overflow. We extended our of indivisible geographic units into a smaller districts; library to manage the sending buffer better at number of larger areas (i.e., districts). The • Equi-population: to make sure votes are the application level and explicitly specified primary goal of our work is to develop and as equally weighted as possible; the degree of overlap between GA iterations implement computational capabilities that can • Preservation of communities of interest and message sending. The improved library generate and objectively evaluate alternative and local political subdivisions: keep identifiable has scaled well in tests on Blue Waters with redistricting schemes and compare them to communities together; minimal impact on the numerical performance one another based on compliance with voting • Incumbent protection: ensure that of our PGA. The communication cost stayed laws as well as with various notions of “fairness” incumbents are not pitted against one another, consistently around 0.17% when the number of and adherence to democratic principles. Our which, if excessive, may disrupt the political cores increased from 8,192 to 65,536. approach is implemented by enhancing a scalable process; and Solving the redistricting problem using GA parallel genetic algorithm (PGA) library. The • Minority districts: comply with the Voting requires a series of spatial GA operators to improved library has scaled well in tests on Blue Rights Act. consider the spatial distribution of political units Waters with minimal impact on the numerical and districts in order to satisfy explicit or implicit performance of our PGA. Using massive spatial constraints. These spatial operations are computing power on Blue Waters, deeper METHODS AND RESULTS much more computationally intensive than the understanding of the problem space and solution Finding an exact optimal solution to redistricting conventional GA operators. As a result, a new search strategies applied to large redistricting FIGURE 1: is computationally intractable. In our research, set of spatial GA operators, including initial exponentially with the number of geographic problems will lead to a series of innovations in Spatial genetic we develop a heuristic algorithm by combining solution generation, crossover, and mutation, units. Given the large number of census blocks the development of scalable heuristic strategies algorithm (GA) attention to the idiosyncrasies of the specific is developed for redistricting. The slow-down in each state (e.g., 259,777 in Minnesota; 710,145 and algorithmic operations. operators for redistricting problem with a genetic algorithm from higher computational intensity increases as in California), our goal would be inconceivable redistricting (GA) [3] to produce nearly optimal redistricting the number of processor cores involved increases. given the prohibitively large sizes of these More efficient heuristic strategies for solution problem instances. Using massive computing problem solving. INTRODUCTION maps. Our computational approach considers the development of both redistricting-specific space searching, such as path relinking, are being power on Blue Waters, deeper understanding of The redistricting problem [1], or drawing strategies to reduce the number of iterations considered to allow more intensified local search the problem space and solution search strategies electoral maps, amounts to arranging a finite needed to identify an optimal solution and and diversified global search. applied to large redistricting problems will lead number of indivisible geographic units into a scalable algorithms for exploiting the massive to a series of innovations on the development smaller number of larger areas (i.e., districts). computational power provided by high- of scalable heuristic strategies and algorithmic WHY BLUE WATERS Redistricting [2] has attracted significant interest performance computing resources such as Blue operations. in political science, geographic information Waters. Exploiting high-performance computing science, and operations research. Due to the Our approach is implemented by enhancing resources such as Blue Waters is crucial limited computational capability of existing a scalable parallel genetic algorithm (PGA) for achieving our goal of gaining a better PUBLICATIONS solutions, the study of redistricting problems at library [4] developed by the authors. The PGA understanding of the redistricting process and Liu, Y. Y., and S. Wang, A Scalable Parallel fine spatial scales (e.g., census block) has been library runs a large number of independent its impact on democratic rule. For instance, Genetic Algorithm for the Generalized difficult, if not impossible. PGA processes simultaneously with a migration dividing 55 blocks into six districts, the Assignment Problem. Parallel Comput., (in press). Our research proposes a scalable computational strategy that exchanges solutions between any number of possibilities is 8.7x1039, a formidable approach to address this fundamental challenge two directly connected PGA processes. This number, and the magnitude of the problem rises

148 149 BLUE WATERS ANNUAL REPORT 2014

properties concern the shape and smoothness of POLICY RESPONSES TO CLIMATE the solution to the Bellman equation. Smoothness CHANGE IN A DYNAMIC allows for efficient approximation of multi- dimensional functions and efficient quadrature STOCHASTIC ECONOMY methods for evaluating integrals contained in the Bellman operator; concavity is a strong qualitative property that we use to stabilize the GLCPC/0.385 Mnh Allocation: numerical procedures [2,3]. Many problems we PI: Lars Hansen1 Collaborators: Kenneth Judd2; Yongyang Cai2; Simon Scheidegger3 solved have nine continuous dimensions. Only a machine with the scale of Blue Waters can solve 1University of Chicago such complex problems in reasonable time. 2 Hoover Institution Solving for all Nash equilibria of dynamic 3University of Zürich games requires solving a fixed-point mapping on EXECUTIVE SUMMARY: finite lists of convex polytopes in n-dimensional Euclidean space, where n is the number of players. We developed tools for evaluating alternative Again, such problems can only be solved on Blue policy responses to challenges posed by Waters-like machines. future climate change in models that merge These tools open up the possibility to explore uncertainties related to both economic factors dynamic optimization and dynamic strategic and their interaction with the climate. We problems in a quantitative manner never before extended past work on computational methods feasible. (Economists generally confine their for solving dynamic programming problems, games to have few moves and a small number of dynamic games, and empirical estimation of players, or focus on finding only one equilibrium economic models. even though they know there are many equilibria.) We have combined numerical methods for Our dynamic programming tools have shown quadrature, approximation, and optimization that the social costs of greenhouse gases is a problems to develop efficient approximations stochastic process with great variation, and that of the Bellman operator for up to 20 dimensional the appropriate policy needs to be flexible to discrete-time dynamic programming. We have deal with unexpected events. We have shown also developed methods for computing all Nash that the chance of needing aggressive policies equilibria of dynamic games. Our code scales is much larger than usually thought. nicely up to 160,000 processes for realistic This is the first time that economics problems problems. of this nature and scale have been solved. It is One initial substantive result is that the clear that economics problems need and can optimal carbon tax is three to four times larger efficiently use high-power computing. than usually estimated when we incorporate empirically justified specifications for the social desire to reduce risk [1]. WHY BLUE WATERS Modeling realistic dynamic stochastic METHODS AND RESULTS economics problems requires massive computational power. Each value function This project aims to develop tools that determine iteration (or more generally, the application optimal policies and outcomes of competitive of a contraction mapping in a Banach space) processes for use in economics research. consists of millions of nonlinear programming Dynamic optimization problems reduce to problems that can be solved in parallel. Our code solving Bellman equations arising from dynamic allocates the tasks so that all cores are busy at programming problems. In general, solving almost all times. The Blue Waters architecture Bellman equations grows rapidly in complexity is an excellent fit for this problem. as the dimension rises. However, we identified mathematical properties of a wide range of economics problems (i.e., portfolio allocation and optimal greenhouse gas mitigation) which allow us to solve high-dimensional problems. These

150 151 BLUE WATERS ANNUAL REPORT 2014

[7] Heerikhuisen, J., E. J. Zirnstein, H. O. [6] Winteler, C., et al., Magnetorotationally Wuebbles Funsten, N. V. Pogorelov, and G. P. Zank, Driven Supernovae as the Origin of Early [1] Dennis, J., et al., CAM-SE: A scalable The Effect of New Interstellar Medium -Pa Galaxy r-process Elements?. Astrophys. J. spectral element dynamical core for the REFERENCES rameters on the Heliosphere and Energetic Lett., 750:1 (2012), L22. Community Atmosphere Model. Int. J. Neutral Atoms from the Interstellar Bound- [7] Woosley, S. E., Gamma-ray bursts from High Perform. Comput. Appl., 26 (2012), pp. ary. Astrophys. J., 784 (2014), 73. stellar mass accretion disks around black 74-89. [8] Pogorelov, N. V., et al., Modeling Solar holes. Astrophys. J., 405:1 (1993), pp. 273- [2] Bacmeister, J. T., et al., Exploratory high- SPACE SCIENCE Wind Flow with the Multi-Scale Fluid- 277. Resolution Climate Simulations using the Brunner Kinetic Simulation Suite. in Numerical [8] Metzger, B. D., D. Giannios, T. A. Thomp- Community Atmosphere Model (CAM). J. [1] Spergel, D. N., et al., Three-Year Wilkinson Modeling of Space Plasma Flows: ASTRO- son, N. Bucciantini, and E. Quataert, The Climate, 27:9 (2014), pp. 3073-3099. Microwave Anisotropy Probe (WMAP) NUM-2012, N.V. Pogorelov, E. Audit, and protomagnetar model for gamma-ray [3] Liang, X.-Z., and F. Zhang, The Cloud- Observations: Implications for Cosmology. G.P. Zank, Eds., (Astronomical Society of bursts. Mon. Not. R. Astron. Soc., 413:3 Aerosol-Radiation (CAR) ensemble model- Astrophys. J. Suppl., 170:377 (2007), 91 pp. the Pacific Conf. Ser. 474, San Francisco, (2011), pp. 2031-2056. ing system. Atmos. Chem. Phys., 13 (2013), [2] www.sdss.org 2013), pp. 165-170. [9] Mösta, P., et al., Magnetorotational core- pp. 8335-8364. [3] Bullock, J. S., Notes on the Missing Satel- [9] Pogorelov, N. V., S. T. Suess, S. N. Boro- collapse supernovae in three dimensions. [4] Liang, X.-Z., et al., Regional Climate- lites Problem. in Canary Islands Winter vikov, R. W. Ebert, D. J. McComas, and Astrophys. J., 785 (2014), L29. Weather Research and Forecasting Model School of Astrophysics on Local Group G. P. Zank, Three-dimensional Features (CWRF). Bull. Am. Meteorol. Soc., 93 Cosmology, D. Martinez-Delgado, Ed. of the Outer Heliosphere due to Coupling Di Matteo (2012), pp. 1363-1387. [1] Hopkins, P. F., A General Class of Lagrang- (Cambridge Univ. Press, Cambridge, 2010), between the Interstellar and Interplanetary ian Smoothed Particle Hydrodynamics 38 pp. Magnetic Fields. IV. Solar Cycle Model Jordan Methods and Implications for Fluid Mixing [4] montage.ipac.caltech.edu Based on Ulysses Observations. Astrophys. [1] Spudich, P., and B. S. J. Chiou, Directivity in Problems. Mon. Not. R. Astron. Soc., 428 [5] www.astromatic.net/software/stiff J., 772 (2013), 2. NGA earthquake ground motions: Analysis (2013), pp. 2840-2856. [6] www.astromatic.net/software/sextractor [10] Borovikov, S. N., J. Heerikhuisen, and N. V. using isochrones theory. Earthquake Spec- Pogorelov, Hybrid Parallelization of Adap- [2] Gnedin, N. Y., K. Tassis, and A. V. Kravtsov, tra, 24 (2008), pp. 279-298. Stein tive MHD-Kinetic Module in Multi-Scale Modeling Molecular Hydrogen and Star [2] Wang, F., and T. H. Jordan, Comparison of [1] Nagashima, K. et al., Interpreting the Fluid-Kinetic Simulation Suite. in Numeri- Formation in Cosmological Simulations. probabilistic seismic hazard models using Helioseismic and Magnetic Imager (HMI) cal Modeling of Space Plasma Flows: AS- Astrophys. J., 697:1 (2009), 55. averaging-based factorization. Bull. Seismol. Multi-Height Velocity Measurements. Solar TRONUM-2012, N.V. Pogorelov, E. Audit, [3] Battaglia, N., H. Trac, R. Cen, and A. Loeb, Soc. Am., (2014), doi: 10.1785/0120130263. Physics, (2014), doi: 10.1007/s11207-014- & G.P. Zank, Eds., (Astronomical Society Reionization on Large Scales. I. A Para- [3] Abrahamson, N. A., and W. J. Silva, Sum- 0543-5. of the Pacific Conf. Ser. 474, San Francisco, metric Model Constructed from Radiation- mary of the Abrahamson & Silva NGA 2013), pp. 219-224. hydrodynamic Simulations. Astrophys. J., Ground-Motion Relations. Earthquake Pogorelov 776:2 (2013), 81. Spectra, 24 (2008), pp. 67-97. [1] Pogorelov, N. V., et al., Unsteady processes O'Shea/Norman in the vicinity of the heliopause: Are we [1] enzo-project.org Campanelli in the LISM yet?. AIP Conf. Proc., 1539 [1] Zilhão, M., and S. C. Noble, Dynamic fish- (2013a), pp. 352-355. Nagamine eye grids for binary black hole simulations. PHYSICS & [2] Borovikov, S. N., and N. V. Pogorelov, [1] Aumer et al. 2013 Classical Quant. Grav., 31:6 (2014), 065013. ENGINEERING Voyager 1 near the Heliopause. Astrophys. J. [2] Nunez et al. in prep [2] Mundim, B. C., H. Nakano, N. Yunes, M. Lett., 783 (2014), L16. [3] Choi et al. 2014 Campanelli, S. C. Noble, and Y. Zlochower, Sugar [3] Zirnstein, E. J., J. Heerikhuisen, G. P. Zank, [4] Hu et al. 2014 Approximate Black Hole Binary Spacetime [1] Bazavov, A., et al., Leptonic decay-constant via Asymptotic Matching. Phys. Rev. D, 89 N. V. Pogorelov, D. J. McComas, and M. I. ratio fK+/fpi+ from lattice QCD with physical Desai, Charge-exchange Coupling between Diener (2014), 084008. light quarks. Phys. Rev. Lett., 110 (2013), [1] Soderberg, A. M., et al., Relativistic ejecta Pickup Ions across the Heliopause and its [3] Zilhão, M., S. C. Noble, M. Campanelli, and 172003. from X-ray flash XRF 060218 and the rate Effect on Energietic Neutral Hydrogen Flux. Y. Zlochower, Resolving the Relative Influ- [2] Bazavov, A., et al., Charmed and strange of cosmic explosions. Nature, 442:7106 Astrophys. J., 783 (2014), 129. ence of Strong Field Spacetime Dynamics pseudoscalar meson decay constants from (2006), pp. 1014-1017. [4] Luo, X., M. Zhang, H. K. Rassoul, N. V. and MHD on Circumbinary Disk Physics. HISQ simulations. Proc. 31st Int. Symp. Lat- [2] Drout, M. R., et al., The First Systematic Pogorelov, and J. Heerikhuisen, Galactic Phys. Rev. D, (in preparation). tice Field Theory (LATTICE2013), Mainz, Study of Type Ibc Supernova Multi-band Cosmic-Ray Modulation in a Realistic Germany, July 29-August 3, 2013. Light Curves. Astrophys. J., 741:2 (2011), 97. Global Magnetohydrodynamic Heliosphere. [3] Bazavov, A., et al., Determination of |V | [3] Mikami, H., Y. Sato, T. Matsumoto, and T. us Astrophys. J., 764 (2013), 85. from a lattice-QCD calculation of the K  Hanawa, Three-dimensional Magnetohy- GEOSCIENCE [5] Zank, G. P., J. Heerikhuisen, B. E. Wood, π ν semileptonic form factor with physical drodynamical Simulations of a Core-Col- l N. V. Pogorelov, E. Zirnstein, and D. J. Mc- Valocchi quark masses. Phys. Rev. Lett., 112 (2014), lapse Supernova. Astrophys. J., 683:1 (2008), Comas, Heliospheric Structure: The Bow [1] Zhang, C., et al., Liquid CO Displacement 112001. pp. 357-374. 2 Wave and the Hydrogen Wall. Astrophys. J., of Water in a Dual-Permeability Pore Net- [4] Winter, F. T., M. A. Clark, R. G. Edwards, [4] Kuroda, T., and H. Umeda, Three-dimen- 763 (2013), 20. work Micromodel. Environ. Sci. Technol., and B. Joo, A Framework for Lattice QCD sional Magnetohydrodynamical Simula- [6] Heerikhuisen, J., N. Pogorelov, and G. 45:17 (2011), p. 7581-7588. Calculations on GPUs. Proc. 28th IEEE Int. tions of Gravitational Collapse of a 15 M Zank, Simulating the Heliosphere with sun Parallel Distrib. Process. Symp., Phoenix, Star. Astrophys. J. Supp., 191:2 (2010), pp. Kinetic Hydrogen and Dynamic MHD Di Girolamo Ariz., May 19-23, 2014. 439-466. Source Terms. in Numerical Modeling of [1] Di Girolamo, L., et al. An evaluation of the [5] Scheidegger, S., R. Käppeli, S. C. White- Space Plasma Flows: ASTRONUM-2012, MODIS liquid cloud drop effective radius Klimeck house, T. Fischer, and M. Liebendörfer, The N.V. Pogorelov, E. Audit, & G.P. Zank, Eds., product for trade wind cumulus clouds: [1] Steiger, S., M. Povolotskyi, H. Park, T. influence of model parameters on the pre- (Astronomical Society of the Pacific Conf. implications for data interpretation and Kubis, and G. Klimeck, NEMO5: A Parallel diction of gravitational wave signals from Ser. 474, San Francisco, 2013), pp. 195-200. building climatologies. (in preparation). Multiscale Nanoelectronics Modeling Tool. stellar core collapse. Astron. Astrophys., 514 (2010), A51. 152 153 BLUE WATERS ANNUAL REPORT 2014

IEEE Trans. Nanotechnol., 10 (2011), pp. ics. Comput. Math. Appl., 58:5 (2009), pp. Distortion in Thin-Slab Casting.Proc. 8th Koric 1464-1474. 975-986. European Continuous Casting Conf., Graz, [1] Gupta A., S. Koric S, and T. George, Sparse [2] Fonseca, J., et al., Efficient and realistic [4] Joshi, A. S., P. K. Jain, J. A. Mudrich, and Austria, June 23-26, 2014. Linear Solvers on Massively Parallel device modeling from atomic detail to the E. L. Popov, Pratham: Parallel Thermal [12] Liu, R., and B. G. Thomas, Model of Machines. Proc. ACM/IEEE Conf. High nanoscale. J. Comput. Electron., 12:4 (2013), Hydraulics Simulations using Advanced Transient Multiphase Turbulent Flow with Perform. Comput. SC 2009, Portland, Ore., pp. 592-600. Mesoscopic Methods. Am. Nuclear Soc. Surface Level Fluctuations and Application November 14-20, 2009. [3] MAGMA: Matrix Algebra on GPU and Winter Meeting, San Diego, Calif., Novem- to Slide-Gate Dithering in Steel Continu- [2] Koric S., Q. Lu, and E. Guleryuz. Evaluation Multicore Architectures, available at http:// ber 11-15, 2012. ous Casting. (in preparation). of Massively Parallel Linear Sparse Solvers icl.cs.utk.edu/magma/index.html [5] Jain, P. K., A. Tentner, and R. Uddin, A [13] Jun, K., and B. G. Thomas, Prediction on Unstructured Finite Element Meshes. [4] Lopez Sancho, M. P., J. M. Lopez Sancho, Lattice Boltzmann Framework for the and Measurement of Turbulent Flow and Comput. Struct., 141 (2014), pp. 19-25. and J. Rubio, Quick iterative scheme for the Simulation of Boiling Hydrodynamics in Particle Entrapment in Continuous Slab [3] Gupta, A., “WSMP: Watson Sparse matrix calculation of transfer matrices: Application BWRs. Amer. Nuclear Soc. Annual Meeting, Casting with Electromagnetic-Braking package (Part-I: Direct solution of sym- to Mo (100). J. Phys. F Met. Phys., 14 (1984), Anaheim, Calif., June 8-12, 2008. Flow-Control Mold. (in preparation). metric sparse systems)” (Tech. Rep. RC pp. 1205-1215. 21866, IBM, T. J. Watson Research Center, [5] http://www.mcs.anl.gov/petsc (Balay, S., et Thomas Wagner Yorkton Heights, N. Y.; 2013). al., 2013) [1] https://wiki.engr.illinois. [1] QWalk: http://qwalk.org [6] Luisier, M., T. Boykin, G. Klimeck, and W. edu/display/cs519sp11/ Fichtner, Atomistic nanoelectronic device Lance+C.+Hibbeler+Project+Page BIOLOGY & CHEMISTRY engineering with sustained performances [2] Li, C., and B. G. Thomas, Thermo-Mechan- COMPUTER SCIENCE up to 1.44 PFlop/s. Proc. ACM/IEEE Conf. ical Finite-Element Model of Shell Behavior & ENGINEERING Voth High Perform. Comput. SC 2011, Seattle, in Continuous Casting of Steel. Metall. Ma- [1] Dama, J. F., A. V. Sinitskiy, M. McCullagh, Wash., November 12-18, 2011. ter. Trans. B, 35B:6 (2004), pp. 1151-1172. Wilde J. Weare, B. Roux, A. R. Dinner, and G. A. [3] Koric, S., and B. G. Thomas, Efficient [1] Raicu, I., Many-task computing: bridging Voth, The Theory of Ultra-Coarse-Graining. Draayer thermo-mechanical model for solidification the gap between high throughput comput- 1. General Principles. J. Chem. Theory Com- [1] Dytrych, T., et al., Collective Modes in processes. Int. J. Numer. Meth. Eng., 66:12 ing and high-performance computing. put., 9:5 (2013), pp. 2466-2480. Light Nuclei from First Principles. Phys. Rev. (2006), pp. 1955-1989. ProQuest, (2009 dissertation). [2] Grime, J. M. A., and G. A. Voth, Highly Lett., 111 (2013), 252501. [4] Koric, S., L. C. Hibbeler, R. Liu, and B. G. [2] Kreider, S. J., et al., Design and Evalua- Scalable and Memory Efficient Ultra- [2] Dytrych, T., K. D. Sviratcheva, C. Bahri, Thomas, Multiphysics Modeling of Metal tion of the GeMTC Framework for GPU Coarse-Grained Molecular Dynamics J. P. Draayer, and J. P. Vary, Evidence for Solidification on the Continuum Level. enabled Many-Task Computing. Proc. 23rd Simulations. J. Chem. Theory Comput., 10:1 Symplectic Symmetry in Ab Initio No-Core Numer. Heat Tr. B-Fund., 58:6 (2010), pp. ACM Int. Symp. High-Perform. Parallel and (2014), pp. 423-431. Shell Model Results for Light Nuclei. Phys. 371-392. Distr. Comput., Vancouver, BC, Canada, Mankin Rev. Lett., 98 (2007), 162503. [5] Koric, S., B. G. Thomas, and V. R. Voller, June 23-27, 2014. [3] Dytrych, T., K. D. Sviratcheva, J. P. Draayer, Enhanced Latent Heat Method to Incor- [1] Poehlsgaard, J., and S. Douthwaite, The [3] Wilde, M., M. Hategan, J. M. Wozniak, B. bacterial ribosome as a target for antibiot- C. Bahri, and J. P. Vary, Ab initio symplectic porate Superheat Effects Into Fixed-Grid Clifford, D. S. Katz, and I. Foster, Swift: A no-core shell model. J. Phys. G, 35:12 (2008), Multiphysics Simulations. Numer. Heat Tr. ics. Nat. Rev. Microbiol., 3:11 (2005), pp. language for distributed parallel scripting. 870-881. 123101. B-Fund., 57:6 (2010), pp. 396-413. Parallel Comput., 37:9 (2011), pp. 633–652. [4] Dreyfuss, A. C., K. D. Launey, T. Dytrych, [6] Koric, S., L. C. Hibbeler, and B. G. Thomas, [2] Yonath, A., Antibiotics targeting ribosomes: J. P. Draayer, and C. Bahri, Hoyle state and Explicit coupled thermo-mechanical finite Gropp resistance, selectivity, synergism and cel- rotational features in Carbon-12 within a element model of steel solidification.Int. J. [1] Amer, A., H. Lu, S. Matsuoka, and P. Balaji, lular regulation. Annu. Rev. Biochem., 74 no-core shell-model framework. Phys. Lett. Numer. Meth. Eng., 78:1 (2009), pp. 1-31. Characterizing Lock Contention in Mul- (2005), pp. 649-679. B, 727:4-5 (2013), pp. 511-515. [7] Liu, R., B. G. Thomas, L. Kalra, T. Bhat- tithreaded MPI Implementations. IEEE/ [3] Laing, R., B. Waning, A. Gray, N. Ford, and [5] Dytrych, T., K. D. Launey, and J. P. Draayer, tacharya, and A. Dasgupta, Slidegate ACM SC 2014, New Orleans, La., Novem- E. ’t Hoen, 25 years of the WHO essential Symmetry-adapted no-core shell model. Dithering Effects on Transient Flow and ber 16-21, 2014 (in preparation). medicines lists: progress and challenges. (McGraw-Hill Education – Research Re- Mold Level Fluctuations. 2013 AISTech [2] Balaji, P., D. Buntinas, D. Goodell, W. Lancet, 361:9370 (2003), pp. 1723-1729. view). Conf. Proc., Pittsburgh, Pa., May 6-9, 2013. Gropp, and R. Thakur, Fine-grained mul- [4] Lindenberg, M., S. Kopp, and J. Dressman. [8] Thomas, B. G., Q. Yuan, S. Mahmood, R. tithreading support for hybrid threaded Classification of orally administered drugs Aluru Liu, and R. Chaudhary, Transport and MPI programming. Int. J. High Perform. on the World Health Organization Model [1] Kiselev, A. V., and Y. I. Yashin, Analytical Use Entrapment of Particles in Steel Continu- Comput., 24:1 (2010), pp. 49-57. list of Essential Medicines according to the of Gas-Adsorption Chromatography. in ous Casting. Metall. Mater. Trans. B, 45:1 [3] Gropp, W., and R. Thakur, Thread safety in biopharmaceutics classification system.Eur. Gas-Adsorption Chromatography (Springer, (2014), pp. 22-35. an MPI implementation: Requirements and J. Pharm. Biopharm., 58:2 (2004), pp. 265- New York, 1969), pp. 146-228. [9] Chaudhary, R., B. G. Thomas, and S. P. analysis. Parallel Comput., 33:9 (2007), pp. 278. Vanka, Effect of electromagnetic ruler 595–604. [5] Mankin, A., Macrolide myths. Curr. Opin. Uddin braking (EMBr) on transient turbulent flow [4] Thakur, R., and W. Gropp, Test suite for Microbiol., 11:5 (2008), pp. 414-421. [1] Jain, P., and R. Uddin, Artificial Interface [6] http://www.who.int/drugresistance/docu- Lattice Boltzmann (AILB) Model for Simu- in continuous slab casting using large eddy evaluating performance of multithreaded simulations. Metall. Mater. Trans. B, 43:3 MPI communication. Parallel Comput., ments/surveillancereport/en/ lation of Two-Phase Dynamics, Transac- [7] Vazquez-Laslop, N., C. Thum, and A. S. tions of the ANS, 103 (2010). (2012), pp. 532-553. 35:12 (2009), pp. 608–617. [10] Singh, R., B. G. Thomas and S. P. Vanka, [5] Zhu, X., H. Lu, and P. Balaji, Asychronous Mankin, Molecular mechanism of drug de- [2] Jain, P., E. Popov, G. L. Yoder, and R. Uddin, pendent ribosome stalling. Mol. Cell, 30:2 Parallel Simulation of 2D/3D Flows Using Effects of a magnetic field on turbulent flow Graph500 Breadth First Search on Dis- in the mold region of a steel caster. Metall. tributed Memory Systems. IEEE Int. Conf. (2008), pp. 190-202. Lattice Boltzmann Models (LBM), Transac- [8] Tu, D., G. Blaha, P. B. Moore, and T. A. tions of the ANS, 103 (2010). Mater. Trans. B, 44:5 (2013), pp. 1-21. Parallel Distr. Systems (ICPADS), Hsinchu, [11] Hibbeler, L. C., B. G. Thomas, R. C. Schim- Taiwan, December 16-19, 2014 (in prepara- Steitz, Structures of MLSBK antibiotics [3] Jain, P. K., A. Tentner, and R. Uddin, A bound to mutated large ribosomal subunits Lattice Boltzmann Framework to Simulate mel, and H. H. Visser, Simulation and tion). Online Measurement of Narrow Face Mold provide a structural explanation for resis- Boiling Water Reactor Core Hydrodynam- tance. Cell, 121:2 (2005), pp. 257-270.

154 155 BLUE WATERS ANNUAL REPORT 2014

[9] Dunkle, J. A., L. Xiong, A. S. Mankin, and [3] Fudenberg, G., G. Getz, M. Meyerson, and L. ing microbial in silico evolution. BMC [2] Rezaei-Ghaleh, N., M. Blackledge, and J. H. D. Cate, Structures of the Escherichia A. Mirny, High order chromatin architec- Bioinformatics, 13:S13 (2012), doi: M. Zweckstetter, Intrinsically disordered coli ribosome with antibiotics bound near ture shapes the landscape of chromosomal 10.1186/1471-2105-13-S10-S13. proteins: from sequence and conforma- the peptidyl transferase center explain alterations in cancer. Nat. Biotechnol., 29:12 [7] Mozhayskiy, V., and I. Tagkopoulos, Guided tional properties toward drug discovery. spectra of drug action. Proc. Natl. Acad. Sci. (2011), pp. 1109-13. evolution of in silico microbial popula- Chembiochem., 13:7 (2012), pp. 930-950. USA, 107:40 (2010), pp. 17152-17157. [4] Dixon, J. R., S. Selvaraj, F. Yue, A. Kim, Y. tions in complex environments accelerates [3] Bullerjahn, J. T., S. Sturm, L. Wolff, and K. [10] Schlunzen, F., et al., Structural basis for the Li, Y. Shen, M. Hu, J. S. Liu, and B. Ren, evolutionary rates through a step-wise Kroy, Monomer dynamics of a wormlike interaction of antibiotics with the peptidyl Topological domains in mammalian genomes adaptation. BMC Bioinformatics, 13:S10 chain. Europhys. Lett., 96:4 (2011), 48005. transferase centre in eubacteria. Nature, identified by analysis of chromatin interac- (2012), doi:10.1186/1471-2105-13-S10-S10. [4] Soranno, A., et al., Quantifying internal 413:6858 (2001), pp. 814–821. tions. Nature, 485:7398 (2012), pp. 376-80. [8] Mozhayskiy, V., and I. Tagkopoulos, Micro- friction in unfolded and intrinsically [11] Hansen, J. L., J. A. Ippolito, N. Ban, P. Nis- [5] Nora, E. P., et al., Spatial partitioning of the bial evolution in vivo and in silico: methods disordered proteins with single-molecule sen, P. B. Moore, and T. A. Steitz, The struc- regulatory landscape of the X-inactivation and applications. Integr. Biol., 5:2 (2012), pp. spectroscopy. Proc. Natl. Acad. Sci. USA, tures of four macrolide antibiotics bound to centre. Nature, 485:7398 (2012), pp. 381-5. 262-77. 109 (2012), pp. 17800-17806. the large ribosomal subunit. Mol. Cell, 10:1 [6] Sexton, T., et al., Three-dimensional folding [9] Dragosits, M., V. Mozhayskiy, S. Quinones- [5] Debes, C., M. Wang, G. Caetano-Anollés, (2002), pp. 117-128. and functional organization principles of the Soto, J. Park, and I. Tagkopoulos, Evolu- and F. Gräter, Evolutionary optimization [12] Otaka, T., and A. Kaji. Release of (oligo) Drosophila genome. Cell, 148:3 (2012), pp. tionary potential, cross-stress behavior, and of protein folding. PLoS Comput. Biol., 9:1 peptidyl-transfer-RNA from ribosomes by 458-72. the genetic basis of acquired stress resis- (2013), e1002861. erythromycin-A. Proc. Natl. Acad. Sci. USA, [7] Goldman, R. D., Y. Gruenbaum, R. D. Moir, tance in Escherichia coli. Mol. Syst. Biol., [6] Caetano-Anollés, G., M. Wang, and D. 72:7 (1975), pp. 2649-2652. D. K. Shumaker, and T. P. Spann, Nuclear 9:643 (2013), doi:10.1038/msb.2012.76. Caetano-Anollés, Structural phylogenom- [13] Tenson, T., M. Lovmar, and M. Ehrenberg, lamins: building blocks of nuclear architec- [10] Carrera, J., R. E. Curado, J. Luo, N. Rai, ics retrodicts the origin of the genetic code The mechanism of action of macrolides, lin- ture. Gene Dev., 16:5 (2002), pp. 533-47. A. Tsoukalas, and I. Tagkopoulos, An and uncovers the evolutionary impact of cosamides and streptogramin b reveals the [8] Zullo, J. M., et al., DNA sequence-depen- integrative, multi-scale, genome-wide protein flexibility. PLoS One, 8:8 (2013), nascent peptide exit path in the ribosome. J. dent compartmentalization and silencing of model reveals the phenotypic landscape of e72225. Mol. Biol., 330:5 (2003), pp. 1005-1014. chromatin at the nuclear lamina. Cell, 149:7 Escherichia coli. Mol. Syst. Biol., 10:7 (2014), [14] Menninger, J. R., and D. P. Otto, Erythro- (2012), pp. 1474-87. 735. Schulten mycin, carbomycin, and spiramycin inhibit [9] Dekker, J., M. A. Marti-Renom, and L. A. [11] Huynh, L., J. Kececioglu, M. Köppe, and [1] Zhao, G., et al., Mature HIV-1 capsid protein synthesis by stimulating the dis- Mirny, Exploring the three-dimensional I. Tagkopoulos, Automated Design of structure by cryo-electron microscopy sociation of peptidyl-trna from ribosomes. organization of genomes: interpreting chro- Synthetic Gene Circuits through Linear and all-atom molecular dynamics. Nature, Antimicrob. Agents Chemother., 21:5 (1982), matin interaction data. Nat. Rev. Genet., Approximation and Mixed Integer Optimi- 497:7451 (2013), pp. 643-646. pp. 811-818. 14:6 (2013), pp. 390-403. zation. PLoS One, 7:4 (2012), e35529. [2] Cartron, M. L., et al., Integration of energy [15] Schuwirth, B. S., et al., Structures of the [12] Huynh, L., and I. Tagkopoulos, A robust, and electron transfer processes in the pho- bacterial ribosome at 3.5°A resolution. Sci- Tagkopoulos library-based, optimization-driven method tosynthentic membrane of Rhodobacter ence, 310:5749 (2005), pp. 827-834. [1] Tagkopoulos, I., Y. Liu, and S. Tavazoie, for automatic gene circuit design. 2nd sphaeroides. BBA-Bioenergetics, (2014), in [16] Moazed, D., and H. Noller. Sites of interac- Predictive Behavior within Microbial IEEE Int. Conf. Comput. Adv. Bio Med. Sci. press. tion of the CCA end of peptidyl-transfer Genetic Networks. Science, 320:5881 (2008), (ICCABS), Las Vegas, Nev., February 23-25, [3] Chandler, D., J. Strumpfer, M. Sener, RNA with 23S ribosomal-RNA. Proc. Natl. pp. 1313-7. 2012, pp.1-6. S. Scheuring, and K. Schulten, Light Acad. Sci. USA, 88:9 (1991), pp. 3725-3728. [2] Mozhayskiy, V., and I. Tagkopoulos, In silico [13] Huynh, L., A. Tsoukalas, M. Köppe, and harvesting by lamellar chromatophores in [17] Green, R., R. R. Samaha, and H. F. Noller. Evolution of Multi-scale Microbial Systems I. Tagkopoulos, SBROME: A scalable opti- Rhodospirillum photometricum. Biophys. J., Mutations at nucleotides G2251 and U2585 in the Presence of Mobile Genetic Elements mization and module matching framework 106:11 (2014), pp. 2503-2510. of 23S rRNA perturb the peptidyl trans- and Horizontal Gene Transfer. in Bioinfir- for automated biosystem design. ACS Synth. matics Research and Applications (Springer, Luthey-Schulten ferase center of the ribosome. J. Mol. Biol., Biol., 2:5 (2013), pp 263-273. [1] Held, W. A., B. Ballou, S. Mizushima, and 266:1 (1997), pp. 40-50. Berlin, Heidelberg, 2011), pp. 262-273. [14] Huynh, L., and I. Tagkopoulos, Optimal [3] Mozhayskiy, V., R. Miller, K. L. Ma, and M. Nomura, Assembly Mapping of 30S [18] Jin, H., A. C. Kelley, D. Loakes, and V. Ra- part and module selection for synthetic Ribosomal Proteins from Escherichia coli: makrishnan. Structure of the 70s ribosome I. Tagkopoulos, A Scalable Multi-scale gene circuit design automation. ACS Synth. Framework for Parallel Simulation and Further studies. J. Biol. Chem., 249:10 bound to release factor 2 and a substrate Biol., (2014), doi:10.1021/sb400139h. (1974), pp. 3103-11. analog provides insights into catalysis of Visualization of Microbial Evolution. Proc. 2011 TeraGrid Conf., Salt Lake City, Utah, Ertekin-Taner [2] Adilakshmi, T., P. Ramaswamy, and S. A. peptide release. Proc. Natl. Acad. Sci. USA, Woodson, Protein-independent folding 107:19 (2010), pp. 8593-8598. July 18-21, 2011. [1] Zou, F., et al., Brain Expression Genome- [4] Miller, R., V. Mozhayskiy, I. Tagkopou- Wide Association Study (eGWAS) Identi- pathway of the 16S rRNA 5’ domain. J. Mol. [19] Sothiselvam, S., et al., Macrolide antibiot- Biol., 351:3 (2005), pp. 508-19. ics allosterically predispose the ribosome los, and K. L. Ma, EVEVis: A Multi-Scale fies Human Disease-Associated Variants. Visualization System for Dense Evolution- PLoS Genetics, 8:6 (2012), e1002707. [3] Adilakshmi, T., B. Tadepalli, L. Deepti, and for translation arrest. Proc. Natl. Acad. Sci. S. A. Woodson, Concurrent Nucleation of USA, (2014), doi: 10.1073/pnas.1403586111. ary Data. Proc. 2011 IEEE Symp. Biol. Data [2] Schüpbach, T., I. Xenarios, S. Bergmann, Visual., Providence, R. I., October 23-24, and K. Kapur, FastEpistasis: a high perfor- 16S Folding and Induced fit in 30S Ribo- Aksimentiev 2011. mance computing solution for quantitative some Assembly. Nature, 455:7217 (2008), [1] Issa, J. P., CpG island methylator phenotype [5] Pavlogiannis, A., V. Mozhayskiy, and I. Tag- trait epistasis. Bioinformatics, 26:11 (2010), pp. 1268-72. in cancer. Nat. Rev. Cancer, 4:12 (2004), pp. kopoulos, A flood-based information flow pp. 1468-1469. [4] Kim, H., et al., Protein-guided RNA dynam- 988-93. analysis and network minimization method ics during early ribosome assembly. Nature, [2] Megan, I. V. L., L. Jones, and D. N. Beraton, for bacterial systems. BMC Bioinformatics, Caetano-Anollés 506:7488 (2014), pp. 334-8. The nature of tunneling in pathway and 14:137 (2013) doi:10.1186/1471-2105-14- [1] Xue, B., K. A. Dunker, and V. N. Uversky, [5] Talkington, M. W. T., G. Siuzdak, and J. R. average packing density models for protein- 137. Orderly order in protein intrinsic disorder Williamson, An assembly landscape for the mediated electron transfer. J. Phys. Chem. A, [6] Mozhayskiy, V., and I. Tagkopoulos, distribution: disorder in 3500 proteomes 30S ribosomal subunit. Nature, 438:7068 106 (2002), pp. 2002-6. Horizontal gene transfer dynamics from viruses and the three domains of life. (2005), pp. 628-32. and distribution of fitness effects dur- J. Biomol. Struct. Dyn., 30:2 (2012), pp. 137- [6] Sykes, M. T., and J. R. Williamson, A 149. complex assembly landscape for the 30S

156 157 BLUE WATERS ANNUAL REPORT 2014

ribosomal subunit. Ann. Rev. Biophys., 38 third-order Feynman-Goldstone diagrams. flow in complex geometries.Int. J. Eng. Sci., [3] Holland, J. H., Adaptation in Natural and (2009), pp. 197-215. J. Chem. Phys., 140 (2014), 024111. 72 (2013), pp. 78-88. Artificial Systems. (University of Michigan [7] Bunner, A. E., A. H. Beck, and J. R. William- [4] He, X., O. Sode, and S. Hirata, Second- [3] Masud, A., and J. Kwack, A stabilized Press, Ann Arbor, MI, 1975). son, Kinetic cooperativity in Escherichia order many-body perturbation study of ice mixed finite-element method for the [4] Liu, Y. Y., and S. Wang, A Scalable Parallel coli 30S ribosomal subunit reconstitution Ih. J. Chem. Phys., 137 (2012), 204505. incompressible shear-rate dependent non- Genetic Algorithm for the Generalized revels additional complexity in the assem- [5] Li, J., O. Sode, G. A. Voth, and S. Hirata, Newtonian fluids: Variational Multiscale Assignment Problem. Parallel Comput., bly landscape. Proc. Nat. Acad. Sci., 107:12 A solid-solid phase transition in carbon framework and consistent linearization. (2014), in press. (2010), pp. 5417-22. dioxide at high pressures and intermediate Comput. Methods Appl. Mech. Eng., 200 [8] Hallock, M. J., J. Stone, E. Roberts, C. Fry, temperatures. Nature Commun., 4 (2013), (2011), pp. 577-596. Hansen and Z. Luthey-Schulten, Simulation of 2647. [4] Kwack, J., and A. Masud, A stabilized [1] Cai, Y., K. L. Judd, and T. S. Lontzek, The reaction diffusion processes over biologi- [6] Hirata, S., K. Gilliard, X. He, J. Li, . and O. mixed finite-element method for shear- social cost of stochastic and irreversible cli- cally relevant size and time scales using Sode, Ab initio molecular crystal structures, rate dependent non-Newtonian fluids: 3D mate change. National Bureau of Econom- multi-GPU workstations. Parallel Comput., spectra, and phase diagrams. Acc. Chem. benchmark problems and application to ics working paper No. 18704 (2013). (2014), in press. Res., (2014), doi: 10.1021/ar500041m. blood flow in bifurcating arteries.Comput. [2] Cai, Y., and K. L. Judd, Stable and efficient [9] Peterson, J. R., M. J. Hallock, J. A. Cole, and Mech., 53:4 (2014), pp. 751-776. computational methods for dynamic pro- Z. Luthey-Schulten, A Problem Solving En- Roux gramming. J. Eur. Econ. Assoc., 8:2-3 (2010), vironment for Stochastic Biological Simula- [1] Toyoshima, C., M. Nakasako, H. Nomura, Pande pp. 626-634. tions. SC 2013, Denver, Colo., November and H. Ogawa, Crystal structure of the [1] Kohlhoff, K. J., et al., Cloud-based simula- [3] Cai, Y., and K. L. Judd, Shape-preserving 17-21, 2013. calcium pump of sarcoplasmic reticulum at tions on Google Exacycle reveal ligand dynamic programming. Math. Methods [10] Roberts, E., J. E. Stone, and Z. Luthey- 2.6 Å resolution. Nature, 405:6787 (2000), modulation of GPCR activation pathways. Oper. Res., 77:3 (2013), pp. 407-421. Schulten, Lattice microbes: high-perfor- pp. 647-655. Nature Chemistry, 6:1 (2014), pp. 15-21. mance stochastic simulation method for [2] Toyoshima, C., Structural aspects of ion 2+ Gerlt the reaction-diffusion master equation.J. pumping by Ca -ATPase of sarcoplasmic reticulum. Arch. Biochem. Biophys., 476:1 [1] Gerlt, J. A., et al., The Enzyme Function Comput. Chem., 32:3 (2013), pp. 245-55. Initiative. Biochemistry, 50:46 (2011), pp. [11] Roberts, E., A. Magis, J. O. Ortiz, W. Bau- (2008), pp. 3-11. [3] Toyoshima, C., How Ca2+-atpase pumps 9950-9962. meister, and Z. Luthey-Schulten, Noise con- [2] Atkinson, H. J., J. H. Morris, T. E. Ferrin, tributions in an inducible genetic switch: a ions across the sarcoplasmic reticulum membrane. Biochim. Biophys. Acta, 1793:6 and P. C. Babbitt, Using Sequence Similar- whole-cell simulation study. PLoS Comput. ity Networks for Visualization of Relation- Biol., 7:3 (2011), e1002010. (2009), pp. 941-946. [4] Sørensen, T. L. M., J. V. Møller, and P. Nis- ships across Diverse Protein Superfamilies. [12] Chen, K., J. Eargle, K. Sarkar, M. Gruebele, PLoS One, (2009), doi: 10.1371/journal. and Z. Luthey-Schulten, Functional role sen, Phosphoryl transfer and calcium ion occlusion in the calcium pump. Science, pone.0004345. of ribosome signatures. Biophys. J., 99:12 [3] http://weizhong-lab.ucsd.edu/cd-hit/ (2010), pp. 3930-40. 304:5677 (2004), pp. 1672-1675. [5] Olesen, C., et al., The structural basis of [13] Chen, K., et al., Assembly of the five-way Fields junction in the ribosomal small subunit calcium transport by the calcium pump. [1] Puckelwartz, M. J., et al., Supercomputing using hybrid MD-Go simulations. J. Phys. Nature, 450 (2007), pp. 1036-1042. for the parallelization of whole genome Chem. B, 116:23 (2012), pp. 6819-31. [6] Toyoshima, C., et al., Crystal structures analysis. Bioinformatics, 30:11 (2014), pp. [14] Lai, J., K. Chen, and Z. Luthey-Schulten, of the calcium pump and sarcolipin in the 2+ 1508-13. Structural intermediates and folding events Mg -bound E1 state. Nature, 495:7440 in the early assembly of the ribosomal small (2013), pp. 260-265. subunit. J. Phys. Chem B, 117:42 (2013), pp. [7] Allen, M. P., and D. P. Tildesley, Computer 13335-45. Simulation of Liquids (Oxford Univ. Press, SOCIAL SCIENCES, [15] Mulder, A. M., C. Yoshioka, A. H. Beck, A. Oxford, 1987). E. Bunner, R. A. Milligaan, C. S. Potter, B. [8] Pan, A. C., D. Sezer, and B. Roux, Find- ECONOMICS, & Carragher, and J. R. Williamson, Visualizing ing transition pathways using the string ribosome biogenesis: parallel assembly method with swarms of trajectories. J. Phys. HUMANITIES Chem. B, 112:11 (2008), pp. 3432-3440. pathways for the 30S subunit. Science, Althaus 330:6004 (2010), pp. 673-7. [9] Gan, W., S. Yang, and B. Roux, Atomistic view of the conformational activation of [1] http://doc.akka.io/docs/akka/snapshot/ common/cluster.html#cluster Hirata Src kinase using the string method with [1] Willow, S. Y., M. R. Hermes, K. S. Kim, and swarms-of-trajectories. Biophys. J., 97:4 [2] http://architects.dzone.com/articles/ S. Hirata, Convergence acceleration of par- (2009), pp. L8-L10. increasing-system-robustness allel Monte Carlo second-order many-body [10] Phillips, J. C., et al., Scalable molecular Wang, S. perturbation calculations using redundant dynamics with NAMD. J. Comput. Chem., [1] Mehrotra, A., E. L. Johnson, and G. L. walkers. J. Chem. Theory Comput., 9:10 26:16 (2005), pp. 1781–1802. Nemhauser, An Optimization Based Heu- (2013), pp. 4396-4402. ristic for Political Districting. Manage. Sci., Masud [2] Willow, S. Y., Kim, K. S., and S. Hirata, 44:8 (1998), pp. 1100-1114. Stochastic evaluation of second-order [1] Kwack, J., and A. Masud, A Three-field Formulation for Incompressible Visco- [2] Cho, W. K. T., and J. G. Gimpel, Geograph- many-body perturbation energies. J. Chem. ic Information Systems and the Spatial Phys., 137 (2012), 204122. elastic Fluids. Int. J. Eng. Sci., 48 (2010), pp. 1413-1432. Dimensions of American Politics. Annu. [3] Willow, S. Y. and S. Hirata, Stochastic, Rev. Polit. Sci., 15 (2012), pp. 443-460. real-space, imaginary-time evaluation of [2] Anand, M., J. Kwack, and A. Masud, A new generalized Oldroyd-B model for blood

158 159 BLUE WATERS ANNUAL REPORT 2014

Cheatham, Thomas 124 Affiliation: University of Utah G L Allocation Size: 14 Mnh 140 126 INDEX Allocation Type: NSF Gerlt, John A. Luthey-Schulten, Zaida Affiliation: University of Illinois at Urbana- Affiliation: University of Illinois at Urbana- Chemla, Yann 118 Champaign Champaign; Beckman Institute Affiliation: University of Illinois at Urbana- Allocation Size: 0.625 Mnh Allocation Size: 0.592 Mnh A Champaign Allocation Type: Illinois Allocation Type: Illinois Allocation Size: 0.15 Mnh Aksimentiev, Aleksei 109 100 Allocation Type: Illinois Gropp, William Affiliation: University of Illinois at Urbana- Affiliation: University of Illinois at Urbana- M Champaign Champaign Allocation Size: 0.922 Mnh; 0.24 Mnh Allocation Size: 0.245 Mnh; 0.613 Mnh Makri, Nancy 136 Allocation Type: Illinois; BW Prof. D Allocation Type: BW Prof.; NSF Affiliation: University of Illinois at Urbana- Diener, Peter 40 Champaign Althaus, Scott 146 Affiliation: Louisiana State University Allocation Size: 0.035 Mnh Affiliation: University of Illinois at Urbana- Allocation Size: 3.81 Mnh Allocation Type: Illinois Champaign H Allocation Type: NSF Allocation Size: 0.05 Mnh Hammes-Schiffer, Sharon 130 Mankin, Alexander 108 Allocation Type: Illinois Di Girolamo, Larry 54 Affiliation: University of Illinois at Urbana- Affiliation: University of Illinois at Chicago Affiliation: University of Illinois at Urbana- Champaign Allocation Size: 0.33 Mnh Aluru, Narayana R. 84 Champaign Allocation Size: 0.24 Mnh Allocation Type: GLCPC Affiliation: University of Illinois at Urbana- Allocation Size: 0.5 Mnh; 0.24 Mnh Allocation Type: BW Prof. Champaign 134 Allocation Type: Illinois; BW Prof. Masud, Arif Allocation Size: 0.45 Mnh Hansen, Lars 150 Affiliation: University of Illinois at Urbana- Allocation Type: Illinois Di Matteo, Tiziana 42 Affiliation: University of Chicago Champaign Affiliation: Carnegie Mellon University Allocation Size: 0.385 Mnh Allocation Size: 0.042 Mnh Allocation Size: 2.63 Mnh Allocation Type: GLCPC Allocation Type: Illinois B Allocation Type: NSF Hirata, So 128 McFarquhar, Greg M. 48 Bernholc, Jerzy 80 Draayer, J. P. 76 Affiliation: University of Illinois at Urbana- Affiliation: University of Illinois at Urbana- Affiliation: North Carolina State University Affiliation: Louisiana State University Champaign Champaign Allocation Size: 3.2 Mnh Allocation Size: 0.5 Mnh Allocation Size: 0.12 Mnh Allocation Size: 0.698 Mnh Allocation Type: NSF Allocation Type: GLCPC Allocation Type: BW Prof. Allocation Type: Illinois Brunner, Robert J. 26 Affiliaiton: National Center for Supercomputing Applications E J N Allocation Size: 0.025 Mnh Elghobashi, Said 73 Jordan, Thomas H. 60 Nagamine, Kentaro 38 Allocation Type: Illinois Affiliation: University of California, Irvine Affiliation: Southern California Earthquake Affiliation: University of Nevada, Las Vegas Allocation Size: 1.24 Mnh Center Allocation Size: 3.13 Mnh Buchta, David 78 Allocation Type: NSF Allocation Size: 3.4 Mnh Allocation Type: NSF Affiliation: University Of Illinois at Urbana- Allocation Type: NSF Champaign (w/Freund) Ertekin-Taner, Nilufer 114 Norman, Michael 36 Allocation Size: 0.075 Mnh Affiliation: Mayo Clinic in Jacksonville Affilation: University of California, Allocation Type: Illinois Allocation Size: 0.031 Mnh K San Diego; San Diego Supercomputing Allocation Type: Private sector Center (w/O'Shea) Karimabadi, Homayoun 52 Allocation Size: 7.8 Mnh C Affiliation: University of California, San Diego; Allocation Type: NSF F SciberQuest Caetano-Anollés, Gustavo 120 Allocation Size: 11.1 Mnh Affiliation: Heidelberg Institute for Theoretical Fields, Christopher J. 142 Allocation Type: PRAC O Studies, Germany Affiliation: University of Illinois at Urbana- Allocation Size: 0.05 Mnh Champaign Klimeck, Gerhard 74 O’Shea, Brian 36 Allocation Type: Illinois Allocation Size: 0.05 Mnh Affiliation: Perdue University Affiliation: Michigan State University Allocation Type: Illinois Allocation Size: 1.24 Mnh; 0.313 Mnh (w/Norman) Campanelli, Manuela 44 Allocation Type: NSF; GLCPC Allocation Size: 7.8 Mnh Affiliation: Rochester Institute of Technology Freund, Jonathan B. 78 Allocation Type: NSF Allocation Size: 1.09 Mnh Affiliation: University of Illinois at Urbana- Koric, Seid 102 Allocation Type: NSF Champaign (w/Buchta) Affiliation: University of Illinois at Urbana- Allocation Size: 0.075 Mnh Champaign Cann, Isaac 110 P Allocation Type: Illinois Allocation Size: 0.002 Mnh Affiliation: University of Illinois at Urbana- Allocation Type: Private sector Pande, Vijay S. 138 Champaign Affiliation: Stanford University Allocation Size: 0.25 Mnh Allocation Size: 3.13 Mnh Allocation Type: Illinois Allocation Type: NSF

160 161 BLUE WATERS ANNUAL REPORT 2014

Pogorelov, Nikolai V. 34 Affiliation: University of Alabama in Huntsville Tajkhorshid, Emad 116 Wilde, Michael 96 Allocation Size: 0.78 Mnh Affiliation: Beckman Institute for Advanced Sci- Affiliation: Argonne National Laboratory; Uni- Allocation Type: NSF ence and Technology; University of Illinois versity of Chicago at Urbana-Champaign Allocation Size: 0.375 Mnh Allocation Size: 0.686 Mnh Allocation Type: GLCPC Q Allocation Type: Illinois Wilhelmson, Robert 64 Quinn, Thomas 32 Thomas, Brian G. 88 Affiliation: University of Illinois at Urbana- Affiliation: University of Washington Affiliation: University of Illinois at Urbana- Champaign; National Center for Supercom- Allocation Size: 9.38 Mnh Champaign puting Applications Allocation Type: NSF Allocation Size: 0.1 Mnh Allocation Size: 0.76 Mnh Allocation Type: Illinois Allocation Type: Illinois R Tomko, Karen 98 Woodward, Paul R. 28 Affiliation: Ohio Supercomputer Center Affiliation: University of Minnesota Reed, Patrick 62 Allocation Size: 0.319 Mnh Allocation Size: 3.88 Mnh Affiliation: Cornell University Allocation Type: GLCPC Allocation Type: NSF Allocation Size: 4.53 Mnh Allocation Type: NSF Woosley, Stan 24 U Affiliaiton: University of California, Santa Cruz Roux, Benoît 132 Allocation Size: 8.44 Mnh Affiliation: University of Chicago Uddin, Rizwan 86 Allocation Type: NSF Allocation Size: 0.6 Mnh Affiliation: University of Illinois at Urbana- Allocation Type: GLCPC Champaign Wuebbles, Donald J. 56 Allocation Size: 0.05 Mnh Affiliation: University of Illinois at Urbana- Allocation Type: Illinois Champaign S Allocation Size: 5.11 Mnh Allocation Type: NSF Schleife, Andre 68 Affiliation: Lawrence Livermore National V Laboratory; University of Illinois at Urbana- Valocchi, Albert J. 50 Y Champaign Affiliation: University of Illinois at Urbana- Allocation Size: 0.245 Mnh Champaign Yeung, P.K. 90 Allocation Type: BW Prof. Allocation Size: 0.05 Mnh Affiliation: Georgia Institute of Technology Allocation Type: Illinois Allocation Size: 9.03 Mnh Schulten, Klaus 122 Allocation Type: NSF Affiliation: University of Illinois at Urbana- Voth, Gregory A. 106 Champaign Affiliation: University of Chicago Allocation Size: 30 Mnh; 0.24 Mnh Allocation Size: 5.07 Mnh Z Allocation Type: NSF; BW Prof. Allocation Type: NSF Zhang, Shiwei 82 Stein, Robert 30 Affiliation: College of William and Mary Affiliation: Michigan State University W Allocation Size: 2.66 Mnh Allocation Size: 6.5 Mnh Allocation Type: NSF Allocation Type: NSF Wagner, Lucas K. 92 Affiliation: University of Illinois at Urbana- Sugar, Robert L. 70 Champaign Affiliation: University of California, Santa Allocation Size: 0.508 Mnh Barbara Allocation Type: Illinois Allocation Size: 60.1 Mnh Allocation Type: NSF Wang, Liqiang 58 Affiliation: University of Wyoming Allocation Size: 0.003 Mnh T Allocation Type: NSF Tagkopoulos, Ilias 112 Wang, Shaowen 148 Affiliation: University of California, Davis Affiliation: University of Illinois at Urbana- Allocation Size: 0.003 Mnh Champaign; National Center for Supercom- Allocation Type: NSF puting Applications Allocation Size: 0.6 Mnh Allocation Type: Illinois

162 163

Blue Waters is supported by the National Science Foundation

ISBN 978-0-9908385-1-7