Lawrence Livermore National Laboratory

Sequoia and the Petascale Era

SCICOMP 15 May 20, 2009

Thomas Spelce Development Environment Group

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 LLNL-PRES-411030

The Advanced Simulation and Computing (ASC) Program delivers high confidence prediction of weapons behavior

Integrated Codes

Codes to predict safety and reliability

Physics and Verification and Engineering Models Validation

Experiments provide Models and NNSA Science Campaigns critical validation data understanding Experiments Legacy UGTs

ASC integrates all of the science and engineering that makes stewardship successful

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 2 ASC pursued three classes of systems to cost effectively meet current (and anticipate future) compute requirements

Higher performance and . Capability systems ==> the most lower power consumption challenging integrated design calculations Sequoia • More costly but proven Roadrunner • Production workload BlueGene/L

Original concept: develop capability . Capacity systems ==> day to day work Purple TLCC (Juno)

• Less costly, somewhat less reliable Performance • Throughput for less demanding Q problems Thunder White MCR Mainframes (RIP) . Advanced Architectures ==> Blue performance, power consumption, etc. Low-cost capacity • Targeted but demanding workload Red • Tomorrow’s mainstream solutions? FY01 FY05 Time

The “three curves” (Capability, Capacity and Advanced Architectures) approach has been successful in delivering good cost performance across the spectrum of need…

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 3

Sequoia represents largest increase in computational power ever delivered for NNSA

Sequoia Five Years Planned Lifetime Through CY17 Market Survey

1/06 7/06 12/06 CD0 Approved Write RFP

1/07 7/07 12/07 Vendor Response CD1 Approved Contract Package Selection Sequoia Plan Review 1/08 7/08 12/08 Dawn LA CD2/3 Approved Dawn Early Science Transition to Classified Dawn GA

1/09 7/09 12/09 Dawn Phase 1 Dawn Phase 2 Sequoia Build Decision

1/10 7/10 12/10 Phased System Deliveries Sequoia Parts Commit & Option Sequoia Parts Build Sequoia Demo 1/11 7/11 12/11 Sequoia Early Science Transition to Classified CD4 Approved Sequoia Operational Readiness

1/12 7/12 12/12

Sequoia contract award Dawn system acceptance Sequoia final system acceptance

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 4 “Dawn speeds a man on his journey, and speeds him too in his work” ...Hesiod (~700 B.C.E)

Dawn Specifications • IBM BG/P architecture • 36,864 compute nodes (500TF) • 147,456 PPC 450 cores • 4GB memory per node (147.5TB) • 128-to-1 compute to I/O node ratio • 288 10GE links to file system

Dawn Installation • Feb 27th - final rack delivery • March 5th - 36 Rack integration complete • March 15-24th – Synthetic WorkLoad start • End of March - Acceptance (planned)

ibm.com/systems/deepcomputing/bluegene/

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 5

DAWN SEQUOIA Initial Delivery

System 36 racks 0.5 PF/s 144 TB Rack 1.3 MW 14 TF/s >8 Day MTBF 4 TB 36 KW Chip 850 MHz PPC 450 4 cores/4 threads Node Card 13.6 GF/s Peak 435 GF/s 8 MB EDRAM 128 GB Compute Card 13.6 GF/s 4.0 GB DDR2 13.6 GB/s Memory BW 0.75 GB/s 3D Torus BW

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 6 DAWN Initial Delivery Infrastructure

288 – 10GbE

8 x 4 – LLNL 10GbE 1 – 10GbE

HTC

2 – 1GbE 4 x 4 – 144 – 1GbE 1GbE

Dawn Core (9 x 4 BG/P Racks) E-net 14 – 1GbE Core 14 – 1GbE 2 – 1GbE 3 – 1GbE

Primary Backup 2 – 1GbE LOGIN SERVICE SERVICE SERVICE HMC

2 – FC4 2 – FC4 2 – FC4 2 – FC4

2 – 10GbE 2 – 10GbE 12 – 2 – 10GbE Local Disk 10GbE

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 7

Sequoia Target Architecture and Infrastructure

Production Operation FY12-FY17 • 20PF/s, 1.6 PB Memory • 96 racks, 98,304 nodes • 1.6 M cores (1 GB/core) • 50 PB file system • 6.0 MW power (160 times more efficient than Purple)

. Will be used as a 2D ultra-res and 3D high-res Uncertainty Quantification (UQ) engine

. Will be used for 3D science capability runs exploring key materials science problems

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 8 High performance material science simulations will contribute directly to ASC programmatic success

Six physics/materials science applications targeted for early implementation on Sequoia infrastructure • Qbox – Quantum molecular dynamics for determination of material equation of state • DDCMD – Molecular dynamics for material dynamics • Miranda – 3D Continuum fluid dynamics for interfacial mixing • ALE3D – 3D Continuum mechanics for ignition and detonation propagation of explosives • LAMPPS – Molecular dynamics for shock initiation in high explosives • ParaDiS – Dislocation dynamics for high pressure strength in materials

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 9

Single Sequoia Platform Mandatory Requirement is P ≥ 20

. P is “peak” of the machine measured in petaFLOP/s . Target requirement is P + S ≥ 40 • S is weighted average of five “marquee” benchmark codes • Four code package benchmarks − UMT, IRS, AMG, and SPhot − Program goal is 24x the Purple capability throughput • One “science workload” benchmark from SNL − LAMMPS (molecular dynamics) − Program goal is 20x-50x BGL for science capability

PurplePurple -- 100TF/s100TF/s BlueGeneBlueGene /L/L –– 367TF/s367TF/s

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 10 Sequoia Operating System Perspective

… Light weight kernel on compute nodes Application . Optimized for scalability and reliability 1-N CN Application NPTL Posix threads . As simple as possible glibc dynamicApplication loading NPTL ApplicationPosix threads glibc dynamic loading . Extremely low OS noise NPTLGLIBC Posix threadsMPI glibc dynamic loading PosixGLIBC threads, OpenMP,MPI SE/TMADI glibc dynamic loading Shared . Direct access to interconnect hardware GLIBCsyscallsMPIFutex ADI RAS SharedMemory GLIBCsyscallsMPIFutexhardwareADI transportRAS . OS features SharedMemory Functionsyscalls ShippedSequoiaFutexhardwareADI CN transport and InterconnectRAS Memory . Linux/Unix syscall compatible w/ I/O syscalls SMP RAS syscallsSequoiahardware CN transport and Interconnect Sequoiahardware CN transport and Interconnect . Support for dynamic lib runtime loading Sequoia CN and Interconnect . Shared memory regions Compute Nodes . Open source

Linux/Unix OS on I/O nodes FSD SLURMD Perf tools totalview . Leverage large Linux/Unix base & community Lustre Client NFSv4 Function Shipped . Enhance TCP offload, PCIe, I/O LNet syscalls . Standard File Systems - Lustre, NFSv4, etc. UDP TCP/IP Linux/Unix . Aggregates N CN for I/O & admin Sequoia ION and Interconnect . Open source

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 11

Sequoia Software Stack – Applications Perspective

Code Development Tools C/C++/Fortran Compilers, Python

User Space Kernel Space

Function Shipped APPLICATION syscalls Parallel Math Libs OpenMP, Threads, SE/TM Clib/F03 runtime SOCKETS Lustre Client

Optimized Math Libs LNet TCP UDP SLURM/Moab

MPI2 Linux/Unix LWK, RAS, Control System ADI IP Code Dev Tools Infrastructure

Interconnect Interface External Network

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 12 The tools that users know and love will be available on Sequoia with improvements and additions as needed

STAT Existing 107 - MRNet TV memlight LaunchMON New memP 106 - New Lightweight PMPI mpiP Focus Tools O|SS OTF TAU 5 10 - APAI OpenMP SE/TM DPCL TotalView Analyzer Analyzer 104 - MemCheck OpenMP Profiling SE/TM Valgrind Interface Monitor ThreadCheck

Operational Scale Operational gprof Stack SE/TM Walker Debugger Dyninst 1 - PAPI Debugging Performance Infrastructure Features

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 13

Application programming requirements and challenges

. Availability of 1.6M cores pushes all- MPI codes to extreme concurrency I/OI/O && VisualizationVisualization

. Availability of many threads on many SMP cores encourages low-level HybridHybrid parallelism for higher performance Models

. Mixed MPI/SMP programming environment and possibility of heterogeneous compute distribution SMPSMP ThreadsThreads brings load imbalance to the fore

. I/O and visualization requirements encourage innovative strategies to MPI minimize memory and bandwidth ScalingScaling bottlenecks

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 14 The RFP asked interested vendors to address a “Unified Nested Node Concurrency” model TM/SE TM/SE OpenMP OpenMP OpenMP OpenMP MPI_INIT Call MPI Call MPI Call MPI MPI Call MPI Call MPI Call MPI Call MPI MPI_FINALIZE Thread0 Thread1 W W

Thread2 W W Exit MAIN MAIN Funct1 Funct1 Thread3 W Funct2 1-3 W 1-3 1-3 1-3 1-3 1-3 1-3 1) Pthreads born with MAIN 2) Only Thread0 calls functions to nest parallelism 3) Pthreads based MAIN calls OpenMP based Funct1 4) OpenMP Funct1 calls TM/SE based Funct2 5) Funct2 returns to OpenMP based Funct1 6) Funct1 returns to Pthreads based MAIN

. MPI Tasks on a node are processes (one is shown) with multiple OS threads (Thread0-3 shown) . Thread0 is “Main thread” & Thread1-3 are helper threads that morph from Pthread to OpenMP worker to TM/SE compiler generated threads via runtime support . Hardware support to significantly reduce overheads for thread repurposing and OpenMP loops and locks

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 15

Previous systems have prepared the way for Sequoia

. BG/L experience informs Dawn/Sequoia scalability . OpenMP & Posix threads experience on Linux/AIX . Integrated codes regularly run at Purple capability . Dawn will be used for code development • SMP parallelism • Python • Larger memory per core than BG/L DAWN Initial Delivery • Some critical UQ analysis as well . Sequoia will be a Tri-Lab ASC resource • Video conferences for coordination

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 16 A diverse team and a new Scalable Application Preparation Project ensure success on Sequoia

. LC Hotline, User Training and Documentation address routine issues . ADEPT team provides expertise in compilers, debuggers, performance tools . Access to IBM experts, including an on-site IBM applications analyst

. Staff to work closely with the application teams . Ongoing ANL/IBM/LLNL BlueGene collaboration . Engaging third-party vendors, university research partners, and the open source community

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 17

New Enabling Technologies (PCET) LDRD is addressing key barriers to predictive simulation Exascale 108 Cores Petascale Power? 107 Cores BG/L Vector FP Units/ 106 Cores Accelerators? Purple Multicore 105 Cores Fault Tolerance 104 Cores Load Balance 103 Cores Debugging

PCET creates essential capabilities for exascale core counts

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 18 PCET strategy mitigates risk to assure immediate impact on application drivers and longer term success

Current capabilities MPI large grain parallelism Shorter Cache oblivious Basic checkpoint/restart Term Payoff data layouts Checkpoint Ill-defined load imbalances compression Debugging < 4096 cores

Load balance analysis Terascale capabilities Behavioral Multicore-adapted algorithms Faster checkpoint/restart and performance Understood load imbalances equivalence Targeted debugging classes

Petascale capable & Exascale prepared Multicore-aware algorithms Application-level fault tolerance Well-balanced application load Automated error analysis

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 19

Take-away: Computational science on Sequoia at full- scale will be culmination of many years of hard work

Innovative or Rigorous R&D evolutionary review contracts architecture ideas

Milestone Computational progress science R&D

Initial delivery & integration We’re Periodic here with reviews Dawn ID Flexible contracts with targets as requirements

Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 20