Sequoia and the Petascale Era

Lawrence Livermore National Laboratory Sequoia and the Petascale Era SCICOMP 15 May 20, 2009 Thomas Spelce Development Environment Group Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 LLNL-PRES-411030 The Advanced Simulation and Computing (ASC) Program delivers high confidence prediction of weapons behavior Integrated Codes Codes to predict safety and reliability Physics and Veriﬁcation and Engineering Models Validation Experiments provide Models and NNSA Science Campaigns critical validation data understanding Experiments Legacy UGTs ASC integrates all of the science and engineering that makes stewardship successful Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 2 ASC pursued three classes of systems to cost effectively meet current (and anticipate future) compute requirements Higher performance and . Capability systems ==> the most lower power consumption challenging integrated design calculations Sequoia • More costly but proven Roadrunner • Production workload BlueGene/L Original concept: develop capability . Capacity systems ==> day to day work Purple TLCC (Juno) • Less costly, somewhat less reliable Peloton Performance • Throughput for less demanding Q problems Thunder White MCR Mainframes (RIP) . Advanced Architectures ==> Blue performance, power consumption, etc. Low-cost capacity • Targeted but demanding workload Red • Tomorrow’s mainstream solutions? FY01 FY05 Time The “three curves” (Capability, Capacity and Advanced Architectures) approach has been successful in delivering good cost performance across the spectrum of need… Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 3 Sequoia represents largest increase in computational power ever delivered for NNSA Stockpile Stewardship Sequoia Five Years Planned Lifetime Through CY17 Market Survey 1/06 7/06 12/06 CD0 Approved Write RFP 1/07 7/07 12/07 Vendor Response CD1 Approved Contract Package Selection Sequoia Plan Review 1/08 7/08 12/08 Dawn LA CD2/3 Approved Dawn Early Science Transition to Classified Dawn GA 1/09 7/09 12/09 Dawn Phase 1 Dawn Phase 2 Sequoia Build Decision 1/10 7/10 12/10 Phased System Deliveries Sequoia Parts Commit & Option Sequoia Parts Build Sequoia Demo 1/11 7/11 12/11 Sequoia Early Science Transition to Classified CD4 Approved Sequoia Operational Readiness 1/12 7/12 12/12 Sequoia contract award Dawn system acceptance Sequoia final system acceptance Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 4 “Dawn speeds a man on his journey, and speeds him too in his work” ...Hesiod (~700 B.C.E) Dawn Specifications • IBM BG/P architecture • 36,864 compute nodes (500TF) • 147,456 PPC 450 cores • 4GB memory per node (147.5TB) • 128-to-1 compute to I/O node ratio • 288 10GE links to file system Dawn Installation • Feb 27th - final rack delivery • March 5th - 36 Rack integration complete • March 15-24th – Synthetic WorkLoad start • End of March - Acceptance (planned) ibm.com/systems/deepcomputing/bluegene/ Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 5 DAWN SEQUOIA Initial Delivery System 36 racks 0.5 PF/s 144 TB Rack 1.3 MW 14 TF/s >8 Day MTBF 4 TB 36 KW Chip 850 MHz PPC 450 4 cores/4 threads Node Card 13.6 GF/s Peak 435 GF/s 8 MB EDRAM 128 GB Compute Card 13.6 GF/s 4.0 GB DDR2 13.6 GB/s Memory BW 0.75 GB/s 3D Torus BW Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 6 DAWN Initial Delivery Infrastructure 288 – 10GbE 8 x 4 – LLNL 10GbE 1 – 10GbE HTC 2 – 1GbE 4 x 4 – 144 – 1GbE 1GbE Dawn Core (9 x 4 BG/P Racks) E-net 14 – 1GbE Core 14 – 1GbE 2 – 1GbE 3 – 1GbE Primary Backup 2 – 1GbE LOGIN SERVICE SERVICE SERVICE HMC 2 – FC4 2 – FC4 2 – FC4 2 – FC4 2 – 10GbE 2 – 10GbE 12 – 2 – 10GbE Local Disk 10GbE Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 7 Sequoia Target Architecture and Infrastructure Production Operation FY12-FY17 • 20PF/s, 1.6 PB Memory • 96 racks, 98,304 nodes • 1.6 M cores (1 GB/core) • 50 PB Lustre file system • 6.0 MW power (160 times more efficient than Purple) . Will be used as a 2D ultra-res and 3D high-res Uncertainty Quantification (UQ) engine . Will be used for 3D science capability runs exploring key materials science problems Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 8 High performance material science simulations will contribute directly to ASC programmatic success Six physics/materials science applications targeted for early implementation on Sequoia infrastructure • Qbox – Quantum molecular dynamics for determination of material equation of state • DDCMD – Molecular dynamics for material dynamics • Miranda – 3D Continuum fluid dynamics for interfacial mixing • ALE3D – 3D Continuum mechanics for ignition and detonation propagation of explosives • LAMPPS – Molecular dynamics for shock initiation in high explosives • ParaDiS – Dislocation dynamics for high pressure strength in materials Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 9 Single Sequoia Platform Mandatory Requirement is P ≥ 20 . P is “peak” of the machine measured in petaFLOP/s . Target requirement is P + S ≥ 40 • S is weighted average of five “marquee” benchmark codes • Four code package benchmarks − UMT, IRS, AMG, and SPhot − Program goal is 24x the Purple capability throughput • One “science workload” benchmark from SNL − LAMMPS (molecular dynamics) − Program goal is 20x-50x BGL for science capability PurplePurple -- 100TF/s100TF/s BlueGeneBlueGene /L/L –– 367TF/s367TF/s Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 10 Sequoia Operating System Perspective … Light weight kernel on compute nodes Application . Optimized for scalability and reliability 1-N CN Application NPTL Posix threads . As simple as possible glibc dynamicApplication loading NPTL ApplicationPosix threads glibc dynamic loading . Extremely low OS noise NPTLGLIBC Posix threadsMPI glibc dynamic loading PosixGLIBC threads, OpenMP,MPI SE/TMADI glibc dynamic loading Shared . Direct access to interconnect hardware GLIBCsyscallsMPIFutex ADI RAS SharedMemory GLIBCsyscallsMPIFutexhardwareADI transportRAS . OS features SharedMemory Functionsyscalls ShippedSequoiaFutexhardwareADI CN transport and InterconnectRAS Memory . Linux/Unix syscall compatible w/ I/O syscalls SMP RAS syscallsSequoiahardware CN transport and Interconnect Sequoiahardware CN transport and Interconnect . Support for dynamic lib runtime loading Sequoia CN and Interconnect . Shared memory regions Compute Nodes . Open source Linux/Unix OS on I/O nodes FSD SLURMD Perf tools totalview . Leverage large Linux/Unix base & community Lustre Client NFSv4 Function Shipped . Enhance TCP offload, PCIe, I/O LNet syscalls . Standard File Systems - Lustre, NFSv4, etc. UDP TCP/IP Linux/Unix . Aggregates N CN for I/O & admin Sequoia ION and Interconnect . Open source Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 11 Sequoia Software Stack – Applications Perspective Code Development Tools C/C++/Fortran Compilers, Python User Space Kernel Space Function Shipped APPLICATION syscalls Parallel Math Libs OpenMP, Threads, SE/TM Clib/F03 runtime SOCKETS Lustre Client Optimized Math Libs LNet TCP UDP SLURM/Moab MPI2 Linux/Unix LWK, RAS, Control System ADI IP Code Dev Tools Infrastructure Interconnect Interface External Network Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 12 The tools that users know and love will be available on Sequoia with improvements and additions as needed STAT Existing 107 - MRNet TV memlight LaunchMON New memP 106 - New Lightweight PMPI mpiP Focus Tools O|SS OTF TAU 5 10 - APAI OpenMP SE/TM DPCL TotalView Analyzer Analyzer 104 - MemCheck OpenMP Profiling SE/TM Valgrind Interface Monitor ThreadCheck Operational Scale Operational gprof Stack SE/TM Walker Debugger Dyninst 1 - PAPI Debugging Performance Infrastructure Features Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 13 Application programming requirements and challenges . Availability of 1.6M cores pushes all- MPI codes to extreme concurrency I/OI/O && VisualizationVisualization . Availability of many threads on many SMP cores encourages low-level HybridHybrid parallelism for higher performance Models . Mixed MPI/SMP programming environment and possibility of heterogeneous compute distribution SMPSMP ThreadsThreads brings load imbalance to the fore . I/O and visualization requirements encourage innovative strategies to MPI minimize memory and bandwidth ScalingScaling bottlenecks Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 14 The RFP asked interested vendors to address a “Unified Nested Node Concurrency” model TM/SE TM/SE OpenMP OpenMP OpenMP OpenMP MPI_INIT Call MPI Call MPI Call MPI MPI Call MPI Call MPI Call MPI Call MPI MPI_FINALIZE Thread0 Thread1 W W Thread2 W W Exit MAIN MAIN Funct1 Funct1 Thread3 W Funct2 1-3 W 1-3 1-3 1-3 1-3 1-3 1-3 1) Pthreads born with MAIN 2) Only Thread0 calls functions to nest parallelism 3) Pthreads based MAIN calls OpenMP based Funct1 4) OpenMP Funct1 calls TM/SE based Funct2 5) Funct2 returns to OpenMP based Funct1 6) Funct1 returns

Sequoia and the Petascale Era

2017 HPC Annual Report Team Would Like to Acknowledge the Invaluable Assistance Provided by John Noe

Safety and Security Challenge

The Artisanal Nuke, 2014

Technical Issues in Keeping the Nuclear Stockpile Safe, Secure, and Reliable

A Comparison of the Current Top Supercomputers

The Blue Gene/Q Compute Chip

Report Is Available on the UCS Website At

2. the IBM Blue Gene/P Supercomputer

Annex a – FY 2011 Stockpile Stewardship Plan

Openacc for Gpus, X86, Openpower and Beyond

Conceptual and Technical Challenges for High Performance Computing Claude Tadonki

Blue Gene/Q Resource Management Architecture