Lawrence Livermore National Laboratory
Sequoia and the Petascale Era
SCICOMP 15 May 20, 2009
Thomas Spelce Development Environment Group
Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 LLNL-PRES-411030
The Advanced Simulation and Computing (ASC) Program delivers high confidence prediction of weapons behavior
Integrated Codes
Codes to predict safety and reliability
Physics and Verification and Engineering Models Validation
Experiments provide Models and NNSA Science Campaigns critical validation data understanding Experiments Legacy UGTs
ASC integrates all of the science and engineering that makes stewardship successful
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 2 ASC pursued three classes of systems to cost effectively meet current (and anticipate future) compute requirements
Higher performance and . Capability systems ==> the most lower power consumption challenging integrated design calculations Sequoia • More costly but proven Roadrunner • Production workload BlueGene/L
Original concept: develop capability . Capacity systems ==> day to day work Purple TLCC (Juno)
• Less costly, somewhat less reliable Peloton Performance • Throughput for less demanding Q problems Thunder White MCR Mainframes (RIP) . Advanced Architectures ==> Blue performance, power consumption, etc. Low-cost capacity • Targeted but demanding workload Red • Tomorrow’s mainstream solutions? FY01 FY05 Time
The “three curves” (Capability, Capacity and Advanced Architectures) approach has been successful in delivering good cost performance across the spectrum of need…
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 3
Sequoia represents largest increase in computational power ever delivered for NNSA Stockpile Stewardship
Sequoia Five Years Planned Lifetime Through CY17 Market Survey
1/06 7/06 12/06 CD0 Approved Write RFP
1/07 7/07 12/07 Vendor Response CD1 Approved Contract Package Selection Sequoia Plan Review 1/08 7/08 12/08 Dawn LA CD2/3 Approved Dawn Early Science Transition to Classified Dawn GA
1/09 7/09 12/09 Dawn Phase 1 Dawn Phase 2 Sequoia Build Decision
1/10 7/10 12/10 Phased System Deliveries Sequoia Parts Commit & Option Sequoia Parts Build Sequoia Demo 1/11 7/11 12/11 Sequoia Early Science Transition to Classified CD4 Approved Sequoia Operational Readiness
1/12 7/12 12/12
Sequoia contract award Dawn system acceptance Sequoia final system acceptance
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 4 “Dawn speeds a man on his journey, and speeds him too in his work” ...Hesiod (~700 B.C.E)
Dawn Specifications • IBM BG/P architecture • 36,864 compute nodes (500TF) • 147,456 PPC 450 cores • 4GB memory per node (147.5TB) • 128-to-1 compute to I/O node ratio • 288 10GE links to file system
Dawn Installation • Feb 27th - final rack delivery • March 5th - 36 Rack integration complete • March 15-24th – Synthetic WorkLoad start • End of March - Acceptance (planned)
ibm.com/systems/deepcomputing/bluegene/
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 5
DAWN SEQUOIA Initial Delivery
System 36 racks 0.5 PF/s 144 TB Rack 1.3 MW 14 TF/s >8 Day MTBF 4 TB 36 KW Chip 850 MHz PPC 450 4 cores/4 threads Node Card 13.6 GF/s Peak 435 GF/s 8 MB EDRAM 128 GB Compute Card 13.6 GF/s 4.0 GB DDR2 13.6 GB/s Memory BW 0.75 GB/s 3D Torus BW
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 6 DAWN Initial Delivery Infrastructure
288 – 10GbE
8 x 4 – LLNL 10GbE 1 – 10GbE
HTC
2 – 1GbE 4 x 4 – 144 – 1GbE 1GbE
Dawn Core (9 x 4 BG/P Racks) E-net 14 – 1GbE Core 14 – 1GbE 2 – 1GbE 3 – 1GbE
Primary Backup 2 – 1GbE LOGIN SERVICE SERVICE SERVICE HMC
2 – FC4 2 – FC4 2 – FC4 2 – FC4
2 – 10GbE 2 – 10GbE 12 – 2 – 10GbE Local Disk 10GbE
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 7
Sequoia Target Architecture and Infrastructure
Production Operation FY12-FY17 • 20PF/s, 1.6 PB Memory • 96 racks, 98,304 nodes • 1.6 M cores (1 GB/core) • 50 PB Lustre file system • 6.0 MW power (160 times more efficient than Purple)
. Will be used as a 2D ultra-res and 3D high-res Uncertainty Quantification (UQ) engine
. Will be used for 3D science capability runs exploring key materials science problems
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 8 High performance material science simulations will contribute directly to ASC programmatic success
Six physics/materials science applications targeted for early implementation on Sequoia infrastructure • Qbox – Quantum molecular dynamics for determination of material equation of state • DDCMD – Molecular dynamics for material dynamics • Miranda – 3D Continuum fluid dynamics for interfacial mixing • ALE3D – 3D Continuum mechanics for ignition and detonation propagation of explosives • LAMPPS – Molecular dynamics for shock initiation in high explosives • ParaDiS – Dislocation dynamics for high pressure strength in materials
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 9
Single Sequoia Platform Mandatory Requirement is P ≥ 20
. P is “peak” of the machine measured in petaFLOP/s . Target requirement is P + S ≥ 40 • S is weighted average of five “marquee” benchmark codes • Four code package benchmarks − UMT, IRS, AMG, and SPhot − Program goal is 24x the Purple capability throughput • One “science workload” benchmark from SNL − LAMMPS (molecular dynamics) − Program goal is 20x-50x BGL for science capability
PurplePurple -- 100TF/s100TF/s BlueGeneBlueGene /L/L –– 367TF/s367TF/s
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 10 Sequoia Operating System Perspective
… Light weight kernel on compute nodes Application . Optimized for scalability and reliability 1-N CN Application NPTL Posix threads . As simple as possible glibc dynamicApplication loading NPTL ApplicationPosix threads glibc dynamic loading . Extremely low OS noise NPTLGLIBC Posix threadsMPI glibc dynamic loading PosixGLIBC threads, OpenMP,MPI SE/TMADI glibc dynamic loading Shared . Direct access to interconnect hardware GLIBCsyscallsMPIFutex ADI RAS SharedMemory GLIBCsyscallsMPIFutexhardwareADI transportRAS . OS features SharedMemory Functionsyscalls ShippedSequoiaFutexhardwareADI CN transport and InterconnectRAS Memory . Linux/Unix syscall compatible w/ I/O syscalls SMP RAS syscallsSequoiahardware CN transport and Interconnect Sequoiahardware CN transport and Interconnect . Support for dynamic lib runtime loading Sequoia CN and Interconnect . Shared memory regions Compute Nodes . Open source
Linux/Unix OS on I/O nodes FSD SLURMD Perf tools totalview . Leverage large Linux/Unix base & community Lustre Client NFSv4 Function Shipped . Enhance TCP offload, PCIe, I/O LNet syscalls . Standard File Systems - Lustre, NFSv4, etc. UDP TCP/IP Linux/Unix . Aggregates N CN for I/O & admin Sequoia ION and Interconnect . Open source
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 11
Sequoia Software Stack – Applications Perspective
Code Development Tools C/C++/Fortran Compilers, Python
User Space Kernel Space
Function Shipped APPLICATION syscalls Parallel Math Libs OpenMP, Threads, SE/TM Clib/F03 runtime SOCKETS Lustre Client
Optimized Math Libs LNet TCP UDP SLURM/Moab
MPI2 Linux/Unix LWK, RAS, Control System ADI IP Code Dev Tools Infrastructure
Interconnect Interface External Network
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 12 The tools that users know and love will be available on Sequoia with improvements and additions as needed
STAT Existing 107 - MRNet TV memlight LaunchMON New memP 106 - New Lightweight PMPI mpiP Focus Tools O|SS OTF TAU 5 10 - APAI OpenMP SE/TM DPCL TotalView Analyzer Analyzer 104 - MemCheck OpenMP Profiling SE/TM Valgrind Interface Monitor ThreadCheck
Operational Scale Operational gprof Stack SE/TM Walker Debugger Dyninst 1 - PAPI Debugging Performance Infrastructure Features
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 13
Application programming requirements and challenges
. Availability of 1.6M cores pushes all- MPI codes to extreme concurrency I/OI/O && VisualizationVisualization
. Availability of many threads on many SMP cores encourages low-level HybridHybrid parallelism for higher performance Models
. Mixed MPI/SMP programming environment and possibility of heterogeneous compute distribution SMPSMP ThreadsThreads brings load imbalance to the fore
. I/O and visualization requirements encourage innovative strategies to MPI minimize memory and bandwidth ScalingScaling bottlenecks
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 14 The RFP asked interested vendors to address a “Unified Nested Node Concurrency” model TM/SE TM/SE OpenMP OpenMP OpenMP OpenMP MPI_INIT Call MPI Call MPI Call MPI MPI Call MPI Call MPI Call MPI Call MPI MPI_FINALIZE Thread0 Thread1 W W
Thread2 W W Exit MAIN MAIN Funct1 Funct1 Thread3 W Funct2 1-3 W 1-3 1-3 1-3 1-3 1-3 1-3 1) Pthreads born with MAIN 2) Only Thread0 calls functions to nest parallelism 3) Pthreads based MAIN calls OpenMP based Funct1 4) OpenMP Funct1 calls TM/SE based Funct2 5) Funct2 returns to OpenMP based Funct1 6) Funct1 returns to Pthreads based MAIN
. MPI Tasks on a node are processes (one is shown) with multiple OS threads (Thread0-3 shown) . Thread0 is “Main thread” & Thread1-3 are helper threads that morph from Pthread to OpenMP worker to TM/SE compiler generated threads via runtime support . Hardware support to significantly reduce overheads for thread repurposing and OpenMP loops and locks
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 15
Previous systems have prepared the way for Sequoia
. BG/L experience informs Dawn/Sequoia scalability . OpenMP & Posix threads experience on Linux/AIX . Integrated codes regularly run at Purple capability . Dawn will be used for code development • SMP parallelism • Python • Larger memory per core than BG/L DAWN Initial Delivery • Some critical UQ analysis as well . Sequoia will be a Tri-Lab ASC resource • Video conferences for coordination
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 16 A diverse team and a new Scalable Application Preparation Project ensure success on Sequoia
. LC Hotline, User Training and Documentation address routine issues . ADEPT team provides expertise in compilers, debuggers, performance tools . Access to IBM experts, including an on-site IBM applications analyst
. Staff to work closely with the application teams . Ongoing ANL/IBM/LLNL BlueGene collaboration . Engaging third-party vendors, university research partners, and the open source community
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 17
New Petascale Computing Enabling Technologies (PCET) LDRD is addressing key barriers to predictive simulation Exascale 108 Cores Petascale Power? 107 Cores BG/L Vector FP Units/ 106 Cores Accelerators? Purple Multicore 105 Cores Fault Tolerance 104 Cores Load Balance 103 Cores Debugging
PCET creates essential capabilities for exascale core counts
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 18 PCET strategy mitigates risk to assure immediate impact on application drivers and longer term success
Current capabilities MPI large grain parallelism Shorter Cache oblivious Basic checkpoint/restart Term Payoff data layouts Checkpoint Ill-defined load imbalances compression Debugging < 4096 cores
Load balance analysis Terascale capabilities Behavioral Multicore-adapted algorithms Faster checkpoint/restart and performance Understood load imbalances equivalence Targeted debugging classes
Petascale capable & Exascale prepared Multicore-aware algorithms Application-level fault tolerance Well-balanced application load Automated error analysis
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 19
Take-away: Computational science on Sequoia at full- scale will be culmination of many years of hard work
Innovative or Rigorous R&D evolutionary review contracts architecture ideas
Milestone Computational progress science R&D
Initial delivery & integration We’re Periodic here with reviews Dawn ID Flexible contracts with targets as requirements
Lawrence Livermore National Laboratory LLNL-PRES-411030 10th International LCI Conference 20