Operating Systems – the Code We Love to Hate
Total Page:16
File Type:pdf, Size:1020Kb
U.S. Department of Energy Office of Science Operating Systems – the Code We Love to Hate (Operating and Runtime Systems: Status, Challenges and Futures) Fred Johnson Senior Technical Manager for Computer Science Office of Advanced Scientific Computing Research DOE Office of Science Salishan – April 2005 1 U.S. Department of Energy The Office of Science Office of Science Supports basic research that underpins DOE missions. Constructs and operates large scientific facilities for the U.S. scientific community. Accelerators, synchrotron light sources, neutron sources, etc. Six Offices Basic Energy Sciences Biological and Environmental Research Fusion Energy Sciences High Energy Nuclear Physics Advanced Scientific Computing Research Salishan – April 2005 2 U.S. Department of Energy Simulation Capability Needs -- FY2005 Timeframe Office of Science Sustained Application Simulation Need Computational Significance Capability Needed (Tflops) Climate Calculate chemical balances Provides U.S. policymakers with Science in atmosphere, including > 50 leadership data to support policy clouds, rivers, and decisions. Properly represent and vegetation. predict extreme weather conditions in changing climate. Magnetic Optimize balance between > 50 Underpins U.S. decisions about future Fusion Energy self-heating of plasma and international fusion collaborations. heat leakage caused by Integrated simulations of burning electromagnetic turbulence. plasma crucial for quantifying prospects for commercial fusion. Combustion Understand interactions > 50 Understand detonation dynamics (e.g. Science between combustion and engine knock) in combustion turbulent fluctuations in systems. Solve the “soot “ problem burning fluid. in diesel engines. Environmental Reliably predict chemical > 100 Develop innovative technologies to Molecular and physical properties of remediate contaminated soils and Science radioactive substances. groundwater. Astrophysics Realistically simulate the >> 100 Measure size and age of Universe and explosion of a supernova for rate of expansion of Universe. Gain first time. insight into inertial fusion processes. Salishan – April 2005 3 U.S. Department of Energy CS research program Office of Science Operating Systems FAST-OS Scalable System Software ISIC Science Appliance Programming Models MPI/ROMIO/PVFS II Runtime libraries MPI, Global Arrays (ARMCI), UPC (GASNet) Performance tools and analysis Program Development Data management and visualization Salishan – April 2005 4 U.S. Department of Energy Outline Office of Science Motivation – so what? OSes and Architectures Current State of Affairs Research directions Final points Salishan – April 2005 5 U.S. Department of Energy OS Noise/Jitter Office of Science Latest two sets of measurements are consistent (~70% longer than model) Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, "The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q", in Proc. SuperComputing, Phoenix, November 2003. SAGE on QB 1-rail (timing.input) 1.4 Model 1.2 21-Sep There is a difference 25-Nov why ? 1 0.8 0.6 Cycle Time(s) 0.4 0.2 Lower is 0 1 10 100 1000 10000 better! # PEs Salishan – April 2005 6 U.S. Department of Energy OS/Runtime Thread Placement Office of Science G. Jost et al -- What Multilevel Parallel Programs Do When You Are Not Watching: A Performance Analysis case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP, WOMPAT 2004. http://www.tlc2.uh.edu/wompat2004/Presentations/jost.pdf Multi-zone NAS Parallel Benchmarks Coarse grain parallelism between zones Fine grain loop-level parallelism in solver routines MLP 30% – 50% faster in some cases. Why? Salishan – April 2005 7 U.S. Department of Energy MPI/MLP Differences Office of Science MPI: Initially not designed for NUMA architectures or mixing of threads and processes API does not provide support for memory/thread placement Vendor-specific APIs to control thread and memory placement MLP: Designed for NUMA architectures and hybrid programming API includes system call to bind threads to CPUs Binds threads to CPUs (Pinning) to improve performance of hybrid codes Salishan – April 2005 8 U.S. Department of Energy Results Office of Science Thread binding and data placement: Bad: Initial process assignment to CPUs is sequential – subsequent thread placement separates threads and data P1 P2 Good: Initial process assignment to CPUs allows room for multiple threads/data of a logical process to be “close to each other” P1 P2 “The use of detailed analysis techniques helped to determine that initially observed performance differences were not due to the programming models themselves but rather to other factors.” Salishan – April 2005 9 U.S. Department of Energy NWChem Architecture Office of Science Generic Energy, structure, … Tasks • Object-oriented design SCF energy, gradient, … – abstraction, data hiding, Molecular handles, APIs DFT energy, gradient, … Calculation • Parallel programming model MD, NMR, Solvation, … Modules – non-uniform memory Optimize, Dynamics, … access, global arrays, MPI Run-time database Molecular • Infrastructure Modeling – GA, Parallel I/O, RTDB, MA, Toolkit ... • Program modules PeIGS Basis Set Object ... ... Integral API Object Geometry – communication only through Parallel IO Molecular the database; persistence Software for easy restart Memory Allocator Development Global Arrays Toolkit Salishan – April 2005 10 U.S. Department of Energy Outline Office of Science Motivation – so what? OSes and Architectures Current State of Affairs Research directions Final points Salishan – April 2005 11 U.S. Department of Energy The 100,000 foot view Office of Science Full-featured OS Full set of shared services Complexity is a barrier to understanding new hw and sw Subject to “rogue/unexpected” effects Unix, Linux Lightweight OS Small set of shared services Puma/Cougar/Catamount – Red, Red Strom CNK – BlueGene/L Open Source/proprietary Interaction of system architecture and OS Clusters, CCNuma, MPP, Distributed/shared memory, bandwidth … Taxonomy is still unsettled High-Performance Computing: Clusters, Constellations, MPPs, and Future Directions, Dongarra, Sterling, Simon, Strohmaier, CISE, March/April 2005. Salishan – April 2005 12 U.S. Department of Energy Birds eye view of a typical software stack on a compute node Office of Science Parallel Resource Scientific Software Management/ Applications Development / Scheduler Compilers Kernel Characteristics User Space Runtime Support OS Kernel • Monolithic Perf Tools Debuggers CPU Scheduler • Privileged operations for protection Programming Model Runtime Virtual Memory MGR • Software tunable IPC • General purpose algorithms Chkpt/Rstrt I/O File Systems • Parallel knowledge in resource Sockets management and runtime OS Bypass Chkpt/RstrtTDP UDP • OS bypass for performance Hypervisor • Single system image? Node Hardware Architecture: CPU, Cache, Memory, Interconnect Adaptor, … HPC System Software Elements Salishan – April 2005 13 U.S. Department of Energy A High-End Cluster Office of Science Compute Switch Fabric SMP SMP SMP •••Compute Server ••• SMP SMP SMP O( 10-103 TF ) Secure Control Switch Fabric SGPFS ••••• Meta Data Global Name Visualization Servers Space Storage HPSS Switch Meta Data Fabric Agent User Access Archival Agent NFS Switch Fabric Storage HPSS DFS HPSS ••• ••• RAITs ••• •••••••••••••••••••••••••Disk Memory Server •••••••••••••••••••••••••• O( 1-102 PB ), User Data, Scratch Storage, Out of Core Memory, Disks, RAIDs, Locks Salishan – April 2005 14 U.S. Department of Energy Clustering software Office of Science Classic Lots of local disks with full root file systems and full OS on every node All the nodes are really, really independent (and visible) Clustermatic Clean, reliable BIOS that boots in seconds (LinuxBIOS) Boot directly into Linux so you can use HPC network to boot See entire cluster process state from one node (bproc) Fast, low overhead monitoring (Supermon) No NFS root -- root is local ramdisk Salishan – April 2005 15 U.S. Department of Energy LANL SCIENCE APPLIANCE Scalable Cluster Computing using Clustermatic Office of Science Compilers Debuggers Applications BJS Supermon Beoboot MPI BProc v9fs v9fs mon parallel i/o v9fs mon parallel i/o v9fs mon parallel i/o v9fs mon parallel i/o Linux Linux Linux Linux Linux LinuxBIOS LinuxBIOS LinuxBIOS LinuxBIOS LinuxBIOS High speed interconnect(s) master slaveslave slave slave v9fs server parallel i/o server Environment Fileserver High Performance Parallel Fileserver Thanks to Ron Minnich Salishan – April 2005 16 U.S. Department of Energy LWK approach Office of Science Separate policy decision from policy enforcement Move resource management as close to application as possible Protect applications from each other Get out of the way Salishan – April 2005 17 U.S. Department of Energy History of Sandia/UNM Lightweight Kernels Office of Science Thanks to Barney Maccabe and Ron Brightwell Salishan – April 2005 18 U.S. Department of Energy LWK Structure Office of Science PCT App. 1 App. 2 App. 3 libc.a libc.a libc.a libmpi.a libnx.a libvertex.a Q-Kernel: message passing, memory protection Salishan – April 2005 19 U.S. Department of Energy LWK Ingredients Office of Science Quintessential Kernel (Qk) Process Control Thread Policy enforcer Runs in user space Initializes hardware More privileges than user Handles interrupts and exceptions applications Maintain hardware virtual Policy maker addressing