U.S. Department of Energy

Office of Science

Operating Systems – the Code We Love to Hate

(Operating and Runtime Systems: Status, Challenges and Futures)

Fred Johnson

Senior Technical Manager for Science Office of Advanced Scientific Computing Research DOE Office of Science

Salishan – April 2005 1 U.S. Department of Energy The Office of Science

Office of Science ƒ Supports basic research that underpins DOE missions.

ƒ Constructs and operates large scientific facilities for the U.S. scientific community. ƒ Accelerators, synchrotron light sources, neutron sources, etc. ƒ Six Offices ƒ Basic Energy Sciences ƒ Biological and Environmental Research ƒ Fusion Energy Sciences ƒ High Energy ƒ Nuclear Physics ƒ Advanced Scientific Computing Research

Salishan – April 2005 2 U.S. Department of Energy Simulation Capability Needs -- FY2005 Timeframe Office of Science Sustained Application Simulation Need Computational Significance Capability Needed (Tflops) Climate Calculate chemical balances Provides U.S. policymakers with Science in atmosphere, including > 50 leadership data to support policy clouds, rivers, and decisions. Properly represent and vegetation. predict extreme weather conditions in changing climate. Magnetic Optimize balance between > 50 Underpins U.S. decisions about future Fusion Energy self-heating of plasma and international fusion collaborations. heat leakage caused by Integrated simulations of burning electromagnetic turbulence. plasma crucial for quantifying prospects for commercial fusion. Combustion Understand interactions > 50 Understand detonation dynamics (e.g. Science between combustion and engine knock) in combustion turbulent fluctuations in systems. Solve the “soot “ problem burning fluid. in diesel engines. Environmental Reliably predict chemical > 100 Develop innovative technologies to Molecular and physical properties of remediate contaminated soils and Science radioactive substances. groundwater.

Astrophysics Realistically simulate the >> 100 Measure size and age of Universe and explosion of a supernova for rate of expansion of Universe. Gain first time. insight into inertial fusion processes.

Salishan – April 2005 3 U.S. Department of Energy CS research program

Office of Science

ƒ Operating Systems ƒ FAST-OS ƒ Scalable System Software ISIC ƒ Science Appliance ƒ Programming Models ƒ MPI/ROMIO/PVFS II ƒ Runtime libraries ƒ MPI, Global Arrays (ARMCI), UPC (GASNet) ƒ Performance tools and analysis ƒ Program Development ƒ Data management and visualization

Salishan – April 2005 4 U.S. Department of Energy Outline

Office of Science

ƒ Motivation – so what? ƒ OSes and Architectures ƒ Current State of Affairs ƒ Research directions ƒ Final points

Salishan – April 2005 5 U.S. Department of Energy OS Noise/Jitter

Office of Science ƒ Latest two sets of measurements are consistent (~70% longer than model)

ƒ Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, "The Case of the Missing Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q", in Proc. SuperComputing, Phoenix, November 2003.

SAGE on QB 1-rail (timing.input) 1.4 Model 1.2 21-Sep There is a difference 25-Nov why ? 1

0.8

0.6

Cycle Time(s) 0.4

0.2 Lower is 0 1 10 100 1000 10000 better! # PEs

Salishan – April 2005 6 U.S. Department of Energy OS/Runtime Placement

Office of Science

ƒ G. Jost et al -- ƒ What Multilevel Parallel Programs Do When You Are Not Watching: A Performance Analysis case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP, WOMPAT 2004. ƒ http://www.tlc2.uh.edu/wompat2004/Presentations/jost.pdf ƒ Multi-zone NAS Parallel Benchmarks ƒ Coarse grain parallelism between zones ƒ Fine grain loop-level parallelism in solver routines

ƒ MLP 30% – 50% faster in some cases. Why?

Salishan – April 2005 7 U.S. Department of Energy MPI/MLP Differences

Office of Science

ƒ MPI: ƒ Initially not designed for NUMA architectures or mixing of threads and processes ƒ API does not provide support for memory/thread placement ƒ Vendor-specific to control thread and memory placement ƒ MLP: ƒ Designed for NUMA architectures and hybrid programming ƒ API includes to bind threads to CPUs ƒ Binds threads to CPUs (Pinning) to improve performance of hybrid codes

Salishan – April 2005 8 U.S. Department of Energy Results

Office of Science

ƒ Thread binding and data placement: ƒ Bad: Initial assignment to CPUs is sequential – subsequent thread placement separates threads and data P1 P2 ƒ Good: Initial process assignment to CPUs allows room for multiple threads/data of a logical process to be “close to each other”

P1 P2

ƒ “The use of detailed analysis techniques helped to determine that initially observed performance differences were not due to the programming models themselves but rather to other factors.”

Salishan – April 2005 9 U.S. Department of Energy NWChem Architecture

Office of Science Generic Energy, structure, … Tasks • Object-oriented design SCF energy, gradient, … – abstraction, data hiding, Molecular handles, APIs DFT energy, gradient, … Calculation • Parallel programming model MD, NMR, Solvation, … Modules – non-uniform memory Optimize, Dynamics, … access, global arrays, MPI

Run-time database Molecular • Infrastructure Modeling – GA, Parallel I/O, RTDB, MA, Toolkit ... • Program modules PeIGS Basis Set Object ...... Integral API Object Geometry – communication only through Parallel IO Molecular the database; persistence Software for easy restart Memory Allocator Development Global Arrays Toolkit

Salishan – April 2005 10 U.S. Department of Energy Outline

Office of Science

ƒ Motivation – so what? ƒ OSes and Architectures ƒ Current State of Affairs ƒ Research directions ƒ Final points

Salishan – April 2005 11 U.S. Department of Energy The 100,000 foot view

Office of Science

ƒ Full-featured OS ƒ Full set of shared services ƒ Complexity is a barrier to understanding new hw and sw ƒ Subject to “rogue/unexpected” effects ƒ , ƒ Lightweight OS ƒ Small set of shared services ƒ Puma/Cougar/Catamount – Red, Red Strom ƒ CNK – BlueGene/L ƒ Open Source/proprietary ƒ Interaction of system architecture and OS ƒ Clusters, CCNuma, MPP, Distributed/, bandwidth … ƒ Taxonomy is still unsettled ƒ High-Performance Computing: Clusters, Constellations, MPPs, and Future Directions, Dongarra, Sterling, Simon, Strohmaier, CISE, March/April 2005.

Salishan – April 2005 12 U.S. Department of Energy Birds eye view of a typical software stack on a compute node

Office of Science

Parallel Resource Scientific Software Management/ Applications Development / Scheduler Compilers

Kernel Characteristics Runtime Support OS Kernel • Monolithic Perf Tools Debuggers CPU Scheduler • Privileged operations for protection

Programming Model Runtime Virtual Memory MGR • Software tunable IPC • General purpose algorithms Chkpt/Rstrt I/O Systems • Parallel knowledge in resource Sockets management and runtime OS Bypass Chkpt/RstrtTDP UDP • OS bypass for performance Hypervisor • ? Node Hardware Architecture: CPU, Cache, Memory, Interconnect Adaptor, …

HPC System Software Elements

Salishan – April 2005 13 U.S. Department of Energy A High-End Cluster

Office of Science Compute Switch Fabric

SMP SMP SMP •••Compute Server ••• SMP SMP SMP O( 10-103 TF ) Secure Control

Switch Fabric SGPFS ••••• Meta Data Global Name Visualization Servers Space Storage HPSS Switch Meta Data Fabric Agent User Access Archival Agent NFS Switch Fabric Storage HPSS DFS HPSS ••• ••• RAITs •••

•••••••••••••••••••••••••Disk Memory Server •••••••••••••••••••••••••• O( 1-102 PB ), User Data, Scratch Storage, Out of Core Memory, Disks, RAIDs, Locks Salishan – April 2005 14 U.S. Department of Energy Clustering software

Office of Science

ƒ Classic ƒ Lots of local disks with full root file systems and full OS on every node ƒ All the nodes are really, really independent (and visible) ƒ Clustermatic ƒ Clean, reliable BIOS that boots in seconds (LinuxBIOS) ƒ Boot directly into Linux so you can use HPC network to boot ƒ See entire cluster process state from one node (bproc) ƒ Fast, low overhead monitoring (Supermon) ƒ No NFS root -- root is local ramdisk

Salishan – April 2005 15 U.S. Department of Energy LANL SCIENCE APPLIANCE Scalable Cluster Computing using Clustermatic

Office of Science

Compilers Debuggers Applications

BJS

Supermon Beoboot MPI BProc

v9fs v9fs mon parallel i/o v9fs mon parallel i/o v9fs mon parallel i/o v9fs mon parallel i/o Linux Linux Linux Linux Linux

LinuxBIOS LinuxBIOS LinuxBIOS LinuxBIOS LinuxBIOS

High speed interconnect(s) master slaveslave slave slave

v9fs server parallel i/o server

Environment Fileserver High Performance Parallel Fileserver

Thanks to Ron Minnich Salishan – April 2005 16 U.S. Department of Energy LWK approach

Office of Science

ƒ Separate policy decision from policy enforcement ƒ Move resource management as close to application as possible ƒ Protect applications from each other ƒ Get out of the way

Salishan – April 2005 17 U.S. Department of Energy History of Sandia/UNM Lightweight Kernels

Office of Science

Thanks to Barney Maccabe and Ron Brightwell

Salishan – April 2005 18 U.S. Department of Energy LWK Structure

Office of Science

PCT App. 1 App. 2 App. 3

libc.a libc.a libc.a libmpi.a libnx.a libvertex.a

Q-Kernel: message passing,

Salishan – April 2005 19 U.S. Department of Energy LWK Ingredients

Office of Science ƒ Quintessential Kernel (Qk) ƒ Process Control Thread ƒ Policy enforcer ƒ Runs in user space ƒ Initializes hardware ƒ More privileges than user ƒ Handles interrupts and exceptions applications ƒ Maintain hardware virtual ƒ Policy maker addressing ƒ Process loading ƒ No virtual memory paging ƒ Process scheduling ƒ Static size ƒ Virtual address space ƒ Small size management ƒ Non-blocking ƒ Changes behavior of OS ƒ Few, well defined entry points without changing the kernel

ƒ Portals ƒ Basic building blocks for any high-level message passing system ƒ All structures are in user space ƒ Avoids costly memory copies ƒ Avoids costly context switches to user mode (up call)

Salishan – April 2005 20 U.S. Department of Energy Key Ideas

Office of Science

ƒ Kernel is small and reliable ƒ Protection ƒ Kernel has static size ƒ No structures depend on how many processes are running ƒ All message passing structures are in user space ƒ Resource management pushed out of the kernel to the process and the ƒ Services pushed out of the kernel to the PCT and the runtime system

Salishan – April 2005 21 U.S. Department of Energy BGL/RS LWK System Call Comparison

Office of Science

BGL - ship BGL - lwk BGL - not Total RS

RS - ship 26 4 6 36

RS - lwk 3 6 7 16

RS - not 5 0 45 50

RS - ??? 11 6 81 98

Total BGL 45 16 139 200

Source: Mary Zosel, LLNL

Salishan – April 2005 22 U.S. Department of Energy Sidebar: resource management software

Office of Science

Parallel Resource Scientific Software Management/ Applications Development / Scheduler Compilers

Characteristics User Space Runtime Support OS Kernel • Monolithic kernel Perf Tools Debuggers CPU Scheduler • Privileged operations for protection

Programming Model Runtime Virtual Memory MGR • Software tunable IPC • General purpose algorithms Chkpt/Rstrt I/O File Systems • Parallel knowledge in resource Sockets management and runtime OS Bypass Chkpt/RstrtTDP UDP • OS bypass for performance Hypervisor • Single system image? Node Hardware Architecture: CPU, Cache, Memory, Interconnect Adaptor, …

HPC System Software Elements

Salishan – April 2005 23 U.S. Department of Energy Current status of system/resource management software for large machines

Office of Science ƒ Both proprietary and open-source systems ƒ Machine-specific, PBS, LSF, POE, SLURM, COOAE (Collections Of Odds And Ends), … ƒ Many are monolithic “resource management systems,” combining multiple functions ƒ Job queuing, scheduling, process management, node monitoring, job monitoring, accounting, configuration management, etc. ƒ A few established separate components exist ƒ Maui scheduler ƒ Qbank accounting system ƒ Many home-grown, local pieces of software ƒ Scalability often a weak point

Salishan – April 2005 24 U.S. Department of Energy System Software Architecture

Office of Science Access control Meta Meta Meta Security Scheduler Monitor Manager manager

Interacts with all components Node Accounting Scheduler System Configuration Monitor & Build Allocation Manager Management Job Queue Manager & User DB Manager Monitor Data Migration

Usage User High Performance Checkpoint/ File Testing & Reports utilities Communication System & I/O restart Validation

Application Environment Salishan – April 2005 25 U.S. Department of Energy Outline

Office of Science

ƒ Motivation – so what? ƒ OSes and Architectures ƒ Current State of Affairs ƒ Research directions ƒ Final points

Salishan – April 2005 26 U.S. Department of Energy ’s doing what – Top500 11/2004

Office of Science

ƒ Full Unix ƒ NEC -- Earth Simulator (3) ƒ IBM -- White/Seaborg (20, 21) ƒ HP/Compaq -- Q (6) ƒ Lightweight Kernel ƒ IBM -- BlueGene/L (1) ƒ Intel -- ASCI Red (78) ƒ Linux ƒ SGI -- Columbia (2) ƒ HP -- PNNL (16) ƒ MISC in Top 25 -- LLNL, LANL, NCSA, ARL, UCSD ƒ (Lots and lots of Linuxes)

Salishan – April 2005 27 U.S. Department of Energy Top500 Trends

Office of Science

Estimated fraction of Linux in Top500: 60%

Salishan – April 2005 28 U.S. Department of Energy The Death of High-end OS Research

Office of Science

ƒ Large effort required ƒ 100’s of person years ƒ Unix, Linux pervasive ƒ ƒ picking a winner too early ƒ Services and standards ƒ Users want a rich set of services ƒ “To be a viable computer system, one must honor a huge list of large, and often changing standards: TCP/IP, HTTP, HTML, XML, CORBA, , POSIX, NFS, SMB, MIME, POP, IMAP, X, … A huge amount of work, but if you don’t honor the standards, you’re marginalized.”*

* , Systems Software research is Irrelevant, August 2000. http://www.cs.bell-labs.com/cm/cs/who/rob/utah2000.pdf

Salishan – April 2005 29 U.S. Department of Energy Death (continued)

Office of Science

ƒ Hardware access ƒ OS developers rarely get access to large systems (they want to break them) ƒ OSF only had access to 32-node systems ƒ Research vs production quality ƒ OS development focuses on features, not implementations ƒ OS becomes more complex due to poor implementations ƒ Linux ƒ Structure: 1,000’s of lines of code know the socket structure ƒ Acceptance metric: performance on servers and desktops ƒ Culture: Linux hackers rarely acknowledge OS research

Salishan – April 2005 30 U.S. Department of Energy Sidebar: OS Statistics

Office of Science

ƒ Windows 3.1 (1990) ~~ 3M lines of code ƒ Windows NT (1995) ~~ 4M LOC ƒ Windows NT 5.0 (2000) ~~ 20M LOC ƒ Windows XP (2002) ~~ 40M LOC ƒ Red Hat 7.1 (2001) ~~ 30M LOC ƒ 8000 person years and cost >$1B if developed by proprietary means (COCOMO model) ƒ Linux kernel (2.4.2) ~~ 2.4M LOC ƒ 1.4M lines in kernel (~60%) are drivers

Source: More than a Gigabuck: Estimating GNU/Linux’s Size, www.dwheeler.com/sloc

Salishan – April 2005 31 U.S. Department of Energy Outline

Office of Science

ƒ Motivation – so what? ƒ OSes and Architectures ƒ Current State of Affairs ƒ Research directions ƒ Final points

Salishan – April 2005 32 U.S. Department of Energy OS Research Challenges

Office of Science ƒ Architecture ƒ Support for architectural innovation is essential ƒ Multiprocessor cores, PIM systems, smart caches ƒ Current OSes can stifle architecture research ƒ Linux page table abstraction is ƒ Enable multiple management strategies ƒ Resource constrained applications ƒ OS/application management mismatch ƒ Application re-invent resource management ƒ Specialized HEC needs ƒ New programming models ƒ I/O services ƒ Scalability ƒ Fault tolerance

Salishan – April 2005 33 U.S. Department of Energy Details

Office of Science ƒ OS structure ƒ Global/local OS , OS/runtime split ƒ Adaptability, composibility ƒ Extending lightweight approaches ƒ Protection boundaries and virtualization ƒ Scalability – what needs to scale ƒ Fault tolerance ƒ Checkpoint/restart (system, application, compiler support) ƒ Run through failure, other forms of application fault tolerance ƒ Migration ƒ APIs/Interfaces ƒ Application/runtime, Runtime/OS, OS/compiler, architecture ƒ Tool interfaces ƒ Environment information ƒ Hardware support for OS/runtime ƒ Protection, network reliability, collective operations, atomic memory operations, transactional memory ƒ Application requirements ƒ Metrics ƒ Testbeds

Salishan – April 2005 34 U.S. Department of Energy Forum to Address Scalable Technology for runtime and Operating Systems (FAST-OS)

Office of Science

ƒ Drivers: ƒ Challenges of petascale systems ƒ No effective high-end OS research ƒ several advanced architecture programs ƒ advanced programming model activities (HPCS: Chapel, Fortress, X-10) ƒ are doomed without OS/runtime innovation ƒ Scaling efficiency is essential for success at the petascale ƒ Must be addressed at all levels: architecture, , runtime system, application and algorithms ƒ Address both operating and runtime systems: ƒ OS is primarily about protection of shared resources (memory, cpu, …) ƒ RT is primarily about application support environment ƒ Long term: ƒ Build high-end OS/Runtime community, including: vendors, academics, and labs ƒ 2010 timeframe: petaflops and beyond

Salishan – April 2005 35 U.S. Department of Energy FAST-OS Activities

Office of Science

ƒ February 2002: (Wimps) Bodega Bay ƒ July 2002: (Fast-OS) Chicago ƒ March 2003: (SOS) Durango ƒ June 2003: (HECRTF) ƒ June 2003: (SCaLeS) DC ƒ July 2003: (Fast-OS) DC ƒ March 2004: Research Announcement ƒ 2nd Half 2004: Awards

• 10 teams, about $21M invested over 3 years • Part of HEC-URA, additional funding support from DARPA and NSA

Salishan – April 2005 36 U.S. Department of Energy FAST-OS Award Summary

Office of Science FAST-OS Lead Org/ Lab/Univ Industry Effort Activity Coord. PI Partners Partners LLNL Virtualization on minimal Linux with SSI Colony UIUC IBM Terry Jones services SNL Combine micro services to build app Config UNM, Caltech Ron Brightwell specific OS UTEP Adaptation of OS based on Kperfmon & DAiSES Wisconsin IBM Pat Teller Kerninst

LBNL Enhance applicability of for HEC OS K42 Toronto, UNM IBM Paul Hargrove research

ORNL LaTech, OSU, Modules to config and adapt Linux + RAS MOLAR Cray Stephen Scott NCSU, UNM & fSM ORNL HP, CFS, OpenSSI, Intersection of big (SMP) and SSI Rice Scott Studham SGI, Intel small (node) kernels LANL Build application specific Linux/Plan 9 Right-Weight Ron Minnich kernels PNNL Quadrics, Implicit, explicit, incremental Scalable FT LANL, UIUC Jarek Nieplocha Intel checkpointing & resilience Texas A&M Vertical integration between SmartApps SmartApps Lawrence LLNL IBM and K42 Rauchwerger ANL Ultralight Linux, collective runtime, ZeptoOS Oregon Pete Beckman measure & FT

Salishan – April 2005 37 U.S. Department of Energy Areas of Emphasis

Office of Science

• Virtualization • lightweight mechanisms for virtual ƒ Fault handling resources ƒ Checkpoint/restart -- implicit • better balance for large set of small (system); explicit (application) entities ƒ Run through failure • Adaptability ƒ Prediction and migration • apps go through different phases ƒ Common API • configurability versus adaptability ƒ Defining/supporting/using • Usage model & system management ƒ Single System Image • dedicated, space shared, time shared ƒ managing complexity at scale • single system may have multiple ƒ system view, application view OSes ƒ Collective Runtime • Metrics & Measurement ƒ defining collective operations to • adaptation requires measurement support runtime • what and how to measure ƒ I/O • HPC Challenge ƒ compute node -- I/O system interface • OS Noise ƒ ƒ controlling asynchronous activities I/O offload undertaken by the OS/runtime

Salishan – April 2005 38 U.S. Department of Energy

g I n T y n P o t i R i i s l A e t de d e is o i n I za r n v o i a M a o i l t t S t /O e c I N a p e H m S u a g M t e S t l m ll Office of Science d a ir u o o O A s a V C C U F

Colony H M H M H M H

Config H M H M M M

DAiSES H H M

K42 H H H M M

MOLAR H H H H M M

Peta-Scale SSI H H H H H

Rightweight M H M M H

Scalable FT H M H M

SmartApps M H H M HHigh

ZeptoOS H H H H H MMedium

Salishan – April 2005 39 U.S. Department of Energy FAST-OS Statistics

Office of Science

ƒ OSes (4): ƒ Linux (6.5), K-42 (2), Custom (1), Plan 9 (.5) ƒ Labs (7): ƒ ANL, LANL, ORNL, LBNL, LLNL, PNNL, SNL ƒ Universities (13): ƒ Caltech, Louisiana Tech, NCSU, Rice, Ohio State, Texas A&M, Toronto, UIUC, UTEP, UNM, U of Chicago, U of Oregon, U of Wisconsin ƒ Industry (8): ƒ Bell Labs, Cray, HP, IBM, Intel, CFS (), Quadrics, SGI

Salishan – April 2005 40 U.S. Department of Energy Outline

Office of Science

ƒ Final points ƒ Virtualization/hypervisor ƒ Status, future ƒ Research frameworks ƒ K-42, Plan9 ƒ I/O, file systems ƒ Engaging the open source community

Salishan – April 2005 41 U.S. Department of Energy Birds Eye View of the Software Stack

Office of Science

Parallel Resource Scientific Software Management/ Applications Development / Scheduler Compilers

Characteristics User Space Runtime Support OS Kernel • Monolithic kernel Perf Tools Debuggers CPU Scheduler • Privileged operations for protection

Programming Model Runtime Virtual Memory MGR • Software tunable IPC • General purpose algorithms Chkpt/Rstrt I/O File Systems • Parallel knowledge in resource Sockets management and runtime OS Bypass Chkpt/RstrtTDP UDP • OS bypass for performance Hypervisor Node Hardware Architecture: CPU, Cache, Memory, Interconnect Adaptor, …

HPC System Software Elements

Salishan – April 2005 42 U.S. Department of Energy Virtualization – hypervisers/virtual machines

Office of Science ƒ Full virtualization: VMware, VirtualPC ƒ Run multiple unmodified guest OSes ƒ Hard to efficiently virtualize x86 ƒ Para-virtualization: UML, ƒ Run multiple guest OSes ported to special arch ƒ Arch Xen/x86 is very close to normal x86

XenXen isis beingbeing Single System Image with process migration usedused byby thethe OpenSSI OpenSSI OpenSSI OpenSSI FastOSFastOS PetaSSIPetaSSI projectproject toto XenLinux XenLinux XenLinux XenLinux simulatesimulate SSISSI onon Linux 2.6.10 Linux 2.6.10 Linux 2.6.10 Linux 2.6.10 largerlarger clustersclusters Lustre Lustre Lustre Lustre Xen Virtual Machine Monitor

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) Thanks to Scott Studham

Salishan – April 2005 43 U.S. Department of Energy Virtualization and HPC

Office of Science ƒ For latency tolerant applications it is possible to run multiple virtual machines on a single physical node to simulate large node counts. ƒ Xen has little overhead when GuestOS’s are not contending for resources. This may provide a to support multiple OS’s on a single HPC system.

900 Latency Ring & Random 800 OverheadOverhead toto buildbuild aa LinuxLinux kernelkernel Latency Rings 700 onon aa GuestOS:GuestOS: Latency Ping-Pong 600

500 Xen:Xen: 3%3%

400 VMWare:VMWare: 27%27% Latency (us) 300 UserUser ModeMode Linux:Linux: 103% 103%

200

100

0 23456 Number of virtual machines on a node

Hypervisor Status HW support – IBM Power, Intel, AMD SW support – Xen support in Linux 2.6 kernel

Salishan – April 2005 44 U.S. Department of Energy K-42

Office of Science

ƒ Framework for OS research ƒ Full Linux compatibility – API and ABI ƒ Most OS functionality in user-level library ƒ Object-oriented design at all levels ƒ Policies/implementations of every physical/virtual resource instance can be customizable to individual application needs ƒ Enables dynamic adaptation to changing application needs (hot swapping)

Salishan – April 2005 45 U.S. Department of Energy K-42 (cont)

Office of Science

ƒ Research activity at Watson Lab led by Orran Krieger

ƒ Research OS for PERCS, the IBM HPCS project

ƒ K(itchawan)-42

Salishan – April 2005 46 U.S. Department of Energy (or Outer Space)

Office of Science

ƒ Not like anything you’ve seen ƒ Not a mini-, micro-, nano-kernel App1 App2 ƒ Core OS is fixed-configuration set of “devices” ƒ Means “anything that has to be is in the OS” Services ƒ E.g. Memory,TCP/IP stack, Net Services hardware, etc. ƒ Everything else is a “Server” ƒ File systems, windowing Fixed-configuration systems, etc. Kernel Fixed-configuration

hardware Thanks to Ron Minnich

Salishan – April 2005 47 U.S. Department of Energy Features

Office of Science

ƒ What has to be in the OS goes in the OS ƒ Everything else is optional ƒ If you need something you pay for it ƒ If not, not ƒ Options are configured per-process-group ƒ The name for this is “private name spaces” ƒ There is no root user ƒ There are no integer UIDs ƒ There need not be a central “UID store” ƒ 38 system calls ƒ Linux is at 300 and counting

Salishan – April 2005 48 U.S. Department of Energy I/O software stacks

Office of Science

ƒ I/O components layered to provide needed Application functionality Application High-level I/O Library ƒ Layers insulate apps from eccentric low-level High-level I/O Library details I/OI/O Middleware (MPI-IO) (MPI-IO) ƒ HLLs provide interfaces and models appropriate for Parallel domains Parallel File System ƒ Middleware provides a common interface and I/OI/O Hardware Hardware model for accessing underlying storage ƒ PFS manages the hardware and provides raw bandwidth ƒ Maintain (most of) I/O performance ƒ Some high-level library (HLL) features do cost performance ƒ Opportunities for optimizations at all levels ƒ Parallel file system challenges: ƒ Scaling effective I/O rate ƒ Scaling metadata/management operation rate ƒ Providing fault tolerance in large scale systems

Thanks to Rob Ross

Salishan – April 2005 49 U.S. Department of Energy Interfaces again

Office of Science

ƒ POSIX I/O APIs aren’t descriptive enough ƒ Don’t enable the description of noncontiguous regions in both memory and file ƒ POSIX consistency semantics are too great a burden ƒ Require too much additional communication and synchronization, not really required by many HPC applications ƒ Will never reach peak I/O with POSIX at scale, only penalize the stubborn apps ƒ Alternative: use more relaxed semantics at the FS layer as the default, build on top of that Application Tile Reader Benchmark I/O Read Application 40 35 High-levelHigh-level I/O I/O Library Library 30 25 I/O Middleware (MPI-IO) 20 I/O Middleware (MPI-IO) 15 10 Parallel File System 5

Parallel File System Bandwidth (MB/sec) 0 I/OI/O Hardware Hardware POSIX List I/O Structured I/O

Salishan – April 2005 50 U.S. Department of Energy Breaking down cultural barriers

Office of Science

ƒ ASC PathForward project goals ƒ Accelerate IB support for HPC needs ƒ Scalability, bw, latency, robustness, … ƒ Effective community engagement ƒ Open IB Alliance ƒ Achieved major milestone with OpenIB driver and stack being accepted into 2.6 Linux kernel at kernel.org

Salishan – April 2005 51 U.S. Department of Energy Acknowledgements

Office of Science

ƒ Thanks to all of the participants in FAST- OS and to the many others who have contributed to research in high-end operating systems, programming models and system software (and to this talk!)

http://www.cs.unm.edu/~fastos

FAST-OS PI Meeting/Workshop June 8-10, Washington, DC

Salishan – April 2005 52 U.S. Department of Energy

Office of Science

Salishan – April 2005 53