Thoughts on HPC Facilities Strategies for DOE Office of Science: Grids to Petaflops

Rick Stevens Argonne National Laboratory Outline • Update on NSF’s distributed terascale facility • What grid and facilities strategy is appropriate for DOE? • Limits to cluster based architectures • New paths to petaflops computing capability • Grid implications of affordable petaflops • Summary and recommendations

R. Stevens Argonne National Laboratory + University of Chicago NSF TeraGrid Approach [$53M in FY01-FY03] • DTF’s project goal is deployment of a production Grid environment • staged deployment based on service priorities • first priority is a linked set of working IA-64 based clusters • immediately useful by the current NSF PACI user base • supporting current high-end applications • standard cluster and data management software • Grid software deployed in phases • basic, core, and advanced services

• DTF technology choices based on application community trends • > 50% of top 20 PACI users compute on Linux clusters • development and production runs • majority of NSF MRE projects plan Data Grid environments

R. Stevens Argonne National Laboratory + University of Chicago Major DTF and TeraGrid Tasks • Create management structure – Harder than we thought! • Engage major application teams – Starting with ITRs and MREs • Construct high bandwidth national network – On track • Integrate terascale hardware and software – Planning underway • Establish distributed TeraGrid operations – New Concepts needed • Deploy and harden Grid software – Need Grid testbeds • Expand visualization resources – Development needed • Implement outreach and training program – PACI Leverage • Assess scientific impact – Need metrics and process

R. Stevens Argonne National Laboratory + University of Chicago NSF PACI 13.6 TF Linux TeraGrid

574p IA-32 Chiba City 256p HP 32 32 32 32 128p Origin X-Class Caltech Argonne 24 64 Nodes 32 128p HP 32 Nodes 32 24 1 TF HR Display & V2500 0.5 TF VR Facilities 8 8 0.25 TB Memory 5 92p IA-32 0.4 TB Memory 5 HPSS 86 TB disk 25 TB disk 24 HPSS OC-12 4 Extreme Chicago & LA DTF Core Switch/Routers ESnet OC-48 Black Diamond Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) HSCC Calren OC-48 OC-12 MREN/Abilene NTON GbE OC-12 ATM Juniper M160 Starlight

Juniper M40 SDSC NCSA Juniper M40 OC-12 vBNS OC-12 OC-12 256 Nodes 500 Nodes vBNS Abilene 2 2 OC-12 OC-12 4.1 TF, 2 TB Memory 8 TF, 4 TB Memory Abilene Calren OC-3 MREN ESnet OC-3 225 TB disk 240 TB disk 8 4 HPSS 8 UniTree 2 Sun = 32x 1GbE 4 Starcat 1024p IA-32 1176p IBM SP 320p IA-64 Blue Horizon 16 = 64x Myrinet 14 4 = 32x Myrinet MyrinetMyrinet ClosClos SpineSpine MyrinetMyrinet ClosClos SpineSpine 1500p Origin Sun E10K = 32x FibreChannel = 8x FibreChannel 10 GbE

32 quad-processor McKinley Servers 32 quad-processor McKinley Servers Fibre Channel Switch (128p @ 4GF, 8GB memory/server) (128p @ 4GF, 12GB memory/server) 16 quad-processor McKinley Servers Cisco 6509 Catalyst Switch/Router IA-32 nodes (64p @ 4GF, 8GB memory/server) TeraGrid [40 Gbit/s] DWDM Wide Area Network

StarLight International Optical Peering Point (see www.startap.net)

Abilene

Chicago ne ackbo DTF B Indianapolis Urbana

Los Angeles Starlight / NW Univ San Diego UIC I-WIRE Multiple Carrier Hubs Ill Inst of Tech OC-48 (2.5 Gb/s, Abilene) ANL Multiple 10 GbE (Qwest) Univ of Chicago Indianapolis Multiple 10 GbE (I-WIRE Dark Fiber) (Abilene NOC) NCSA/UIUC • Solid lines in place and/or available by October 2001 • Dashed I-WIRE lines planned for summer 2002 R. Stevens Argonne National Laboratory + University of Chicago TeraGrid Middleware Definition Levels • Basic Grid services [little new capability] • deployment ready • in current use • immediate deployment planned • Core Grid services [essential Grid] • largely ready • selected hardening and enhancement • planned deployment in year one • Advanced Grid services [True Grid] • ongoing development • expect to deploy in year two and later

R. Stevens Argonne National Laboratory + University of Chicago Grid Middleware [toolkits for building Grids] • PKI Based Security Infrastructure • Distributed Directory Services • Reservations Services • Meta-Scheduling and Co-Scheduling • Quality of Service Interfaces • Grid Policy and Brokering Services • Common I/O and Data Transport Services • Meta-Accounting and Allocation Services

R. Stevens Argonne National Laboratory + University of Chicago Expected NSF TeraGrid Scientific Impact • Multiple classes of user support • each with differing implementation complexity • minimal change from current practice • new models, software, and applications • Benefit to three user communities • existing users • new capability [FLOPS, memory, and storage] • data-intensive and remote instrument users • linked archives, instruments, visualization and computation • several communities already embracing this approach – GriPhyN, BIRN, Sloan DSS/NVO, BIMA, … • future users of MRE and similar facilities • DTF is a prototype for ALMA, NEESGrid, LIGO, and others

R. Stevens Argonne National Laboratory + University of Chicago Strategies for Building Computational Grids • Three current approaches to developing and deploying Grid Infrastructure • Top Down – NASA IPG, NSF DTF, UK e-science • Bottom Up – Life Science’s Web Based Computing • User Community Based – GriPhyN, iVDG, PPDG, etc. • Current Grid Software R+D Mostly focused on Top Down and User Community Models • Major Grid Building “activities” • Grid software infrastructure and toolkit development • Grid hardware resources [systems, networks, data, instruments] • Grid applications development and deployment • Grid resource allocation and policy development

R. Stevens Argonne National Laboratory + University of Chicago DOE Programs and Facilities Technical Challenges: Distributed Resources, Distributed Expertise

Pacific Northwest Idaho National National Laboratory Idaho National Ames Laboratory Engineering Laboratory Argonne National Brookhaven Laboratory National Fermi Laboratory Lawrence National Berkeley Accelerator National Laboratory Laboratory

Stanford Linear Accelerator Princeton Center Plasma Physics Laboratory

Lawrence Thomas Jefferson Livermore National National Accelerator Facility Laboratory Oak Ridge National Major User Facilities-SC Laboratory Institutions supported by SC Los Alamos National DOE Specific-Mission Laboratories Laboratory DOE Program-Dedicated Laboratories Sandia National National Renewable Energy DOE Multiprogram Laboratories Laboratories Laboratory Vision for a Large-scale science and engineering is typically done DOE Science Grid through the interaction of • collaborators Scientific applications use workflow • resources, frameworks to coordinate resources and • multiple information systems, and solve complex, multi-disciplinary problems • experimental science and engineering facilities all of which are geographically and organizationally dispersed.

Grid services provide uniform access to The overall motivation for “Grids” is to enable routine interactions of many diverse resources networked combinations of these resources to facilitate large-scale science and engineering.

high-speed collaboration resources network (e.g. shared visualization) services and infrastructure provide the substrate for data Grids resources

Two Primary Goals •Build a DOE Science Grid that ultimately incorporates computing, data, and instrument resources at most, if not all, of the DOE Labs and their partners. computational •Advance the state-of-the-art in high performance, widely resources distributed computing so that the Grid can be used as a single, experimental science very large scale computing, data handling, and collaboration resources facility. Grid Strategies Appropriate for DOE-SC • Possible three layered structure of DOE-SC Grid resources • Large-Scale Grid Power Plants O[10] • [Multiprogram Labs, LBNL-NERSC, ORNL, ANL, PNNL, etc.] • Data and Instrument Interface Servers O[100] • [Major DOE Facilities, LHC, APS, ALS, RHIC, SLAC, etc.] • PI and small “l” laboratory based resources O[1000-10,000] • [Workstations and small Clusters, laboratory data systems, databases, etc.] • Need tool development, applications development and support appropriate to each layer and user community • Need resource allocation policies appropriate for each class of resource and user community

R. Stevens Argonne National Laboratory + University of Chicago Near Term Directions for “Clusters” • High-Density Web Server Farms [IA-32, AMD, Transmeta] • Blade based servers optimized for dense web serving • Scalable, but not aimed at high-performance numerical computing • Passive Backplane Based Clusters [IA-32, Infiniband] • Reasonably dense packaging possible • High-Scalability not a design goal • IA-64, -64 and Power4 “Server” based compute nodes • Good price performance, poor packaging density • Designed for commercial I/O intensive configurations • Sony Playstation2 [Emotion Engine, IBM Project] • Excellent pure price performance $50K/Teraflop • Not a balanced system, difficult microarchitecture

R. Stevens Argonne National Laboratory + University of Chicago Limits to Cluster Based Systems for HPC • Memory Bandwidth • Commodity memory interfaces [SDRAM, RDRAM, DDRAM] • Separation of memory and CPU implementations limits performance • Communications fabric/CPU/Memory Integration • Current networks are attached via I/O devices • Limits bandwidth and latency and communication semantics • Node and system packaging density • Commodity components and cooling technologies limit densities • Blade based servers moving in right direction but are not High Performance • Ad Hoc Large-scale Systems Architecture • Little functionality for RAS • Commodity design points don’t scale

R. Stevens Argonne National Laboratory + University of Chicago Each vector unit has enough parallelism to complete a vertex operation [19 mul-adds + 1 divide] every seven cycles.

The PSX2's Emotion Engine provides ten floating-point multiplier-accumulators, four floating-point dividers, and an MPEG-2 decoder. IBM, Sony and “Cell” Project • $400M investment towards Teraflop processor • Targeted at PS3, broadband applications • Each company will produce products based on the core technology • 100 nm feature size ⇒ 2006 [based on SIA Roadmap] • Design Center in Austin TX opening later this year • Sony’s description of PS3 is 1000x performance of PS2 • Planned use for all of Sony’s product lines • Video, Audio, Computer Games, PCs Etc.

R. Stevens Argonne National Laboratory + University of Chicago Cluster Technology Will not Scale to Petaflops • Affordable and Usable Petaflops will require improvements in a number of areas: • Improved CPU-Memory Bandwidth [e.g. PIM, IRAM] • HW Based Latency Management [e.g. multithreaded architectures] • Integrated Communications Infrastructure [e.g. on-chip networking] • Increased level of system Integration and Packaging [e.g. SOC, ClOC] • New Large-scale Systems Architectures • Aggressive Fault Management and Reliability Features • Scalable Systems Management and Serviceability Features • Dramatic Improvements in Scalable Systems Software

R. Stevens Argonne National Laboratory + University of Chicago UCB VIRAM-1 Integrated Processor/Memory

15 mm • Microprocessor • 256-bit media processor [vector] • 14 MBytes DRAM • 2.5-3.2 billion operations per second • 2W at 170-200 MHz • Industrial strength compiler • 280 mm2 die area • 18.72 x 15 mm 18.7 mm18.7 • ~200 mm2 for memory/logic • DRAM: ~140 mm2 • Vector lanes: ~50 mm2 • Technology: IBM SA-27E • 0.18µm CMOS Thanks to DARPA: funding • 6 metal layers [copper] IBM: donate masks, fab • : >100M Avanti: donate CAD tools MIPS: donate MIPS core • Implemented by 6 Berkeley graduate students Cray: Compilers, MIT:FPU R. Stevens Argonne National Laboratory + University of Chicago Petaflops from V-IRAM? • In 2005 .. V-IRAM 6 Gflops/64 MBs • 50 TF per 19” Rack [~10K CPUs/Disk assemblies per rack] • 100 drawers of 100 processors [like library cards] • cross bar interconnected at Nx 100 MB/s • 20 optically interconnected Racks • 1015 FLOPS ⇒~$20M • Power ~ 20K Watts x 20 = 400K Watts

R. Stevens Argonne National Laboratory + University of Chicago A Possible Path Towards Petaflops User Facilities

• Community Based Approach to Petaflops Systems Development • Laboratories, Universities, and Applications Communities • User Requirements, Software and Systems Design • Exploiting New Design Ideas and Technology for Scalability • Cluster-on-a-chip Level Integration • Hardware/software Co-design • Affordable Petaflops Enable Personal Teraflops • $50M PFlops System ⇒ $50K TFlops • Enable Broad Deployment and Scientific Impact • Advanced Networking and Middleware • Embed Petaflops Capability in the Grid

R. Stevens Argonne National Laboratory + University of Chicago Grid Implications of Affordable Petaflops

• $50M Petaflops system ⇒ $50K Teraflops systems • DOE Computing Facilities Circa 2006-2010 • Power Plant Level ⇒ O[1-10] PFs computers • Data Server Level ⇒ O[20-100] 100 TFs Data Servers • PI Level ⇒ O[1000-10,000] 1 TF lab systems • Data Server Capability • Power Plant Level ⇒ 10-20 PB secondary, 100-1000 PB tertiary • Data Server Level ⇒ 100-500 PB sec, 1000+ PB tertiary • PI Level ⇒ ~1 PB secondary • Networking Capability • Ideal BW ⇒ 10% of bisection bandwidth per system for peer-to-peer • Terabit WAN backbones and backplanes will be needed

R. Stevens Argonne National Laboratory + University of Chicago Summary • Grid Based Computing Concept is well matched to DOE’s Distributed Facilities and Missions needs • Grids do not replace need for large-scale computers • Increases high-end demand via Portals • Increases data intensive computing and high-performance networking • software environments link desktops to high-end platforms [petaflops] • Grids require new ways to allocate and manage computing and data resources • Need a broader view of resources and resource allocations • Grids and Technology for Petaflops Facilities make sense together • Technologies for Petaflops will power future grids at all levels

R. Stevens Argonne National Laboratory + University of Chicago Recommendations • DOE should aggressively pursue development of Grid Technologies and deployment of Grid based Infrastructure • DOE should facilitate Grid Applications Communities relevant to mission areas: Security, Energy, Climate, HEP, etc. • DOE should participate in National and International coordination of Grid development and Deployment • DOE should support development of computing platform technologies that will enable future Grid engines, including new approaches to Petaflops and associated affordable Teraflops

R. Stevens Argonne National Laboratory + University of Chicago