High-Performance Computing at SGI and the Status of Climate and Weather Codes on the SGI Gerardo Cisneros, Ph.D. Scientist C2004 , Inc. All rights reserved. Silicon Graphics, SGI, IRIX, Origin, Onyx, Onyx2, IRIS, Altix, InfiniteReality, Challenge, Reality Center, Geometry Engine, ImageVision Library, OpenGL, XFS, the SGI logo and the SGI cube are registered trademarks and CXFS, Onyx4, InfinitePerformance, IRIS GL, Power Series, Personal IRIS, Power Challenge, NUMAflex, REACT, , OpenGL Performer, OpenGL, Optimizer, OpenGL Volumizer, OpenGL , OpenGL Multipipe, OpenGL Vizserver, SkyWriter, RealityEngine, SGI ProPack, Performance Co- Pilot, SGI Advanced , UltimateVision and The Source of Innovation and Discovery are trademarks of Silicon Graphics, Inc., in the U.S. and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds in several countries, used with permission by Silicon Graphics, Inc. MIPS is a registered trademark of MIPS Technologies, Inc., used under license by Silicon Graphics, Inc. and are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. in the United States and other countries. Linux penguin logo created by Larry Ewing. All other trademarks mentioned herein are the property of their respective owners. (04/04)

9/9/2004 Slide 2

AA Overview

• Company focus • SGI Altix: present and future • Performance of NWS codes • Conclusions

9/9/2004 Slide 3 Silicon Graphics Providing the Industry’s Highest-Performing Compute, Storage and Products •Exclusively focused on the technical computing market •Technology is designed to enable the most significant scientific and creative breakthroughs of the 21st century •Products and services are mission critical to government and defense, science and research, manufacturing, energy and media industries

9/9/2004 Slide 4 Images courtesy of SCI Institute, University of Utah, Navy Rehearsal TOPSCENE Program, Magic Earth LLC, and WETA Digital

AA Strategic Focus Areas

High Performance Advanced Storage Computing Visualization

High-performance CXFS™ shared file system Onyx4™ allows users to combine NUMAflex™ architecture allows transparent, multiple industry-standard delivers unprecedented heterogeneous file access graphics cards in a high- flexibility and performance. everywhere, without copying bandwidth, low-latency data. architecture, for cost-effective high performance visualization.

9/9/2004 Slide 5

AA Architecture Designed for HPC; Choice of Deployments

NUMAflex™ Global Shared-Memory Architecture Balanced, scalable performance Operating environment optimized for HPC Low-latency memory access Easily Deployable

MIPS® and IRIX® Intel® Itanium® 2 and Linux®

SGI® Origin® Family SGI® Altix® Family

9/9/2004 Slide 6

AA Modular SGI® NUMAflex™ Architecture

System PPeerrffoorrmmaannccee:: HHiigghh--bbaannddwwiiddtthh Interconnect iinntteerrccoonnnneecctt wwiitthh vveerryy llooww llaatteennccyy

FFlleexxiibbiilliittyy:: TTaaiilloorreedd ccoonnffiigguurraattiioonnss ffoorr CPU & ddiiffffeerreenntt ddiimmeennssiioonnss ooff ssccaallaabbiilliittyy Memory IInnvveessttmmeenntt pprrootteeccttiioonn:: AAdddd nneeww tteecchhnnoollooggiieess aass tthheeyy eevvoollvvee System I/O SSccaallaabbiilliittyy:: NNoo cceennttrraall bbuuss oorr sswwiittcchh;; jjuusstt Standard I/O mmoodduulleess aanndd NNUUMMAAlliinnkk™™ ccaabblleess Expansion

High-Bandwidth I/O Expansion

Graphics Expansion

Storage Expansion

9/9/2004 Slide 7 SGI Family of Scalable Linux® Solutions

SGI® Altix® 350 SGI® Altix® 3000 Servers and Clusters Servers and Superclusters

Image courtesy: NASA Ames

Mid-range Departmental Capability

9/9/2004 Slide 8

AA SGI® Altix® 3000 Scaling Roadmap 2048p y

® t

Worlds Most Scalable Linux i l i b a p a c

r o

1024p s s e c o r p

4

512p 512p 512p 512p 8 3 , 6 1 256p 256p 256p 128p 64p 64p 64p 64p 64p

Jan 2003 Apr 2003 Sept 2003 Oct 2003 Mar 2004 mid-2004 late 2004 2005 SSI Scalability Shared Memory Scalability

9/9/2004 Slide 9 This slide contains forward-looking statements. The results and forecasts as stated may vary. Other risks and uncertainties relating to this slide may be found in the "Safe-Harbor" statement at the beginning of this presentation. SGI® Altix® 3000 C-Brick Detail

16 x PC2100 or PC2700 DDR SDRAM 8 to 16GB of memory per node 8.51–10.2GB/sec memory b/w

• 4x Intel® Itanium® 2 processors • 2 processor per 6.4GB/sec frontside bus • 4–64GB memory C-brick • SHUB 8.51–10.2GB/sec memory SHUB SHUB bandwidth (varies with memory speed 133 MHz vs. 166 MHz) • 6.4GB/sec aggregate interconnect bandwidth • 4.8GB/sec aggregate I/O bandwidth

9/9/2004 Slide 10 Other Altix Bricks

IX-brick Base I/O module

D-brick2 Disk expansion

R-brick2 Router interconnect

PX-brick PCI-X expansion

M-brick Memory expansion

9/9/2004 Slide 11 Altix 3700 — 128 processors

9/9/2004 Slide 12 Altix 3700 — 512 processors

9/9/2004 Slide 13 Next generation Altix

(8) Fewer External Routers (1) Less Power Bay (1) Less Rack (20) Fewer Cables

Altix 3700 “Tornado”

9/9/2004 Slide 14 Next generation Altix

9/9/2004 Slide 15 The Altix® 350: “Expand on Demand” Growth Path

1-16P/2-192GB

CPU Expansion Module Right-size systems 1-12P/ 2-144 GB Memory Expansion for the ultimate Module price/performance 1-8P/ 2-96GB I/O Expansion (4-32) Module

1- 4P/ 2- 48GB • Independently scale CPU, memory, I/O • One Linux® instance to manage Base Unit: 1 or 2P/2-24GB • Investment protection & leverage current assets • Allocate budget and resources to ongoing needs

9/9/2004 Slide 16 Customers with Altix systems running climate and weather applications

• NASA – 10240p – ECCO, CAM, CCSM, etc. • NCSA – 1024p – WRF • GFDL – 608p (2x256, 1x96) – MOM4, CM2.1 • NRL – 384p (1x128, 1x256) – various • ORNL – 256p – CCSM, CAM, POP • LANL – 256p – POP, HYCOM • BAMS – 20p – MM5 and MAQSIP • CMMACS – 12p Altix350 – MOM4 • APAT – 8p Altix350 – MM5 • Romanian Nat’l Met Admin – 2p Altix350 – HRM

9/9/2004 Slide 17 IFS - Performance on SGI Altix 3000

T511 performance in Forecast days per day 700 600 500 32 Cpu 64 Cpu 400 128 Cpu 300 256 Cpu 200 384 Cpu 512 Cpu 100 0 IBM Power4 SGI Altix 3000 SGI 3000 1.3GHz 1.3GHz 1.5GHz

Altix 3000 @1.5GHz is 2.22 x faster IBM Power4 @1.3GHz

9/9/2004 Slide 18 Source: Roland Richter, SGI, May 2004 IFS - Scalability on SGI Altix 3000

Itanium2 @1.5GHz is 1.3x faster than Itanium @1.3GHz because of the larger and higher

9/9/2004 Slide 19 Source: Roland Richter, SGI, May 2004 HRM Performance on the Altix 350 Simulation speed

350 Origin 350/600 MHz Altix 350/1.4 GHz 300

250

200

150

100

50

0 1 2 4 8 10 12 16 Nproc

78h forecast in 2340 time steps over a 181x217 horizontal grid with 26 vertical levels

9/9/2004 Slide 20 LM-RAPS 2.1 (Optimization using shared Memory) Scalability on Altix using Intel Compiler and SGI’s MPT library

9/9/2004 Slide 21 Benchmark case used by Emy (Greece) in 2003 Grid 363x263x45 MM5 3.6.3 - Standard benchmark, 2004

Higher is Better

Altix version was compiled with the Intel 8.1.007 (beta) ifort Fortran compiler and the Intel 8.1.010 (beta) icc C compiler, and linked with SGI's MPI from MPT 1.10. Source: http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20040304a.html 9/9/2004 Slide 22 WRF 1.3 - Scalability & Performance on Altix 3000

Altix 3000 is 1.28x faster than HP cluster at the same clock frequency.

Source: http://www.mmm.ucar.edu/wrf/bench and Gerardo Cisneros, SGI 9/9/2004 Slide 23 WRF 2.0.2 on a Large Problem

WRF SI 5km CONUS 48h forecast (980x720x37, 5km, 30s)

50000 s

40000 d

e 30000 s

p 20000 a l

E 10000 0 32 64 128 256 512 NCPUs

1.5GHz/6MB L3 1.3GHz/3MB L3

9/9/2004 Slide 24 WRF 2.0.2 on a Large Problem

WRF SI 5km CONUS 48h forecast (980x720x37, 5km, 30s) d

e 40 e p

s 30

n o

i 20 t a l

u 10 m i 0 S 32 64 128 256 512 NCPUs

1.5GHz/6MB L3 1.3GHz/3MB L3

9/9/2004 Slide 25 GFS Performance on the Altix 3700 (1.5GHz)

GFS T240L30 (720x360x30) 120h forecast

400.00 343.91 350.00 d

e 300.00 e 255.20 p

s 250.00

n 190.01

o 200.00 i t a l 150.00 u 98.34 m

i 100.00

S 51.15 50.00 23.68 2.23 4.21 9.58 0.00 1 2 4 8 16 32 64 96 128 NCPUs

9/9/2004 Slide 26 MOM4 - Performance on SGI Altix 3700

0.25° Ocean Model (1440x600x24) 18,000 120 15,476 16,000 100 14,000

12,000 80 p d 10,185 e u 10e ,000 s 7,813 d m 60 p e i a 8,000 e T l p

E 6,000 4,640 40 3,822 S 4,000 2,308 2,397 20 2,000 0 0 16 32 64 128 Number of processors Altix 3000 @1.5GHz w/ Dynamic memory compilation Altix 3000 @1.5GHz w/ Static memory compilation Speedup

Source: Gerardo Cisneros, SGI, May 2004 9/9/2004 Slide 27 POP 1.4.3 (Optimization by reducing the # of synchronizations) Scalability on ALTIX using Intel Compiler and SGI’s MPT library

9/9/2004 Slide 28 Increasing Message length and reducing synchronization frequency improved the performance by 20% at 256 CPU POP 1.4.3 - Performance on SGI Altix 3000 SSI

POP 1.4.3 - Performance on 1 Degree “X1” Problem Higher 100.0 is

Better

Altix 3000 @1.5 GHz (SGI) ) y a

D Altix 3000 @ 1.3 GHz (SGI)

80.0

k Configuration: c o

l Altix 3000 @1.5GHz (NASA)

c Altix 3000 l

l 96

a 60.0 88

w @1.3GHz, 512P Ideal / 80

s 72 Altix 3000

r SSI, 64 a p e 40.0 u 56

d Intel compiler Y

e 48 e

d

p 40 e 7.1.035 MPT 1.9 S t 32 a l 24

u 20.0 16 m i 8 S 0 0 32 64 96 128 160 192 224 256 288 320 352 384 0.0 0 64 128 192 256 320 384 Nr. of Processors

Source: Altix 3000 (SGI) - version optimized by Roland Richter using MPT Altix 3000 (NASA) - Jim Taft’s version using MLP

9/9/2004 Slide 29 CCSM 3.0 (beta08) Performance on SGI Altix 3000

Atmosphere-Land-Ice-Ocean Coupled Model Speedup from 26P to 208P 14 8

7 Ideal speedup y 12 a Higher d

SGI Altix 3000 @1.3GHz

k is 6 5.6 c 10

o Better l c

l 5 l p a

8 u d w / e 3.8

e 4 s p r S a 6 3.1 e

y 3

t 2.1 s

a 4

c 2 e r

o 1.0

F 2 1 6 2 3 8 6 0 0 6 9 6 . . 9 3 3 0 8 3 0 . . 3 . . . . 1 8 8 1 7 5 4 0 2 0 26 52 64 78 96 104 128 208 26 52 78 104 208 Number of Processors Number of Processors

Run on 512P Altix 3000 @1.3GHz with efc 7.1.035 compiler Source: Roland Richter, SGI 9/9/2004 Slide 30 Conclusions

The Altix is proving to be an excellent system for weather and climate codes • Shared memory allows very large models to run • The compilers are getting better • The libraries are excellent and still improving • SGI’s software layers on top of Linux help jobs attain their best performance • SGI engineers have the know-how to make weather and climate codes run well on its customers’ SGI systems

9/9/2004 Slide 31 9/9/2004 Slide 32