Performance Comparison of X1 and Cray Cluster with Other Leading Platforms Using HPCC and IMB Benchmarks

Rolf Rabenseifner Subhash Saini1, Rolf Rabenseifner3, Brian T. N. Gunney2, Thomas E. Spelce2, Alice Koniges2, Don Dossa2, Panagiotis Adamidis3, Robert Ciotti1, Sunil R. Tiyyagura3, Matthias Müller4, and Rod Fatoohi5

1NASA Advanced Supercomputing (NAS) Division NASA Ames Research Center, Moffett Field, California 2Lawrence Livermore National Laboratory 3High Performance Computing Center Stuttgart (HLRS) 4ZIH, TU Dresden; 5San Jose State University CUG 2006, May 2006

Outline

¢ Computing platforms ¢ Columbia System (NASA, USA) ¢ Cray Opteron Cluster (NASA, USA)

¢ Dell POWER EDGE (NCSA, USA)

¢ NEC SX-8 (HLRS, Germany)

¢ (NASA, USA)

¢ IBM Blue Gene/L

¢ Benchmarks ¢ HPCC Benchmark suite (measurements on 1st four platforms)

¢ IMB Benchmarks (measurements on 1st five platforms)

¢ Balance analysis based on publicly available HPCC data ¢ Summary 2 / 47

1 ¢ Computing platforms ¢ Columbia System ¢ Cray X1 ¢ Dell Xeon Cluster ¢ Cray Opteron Columbia 2048 System ¢ NEC SX-8 ¢ Benchmarks ¢ Results ¢ Summary

¢ Four SGI Altix BX2 boxes with 512 processors each connected with NUMALINK4 using fat-tree topology ¢ Itanium 2 processor with 1.6 GHz and 9 MB of L3 cache ¢ SGI Altix BX2 compute brick has eight Itanium 2 processors with 16 GB of local memory and four ASICs called SHUB ¢ In addition to NUMALINK4,  InfiniBand (IB) and 10 Gbit Ethernet networks also available ¢ Processor peak performance is 6.4 Gflop/s; system peak of the 2048 system is 13 Tflop/s ¢ Measured latency and bandwidth of IB are 10.5 microseconds and 855 MB/s. 3 / 47

Columbia System

4 / 47

2 — s ki pp ed SGI Altix 3700 —

¢ CC-NUMA in hardware ¢ Itanium 2@ 1.5GHz (peak 6 GF/s) ¢ 64-bit w/ single ¢ 128 FP reg, 32K L1, 256K L2, 6MB system image -- looks like a L3 single Linux machine but with many processors 5 / 47

— s ki pp ed — Columbia Configuration

RTF 128 Front End - 128p Altix 3700 (RTF) Networking - 10GigE Switch 32-port 10GigE InfiniBand -10GigE Cards (1 Per 512p) - InfiniBand Switch (288port) - InfiniBand Cards (6 per 512p) - Altix 3700 2BX 2048 Numalink Kits A A A A A A A A A A A A T T T T T T T T 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 p p p p p p p p p p p p p p p p p p p p Compute Node (Single Sys Image) - Altix 3700 (A) 12x512p Capacity System Capability Sys.- Altix 3700 BX2 (T) 8x512p 48 TF 12 TF FFCC S Swwiittcchh 128p Storage Area Network -Brocade Switch 2x128port

Storage (440 TB) Fibre Fibre Fibre Fibre -FC RAID 8x20 TB (8 Racks) SATA SATA SATA SATA FiCbrhea nneFliCbrhea nneFliCbrhea nneFliCbrhea nnel 35TSABTA35TSABTA35TSABTA35TSABTA -SATARAID 8x35TB (8 Racks) Chan2n0eTlBChan2n0eTlBChan2n0eTlBChan2n0eTlB 35TB 35TB 35TB 35TB 20TB 20TB 20TB 20TB 6 / 47

3 ¢ Computing platforms ¢ Columbia System ¢ Cray X1 Cray X1 CPU: ¢ Dell Xeon Cluster ¢ Cray Opteron ¢ NEC SX-8 Multistreaming Processor ¢ Benchmarks ¢ Results ¢ Summary • New Cray Vector Instruction Set Architecture (ISA) • 64- and 32-bit operations, IEEE floating-point

Single- Single- Single- Single- Each Stream: streaming streaming streaming streaming • 2 vector pipes processor processor processor processor (32 vector regs. #1 #2 #3 #4 of 64 element ea) • 64 A & S regs. • Instruction & data cache

2 MB E-cache MSP: • 4 x P-chips • 4 x E-chips (cache)

Bandwidth per CPU • Up to 76.8 GB/sec read/write to cache • Up to 34.1 GB/sec read/write to memory

7 / 47

— s ki pp ed — Cray X1 Processor Node Module

12.8 GF (64bit) 12.8 GF (64bit) 12.8 GF (64bit) 12.8 GF (64bit) Cray X1 16 Node MSP MSP MSP MSP 819 GFLOPS 100 GB/s 16 or 32 GB Memory 200 GB/s

I/O I/O I/O I/O

X1 node board has performance roughly comparable to: • 128 PE system • 16-32 CPU system

8 / 47

4 Cray X1 Node - 51.2 Gflop/s

12.8 GF (64bit) 12.8 GF (64bit) 12.8 GF (64bit) 12.8 GF (64bit) MSP CPU MSP CPU MSP CPU MSP CPU

Cache Cache Cache Cache 2.0 MB 2.0 MB 2.0 MB 2.0 MB

2x2x1.6 GB/s Ï 102.4 GB/s

M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15

Local Memory (Up to 64 GB, 16 sections)

Interconnect network Local memory 2 ports/M-chip Peak BW = 16 sections x 12.8 1.6 GB/s/port peak in each direction GB/s/section = 204.8 GB/s Capacity = 16, 32 or 64 GB = 102.4 GB/s to the network 9 / 47

Cray X1 at NAS

¢ Architecture

QuickTime™ and a ¢ TIFF (Uncompressed) decompressor 4 nodes, 16 MSPs (64 SSPs) are needed to see this picture. ¢ 1 node reserved for system; 3 nodes usable for user codes

¢ 1 MSP: 4 SSPs at 800 MHz, 2 MB ECache 12.8 Gflops/s peak

¢ 64 GB main memory; 4 TB FC RAID

¢ Operating Environment ¢ Unicos MP 2.4.3.4

¢ Cray and C 5.2

¢ PBSPro job scheduler 10 / 47

5 — s ki pp ed — Cray X1 at NAS

11 / 47

¢ Computing platforms ¢ Columbia System Intel Xeon Cluster ¢ Cray X1 ¢ Dell Xeon Cluster ¢ Cray Opteron at NCSA (“Tungsten”) ¢ NEC SX-8 ¢ Benchmarks ¢ Results ¢ Summary

° 2 CPUs /node ° 1280 nodes ° 3.6 GHz Intel Xeon ° Infiniband 12 / 47

6 ¢ Computing platforms ¢ Columbia System ¢ Cray X1 ¢ Dell Xeon Cluster ¢ Cray Opteron ¢ NEC SX-8 Cray Opteron Cluster ¢ Benchmarks ¢ Results ¢ Summary

¢ 2 CPUs/node ¢ 64 nodes ¢ AMD 2.GHz Opteron ¢ Myrinet

13 / 47

¢ Computing platforms ¢ Columbia System ¢ Cray X1 NEC SX-8 System ¢ Dell Xeon Cluster ¢ Cray Opteron ¢ NEC SX-8 ¢ Benchmarks ¢ Results ¢ Summary

14 / 47

7 SX-8 System Architecture

15 / 47

SX-8 Technology

¢ Hardware dedicated to scientific and engineering applications. ¢ CPU: 2 GHz frequency, 90 nm-Cu technology ¢ 8000 I/O per CPU chip ¢ Hardware vector square root ¢ Serial signalling technology to memory, about 2000 transmitters work in parallel ¢ 64 GB/s memory bandwidth per CPU ¢ Multilayer, low-loss PCB board, replaces 20000 cables ¢ Optical cabling used for internode connections ¢ Very compact packaging.

16 / 47

8 SX-8 specifications

¢ 16 GF / CPU (vector) ¢ 64 GB/s memory bandwidth per CPU ¢ 8 CPUs / node ¢ 512 GB/s memory bandwidth per node ¢ Maximum 512 nodes ¢ Maximum 4096 CPUs, max 65 TFLOPS ¢ Internode crossbar Switch ¢ 16 GB/s (bi-directional) interconnect bandwidth per node ¢ @ HLRS: 72 nodes = 576 CPUs = 9 Tflop/s (vector) (12 Tflop/s (total peak)

17 / 47

High End Computing Platforms

Table 2: System characteristics of the computing platforms. Pro- Peak/ Oper Sys- CPUs ces- Clock node Network ating Loca- tem Platform Type / Network sor (GHz) (Gflop Topology Sys- tion Ven- node Ven- /s) tem dor dor Linux NUMA- NASA SGI Altix BX2 Scalar 2 1.6 12.8 Fat-tree (Suse Intel SGI LINK 4 (USA) ) Proprie- UNIC NASA Cray X1 Vector 4 0.800 12.8 4D-Hypercube Cray Cray tary OS (USA) Linux NASA Cray Opteron Cluster Scalar 2 2.0 8.0 Myrinet Fat-tree (Red AMD Cray (USA) hat) Linux Infini- NCSA Dell Xeon Cluster Scalar 2 3.6 14.4 Fat-tree (Red Intel Dell Band (USA) hat) HLRS Multi-stage Supe NEC SX-8 Vector 8 2.0 16.0 IXS (Ger- NEC NEC Crossbar r-UX many)

18 / 47

9 ¢ Computing platforms ¢ Benchmarks ¢ HPCC ¢ IMB ¢ Results ¢ HPCC HPC Challenge Benchmarks ¢ IMB ¢ HPCC public data ¢ Summary

¢ Basically consists of 7 benchmarks ¢ HPL: floating-point execution rate for solving a linear system of equations ¢ DGEMM: floating-point execution rate of double precision real matrix-matrix multiplication ¢ STREAM: sustainable memory bandwidth ¢ PTRANS: transfer rate for large data arrays from memory (total network communications capacity) ¢ RandomAccess: rate of random memory integer updates (GUPS) ¢ FFTE: floating-point execution rate of double-precision complex 1D discrete FFT ¢ Bandwidth/Latency: random & natural ring, ping-pong 19 / 47

HPC Challenge Benchmarks & Computational Resources

HPL (Jack Dongarra)

CPU computational speed

CCoommppuuttaattioionnaall rreessoouurrcceess

Node Random & Natural Memory Interconnect Ring bandwidth STREAM bandwidth Bandwidth & Latency (John McCalpin) (my part of the HPCC Benchmark Suite)

20 / 47

10 — s ki pp ed HPC Challenge Benchmarks — Corresponding Memory Hierarchy

¢ Top500: solves a system Registers Ax = b Instr. Operands ¢ STREAM: vector operations Cache A = B + s x C Blocks

bandwidth Local Memory ¢ FFT: 1D Fast Fourier Transform latency Messages Z = FFT(X) Remote Memory ¢ RandomAccess: random updates Pages T(i) = XOR( T(i), r ) Disk

¢¢ HPCS prrogrram hass devvelloped a new ssuiitte off bencchmarrkkss ((HPC Challllenge)) ¢¢ Eacch bencchmarrkk ffoccussess on a diifffferrentt parrtt off tthe memorryy hiierrarrcchyy ¢¢ HPCS prrogrram perrfforrmancce ttarrgettss wiillll ffllatttten tthe memorryy hiierrarrcchyy,, iimprrovve rreall applliiccattiion perrfforrmancce,, and makke prrogrrammiing eassiierr 21 / 47

— s ki pp ed Spatial and Temporal Locality — Processor Reuse=2 Op1 Op2

P P P 1 2 3 u u u et et et t1 t2 t G G G 3

Stride=3 Memory

¢¢ Prrogrrams can be decomposed iintto memorry rrefferrence patttterrns ¢¢ Sttrriide iis tthe diisttance bettween memorry rrefferrences ¢¢ Prrogrrams wiitth smallll sttrriides have hiigh ““Spattiiall Localliitty”” ¢¢ Reuse iis tthe numberr off operrattiions perrfforrmed on each rrefferrence ¢¢ Prrogrrams wiitth llarrge rreuse have hiigh ““Temporrall Localliitty”” ¢¢ Can measurre iin rreall prrogrrams and corrrrellatte wiitth HPC Challllenge

22 / 47

11 — s ki pp ed Spatial/Temporal Locality Results — 1

MetaSim data from Snavely et al (SDSC) FFT 0.8 GUPS HPC HPL (Top500) Stream Challenge HPL 0.6 Avus Large cth7 Large Scientific Gamess Large Applications lammps Large 0.4 FFT overflow2 wrf Scientific Applications 0.2 Intelligence, Surveillance, Reconnaisance Applications STREAM RandomAcces0s 0 0.2 0.4 0.6 0.8 1 Spatial Score

¢ HPC Challllenge bounds rreall applliicattiions

¢¢ Allllows us tto map bettween applliicattiions and benchmarrks 23 / 47

¢ Computing platforms ¢ Benchmarks ¢ HPCC ¢ IMB ¢ Results Intel MPI Benchmarks Used ¢ HPCC ¢ IMB ¢ HPCC public data ¢ Summary 1. Barrier: A barrier function MPI_Barrieris used to synchronize all processes. 2. Reduction: Each processor provides A numbers. The global result, stored at the root processor is also A numbers. The number A[i] is the results of all the A[i] from the N processors. 3. All_reduce: MPI_Allreduce is similar to MPI_Reduce except that all members of the communicator group receive the reduced result. 4. Reduce scatter: The outcome of this operation is the same as an MPI Reduce operation followed by an MPI Scatter 5. Allgather: All the processes in the communicator receive the result, not only the root 24 / 47

12 Intel MPI Benchmarks Used

1. Allgatherv: it is vector variant of MPI_ALLgather. 2. All_to_All: Every process inputs A*N bytes and receives A*N bytes (A bytes for each process), where N is number of processes. 3. Send_recv: Here each process sends a message to the right and receives from the left in the chain. 4. Exchange: Here process exchanges data with both left and right in the chain 5. Broadcast: Broadcast from one processor to all members of the communicator.

25 / 47

¢ Computing platforms ¢ Benchmarks ¢ Results Accumulated Random Ring BW ¢ HPCC ¢ IMB ¢ HPCC public data vs HPL Performance ¢ Summary h t 1000 d i NEC SX8 32- w

d 576 cpus n a B

100

g Cray Opteron n i

R 4-64 cpus s

/ s m e o

t 10 d y n

B SGI Altix a G

R Numalink3 64-

d 440 cpus e 1 t a l SGI Altix u

m Numalink4 64- u

c 0.1 2024 cpus c

A 0.001 0.01 0.1 1 10 Linpack (HPL) TFlop/s

26 / 47

13 Accumulated Random Ring BW vs HPL Performance Balance Ratio /

1000 NEC SX8 h t

) 32-576 d i s / w

p cpus d o l n F a T B Cray

/

g s / n Opteron 4- i s R e

t 100 64 cpus y m B o d G

( SGI Altix n

a

L

R Numalink3

P d

H 64-440 e

t k a l

c cpus u a

p SGI Altix

m 10 n u i Numalink4 c L 0.001 0.01 0.1 1 10 c

A 64-2024 Linpack(HPL) TFlop/s cpus

27 / 47

Accumulated EP Stream Copy vs HPL Performance

100 NEC SX8 32- 576 cpus

y 10 p o C

Cray m Opteron 4- a e

r 1 64 cpus t s / S

0.001 0.01 0.1 1 10 s P e t E

y SGI Altix d B e

T 0.1 t Numalink3 a l 64-440 cpus u m u c c 0.01 SGI Altix A Numalink4 64-2024 0.001 cpus Linpack(HPL) TFlop/s

28 / 47

14 Accumulated EP Stream Copy vs HPL Performance Balance Ratio

10 NEC SX8 32-

) 576 cpus / s

/ y p p o l o F C

T

Cray Opteron / m a s

/ 4-64 cpus e s r t e t S

y 1 P B E T

SGI Altix (

d 0.001 0.01 0.1 1 10

e Numalink3 64- t L a P l 440 cpus u H

k m c u

a SGI Altix c p c n

A Numalink4 64- i L 2024 cpus 0.1 Linpack(HPL) TFlop/s

29 / 47

Normalized Values of HPCC Benchmark

Ratio Maximum value G-HPL 8.729 TF/s G-EP DGEMM/G-HPL 1.925 G-FFTE/G-HPL 0.020 G-Ptrans/G-HPL 0.039 B/F G-StreamCopy/G-HPL 2.893 B/F RandRingBW/PP-HPL 0.094 B/F 1/RandRingLatency 0.197 1/µs G-RandomAccess/G-HPL 4.9e-5 Update/F

30 / 47

15 HPCC Benchmarks Normalized with HPL Value

1

0.8 NEC SX-8

0.6 Cray Opteron

0.4 SGI Altix Numalink3 SGI Altix 0.2 Numalink4

0 - - L L L L L L L y 1 / c G G P P P P P P P h =

t / n H H H H L H H H h s d e ------t P i t s i P G G G G G a H e w w / / P / / L c

d y E S c M g d n p T N A n e a M i F o z A E B i F m R C l R - g o G d a T m G n d n D i a P

m n a - R e r P a R r d o G E t R / - n N - S 1 a G - G R G

31 / 47

¢ Computing platforms ¢ Benchmarks ¢ Results ¢ HPCC ¢ IMB IMB Barrier Benchmark ¢ HPCC public data ¢ Summary

Barrier SGI Altix BX2 l l Cray Opteron Cluster a 1000 c

I Similar Cray X1 (SSP) P Cray X1 (MSP) M

r NEC SX-8 e

l p

l 100 Xeon Cluster a e c / m i s t

µ n o i 10 t u c e x E 1 1 10 100 1000

Number of Processors

32 / 47

16 1 MB Reduction

1 MB Reduction

1.E+05 l l a c

I SGI Altix BX2

P 1.E+04 M

r Cray Opteron e

l

p Cluster l

a e 1.E+03 Cray X1 (SSP) c / m i s t µ

n Cray X1 (MSP) o i 1.E+02 t u NEC SX-8 c e

x Best … Worst E 1.E+01 SX-8 ‰ Cray X1 ‰ SGI & cluster Xeon Cluster

1.E+00 1 10 100 1000 Number of Processors 33 / 47

1 MB Allreduce

1 MB Allreduce

1.E+06 l l a

c SGI Altix BX2 1.E+05 I P M

Cray Opteron r

e 1.E+04 Cluster l p l

a Cray X1 (SSP) e c / m i s

t 1.E+03

µ Cray X1 (MSP) n o i t

u 1.E+02 NEC SX-8 c e x

E Xeon Cluster 1.E+01 Best … Worst SX-8 ‰ Cray X1 ‰ SGI & cluster

1.E+00 1 10 100 1000 Number of Processors

34 / 47

17 1 MB Reduction_scatter

1 MB Reduce_scatter

l 1.E+05 l a c

I

P SGI Altix BX2 M

1.E+04 r e

l Cray Opteron p l

a

e Cluster c / m i s 1.E+03 Cray X1 (SSP) t

µ n o i

t Cray X1 (MSP) u

c 1.E+02 e

x NEC SX-8 E Best … Worst 1.E+01 SX-8 ‰ Cray X1 ‰ SGI & cluster Xeon Cluster

1.E+00 1 10 100 1000 Number of Processors 35 / 47

1 MB Allgather

1 MB Allgather

1.E+07 l l a c

I 1.E+06 SGI Altix BX2 P M r e

1.E+05 Cray Opteron l p l Cluster a e c / m Cray X1 (SSP) i s 1.E+04 t

µ n o i Cray X1 (MSP) t 1.E+03 u c e NEC SX-8 x

E 1.E+02 Xeon Cluster 1.E+01 Best … Worst SX-8 ‰ Cray X1 ‰ SGI & cluster 1.E+00 1 10 100 1000

Number of Processors 36 / 47

18 1 MB Allgatherv

1 MB Allgatherv

1.E+07 l l a SGI Altix BX2 c

I 1.E+06 P M Cray Opteron r e

1.E+05

l Cluster p l

a e

c Cray X1 (SSP) / m i

s 1.E+04 t

µ n o Cray X1 (MSP) i

t 1.E+03 u c e x 1.E+02 NEC SX-8 E

1.E+01 Best … Worst Xeon Cluster SX-8 ‰ Cray X1 ‰ SGI & cluster 1.E+00 1 10 100 1000 Number of Processors 37 / 47

1 MB All_to_All

1 MB Alltoall

1.E+07 l l a c

I 1.E+06 SGI Altix BX2 P M r

e Cray Opteron

1.E+05 l p l Cluster a e c

/ Cray X1 (SSP) m i

s 1.E+04 t

µ n

o Cray X1 (MSP) i

t 1.E+03 u c

e NEC SX-8 x 1.E+02 E Best … Worst Xeon Cluster 1.E+01 SX-8 ‰ SGI & Cray X1 ‰ Clusters

1.E+00 1 10 100 1000 Number of Processors 38 / 47

19 1 MB Send_recv

1 MB Sendrecv

1.E+05

SGI Altix BX2 s / 1.E+04 s e t

y Cray Opteron b Cluster M

1.E+03 Cray X1 (SSP) h t d i Cray X1 (MSP) w d 1.E+02 n a NEC SX-8 B

1.E+01 Xeon Cluster Best … Worst SX-8 ‰ SGI ‰ Cray X1 ‰ Clusters 1.E+00 1 10 100 1000 Number of Processors 39 / 47

1 MB Exchange

1 MB Exchange

1.E+05

s /

s SGI Altix BX2

e 1.E+04 t y

b Cray Opteron M

Cluster

h 1.E+03 t Cray X1 (SSP) d i w

d Cray X1 (MSP) n

a 1.E+02

B NEC SX-8 Best … Worst Xeon Cluster 1.E+01SX-8 ‰ Xeon ‰ Cray X1 & SGI ‰ Cray Opteron

1.E+00 1 10 100 1000 Number of Processors 40 / 47

20 1 MB Broadcast

1MB Broadcast

1.E+06

l l a

c SGI Altix BX2 / 1.E+05 s µ

Cray Opteron l l

a 1.E+04 Cluster c

I Cray X1 (SSP) P M

1.E+03 r

e Cray X1 (MSP) p

e

m 1.E+02 NEC SX-8 i t

n o

i Xeon Cluster t 1.E+01 Best … Worst u

c ‰ ‰ ‰

e SX-8 SGI Cray X1 Clusters x E 1.E+00 1 10 100 1000

Number of Processors 41 / 47

¢ Computing platforms ¢ Benchmarks Balance between ¢ Results ¢ HPCC Random Ring Bandwidth (network b/w) ¢ IMB ¢ HPCC public data and HPL Benchmark (CPU speed) ¢ Summary

• Ratio measures balance between inter-node communication and computational speed (Linpack). • It varies between systems by a Status Feb. 6, 2006 factor of ~20.

Measurements with smaller #CPUs on XT3 and SX-8 courtesy to Nathan Wichmann, Cray, CUG 2005, and Sunil Tiyyagu4ra2, H/ L4R7S. Slide 42 of 47 – Höchstleistungsrechenzentrum Stuttgart (HLRS) – Rolf Rabenseifner

21 Balance between Random Ring Bandwidth and CPU speed Same as on previous slide, but linear ...

Measurements with smaller #CPUs on XT3 and SX-8 courtesy to Nathan Wichmann, Cray, CUG 2005, and Sunil Tiyyagu4ra3, H/ L4R7S. Slide 43 of 47 – Höchstleistungsrechenzentrum Stuttgart (HLRS) – Rolf Rabenseifner

Balance between memory and CPU speed High memory bandwidth ratio on vector-type systems (NEC SX-8, Cray X1 & X1E), but also on Cray XT3.

• Balance: Variation between systems only about 10.

Measurements with smaller #CPUs on XT3 and SX-8 courtesy to Nathan Wichmann, Cray, CUG 2005, and Sunil Tiyyagu4ra4, H/ L4R7S. Slide 44 of 47 – Höchstleistungsrechenzentrum Stuttgart (HLRS) – Rolf Rabenseifner

22 Balance between Fast Fourier Transform (FFTE) and CPU

• Ratio ~20.

Measurements with smaller #CPUs on XT3 and SX-8 courtesy to Nathan Wichmann, Cray, CUG 2005, and Sunil Tiyyagu4ra5, H/ L4R7S. Slide 45 of 47 – Höchstleistungsrechenzentrum Stuttgart (HLRS) – Rolf Rabenseifner

Balance between Matrix Transpose (PTRANS) and CPU

• Balance: Variation between systems larger than 30.

Measurements with smaller #CPUs on XT3 and SX-8 courtesy to Nathan Wichmann, Cray, CUG 2005, and Sunil Tiyyagu4ra6, H/ L4R7S. Slide 46 of 47 – Höchstleistungsrechenzentrum Stuttgart (HLRS) – Rolf Rabenseifner

23 ¢ Computing platforms ¢ Benchmarks ¢ Results ¢ HPCC ¢ IMB ¢ HPCC public data Summary ¢ Summary

[HPCC and IMB measurements] ¢ Performance of vector systems is consistently better than all the scalar systems ¢ Performance of SX-8 is better than Cray X1 ¢ Performance of SGI Altix BX2 is better than Dell Xeon cluster and Cray Opteron cluster ¢ IXS (SX-8) > Cray X1 network > SGI Altix BX2 (NL4) > Dell Xeon cluster (IB) > Cray Opteron cluster (Myrinet). [publicly available HPCC data] ¢ Cray XT3 has a strongly balanced network – similar to NEC SX-8

47 / 47

24