(WP8) Prototypes Future Technologies

“Future Technologies” (WP8) Prototypes Iris Christadler, Dr. Herbert Huber Leibniz Supercomputing Centre, Germany Prototype Overview (1/2) CEA 1U Tesla Server T1070 (CUDA, Take more easily advantage of accelerators. Compare “GPU/CAPS” CAPS, DDT) , Intel Harpertown nodes HMPP with other approaches to program accelerators. Assess the applicability of new file system and I/O Subsystem (SSD, Lustre, pNFS) CINECA storage technologies. CINES-LRZ Hybrid SGI ICE/UV/Nehalem-EP & Evaluate a hybrid system architecture containing thin nodes, fat nodes and compute accelerators “LRB/CS” Nehalem-EX/ClearSpeed/Larrabee with a shared file system. CSCS Prototype PGAS language compilers Understand the usability and programmability “UPC/CAF” (CAF + UPC for Cray XT systems) of PGAS languages. EPCC Maxwell – FPGA prototype (VHDL Assess the potential of high-level languages support & consultancy + software for using FPGAs in HPC. Compare energy “FPGA” licenses (e.g., Mitrion-C)) efficiency with other solutions. SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 2 Prototype Overview (2/2) FZJ eQPACE (PowerXCell Gain deep expertise in communication “Cell & FPGA 8i cluster with special network issues. Extend the application interconnect” network processor) domain of the QPACE system. LRZ RapidMind Multi-Core Development Assess the potential of data stream languages. Platform (automatic code generation Compare RapidMind with other approaches for “RapidMind” for x86, GPUs and Cell) programming accelerators or multi-core systems NCF ClearSpeed Evaluate ClearSpeed accelerator hardware “ClearSpeed” CATS 700 units for large-scale applications. Air cooled blade system from SNIC- Supermicro with AMD Istanbul Evaluate and optimize energy efficiency and packing density of Experiences with the KTH processors & QDR IB commodity hardware. prototypes will be reported (subject to EC approval) in Deliverable D8 .3 .2 [http://www.prace-project.eu/documents/public-deliverables-1/] SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 3 The teaser A SELECTION OF RESULTS SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 4 Rinf SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 5 Euroben results - accelerator languages Accelerator Languages (absolute performance) 1000000 MKL (8 Nehalem cores) CUDA (1 C1060) 94% 81% 100000 CellSs (1 PowerXCell8i) 79% 78% v. peak Cn (1CSX700) 10000 Accelerator Languages (%peak perf) 94 81 79 78 100.00 1000 30 Mflops 4.5 6 10.00 3.3 100 2 MKL formance rr 090.9 1.00 pe CUDA CellSs peak 10 0.10 of mod2f/MKL: 0.04 mod2f/MKL: Cn single‐threaded only % single‐threaded 0.03 only 0010.01 1 mod2am mod2as mod2f peak perf mod2am mod2as mod2f SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 6 Euroben results - GPGPU languages Performance Comparison (dense matrix‐matrix mul.) on Nvidia C1060 100 90 80 70 60 CUDA 50 flops CAPS GG 40 CUDA+MPI 4x4 30 RapidMind 20 OpenCL 10 MKL (8cores Nehalem) 0 matrix size (m) SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 7 Euroben results - productivity Development Time versus Performance (dense matrix-matrix mul.) 100000 20 18 10000 16 14 e in Days Mflops 1000 12 mm * * 10 100 8 6 elopment Ti erformance in vv Performance PP 10 ** 4 De total time 2 first version 1 0 * OpenCL and CUDA+MPI port based on existing CUDA port ** RapidMind developer included time for benchmarking SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 8 First IO -Results SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 9 A glimpse on what you will find in Deliverable D8.3.2 PROTOTYPES SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 10 eQPACE Extend communication capabilities of eQPACE to make it suitable for a wider range of applications. Reach a top position in the Green500 list (FZJ). • HdHardware: Power XCll8iXCell8i processor no des w ihith custom 3D-torus interconnect. • BhkBenchmarks: HPL, Euroben kernels, torus network benchmark , applications & iterative solvers. •Progggramming environments: Cell SDK & CellSs SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 11 RapidMind Evaluation of the RapidMind programming model (LRZ). RidMidRapidMind mod2am 70 • Hardware: 60 50 40 30 – CPUs (Nehalem EP, AMD Opteron) Gfops 20 10 – GPUs (Nvidia Tesla and Quadro FX) 0 – Cell (QS22-blade cluster) matrix size (m) • Software: x86‐dp (8 cores nehalem) cuda‐dp (c1060) glsl‐sp (FX 5800) RapidMind allows to write code which can run on x86 cores as well as accelerators like GPUs and Cell. – Evaluate ease-of-use & portability – Assess RapidMind performance on different architectures – Compare RapidMind with other accelerator languages SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 12 LRZ-CINES Evaluation of a hybrid system architecture containing thin nodes, fat nodes and compute accelerators with a shared file system (CINES, LRZ). • HdHardware: – SGI ICE (Nehalem EP) – SGI UV (Nehalem EX) – Clearspeed CSX700 • Benchmarks: – Euroben kernels – Synthetic BMs: HPL, Rinf, Intel MPI Benchmark, Apex-MAP – Application BMs: Gadget, Raxml, Specfm3dglobe SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 13 Hybrid technology demonstrator Evaluating GPGPU with CAPS HMPP (CEA). CAPS hmpp mod2am • Hardware: 70 60 50 Tesla servers connected to 40 ops 30 Gfl BllBull servers v ia PCI-E. 20 10 • Software: 0 CAPS HMPP a llows to exp lo it the matrix size (m) CUDA mod2am potential of GPGPUs by simply 70 60 adding preprocessor directives to 50 40 30 Gflops legacy Fortran and C codes. 20 10 0 matrix size (m) SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 14 Maxwell FPGA Evaluate the performance and usability of the HARWEST Compiling Environment (EPCC). • Hardware: FPGA prototype “Maxwell” (32 FPGAs) fbhAlhDLddfrom both Alpha Data Ltd and NllNallatec hLdLtd us ing Virtex-4 FPGAs supplied by Xilinx Corp. • BhkBenchmarks: 4 Euroben kernels • Languages: – VHDL – HCE SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 15 PGAS languages Evaluate ease of use of PGAS programming model (CSCS). • Hardware: Cray XT5 • Compiler: Cray Compiler Environment (CCE) • Evaluation of the compiler: – Functional correctness – Conformance with language standards – Usability for existing CAF and UPC benchmarks/applications • Benchmarks from Rice University, George Washington University and the Lawrence Berkley National Laboratory SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 16 ClearSpeed/PetaPath Evaluate ClearSpeed-Petapath system (NCF). • Hardware: 114 ClearSpeed CSX700 cards • Language: Cn • Benchmarks: – 4 Euroben kernels – 4 Applications • Astronomy • Geophysics • numerical mathematics • medical tomography SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 17 XC4-IO • Compare performances in storage infrastructure access, using different hardware configurations and file system architectures. (CINECA). Das Bild kann nicht angezeigt werden. Dieser Computer verfügt möglicherweise über zu wenig Arbeitsspeicher, um das Bild zu öffnen, oder das Bild ist beschädigt. Starten Sie den Computer neu, und öffnen Sie dann erneut die Datei. Wenn weiterhin das rote x angezeigt wird, müssen Sie das Bild möglicherweise löschen und dann erneut einfügen. SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 18 SNIC-KTH Preliminary Results (Gromacs) Evaluate energy efficiency of high density commodity parts (SNIC-KTH). • Hardware: AMD Istanbul • Benchmarks: Euroben, STREAM, IMB, Gromacs, CFD • Measure power consumption per component • Adjust fan speed and fan power • Assess energy management features of AMD Istanbul (Control of voltage and frequency of components) SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 19 Results will be reported in Deliverable D8.3.2. RESEARCH ACTIVITIES SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 20 Parallel GPU Evaluation of GPGPU programming languages (CSC). • Languages – CUDA+MPI GPU-HMMER – OpenCL • Benchmarks: – GPU-HMMER – Euroben Kernels • Hardware – Tesla – AMD Firestream – CEA WP8 Prototype SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 21 Advanced PGAS Programming upc_barrier; Evaluate usability of PGAS upc_forall (sc=0; sc<totblks; sc++; sc) { // S quare mat ri x multi pl y C = A * B with aid of DGEMM programming model (CSC). double beta = 0; double *clocal = (double *)c[sc].x; // Local C-block for this UPC-thread int ib = sc / numblks, jb = sc % numblks, kb, i, j, k; • Languages for (kb=0; kb<numblks; kb++) { int sa = ib * numblks + kb; // The owner of A-block is sa % THREADS – Coarray Fortran (CAF) int sb = kb * numblks + jb; // The owner of B-block is sb %THREADS double *al = (sa%THREADS == MYTHREAD) ? // Get the A-block – Unified Parallel C (UPC) (double *)a[sa].x : ( upc_memget(alocal, a[sa].x, ns), alocal); double *bl = (sb%THREADS == MYTHREAD) ? // Get the B-block • BhkBenchmarks (double *)b[sb].x : (upc_memget(blocal, b[sb].x, ns), blocal); double *cl = clocal; // The local C-block owned by this UPC-thread – Euroben mod2am/as/f // Call BLAS3-library DGEMM dgemm_("N","N", &blksize, &blksize, &blksize, &alpha, • Environments al, &blksize, bl, &blksize, &beta, cl, &blksize, 11);1, 1); beta = 1; – Cray XT5 (cce) } /* for (kb=0; kb<numblks; kb++) */ } /* upc_forall (sc=0; sc<totblks; sc++; sc) */ – SGI Altix (g95,

(WP8) Prototypes Future Technologies

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

Clearspeed Technical Training

Introduction Hardware Acceleration Philosophy Popular Accelerators In

The Return of Acceleration Technology

Highlights of the 53Rd TOP500 List

A Fad Or the Yellow Brick Road Onto Exascale?

Exascale” Supercomputer Fugaku & Beyond

TSUBAME2.0: a Tiny and Greenest Petaflops Supercomputer

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

Clearspeed Paper: a Study of Accelerator Options for Maximizing Cluster Performance

GPU Scalability 35

A Simd Approach to Large-Scale Real-Time System Air Traffic Control Using Associative Processor and Consequences for Parallel Computing