Many Integrated Core Prototype G
Total Page:16
File Type:pdf, Size:1020Kb
Many Integrated Core Prototype G. Erbacci – CINECA PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 Outline • HPC evolution • The Eurora Prototype • MIC architecture • Programming MIC 2 Many Integrated Core Prototype • HPC evolution • The Eurora Prototype • MIC architecture • Programming MIC 3 HPC at CINECA CINECA: National Supercomputing Centre in Italy • manage the HPC infrastructure • provide support to Italian and European researchers • promote technology transfer initiatives for industry • CINECA is a Hosting Member in PRACE – PLX: Cluster Linux with GPUs (Tier-1 in PRACE) – FERMI: IBM BG/Q (Tier-0 in PRACE) 4 PLX@CINECA IBM Cluster linux Processor type: 2 six-cores Intel Xeon (Esa-Core Westmere) X 5645 @ 2.4 GHz, 12MB Cache N. of nodes / cores: 274 / 3288 RAM: 48 GB/Compute node (14 TB in total) Internal Network: Infiniband with 4x QDR switches (40 Gbps) Acccelerators: 2 GPUs nVIDIA M2070 per node 548 GPUs in total Peak performance: 32 Tflops 565 TFlops SP GPUs 283 TFlops DP GPUs 5 FERMI@CINECA Architecture: 10 BGQ Frames Model: IBM-BG/Q Processor type: IBM PowerA2 @1.6 GHz Computing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core (163 PByte total) Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s N. 7 in Top 500 rank (June 2012) National and PRACE Tier-0 calls 6 CINECA HPC Infrastructure 7 Computational Sciences Computational science (with theory and experimentation), is the “third pillar” of scientific inquiry, enabling researchers to build and test models of complex phenomena Quick evolution of innovation: - Instantaneous communication - Geographically distributed work - Increased productivity - More data everywhere - Increasing problem complexity - Innovation happens worldwide 8 Technology Evolution More data everywhere: Radar, satellites, CAT scans, sensors, micro-arrays weather models, the human genome. The size and resolution of the problems scientists address today are limited only by the size of the data they can reasonably work with. There is a constantly increasing demand for faster processing on bigger data. Increasing problem complexity Partly driven by the ability to handle bigger data, but also by the requirements and opportunities brought by new technologies. For example, new kinds of medical scans create new computational challenges. HPC Evolution As technology allows scientists to handle bigger datasets and faster computations, they push to solve harder problems. In turn, the new class of problems drives the next cycle of technology innovation. 9 Top 500: some facts 1976 Cray 1 installed at Los Alamos: peak performance 160 MegaFlop/s (106 flop/s) 1993 (1° Edition Top 500) N. 1 59.7 GFlop/s (1012 flop/s) 1997 Teraflop/s barrier (1012 flop/s) 2008 Petaflop/s (1015 flop/s): Roadrunner (LANL) Rmax 1026 Gflop/s, Rpeak 1375 Gflop/s hybrid system: 6562 processors dual-core AMD Opteron accelerated with 12240 IBM Cell processors (98 TByte di RAM) 2012 (J) 16.3 Petaflop/s : Lawrence Livermore’s Sequoia Supercomputer BlueGene/Q, (1.572.864 cores) - 4 European systems in the Top 10 - Total combined performance of all 500 systems has grown to 123.02 Pflop/s, compared to 74.2 Pflop/s six months ago - 57 systems use accelerators - - - - Toward Exascale 10 Dennard Scaling law (MOSFET) • L’ = L / 2 do not hold anymore! • V’ = V / 2 The core frequency and performance do not • F’ = F * 2 grow following the • D’ = 1 / L2 = 4D Moore’s law any longer • P’ = P L’ = L / 2 CPU + Accelerator V’ = ~V to maintain the F’ = ~F * 2 architectures evolution In the Moore’s law D’ = 1 / L2 = 4 * D P’ = 4 * P Programming crisis! The power crisis! 11 Roadmap to Exascale(architectural trends) 12 Heterogeneous Multi-core Architecture • Combines different types of processors – Each optimized for a different operational modality • Performance – Synthesis favors superior performance • For complex computation exhibiting distinct modalities • Purpose-designed accelerators – Integrated to significantly speedup some critical aspect of one or more important classes of computation – IBM Cell architecture, ClearSpeed SIMD attached array processor, • Conventional co-processors – Graphical processing units (GPU) – Network controllers (NIC) – Many Integrated Cores (MIC ) – Efforts underway to apply existing special purpose components to general applications 13 Accelerators A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. CPU ACC. Single thread perf. throughput ACCCPU. Physical integration CPU & ACC Architectural integration 14 nVIDIA GPU Fermi implementation packs 512 processor cores 15 ATI FireStream, AMD GPU 2012 New Graphics Core Next “GCN” With new instruction set and new SIMD design 16 Intel MIC (Knight Ferry) 17 Real HPC Crisis is with Software A supercomputer application and software are usually much more long-lived than a hardware - Hardware life typically four-five years at most. - Fortran and C are still the main programming models Programming is stuck - Arguably hasn’t changed so much since the 70’s Software is a major cost component of modern technologies - The tradition in HPC system procurement is to assume that the software is free. It’s time for a change - Complexity is rising dramatically - Challenges for the applications on Petaflop systems - Improvement of existing codes will become complex and partly impossible - The use of O(100K) cores implies dramatic optimization effort - New paradigm as the support of a hundred threads in one node implies new parallelization strategies - Implementation of new parallel programming methods in existing large applications has not always a promising perspective There is the need for new community codes 18 What about parallel App? • In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001 19 Trends Scalar Application MPP System, Message Passing: MPI Vector Multi core nodes: OpenMP Distributed memory Accelerator (GPGPU, FPGA): Cuda, OpenCL Shared Memory Hybrid codes 20 Many Integrated Core Prototype • HPC evolution • The Eurora Prototype • MIC architecture • Programming MIC 21 EURORA Prototype • Evolution of AURORA architecture by Eurotech (http://www.eurotech.com/) – Aurora Rack: 256 Nodes: 512 CPUs – 101 Tflops @ 100 KW – liquid cooled • CPU: Xeon Sandy Bridge (SB) – Up to One full cabinet (128 nodes + 256 accelerators) • Accelerator: Intel Many Integrated Cores (MIC) • Network architecture: IB and Torus interconnect – Low Latency/High Bandwidth Interconnect • Cooling: Hot Water 22 EURORA chassis 1 rack, 16 chassis 16 nodes card or 8 nodes card + 16 accelerators Eurora Rack Physical dimensions: 2133mm(48U) h, 1095mm w, 1500 mm d; Weight (full rack with cooling fully loaded with water): 2000Kg Power/Cooling typical requirements: 120-130 kW @ 48 Vdc 23 EURORA node • 2 Intel Xeon E5 2 Intel MIC or 2 nVidia Kepler 16GByte DDR3 1.6GHz SSD disk 24 Node card mockup • Presented at ISC12 • Can host MIC and K20 cards • Thermal analysis and validation performed 25 EURORA Network 3D Torus custom network FPGA (Altera Stratix V) EXTOLL, APENET Ad-hoc MPI subset InfiniBand FDR Mellanox ConnectX3 MPI + Filesystem Synch 26 Cooling • Hot water 50-80C • Temperature gap 3-5C • No rotating fans • Cold plates –direct on component liquid cooling • Dry chillers • Free cooling Quick disconnect • Temperature sensors – downgrade performance is required • System isolation 27 EURORA prototype (Node Accelerator) EURopean many integrated cORe Architecture Goal: evaluate a new architecture for next generation Tier-0 system Partners: - CINECA, Italy - GRNET, Greece - IPB, Serbia - NCSA, Bulgaria Vendor: Eurotech, Italy 28 EURORA Installation Plan 29 HW Procurement • Contract with EUROTECH signed in July – 64 compute card – 128 Xeon SandyBridge 3.1GHz – 16GByte DDR3 1600MHz per node – 160GByte SSD per node – 1 FPGA (Altera Stratix V) per node – IB FDR – 128 Accelerator cards • INTEL KNC (or NVIDA K20) – Thermal sensors network 30 HW Procurement and Facility • Contract with EUROTECH signed in July • Integration in the Facility – First assessment of the location with EUROTECH in May – First project of integration completed • Estimated cost higher than budgeted – Second assessment with EUROTECH in September (before the end) – Procurement of the technology: • Dry coolers, pipes and pumps, exchanger, tanks, filters 31 Some Applications • www.quantum-espresso.org www.gromacs.org 32 EURORA Programming Models • Message Passing (MPI) • Shared Memory (OpenMP, TBB) • MIC offload (pragmas) / native • Hybrid: MPI + OpenMP + MIC extensions/OpenCL 33 ACCELERATORS • First K20 and KNC (dense form factor) samples in September • KNC standard expansion module, already available to start the work on software. 34 Software • Installation of the KNC software kit • Test of the compiler, and node card HW • First simple (MPI+OpenMP) application test • First Mic-to-Mic MPI communication test – Intel MPI – within the same node • Test of the affinity 35 ACCESS • Access will be granted upon request to the partners of the prototype project. • Other requests will be evaluated case by case. • We are working