Xinya (Leah) Zhao Abdulahi Abu Outline

K-Computer Xinya (Leah) Zhao Abdulahi Abu Outline • History of Supercomputing • K-Computer Architecture • Programming Environment • Performance Evaluation • What's next? Timeline of Supercomputing Control Data The Cray era Massive Processing Petaflop Computing Corporation (1960s) (mid-1970s - 1980s) (1990s) (21st century) • CDC 1604 (1960): First • 80 MHz Cray-1 (1976): • Hitachi SR2201 (1996): • Realization: Power of solid state The most successful used 2048 processors large number of small • CDC 6600 (1964): 100 supercomputers in history connected via a fast three processors can be computers were sold at $8 • Vector processor dimensioanl corssbar harnessed to achieve high million each • Introduced chaining in network performance • Gained speed by "farming which scalar and vector • ASCI Red: mesh-based • IBM Blue Gene out" work to peripheral registers generate interim MIMD massively parallel architecture: trades computing elements, results system with over 9,000 processor speed for low freeing the CPU to • The Cray-2 (1985): No compute nodes and well power consumption, so a process actual data chaning and high memory over 12 terabytes of disk large number of • STAR-100: First to use latency with deep storage processors can be used at vector processing pipelinging • ASCI Red was the first air cooled temperature ever to break through the • K computer (2011) : 1 teraflop barrier fastest in the world K Computer is #1 !!! Why K Computer? Purpose: • Perform extremely complex mathematical or scientific calculations (ex: modelling the changes in climate over millions of years) • Areas of application: manufacturing, aerospace, biotechnology and disaster- prevention K Computer Initiative: • Part of the High-Performance Computing Infrastructure (HPCI) initiative led by Japan's Ministry of Education, Culture, Sports, Science, and Technology (MEXT) • Developed by RIKEN and Fujitsu • Completed in August 2011 Specs CPU Specs • SPARC64 VIIIFX (Venus) o Reuses part of SPARC VII architecture • 8 cores at 2.0GHz • L1 Cache=32Kb, L2 Cache=6MB o Software or hardwar controlled cache through HPC • High performance with low power o 128 GFLOPS o 64 GB/s memory throughput o 58W (TYP, 30°C) • Hardware barrier for fast inter-core sync HPC Extension • HPC-ACE: Fujitsu's unique ISA Extension • Large register sets o 160->192 INT registers, 32->256FP registers (DP) o Extract more parallelism o Reduce spill/fill overhead • SIMD enabled • Software controlled Cache o Goal is to Optimize performance while keeping cache coherency K Computer Packaging IO System Board Service Processor Board System Board Power supply Cooling Tubes System Disk "Slanted Implementation" Supports Cooling and High Density 24 system boards 6 IO boards System Configuration Network Architecture Tofu: 6D mesh/torus Interconnect Architecutre • High communication performance • High system scalability • High fault-tolerance Main Architecture Composition • Node Construction • Network Construction • Routing function Why 6 dimension? Node Construction • Single CPU and single interconnect controller o Comput nodes: 80,000 Number of cores: 640,000 • Memory: 1PB (16 GB/node) • 10 links for inter-node connection • 10GB/s per link • Routing chip structure • No outside switch Network Construction Routing Algorithm Extended dimension order routing • abc to xyz to abc • The first abc traversal is path selection Example routing • Routing from (x=0,y=0,z=0,a=0,b=0,c=0) to (3,2,1, 1,1,1) • Traverses: +b+a+3x+2y+z+c Where is the K Computer? Cooling System Computer Floor Software Structure Programing Environment • Major problems o Too many processors to manipulate o Very little opportunity for coarse grain parallism o Scheduling on processors • Solutions on K computer o VISIMPACT o Automatic parallelization facility makes multi-cores into one high speed core o Exploit fine grain parallism • Open Petascale Libraries o Development platform for application on petascale-class Supercomputer VISIMPACT • What o Hybrid execution model (MPI + Threading between cores) o Treats 8-core CPU as one high speed CPU • Why o Improve parallel efficiency and reduce memory impact o Make it easier to program on many cores • How o Hardware barriers between cores o Shared L2 Cache o Automatic parallel compiler MPI • Open MPI based o Added extension for "Tofu" interconnect • Goals for MPI on K computer o High performace High bandwidth o High Scalability K computer will grow • Dimension specific for each rank Performance (Nov 2010) • Measured results o Max (LinPack) = 48.03 TFLOPs o Peak = 52.22 TFLOPs o Efficiency = 92.0% o Power Consumption = 57.96 KW o Performance per power = 828.7 MFLOPs/W • Ranking o TOP500 = 170th o Green500 = 4th Performance (June 2011) • Measured results o Max (LinPack) = 10.51 PetaFLOPs o Peak = 11.28 petaFLOPs o Efficiency = 93.2% o Power Consumption = 9.89 MW o Performance per power = 824.6 GFLOPs/kW • Ranking o TOP500 = 1st o Green500 = 2nd CPU Performance HPC Challenge 2011 K Computer Comparison What’s next? • SPARC64 Ixfx o 16 cores, 12MB shared L2 cache, and runs at 1.86GHz o Peak performance of 236.5 GFLOPS and will have a power efficiency of more than 2 GFLOPS per watt (115W per chip) • PRIMEHPC FX10 o Follow up to the K computer o 20 petaflop, 6 dimensional torus interconnect, one SPARC processor per node • RIKEN and Fujitsu will focus on developing and assessing system software for the large-scale K computer with the aims of system completion in June 2012 Questions?.

Xinya (Leah) Zhao Abdulahi Abu Outline

Fujitsu PRIMEHPC FX10 Supercomputer

Masahori Nunami (NIFS)

Recent Supercomputing Development in Japan

FX10 Supercomputer System (Oakleaf-FX) with 1.13 PFLOPS Started Operation at the University of Tokyo

Fujitsu Standard Tool

Primehpc Fx10

Overview of Supercomputer Systems

Hronologija Izračunavanja Π Prije 1400

Introduction of Fujitsu's Next-Generation Supercomputer

Supercomputers in Nagoya University

FUJITSU Supercomputer PRIMEHPC FX10 Datasheet

Peta/Exascale Computing Information Technology Center the University of Tokyo