Technical Challenge for High Performances in NEC

T.Watanabe

NEC Corp. Tokyo, Japan

A high performance (HPC), sometimes called a , is a high-speed, large-scale computer system designed for scientific and technical computations. In this talk, architectural and system’s key points as well as basic hardware technologies to realize high performance computer systems for grand challenge problems will be introduced along with NEC’s HPC product lines.

1. Introduction Clusterd Vector Shared Memory Scalar Processing Vector Distributed Memory Parallel Processor Distributed/ Processing (Memory to Processing Multi-Processor Parallel Processor (Vector Register) Shared Memory A high performance computer (HPC), sometimes called a Memory) Parallel Processor Multi-Processor Bottleneck Vector Register Difficulty in Bottleneck Vectorizing Compiler Parallelizing in Memory Programming supercomputer, is a high-speed, large-scale computer system in Scalar Compiler Processing Throughput Single Processor Bottleneck Distributed designed for scientific and technical computations in in Scalability Memory application areas such as meteorology, molecular and bio Vector Processor Vector Processor P P Scalar Processor Processor Vector VecVetctoorr P Pipipee Vector Pipe Processor Vector Pipe Memory science, energy development, air-craft and automotive designs, Vector Register Vector Register Memory electronic device development, and space sience. Those applications are called “Grand Challenge Problems” in which Memory Memory Memeory Memeory Network Network CDC6600/7600 CYBER200 SX-2 SX-3~SX-5 VPP5000 T3E Earth Simulator -1 CRAY-XMP~T90 SR2201 CM5 SX-4~SX-7 major targets are to run application programs within limited VP-200 VP2000,S3800 SR8000 nCUBE IBM p690 S810/S820 IBM RS6000/SP HP() SC time periods. SR11000 In this talk, architectural and system’s key points as well as Fig.2. Transition of System Architecture basic hardware technologies to realize high performance computer systems for grand challenge problems will be 2. Capacity computing and capability computing introduced along with NEC’s HPC product lines. There are two types in computing for the usage of large scale

10000001P.0000 Aggregate Systems Performance computer systems called capacity computing and capability

10000100T0.0000 Earth Simulator ASCI X computing as shown in Fig.3. ASCI Q ASCI White 10000.0000 SX-6 10T VPP5000 SX-5 SR8000G1 VPP800 ASCI Blue Increasing 1000.1T0000 ASCI Red VPP700 ASCI Blue Mountain SX-4 T3E SR8000 Parallelism Paragon VPP500 SR2201/2K NWT/166 100100.G0000 T3D … CM-5 SX-3R T90 S-3800 OPS SX-3 SR2201 1010.G0000 C90 Single CPU Performance FL VP2600/1 CRAY-2 Y-MP8 0 SX-2 S-820/80 1.0000 10GHz 1G S-810/20 VP-400 X-MP VP-200

100M0.1000 1GHz CPU Frequencies

0.0100 100MHz 10M PC Cluster / Blade Vector / SX

0.0010 10MHz 1M Capacpacity ComComputiputingng CCaapapabibilliittyy CCoomputiputing 1980 1985 1990 1995 2000 2005 2010 ・GoGoalsals: HigHighh SSpepeeded ExeExeccutiutionon of SinSinglegle Year ・GoGoalsals: Worrklkloadoad andand TThroughroughpuhputt ManyMany JobsJobs perper timetime JoJobb ・ManyMany SSmallmall ProblemsProblems ・LargLargee and CritCritical PrProbobllem – GGrandrand Fig.1. History of High Performance Computers ・ParParallelallel or Clusteruster Machine bbasedased on CChallengehallenge MicroprocessMicroprocessoorr ・PoPowerful PProroccessoressor and HighHighbandbandwidth NeNettwwoorkrk Fig.1 shows the performance trend of in both Fig.3. Capacity Computing and Capability Computing their CPUs and systems since early 80’s. It clearly shows that the increase of CPU speed follows the so-called Moore’s law, The major goal of capacity computing is to get higher system that is, twice increase every 1.5 to 2 years, but the aggregate throughput in which many and relatively small jobs are system speed has been much more rapidly increasing than that executed simultaneously. On the other hand, the capability of CPU. That is due to the increase of parallelism at systems computing aims at the high speed execution of an extremely level. Such transition of system configurations increasing large job, sometimes within a limited time period like weather parallelism is shown in Fig.2. forecast. The COTS (commercial-off-the-shelf) The current main stream of system configuration for parallel based cluster systems are best fit to the capacity computing, but processing is the cluster or multi-node configuration in which the powerful processor like vector processor and high lots of SMP (Shared Memory Processors) nodes are connected bandwidth network for the cluster configuration are critical key with some high speed network such as cross-bar, fat-tree or technologies for capability computing. hyper-cube network.

3. Divergence problem New Product Particularby in capability computing, there is an issue called VeVectorctor Supercomputeters “divergence problem” – the increasing gap between the SSupeupercomputer Other Platforms Compacactt vevecctotor SX Serriies server SX-6i theoretical peak performance and the sustained system SX performance – for a major high-end computing center as GFS illustrated by Fig.4. SScacalarlar SSeervers (Global File System) IPF TXTX77 Series Scalar Server Expressess55800800/ 1020B20Baa FC-SW (IPF(IPF BlBlade) IPF Blade SeServer Rackmount iStorage Rackmount EMC Hitatacchii Blade PCPC Cluster PC Cluustester (IA SerServeerr Cluustster/Ir/IA Blade SerServer)

Collaboration System IIAA WWorkstationorkstation ExpExpress5s5800800//50 SeSerriies Fig.6. NEC’s HPC Products

The supercomputer SX-8 announced in 2004 is the latest version of vector supercomputer with a system peak speed of 65TFLOPS positioned at high-end computer system for Fig.4. Divergence Problem capability computing shown in Fig.7.

This performance gap has been mainly driven by the 2x (to SX-6) imbalance between processor speed and memory bandwidth, 1. Single node performance Max 128GFLOPS 2. Maximum number of node and it is critical because it is sustained system performance, not Max 512 nodes 4x (to SX-6) 3. Data transfer rate among Max 8TB/s peak performance, that is actually usable by applications. The nodes (Peak data transfer rate) 8x (to SX-6) keys for realizing highly efficient capability computing to fill the gap between the peak and sustained performance are to set HigHigh speed inter-node sswwitchitch (IXS)XS) the systems parameters appropriately that limit the system 16GB/s x 2 / nodes ノード ノード 128GFLO最大 PS Maxノー 8CPドUノー1ド28GFLOP最大ノー8Cノー最大ドP最UドS8CP大ノー8Cノー最大UドP最大Uド8C8CP最大UP最大U8C8CPPU U Max 8CPU 8CPU #0 #0#0 #0 #0 C C performance such as CPU performance, memory bandwidth, C CC C #0CC#0CC #0CCCC CCCC C C C ...... P...... P CPUP CPUPP...P . CPUPPPP PPPOpt.PO...p...tiP.calPPP .In...tercoP Pnnection CPUP CPU CPU U U and high speed network, and to realize the well balanced U UU U UUUU UUUU UUUU U U U MMU system as shown in Fig.5. éMMUÂéãLâMMÂãLØUâØéMMUÂéÂãLâØMMãLâUéÂMMØéãMMUÂLâØUãLâØéMMÂéÂãLMMUUâØãLâØ éÂMMéãÂLâØUãLâØ

IOF IOF IOFIOFIOFIOIOFIOFIOFFIOIOFIOFIOFIOFIOFIOIOFFFIOIOFFIOFIOFIOFIOIOIOFIOFIOFIOFFIOFIOFIOFIOFIOFIOF IOFIOFIOF IOFIOF Node#0 #1 #2 #3 #4 #5 #6 #7 Node#511 Max 512 nodes NW Fig.7. Multi Node System of SX-8 •Low Latency • High Throughput One of design objectives of SX-8 was to achieve the highly MeMemmorory MemoryMemory Memory efficient system in performance and cost at all points including High Bandwidth Interface the floor space and electrical power as well as the highest CPU CPU CPU CPU CPU CPU performance by using the latest semiconductor technology and high density packaging system. The single node of SX-8

Powerful CPU consists of up to 8 CPUs, each of which has a peak speed of 16GFLOPS with four logical vector pipelines clocked at 2GHz, and has a maximum speed of 128GFLOPS with 128GBytes of Fig.5. Highly Efficient Capability Computing shared memory and 512GBytes/sec memory bandwidth to achieve the highest efficiency in performance. The multi-node 4. NEC’s HPC products line and SX-8 for capability system of SX-8 can configure up to 512 nodes connected with computing the high speed and non blocking crossbar switch called IXS NEC is now supporting various computing systems from PC using optical fibers for supporting ultra large configurations and clusters to high-end vector supercomputers as the application grand challenge problems. The highly efficient parallel areas for high performance computing have recently been processing system is realized by high speed data transfer of two diversified and expanded from traditional applications such as 16GBytes/sec per node of IXS. The SX-8 is currently achieving structural analysis and fluid dynamics to bio science and even almost twice speed up compared to the former model SX-6 and social systems like financial applications. Those computing the Earth Simulator in actual applications, and 1/4 in floor systems include IA-32 based PC Clusters and IPF blade servers space and half in electricity to those of SX-6. for capacity computing, and powerful scalable SMP servers based on IA-64 and large scale vector supercomputer SX series for capability computing to meet these wide varieties of applications and performance needs as shown in Fig.6. 5. Challenges in hardware minimization of TCO from applications development to getting Basic hardware technologies for highly efficient and ultra final results of applications. Grid technology is one of software high speed capability computing system are semiconductor and challenges that is not the technology to increase the processing interconnection technologies. power of computer systems, but the one to promote the accessibility and usability of various computing systems and 100 Technology Node 100 (DRAM half pitch) distributed data connected through high speed network.

MTr/chip 1000 7. Conclusion

50 As the CMOS semiconductor technologies are approaching technological limits, we are facing many challenges to

10 overcome. But high performance computing has always been

500 and will continue to be a driving force for the advancement of 20 information technologies. If we have made technological break-through, then we will be able to open up a new era of IT Gbits/chip 200 as well as high performance computing. 1 10 Reference: “Report of the High-End Computing Revitalization Task Force,” ’04 ’05 ’06 ’07 ’08 ’09 ’10 ’11 ’12 ’13 ’14 ’15 ’16 OSTP, Washington D.C., May, 2004. Roadmap for Semiconductors(ITRS2003)

Fig.8. Road Map of Semiconductors

Fig.8 is a road map of semiconductors by ITRS (International Technology Roadmap of Semiconductors) 2003 that shows the number of transistors in a chip will reach more than five hundreds million, and the memory capacity will be several giga bits per chip by technology node of 45nm in 2010. However there are so many technological challenges to realize this road map. One of the most serious problems is the power increase of a chip due to both the increase of operating power and leak current along with the cooling of device. To overcome this problem, several technologies are now being developed such as the development of new materials and power control technique at system level. The interconnection technology is a key for achieving high memory bandwidth and high speed network to configure the cluster system. But the electrical interface will reach the limit for high speed signal transmission. The optical technology is considered to be the most promising one to replace the electrical interconnections at present.

6. Challenges in software There are many challenges in software as well for the future highly parallel processing system. The resource management and scheduler are main functions for the efficient use of system resources and achieving high scalability under the environment of thousands of processors and tera bytes of main memory. Data management capability is also a key component of operating system to handle the input and output of huge amount of data efficiently. We have to carefully consider about the reliability feature to sustain the system operation for ultra large scale configurations and particularly for mission critical systems like a weather forecasting system. The support of development environment including compilers, and debugging and tuning tools for highly parallel processing system is extremely important for the reduction of time-to-solution or