Parallelization Hardware architecture Type to enter text
Delft University of Technology Challenge the future Contents
• Introduction • Classification of systems • Topology • Clusters and Grid • Fun Hardware
2 hybrid
parallel
vector
superscalar scalar
3 Why Parallel Computing
Primary reasons:
• Save time
• Solve larger problems
• Provide concurrency (do multiple things at the same time) Classification of HPC hardware
• Architecture
• Memory organization
5 1st Classification: Architecture
• There are several different methods used to classify computers
• No single taxonomy fits all designs
• Flynn's taxonomy uses the relationship of program instructions to program data • SISD - Single Instruction, Single Data Stream • SIMD - Single Instruction, Multiple Data Stream • MISD - Multiple Instruction, Single Data Stream • MIMD - Multiple Instruction, Multiple Data Stream
6 Flynn’s Taxonomy
• SISD: single instruction and single data stream: uniprocessor
• SIMD: vector architectures: lower flexibility
• MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines
• MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility
7 SISD
• One instruction stream • One data stream • One instruction issued on each clock cycle • One instruction executed on single element(s) of data (scalar) at a time • Traditional ‘von Neumann’ architecture (remember from introduction)
8 SIMD
• Also von Neumann architectures but more powerful instructions
• Each instruction may operate on more than one data element
• Usually intermediate host executes program logic and broadcasts instructions to other processors
• Synchronous (lockstep)
• Rating how fast these machines can issue instructions is not a good measure of their performance
• Two major types: • Vector SIMD • Parallel SIMD
9 Vector SIMD
• Single instruction results in multiple operands being updated • Scalar processing operates on single data elements. Vector processing operates on whole vectors (groups) of data at a time.
• Examples: • SSE instructions • NEC SX-9 • Fujitsu VP • Hitachi S820
10 Vector SIMD
11 Parallel SIMD
• Several processors execute the same instruction in lockstep
• Each processor modifies a different element of data
• Drawback: idle processors
• Advantage: no explicit synchronization required
• Examples
• GPGPU’s • Cell
12 Parallel SIMD
13 MIMD
• Several processors executing different instructions on different data
• Advantages: • different jobs can be performed at a time • A better utilization can be achieved
• Drawbacks: • Explicit synchronization needed • Difficult to program
• Examples: • MIMD Accomplished via Parallel SISD machines: all clusters, Cray XE6, IBM Blue Gene, SGI Altix • MIMD Accomplished via Parallel SIMD machines: NEC SX-8, Convex(old), Cray X2
14 15 16 2nd Classification: Memory organization
• Shared memory (SMP) • UMA • NUMA • CC-NUMA
• Distributed memory
17 Memory Organization
Symmetric shared-memory multiprocessor (SMP)
• Implementations:
• Multiple processors connected to a single centralized memory – since all processors see the same memory organization -> uniform memory access (UMA)
• Shared-memory because all processors can access the entire memory address space through a tightly interconnect between compute/ memory nodes - non-uniform (NUMA)
18 UMA (Uniform Memory Access)
PE1 PE2 PEn
P1 P2 Pn …
Interconnect
M1 M2 Mn 19 NUMA (Non Uniform Memory Access)
PE1 PE2 PEn
P1 P2 Pn … M1 M2 Mn
Interconnect
20 CC-NUMA (Cache Coherent NUMA)
PE1 PE2 PEn
P1 P2 Pn
C1 C2 Cn
M1 M2 Mn
Interconnect
21 Distributed memory
PE1 PE2 PEn
P1 P2 Pn …
M1 M2 Mn
Interconnect
22 Hybrid Distributed memory
PE1 PE2 PEn
P1 P2 P3 P4 Pn-1 Pn
C1 C2 Cn
M1 M2 Mn
Interconnect
The largest and fastest computers in the world today employ
both shared and distributed memory architectures. 23 Memory architecture Logical view
shared distributed
shared UMA Physical view
distributed NUMA M. Distr. scalability
“Easy” Programming
24 Shared-Memory vs. Distributed-Memory
Shared-memory: • Well-understood programming model • Communication is implicit and hardware handles protection • Hardware-controlled caching • OpenMP and MPI
Distributed-memory: • No cache coherence à simpler hardware • Explicit communication à easier for the programmer to restructure code • Sender can initiate data transfer • MPI, PGAS
25 Hardware implementation (MIMD)
• Shared memory
• Distributed memory
26 A Shared Memory Computer Data movement is transparent to the programmer
Memory
Interconnect move data
CPU
SMP = Symmetric Multi-Processor Note that the CPU can be a multi-core chip
27 Issues for MIMD distributed Shared Memory
• Memory Access • Can reads be simultaneous? • How to control multiple writes?
• Synchronization mechanism needed • semaphores • monitors
• Local caches need to be co-ordinated • cache coherency protocols cache coherency ensures that one always gets the right value ... regardless of where the data is
Memory
Cache
CPU
same variable is present in multiple places
29 71 write-through: simple, but wastes memory bandwidth
Memory write-back: minimizes bandwidth, takes extra logic
Memory
30 Cache Coherence Protocols
• Directory-based: A single location (directory) keeps track of the sharing status of a block of memory
• Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary
‣ Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies ‣ Write-update: when a processor writes, it updates other shared copies of that block
31 Cache Coherence - Snooping
Memory
Data bus Address bus
Cache
CPU
With a snooping protocol, ALL address traffic on the bus is monitored by ALL processors
32 71 MIMD-Distributed Memory
• Connection Network • fast • high bandwidth • scalable
• Communications • explicit message passing • parallel languages • Unified Parallel C, Co-Array Fortran, HPF • libraries for sequential languages • MPI, PVM, Java with CSP A Distributed Memory Computer The system is programmed using message passing receive send Interconnect move pack unpack
CPU
34 Hybrid: MIMD with shared memory nodes
receive send Interconnect
Memory Memory Memory
Interconnect Interconnect Interconnect
And now imagine a multi-core chip at the lowest level.
35 Interconnection Network
• Speed and Bandwidth are critical
• Low cost networks • local area network (ethernet, token ring)
• High Speed Networks • The heart of a MIMD-DM Parallel Machine Issues for Networks
• Total Bandwidth • amount of data which can be moved from somewhere to somewhere per unit time
• Link Bandwidth • amount of data which can be moved along one link per unit time
• Message Latency • time from start of sending a message until it is received
• Bisection Bandwidth • amount of data which can move from one half of network to the other per unit time for worst case split of network Design Characteristics of a Network
38 Design Characteristics of a Network
• Topology (how things are connected): • Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, ....
• Routing algorithm (path used): • Example in 2D torus: all east-west then all north-south
• Switching strategy: • Circuit switching: full path reserved for entire message, like the telephone. • Packet switching: message broken into separately-routed packets, like the post office.
• Flow control (what if there is congestion): • Stall, store data in buffers, re-route data to other nodes, tell source node to temporarily halt, discard, ... Performance Properties of a Network: Latency
• Latency: delay between send and receive times • Latency tends to vary widely across architectures • Vendors often report hardware latencies (wire time) • Application programmers care about software latencies (user program to user program)
• Latency is important for programs with many small messages Performance Properties of a Network: Bandwidth • The bandwidth of a link = w * 1/t • w is the number of wires • t is the time per bit
• Bandwidth typically in GigaBytes (GB), i.e., 8* 220 bits • Effective bandwidth is usually lower than physical link bandwidth due to packet overhead. Routing and control header • Bandwidth is important for Data applications with mostly large payload messages
Error code Trailer Common Network Topologies
42 Linear and Ring Topologies
• Linear array
• Diameter = n-1; average distance ~n/3. • Bisection bandwidth = 1 (in units of link bandwidth).
• Torus or Ring
• Diameter = n/2; average distance ~ n/4. • Bisection bandwidth = 2. • Natural for algorithms that work with 1D arrays. Meshes and Tori Two dimensional mesh Two dimensional torus • Diameter = 2 * (sqrt( n ) – 1) • Diameter = sqrt( n ) • Bisection bandwidth = sqrt(n) • Bisection bandwidth = 2* sqrt(n)
• Generalises to higher dimensions (Cray XT5 used 3D Torus). • Natural for algorithms that work with 2D and/or 3D arrays. Hypercube
One Dimensional Hypercube
Two Dimensional Hypercube
Three Dimensional Hypercube
Four Dimensional Trees
• Diameter = log n. • Bisection bandwidth = 1. • Easy layout as planar graph. • Many tree algorithms (e.g., summation). • Fat trees avoid bisection bandwidth problem: • More (or wider) links near top. Butterflies with n = (k+1)2^k nodes • Diameter = 2k. • Bisection bandwidth = 2^k. • Cost: lots of wires. • Used in BBN Butterfly. • Natural for FFT.
O 1 O 1
O 1 O 1
butterfly switch multistage butterfly network Topologies in Real Machines
Cray XC30 Dragonfly Cray XE series 3D Torus Blue Gene/L /P 3D Torus SGI Altix Fat tree
Cray X1 4D Hypercube*
Myricom (Millennium) Arbitrary
Quadrics Fat tree
older newer IBM SP Fat tree (approx) Many of these are approximations: SGI Origin Hypercube E.g., the X1 is really a “quad bristled Intel Paragon (old) 2D Mesh hypercube” and some of the fat trees are not as fat as they should be at the top MIMD - clusters
52 Cluster
• A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone or complete computers. These computers co-operatively work together as a single, integrated computing resource.
LM LM LM
CPU CPU … CPU
Network
Cluster
53 Construction of a Beowulf Cluster Topology Node 1Node 2 Node 3 Node 16
Computing Nodes
Fast Ethernet Switch
Service Node
1 Gb/s LAN 10 Gb/s
54 Beyond a Cluster: Grid
55 Computational Grids
• A network of geographically distributed resources including computers, peripherals, switches, instruments, and data.
• Each user should have a single login account to access all resources.
• Resources may be owned by diverse organisations.
56 GRID vs. Cluster
• Cluster: Computer network typically dedicated 100 % to execute a specific task
• GRID: computer networks distributed planet-wide, that can be shared by the means of resource management software
57 Cluster computing vs. others
Distance between nodes
SM Parallel A chip computing A rack Cluster computing A room
Distributed computing A building Grid computing The world
58 Some HPC specialized hardware
59 General design philosophy Compute • 12 AMD Opteron processors 32/64 bit, x86 processors • High Performance Linux
RA RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric RA RA
Active Management • Dedicated processor
Processors directly connected via integrated switch fabric Application Acceleration 60 • 6 co-processors Balanced Interconnect
GigaBytes GFLOPS GigaBytes per Second Memory Processor I/O Interconnect
1 GB/s 0.25 GB/s PCI-X GigE Xeon
Server 5.3 GB/s DDR 333
Cray XD1 RA 6.4GB/s DDR 400 8 GB/s 6.4GB/s DDR 400 8 GB/s XT4
SS
6.4 GB/s DDR 400 31 GB/s
61 Cray XT5 node 2 – 32 GB memory
6.4 GB/sec direct connect HyperTransport
• 8-way SMP • >70 Gflops per node • Up to 32 GB of
shared memory per 9.6 GB/sec 9.6 GB/sec node
• OpenMP Support 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 25.6 GB/sec direct connect memory Cray SeaStar2+ Interconnect 9.6 GB/sec
62 R134a piping Exit Evaporators
Inlet Evaporator
63 High Velocity Airflow HighVelocity High Velocity Airflow HighVelocity Low Velocity Airflow Low Velocity
Low Velocity Airflow Low Velocity
Low Velocity Airflow Low Velocity
27 64 Cray XE6 node
HT3 HT3
Z"
Y"
X"
65 NVIDIA&Tesla&GPU& 6GB&GDDR5;& with&665GF&DPFP& 138&GB/s&
PCIe Gen2
HT3 HT3
1600&MHz&DDR3;& Cray&Gemini&High& 16,&32&or&64&GB& Speed&Interconnect&
66 Chassis/Rank-1 (Backplane x 15 links) Intra Group/Rank-3 14 Gbps (Optic Cable x 10 links) XC30 12.5 Gbps
PCIe-3 16 bits at 8.0 GT/s per direction Dual QPI SMP Links
4 Channels DDR3 Aries 48-port Router 4 NICs, 2 router tiles each 40 router tiles for interconnect 6-Chassis Group/Rank-2 (Copper Cable x 15 links) 14 Gbps 67 Cray XC30 Rank1 Network
Text
o Chassis with 16 compute blades o 128 Sockets o Inter-Aries communication over backplane o Per-Packet adaptive Routing
68 2 Cabinet Group Cray XC30 Rank-2 Network 768 Sockets
Text
6 backplanes connected with copper cables in a 2-cabinet group: “Black Network”
Active optical cables interconnect groups 16 Aries connected by 4 nodes connect “Blue Network” backplane to a single Aries “Green Network” 69 Cray XC30 Routing
Minimal routes S M M between any two nodes in a group are Text just two hops
Non-minimal route requires up to four hops.
With adaptive routing we select between minimal and non- minimal paths based on load D The Cray XC30 Class-2 Group has sufficient bandwidth to support R M full injection rate for all 384 nodes with non- minimal routing
70 Cray XC30 Network Overview – Rank-3 Network
Group 0 Group 1 Group 2 Group 3
Example: An 4-group system is interconnected with 6 optical “bundles”. The “bundles” can be configured between 20 and 80 cables wide
71 Integration HPC and Big Data
72 73 HLRS Stuttgart
• video of building a computer
74 JAIG6=><="HE::9 =><=9:CH>IN K:8IDG6C9H86A6GJC>I E68@6<>C<I:8=CDAD The vector unit is the most important part of For the first time in the SX-series, proces- the SX-8 CPU. It consists of a set of vector sors and main memory unit are integrated registers and arithmetic execution pipes for on a single printed circuit board. With addition, multiplication, divison and square combined effect of the low-loss substrates root. The hardware square root vector pipe enabling the high efficiency transfers and is the latest addition to the SX-8 CPU high density packaging technology, the architecture and is only available on the has substantially reduced space SX-8. Each of the arithmetic pipes is four requirements and power consumption. This way parallel, i.e can produce four results per leads to a notable reduction of installation cycle. A high speed load/store pipe connects costs and complexities. the vector unit to the main memory. * NEC $16 (nPQTQFS$16 7FDUPS6OJUY .BTL3FHJTUFS .BTL0QFSBUJPO -PHJDBM0QFSBUJPO .VMUJQMJDBUJPO ..6 -PBE4UPSF 7FDUPS 3FHJTUFS "EEJUJPO4IJGUJOH %JWJTJPOɅ 4DBMBS6OJUY "EEJUJPO%JWJTJPO $BDIF.FNPSZ 4DBMBS3FHJTUFS .VMUJQMJDBUJPO I:8=CDAD<>86AA:69:GH=>E /0%& $FOUSBM1SPDFTTJOH 6OJU $16 $16 $16 $16 *94 .BJO.FNPSZ6OJU ..6 *O0VUQVU *0' *0' 'FBUVSF 4$4* '$ *OUFHSBUFE0QFSBUPS *09 2 4UBUJPOGPS49 *09 /FUXPSL .VMUJOPEF .VMUJOPEF .VMUJOPEF .VMUJOPEF *OUFS/PEF$SPTTCBS4XJUDI *94 75 SX-ACE SX-ACE SX-ACE Targeting high sustained performanceTargeting high and sustained Targeting performance high sustainedand performance and efficiency in all respects toefficiency address thein all respectsefficiency to address in allthe respects to address the increasing needs of scienceincreasing and engineering needs of scienceincreasing and engineering needs of science and engineering The SX design philosophy aims to provide high performance and usabilityThe SX design to customers philosophy continuously. aims to provide In order high to performanceThe and SX usabilitydesign philosophy to customers aims continuously. to provide high In order performance to and usability to customers continuously. In order to Well balanced high performance I/O configuration SX-ACE architecture and technology Well balanced high performance I/O configuration SX-ACE architecture and technology achieve this goal, NEC sticks to high single-core performance and leadingachieve technologythis goal, NEC for highsticks memory to high bandwidth. single-core performanceachieve and leadingthis goal, technology NEC sticks for to high high memory single-core bandwidth. performance and leadingWell balanced technology high for performance high memory I/O bandwidth. configuration SX-ACE architecture and technology Common serious problems of modern computer systems, increasinglyCommon not only serious in the problems HPC feld, of modernare the so-called computer systems, Commonincreasingly serious not only problems in the ofHPC modern feld, arecomputer the so-called systems, increasingly not only in the HPC feld, are the so-called The real world applications consistently show that I/O performance is The SX-ACE is the only architecture targeted for the most demanding The real world applications consistently show that I/O performance is The SX-ACE is the only architecture targeted for the most demanding "memory wall" and "power wall". The problems are caused by a lack of memory bandwidth and power limitations. "memory wall" and "power wall". The problems are caused by a lackThe of real memory world applications bandwidth consistently and power show limitations.that I/O performance is The SX-ACE is the only architecture targeted for the most demanding "memory wall" and "power wall". The problems are causedbecoming by a lack increasingly of memory important. bandwidth In order to and address power this need, limitations. the I/O is HPC applications. Its processor builds upon the successful history of the becoming increasingly important. In order to address this need, the I/O is HPC applications. Its processor builds upon the successful history of the The SX-ACE overcomes these limitations and provides high performance and effective supercomputing The SX-ACE overcomes these limitations and provides high performancebecoming increasingly and important. effective In order supercomputing to address this need, the I/O is HPC applications. Its processor builds upon the successful history of the The SX-ACE overcomes these limitations and provideshighly high integrated performance with the SX-ACE and CPU. effective That way a singlesupercomputing node can achieve SX product line, achieving higher memory bandwidth by the integration of highly integrated with the SX-ACE CPU. That way a single node can achieve SX product line, achieving higher memory bandwidth by the integration of environment to users. highly integrated with the SX-ACE CPU. That way a single node can achieve SX product line, achieving higher memory bandwidth by the integration of environment to users. I/O performance of up to 8 Gbyte/s per processor LSI. the memory controller on the chip. The CPU is implemented using the I/O performance of up to 8 Gbyte/s per processor LSI. the memory controller on the chip. The CPU is implemented using the environment to users. I/O performance of up to 8 Gbyte/s per processor LSI. the memory controller on the chip. The CPU is implemented using the most advanced technologies, such as 28nm CMOS LSIs and 11-layer most advanced technologies, such as 28nm CMOS LSIs and 11-layer most advanced technologies, such as 28nm CMOS LSIs and 11-layer copper wiring. By these 256 GFLOPS of peak performance and 256 GB/s copper wiring. By these 256 GFLOPS of peak performance and 256 GB/s copper wiring. By these 256 GFLOPS of peak performance and 256 GB/s of memory bandwidth per processor are achieved. of memory bandwidth per processor are achieved. of memory bandwidth per processor are achieved. Furthermore, an outstanding power effciency is achieved by adopting the Furthermore, an outstanding power effciency is achieved by adopting the Furthermore, an outstanding power effciency is achieved by adopting the most advanced technologies, such as 10 Gbps-SerDes, Multi-Vth most advanced technologies, such as 10 Gbps-SerDes, Multi-Vth most advanced technologies, such as 10 Gbps-SerDes, Multi-Vth transistor, clock gating per register/CPU core unit, and power voltage transistor, clock gating per register/CPU core unit, and power voltage transistor, clock gating per register/CPU core unit, and power voltage optimizer. optimizer. optimizer. Highly scalable multi-node system Highly scalable multi-node system Highly scalable multi-node system In the SX-ACE, data communication between nodes can be accelerated, and In the SX-ACE, data communication between nodes can be accelerated, and In the SX-ACE, data communication between nodes can be accelerated, and high performance and high scalability in parallel processing are achieved by high performance and high scalability in parallel processing are achieved by high performance and high scalability in parallel processing are achieved by the network communication controller in the processor LSI. The nodes of the the network communication controller in the processor LSI. The nodes of the the network communication controller in the processor LSI. The nodes of the SX-ACE are connected by a highly scalable proprietary interconnect, the IXS. SX-ACE are connected by a highly scalable proprietary interconnect, the IXS. SX-ACE are connected by a highly scalable proprietary interconnect, the IXS. It implements the topology of a full non-blocking fat tree with a link bandwidth It implements the topology of a full non-blocking fat tree with a link bandwidth It implements the topology of a full non-blocking fat tree with a link bandwidth of 8 GByte/s per direction. With the standard 2 level fat tree network, a of 8 GByte/s per direction. With the standard 2 level fat tree network, a of 8 GByte/s per direction. With the standard 2 level fat tree network, a multi-node system can be easily confgured. That way highly scalable HPC multi-node system can be easily confgured. That way highly scalable HPC multi-node system can be easily confgured. That way highly scalable HPC environment by large scale shared/distributed memory system can be built. environment by large scale shared/distributed memory system can be built. environment by large scale shared/distributed memory system can be built. ※The default cabinet color is black. Other colors are optional. ※The default cabinet color is black. Other colors are optional. ※The default cabinet color is black. Other colors are optional. Enlarged Enlarged Approximately 3,000 wires are More than 4,000 fne Approximately 3,000 wires are More than 4,000 fne picture of Enlarged Approximately 3,000 wires are More than 4,000 fne picture of a hair accommodated at 70 micron wide, solder joints (2-4 times picture of a hair accommodated at 70 micron wide, solder joints (2-4 times accommodated at 70 micron wide, solder joints (2-4 times much less than a hair (100 micron) more than PC/Server) a hair much less than a hair (100 micron) more than PC/Server) much less than a hair (100 micron) more than PC/Server) Environmentally friendly SX-ACE fully integrated World-leading CPU core performance Significant reduction of the power consumption Environmentally friendly SX-ACE fully integrated World-leadingEnvironmentally CPU core performancefriendly SX-ACE fully integrated SignificantWorld-leading reduction CPU core of the performance power consumption Significant reduction of the power consumption processor with high sustained performance and memory bandwidth and floor-space requirement processor with high sustained performance and memoryprocessor bandwidth with high sustained performance andand floor-spacememory bandwidth requirement and floor-space requirement The SX-ACE is using the world’s only real vector architecture core with a Different from synthetic benchmarks, real relevant applications exhibit limited The number of LSIs for an SX-ACE confguration is reduced by a factor of The SX-ACE is using the world’s only real vector architecture core with a Different from synthetic benchmarks, real relevant applications exhibit limited The number of LSIs for an SX-ACE confguration is reduced by a factor of CPU_BGA CPU_BGA The SX-ACE is using the world’s only real vector architecture core with a Different from synthetic benchmarks, real relevant applications exhibit limited The number of LSIs for an SX-ACE confguration is reduced by a factor of CPU_BGA large vector cache specifcally designed for supercomputing usage. The scalability. To overcome these limitations and to provide high sustained 100large compared vector cache to an speci SX-9f callyconf gurationdesigned with for supercomputingthe same performance. usage. As The a scalability. To overcome these limitations and to provide high(back) sustained 100 compared to an SX-9 confguration with the same performance. As a (back) large vector cache specifcally designed for supercomputing usage. The scalability. To overcome these limitations and to provide high sustained PWB photo100 compared to an SX-9 confguration with the same performance. As a (back) PWB photo 4-core vector processor is fully integrated with the memory and network performance and usability, NEC continues the vector architecture with a result,4-core avector conf gurationprocessor based is fully on integrated the SX-ACE with requires the memory a factor and of network 10 less performance and usability, NEC continues the vector architecture with a result,PWB aphoto confguration based on the SX-ACE requires a factor of 10 less 4-core vector processor is fully integrated with the memory and network performance and usability, NEC continues the vector architecture with a result, a confguration based on the SX-ACE requires a factor of 10 less interface controllers on one single LSI to ensure highest effciency at lowest strong single core performance of 64 GFLOPS and a high memory bandwidth electricityinterface controllers and a factor on oneof 5 singleless space LSI to than ensure a SX-9 highest con effgurationfciency at with lowest the strong single core performance of 64 GFLOPS and a high memory bandwidth electricity and a factor of 5 less space than a SX-9 confguration with the interface controllers on one single LSI to ensure highest effciency at lowest strong single core performance of 64 GFLOPS and a high memory bandwidth electricity and a factor of 5 less space than a SX-9 confguration with the power-consumption for environmentally friendly supercomputing. of 64 GB/s per core. That way users can achieve the necessary real samepower-consumption performance. for environmentally friendly supercomputing. of 64 GB/s per core. That way users can achieve the necessary real same performance. power-consumption for environmentally friendly supercomputing. of 64 GB/s per core. That way users can achieve the necessary real same performance. NODE application performance even if the code requires a lot of memory bandwidth applicationNODE performance even if the code requires a lot of memory bandwidth application performance even if the code requires a lot of memory bandwidth module NODE module and does not scale perfectly well. and does not scale perfectly well. Solder ball module Solder ball and does not scale perfectly well. Solder ball NEC SX-ACE (2014) 0.8mm 0.8mm Sectional view 0.8mm Sectional view of solder joints Sectional view of solder joints SX-9 of SX-9solder joints SX-9 Scalar Vector Scalar Vector processing unit processing unit Scalar Vector processing unit processing unit processing unit processing unit Vector pipeline x16 24m I/O controller Vector pipeline x16 24m SX-ACEI/O controller SX-ACE I/O controller Vector pipeline x16 24m SX-ACE File system IOP File system SPU VPU IOP RAS-features of the SX-ACE SPU VPU File system RAS-featuresVector of the SX-ACE Core 8GB/s x2 Vector SPU VPU Mask OperationIOP Core 80 racks 8GB/s x2 Mask Operation RAS-features of the SX-ACE80 racks Core Mask reg. 8GB/s x2 Vector 7mMask Operation processing Mask reg. 80 racks 7m 256GB/s Core Core Core processing 256GB/s CoreMask reg. Core Core Network controller 7m Network controller unit 256GB/s Core Core Core processing unit Network controller ADB 8 racks Add 8 racks ADB RCU Interconnect Add unit RCU Interconnect 8 racks Compared to the SX-9, the number of parts of a node is drastically reduced HPC (Assignable Data Buffer) ADB RCU Interconnect HPC (Assignable Data Buffer) Add Compared to the SX-9, the number of parts of a node is drastically reduced 8GB/s x2 (Assignable Data Buffer) Add 8m8GB/s x2 Add Compared to the SX-9, the number of parts of a node is8m drastically reduced dedicated HPC 8GB/s x2 dedicated Add 8m by the fully integrated LSI, and consequently achieves a very high reliability. dedicated by the fully integrated LSI, and consequently achievesMultiply a very high reliability. design Load/ Multiply design 256GB/s Load/ Vector reg. by the fully integrated12m LSI, and consequently achieves a very high reliability. 256GB/s design Vector reg. 12m Load/ MeetingMultiply room size Meeting room size NEC is attempting to improve reliability and availability by adopting various cache HPC Store 256GB/s Multiply cache Vector reg. NEC is HPCattempting Storeto improve12m reliability and availabilityMultiply by adoptingMeeting various room size HPC Store NEC is attempting to improve reliability and availability by adopting various dedicated cache Multiply dedicated Div/Sqrt 25m pool sizeCrossbar switch kinds of leading technologies. Div/Sqrt 25m pool size kinds of leading technologies. Crossbar switch design dedicated Div/Sqrt design kinds of leading technologies. Crossbar switch 25m pool size Logical cache Logical design The memorycache architecture uses error correction codes (ECC), error detection The memory architecture uses error correction codes (ECC), error detection Memory cache Logical Memory The memory architecture uses error correction codes (ECC), error detection Memory functions are implemented on the processor LSI, and a built-in diagnosis functions are implemented on the processor LSI, and a built-in diagnosis controller controller functions are implemented on the processor LSI, and a built-in diagnosis MC(Memory Controller) ALU controller 131TF MC(PeakMemory performance Controller) 131TF ALU 131TF Peak performance 131TF (BID) makes sure that faults are recognized and then a reconfguration is MC(Memory Controller) ALU (BID) makes sure131TF that faults are recognizedPeak performance and then a reconfguration131TF is Cache Scalar reg. ALU Cache Scalar reg. ALU (BID) makes sure that faults are recognized and then a reconfguration is Cache Space-savingScalar reg. FootprintALU 1/5 carried out. All information is automatically logged and provided to the Space-saving Footprint 1/5 carried out. All information is automatically logged and provided to the FLOAT Space-saving FLOAT Footprint 1/5 carried out. All information is automatically logged and provided to the 256GB/s Scalar processing unit 256GB/s FLOAT Scalar processing unit service center of NEC for reaction. That way the reliability, availability, and 256GB/s Scalar processing unit service center of NEC for reaction. That way the reliability, availability, and Power-saving Powerconsumption 1/10 service center of NEC for reaction.Power-saving That way Powerconsumptionthe reliability, availability, 1/10 and Memory Memory maintainability of the system are increased,Power-saving and the customers’ Powerconsumption operations 1/10 maintainability of the system are increased, and the customers’ operations Memory maintainability of the system are increased, and the customers’ operations are ensured for their business needs. are ensured for their business needs. are ensured for their business needs. 4 4 5 5 4 76 5 Multi-PF Solutions IBM Blue Gene P BlueGene/P SystemSystem Rack Up to 72 Racks 72 Racks Cabled 8x8x16 Rack Cabled 8x8x16 Rack 32 Node Cards 32 Node Cards 1 PF/s Node Card 1 PB/s Node Card 144 or 288 TB 32 Chips, 4x4x2 144 TB (32 chips 4x4x2) (32 compute, 4 I/O cards) 32 compute, 0 4 IO cards 14 TF/s 2 TB Compute Card 14 TF/s 1Compute Chip, 1x1x1 Card 2 or 4 TB 1 chips, 1x1x1 435 GF/s ChipChip 435 GF/s 64 GB 4 processors4 processors 64 or 128 GB 13.6 GF/s13.9 GF/s 2 or 4 GB2 DDRGB DDR 13.613.6 GF/s GF/s 8 MB EDRAM ©2007 IBM Corporation77 IBM Blue Gene Q 78 Systems and Technology Group Interconnect Architecture