Parallelization Hardware architecture Type to enter text

Delft University of Technology Challenge the future Contents

• Introduction • Classification of systems • Topology • Clusters and Grid • Fun Hardware

2 hybrid

parallel

vector

superscalar scalar

3 Why Parallel Computing

Primary reasons:

• Save time

• Solve larger problems

• Provide concurrency (do multiple things at the same time) Classification of HPC hardware

• Architecture

• Memory organization

5 1st Classification: Architecture

• There are several different methods used to classify computers

• No single taxonomy fits all designs

• Flynn's taxonomy uses the relationship of program instructions to program data • SISD - Single Instruction, Single Data Stream • SIMD - Single Instruction, Multiple Data Stream • MISD - Multiple Instruction, Single Data Stream • MIMD - Multiple Instruction, Multiple Data Stream

6 Flynn’s Taxonomy

• SISD: single instruction and single data stream: uniprocessor

• SIMD: vector architectures: lower flexibility

• MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines

• MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility

7 SISD

• One instruction stream • One data stream • One instruction issued on each clock cycle • One instruction executed on single element(s) of data (scalar) at a time • Traditional ‘von Neumann’ architecture (remember from introduction)

8 SIMD

• Also von Neumann architectures but more powerful instructions

• Each instruction may operate on more than one data element

• Usually intermediate host executes program logic and broadcasts instructions to other processors

• Synchronous (lockstep)

• Rating how fast these machines can issue instructions is not a good measure of their performance

• Two major types: • Vector SIMD • Parallel SIMD

9 Vector SIMD

• Single instruction results in multiple operands being updated • Scalar processing operates on single data elements. Vector processing operates on whole vectors (groups) of data at a time.

• Examples: • SSE instructions • NEC SX-9 • VP • Hitachi S820

10 Vector SIMD

11 Parallel SIMD

• Several processors execute the same instruction in lockstep

• Each modifies a different element of data

• Drawback: idle processors

• Advantage: no explicit synchronization required

• Examples

• GPGPU’s • Cell

12 Parallel SIMD

13 MIMD

• Several processors executing different instructions on different data

• Advantages: • different jobs can be performed at a time • A better utilization can be achieved

• Drawbacks: • Explicit synchronization needed • Difficult to program

• Examples: • MIMD Accomplished via Parallel SISD machines: all clusters, XE6, IBM Blue Gene, SGI Altix • MIMD Accomplished via Parallel SIMD machines: NEC SX-8, Convex(old), Cray X2

14 15 16 2nd Classification: Memory organization

• Shared memory (SMP) • UMA • NUMA • CC-NUMA

• Distributed memory

17 Memory Organization

Symmetric shared-memory multiprocessor (SMP)

• Implementations:

• Multiple processors connected to a single centralized memory – since all processors see the same memory organization -> uniform memory access (UMA)

• Shared-memory because all processors can access the entire memory address space through a tightly interconnect between compute/ memory nodes - non-uniform (NUMA)

18 UMA (Uniform Memory Access)

PE1 PE2 PEn

P1 P2 Pn …

Interconnect

M1 M2 Mn 19 NUMA (Non Uniform Memory Access)

PE1 PE2 PEn

P1 P2 Pn … M1 M2 Mn

Interconnect

20 CC-NUMA (Cache Coherent NUMA)

PE1 PE2 PEn

P1 P2 Pn

C1 C2 Cn

M1 M2 Mn

Interconnect

21 Distributed memory

PE1 PE2 PEn

P1 P2 Pn …

M1 M2 Mn

Interconnect

22 Hybrid Distributed memory

PE1 PE2 PEn

P1 P2 P3 P4 Pn-1 Pn

C1 C2 Cn

M1 M2 Mn

Interconnect

The largest and fastest computers in the world today employ

both shared and distributed memory architectures. 23 Memory architecture Logical view

shared distributed

shared UMA Physical view

distributed NUMA M. Distr. scalability

“Easy” Programming

24 Shared-Memory vs. Distributed-Memory

Shared-memory: • Well-understood programming model • Communication is implicit and hardware handles protection • Hardware-controlled caching • OpenMP and MPI

Distributed-memory: • No cache coherence à simpler hardware • Explicit communication à easier for the programmer to restructure code • Sender can initiate data transfer • MPI, PGAS

25 Hardware implementation (MIMD)

• Shared memory

• Distributed memory

26 A Shared Memory Computer Data movement is transparent to the programmer

Memory

Interconnect move data

CPU

SMP = Symmetric Multi-Processor Note that the CPU can be a multi-core chip

27 Issues for MIMD distributed Shared Memory

• Memory Access • Can reads be simultaneous? • How to control multiple writes?

• Synchronization mechanism needed • semaphores • monitors

• Local caches need to be co-ordinated • cache coherency protocols cache coherency ensures that one always gets the right value ... regardless of where the data is

Memory

Cache

CPU

same variable is present in multiple places

29 71 write-through: simple, but wastes memory bandwidth

Memory write-back: minimizes bandwidth, takes extra logic

Memory

30 Cache Coherence Protocols

• Directory-based: A single location (directory) keeps track of the sharing status of a block of memory

• Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary

‣ Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies ‣ Write-update: when a processor writes, it updates other shared copies of that block

31 Cache Coherence - Snooping

Memory

Data bus Address bus

Cache

CPU

With a snooping protocol, ALL address traffic on the bus is monitored by ALL processors

32 71 MIMD-Distributed Memory

• Connection Network • fast • high bandwidth • scalable

• Communications • explicit message passing • parallel languages • Unified Parallel , Co-Array Fortran, HPF • libraries for sequential languages • MPI, PVM, Java with CSP A Distributed Memory Computer The system is programmed using message passing receive send Interconnect move pack unpack

CPU

34 Hybrid: MIMD with shared memory nodes

receive send Interconnect

Memory Memory Memory

Interconnect Interconnect Interconnect

And now imagine a multi-core chip at the lowest level.

35 Interconnection Network

• Speed and Bandwidth are critical

• Low cost networks • local area network (ethernet, token ring)

• High Speed Networks • The heart of a MIMD-DM Parallel Machine Issues for Networks

• Total Bandwidth • amount of data which can be moved from somewhere to somewhere per unit time

• Link Bandwidth • amount of data which can be moved along one link per unit time

• Message Latency • time from start of sending a message until it is received

• Bisection Bandwidth • amount of data which can move from one half of network to the other per unit time for worst case split of network Design Characteristics of a Network

38 Design Characteristics of a Network

• Topology (how things are connected): • Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, ....

• Routing algorithm (path used): • Example in 2D torus: all east-west then all north-south

• Switching strategy: • Circuit switching: full path reserved for entire message, like the telephone. • Packet switching: message broken into separately-routed packets, like the post office.

• Flow control (what if there is congestion): • Stall, store data in buffers, re-route data to other nodes, tell source node to temporarily halt, discard, ... Performance Properties of a Network: Latency

• Latency: delay between send and receive times • Latency tends to vary widely across architectures • Vendors often report hardware latencies (wire time) • Application programmers care about latencies (user program to user program)

• Latency is important for programs with many small messages Performance Properties of a Network: Bandwidth • The bandwidth of a link = w * 1/t • w is the number of wires • t is the time per bit

• Bandwidth typically in GigaBytes (GB), i.e., 8* 220 bits • Effective bandwidth is usually lower than physical link bandwidth due to packet overhead. Routing and control header • Bandwidth is important for Data applications with mostly large payload messages

Error code Trailer Common Network Topologies

42 Linear and Ring Topologies

• Linear array

• Diameter = n-1; average distance ~n/3. • Bisection bandwidth = 1 (in units of link bandwidth).

• Torus or Ring

• Diameter = n/2; average distance ~ n/4. • Bisection bandwidth = 2. • Natural for algorithms that work with 1D arrays. Meshes and Tori Two dimensional mesh Two dimensional torus • Diameter = 2 * (sqrt( n ) – 1) • Diameter = sqrt( n ) • Bisection bandwidth = sqrt(n) • Bisection bandwidth = 2* sqrt(n)

• Generalises to higher dimensions (Cray XT5 used 3D Torus). • Natural for algorithms that work with 2D and/or 3D arrays. Hypercube

One Dimensional Hypercube

Two Dimensional Hypercube

Three Dimensional Hypercube

Four Dimensional Trees

• Diameter = log n. • Bisection bandwidth = 1. • Easy layout as planar graph. • Many tree algorithms (e.g., summation). • Fat trees avoid bisection bandwidth problem: • More (or wider) links near top. Butterflies with n = (k+1)2^k nodes • Diameter = 2k. • Bisection bandwidth = 2^k. • Cost: lots of wires. • Used in BBN Butterfly. • Natural for FFT.

O 1 O 1

O 1 O 1

butterfly switch multistage butterfly network Topologies in Real Machines

Cray XC30 Dragonfly Cray XE series 3D Torus Blue Gene/L /P 3D Torus SGI Altix Fat tree

Cray X1 4D Hypercube*

Myricom (Millennium) Arbitrary

Quadrics Fat tree

older newer IBM SP Fat tree (approx) Many of these are approximations: SGI Origin Hypercube E.g., the X1 is really a “quad bristled Intel Paragon (old) 2D Mesh hypercube” and some of the fat trees are not as fat as they should be at the top MIMD - clusters

52 Cluster

• A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone or complete computers. These computers co-operatively work together as a single, integrated computing resource.

LM LM LM

CPU CPU … CPU

Network

Cluster

53 Construction of a Beowulf Cluster Topology Node 1Node 2 Node 3 Node 16

Computing Nodes

Fast Ethernet Switch

Service Node

1 Gb/s LAN 10 Gb/s

54 Beyond a Cluster: Grid

55 Computational Grids

• A network of geographically distributed resources including computers, peripherals, switches, instruments, and data.

• Each user should have a single login account to access all resources.

• Resources may be owned by diverse organisations.

56 GRID vs. Cluster

• Cluster: typically dedicated 100 % to execute a specific task

• GRID: computer networks distributed planet-wide, that can be shared by the means of resource management software

57 Cluster computing vs. others

Distance between nodes

SM Parallel A chip computing A rack Cluster computing A room

Distributed computing A building Grid computing The world

58 Some HPC specialized hardware

59 General design philosophy Compute • 12 AMD Opteron processors 32/64 bit, processors • High Performance Linux

RA RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric RA RA

Active Management • Dedicated processor

Processors directly connected via integrated switch fabric Application Acceleration 60 • 6 co-processors Balanced Interconnect

GigaBytes GFLOPS GigaBytes per Second Memory Processor I/O Interconnect

1 GB/s 0.25 GB/s PCI-X GigE Xeon

Server 5.3 GB/s DDR 333

Cray XD1 RA 6.4GB/s DDR 400 8 GB/s 6.4GB/s DDR 400 8 GB/s XT4

SS

6.4 GB/s DDR 400 31 GB/s

61 Cray XT5 node 2 – 32 GB memory

6.4 GB/sec direct connect HyperTransport

• 8-way SMP • >70 Gflops per node • Up to 32 GB of

shared memory per 9.6 GB/sec 9.6 GB/sec node

• OpenMP Support 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 25.6 GB/sec direct connect memory Cray SeaStar2+ Interconnect 9.6 GB/sec

62 R134a piping Exit Evaporators

Inlet Evaporator

63 High Velocity Airflow HighVelocity High Velocity Airflow HighVelocity Low Velocity Airflow Low Velocity

Low Velocity Airflow Low Velocity

Low Velocity Airflow Low Velocity

27 64 Cray XE6 node

HT3 HT3

Z"

Y"

X"

65 NVIDIA&Tesla&GPU& 6GB&GDDR5;& with&665GF&DPFP& 138&GB/s&

PCIe Gen2

HT3 HT3

1600&MHz&DDR3;& Cray&Gemini&High& 16,&32&or&64&GB& Speed&Interconnect&

66 Chassis/Rank-1 (Backplane x 15 links) Intra Group/Rank-3 14 Gbps (Optic Cable x 10 links) XC30 12.5 Gbps

PCIe-3 16 bits at 8.0 GT/s per direction Dual QPI SMP Links

4 Channels DDR3 Aries 48-port Router 4 NICs, 2 router tiles each 40 router tiles for interconnect 6-Chassis Group/Rank-2 (Copper Cable x 15 links) 14 Gbps 67 Cray XC30 Rank1 Network

Text

o Chassis with 16 compute blades o 128 Sockets o Inter-Aries communication over backplane o Per-Packet adaptive Routing

68 2 Cabinet Group Cray XC30 Rank-2 Network 768 Sockets

Text

6 backplanes connected with copper cables in a 2-cabinet group: “Black Network”

Active optical cables interconnect groups 16 Aries connected by 4 nodes connect “Blue Network” backplane to a single Aries “Green Network” 69 Cray XC30 Routing

Minimal routes S M M between any two nodes in a group are Text just two hops

Non-minimal route requires up to four hops.

With adaptive routing we select between minimal and non- minimal paths based on load D The Cray XC30 Class-2 Group has sufficient bandwidth to support R M full injection rate for all 384 nodes with non- minimal routing

70 Cray XC30 Network Overview – Rank-3 Network

Group 0 Group 1 Group 2 Group 3

Example: An 4-group system is interconnected with 6 optical “bundles”. The “bundles” can be configured between 20 and 80 cables wide

71 Integration HPC and Big Data

72 73 HLRS Stuttgart

• video of building a computer

74 JAIG6=><="HE::9 =><=9:CH>IN K:8IDG6C9H86A6GJC>I E68@6<>C<I:8=CDAD

The vector unit is the most important part of For the first time in the SX-series, proces- the SX-8 CPU. It consists of a set of vector sors and main memory unit are integrated registers and arithmetic execution pipes for on a single printed circuit board. With addition, multiplication, divison and square combined effect of the low-loss substrates root. The hardware square root vector pipe enabling the high efficiency transfers and is the latest addition to the SX-8 CPU high density packaging technology, the architecture and is only available on the has substantially reduced space SX-8. Each of the arithmetic pipes is four requirements and power consumption. This way parallel, i.e can produce four results per leads to a notable reduction of installation cycle. A high speed load/store pipe connects costs and complexities. the vector unit to the main memory.

*

NEC $16 (nPQTQFS$16

7FDUPS6OJUY .BTL3FHJTUFS .BTL0QFSBUJPO -PHJDBM0QFSBUJPO .VMUJQMJDBUJPO ..6 -PBE4UPSF 7FDUPS 3FHJTUFS "EEJUJPO4IJGUJOH %JWJTJPOɅ

4DBMBS6OJUY "EEJUJPO%JWJTJPO $BDIF.FNPSZ 4DBMBS3FHJTUFS .VMUJQMJDBUJPO

I:8=CDAD<>86AA:69:GH=>E

/0%& $FOUSBM1SPDFTTJOH 6OJU $16 $16 $16 $16 *94 .BJO.FNPSZ6OJU ..6 *O0VUQVU *0' *0' 'FBUVSF 4$4* '$ *OUFHSBUFE0QFSBUPS *09 2 4UBUJPOGPS49 *09 /FUXPSL

.VMUJOPEF .VMUJOPEF .VMUJOPEF .VMUJOPEF

*OUFS/PEF$SPTTCBS4XJUDI *94

75 SX-ACE SX-ACE SX-ACE Targeting high sustained performanceTargeting high and sustained Targeting performance high sustainedand performance and efficiency in all respects toefficiency address thein all respectsefficiency to address in allthe respects to address the increasing needs of scienceincreasing and engineering needs of scienceincreasing and engineering needs of science and engineering

The SX design philosophy aims to provide high performance and usabilityThe SX design to customers philosophy continuously. aims to provide In order high to performanceThe and SX usabilitydesign philosophy to customers aims continuously. to provide high In order performance to and usability to customers continuously. In order to Well balanced high performance I/O configuration SX-ACE architecture and technology Well balanced high performance I/O configuration SX-ACE architecture and technology achieve this goal, NEC sticks to high single-core performance and leadingachieve technologythis goal, NEC for highsticks memory to high bandwidth. single-core performanceachieve and leadingthis goal, technology NEC sticks for to high high memory single-core bandwidth. performance and leadingWell balanced technology high for performance high memory I/O bandwidth. configuration SX-ACE architecture and technology Common serious problems of modern computer systems, increasinglyCommon not only serious in the problems HPC feld, of modernare the so-called computer systems, Commonincreasingly serious not only problems in the ofHPC modern feld, arecomputer the so-called systems, increasingly not only in the HPC feld, are the so-called The real world applications consistently show that I/O performance is The SX-ACE is the only architecture targeted for the most demanding The real world applications consistently show that I/O performance is The SX-ACE is the only architecture targeted for the most demanding "memory wall" and "power wall". The problems are caused by a lack of memory bandwidth and power limitations. "memory wall" and "power wall". The problems are caused by a lackThe of real memory world applications bandwidth consistently and power show limitations.that I/O performance is The SX-ACE is the only architecture targeted for the most demanding "memory wall" and "power wall". The problems are causedbecoming by a lack increasingly of memory important. bandwidth In order to and address power this need, limitations. the I/O is HPC applications. Its processor builds upon the successful history of the becoming increasingly important. In order to address this need, the I/O is HPC applications. Its processor builds upon the successful history of the The SX-ACE overcomes these limitations and provides high performance and effective supercomputing The SX-ACE overcomes these limitations and provides high performancebecoming increasingly and important. effective In order supercomputing to address this need, the I/O is HPC applications. Its processor builds upon the successful history of the The SX-ACE overcomes these limitations and provideshighly high integrated performance with the SX-ACE and CPU. effective That way a singlesupercomputing node can achieve SX product line, achieving higher memory bandwidth by the integration of highly integrated with the SX-ACE CPU. That way a single node can achieve SX product line, achieving higher memory bandwidth by the integration of environment to users. highly integrated with the SX-ACE CPU. That way a single node can achieve SX product line, achieving higher memory bandwidth by the integration of environment to users. I/O performance of up to 8 Gbyte/s per processor LSI. the on the chip. The CPU is implemented using the I/O performance of up to 8 Gbyte/s per processor LSI. the memory controller on the chip. The CPU is implemented using the environment to users. I/O performance of up to 8 Gbyte/s per processor LSI. the memory controller on the chip. The CPU is implemented using the most advanced technologies, such as 28nm CMOS LSIs and 11-layer most advanced technologies, such as 28nm CMOS LSIs and 11-layer most advanced technologies, such as 28nm CMOS LSIs and 11-layer copper wiring. By these 256 GFLOPS of peak performance and 256 GB/s copper wiring. By these 256 GFLOPS of peak performance and 256 GB/s copper wiring. By these 256 GFLOPS of peak performance and 256 GB/s of memory bandwidth per processor are achieved. of memory bandwidth per processor are achieved. of memory bandwidth per processor are achieved. Furthermore, an outstanding power effciency is achieved by adopting the Furthermore, an outstanding power effciency is achieved by adopting the Furthermore, an outstanding power effciency is achieved by adopting the most advanced technologies, such as 10 Gbps-SerDes, Multi-Vth most advanced technologies, such as 10 Gbps-SerDes, Multi-Vth most advanced technologies, such as 10 Gbps-SerDes, Multi-Vth transistor, per register/CPU core unit, and power voltage transistor, clock gating per register/CPU core unit, and power voltage transistor, clock gating per register/CPU core unit, and power voltage optimizer. optimizer. optimizer.

Highly scalable multi-node system Highly scalable multi-node system Highly scalable multi-node system

In the SX-ACE, data communication between nodes can be accelerated, and In the SX-ACE, data communication between nodes can be accelerated, and In the SX-ACE, data communication between nodes can be accelerated, and high performance and high scalability in parallel processing are achieved by high performance and high scalability in parallel processing are achieved by high performance and high scalability in parallel processing are achieved by the network communication controller in the processor LSI. The nodes of the the network communication controller in the processor LSI. The nodes of the the network communication controller in the processor LSI. The nodes of the SX-ACE are connected by a highly scalable proprietary interconnect, the IXS. SX-ACE are connected by a highly scalable proprietary interconnect, the IXS. SX-ACE are connected by a highly scalable proprietary interconnect, the IXS. It implements the topology of a full non-blocking fat tree with a link bandwidth It implements the topology of a full non-blocking fat tree with a link bandwidth It implements the topology of a full non-blocking fat tree with a link bandwidth of 8 GByte/s per direction. With the standard 2 level fat tree network, a of 8 GByte/s per direction. With the standard 2 level fat tree network, a of 8 GByte/s per direction. With the standard 2 level fat tree network, a multi-node system can be easily confgured. That way highly scalable HPC multi-node system can be easily confgured. That way highly scalable HPC multi-node system can be easily confgured. That way highly scalable HPC environment by large scale shared/distributed memory system can be built. environment by large scale shared/distributed memory system can be built. environment by large scale shared/distributed memory system can be built. ※The default cabinet color is black. Other colors are optional. ※The default cabinet color is black. Other colors are optional. ※The default cabinet color is black. Other colors are optional.

Enlarged Enlarged Approximately 3,000 wires are More than 4,000 fne Approximately 3,000 wires are More than 4,000 fne picture of Enlarged Approximately 3,000 wires are More than 4,000 fne picture of a hair accommodated at 70 micron wide, solder joints (2-4 times picture of a hair accommodated at 70 micron wide, solder joints (2-4 times accommodated at 70 micron wide, solder joints (2-4 times much less than a hair (100 micron) more than PC/Server) a hair much less than a hair (100 micron) more than PC/Server) much less than a hair (100 micron) more than PC/Server) Environmentally friendly SX-ACE fully integrated World-leading CPU core performance Significant reduction of the power consumption Environmentally friendly SX-ACE fully integrated World-leadingEnvironmentally CPU core performancefriendly SX-ACE fully integrated SignificantWorld-leading reduction CPU core of the performance power consumption Significant reduction of the power consumption processor with high sustained performance and memory bandwidth and floor-space requirement processor with high sustained performance and memoryprocessor bandwidth with high sustained performance andand floor-spacememory bandwidth requirement and floor-space requirement

The SX-ACE is using the world’s only real vector architecture core with a Different from synthetic benchmarks, real relevant applications exhibit limited The number of LSIs for an SX-ACE confguration is reduced by a factor of The SX-ACE is using the world’s only real vector architecture core with a Different from synthetic benchmarks, real relevant applications exhibit limited The number of LSIs for an SX-ACE confguration is reduced by a factor of CPU_BGA CPU_BGA The SX-ACE is using the world’s only real vector architecture core with a Different from synthetic benchmarks, real relevant applications exhibit limited The number of LSIs for an SX-ACE confguration is reduced by a factor of CPU_BGA large vector cache specifcally designed for supercomputing usage. The scalability. To overcome these limitations and to provide high sustained 100large compared vector cache to an speci SX-9f callyconf gurationdesigned with for supercomputingthe same performance. usage. As The a scalability. To overcome these limitations and to provide high(back) sustained 100 compared to an SX-9 confguration with the same performance. As a (back) large vector cache specifcally designed for supercomputing usage. The scalability. To overcome these limitations and to provide high sustained PWB photo100 compared to an SX-9 confguration with the same performance. As a (back) PWB photo 4-core is fully integrated with the memory and network performance and usability, NEC continues the vector architecture with a result,4-core avector conf gurationprocessor based is fully on integrated the SX-ACE with requires the memory a factor and of network 10 less performance and usability, NEC continues the vector architecture with a result,PWB aphoto confguration based on the SX-ACE requires a factor of 10 less 4-core vector processor is fully integrated with the memory and network performance and usability, NEC continues the vector architecture with a result, a confguration based on the SX-ACE requires a factor of 10 less interface controllers on one single LSI to ensure highest effciency at lowest strong single core performance of 64 GFLOPS and a high memory bandwidth electricityinterface controllers and a factor on oneof 5 singleless space LSI to than ensure a SX-9 highest con effgurationfciency at with lowest the strong single core performance of 64 GFLOPS and a high memory bandwidth electricity and a factor of 5 less space than a SX-9 confguration with the interface controllers on one single LSI to ensure highest effciency at lowest strong single core performance of 64 GFLOPS and a high memory bandwidth electricity and a factor of 5 less space than a SX-9 confguration with the power-consumption for environmentally friendly supercomputing. of 64 GB/s per core. That way users can achieve the necessary real samepower-consumption performance. for environmentally friendly supercomputing. of 64 GB/s per core. That way users can achieve the necessary real same performance. power-consumption for environmentally friendly supercomputing. of 64 GB/s per core. That way users can achieve the necessary real same performance. NODE application performance even if the code requires a lot of memory bandwidth applicationNODE performance even if the code requires a lot of memory bandwidth application performance even if the code requires a lot of memory bandwidth module NODE module and does not scale perfectly well. and does not scale perfectly well. Solder ball module Solder ball and does not scale perfectly well. Solder ball

NEC SX-ACE (2014) 0.8mm 0.8mm Sectional view 0.8mm Sectional view of solder joints Sectional view of solder joints SX-9 of SX-9solder joints SX-9 Scalar Vector Scalar Vector processing unit processing unit Scalar Vector processing unit processing unit processing unit processing unit Vector pipeline x16 24m I/O controller Vector pipeline x16 24m SX-ACEI/O controller SX-ACE I/O controller Vector pipeline x16 24m SX-ACE IOP File system SPU VPU IOP RAS-features of the SX-ACE SPU VPU File system RAS-featuresVector of the SX-ACE Core 8GB/s x2 Vector SPU VPU Mask OperationIOP Core 80 racks 8GB/s x2 Mask Operation RAS-features of the SX-ACE80 racks Core Mask reg. 8GB/s x2 Vector 7mMask Operation processing Mask reg. 80 racks 7m 256GB/s Core Core Core processing 256GB/s CoreMask reg. Core Core Network controller 7m Network controller unit 256GB/s Core Core Core processing unit Network controller ADB 8 racks Add 8 racks ADB RCU Interconnect Add unit RCU Interconnect 8 racks Compared to the SX-9, the number of parts of a node is drastically reduced HPC (Assignable Data Buffer) ADB RCU Interconnect HPC (Assignable Data Buffer) Add Compared to the SX-9, the number of parts of a node is drastically reduced 8GB/s x2 (Assignable Data Buffer) Add 8m8GB/s x2 Add Compared to the SX-9, the number of parts of a node is8m drastically reduced dedicated HPC 8GB/s x2 dedicated Add 8m by the fully integrated LSI, and consequently achieves a very high reliability. dedicated by the fully integrated LSI, and consequently achievesMultiply a very high reliability. design Load/ Multiply design 256GB/s Load/ Vector reg. by the fully integrated12m LSI, and consequently achieves a very high reliability. 256GB/s design Vector reg. 12m Load/ MeetingMultiply room size Meeting room size NEC is attempting to improve reliability and availability by adopting various cache HPC Store 256GB/s Multiply cache Vector reg. NEC is HPCattempting Storeto improve12m reliability and availabilityMultiply by adoptingMeeting various room size HPC Store NEC is attempting to improve reliability and availability by adopting various dedicated cache Multiply dedicated Div/Sqrt 25m pool sizeCrossbar switch kinds of leading technologies. Div/Sqrt 25m pool size kinds of leading technologies. Crossbar switch design dedicated Div/Sqrt design kinds of leading technologies. Crossbar switch 25m pool size Logical cache Logical design The memorycache architecture uses error correction codes (ECC), error detection The memory architecture uses error correction codes (ECC), error detection Memory cache Logical Memory The memory architecture uses error correction codes (ECC), error detection Memory functions are implemented on the processor LSI, and a built-in diagnosis functions are implemented on the processor LSI, and a built-in diagnosis controller controller functions are implemented on the processor LSI, and a built-in diagnosis MC(Memory Controller) ALU controller 131TF MC(PeakMemory performance Controller) 131TF ALU 131TF Peak performance 131TF (BID) makes sure that faults are recognized and then a reconfguration is MC(Memory Controller) ALU (BID) makes sure131TF that faults are recognizedPeak performance and then a reconfguration131TF is Cache Scalar reg. ALU Cache Scalar reg. ALU (BID) makes sure that faults are recognized and then a reconfguration is Cache Space-savingScalar reg. FootprintALU 1/5 carried out. All information is automatically logged and provided to the Space-saving Footprint 1/5 carried out. All information is automatically logged and provided to the FLOAT Space-saving FLOAT Footprint 1/5 carried out. All information is automatically logged and provided to the 256GB/s Scalar processing unit 256GB/s FLOAT Scalar processing unit service center of NEC for reaction. That way the reliability, availability, and 256GB/s Scalar processing unit service center of NEC for reaction. That way the reliability, availability, and Power-saving Powerconsumption 1/10 service center of NEC for reaction.Power-saving That way Powerconsumptionthe reliability, availability, 1/10 and Memory Memory maintainability of the system are increased,Power-saving and the customers’ Powerconsumption operations 1/10 maintainability of the system are increased, and the customers’ operations Memory maintainability of the system are increased, and the customers’ operations are ensured for their business needs. are ensured for their business needs. are ensured for their business needs. 4 4 5 5 4 76 5 Multi-PF Solutions IBM Blue Gene P BlueGene/P SystemSystem Rack Up to 72 Racks 72 Racks Cabled 8x8x16 Rack Cabled 8x8x16 Rack 32 Node Cards 32 Node Cards

1 PF/s Node Card 1 PB/s Node Card 144 or 288 TB 32 Chips, 4x4x2 144 TB (32 chips 4x4x2) (32 compute, 4 I/O cards) 32 compute, 0 4 IO cards 14 TF/s 2 TB Compute Card 14 TF/s 1Compute Chip, 1x1x1 Card 2 or 4 TB 1 chips, 1x1x1

435 GF/s ChipChip 435 GF/s 64 GB 4 processors4 processors 64 or 128 GB 13.6 GF/s13.9 GF/s 2 or 4 GB2 DDRGB DDR 13.613.6 GF/s GF/s 8 MB EDRAM

©2007 IBM Corporation77 IBM Blue Gene Q

78 Systems and Technology Group Interconnect Architecture

“All Peers” architecture 4 A2 cores L2 Cache Packet 4 x – Accel. and I/O are first class citizens processing 10GbE 4 A2 cores engine MAC Proven Power-Bus architecture L2 Cache – Independent CMD Network (one/cycle) 4 A2 cores L2 Cache DRAM Controller – Two north, two south 16B data busses 4 A2 cores DRAM Controller – ECC protected data paths L2 Cache

PowerBus PCIe 64 Byte Cache Line Crypto Engine Cache Injection RegEx Engine – Packets flow to / from Caches Compress Engine External PowerBus – New PBus commands XML Engine 1.75 GHz operation – Asynchronous connection to AT Nodes and accelerators via PBICs – Synchronous connection to DRAM controllers – Three 4B 2.5 GHz EI3 external links (1,2, or 4 chip systems)

5 IBM PowerEN™ ©79 2010 IBM Corporation IBM Power 7 based

80 BPA Rack !!200 to 480Vac !!990.6w x 1828.8d x 2108.2 !!370 to 575Vdc !!39”w x 72”d x 83”h !!Redundant Power !!~2948kg (~6500lbs) !!Direct Site Power Feed !!PDU Elimination

Storage Unit Data Center In a Rack !!4U Compute !!0-6 / Rack !Up To 384 SFF DASD / Unit Storage ! !!File System Switch 100% Cooling PDU Eliminated CECs !!2U !!1-12 CECs/Rack Input: 8 Water Lines, 4 Power Cords !!256 Cores Out: ~100TFLOPs / 24.6TB / 153.5TB !!128 SN DIMM Slots / CEC 192 PCI-e 16x / 12 PCI-e 8x !!8,16, (32) GB DIMMs !!17 PCI-e Slots !!Imbedded Switch !!Redundant DCA !!NW Fabric !!Up to:3072 cores, 24.6TB (49.2TB) WCU !!Facility Water Input !!100% Heat to Water !!Redundant Cooling !!CRAH Eliminated

43 81 Bull

82 Solution for Peta Scalability

bullx R servers bullx B servers bullx S servers

Water

System environment Application environment Installation/configuration Monitoring/control/diagnostics Execution environment File systems Development Network Libraries Job scheduling Lustre & tools Lustre Nscontrol Nagios cooling Ksis NFSv4 config // cmds Ganglia Resource NFSv3 MPIBull2 management

Cluster database

Linux OS

Administration network HPC interconnect

Linux kernel

Hardware XPF platforms InfiniBand/GigE interconnects GigE network switches Disk arrays Cluster Suite Storage

83 9 ©Bull, 2010 ISC10 : bullx, A comprehensive range of solutions for Extreme Computing SGI

84 3&)$/DE$H(1"M$@(1.'d"-()1$Q@NV$/0+1)2

Fd-(71"-* );"#( 55O@GH$/&'.)I$G)2-.0$Q/DE$X."#"*V -" B'.1"1"-*)I$D(-?'((0$HII.)++'?()$/&'.)I$G)2-.0$ Q/DE$H(1"M$Kf==V -"( A_$H55)().'1)I$B'.1"1"-*)I$D(-?'((0$HII.)++'?()$ /0+1)2$Q/DE$H(1"M$@NV

!"#$%&'(!"&)*+,&-*%.

85 D(-?'((0$/&'.)I$G)2-.0$/0+1)2

! TUL3.*&DV W(*C(-J,(I.E,(")(3.-*4V UX(>88Y>888

O@GH("*P >-71). P$ H("* O@G H(1"M$@N$!('I) H(1"M$@N$!('I) H(1"M$@N$!('I) H(1"M$@N$!('I)

A@! A@! A@! A@!

%B@ %B@ %B@ %B@ %B@ %B@ %B@ %B@

:KD! :KD! :KD! :KD! :KD! :KD! :KD! :KD!

\9

!"#$%&'(!"&)*+,&-*%.

86 Fujitsu

87 Road to exascale computing

2011 2012 2013 2014 2015 2016 2017 2018 2019

PRIMEHPC FX10 Post-FX10 system ・1.85 CPU performance ・Enhanced CPU & network performance ・Easier installation ・High-density packaging & low power consumption

National projects

Development Operation of

HPCI strategic applications program App. FS review projects Exa-system development project

8 Copyright 2013 FUJITSU LIMITED 88 Interconnect of K computer: Tofu (torus fusion)

Technology 65 nm DMA Engine Send 4 + recv. 4 Link BW 5 + 5 GB/s 10 ports PCIe 16-lane Gen2 # of transistors 200 M

3D torus 6D mesh/torus direct network 24 18 17 K: (24 18 17) (2 3 2) Low ave. hops and high bisectional BW Virtual 3D torus topology for apps. Hardware collective comm. support Congestion control by inserting GAPs 3D mesh/torus 2 × 3 × 2

5 Copyright 2013 FUJITSU LIMITED 89 CPU chip of K computer: SPARC64 VIIIfx

Technology 45 nm Performance 128 GFLOPS Memory bandwidth 64 GB/s Power consumption 58 W # of transistors 760 M

Eight core out-of-order super scalar CPU HPC-ACE instruction set extension VISIMPACT hybrid execution model support Low power consumption design

Highly reliable design inherited from mainframe Error detection by hardware and automatic recovery

Error detection by hardware

No affect on system operation

4 Copyright 2013 FUJITSU LIMITED 90 Feature and Configuration of Post-FX10

Fujitsu designed Chassis TM SPARC64 XIfx 1 CPU/1 node 1TF~(DP)/2TF~(SP) 12 nodes/2U Chassis 32 + 2 core CPU Water cooled HPC-ACE2 support Tofu2 integrated

Cabinet 200~ nodes/cabinet High-density 100% water cooled CPU Memory Board with EXCU (option) Tofu Interconnect 2 Three CPUs 12.5 GB/s×2(in/out)/link 3 x 8 Micron’s HMCs 10 links/node 8 Finisar's opt modules, BOA, for Optical technology inter-chassis connections

2 Copyright 2014 FUJITSU LIMITED

91 Flexible SIMD operations

New 256bit wide SIMD functions enable versatile operations Four double-precision calculations Stride load/store, Indirect (list) load/store, Permutation, Concatenation

SIMD Stride load Indirect load Permutation

Double precision reg S Memory reg S reg S1 Specified stride Memory reg S2 128 Arbitrary shuffle regs simultaneous calculation

reg D reg D reg D reg D

5 Copyright 2014 FUJITSU LIMITED

92 Infiniband App Buf Buf App

OS NIC NIC OS

• Direct access to communication resources • provides a messaging service • no need to request the OS • directly communicate with another application through devices

• defines an ‘API’ set of behaviours (verbs) • OpenFabrics Alliance (OFA) stack op software for IB • The complete set of software components provided by the OpenFabrics Alliance is known as the Open Fabrics Enterprise Distribution – OFED

93 CHAPTER 4 – DESIGNING WITH INFINIBAND

t NFS-RDMA – Network File System over RDMA. NFS is a well-known and widely-deployed file system providing file level I/O (as opposed to block level I/O) over a conventional TCP/IP network. This enables easy file shar- ing. NFS-RDMA extends the protocol and enables it to take full advantage of the high bandwidth and parallelism provided naturally by InfiniBand. t Lustre support – Lustre is a parallel file system enabling, for example, a set of clients hosted on a number of servers to access the data store in parallel. It does this by taking advantage of InfiniBand’s Channel I/O architecture, al- lowing each client to establish an independent, protected channel between itself and the Lustre Metadata Servers (MDS) and associated Object Storage Servers and Targets (OSS, OST). t RDS – Reliable Datagram Sockets offers a Berkeley sockets API allow- ing messages to be sent to multiple destinations from a single socket. This ULP, originally developed by Oracle, is ideally designed to allow database systems to take full advantage of the parallelism and low latency character- istics of InfiniBand. t MPI – The MPI ULP for HPC clusters provides full support for MPI function IBcalls. software stack

Sockets Block Files Apps Storage Systems

IP-based Clustered Apps DB Access Application

Cluster VNIC SDP SRP Upper Layer Protocols File Systems

iPoIB iSER RDS NFS RDMA MPI InfiniBand RDMA Message InfiniBand verbs interface Transport Ser vice

InfiniBand Fabric Interconnect

Figure 9.

There is one more important corollary to this. As described above, a tradi- tional API provides access to a particular kind of service, such as a network service or a storage service. These services are usually provided over a specific type of network or interconnect. The sockets API, for example, gives an appli- cation access to a network service and is closely coupled with an underlying TCP/IP network. Similarly, a block level file system is closely coupled with a 94 [ 35 ] Benchmarks

• A way to compare the speed of different computer architectures.

95 Benchmarks

• Performance best determined by running a real application – Use programs typical of expected workload – Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. • Small benchmarks – nice for architects and designers – easy to standardize – can be abused • SPEC (System Performance Evaluation Cooperative) – companies have agreed on a set of real program and inputs – valuable indicator of performance (and compiler technology) – can still be abused

96 Types of Benchmarks Pros Cons • Very specific. • Representative Actual Target Workload • Non-portable. • Complex: Difficult to run, or measure.

• Portable. • Widely used. Full Application Benchmarks • Less representative • Measurements than actual workload. useful in reality.

• Easy to “fool” by • Easy to run, early in Small “Kernel” Benchmarks designing hardware the design cycle. to run them well.

• Peak performance • I d e n t i f y p e a k Microbenchmarks results may be a long performance and way from real application potential bottlenecks. performance SPEC: System Performance Evaluation Cooperative The most popular and industry-standard set of CPU benchmarks

• SPEC CPU2006, combined performance of CPU, memory and compiler:

‣ CINT2006 ("SPECint"), testing integer arithmetic, with programs such as compilers, interpreters, word processors, chess programs etc. ‣ CFP2006 ("SPECfp"), testing floating point performance, with physical simulations, 3D graphics, image processing, computational chemistry etc.

http://www.spec.org/cpu/ http://www.cpubenchmark.net/ LINPACK N*N

• Customers use TOP500 list as one of the criteria to purchase machines

• TOP500 is based on LINPACK performance

• See http://www.top500.org/

99 Linpack compute resources

100 November 2009 101 November 2010 102 November 2011 103 June 2013 104 November 2014 105 106 107 108