Advanced Architectural Features Advanced Architectural Features Introduction

ƒ Recap ƒ Multi-core hardware is becoming prevalent, and is tightly Theme: Towards Reconfigurable High-Performance Computing couple d w ith the so ftware whi c h dr ives it Lecture 3 ƒ Objectives: Platforms I: Advanced Architectural Features ƒ Explain key architectural concepts ƒ Discuss architectural extensions ƒ Discover interesting multi-core designs and interconnects Andrzej Nowak CERN openlab (Geneva, Switzerland) ƒ Contents: ƒ Systems architecture basics ƒ Instruction set extensions ƒ Compilers and parallelism Inverted CERN School of Computing, 3-5 March 2008 ƒ Advanced multi-core architecture discussion

1 iCSC2008, Andrzej Nowak, CERN openlab 2 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features Von Neumann architecture

Input

Arithmetic Control Unit Logic Unit

CPU COMPUTER ARCHITECTURES And their extensions Output Memory

3 iCSC2008, Andrzej Nowak, CERN openlab 4 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 1 Advanced Architectural Features Advanced Architectural Features Flynn’s taxonomy (1) Flynn’s taxonomy (2)

ƒ SISD – Single Instruction, Single Data ƒ MISD – Multiple Instruction, Single Data ƒ Classical Von Neumann’s model ƒ Redundant systems, pipeline systems (disputable)

ƒ SIMD – Single Instruction, Multiple Data ƒ MIMD – Multiple Instruction, Multiple Data ƒ A GPU ƒ Distributed systems

5 iCSC2008, Andrzej Nowak, CERN openlab 6 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features X86 architectural extensions MMX

ƒ Extensions in CPUs: ƒ Intel’s first attempt at adding SIMD capabilities to their CPUs; ƒ FPU, MMX, SSE, SSE2, SSE3, SSSE3, SSE4 (SSE4.1 + introduced in 1997 SSE4. 2), SSE5, EM64T ƒ Packet data type concept ƒ Extensions in AMD CPUs: ƒ 64 bits = 2x 32bits = 4x 16bits = 8x 8bits ƒ AMD (pre K6), MMX, SSE, SSE2, SSSE3, SSE4, 3DNow!, ƒ 8 “new” 64bit integer registers – MM0 … MM7 (mapped onto 3DNow!+, 3DNow! Professional (SSE + 3DNow!) x87 the stack)

ƒ Understanding SIMD extension history is helpful in ƒ Major flaws: understanding modern vector instructions ƒ floatinggp point and SIMD could not be used at the same time ƒ integer operations only ƒ Embedded XScale CPUs (ARM family) use iwMMXt – Intel Wireless MMX Technology ƒ 64 bit packed data type ƒ 16 data regs, 8 control regs

7 iCSC2008, Andrzej Nowak, CERN openlab 8 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 2 Advanced Architectural Features Advanced Architectural Features SSE 3DNow!, AltiVec

ƒ Introduced in 1999 with the Pentium III, 70 new instructions ƒ 3DNow! stands for AMD extensions to MMX, introduced in 1998 ƒ Fixed the 2 main MMX deficiencies ƒ 32-bit FP support ƒ Some instructions from this family were added to the Pentium III as SSE ƒ 8 truly new 128-bit registers – XMM0 … XMM7 – 4x 32-bit float ƒ Later upgraded to 3DNow!+

ƒ Later on, another 8 registers added ƒ AltiVec – Apple and IBMs vector extensions for PowerPC ƒ FP Instructions: ƒ Developed between 1996 and 1998 ƒ Data movement (M->R, R->M, R->R) ƒ Also known as “Velocity Engine” (Apple) and “VMX” (IBM) ƒ Arithmetic, bitwise, comparison ƒ Widely used by Apple in their flagship applications, as well as 3rd party ƒ Data shuffling, data unpacking, simple data type conversion developers such as Adobe ƒ Technical details ƒ INT instructions: simple arithmetic and movement ƒ 32 128-bit vector registers (can be split up into 8, 16 or 32 bit pieces) ƒ Flaws: ƒ Three register operands ƒ Register states had to be saved “manually” by the OS ƒ Support for a special RGB data type, which does not map onto 64-bit floats ƒ Execution resources shared with the FPU easily ƒ The IBM CELL supports AltiVec, as well as the IBM Power6 ƒ AMD introduced SSE in AthlonXPs (Palomino – 2001)

9 iCSC2008, Andrzej Nowak, CERN openlab 10 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features SSE2 SSE3, SSSE3

ƒ Introduced with the Pentium 4 in 2001, 144 new instructions ƒ SSE3 introduced in 2004 in the Pentium 4 (“Prescott” – hence the a.k.a. name “PNI”) ƒ Technical details: ƒ 8 registers ƒ SSE3 Technical details: ƒ 64-bit floating point ƒ Horizontal operations portfolio expanded, i.e. add/subtract ƒ Minimized cache pollution elements in a single vector ƒ More sophisticated format conversions ƒ Improved misaligned data loading ƒ Extended MMX instructions allow operation on XMM registers ƒ FP -> Int conversion simplified ƒ SSSE3 is really a new iteration, introduced in Intel Core ƒ Flaws: chips ƒ Accessing misaligned data introduces a penalty ƒ 16 new instructions – some packed and horizontal operations ƒ Unimpressive throughput compared to MMX ƒ No new registers ƒ AMD introduced SSE2 in 2003 in the Athlon64 and Opteron ƒ Operates on MMX or XMM registers families ƒ Unsupported in AMD chips ƒ 8 additional registers

11 iCSC2008, Andrzej Nowak, CERN openlab 12 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 3 Advanced Architectural Features Advanced Architectural Features SSE4 SSE5

ƒ 54 new instructions introduced publicly in 2007 ƒ AMD specific – a 128-bit extension of a 64-bit extension to ƒ SSE4.1: 47, SSE4.2: 7 (only in Nehalem) the 32-bit original x86 instruction set; targeted for 2009

ƒ Technical details: ƒ 170 new instructions, targeting: ƒ No new data types, no new registers ƒ HPC ƒ Compiler vectorization improved ƒ Multimedia ƒ Significant packed dword computation improvement ƒ Security applications ƒ Some instructions are not multimedia related ƒ Features: ƒ Some instructions take an imppplicit third operand ƒ 3 operand ops ƒ You can use SSE4 with Intel compilers from version 10.0 ƒ Fused instructions onwards ƒ MADD instructions

ƒ SSE5 software simulator

Image: AMD 13 iCSC2008, Andrzej Nowak, CERN openlab 14 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features AMD x86-64 (a.k.a. EM64T or AMD Lightweight Profiling Intel64) ƒ Roles reversed – Intel had to follow AMD’s lead ƒ Only 2 new instructions ƒ Enable/disable profiling ƒ 64-bit operations fully supported ƒ Retrieve results ƒ Arithmetic ƒ Registers ƒ No interrupts needed (current situation is the opposite) ƒ Virtual addresses ƒ Profiling on the fly supported ƒ Expanded virtual and physical address space ƒ Drawbacks: ƒ New silicon needed ƒ SSE, SSE2 and SSE3 (Intel) instructions included ƒ Profiling on the fly might not be that easy due to OS designs ƒ Cleanups ƒ Introduced no sooner than late 2008-2009 ƒ There are some differences between AMD’s and Intel’s ƒ We already know that upcoming Performance Monitoring implementations Units will not differ greatly from what we have today

15 iCSC2008, Andrzej Nowak, CERN openlab 16 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 4 Advanced Architectural Features Advanced Architectural Features AMD Extensions for Software X86 extensions summary Parallelism ƒ No details yet, apart from the fact that this extension will ƒ During the last 10 years: upgrade the existing x86 instruction set ƒ We’ve moved from simple 32-bit integer operations to complex 64-bit packed and floating point instructions ƒ The instruction set and surrounding optimizations will be ƒ We’ve received some dedicated hardware for the extensions “broad”, AMD says in question ƒ We’ve moved from 32-bit to 64-bit – more throughput, but ƒ Analysts say that this feature might have a profound impact more memory used as well on the processor industry ƒ The future: ƒ Non x86 architectures ƒ The LRB ins truc tion se t ƒ X86 derived ƒ Mostly multimedia ƒ As always, manuals from Intel or AMD will come in handy when programming using extensions

17 iCSC2008, Andrzej Nowak, CERN openlab 18 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features The Core 2 issue ports

Port 0 Port 1 Port 2 Port 3 Port 4 Port 5

Integer Integer Integer Store Store Integer ALU ALU Load Address Data ALU Int. SIMD Int. SIMD FP Int. SIMD ALU MUL Load ALU SSE FSS Move FP ADD FP MUL & Logic 80 bit Shuffle PARALLEL PROGRAMMING FP MUL And the missing golden bullet for the gun of multi-core FSS Move FSS Move Jump & Logic & Logic FP – Floating Point exec unit FSS – FP, SIMD, SSE 64 bit 64 bit MUL - Multiply shuffle shuffle

Image: based on Sverre Jarp’s work 19 iCSC2008, Andrzej Nowak, CERN openlab 20 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 5 Advanced Architectural Features Advanced Architectural Features

Instruction layout (Core 2) Common parallel programming libraries (1)

ƒ Pthreads, Windows threads ƒ Fine grained control ƒ Lightweight Cycle Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 ƒ Shared memory only 1 load point[0] ƒ OS dependent 2 load origin[0] ƒ Often painful to debug 3 4 ƒ OpenMP 5 ƒ A simple set of #pragma extensions 6 subsd load float-packet ƒ Several languages supported: C , C++, Fortran 7 ƒ Several implementations exist – compiler dependent 8 load xhalfsz ƒ Gcc 4.2 and ICC support OpenMP 9 10 andpd ƒ Several data scopes and scheduling models available 11 ƒ Can be used in a hybrid model with MPI 12 comisd ƒ Shared memory only

13 jbe Image: Sverre Jarp 21 iCSC2008, Andrzej Nowak, CERN openlab 22 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features Common parallel programming libraries Common parallel programming libraries (3) (2) ƒ Intel TBB – Threading Building Blocks ƒ MPI – Message Passing Interface ƒ An extension to C++ ƒ A language independent communications protocol ƒ A set of algorithms and data types to facilitate parallel ƒ Point to point message passing and global operations programming ƒ Parallel sort, while, for, reduce ƒ Numerous implementations exist ƒ Container types: queue, vector, hash map ƒ No shared memory concept in MPI-1 (v 1.2) ƒ Scalable memory allocators ƒ MPI-2 (v. 2.1) introduces numerous enhancements ƒ Mutexes, atomic operations ƒ Limited shared memory concept ƒ Automatic scaling to utilize all available processing units ƒ Parallel I/O ƒ Licensed on the GPLv2 ƒ Dyygnamic management ƒ Future features: ƒ Remote memory support ƒ I/O tasks ƒ Thread pinning (affinity) ƒ PVM – Parallel Virtual Machine ƒ New container classes ƒ A network of machines is used as a single entity ƒ Improved interoperability with Intel Threading Tools ƒ Diminishing popularity

23 iCSC2008, Andrzej Nowak, CERN openlab 24 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 6 Advanced Architectural Features Advanced Architectural Features New and experimental compilers

ƒ Intel STM () ƒ A prototype version of the ICC C/C++ compiler ƒ Added transactional programming constructs ƒ Also works with OpenMP ƒ Basic construct: __tm_atomic { statements; } ƒ Very interesting development, worth following

ƒ Intel Ct (parallel programming language) ƒ An exppppggerimental data parallel programming environment ƒ Designed to facilitate multi-core programming and increase ADVANCED ARCHITECTURES portability ƒ Best with vectors, sparse matrices, trees, linked lists ƒ Mostly graphics-oriented so far

25 iCSC2008, Andrzej Nowak, CERN openlab 26 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features Multi-core architectures – high level Multi-core architectures – Intel Pentium overview D

ƒ Modern consumer and mainstream architectures following the general trend ƒ Intel Pentium D , Intel Core, Intel Core2, Intel Itanium 2 Execution Execution ƒ AMD Athlon X2, AMD Phenom core core ƒ Upcoming consumer and mainstream architectures ƒ Intel “Nehalem” (Core 3), Intel “Tukwila” (Itanium 3) ƒ AMD “Fusion” L2 cache L2 cache

ƒ Less well known designs Bus interface Bus interface ƒ Sun “Niagara”, “Niagara 2” (UltraSPARC T1 and T2) ƒ IBM CELL, Power6 ƒ Intel “Larrabee” ƒ NVIDIA G80 ƒ Intel “Polaris” ƒ SiCorTex System bus

27 iCSC2008, Andrzej Nowak, CERN openlab 28 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 7 Advanced Architectural Features Advanced Architectural Features Multi-core architectures – Intel Multi-core architectures – Intel Core 2 “Nehalem” ƒ Release: YE 2008

ƒ 4-8 cores, 2 SMT threads SMT SMT SMT SMT per core Execution Execution Execution Execution Execution Execution Execution Execution core core core core core core core core ƒ Next generation interconnect (QPI) 512kB 512kB 512kB 512kB L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache ƒ Advanced cache management 8MB Bus interface Bus interface ƒ Exclusive L2 and shared L3 cache L3 caches QPI interface

System bus System bus (QPI / CSI)

Based on undisclosed data, might vary from actual product 29 iCSC2008, Andrzej Nowak, CERN openlab 30 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features Multi-core architectures – Intel Itanium 2 Multi-core architectures – Intel Itanium (“Montecito”) 3 (“Tukwila”)

ƒ Release: ~2008

ƒ Estimated 40GFlops / socket

ƒ 24MB L2 cache

ƒ Next generation interconnect (QPI) SMT SMT SMT SMT Execution Execution Execution Execution ƒ 30% improvement over core core core core “Montecito” (Itanium 2) 6MB 6MB 6MB 6MB ƒ Socket compatible with Xeon L2 cache L2 cache L2 cache L2 cache

QPI router

System interconnect (QPI)

31 iCSC2008, Andrzej Nowak, CERN openlab 32 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 8 Advanced Architectural Features Advanced Architectural Features Multi-core architectures – UltraSPARC Multi-core architectures – UltraSPARC T1 T2

TMT TMT TMT TMT TMT TMT TMT TMT Execution Execution ... Execution Execution Execution Execution Execution Execution core 1 core 2 core 7 core 8 core 1 core 2 ... core 7 core 8 + L1 + L1 + L1 + L1 + L1 + L1 + L1 + L1

Crossbar Crossbar

Shared L2 cache FPU 4 banks L2 cache 8 banks

JBus interface

PCIe, 10GBE System bus

33 iCSC2008, Andrzej Nowak, CERN openlab 34 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features Multi-core architectures – IBM Multi-core architectures – Power5+ Power6 interconnects ƒ 4.7GHz top frequency

ƒ 500GB/s of bandwidth

ƒ 32MB off-die L3 cache on a 80GB/s Execution Execution bus core core

4MB 4MB L2 cache L2 cache

32MB L3 controller & directory L3 cache

Bus interface

System bus

Image: Real World Tech 35 iCSC2008, Andrzej Nowak, CERN openlab 36 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 9 Advanced Architectural Features Advanced Architectural Features Multi-core architectures – Power6 Interesting architectures – IBM interconnects CELL

PowerPC execution core L2 cache (2 threads) + L1

SPE SPE

SPE SPE Element Interface Bus SPE SPE

SPE SPE

I/O subsystem

Image: Real World Tech 37 iCSC2008, Andrzej Nowak, CERN openlab 38 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features Multi-core architectures – Intel Multi-core architectures – NVIDIA G80 Larrabee ƒ Moved away from traditional GPU design

ƒ 128 stream processors In-order In-order In-order In-order execution core execution core execution core execution core 4 threads 4 threads 4 threads 4 threads ƒ 330 GFLOPS peak

Memory Texture 2MB 2MB Texture Memory Controller Sampler L2 cache …. L2 cache Sampler Controller ƒ Second generation: G92

In-order In-order In-order In-order execution core execution core execution core execution core 4 threads 4 threads 4 threads 4 threads

Graphics: NVIDIA SPECULATIVE INFORMATION. Source: ArsTechnica

39 iCSC2008, Andrzej Nowak, CERN openlab 40 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 10 Advanced Architectural Features Advanced Architectural Features Multi-core architectures – Intel Polaris Multi-core architectures – Intel Polaris (1) (2)

ƒ 80 cores ƒ Core data: North neighbor ƒ 2kB data memory ƒ Tiled (mesh) architecture ƒ 3kB instruction memory ƒ Array area: 13mm x 28mm ƒ 32GBps interconnect ƒ Single core: 2mm x 1.5mm ƒ Tile area: 3mm2 ƒ Versatile, scalable design Computing ƒ Modular, scalable design Element

ƒ Fine grained power management

ƒ Approximate performance: Router ƒ 1TFLOP@501 TFLOP @ 50-60W ƒ 1.5 TFLOPS @ ~100W ƒ 2 TFLOPS @ ~200-250W

South neighbor

Data source: computerbase.de Data source: computerbase.de 41 iCSC2008, Andrzej Nowak, CERN openlab 42 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features Advanced Architectural Features Interesting architectures – SiCorTex (1) Interesting architectures – SiCorTex (2)

1 GFLOP 600mW ƒ 27 6-core nodes make up one blade power ƒ SC5832 ƒ 36 blades 1.5MB L2 ƒ 5832 cores ƒ 5.8 TFLOPS 10W ƒ 8 TB of DDR2 memory ƒ The only computing system on the Top500 list with a single backplane ƒ 18 kW ƒ ~$2. 5 M

ƒ SC648 ƒ 0.648 TFLOP ƒ 2kW ƒ ~$200 k

Image: linuxdevices.com 43 iCSC2008, Andrzej Nowak, CERN openlab 44 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 11 Advanced Architectural Features Advanced Architectural Features Interesting architectures – FPGAs Summary ƒ Programmable hardware ƒ Programmed using a low level hardware description language ƒ Emerging trend – hybrid, (commonly VHDL or Verilog) heterogeneous solutions ƒ Some higher level languages and methods are being developed ƒ The future ƒ Large core designs? ƒ Heavily used in the industry, becoming popular in HPC ƒ Hybrid designs? ƒ Well suited for data streaming ƒ Small core designs? ƒ Common method: moving inner loops into very fast custom instructions ƒ Advantages ƒ Very fast ƒ Can execute all implemented operations in parallel ƒ More later

45 iCSC2008, Andrzej Nowak, CERN openlab 46 iCSC2008, Andrzej Nowak, CERN openlab

Advanced Architectural Features

Q&A

47 iCSC2008, Andrzej Nowak, CERN openlab

Towards Reconfigurable High-Performance Computing Lecture 3 iCSC 2008 3-5 March 2008, CERN Platforms I: Advanced Architectural Features 12