<<

COSC 6385 Computer Architecture - Multi-Processors (II) The IBM , Intel Larrabee and Nvidia G80 processors Edgar Gabriel Fall 2008

Edgar Gabriel

References • Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: “Larrabee: a many-core x86 architecture for visual computing”, ACM Trans. Graph. , Vol. 27, No. 3. (August 2008), pp. 1-15. http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee _manycore .pdf • IBM Cell : [2] C. R. Johns, D. A. Brokenshire “Introductioon to the Cell Broadband Engine Architecture”, IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519 http://www.research.ibm.com/journal/rd/515/johns.pdf [3] M. Kistler, M. Perrone, F. Petrini, “Cell Multiprocessor Communication Network: Built for Speed” IEEE Micro, vol. 26, no. 3, pp .10-23 ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf • Nvidia G80 [4] Scott Wasson, Nvidia GeForce 8800 graphics processor” http://techreport.com/articles.x/11211/1

COSC 6385 – Computer Architecture Edgar Gabriel

1 Larrabee Motivation

• Comparison of two architectures with the same number of transistors – Half the performance of a single stream for the simplified core – 40x increase for multi-stream executions

2 out-of-order 10 in-order cores cores Instruction issue 4 2 VPU per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single stream 4 per clock 2 per clock Vector 8 per clock 160 per clock

COSC 6385 – Computerthroughput Architecture Edgar Gabriel

Larrabee Overview

• Many-core visual computing architecture • Based on x86 CPU cores – Extended version of the regular x86 instruction set – Supports subroutines and page faulting • Number of x86 cores can vary depending on the implementation and processor version • Fixed functional units for texture filtering – Other graphical operations such as rasterization or post- shader blending done in software

COSC 6385 – Computer Architecture Edgar Gabriel

2 Larrabee Overview (II)

Image Source: [1] COSC 6385 – Computer Architecture Edgar Gabriel

Overview of a Larrabee Core (I)

Image Source: [1]

COSC 6385 – Computer Architecture Edgar Gabriel

3 Overview of a Larrabee Core (I) • x86 core derived from the Pentium processor – No out-of-order execution • Standard Pentium instruction set with the addition of – 64 bit instructions – Instructions for pre-fetching data into L1 and L2 cache – Support for 4 simultaneous threads, separate registers for each • Each core is augmented with a wide (VPU) • 32kb L1 Instruction cache, 32 kb L1 Data Cache • 256 KB of ‘local subset’ of the L2 cache – Coherent L2 cache across all cores COSC 6385 – Computer Architecture Edgar Gabriel

Vector Processing Unit in Larrabee

• 16-wide VPU executing integer, single- and double precision floating point operations • VPU supports gather-scatter operations – The 16 elements are loaded or can be stored from up to 16 different addresses • Support for predicated instructions using a mask control register (if-then-else statements)

COSC 6385 – Computer Architecture Edgar Gabriel

4 Inter-Processor Ring Network

• Bi-directional ring network • 512 bits-wide per direction • Routing decisions done before injecting message into the network

COSC 6385 – Computer Architecture Edgar Gabriel

Larrabee Programming Models

• Most application can be executed without modification due to the full support of the x86 instruction set • Support for POSIX threads to create multiple threads – API extended by thread affinity parameters • Recompiling code with Larrabee’s native compiler will generate automatically the codes to use the VPUs. • Alternative parallel approaches – Intel threading building blocks – Larrabee specific OpenMP directives

COSC 6385 – Computer Architecture Edgar Gabriel

5 Larrabee Performance

Image Source: [1] COSC 6385 – Computer Architecture Edgar Gabriel

IBM Cell Overview (I)

• Cell Broadband Architecture (CBEA) defined by a consortium of IBM, , and • Originally targeting the multi-media industry – E.g. Playstation 3, Toshiba HDTV, etc. • Sold as regular compute-blades also by IBM – IBM QS20, QS21, QS22 • Main idea: heterogeneous consisting of – one (or more) general purpose processor element (PPE) and – (one or) more synergistic processor elements (SPEs)

COSC 6385 – Computer Architecture Edgar Gabriel

6 Cell Architecture block diagram

COSC 6385 – Computer Architecture Image Source: [2] Edgar Gabriel

• Two generations available so far: – Cell BE: • 204.8 GFLOPS single precision peak performance • 14.6 GFLOPS double precision peak performance – PowerXCell 8i (2008): • 204.8 GFLOPS single precision peak performance • 102.4 GFLOPS double precision peak performance – Both have 1 PPE and 8 SPEs

COSC 6385 – Computer Architecture Edgar Gabriel

7 General Purpose Processor (PPE)

• Based on the IBM PowerPC processor – Supports multiple simultaneous operating environments (virtualization) – E.g. can execute an instance of a real-time and an instance of a non-real-time operating system • Performs management and application control functions

COSC 6385 – Computer Architecture Edgar Gabriel

Synergistic Processor Element (SPE)

• SIMD processor used for offloading compute-intensive, data parallel operations from the PPE • Each SPE has its own local storage and can access data only from the local storage – Current versions of the Cell processors: 256k local storage • The local storage is connected to the main memory through a Memory Flow Controller (MFC) – MFC moves data from main memory to local storage or between two SPEs.

COSC 6385 – Computer Architecture Edgar Gabriel

8 MFC commands

COSC 6385 – Computer Architecture Image Source: [2] Edgar Gabriel

Synergistic Processor Element (SPE) (II)

• Each SPE has 128 registers • Each register is 128 bits wide which can be used to hold – Sixteen 8-bit integers or – Eight 16-bit integers or – Four 32-bit integers or single precision floating-point numbers – Two 64-bit integers or double precision floating point numbers • Most instructions supported by the synergistic processor unit utilize all elements in a register -> SIMD

COSC 6385 – Computer Architecture Edgar Gabriel

9 Simplified representation of a current Cell processor

COSC 6385 – Computer Architecture Edgar Gabriel Image Source: [3]

Element Interconnect

• PPE and SPEs communicate through the Element Interconnect Bus – Contains a shared command bus • Sets up end-to-end transactions • Used for coherence protocols – Point-to-point data interconnect • Four 16--wide rings, two used for clockwise data transfers, two for counter-clockwise data transfers • Each ring transfer 128 byte packets ( = cache block size of an SPE) • Communication costs between two SPEs can vary between 1 hop and 6 hops – Overall bandwidth: 204.8 GB/s COSC 6385 – Computer Architecture Edgar Gabriel

10 Comparison IBM Cell and Intel Larrabee • Both use a large number of small and simple cores • Both use high-bandwidth ring bus to communicate between the cores • Intel Larrabee is homogeneous, while IBM Cell is a heterogeneous process (difference between PPE and SPE) • IBM Cell requires data to be moved explicitly to the ‘local store’, while Larrabee can address any memory area – Programm for the Cell have to be written taking the limited amount of memory available for a SPE into account COSC 6385 – Computer Architecture Edgar Gabriel

Nvidia G80

• Parallel Stream Processor – Each green block is a stream processor – 16 stream processors are grouped and connected by a L1 cache – Each G80 has 8 groups with 16 SPs = 128 SPs total – Each SP is a generalized processors running at 1.35 GHz – Each SP operates on a single element (scalar) – groups are connected by a crossbar style switch and that connects them to six ROP – Each ROP has its own L2 cache and an interface to graphics memory (frame buffer) with 64 bits width – 6 * 64bits = 384 bits path to memory

COSC 6385 – Computer Architecture Edgar Gabriel

11 Nvidia G80 (I)

COSC 6385 – Computer Architecture Edgar Gabriel

Performance comparison G80 to IBM Cell • Ray Tracing Application

Source: http://gametomorrow.com/blog/index.php/2007/09/05/cell-vs-g80/ COSC 6385 – Computer Architecture Edgar Gabriel

12