COSC 6385 Computer Architecture - Multi-Processors (II) the IBM Cell, Intel Larrabee and Nvidia G80 Processors Edgar Gabriel Fall 2008

COSC 6385 Computer Architecture - Multi-Processors (II) the IBM Cell, Intel Larrabee and Nvidia G80 Processors Edgar Gabriel Fall 2008

COSC 6385 Computer Architecture - Multi-Processors (II) The IBM Cell, Intel Larrabee and Nvidia G80 processors Edgar Gabriel Fall 2008 Edgar Gabriel References • Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: “Larrabee: a many-core x86 architecture for visual computing”, ACM Trans. Graph. , Vol. 27, No. 3. (August 2008), pp. 1-15. http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee _manycore .pdf • IBM Cell processor: [2] C. R. Johns, D. A. Brokenshire “Introductioon to the Cell Broadband Engine Architecture”, IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519 http://www.research.ibm.com/journal/rd/515/johns.pdf [3] M. Kistler, M. Perrone, F. Petrini, “Cell Multiprocessor Communication Network: Built for Speed” IEEE Micro, vol. 26, no. 3, pp .10-23 ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf • Nvidia G80 [4] Scott Wasson, Nvidia GeForce 8800 graphics processor” http://techreport.com/articles.x/11211/1 COSC 6385 – Computer Architecture Edgar Gabriel 1 Larrabee Motivation • Comparison of two architectures with the same number of transistors – Half the performance of a single stream for the simplified core – 40x increase for multi-stream executions 2 out-of-order 10 in-order cores cores Instruction issue 4 2 VPU per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single stream 4 per clock 2 per clock Vector 8 per clock 160 per clock COSC 6385 – Computerthroughput Architecture Edgar Gabriel Larrabee Overview • Many-core visual computing architecture • Based on x86 CPU cores – Extended version of the regular x86 instruction set – Supports subroutines and page faulting • Number of x86 cores can vary depending on the implementation and processor version • Fixed functional units for texture filtering – Other graphical operations such as rasterization or post- shader blending done in software COSC 6385 – Computer Architecture Edgar Gabriel 2 Larrabee Overview (II) Image Source: [1] COSC 6385 – Computer Architecture Edgar Gabriel Overview of a Larrabee Core (I) Image Source: [1] COSC 6385 – Computer Architecture Edgar Gabriel 3 Overview of a Larrabee Core (I) • x86 core derived from the Pentium processor – No out-of-order execution • Standard Pentium instruction set with the addition of – 64 bit instructions – Instructions for pre-fetching data into L1 and L2 cache – Support for 4 simultaneous threads, separate registers for each thread • Each core is augmented with a wide vector processor (VPU) • 32kb L1 Instruction cache, 32 kb L1 Data Cache • 256 KB of ‘local subset’ of the L2 cache – Coherent L2 cache across all cores COSC 6385 – Computer Architecture Edgar Gabriel Vector Processing Unit in Larrabee • 16-wide VPU executing integer, single- and double precision floating point operations • VPU supports gather-scatter operations – The 16 elements are loaded or can be stored from up to 16 different addresses • Support for predicated instructions using a mask control register (if-then-else statements) COSC 6385 – Computer Architecture Edgar Gabriel 4 Inter-Processor Ring Network • Bi-directional ring network • 512 bits-wide per direction • Routing decisions done before injecting message into the network COSC 6385 – Computer Architecture Edgar Gabriel Larrabee Programming Models • Most application can be executed without modification due to the full support of the x86 instruction set • Support for POSIX threads to create multiple threads – API extended by thread affinity parameters • Recompiling code with Larrabee’s native compiler will generate automatically the codes to use the VPUs. • Alternative parallel approaches – Intel threading building blocks – Larrabee specific OpenMP directives COSC 6385 – Computer Architecture Edgar Gabriel 5 Larrabee Performance Image Source: [1] COSC 6385 – Computer Architecture Edgar Gabriel IBM Cell Overview (I) • Cell Broadband Architecture (CBEA) defined by a consortium of IBM, Sony, and Toshiba • Originally targeting the multi-media industry – E.g. Playstation 3, Toshiba HDTV, etc. • Sold as regular compute-blades also by IBM – IBM QS20, QS21, QS22 • Main idea: heterogeneous microprocessor consisting of – one (or more) general purpose processor element (PPE) and – (one or) more synergistic processor elements (SPEs) COSC 6385 – Computer Architecture Edgar Gabriel 6 Cell Architecture block diagram COSC 6385 – Computer Architecture Image Source: [2] Edgar Gabriel • Two generations available so far: – Cell BE: • 204.8 GFLOPS single precision peak performance • 14.6 GFLOPS double precision peak performance – PowerXCell 8i (2008): • 204.8 GFLOPS single precision peak performance • 102.4 GFLOPS double precision peak performance – Both have 1 PPE and 8 SPEs COSC 6385 – Computer Architecture Edgar Gabriel 7 General Purpose Processor (PPE) • Based on the IBM PowerPC processor – Supports multiple simultaneous operating environments (virtualization) – E.g. can execute an instance of a real-time operating system and an instance of a non-real-time operating system • Performs management and application control functions COSC 6385 – Computer Architecture Edgar Gabriel Synergistic Processor Element (SPE) • SIMD processor used for offloading compute-intensive, data parallel operations from the PPE • Each SPE has its own local storage and can access data only from the local storage – Current versions of the Cell processors: 256k local storage • The local storage is connected to the main memory through a Memory Flow Controller (MFC) – MFC moves data from main memory to local storage or between two SPEs. COSC 6385 – Computer Architecture Edgar Gabriel 8 MFC commands COSC 6385 – Computer Architecture Image Source: [2] Edgar Gabriel Synergistic Processor Element (SPE) (II) • Each SPE has 128 registers • Each register is 128 bits wide which can be used to hold – Sixteen 8-bit integers or – Eight 16-bit integers or – Four 32-bit integers or single precision floating-point numbers – Two 64-bit integers or double precision floating point numbers • Most instructions supported by the synergistic processor unit utilize all elements in a register -> SIMD COSC 6385 – Computer Architecture Edgar Gabriel 9 Simplified representation of a current Cell processor COSC 6385 – Computer Architecture Edgar Gabriel Image Source: [3] Element Interconnect Bus • PPE and SPEs communicate through the Element Interconnect Bus – Contains a shared command bus • Sets up end-to-end transactions • Used for coherence protocols – Point-to-point data interconnect • Four 16-byte-wide rings, two used for clockwise data transfers, two for counter-clockwise data transfers • Each ring transfer 128 byte packets ( = cache block size of an SPE) • Communication costs between two SPEs can vary between 1 hop and 6 hops – Overall bandwidth: 204.8 GB/s COSC 6385 – Computer Architecture Edgar Gabriel 10 Comparison IBM Cell and Intel Larrabee • Both use a large number of small and simple cores • Both use high-bandwidth ring bus to communicate between the cores • Intel Larrabee is homogeneous, while IBM Cell is a heterogeneous process (difference between PPE and SPE) • IBM Cell requires data to be moved explicitly to the ‘local store’, while Larrabee can address any memory area – Programm for the Cell have to be written taking the limited amount of memory available for a SPE into account COSC 6385 – Computer Architecture Edgar Gabriel Nvidia G80 • Parallel Stream Processor – Each green block is a stream processor – 16 stream processors are grouped and connected by a L1 cache – Each G80 has 8 groups with 16 SPs = 128 SPs total – Each SP is a generalized processors running at 1.35 GHz – Each SP operates on a single element (scalar) – groups are connected by a crossbar style switch and that connects them to six ROP – Each ROP has its own L2 cache and an interface to graphics memory (frame buffer) with 64 bits width – 6 * 64bits = 384 bits path to memory COSC 6385 – Computer Architecture Edgar Gabriel 11 Nvidia G80 (I) COSC 6385 – Computer Architecture Edgar Gabriel Performance comparison G80 to IBM Cell • Ray Tracing Application Source: http://gametomorrow.com/blog/index.php/2007/09/05/cell-vs-g80/ COSC 6385 – Computer Architecture Edgar Gabriel 12.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us