Broadband Processor Architecture

A novel SIMD architecture for the heterogeneous chip-multiprocessor

Michael Gschwind, Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, Takeshi Yamazaki

© 2005 IBM Corporation Broadband Processor Architecture

Acknowledgements

. Cell is the result of a partnership between SCEI/Sony, , and IBM . Cell represents the work of more than 400 people starting in 2000 and a design investment of about $400M

2 © 2005 IBM Corporation Broadband Processor Architecture

Cell Design Goals

. Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 – Ability to reach 1 TF with a 4-node Cell system . Computing density as main challenge – Dramatically increase performance per X • X = Area, Power, Volume, Cost,… . Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost . Exploit application parallelism to provide a quantum leap in performance

3 © 2005 IBM Corporation Broadband Processor Architecture

Shifting the Balance of Power with Cell

. Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Data parallelism added as an after-thought . Cell provides parallelism at all levels of system abstraction – Thread-level parallelism  multi-core design approach – Instruction-level parallelism  statically scheduled & power aware – Data parallelism  data-paralleI instructions . Data processor instead of control system

4 © 2005 IBM Corporation Broadband Processor Architecture

Cell Features

SPE . Heterogeneous multi- SPU SPU SPU SPU SPU SPU SPU SPU core system architecture SXU SXU SXU SXU SXU SXU SXU SXU – Power Processor Element LS LS LS LS LS LS LS LS for control tasks

– Synergistic Processor SMF SMF SMF SMF SMF SMF SMF SMF Elements for data-intensive processing 16B/cycle . Synergistic Processor EIB (up to 96B/cycle)

Element (SPE) consists 16B/cycle of PPE 16B/cycle 16B/cycle (2x) – Synergistic Processor Unit (SPU) PPU MIC BIC – Synergistic Memory Flow Control (SMF) L2 L1 PXU • Data movement and 32B/cycle 16B/cycle synchronization Dual FlexIOTM • Interface to high- XDRTM performance Element Interconnect Bus 64-bit Power Architecture with VMX

5 © 2005 IBM Corporation Broadband Processor Architecture

Powering Cell – the Synergistic Processor Unit

64B Fetch

ILB Local Store 128B

Issue / Single Port Branch SRAM 2 instructions 16B

VRF

16B x 2 16B x 3 x 2

V F P U V F X U PERM LSU

6 © 2005 IBM Corporation Broadband Processor Architecture

Density Computing in SPEs

. Today, execution units only fraction of core area and power – Bigger fraction goes to other functions • Address translation and privilege levels • Instruction reordering • Register renaming • Cache hierarchy . Cell changes this ratio to increase performance per area and power – Architectural focus on data processing • Wide datapaths • More and wide architectural registers • Data privatization and single level processor-local store • All code executes in a single (user) privilege level • Static scheduling

7 © 2005 IBM Corporation Broadband Processor Architecture

Streamlined Architecture . Architectural focus on simplicity – Aids achievable operating frequency – Optimize circuits for common performance case – Compiler aids in layering traditional hardware functions – Leverage 20 years of architecture research . Focus on statically scheduled data parallelism – Focus on data parallel instructions • No separate scalar execution units • Scalar operations mapped onto data parallel dataflow – Exploit wide data paths • data processing • instruction fetch – Address impediments to static scheduling • Large register set • Reduce latencies by eliminating non-essential functionality

8 © 2005 IBM Corporation Broadband Processor Architecture Synergistic Processing large DLP register shorter scalar file pipeline layering

instruction bundling data static static scheduling prediction parallel select high frequency simpler wide µarch data ILP paths determ. opt. large latency basic blocks local compute store single sequential density fetch port

9 © 2005 IBM Corporation Broadband Processor Architecture

Architected Registers have Lagged Behind Latency

50 32 registers

45 since RISC

FMA) introduced 40 per 35

30 operands 25

unique 20 (2 15

10 pressure 5

register 0 1989 1993 1997 2001 high-freq core extrapolated _(2FPU) (1FPU) core generation FMA latency * FMAs L1 latency * FMAs L2 latency * FMAs mem latency * FMAs

10 © 2005 IBM Corporation Broadband Processor Architecture

Restoring the Architectural Register Balance

. Unified 128 entry, 128b wide register file – Used as 128x1, 64x2, 32x4, 16x8, 8x16, 1x128 – Emphasis on 32x4 integer and FP arithmetic – Register file also stores addresses and condition values . Unified register file gives programmers more flexibility in resource allocation – Stores all data types • address, integer, SP/DP FP, Boolean – Stores scalar and vector SIMD data . Large register file facilitates static scheduling and loop optimization for latency hiding

11 © 2005 IBM Corporation Broadband Processor Architecture

Pervasive Data Parallel Computing (PDPC)

. Scalar processing supported on data-parallel substrate – All instructions are data parallel and operate on vectors of elements – Scalar operation defined by instruction use, not opcode • Vector instruction form used to perform operation . Preferred slot paradigm – Scalar arguments to instructions found in “preferred slot” – Computation can be performed in any slot

12 © 2005 IBM Corporation Broadband Processor Architecture

Register Scalar Data Layout

. Preferred slot in bytes 0-3 – By convention for procedure interfaces – Used by instructions expecting scalar data • Addresses, branch conditions, generate controls for insert

13 © 2005 IBM Corporation Broadband Processor Architecture

PDPC Memory Access Architecture

. Data memory interface optimized for quadword access – Simplified alignment network – Always access 128 bits of data . Processing scalar data – Extracted using rotate – Stored with read-modify-write sequence . Four address formats – register + displacement, register-indexed – PC-relative, absolute address . Local Store Limit Register (LSLR) – Address mask register to define maximum address range – Addresses wrap at LSLR value

14 © 2005 IBM Corporation Broadband Processor Architecture

PDPC Branch Architecture

. Register-indirect target address in preferred slot of general purpose register – Indirect branch (function return) – Indirect prepare-to-branch . Link address deposited in preferred slot of specified general purpose register – Register 0 ($ra) used as return register by software convention

15 © 2005 IBM Corporation Broadband Processor Architecture

A Statically Scheduled PDPC Architecture

. Bundle concept – Sequential semantics – Static scheduling key ingredient to great performance – Strong performance bias to properly aligned instructions • Align instructions to map on simplified issue routing logic • Maximize fetch group utilization . Branch and control architecture – Static branch prediction with Prepare-to-branch instruction – Simplify implementations, improve utility of sequential fetch . Explicit inline instruction fetch – To avoid fetch starvation during bursts of high-priority load/store traffic – Compiler manages 256KB unified, single ported memory

16 © 2005 IBM Corporation Broadband Processor Architecture

Data Selection Architecture

. Branch is scalar, select is data parallel – Branch inherently inefficient . No exception  safe to compute both paths through conditional assignment . Use data parallel compute flow – Data parallel if-conversion – Select instruction independently selects result for each slot – Compare instructions generate type-specific mask fields

17 © 2005 IBM Corporation Broadband Processor Architecture

a[0]>b[0]

Data Parallel Select Operation m[0]=a[0]*2; m[0]=b[0]*3;

a[1]>b[1] for (i =0; i< VL; i++) m[1]=a[1]*2; m[1]=b[1]*3; if (a[i]>b[i]) m[i] = a[i]*2; a[2]>b[2]

else m[2]=a[2]*2; m[2]=b[2]*3; Long latency m[i] = b[i]*3; a[3]>b[3]

m[3]=a[3]*2; m[3]=b[3]*3;

Exploit data parallelism

s[0]=a[0]>b[0] s[1]=a[1]>b[1] s[2]=a[2]>b[2] s[3]=a[3]>b[3]

a’[0] =a[0]*2; b’[0]=b[0]*3; a’[1]=a[1]*2; b’[1]=b[1]*3; a’[2]=a[2]*2; b’[2]=b[2]*3; a’[3]=a[3]*2; b’[3]=b[3]*3;

m[0]=s[0]? m[1]=s[1]? m[2]=s[2]? m[3]=s[3]? a’[0]:b’[0] a’[1]:b’[1] a’[2]:b’[2] a’[3]:b’[3]

18 © 2005 IBM Corporation Broadband Processor Architecture

Compare and Data Selection Architecture

. Only limited number of compares – Compare for equality • CEQ – Compare for ordering • CGT, CGTL – Other conditions can be derived • operand order, inverted condition . Generates mask of data type width for use with select – Data-parallel conditional operation

19 © 2005 IBM Corporation Broadband Processor Architecture

Integer Data Processing for Advanced Media Computing

. Ensure data fidelity – Avoid accumulated saturation error in content creation – Emphasis on data quality first, quantity second – Concentrate on 16b and 32b integer data – Concentrate on modulo arithmetic • Consistent with common C/C++/Java semantics . Selected byte operations to maximize performance on key operations – Encryption/decryption performance • Security in electronic transactions will be key to enable future business models – Motion estimation • Video encoding

20 © 2005 IBM Corporation Broadband Processor Architecture

Reconciling Data Fidelity and Compact Data Representation

. Use wide data types for intermediate results in computation – Avoid saturation of intermediate results – Prevent loss of dynamic range . Pack-and-saturate operation to store data in dense memory format – Keep data footprint in memory compact – Pack and saturate once for bulk data storage in memory . Saturating computation only important for low-cost content rendering devices – No/less accumulation of error across rendering steps – e.g., MPEG synchronize with I-frame

21 © 2005 IBM Corporation Broadband Processor Architecture

Floating Point Processing in the SPE

. Support standard IEEE FP and graphics-optimized arithmetic – Floating-point naturally saturating – Always use IEEE data layout . Graphics oriented SP-FP using IEEE compatible data layout – Extended range / Simplified format • Improved estimate instructions to reduce “banding” – Media workloads impose realtime processing requirements • No floating point exceptions – Graphics oriented SP-FP mode added to Cell Power Processor Element . DP-FP follows IEEE conventions

22 © 2005 IBM Corporation Broadband Processor Architecture

Cell: a Synergistic System Architecture

. Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent – Share address translation and memory protection model . SPE optimized for efficient data processing – SPEs share Cell system functions provided by Power Architecture – SMF implements interface to memory • Copy in/copy out to local storage . Power Architecture provides system functions – Virtualization – Address translation and protection – External exception handling . EIB (Element Interconnect Bus) integrates system as data transport hub

23 © 2005 IBM Corporation Broadband Processor Architecture

Compiling for Cell

. The lesson of “RISC computing” – Architecture provides fast, streamlined primitives to compiler – Compiler uses primitives to implement higher-level idioms – If the compiler can’t target it  do not include in architecture . Compiler focus throughout project – Prototype compiler soon after first proposal – Cell compiler team has made significant advances in • Automatic SIMD code generation • Automatic parallelization • Data privatization

24 © 2005 IBM Corporation Broadband Processor Architecture

SPE Block Diagram

Floating-Point Unit Permute Unit Fixed-Point Unit Load-Store Unit Branch Unit Local Store Channel Unit (256kB) Single Port SRAM Result Forwarding and Staging Register File

Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write

On-Chip Coherent Bus DMA Unit

8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle

25 © 2005 IBM Corporation Broadband Processor Architecture

SPE Pipeline

SPE PIPELINE FRONT END IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2

SPE PIPELINE BACK END Branch Instruction RF1 RF2 IF Instruction Fetch Permute Instruction IB Instruction Buffer EX1 EX2 EX3 EX4 WB ID Instruction Decode Load/Store Instruction IS Instruction Issue RF Register File Access EX1 EX2 EX3 EX4 EX5 EX6 WB EX Execution Fixed Point Instruction WB Write Back EX1 EX2 WB Floating Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB

26 © 2005 IBM Corporation Broadband Processor Architecture

Cell Broadband Engine

27 © 2005 IBM Corporation Broadband Processor Architecture

Cell Implementation Characteristics

Frequency Increase vs. Power Consumption . 241M transistors . 235mm2 . Design operates across wide frequency range – Optimize for power & yield

. > 200 GFlops (SP) @3.2GHz ve ati el . > 20 GFlops (DP) @3.2GHz R . Up to 25.6 GB/s memory bandwidth . Up to 75 GB/s I/O bandwidth . 100+ simultaneous bus transactions – 16+8 entry DMA queue per SPE

Voltage

28 © 2005 IBM Corporation Broadband Processor Architecture

Conclusion

. Single chip heterogeneous multicore system takes chip performance to a new level – By exploiting thread level parallelism – By exploiting instruction level parallelism – By exploiting data level parallelism . Novel pervasive vector-centric architecture – Data parallelism at the center – Redefines high-performance architecture . SPE compute engine offers high compute density – Power-efficient – Area-efficient . Cell delivers unprecedented supercomputer power for consumer applications

29 © 2005 IBM Corporation Broadband Processor Architecture

Additional Information

. Additional details on the SPE architecture and the Cell implementation can be found at – http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell – http://www.research.ibm.com/cell

30 © 2005 IBM Corporation Broadband Processor Architecture

© Copyright International Business Machines Corporation 2005. All Rights Reserved. Printed in the United States August 2005.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

31 © 2005 IBM Corporation