<<

360 SYSTEM ARCHITECTURE

THIS ARTICLE COVERS THE ’SHIGH-LEVEL TECHNICAL

REQUIREMENTS, A SHORT SYSTEM OVERVIEW, AND DETAILS OF THE CPU AND

THE GPU. THE AUTHORS DESCRIBE THEIR ARCHITECTURAL TRADE-OFFS AND

SUMMARIZE THE SYSTEM’S SOFTWARE PROGRAMMING SUPPORT.

Microsoft’s Xbox 360 game console to translate the next-generation gaming prin- is the first of the latest generation of game con- ciple into useful feature requirements and soles. Historically, game console architecture next-generation game workloads. For the and design implementations have provided game workloads, the designers’ direction came large discrete jumps in system performance, from interaction with game developers, approximately at five-year intervals. Over the including game engine developers, middle- last several generations, game console systems ware developers, tool developers, API and dri- have increasingly become graphics supercom- ver developers, and game performance puters in their own right, particularly at the experts, both inside and outside . launch of a given game console generation. One key next-generation game feature The Xbox 360, pictured in Figure 1, contains requirement was that the Xbox 360 system an aggressive hardware architecture and imple- must implement a (progressive scan) mentation targeted at game console workloads. pervasive high-definition (HD), 16:9 aspect The core silicon implements the product ratio screen in all Xbox 360 games. This fea- Jeff Andrews designers’ goal of providing game developers a ture’s architectural implication was that the hardware platform to implement their next-gen- Xbox 360 required a huge, reliable fill rate. Nick Baker eration game ambitions. The core chips include Another design principle of the Xbox 360 the conceptual blocks of CPU, graph- architecture was that it must be flexible to suit Microsoft Corp. ics processing unit (GPU), memory, and I/O. the dynamic range of game engines and game Each of these components and their intercon- developers. The Xbox 360 has a balanced nections are customized to provide a user- hardware architecture for the software game friendly game console product. pipeline, with homogeneous, reallocatable hardware resources that adapt to different Design principles game genres, different developer emphases, One of the Xbox 360’s main design princi- and even to varying workloads within a frame ples is the next-generation gaming principle— of a game. In contrast, heterogeneous hard- that is, a new game console must provide value ware resources lock software game pipeline to customers for five to seven years. Thus, as performance in each stage and are not reallo- for any true next-generation game console catable. Flexibility helps make the design hardware, the Xbox 360 delivers a huge discrete “futureproof.” The Xbox 360’s three CPU jump in hardware performance for gaming. cores, 48 unified shaders, and 512-Mbyte The Xbox 360 hardware design team had DRAM main memory will enable developers

0272-1732/06/$20.00 © 2006 IEEE Published by the IEEE Computer Society 25 HOT CHIPS 17

sent the simplest and programming models to let game developers use hardware resources effectively. We extended pro- gramming models that developers liked. Because software developers liked the first Xbox, using it as a working model was nat- ural for the teams. In listening to developers, we did not repackage or include hardware features that developers did not like, even though that may have simplified the hard- ware implementation. We considered the software tool chain from the very beginning of the project. Another major design principle was that the Xbox 360 hardware be optimized for achiev- able performance. To that end, we designed a scalable architecture that provides the great- est usable performance per square millimeter while remaining within the console’s system power envelope. As we continued to work with game devel- opers, we scaled chip implementations to result in balanced hardware for the software game pipeline. Examples of higher-level implementation scalability include the num- ber of CPU cores, the number of GPU shaders, CPU L2 size, bus bandwidths, and Figure 1. Xbox 360 game console and wireless controller. main memory size. Other scalable items rep- resented smaller optimizations in each chip.

to create innovative games for the next five to Hardware designed for games seven years. Figure 2 shows a top-level diagram of the A third design principle was programma- Xbox 360 system’s core silicon components. bility; that is, the Xbox 360 architecture must The three identical CPU cores share an 8-way be easy to program and develop software for. set-associative, 1-Mbyte L2 cache and run at The silicon development team spent much 3.2 GHz. Each core contains a complement of time listening to software developers (we are four-way single-instruction, multiple data hardware folks at a software company, after (SIMD) vector units.1 The CPU L2 cache, all). There was constant interaction and iter- cores, and vector units are customized for ation with software developers at the very Xbox 360 game and 3D graphics workloads. beginning of the project and all along the The front-side bus (FSB) runs at architecture and implementation phases. 5.4 Gbit/pin/s, with 16 logical pins in each This interaction had an interesting dynam- direction, giving a 10.8-Gbyte/s read and a ic. The software developers weren’t shy about 10.8-Gbyte/s write bandwidth. The bus their hardware likes and dislikes. Likewise, the design and the CPU L2 provide added sup- hardware team wasn’t shy about where next- port that allows the GPU to read directly from generation hardware architecture and design the CPU L2 cache. were going as a result of changes in silicon As Figure 2 shows, the I/O chip supports processes, hardware architecture, and system abundant I/O components. The Xbox media design. What followed was further iteration audio (XMA) decoder, custom-designed by on planned and potential workloads. Microsoft, provides on-the-fly decoding of a An important part of Xbox 360 pro- large number of compressed audio streams in grammability is that the hardware must pre- hardware. Other custom I/O features include

26 IEEE MICRO CPU I/O chip DVD (SATA) Core 0 Core 1 Core 2 HDD port (SATA) Front controllers (2 USB) L1D L1I L1D L1I L1D L1I Wireless controllers 1 Mbyte L2 MU ports (2 USB) Rear panel USB

Memory GPU IR BIU/IO interface XMA decoder

MC1 Audio out 512 Mbyte DRAM Flash 3D core

SMC System control

MC0 10 Mbytes Video EDRAM out Analog Video out chip

BIU Bus interface unit MC Memory controller HDD MU Memory unit IR Infrared receiver SMC System management controller XMA Xbox media audio

Figure 2. Xbox 360 system block diagram. the NAND flash controller and the system The shared L2 allows fine-grained, dynamic management controller (SMC). allocation of cache lines between the six threads. The GPU 3D core has 48 parallel, unified Commonly, game workloads significantly vary shaders. The GPU also includes 10 Mbytes of in working-set size. For example, scene man- embedded DRAM (EDRAM), which runs at agement requires walking larger, random-miss- 256 Gbytes/s for reliable frame and z-buffer dominated data structures, similar to database bandwidth. The GPU includes interfaces searches. At the same time, audio, Xbox proce- between the CPU, I/O chip, and the GPU dural synthesis (described later), and many other internals. game processes that require smaller working sets The 512-Mbyte unified main memory con- can run concurrently. The shared L2 allows trolled by the GPU is a 700-MHz graphics- workloads needing larger working sets to allo- double-data-rate-3 (GDDR3) memory, cate significantly more of the L2 than would be which operates at 1.4 Gbit/pin/s and provides available if the system used private L2s (of the a total main memory bandwidth of 22.4 same total L2 size) instead. Gbytes/s. The CPU core has two-per-cycle, in-order The DVD and HDD ports are serial ATA instruction issuance. A separate vector/scalar (SATA) interfaces. The analog chip drives the issue queue (VIQ) decouples instruction HD video out. issuance between integer and vector instruc- tions for nondependent work. There are two CPU chip symmetric multithreading (SMT),5 fine- Figure 3 shows the CPU chip in greater grained hardware threads per core. The L1 detail. Microsoft’s partner for the Xbox 360 caches include a two-way set-associative, 32- CPU is IBM. The CPU implements the Pow- Kbyte L1 instruction cache and a four-way erPC instruction set architecture,2-4 with the set-associative, 32-Kbyte L1 data cache. The VMX SIMD vector instruction set (VMX128) write-through data cache does not allocate customized for graphics workloads. cache lines on writes.

MARCH–APRIL 2006 27 HOT CHIPS 17

Core 2 Core 1 L1I 32 L1I Instruction unit Core 0 Kbytes 32 Instruction unit L1D L1I Load/ Kbytes Branch VIQ Int 32 32 Instruction unit L1D Store Load/ Kbytes Kbytes Branch VIQ Int 32 L1D Store Load/ Kbytes Branch VIQ Int 32 Store Kbytes VMX VMX VMX FPU MMU

VSU FP perm simp VMX VMX VMX FPU MMU

VSU FP perm simp VMX VMX VMX FPU MMU

VSU FP perm simp

L2 Node crossbar/queuing

Uncached L2 L2 PIC UncachedUnit2 L2 data UncachedUnit2 directory directory Unit2

Test, Bus interface debug, clocks, temperature sensor. Front side bus (FSB)

VSU Vector/scalar unit Perm Permute Simp Simple MMU Main-memory unit Int Integer PIC Programmable interrupt controller FPU Floating point unit VIQ Vector/scalar issue queue

Figure 3. Xbox 360 CPU block diagram.

The integer execution pipelines include graphics applications. The dot product branch, integer, and load/store units. In implementation adds minimal latency to a addition, each core contains an IEEE-754- multiply-add by simplifying the rounding of compliant scalar floating-point unit (FPU), intermediate multiply results. The dot prod- which includes single- and double-precision uct instruction takes far less latency than dis- support at full hardware throughput of one crete instructions. operation per cycle for most operations. Another addition we made to the VMX128 Each core also includes the four-way SIMD was direct 3D (D3D) compressed data for- VMX128 units: floating-point (FP), per- mats,6-8 the same formats supported by the mute, and simple. As the name implies, the GPU. This allows graphics data to be gener- VMX128 includes 128 registers, of 128 bits ated in the CPU and then compressed before each, per hardware to maximize being stored in the L2 or memory. Typical use throughput. of the compressed formats allows an approx- The VMX128 implementation includes an imate 50 percent savings in required band- added dot product instruction, common in width and memory footprint.

28 IEEE MICRO CPU data streaming isters and increase latency (and thus through- In the Xbox, we paid considerable atten- put) of inner loops of computation kernels. tion to enabling data-streaming workloads, We applied similar customization to read which are not typical PC or workloads. streaming. For each CPU core, there are eight We added features that allow a given CPU outstanding loads/prefetches. A custom core to execute a high-bandwidth workload prefetch instruction, extended data cache (both read and write, but particularly write), block touch (xDCBT), prefetches data, but while avoiding thrashing its own cache and delivers to the requesting CPU core’s L1 data the shared L2. cache and never puts data in the L2 cache as First, some features shared among the CPU regular prefetch instructions do. This modifi- cores help data streaming. One of these is 128- cation seems minor, but it is very important cache line sizes in all the CPU L1 and L2 because it allows higher bandwidth read caches. Larger cache line sizes increase FSB streaming workloads to run on as many and memory efficiency. The L2 includes a threads as desired without thrashing the L2 cache-set-locking functionality, common in cache. Another option we considered for read embedded systems but not in PCs. streaming would be to lock a set of the L2 per Specific features that improve streaming thread for read streaming. In that case, if a user bandwidth for writes and reduce thrashing wanted to run four threads concurrently, half include the write-through L1 data caches. the L2 cache would be locked down, hurting Also, there is no write allocation of L1 data workloads requiring a large L2 working-set cache lines when writes miss in the L1 data size. Instead, read streaming occurs through cache. This is important for write streaming the L1 data cache of the CPU core on which because it keeps the L1 data cache from being the given thread is operating, effectively giv- thrashed by high bandwidth transient write- ing private read streaming first in, first out only data streams. (FIFO) area per thread. We significantly upgraded write gathering A system feature planned early in the Xbox in the L2. The shared L2 has an uncached unit 360 project was to allow the GPU to directly for each CPU core. Each uncached unit has read data produced by the CPU, with the data four noncached write-gathering buffers that never going through the CPU cache’s back- allow multiple streams to concurrently gath- ing store of main memory. In a specific case er and dump their gathered payloads to the of this data streaming, called Xbox procedur- FSB yet maintain very high uncached write- al synthesis (XPS), the CPU is effectively a streaming bandwidth. data decompressor, procedurally generating The cacheable write streams are gathered by geometry on-the-fly for consumption by the eight nonsequential gathering buffers per CPU GPU 3D core. For 3D games, XPS allows a core. This allows programming flexibility in the far greater amount of differentiated geometry write patterns of cacheable very high bandwidth than simple traditional instancing allows, write streams into the L2. The write streams can which is very important for filling large HD randomly write within a window of a few cache screen worlds with highly detailed geometry. lines without the writes backing up and caus- We added two features specifically to sup- ing stalls. The cacheable write-gathering buffers port XPS. The first was support in the GPU effectively act as a bandwidth compression and the FSB for a 128-byte GPU read from scheme for writes. This is because the L2 data the CPU. The other was to directly lower arrays see a much lower bandwidth than the raw communication latency from the GPU back bandwidth required by a program’s store pat- to the CPU by extending the GPU’s tail tern, which would have low utilization of the pointer write-back feature. L2 cache arrays. Data transformation workloads Tail pointer write-back is a method of con- commonly don’t generate the data in a way that trolling communication from the GPU to the allows sequential write behavior. If the write CPU by having the CPU poll on a cacheable gathering buffers were not present, software location, which is updated when a GPU would have to effectively gather write data in instruction writes an update to the pointer. the register set before storing. This would put a The system coherency scheme then updates large amount of pressure on the number of reg- the polling read with the GPU’s updated

MARCH–APRIL 2006 29 HOT CHIPS 17

Core 2 Core 1 L1I 32 L1I Instruction unit Core 0 Kbytes 32 Instruction unit L1D L1I Load/ Kbytes Branch VIQ Int 32 32 Instruction unit L1D Store Load/ Kbytes Kbytes Branch VIQ Int 32 L1D Store Load/ Kbytes Branch VIQ Int 32 Store Kbytes VMX VMX VMX FPU MMU

VSU FP D3Dperm compressedsimp data, VMX VMX VMX FPU VMX storesMMU to L2

VSU FP perm simp VMX VMX VMX FPU MMU

VSU FP perm simp

xDCBT 128-byte prefetch around L2, into L1 data cache

L2 Node crossbar/queuing

Non-sequential gathering, Uncached L2 L2 locked set in L2 PIC UncachedUnit2 L2 data UncachedUnit2 directory directory Unit2

Test, Bus interface debug, clocks, temperature sensor. Front side bus (FSB) GPU 128-byte read from L2

VSU Vector/scalar unit Perm Permute Simp Simple From memory To GPU MMU Main-memory unit Int Integer PIC Programmable interrupt controller

Figure 4. CPU cached data-streaming example.

pointer value. Tail write-backs reduce com- the CPU and the GPU. To get an idea of this munication latency compared to using feature’s usefulness, consider this: Given a typ- interrupts. We lowered GPU-to-CPU com- ical average of 2:1 compression and an XPS- munication latency even further by imple- targeted 9 Gbytes/s FSB bandwidth, the CPU menting the tail pointer’s backing-store target cores can generate up to 18 Gbytes/s of effec- on the CPU die. This avoids the round-trip tive geometry and other graphics data and from CPU to memory when the GPU point- ship it to the GPU 3D core. Main memory er update causes a probe and castout of the sees none of this data traffic (or footprint). CPU cache data, requiring the CPU to refetch the data all the way from memory. Instead the CPU cached data-streaming example refetch never leaves the CPU die. This lower Figure 4 illustrates an example of the Xbox latency translates into smaller streaming 360 using its data-streaming features for an FIFOs in the L2’s locked set. XPS workload. Consider the XPS workload, A previously mentioned feature very impor- acting as a decompression kernel running on tant to XPS is the addition of D3D com- one or more CPU SMT hardware threads. pressed formats that we implemented in both First, the XPS kernel must fetch new, unique

30 IEEE MICRO data from memory to enable generation of the given piece of geometry. This likely includes world space coordinate data and specific data to make each geometry instance unique. The XPS kernel prefetches this read data during a previous geometry generation iteration to cover the fetch’s memory latency. Because none of the per-instance read data is typical- ly reused between threads, the XPS kernel fetches it using the xDCBT prefetch instruc- tion around the L2, which puts it directly into the requesting CPU core’s L1 data cache. Prefetching around the L2 separates the read data stream from the write data stream, avoid- ing L2 cache thrashing. Figure 4 shows this step as a solid-line arc from memory to Core 0’s L1 data cache. The XPS kernel then crunches the data, primarily using the VMX128 computation ability to generate far more geometry data than the amount read from memory. Before the data is written out, the XPS kernel com- presses it, using the D3D compressed data formats, which offer simple trade-offs Figure 5. Xbox 360 CPU die photo (courtesy of IBM). between number of bits, range, and precision. The XPS kernel stores these results as gener- ated to the locked set in the L2, with only GPU, the CPU notifies the GPU that the minimal attention to the write access pattern’s GPU can advance within the FIFO, and the randomness (for example, the kernel places GPU performs 128-byte reads to the FSB. write accesses within a few cache lines of each This step is shown in the diagram as the dot- other for efficient gathering). Furthermore, ted-line arc starting in the L2 and going to the because of the write-through and no-write- GPU. The GPU design incorporates special allocate nature of the L1 data caches, none of features allowing it to read from the FSB, in the write data will thrash the L1 data cache contrast with the normal GPU read from of the CPU core. The diagram shows this step main memory. The GPU also has an added as a dashed-line arc from load/store in Core 128-byte fetch, which enables maximum FSB 0 to the locked set in L2. and L2 data array utilization. Once the CPU core has issued the stores, The two final steps are not shown in the the store data sits in the gathering buffers wait- diagram. First, the GPU uses the corre- ing for more data until timed out or forced sponding D3D compressed data sup- out by incoming write data demanding new port to expand the compressed D3D formats 64-byte ranges. The XPS output data is writ- into single-precision floating-point formats ten to software-managed FIFOs in the L2 data native to the 3D core. Then, the GPU com- arrays in a locked set in the L2 (the unshaded mands tail pointer write-backs to the CPU to box in Figure 4). There are multiple FIFOs in indicate that the GPU has finished reading one locked set, so multiple threads can share data. This tells the streaming FIFOs’ CPU one L2 set. This is possible within 128 Kbytes software control that the given FIFO space is of one set because tail pointer write-back com- free to be written with new geometry or munication frees completed FIFO area with index data. lowered latency. Using the locked set is impor- Figure 5 shows a photo of the CPU die, tant; otherwise, high-bandwidth write streams which contains 165 million transistors in an would thrash the L2 working set. IBM second-generation 90-nm silicon-on- Next, when more data is available to the insulator (SOI) enhanced transistor process.

MARCH–APRIL 2006 31 HOT CHIPS 17

Graphics processing unit fetches and can sustain this at eight pixels per The GPU is the latest-generation graphics cycle. This blazing fill rate enables the Xbox processor from ATI. It runs at 500 MHz and 360 to deliver HD-resolution rendering simul- consists of 48 parallel, combined vector and taneously with many state-of-the-art effects scalar shader ALUs. Unlike earlier graphics that traditionally would be mutually exclusive engines, the shaders are dynamically allocat- because of fill rate limitations. For example, ed, meaning that there are no distinct vertex games can particle, high-dynamic-range or pixel shader engines—the hardware auto- (HDR) lighting, fur, depth-of-field, motion matically adjusts to the load on a fine-grained blur, and other complex effects. basis. The hardware is fully compatible with For next-generation geometric detail, shad- D3D 9.0 and High-Level Shader Language ing, and fill rate, the pipeline’s front end can (HLSL) 3.0,9,10 with extensions. process one triangle or vertex per cycle. These The ALUs are 32-bit IEEE 754 floating- are essentially full-featured vertices (rather point ALUs, with relatively common graphics than a single parameter), with the practical simplifications of rounding modes, denor- limitation of required memory bandwidth malized numbers (flush to on reads), and storage. To overcome this limitation, sev- NaN handling, and exception handling. They eral compressed formats are available for each are capable of vector (including dot product) data type. In addition, XPS can transiently and scalar operations with single-cycle generate data on the fly within the CPU and throughput—that is, all operations issue every pass it efficiently to the GPU without a main cycle. The superscalar instructions encode vec- memory pass. tor, scalar, texture load, and vertex fetch with- The EDRAM removes the render target in one instruction. This allows peak and z-buffer fill rate from the bandwidth processing of 96 shader calculations per cycle equation. The EDRAM resides on a separate while fetching textures and vertices. die from the main portion of GPU logic. The Feeding the shaders are 16 texture fetch EDRAM die also contains dedicated alpha engines, each capable of producing a filtered blend, z-test, and antialiasing logic. The inter- result in each cycle. In addition, there are 16 face to the EDRAM macro runs at 256 programmable vertex fetch engines with built- Gbytes/s: (8 pixels/cycle + 8 z-compares/cycle) in tessellation that the system can use instead × (read + write) × 32 bits/sample × 4 sam- of CPU geometry generation. Finally, there ples/pixel × 500 MHz. are 16 interpolators in dedicated hardware. The GPU supports several pixel depths; 32 The render back end can sustain eight pix- bits per pixel (bpp) and 64 bpp are the most els per cycle or 16 pixels per cycle for depth common, but there is support for up to 128 and stencil-only rendering (used in z-prepass bpp for multiple-render-target (MRT) or or shadow buffers). The dedicated z or blend floating-point output. MRT is a graphics logic and the EDRAM guarantee that eight technique of outputting more than one piece pixels per cycle can be maintained even with of data per sample to the effective frame 4× antialiasing and transparency. The buffer, interleaved efficiently to minimize the z-prepass is a technique that performs a first- performance impact of having more data. The pass rendering of a command list, with no ren- data is used later for a variety of advanced dering features applied except occlusion graphics effects. To optimize space, the GPU determination. The z-prepass initializes the supports 32-bpp and 64-bpp HDR lighting z-buffer so that on a subsequent rendering pass formats. The EDRAM only supports render- with full texturing and shaders applied, dis- ing operations to the render target and z- carded pixels won’t spend shader and textur- buffer. For render-to-texture, the GPU must ing resources on occluded pixels. With modern “flush” the appropriate buffer to main mem- scene depth complexity, this technique signif- ory before using the buffer as a texture. icantly improves rendering performance, espe- Unlike a fine-grained tiler architecture, the cially with complex shader programs. GPU can achieve common HD resolutions and As an example benchmark, the GPU can bit depths within a couple of EDRAM tiles. render each pixel with 4× antialiasing, a z- This simplifies the problem substantially. Tra- buffer, six shader operations, and two texture ditional tiling architectures typically include a

32 IEEE MICRO Command Vertex processor FSB cache Vertex assembly/ Texture tesselator cache Sequencer Hi-Z Interpolators Main die Shader complex Bus interface unit Shader export Blending interface

I/O controller PCI-E I/O MC1

Mem 1 High-speed I/O bus Graphics 10-Mbyte

EDRAM 0 AA+AZ

Memory interface

C0

em M

M Display DRAM die

Video

Figure 6. GPU block diagram. whole process inserted in the traditional graph- needed, as well as allowing multipass shaders. ics pipeline for binning the geometry into a large The latter can be useful for subdivision sur- number of bins. Handling the bins in a high- faces. In addition, the display pipeline performance manner is complicated (for exam- includes an in-line scaler that resizes the frame ple, overflow cases, memory footprint, and buffer on the fly as it is output. This feature bandwidth). Because the GPU’s EDRAM usu- allows games to pick a rendering resolution to ally requires only a couple of bins, bin handling work with and then lets the display hardware is greatly simplified, allowing more-optimal make the best match to the display resolution. hardware-software partitioning. As Figure 6 shows, the GPU consists of the With a binning architecture, the full com- following blocks: mand list must be presented before rendering. The hardware uses a few tricks to speed this • Bus interface unit. This interface to the process up. Rendering increasingly relies on a FSB handles CPU-initiated transactions, z-prepass to prepare the z-buffer before exe- as well as GPU-initiated transactions cuting complex pixel shader algorithms. We such as snoops and L2 cache reads. take advantage of this by collecting object • I/O controller. Handles all internal mem- extent information during this pass, as well as ory-mapped I/O accesses, as well as trans- priming a full-resolution hierarchical z-buffer. actions to and from the I/O chip via the We use the extent information to set flags to two-lane PCI-Express bus (PCI-E). skip command list sections not needed with- • Memory controllers (MC0, MC1). These in a tile. The full-resolution hi-z buffer retains 128-byte interleaved GDDR3 memory its state between tiles. controllers contain aggressive address In another interesting extension to normal tiling for graphics and a fast path to min- D3D, the GPU supports a shader export fea- imize CPU latency. ture that allows data to be output directly • Memory interface. Memory crossbar and from the shader to a buffer in memory. This buffering for non-CPU initiators (such lets the GPU serve as a vector math engine if as graphics, I/O, and display).

MARCH–APRIL 2006 33 HOT CHIPS 17

tion is transferred and then expanded on the EDRAM die. • Antialiasing and Alpha/A (AA+AZ). Han- dles pixel-to-sample expansion, as well as z-test and alpha blend. • Display.

Figures 7 and 8 show photos of the GPU “parent” and EDRAM (“daughter”) dies. The parent die contains 232 million transistors in a TSMC 90-nm GT. The EDRAM die con- tains 100 million transistors in an NEC 90- nm process. Architectural choices The major choices we made in designing the Xbox 360 architecture were to use chip multiprocessing (CMP), in-order issuance cores, and EDRAM. Chip multiprocessing Our reasons for using multiple CPU cores Figure 7. Xbox 360 GPU “parent” die (courtesy of on one chip in Xbox 360 was relatively Semiconductor Manufacturing Co.). straightforward. The combination of power consumption and diminishing returns from instruction-level parallelism (ILP) is driving the industry in general to multicore. CMP is a natural twist on traditional symmetric mul- tiprocessing (SMP), in which all the CPU cores are symmetric and have a common view of main memory but are on the same die ver- sus separate chips. Modern process geometries afford hardware designers the flexibility of CMP, which was usually too costly in die area previously. Having multiple cores on one chip is more cost-effective. It enables shared L2 implementation and minimizes communica- tion latency between cores, resulting in high- er overall performance for the same die area and power consumption. In addition, we wanted to optimize the Figure 8. Xbox 360 GPU EDRAM (“daughter”) die (courtesy architecture for the workload, optimize in- of NEC Electronics). game utilization of silicon area, and keep the system easy to program. These goals made CMP a good choice for several reasons: • Graphics. This block, the largest on the First, for the game workload, both integer chip, contains the rendering engine. and floating-point performance are impor- • High-speed I/O bus. This bus between the tant. The high-level game code is generally a graphics core and the EDRAM die is a database management problem, with plenty chip-to-chip bus (via substrate) operat- of object-oriented code and pointer manipu- ing at 1.8 GHz and 28.8 Gbytes/s. When lation. Such a workload needs a large L2 and multisample antialiasing is used, only high integer performance. The CMP shared pixel center data and coverage informa- L2 with its fine-grained, dynamic allocation

34 IEEE MICRO means this workload can use a large working ples vector/floating point versus integer set in the L2 while running. In addition, sev- issuance in many cases. eral sections of the application lend themselves We had several reasons for choosing in- well to vector floating-point acceleration. order issuance. First, the die area required by Second, to optimize silicon area, we can in-order-issuance cores is less than that of out- take advantage of two factors. To start with, of-order-issuance cores. In-order cores sim- we are presenting a stable platform for the plify issue logic considerably. Although not product’s lifetime. This means tools and pro- directly a big area user, out-of-order issue logic gramming expertise will mature significantly, can consume extra area because it requires so we can rely more on generating code than additional pipeline stages to meet clock peri- optimizing performance at runtime. More- od timing. Further, common implementa- over, all Xbox 360 games (as opposed to Xbox tions of out-of-order issuance and completion games from Microsoft’s first game console, use rename registers and completion queues, which are emulated on Xbox 360) are com- which take significant die area. piled from scratch and optimized for the cur- Second, in-order implementation is more rent microarchitecture. We don’t have the power efficient than out-of-order implemen- problem of running legacy, but compatible, tation. Keeping power levels manageable was instruction set architecture executables that a major issue for the design team. All the addi- were compiled and optimized for a completely tional die area required for out-of-order different microarchitecture. This problem has issuance consumes power. Out-of-order cores significant implications for CPU microarchi- commonly increase performance because their tectures in PC and server markets. issuance, tracking, and completion enable Third, although we knew multicore was the deeper speculative instruction execution. This way to go, the tools and programming exper- deeper speculation means wasted power since tise for multithread programming are certain- whole execution strings are often thrown ly not mature, presenting a problem for our away. Xbox 360 execution does speculate but goal of keeping programming easy. For the to a lesser degree. types of workloads present in a game engine, Third, the Xbox 360’s two SMT hardware we could justify at most six to eight threads in threads per core keep the execution pipelines the system. The solution was to adapt the more fully utilized than they are in traditional “more-but-simpler” philosophy to the CPU in-order designs. This helps keep the execution core topology. The key was keeping the num- pipelines busy without out-of-order issuance. ber of hardware threads limited, thus increas- Finally, in-order design is simpler, aiding ing the chance that they would be used design and implementation. Simplicity also effectively. We decided the best approach was makes performance more predictable, simpli- to tightly couple dedicated vector math engines fying programming and tool optimizations. to integer cores rather than making them autonomous. This keeps the number of threads EDRAM low and allows vector math routines to be opti- HD, alpha blending, z-buffering, antialias- mized and run on separate threads if necessary. ing, and HDR pixels take a heavy toll on mem- ory bandwidth. Although more effects are In-order issuance cores being achieved in the shaders, postprocessing The Xbox 360 CPU contains three two- effects still require a large pixel-depth com- issue, in-order instruction issuance cores. Each plexity. Also as texture filtering improves, texel core has two SMT hardware threads, which fetches can consume large amounts of memo- support fine-grained instruction issuance. The ry bandwidth, even with complex shaders. cores allow out-of-order execution in the com- One approach to solving this problem is to mon cases of loads and vector/floating-point use a wide external memory interface. This versus integer instructions. Loads, which are limits the ability to use higher-density mem- treated as prefetches, don’t stall until a load ory technology as it becomes available, as well dependency is present. Vector and floating- as requiring compression. Unfortunately, any point operations have their own, decoupled compression technique must be lossless, vector/float issue queue (VIQ), which decou- which means unpredictable—generally not

MARCH–APRIL 2006 35 HOT CHIPS 17

Software By adopting SMP and SMT, we’re using standard parallel models, which keep things simple. Also, the uni- fied memory architecture allows flexible use of memory resources. Our opens all three cores to game developers to program as they wish. For this, we provide standard APIs including Win32 and OpenMP, as well as D3D and HLSL. Devel- opers can also bypass these and write their own CPU assembly and shader microc- ode, referred to in the game industry as “to the metal” programming. We provide standard tools including XNA-based tools Performance Investigator (PIX) and Xbox Audio Cre- ation Tool (XACT). XNA is Microsoft’s game develop- ment platform, which devel- Figure 9. Multithreaded debugging in the Xbox 360 development environment. opers of PC and Xbox 360 games (as well as other plat- forms) can use to minimize good for game optimization. In addition, the cross-platform development costs.11 PIX is the required bandwidth would most likely graphics profiler and debugger.12 It uses per- require using a second memory controller in formance counters embedded in the CPU and the CPU itself, rather than having a unified GPU and architectural simulators to provide memory architecture, further reducing - performance feedback. tem flexibility. The Xbox 360 development environment is EDRAM was the logical alternative. It has familiar to most programmers. Figure 9 shows the advantage of completely removing the ren- a screen shot from the XNA Studio Integrat- der target and the z-buffer bandwidth from ed Development Environment (IDE), a ver- the main-memory bandwidth equation. In sion of Visual Studio with additional features addition, alpha blending and z-buffering are for game developer teams. Programmers use read-modify-write processes, which further IDE for building projects and debugging, reduce the efficiency of memory bandwidth including debugging of multiple threads. consumption. Keeping these processes on- When stepping through code, pro- chip means that the remaining high-band- grammers find the instruction set’s low-level width consumers—namely, geometry and details completely hidden, but when they texture—are now primarily read processes. open the disassembly window, they can see Changing the majority of main-memory that PowerPC code is running. bandwidth to read requests increases main- Other powerful tools that help Xbox 360 memory efficiency by reducing wasted mem- developers maximize productivity and per- ory bus cycles caused by turning around the formance include CPU profilers, the Visual bidirectional memory buses. C++ 8.0 compiler, and audio libraries. These

36 IEEE MICRO tools and libraries let programmers quickly Architecture: A Quantitative Approach, 3rd exploit the power of Xbox 360 chips and then ed., Morgan Kaufmann, 2002. help them code to the metal when necessary. 6. K. Gray, The Microsoft DirectX 9 Program- mable Graphics Pipeline, , box 360 was launched to customers in the 2003. XUS, Canada, and Puerto Rico on 22 7. F. Luna, Introduction to 3D Game Program- November, 2005. Between then and the end of ming with DirectX 9.0, 1st ed., Wordware, 2005, it launched in and . Dur- 2003. ing the first quarter of 2006, Xbox 360 has 8. MSDN DX9 SDK Documentation, been launched in Central and South America, Overview, 2005, http://msdn.microsoft.com/ Southeast Asia, Australia, and . library/default.asp?url=/library/en-us/ Xbox 360 implemented a number of firsts directx9_c/dx9_graphics.asp. and/or raised the bar for performance for users 9. S. St.-Laurent, The Complete Effect and of PC and game console machines for gam- HLSL Guide, Paradoxal Press, 2005. ing. These include 10. MSDN DX9 SDK Documentation, HLSL Shaders Overview, 2005, http://msdn. • the first CMP implementation with microsoft.com/library/default.asp?url=/ more than 2 cores (3); library/en-us/directx9_c/HLSL_Workshop. • the highest frequency and bandwidth asp. CPU frontside bus (5.4 Gbps and 21.6 11. Microsoft XNA, 2006, http://www.microsoft. GB/s); com/xna. • the first CMP chip with shared L2; 12. MSDN DX9 SDK Documentation, PIX • the first game console with SMT; Overview, 2005, http://msdn.microsoft.com/ • the first game console with GPU-unified library/default.asp?url=/library/en-us/ shader architecture; and directx9_c/PIX.asp. • the first game console with MCM GPU/EDRAM die implementation. Jeff Andrews is a CPU architect and project The Xbox 360 core contains leader in Microsoft’s Xbox Console Architec- approximately 500 M transistors. It is the ture Group, focusing on the most complex, highest performance consumer CPU. Andrews has a BS in computer engi- electronic product shipping today and pre- neering from the University of Illinois sents a large discrete jump in 3D graphics and Urbana-Champaign. gaming performance. Nick Baker is the director of Xbox console References architecture at Microsoft. His responsibilities 1. PowerPC Microprocessor Family: AltiVec include managing the Xbox Console Archi- Technology Programming Environments tecture, System Verification, and Test Soft- Manual, version 2.0, IBM Corp., 2003. ware teams. Baker has an MS in electrical 2. E. Silha et al., PowerPC User Instruction Set engineering from Imperial College London. Architecture, Book I, version 2.02, IBM Corp., 2005. Direct questions and comments about this 3. E. Silha et al., PowerPC Virtual Environment article to Jeff Andrews, Microsoft, 1065 La Architecture, Book II, version 2.02, IBM Avenida St., View, CA 94043; Corp., 2005. [email protected]. 4. E. Silha et al., PowerPC Operating Environ- ment Architecture, Book III, version 2.02, For further information on this or any other IBM Corp., 2005. computing topic, visit our Digital Library at 5. J. Hennessey and D. Patterson, Computer http://www.computer.org/publications/dlib.

MARCH–APRIL 2006 37