C O N F I D E N T I A L

ATI ™ HD 2000 Series Technology Overview

Richard Huddy Worldwide DevRel Manager, AMD Graphics Products Group Introducing the ATI Radeon™ HD 2000 Series

ATI Radeon™ HD 2900 Series – Enthusiast

ATI Radeon™ HD 2600 Series – Mainstream

ATI Radeon™ HD 2400 Series – Value

2 ATI Radeon HD™ 2000 Series Highlights

Technology leadership Cutting-edge image quality features • Highest clock speeds – up to 800 MHz • Advanced anti-aliasing and texture filtering capabilities • Highest transistor density – up to 700 million transistors • Fast High Dynamic Range rendering • Lowest power for mobile • Programmable Tessellation Unit

2nd generation unified architecture ATI Avivo™ HD technology • Superscalar design with up to 320 stream • Delivering the ultimate HD video processing units experience • Optimized for Dynamic Game Computing • HD display and audio connectivity and Accelerated

DirectX® 10 Native CrossFire™ technology • Massive and geometry processing • Superior multi-GPU support performance • Enabling the next generation of visual effects

3 The March to Reality

Radeon HD 2900 Radeon X1950 Radeon Radeon X1800 X800 Radeon Radeon 9700 9800 Radeon 8500 Radeon

4 2nd Generation Unified Shader Architecture

y Development from proven and successful Command Processor Sha S

” design ( 360 graphics) V h e ade der Programmable r t

Settupup e x

al Z Tessellator r I Scan Converter / I n C ic n s h

• New dispatch processor handling thousands of Engine ons Rasterizer Engine d t c r e r u x ar e c t f a t Fe er i i n simultaneous threads on C t H Buf Geometry tc

Vertex C

t h u Interpolators Assembler Assembler a a c c O h h e m e

y Up to 320 discrete, independent a re stream processing units St Ultra-Threaded Dispatch Processor • In comparison, ATI Radeon X19xx family L T Tex e had 48 vector + 48 scalar processing units 2 h ext T c L a e 1 xtu Ver C T t e ur ur e r it Stream

Stream t xtu e r e Ca

y Superscalar ALU implementation x e W e C r e U Units ch Ca PrProcessingocessing a c ead/ e h n ch e R

• Dedicated branch execution units and y i e r e

Units t o Units s ch m e

texture address processors Ca M l ci n e t

y Full support for DirectX 10.0, Z/S

Shader Model 4.0 Shader Export • Dynamic shader load balancing between vertex, geometry and pixel shader operations Render Baackck-EnEndsds • Handled automatically by hardware scheduler

Color Cache

5 Shader Processor Progression

Vector Radeon and 1 instruction/clk earlier 4 components

Vector Radeon + 9600 Scalar 9700 2 instructions/clk 9800 3+1 or 4+1 components X series

Superscalar Radeon

5 instructions/clk HD 2000 5 components series

6 Shader Processors S P u r Shader pe 5-W o cess r sca a y o la r r

Str S Ge Gener t r e n e e am Pr am P r a a B l B l P P r Un r Un anch E anch u Z/Stencil Cache u Hierarchical Z r r p r p its i oc ose oc ose R t E s xecuti x ess es R Memory Read/Write Cache Stream Out Buffer ecuti e e giste g s i i o s i o ng n t n n S er c U r g

an U s Ra s Co n

n ste i i t n t r v i z U erter er l t r a

/ - T C S h o h re m a Inter der a m d an Co Asse e G p E d d P o eom x lor D l po a t m r Cac or i r oc sp et t b s le e r y at s r h s e ch o r

P r ocess As Pro Ve s Tes e rt mb g e ra or sel x l m e la r m t a o b

r

l e

Index Index ex t r e V h tc e F

e h c a C x e t r Ve c ru t s n I er had S e h c a C n o i t

e h c a C e ur t x e T 1 L ant t s n o C er had S e h c a C

e ch a C e r u t x e T L2 Pushing a TeraFLOP

475 GigaFLOPS per GPU • 237 billion single precision floating point multiply-add operations per second • Real, measurable performance – not just theoretical

950 GigaFLOPS in a CrossFire configuration • Tera-scale computing is possible today on your desktop

Unprecedented compute density • More than 1 GigaFLOP per mm2 • Less than $1 per GigaFLOP • Over 3.4 GigaFLOPs per Watt

8 Texture Unit Features

Full speed floating point texture filtering • 64-bit HDR textures bilinear filtered at full speed (~7x faster than )1 • 128-bit floating point textures filtered at half speed • New compact 32-bit HDR shared exponent texture format (RGBE 9:9:9:5) • Trilinear and anisotropic filtering supported for all formats Improved high quality anisotropic filtering Percentage Closer Filtering (PCF) for enhanced shadow rendering High resolution texture support • Up to 64 MTexels (8192 x 8192) Full texturing capability accessible to vertex, geometry, and pixel shader programs

9 Bandwidth Drives Performance

2

512 bits Per f or sec) mance (3DMar / B

G GDDR4 h ( t

d GDDR3

256 DDR2 k0 Bandwi 3 y 128 bits bits

S

DDR c ore) Memor

10 Memory Controller Progression

ATI Radeon X850 & earlier Centralized Crossbar + all competing GPUs to date

Partially ATI Radeon X1000 Hybrid Distributed Series Ring Bus

Fully ATI Radeon Ring Bus HD 2000 Distributed Series

11 Massive Bandwidth

ATI Radeon HD 2900 memory controller

• World’s first 512-bit memory interface GDDR3/4 GDDR3/4 • Designed for full performance HDR rendering DRAM DRAM

64-bit memory 64-bit memory channel channel

P 1024-bit ring bus C Sequencer Sequencer I Ring R E (512-bit read + 512- in x Stop g p Arbiter Arbiter re bit write) S s to s p

Ring Stop

Arbiter Arbiter

Ring Ring Crossbar Mux Stop Stop

Read Write Memory client interfaces HighlHighliightsghts •• OOveverr 110000 GGBB/sec/sec memomemoryry bbaannddwwididtthh Ring • Eight 64-bit memory channels Stop • Eight 64-bit memory channels •• KKilobilobitit rringing bubuss •• FFulullyly didissttriribubutedted ddeesisigngn --nnoo cencenttraral l huhubb •• SSimpimplilfifieiedd lalayoyoutut,, hhigighhlyly scascalalabblele

12 High Dynamic Range Performance

ATI Radeon HD 2000 Series vs. ATI Radeon X1000 Series High Dynamic Range Performance Radeon X1950 XTX Radeon HD 2900 XT

240%

220%

200%

180%

160%

140%

120%

100%

80% Far Cry HDR Fa r Cr y HDR 3DMark06 HDR1 3DMark06 HDR2 2 Serious Sam 2 El d e r Sc r o l l s I V: El d e r Sc r o l l s I V: 16x12 25x16 12x10 12x10 HDR HDR Oblivion Oblivion 16x12 25x16 16x12 25x16

13 Geometry Performance

Large vertex cache

• 8x larger than Radeon X1950 for improved vertex fetch performance

Fast, full-featured Vertex Texture Fetch • Uses same texture units as pixel

All shader processors can perform vertex and/or geometry processing if necessary • Up to 10x the vertex processing power of Radeon X1950 available on demand • Up to 50x the geometry processing power of the fastest competing DirectX 10 GPUs3

14 Tessellation

All ATI Radeon HD 2000 series GPUs feature new programmable tessellation unit • Based on technology • Provides highly effective geometry data compression • Orders of magnitude faster than CPU-based or geometry shader-based tessellation

Enables: • More detailed animation • More realistic characters • Complex terrain • More sophisticated shader effects

15 CrossFire

All ATI Radeon HD 2000 series GPUs CrossFire Rendering Modes feature native CrossFire support GPU_0

GPU_1 Simplified CrossFire experience • Easy plug-and-play setup Frame_0 Frame_1 Display • No special master cards required Alternate Frame • Integrated compositing engine Rendering • New AFR detect algorithm - intelligent mode selection for best scalability SuperTile Most immersive and most responsive gaming experience High bandwidth dual-link GPU • Scissor interconnect • Supports display resolutions up to 2560x2048 @ 60Hz Super AA • Built for future scalability (>2 GPUs)

16 CrossFire Performance

ATI Radeon HD 2900 XT CrossFire Scaling Radeon HD X2900 XT Radeon HD X2900 XT CrossFire

200%

180%

160%

140%

120%

100%

80% 3DMark05 3DMark06 Co m p a n y o f Call of Duty 2 Doom 3 Fa r Cr y FEAR Hal f Li fe 2 Half Life 2 : EP1 Hal f Li fe 2: LC Pr ey Se r i o u s Sa m 2 Splinter Cell:CT St a l k e r 25x16 4xAA 8xAF 25x16 4xAA 8xAF Heroes 25x16 16xAF 25x16 4xAA 8xAF 25x16 16xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 AA 8xAF

17 ATI Avivo™ HD Technology

Dedicated silicon for accelerated HD video decode and processing • UVD – • AVP – Advanced Video Processor

Leading video playback quality y Up to 128 out of 130 on HQV video quality test

Dual-link outputs with HDMI & HDCP • First products to support high resolution HDMI displays4

On-chip HD audio controller • AC3 5.1 surround-sound output over HDMI

18 Cutting Edge Process Technology

ATI Radeon HD 2900 1.6 • 700 million transistors at 750 MHz 1.4 Transistor Speed • Uses unique TSMC 80nm process (80HS) 1.2 • Optimized for high clock speeds 1.0 0.8 ATI Radeon HD 2600 & HD 2400 0.6 • Use unique TSMC 65nm process (65G+) 0.4 • Optimized for power efficiency 0.2 0 Both processes target aggressive transistor density 90GT 80HS 2.0 Gates per mm2 1.8 Leakage power per mm2 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 90GT 80GT 80HS 65G+ 0 80GT 65G+

19 Unified Architecture (Painful) Detail y Command Processor Command Processor S S V Programmable hade h e ader

SeSetuptup Tessellator r t e r x al Z I In ns C ic Scan Converter / EEngingine h o d t

Rasterizer n r Geometry Vertex e r uc

y Setup Engine s x e t a ff Assembler Assembler t F erarc i i nt u o e n H B tch C

C t u a a

Interpolators c c O h h e m e a

y Ultra-Threaded Dispatch Processor e r t S Ultra-Threaded Dispatch Processor y Stream Processing Units L Text T e 2 T ext ch L a e 1 x Ve C T

t ur ur ur e te r i Stream

Stream t x e y Texture Units & Caches r ex t ur C e U W e C / a e d Un c a a Processing C Processing h c e a h n c e Re h y it e r it e

h Units o Units s c s m y Memory Read/Write Cache & a Me il C enc t S

Stream Out Buffer / Z y Shader Export Shader Export y Render Back-Ends Renndder BBackack-EndsEnds AATTII RRaadedeoonn HHDD 22900900 320 Stream Processing Units 320 Stream Processing Units Color Cache y Memory controller 44 SIMDSIMDss 44 TeTexturexture UnitsUnits 4 Render Back-End 20 4 Render Back-End ATI Radeon HD 2600 & HD 2400

Command Processor Command Processor S Sh Sh h Ve Sh Programmable Ve ade

Programmable a ad r

ad Tessellator d t r e

Tessellator e t er r e x er r Z I x I n Co al I

I Scan Converter / n n C s n ic Scan Converter / d s t o n r d h Rasterizer e t Vertex Geometry uc n r s e c x

Rasterizer u r

Vertex s Geometry ta x

c t F t a Assembler Assembler

i a n t on F e

Assembler Assembler i nt er o t e i tc Ca n C tc C H h C h a a Interpolators c a c c h c Interpolators h h e h e e e

UltrUltra-Threaeadeded Dissppatatcchh Prococesessor UltraUltra-ThThreadaded Dispspaatch PPrroocescessor Stream Out Buffer L 2 Text e L h 1 er c f Ve Text f Memory R/W Cache a u r C rt e e

Bu Ve e h t C u x c W u / C r r a a e t che e O R a C C x l y i c m r

a h / Te a o che e e r m enc Shader Export x e St St t h Me u c Z/ r a e C C l i a c

Shader Export h enc e t S Z/

AATTII RRaadedeoonn HHDD 22400400 AATTII RRaadedeoonn HHDD 22600600 Color Cache 4040 SSttreamream ProProccessingessing UnitsUnits 120120 StStrereamam ProcProceessssiingng UnitUnitss 22 SIMDSIMDss Color Cache 33 SIMDSIMDss 11 TextureTexture UnitUnit 22 TeTexturexture UnitsUnits 11 RendRendeerr BBaacckk--EEndnd 11 RendRendeerr BBaacckk--EEndnd SShharedared vertex/vertex/ttextextuurere cachcachee

21 Unified Shader Architecture

Making better use of compute resources to render more complex and realistic images Unified Shader Definitions

Shader – Program consisting of a set of instructions that determine the characteristics of a vertex, primitive, or pixel

Thread – Set of vertices, polygons, or pixels that form a unit of work on which a single instance of a shader program is executed

Flow Control – Special instructions in a shader program that control the execution of instructions based on specified conditions

GPR – General Purpose Register to store shader initial, intermediate, and output data during shader execution

Stream Processing Unit – Hardware block that executes mathematical operations on a stream of input data

SIMD – A group of stream processing units executing a single instruction on multiple data items (vertices, primitives, pixels).

23 Command Processor

y Processes command stream from graphics driver • Executes microcode with memory access y Performs some state validation • Offloads this task from CPU • Small batch improvements • Up to 30% reduction in driver CPU overhead • Applies to both DirectX 9 and DirectX 10 applications

24 Setup Engine

Prepares data for processing by the stream processing units

Consists of three different functions: • Vertex assembly and tessellation (for vertex shaders) • Geometry assembly (for geometry shaders) • Scan conversion and interpolation (for pixel shaders)

Each function can submit threads to the dispatch processor

25 Ultra-Threaded Dispatch Processor Separate command queues for each shader type Setup Engine Vertex Assembler Geometry Assembler Interpolators • Fill with threads waiting for execution

• Each consists of a number of instructions Ultra-Threaded Ultra-Threaded Vertex Shader Geometry Shader PIxel Shader Sh that will operate on a block of input data Command Queue Command Queue Command Queue Sh Dispatch ad ad

Processor er Processor PS Thread 4 er

VS Thread 3 PS Thread 3 In

VS Thread 2 GS Thread 2 PS Thread 2 Co Arbiter units determine next thread to VS Thread 1 GS Thread 1 PS Thread 1 st n ru s ct

process, based on a variety of tracked t a io n t parameters n Ca C ach • Initial arbiter to select which thread to submit Arbiter Arbiter Arbiter Arbiter Texture Fetch Vertex Fetch ch Arbiter Arbiter e e • Two arbiter units per SIMD array Arbiter Arbiter Arbiter Arbiter • Allows each SIMD to be pipelined, with Sequencer Sequencer Sequencer Sequencer Texture Fetch Vertex Fetch Sequencer Sequencer two operations at a time in process Sequencer Sequencer Sequencer Sequencer • Dedicated arbiter units for texture and vertex fetches • Can be scheduled independently from

math operations Vert V

SIMD SIMD e

SIMD SIMD ex /T r t • Executing threads can be bumped at any time if Array Array Array Array ex /

T ext

a higher priority thread is pulled from the e x

80 80 80 80 t ur

command queues u e C

Stream Stream Stream Stream r e Un a

Processing Processing Processing Processing c h • Temporary data saved so thread can e

Units Units Units Units it resume later s

26 Ultra-Threaded Dispatch Processor

Setup Engine

Dedicated shader caches Vertex Assembler Geometry Assembler Interpolators • Instruction cache allows unlimited shader length Ultra-Threaded Ultra-Threaded Vertex Shader Geometry Shader PIxel Shader Sh • Constant cache allows unlimited number of Command Queue Command Queue Command Queue Sh Dispatch ad ad

Processor er constants Processor PS Thread 4 er

VS Thread 3 PS Thread 3 In

VS Thread 2 GS Thread 2 PS Thread 2 Co VS Thread 1 GS Thread 1 PS Thread 1 st n • Both caches take advantage of data re-use to ru s ct t a io improve state change overhead and efficiency n t n Ca C ach Arbiter Arbiter Arbiter Arbiter Texture Fetch Vertex Fetch ch

Arbiter Arbiter e Latency hiding Arbiter Arbiter Arbiter Arbiter e • Vertex, texture and shader I/O fetches that result in Sequencer Sequencer Sequencer Sequencer Texture Fetch Vertex Fetch a cache miss may require hundreds of cycles to Sequencer Sequencer return data from graphics memory Sequencer Sequencer Sequencer Sequencer • As soon as a thread is forced to wait for data, it is suspended and a new thread begins executing

immediately Vert V

SIMD SIMD e

SIMD SIMD ex /T r t • Suspended threads remain in the command queues Array Array Array Array ex /

T ext e

until their requested data arrives x

80 80 80 80 t ur u e C

Stream Stream Stream Stream r e Un a

• Ultra-threaded dispatch processor can queue up Processing Processing Processing Processing c h e

Units Units Units Units it

hundreds of threads to make sure the SIMD arrays s are never sitting idle

27 SIMD Arrays

Single Instruction, Multiple Data y Execute the same instruction thread on multiple data elements in parallel y Very Large Instruction Word (VLIW) design • Each instruction word can include up to 6 independent, co-issued operations (5 math + 1 flow control) • All operations are performed in parallel on each data element in the current thread y Texture fetch and vertex fetch instructions are issued and executed separately • Allows fetches to begin executing before the requested data is required by the shader

28 Stream Processing Units

Arranged as 5-way superscalar shader processors • Co-issue up to 5 scalar FP MAD (Multiply-Add) instructions per clock • One of the 5 stream processing units handles transcendental instructions as well (SIN, COS, LOG, Branch Execution EXP, etc.) Unit • 32-bit floating point precision • Integer and bitwise operation support GeneralGeneral PPuurposerpose RegistersRegisters Branch execution units handle flow control and conditional operations • Free stream processing units from handling this task • Practically eliminate flow control performance overhead General Purpose Registers store input data, temporary values, and output data

29 Counting FLOPS

FLoating point OPerations per Second

What should count as a FLOP? • Basic operations only (add, multiply)? • Transcendental operations (sine, cosine, log, power, etc.)? • Conditional operations (if, compare)? • Specialized functions (e.g. texture address instructions)? • Fixed functions (e.g. interpolation)?

Peak theoretical vs. measured

30 Counting FLOPS

How we count them for GPUs: y 32-bit floating point MUL & ADD only y Shader programmable operations only y General purpose ops only

Rationale y Practically all math ops can be represented in terms of MUL & ADD y Fixed function ops don’t allow for fair comparisons with general purpose processors (such as CPUs) y Specialized ops are typically only usable in limited circumstances, so including them can result in figures that are impossible to achieve in practice

31 FLOPS Comparison

Processing FLOPs Clock Processing Processor Units per Unit Speed Rate

ATI Radeon HD 2900 XT 320 2 742 MHz 475 GigaFLOPS

ATI Radeon HD 2600 XT 120 2 800 MHz 192 GigaFLOPS

ATI Radeon HD 2400 XT 40 2 700 MHz 56 GigaFLOPS

High-end dual core CPU 8 (4 per core) 2 3000 MHz 48 Giga FLOPS

32 Texture Unit Design D Radeon HD 2900 has 4 Texture Units ecom p y 8 Texture Address Processors each (32 total) ress – Execute shader instructions to control address for texture lookups D ecom L2 T L2 Textu L1 p y 20 Texture Samplers each (80 total) L1 ress Ver V T Te extu e e

– Can fetch a single data value per clock rte xt t xt e ure x ure x r C r C e y 4 FP Texture Filter Units each (16 total) e Cac a Cac ach Ca ch Ca D ecom e e h

– Can bilinear filter one 64-bit color value per clock, or h che ch e e p

one 128-bit color value per 2 clocks ress e

HD 2600 and HD 2400 texture units have identical functionality D

Texture Address Processors ecom p

FP32 Texture Samplers ress

Texture Filter Units

33 Texture Unit Design D Multi- texture cache design ecom p y Large, shared L2 cache stores data retrieved on ress L1 cache misses

y 256 kB for HD2900 D ecom L2 T L2 Textu L1 p L1 y 128 kB for HD2600 ress Ver V T Te extu e e rte xt t xt y HD2400 uses single level vertex/texture cache e ure x ure x r C r C e e Cac a Cac ach Ca ch Ca D ecom e e h h che ch e e p

All texture units can access both vertex cache and ress e L1 texture cache y Can provide increased throughput D

for unfiltered texture reads Texture Address Processors ecom p

FP32 Texture Samplers ress

Texture Filter Units

34 Texture Unit Features y Full Speed Floating Point Texture Filtering • 64-bit HDR textures bilinear filtered at full speed (~7x faster than Radeon X1000 series) • 128-bit floating point textures filtered at half speed • Trilinear and anisotropic filtering supported for all formats

• Improved high quality anisotropic filtering • High quality mode from Radeon X1000 series is now the default setting • Improved handling of problematic texture filtering cases y Depth Stencil Textures (DST) with Percentage Closer Filtering (PCF) • High performance soft shadow rendering

35 Texture Unit Features

y New shared exponent texture format (RGBE 9:9:9:5) for 32-bit HDR

y High resolution texture support • Up to 67 megatexels (8192 x 8192)

y Up to two texture fetches per clock per texture unit (1 filtered + 1 unfiltered) • Option to get 4 unfiltered fetches in place of 1 filtered fetch (Fetch4)

y Full texturing capability accessible to vertex, geometry, and pixel shader programs

36 Geometry Performance

Practical polygon throughput is HD 2000 series feature major determined by a number of factors improvements in all of these areas y Vertex fetch rate y Fetch up to 16 vertices per clock y Vertex cache size & efficiency y Up to 8x increase in vertex cache size y Vertex shader performance vs. X1000 series y Geometry shader performance y Unified architecture can increase available vertex shader processing y Geometry amplification capabilities power by up to 10x X1000 series y Triangle setup rate y Geometry shader performance shown to be up to 50x faster than competing implementations y Programmable tessellation unit for accelerated geometry amplification y Setup 1 triangle per clock cycle

37 Memory Read/Write Cache y Virtualizes register space • Allows overflow to graphics memory • Can be read from or written to by any SIMD (texture & vertex caches are read-only) • Can export data to stream out buffer y Stream Out • Allows shader output to bypass render back-ends and color buffer • Outputs sequential stream of data instead of bitmaps y Uses include: • Interthread communication • Render to vertex buffer • Overflow storage for GS data (since it can generate widely variable amounts of output data)

38 Render Back-Ends

Double rate depth/stencil test y 32 pixels per clock for HD 2900 Alpha/Fog y 8 pixels per clock for HD 2600 & HD 2400

Fast Post-Processing Effects che y Render-to-texture much more efficient than previous chips Decompress Ca

l Depth/Stencil i Compress MSAA resolve functionality is programmable

y Makes Custom Filter AA possible enc St New blendable surface formats Programmable Blend Z/ y Allows new DX10 formats to be displayable MSAA Resolve • 128-bit floating point format • 11:11:10 floating point format Decompress Compress

MRT (Multiple Render Target) support • Up to 8 MRTs (double Radeon X1000 series) with MSAA support Color Cache

39 Depth, Stencil, and Compression Improvements y Improved Z & Stencil Compression • Up to 16:1 in standard mode (vs. 8:1 in previous chips), up to 128:1 with 8x MSAA • Z & Stencil now compressed separately from each other for better efficiency • Compression information stored in graphics memory and cached on-chip – allows compression to be used at very high resolutions (previous generation was limited to 5 megapixels) y Z Range optimization • Limit depth test operations to a programmable depth range (useful for speeding up stencil shadowing) y Re-Z • Can check Z buffer twice – once before pixel shader, and again after • Allows early Z before shading in all cases y Improved Hierarchical Z Buffer • Adds Hierarchical Stencil (HiS) for better stencil shadow performance • Handles most situations where it had to be disabled in the past y 32-bit Floating Point Z-Buffer support • Provides greater precision than previous 24-bit format

40 Stencil Shadow Performance

ATI Radeo n HD 2000 S eries vs. ATI Radeo n X1000 S er ies Stencil Sha do w Cases Rad eo n X1950 X TX Rad eo n HD X2900 X T

200 % 180 % 23 1 160 % 28 2 99 140 % 77 39 120 % 145 100 % 163 118 50 25 59 80 % 11 0 3DM ar k03 GT2 3DM ar k03 G T3 FEAR FEAR Do o m 3 Do o m 3 10x 7 10 x7 16 x12 Sof t 25x16 S of t 16x 12 25 x16 Shado ws Shado ws

41 Anti-Aliasing

Recap of existing ATI Radeon AA technologies: • Multisampling • Programmable sample patterns • Gamma correct resolve • Temporal AA • Adaptive Supersampling/Multisampling • Super AA (CrossFire)

All of these feature still available with the HD 2000 series, plus a new one – Custom Filter Anti-Aliasing

42 Building a Better Filter

Anti-aliasing is effectively a post-processing filter applied to each rendered frame y Designed to remove high frequencies from the video “signal” y These appear as jagged edges and shimmering

Most hardware relies on simple box filter • Restricted to pixel boundaries, fixed weights • Diminishing returns from adding more samples • The best anti-aliasing filters are not subject to these constraints

How we can do better y Sampling from outside pixel boundaries y Non-uniform sample weights y Filter kernels that adapt to the characteristics of each pixel

43 Custom Filter Anti-Aliasing (CFAA)

Standard 8x MSAA (Box Filter)

44 Custom Filter Anti-Aliasing (CFAA)

12x CFAA Narrow Tent Filter

45 Custom Filter Anti-Aliasing (CFAA)

16x CFAA Wide Tent Filter

46 Custom Filter Anti-Aliasing (CFAA)

Adaptive Edge Detect Filter y Performs edge detection pass on rendered image y Edge pixels resolved using more samples along direction of edge with high quality filter y Other pixels resolved using fewer samples and box filter

Benefits y Provides excellent edge smoothing where it’s needed the most y Reduces texture shimmering y Avoids blurring of fine detail (e.g. small text) y Provides better quality per sample than supersampling, with better performance

47 Nice Properties of CFAA y Software upgradeable y Can be used to enhance in-game AA settings for most DirectX 9 titles y Works together with all other ATI Radeon AA features y Works with HDR y Works with stencil shadows y More samples per pixel than MSAA without increasing memory footprint y Potential image quality limited only by performance

48 Anti-Aliasing Quality Comparison

Images captured from Half-Life 2 by Valve Software

No AA 2x MSAA 4x CFAA 4x MSAA

6x CFAA 8x MSAA 12x CFAA 16x CFAA Anti-Aliasing Quality Comparison

Custom Filter AA vs. ATI Radeon Nvidia HD 2000 series GeForce 8 series Coverage Sample AA (Nvidia GeForce 8 series) 8x CFAA 8x CSAA y CFAA provides better edge quality per sample y CFAA can scale to more than 16 samples per pixel y CFAA works better on silhouette edges (where there are many small, intersecting polygons per pixel) y CFAA works on stencil shadows y CFAA is not limited to polygon edges Anti-Aliasing Quality Comparison

Custom Filter AA Advantages vs. ATI Radeon Nvidia HD 2000 series GeForce 8 series Coverage Sample AA (Nvidia GeForce 8 series) 16x CFAA 16x CSAA y CFAA provides better edge quality per sample y CFAA can scale to more than 16 samples per pixel y CFAA works better on silhouette edges (where there are many small, intersecting polygons per pixel) y CFAA works on stencil shadows y CFAA is not limited to polygon edges Memory Interface

Why is it so hard to just widen the memory interface?

P y Requires more I/O pads around the edge of the chip C Ring I R E Stop in x g p re S s to s y I/O pads tend to scale poorly with process shrinks p y If interface is too wide, chip area becomes pad limited y Makes GPU too large to produce at a reasonable cost

How did we get to 512 bits on the HD 2900? Ring Ring y New, compact, stacked I/O pad design StStoop Stop y Double the I/O density of previous designs y Let us pack 512 bits into the same area as 256 bits using previous I/O design

Ring Stop

52 Memory Interface

Benefits of a 512-bit interface y More bandwidth with existing memory P C Ring I R E Stop technology in x g p re S s to s y Lower memory clock required to achieve p target bandwidth y Improved cost:bandwidth ratio

Ring Ring Benefits of the ring bus StStoop Stop y Simplifies routing to improve scalability y Reduces wire delay y Reduces number of repeaters required Ring Stop

53 Memory Controller

Fully Distributed vs. Crossbar Designs y Crossbars have to be redesigned for each new product y Crossbars become exponentially more complex as more channels are added y Fully distributed design allows memory channels to be added or removed as required with minimal effort

54 Questions

[email protected]

55