C O N F I D E N T I A L
ATI Radeon™ HD 2000 Series Technology Overview
Richard Huddy Worldwide DevRel Manager, AMD Graphics Products Group Introducing the ATI Radeon™ HD 2000 Series
ATI Radeon™ HD 2900 Series – Enthusiast
ATI Radeon™ HD 2600 Series – Mainstream
ATI Radeon™ HD 2400 Series – Value
2 ATI Radeon HD™ 2000 Series Highlights
Technology leadership Cutting-edge image quality features • Highest clock speeds – up to 800 MHz • Advanced anti-aliasing and texture filtering capabilities • Highest transistor density – up to 700 million transistors • Fast High Dynamic Range rendering • Lowest power for mobile • Programmable Tessellation Unit
2nd generation unified architecture ATI Avivo™ HD technology • Superscalar design with up to 320 stream • Delivering the ultimate HD video processing units experience • Optimized for Dynamic Game Computing • HD display and audio connectivity and Accelerated Stream Processing
DirectX® 10 Native CrossFire™ technology • Massive shader and geometry processing • Superior multi-GPU support performance • Enabling the next generation of visual effects
3 The March to Reality
Radeon HD 2900 Radeon X1950 Radeon Radeon X1800 X800 Radeon Radeon 9700 9800 Radeon 8500 Radeon
4 2nd Generation Unified Shader Architecture
y Development from proven and successful Command Processor Sha S
“Xenos” design (XBOX 360 graphics) V h e ade der Programmable r t
Settupup e x
al Z Tessellator r I Scan Converter / I n C ic n s h
• New dispatch processor handling thousands of Engine ons Rasterizer Engine d t c r e r u x ar e c t f a t Fe er i i n simultaneous threads on C t H Buf Geometry tc
Vertex C
t h u Interpolators Assembler Assembler a a c c O h h e m e
y Up to 320 discrete, independent a re stream processing units St Ultra-Threaded Dispatch Processor • In comparison, ATI Radeon X19xx family L T Tex e had 48 vector + 48 scalar processing units 2 h ext T c L a e 1 xtu Ver C T t e ur ur e r it Stream
Stream t xtu e r e Ca
y Superscalar ALU implementation x e W e C r e U Units ch Ca PrProcessingocessing a c ead/ e h n ch e R
• Dedicated branch execution units and y i e r e
Units t o Units s ch m e
texture address processors Ca M l ci n e t
y Full support for DirectX 10.0, Z/S
Shader Model 4.0 Shader Export • Dynamic shader load balancing between vertex, geometry and pixel shader operations Render Baackck-EnEndsds • Handled automatically by hardware scheduler
Color Cache
5 Shader Processor Progression
Vector Radeon and 1 instruction/clk earlier 4 components
Vector Radeon + 9600 Scalar 9700 2 instructions/clk 9800 3+1 or 4+1 components X series
Superscalar Radeon
5 instructions/clk HD 2000 5 components series
6 Shader Processors S P u r Shader pe 5-W o cess r sca a y o la r r
Str S Ge Gener t r e n e e am Pr am P r a a B l B l P P r Un r Un anch E anch u Z/Stencil Cache u Hierarchical Z r r p r p its i oc ose oc ose R t E s xecuti x ess es R Memory Read/Write Cache Stream Out Buffer ecuti e e giste g s i i o s i o ng n t n n S er c U r g
an U s Ra s Co n
n ste i i t n t r v i z U erter er l t r a
/ - T C S h o h re m a Inter der a m d an Co Asse e G p E d d P o eom x lor D l po a t m r Cac or i r oc sp et t b s le e r y at s r h s e ch o r
P r ocess As Pro Ve s Tes e rt mb g e ra or sel x l m e la r m t a o b
r
l e
Index Index ex t r e V h tc e F
e h c a C x e t r Ve c ru t s n I er had S e h c a C n o i t
e h c a C e ur t x e T 1 L ant t s n o C er had S e h c a C
e ch a C e r u t x e T L2 Pushing a TeraFLOP
475 GigaFLOPS per GPU • 237 billion single precision floating point multiply-add operations per second • Real, measurable performance – not just theoretical
950 GigaFLOPS in a CrossFire configuration • Tera-scale computing is possible today on your desktop
Unprecedented compute density • More than 1 GigaFLOP per mm2 • Less than $1 per GigaFLOP • Over 3.4 GigaFLOPs per Watt
8 Texture Unit Features
Full speed floating point texture filtering • 64-bit HDR textures bilinear filtered at full speed (~7x faster than Radeon X1000 series)1 • 128-bit floating point textures filtered at half speed • New compact 32-bit HDR shared exponent texture format (RGBE 9:9:9:5) • Trilinear and anisotropic filtering supported for all formats Improved high quality anisotropic filtering Percentage Closer Filtering (PCF) for enhanced shadow rendering High resolution texture support • Up to 64 MTexels (8192 x 8192) Full texturing capability accessible to vertex, geometry, and pixel shader programs
9 Bandwidth Drives Performance
2
512 bits Per f or sec) mance (3DMar / B
G GDDR4 h ( t
d GDDR3
256 DDR2 k0 Bandwi 3 y 128 bits bits
S
DDR c ore) Memor
10 Memory Controller Progression
ATI Radeon X850 & earlier Centralized Crossbar + all competing GPUs to date
Partially ATI Radeon X1000 Hybrid Distributed Series Ring Bus
Fully ATI Radeon Ring Bus HD 2000 Distributed Series
11 Massive Bandwidth
ATI Radeon HD 2900 memory controller
• World’s first 512-bit memory interface GDDR3/4 GDDR3/4 • Designed for full performance HDR rendering DRAM DRAM
64-bit memory 64-bit memory channel channel
P 1024-bit ring bus C Sequencer Sequencer I Ring R E (512-bit read + 512- in x Stop g p Arbiter Arbiter re bit write) S s to s p
Ring Stop
Arbiter Arbiter
Ring Ring Crossbar Mux Stop Stop
Read Write Memory client interfaces HighlHighliightsghts •• OOveverr 110000 GGBB/sec/sec memomemoryry bbaannddwwididtthh Ring • Eight 64-bit memory channels Stop • Eight 64-bit memory channels •• KKilobilobitit rringing bubuss •• FFulullyly didissttriribubutedted ddeesisigngn --nnoo cencenttraral l huhubb •• SSimpimplilfifieiedd lalayoyoutut,, hhigighhlyly scascalalabblele
12 High Dynamic Range Performance
ATI Radeon HD 2000 Series vs. ATI Radeon X1000 Series High Dynamic Range Performance Radeon X1950 XTX Radeon HD 2900 XT
240%
220%
200%
180%
160%
140%
120%
100%
80% Far Cry HDR Fa r Cr y HDR 3DMark06 HDR1 3DMark06 HDR2 Serious Sam 2 Serious Sam 2 El d e r Sc r o l l s I V: El d e r Sc r o l l s I V: 16x12 25x16 12x10 12x10 HDR HDR Oblivion Oblivion 16x12 25x16 16x12 25x16
13 Geometry Performance
Large vertex cache
• 8x larger than Radeon X1950 for improved vertex fetch performance
Fast, full-featured Vertex Texture Fetch • Uses same texture units as pixel shaders
All shader processors can perform vertex and/or geometry processing if necessary • Up to 10x the vertex processing power of Radeon X1950 available on demand • Up to 50x the geometry processing power of the fastest competing DirectX 10 GPUs3
14 Tessellation
All ATI Radeon HD 2000 series GPUs feature new programmable tessellation unit • Based on Xbox 360 technology • Provides highly effective geometry data compression • Orders of magnitude faster than CPU-based or geometry shader-based tessellation
Enables: • More detailed animation • More realistic characters • Complex terrain • More sophisticated shader effects
15 CrossFire
All ATI Radeon HD 2000 series GPUs CrossFire Rendering Modes feature native CrossFire support GPU_0
GPU_1 Simplified CrossFire experience • Easy plug-and-play setup Frame_0 Frame_1 Display • No special master cards required Alternate Frame • Integrated compositing engine Rendering • New AFR detect algorithm - intelligent mode selection for best scalability SuperTile Most immersive and most responsive gaming experience High bandwidth dual-link GPU • Scissor interconnect • Supports display resolutions up to 2560x2048 @ 60Hz Super AA • Built for future scalability (>2 GPUs)
16 CrossFire Performance
ATI Radeon HD 2900 XT CrossFire Scaling Radeon HD X2900 XT Radeon HD X2900 XT CrossFire
200%
180%
160%
140%
120%
100%
80% 3DMark05 3DMark06 Co m p a n y o f Call of Duty 2 Doom 3 Fa r Cr y FEAR Hal f Li fe 2 Half Life 2 : EP1 Hal f Li fe 2: LC Pr ey Se r i o u s Sa m 2 Splinter Cell:CT St a l k e r 25x16 4xAA 8xAF 25x16 4xAA 8xAF Heroes 25x16 16xAF 25x16 4xAA 8xAF 25x16 16xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 4xAA 8xAF 25x16 AA 8xAF
17 ATI Avivo™ HD Technology
Dedicated silicon for accelerated HD video decode and processing • UVD – Unified Video Decoder • AVP – Advanced Video Processor
Leading video playback quality y Up to 128 out of 130 on HQV video quality test
Dual-link outputs with HDMI & HDCP • First products to support high resolution HDMI displays4
On-chip HD audio controller • AC3 5.1 surround-sound output over HDMI
18 Cutting Edge Process Technology
ATI Radeon HD 2900 1.6 • 700 million transistors at 750 MHz 1.4 Transistor Speed • Uses unique TSMC 80nm process (80HS) 1.2 • Optimized for high clock speeds 1.0 0.8 ATI Radeon HD 2600 & HD 2400 0.6 • Use unique TSMC 65nm process (65G+) 0.4 • Optimized for power efficiency 0.2 0 Both processes target aggressive transistor density 90GT 80HS 2.0 Gates per mm2 1.8 Leakage power per mm2 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 90GT 80GT 80HS 65G+ 0 80GT 65G+
19 Unified Architecture (Painful) Detail y Command Processor Command Processor S S V Programmable hade h e ader
SeSetuptup Tessellator r t e r x al Z I In ns C ic Scan Converter / EEngingine h o d t
Rasterizer n r Geometry Vertex e r uc
y Setup Engine s x e t a ff Assembler Assembler t F erarc i i nt u o e n H B tch C
C t u a a
Interpolators c c O h h e m e a
y Ultra-Threaded Dispatch Processor e r t S Ultra-Threaded Dispatch Processor y Stream Processing Units L Text T e 2 T ext ch L a e 1 x Ve C T
t ur ur ur e te r i Stream
Stream t x e y Texture Units & Caches r ex t ur C e U W e C / a e d Un c a a Processing C Processing h c e a h n c e Re h y it e r it e
h Units o Units s c s m y Memory Read/Write Cache & a Me il C enc t S
Stream Out Buffer / Z y Shader Export Shader Export y Render Back-Ends Renndder BBackack-EndsEnds AATTII RRaadedeoonn HHDD 22900900 320 Stream Processing Units 320 Stream Processing Units Color Cache y Memory controller 44 SIMDSIMDss 44 TeTexturexture UnitsUnits 4 Render Back-End 20 4 Render Back-End ATI Radeon HD 2600 & HD 2400
Command Processor Command Processor S Sh Sh h Ve Sh Programmable Ve ade
Programmable a ad r
ad Tessellator d t r e
Tessellator e t er r e x er r Z I x I n Co al I
I Scan Converter / n n C s n ic Scan Converter / d s t o n r d h Rasterizer e t Vertex Geometry uc n r s e c x
Rasterizer u r
Vertex s Geometry ta x
c t F t a Assembler Assembler
i a n t on F e
Assembler Assembler i nt er o t e i tc Ca n C tc C H h C h a a Interpolators c a c c h c Interpolators h h e h e e e
UltrUltra-Threaeadeded Dissppatatcchh Prococesessor UltraUltra-ThThreadaded Dispspaatch PPrroocescessor Stream Out Buffer L 2 Text e L h 1 er c f Ve Text f Memory R/W Cache a u r C rt e e
Bu Ve e h t C u x c W u / C r r a a e t che e O R a C C x l y i c m r
a h / Te a o che e e r m enc Shader Export x e St St t h Me u c Z/ r a e C C l i a c
Shader Export h enc e t S Z/
AATTII RRaadedeoonn HHDD 22400400 AATTII RRaadedeoonn HHDD 22600600 Color Cache 4040 SSttreamream ProProccessingessing UnitsUnits 120120 StStrereamam ProcProceessssiingng UnitUnitss 22 SIMDSIMDss Color Cache 33 SIMDSIMDss 11 TextureTexture UnitUnit 22 TeTexturexture UnitsUnits 11 RendRendeerr BBaacckk--EEndnd 11 RendRendeerr BBaacckk--EEndnd SShharedared vertex/vertex/ttextextuurere cachcachee
21 Unified Shader Architecture
Making better use of compute resources to render more complex and realistic images Unified Shader Definitions
Shader – Program consisting of a set of instructions that determine the characteristics of a vertex, primitive, or pixel
Thread – Set of vertices, polygons, or pixels that form a unit of work on which a single instance of a shader program is executed
Flow Control – Special instructions in a shader program that control the execution of instructions based on specified conditions
GPR – General Purpose Register to store shader initial, intermediate, and output data during shader execution
Stream Processing Unit – Hardware block that executes mathematical operations on a stream of input data
SIMD – A group of stream processing units executing a single instruction on multiple data items (vertices, primitives, pixels).
23 Command Processor
y Processes command stream from graphics driver • Executes microcode with memory access y Performs some state validation • Offloads this task from CPU • Small batch improvements • Up to 30% reduction in driver CPU overhead • Applies to both DirectX 9 and DirectX 10 applications
24 Setup Engine
Prepares data for processing by the stream processing units
Consists of three different functions: • Vertex assembly and tessellation (for vertex shaders) • Geometry assembly (for geometry shaders) • Scan conversion and interpolation (for pixel shaders)
Each function can submit threads to the dispatch processor
25 Ultra-Threaded Dispatch Processor Separate command queues for each shader type Setup Engine Vertex Assembler Geometry Assembler Interpolators • Fill with threads waiting for execution
• Each thread consists of a number of instructions Ultra-Threaded Ultra-Threaded Vertex Shader Geometry Shader PIxel Shader Sh that will operate on a block of input data Command Queue Command Queue Command Queue Sh Dispatch ad ad
Processor er Processor PS Thread 4 er
VS Thread 3 PS Thread 3 In
VS Thread 2 GS Thread 2 PS Thread 2 Co Arbiter units determine next thread to VS Thread 1 GS Thread 1 PS Thread 1 st n ru s ct
process, based on a variety of tracked t a io n t parameters n Ca C ach • Initial arbiter to select which thread to submit Arbiter Arbiter Arbiter Arbiter Texture Fetch Vertex Fetch ch Arbiter Arbiter e e • Two arbiter units per SIMD array Arbiter Arbiter Arbiter Arbiter • Allows each SIMD to be pipelined, with Sequencer Sequencer Sequencer Sequencer Texture Fetch Vertex Fetch Sequencer Sequencer two operations at a time in process Sequencer Sequencer Sequencer Sequencer • Dedicated arbiter units for texture and vertex fetches • Can be scheduled independently from
math operations Vert V
SIMD SIMD e
SIMD SIMD ex /T r t • Executing threads can be bumped at any time if Array Array Array Array ex /
T ext
a higher priority thread is pulled from the e x
80 80 80 80 t ur
command queues u e C
Stream Stream Stream Stream r e Un a
Processing Processing Processing Processing c h • Temporary data saved so thread can e
Units Units Units Units it resume later s
26 Ultra-Threaded Dispatch Processor
Setup Engine
Dedicated shader caches Vertex Assembler Geometry Assembler Interpolators • Instruction cache allows unlimited shader length Ultra-Threaded Ultra-Threaded Vertex Shader Geometry Shader PIxel Shader Sh • Constant cache allows unlimited number of Command Queue Command Queue Command Queue Sh Dispatch ad ad
Processor er constants Processor PS Thread 4 er
VS Thread 3 PS Thread 3 In
VS Thread 2 GS Thread 2 PS Thread 2 Co VS Thread 1 GS Thread 1 PS Thread 1 st n • Both caches take advantage of data re-use to ru s ct t a io improve state change overhead and efficiency n t n Ca C ach Arbiter Arbiter Arbiter Arbiter Texture Fetch Vertex Fetch ch
Arbiter Arbiter e Latency hiding Arbiter Arbiter Arbiter Arbiter e • Vertex, texture and shader I/O fetches that result in Sequencer Sequencer Sequencer Sequencer Texture Fetch Vertex Fetch a cache miss may require hundreds of cycles to Sequencer Sequencer return data from graphics memory Sequencer Sequencer Sequencer Sequencer • As soon as a thread is forced to wait for data, it is suspended and a new thread begins executing
immediately Vert V
SIMD SIMD e
SIMD SIMD ex /T r t • Suspended threads remain in the command queues Array Array Array Array ex /
T ext e
until their requested data arrives x
80 80 80 80 t ur u e C
Stream Stream Stream Stream r e Un a
• Ultra-threaded dispatch processor can queue up Processing Processing Processing Processing c h e
Units Units Units Units it
hundreds of threads to make sure the SIMD arrays s are never sitting idle
27 SIMD Arrays
Single Instruction, Multiple Data y Execute the same instruction thread on multiple data elements in parallel y Very Large Instruction Word (VLIW) design • Each instruction word can include up to 6 independent, co-issued operations (5 math + 1 flow control) • All operations are performed in parallel on each data element in the current thread y Texture fetch and vertex fetch instructions are issued and executed separately • Allows fetches to begin executing before the requested data is required by the shader
28 Stream Processing Units
Arranged as 5-way superscalar shader processors • Co-issue up to 5 scalar FP MAD (Multiply-Add) instructions per clock • One of the 5 stream processing units handles transcendental instructions as well (SIN, COS, LOG, Branch Execution EXP, etc.) Unit • 32-bit floating point precision • Integer and bitwise operation support GeneralGeneral PPuurposerpose RegistersRegisters Branch execution units handle flow control and conditional operations • Free stream processing units from handling this task • Practically eliminate flow control performance overhead General Purpose Registers store input data, temporary values, and output data
29 Counting FLOPS
FLoating point OPerations per Second
What should count as a FLOP? • Basic operations only (add, multiply)? • Transcendental operations (sine, cosine, log, power, etc.)? • Conditional operations (if, compare)? • Specialized functions (e.g. texture address instructions)? • Fixed functions (e.g. interpolation)?
Peak theoretical vs. measured
30 Counting FLOPS
How we count them for GPUs: y 32-bit floating point MUL & ADD only y Shader programmable operations only y General purpose ops only
Rationale y Practically all basic math ops can be represented in terms of MUL & ADD y Fixed function ops don’t allow for fair comparisons with general purpose processors (such as CPUs) y Specialized ops are typically only usable in limited circumstances, so including them can result in figures that are impossible to achieve in practice
31 FLOPS Comparison
Processing FLOPs Clock Processing Processor Units per Unit Speed Rate
ATI Radeon HD 2900 XT 320 2 742 MHz 475 GigaFLOPS
ATI Radeon HD 2600 XT 120 2 800 MHz 192 GigaFLOPS
ATI Radeon HD 2400 XT 40 2 700 MHz 56 GigaFLOPS
High-end dual core CPU 8 (4 per core) 2 3000 MHz 48 Giga FLOPS
32 Texture Unit Design D Radeon HD 2900 has 4 Texture Units ecom p y 8 Texture Address Processors each (32 total) ress – Execute shader instructions to control address for texture lookups D ecom L2 T L2 Textu L1 p y 20 Texture Samplers each (80 total) L1 ress Ver V T Te extu e e
– Can fetch a single data value per clock rte xt t xt e ure x ure x r C r C e y 4 FP Texture Filter Units each (16 total) e Cac a Cac ach Ca ch Ca D ecom e e h
– Can bilinear filter one 64-bit color value per clock, or h che ch e e p
one 128-bit color value per 2 clocks ress e
HD 2600 and HD 2400 texture units have identical functionality D
Texture Address Processors ecom p
FP32 Texture Samplers ress
Texture Filter Units
33 Texture Unit Design D Multi-level texture cache design ecom p y Large, shared L2 cache stores data retrieved on ress L1 cache misses
y 256 kB for HD2900 D ecom L2 T L2 Textu L1 p L1 y 128 kB for HD2600 ress Ver V T Te extu e e rte xt t xt y HD2400 uses single level vertex/texture cache e ure x ure x r C r C e e Cac a Cac ach Ca ch Ca D ecom e e h h che ch e e p
All texture units can access both vertex cache and ress e L1 texture cache y Can provide increased throughput D
for unfiltered texture reads Texture Address Processors ecom p
FP32 Texture Samplers ress
Texture Filter Units
34 Texture Unit Features y Full Speed Floating Point Texture Filtering • 64-bit HDR textures bilinear filtered at full speed (~7x faster than Radeon X1000 series) • 128-bit floating point textures filtered at half speed • Trilinear and anisotropic filtering supported for all formats
• Improved high quality anisotropic filtering • High quality mode from Radeon X1000 series is now the default setting • Improved handling of problematic texture filtering cases y Depth Stencil Textures (DST) with Percentage Closer Filtering (PCF) • High performance soft shadow rendering
35 Texture Unit Features
y New shared exponent texture format (RGBE 9:9:9:5) for 32-bit HDR
y High resolution texture support • Up to 67 megatexels (8192 x 8192)
y Up to two texture fetches per clock per texture unit (1 filtered + 1 unfiltered) • Option to get 4 unfiltered fetches in place of 1 filtered fetch (Fetch4)
y Full texturing capability accessible to vertex, geometry, and pixel shader programs
36 Geometry Performance
Practical polygon throughput is HD 2000 series feature major determined by a number of factors improvements in all of these areas y Vertex fetch rate y Fetch up to 16 vertices per clock y Vertex cache size & efficiency y Up to 8x increase in vertex cache size y Vertex shader performance vs. X1000 series y Geometry shader performance y Unified architecture can increase available vertex shader processing y Geometry amplification capabilities power by up to 10x X1000 series y Triangle setup rate y Geometry shader performance shown to be up to 50x faster than competing implementations y Programmable tessellation unit for accelerated geometry amplification y Setup 1 triangle per clock cycle
37 Memory Read/Write Cache y Virtualizes register space • Allows overflow to graphics memory • Can be read from or written to by any SIMD (texture & vertex caches are read-only) • Can export data to stream out buffer y Stream Out • Allows shader output to bypass render back-ends and color buffer • Outputs sequential stream of data instead of bitmaps y Uses include: • Interthread communication • Render to vertex buffer • Overflow storage for GS data (since it can generate widely variable amounts of output data)
38 Render Back-Ends
Double rate depth/stencil test y 32 pixels per clock for HD 2900 Alpha/Fog y 8 pixels per clock for HD 2600 & HD 2400
Fast Post-Processing Effects che y Render-to-texture much more efficient than previous chips Decompress Ca
l Depth/Stencil i Compress MSAA resolve functionality is programmable
y Makes Custom Filter AA possible enc St New blendable surface formats Programmable Blend Z/ y Allows new DX10 formats to be displayable MSAA Resolve • 128-bit floating point format • 11:11:10 floating point format Decompress Compress
MRT (Multiple Render Target) support • Up to 8 MRTs (double Radeon X1000 series) with MSAA support Color Cache
39 Depth, Stencil, and Compression Improvements y Improved Z & Stencil Compression • Up to 16:1 in standard mode (vs. 8:1 in previous chips), up to 128:1 with 8x MSAA • Z & Stencil now compressed separately from each other for better efficiency • Compression information stored in graphics memory and cached on-chip – allows compression to be used at very high resolutions (previous generation was limited to 5 megapixels) y Z Range optimization • Limit depth test operations to a programmable depth range (useful for speeding up stencil shadowing) y Re-Z • Can check Z buffer twice – once before pixel shader, and again after • Allows early Z before shading in all cases y Improved Hierarchical Z Buffer • Adds Hierarchical Stencil (HiS) for better stencil shadow performance • Handles most situations where it had to be disabled in the past y 32-bit Floating Point Z-Buffer support • Provides greater precision than previous 24-bit format
40 Stencil Shadow Performance
ATI Radeo n HD 2000 S eries vs. ATI Radeo n X1000 S er ies Stencil Sha do w Cases Rad eo n X1950 X TX Rad eo n HD X2900 X T
200 % 180 % 23 1 160 % 28 2 99 140 % 77 39 120 % 145 100 % 163 118 50 25 59 80 % 11 0 3DM ar k03 GT2 3DM ar k03 G T3 FEAR FEAR Do o m 3 Do o m 3 10x 7 10 x7 16 x12 Sof t 25x16 S of t 16x 12 25 x16 Shado ws Shado ws
41 Anti-Aliasing
Recap of existing ATI Radeon AA technologies: • Multisampling • Programmable sample patterns • Gamma correct resolve • Temporal AA • Adaptive Supersampling/Multisampling • Super AA (CrossFire)
All of these feature still available with the HD 2000 series, plus a new one – Custom Filter Anti-Aliasing
42 Building a Better Filter
Anti-aliasing is effectively a post-processing filter applied to each rendered frame y Designed to remove high frequencies from the video “signal” y These appear as jagged edges and shimmering
Most hardware relies on simple box filter • Restricted to pixel boundaries, fixed weights • Diminishing returns from adding more samples • The best anti-aliasing filters are not subject to these constraints
How we can do better y Sampling from outside pixel boundaries y Non-uniform sample weights y Filter kernels that adapt to the characteristics of each pixel
43 Custom Filter Anti-Aliasing (CFAA)
Standard 8x MSAA (Box Filter)
44 Custom Filter Anti-Aliasing (CFAA)
12x CFAA Narrow Tent Filter
45 Custom Filter Anti-Aliasing (CFAA)
16x CFAA Wide Tent Filter
46 Custom Filter Anti-Aliasing (CFAA)
Adaptive Edge Detect Filter y Performs edge detection pass on rendered image y Edge pixels resolved using more samples along direction of edge with high quality filter y Other pixels resolved using fewer samples and box filter
Benefits y Provides excellent edge smoothing where it’s needed the most y Reduces texture shimmering y Avoids blurring of fine detail (e.g. small text) y Provides better quality per sample than supersampling, with better performance
47 Nice Properties of CFAA y Software upgradeable y Can be used to enhance in-game AA settings for most DirectX 9 titles y Works together with all other ATI Radeon AA features y Works with HDR y Works with stencil shadows y More samples per pixel than MSAA without increasing memory footprint y Potential image quality limited only by performance
48 Anti-Aliasing Quality Comparison
Images captured from Half-Life 2 by Valve Software
No AA 2x MSAA 4x CFAA 4x MSAA
6x CFAA 8x MSAA 12x CFAA 16x CFAA Anti-Aliasing Quality Comparison
Custom Filter AA vs. ATI Radeon Nvidia HD 2000 series GeForce 8 series Coverage Sample AA (Nvidia GeForce 8 series) 8x CFAA 8x CSAA y CFAA provides better edge quality per sample y CFAA can scale to more than 16 samples per pixel y CFAA works better on silhouette edges (where there are many small, intersecting polygons per pixel) y CFAA works on stencil shadows y CFAA is not limited to polygon edges Anti-Aliasing Quality Comparison
Custom Filter AA Advantages vs. ATI Radeon Nvidia HD 2000 series GeForce 8 series Coverage Sample AA (Nvidia GeForce 8 series) 16x CFAA 16x CSAA y CFAA provides better edge quality per sample y CFAA can scale to more than 16 samples per pixel y CFAA works better on silhouette edges (where there are many small, intersecting polygons per pixel) y CFAA works on stencil shadows y CFAA is not limited to polygon edges Memory Interface
Why is it so hard to just widen the memory interface?
P y Requires more I/O pads around the edge of the chip C Ring I R E Stop in x g p re S s to s y I/O pads tend to scale poorly with process shrinks p y If interface is too wide, chip area becomes pad limited y Makes GPU too large to produce at a reasonable cost
How did we get to 512 bits on the HD 2900? Ring Ring y New, compact, stacked I/O pad design StStoop Stop y Double the I/O density of previous designs y Let us pack 512 bits into the same area as 256 bits using previous I/O design
Ring Stop
52 Memory Interface
Benefits of a 512-bit interface y More bandwidth with existing memory P C Ring I R E Stop technology in x g p re S s to s y Lower memory clock required to achieve p target bandwidth y Improved cost:bandwidth ratio
Ring Ring Benefits of the ring bus StStoop Stop y Simplifies routing to improve scalability y Reduces wire delay y Reduces number of repeaters required Ring Stop
53 Memory Controller
Fully Distributed vs. Crossbar Designs y Crossbars have to be redesigned for each new product y Crossbars become exponentially more complex as more channels are added y Fully distributed design allows memory channels to be added or removed as required with minimal effort
54 Questions
55