Valencia, Spain September 7, 2010
The Future Is Heterogeneous Computing
Mike Houston
Principal Architect, Accelerated Parallel Processing Advanced Micro Devices
October 27th, 2010
Page 1 | The Future Is Heterogeneous Computing | Oct 27, 2010 Workload Example: Changing Consumer Behavior
Approximately 20 hours 9 billion of video video files owned are
uploaded to YouTube high-definition every minute
50 million + digital media files 1000 added to personal content libraries images every day are uploaded to Facebook every second
Page 2 | The Future Is Heterogeneous Computing | Oct 27, 2010
2 Challenges for Next Generation Systems
• The Power Wall . Even more broadly constraining in the future!
• Complexity Management – HW and SW . Principles for managing exponential growth
• Parallelism, Programmability and Efficiency . Optimized SW for System-level Solutions
• System balance . Memory Technologies and System Design . Interconnect Design
Page 3 | The Future Is Heterogeneous Computing | Oct 27, 2010 The Power Wall
• Easy prediction: Power will continue to be the #1 design constraint for Computer Systems design.
• Why? . Vmin will not continue tracking Moore’s Law . Integration of system-level components consume chip power – A well utilized 100GB/sec DDR memory interface consumes ~15W for the I/O alone! . 2nd Order Effects of Power – Thermal, packaging & cooling (node-level & datacenter-level) – Electrical stability in the face of rising variablity . Thermal Design Points (TDPs) in all market segments continue to drop . Lightly loaded and idle power characteristics are key parameters in the Operational Expense (OpEx) equation . Percent of total world energy consumed by computing devices continues to grow year-on-year
Page 4 | The Future Is Heterogeneous Computing | Oct 27, 2010 Optimized SW for System-level Solutions
. Long history of SW optimizations for HW “characteristics” • Optimizing compilers • Cache / TLB blocking • Multi-processor coordination: communication & synchronization • Non-uniform memory characteristics: Process and memory affinity . Scarcity/Abundance principle favors increased use of Abstractions • Abstraction leads to Increased productivity but costs performance • Still allow experts burrow down into lower level “on the metal” details . System-level Integration Era will demand even more • Many Core: user mode and/or managed runtime scheduling? • Heterogeneous Many Core: capability aware scheduling? . SW productivity versus optimization dichotomy • Exposed HW leads to better performance but requires a “platform characteristics aware programming model”
Page 5 | The Future Is Heterogeneous Computing | Oct 27, 2010 The Memory Wall – getting thicker
There has always been a Critical Balance between Data Availability and Processing
Situation When? Implication Industry Solutions
Non-blocking Early Memory wait time DRAM vs CPU Cycle Time Gap caches 1990s dominates computing O-o-O Machines
SW Productivity Crisis Larger Caches Mid Larger working sets Cache Hierarchies Object oriented languages; 1990s More diverse data types Managed runtime environments Elaborate prefetch
Huge Caches Multiple working sets! 2005 and Multiple Memory Single Thread CMP Focus Virtual Machines! beyond Controllers More memory accesses Extreme PHYs
New & Emerging Abstractions Accelerated Parallel 2009 and Even larger working sets Browser-based Runtimes Processing TBD beyond Larger data types Image/Video as basic data types Chip Stacking Throughput-based designs
Page 6 | The Future Is Heterogeneous Computing | Oct 27, 2010 Interconnect Challenges
• Coherence domain – knowing when to stop . Interesting implications for on-chip interconnect networks
• Industry Mantra: “Never bet against Ethernet” . But, current Ethernet not well suited for lossless transmission . Troublesome for storage, messaging and more
• The more subtle and trickier problems . Adaptive routing, congestion management, QOS, End-to-end characteristics, and more
• Data centers of tomorrow are going to take great interest in this area
Page 7 | The Future Is Heterogeneous Computing | Oct 27, 2010 Single-thread Performance
Moore’s Law The Power Wall The Frequency Wall ! Server: power=$$ DT: eliminate fans Mobile: battery o - DFM o o we are - Variability here we are - Reliability Frequency we are - Wire delay here
here (TDP) Budget Power Integration (log scale)
Time Time Time
The IPC Complexity Wall Locality Single thread Perf (!)
o o o ?
IPC we are thread Perf we are we are - here here here Performance Single
Issue Width Cache Size Time
Page 8 | The Future Is Heterogeneous Computing | Oct 27, 2010 Parallel Programs and Amdahl’s Law
1 Speed-up = SW: % Serial Work SW + (1 – SW) / N N: Number of processors
140
120
Assume 100W TDP Socket 100 10W for global clocking 20W for on-chip network/caches 80 up 15W for I/O (memory, PCIe, etc) - 0% Serial 0%10% Serial Serial 60 This leaves 55W for all the cores Speed 850mW per Core ! 100%35% Serial Serial
40 100% Serial
20
0 1 2 4 8 16 32 64 128
Number of CPU Cores
Page 9 | The Future Is Heterogeneous Computing | Oct 27, 2010 35 Years of Microprocessor Trend Data
Transistors (thousands)
Single-thread Performance (SpecINT)
Frequency (MHz )
Typical Power (Watts)
Number of Cores
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore
Page 10 | The Future Is Heterogeneous Computing | Oct 27, 2010 The Power Wall – Again!
• Escalating multi-core designs will crash into the power wall just like single cores did due to escalating frequency
• Why? . In order to maintain a reasonable balance, core additions must be accompanied by increases in other resources that consume power (on-chip network, caches, memory and I/O BW, …) – Spiral upwards effect on power . The use of multiple cores forces each core to actually slow down – At some point, the power limits will not even allow you to activate all of the cores at the same time . Small, low-power cores tend to be very weak on single-threaded general purpose workloads – Customer value proposition will continue to demand excellent performance on general purpose workloads – The transition to compelling general purpose parallel workloads will not be a fast one
Page 11 | The Future Is Heterogeneous Computing | Oct 27, 2010 What about Throughput Computing?
• Works around Amdahl’s law by focusing on throughput of multiple independent tasks . Servers: Transaction Processing; Web Clicks; Search Queries . Clients: Graphics; Multimedia; Sensory Inputs (future) . HPC: Data-level parallelism • New bottlenecks start to appear . As some point, the OS itself becomes the “serial component” User mode scheduling and task-stealing runtimes . Memory BW – Goal is to saturate the pipeline to memory Large number of outstanding references Large number of active and/or standby threads . Power – Overall utilization goes up, so does power consumption Still the #1 constraint in modern computer design
Page 12 | The Future Is Heterogeneous Computing | Oct 27, 2010 Three Eras of Processor Performance
Single-Core Multi-Core Heterogeneous Era Era Systems Era
Enabled by: Enabled by: Enabled by: Moore’s Law Moore’s Law Moore’s Law Voltage Scaling Desire for Throughput Abundant data parallelism MicroArchitecture 20 years of SMP arch Power efficient GPUs Constrained by: Constrained by: Currently constrained by: Power Power Programming models Complexity Parallel SW availability Communication overheads Scalability
o o ? we are here we are o here we are Performance
thread thread Performance here - Targeted Application Application Targeted
Throughput Performance Time Time Time Single (# of Processors) (Data-parallel exploitation)
Page 13 | The Future Is Heterogeneous Computing | Oct 27, 2010 AMD x86 64-bit CMP Evolution
2003 2005 2007 2008 2009 2010
45nm Quad- Dual-Core Quad-Core Six-Core AMD Opteron AMD Opteron™ Core AMD Opteron AMD Opteron AMD Opteron 6100 Series AMD Opteron Mfg. 90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core
L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB
Hyper Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology
Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333
Max Power Budget Remains Consistent
Page 14 | The Future Is Heterogeneous Computing | Oct 27, 2010 AMD Opteron™ 6100 Series Silicon and Package
L3 CACHE Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 L3 CACHE
12 AMD64 x86 Cores 18 MB on-chip cache 4 Memory Channels @ 1333 MHz 4 HT Links @ 6.4 GT/sec
Page 15 | The Future Is Heterogeneous Computing | Oct 27, 2010 AMD Radeon HD5870 GPU Architecture
Page 16 | The Future Is Heterogeneous Computing | Oct 27, 2010 GPU Processing Performance Trend
3000
GigaFLOPS * Cypress ATI RADEON™ HD 5870 2500 * Peak single-precision performance; For RV670, RV770 & Cypress divide by 5 for peak double-precision performance RV770 ATI RADEON™ HD 4800 RV670 ATI FirePro™ 2000 ATI RADEON™ V8700 OpenCL 1.1+ R600 HD 3800 AM D FireStream™ DirectX 11 ATI RADEON™ ATI FireGL™ 9250 2.25x Perf. HD 2900 V7700 9270 R580(+) ATI FireGL™ AM D FireStream™ 1500 ATI RADEON™ 9170 X19xx V7600 R520 ATI FireStream™ V8600 ATI RADEON™ V8650 2.5x ALU X1800 1000 ATI FireGL™ increase V7200 V7300 Stream SDK V7350 CAL+IL/Brook+ 500 Double-precision GPGPU Unified floating point via CTM 0 Shaders
Jul-09 Sep-05 Mar-06 Oct-06 Apr-07 Nov-07 Jun-08 Dec-08
17 GPU Efficiency
16
14 14.47 GFLOPS/W
12 GFLOPS/W GFLOPS/mm2 10 7.50
8 4.50 7.90 GFLOPS/mm 2 6
2.01 2.21 4.56 4 1.07 2.24 2 0.42 1.06 0.92
0 Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09
ATI Radeon™ ATI Radeon™ ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD X1800 XT X1900 XTX 2900 PRO 3870 4870 5870
18 AMD Accelerated Parallel Processing (APP) Technology is… Heterogeneous: Developers leverage AMD GPUs and CPUs for optimal application performance and user experience
High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency
Industry Standards: OpenCL™ enables cross-platform development
Gaming Digital Content Productivity Creation
Sciences Government Engineering 19 Moving Past Proprietary Solutions for Ease of Cross-Platform Programming
Open and Custom Tools High Level Tools High Level Language Application Specific Compilers Libraries
Industry Standard Interfaces
DirectX® OpenCL™ OpenGL®
AMD AMD Other GPUs CPUs CPUs/GPUs
OpenCL -
• Cross-platform development • Interoperability with OpenGL and DX • CPU/GPU backends enable balanced platform approach
20 Heterogeneous Computing: Next-Generation Software Ecosystem
Increase ease of application development
End-user Applications
High Level Frameworks Tools: HLL compilers, Load balance Debuggers, across CPUs and Middleware/Libraries: Video, Profilers GPUs; leverage Imaging, Math/Sciences, AMD Fusion™ Physics performance advantages
& Load Balancing Drive new features into Advanced Optimizations Advanced OpenCL & Direct Compute industry standards
Hardware & Drivers: AMD Fusion™, Discrete CPUs/GPUs
21 AMD Balanced Platform Advantage
CPU is excellent for running some GPU is ideal for data parallel algorithms algorithms like image processing, CAE, etc . Ideal place to process if GPU is . Great use for AMD Accelerated fully loaded Parallel Processing (APP) technology . Great use for additional CPU cores . Great use for additional GPUs
Graphics Workloads
Serial/Task-Paralle l Other Highly Workloads Parallel Workloads
Delivers advanced performance for a wide range of platform configurations
22 Challenges: Extracting Parallelism
Fine-grain data Nested data Coarse-grain data parallel Code parallel Code parallel Code … i=0 i++ load x(i) Loop 1M fmul time s f o r store cmp i (1000000) 1M pieces bc of data … …
i=0 i++ load x(i) fmul store cmp i (16) bc …
… 2D array i,j=0 representing i++ j++ very large load x(i,j) dataset fmul Loop 16 times for 16 store pieces of data cmp j (100000) bc cmp i (100000) Lots of conditional data bc Maps very well to parallelism. Benefits Maps very well to … integrated SIMD from closer coupling Throughput-oriented dataflow (ie: SSE) between CPU & GPU data parallel engines
Page 23 | The Future Is Heterogeneous Computing | Oct 27, 2010 A New Era of Processor Performance
Microprocessor Advancement
Single-Core Multi-Core Heterogeneous Era Era Systems Era CPU
Heterogeneous System-level Computing programmable
OpenCL/DX driver-based Homogeneous programs Computing Programmability
Graphics GPU driver-based programs Advancement
Throughput Performance GPU
24 Now the AMD Fusion Era of Computing Begins
25 DISCLAIMER
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and t ypographical errors.
The inf ormat ion cont ained herein is subject to change and may be rendered inaccurat e f or many reasons, including but not limited t o product and roadmap changes, component and motherboard version changes, new model and/or product releases, product dif f erences between dif f ering manuf acturers, software changes, BIOS f lashes, f irmware upgrades, or the like. A MD assumes no obligation to update or otherwise correct or revise this inf ormation. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT W ILL A MD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
This presentation contains forward-looking statements concerning AMD and technology partner product offerings which are made pursuant to the safe harbor provisions of the Private Securities Litigation Reform A ct of 1995. Forward-looking statements are commonly identified by words such as "would," "may," "expects," "believes," "plans," "intends," “strategy,” “roadmaps,” "projects" and other terms with similar meaning. Investors are cautioned that the forward-look ing s tate ments in this presentation are based on current beliefs, assumptions and expectations, speak only as of the date of this presentation and involve risks and uncertainties that could cause actual results to differ materially from current expectations.
A T TRIBUTION
© 2010 Advanced Micro Devices, Inc. A ll rights reserved. AMD, the AMD Arrow logo, AMD O pte ron, A TI, the A TI logo, Radeon and combinations thereof are trademarks of A dvanced M icro Devices, Inc. M icrosoft, Windows, and Windows V ista are registered trademarks of M icrosoft Corporation in the U nited States and/or other jurisdictions. O penCL is trademark of A pple Inc. used under license to the Khronos Group Inc. O ther names are for informational purposes only and may be trademarks of their respective owners.
26