The Future Is Heterogeneous Computing
Total Page:16
File Type:pdf, Size:1020Kb
Valencia, Spain September 7, 2010 The Future Is Heterogeneous Computing Mike Houston Principal Architect, Accelerated Parallel Processing Advanced Micro Devices October 27th, 2010 Page 1 | The Future Is Heterogeneous Computing | Oct 27, 2010 Workload Example: Changing Consumer Behavior Approximately 20 hours 9 billion of video video files owned are uploaded to YouTube high-definition every minute 50 million + digital media files 1000 added to personal content libraries images every day are uploaded to Facebook every second Page 2 | The Future Is Heterogeneous Computing | Oct 27, 2010 2 Challenges for Next Generation Systems • The Power Wall . Even more broadly constraining in the future! • Complexity Management – HW and SW . Principles for managing exponential growth • Parallelism, Programmability and Efficiency . Optimized SW for System-level Solutions • System balance . Memory Technologies and System Design . Interconnect Design Page 3 | The Future Is Heterogeneous Computing | Oct 27, 2010 The Power Wall • Easy prediction: Power will continue to be the #1 design constraint for Computer Systems design. • Why? . Vmin will not continue tracking Moore’s Law . Integration of system-level components consume chip power – A well utilized 100GB/sec DDR memory interface consumes ~15W for the I/O alone! . 2nd Order Effects of Power – Thermal, packaging & cooling (node-level & datacenter-level) – Electrical stability in the face of rising variablity . Thermal Design Points (TDPs) in all market segments continue to drop . Lightly loaded and idle power characteristics are key parameters in the Operational Expense (OpEx) equation . Percent of total world energy consumed by computing devices continues to grow year-on-year Page 4 | The Future Is Heterogeneous Computing | Oct 27, 2010 Optimized SW for System-level Solutions . Long history of SW optimizations for HW “characteristics” • Optimizing compilers • Cache / TLB blocking • Multi-processor coordination: communication & synchronization • Non-uniform memory characteristics: Process and memory affinity . Scarcity/Abundance principle favors increased use of Abstractions • Abstraction leads to Increased productivity but costs performance • Still allow experts burrow down into lower level “on the metal” details . System-level Integration Era will demand even more • Many Core: user mode and/or managed runtime scheduling? • Heterogeneous Many Core: capability aware scheduling? . SW productivity versus optimization dichotomy • Exposed HW leads to better performance but requires a “platform characteristics aware programming model” Page 5 | The Future Is Heterogeneous Computing | Oct 27, 2010 The Memory Wall – getting thicker There has always been a Critical Balance between Data Availability and Processing Situation When? Implication Industry Solutions Non-blocking Early Memory wait time DRAM vs CPU Cycle Time Gap caches 1990s dominates computing O-o-O Machines SW Productivity Crisis Larger Caches Mid Larger working sets Cache Hierarchies Object oriented languages; 1990s More diverse data types Managed runtime environments Elaborate prefetch Huge Caches Multiple working sets! 2005 and Multiple Memory Single Thread CMP Focus Virtual Machines! beyond Controllers More memory accesses Extreme PHYs New & Emerging Abstractions Accelerated Parallel 2009 and Even larger working sets Browser-based Runtimes Processing TBD beyond Larger data types Image/Video as basic data types Chip Stacking Throughput-based designs Page 6 | The Future Is Heterogeneous Computing | Oct 27, 2010 Interconnect Challenges • Coherence domain – knowing when to stop . Interesting implications for on-chip interconnect networks • Industry Mantra: “Never bet against Ethernet” . But, current Ethernet not well suited for lossless transmission . Troublesome for storage, messaging and more • The more subtle and trickier problems . Adaptive routing, congestion management, QOS, End-to-end characteristics, and more • Data centers of tomorrow are going to take great interest in this area Page 7 | The Future Is Heterogeneous Computing | Oct 27, 2010 Single-thread Performance Moore’s Law The Power Wall The Frequency Wall ! Server: power=$$ DT: eliminate fans Mobile: battery o - DFM o o we are - Variability here we are - Reliability Frequency we are - Wire delay here here (TDP) Budget Power Integration (log scale) Time Time Time The IPC Complexity Wall Locality Single thread Perf (!) o o o ? IPC we are thread Perf we are we are - here here here Performance Single Issue Width Cache Size Time Page 8 | The Future Is Heterogeneous Computing | Oct 27, 2010 Parallel Programs and Amdahl’s Law 1 Speed-up = SW: % Serial Work SW + (1 – SW) / N N: Number of processors 140 120 Assume 100W TDP Socket 100 10W for global clocking 20W for on-chip network/caches 80 up 15W for I/O (memory, PCIe, etc) - 0% Serial 0%10% Serial Serial 60 This leaves 55W for all the cores Speed 850mW per Core ! 100%35% Serial Serial 40 100% Serial 20 0 1 2 4 8 16 32 64 128 Number of CPU Cores Page 9 | The Future Is Heterogeneous Computing | Oct 27, 2010 35 Years of Microprocessor Trend Data Transistors (thousands) Single-thread Performance (SpecINT) Frequency (MHz ) Typical Power (Watts) Number of Cores Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore Page 10 | The Future Is Heterogeneous Computing | Oct 27, 2010 The Power Wall – Again! • Escalating multi-core designs will crash into the power wall just like single cores did due to escalating frequency • Why? . In order to maintain a reasonable balance, core additions must be accompanied by increases in other resources that consume power (on-chip network, caches, memory and I/O BW, …) – Spiral upwards effect on power . The use of multiple cores forces each core to actually slow down – At some point, the power limits will not even allow you to activate all of the cores at the same time . Small, low-power cores tend to be very weak on single-threaded general purpose workloads – Customer value proposition will continue to demand excellent performance on general purpose workloads – The transition to compelling general purpose parallel workloads will not be a fast one Page 11 | The Future Is Heterogeneous Computing | Oct 27, 2010 What about Throughput Computing? • Works around Amdahl’s law by focusing on throughput of multiple independent tasks . Servers: Transaction Processing; Web Clicks; Search Queries . Clients: Graphics; Multimedia; Sensory Inputs (future) . HPC: Data-level parallelism • New bottlenecks start to appear . As some point, the OS itself becomes the “serial component” User mode scheduling and task-stealing runtimes . Memory BW – Goal is to saturate the pipeline to memory Large number of outstanding references Large number of active and/or standby threads . Power – Overall utilization goes up, so does power consumption Still the #1 constraint in modern computer design Page 12 | The Future Is Heterogeneous Computing | Oct 27, 2010 Three Eras of Processor Performance Single-Core Multi-Core Heterogeneous Era Era Systems Era Enabled by: Enabled by: Enabled by: Moore’s Law Moore’s Law Moore’s Law Voltage Scaling Desire for Throughput Abundant data parallelism MicroArchitecture 20 years of SMP arch Power efficient GPUs Constrained by: Constrained by: Currently constrained by: Power Power Programming models Complexity Parallel SW availability Communication overheads Scalability o o ? we are here we are o here we are Performance thread thread Performance here - Targeted Application Application Targeted Throughput Performance Time Time Time Single (# of Processors) (Data-parallel exploitation) Page 13 | The Future Is Heterogeneous Computing | Oct 27, 2010 AMD x86 64-bit CMP Evolution 2003 2005 2007 2008 2009 2010 45nm Quad- Dual-Core Quad-Core Six-Core AMD Opteron AMD Opteron™ Core AMD Opteron AMD Opteron AMD Opteron 6100 Series AMD Opteron Mfg. 90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB Hyper Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333 Max Power Budget Remains Consistent Page 14 | The Future Is Heterogeneous Computing | Oct 27, 2010 AMD Opteron™ 6100 Series Silicon and Package L3 CACHE Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 L3 CACHE 12 AMD64 x86 Cores 18 MB on-chip cache 4 Memory Channels @ 1333 MHz 4 HT Links @ 6.4 GT/sec Page 15 | The Future Is Heterogeneous Computing | Oct 27, 2010 AMD Radeon HD5870 GPU Architecture Page 16 | The Future Is Heterogeneous Computing | Oct 27, 2010 GPU Processing Performance Trend 3000 GigaFLOPS * Cypress ATI RADEON™ HD 5870 2500 * Peak single-precision performance; For RV670, RV770 & Cypress divide by 5 for peak double-precision performance RV770 ATI RADEON™ HD 4800 RV670 ATI FirePro™ 2000 ATI RADEON™ V8700 OpenCL 1.1+ R600 HD 3800 AM D FireStream™ DirectX 11 ATI RADEON™ ATI FireGL™ 9250 2.25x Perf. HD 2900 V7700 9270 R580(+) ATI FireGL™ AM D FireStream™ 1500 ATI RADEON™ 9170 X19xx V7600 R520 ATI FireStream™ V8600 ATI RADEON™ V8650 2.5x ALU X1800 1000 ATI FireGL™ increase V7200 V7300 Stream SDK V7350 CAL+IL/Brook+ 500 Double-precision GPGPU Unified floating point via CTM 0 Shaders Jul-09 Sep-05 Mar-06 Oct-06 Apr-07 Nov-07 Jun-08 Dec-08 17 GPU Efficiency 16 14 14.47 GFLOPS/W 12 GFLOPS/W GFLOPS/mm2 10 7.50 8