SLIDES [email protected]

6/24/2013 WHAT IS THIS TALK ABOUT ? • Study of power and energy profile of different optimization ANALYZING OPTIMIZATION techniques used in heterogeneous applications TECHNIQUES FOR POWER EFFICIENCY ON HETEROGENEOUS • Evaluation of power/performance of such optimization techniques on heterogeneous applications such as FFT PLATFORMS Yash Ukidave and David Kaeli Department of Electrical and Computer Engineering Northeastern University, Boston, USA AsHES 2013 Boston, MA 20th May, 2013 1 | AsHES 2013 | May 2013 2 | AsHES 2013 | May 2013 TOPICS MOTIVATION Increasing use of the CPU-GPU Applications using OpenCL and power consumption of heterogeneous environment to accelerate data-parallel devices heterogeneous applications Fast Fourier Transforms (FFT) & evaluation methodology Thermal Design Power (TDP) of latest generation of GPUs used for Optimization techniques used for analysis heterogeneous compute RltfResults for power-performance o f FFT i mp lemen tati ons Understanding the effects of software design methods contributing to power Analysis of power-performance of different optimization techniques consumption Energy profile of different optimization techniques Power and Energy aspect of different optimization techniques for Conclusion heterogeneous platforms Future work 3 | AsHES 2013 | May 2013 4 | AsHES 2013 | May 2013 1 6/24/2013 FAST FOURIER TRANSFORM (FFT) x(0) X(0) FFT IMPLEMENTATIONS FFT is an algorithm to compute Discrete Fourier • MR-SC FFT : Multi-Radix single FFT Kernel Memory Twiddle Transforms (DFT) kernel call, based on the FFT Implementation Calls Access factor x(1) X(1) Patterns Computation -1 implementation in AMD SDK. Reduces time complexity to O(nlogn) from O(n2) MR-SC Global & Kernel • MR-MC FFT : Based on Cooley- FFT classified as Decimation in Time(DIT) or Single Local Compile-Time Tukey algorithm, uses Multiple Decimation in Frequency (DIF) Memory kernel calls and Multiple Radix combinations for compute DIT : Operates on odd and even components of Stockham FFT signal Kernel • Stockham FFT : Based on the DIF : Operates on two halves of the signal Apple FFT Global Run-Time Stockham algorithm for FFT. Single Multiple Memory Butterfly structure is a major component of FFT of a Radix and Single Kernel call MR-MC FFT given Radix computation Each FFT work item performs computes butterfly on • Apple FFT : A Multiple Kernel call given data points based FFT provided by Apple Inc. using OpenCL 5 | AsHES 2013 | May 2013 6 | AsHES 2013 | May 2013 PLATFORMS FOR EVALUATION OPTIMIZATION TECHNIQUES CLASSIFIED IN SETS Shared Memory APUs Discrete GPUs Set S0 Device Features Intel AMD Device Features Nvidia AMD No modifications Core i7 Fusion A8 GTX 680 Radeon HD 3300 APU 7770 ( “out-of-box”) Device Generation Ivy Evergreen Device Generation Kepler Southern performance Bridge Islands Compute 16 5 Compute 810 Units(()CU) Units(()CU) Processing 416 Processing 192 64 Set S1 Elements(PE)/CU Elements(PE)/CU Set S2 Set S3 Coalesced Global Data Transformation Local Memory TDP(Watts) 70 100 TDP(Watts) 195 80 Memory Accesses Usage float Æ float2 Memory 26 17 Memory 192 72 Loop Unrolling Stage overlapping Bandwidth(GB/s) Bandwidth(GB/s) float Æ float4 for stage-based compute Register File 64 64 Register File 256 256 float Æ float8 Per CU(KB) Per CU(KB) 7 | AsHES 2013 | May 2013 8 | AsHES 2013 | May 2013 2 6/24/2013 EXECUTION PERFORMANCE OF FFT POWER PERFORMANCE OF NON-OPTIMIZED FFT • Performance is compared to the baseline on each platform High power consumption of Nvidia GPUs affect their power efficiency over AMD • Evaluation is done for three data set sizes 64K, 1M and 2M data points GPUs Execution Performance on GPUs Execution Performance on APUs Low compute capability of APUs exhibit decrease in performance over GPUs Power Performance on GPUs Power Performance on APUs 9 | AsHES 2013 | May 2013 10 | AsHES 2013 | May 2013 ANALYSIS OF MEMORY BASED OPTIMIZATIONS ANALYSIS OF MEMORY BASED OPTIMIZATIONS S1 & S2 optimizations use coalesced memory accesses in FFT kernels Improved throughput causes increase in FFTs are modified to perform coalesced memory accesses and loop structures execution performance are unrolled Coalesced accesses are Improved throughput on devices leads to increase in performance handled using specialized Average improvement in throughput is 2X on discrete GPUs hardware paths Coalesced accesses increase power consumption Average increase in power is 32% 11 | AsHES 2013 | May 2013 12 | AsHES 2013 | May 2013 3 6/24/2013 ANALYSIS OF MEMORY BASED OPTIMIZATIONS EFFECTS OF LOCAL MEMORY USAGE AND STAGE-OVERLAPPING S2 data transform evaluation on Nvidia GPU S2 optimizations use data • S3 optimizations use local transformation memory and stage-overlapping Data transforms in OpenCL • Stages of multi-kernel allow contiguous memory implementation are overlapped access in single kernel call This increases coalesced • MRMC and Apple FFT are accesses for the kernel modified for stage overlap Power consumption increases • Kernel call overhead is due to coalesced access avoided and coalesced access are also used Per-workitem compute increases due to increase in • Large performance gains are input data access observed on GPUs and APUs 13 | AsHES 2013 | May 2013 14 | AsHES 2013 | May 2013 ARCHITECTURAL FACTORS CONTRIBUTING TO POWER ENERGY PROFILE OF OPTIMIZATION TECHNIQUES S3 optimizations are evaluated Energy relates power for this analysis consumption and execution performance directly Memory stalls and ALU utilization is evaluated for power Power consumption can consumption increase due to increase in performance Memoryyy unit stalls directly affects power consumption of Energy consumption exposes GPUs and APUs this trade-off Number of in-flight memory Variation in energy across accesses can cause memory- optimizations is 11% on average unit stalls S2 optimizations show 13% ALU utilization does not directly increase in energy over other affect power consumption optimization sets 15 | AsHES 2013 | May 2013 16 | AsHES 2013 | May 2013 4 6/24/2013 RESULTS SUMMARY RESULTS SUMMARY 9 Highly Efficient 8 Less Efficient + Moderately Efficient More than 40% Less than 10% 10-40% improvement GPUs APUs improvement improvement • S1 & S2 improve performance • S1 & S2 are not power efficient Power Performance Power Performance efficiencyyp with cost for power Optimization Techniques Efficiency Efficiency (Gflops/Watts) consumption • Local memory improves compute efficiency and power efficiency • Local memory in S3 improve GPUs APUs GPUs APUs GPUs APUs power-performance • Stage overlapping increases load Coalesced Memory Access 8899+ + on resources and increases power Loop Unrolling 8899+ + • S3 are compute and power efficient consumption Data Transformation 88998 8 Local Memory Usage 99999 9 Stage Overlapping 9+999 + 17 | AsHES 2013 | May 2013 18 | AsHES 2013 | May 2013 CONCLUSION FUTURE WORK Analyzed different optimization techniques for their power-performance on once Analyze power consumption on SoC (System-on-Chip) devices with GPUs, such class of applications as TI OMAP4, Samsung Exynos, Qualcomm Snapdragon Study helps developer identify potential of power-aware kernel development Power-performance analysis to multi-GPU environments such as clusters Optimizations related to coalesced memory accesses exhibit increase in power Study of microarchitectural features responsible for power consumption on GPUs consumption on GPUs and APUs Extend study to different heterogeneous multi-core devices such as Adapteva Local Memory utilization is observed as the most power efficient optimization Ephiphany Processor, Tilera TilePro processors technique Power increment due to performance improvement is captured effectively in the energy profile 19 | AsHES 2013 | May 2013 20 | AsHES 2013 | May 2013 5 6/24/2013 THANK YOU ! QUESTIONS ? COMMENTS ? Yash Ukidave EXTRA SLIDES [email protected] 21 | AsHES 2013 | May 2013 22 | AsHES 2013 | May 2013 TITLE PAGE PICTURE FFT IMPLEMENTATIONS WITH SINGLE KERNEL CALL MR-SC FFT : FFT implementation based on AMD SDK. Multiple Radix sizes used Stockham FFT : Stockham algorithm for FFT computation. Single Radix computation FFT Source: http://commons.wikimedia.org/wiki/File:Mona_Lisa_bw_square.jpeg 23 | AsHES 2013 | May 2013 24 | AsHES 2013 | May 2013 6 6/24/2013 FFT IMPLEMENTATIONS WITH MULTIPLE KERNEL CALLS POWER MEASUREMENT SETUP Power measurement on GPU: MR-MC FFT : Multiple kernel calls and Multiple Radix combinations for compute Discrete GPUs were isolated from the system for power measurement Apple FFT : Multiple Kernel call based FFT by Apple Inc. using OpenCL A dedicated external Power supply is used for Discrete GPUs Power consumption of GPU is measured by profiling the External PSU Power measurement on GPUs Power Measurement for APU: Power supply for APUs is measured using a similar power meter System power is measured to record power consumption on APUs Graphic processor of APUs cannot be completely isolated off the host CPU device Power measurement on APUs 25 | AsHES 2013 | May 2013 26 | AsHES 2013 | May 2013 7.

SLIDES [email protected]

A 1024-Core 70GFLOPS/W Floating Point Manycore Microprocessor

Comparison of 116 Open Spec, Hacker Friendly Single Board Computers -- June 2018

Survey and Benchmarking of Machine Learning Accelerators

SPORK: a Summarization Pipeline for Online Repositories of Knowledge

Master's Thesis: Adaptive Core Assignment for Adapteva Epiphany

Program & Exhibits Guide

Proyecto Fin De Grado

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

AI-Optimized Chipsets

November 2–5, 2014 Asilomar Hotel and Conference Grounds

Electronics Tinkering Books

Analysis of Task Scheduling for Multi-Core Embedded Systems