Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian National University Canberra, Australia April 07, 2016 Introduction & Background Overview 1 Introduction & Background 2 Power Measurement Environment 3 Experimental Platforms 4 Approach 5 Results & Analysis 6 Conclusion Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20 Introduction & Background Use of low-powered SoCs for HPC Nvidia Jetson TK1: ARM + GPU SoC Nvidia Jetson TX1: ARM + GPU SoC TI Keystone II: ARM + DSP SoC Adapteva Parallella: ARM + 64-core NoC TI BeagleBoard: ARM + DSP SoC Terasic DE1: ARM + FPGA SoC Rockchip Firefly: ARM + GPU SoC Freescale Wandboard: ARM + GPU SoC Cubieboard4: ARM + GPU SoC http://cs.anu.edu.au/systems Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 3 / 20 Introduction & Background Use of low-powered SoCs for HPC In order for SoC processors to be considered viable exascale building blocks, important factors to explore include: Absolute performance Balancing use of different on-chip devices Understanding the performance-energy trade-off Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 4 / 20 Introduction & Background Contributions Environment for monitoring and collecting high resolution power measurements for SoC systems Understanding the benefits of exploiting both the host CPU and accelerator GPU cores simultaneously for critical HPC kernels Performance and energy comparisons with conventional HPC systems - Intel Xeon CPUs and NVIDIA K20 and K80 GPUs Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 5 / 20 Power Measurement Environment Measurement Requirements SoC systems generally consume very low power ∼ few Watts Subtle differences in energy consumption triggered by different factors such as the use of CPU or on-chip GPU cores Changes in DC current supplied to SoC system boards must be reliably measured Current use ranges from µAmps to a few Amps, a very high-precision ammeter must be used to measure subtle changes Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 6 / 20 Power Measurement Environment Measurement Apparatus µCurrent Gold: High-precision ammeter for measuring low-currents An mbed LPC1768 micro-controller with a 12-bit ADC (0-3.3V) used to measure analog output signals from µCurrent Gold https://www.eevblog.com/projects/ucurrent/ The ADC has a resolution of 0.81±0.40mV, which corresponds to 0.81mA. This is 9.7±4.8mW at 12V. https://developer.mbed.org/ Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 7 / 20 Power Measurement Environment Power Measurement Environment Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 8 / 20 Experimental Platforms Experimental Platforms TK1 TX1 SANDY HASWELL CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3 CPU Cores 4 4 2×8 2×12 CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3 GPU GK20A GM20B K20m (GK110) K80 (GK210) GPU Cores 192 256 2496 2496 GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz GPU RAM Shared Shared 5GB 12GB CUDA v6.5 v7.0 v7.0 v7.5 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 9 / 20 Approach Evaluation Kernel C = A × B 2 3 2 3 2 3 4 C1 C2 5 = 4 A 5 × 4 B1 B2 5 .& .& .& C1 = A × B1 C2 = A × B2 2 3 2 3 2 3 2 3 2 3 2 3 4 C1 5 = 4 A 5 × 4 B1 5 4 C2 5 = 4 A 5 × 4 B2 5 CPU GPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 10 / 20 Approach Approaches Traditional methods: Assign all work to GPU or CPU Static Partitioning: Partition work between GPU and CPU based on apriori information Beaumont et al., Matrix Multiplication on Heterogeneous Platforms C. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing Donfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs Dynamic Partitioning: Papadrakakis et al., A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures ! Existing approaches do not consider the use of shared physical memory or the implications for energy efficiency Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 11 / 20 Approach Our approach Static partitioning: Guess a partition based on experimentally measured peak performances of CPU and Dynamic partitioning: GPU CPU and GPU remove chunks of matrix Used the achieved peaks to refine the columns from a workqueue partition Chunk size must be sufficient to occupy Repeat until convergence CPU and GPU fully Suitable for repeated calculations of the On traditional discrete GPU systems, same size copies have to be carefully scheduled Implemented using OpenMP Use of shared memory on SoC systems: Two threads, one each for CPU and GPU, taking work off a master queue CUDA driver automatically protects CUDA-allocated memory during kernel The GPU thread executes at the expense execution phase of doing productive work on the CPU cores We circumvent this by immediately unprotecting using mprotect() the memory after initiating a kernel execution Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 12 / 20 Results & Analysis Results: Best split performance Platform Matrix CPU GPU CPU SPLIT Size GFLOPS GFLOPS SPLIT COLS GFLOPS DGEMM TK1 4096 14 12 2176 26 TX1 4096 18 9 2608 25 SANDY 8192 311 836 2128 1099 HASWELL 16384 804 1124 6912 1870 SGEMM TK1 4096 34 205 448 227 TX1 4096 38 391 128 399 SANDY 16384 643 2318 3392 2887 HASWELL 16384 1753 2526 6896 4109 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 13 / 20 Results & Analysis Best Split Search - Tegra K1/X1 TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES 25 120 20 100 15 80 JOULES 10 60 DGEMM GFLOPS TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES 400 60 300 40 200 20 JOULES 100 SGEMM GFLOPS 0 0 1;000 2;000 3;000 4;000 Split Size Given to CPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 14 / 20 Results & Analysis Best Split Search - Intel + NVIDIA GPUs SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES 1;500 1;000 100 500 50 JOULES DGEMM GFLOPS SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES 60 2;000 40 1;000 20 JOULES SGEMM GFLOPS 0 1;000 2;000 3;000 4;000 Split Size Given to CPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 15 / 20 Results & Analysis Performance Scaling - TK1 28 CPU 26 GPU 24 SPLIT 22 20 DYNAMIC 18 TBALANCE 16 PEAK (CPU+GPU) 14 12 10 8 DGEMM GFLOPS 6 4 2 0 16 32 64 128 256 512 1024 2048 4096 280 260 240 220 200 180 160 140 120 100 80 SGEMM GFLOPS 60 40 20 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 16 / 20 Results & Analysis Performance Scaling - TX1 CPU 30 GPU SPLIT 25 DYNAMIC 20 TBALANCE PEAK (CPU+GPU) 15 10 DGEMM GFLOPS 5 0 16 32 64 128 256 512 1024 2048 4096 500 450 400 350 300 250 200 150 SGEMM GFLOPS 100 50 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 17 / 20 Results & Analysis Energy Efficiency - TX1 - SGEMM CPU GPU SPLIT −8 TBALANCE 10 DYNAMIC 10−9 −10 Joules/FLOP (SP) 4:22 · 10 10−10 3:75 · 10−11 128 256 512 1024 2048 4096 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 18 / 20 Results & Analysis Energy Efficiency - Haswell - SGEMM CPU GPU SPLIT TBALANCE DYNAMIC 10−9 Joules/FLOP (SP) 1:76 · 10−10 10−10 8:24 · 10−11 512 1024 2048 4096 8192 16384 Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 19 / 20 Conclusion Conclusion A high accuracy and high resolution energy measurement system introduced here enables tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune and produce best-performance and best-energy optimized libraries. How might a running application use information on energy usage to dynamically change its behaviour? Use of shared physical memory on SoC systems eliminates transfer overhead Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit was observed from exploting both CPU and GPU together The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1 while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80. Contact: [email protected] https://www.linkedin.com/in/alistair-rendell-6230b72 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 20 / 20.

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Bootstomp: on the Security of Bootloaders in Mobile Devices

FAN53525 3.0A, 2.4Mhz, Digitally Programmable Tinybuck® Regulator

A 1024-Core 70GFLOPS/W Floating Point Manycore Microprocessor

GPU Developments 2018

Embedded Computer Solutions for Advanced Automation Control «

Comparison of 116 Open Spec, Hacker Friendly Single Board Computers -- June 2018

Low-Power Ultra-Small Edge AI Accelerators for Image Recog- Nition with Convolution Neural Networks: Analysis and Future Directions

Tegra Linux Driver Package

Survey and Benchmarking of Machine Learning Accelerators

SPORK: a Summarization Pipeline for Online Repositories of Knowledge

042Cf377-Ed0c-4715-9260-770F680082fc.Pdf

Master's Thesis: Adaptive Core Assignment for Adapteva Epiphany