Experiences Using K1 and X1 for Highly Energy Efficient Computing

Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell

Research School of Computer Science Australian National University Canberra, Australia

April 07, 2016 Introduction & Background Overview

1 Introduction & Background

2 Power Measurement Environment

3 Experimental Platforms

4 Approach

5 Results & Analysis

6 Conclusion

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 2 / 20 Introduction & Background Use of low-powered SoCs for HPC

Nvidia Jetson TK1: ARM + GPU SoC Jetson TX1: ARM + GPU SoC TI Keystone II: ARM + DSP SoC Parallella: ARM + 64-core NoC TI BeagleBoard: ARM + DSP SoC Terasic DE1: ARM + FPGA SoC Firefly: ARM + GPU SoC Freescale Wandboard: ARM + GPU SoC Cubieboard4: ARM + GPU SoC

http://cs.anu.edu.au/systems

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 3 / 20 Introduction & Background Use of low-powered SoCs for HPC

In order for SoC processors to be considered viable exascale building blocks, important factors to explore include: Absolute performance Balancing use of different on- devices Understanding the performance-energy trade-off

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 4 / 20 Introduction & Background Contributions

Environment for monitoring and collecting high resolution power measurements for SoC systems Understanding the benefits of exploiting both the host CPU and accelerator GPU cores simultaneously for critical HPC kernels Performance and energy comparisons with conventional HPC systems - Xeon CPUs and NVIDIA K20 and K80 GPUs

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 5 / 20 Power Measurement Environment Measurement Requirements

SoC systems generally consume very low power ∼ few Watts Subtle differences in energy consumption triggered by different factors such as the use of CPU or on-chip GPU cores Changes in DC current supplied to SoC system boards must be reliably measured Current use ranges from µAmps to a few Amps, a very high-precision ammeter must be used to measure subtle changes

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 6 / 20 Power Measurement Environment Measurement Apparatus

µCurrent Gold: High-precision ammeter for measuring low-currents An LPC1768 micro-controller with a 12-bit ADC (0-3.3V) used to measure analog output signals from µCurrent Gold https://www.eevblog.com/projects/ucurrent/ The ADC has a resolution of 0.81±0.40mV, which corresponds to 0.81mA. This is 9.7±4.8mW at 12V.

https://developer.mbed.org/

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 7 / 20 Power Measurement Environment Power Measurement Environment

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 8 / 20 Experimental Platforms Experimental Platforms

TK1 TX1 SANDY HASWELL CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3 CPU Cores 4 4 2×8 2×12 CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3 GPU GK20A GM20B K20m (GK110) K80 (GK210) GPU Cores 192 256 2496 2496 GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz GPU RAM Shared Shared 5GB 12GB CUDA v6.5 v7.0 v7.0 v7.5

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 9 / 20 Approach Evaluation Kernel

C = A × B      

 C1 C2  =  A  ×  B1 B2 

.& .& .&

C1 = A × B1 C2 = A × B2            

 C1  =  A  ×  B1   C2  =  A  ×  B2 

CPU GPU

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 10 / 20 Approach Approaches

Traditional methods: Assign all work to GPU or CPU Static Partitioning: Partition work between GPU and CPU based on apriori information Beaumont et al., Matrix Multiplication on Heterogeneous Platforms C. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing Donfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs Dynamic Partitioning: Papadrakakis et al., A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures

→ Existing approaches do not consider the use of shared physical memory or the implications for energy efficiency

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 11 / 20 Approach Our approach

Static partitioning: Guess a partition based on experimentally measured peak performances of CPU and Dynamic partitioning: GPU CPU and GPU remove chunks of matrix Used the achieved peaks to refine the columns from a workqueue partition Chunk size must be sufficient to occupy Repeat until convergence CPU and GPU fully Suitable for repeated calculations of the On traditional discrete GPU systems, same size copies have to be carefully scheduled Implemented using OpenMP Use of shared memory on SoC systems: Two threads, one each for CPU and GPU, taking work off a master queue CUDA driver automatically protects CUDA-allocated memory during kernel The GPU thread executes at the expense execution phase of doing productive work on the CPU cores We circumvent this by immediately unprotecting using mprotect() the memory after initiating a kernel execution

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 12 / 20 Results & Analysis Results: Best split performance

Platform Matrix CPU GPU CPU SPLIT Size GFLOPS GFLOPS SPLIT COLS GFLOPS DGEMM TK1 4096 14 12 2176 26 TX1 4096 18 9 2608 25 SANDY 8192 311 836 2128 1099 HASWELL 16384 804 1124 6912 1870 SGEMM TK1 4096 34 205 448 227 TX1 4096 38 391 128 399 SANDY 16384 643 2318 3392 2887 HASWELL 16384 1753 2526 6896 4109

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 13 / 20 Results & Analysis Best Split Search - Tegra K1/X1

TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES 25 120 20 100 15

80 JOULES 10 60 DGEMM GFLOPS

TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES 400 60 300 40 200

20 JOULES 100

SGEMM GFLOPS 0 0 1,000 2,000 3,000 4,000 Split Size Given to CPU

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 14 / 20 Results & Analysis Best Split Search - Intel + NVIDIA GPUs

SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES 1,500

1,000 100

500 50 JOULES DGEMM GFLOPS

SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES 60 2,000 40

1,000 20 JOULES

SGEMM GFLOPS 0 1,000 2,000 3,000 4,000 Split Size Given to CPU

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 15 / 20 Results & Analysis Performance Scaling - TK1

28 CPU 26 GPU 24 SPLIT 22 20 DYNAMIC 18 TBALANCE 16 PEAK (CPU+GPU) 14 12 10 8 DGEMM GFLOPS 6 4 2 0 16 32 64 128 256 512 1024 2048 4096 280 260 240 220 200 180 160 140 120 100 80 SGEMM GFLOPS 60 40 20 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 16 / 20 Results & Analysis Performance Scaling - TX1

CPU 30 GPU SPLIT 25 DYNAMIC 20 TBALANCE PEAK (CPU+GPU) 15

10 DGEMM GFLOPS 5

0 16 32 64 128 256 512 1024 2048 4096 500 450 400 350 300 250 200 150 SGEMM GFLOPS 100 50 0 16 32 64 128 256 512 1024 2048 4096 Matrix Dimension M=N=K

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 17 / 20 Results & Analysis Energy Efficiency - TX1 - SGEMM

CPU GPU SPLIT

−8 TBALANCE 10 DYNAMIC

10−9

−10

Joules/FLOP (SP) 4.22 · 10

10−10

3.75 · 10−11

128 256 512 1024 2048 4096 Matrix Dimension M=N=K

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 18 / 20 Results & Analysis Energy Efficiency - Haswell - SGEMM

CPU GPU SPLIT TBALANCE DYNAMIC 10−9 Joules/FLOP (SP)

1.76 · 10−10

10−10

8.24 · 10−11

512 1024 2048 4096 8192 16384 Matrix Dimension M=N=K

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 19 / 20 Conclusion Conclusion

A high accuracy and high resolution energy measurement system introduced here enables tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune and produce best-performance and best-energy optimized libraries. How might a running application use information on energy usage to dynamically change its behaviour? Use of shared physical memory on SoC systems eliminates transfer overhead Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit was observed from exploting both CPU and GPU together The best energy efficiency observed on SoC systems was 37.5 pJ/FLOP SGEMM on TX1 while on conventional systems, 82.4 pJ/FLOP SGEMM was observed on the K80.

Contact: [email protected] https://www.linkedin.com/in/alistair-rendell-6230b72

Mitra et. al. (ANU) GTC 2016, San Francisco April 07, 2016 20 / 20