Rocm Platforms

Open Unified Architecture for Deep Learning Applications on EPYCTM , RadeonTM and ROCm Platforms

Станавов Павел

Machine Learning Technologies’2018, Moscow, Russia Introduction

If you are old enough you may know this… AMD defined the architecture used in ~99% of current servers –> the x86-64 de facto standard, the 64-bit extensions to the x86 architecture. First delivered with the AMD OpteronTM in 2003

Now AMD is back in the Data Center business in a big way with the EPYCTM line of server chips including EPYC+ 12-nm and EPYCTM ROME with 64-cores and 7-nm technology

Machine Learning Technologies’2018, Moscow, Russia 2015 2017 First to market Introduction of Ryzen, Ryzen PRO, 2011 2013 with 6th generation Vega, Radeon Instinct and Epic SoCs technology for Brings world’s Inside every major next Commercial first APUs to generation gaming market console

2009

Breaks 1GHz GPU barrier First public demo with Radeon™ HD 4890 of “Zen” microarchitecture 2004 2016 Powered the World’s first x86 Industry’s first 2000 world’s most dual-core processor quad-core x86 SoCs efficient First to break the historic super computer 1GHz with the AMD Athlon™ 2012 2014

1970 First proprietary device: Am2501 First to break teraflop logic counter performance barrier introduced В ТЕХНОЛОГИЯХ World’s first x86-64 bit architecture 2006 2003 КТО  1-е кто преодолел барьер в 1Гц  1-е кто создал 64 битный x86 процессор

 1-е двухядерные и четырехъядерные процессоры в истории

 1-й 64 битный ARM сокет в индустрии МЫ  1-й серверный APU

3 | products AMD 2017 | MARCH | Облачные технологии Gaming

Медицина ОБЛАСТИ Персональные компьютеры Безопасность ПРИМЕНЕНИЯ Виртуальная реальность Автомобилестроение ПРОЦЕССОРОВ Искусственный интеллект Промышленность Банковский сектор

4 | products AMD 2017 | MARCH | Год 2018: продукты

NAPLES SCBU

Процессоры Графика AMD “Naples” SEMI-CUSTOM AMD Radeon™ ™ AMD Ryzen™ Radeon INSTINCT Серверный рынок. Новейшее ядро. Быстрые карты для Наше глобальное GPU для машинного Высочайшая игроков и профессионалов Мы возвращаемся. лидерство в играх. обучения производительность. 5 | products AMD 2017 | MARCH | КАК ЛЮДИ СМОТРЯТ НА ПРОЦЕССОР? ВЗГЛЯД ДИЗАЙНЕРА ВЗГЛЯД поставщика ВЗГЛЯД ИНЖЕНЕРА ВЗГЛЯД пользователя ARCHITECTED TOGETHER TO DELIVER BREAKTHROUGH PERFORMANCE

AI Cloud Machine Intelligence Deep Learning High Performance Computing

Machine Learning Technologies’2018, Moscow, Russia Radeon™ “Vega” GPU Architecture

64 Next-Gen Compute Units (NCu – 4096 Stream Processors)

12.3 TFLOPS Peak Single Precision Compute (FP32)

24.6 TFLOPS Peak Half Precision Compute (FP16)

16 GB HBM2 Radeon Instinct™ MI25 High Bandwidth Cache

484 GB/s Memory Bandwidth Machine Learning Technologies’2018, Moscow, Russia ПРЕДСТАВЛЯЕМ EPYC ПРЕДСТАВЛЯЕМ EPYC

Разработан для облачных технологий Задуман, спроектирован и построен для современных и будущих задач Иновационность SoC разработанная с нуля с учетом всех современных требований Функциональность Полный набор функций для выполнения реальных задач Беспрецедентная безопасность Первый в индустрии х86 процессор с аппаратной поддержкой безопасности

Ускорение бизнес Увеличение Защита данных и процессов эффективности (ТСО) заказчиков

14 ARM + Х86 – БЕСПРЕЦЕДЕНТНАЯ БЕЗОПАСНОСТЬ

Опасности без границ Сервер «заперт» от загрузки до выключения

.com .gov Secure Root-of-Trust Secure RUN Secure MOVE .edu .org Technology Technology Technology Boot to secure coprocessor Data & Software at work Private – hybrid - public

Новые векторы атак Secure Memory Encryption (SME) Secure Encrypted Virtualization (SEV) Все больше и больше данных Protects data against memory hacks and encrypts and isolates virtual machines подвергаются нападению scrapes

$7m – цена безопасности* Security mechanisms that reduce server attacks significantly… 40% угроз приходит Deploy “on the edge” with confidence | Offer differentiated service levels to customers извне**

Первое в индустрии аппаратное решение для обеспечения безопасности

15 *,**See endnotes EPYC В ЦИФРАХ

Выдающаяся производительность – быстрая адаптация под задачи

24, 16, 8 PCIe Gen 3 lanes Каналов памяти TB RAM на сокет 32 ядер на сокет 128 на сокет 8 2

Широкий модельный ряд без Больше Больше пропускная Выше доступный жертв возможностей способность объем для периферии

16 See endnotes БЕСКОМПРОМИСНОЕ 1 СОКЕТНОЕ РЕШЕНИЕ

Intel EPYC 7551P 2x Xeon 2650 v4 21% Выше целочисленная производительность

HIGH SPEED I/O 17% Меньше энергопотреблени 128 LANES е

33% Больше ядер 11% Больше памяти 60% Больше I/0

17 See endnotes ПРАВИЛЬНЫЙ РАЗМЕР С 2-Х СОКЕТНЫМ EPYC 47% Выше целочисленная EPYC 7601 производительность* EPYC 7601

FABRIC

45% Больше Ядер 122% Больше каналов памяти 60% Больше I/O

18 See endnotes EPYC – ОБЛАСТИ ПРИМЕНЕНИЯ

DATA MACHINE SW-DEFINED HPC CLOUD ANALYTICS LEARNING • ПоддержкаSTORAGE Direct SATA & • Процессор оптимизирован • Параллелизм CPU NVMe для совместной работы с • Много ядер для • Высокая степень оптимизированный для • Параллельная обработка граф ускорителями бесперебойного параллелизма GPU ускорителей большого количетсва • Много портов для обеспечения SLAs • Много быстрых I/O для • Обработка данных в потоков с низкими организации кластеров • Много памяти для VMs быстрой загрузки реальном времени задержками • Много памяти для больших • Шифрование для изоляции • Вычислительная мощность • Много ядер для быстрого • Большой объем памяти для объемов данных VM • Защита для business critical анализа полученных кэширования данных • Много портов для NVMe • Масштабируемость по I/O data данных • Шифрование памяти дисков • Много портов для NVMe дисков

19 EPYC В ДЕТАЛЯХ СНИЖЕНИЕ TCO БЛАГОДАРЯ ОПТИМАЛЬНОМУ БАЛАНСУ ВЫЧИСЛИТЕЛЬНОЙ МОЩНОСТИ , ПАМЯТИ, I/O И СРЕДСТВ ОБЕСПЕЧЕНИЯ БЕЗОПАСНОСТИ

COMPUTE INTEGRATED I/O 8 to 32 AMD “Zen” x86 cores (32 to 64 threads) – NO CHIPSET 512KB L2 cache per core 128 lanes PCIe Gen3 (16 MB total L2 cache) − Used for PCIe, SATA, and 64MB shared L3 cache Coherent Interconnect (8MB per 4 cores) − Up to 32 SATA or NVMe devices TDP range: 120W-180W Server Controller Hub (USB, UART, SPI, LPC, I2C, etc.) MEMORY 8 channel DDR4 with ECC up to 2666 MHz SECURITY RDIMM, LRDIMM, 3DS, Dedicated Security Subsystem NVDIMM, Flash Hardware Root-of-Trust 2 DIMMs/channel capacity of 2TB/socket Hardware Memory Encryption

20 EPYC™ + MI25 – Optimized for Massive System Scalability

. 128 PCIe® links/CPU ‒ Removes PCIe switches . Full PCIe P2P support

. 32-core/CPU for I/O and compute 16 DIMMs 8 Drives balance Memory

NIC . Provides strong I/O connectivity and bandwidth with single high-performance CPU

Machine Learning Technologies’2018, Moscow, Russia Radeon Instinct On EPYC Platform Direct Device Connect

Optimized for VDI, ML & HPC Computing

Lowest System Cost Lower Latency Architecture Peer to Peer Communication High Density Footprint

Machine Learning Technologies’2018, Moscow, Russia AMD EPYC™ and RADEON Instinct™ РЕШЕНИЕ ДЛЯ DEEP LEARNING EPYC provides: Up to 128 high bandwidth PCIe® gen 3.0 lanes for unmatched connectivity 8 memory channels for high bandwidth and capacity 32 high performance cores

RADEON Instinct MI25 delivers: Hyperscale and HPC-class heterogeneous compute for Machine Intelligence and Deep Learning, along with HPC workloads 4,096 Stream Processors for 24.6 TFLOPS (FP16); 12.3 TFLOPS (FP32); 768 GFLOPS (FP64) 16GB of latest HBM2 ECC GPU memory with 484 GB/s of memory bandwidth* Radeon Instinct’s open ecosystem approach to datacenter design with ROCm open software platform and MIOpen libraries *ECC support is limited to the HBM2 memory and ECC protection is not provided for internal GPU structures. Combined, they deliver a world class machine learning platform… Machine Learning Technologies’2018, Moscow, Russia Вариант платформы

Inventec P47 cluster node AMD EPYC™ 7601 Processor 32 cores (64 threads) 2200Mhz base / 3200Mhz boost 512 GB physical memory

4 x Radeon Instinct™ MI25 Accelerators “Vega10” GPU Arch 16 GB HBM2 12.3 TFLOPS peak single precision

Machine Learning Technologies’2018, Moscow, Russia 24 AMD “VEGA10” SOC

14nm FinFET GPU Die Size: 19mm x 25.6mm Area: 486 sq mm2, Transistors: 12.5 Billion 2 Stack HBM2 4, 8, or 16 GB Capacity Up to 484 GB/S with ECC 2x HBM1 rate with ½ footprint 16x PCIE® Gen 3.0 2nd Gen SR-IOV GPU Virtualization Package 47.5mm x 47.5 mm 3.42 mm z-height Power Envelope: 150W – 300W Idle: <2W Machine Learning Technologies’2018, Moscow, Russia HSA - ЧТО ЭТО?

Архитектура, позволяющая hUMA центральному, графическому и другим процессорам гармонично PARALLEL SERIAL WORKLOADS WORKLOADS работать вместе на одной кремниевой микросхеме, передавая задачи элементам, которые лучше всего для них подходят.

APU ACCELERATED PROCESSING UNIT

 В составе HSA Foundation: AMD, Qualcomm, ARM, Mediatek, Samsung, Oracle и др. http://hsafoundation.com/ 26 | PLATFORMS AMD 2014| MARCH 2014 Approach to Deep Learning

Machine Learning Technologies’2018, Moscow, Russia Where does Deep Learning fit in the AI ecosystem?

Perceptual Data Analytics Understanding

Machine Learning

Deep Learning

CNN RNN LSTM …

Machine Learning Technologies’2018, Moscow, Russia Deep Learning Development Plan

Vision Natural Language Processing Creative / Higher Reasoning Images & Video Speech-Audio-Text

Convolution Neural Recurrent Neural Networks (RNNs) Generative Adversarial Networks (CNNs) Networks (GANs) Long Short Term Memory (LSTM) Convolution Recurrent Weighted Average (RWA) Reinforcement Learning 4D/5D Tensors Gated Recurrent Unit (GRU) First Focus Area

Recognizing cats “Girl in pink dress is jumpingBeating in air” the world champion in Go from YouTube and

road scenes Machine Learning Technologies’2018, Moscow, Russia Open Source AI Frameworks: ranked by popularity Keras: High level ML API (metaframework) uses other frameworks as backend Keras is one of the fastest growing libraries for DL (Tensorflow, MXNet, CNTK, originally Theano) Tensorflow: ML library by Google is the most popular AI library at the moment Based on number of stars on GitHub and stack-overflow activity Caffe: Python DL library UC Berkeley for supervised computer vision problems It used to be the most popular deep learning library in use and still most popular in embedded PyTorch: one of the newest DL framework with simplicity and ease of use Very popular for its dynamic computational graph and efficient memory usage CNTK: Microsoft Cognitive toolkit (CNTK) framework supported by Microsoft It is not hugely popular like Tensorflow/Pytorch/Caffe MXNet: Promoted by Amazon and supported by Apache foundation Very popular in R-language community while has API for multiple languages Caffe2: ML framework supported by Facebook and built on the original Caffe Some others: Torch,Theano, DeepLearning4j and exchange converter OONX Machine Learning Technologies’2018, Moscow, Russia Example: Step-by-step guide

Define the Choose a Choose a Model and Problem & Train Framework (or build your own) Deploy collect data test

Example: Caffe, Caffe-2, AlexNet, Some HW choices: TensorFlow, VGGNet, Object MxNet Inception, • Supercomputer: Project 47 TM detection with Etc. Etc. • Server: EPYC + Vega (MI25) • Workstation: AMD 32-core 1M labelled ThreadripperTM + MI6 / MI10 or MI25 images for training Some HW choices:

• Server: EPYCTM + Vega • Desktop: RyzenTM + MI6, MI10, MI25 • Mobile: AMD APUs • or any NN capable platform

Machine Learning Technologies’2018, Moscow, Russia AMD SOFTWARE STACK

Applications Machine Learning App (ResNet50 Training, CANDLE, etc)

Caffe TensorFlow Keras Frameworks Caffe2 PyTorch MxNet

Middleware BLAS, FFT, RNG & Libraries MIOpen RCCL Eigen

Programming HCC HIP OpenCLTM Python Models

ROCm ROCm Platform

Italics = Under Dev by AMD

Machine Learning Technologies’2018, Moscow, Russia OPEN AND SHUT: THE CASE FOR AMD’S OPEN SOURCE MACHINE INTELLIGENCE SOFTWARE STACK

Open-source MI Software AMD / ROCm NVIDIA / CUDA CAFFE and CAFFE-2 Open-source Open-source TensorFlow Open-source Open-source

Programming Language Open-source (HIP C++) Proprietary (CUDA) Accelerated MI Library Open-source (MIOpen) Proprietary (cuDNN) Open-Source (rocBLAS, rocRand, Proprietary (cuBLAS, cuRNG, Accelerated Math Libs rocFFT, sparse) cuFFT, cuSparse) Communication Library Open-source (RCCL) NCCL Runtime Open-source (ROCr) Proprietary Linux Driver Open-source (AMDGPU) Proprietary Documented ISA Open (GCN) Proprietary Italics = Under Dev Software: https://rocm.github.io Blog: https://gpuopen.com/open-shut-case-amds-open-source-machine-intelligence-software-stack/ MachineMachine Learning Learning Technologies’2018, Technologies’2018, Moscow, Moscow, Russia Russia MIOpen High-performance DL primitives similar to cuDNN functionality

Key Features Architecture

• Convolutions for Inference and Training • HIP and OpenCL top-level APIs • “In place” Winograd Solver • Kernels in high-level source and GCN asm • Optimized GEMM for Deep learning • Documented ISA with open-source • Pooling Forward & Backwards tools • Softmax • Activation • Batch Normalization • RNN support • Benefits from “Vega10” include: • Packed FP16 (>25 Tflops) • Cross-lane “DPP” instructions • LDS Scratchpad memory (>13 TB/s)

Machine Learning Technologies’2018, Moscow, Russia ROCm Programming Model Options

CUDA  HIP HIP OpenCLTM

Convert CUDA to portable HIP True single-source C++ Khronos Industry Standard C++ accelerator language accelerator language

• Single-source Host + Kernel • Single-source Host + Kernel • Split Host/Kernel • C++ Kernel Language • C++ Kernel Language • C99-based Kernel Language • C Runtime • HIP Runtime • OpenCL Runtime • Platforms: AMD GPU, NVIDIA • Platforms: AMD GPU • Platforms: CPU, GPU, FPGA (same perf as native CUDA) When to use it? When to use it? When to use it? • Port existing CUDA code • New projects where true C++ • Port existing OpenCL code • Developers familiar with CUDA language preferred • New project that needs • New project that needs • Use features from latest ISO portability to CPU,GPU,FPGA portability to AMD and NVIDIA C++ standards

https://github.com/ROCm-Developer-Tools/hcc2-hip HIP : Key Features Strong support for most commonly used parts of CUDA API Streams, events, memory allocation/deallocation, profiling HIP includes driver API support (modules and contexts) HIP Clang language uses CUDA offload Clang driver and some CUDA features

Full C++ support including templates, namespace, classes, lambdas AMD’s open-source GPU compiler based on near-tip clang+llvm Support C++11, C++14, some C++17 features

HIPified code or HIP-code is portable to AMD/ROCM and NVIDIA/CUDA On CUDA, developers can use native CUDA tools (nvcc, nvprof, etc.) On ROCM, developers can use native ROCM tools (hcc, rocm-prof, codexl) HIP is new C-style language like CUDA https://github.com/ROCm-Developer-Tools/hcc2-hip HIP ecosystem includes hipBlas, hipFFT, hipRNG, MIOpen, TensorFlow, PyTorch etc.

HIPify tools automate the translation from CUDA to HIP Developers should expect some final cleanup and performance tuning Machine Learning Technologies’2018, Moscow, Russia Hipification of CUDA Kernel (CAFFE) HIPIFY CUDA (Automated) HIP

namespace caffe { C++ Features namespace caffe { template Unchanged! template __global__ void __global__ void BNLLForward(const int n, const Dtype* in, BNLLForward(const int n, const Dtype* in, Dtype* out) Dtype* out) { { for(int i= blockIdx.x * blockDim.x for(int i= hipBlockIdx_x * hipBlockDim_x + threadIdx.x; + hipThreadIdx_x; i < (n); i < (n); i += blockDim.x * gridDim.x) { i += hipBlockDim_x * hipGridDim_x) {

out[index] = in[index] > 0 ? out[index] = in[index] > 0 ? in[index] + log(1. + exp(-in[index])) in[index] + log(1. + exp(-in[index])) : Math Libs : log(1. + exp(in[index])); Unchanged! log(1. + exp(in[index])); } } } }

Machine Learning Technologies’2018, Moscow, Russia Hipification of CUDA Runtime APIs (CAFFE)

HIPIFY CUDA (Automated) HIP

void SyncedMemory::async_gpu_push(const void SyncedMemory::async_gpu_push(const cudaStream_t& stream) { hipStream_t& stream) { CHECK(head_ == HEAD_AT_CPU); CHECK(head_ == HEAD_AT_CPU); if (gpu_ptr_ == NULL) { if (gpu_ptr_ == NULL) { cudaGetDevice(&gpu_device_); hipGetDevice(&gpu_device_); cudaMalloc(&gpu_ptr_, size_); hipMalloc(&gpu_ptr_, size_); own_gpu_data_ = true; own_gpu_data_ = true; } } const cudaMemcpyKind put = const hipMemcpyKind put = cudaMemcpyHostToDevice; hipMemcpyHostToDevice; cudaMemcpyAsync(gpu_ptr_, cpu_ptr_, hipMemcpyAsync(gpu_ptr_, cpu_ptr_, size_, put, stream); size_, put, stream); // Assume caller will sync on the stream // Assume caller will sync on the stream head_ = SYNCED; head_ = SYNCED; } }

Machine Learning Technologies’2018, Moscow, Russia Porting with hipify tool

CUDA

hipify ~99%+ Automatic Conversion

Developer Cleanup and Tuning Developer maintains HIP port HIP C++ language is official for LLVM Portable HIP Clang Resulting C++ code runs on NVIDIA or AMD GPUs

Machine Learning Technologies’2018, Moscow, Russia HIP Compilation Process Portable HIP C++ (Kernels + HIP API) NVIDIA AMD  HIP and HCC merged to HIP  HIP API implemented as HIP language inlined calls to CUDA CUDA  HIP API implemented with Runtime Header lightweight HIP runtime Header  Compute kernels mostly  Uses HCC’s hc::accelerator, unchanged hc::accelerator_view, hc::completion_future HIP  Some calls directly into ROCr RT CUDA (Kernels + HIP ROCr  Compute kernels mostly (Kernels + CUDA API) API) unchanged

 Code compiled with  Code compiled with HIPCC NVCC (same as CUDA) NVCC HIPCC  Performance Tools: ROCm Profiling  Can use nvprof, CUDA Library, PAPI and TAU debugger, other tools  Debugger: ROCm GDB CUDA  Source Portable ELF Executable Executable  Not Binary Portable

Machine Learning Technologies’2018, Moscow, Russia ROCm : Deep Learning Gets HIP Bringing a faster path to bring deep learning application to AMD GPUs Complexity of

• The Challenge: CAFFE Application Porting: • Popular machine-learning framework CAFFE • Tip version on GitHub has 55000+ lines-of-code • GPU-accelerated with CUDA 35000 30000 Manual, 32227 25000 • Results: 20000 • 99.6% of code unmodified or automatically converted 15000 • Port required less than 1 week developer time • Supports all CAFFE features (multi-gpu, P2P, FFT 10000 filters) Changed ofCode Lines 5000 Manual, 219 • HIPCAFFE is the fastest CAFFE on AMD hardware – 0 Automatic, 688 1.8X faster than CAFFE/OpenCL OpenCL Port HIP Port

AMD Internal Data

Machine Learning Technologies’2018, Moscow, Russia End-To -End Open-Source Compute Software Stack

Single- source C/C++

ROCm Runtime (ROCr) HIP C/C++ Compiler  Headless HIP Runtime  Low-latency dispatch  Low-latency DMA Lightning Compiler Lightning LLVM  Peer2Peer & RDMA  Offline compile GCN Compiler Compiler ROCm Assembly  Fully open-source  Direct-to-ISA LLVM Opt LLVM Opt Runtime  Full GCN Docs Passes Passes (ROCrUser )  CLANG/LLVM queues GCN Target X86 Target AMDGPU Kernel Driver  GCN Assembler AMDGPU kernel driver  In the box  Fully open-source System  Upstream and fully open- CPU Code GPU source CPU GPU Code

Machine Learning Technologies’2018, Moscow, Russia HIP Automated Conversion

Examples of other (non AI) software converted from CUDA to HIP: All convert between 96% and 100% automatically

FinanceBench cunn barracuda cutorch libgeodecomp gpubiotools nvbio arrayfire magma-1.7.0 stella hoomd-v1.1.1 shoc

Machine Learning Technologies’2018, Moscow, Russia A New Path Forward

rocm.github.io

Machine Learning Technologies’2018, Moscow, Russia ROCM SOURCE CODE m Source Code

Source-Code Repositories for Kernel Driver + Thunk + Run Time •ROCk: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver •ROCt: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface •ROCr: https://github.com/RadeonOpenCompute/ROCR-Runtime

HIP C++ Compiler • HCC compiler source: https://github.com/RadeonOpenCompute/hcc • HCC Clang source: https://github.com/RadeonOpenCompute/hcc-clang

HIP converter • HIP run time and tools for porting Cuda applications with HCC: https://github.com/GPUOpen- ProfessionalCompute-Tools/HIP • HIP examples: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP-Examples

LLVM Source for GCN ISA Compiler • LLVM source code for native GCN ISA code generation: https://github.com/RadeonOpenCompute/llvm • ROCm device-library compiler intrinsics with Open Compute Math Library and Open Compute kernel language: https://github.com/RadeonOpenCompute/ROCm-Device-Libs

Machine Learning Technologies’2018, Moscow, Russia Documents

ROCm.github.io https://rocm.github.io/documentation.html HSA Foundation Runtime Spec used by ROCr http://www.hsafoundation.com/html_spec11/HSA_Library.htm ROCm error codes https://rocm.github.io/ROCmRTec.html ROCm SMI https://github.com/RadeonOpenCompute/ROC- smi/blob/master/README.md USE of PCIe Atomics and Large Bar Support https://github.com/RadeonOpenCompute/RadeonOpenCompute.github.io/bl ob/master/ROCmPCIeFeatures.md ROCmRDMA https://github.com/RadeonOpenCompute/ROCnRDMA

Machine Learning Technologies’2018, Moscow, Russia ROCm Documentation: Compiler

HIP Compiler Overview https://github.com/RadeonOpenCompute/hcc/wiki Documentation http://scchan.github.io/hcc/ Porting Guide https://github.com/GPUOpen-ProfessionalCompute- Tools/HIP/blob/master/docs/markdown/hip_porting_guide.md Runtime API http://gpuopen-professionalcompute-tools.github.io/HIP/ ABI https://github.com/RadeonOpenCompute/ROCm-ComputeABI- Doc/blob/master/AMDGPU-ABI.md LLVM Documentation http://llvm.org/docs/AMDGPUUsage.html Compiler Device Intrinsics https://github.com/RadeonOpenCompute/ROCm- Device-Libs/blob/master/doc/OCML.md

Machine Learning Technologies’2018, Moscow, Russia DISCLAIMERS AND ATTRIBUTIONS

DISCLAIMER The information contained herein is for informational purposes only, and is subject to change without notice. Timelines, roadmaps, and/or product release dates shown in these slides are plans only and subject to change. “Vega" is a codename for AMD architectures, and are not product names.

While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of non-infringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.

The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18

SPEC® and the benchmark named SPECint® and SPEC CPU and are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPECint and SPEC CPU see www.spec.org.

©2018 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Epyc, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft and DirectX are trademarks or registered trademarks of Microsoft Corporation in the US and other jurisdictions. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Machine Learning Technologies’2018, Moscow, Russia