Tegra K1によるgpuコンピューティング

TEGRA K1によるGPUコンピューティング

COMPUTE WITH TEGRA K1

馬路徹、シニア・ソリューション・アーキテクト、NVIDIA AGENDA . Introducing Tegra K1

. Tegra K1 Compute Software Capabilities

— OpenGL GLSL

— OpenCL

— CUDA/Unified Memory

— Google Renderscript • Kepler Architecture Tesla • ISA Compatible In Super Computers Tegra K1 to GeForce, Quadro, Tesla

• 64kB L1 Cache LP and Shared Quadro A15 A15 A15 A15 A15 Memory In Work Stations

• 128kB L2 Cache

GeForce 192 CUDA Cores In PCs

Mobile Kepler In Tegra TEGRA K1 DEVELOPMENT PLATFORMS

Coming to Android K1 Devices soon…

JETSON TK1 JETSON X3 (TK1 PRO) GigE, USB3.0, HDMI GigE, USB3.0, HDMI, 8 x Cameras, CANBUS running Linux4Tegra running Vibrante Linux AUTOMOTIVE GRADE

SOFTWARE FOR COMPUTE

Tegra K1 can accelerate

Renderscript

OpenGL /OpenGL ES with Compute Shaders OpenCV NPP OpenCL full profile cuFFT, cuBLAS, cuSPARSE CUDA and a whole list of libraries enabling compute on the GPU TEGRA K1 FOR OPENGL/GLSL

Kepler Architecture 192 CUDA Cores

Cortex-A15 4-Plus-1

Shared Physical Memory

2D Engine / ISP COMPUTE SHADERS

. Standard OpenGL API . Execute algorithmically general purpose GLSL shaders — Operate on buffers, images and textures . Process graphics data in the context of the graphics pipeline — Easier than interoperating with a compute API for graphics apps . Standard part of all OpenGL 4.3+ implementations — And now OpenGL ES 3.1!

Image processing AI Simulation Ray Tracing Wave Simulation Global Illumination OPENGL COMPUTE SHADERS From Application From Application

Vertex Puller Dispatch Indirect Dispatch Element Array Buffer b Buffer b

Vertex Shader Draw Indirect Buffer b Image Load / Store t/b Compute Shader Tessellation Control Shader Vertex Buffer Object b Atomic Counter b

Tessellation Primitive Gen. Shader Storage b

Tessellation Eval. Shader Texture Fetch t/b Geometry Shader Uniform Block b Transform Feedback Transform Feedback Buffer b

Legend Rasterization From Application

Fixed Function Stage Fragment Shader Pixel Assembly Pixel Unpack Buffer b Programmable Stage

b – Buffer Binding Per-Fragment Operations Pixel Operations Texture Image t

t – Texture Binding Framebuffer Pixel Pack Buffer b Arrows indicate data flow Pixel Pack TEGRA K1 FOR COMPUTE

Kepler Architecture 192 CUDA Cores

Cortex-A15 4-Plus-1

Shared Physical Memory

2D Engine / ISP TEGRA K1 FOR OPENCL . OpenCL 1.2 Full profile support (OpenCL and OpenCL Embedded) — True portability from desktop — Higher precision, higher limits . Awesome performance . Related Session:

GTC2014 @ US S4808 - Real-Time Facial Motion Capture and Animation on Mobile Emiliano Gambaretto

TEGRA K1 CUDA 6 AND SHARED PHYSICAL MEMORY

Kepler Architecture 192 CUDA Cores

Cortex-A15 4-Plus-1

Shared Physical Memory

2D Engine / ISP CUDA REQUIRES MEMORY COPY Programmers are forced do perform additional work to allocate memories both in host/device and copy data from/to host to/from device

Conventional Discrete GPU __global__ void saxpy(int n, float a, float *x, float *y)

{ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; }

int N = 1<<20; cudaMemcpy(d_x, x, N, cudaMemcpyHostToDevice); 1 cudaMemcpy(d_y, y, N, cudaMemcpyHostToDevice); PCIe // Perform SAXPY on 1M elements 1 saxpy<<<4096,256>>>(N, 2.0, d_x, d_y);

Host Memory Device Memory cudaMemcpy(y, d_y, N, cudaMemcpyDeviceToHost); (CPUMemory) 2 (GPU Memory) 2

CUDA UNIFIED MEMORY (FROM CUDA 6)

Developer View Today Developer View With Unified Memory

System GPU Memory Unified Memory Memory

Dramatically Lower Developer Effort Faster performance on SoC WHAT IS HAPPENING IN THE BACKGROUND TEGRA K1 PHYSICALLY SHARED MEMORY

HD Video

CPU Quad Cortex-A15 Processor

+ Shadow LP C-A15 CPU Physically ARM7

SATA2 x1 Unified Memory USB 2.0 x3 Image Processor GPU without the Kepler

PCIe G2 Audio 192 CUDA Cores

x4 + x1 Processor need of Data UART x4 Display HDMI USB 3.0 x2 SPI x4 Migration I2C x5 SDIO/MMC x4 x2 eDP/LVDS

CSI NOR DDR3 Ctlr Security DAP x5 x4 + x4 Flash 64b Engine (12S/TDM)

Unified Memory CUDA UNIFIED MEMORY (FROM CUDA 6)

Developer View Today Developer View With Unified Memory

System GPU Memory Unified Memory Memory

Dramatically Lower Developer Effort Faster performance on SoC TEGRA K1 FOR RENDERSCRIPT

Kepler Architecture 192 CUDA Cores

Cortex-A15 4-Plus-1

Shared Physical Memory

2D Engine / ISP • C99 based kernel RENDERSCRIPT language

• Easy programmability with host and device portability. • Portable across wide range of devices, fastest on Tegra K1

• GPU (+ CPU) and more

• Renderscript API 19 Support

RENDERSCRIPT ON THE SOC . Acceleration of Renderscript Scripts over GPU — ScriptC, not just ScriptIntrinsics — Huge gains in performance and performance/watt . Runtime capable of scheduling work across units — CPU — GPU — 2D Engine/ISP . Related Session:

GTC2014 @ US S4885 - Efficient Parallel Computation on Android Jason Sams, Tim Murray INTRODUCTION (REF S4885) CONSTRAINTS #1, #2 (Ref S4885) CONSTRAINTS #3 (Ref S4885) GPU OR CPU? (REF S4885) DESKTOP PERFORMANCE TODAY (REF S4885) MOBILE PERFORMANCE TODAY (SHIPPING) (REF S4885) ARCHITECTUAL DIVERSITY? (REF S4885) GOAL OF RENDERSCRIPT (REF S4885) WHAT IS RENDERSCRIPT? (REF S4885) RENDERSCRIPT INTRINSICS (REF S4885) TEGRA K1? (REF S4885) TEGRA K1? (REF S4885) SUMMARY . Tegra K1内蔵のKeplerはTesla/Quadro/GeForceとアーキテクチャを共通とするスケーラブルなGPU

. これによりTesla/Quadro/GeForceで熟成されたCUDA, OpenCL, OpenGL Shader Languageのソフト資産、開発環境が使用可能

. さらにHPC、WS、PCとは対極にあるモバイル用のRenderscriptに関しても、GPUを活用することにより、優れた性能を発揮する

THANK YOU