TEGRA K1によるGPUコンピューティング
COMPUTE WITH TEGRA K1
馬路 徹、シニア・ソリューション・アーキテクト、NVIDIA AGENDA . Introducing Tegra K1
. Tegra K1 Compute Software Capabilities
— OpenGL GLSL
— OpenCL
— CUDA/Unified Memory
— Google Renderscript • Kepler Architecture Tesla • ISA Compatible In Super Computers Tegra K1 to GeForce, Quadro, Tesla
• 64kB L1 Cache LP and Shared Quadro A15 A15 A15 A15 A15 Memory In Work Stations
• 128kB L2 Cache
GeForce 192 CUDA Cores In PCs
Mobile Kepler In Tegra TEGRA K1 DEVELOPMENT PLATFORMS
Coming to Android K1 Devices soon…
JETSON TK1 JETSON X3 (TK1 PRO) GigE, USB3.0, HDMI GigE, USB3.0, HDMI, 8 x Cameras, CANBUS running Linux4Tegra running Vibrante Linux AUTOMOTIVE GRADE
SOFTWARE FOR COMPUTE
Tegra K1 can accelerate
Renderscript
OpenGL /OpenGL ES with Compute Shaders OpenCV NPP OpenCL full profile cuFFT, cuBLAS, cuSPARSE CUDA and a whole list of libraries enabling compute on the GPU TEGRA K1 FOR OPENGL/GLSL
Kepler Architecture 192 CUDA Cores
Cortex-A15 4-Plus-1
Shared Physical Memory
2D Engine / ISP COMPUTE SHADERS
. Standard OpenGL API . Execute algorithmically general purpose GLSL shaders — Operate on buffers, images and textures . Process graphics data in the context of the graphics pipeline — Easier than interoperating with a compute API for graphics apps . Standard part of all OpenGL 4.3+ implementations — And now OpenGL ES 3.1!
Image processing AI Simulation Ray Tracing Wave Simulation Global Illumination OPENGL COMPUTE SHADERS From Application From Application
Vertex Puller Dispatch Indirect Dispatch Element Array Buffer b Buffer b
Vertex Shader Draw Indirect Buffer b Image Load / Store t/b Compute Shader Tessellation Control Shader Vertex Buffer Object b Atomic Counter b
Tessellation Primitive Gen. Shader Storage b
Tessellation Eval. Shader Texture Fetch t/b Geometry Shader Uniform Block b Transform Feedback Transform Feedback Buffer b
Legend Rasterization From Application
Fixed Function Stage Fragment Shader Pixel Assembly Pixel Unpack Buffer b Programmable Stage
b – Buffer Binding Per-Fragment Operations Pixel Operations Texture Image t
t – Texture Binding Framebuffer Pixel Pack Buffer b Arrows indicate data flow Pixel Pack TEGRA K1 FOR COMPUTE
Kepler Architecture 192 CUDA Cores
Cortex-A15 4-Plus-1
Shared Physical Memory
2D Engine / ISP TEGRA K1 FOR OPENCL . OpenCL 1.2 Full profile support (OpenCL and OpenCL Embedded) — True portability from desktop — Higher precision, higher limits . Awesome performance . Related Session:
GTC2014 @ US S4808 - Real-Time Facial Motion Capture and Animation on Mobile Emiliano Gambaretto
TEGRA K1 CUDA 6 AND SHARED PHYSICAL MEMORY
Kepler Architecture 192 CUDA Cores
Cortex-A15 4-Plus-1
Shared Physical Memory
2D Engine / ISP CUDA REQUIRES MEMORY COPY Programmers are forced do perform additional work to allocate memories both in host/device and copy data from/to host to/from device
Conventional Discrete GPU __global__ void saxpy(int n, float a, float *x, float *y)
{ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; }
int N = 1<<20; cudaMemcpy(d_x, x, N, cudaMemcpyHostToDevice); 1 cudaMemcpy(d_y, y, N, cudaMemcpyHostToDevice); PCIe // Perform SAXPY on 1M elements 1 saxpy<<<4096,256>>>(N, 2.0, d_x, d_y);
Host Memory Device Memory cudaMemcpy(y, d_y, N, cudaMemcpyDeviceToHost); (CPUMemory) 2 (GPU Memory) 2
CUDA UNIFIED MEMORY (FROM CUDA 6)
Developer View Today Developer View With Unified Memory
System GPU Memory Unified Memory Memory
Dramatically Lower Developer Effort Faster performance on SoC WHAT IS HAPPENING IN THE BACKGROUND TEGRA K1 PHYSICALLY SHARED MEMORY
HD Video
CPU Quad Cortex-A15 Processor
+ Shadow LP C-A15 CPU Physically ARM7
SATA2 x1 Unified Memory USB 2.0 x3 Image Processor GPU without the Kepler
PCIe G2 Audio 192 CUDA Cores
x4 + x1 Processor need of Data UART x4 Display HDMI USB 3.0 x2 SPI x4 Migration I2C x5 SDIO/MMC x4 x2 eDP/LVDS
CSI NOR DDR3 Ctlr Security DAP x5 x4 + x4 Flash 64b Engine (12S/TDM)
Unified Memory CUDA UNIFIED MEMORY (FROM CUDA 6)
Developer View Today Developer View With Unified Memory
System GPU Memory Unified Memory Memory
Dramatically Lower Developer Effort Faster performance on SoC TEGRA K1 FOR RENDERSCRIPT
Kepler Architecture 192 CUDA Cores
Cortex-A15 4-Plus-1
Shared Physical Memory
2D Engine / ISP • C99 based kernel RENDERSCRIPT language
• Easy programmability with host and device portability. • Portable across wide range of devices, fastest on Tegra K1
• GPU (+ CPU) and more
• Renderscript API 19 Support
RENDERSCRIPT ON THE SOC . Acceleration of Renderscript Scripts over GPU — ScriptC, not just ScriptIntrinsics — Huge gains in performance and performance/watt . Runtime capable of scheduling work across units — CPU — GPU — 2D Engine/ISP . Related Session:
GTC2014 @ US S4885 - Efficient Parallel Computation on Android Jason Sams, Tim Murray INTRODUCTION (REF S4885) CONSTRAINTS #1, #2 (Ref S4885) CONSTRAINTS #3 (Ref S4885) GPU OR CPU? (REF S4885) DESKTOP PERFORMANCE TODAY (REF S4885) MOBILE PERFORMANCE TODAY (SHIPPING) (REF S4885) ARCHITECTUAL DIVERSITY? (REF S4885) GOAL OF RENDERSCRIPT (REF S4885) WHAT IS RENDERSCRIPT? (REF S4885) RENDERSCRIPT INTRINSICS (REF S4885) TEGRA K1? (REF S4885) TEGRA K1? (REF S4885) SUMMARY . Tegra K1内蔵のKeplerはTesla/Quadro/GeForceとアーキテクチャ を共通とするスケーラブルなGPU
. これによりTesla/Quadro/GeForceで熟成されたCUDA, OpenCL, OpenGL Shader Languageのソフト資産、開発環境が使用可能
. さらにHPC、WS、PCとは対極にあるモバイル用のRenderscriptに関 しても、GPUを活用することにより、優れた性能を発揮する
THANK YOU