April 4-7, 2016 | Silicon Valley
CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified Memory on Tegra
2 TYPICAL USE CASES
Automobiles: Autonomous Cars Mobile Devices: Consoles, Tablets
Embedded: Drones, Robots, Smart-Surveillance
3 TYPICAL PIPELINE
Graphics/ CameraCamera Graphics Display
ISP/DSPISP/DSP CUDA
Actuators SensorSensor Actuators
CUDA
CAPTURE PROCESS DISPLAY 4/1/2016 4 CUDA OPENGL(ES) INTEROP
5 CUDA–OPENGL(ES)
• Provide access to OpenGL-ES resources in CUDA • Support for EGL • Supported on Android, L4T, Vibrante- Linux, QNX • Implicit synchronization support • Useful for graphics applications and games
4/1/2016 6
EGL IMAGE INTEROP
7 EGL IMAGE
Source for EGL image • GStreamer • OpenGL ES • OpenMAX • Android - GraphicBuffer
Khronos EGL_image_base: https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_image_base.txt
4/1/2016 8 EGL IMAGE
cudaArray
EGLimage cudaDevicePointer
cuGraphicsEGLRegisterImage() cuGraphicsResourceGetMappedPointer() cuGraphicsResourceGetMappedArray()
Begin resource Begin resource usage in Other API Usage in CUDA
Other API code CUDA code
End resource End resource Usage in Other API synchronize Usage in CUDA
4/1/2016 9 EGL STREAMS INTEROP
10 EGL STREAMS • Producer-Consumer architecture
• EGL streams spec: https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_stream.txt • Implicit Synchronization • Cross Process support • Supports YUV formats cuDNN
ISP CUDA cuBLAS CUDA EGL stream Producer Consumer Visionworks Producer
CUDA
OpenGL EGL stream Consumer
4/1/2016 11 EGL STREAMS CUDA CUDA cuEGLStreamProducerConnect() EGL Stream cuEGLStreamConsumerConnect() Producer Consumer
Frame 1 cuEGLStreamConsumerAcquireFrame(frame) cuEGLStreamProducerPresentFrame(frame) Frame 2 Use Frame in CUDA 3
cuEGLStreamConsumerReleaseFrame(frame) cuEGLStreamProducerReturnFrame(frame) Frame
Frame 4
12 INTEROP SUMMARY
EGL STREAMS EGL IMAGE CUDA-OPENGL
• Producer-Consumer • Easy setup • EGL support
• Implicit- • Works with several EGL • OpenGL-ES support Synchronization client API • Portable across Tegra • Cross-Process support • YUV Planar Image and discrete GPU support • YUV Planar Image support
4/1/2016 13 CUDA UNIFIED MEMORY ON TEGRA
• Helps take advantage of unified DRAM on Tegra
• Easier to program, Unified allocator: cudaMallocManaged TEGRA • Programming model enforced through memory access protection CPU GPU
• Memcpy not needed, migration managed by CUDA driver Memory- DRAM • Saves memory consumption and power
• Attach API will help achieve optimal performance
4/1/2016 14 CUDA MEMORY TYPES
Traditional Zero Copy Unified Memory
malloc() Allocate cudaMallocHost() cudaMallocManaged() cudaMalloc()
CPU use CPU use CPU use CPU use
Migrate cudaMemcpyHtoD() NA cudaMemAttach[Optional]
CUDA kernel Kernel_launch<<<>>>() Kernel_launch<<<>>>() Kernel_launch<<<>>>()
Migrate cudaMemcpyDtoH() NA cudaMemAttach[Optional]
CPU use CPU use CPU use CPU use
15 CUDA MEMORY TYPES
Traditional Zero Copy Managed
• Easy portability from • Cache is bypassed by • Memory access by CPU existing desktop both GPU and CPU and GPU is through programs while accessing these cache. allocations • Faster for some small • Faster for larger allocations • Suitable when memory allocations access is not affected • Suitable for GPU by caching • Suitable when memory intermediate buffers, used on both host and tables, etc GPU
Time taken(ms) by the Matrix Multiply CUDA kernel with different allocation types: TRADITIONAL ZERO COPY MANAGED MEMORY 16KB 0.617 0.544 0.644 1MB 9.723 11.119 7.093 4MB 59.37618 62.232 46.42551 16MB 377.9244 403.2382 344.926
16 April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join