IMAGE AND VISION PROCESSING ON TEGRA K1 Elif Albuz

IMAGE AND VISION USE CASES Driven by using camera as a sensor

Computational Face, Body and 3D Scene/Object Augmented Photography and Gesture Tracking Reconstruction Reality Videography

MOBILE VISION COMPUTING

Mobile Vision Computing Input = MEMS + Depth Camera Processors = ISP + CPU + GPU Result = Data for advanced user interface and environment modeling

Computational Photography Input = MEMS + 2D Camera Processors = ISP + CPU + GPU

Result = Enhance Images and Videos Processing Demands Processing

Photography Input = 2D Camera Processors = ISP + CPU Product = Static Images

Time VISION COMPUTING APP CATEGORIES

Vision Computing

3D Reconstruction Tracking (constructs 3D geometry) (constructs positions and motions)

Object Reconstruction Face and gesture tracking Environmental Feature Tracking Facial Modeling

Scene Reconstruction Indoor/Outdoor Positional Tracking Body Modeling Body Tracking

3D Grid with SFM Ped/car detection & tracking AUGMENTED REALITY

HYPER-REALISM

Ray-tracing and light-field calculations running today on CUDA laptop PC – 50+ Watts

Ongoing research to use depth cameras to reconstruct global illumination model in real-time

Need on mobile devices at 100x less power = 0.5W

High-Quality Reflections, Refractions, and Caustics in Augmented Reality and their Contribution to Visual Coherence P. Kán, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria SIMULTANEOUS LOCALIZATION AND MAPPING WHAT IS NEEDED? . Accelerated image & vision processing — Tegra K1: CPU, GPU, ISP . Camera flexibility and handling of new sensor types — Android HAL V3, V4L . Low latency routing of image streams to GPU — EGLStreams . Integrated handling of various sensors and camera — Global Time Stamps . Effective image & vision programming frameworks — VisionWorks

TEGRA K1 PLATFORM Desktop GPGPU features and tools on mobile

GPGPU Compute

CUDA Tools and Libraries

Advanced Graphics

THREE GENERATIONS OF REFINEMENT

Tesla - 2006 Fermi - 2010 Kepler - 2012

28 nm HPM Quad Cortex-A15 HD Video TEGRA K1: A MAJOR 23x23mm, 4x Cores (1+ GHz) LEAP FORWARD FOR 1080p24/30 Video Decode 0.7mm pitch NEON SIMD 1080p24/30 Video Encode HS-FCBGA 2 MB L2 (Shared) MOBILE & EMBEDDED ARM7 ARM Trust Zone H.264 | MPEG4 | VC1 | MPEG2 Shadow LP C-A15 CPU VP8 APPLICATIONS

SATA2 x1 ® USB 2.0 x3 Kepler GeForce

Image Processor GPU w/CUDA KEPLER GPU, 192 CORES

25MP Sensor Support OpenGL-ES nextgen

PCIe* G2 Audio ISP 1080p60 192 Stream Processors

Enhanced JPEG Engine Processor CUDA x4 + x1 2D Graphics/Scaling

12GB/S BANDWIDTH UART x4 Display HDMI USB 3.0* x2 SPI x4 I2C x5 SDIO/MMC x4 x2 eDP/LVDS VIDEO IMAGE COMPOSITOR (VIC) DDR3 Ctlr CSI NOR Security DAP x5 64b (12S/TDM) x4 + x4 Flash 800+ MHz Engine CPU

4 +1 28 nm HPM Quad Cortex-A15 HD Video 23x23mm, Processor 4x Cores (1+ GHz) Quad-core A15, NEON 1080p24/30 Video Decode 0.7mm pitch NEON SIMD 1080p24/30 Video Encode HS-FCBGA 2 MB L2 (Shared) ARM7 ARM Trust Zone H.264 | MPEG4 | VC1 | MPEG2 Shadow LP C-A15 CPU VP8

SATA2 x1 Unified memory, access from ® USB 2.0 x3 Kepler GeForce CPU and GPU

Image Processor GPU w/CUDA

25MP Sensor Support OpenGL-ES nextgen

PCIe* G2 Audio ISP 1080p60 192 Stream Processors

Enhanced JPEG Engine Processor x4 + x1 2D Graphics/Scaling low-power Shadow core

UART x4 Display HDMI USB 3.0* x2 SPI x4 I2C x5 SDIO/MMC x4 x2 eDP/LVDS

DDR3 Ctlr CSI NOR Security DAP x5 64b (12S/TDM) x4 + x4 Flash 800+ MHz Engine IMAGE PROCESSOR

2 ISPs, independently 28 nm HPM Quad Cortex-A15 HD Video 23x23mm, Processor 4x Cores (1+ GHz) programmable 1080p24/30 Video Decode 0.7mm pitch NEON SIMD 1080p24/30 Video Encode HS-FCBGA 2 MB L2 (Shared) ARM7 ARM Trust Zone H.264 | MPEG4 | VC1 | MPEG2 Shadow LP C-A15 CPU VP8

SATA2 x1 25Mp camera support ® USB 2.0 x3 x2 Kepler GeForce

Image Processor GPU w/CUDA (upto 250Mp)

25MP Sensor Support OpenGL-ES nextgen

PCIe* G2 Audio ISP 1080p60 192 Stream Processors

x4 + x1 Enhanced JPEG Engine Processor 2D Graphics/Scaling 1.2Gp throughput UART x4 Display HDMI USB 3.0* x2 SPI x4 I2C x5 SDIO/MMC x4 x2 eDP/LVDS GPGPU interoperability DDR3 Ctlr CSI NOR Security DAP x5 64b (12S/TDM) x4 + x4 Flash 800+ MHz Engine HD DEC/ENC PROCESSOR

28 nm HPM Quad Cortex-A15 HD Video 23x23mm, Processor 4x Cores (1+ GHz) 1080p24/30 Video Decode 0.7mm pitch NEON SIMD 1080p24/30 Video Encode HS-FCBGA 2 MB L2 (Shared) ARM7 ARM Trust Zone H.264 | MPEG4 | VC1 | MPEG2 Shadow LP C-A15 CPU VP8

SATA2 x1 ® USB 2.0 x3 Kepler GeForce VIDEO ENCODE/DECODE 2X

Image Processor GPU w/CUDA 1920X1080@30FPS

25MP Sensor Support OpenGL-ES nextgen

PCIe* G2 Audio ISP 1080p60 192 Stream Processors

Enhanced JPEG Engine Processor x4 + x1 2D Graphics/Scaling CUVID/CUVENC VIDEO ENCODE/DECODE INTERFACE UART x4 Display HDMI USB 3.0* x2 SPI x4 I2C x5 SDIO/MMC x4 x2 eDP/LVDS MOTION ESTIMATION ONLY MODE DDR3 Ctlr CSI NOR Security DAP x5 64b (12S/TDM) x4 + x4 Flash 800+ MHz Engine VIDEO ENCODER Dedicated hardware accelerator

Current frame Compressed video (h.264, ..)

Encoder Motion Vectors/Track info Ref. frame GPU KEPLER Architecture 192 CUDA Cores, SM3.2 28 nm HPM Quad Cortex-A15 HD Video 23x23mm, Processor 4x Cores (1+ GHz) 1080p24/30 Video Decode 0.7mm pitch NEON SIMD ISA Compatible to GeForce, Quadro, 1080p24/30 Video Encode HS-FCBGA 2 MB L2 (Shared) ARM7 ARM Trust Zone Tesla H.264 | MPEG4 | VC1 | MPEG2 Shadow LP C-A15 CPU VP8

SATA2 x1 ® 64kb L1 Cache and Shared Memory USB 2.0 x3 Kepler GeForce

Image Processor GPU w/CUDA 128kb L2 Cache 25MP Sensor Support OpenGL-ES nextgen

PCIe* G2 Audio ISP 1080p60 192 Stream Processors

Enhanced JPEG Engine Processor x4 + x1 2D Graphics/Scaling 128 kb

UART x4 Display HDMI USB 3.0* x2 SPI x4 I2C x5 SDIO/MMC x4 x2 eDP/LVDS

DDR3 Ctlr CSI NOR Security DAP x5 64b (12S/TDM) x4 + x4 Flash 800+ MHz Engine WHAT IS NEEDED? . Accelerated image & vision processing — Tegra K1: CPU, GPU, ISP . Camera flexibility and handling of new sensor types — Android HAL V3, V4L, Camera API . Low latency routing of image streams to GPU — EGLStreams . Integrated handling of various sensors and camera — Global Time Stamps . Effective image & vision programming frameworks — VisionWorks

CAMERA IMAGE PROCESSING Camera ISP (Image Signal Processor) — Little or no programmability — Data flows thru compact hardware pipe — Scan-line-based - no global memory — Best perf/watt

~760 math Ops ~42K vals = 670Kb 300MHz  ~250Gops VISUAL SENSOR REVOLUTION . Single RGB sensors just the start of mobile visual revolution — IR sensors – LEAP Motion, eye-trackers — Active illumination depth sensors – TOF and structured light . Multi-sensors: Stereo pairs -> Plenoptic array -> Depth cameras — Stereo pair can enable object scaling and enhanced depth extraction — Plenoptic Field processing needs FFTs and ray-casting . Hybrid visual sensing solutions — Different sensors mixed for different distances and lighting conditions

Dual Camera Plenoptic Array Capri Structured Light 3D Camera LG Electronics Pelican imaging PrimeSense ANDROID CAMERA HAL V3 . Camera HAL v1 focused on simplifying basic camera apps — Difficult or impossible to do much else — New features require proprietary driver extensions — Extensions not portable - restricted growth of third party app ecosystem . Camera HAL v3 is a fundamentally different API — Flexible primitives for building sophisticated use-cases — Interface is clean and easily extensible — Apps can have more control, and more responsibility . Enables sophisticated camera applications . Faster time to market and higher quality

KHRONOS CAMERA API . Specification available 2015 . Also FCAM-Based — Will be available for any OS — FCAM running on NVIDIA Linux today . No global state — State travels with image requests — Every stage in the pipeline may have different state — Enables fast, deterministic state changes . Synchronize devices — Lens, flash, sound capture, gyro… — Devices can schedule Actions — E.g. to be triggered on exposure change KHRONOS CAMERA API REQUIREMENTS . Application control over ISP processing (including 3A) — Including multiple, re-entrant ISPs . Control multiple sensors with synch and alignment — E.g. Stereo pairs, Plenoptic arrays, TOF/structured light depth cameras . Enhanced per frame detailed control — Format flexibility, Region of Interest (ROI) selection . Global timing & synchronization — E.g. Between cameras and MEMS sensors Enable new camera functionality not available . Flexible processing/streaming on current platforms and — Multiple input and output streams align with future platform directions for easy adoption — RAW, Bayer or YUV Processing — Streaming of rows (not just frames) TEGRA K1 CAMERA DATAFLOW

. Flexible routing of sensor data to ISPs and unified memory ISP-A . Enables sensor processing by any combination of GPUs, Camera CPUs and ISPs One CameraRouting CPU CPU CPU CPU Unified Camera Memory Two

Kepler

GPGPU

ISP-B COMPUTATIONAL PHOTOGRAPHY . Camera ISP (Image Signal Processor) — Little or no programmability

. CPU — Single processor or Neon SIMD - running fast — Makes heavy use of general memory — Non-optimal performance and power . GPU — Programmable and flexible — Many way parallelism - run at lower frequency — Efficient image caching close to processors — BUT cycles frames in and out of memory

~760 math Ops ~42K vals = 670Kb 300MHz  ~250Gops WHAT IS NEEDED? . Accelerated image & vision processing — Tegra K1: CPU, GPU, ISP . Camera flexibility and handling of new sensor types — Android HAL V3, V4L . Low latency routing of image streams to GPU — EGLStreams . Integrated handling of various sensors and camera — Global Time Stamps . Effective image & vision programming frameworks — VisionWorks

EGL 1.5 RELEASED . EGL 1.5 brings functionality from multiple extensions into core Applications — Increased reliability and portability API Interop EGL provides efficient . EGLImages transfer of data and events — Sharing textures and renderbuffers between Khronos APIs . Context Robustness — Defending against malicious code . EGLSync objects — Improved OpenGL /OpenCL interop Application Portability Platform extensions EGL abstracts graphics context . management, surface and — Standardized interactions for multiple OS e.g. buffer binding and rendering synchronization Android and 64-bit platforms . sRGB colorspace rendering OS and Display Platforms

WHAT IS NEEDED? . Accelerated image & vision processing — Tegra K1: CPU, GPU, ISP . Camera flexibility and handling of new sensor types — Android HAL V3, V4L . Low latency routing of image streams to GPU — EGLStreams . Integrated handling of various sensors and camera — Global Time Stamps . Effective image & vision programming frameworks — VisionWorks

27 HOW MANY SENSORS ARE IN A SMARTPHONE? . Light . Proximity . 2 cameras . 3 microphones . Touch . Position — GPS — WiFi (fingerprint) — Cellular (tri-lateration) — NFC, Bluetooth (beacons) . Accelerometer . Magnetometer . Gyroscope . Pressure . Temperature . Humidity 19

STREAMINPUT SENSOR ABSTRACTION API

Apps request semantic sensor information StreamInput defines possible requests, e.g. Read Physical or Virtual Sensors e.g. “Game Quaternion” Context detection e.g. “Am I in an elevator?”

Apps Need Sophisticated Access to Sensor Data Without coding to specific Advanced Sensors Everywhere sensor hardware Multi-axis motion/position, quaternions, context-awareness, gestures, activity Sensor Discoverability monitoring, health and environmental sensors Sensor Code Portability

StreamInput processing graph provides optimized sensor data stream High-value, smart sensor fusion middleware can connect to apps in a portable way Apps can gain ‘magical’ situational awareness WHAT IS NEEDED? . Accelerated image & vision processing — Tegra K1: CPU, GPU, ISP . Camera flexibility and handling of new sensor types — Android HAL V3, V4L . Low latency routing of image streams to GPU — EGLStreams . Integrated handling of various sensors and camera — Global Time Stamps . Effective image & vision programming frameworks — VisionWorks

MOBILE & EMBEDDED DEVELOPERS NEED HELP!

Control, coordinate and Handle a diverse selection synchronize a diverse of emerging depth camera array of mobile sensors technologies

Write maintainable code Write code that is deployable for a heterogeneous mix across multiple devices, of CPUs, GPUs and DSPs platforms and OS

Leverage dedicated Create fluid 60Hz vision hardware for experiences on battery- minimized power powered mobile devices DESKTOP TO MOBILE DEVELOPMENT . Image & Vision Processing code development starts at desktop PC

. Tegra K1 enables easy migration through libraries and tools available across platforms

TEGRA K1 CUDA DEVELOPMENT

CUDA-Aware Editor Nsight Debugger Nsight Profiler Automated CPU to GPU code Simultaneously debug of CPU and GPU Quickly identifies performance refactoring Inspect variables across CUDA threads issues Semantic highlighting of CUDA code Use breakpoints & single-step Integrated expert system Integrated code samples & docs debugging Source line correlation

Cross platform development Native memcheck, GDB, nvprof

CUDA LIBRARIES

VisionWorks

NPP OpenCV CUFFT

CUBLAS CUDA Math Lib NPP LIBRARY (CUDA) . Data exchange & — FilterBox, Row, Column, Max, Min, Median, Dilate, Erode, initialization SumWindowColumn/Row — Set, Convert, CopyConstBorder, Copy, Transpose, SwapChannels . Geometry Transforms — Mirror, WarpAffine / Back/ . Arithmetic & Logical Ops Quad, WarpPerspective / Back / — Add, Sub, Mul, Div, AbsDiff Quad, Resize . Threshold & Compare . Statistics — Threshold, Compare — Mean, StdDev, NormDiff, MinMax, . Color Conversion Histogram, SqrIntegral, — RGB To YCbCr (& vice versa), RectStdDev ColorTwist, LUT_Linear . Computer Vision . JPEG — ApplyHaarClassifier, — DCTQuantInv/Fwd, — GraphCuts QuantizationTable . Functions OPENCV LIBRARY

. Initially developed by Intel for single-core CPUs OpenCV — Version 2.4.5 >900 functions (x the datatypes) — OpenCV4Tegra - Accelerated CUDA+NEON+GLSL+TBB multithreading

Image processing

General Image Segmentation Machine Learning, Image Pyramids Transforms Fitting Processing Detection

Video, Stereo, and 3D

Camera Calibration Features Depth Maps Optical Flow Inpainting Tracking OPENCV-GPU VALUE ADD ON LOGAN

Jetson Kepler GPU /Quadcore A15 Public OCV for Mobile/Embedded Speedup 7

6

5

4

3

2

1

0 core filter imgproc objdetect

Average speedup for different function categories with Logan GPU compared to public source code. VISIONWORKS MOTIVATION + = Advanced VisionWorks Widespread vision Silicon Simplify vision programming Fully optimized and accelerated processing in embedded, Modular and Extensible mobile and automotive devices and applications VISIONWORKS Power Efficient Computer Vision Powered with CUDA

Supported on Tegra K1 Linux and Android

ADAS – Advanced Driver Computational ACCELERATING Assistance Systems Photography • Advanced Driver Assistance • Computational Photography • Augmented Reality • Robotics • Deep Learning and more…

Version 0.10 is available for registered partners! Augmented Reality Robotics VISIONWORKS SOFTWARE STACK

Applications use combination of direct primitives, the Application Code OpenVX framework and supplied pipelines

NVIDIA provides sample Sample Pipelines pipelines for common use 3rd Party cases Object SfM Pipelines Detection …

VisionWorks Primitives Customers and developers NVIDIA supplied vision can create their own primitives using CUDA and Corner 3rd Party Classifier primitives e.g. using CUDA Tegra processing resources Detection … Primitives OpenVX enables power CUDA Framework efficient and flexible chaining of primitives

Tegra K1 OPENVX – POWER EFFICIENT VISION ACCELERATION . Khronos, open, cross-vendor vision API — Focus on mobile and embedded systems Application

OpenCV . Foundational API for vision acceleration OpenCV open VisionWorks — Useful for middleware or by applications source library Sample Pipelines — Enables diverse efficient implementations

. Complementary to OpenCV — Which is great for prototyping

Open source sample Hardware vendor implementation implementations OPENVX GRAPHS – THE KEY TO EFFICIENCY . Directed graphs for processing power and efficiency — Each Node can be implemented in software or accelerated hardware — Nodes may be fused to eliminate memory transfers — Processing can be tiled to keep data entirely in local memory/cache . EGLStreams route data from camera and to application . Can extend with “VisionWorks” nodes using CUDA

VisionWorks Node Native OpenVX VisionWorks Camera Node Node Application Control OpenVX Node

Example OpenVX Graph OPENVX AND OPENCV ARE COMPLEMENTARY

Community driven open source Defined and Governance with no formal specification implemented by Khronos Portability APIs can vary depending on processor Tegra K1 mobile platforms Very wide Tight focus on hardware accelerated functions Scope 1000s of imaging and vision functions for mobile vision Multiple camera APIs/interfaces Use external camera API Memory-based architecture Graph-based execution Efficiency Each operation reads and writes Optimizable computation, data transfer memory Use Case Rapid experimentation Production development & deployment OPENVX 1.0 FUNCTION OVERVIEW

. Core data structures — Images and Image Pyramids — Processing Graphs, Kernels, Parameters . Image Processing OpenVX Specification Evolution — Arithmetic, Logical, and statistical operations — Multichannel Color and BitDepth Extraction and Conversion OpenVX 1.0 defines — 2D Filtering and Morphological operations framework for creating, managing and — Image Resizing and Warping executing graphs . Core Computer Vision — Pyramid computation — Integral Image computation Focused set of widely used functions that are . Feature Extraction and Tracking readily accelerated — Histogram Computation and Equalization Widely used extensions adopted into future — Canny Edge Detection versions of the core — Harris and FAST Corner detection Implementers can add functions as extensions — Sparse Optical Flow

VISIONWORKS PRIMITIVES – JAN 2014

Sobel Optical Flow PyrLK HOG (Histogram of Convolve Optical Flow Farneback Oriented Gradients) Bilateral Filter Warp Perspective Soft Cascade Detector Integral Image Hough Lines Object Tracker Integral Histogram Fast NLM Denoising TLD Object Tracker Corner Harris Stereo Block Matching SLAM Corner FAST IME (Iterative Motion Path Estimator Image Pyramid Estimation) MedianFlow Estimator OPENVX AND CUDA ARE COMPLEMENTARY

Domain targeted Use Case GPGPU Programming Vision processing Library-based Architecture Language-based - no separate compiler required Abstracted node and memory model - Target ‘Exposed’ architected memory model – diverse implementations can be optimized Hardware programmer manages memory for power and performance Minimal floating point requirements – Precision Full IEEE floating point mandated optimized for vision operators General-purpose math and other Fully implemented vision operators and Ease of Use libraries framework ‘out of the box’

Use CUDA to build new VisionWorks OpenVX Nodes VISIONWORKS SAMPLE PIPELINES (V0.10)

Structure From Pedestrian Detection Vehicle detection Object tracking Motion/SLAM

Dense optical flow Active Shape Model Denoising VISIONWORKS – LOOKING FORWARD

. Enable multi-camera applications

. 3D sensors

. Conformance with OpenVX once specification finalized

TEGRA K1 DEVELOPMENT PLATFORMS

Coming to Android K1 & Other Linux Devices soon..

JETSON X3 (TK1 PRO) gigE, usb3.0, HDMI, CANBUS running Vibrante Linux AUTOMOTIVE GRADE

QUESTIONS? BACKUP