How Mobile Devices are Revolutionizing User Interaction HCI Korea 2015 | Seoul Neil Trevett | Khronos President NVIDIA Vice President Mobile Ecosystem

© Copyright Khronos Group 2014 - Page 1 Mobile and Advanced User Interaction • Mobile devices are evolving significant sensing capabilities - Sensors to gather information about the user and environment - Processing power to analyze and process the sensor data

What are the What sensors are standards and APIs coming to mobile for developers to devices? access new sensor

capabilities? What mobile

acceleration capability Early examples of is developing to enabled devices process sensor data? and use cases

© Copyright Khronos Group 2014 - Page 2

Mobile Computing Revolution • Mobile devices are a new platform for computer human interaction innovation - High market volume -> Investment $ -> lower cost and increasing functionality

Announcement of new Pope in St. Peters Square © Copyright Khronos Group 2014 - Page 3 4 How Many Sensors in a Today? • Ambient Light • RGB Light • Proximity • IR Gestures • 2 cameras • IR Autofocus Laser • 3 microphones • Touch • Position2 22 - GPS - WiFi (fingerprint) - Cellular (tri-lateration) - NFC, Bluetooth (beacons) • Pressure • Temperature • Humidity • Accelerometer Micro Electrical Mechanical Systems • Magnetometer ‘MEMS’ • Gyroscope

© Copyright Khronos Group 2014 - Page 4 Mobile Camera – The Most Interesting Sensor? • Single sensor RGB cameras just the start of mobile visual revolution - Main focus has been on capturing accurate photographs – not vision processing - Vision processing lets us capture DATA not just PICTURES • New camera types : Stereo pairs -> Plenoptic array -> Depth cameras - Stereo disparity processing enables object scaling and depth extraction - Plenoptic arrays use FFTs and ray-casting to capture a light field - Structured Light sensors use image processing to extract depth from the distortion of an IR pattern projected onto the scene • Advanced sensor processing needs significant compute power - Vision processing can be effectively accelerated on GPUs today

Dual Camera Plenoptic Array Capri Structured Light 3D Camera LG Electronics Pelican imaging PrimeSense © Copyright Khronos Group 2014 - Page 5 Mobile Photography -> Visual Computing

Mobile Visual Computing Input = MEMS + Depth Camera Processors = ISP + CPU + GPU Result = Data for advanced user interaction and environment modeling

Computational Photography Input = MEMS + 2D Camera Processors = ISP + CPU + GPU Result = Enhanced Images and Videos e.g. Panoramas

Photography

Processing Demands Processing Input = 2D Camera Processors = ISP + CPU Product = Static Images

ISP = Image Signal Processor Dedicated hardware processor for processing camera imagery

Time

© Copyright Khronos Group 2014 - Page 6 Visual Computing = Graphics AND Vision

Graphics Processing

Data New mobile visual sensors for MORE DATA Advanced mobile hardware for MORE PROCESSING Enables closer intertwining of real and virtual worlds Imagery

Vision Real time demo on CUDA-enabled laptop High-Quality Reflections, Refractions, and Caustics in Augmented Processing Reality and their Contribution to Visual Coherence P. Kán, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria https://www.youtube.com/watch?v=i2MEwVZzDaA

© Copyright Khronos Group 2014 - Page 7 Mobile Vision Acceleration = New Experiences

Need for advanced sensors and the acceleration to process them

Computational Face, Body and 3D Scene/Object Augmented Photography and Gesture Tracking Reconstruction Reality Videography

© Copyright Khronos Group 2014 - Page 8 Mobile SOC Performance Increases

Google Xiaomi MiPad Shield Tablet Erista Maxwell GPU 100 Shield Portable

Tegra K1 Quad Cortex A15 HTC One X+ Kepler GPU 100x perf increase in four years Tegra 4 10 Quad Cortex A15

Tegra 3 Quad A9 Power saver 5th core SOC = ‘System On Chip’ Tegra 2 Dual A9 CPU/GPU AGGREGATE PERFORMANCE AGGREGATE CPU/GPU 1 2012 2013 2014 2015 2011 Device Shipping Dates © Copyright Khronos Group 2014 - Page 9 Mobile Thermal Design Point

10” Screen takes 1-2W Resolution makes a difference - Wearable AR 7” Screen the iPad3 screen takes up to 8W! Displays should takes 1W ideally remain cool 4-5” Screen takes to the touch and 250-500mW operate all day!

0.5W or less! 2-4W 4-7W 6-10W 30-90W Typical max system power levels before thermal failure Even as battery technology improves - these thermal limits remain

© Copyright Khronos Group 2014 - Page 10 Power is the New Design Limit • The Process Fairy keeps bringing more transistors.. ..but the ‘End of Voltage Scaling’ means power is much more of an issue than in the past

In the Good Old Days The New Reality Leakage was not important, and voltage Leakage has limited threshold voltage, scaled with feature size largely ending voltage scaling

L’ = L/2 L’ = L/2 D’ = 1/L2 = 4D D’ = 1/L2 = 4D f’ = 2f f’ = ~2f V’ = V/2 V’ = ~V E’ = CV2 = E/8 E’ = CV2 = E/2 P’ = P P’ = 4P

Halve L and get 4x the transistors and Halve L and get 4x the transistors and 8x the capability for 8x the capability for the same power 4x the power!!

© Copyright Khronos Group 2014 - Page 11 How to Save Power? Write 32-bits to LP-DDR2 600pJ • Much more expensive to MOVE data than COMPUTE data Send 32-bits Off-chip • Process improvements WIDEN the gap 50pJ - 10nm process will increase ratio another 4X • Energy efficiency must be key metric during silicon AND app design

- Awareness of where data lives, Send 32-bits 2mm where computation happens, 24pJ how is it scheduled

32-bit Float Operation For 40nm, 7pJ 1V process

32-bit Integer Add 1pJ 32-bit Register Write 0.5pJ © Copyright Khronos Group 2014 - Page 12 Hardware Saves Power e.g. Camera Sensor ISP • CPU - Single processor or Neon SIMD - running fast - Makes heavy use of general memory - Non-optimal performance and power • GPU - Programmable and flexible - Many way parallelism - run at lower frequency - Efficient image caching close to processors - BUT cycles frames in and out of memory

• Camera ISP (Image Signal Processor) ~760 math Ops ~42K vals = 670Kb - Little or no programmability 300MHz  ~250Gops - Data flows through compact hardware pipe - Scan-line-based - no global memory - Best perf/watt

© Copyright Khronos Group 2014 - Page 13 Vision Processing Power Efficiency • Wearables will need ‘always-on’ vision - With smaller thermal limit / battery than phones! • GPUs have x10 imaging power efficiency over CPU - GPUs architected for efficient handling • Dedicated Hardware/DSPs can be even more efficient - With some loss of generality • Mobile SOCs have space for more transistors - But can’t turn on at same time = Dark Silicon - Can integrate more gates ‘for free’ if careful how and when they are used X100 Dedicated Hardware

GPU Potential for dedicated sensor/vision X10 Compute silicon to be integrated into Multi-core Mobile Processors Efficiency Power X1 CPU

Computation Flexibility

© Copyright Khronos Group 2014 - Page 14 Power Efficiency will Need Holistic App Design

• Ultra-low power camera use cases will need smart use of all sensors in a device High-performance vision application processing - Computational videography - Face, body and gesture tracking Often/always on camera processing - Object and scene reconstruction to detect visual triggering events - Feature tracking, pose estimation e.g. for AR Minimum possible power for small repertoire of visual events

1 MIP sensor hub and Low power activation of camera and High-quality vision processing in accelerometers can detect processing to detect visual triggers vision-based applications device being used

© Copyright Khronos Group 2014 - Page 15 Mobile Developers Need Help!

Control, coordinate and Handle a diverse selection synchronize a diverse of emerging depth camera array of mobile sensors technologies

Write maintainable code Write code that is deployable for a heterogeneous mix across multiple devices, of CPUs, GPUs and DSPs platforms and OS

Leverage dedicated Create fluid 60Hz vision hardware for experiences on battery- minimized power powered mobile devices

© Copyright Khronos Group 2014 - Page 16 Khronos Connects Software to Silicon

Open Consortium creating ROYALTY-FREE, OPEN STANDARD APIs for hardware acceleration

Defining the roadmap for low-level silicon interfaces needed on every platform

Graphics, compute, rich media, vision, sensor and camera processing

Rigorous specifications AND conformance tests for cross- vendor portability

Acceleration APIs BY the Industry FOR the Industry Well over a BILLION people use Khronos APIs Every Day…

© Copyright Khronos Group 2014 - Page 17 http://accelerateyourworld.org/

© Copyright Khronos Group 2014 - Page 18 Access to 3D on Over 2 BILLION Devices

1.9B Mobiles / year

300M Desktops / year Windows, Mac, Linux

1B Browsers / year

Source: Gartner (December 2013) © Copyright Khronos Group 2014 - Page 19 OpenGL ES Momentum • OpenGL ES 3.1 is latest version and is standard in - Announced at Google IO June 2014 • Google has defined Android Extension Pack (AEP) for premium Android gaming - Optional set of extensions for OpenGL ES 3.1 accessible through a single query - Functionality to support AAA games - Tessellation, Geometry shaders, ASTC Texture Compression • First OpenGL ES 3.1 drivers are shipping - Just a few months after specification

Epic’s Rivalry demo using full Unreal Engine 4 Running in real-time on NVIDIA Tegra K1 with OpenGL ES 3.1 + AEP https://www.youtube.com/watch?v=jRr-G95GdaM

© Copyright Khronos Group 2014 - Page 20 Will Influence Graphics APIs • The ability to generate ‘Presence’ is becoming achievable at reasonable cost - Using visual input to generate subconscious belief in a virtual situation • PC-based AND mobile systems - Beginning to enable developer experimentation • VR Requirements will affect how graphics APIs generate visual imagery - Control over generation of stereo pairs – slightly different view for each eye - Optical system geometric correction in rendering path - Reduce latency through elimination of in-driver buffering - Asynchronously warp framebuffer for Samsung Gear VR instantaneous response to head movement • If achieved – Presence will need to be used very carefully - What happens if you BELIEVE you are in a first person shooter game?

© Copyright Khronos Group 2014 - Page 21 Heterogeneous Computing and Mobile • Mobile SOCs now beginning to need more than just ‘GPU Compute’ - Multi-core CPUs, GPUs, DSPs • OpenCL can provide a single programming framework for all processors on a SOC - Even ISPs and specialized hardware blocks with Built-in Kernels for custom HW

Image Courtesy Qualcomm

© Copyright Khronos Group 2014 - Page 22 OpenCL – Portable Heterogeneous Computing • Portable Heterogeneous programming of diverse compute resources - Targeting supercomputers -> embedded systems -> mobile devices • One code tree can be executed on CPUs, GPUs, DSPs, FPGA and hardware - Dynamically interrogate system load and balance work across available processors • OpenCL = Two APIs and C-based Kernel language - Platform Layer API to query, select and initialize compute devices - Kernel language - Subset of ISO C99 + language extensions - C Runtime API to build and execute kernels

across multiple devices OpenCL KernelOpenCL CodeKernel OpenCL CodeKernel OpenCL CodeKernel Code GPU DSP CPU FPGA CPU HW OpenCL is shipping today on multiple mobile processors and cores © Copyright Khronos Group 2014 - Page 23 OpenCL as Parallel Language Backend

JavaScript Language for MulticoreWare Embedded Java language River Trail Compiler PyOpenCL Harlan binding for image open source array extensions Language directives for Python High level initiation of processing and project on language for for extensions to Fortran, wrapper language OpenCL C computational Bitbucket Haskell parallelism JavaScript C and C++ around for GPU kernels photography OpenCL programming

OpenCL provides vendor optimized, cross-platform, cross-vendor access to heterogeneous compute resources

© Copyright Khronos Group 2014 - Page 24 Widening OpenCL Ecosystem

High-level OpenCL C Alternative Language High-level Alternative Language SingleFrameworks source Kernel Source Diverse, fordomain Kernels- Frameworks specificfor Languages, Kernels file applications frameworks and tools

SPIR Generator (e.g. patched Clang)

https://github.com/KhronosGroup/SPIR

SPIR is easier compiler target than C SYCL Programming abstraction that combines SPIR portability and efficiency of OpenCL with (Standard Portable ease of use and flexibility of C++ Intermediate Representation) Single source file programming First portable IR that includes SYCL 1.2 Provisional Updated

support for parallel computation November 2014 Created in close cooperation with OpenCL run-time OpenCL C LLVM community can consume SPIR Runtime SPIR 2.0 Provisional Released August 2014 (uses LLVM 3.4) Device X Device Y Device Z © Copyright Khronos Group 2014 - Page 25 Mixamo - Avatar Videoconferencing • Real-time facial analysis and animation capture on mobile device - OpenCL GPU acceleration enables 30 frames per second processing • Animate an avatar while video conferencing or playing game online - Teleconference as your gaming character!

OpenCL-accelerated face tracking running on NVIDIA Tegra K1 development system

© Copyright Khronos Group 2014 - Page 26 Vision Pipeline Challenges and Opportunities

Growing Camera Diversity Diverse Vision Processors Sensor Proliferation

22

Flexible sensor and camera Use efficient acceleration to Combine vision output control to GENERATE PROCESS with other sensor data an image stream the image stream on device

© Copyright Khronos Group 2014 - Page 27 Need for Camera Control API - OpenKCAM • Advanced control of ISP and camera subsystem – with cross-platform portability - Generate sophisticated image stream for advanced imaging & vision apps • No platform API currently fulfills all developer requirements - Portable access to growing sensor diversity: e.g. depth sensors and sensor arrays - Cross sensor synch: e.g. synch of camera and MEMS sensors - Advanced, high-frequency per-frame burst control of camera/sensor: e.g. ROI - Multiple input, output re-circulating streams with RAW, Bayer or YUV Processing

Defines control of Sensor, Color Filter Array OpenKCAM standard is Lens, Flash, Focus, Aperture still in development Auto Exposure (AE) Auto White Balance (AWB) Auto Focus (AF) Stream of Images for Image Signal Vision Processing Processor (ISP)

© Copyright Khronos Group 2014 - Page 28 OpenVX – Power Efficient Vision Acceleration • Out-of-the-Box vision acceleration framework - Enables low-power, real-time applications - Targeted at mobile and embedded platforms • Portability across diverse vision HW Application - ISPs, Dedicated hardware, DSPs and DSP Application Application arrays, GPUs, Multi-core CPUs … Application • OpenVX 1.0 and conformance tests available - Released in October 2014 with open source sample implementation coming soon

Vision AcceleratorVision AcceleratorVision AcceleratorVision Accelerator

© Copyright Khronos Group 2014 - Page 29 OpenVX Graphs – The Key to Efficiency • OpenVX enables developer to express a graph of image operations (‘Nodes’) - Enables execution optimizations for power and performance efficiency • Implementer can use diverse optimization methods - Each Node can be implemented in software or accelerated hardware - Nodes may be fused by the implementation to eliminate memory transfers - Processing can be tiled to keep data entirely in local memory/cache

Stereo Rectify with Compute Depth Detect and Object Camera 1 Remap Map track objects (User Node) (User Node) coordinates

Stereo Image Rectify with Compute Pyramid Remap Optical Camera 2 Flow Example OpenVX Graph

Stereo machine vision Delay

© Copyright Khronos Group 2014 - Page 30 OpenVX 1.0 Functional Overview • Core data structures - Images and Image Pyramids - Processing Graphs, Kernels, Parameters OpenVX Specification • Image Processing Is Extensible - Arithmetic, Logical, and statistical operations Khronos maintains extension registry - Multichannel Color and BitDepth Extraction and Conversion OpenVX 1.0 defines - 2D Filtering and Morphological operations framework for - Image Resizing and Warping creating, managing and executing graphs • Core Computer Vision

- Pyramid computation - Integral Image computation Focused set of widely • Feature Extraction and Tracking used functions (Nodes) to be accelerated - Histogram Computation and Equalization Widely used extensions adopted into future - Canny Edge Detection versions of the core - Harris and FAST Corner detection Implementers can add - Sparse Optical Flow functions as extensions

© Copyright Khronos Group 2014 - Page 31 OpenVX and OpenCV are Complementary

Community driven open source Formal specification defined and Governance with no formal specification implemented by hardware vendors No conformance tests for consistency and Full conformance test suite / process Conformance every vendor implements different subset creates a reliable acceleration platform Portability APIs can vary depending on processor Hardware abstracted for portability Very wide Tight focus on hardware accelerated Scope 1000s of imaging and vision functions functions for mobile vision Multiple camera APIs/interfaces Use external camera API Memory-based architecture Graph-based execution Efficiency Each operation reads and writes memory Optimizable computation, data transfer Use Case Rapid experimentation Production development & deployment

© Copyright Khronos Group 2014 - Page 32 Khronos APIs for Vision Processing • Any compute API can be used for vision acceleration - OpenCL, OpenGL Compute Shaders … • OpenVX is the only vision API that does not NEED a powerful CPU/GPU complex - Can use any processor – from high-end GPU, through DSPs to hardware blocks • The higher abstraction level of OpenVX protects app from hardware differences - Enables low-power, always-on acceleration – with application portability

Many implementers may choose to use OpenCL

or OpenGL Compute Shaders to implement

OpenVX nodes and OpenVX to enable a developer to connect those nodes into a graph Programmable Vision Dedicated Vision Processors Hardware

© Copyright Khronos Group 2014 - Page 33 Sensor Industry Fragmentation …

© Copyright Khronos Group 2014 - Page 34 Low-level Sensor Abstraction API

Apps request semantic sensor information StreamInput defines possible requests, e.g. Read Physical or Virtual Sensors e.g. “Game Quaternion” Context detection e.g. “Am I in an elevator?”

Apps Need Sophisticated Access to Sensor Data Without coding to specific Advanced Sensors Everywhere sensor hardware Multi-axis motion/position, quaternions, context-awareness, gestures, activity Sensor Discoverability monitoring, health and environmental sensors Sensor Code Portability

StreamInput processing graph provides optimized sensor data stream High-value, smart sensor fusion middleware can connect to apps in a portable way Apps can gain ‘magical’ situational awareness

© Copyright Khronos Group 2014 - Page 35 Google Project

A high spec Android Tablet made by Google ATAP Motion 4GB RAM , 128GB SSD, NVIDIA Tegra K1 Tracking

Depth Perception

Area Learning To inspire and explore use of new- generation mobile sensors https://www.google.com/atap/projecttango/#project © Copyright Khronos Group 2014 - Page 36 Tango Sensors and Processing Power

4 Mpixel RGB/IR Wide-Angle IR 2um Sensor Odometry Structured + Flash Camera Light Projector

Tango tablet needs no additional image processing hardware other than the Tegra K1 GPU 120 degree front camera

Sensor Hub Tegra K1 Processor MEMS Sensor Control 350GFlops GPU Compute Sensor Time-stamping Desktop-class Graphics

© Copyright Khronos Group 2014 - Page 37 Motion Tracking • Generates Accurate Pose - Position (x,y,z) - Pointing direction (i, j, k, rotation) • Uses odometry camera - Wide angle lens, global shutter - Tracking even through fast movements - Avoid rolling shutter artifacts - Fuses inertial and visual odometry • GPU Processing - Fisheye geometric correction - Feature tracking and SLAM • Accuracy: <1% session error - Significantly more accurate than using inertial sensors alone

© Copyright Khronos Group 2014 - Page 38 Depth Perception • Real-time Point Cloud - Depth information for many points on a 2D picture • Mantis Vision Structured Light Projector - Strobes timed IR grid pattern into scene - IR image captured by 4MP Camera • GPU accelerated processing - Depth cloud from pattern deformation • Enables diverse applications - Body position and movement - 3D mesh generation - Object recognition

© Copyright Khronos Group 2014 - Page 39 Area Learning • Remembers visual features of the scene - Corrects positional drift when detects and locks to a recognized position in an Area • Can save and reload Area descriptions - Enables Localization of position against a previously stored scene - E.g. to start new session at known point in space • Share Area descriptions in the cloud - Localize multiple users to same saved Area for spatial cooperation - Match current Area to stored descriptions to recognize environment

Video courtesy Mantis Vision

© Copyright Khronos Group 2014 - Page 40 Tango Tablet Hardware Architecture • No external image processors

- All vision processing is on GPU Camera CSI Front - Reduces cost, power, latency 120 Degree - Increases flexibility Camera CSI RGBiR • Odometry 4MP Strobes Tegra K1 - 30 FPS -> GPU load: ~15% Camera CSI VGA BW - ~3-5ms GPU processing/frame Wide Angle • Depth Decoding MEMS Sensor SPI - 3-5 FPS -> GPU load: ~8% Sensors Hub

- ~13-15ms GPU processing/frame IR Projector • Sensor Hub Processor time-stamps asynchronous sensor samples - Application can reconstruct sample timeline for processing • Plenty of GPU bandwidth left for real-time graphics

© Copyright Khronos Group 2014 - Page 41 Tango Application APIs • GPU processed sensor data made available to Android applications - Pose data and depth cloud available through Java or native APIs - Application can receive events when sensor data available - Sensor data can be sent to 3D engines such as • Downstream processing needed for many applications - Depth map and mesh generation, RGB depth maps, object recognition, etc. - Open source libraries such as PCL (Point Cloud Library) can be accelerated with OpenVX

Camera Control Sensor Hub SW Use standard Android APIs GPU Sensor Processing

CUDA Time-stamping Odometry Feature Pose and Depth Cloud Tracking Client Camera information made available to HAL V3 Laser Control Core Android applications through CUDA Depth Inertial Depth Camera both Java and C interfaces Sensor Stack Decoding Client

© Copyright Khronos Group 2014 - Page 42 Tango Early Learnings • Motion tracking technology is very robust - Fusion of motion sensors and visual odometry can be very accurate • Depth sensing in mobile devices not quite consumer ready - Devices can be taken into unknown and hostile environments - IR interference outdoors in direct sunlight - Current Tango active depth sensor constrained to 1m – 4m range - Point cloud resolution is limited • Will need fused depth solutions using multiple sensor types - Time of flight sensors can benefit from Moore’s law and provide range flexibility - Stereo can provide range data for complementary ranges and lighting conditions - Structure from motion can be useful in mobile devices - compute intensive • Tango will help us understand and overcome sensor challenges

© Copyright Khronos Group 2014 - Page 43 First Wave Tango Application Areas Limbic • Enhanced User Interaction Styku • Gaming • Environment Scanning • Object Scanning • Augmented Reality SmartPicture

Sensopia

© Copyright Khronos Group 2014 - Page 44 Metaio - IKEA Furniture Catalog • Select catalog item, and display in own home - 1 million users in Europe • Depth camera enhances experience - Absolute measurements and ground plane detection with no marker - Environmental lighting - Occlusion

© Copyright Khronos Group 2014 - Page 45 Matterport - Interior Space Capture • Real Estate, Training, Historical Preservation • Currently needs dedicated hardware camera - Expensive and difficult to use • Depth and odometry sensors will enable use of standard mobile devices - For environment capture and viewing

© Copyright Khronos Group 2014 - Page 46 DotProduct 3D Model Capture • Integrated depth sensors enable real-time3D scanning - Without the need to upload data to the cloud for processing - Real-time feedback on completeness of scan data

© Copyright Khronos Group 2014 - Page 47 Khronos APIs for Augmented Reality

AR needs not just advanced sensor processing, vision acceleration, computation and rendering - but also for all these subsystems to work efficiently together

Audio Rendering

MEMS Application Sensors Sensor on CPUs, GPUs Fusion and DSPs

Precision timestamps Vision Processing on all sensor samples

Advanced Camera EGLStream - 3D Rendering and Video Control and stream stream data Composition generation between APIs On GPU

© Copyright Khronos Group 2014 - Page 48 Summary • Mobile device acquiring powerful new user interaction capabilities - Sensors to gather information about the user and the environment - Processing power to use and understand the sensor data - Open standard APIs are enabling applications to access these capabilities • A rich area for research and innovation - Enabled by devices such as Google Tango Tablet • More Information - www.khronos.org - [email protected] - @neilt3d

© Copyright Khronos Group 2014 - Page 49