SOC Programming Tutorial Hot Chips 2012 Neil Trevett Khronos President
Total Page:16
File Type:pdf, Size:1020Kb
SOC Programming Tutorial Hot Chips 2012 Neil Trevett Khronos President © Copyright Khronos Group 2012 | Page 1 Welcome! • An exploration of SOC capabilities from the programmer’s perspective - How is mobile silicon interfacing to mobile apps? • Overview of acceleration APIs on today’s mobile OS - And how they can be used to optimize performance and power • Focus on mobile innovation hotspots - Vision and gesture processing, Augmented Reality, Sensor Fusion, Computational Photography, 3D Graphics • Highlight silicon-level opportunities and challenges still be to solved - While exploring the state of the art in mobile programming SOC = ‘System On Chip’ Minus memory and some peripherals © Copyright Khronos Group 2012 | Page 2 Speakers Session Speaker Company Title Connecting Mobile SOC Hardware to Apps Neil Trevett Khronos Khronos President and NVIDIA VP of Mobile Content Break Break Camera and Video Sean Mao ArcSoft VP Marketing, Advanced Imaging Technologies Vision and Gesture Processing Itay Katz Eyesight Co-Founder & CTO Augmented Reality Ben Blachnitzky Metaio Director of R&D Sensor Fusion Jim Steele Sensor Platforms VP Engineering 3D Gaming Daniel Wexler the11ers CXO Panel Session All Speakers © Copyright Khronos Group 2012 | Page 3 Khronos Connects Software to Silicon • Khronos APIs define processor acceleration capabilities - Graphics, video, audio, compute, vision and sensor processing APIs developed today define the functionality of platforms and devices tomorrow © Copyright Khronos Group 2012 | Page 4 APIs BY the Industry FOR the Industry • Khronos defines APIs at the software silicon interface - Low-level “Foundation” functionality needed on every platform • Khronos standards have strong industry momentum - 100s of man years invested by industry experts - Shipping on billions of devices across multiple operating systems - Rigorous conformance tests for cross-vendor consistency • Khronos is OPEN for any company to join and participate - Standards are cooperative – one company, one vote - Proven legal and IP framework for industry cooperation - Khronos membership fees to cover expenses • Khronos standards are FREE to use - Members agree to not request royalties Silicon Software © Copyright Khronos Group 2012 | Page 5 Apple Over 100 members – any company worldwide is welcome to join Board of Promoters © Copyright Khronos Group 2012 | Page 6 API Standards Evolution WEB INTEROP, VISION MOBILE AND SENSORS DESKTOP OpenVL New API technology first evolves on high- Mobile is the new platform for Apps embrace mobility’s end platforms apps innovation. Mobile unique strengths and need Diverse platforms – mobile, TV, APIs unlock hardware and complex, interoperating APIs embedded – means HTML5 will conserve battery life with rich sensory inputs become increasingly important e.g. Augmented Reality as a universal app platform © Copyright Khronos Group 2012 | Page 7 A New Era in Computing Mobile PC Internet Computing 1990’s 2000’s 2010’s © Copyright Khronos Group 2012 | Page 8 20 Years Faster to 100M Per Year 140 Cumulative Shipments 120 iOS & Android MacOS & Windows 100 80 60 40 Units in Millions Millions in Units 20 0 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Year 7 Year 8 Year 9 Source: Gartner, Apple, NVIDIA © Copyright Khronos Group 2012 | Page 9 The Largest Device Market Ever • IDC - 1.8 billion mobile phones will ship in 2012 - By the end of 2016, 2.3 billion mobile phones will ship per year Smart phones account for approximately half of total phone market © Copyright Khronos Group 2012 | Page 10 Global Smartphone Market Share Windows Mobile/Phone © Copyright Khronos Group 2012 | Page 11 ARM is Licensable and Pervasive 14 12 ARM 10 x86 8 Units in Billions 6 Annual Shipments (Billions) 4 2 0 2005 2006 2007 2008 2009 2010 2011 2012e 2013e 2014e 2015e Source: ARM, Mercury Research © Copyright Khronos Group 2012 | Page 12 Mobile Performance Increases 100 75x STARK 50x LOGAN WAYNE 10 10x Core i5 5x Tegra 3 (KAL-EL) Core 2 Duo CPU/GPU AGGREGATE PERFORMANCE AGGREGATE CPU/GPU TEGRA 2 1 2011 2012 2013 2014 2010 © Copyright Khronos Group 2012 | Page 13 Power is the New Design Limit • The Process Fairy keeps bringing more transistors - Transistors are getting cheaper • The End of Voltage Scaling - The Process Fairy isn’t helping as much on power as in the past In the Good Old Days The New Reality Leakage was not important, and voltage scaled Leakage has limited threshold voltage, largely with feature size ending voltage scaling L’ = L/ L’ = L/ V’ = V/ V’ = ~V E’ = CV2 = E/8 E’ = CV2 = E/2 f’ = f f’ = ~2f D’ = /L2 = 4D D’ = /L2 = 4D P’ = P P’ = 4P Halve L and get 4x the transistors and 8x Halve L and get 4x the transistors and 8x the capability for the capability for the same power 4x the power!! © Copyright Khronos Group 2012 | Page 14 Mobile Thermal Design Point 10” Screen takes 1-2W Resolution makes a difference! 7” Screen The iPad3 screen takes up to 8W 4-5” Screen takes takes 1W 250-500mW 2-4W 4-7W 6-10W 30-90W Max system power before thermal failure Even as battery technology improves - these thermal limits remain © Copyright Khronos Group 2012 | Page 15 Apps and Power Write 32-bits to LP-DDR2 600pJ • Much more expensive to MOVE data than COMPUTE data Send 32-bits Off-chip 50pJ • Process improvements WIDEN the gap - 10nm process will increase ratio another 4X • Energy efficiency must be key metric during silicon AND app design Send 32-bits 2mm - Awareness of where data lives, where 24pJ computation happens, how is it scheduled 32-bit Float Operation For 40nm, 7pJ 1V process 32-bit Integer Add 1pJ 32-bit Register Write 0.5pJ © Copyright Khronos Group 2012 | Page 16 Energy Optimization Opportunities • Dark Silicon - Lots of space for transistors – just can’t turn them all on at same time - Multiple specialized hardware units that are only turned on when needed - Increase locality and parallelism of computation to save power compared to programmable processors • Dynamic and feedback-driven software power optimization - Instrumentation for energy-aware compilers and profilers - Most compilers just look at one thread, take a more global view - Power optimizing compiler back-end / installers • Smart, holistic use of sensors and peripherals - Wireless modems and networks - Motion sensors, cameras, networking, GPS © Copyright Khronos Group 2012 | Page 17 Camera Sensor Processing • CPU - Single processor or Neon SIMD - Makes heavy use of general memory - Non-optimal performance and power • GPU - Many way parallelism - Efficient image caching into general memory - Programmable and flexible - Still significant use of cache/memory • Camera ISP = Image Signal Processor - Scan-line-based - Data flows through compact hardware pipe - No global memory used to minimize power - Little or no programmability © Copyright Khronos Group 2012 | Page 18 Typical Camera ISP • ~760 math Ops • Computational photography apps beginning to mix non-programmable ISP processing with • ~42K vals = 670Kb more flexible GPU or CPU processing • 300MHz ~250Gops • ISP pipelines could provide tap/insertion points to/from CPU/GPU at critical pipeline points 2 MADs 1 Mul 9 MADs 50 MADs 6 MADs 2 I/O reg 2 I/O reg 2 I/O reg 300 vals 6 I/O reg 256 LUT 4 vals 9 vals 3*256 LUT 1 adder 12 MADs 9 MADs 9 MADs 100 MADs 40M 75 MADs 9 MADs 2 I/O reg 2 I/O reg 2 I/O reg 6 I/O reg 400 vals vals 300 vals 2 I/O reg 4 vals 400 vals 9 vals 9 vals 9 vals 100 MADs 100 vals Black While Raw Level Bal. Offset Gain Bayer Sharp Luma + NR -1 YUV Contr. Yuv Yuv 12 Luma (Y) 12 12 12 Gamma Linearize Flatfield Early Demosaicer 3x3 Color Color Color LUT 12 Signal Image x De- Color Matrix Matrix 12 12 Matrix - S14 Non-Linear 12 S12 S12 S14 noising 12 Matrix 12 Chroma Linear Non-linear Linear Sat. sRGB sRGB Yuv 12 12 NR 12 12 camera sRGB Chroma (uv) RGB n-Rows line buffer © Copyright Khronos Group 2012 | Page 19 Programmers View of Typical SOC c. 2012 Unified memory is CPU Complex Automatic scaling of critical departure from 1-5 Cortex A9/A15 Cores frequency/voltage and # cores to meet PC architecture L1 Cache Software current software load configurability Common virtual L2 Cache HD Audio address space for all units – but typically Engine / IO not coherent Peripheral Sensor Unified Busses Array 2D/3D GPU Graphics Memory Cache Complex Controller for 1-4 Display 0.5-2GB DDR3L/LP-DDR3 Controllers Significant programmable acceleration. OpenGL ES today. OpenCL soon. 10-50+ cores Image Video Vision Signal Encoder Processor Processor Decoder (Future) Programmable? Or dedicated Software Typically NO or very limited hardware for power efficiency? programmable functionality 1-3 Cameras configurability © Copyright Khronos Group 2012 | Page 20 HSA Feature Roadmap Heterogeneous System Architecture Foundation: AMD, ARM, Imagination, TI, MediaTek Physical Optimized Architectural System Integration Platforms Integration Integration Unified Address Integrate CPU & GPU GPU Compute C++ GPU compute Space for CPU and in silicon support context switch GPU GPU uses pageable Unified Memory User mode GPU graphics system memory via Controller scheduling pre-emption CPU pointers Common Bi-Directional Power Fully coherent Manufacturing Mgmt between CPU memory between Quality of Service Technology and GPU CPU & GPU Time Current Innovation © Copyright Khronos Group 2012 | Page 21 Mobile Innovation Hot Spots • New platform capabilities being driven by SILICON and APIs Vision - Camera as sensor Computational Photography Console-Class 3D Gesture Processing Performance, Quality, Augmented Reality Controllers and TV connectivity • Media