Mali-400 MP: a Scalable GPU for Mobile Devices

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson Director, Graphics Research, ARM Outline . ARM and Mobile Graphics . Design Constraints for Mobile GPUs . Mali Architecture Overview . Multicore Scaling in Mali-400 MP . Results 2 About ARM . World’s leading supplier of semiconductor IP . Processor Architectures and Implementations . Related IP: buses, caches, debug & trace, physical IP . Software tools and infrastructure . Business Model . License fees . Per-chip royalties . Graphics at ARM . Acquired Falanx in 2006 . ARM Mali is now the world’s most widely licensed GPU family 3 Challenges for Mobile GPUs . Size . Power . Memory Bandwidth 4 More Challenges . Graphics is going into “anything that has a screen” . Mobile . Navigation . Set Top Box/DTV . Automotive . Video telephony . Cameras . Printers . Huge range of form factors, screen sizes, power budgets, and performance requirements . In some applications, a huge difference between peak and average performance requirements 5 Solution: Scalability . Address a wide variety of performance points and applications with a single IP and a single software stack. Need static scalability to adapt to different peak requirements in different platforms / markets . Need dynamic scalability to reduce power when peak performance isn’t needed 6 Options for Scalability . Fine-grained: Multiple pipes, wide SIMD, etc . Proven approach, efficient and effective . But, adding pipes / lanes is invasive . Hard for IP licensees to do on their own . And, hard to partition to provide dynamic scalability . Coarse-grained: Multicore . Easy for licensees to select desired performance . Putting cores on separate power islands allows dynamic scaling 7 Mali 400-MP Top Level Architecture Asynch Mali-400 MP Top-Level APB Geometry Pixel Processor Pixel Processor Pixel Processor Pixel Processor Processor #1 #2 #3 #4 CLKs MaliMMUs RESETs IRQs IDLEs MaliL2 AXI . Scalable pixel performance . 1-4 rasterizer cores . 32K-128K L2 cache 8 Mali Tile-Based Rendering 1 1 0 00,1 00,1 0 0 00,1 00,1 0 0 00,1 00,1 0 . Reduces off-chip framebuffer bandwidth . Rasterize into on-chip 16x16 tile buffer . Z, stencil, MSAA samples never go off-chip . Tradeoff against increased geometry bandwidth . Details: see Real-Time Rendering, 3rd ed. 9 Mali-400 MP Geometry Processor System bus interface Vertex Vertex Polygon List Vertex Storer Loader Shader Builder Unit Control Registers / Performance Counters . Vertex Shader . Single-threaded, deeply pipelined 10 Mali-400 MP Pixel Processor . Fragment Shader System bus interface . 128-thread barrel processor . Fully general control flow Polygon List Color / Depth / Reader . VLIW ISA, tuned for graphics Stencil Buffers Triangle setup . One texture sample per clock Fragment Shader Rasterization Blending Thread control . Key Features Control Registers / Performance Counters . 16x16 on-chip tile buffer . Renders one pixel per clock: 275M pix/sec @ 275 MHz . No penalty for 4x MSAA . No penalty for blending . 4x or 16x MSAA resolve on output . OpenGL ES 2.0 states handled without state dependent shaders 11 Going Multi-Core . All cores work in parallel on separate tasks . Each core processes one tile at a time until completion – no communication between cores A s y n c h Mali-400 MP Top-Level A P B . Tiles assigned statically to cores F r a g m e n t F r a g m e n t F r a g m e n t F r a g m e n t V e r t e x P r o c e s s o r P r o c e s s o r P r o c e s s o r P r o c e s s o r P r o c e s s o r in a swizzled order # 1 # 2 # 3 # 4 C L K s M a l i M M U R E S E T s I R Q s I D L E s . Tile processing order maximizes M a l i L 2 L2 hit rate for polygon descriptors, A X I textures 12 Does it work? . Test content . Industry standard benchmarks for GL ES 1.1, 2.0 . Thanks Rightware! 13 Relative Performance OpenGL ES 1.1 OpenGL ES 2.0 4.00 4.50 3.50 4.00 3.00 3.50 3.00 2.50 2.50 2.00 2.00 1.50 1.50 1.00 1.00 0.50 0.50 Normalized Frame Rate* Frame Normalized 0.00 Rate* Frame Normalized 0.00 MC1 MC2 MC3 MC4 MC1 MC2 MC3 MC4 . Single frames in RTL simulation . Artificial memory model (fixed static latency) . 1-4 cores, 32KB L2 cache . 1920x1080, 32bpp, 4xMSAA *Single core frame rate = 1.0 14 Impact on Bandwidth OpenGL ES 1.1 1.40 1.20 1.00 0.80 0.60 0.40 0.20 Normalized Bandwidth* Normalized 0.00 MC1 MC2 MC3 MC4 . Memory bandwidth per frame is stable . L2 cache is working to prevent redundant memory accesses *Single core bandwidth per frame = 1.0 15 Summary . Scalability is an essential feature for mobile GPUs . Static and Dynamic . Multi-core architectures are a good fit to the need . Easy to configure for different performance requirements . Easy to power-gate for dynamic scaling . Mali-400 MP shows that the concept works . Pixel rate scalable from 275 Mpix/s to 1.1 Gpix/s . Linear speedup demonstrated on common mobile benchmarks . Single driver for all configurations – transparent to applications . Power and bandwidth efficient 16 Thanks to… . The HPG 2010 Hot3D organizers . Remi Pedersen, Mashhuda Glencross, and the Mali team . Our Silicon Partners 17.

Mali-400 MP: a Scalable GPU for Mobile Devices

GPU Developments 2018

Antikernel: a Decentralized Secure Hardware-Software Operating

Intel Desktop Board DG41MJ Product Guide

GMAI: a GPU Memory Allocation Inspection Tool for Understanding and Exploiting the Internals of GPU Resource Allocation in Critical Systems

Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Speciﬁc Workloads

Intel® Desktop Board DG965RY Product Guide

Intel® Desktop Board DG41TX Product Guide

The Bifrost GPU Architecture and the ARM Mali-G71 GPU

AI Acceleration – Image Processing As a Use Case

Dynamic Task Scheduling and Binding for Many-Core Systems Through Stream Rewriting

An Overview of MIPS Multi-Threading White Paper

C for a Tiny System Implementing C for a Tiny System and Making the Architecture More Suitable for C