Mali-400 MP: A Scalable GPU for Mobile Devices

Tom Olson Director, Graphics Research, ARM Outline . ARM and Mobile Graphics

. Design Constraints for Mobile GPUs

. Mali Architecture Overview

. Multicore Scaling in Mali-400 MP

. Results

2 About ARM . World’s leading supplier of semiconductor IP . Architectures and Implementations . Related IP: buses, caches, debug & trace, physical IP . Software tools and infrastructure

. Business Model . License fees . Per-chip royalties

. Graphics at ARM . Acquired Falanx in 2006 . ARM Mali is now the world’s most widely licensed GPU family

3 Challenges for Mobile GPUs . Size

. Power

. Memory Bandwidth

4 More Challenges . Graphics is going into “anything that has a screen” . Mobile . Navigation . Set Top Box/DTV . Automotive . Video telephony . Cameras . Printers . Huge range of form factors, screen sizes, power budgets, and performance requirements . In some applications, a huge difference between peak and average performance requirements

5 Solution: Scalability . Address a wide variety of performance points and applications with a single IP and a single software stack.

. Need static scalability to adapt to different peak requirements in different platforms / markets . Need dynamic scalability to reduce power when peak performance isn’t needed

6 Options for Scalability

. Fine-grained: Multiple pipes, wide SIMD, etc . Proven approach, efficient and effective . But, adding pipes / lanes is invasive . Hard for IP licensees to do on their own . And, hard to partition to provide dynamic scalability

. Coarse-grained: Multicore . Easy for licensees to select desired performance . Putting cores on separate power islands allows dynamic scaling

7 Mali 400-MP Top Level Architecture

Asynch Mali-400 MP Top-Level APB

Geometry Processor Pixel Processor Pixel Processor Pixel Processor Processor #1 #2 #3 #4

CLKs MaliMMUs RESETs

IRQs

IDLEs MaliL2

AXI . Scalable pixel performance . 1-4 rasterizer cores . 32K-128K L2

8 Mali Tile-Based Rendering

1 1

0 00,1 00,1 0

0 00,1 00,1 0

0 00,1 00,1 0

. Reduces off-chip bandwidth . Rasterize into on-chip 16x16 tile buffer . Z, stencil, MSAA samples never go off-chip . Tradeoff against increased geometry bandwidth . Details: see Real-Time Rendering, 3rd ed.

9 Mali-400 MP Geometry Processor

System bus interface

Vertex Vertex Polygon List Vertex Storer Loader Builder Unit

Control Registers / Performance Counters

. Vertex Shader . Single-threaded, deeply pipelined

10 Mali-400 MP Pixel Processor

. Fragment Shader System bus interface . 128- barrel processor . Fully general control flow Polygon List Color / Depth / Reader . VLIW ISA, tuned for graphics Stencil Buffers Triangle setup . One texture sample per clock Fragment Shader Rasterization Blending Thread control . Key Features Control Registers / Performance Counters . 16x16 on-chip tile buffer . Renders one pixel per clock: 275M pix/sec @ 275 MHz . No penalty for 4x MSAA . No penalty for blending . 4x or 16x MSAA resolve on output . OpenGL ES 2.0 states handled without state dependent

11 Going Multi-Core . All cores work in parallel on separate tasks

. Each core processes one tile at a time until completion – no communication between cores

A s y n c h Mali-400 MP Top-Level A P B . Tiles assigned statically to cores F r a g m e n t F r a g m e n t F r a g m e n t F r a g m e n t V e r t e x P r o c e s s o r P r o c e s s o r P r o c e s s o r P r o c e s s o r P r o c e s s o r in a swizzled order # 1 # 2 # 3 # 4

C L K s M a l i M M U R E S E T s

I R Q s

I D L E s . Tile processing order maximizes M a l i L 2 L2 hit rate for polygon descriptors, A X I textures

12 Does it work? . Test content . Industry standard benchmarks for GL ES 1.1, 2.0 . Thanks Rightware!

13 Relative Performance

OpenGL ES 1.1 OpenGL ES 2.0 4.00 4.50 3.50 4.00 3.00 3.50 3.00 2.50 2.50 2.00 2.00 1.50 1.50 1.00 1.00

0.50 0.50 Normalized * Frame Normalized 0.00 Rate* Frame Normalized 0.00 MC1 MC2 MC3 MC4 MC1 MC2 MC3 MC4 . Single frames in RTL simulation . Artificial memory model (fixed static latency) . 1-4 cores, 32KB L2 cache . 1920x1080, 32bpp, 4xMSAA

*Single core frame rate = 1.0

14 Impact on Bandwidth

OpenGL ES 1.1

1.40

1.20

1.00

0.80

0.60

0.40

0.20

Normalized Bandwidth* Normalized 0.00 MC1 MC2 MC3 MC4

. Memory bandwidth per frame is stable . L2 cache is working to prevent redundant memory accesses

*Single core bandwidth per frame = 1.0

15 Summary . Scalability is an essential feature for mobile GPUs . Static and Dynamic

. Multi-core architectures are a good fit to the need . Easy to configure for different performance requirements . Easy to power-gate for dynamic scaling

. Mali-400 MP shows that the concept works . Pixel rate scalable from 275 Mpix/s to 1.1 Gpix/s . Linear speedup demonstrated on common mobile benchmarks . Single driver for all configurations – transparent to applications . Power and bandwidth efficient

16 Thanks to…

. The HPG 2010 Hot3D organizers . Remi Pedersen, Mashhuda Glencross, and the Mali team . Our Silicon Partners

17