Optimizing the Cocos2D-X library A DS-5 Streamline case study

彭晓波/Bob Peng Technical Marketing Manager, Strategic Software Alliances November 2013

1 Agenda . Streamline Overview

. Getting start with streamline

. Cocos2d-x case study

* Event-based sampling is available on kernels 3.0 or later

2 ARM DS-5TM Key Components

Compilation Tools • ARM Compiler 5 – Bare-metal C/C++ and NEON vectorization • Integrated GCC for ARM Linux

DS-5 Debugger • Device bring-up and s/w development on single and multicore • OS aware debug, on silicon, virtual platform and emulator

Streamline Analyzer • CPU, GPU, interconnect performance and power analysis • Time- and event-based profiling

DS-5 IDE • Powerful editor based on industry standard Eclipse CDT • Hundreds of compatible plugins

3 Streamline Analyzer Debug and optimize system performance and power

Advantages . System-wide visibility into CPUs, GPUs, interconnect, power consumption and Linux/Android OS resources . C/C++ source code level profiling based on time or PMU events . Streaming data collection allowing analysis as long as hours . Extensible data sources and customizable data visualization . Trace hardware not required

4 Analysis Overview

Visualization of system performance, software profile and thread switching over time

Hierarchical profile table, aggregating samples per process, thread, and function call chain

Flat software profile table, listing shared libraries and function hotspots

Source and instruction level profile. Colour coded source code lines matching samples.

Dynamically created map of the functions in your application and their relationship

Dynamic analysis of the stack usage by your application

Chronologic list of text and graphic annotations sent to gator

5 Timeline view: The Big Picture

Select from 40+ CPU counters, OS level and custom metrics Select one or more processes to visualize their instant load on CPU

Accumulate counters, measure time and find instant hotspots

Combined task switch trace and sampled profile for all threads

6 Performance Charts

. CPU aware PMU registers . 40+ core-level metrics to choose from . Mali graphics . 300+ hardware and software counters . OS level statistics . e.g. DVFS, interrupts, networking . Custom counters . Easily add custom system counters . Event-based sampling . Match PMU events to threads/source code

7 GPU Graphics Analysis

CPU, and GPU fragment and vertex processing activity

Hardware and Software counters Frame buffer filmstrip

Visualize application activity per processor or processor activity per application

8 SMP Analysis

Per core, per process activity

9 big.LITTLETM Analysis . Inspect tasks moving between clusters . Cycle between aggregate, per cluster and per core . Consistent colouring between threads and counter charts

. X-ray view

Core / cluster colour key

X-ray mode augmented with intermediate cluster mode

Disclosure control

Cycle between combined values (right arrow), cluster values (as shown), per core (down arrow) . Counters

10 Drilldown Software Profiling

Filter timeline data to generate focused software profile reports

Quickly identify instant hotspots

Click on the function name to go to source code level profile

11 Dynamic Call Graph Analysis . Call Graph view maps relationships between functions . Easy to navigate dynamic function-level map

Functions are colour coded according to CPU time or events

Easily navigate along call paths and identify caller/callee relationships

Function mapping can include system and uncalled functions

12 Power Measurement Interfaces

ARM Energy Probe Visual Analysis

• 3-channel • System-level analysis • Easy to deploy • Affordable

Good for trend spotting and application optimization V

NI DAQ USB-62xx

• 40+ analog inputs Automated Tests Streamline

• Subcomponent sensitivity Data Acquisition Data • High fidelity • Higher cost

Good for OS power management tuning and benchmarking

13 Streamline Community vs. Basic/Pro

Community Basic/Pro

Simple application System-wide, SMP . Which is the right Typical Use Case profiling analysis Streamline for you? Limited to host Program Images 1 memory Timeline View * Performance Charts   BSP / Distribution * Process Bars   Makers * Mali GPU Analysis   * Quick Profile Summary  * Core Affinity Mode  * Energy Probe data capture  * Time Filtering  OEMs / ODMs * Annotation  

Call Paths View  Basic/Pro Editions Basic/Pro Functions View   Code View 

Call Graph 

Application Stack View  developers CE Log View  Command Line  Event Based Sampling 

14 Agenda . Streamline Overview

. Getting start with streamline

. Cocos2d-x case study

* Event-based sampling is available on kernels 3.0 or later

15 Target Device Setup

. IP-based connection to target User Space . No ICE/trace units required

Applications & Middleware TargetDevice . Open source kernel module and daemon ® . Support for Linux kernel 2.6.32+ OpenGL ES gator Daemon Mali Drivers gator Driver Kernel configuration . . PROFILING + PERF_EVENTS TCP/IP . FTRACE + Linux Kernel ENABLE_DEFAULT_TRACERS ARM Processor . HIGH_RES_TIMERS + HW_PERF_EVENTS . LOCAL_TIMERS, if SMP

. Reference blog: • 设置Android手机以使用ARM Streamline进行性能分析一

16

Some Streamline-enabled Targets…

White-box Tablet (Dual-core Cortex® -A9 + Quad-core Mali-400) • Purchase link: http://detail.tmall.com/item.htm?id=22414055 832& • Gator start automatically when power up Hardkernel Odroid

Pipo Smart-S1 Pro

HDMI Dongle (Cortex-A8 + Mali- 400) Arndale board • Purchase link: http://www.aliexpress.com/store/product/ NewRikomagic-arrival-Rikomagic MK802-MK802 II -II-Mini- Android-4-0-PC-Android-TV-Box-A10- BlueTechnix SoM Cortex-A8/810525_651058884.html • Tutorial book under \ARM-DS-5 • Blog : 如何利用全志安卓4.0 HDMI Dongle 进行ARM DS-5 Streamline性能分析 17 Streamline data view

Show View Style Delete help Change Capture Options

Streamline Capture Data

Streamline Start Counter Analysis Capture Configuration Report

18 Setting Capture Options

Target address Sample Rate: “Localhost “ Normal=1kHz, Low=100Hz, and None Or “127.0.0.1” Buffer Mode: Large 16MB; Medium 4MB; Small 1MB

Capture Duration: Format: Minute:Second (1:05) Not filled meaning stop manually

Call Stack Unwining: Streamline records call stacks or Not

Process Debug Information: Streamline processes dwarf debug information and line numbers or Not ?

High Resolution Timeline: Streamline processes more data, enabling you to zoom in three more levels in the Timeline view

Save caputre option Or Import from saved one

Add elf image Add elf image from workspace

19 Configure counters Events to be Collected: Each event listed here is available for display in the Timeline view

Delete Import Export Available Events List: CPU events Linux events Mali GPU evens -VP/FP Energy probe events

20 Start Capture

Stop and Capturing… generate analysis report

21 Agenda . Streamline Overview

. Getting start with streamline

. Cocos2d-x case study

* Event-based sampling is available on kernels 3.0 or later

22 Performance Bounds

BANDWIDTH Bound External Limited bandwidth Memory

CPU Cache GPU Cache

GPU Bound CPU Bound • Vertex • Fragment

Frame buffer

23 CPU Optimization

. Draw Calls --- As low as Possible . OpenCL . Offload some of the work to the GPU . Mali-T604 Support OpenCL Full profile . Neon optimization . Neon in opensource . projectNe10.org . Math – Vector/Matrix . DSP -- FFT/IFFT/FIR/IIR . Imgproc – Image resize/rotate . ARM v8(64bit) . OpenCL . Physics engine . Your input …

24

NEONTM in Open Source Today . Google WebM – 11,000 lines NEON assembler! . Bluez – official Linux Bluetooth protocol stack . Pixman (part of cairo 2D graphics library) . ffmpeg (libav) – libavcodec . LGPL media player used in many Linux distros and products . Extensive NEON optimizations . x264 – Google Summer Of Code 2009 . GPL H.264 encoder – e.g. for video conferencing . Android – NEON optimizations . Skia library, S32A_D565_Opaque 5x faster using NEON . Available in Google Skia tree from 03-Aug-2009 . LLVM – code generation backend used by Android RenderScript . Eigen2 – C++ vector math / linear algebra template library . TheorARM – libtheora NEON version (optimized by Google) . libjpeg / libjpeg-turbo – optimized JPEG decode . libpng – optimized PNG decode . FFTW – NEON enabled FFT library . Liboil / liborc – runtime compiler for SIMD processing . webkit – used by Chrome Browser

25 Vertex Optimization . Using VBO (vertex buffer object) . Cache vertex data in GPU memory, no need copy from CPU every frame

. Using culling . backface culling . view frustum culling . occlusion culling

. Using LOD (Levels of Detail)

. Remove unnecessary vertices . It’s Mobile, not PC!

26

Fragment Optimization . Reducing Overdraw

. Front to Back - Yes

. Back to front - No

. Limiting the amount of transparency in the scene

. Using ETC texture

27

Bandwidth Optimization . Bandwidth is a scarce resource . A typical embedded device can handle ≈ 5.0 Gigabytes a second of bandwidth . A typical desktop GPU can do in excess of 100 Gigabytes a second

. Use texture compression . The main popular format is ETC Texture Compression . This can help reduce your 32 bits per pixel texture into a 4 bits per pixel texture . Mali Texture Compression Tool

. use 16 bit textures instead of 32 . You won’t often notice the difference

28

Cocos2d-x Project : Introduction . What’s Cocos2d-x ? . Cross-platform, open source (MIT) 2D game engine . Used by 25% of worldwide mobile games . 1.5+ billion cocos2d-based games downloads . Supports C++, Javascript and Lua

. Profiling SW • Cocos2d-x Benchmark • Game rebuild with symbol file (FishJoy, 忘仙)

. Profiling HW . Entry-level smartphone . Cotex A5 + Mali300 . Android version: ICS

29

Profiling story 1: NodeChildren iterate test

30 Profiling story 2 : Performance test Sprite A

31 Profiling story 3: Fishjoy2(Start Game)

32 Profiling Story 4: FishJoy2(Quick click to play the game)

33 Reference . Blog post

. @cocos2d-x.org http://www.cocos2d-x.org/news/137 .Current status

. Chinese key mobile internet companies start using Streamline itself now

. Alibaba inc.

. Tencent inc

. Ucweb inc

. Cocos2d-x

. Sohu Game

34