ARM Streamline进行性能分析一

Optimizing the Cocos2D-X library A DS-5 Streamline case study 彭晓波/Bob Peng Technical Marketing Manager, Strategic Software Alliances November 2013 1 Agenda . Streamline Overview . Getting start with streamline . Cocos2d-x case study * Event-based sampling is available on kernels 3.0 or later 2 ARM DS-5TM Key Components Compilation Tools • ARM Compiler 5 – Bare-metal C/C++ and NEON vectorization • Integrated Linaro GCC for ARM Linux DS-5 Debugger • Device bring-up and s/w development on single and multicore • OS aware debug, on silicon, virtual platform and emulator Streamline Analyzer • CPU, GPU, interconnect performance and power analysis • Time- and event-based profiling DS-5 IDE • Powerful editor based on industry standard Eclipse CDT • Hundreds of compatible plugins 3 Streamline Analyzer Debug and optimize system performance and power Advantages . System-wide visibility into CPUs, GPUs, interconnect, power consumption and Linux/Android OS resources . C/C++ source code level profiling based on time or PMU events . Streaming data collection allowing analysis as long as hours . Extensible data sources and customizable data visualization . Trace hardware not required 4 Analysis Overview Visualization of system performance, software profile and thread switching over time Hierarchical profile table, aggregating samples per process, thread, and function call chain Flat software profile table, listing shared libraries and function hotspots Source and instruction level profile. Colour coded source code lines matching samples. Dynamically created map of the functions in your application and their relationship Dynamic analysis of the stack usage by your application Chronologic list of text and graphic annotations sent to gator 5 Timeline view: The Big Picture Select from 40+ CPU counters, OS level and custom metrics Select one or more processes to visualize their instant load on CPU Accumulate counters, measure time and find instant hotspots Combined task switch trace and sampled profile for all threads 6 Performance Charts . CPU aware PMU registers . 40+ core-level metrics to choose from . Mali graphics . 300+ hardware and software counters . OS level statistics . e.g. DVFS, interrupts, networking . Custom counters . Easily add custom system counters . Event-based sampling . Match PMU events to threads/source code 7 GPU Graphics Analysis CPU, and GPU fragment and vertex processing activity Hardware and Software counters Frame buffer filmstrip Visualize application activity per processor or processor activity per application 8 SMP Analysis Per core, per process activity 9 big.LITTLETM Analysis . Inspect tasks moving between clusters . Cycle between aggregate, per cluster and per core . Consistent colouring between threads and counter charts . X-ray view Core / cluster colour key X-ray mode augmented with intermediate cluster mode Disclosure control Cycle between combined values (right arrow), cluster values (as shown), per core (down arrow) . Counters 10 Drilldown Software Profiling Filter timeline data to generate focused software profile reports Quickly identify instant hotspots Click on the function name to go to source code level profile 11 Dynamic Call Graph Analysis . Call Graph view maps relationships between functions . Easy to navigate dynamic function-level map Functions are colour coded according to CPU time or events Easily navigate along call paths and identify caller/callee relationships Function mapping can include system and uncalled functions 12 Power Measurement Interfaces ARM Energy Probe Visual Analysis • 3-channel • System-level analysis • Easy to deploy • Affordable Good for trend spotting and application optimization V NI DAQ USB-62xx • 40+ analog inputs Automated Tests Streamline • Subcomponent sensitivity Data Acquisition Data • High fidelity • Higher cost Good for OS power management tuning and benchmarking 13 Streamline Community vs. Basic/Pro Community Basic/Pro Simple application System-wide, SMP . Which is the right Typical Use Case profiling analysis Streamline for you? Limited to host Program Images 1 memory Timeline View * Performance Charts BSP / Distribution * Process Bars Makers * Mali GPU Analysis * Quick Profile Summary * Core Affinity Mode * Energy Probe data capture * Time Filtering OEMs / ODMs * Annotation Call Paths View Basic/Pro Editions Basic/Pro Functions View Code View Call Graph Application Stack View developers CE Log View Command Line Event Based Sampling 14 Agenda . Streamline Overview . Getting start with streamline . Cocos2d-x case study * Event-based sampling is available on kernels 3.0 or later 15 Target Device Setup . IP-based connection to target User Space . No ICE/trace units required Applications & Middleware Device Target . Open source kernel module and daemon ® . Support for Linux kernel 2.6.32+ OpenGL ES gator Daemon Mali Drivers gator Driver Kernel configuration . PROFILING + PERF_EVENTS TCP/IP . FTRACE + Linux Kernel ENABLE_DEFAULT_TRACERS ARM Processor . HIGH_RES_TIMERS + HW_PERF_EVENTS . LOCAL_TIMERS, if SMP . Reference blog: • 设置Android手机以使用ARM Streamline进行性能分析一 16 Some Streamline-enabled Targets… White-box Tablet (Dual-core Cortex® -A9 + Quad-core Mali-400) • Purchase link: http://detail.tmall.com/item.htm?id=22414055 832& • Gator start automatically when power up Hardkernel Odroid Pipo Smart-S1 Pro HDMI Dongle (Cortex-A8 + Mali- 400) Arndale board • Purchase link: http://www.aliexpress.com/store/product/ NewRikomagic-arrival-Rikomagic MK802-MK802 II -II-Mini- Android-4-0-PC-Android-TV-Box-A10- BlueTechnix SoM Cortex-A8/810525_651058884.html • Tutorial book under \ARM-DS-5 • Blog : 如何利用全志安卓4.0 HDMI Dongle 进行ARM DS-5 Streamline性能分析 17 Streamline data view Show View Style Delete help Change Capture Options Streamline Capture Data Streamline Start Counter Analysis Capture Configuration Report 18 Setting Capture Options Target address Sample Rate: “Localhost “ Normal=1kHz, Low=100Hz, and None Or “127.0.0.1” Buffer Mode: Large 16MB; Medium 4MB; Small 1MB Capture Duration: Format: Minute:Second (1:05) Not filled meaning stop manually Call Stack Unwining: Streamline records call stacks or Not Process Debug Information: Streamline processes dwarf debug information and line numbers or Not ? High Resolution Timeline: Streamline processes more data, enabling you to zoom in three more levels in the Timeline view Save caputre option Or Import from saved one Add elf image Add elf image from workspace 19 Configure counters Events to be Collected: Each event listed here is available for display in the Timeline view Delete Import Export Available Events List: CPU events Linux events Mali GPU evens -VP/FP Energy probe events 20 Start Capture Stop and Capturing… generate analysis report 21 Agenda . Streamline Overview . Getting start with streamline . Cocos2d-x case study * Event-based sampling is available on kernels 3.0 or later 22 Performance Bounds BANDWIDTH Bound External Limited bandwidth Memory CPU Cache GPU Cache GPU Bound CPU Bound • Vertex • Fragment Frame buffer 23 CPU Optimization . Draw Calls --- As low as Possible . OpenCL . Offload some of the work to the GPU . Mali-T604 Support OpenCL Full profile . Neon optimization . Neon in opensource . projectNe10.org . Math – Vector/Matrix . DSP -- FFT/IFFT/FIR/IIR . Imgproc – Image resize/rotate . ARM v8(64bit) . OpenCL . Physics engine . Your input … 24 NEONTM in Open Source Today . Google WebM – 11,000 lines NEON assembler! . Bluez – official Linux Bluetooth protocol stack . Pixman (part of cairo 2D graphics library) . ffmpeg (libav) – libavcodec . LGPL media player used in many Linux distros and products . Extensive NEON optimizations . x264 – Google Summer Of Code 2009 . GPL H.264 encoder – e.g. for video conferencing . Android – NEON optimizations . Skia library, S32A_D565_Opaque 5x faster using NEON . Available in Google Skia tree from 03-Aug-2009 . LLVM – code generation backend used by Android RenderScript . Eigen2 – C++ vector math / linear algebra template library . TheorARM – libtheora NEON version (optimized by Google) . libjpeg / libjpeg-turbo – optimized JPEG decode . libpng – optimized PNG decode . FFTW – NEON enabled FFT library . Liboil / liborc – runtime compiler for SIMD processing . webkit – used by Chrome Browser 25 Vertex Optimization . Using VBO (vertex buffer object) . Cache vertex data in GPU memory, no need copy from CPU every frame . Using culling . backface culling . view frustum culling . occlusion culling . Using LOD (Levels of Detail) . Remove unnecessary vertices . It’s Mobile, not PC! 26 Fragment Optimization . Reducing Overdraw . Front to Back - Yes . Back to front - No . Limiting the amount of transparency in the scene . Using ETC texture 27 Bandwidth Optimization . Bandwidth is a scarce resource . A typical embedded device can handle ≈ 5.0 Gigabytes a second of bandwidth . A typical desktop GPU can do in excess of 100 Gigabytes a second . Use texture compression . The main popular format is ETC Texture Compression . This can help reduce your 32 bits per pixel texture into a 4 bits per pixel texture . Mali Texture Compression Tool . use 16 bit textures instead of 32 . You won’t often notice the difference 28 Cocos2d-x Project : Introduction . What’s Cocos2d-x ? . Cross-platform, open source (MIT) 2D game engine . Used by 25% of worldwide mobile games . 1.5+ billion cocos2d-based games downloads . Supports C++, Javascript and Lua . Profiling SW • Cocos2d-x Benchmark • Game rebuild with symbol file (FishJoy, 忘仙） . Profiling HW . Entry-level smartphone . Cotex A5 + Mali300 . Android version: ICS 29 Profiling story 1: NodeChildren iterate test 30 Profiling story 2 : Performance test Sprite A 31 Profiling story 3: Fishjoy2(Start Game) 32 Profiling Story 4: FishJoy2(Quick click to play the game) 33 Reference . Blog post . @cocos2d-x.org http://www.cocos2d-x.org/news/137 .Current status . Chinese key mobile internet companies start using Streamline itself now . Alibaba inc. Tencent inc . Ucweb inc . Cocos2d-x . Sohu Game 34 .

ARM Streamline进行性能分析一

Enabling the Use of Low Power Mobile and Embedded Technologies For

DISI - University of Trento

Introducing Slambench, a Performance and Accuracy Benchmarking Methodology for SLAM

Report on Tuned Linux-ARM Kernel and Delivery of Kernel Patches to the Linux Kernel Version 1.0

Proyecto Fin De Grado

Que E Un Arduino

Energy Neutral Wireless Sensing for Server Farms Monitoring

D5.5– Intermediate Report on Porting and Tuning of System Software to ARM Architecture Version 1.0

On First Page

Design, Construction, and Use of a Single Board Computer Beowulf Cluster: Application of the Small- Footprint, Low-Cost, Insignal 5420 Octa Board

Samsung Exynos 5 Board - 10-30-2012 by Vincent - Streamcomputing

Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine Danilo Câmara, Conrado Gouvêa, Julio López, Ricardo Dahab