CycleCycle Accurate Accurate Profiling Profiling With With Perf Perf Paweł Moll
1 The plan
2 The plan
I Hardware I Linux perf I Let’s hack!
3 Hardware
4 ARM CoreSight
System Interconnect (AXI) Cross Trigger Matrix (CTM)
CTI CTI CTI
DAP AXI-AP STM TPIU
JTAG- Funnel SWJ-DP AP Cortex ETM Processor ETB APB-AP
Debug APB Interconnect (APBIC)
I Source I Bus I Sink
5 Processor trace
I Embedded Trace Macrocell I Instructions I Data I Program Trace Macrocell I Only branches I Bandwidth I From 10Mbps to many Gbps per core I Non- (or low-) intrusive debug
6 Sinks
I Embedded Trace Buffer I Dedicated, small SRAM I Flight recorder use case I Trace Port I External analyser I Large buffer I External (high speed) pins I Embedded Trace Router I Sinks data into main interconnect I Usually uses system DRAM I Consumes memory system bandwidth
7 ETM Protocol
I Highly compressed data I Generated against program memory I Based on E/N atoms I eg. b1NEEEE00 up to 16+1 instructions I Branches I Only if not evident (eg. eg. B
8 Issues
I Requires memory contents for decompression I Multitasking OS I JIT engines I Self-modifying code (kernel runtime patching, kprobes, dynamic trace events) I Parallel and out-of-order excecution
9 Additional features
I Filtering I Address I CONTEXTID I VMID I Trigerring I Address I DBG
10 Linux perf
11 Linux perf framework
I PMU drivers I Many use cases, eg. statistics I Sampling profiler I Periodic PC (IP) sampling I Timer or PMU counter overflow interrupt I Typical sampling rate 1kHz (every 1ms)
12 Sampling profile
I Statistical approximation of a process I Think analog/digital converter I Shannon’s theorem: “If a function x(t) contains no frequencies higher than B cps, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.”
13 CoreSight Linux framework
I Developed by Mathieu Poirier at Linaro I Based on 2012 code from Code Aurora I At v7 stage now (http://lwn.net/Articles/614232/) I Control trace components via sysfs I Enable sink, enable source, dump buffer contents I Separate decoders I Full series at http://git.linaro.org/kernel/coresight.git
14 Intel PT
I “an exciting new feature coming in future processors” (2013) I Integrating with perf I Auxiliary buffers I Decoder integrated with user space tool I Kernel portions at v4 stage now (http://lwn.net/Articles/609010/) I Full series at https://github.com/virtuoso/linux-perf/tree/intel_pt
15 Let’s hack!
16 Particularly pathologic example
I Rotate JPEG file
/ # time taskset 4 ./gm convert -rotate 90 in.jpg out.jpg real 0m 0.01s user 0m 0.01s sys 0m 0.00s
17 Can it go faster? / # time perf record -F 1000 -e cpu-clock \ taskset 4 ./gm convert -rotate 90 in.jpg out.jpg [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.001 MB perf.data (~50 samples) ] real 0m 0.27s user 0m 0.14s sys 0m 0.13s
I It was: real 0m 0.01s user 0m 0.01s sys 0m 0.00s
18 Report # Samples: 17 of event 'cpu-clock' # Event count (approx.): 17000000 # # Overhead Command Shared Object Symbol # ...... # 17.65% gm [kernel.kallsyms] [k] filemap_map_pages 5.88% taskset [kernel.kallsyms] [k] filemap_map_pages 5.88% gm gm [.] LocaleCompare 5.88% gm gm [.] forward_DCT_float 5.88% gm gm [.] encode_mcu_huff 5.88% gm gm [.] ycc_rgb_convert 5.88% gm gm [.] decode_mcu_DC_first 5.88% gm gm [.] decode_mcu_AC_refine 19 Report, cont.
5.88% gm gm [.] jpeg_fdct_16x16 5.88% gm gm [.] _IO_link_in 5.88% gm gm [.] malloc 5.88% gm gm [.] strncpy 5.88% gm [kernel.kallsyms] [k] lock_acquire 5.88% gm [kernel.kallsyms] [k] lock_release 5.88% gm [kernel.kallsyms] [k] unmap_single_vma
20 Let’s have a closer look…
I Cortex-A7 ETM 3.5 I Instructions only I Cycle accurate I Captured with DStream & ARM DS-5 I 10MB of binary trace data
21 What can we see there?
I Decoded into text format I 890MB file I What to look for:
ELF Header: Entry point address: 0x96dd
8360: 000096dd 0 FUNC GLOBAL DEFAULT 6 _start
8845: 0012ac44 0 FUNC GLOBAL DEFAULT 9 _fini
22 Here we go
2859224 S:0x8000E4CC E28DD00C 0 ADD sp,sp,#0xc ret_to_user_from_irq 2859225 S:0x8000E4D0 E1B0F00E 1 MOVS pc,lr ret_to_user_from_irq Return from exception Timestamp: 1878241989137 S:0x000096DC F04F0B00 29 MOV r11,#0
23 Here we go again 2878794 S:0x8000E4D0 E1B0F00E 1 MOVS pc,lr ret_to_user_from_irq Return from exception Timestamp: 1878241990817 2878795 S:0x000096DC F04F0B00 175 MOV r11,#0
24 Going down
9747820 S:0x0012AC44 E91D4008 153 PUSH r3,lr
25 The end of the process
9747990 S:0x000FD536 4C12 2 LDR r4,[pc,#72] ; [0xFD580] = 0x64C55B39 _Exit 9747991 S:0x000FD538 E004 0 B _Exit+24 ; 0xFD544 _Exit 9747992 S:0x000FD544 4618 1 MOV r0,r3 _Exit 9747993 S:0x000FD546 F04F0CF8 1 MOV r12,#0xf8 _Exit 9747994 S:0x000FD54A F7E7F9C9 0 BL __libc_do_syscall ; 0xE48E0 _Exit 9747995 S:0x000E48E0 B580 1 PUSH r7,lr __libc_do_syscall 9747996 S:0x000E48E2 4667 2 MOV r7,r12 __libc_do_syscall 9747997 S:0x000E48E4 DF00 1 SVC #0x0 __libc_do_syscall Exception: SUPERVISOR_CALL (10)
I #0xf8 is __NR_exit_group
26 Focus on the task
I It’s all interesting… I …but the goal is to see a cycle accurate profile of the process I Limit data scope I Filter out kernel addresses (or cheat using entry/exit points) I Filter out other contexts (or cheat by protecting CPU) I Collate migrated data (or cheat by setting affinity) I Generate memory map with DSOs (or cheat by linking statically) I Convert into perf data stream
27 perf.data
I Starts with header I Description of all events I Followed by data records … I selection of samples I by default: PERF_SAMPLE_IP, PERF_SAMPLE_TID, PERF_SAMPLE_TIME, PERF_SAMPLE_PERIOD I generated on every timer/counter interrupt I …interleaved with system information I eg. PERF_RECORD_MMAP, PERF_RECORD_COMM, PERF_RECORD_EXIT, PERF_RECORD_FORK
28 perf.data, cont. $ perf report -D [...] 0x1d0 [0x28]: event: 9 . . ... raw event: size 40 bytes . 0000: 09 00 00 00 01 00 28 00 54 fd 05 80 00 00 00 00 ...... (.T...... 0010: 3b 08 00 00 3b 08 00 00 46 fb 39 91 dd 46 00 00 ;...;...F.9..F.. . 0020: 40 42 0f 00 00 00 00 00 @B...... 77917438212934 0x1d0 [0x28]: PERF_RECORD_SAMPLE(IP, 1): \ 2107/2107: 0x8005fd54 period: 1000000 addr: 0 ... thread: gm:2107 ...... dso: [kernel.kallsyms]
29 The Hack
I Replace data records with trace-based ones I reducing number of samples, from 40 to 24 bytes per record (attribute modification needed) I generating multiple samples, one per cycle used (175 samples if instruction took 175 cycles to execute) I 192MB big perf.data
30 The Hack, cont. 0x1f0 [0x18]: event: 9 . ... raw event: size 24 bytes . 0000: 09 00 00 00 02 00 18 00 dc 96 00 00 00 00 00 00 ...... 0010: 3b 08 00 00 3b 08 00 00 ;...;... 0x1f0 [0x18]: PERF_RECORD_SAMPLE(IP, 2): \ 2107/2107: 0x96dc period: 1 addr: 0 ... thread: :2107:2107 ...... dso:
[...]
36 Result, cont 0.00% gm gm [.] ___fini_from_thumb 0.00% gm gm [.] jpeg_destroy_compress 0.00% gm gm [.] __feupdateenv 0.00% gm gm [.] IdentityAffine 0.00% gm gm [.] __dcgettext 0.00% gm gm [.] _setjmp 0.00% gm gm [.] DestroyMagickResources 0.00% gm gm [.] jpeg_free_small 0.00% gm gm [.] fprintf 0.00% gm gm [.] DestroySemaphore 0.00% gm gm [.] jpeg_mem_term 0.00% gm gm [.] start_pass_downsample 0.00% gm gm [.] malloc_info 0.00% gm gm [.] __stpcpy 37 Result, cont
I 616 lines of report I perf annotate works as well!
38 Summary
39 Summary
I Proof of concept I Can help with pathological cases I Scaling issues I Powerful but need to by “civilised” I Nearest future I Drivers in mainline I perf tool decoder integration
40 Thank You
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
41