CycleCycle Accurate Accurate Profiling Profiling With With Perf Perf Paweł Moll

1 The plan

2 The plan

I Hardware I perf I Let’s hack!

3 Hardware

4 ARM CoreSight

System Interconnect (AXI) Cross Trigger Matrix (CTM)

CTI CTI CTI

DAP AXI-AP STM TPIU

JTAG- Funnel SWJ-DP AP Cortex ETM Processor ETB APB-AP

Debug APB Interconnect (APBIC)

I Source I Bus I Sink

5 Processor trace

I Embedded Trace Macrocell I Instructions I Data I Program Trace Macrocell I Only branches I Bandwidth I From 10Mbps to many Gbps per core I Non- (or low-) intrusive debug

6 Sinks

I Embedded Trace Buffer I Dedicated, small SRAM I Flight recorder use case I Trace Port I External analyser I Large buffer I External (high speed) pins I Embedded Trace Router I Sinks data into main interconnect I Usually uses system DRAM I Consumes memory system bandwidth

7 ETM Protocol

I Highly compressed data I Generated against program memory I Based on E/N atoms I eg. b1NEEEE00 up to 16+1 instructions I Branches I Only if not evident (eg. eg. B ) I No address == previous address I Exceptions, instruction sets, processor state I Synchronisation packets I Data packets I PTM protocol I One bit per conditional branch

8 Issues

I Requires memory contents for decompression I Multitasking OS I JIT engines I Self-modifying code (kernel runtime patching, kprobes, dynamic trace events) I Parallel and out-of-order excecution

9 Additional features

I Filtering I Address I CONTEXTID I VMID I Trigerring I Address I DBG I Counter I Sequencer I Timestamping I Correlation (synchronisation)

10 Linux perf

11 Linux perf framework

I PMU drivers I Many use cases, eg. statistics I Sampling profiler I Periodic PC (IP) sampling I Timer or PMU counter overflow interrupt I Typical sampling rate 1kHz (every 1ms)

12 Sampling profile

I Statistical approximation of a process I Think analog/digital converter I Shannon’s theorem: “If a function x(t) contains no frequencies higher than B cps, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.”

13 CoreSight Linux framework

I Developed by Mathieu Poirier at I Based on 2012 code from Code Aurora I At v7 stage now (http://lwn.net/Articles/614232/) I Control trace components via I Enable sink, enable source, dump buffer contents I Separate decoders I Full series at http://git.linaro.org/kernel/coresight.git

14 PT

I “an exciting new feature coming in future processors” (2013) I Integrating with perf I Auxiliary buffers I Decoder integrated with tool I Kernel portions at v4 stage now (http://lwn.net/Articles/609010/) I Full series at https://github.com/virtuoso/linux-perf/tree/intel_pt

15 Let’s hack!

16 Particularly pathologic example

I Rotate JPEG file

/ # time taskset 4 ./gm convert -rotate 90 in.jpg out.jpg real 0m 0.01s user 0m 0.01s sys 0m 0.00s

17 Can it go faster? / # time perf record -F 1000 -e cpu-clock \ taskset 4 ./gm convert -rotate 90 in.jpg out.jpg [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.001 MB perf.data (~50 samples) ] real 0m 0.27s user 0m 0.14s sys 0m 0.13s

I It was: real 0m 0.01s user 0m 0.01s sys 0m 0.00s

18 Report # Samples: 17 of event 'cpu-clock' # Event count (approx.): 17000000 # # Overhead Command Shared Object Symbol # ...... # 17.65% gm [kernel.kallsyms] [k] filemap_map_pages 5.88% taskset [kernel.kallsyms] [k] filemap_map_pages 5.88% gm gm [.] LocaleCompare 5.88% gm gm [.] forward_DCT_float 5.88% gm gm [.] encode_mcu_huff 5.88% gm gm [.] ycc_rgb_convert 5.88% gm gm [.] decode_mcu_DC_first 5.88% gm gm [.] decode_mcu_AC_refine 19 Report, cont.

5.88% gm gm [.] jpeg_fdct_16x16 5.88% gm gm [.] _IO_link_in 5.88% gm gm [.] malloc 5.88% gm gm [.] strncpy 5.88% gm [kernel.kallsyms] [k] lock_acquire 5.88% gm [kernel.kallsyms] [k] lock_release 5.88% gm [kernel.kallsyms] [k] unmap_single_vma

20 Let’s have a closer look…

I Cortex-A7 ETM 3.5 I Instructions only I Cycle accurate I Captured with DStream & ARM DS-5 I 10MB of binary trace data

21 What can we see there?

I Decoded into text format I 890MB file I What to look for:

ELF Header: Entry point address: 0x96dd

8360: 000096dd 0 FUNC GLOBAL DEFAULT 6 _start

8845: 0012ac44 0 FUNC GLOBAL DEFAULT 9 _fini

22 Here we go

2859224 S:0x8000E4CC E28DD00C 0 ADD sp,sp,#0xc ret_to_user_from_irq 2859225 S:0x8000E4D0 E1B0F00E 1 MOVS pc,lr ret_to_user_from_irq Return from exception Timestamp: 1878241989137 S:0x000096DC F04F0B00 29 MOV r11,#0 Exception: PREFETCH_ABORT (11) 2859227 S:0xFFFF000C EA000443 21 B PRRR+16027512 ; 0xFFFF1120 Timestamp: 1878241989138 2859228 S:0xFFFF1120 E24EE004 18 SUB lr,lr,#4 2859229 S:0xFFFF1124 E88D4001 3 STM sp,r0,lr

23 Here we go again 2878794 S:0x8000E4D0 E1B0F00E 1 MOVS pc,lr ret_to_user_from_irq Return from exception Timestamp: 1878241990817 2878795 S:0x000096DC F04F0B00 175 MOV r11,#0 2878796 S:0x000096E0 F04F0E00 40 MOV lr,#0 2878797 S:0x000096E4 BC02 4 POP r1 2878798 S:0x000096E6 466A 1 MOV r2,sp [...] 2878804 S:0x000096F6 4B04 1 LDR r3,[pc,#16] ; [0x9708] = 0xB114687C 2878805 S:0x000096F8 F0DAFDF4 0 BL __libc_start_main ; 0xE42E4 S:0x000E42E4 E92D45F0 324 PUSH r4-r8,r10,lr __libc_start_main Exception: PREFETCH_ABORT (11) 2878807 S:0xFFFF000C EA000443 21 B PRRR+16027512 ; 0xFFFF1120

24 Going down

9747820 S:0x0012AC44 E91D4008 153 PUSH r3,lr 9747821 S:0x0012AC48 E8BD8008 2 POP r3,pc 9747822 S:0x000E9AA2 E7BF 9 B __run_exit_handlers+28 ; 0xE9A24 __run_exit_handlers 9747823 S:0x000E9A24 6873 3 LDR r3,[r6,#4] __run_exit_handlers 9747824 S:0x000E9A26 EB061403 3 ADD r4,r6,r3,LSL #4 __run_exit_handlers 9747825 S:0x000E9A2A B173 1 CBZ r3,__run_exit_handlers+66 ; 0xE9A4A __run_exit_handlers

25 The end of the process

9747990 S:0x000FD536 4C12 2 LDR r4,[pc,#72] ; [0xFD580] = 0x64C55B39 _Exit 9747991 S:0x000FD538 E004 0 B _Exit+24 ; 0xFD544 _Exit 9747992 S:0x000FD544 4618 1 MOV r0,r3 _Exit 9747993 S:0x000FD546 F04F0CF8 1 MOV r12,#0xf8 _Exit 9747994 S:0x000FD54A F7E7F9C9 0 BL __libc_do_syscall ; 0xE48E0 _Exit 9747995 S:0x000E48E0 B580 1 PUSH r7,lr __libc_do_syscall 9747996 S:0x000E48E2 4667 2 MOV r7,r12 __libc_do_syscall 9747997 S:0x000E48E4 DF00 1 SVC #0x0 __libc_do_syscall Exception: SUPERVISOR_CALL (10)

I #0xf8 is __NR_exit_group

26 Focus on the task

I It’s all interesting… I …but the goal is to see a cycle accurate profile of the process I Limit data scope I Filter out kernel addresses (or cheat using entry/exit points) I Filter out other contexts (or cheat by protecting CPU) I Collate migrated data (or cheat by setting affinity) I Generate memory map with DSOs (or cheat by linking statically) I Convert into perf data stream

27 perf.data

I Starts with header I Description of all events I Followed by data records … I selection of samples I by default: PERF_SAMPLE_IP, PERF_SAMPLE_TID, PERF_SAMPLE_TIME, PERF_SAMPLE_PERIOD I generated on every timer/counter interrupt I …interleaved with system information I eg. PERF_RECORD_MMAP, PERF_RECORD_COMM, PERF_RECORD_EXIT, PERF_RECORD_FORK

28 perf.data, cont. $ perf report -D [...] 0x1d0 [0x28]: event: 9 . . ... raw event: size 40 bytes . 0000: 09 00 00 00 01 00 28 00 54 fd 05 80 00 00 00 00 ...... (.T...... 0010: 3b 08 00 00 3b 08 00 00 46 fb 39 91 dd 46 00 00 ;...;...F.9..F.. . 0020: 40 42 0f 00 00 00 00 00 @B...... 77917438212934 0x1d0 [0x28]: PERF_RECORD_SAMPLE(IP, 1): \ 2107/2107: 0x8005fd54 period: 1000000 addr: 0 ... thread: gm:2107 ...... dso: [kernel.kallsyms]

29 The Hack

I Replace data records with trace-based ones I reducing number of samples, from 40 to 24 bytes per record (attribute modification needed) I generating multiple samples, one per cycle used (175 samples if instruction took 175 cycles to execute) I 192MB big perf.data

30 The Hack, cont. 0x1f0 [0x18]: event: 9 . ... raw event: size 24 bytes . 0000: 09 00 00 00 02 00 18 00 dc 96 00 00 00 00 00 00 ...... 0010: 3b 08 00 00 3b 08 00 00 ;...;... 0x1f0 [0x18]: PERF_RECORD_SAMPLE(IP, 2): \ 2107/2107: 0x96dc period: 1 addr: 0 ... thread: :2107:2107 ...... dso: 0x208 [0x18]: event: 9 . ... raw event: size 24 bytes . 0000: 09 00 00 00 02 00 18 00 dc 96 00 00 00 00 00 00 ...... 0010: 3b 08 00 00 3b 08 00 00 ;...;... 0x208 [0x18]: PERF_RECORD_SAMPLE(IP, 2): \ 2107/2107: 0x96dc period: 1 addr: 0 31 ... thread: :2107:2107 Result # Samples: 8M of event 'cycles' # Event count (approx.): 8012563 # # Overhead Command Shared Object Symbol # ...... # 8.98% gm gm [.] jpeg_idct_islow 7.25% gm gm [.] strncpy 6.60% gm gm [.] jpeg_fdct_16x16 4.40% gm gm [.] encode_mcu_huff 4.37% gm gm [.] decode_mcu_AC_refine 4.25% gm gm [.] LocaleCompare 3.73% gm gm [.] rgb_ycc_convert 3.25% gm gm [.] _int_free 32 Result, cont 3.17% gm gm [.] memset 3.15% gm gm [.] jpeg_gen_optimal_table 3.01% gm gm [.] malloc 2.63% gm gm [.] _int_malloc 2.63% gm gm [.] forward_DCT_float 2.59% gm gm [.] ycc_rgb_convert 2.43% gm gm [.] jpeg_fdct_float 2.29% gm gm [.] __pthread_mutex_unlock_usercnt 2.25% gm gm [.] __memcpy_neon 2.19% gm gm [.] pthread_mutex_lock 1.74% gm gm [.] ReadJPEGImage 1.72% gm gm [.] encode_mcu_gather 1.61% gm gm [.] WriteJPEGImage 1.39% gm gm [.] vfprintf 33 Result, cont 1.32% gm gm [.] consume_data 1.21% gm gm [.] forward_DCT 1.06% gm gm [.] RegisterMagickInfo 0.88% gm gm [.] UnregisterMagickInfo 0.87% gm gm [.] decode_mcu_AC_first 0.81% gm gm [.] strlen 0.73% gm gm [.] SyncCacheNexus 0.64% gm gm [.] strcpy 0.53% gm gm [.] jpeg_fill_bit_buffer 0.49% gm gm [.] decode_mcu_DC_first 0.40% gm gm [.] _IO_default_xsputn 0.35% gm gm [.] _init 0.34% gm gm [.] jpeg_make_d_derived_tbl 0.33% gm gm [.] DestroyMagickInfo 34 Result, cont 0.33% gm gm [.] QueryColorDatabase 0.32% gm gm [.] GetGeometry 0.31% gm gm [.] SetNexus 0.31% gm gm [.] free 0.29% gm gm [.] __strchrnul 0.27% gm gm [.] ____strtod_l_internal 0.27% gm gm [.] .divsi3_skip_div0_test 0.24% gm gm [.] SetCacheNexus 0.23% gm gm [.] access_virt_barray 0.23% gm gm [.] compress_output 0.22% gm gm [.] ____strtol_l_internal 0.21% gm gm [.] UnlockSemaphoreInfo 0.21% gm gm [.] AcquireCacheNexus 0.21% gm gm [.] format_message 35 Result, cont

[...]

36 Result, cont 0.00% gm gm [.] ___fini_from_thumb 0.00% gm gm [.] jpeg_destroy_compress 0.00% gm gm [.] __feupdateenv 0.00% gm gm [.] IdentityAffine 0.00% gm gm [.] __dcgettext 0.00% gm gm [.] _setjmp 0.00% gm gm [.] DestroyMagickResources 0.00% gm gm [.] jpeg_free_small 0.00% gm gm [.] fprintf 0.00% gm gm [.] DestroySemaphore 0.00% gm gm [.] jpeg_mem_term 0.00% gm gm [.] start_pass_downsample 0.00% gm gm [.] malloc_info 0.00% gm gm [.] __stpcpy 37 Result, cont

I 616 lines of report I perf annotate works as well!

38 Summary

39 Summary

I Proof of concept I Can help with pathological cases I Scaling issues I Powerful but need to by “civilised” I Nearest future I Drivers in mainline I perf tool decoder integration

40 Thank You

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

41