Ftrace How-to version 0.4 - Latinoware Daniel ’bristot’ de Oliveira October 21, 2016 Who am I?

I am Daniel :-) I am from BRAZIL!, But I’m Italian too.. It means 9 FIFA World CUP Championship o/ My father did not allow me to became a Truck Driver... So I started to study: Bs. Computer Science 2009 Ms. Automation Engineering 2014 Ph.D Automation Engineering 2019 Before Red Hat: 5 years with embedded at Red Hat: SEG on SBR-Kernel: Real-time and performance What is trace?

Run-time information Hey... function foo, called bar... Hey... function bar returned in 2 us Hey... the code crossed here, and the var X is 10 Can be enabled/disabled in runtime Low overhead... mainly when disabled (it is really important) Generate a *lot* of data! Dozen lines of trace per microsecond, per cpu! Trace techniques

Static trace - Compiled in the code Trace of functions - In the function calls Dynamic trace - Added dynamically Kernel

Trace techniques Static trace - tracepoints Trace of functions - ftrace Dynamic trace - kprobes Ftrace provides interface for these three techniques Go!

Please, boot your RHEL7/Fedora VMs Or run on your machine! it is safe :-) Ftrace’s interface

Ftrace is embedded on kernel Accessible via echo to set cat to get On Fedora and on RHEL7 it is mounted by default at: /sys/kernel/debug/ On RHEL6: mount -t debugfs debugfs /sys/kernel/debug/ Ftrace’s interface is at /sys/kernel/debug/tracing/ Ftrace’s interface

[root@btt-rhel7 ~]# cd /sys/kernel/debug/tracing/ [root@btt-rhel7 tracing]# ls available_events max_graph_depth stack_trace_filter available_filter_functions options trace available_tracers per_cpu trace_clock buffer_size_kb printk_formats trace_marker buffer_total_size_kb README trace_options current_tracer saved_cmdlines trace_pipe dyn_ftrace_total_info set_event trace_stat enabled_functions set_ftrace_filter tracing_cpumask events set_ftrace_notrace tracing_max_latency free_buffer set_ftrace_pid tracing_on function_profile_enabled set_graph_function tracing_thresh instances snapshot uprobe_events kprobe_events stack_max_size uprobe_profile kprobe_profile stack_trace Starting from function tracer

Trace of kernel functions Only kernel and only functions Only kernel functions - no user-space No macros and no inline functions Basically: how does it work? gcc -pg adds a call to mcount on begin of each function mcount receives the address of the caller and the caller of caller calls* function tracer’s function that will save the information on the trace’s buffer Default question: WOW so it means a lot overhead? No: only a small when enabled, and ”nop” when disabled: When disabled, all mcount calls are turned on nop. This Steven’s lecture explains how it works: video.linux.com/videos/removing-stop-machine-from-the-tracing-infrastructure Basic ftrace’s interface

available tracers cat: show available tracers current tracer cat: show current tracer echo: set the current tracer trace cat: print the trace buffer echo: clean the trace buffer tracing on echo 1: turn the trace on echo 0: turn the trace off Basic ftrace’s interface

[root@btt-rhel7 tracing]# cat available_tracers blk function_graph wakeup_rt wakeup function nop [root@btt-rhel7 tracing]# cat current_tracer nop [root@btt-rhel7 tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | Using function tracer

[root@btt-rhel7 tracing]# echo function > current_tracer [root@btt-rhel7 tracing]# echo 1 > tracing_on [root@btt-rhel7 tracing]# head -15 trace # tracer: function # # entries-in-buffer/entries-written: 71715/71715 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | bash-2274 [002] .... 2553.416814: mutex_unlock <-rb_simple_write bash-2274 [002] .... 2553.416816: __fsnotify_parent <-vfs_write bash-2274 [002] .... 2553.416817: fsnotify <-vfs_write bash-2274 [002] .... 2553.416817: __srcu_read_lock <-fsnotify Stopping the trace

[root@btt-rhel7 tracing]# echo 0 > tracing_on [root@btt-rhel7 tracing]# echo nop > current_tracer [root@btt-rhel7 tracing]# echo > trace Graph tracer

It traces the call of functions But also the return of functions So, can I get the execution time of a function? YES! But it have a cost: it is more expensive than function tracer But not that much Function graph tracer

[root@btt-rhel7 tracing]# echo function_graph > current_tracer [root@btt-rhel7 tracing]# echo 1 > tracing_on [root@btt-rhel7 tracing]# head -20 trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3) | tick_do_update_jiffies64() { 3) 0.045 us | _raw_spin_lock(); 3) | do_timer() { 3) | update_wall_time() { 3) 0.046 us | _raw_spin_lock_irqsave(); 3) 0.047 us | _raw_spin_unlock_irqrestore(); 3) 0.617 us | } 3) 0.040 us | calc_global_load(); 3) 1.138 us | } 3) 0.042 us | _raw_spin_unlock(); 3) 1.938 us | } Jumping to Tracepoints

Points of trace on kernel’s code Low overhead, mainly when disabled Runs a callback to write on ftrace’s buffer It is also known as trace events (e.g. on ) Organized by subsystems subsystem:tracepoint name Basic tracepoint’s interface

available events cat: show available events set event cat: show enabled events echo: enable/clean events Basic tracepoint’s interface

[root@btt-rhel7 tracing]# cat available_events | grep irq_handler irq:irq_handler_exit irq:irq_handler_entry [root@btt-rhel7 tracing]# cat available_events | wc -l 1200 [root@btt-rhel7 tracing]# echo irq:irq_handler_exit > set_event [root@btt-rhel7 tracing]# cat set_event irq:irq_handler_exit [root@btt-rhel7 tracing]# echo irq:irq_handler_entry >> set_event [root@btt-rhel7 tracing]# cat available_events | grep sched_wakeup >> set_event [root@btt-rhel7 tracing]# cat set_event irq:irq_handler_exit irq:irq_handler_entry sched:sched_wakeup_new sched:sched_wakeup [root@btt-rhel7 tracing]# echo > set_event [root@btt-rhel7 tracing]# cat set_event Tracepoints output

[root@btt-rhel7 tracing]# cat available_events | grep irq_handler > set_event [root@btt-rhel7 tracing]# head -20 trace # tracer: nop # # entries-in-buffer/entries-written: 150/150 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | -0 [001] d.h. 3623.817286: irq_handler_entry: irq=42 name=virtio0-input.0 -0 [001] d.h. 3623.817290: irq_handler_exit: irq=42 ret=handled -0 [003] d.h. 3624.175584: irq_handler_entry: irq=14 name=ata_piix -0 [003] d.h. 3624.175681: irq_handler_exit: irq=14 ret=handled -0 [003] d.h. 3624.175689: irq_handler_entry: irq=14 name=ata_piix -0 [003] d.h. 3624.175706: irq_handler_exit: irq=14 ret=handled -0 [001] d.h. 3624.186418: irq_handler_entry: irq=42 name=virtio0-input.0 -0 [001] d.h. 3624.186421: irq_handler_exit: irq=42 ret=handled -0 [001] d.h. 3625.264161: irq_handler_entry: irq=42 name=virtio0-input.0 Ftrace and tracepoints - together is better

-0 [002] .N.. 173.728450: schedule_preempt_disabled <-cpu_startup_entry -0 [002] .N.. 173.728450: __schedule <-schedule_preempt_disabled -0 [002] .N.. 173.728450: rcu_note_context_switch <-__schedule -0 [002] .N.. 173.728450: _raw_spin_lock_irq <-__schedule -0 [002] dN.. 173.728451: pre_schedule_idle <-__schedule -0 [002] dN.. 173.728451: idle_exit_fair <-pre_schedule_idle -0 [002] dN.. 173.728451: put_prev_task_idle <-__schedule -0 [002] dN.. 173.728451: pick_next_task_fair <-__schedule -0 [002] dN.. 173.728451: clear_buddies <-pick_next_task_fair -0 [002] dN.. 173.728452: __dequeue_entity <-pick_next_task_fair -0 [002] d... 173.728452: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=virt-what next_pid=2325 next_prio=120 grep-2325 [002] d... 173.728454: finish_task_switch <-__schedule grep-2325 [002] .... 173.728455: __mmdrop <-finish_task_switch grep-2325 [002] .... 173.728455: pgd_free <-__mmdrop grep-2325 [002] .... 173.728455: _raw_spin_lock <-pgd_free grep-2325 [002] .... 173.728455: _raw_spin_unlock <-pgd_free grep-2325 [002] .... 173.728455: free_pages <-pgd_free grep-2325 [002] .... 173.728456: free_pages.part.63 <-free_pages grep-2325 [002] .... 173.728456: __free_pages <-free_pages.part.63 But it is too much information!

All the functions are too much! It is possible to filter the trace of functions And it is also possible to filter tracepoints based on its data. let’s try it, starting by ftrace. Ftrace’s filter interface

available filter functions cat: show the functions that can be filtered set ftrace filter cat: show functions that will be traced echo: enable/clean functions that will be traced set ftrace notrace cat: show functions that will NOT be traced echo: enable/clean functions that will NOT be traced set ftrace pid cat: show the pid that will be traced echo: set/clean the pid that will be traced Filtering the trace of functions

[root@btt-rhel7 tracing]# cat available_filter_functions | wc -l 29428 [root@btt-rhel7 tracing]# echo mutex_lock > set_ftrace_filter [root@btt-rhel7 tracing]# echo mutex_unlock >> set_ftrace_filter [root@btt-rhel7 tracing]# cat set_ftrace_filter mutex_unlock mutex_lock [root@btt-rhel7 tracing]# echo > set_ftrace_filter [root@btt-rhel7 tracing]# cat set_ftrace_filter #### all functions enabled #### Filtering the trace of functions

[root@btt-rhel7 tracing]# echo mutex_lock mutex_unlock > set_ftrace_filter [root@btt-rhel7 tracing]# echo function > current_tracer [root@btt-rhel7 tracing]# echo 2294 > set_ftrace_pid [root@btt-rhel7 tracing]# echo 1 > tracing_on [root@btt-rhel7 tracing]# head -20 trace # tracer: function # # entries-in-buffer/entries-written: 490/490 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | bash-2294 [001] .... 111801.975119: mutex_unlock <-rb_simple_write bash-2294 [001] .... 111801.975134: mutex_lock <-trace_array_put bash-2294 [001] .... 111801.975135: mutex_unlock <-trace_array_put bash-2294 [001] .... 111801.975360: mutex_lock <-n_tty_write bash-2294 [001] .... 111801.975368: mutex_unlock <-n_tty_write bash-2294 [001] .... 111801.975369: mutex_unlock <-tty_write_unlock bash-2294 [001] .... 111801.975462: mutex_lock <-tty_ioctl Filter and function graph tracer

[root@btt-rhel7 tracing]# echo function_graph > current_tracer [root@btt-rhel7 tracing]# head -20 trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 2) 0.127 us | mutex_lock(); 2) 0.112 us | mutex_unlock(); 2) 0.055 us | mutex_unlock(); 2) 0.127 us | mutex_lock(); 2) 0.122 us | mutex_lock(); 2) 0.129 us | mutex_lock(); 2) 0.049 us | mutex_lock(); 2) 0.050 us | mutex_unlock(); 2) 0.206 us | mutex_lock(); 2) 0.063 us | mutex_lock(); 2) 0.081 us | mutex_unlock(); 2) 0.063 us | mutex_unlock(); 2) 0.054 us | mutex_lock(); 2) 0.062 us | mutex_unlock(); 2) 0.087 us | mutex_unlock(); 2) 0.066 us | mutex_unlock(); Function filtering: wildcards and modules

[root@btt-rhel7 tracing]# echo mutex_* > set_ftrace_filter [root@btt-rhel7 tracing]# cat set_ftrace_filter mutex_spin_on_owner mutex_unlock mutex_lock mutex_trylock mutex_lock_interruptible mutex_lock_killable [root@btt-rhel7 tracing]# echo :mod:dm_mirror:* > set_ftrace_filter [root@btt-rhel7 tracing]# head -10 set_ftrace_filter mirror_iterate_devices [dm_mirror] mirror_postsuspend [dm_mirror] mirror_status [dm_mirror] mirror_resume [dm_mirror] fail_mirror [dm_mirror] wakeup_mirrord [dm_mirror] delayed_wake_fn [dm_mirror] free_context [dm_mirror] mirror_dtr [dm_mirror] trigger_event [dm_mirror] ... Function filtering: graph function

function graph: turn trace on in the call, and off on return

[root@btt-rhel7 tracing]# echo ttwu_do_wakeup > set_graph_function [root@btt-rhel7 tracing]# echo function_graph > current_tracer [root@btt-rhel7 tracing]# echo 1 > tracing_on [root@btt-rhel7 tracing]# head -20 trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3) | ttwu_do_wakeup() { 3) | check_preempt_curr() { 3) 0.077 us | resched_task(); 3) 0.619 us | } 3) 1.066 us | } 1) | ttwu_do_wakeup() { 1) | check_preempt_curr() { 1) | check_preempt_wakeup() { 1) 0.078 us | update_curr(); 1) 0.076 us | wakeup_gran.isra.54(); 1) 1.175 us | } 1) 1.679 us | } 1) 2.159 us | } Filtering tracepoints

There’s no need to filter which tracepoint - you already filtered it by choosing :-) But you can filter at which conditions you want to print a tracepoint, based on its fields. Tracepoints are more than just *printks* They are structured information Basic tracepoint’s filtering interface

do you recall that events are classified by subsystems? events options are on dir: events/$SUBSYSTEM/$EVENT NAME e.g.: events/irq/irq handler entry inside each there are these files: id: the ID of the event enable: echo 1 to enable, 0 to disable filter: get/set filter options format: information about the data gathered by this tracepoint Filtering tracepoints: without filter

[root@btt-rhel7 tracing]# cat available_events | grep irq:irq_ irq:irq_handler_exit irq:irq_handler_entry [root@btt-rhel7 tracing]# cat available_events | grep irq:irq_ > set_event [root@btt-rhel7 tracing]# tail -10 trace -0 [001] d.h. 1543.014323: irq_handler_entry: irq=43 name=virtio0-input.0 -0 [001] d.h. 1543.014328: irq_handler_exit: irq=43 ret=handled -0 [001] d.h. 1543.015088: irq_handler_entry: irq=43 name=virtio0-input.0 -0 [001] d.h. 1543.015090: irq_handler_exit: irq=43 ret=handled kworker/3:0-2299 [003] d.h. 1543.232015: irq_handler_entry: irq=14 name=ata_piix kworker/3:0-2299 [003] d.h. 1543.232147: irq_handler_exit: irq=14 ret=handled kworker/3:0-2299 [003] d.h. 1543.232158: irq_handler_entry: irq=14 name=ata_piix kworker/3:0-2299 [003] d.h. 1543.232196: irq_handler_exit: irq=14 ret=handled -0 [001] d.h. 1543.534487: irq_handler_entry: irq=43 name=virtio0-input.0 -0 [001] d.h. 1543.534492: irq_handler_exit: irq=43 ret=handled Filtering tracepoints!

[root@btt-rhel7 tracing]# cd events/irq/irq_handler_entry/ [root@btt-rhel7 irq_handler_entry]# ls enable filter format id [root@btt-rhel7 irq_handler_entry]# cat format name: irq_handler_entry ID: 114 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;

field:int irq; offset:8; size:4; signed:1; field:__data_loc char[] name; offset:12; size:4; signed:1;

print fmt: "irq=%d name=%s", REC->irq, __get_str(name)

[root@btt-rhel7 irq_handler_entry]# echo 'irq == 14' > filter [root@btt-rhel7 irq_handler_entry]# cd ../irq_handler_exit/ [root@btt-rhel7 irq_handler_exit]# echo 'irq == 14' > filter Filtered tracepoints!

[root@btt-rhel7 irq_handler_exit]# cd ../../../ [root@btt-rhel7 tracing]# tail -10 trace kworker/3:0-2299 [003] d.h. 2305.087986: irq_handler_entry: irq=14 name=ata_piix kworker/3:0-2299 [003] d.h. 2305.088010: irq_handler_exit: irq=14 ret=handled kworker/3:0-2299 [003] d.h. 2307.135803: irq_handler_entry: irq=14 name=ata_piix kworker/3:0-2299 [003] d.h. 2307.135852: irq_handler_exit: irq=14 ret=handled kworker/3:0-2299 [003] d.h. 2307.135858: irq_handler_entry: irq=14 name=ata_piix kworker/3:0-2299 [003] d.h. 2307.135873: irq_handler_exit: irq=14 ret=handled kworker/3:0-2299 [003] d.h. 2309.183882: irq_handler_entry: irq=14 name=ata_piix kworker/3:0-2299 [003] d.h. 2309.183966: irq_handler_exit: irq=14 ret=handled kworker/3:0-2299 [003] d.h. 2309.183973: irq_handler_entry: irq=14 name=ata_piix kworker/3:0-2299 [003] d.h. 2309.183998: irq_handler_exit: irq=14 ret=handled A more complex filter!

[root@btt-rhel7 tracing]# cd events/sched/sched_wakeup [root@btt-rhel7 sched_wakeup]# cat format name: sched_wakeup ID: 311 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;

field:char comm[32]; offset:8; size:16; signed:1; field:pid_t pid; offset:24; size:4; signed:1; field:int prio; offset:28; size:4; signed:1; field:int success; offset:32; size:4; signed:1; field:int target_cpu; offset:36; size:4; signed:1;

print fmt: "comm=%s pid=%d prio=%d success=%d target_cpu=%03d", REC->comm, REC->pid, REC->prio, REC->success, REC->target_cpu

[root@btt-rhel7 sched_wakeup]# echo "prio < 100" > filter

[root@btt-rhel7 sched_wakeup]# echo 1 > enable Let’s put more fun on it!

[root@btt-rhel7 sched_wakeup]# cd ../sched_switch/ [root@btt-rhel7 sched_switch]# cat format name: sched_switch ID: 309 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;

field:char prev_comm[32]; offset:8; size:16; signed:1; field:pid_t prev_pid; offset:24; size:4; signed:1; field:int prev_prio; offset:28; size:4; signed:1; field:long prev_state; offset:32; size:8; signed:1; field:char next_comm[32]; offset:40; size:16; signed:1; field:pid_t next_pid; offset:56; size:4; signed:1; field:int next_prio; offset:60; size:4; signed:1;

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state & (1024-1) ? __print_flags(REC->prev_state & (1024-1), "|", { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R", REC->prev_state & 1024 ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio

[root@btt-rhel7 sched_switch]# echo "(prev_state == 1 && prev_prio < 100) || next_prio < 100 " > filter

[root@btt-rhel7 sched_switch]# echo 1 > enable Lets put fun on it!

[root@btt-rhel7 sched_switch]# cd ../../../ [root@btt-rhel7 tracing]# tail -9 trace -0 [001] dNh. 6155.077138: sched_wakeup: comm=watchdog/1 pid=19 prio=0 success=1 target_cpu=001 -0 [001] d... 6155.077165: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=watchdog/1 next_pid=19 next_prio=0 watchdog/1-19 [001] d... 6155.077181: sched_switch: prev_comm=watchdog/1 prev_pid=19 prev_prio=0 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120 -0 [002] dNh. 6155.089144: sched_wakeup: comm=watchdog/2 pid=24 prio=0 success=1 target_cpu=002 -0 [002] d... 6155.089166: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=watchdog/2 next_pid=24 next_prio=0 watchdog/2-24 [002] d... 6155.089181: sched_switch: prev_comm=watchdog/2 prev_pid=24 prev_prio=0 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 -0 [003] dNh. 6155.101158: sched_wakeup: comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003 -0 [003] d... 6155.101176: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=watchdog/3 next_pid=29 next_prio=0 watchdog/3-29 [003] d... 6155.101189: sched_switch: prev_comm=watchdog/3 prev_pid=29 prev_prio=0 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 Ah! percpu trace! and trace pipe! and buffer size!

That is simple! and useful! Each CPU have a dir in the per cpu/ dir For example, for CPU 2: per cpu/cpu2/ Each CPU has its own trace at: per cpu/cpuX/trace Trace pipe: run a cat per cpu/cpuX/trace pipe It is also available for all CPUs The size of the trace is defined per cpu on file buffer size kb Triggering

Ok, it is nice to filter, but sometimes we need more! I want to start the trace after the occurrence of an event and I want to stop the trace after another event happens! or I want to enable an event after the call of a function or yet I want to get the stacktrace in the occurrence of a tracepoint ok, let’s try it! Triggering on function trace

The interface for triggering is the filter file: set ftrace filter echo ’function:action:times’ > set ftrace filter to set echo ’!function:action:times’ > set ftrace filter to clear Let’s start by turning the tracing on and off Triggering trace on and off - from a function

[root@btt-rhel7 tracing]# echo 0 > tracing_on [root@btt-rhel7 tracing]# echo irq_exit:traceoff:5 irq_enter:traceon:5 > set_ftrace_filter [root@btt-rhel7 tracing]# echo function > current_tracer [root@btt-rhel7 tracing]# cat trace # tracer: function # # entries-in-buffer/entries-written: 70/70 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | bash-2372 [001] d... 2344.896123: irq_enter <-smp_apic_timer_interrupt bash-2372 [001] d... 2344.896124: rcu_irq_enter <-irq_enter bash-2372 [001] d.h. 2344.896124: exit_idle <-smp_apic_timer_interrupt bash-2372 [001] d.h. 2344.896124: local_apic_timer_interrupt <-smp_apic_timer_interrupt bash-2372 [001] d.h. 2344.896125: hrtimer_interrupt <-local_apic_timer_interrupt bash-2372 [001] d.h. 2344.896125: _raw_spin_lock <-hrtimer_interrupt [...] bash-2372 [001] d.h. 2344.896132: _raw_spin_unlock <-hrtimer_interrupt bash-2372 [001] d.h. 2344.896132: tick_program_event <-hrtimer_interrupt bash-2372 [001] d.h. 2344.896132: clockevents_program_event <-tick_program_event bash-2372 [001] d.h. 2344.896132: ktime_get <-clockevents_program_event bash-2372 [001] d.h. 2344.896132: lapic_next_deadline <-clockevents_program_event Triggering events on and off - from a function

[root@btt-rhel7 tracing]# echo 'irq_exit:disable_event:sched:sched_wakeup' > set_ftrace_filter [root@btt-rhel7 tracing]# echo 'irq_enter:enable_event:sched:sched_wakeup' > set_ftrace_filter [root@btt-rhel7 tracing]# head -20 trace # tracer: nop # # entries-in-buffer/entries-written: 467/15199 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | -0 [003] dNh. 5605.176671: sched_wakeup: comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003 -0 [003] dNh. 5605.996414: sched_wakeup: comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003 -0 [003] dNh. 5609.176670: sched_wakeup: comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003 -0 [003] dNh. 5613.176671: sched_wakeup: comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003 -0 [003] dNh. 5613.890256: sched_wakeup: comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003 -0 [003] dNh. 5615.996420: sched_wakeup: comm=rcu_sched pid=13 prio=120 success=1 target_cpu=003 -0 [003] dNh. 5617.176668: sched_wakeup: comm=watchdog/3 pid=29 prio=0 success=1 target_cpu=003 Triggering on events

Interface on ”filter” file of the event dir Format: # echo '[!]command[:count] [if filter]' > trigger Commands: enable event/disable event traceon/traceoff snapshot stacktrace not available on RHEL7 :-( (yet?) Triggering events on and off - from a function

[root@kiron bristot]# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup [root@kiron sched_wakeup]# ls enable filter format id trigger [root@kiron sched_wakeup]# echo 'stacktrace:10 if prio < 100' > trigger [root@kiron sched_wakeup]# cat ../../../trace -0 [003] dNh. 7762.589836: => ftrace_raw_event_sched_wakeup_template => ttwu_do_wakeup => ttwu_do_activate.constprop.90 => try_to_wake_up => wake_up_process => hrtimer_wakeup => __run_hrtimer => hrtimer_interrupt => local_apic_timer_interrupt => smp_apic_timer_interrupt => apic_timer_interrupt => cpuidle_enter => cpu_startup_entry => start_secondary pulseaudio-2972 [003] dN.. 7762.590148: trace-cmd

A command line tool for ftrace It is useful to collect data on customers If you know how-to use ftrace, you know how to use the trace-cmd Tip: ftrace and vmcore

It is possible to extract the trace from a vmcore! It helps to understand what happened before the crash crash> extend /usr/lib64/crash/extensions/trace.so crash> trace dump -t data.dat crash> pwd /cores/retrace/tasks/968181176/misc crash> ls bt-a bt-filter data.dat dwysocha-automated-analysis.txt dwysocha-rhst-search-rip-string.txt retrace-log run_crash sys sys-c More info

LWN -> Kernel -> Kernel Index -> Kernel Tracing Kernel Documentation: Documentation/trace/ ping bristot@sbr-kernel Part I Understanding the execution model : What books say it is: IMHO: Netherlands’ Flag! Before starting...

Let’s redefine hardware Another point of view of Hardware Another point of view of Hardware And we fit the kernel here: And protect it And the kernel runs...

How does the kernel run? There are two ways to run kernel’s code

Or ‘calling the kernel’ Or by running a kernel Calling the kernel

We can think on kernel as a library of functions that are activated to serve an event These events are either generate: by the Hardware, or by the Software. How does the kernel receives these events?

Via interruptions What is an interruption?

Interrupts are events that indicate that a condition exists somewhere in the system, the processor, or within the currently executing program or task that requires the attention of a processor. They typically result in a forced transfer of execution from the currently running program or task to a special software routine or task called an handler - Intel 64 and IA-32 Architectures Software Developer’s Manual. Type of interruptions

Hardware Activated: Asynchronous Software Activated: Synchronous Exceptions: Faults: Correctable; offending instruction is retried Traps: Often for debugging; instruction is not retried Aborts: Not Correctable; Severe errors! Software : System Calls! A Hardware Interruption Hardware Interrupt: Another point of view How about ?

What is a process? A process is a context Running on a protected ring Where the threads run Process and Threads

A process is a ‘virtual’ environment where threads have its: code data stack resources: e.g. sockets, file descriptors, and so on. And they run: Threads are scheduled to run on a processor There’s no ‘Software layer between the thread and processor’ Ok, that flag, I mean, diagram fits to java :P But sometimes a thread need more resources... These resources are managed by the kernel So: threads run with Operating System Support! Not ON the Operating System. Hey kernel! I need a resource!

How does a thread ask a resource to the kernel? Hey kernel! I need a resource!

It runs the kernel :-)... Threads running on kernel space

A thread can run kernel code on kernel-space And we say that the kernel runs on behalf thread Each thread has a stack in kernel context How does a thread jumps to ‘kernel context on ring 0’? Generating a software interrupt o/ Thread running on kernel: Thread running on kernel: or via exceptions Another point of view: Kernel threads

Are threads that run on kernel address space. They are like regular threads - But only run on kernel space. Finally, the kernel threads: So, we have the following ways to run kernel’s code

IRQ - Hardware activated Soft IRQ Process threads: Via system call Via exceptions Kernel threads: Runs only on kernel-space It explains how! but not when!

How does the system decides to run a IRQ or a thread? hardware IRQs

They are asynchronous Kernel can’t control when they will run They start running: They are not scheduled to run! But it can control if they can be activated Only maskable interrupts They run until finish: kernel can’t put it to sleep: But they can suffer interference of another IRQ And they can block on spinlocks IRQ running Threads

They are activated in the kernel context sched wakeup Because they go to sleep in the kernel context Mainly via system call Most common states S - Interruptible D - Uninterruptible R is the Runable state But it does not mean that they are running! Another thread can be running at the time And the thread is waiting to be scheduled... States of a thread Thread sleeping/waking up Schedulers

Real-Time Dynamic priority: DEADLINE Each task has a: Period Execution Time - or budget Deadline Closer deadline - higher priority Real-Time fixed priority: FIFO/RR Each task has a fixed priority of 99 possible: User-space:1 < 99 Kernel-space:0 > 98 Higher priority thread runs Tasks with same prio: FIFO: Each task will run until finish RR: Tasks will share CPU time on a Round-Robin Fashion Schedulers

Fair Scheduler: OTHER Will provide the same amount of CPU time to each runnable task in a period. Less nice task will receive more time in a period. This nice is internally mapped to a priority Kernel-Space: 99 > 139. IDLE: Waits on kernel

Is there a Deadline task ready? Get the one with the closer deadline Is there a RT task ready? Get the one with higher priority Is there a Fair task ready? Get the next to run in the fair fashion Enter on idle state. Scheduling Conclusion

Applications do not run on the OS - It run on the hardware OS is responsible to provide the environment and resources The kernel is activated by interrupts. From hardware, and From software. Threads run on kernel-space Threads sleep on kernel space and the kernel schedules the threads The end.

Thanks for listening.