Walking the Linux Kernel
Stanislav Kozina Associate Manager April 2016 AGENDA
Linux Kernel in general
Debugging kernel issues without crash
Prepare the system for crashing
Crash it with systemtap
See what we can get from the crash dump
2 Linux Kernel in general – boring
● Just software…
● Written in C & asm
● List of expected features
● Boot and initialization process
● Memory and process management
● Files, directories, sockets, …
● Resources abstraction
● CPU, memory, …
● POSIX
3 Linux Kernel in general – BUT!
● Quite big (20mil LOC)
● No libc (many other standards functions instead)
● Special environment
● Preemptive, shared memory space
● Early boot code is tricky
● No dynamic allocator, AP, scheduler, even locks!
● Special security requirements
● Kernel should not just die and/or leak anything
4 Linux Kernel in general – development system
● Open source
● List of maintainers in MAINTAINERS
● Patches posted via email
● Documentation/SubmittingPatches
● LKML
● Usually companies care about support of their stuff
● Hardware vendors…
● “We don't break userland”
5 DIGGING IN Observing kernel is not trivial
● Hard to get a consistent picture
● If we stop it, how we observe it?
● Using Vms?
● printk()
● Statistics, /proc, perf, strace, ftrace, ...
7 8 Debugging kernel issues
● Oops messages
● Message buffer, registers, stack/backtrace
● Oops leaves the system running, but unstable!
● Current task is killed
● Printk()
● Systemtap, ftrace
● crash
● Sysrq triggers
9 Kernel oops
[[ 32.580355]32.580355] SysRqSysRq :: TriggerTrigger aa crashcrash [[ 32.581331]32.581331] BUG:BUG: unableunable toto handlehandle kernelkernel NULLNULL pointerpointer dereferencedereference atat [[ 32.582703]32.582703] IP:IP: [
10 Kernel oops 2
[[ 32.600875]32.600875] RIP:RIP: 0010:[
11 Kernel oops 3
[[ 32.613426]32.613426] CallCall Trace:Trace: [[ 32.613724]32.613724] [
12 Kernel oops
● Good enough when
● The problem is in a direct callpath
● Not sufficient when
● We need to see more context
● Other tasks
● Structure content
● Be aware of -fomit-frame-pointer
● Reliable address only if %rsp == %rbp + sizeof (long)
● CONFIG_FRAME_POINTER
13 Getting crash dump
● Need to configure kdump
● Failsafe crash kernel loaded in a memory
● Switch to the new kernel via kexec
● Run kdump to save to content of /proc/vmcore
## echoecho cc >> /proc/sysrq-trigger/proc/sysrq-trigger
14 Opening a crash dump
● Dump of all kernel memory
● Zero pages, caches, buffers and userland are not dumped by default
● Can be opened in gdb
● But we need DWARF debugging symbols
● Kernel-debuginfo packages in Fedora/CentOS/RHEL
● Crash(8) tool
● Can run on live system through /proc/kcore as well
15 DEMO Using crash(8)
● mod -S
● set hex
● sys
● log
● mount
● swap
● mach [-m | -c]
● net
17 Linux tasks
● include/linux/sched.h: struct task_struct {}
● clone(2)
● CLONE_THREAD:
● “Since Linux 2.4, calls to getpid(2) return the TGID of the caller.”
18 task_struct
structstruct task_structtask_struct {{ longlong state;state; voidvoid *stack;*stack; structstruct sched_entitysched_entity se;se; structstruct mm_structmm_struct *mm;*mm; intint exit_code;exit_code; pid_tpid_t pid,pid, tgid;tgid; structstruct task_structtask_struct *parent;*parent; constconst structstruct credcred __rcu__rcu *cred;*cred; charchar comm[TASK_COMM_LEN];comm[TASK_COMM_LEN]; structstruct files_structfiles_struct *files;*files; }}
19 Process states in Linux
● ps
● Walk task_struct in the kernel
● -u | -k | -G -- filter only user|kernel|thread group leader tasks
● ps -p | -c -- walk parent/child relationship
● ps -S -- summary of task states uninterruptable sleep
D woken signal T R exit interruptable sleep S Z woken / signal
https://idea.popcount.org/2012-12-11-linux-process-states/
20 Process tree
● ps -p
crash>crash> psps -p-p 68646864 PID:PID: 00 TASK:TASK: ffffffff81c124c0ffffffff81c124c0 CPU:CPU: 00 COMMAND:COMMAND: "swapper/0""swapper/0" PID:PID: 11 TASK:TASK: ffff88007c888000ffff88007c888000 CPU:CPU: 00 COMMAND:COMMAND: "systemd""systemd" PID:PID: 784784 TASK:TASK: ffff8800369fbc00ffff8800369fbc00 CPU:CPU: 00 COMMAND:COMMAND: "login""login" PID:PID: 28842884 TASK:TASK: ffff880036a19e00ffff880036a19e00 CPU:CPU: 00 COMMAND:COMMAND: "bash""bash" PID:PID: 68646864 TASK:TASK: ffff880036850000ffff880036850000 CPU:CPU: 00 COMMAND:COMMAND: "cat""cat"
21 DEMO Backtrace
● bt -s
#8#8 [ffff88007abefd88][ffff88007abefd88] ping_v4_seq_show+0x5ping_v4_seq_show+0x5 atat ffffffff816f6a15ffffffff816f6a15 #9#9 [ffff88007abefd98][ffff88007abefd98] seq_read+0xecseq_read+0xec atat ffffffff81246b7cffffffff81246b7c #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 #12#12 [ffff88007abefed0][ffff88007abefed0] vfs_read+0x83vfs_read+0x83 atat ffffffff81223e43ffffffff81223e43 #13#13 [ffff88007abeff08][ffff88007abeff08] sys_read+0x55sys_read+0x55 atat ffffffff81224b95ffffffff81224b95 #14#14 [ffff88007abeff50][ffff88007abeff50] entry_SYSCALL_64_fastpath+0x12entry_SYSCALL_64_fastpath+0x12 at at ffffffff81781e2effffffff81781e2e
23 Full stack dump
● bt -fs
ffff88007abefe10:ffff88007abefe10: 00000000000200000000000000020000 00000000000200000000000000020000 ffff88007abefe20:ffff88007abefe20: ffff88007abefe40ffff88007abefe40 ffffffff81290132ffffffff81290132 #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 ffff88007abefe30:ffff88007abefe30: ffff88007c015b00ffff88007c015b00 ffff88007abeff18ffff88007abeff18 ffff88007abefe40:ffff88007abefe40: ffff88007abefec8ffff88007abefec8 ffffffff81223387ffffffff81223387 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 ffff88007abefe50:ffff88007abefe50: 00000000000200000000000000020000 ffff88007c015b00ffff88007c015b00 ffff88007abefe60:ffff88007abefe60: ffff88007c015b10ffff88007c015b10 ffff880036575148ffff880036575148
24 Full stack dump
● bt -fs
ffff88007abefe10:ffff88007abefe10: 00000000000200000000000000020000 00000000000200000000000000020000 ffff88007abefe20:ffff88007abefe20:ffff88007abefe40 ffff88007abefe40 ffffffff81290132ffffffff81290132 #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 ffff88007abefe30:ffff88007abefe30: ffff88007c015b00ffff88007c015b00 ffff88007abeff18ffff88007abeff18 ffff88007abefe40:ffff88007abefe40:ffff88007abefec8 ffff88007abefec8 ffffffff81223387ffffffff81223387 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 ffff88007abefe50:ffff88007abefe50: 00000000000200000000000000020000 ffff88007c015b00ffff88007c015b00 ffff88007abefe60:ffff88007abefe60: ffff88007c015b10ffff88007c015b10 ffff880036575148ffff880036575148
25 Disassemble
crash>crash> disdis proc_reg_readproc_reg_read 0xffffffff812900f00xffffffff812900f0
26 DEMO mm_struct
● task_struct -> mm
structstruct mm_structmm_struct {{ structstruct vm_area_structvm_area_struct *mmap;*mmap; /*/* listlist ofof VMAsVMAs */*/ structstruct rb_rootrb_root mm_rb;mm_rb; pgd_tpgd_t ** pgd;pgd; unsignedunsigned longlong start_code,start_code, end_code,end_code, start_data,start_data, end_data;end_data; unsignedunsigned longlong start_brk,start_brk, brk,brk, start_stack;start_stack; }}
28 vm_area_struct
● mm_struct -> mmap
structstruct vm_area_structvm_area_struct {{ unsignedunsigned longlong vm_start;vm_start; unsignedunsigned longlong vm_end;vm_end; structstruct filefile ** vm_file;vm_file; structstruct anon_vmaanon_vma ** anon_vma_chain;anon_vma_chain; }}
29 Walk the memory map of a process
● pmap(1) ● task_struct→ mm → mmap → vm_start
$$ pmappmap $(pgrep$(pgrep bash)bash)
crash>crash> listlist -o-o vm_area_struct.vm_nextvm_area_struct.vm_next
30 DEMO Page tables
https://www.kernel.org/doc/gorman/html/understand/understand006.html
32 Page tables (2)
● mm_struct->pgd
crash>crash> rdrd -64-64 0xffff88007bed90000xffff88007bed9000 10241024 ffff88007bed9000:ffff88007bed9000: 00000000000000000000000000000000 00000000000000000000000000000000 ...... ffff88007bed9010:ffff88007bed9010: 00000000000000000000000000000000 00000000000000000000000000000000 ...... ffff88007bed9020:ffff88007bed9020: 00000000000000000000000000000000 00000000000000000000000000000000 ...... ffff88007bed9030:ffff88007bed9030: 00000000000000000000000000000000 00000000000000000000000000000000 ......
33 DEMO kmem
● [-i] -- overview
● [-p] -- Page info, walks mem_mam
● [-s] -- Slab statistic
crash>crash> kmemkmem -s-s ext4_inode_cacheext4_inode_cache CACHECACHE NAMENAME OBJSIZEOBJSIZE ALLOCATEDALLOCATED TOTALTOTAL SLABSSLABS SSIZESSIZE ffff880079fa1000ffff880079fa1000 ext4_inode_cacheext4_inode_cache 10161016 4872848728 4872848728 60916091 8k8k crash>crash> structstruct ext4_inode_infoext4_inode_info || tailtail -n-n 11 SIZE:SIZE: 0x3f80x3f8
35 Slab allocator
crash>crash> kmemkmem -s-s ext4_inode_cacheext4_inode_cache CACHECACHE NAMENAME OBJSIZEOBJSIZE ALLOCATEDALLOCATED TOTALTOTAL SLABSSLABS SSIZESSIZE ffff880079fa1000ffff880079fa1000 ext4_inode_cacheext4_inode_cache 10161016 4872848728 4872848728 60916091 8k8k
ext4_inode_cachepext4_inode_cachep == kmem_cache_create("ext4_inode_cache",kmem_cache_create("ext4_inode_cache", sizeof(structsizeof(struct ext4_inode_info),ext4_inode_info), 0,0, (SLAB_RECLAIM_ACCOUNT|(SLAB_RECLAIM_ACCOUNT| SLAB_MEM_SPREAD),SLAB_MEM_SPREAD), init_once);init_once);init_once);init_once);
36 DEMO CPUs
● There's no “cpu_t” structure
● Instead:
● DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
● O(1) scheduler replaced with CFS
● Uses RB tree
38 runq
● runq
crash>crash> runqrunq CPUCPU 00 RUNQUEUE:RUNQUEUE: ffff88007fc16c80ffff88007fc16c80 CURRENT:CURRENT: PID:PID: 68646864 TASK:TASK: ffff880036850000ffff880036850000 COMMAND:COMMAND: "cat""cat" RTRT PRIO_ARRAY:PRIO_ARRAY: ffff88007fc16e30ffff88007fc16e30 [no[no taskstasks queued]queued] CFSCFS RB_ROOT:RB_ROOT: ffff88007fc16d20ffff88007fc16d20 [120][120] PID:PID: 64646464 TASK:TASK: ffff880036868000ffff880036868000 COMMAND:COMMAND: "kworker/0:1""kworker/0:1"
39 DEMO THANK YOU
plus.google.com/+RedHat facebook.com/redhatinc
linkedin.com/company/red-hat twitter.com/RedHatNews
youtube.com/user/RedHatVideos