Walking the Linux Kernel
Total Page:16
File Type:pdf, Size:1020Kb
Walking the Linux Kernel Stanislav Kozina Associate Manager April 2016 AGENDA Linux Kernel in general Debugging kernel issues without crash Prepare the system for crashing Crash it with systemtap See what we can get from the crash dump 2 Linux Kernel in general – boring ● Just software… ● Written in C & asm ● List of expected features ● Boot and initialization process ● Memory and process management ● Hardware abstraction ● Files, directories, sockets, … ● Resources abstraction ● CPU, memory, … ● POSIX 3 Linux Kernel in general – BUT! ● Quite big (20mil LOC) ● No libc (many other standards functions instead) ● Special environment ● Preemptive, shared memory space ● Early boot code is tricky ● No dynamic allocator, AP, scheduler, even locks! ● Special security requirements ● Kernel should not just die and/or leak anything 4 Linux Kernel in general – development system ● Open source ● List of maintainers in MAINTAINERS ● Patches posted via email ● Documentation/SubmittingPatches ● LKML ● Usually companies care about support of their stuff ● Hardware vendors… ● “We don't break userland” 5 DIGGING IN Observing kernel is not trivial ● Hard to get a consistent picture ● If we stop it, how we observe it? ● Using Vms? ● printk() ● Statistics, /proc, perf, strace, ftrace, ... 7 8 Debugging kernel issues ● Oops messages ● Message buffer, registers, stack/backtrace ● Oops leaves the system running, but unstable! ● Current task is killed ● Printk() ● Systemtap, ftrace ● crash ● Sysrq triggers 9 Kernel oops [[ 32.580355]32.580355] SysRqSysRq :: TriggerTrigger aa crashcrash [[ 32.581331]32.581331] BUG:BUG: unableunable toto handlehandle kernelkernel NULLNULL pointerpointer dereferencedereference atat [[ 32.582703]32.582703] IP:IP: [<ffffffff813b9716>][<ffffffff813b9716>] sysrq_handle_crash+0x16/0x20sysrq_handle_crash+0x16/0x20 [[ 32.583781]32.583781] PGDPGD 3b5030673b503067 PUDPUD 3b5020673b502067 PMDPMD 00 [[ 32.584609]32.584609] Oops:Oops: 00020002 [#1][#1] SMPSMP [[ 32.585210]32.585210] ModulesModules linkedlinked in:in: ip6t_rpfilterip6t_rpfilter (...(... )) [[ 32.598062]32.598062] CPU:CPU: 11 PID:PID: 23702370 Comm:Comm: bashbash NotNot taintedtainted 3.10.0-3.10.0- 327.el7.x86_64327.el7.x86_64 #1#1 [[ 32.598923]32.598923] HardwareHardware name:name: QEMUQEMU StandardStandard PCPC (i440FX(i440FX ++ PIIX,PIIX, 1996),1996), BIOSBIOS 1.8.2-20150714_191134-1.8.2-20150714_191134- 04/01/201404/01/2014 [[ 32.600020]32.600020] task:task: ffff88003b544500ffff88003b544500 ti:ti: ffff88003b518000ffff88003b518000 task.ti:task.ti: ffff88003b518000ffff88003b518000 10 Kernel oops 2 [[ 32.600875]32.600875] RIP:RIP: 0010:[<ffffffff813b9716>]0010:[<ffffffff813b9716>] [<ffffffff813b9716>][<ffffffff813b9716>] sysrq_handle_crash+0x16/0x20sysrq_handle_crash+0x16/0x20 [[ 32.601864]32.601864] RSP:RSP: 0018:ffff88003b51be800018:ffff88003b51be80 EFLAGS:EFLAGS: 0001004600010046 [[ 32.602469]32.602469] RAX:RAX: 000000000000000f000000000000000f RBX:RBX: ffffffff81a06220ffffffff81a06220 RCX:RCX: 00000000000000000000000000000000 [[ 32.603281]32.603281] RDX:RDX: 00000000000000000000000000000000 RSI:RSI: ffff88003fd0d6c8ffff88003fd0d6c8 RDI:RDI: 00000000000000630000000000000063 [[ 32.604092]32.604092] RBP:RBP: ffff88003b51be80ffff88003b51be80 R08:R08: 00000000000000920000000000000092 R09:R09: 00000000000002680000000000000268 [[ 32.606518]32.606518] FS:FS: 00007fa978f3b740(0000)00007fa978f3b740(0000) GS:ffff88003fd00000(0000)GS:ffff88003fd00000(0000) knlGS:0000000000000000knlGS:0000000000000000 [[ 32.607431]32.607431] CS:CS: 00100010 DS:DS: 00000000 ES:ES: 00000000 CR0:CR0: 00000000800500330000000080050033 11 Kernel oops 3 [[ 32.613426]32.613426] CallCall Trace:Trace: [[ 32.613724]32.613724] [<ffffffff813b9ed2>][<ffffffff813b9ed2>] __handle_sysrq+0xa2/0x170__handle_sysrq+0xa2/0x170 [[ 32.614359]32.614359] [<ffffffff813ba3af>][<ffffffff813ba3af>] write_sysrq_trigger+0x2f/0x40write_sysrq_trigger+0x2f/0x40 [[ 32.615049]32.615049] [<ffffffff812492ad>][<ffffffff812492ad>] proc_reg_write+0x3d/0x80proc_reg_write+0x3d/0x80 [[ 32.615685]32.615685] [<ffffffff811de5cd>][<ffffffff811de5cd>] vfs_write+0xbd/0x1e0vfs_write+0xbd/0x1e0 [[ 32.616286]32.616286] [<ffffffff816411b3>][<ffffffff816411b3>] ?? trace_do_page_fault+0x43/0x110trace_do_page_fault+0x43/0x110 [[ 32.616999]32.616999] [<ffffffff811df06f>][<ffffffff811df06f>] SyS_write+0x7f/0xe0SyS_write+0x7f/0xe0 [[ 32.617581]32.617581] [<ffffffff81645909>][<ffffffff81645909>] system_call_fastpath+0x16/0x1bsystem_call_fastpath+0x16/0x1b 12 Kernel oops ● Good enough when ● The problem is in a direct callpath ● Not sufficient when ● We need to see more context ● Other tasks ● Structure content ● Be aware of -fomit-frame-pointer ● Reliable address only if %rsp == %rbp + sizeof (long) ● CONFIG_FRAME_POINTER 13 Getting crash dump ● Need to configure kdump ● Failsafe crash kernel loaded in a memory ● Switch to the new kernel via kexec ● Run kdump to save to content of /proc/vmcore ## echoecho cc >> /proc/sysrq-trigger/proc/sysrq-trigger 14 Opening a crash dump ● Dump of all kernel memory ● Zero pages, caches, buffers and userland are not dumped by default ● Can be opened in gdb ● But we need DWARF debugging symbols ● Kernel-debuginfo packages in Fedora/CentOS/RHEL ● Crash(8) tool ● Can run on live system through /proc/kcore as well 15 DEMO Using crash(8) ● mod -S ● set hex ● sys ● log ● mount ● swap ● mach [-m | -c] ● net 17 Linux tasks ● include/linux/sched.h: struct task_struct {} ● clone(2) ● CLONE_THREAD: ● “Since Linux 2.4, calls to getpid(2) return the TGID of the caller.” 18 task_struct structstruct task_structtask_struct {{ longlong state;state; voidvoid *stack;*stack; structstruct sched_entitysched_entity se;se; structstruct mm_structmm_struct *mm;*mm; intint exit_code;exit_code; pid_tpid_t pid,pid, tgid;tgid; structstruct task_structtask_struct *parent;*parent; constconst structstruct credcred __rcu__rcu *cred;*cred; charchar comm[TASK_COMM_LEN];comm[TASK_COMM_LEN]; structstruct files_structfiles_struct *files;*files; }} 19 Process states in Linux ● ps ● Walk task_struct in the kernel ● -u | -k | -G -- filter only user|kernel|thread group leader tasks ● ps -p | -c -- walk parent/child relationship ● ps -S -- summary of task states uninterruptable sleep D woken signal T R exit interruptable sleep S Z woken / signal https://idea.popcount.org/2012-12-11-linux-process-states/ 20 Process tree ● ps -p crash>crash> psps -p-p 68646864 PID:PID: 00 TASK:TASK: ffffffff81c124c0ffffffff81c124c0 CPU:CPU: 00 COMMAND:COMMAND: "swapper/0""swapper/0" PID:PID: 11 TASK:TASK: ffff88007c888000ffff88007c888000 CPU:CPU: 00 COMMAND:COMMAND: "systemd""systemd" PID:PID: 784784 TASK:TASK: ffff8800369fbc00ffff8800369fbc00 CPU:CPU: 00 COMMAND:COMMAND: "login""login" PID:PID: 28842884 TASK:TASK: ffff880036a19e00ffff880036a19e00 CPU:CPU: 00 COMMAND:COMMAND: "bash""bash" PID:PID: 68646864 TASK:TASK: ffff880036850000ffff880036850000 CPU:CPU: 00 COMMAND:COMMAND: "cat""cat" 21 DEMO Backtrace ● bt -s #8#8 [ffff88007abefd88][ffff88007abefd88] ping_v4_seq_show+0x5ping_v4_seq_show+0x5 atat ffffffff816f6a15ffffffff816f6a15 #9#9 [ffff88007abefd98][ffff88007abefd98] seq_read+0xecseq_read+0xec atat ffffffff81246b7cffffffff81246b7c #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 #12#12 [ffff88007abefed0][ffff88007abefed0] vfs_read+0x83vfs_read+0x83 atat ffffffff81223e43ffffffff81223e43 #13#13 [ffff88007abeff08][ffff88007abeff08] sys_read+0x55sys_read+0x55 atat ffffffff81224b95ffffffff81224b95 #14#14 [ffff88007abeff50][ffff88007abeff50] entry_SYSCALL_64_fastpath+0x12entry_SYSCALL_64_fastpath+0x12 at at ffffffff81781e2effffffff81781e2e 23 Full stack dump ● bt -fs ffff88007abefe10:ffff88007abefe10: 00000000000200000000000000020000 00000000000200000000000000020000 ffff88007abefe20:ffff88007abefe20: ffff88007abefe40ffff88007abefe40 ffffffff81290132ffffffff81290132 #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 ffff88007abefe30:ffff88007abefe30: ffff88007c015b00ffff88007c015b00 ffff88007abeff18ffff88007abeff18 ffff88007abefe40:ffff88007abefe40: ffff88007abefec8ffff88007abefec8 ffffffff81223387ffffffff81223387 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 ffff88007abefe50:ffff88007abefe50: 00000000000200000000000000020000 ffff88007c015b00ffff88007c015b00 ffff88007abefe60:ffff88007abefe60: ffff88007c015b10ffff88007c015b10 ffff880036575148ffff880036575148 24 Full stack dump ● bt -fs ffff88007abefe10:ffff88007abefe10: 00000000000200000000000000020000 00000000000200000000000000020000 ffff88007abefe20:ffff88007abefe20:ffff88007abefe40 ffff88007abefe40 ffffffff81290132ffffffff81290132 #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 ffff88007abefe30:ffff88007abefe30: ffff88007c015b00ffff88007c015b00 ffff88007abeff18ffff88007abeff18 ffff88007abefe40:ffff88007abefe40:ffff88007abefec8 ffff88007abefec8 ffffffff81223387ffffffff81223387 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 ffff88007abefe50:ffff88007abefe50: 00000000000200000000000000020000 ffff88007c015b00ffff88007c015b00 ffff88007abefe60:ffff88007abefe60: ffff88007c015b10ffff88007c015b10 ffff880036575148ffff880036575148 25 Disassemble crash>crash> disdis proc_reg_readproc_reg_read