Walking the Kernel

Stanislav Kozina Associate Manager April 2016 AGENDA

Linux Kernel in general

Debugging kernel issues without crash

Prepare the system for crashing

Crash it with

See what we can get from the crash dump

2 in general – boring

● Just software…

● Written in C & asm

● List of expected features

● Boot and initialization

● Memory and process management

● Files, directories, sockets, …

● Resources abstraction

● CPU, memory, …

● POSIX

3 Linux Kernel in general – BUT!

● Quite big (20mil LOC)

● No libc (many other standards functions instead)

● Special environment

● Preemptive, shared memory space

● Early boot code is tricky

● No dynamic allocator, AP, scheduler, even locks!

● Special security requirements

● Kernel should not just die and/or leak anything

4 Linux Kernel in general – development system

source

● List of maintainers in MAINTAINERS

● Patches posted via email

● Documentation/SubmittingPatches

● LKML

● Usually companies care about support of their stuff

● Hardware vendors…

● “We don't break userland”

5 DIGGING IN Observing kernel is not trivial

● Hard to get a consistent picture

● If we stop it, how we observe it?

● Using Vms?

● printk()

● Statistics, /proc, , , ftrace, ...

7 8 Debugging kernel issues

● Oops messages

● Message buffer, registers, stack/backtrace

● Oops leaves the system running, but unstable!

● Current task is killed

● Printk()

● Systemtap, ftrace

● crash

● Sysrq triggers

9 Kernel oops

[[ 32.580355]32.580355] SysRqSysRq :: TriggerTrigger aa crashcrash [[ 32.581331]32.581331] BUG:BUG: unableunable toto handlehandle kernelkernel NULLNULL pointerpointer dereferencedereference atat [[ 32.582703]32.582703] IP:IP: [][] sysrq_handle_crash+0x16/0x20sysrq_handle_crash+0x16/0x20 [[ 32.583781]32.583781] PGDPGD 3b5030673b503067 PUDPUD 3b5020673b502067 PMDPMD 00 [[ 32.584609]32.584609] Oops:Oops: 00020002 [#1][#1] SMPSMP [[ 32.585210]32.585210] ModulesModules linkedlinked in:in: ip6t_rpfilterip6t_rpfilter (...(... )) [[ 32.598062]32.598062] CPU:CPU: 11 PID:PID: 23702370 Comm:Comm: bashbash NotNot taintedtainted 3.10.0-3.10.0- 327.el7.x86_64327.el7.x86_64 #1#1 [[ 32.598923]32.598923] HardwareHardware name:name: QEMUQEMU StandardStandard PCPC (i440FX(i440FX ++ PIIX,PIIX, 1996),1996), BIOSBIOS 1.8.2-20150714_191134-1.8.2-20150714_191134- 04/01/201404/01/2014 [[ 32.600020]32.600020] task:task: ffff88003b544500ffff88003b544500 ti:ti: ffff88003b518000ffff88003b518000 task.ti:task.ti: ffff88003b518000ffff88003b518000

10 Kernel oops 2

[[ 32.600875]32.600875] RIP:RIP: 0010:[]0010:[] [][] sysrq_handle_crash+0x16/0x20sysrq_handle_crash+0x16/0x20 [[ 32.601864]32.601864] RSP:RSP: 0018:ffff88003b51be800018:ffff88003b51be80 EFLAGS:EFLAGS: 0001004600010046 [[ 32.602469]32.602469] RAX:RAX: 000000000000000f000000000000000f RBX:RBX: ffffffff81a06220ffffffff81a06220 RCX:RCX: 00000000000000000000000000000000 [[ 32.603281]32.603281] RDX:RDX: 00000000000000000000000000000000 RSI:RSI: ffff88003fd0d6c8ffff88003fd0d6c8 RDI:RDI: 00000000000000630000000000000063 [[ 32.604092]32.604092] RBP:RBP: ffff88003b51be80ffff88003b51be80 R08:R08: 00000000000000920000000000000092 R09:R09: 00000000000002680000000000000268 [[ 32.606518]32.606518] FS:FS: 00007fa978f3b740(0000)00007fa978f3b740(0000) GS:ffff88003fd00000(0000)GS:ffff88003fd00000(0000) knlGS:0000000000000000knlGS:0000000000000000 [[ 32.607431]32.607431] CS:CS: 00100010 DS:DS: 00000000 ES:ES: 00000000 CR0:CR0: 00000000800500330000000080050033

11 Kernel oops 3

[[ 32.613426]32.613426] CallCall Trace:Trace: [[ 32.613724]32.613724] [][] __handle_sysrq+0xa2/0x170__handle_sysrq+0xa2/0x170 [[ 32.614359]32.614359] [][] write_sysrq_trigger+0x2f/0x40write_sysrq_trigger+0x2f/0x40 [[ 32.615049]32.615049] [][] proc_reg_write+0x3d/0x80proc_reg_write+0x3d/0x80 [[ 32.615685]32.615685] [][] vfs_write+0xbd/0x1e0vfs_write+0xbd/0x1e0 [[ 32.616286]32.616286] [][] ?? trace_do_page_fault+0x43/0x110trace_do_page_fault+0x43/0x110 [[ 32.616999]32.616999] [][] SyS_write+0x7f/0xe0SyS_write+0x7f/0xe0 [[ 32.617581]32.617581] [][] system_call_fastpath+0x16/0x1bsystem_call_fastpath+0x16/0x1b

12 Kernel oops

● Good enough when

● The problem is in a direct callpath

● Not sufficient when

● We need to see more context

● Other tasks

● Structure content

● Be aware of -fomit-frame-pointer

● Reliable address only if %rsp == %rbp + sizeof (long)

● CONFIG_FRAME_POINTER

13 Getting crash dump

● Need to configure

● Failsafe crash kernel loaded in a memory

● Switch to the new kernel via

● Run kdump to save to content of /proc/vmcore

## echoecho cc >> /proc/sysrq-trigger/proc/sysrq-trigger

14 Opening a crash dump

● Dump of all kernel memory

● Zero pages, caches, buffers and userland are not dumped by default

● Can be opened in gdb

● But we need DWARF debugging symbols

● Kernel-debuginfo packages in Fedora/CentOS/RHEL

● Crash(8) tool

● Can run on live system through /proc/kcore as well

15 DEMO Using crash(8)

● mod -S

● set hex

● sys

● log

● mount

● swap

● mach [-m | -c]

● net

17 Linux tasks

● include/linux/sched.h: struct task_struct {}

● clone(2)

● CLONE_THREAD:

● “Since Linux 2.4, calls to getpid(2) return the TGID of the caller.”

18 task_struct

structstruct task_structtask_struct {{ longlong state;state; voidvoid *stack;*stack; structstruct sched_entitysched_entity se;se; structstruct mm_structmm_struct *mm;*mm; intint exit_code;exit_code; pid_tpid_t pid,pid, tgid;tgid; structstruct task_structtask_struct *parent;*parent; constconst structstruct credcred __rcu__rcu *cred;*cred; charchar comm[TASK_COMM_LEN];comm[TASK_COMM_LEN]; structstruct files_structfiles_struct *files;*files; }}

19 Process states in Linux

● ps

● Walk task_struct in the kernel

● -u | -k | -G -- filter only user|kernel| group leader tasks

● ps -p | -c -- walk parent/child relationship

● ps -S -- summary of task states uninterruptable sleep

D woken signal T R exit interruptable sleep S Z woken / signal

https://idea.popcount.org/2012-12-11-linux-process-states/

20 Process tree

● ps -p

crash>crash> psps -p-p 68646864 PID:PID: 00 TASK:TASK: ffffffff81c124c0ffffffff81c124c0 CPU:CPU: 00 COMMAND:COMMAND: "swapper/0""swapper/0" PID:PID: 11 TASK:TASK: ffff88007c888000ffff88007c888000 CPU:CPU: 00 COMMAND:COMMAND: """systemd" PID:PID: 784784 TASK:TASK: ffff8800369fbc00ffff8800369fbc00 CPU:CPU: 00 COMMAND:COMMAND: "login""login" PID:PID: 28842884 TASK:TASK: ffff880036a19e00ffff880036a19e00 CPU:CPU: 00 COMMAND:COMMAND: "bash""bash" PID:PID: 68646864 TASK:TASK: ffff880036850000ffff880036850000 CPU:CPU: 00 COMMAND:COMMAND: "cat""cat"

21 DEMO Backtrace

● bt -s

#8#8 [ffff88007abefd88][ffff88007abefd88] ping_v4_seq_show+0x5ping_v4_seq_show+0x5 atat ffffffff816f6a15ffffffff816f6a15 #9#9 [ffff88007abefd98][ffff88007abefd98] seq_read+0xecseq_read+0xec atat ffffffff81246b7cffffffff81246b7c #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 #12#12 [ffff88007abefed0][ffff88007abefed0] vfs_read+0x83vfs_read+0x83 atat ffffffff81223e43ffffffff81223e43 #13#13 [ffff88007abeff08][ffff88007abeff08] sys_read+0x55sys_read+0x55 atat ffffffff81224b95ffffffff81224b95 #14#14 [ffff88007abeff50][ffff88007abeff50] entry_SYSCALL_64_fastpath+0x12entry_SYSCALL_64_fastpath+0x12 at at ffffffff81781e2effffffff81781e2e

23 Full stack dump

● bt -fs

ffff88007abefe10:ffff88007abefe10: 00000000000200000000000000020000 00000000000200000000000000020000 ffff88007abefe20:ffff88007abefe20: ffff88007abefe40ffff88007abefe40 ffffffff81290132ffffffff81290132 #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 ffff88007abefe30:ffff88007abefe30: ffff88007c015b00ffff88007c015b00 ffff88007abeff18ffff88007abeff18 ffff88007abefe40:ffff88007abefe40: ffff88007abefec8ffff88007abefec8 ffffffff81223387ffffffff81223387 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 ffff88007abefe50:ffff88007abefe50: 00000000000200000000000000020000 ffff88007c015b00ffff88007c015b00 ffff88007abefe60:ffff88007abefe60: ffff88007c015b10ffff88007c015b10 ffff880036575148ffff880036575148

24 Full stack dump

● bt -fs

ffff88007abefe10:ffff88007abefe10: 00000000000200000000000000020000 00000000000200000000000000020000 ffff88007abefe20:ffff88007abefe20:ffff88007abefe40 ffff88007abefe40 ffffffff81290132ffffffff81290132 #10#10 [ffff88007abefe28][ffff88007abefe28] proc_reg_read+0x42proc_reg_read+0x42 atat ffffffff81290132ffffffff81290132 ffff88007abefe30:ffff88007abefe30: ffff88007c015b00ffff88007c015b00 ffff88007abeff18ffff88007abeff18 ffff88007abefe40:ffff88007abefe40:ffff88007abefec8 ffff88007abefec8 ffffffff81223387ffffffff81223387 #11#11 [ffff88007abefe48][ffff88007abefe48] __vfs_read+0x37__vfs_read+0x37 atat ffffffff81223387ffffffff81223387 ffff88007abefe50:ffff88007abefe50: 00000000000200000000000000020000 ffff88007c015b00ffff88007c015b00 ffff88007abefe60:ffff88007abefe60: ffff88007c015b10ffff88007c015b10 ffff880036575148ffff880036575148

25 Disassemble

crash>crash> disdis proc_reg_readproc_reg_read 0xffffffff812900f00xffffffff812900f0 :: noplnopl 0x0(%rax,%rax,1)0x0(%rax,%rax,1) [FTRACE[FTRACE NOP]NOP] 0xffffffff812900f50xffffffff812900f5 :: push push %rbp%rbp 0xffffffff812900f60xffffffff812900f6 :: xorxor %r8d,%r8d%r8d,%r8d 0xffffffff812900f90xffffffff812900f9 :: movmov %rsp,%rbp%rsp,%rbp 0xffffffff812900fc0xffffffff812900fc :: push push %r12%r12 0xffffffff812900fe0xffffffff812900fe :: push push %rbx%rbx

26 DEMO mm_struct

● task_struct -> mm

structstruct mm_structmm_struct {{ structstruct vm_area_structvm_area_struct *mmap;*mmap; /*/* listlist ofof VMAsVMAs */*/ structstruct rb_rootrb_root mm_rb;mm_rb; pgd_tpgd_t ** pgd;pgd; unsignedunsigned longlong start_code,start_code, end_code,end_code, start_data,start_data, end_data;end_data; unsignedunsigned longlong start_brk,start_brk, brk,brk, start_stack;start_stack; }}

28 vm_area_struct

● mm_struct -> mmap

structstruct vm_area_structvm_area_struct {{ unsignedunsigned longlong vm_start;vm_start; unsignedunsigned longlong vm_end;vm_end; structstruct filefile ** vm_file;vm_file; structstruct anon_vmaanon_vma ** anon_vma_chain;anon_vma_chain; }}

29 Walk the memory map of a process

● pmap(1) ● task_struct→ mm → mmap → vm_start

$$ pmappmap $(pgrep$(pgrep bash)bash)

crash>crash> listlist -o-o vm_area_struct.vm_nextvm_area_struct.vm_next

30 DEMO Page tables

https://www.kernel.org/doc/gorman/html/understand/understand006.html

32 Page tables (2)

● mm_struct->pgd

crash>crash> rdrd -64-64 0xffff88007bed90000xffff88007bed9000 10241024 ffff88007bed9000:ffff88007bed9000: 00000000000000000000000000000000 00000000000000000000000000000000 ...... ffff88007bed9010:ffff88007bed9010: 00000000000000000000000000000000 00000000000000000000000000000000 ...... ffff88007bed9020:ffff88007bed9020: 00000000000000000000000000000000 00000000000000000000000000000000 ...... ffff88007bed9030:ffff88007bed9030: 00000000000000000000000000000000 00000000000000000000000000000000 ......

33 DEMO kmem

● [-i] -- overview

● [-p] -- Page info, walks mem_mam

● [-s] -- Slab statistic

crash>crash> kmemkmem -s-s ext4_inode_cacheext4_inode_cache CACHECACHE NAMENAME OBJSIZEOBJSIZE ALLOCATEDALLOCATED TOTALTOTAL SLABSSLABS SSIZESSIZE ffff880079fa1000ffff880079fa1000 ext4_inode_cacheext4_inode_cache 10161016 4872848728 4872848728 60916091 8k8k crash>crash> structstruct ext4_inode_infoext4_inode_info || tailtail -n-n 11 SIZE:SIZE: 0x3f80x3f8

35 Slab allocator

crash>crash> kmemkmem -s-s ext4_inode_cacheext4_inode_cache CACHECACHE NAMENAME OBJSIZEOBJSIZE ALLOCATEDALLOCATED TOTALTOTAL SLABSSLABS SSIZESSIZE ffff880079fa1000ffff880079fa1000 ext4_inode_cacheext4_inode_cache 10161016 4872848728 4872848728 60916091 8k8k

ext4_inode_cachepext4_inode_cachep == kmem_cache_create("ext4_inode_cache",kmem_cache_create("ext4_inode_cache", sizeof(structsizeof(struct ext4_inode_info),ext4_inode_info), 0,0, (SLAB_RECLAIM_ACCOUNT|(SLAB_RECLAIM_ACCOUNT| SLAB_MEM_SPREAD),SLAB_MEM_SPREAD), init_once);init_once);init_once);init_once);

36 DEMO CPUs

● There's no “cpu_t” structure

● Instead:

● DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

● O(1) scheduler replaced with CFS

● Uses RB tree

38 runq

● runq

crash>crash> runqrunq CPUCPU 00 RUNQUEUE:RUNQUEUE: ffff88007fc16c80ffff88007fc16c80 CURRENT:CURRENT: PID:PID: 68646864 TASK:TASK: ffff880036850000ffff880036850000 COMMAND:COMMAND: "cat""cat" RTRT PRIO_ARRAY:PRIO_ARRAY: ffff88007fc16e30ffff88007fc16e30 [no[no taskstasks queued]queued] CFSCFS RB_ROOT:RB_ROOT: ffff88007fc16d20ffff88007fc16d20 [120][120] PID:PID: 64646464 TASK:TASK: ffff880036868000ffff880036868000 COMMAND:COMMAND: "kworker/0:1""kworker/0:1"

39 DEMO THANK YOU

plus.google.com/+RedHat facebook.com/redhatinc

linkedin.com/company/red-hat twitter.com/RedHatNews

youtube.com/user/RedHatVideos