Everything’s a File Descriptor

Josh Triplett [email protected]

Linux Plumbers Conference 2015 “Everything’s a file” I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0 I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT Everything has a filename? Everything has a filename? Everything//////////////////////////////// has a filename? I Pipes

I Sockets

I

I memfd

I KVM virtual machines and CPUs

I ... Everything’s a file descriptor I What is a file descriptor, really?

I What can you do with a file descriptor?

I What interesting file descriptors exist?

I How do you build a new type of file descriptors?

I What interesting file descriptors don’t exist? I What is a file descriptor, really?

I What can you do with a file descriptor?

I What interesting file descriptors exist?

I How do you build a new type of file descriptors?

I What interesting file descriptors don’t exist yet? What is a file descriptor, really? I struct fd, struct fdtable

I struct file (x, &c, 1); putchar(c); read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c);

struct fd versus struct file

testfile contains “0123456789”

x = ("testfile", O_RDONLY); xdup = (x); y = open("testfile", O_RDONLY); struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c); struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c); struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); /* Prints ’1’ */ read(y, &c, 1); putchar(c); struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); /* Prints ’1’ */ read(y, &c, 1); putchar(c); /* Prints ’0’ */ x f pos: f count:

f pos: 0 xdup f count: 1

y

userspace int kernel object driver-specific

struct fd struct file testfile

0

1

2

3

. . f pos: 0 xdup f count: 1

y

userspace int kernel object driver-specific

struct fd struct file testfile

f pos: 0 x 0 f count: 1

1

2

3

. . f pos: 0 f count: 1

y

userspace int kernel object driver-specific

struct fd struct file testfile

f pos: 0 x 0 f count:2

1 xdup 2

3

. . userspace int kernel object driver-specific

struct fd struct file testfile

f pos: 0 x 0 f count:2

1 f pos: 0 xdup f count: 1 2

y 3

. . userspace int kernel object driver-specific

struct fd struct file testfile

f pos:1 x 0 f count:2

1 f pos: 0 xdup f count: 1 2

y 3

. . userspace int kernel object driver-specific

struct fd struct file testfile

f pos:2 x 0 f count:2

1 f pos: 0 xdup f count: 1 2

y 3

. . userspace int kernel object driver-specific

struct fd struct file testfile

f pos:2 x 0 f count:2

1 f pos:1 xdup f count: 1 2

y 3

. . kernel object driver-specific

struct fd struct file testfile

f pos:2 x 0 f count:2

1 f pos:1 xdup f count: 1 2

y 3

. . userspace int driver-specific

struct fd struct file testfile

f pos:2 x 0 f count:2

1 f pos:1 xdup f count: 1 2

y 3

. . userspace int kernel object struct fd struct file testfile

f pos:2 x 0 f count:2

1 f pos:1 xdup f count: 1 2

y 3

. . userspace int kernel object driver-specific File descriptor: Userspace reference to kernel object What can you do with a file descriptor? I seek

I preadv, pwritev

I

I Blocking or non-blocking

I poll, , epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over

I

I sendfile, splice, tee

I openat

I ...

I

I read, I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2 I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec I sendfile, splice, tee

I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap I openat

I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee I ...

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ... I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I ...

I ioctl Use file descriptors! What interesting file descriptors exist? I write: Add value to counter I read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kernel and userspace

eventfd

I 64-bit counter used as an event queue I read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kernel and userspace

eventfd

I 64-bit counter used as an event queue

I write: Add value to counter I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kernel and userspace

eventfd

I 64-bit counter used as an event queue

I write: Add value to counter I read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1 I Several drivers use eventfd to signal events between kernel and userspace

eventfd

I 64-bit counter used as an event queue

I write: Add value to counter I read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero eventfd

I 64-bit counter used as an event queue

I write: Add value to counter I read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kernel and userspace timerfd

I Allows handling timers as file descriptors

I Throw them in the poll loop with everything else

I Create with specified timeout

I read: Block until timeout; return number of times expired

I poll: Reading for reading if timeout passed Signals I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl) Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions Signed-off-by: <(;,;)@r’lyeh> I read: Block until signal, return struct signalfd_siginfo

I poll: Readable when signal received

signalfd

I File descriptor to receive a given set of signals

I Block “normal” signal delivery; receive via signalfd instead signalfd

I File descriptor to receive a given set of signals

I Block “normal” signal delivery; receive via signalfd instead

I read: Block until signal, return struct signalfd_siginfo

I poll: Readable when signal received How do you build a new type of file descriptor? I poll/select/epoll

I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and write

I Nothing I Raw data I Specific data structure I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and write

I Nothing I Raw data I Specific data structure I poll/select/epoll

I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and write

I Nothing I Raw data I Specific data structure I poll/select/epoll

I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing

I seek and file position I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and write

I Nothing I Raw data I Specific data structure I poll/select/epoll

I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing

I seek and file position

I mmap I For everything else: ioctl

Semantics

I read and write

I Nothing I Raw data I Specific data structure I poll/select/epoll

I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup? Semantics

I read and write

I Nothing I Raw data I Specific data structure I poll/select/epoll

I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl I simple_read_from_buffer, simple_write_to_buffer

I no_llseek, fixed_size_llseek I Check file->f_flags & O_NONBLOCK

I Blocking: wait_queue_head I Non-blocking: return -EAGAIN

Implementation

I anon_inode_getfd

I Doesn’t need a backing or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object I no_llseek, fixed_size_llseek I Check file->f_flags & O_NONBLOCK

I Blocking: wait_queue_head I Non-blocking: return -EAGAIN

Implementation

I anon_inode_getfd

I Doesn’t need a backing inode or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object

I simple_read_from_buffer, simple_write_to_buffer I Check file->f_flags & O_NONBLOCK

I Blocking: wait_queue_head I Non-blocking: return -EAGAIN

Implementation

I anon_inode_getfd

I Doesn’t need a backing inode or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object

I simple_read_from_buffer, simple_write_to_buffer

I no_llseek, fixed_size_llseek Implementation

I anon_inode_getfd

I Doesn’t need a backing inode or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object

I simple_read_from_buffer, simple_write_to_buffer

I no_llseek, fixed_size_llseek I Check file->f_flags & O_NONBLOCK

I Blocking: wait_queue_head I Non-blocking: return -EAGAIN What interesting file descriptors don’t exist yet? Child processes I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops Signals

I Process-global; libraries can’t manage only their own processes

I /clone I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal Problems:

I Waiting not integrated with poll loops Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status I Waiting not integrated with poll loops Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems: Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops Signals I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops Signals

I Process-global; libraries can’t manage only their own processes I signalfd for SIGCHLD

I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries I Must block SIGCHLD; breaks code expecting SIGCHLD

Alternatives

I Set SIGCHLD handler, write to pipe or eventfd

I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries Signals Alternatives

I Set SIGCHLD handler, write to pipe or eventfd

I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries Signals I signalfd for SIGCHLD

I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries I Must block SIGCHLD; breaks code expecting SIGCHLD clonefd I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . . clone syscall parameters vary by architecture

I Avoided in the new syscall clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) and reparenting

I Work in progress

Complications

Need a new clone for the fd out parameter I Avoided in the new syscall clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting

I Work in progress

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting

I Work in progress

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

I Avoided in the new syscall I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting

I Work in progress

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

I Avoided in the new syscall clone is out of parameters (6) on some architectures Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting

I Work in progress

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

I Avoided in the new syscall clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting

I Work in progress

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

I Avoided in the new syscall clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it ptrace and reparenting

I Work in progress

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

I Avoided in the new syscall clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) I Work in progress

Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

I Avoided in the new syscall clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting Complications

Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture

I Avoided in the new syscall clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting

I Work in progress History and status

I Thiago Macieira originally proposed forkfd to simplify Qt

I Josh and Thiago started on clonefd earlier this year

I Some infrastructure merged into 4.2

I Syscall aimed for future kernel after resolving ptrace issues File descriptor: Userspace reference to kernel object What else can we do with a reference to task_struct? Process IDs I Unique within root container

I Container PID namespaces map a subset of these

I PIDs do not hold a reference; can be reused

I Race condition if used from non-parent process

Process IDs

I Small integers used to reference processes

I Used pervasively in process syscalls

I Enumerated as directories in /proc I PIDs do not hold a reference; can be reused

I Race condition if used from non-parent process

Process IDs

I Small integers used to reference processes

I Used pervasively in process syscalls

I Enumerated as directories in /proc

I Unique within root container

I Container PID namespaces map a subset of these Process IDs

I Small integers used to reference processes

I Used pervasively in process syscalls

I Enumerated as directories in /proc

I Unique within root container

I Container PID namespaces map a subset of these

I PIDs do not hold a reference; can be reused

I Race condition if used from non-parent process I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

clonefd as process identifier I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

clonefd as process identifier

I Unique across the entire system I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free I Allows non-parent processes to obtain exit information

clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information Next steps

I Merge clonefd

I For each PID syscall, add an fd variant

I Add to obtain process information

I Add process enumeration (next, child, root) Warning: wild speculation and conjecture ahead

Other future file descriptors Other future file descriptors

Warning: wild speculation and conjecture ahead I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

User and group IDs I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

User and group IDs

I Suppose users and groups were unique kernel objects? I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group” I Allow a process to hold multiple credentials (like setgroups)

User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups) I Suppose mount returned a file descriptor

I openat relative to the filesystem

I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts

Filesystem mounts I openat relative to the filesystem

I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts

Filesystem mounts

I Suppose mount returned a directory file descriptor I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts

Filesystem mounts

I Suppose mount returned a directory file descriptor

I openat relative to the filesystem Filesystem mounts

I Suppose mount returned a directory file descriptor

I openat relative to the filesystem

I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces