Everything’s a File Descriptor
Josh Triplett [email protected]
Linux Plumbers Conference 2015 “Everything’s a file” I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0 I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT
I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf
I /etc/hostname
I /dev/null
I /dev/zero
I /dev/ttyS0
I /dev/dri/card0
I /dev/cpu/0/cpuid
I /tmp/.X11-unix/X0
I /proc/1/environ
I /proc/cmdline
I /sys/class/block/sda/queue/rotational
I /sys/firmware/acpi/tables/DSDT Everything has a filename? Everything has a filename? Everything//////////////////////////////// has a filename? I Pipes
I Sockets
I epoll
I memfd
I KVM virtual machines and CPUs
I ... Everything’s a file descriptor I What is a file descriptor, really?
I What can you do with a file descriptor?
I What interesting file descriptors exist?
I How do you build a new type of file descriptors?
I What interesting file descriptors don’t exist? I What is a file descriptor, really?
I What can you do with a file descriptor?
I What interesting file descriptors exist?
I How do you build a new type of file descriptors?
I What interesting file descriptors don’t exist yet? What is a file descriptor, really? I struct fd, struct fdtable
I struct file read(x, &c, 1); putchar(c); read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c);
struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c); struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); read(y, &c, 1); putchar(c); struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); /* Prints ’1’ */ read(y, &c, 1); putchar(c); struct fd versus struct file
testfile contains “0123456789”
x = open("testfile", O_RDONLY); xdup = dup(x); y = open("testfile", O_RDONLY); read(x, &c, 1); putchar(c); /* Prints ’0’ */ read(xdup, &c, 1); putchar(c); /* Prints ’1’ */ read(y, &c, 1); putchar(c); /* Prints ’0’ */ x f pos: f count:
f pos: 0 xdup f count: 1
y
userspace int kernel object driver-specific
struct fd struct file testfile
0
1
2
3
. . f pos: 0 xdup f count: 1
y
userspace int kernel object driver-specific
struct fd struct file testfile
f pos: 0 x 0 f count: 1
1
2
3
. . f pos: 0 f count: 1
y
userspace int kernel object driver-specific
struct fd struct file testfile
f pos: 0 x 0 f count:2
1 xdup 2
3
. . userspace int kernel object driver-specific
struct fd struct file testfile
f pos: 0 x 0 f count:2
1 f pos: 0 xdup f count: 1 2
y 3
. . userspace int kernel object driver-specific
struct fd struct file testfile
f pos:1 x 0 f count:2
1 f pos: 0 xdup f count: 1 2
y 3
. . userspace int kernel object driver-specific
struct fd struct file testfile
f pos:2 x 0 f count:2
1 f pos: 0 xdup f count: 1 2
y 3
. . userspace int kernel object driver-specific
struct fd struct file testfile
f pos:2 x 0 f count:2
1 f pos:1 xdup f count: 1 2
y 3
. . kernel object driver-specific
struct fd struct file testfile
f pos:2 x 0 f count:2
1 f pos:1 xdup f count: 1 2
y 3
. . userspace int driver-specific
struct fd struct file testfile
f pos:2 x 0 f count:2
1 f pos:1 xdup f count: 1 2
y 3
. . userspace int kernel object struct fd struct file testfile
f pos:2 x 0 f count:2
1 f pos:1 xdup f count: 1 2
y 3
. . userspace int kernel object driver-specific File descriptor: Userspace reference to kernel object What can you do with a file descriptor? I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2 I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec I sendfile, splice, tee
I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap I openat
I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee I ...
I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat I ioctl
I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ... I read, write
I seek
I preadv, pwritev
I stat
I Blocking or non-blocking
I poll, select, epoll
I dup, dup2
I Send over a UNIX socket via SCM_RIGHTS
I Inherited over exec
I mmap
I sendfile, splice, tee
I openat
I ...
I ioctl Use file descriptors! What interesting file descriptors exist? I write: Add value to counter I read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kernel and userspace
eventfd
I 64-bit counter used as an event queue I read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kernel and userspace
eventfd
I 64-bit counter used as an event queue
I write: Add value to counter I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kernel and userspace
eventfd
I 64-bit counter used as an event queue
I write: Add value to counter I read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1 I Several drivers use eventfd to signal events between kernel and userspace
eventfd
I 64-bit counter used as an event queue
I write: Add value to counter I read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero eventfd
I 64-bit counter used as an event queue
I write: Add value to counter I read: Block until non-zero; read value and reset to 0
I “Semaphore mode”: Read 1 and decrement by 1
I poll: Ready for reading if non-zero
I Several drivers use eventfd to signal events between kernel and userspace timerfd
I Allows handling timers as file descriptors
I Throw them in the poll loop with everything else
I Create with specified timeout
I read: Block until timeout; return number of times expired
I poll: Reading for reading if timeout passed Signals I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption I “async-signal-safe” library functions
Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl) Signals
I Receive asynchronous events in a process
I Suspend execution, save registers, move execution to handler
I Restore registers and resume execution when handler done
I Assume a userspace stack to push and pop state
I sigaltstack sets an alternate stack to switch to
I Set up stack to return into call to sigreturn for cleanup
I Can receive signals while in a kernel syscall
I Some syscalls restart afterward
I Syscalls with timeouts adjust them (restart_syscall)
I Other syscalls return EINTR
I Can mask signals to avoid interruption
I Special syscalls that also set signal mask (ppoll, pselect, KVM_SET_SIGNAL_MASK ioctl)
I “async-signal-safe” library functions Signed-off-by: <(;,;)@r’lyeh> I read: Block until signal, return struct signalfd_siginfo
I poll: Readable when signal received
signalfd
I File descriptor to receive a given set of signals
I Block “normal” signal delivery; receive via signalfd instead signalfd
I File descriptor to receive a given set of signals
I Block “normal” signal delivery; receive via signalfd instead
I read: Block until signal, return struct signalfd_siginfo
I poll: Readable when signal received How do you build a new type of file descriptor? I poll/select/epoll
I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and write
I Nothing I Raw data I Specific data structure I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and write
I Nothing I Raw data I Specific data structure I poll/select/epoll
I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and write
I Nothing I Raw data I Specific data structure I poll/select/epoll
I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing
I seek and file position I What happens with multiple processes, or dup?
I For everything else: ioctl
Semantics
I read and write
I Nothing I Raw data I Specific data structure I poll/select/epoll
I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing
I seek and file position
I mmap I For everything else: ioctl
Semantics
I read and write
I Nothing I Raw data I Specific data structure I poll/select/epoll
I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup? Semantics
I read and write
I Nothing I Raw data I Specific data structure I poll/select/epoll
I Must match read/write blocking behavior if any I Can have pollable fd even if read/write do nothing
I seek and file position
I mmap
I What happens with multiple processes, or dup?
I For everything else: ioctl I simple_read_from_buffer, simple_write_to_buffer
I no_llseek, fixed_size_llseek I Check file->f_flags & O_NONBLOCK
I Blocking: wait_queue_head I Non-blocking: return -EAGAIN
Implementation
I anon_inode_getfd
I Doesn’t need a backing inode or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object I no_llseek, fixed_size_llseek I Check file->f_flags & O_NONBLOCK
I Blocking: wait_queue_head I Non-blocking: return -EAGAIN
Implementation
I anon_inode_getfd
I Doesn’t need a backing inode or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object
I simple_read_from_buffer, simple_write_to_buffer I Check file->f_flags & O_NONBLOCK
I Blocking: wait_queue_head I Non-blocking: return -EAGAIN
Implementation
I anon_inode_getfd
I Doesn’t need a backing inode or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object
I simple_read_from_buffer, simple_write_to_buffer
I no_llseek, fixed_size_llseek Implementation
I anon_inode_getfd
I Doesn’t need a backing inode or filesystem I Provide an ops structure and private data pointer I Private data points to your kernel object
I simple_read_from_buffer, simple_write_to_buffer
I no_llseek, fixed_size_llseek I Check file->f_flags & O_NONBLOCK
I Blocking: wait_queue_head I Non-blocking: return -EAGAIN What interesting file descriptors don’t exist yet? Child processes I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal Problems:
I Waiting not integrated with poll loops Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status I Waiting not integrated with poll loops Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems: Signals
I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops I Process-global; libraries can’t manage only their own processes
I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops Signals I fork/clone
I Parent process gets the child PID
I Parent uses dedicated syscalls (waitpid) to wait for child exit
I When child exits, parent gets SIGCHLD signal
I Parent makes waitpid call to get exit status
Problems:
I Waiting not integrated with poll loops Signals
I Process-global; libraries can’t manage only their own processes I signalfd for SIGCHLD
I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries I Must block SIGCHLD; breaks code expecting SIGCHLD
Alternatives
I Set SIGCHLD handler, write to pipe or eventfd
I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries Signals Alternatives
I Set SIGCHLD handler, write to pipe or eventfd
I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries Signals I signalfd for SIGCHLD
I Still process-global; gets all child exit notifications I Requires coordinating global signal handling between libraries I Must block SIGCHLD; breaks code expecting SIGCHLD clonefd I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information I Maintains a reference to the child’s task_struct
I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits I Relatively simple, except. . .
clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct clonefd
I New flag for clone
I Return a file descriptor for the child process
I read: block until child exits, return exit information
I poll: becomes readable when child exits
I Maintains a reference to the child’s task_struct
I Relatively simple, except. . . clone syscall parameters vary by architecture
I Avoided in the new syscall clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter I Avoided in the new syscall clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture
I Avoided in the new syscall I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture
I Avoided in the new syscall clone is out of parameters (6) on some architectures Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture
I Avoided in the new syscall clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture
I Avoided in the new syscall clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it ptrace and reparenting
I Work in progress
Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture
I Avoided in the new syscall clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) I Work in progress
Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture
I Avoided in the new syscall clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting Complications
Need a new clone system call for the fd out parameter clone syscall parameters vary by architecture
I Avoided in the new syscall clone is out of parameters (6) on some architectures
I Pass parameters via a struct and size Low-level copy_thread function grabbed tls parameter directly from syscall register arguments; couldn’t move it
I Pass parameter normally via C, fix assembly syscall entry I Fixed with copy_thread_tls (merged in 4.2) ptrace and reparenting
I Work in progress History and status
I Thiago Macieira originally proposed forkfd to simplify Qt
I Josh and Thiago started on clonefd earlier this year
I Some infrastructure merged into 4.2
I Syscall aimed for future kernel after resolving ptrace issues File descriptor: Userspace reference to kernel object What else can we do with a reference to task_struct? Process IDs I Unique within root container
I Container PID namespaces map a subset of these
I PIDs do not hold a reference; can be reused
I Race condition if used from non-parent process
Process IDs
I Small integers used to reference processes
I Used pervasively in process syscalls
I Enumerated as directories in /proc I PIDs do not hold a reference; can be reused
I Race condition if used from non-parent process
Process IDs
I Small integers used to reference processes
I Used pervasively in process syscalls
I Enumerated as directories in /proc
I Unique within root container
I Container PID namespaces map a subset of these Process IDs
I Small integers used to reference processes
I Used pervasively in process syscalls
I Enumerated as directories in /proc
I Unique within root container
I Container PID namespaces map a subset of these
I PIDs do not hold a reference; can be reused
I Race condition if used from non-parent process I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
clonefd as process identifier I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
clonefd as process identifier
I Unique across the entire system I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information
clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free I Allows non-parent processes to obtain exit information
clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets clonefd as process identifier
I Unique across the entire system
I Holds a reference to the process
I Race-free
I Can pass via exec, UNIX sockets
I Allows non-parent processes to obtain exit information Next steps
I Merge clonefd
I For each PID syscall, add an fd variant
I Add ioctls to obtain process information
I Add process enumeration (next, child, root) Warning: wild speculation and conjecture ahead
Other future file descriptors Other future file descriptors
Warning: wild speculation and conjecture ahead I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
User and group IDs I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
User and group IDs
I Suppose users and groups were unique kernel objects? I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups)
User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group” I Allow a process to hold multiple credentials (like setgroups)
User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem User and group IDs
I Suppose users and groups were unique kernel objects?
I Unique across container user namespaces
I “Get unused user/group”
I Set up arbitrary mappings when mounting a filesystem
I Allow a process to hold multiple credentials (like setgroups) I Suppose mount returned a directory file descriptor
I openat relative to the filesystem
I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts
Filesystem mounts I openat relative to the filesystem
I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts
Filesystem mounts
I Suppose mount returned a directory file descriptor I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts
Filesystem mounts
I Suppose mount returned a directory file descriptor
I openat relative to the filesystem Filesystem mounts
I Suppose mount returned a directory file descriptor
I openat relative to the filesystem
I Separate call to bind into the filesystem namespace
I Bind existing dirfd for bind mounts I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations I Use file descriptors in new APIs
I Don’t invent new identifier namespaces
Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking Summary
I File descriptor: Userspace reference to kernel object
I Reference-counted, race-free, unambiguous ID
I Well-defined semantics
I Extensive operations
I poll and blocking
I Use file descriptors in new APIs
I Don’t invent new identifier namespaces