Resource Management: Linux Kernel Namespaces and Cgroups

Resource Management: Linux Kernel Namespaces and Cgroups

Resource management: Linux kernel Namespaces and cgroups Rami Rosen [email protected] Haifux, May 2013 www.haifux.org 1/121 http://ramirose.wix.com/ramirosen TOC Network Namespace PID namespaces UTS namespace Mount namespace user namespaces cgroups Mounting cgroups links Note: All code examples are from for_3_10 branch of cgroup git tree (3.9.0-rc1, April 2013) 2/121 http://ramirose.wix.com/ramirosen General The presentation deals with two Linux process resource management solutions: namespaces and cgroups. We will look at: ● Kernel Implementation details. ●what was added/changed in brief. ● User space interface. ● Some working examples. ● Usage of namespaces and cgroups in other projects. ● Is process virtualization indeed lightweight comparing to Os virtualization ? ●Comparing to VMWare/qemu/scaleMP or even to Xen/KVM. 3/121 http://ramirose.wix.com/ramirosen Namespaces ● Namespaces - lightweight process virtualization. – Isolation: Enable a process (or several processes) to have different views of the system than other processes. – 1992: “The Use of Name Spaces in Plan 9” – http://www.cs.bell-labs.com/sys/doc/names.html ● Rob Pike et al, ACM SIGOPS European Workshop 1992. – Much like Zones in Solaris. – No hypervisor layer (as in OS virtualization like KVM, Xen) – Only one system call was added (setns()) – Used in Checkpoint/Restart ● Developers: Eric W. biederman, Pavel Emelyanov, Al Viro, Cyrill Gorcunov, more. – 4/121 http://ramirose.wix.com/ramirosen Namespaces - contd There are currently 6 namespaces: ● mnt (mount points, filesystems) ● pid (processes) ● net (network stack) ● ipc (System V IPC) ● uts (hostname) ● user (UIDs) 5/121 http://ramirose.wix.com/ramirosen Namespaces - contd It was intended that there will be 10 namespaces: the following 4 namespaces are not implemented (yet): ● security namespace ● security keys namespace ● device namespace ● time namespace. – There was a time namespace patch – but it was not applied. – See: PATCH 0/4 - Time virtualization: – http://lwn.net/Articles/179825/ ● see ols2006, "Multiple Instances of the Global Linux Namespaces" Eric W. biederman 6/121 http://ramirose.wix.com/ramirosen Namespaces - contd ● Mount namespaces were the first type of namespace to be implemented on Linux by Al Viro, appearing in 2002. – Linux 2.4.19. ● CLONE_NEWNS flag was added (stands for “new namespace”; at that time, no other namespace was planned, so it was not called new mount...) ● User namespace was the last to be implemented. A number of Linux filesystems are not yet user-namespace aware 7/121 http://ramirose.wix.com/ramirosen Implementation details ●Implementation (partial): - 6 CLONE_NEW * flags were added: (include/linux/sched.h) ● These flags (or a combination of them) can be used in clone() or unshare() syscalls to create a namespace. ●In setns(), the flags are optional. 8/121 http://ramirose.wix.com/ramirosen CLONE_NEWNS 2.4.19 CAP_SYS_ADMIN CLONE_NEWUTS 2.6.19 CAP_SYS_ADMIN CLONE_NEWIPC 2.6.19 CAP_SYS_ADMIN CLONE_NEWPID 2.6.24 CAP_SYS_ADMIN CLONE_NEWNET 2.6.29 CAP_SYS_ADMIN CLONE_NEWUSER 3.8 No capability is required 9/121 http://ramirose.wix.com/ramirosen Implementation - contd ● Three system calls are used for namespaces: ● clone() - creates a new process and a new namespace; the process is attached to the new namespace. – Process creation and process termination methods, fork() and exit() methods, were patched to handle the new namespace CLONE_NEW* flags. ● unshare() - does not create a new process; creates a new namespace and attaches the current process to it. – unshare() was added in 2005, but not for namespaces only, but also for security. see “new system call, unshare” : http://lwn.net/Articles/135266/ ● setns() - a new system call was added, for joining an existing namespace. 10/121 http://ramirose.wix.com/ramirosen Nameless namespaces From man (2) clone: ... int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, ... /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ ); ... ●Flags is the CLONE_* flags, including the namespaces CLONE_NEW* flags. There are more than 20 flags in total. ● See include/uapi/linux/sched.h ●There is no parameter of a namespace name. ● How do we know if two processes are in the same namespace ? ● Namespaces do not have names. ● Six entries (inodes) were added under /proc/<pid>/ns (one for each namespace) (in kernel 3.8 and higher.) ● Each namespace has a unique inode number. ●This inode number of a each namespace is created when the namespace is created. 11/121 http://ramirose.wix.com/ramirosen Nameless namespaces ●ls -al /proc/<pid>/ns lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839] lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840] lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956] lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836] lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837] lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838] You can use also readlink. 12/121 http://ramirose.wix.com/ramirosen Implementation - contd ● A member named nsproxy was added to the process descriptor , struct task_struct. ●A method named task_nsproxy(struct task_struct *tsk), to access the nsproxy of a specified process. (include/linux/nsproxy.h) ● nsproxy includes 5 inner namespaces: ● uts_ns, ipc_ns, mnt_ns, pid_ns, net_ns; Notice that user ns is missing in this list, ● it is a member of the credentials object (struct cred) which is a member of the process descriptor, task_struct. ● There is an initial, default namespace for each namespace. 13/121 http://ramirose.wix.com/ramirosen Implementation - contd ● Kernel config items: CONFIG_NAMESPACES CONFIG_UTS_NS CONFIG_IPC_NS CONFIG_USER_NS CONFIG_PID_NS CONFIG_NET_NS ● user space additions: ● IPROUTE package ●some additions like ip netns add/ip netns del and more. ●util-linux package ●unshare util with support for all the 6 namespaces. ●nsenter – a wrapper around setns(). 14/121 http://ramirose.wix.com/ramirosen UTS namespace ● uts - (Unix timesharing) – Very simple to implement. Added a member named uts_ns (uts_namespace object) to the nsproxy. process descriptor (task_struct) nsproxy new_utsname struct uts_ns (uts_namespace object) sysname nodename name (new_utsname object) release version machine domainname 15/121 http://ramirose.wix.com/ramirosen UTS namespace - contd The old implementation of gethostname(): asmlinkage long sys_gethostname(char __user *name, int len) { ... if (copy_to_user(name, system_utsname.nodename, i)) ... errno = -EFAULT; } (system_utsname is a global) kernel/sys.c, Kernel v2.6.11.5 16/121 http://ramirose.wix.com/ramirosen UTS namespace - contd A Method called utsname() was added: static inline struct new_utsname *utsname(void) { return &current->nsproxy->uts_ns->name; } The new implementation of gethostname(): SYSCALL_DEFINE2(gethostname, char __user *, name, int, len) { struct new_utsname *u; ... u = utsname(); if (copy_to_user(name, u->nodename, i)) errno = -EFAULT; ... } Similar approach in uname() and sethostname() syscalls. 17/121 http://ramirose.wix.com/ramirosen UTS namespace - Example We have a machine where hostname is myoldhostname. uname -n myoldhostname unshare -u /bin/bash This create a UTS namespace by unshare() syscall and call execvp() for invoking bash. Then: hostname mynewhostname uname -n mynewhostname Now from a different terminal we will run uname -n, and we will see myoldhostname. 18/121 http://ramirose.wix.com/ramirosen UTS namespace - Example nsexec nsexec is a package by Serge Hallyn; it consists of a program called nsexec.c which creates tasks in new namespaces (there are some more utils in it) by clone() or by unshare() with fork(). https://launchpad.net/~serge-hallyn/+archive/nsexec Again we have a machine where hostname is myoldhostname. uname -n myoldhostname 19/121 http://ramirose.wix.com/ramirosen IPC namespaces The same principle as uts , nothing special, more code. Added a member named ipc_ns (ipc_namespace object) to the nsproxy. ●CONFIG_POSIX_MQUEUE or CONFIG_SYSVIPC must be set 21/121 http://ramirose.wix.com/ramirosen Network Namespaces ● A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices. ● The network namespace is struct net. (defined in include/net/net_namespace.h) Struct net includes all network stack ingredients, like: – Loopback device. – SNMP stats. (netns_mib) – All network tables:routing, neighboring, etc. – All sockets – /procfs and /sysfs entries. 22/121 http://ramirose.wix.com/ramirosen Implementations guidelines • A network device belongs to exactly one network namespace. ● Added to struct net_device structure: ● struct net *nd_net; for the Network namespace this network device is inside. ●Added a method: dev_net(const struct net_device *dev) to access the nd_net namespace of a network device. • A socket belongs to exactly one network namespace. ● Added sk_net to struct sock (also a pointer to struct net), for the Network namespace this socket is inside. ● Added sock_net() and sock_net_set() methods (get/set network namespace of a socket) 23/121 http://ramirose.wix.com/ramirosen Network Namespaces - contd ● Added a system wide linked list of all namespaces: net_namespace_list, and a macro to traverse it (for_each_net()) ● The initial network namespace, init_net (instance of struct net), includes the loopback device and all physical devices, the networking tables, etc. ● Each newly created network namespace includes only the loopback device. ● There are no sockets in a newly created

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    120 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us