Multiple Instances of the Global Namespaces

Eric W. Biederman Linux Networx [email protected]

Abstract to create a clean implementation that can be merged into the . The discussion has begun on the linux-kernel list and things are Currently Linux has the filesystem namespace slowly progressing. for mounts which is beginning to prove use- ful. By adding additional namespaces for pro- cess ids, SYS V IPC, the network stack, user ids, and probably others we can, at a trivial 1 Introduction cost, extend the UNIX concept and make novel uses of Linux possible. Multiple instances of a namespace simply means that you can have two 1.1 High Performance Computing things with the same name.

For servers the power of computers is growing, and it has become possible for a single server I have been working with high performance to easily fulfill the tasks of what previously re- clusters for several years and the situation is quired multiple servers. solutions painful. Each Linux box in the cluster is ref- like are nice but they impose a perfor- ered to as a node, and applications running or mance penalty and they do not easily allow re- queued to run on a cluster are jobs. sources to be shared between multiple servers. Jobs are run by a batch scheduler and, once For clusters application migration and preemp- launched, each job runs to completion typically tion are interesting cases but almost impossibly consuming 99% of the resources on the nodes hard because you cannot restart the application it is running on. once you have moved it to a new machine, as usually there are resource name conflicts. In practice a job cannot be suspended, swapped, or even moved to a different set of For users certain desktop applications interface nodes once it is launched. This is the oldest with the outside world and are large and hard and most primitive way of running a computer. to secure. It would be nice if those applications Given the long runs and high computation over- could be run on their own little world to limit head of HPC jobs it isn’t a bad fit for HPC en- what a security breach could compromise. vironments, but it isn’t a really good fit either.

Several implementations of this basic idea have Linux has much more modern facilities. What been done succsessfully. Now the work is prevents us from just using them? 102 • Multiple Instances of the Global

HPC jobs currently can be suspended, but that uses. ids, SYS V IPC identifiers, file- just takes them off the cpu. If you have suffi- names, and the like. When you restore an appli- cient swap there is a chance the jobs can even cation on another machine there is no guarantee be pushed to swap but frequently these applica- that it can reuse the same global identifiers as tion lock pages in memory so they can be en- another process on that machine may be using sured of low latency communication. those identifiers.

The key problem is simply having multiple ma- There are two general approaches to solving chines and multiple kernels. In general, how this problem. Modifying things so these global to take an application running under one Linux machine identifiers are unique across all ma- kernel and move it completely to another kernel chines in a cluster, or modifying things so these is an unsolved problem. machine global identifiers can be repeated on a single machine. Many attempts have been The problem is unsolved not because it is fun- made to scale global identifiers cluster-wide— damentally hard, but simply because it has Mosix, OpenSSI, bproc, to name a few—and not be a problem in the UNIX environment. all of them have had to work hard to scale. So Most applications don’t need multiple ma- I choose to go with an implementation that will chines or even big machines to run on (espe- easily scale to the largest of clusters, with no cially with Moore’s law exponentially increas- communication needed to do so. ing the power of small machines). For many of the rest the large multiprocessor systems have This has the added advantage that in a cluster been large enough. it doesn’t change the programming model pre- sented to the user. Just some machines will What has changed is the economic observation now appear as multiple machines. As the rise that a cluster of small commodity machines is of the internet has shown building applications much cheaper and equally as fast as a large su- that utilize multiple machines is not foreign to percomputer. the rest of the computing world either. The other reason this problem has not been To make this happen I need to solve the solved (besides the fact that most people work- challenging problem of how to refactor the ing on it are researchers) is that it is not im- UNIX/Linux API so that we can have multi- mediately obvious what a general solution is. ple instances of the global Linux namespaces. Nothing quite like it has been done before so The Plan 9 inspired mount/filesystem name- you can’t look into a text book or into the space has already proved how this can be done archives of history and know a solution. Which and is slowly proving useful. in the broad strokes of theory is a rarity. 1.2 Jails The hard part of the problem also does not lie in the obvious place people first look— how to save all of the state of an application. Instead, Outside the realm of high perfomance comput- the problem is how do you restore a saved ap- ing people have been restricting their server ap- plication so it runs successfully on another ma- plication to jails for years. The problems chine. with chroot jails have become well understood and people have begun fixing them. First BSD The problem with restoring a saved applica- jails, and then are some of tion is all of the global resources an application the better known examples. 2006 Linux Symposium, Volume One • 103

Under Linux the open source community has cations when they log out. Vulnerable or un- not been idle. There is the linux-jail project, trusted applications like a web browser or an irc Vserver, Openvz, and related projects like client could be contained so that if they are at- SELinux, UML, and Xen. tacked all that is gained is the ability to browse the web and draw pictures in an X window. Jails are a powerful general purpose tool use- ful for a great variety of things. In resource There would finally be answer to the age old utilization jails are cheap, dynamically load- question: How do I preserve my ing glibc is likely to consume more memory while upgrading my kernel? than the additional kernel state needed to track a jail. Jails allow applications to be run in sim- All this takes is an eye to designing the inter- ple stripped down environments, increasing se- faces so they are general purpose and nestable. curity, and decreasing maintenance costs, while It should be possible to have a jail inside a jail leaving system administrators with the familiar inside a jail forever, or at least until there don’t UNIX environment. exist the resources to support it.

The only problem with the current general pur- pose implementation of jails under Linux is nothing has been merged into the mainline ker- 2 Namespaces nel, and the code from the various projects is not really mergable as it stands. The closest I have seen is the code from the linux-jail project, How much work is this? Looking at the ex- and that is simply because it is less general pur- isting patches it appears that 10,000 to 20,000 pose and implemented completely as a Linux lines of code will ultimately need to be touched. security module. The core of Linux is about 130,000 lines of code, so we will need to touch between 7% and Allowing multiple instances of global names 15% of the core kernel code. Which clearly which is needed to restore a migrated applica- indicates that one giant patch to do everything tion is a more constrained problem than that of even if it was perfect would be rejected simply simply implementing a jail. But a jail that en- because it is too large to be reviewed. sures you can have multiple instances of all of the global names is a powerful general purpose In Linux there are multiple classes of global jail that you can run just about anything in. So identifiers (i.e. process id, SYS V IPC keys, the two problems can share a common kernel user ids). Each class of identifier can be thought solution. of living in its own namespace.

1.3 Future Directions This gives us a natural decomposition to the problem, allowing each namespace to be mod- ified separately so we can support multiple in- A cheap and always available jail mecha- stances of that namespace. Unfortunately this nism is also potentially quite useful outside also increases the difficulty of the problem, as the realm of high performance computing we need to modify the kernel’s reporting and and server applications. A general purupose configuration interfaces to support multiple in- checkpoint/restart mechanism can allow desk- stances of a namespace instead of having them top users to preserve all of their running appli- tightly coupled. 104 • Multiple Instances of the Global Linux Namespaces

The plan then is simple. Maintain backwards int uname(struct utsname *buf); compatibility. Concentrate on one namespace struct utsname { at a time. Focus on implementing the abil- char sysname[]; ity to have multiple objects of a given type, char nodename[]; with the same name. Configure a namespace char release[]; from the inside using the existing interfaces. char version[]; Think of these things ultimately not as servers char machine[]; or virtual machines but as processes with pe- char domainname[]; culiar attributes. As far as possible implement }; the namespaces so that an application can be given the capability bit for that allows full con- trol over a namespace and still not be able to Figure 1: uname escape. Think in terms of a recursive imple- mentation so we always keep in mind what it CAP_SYS_ADMIN is currently required to takes to recurse indefinitely. modify the mount namespace, although there has been some discussion on how to relax the What the system call interface will be to cre- restrictions for bind mounts. ate a new instance of a namespace is still up for debate. The current contenders are a new CLONE_ flag or individual system calls. I per- 2.2 The UTS Namespace sonally think a flag to clone and unshare is all that is needed but if arguments are actually The UTS namespace characterizes and identi- needed a new system call makes sense. fies the system that applications are running on. It is characterizeed by the uname system call. Currently I have identified ten separate names- uname returns six strings describing the cur- paces in the kernel. The filesystem mount rent system. See Figure 1. namespace, uts namespace, the SYS V IPC namespace, network namespace, the pid name- The returned utsname structure has only two space, the uid namespace, the security name- members that vary at runtime nodename and space, the security keys namespace, the device domainname. nodename is the classic host- namespace, and the time namespace. name of a system. domainname is the NIS domainname. CAP_SYS_ADMIN is required 2.1 The Filesystem Mount Namespace to change these values, and when the system boots they start out as "(none)".

Multiple instances of the filesystem mount The pieces of the kernel that report and mod- namespace are already implemented in the sta- ify the utsname structure are not connected ble Linux kernels so there are few real issues to any of the big kernel subsystems, build even with implementing it. There are still outstand- when CONFIG_NET is disabled, and use CAP_ ing question on how to make this namespace SYS_ADMIN instead of one of the more spe- usable and useful to unpriviledged processes, as cific capabilities. This clearly shows that the well as some ongoing work to allow the filesys- code has no affiliation with one of the larger tem mount namespace to allow bind mounts to namespaces. have bind flags. For example, so the the bind mount can be restricted read only when other Allowing for multiple instances of the UTS mounts of the filesystem are still read/write. namespace is a simple matter of allocating a 2006 Linux Symposium, Volume One • 105 new copy of struct utsname in the kernel • kernel.sem An array of 4 control inte- for each different instance of this namespace, gers and modifying the system calls that modify this value to lookup the appropriate instance of – sc_semmsl The maximum number struct utsname by looking at current. of semaphores in a semaphore set. – sc_semmns The maximum number of semaphores in the system. 2.3 The IPC Namespace – sc_semopm The maximum number of semaphore operations in a single The SYS V interprocess communication name- system call. space controls access to a flavor of shared mem- – sc_semmni The maximum number ory, semaphores, message queues introduced in of semaphore sets in the system. SYS V UNIX. Each object has associated with it a key and an id, and all objects are glob- ally visible and exist until they are explicitly Operations in the ipc namespace are limited by destroyed. the following capabilities:

The id values are unique for every object of • CAP_IPC_OWNER Allows overriding the that type and assigned by the system when the standard ipc ownership checks, for the fol- object is created. lowing operations: shm_attach, shm_ The key is assigned by the application usually stat, shm_get, msgrcv, msgsnd, at compile time and is unique unless it is speci- msg_stat, msg_get, semtimedop, fied as IPC_PRIVATE. In which case the key sem_getall, sem_setall, sem_ is simply ignored. stat, sem_getval, sem_getpid, sem_getncnt, sem_getzcnt, sem_ The ipc namespace is currently limited by the setval, sem_get. following universally readable and uid 0 setable For filesystems namei. uses CAP_ sysctl values: DAC_OVERRIDE, and CAP_DAC_READ_ SEARCH to provide the same level of con- • kernel.shmmax The maximum shared trol. memory segment size. • CAP_IPC_LOCK Required to control • kernel.shmall The maximum com- locking of segments in bined size of all shared memory segments. memory. • kernel.shmni The maximum number • CAP_SYS_RESOURCE Allows setting of shared memory segments. the maximum number of bytes in a mes- sage queue to exceed kernel.msgmnb • kernel.msgmax The maximum mes- sage size. • CAP_SYS_ADMIN Allows changing the ownership and removing any ipc object. • kernel.msgmni The maximum num- ber of message queues. Allowing for multiple instances of the ipc • kernel.msgmnb The maximum num- namespace is a straightforward process of du- ber of bytes in a message queue. plicating the tables used for lookup by key and 106 • Multiple Instances of the Global Linux Namespaces id, and modifying the code to use current can the look at that device to find the network to select the appropriate table. In addition namespace and the rules to process that packet the sysctls need to be modified to look at by. current and act on the corresponding copy of the namespace. We generate a packet and feed it to the ker- nel through a socket. The kernel looks at the The ugly side of working with this namespace socket, finds the network namespace, and from is the capability situation. CAP_IPC_OWNER there the rules to process the packet by. trivially becomes restricted to the current ipc namespace. CAP_IPC_LOCK still remains dan- What this means is that most of the network gerous. CAP_DAC_OVERRIDE and CAP_DAC_ stack global variables need to be moved into READ_SEARCH might be ok, it really depends the network namespace data structure. Hope- on the state of the filesystem mount name- fully we can write this so the extra level of in- space. CAP_SYS_RESOURCE and CAP_SYS_ direction will not reduce the performance of the ADMIN are still unsafe to give to untrusted ap- network stack. plications. Looking up the network stack global variables 2.4 The Network Namespace through the network namespace is not quite enough. Each instance of the network name- space needs its own copy of the loopback de- By volume the code implementing the net- vice, an interface needs to be added to move work stack is the largest of the subsystems that network devices between network namespaces, needs its own namespace. Once you look at and a two headed tunnel device (a cousin of the the network subsystem from the proper slant it loopback device) needs to be added so we can is straightforward to allow user space to have send packets between different network names- what appears to be multiple instances of the paces. network stack, and thus you have a network namespace. With these in place it is safe to give processes a separate network namespace: CAP_NET_ The core abstractions used by the network stack BIND_SERVICE, CAP_NET_BROADCAST, are processes, sockets, network devices, and CAP_NET_ADMIN, and CAP_NET_RAW. All packets. The rest of the network stack is de- of the functionality will work and working fined in terms of these. through the existing network devices there won’t be an ability to escape the network name- In adding the network namespace I add a few space. Care should be given to giving an un- simple rules. trusted process access to real network devices, though, as hardware or software bugs in the im- • A network device belongs to exactly one plementation of that network device could be network namespace. reduce the security. • A socket belongs to exactly one network The future direction of the network stack is to- namespace. wards Van Jackobson network channels, where more of the work is pushed towards process A packet comes into the network stack from the context, and happening in sockets. That work outside world through a network device. We appears to be a win for network namespaces 2006 Linux Symposium, Volume One • 107 in two ways. More work happening in pro- In a lot of ways implementing a process id cess context and in well defined sockets means namespace is straightforward as it is clear how it is easier to lookup the network namespace, everything should look from the inside. There and thus cheaper. Having a lightweight packet should be a group of all of the processes in classifier in the network drivers should allow a the namespace that kill -1 sends signals to. single network device to appear to user space Either a new pid hash table needs to be allo- as multiple network devices each with a dif- cated or the key in the pid hash table needs ferent hardware address. Then based upon the to be modified to include the pid namespace. destination hardware address the packet can A new pid allocation bitmap needs to be al- be placed in one of several different network located. /proc/sys/pid_max needs to be namespaces. Today to get the same semantics modified to refer to the current pid allocation I need to configure the primary network device bitmap. When the pid namespace exits all of as a router, or configure ethernet bridging be- the processes in the pid namespace need to be tween the real network and the network inter- killed. Kernel threads need to be modified to face that sends packets to the secondary net- never start up in anything except the default pid work namespace. namespace. A process that has pid 1 must exist that will not receive any signals except for the ones it installs a signal handler for. 2.5 The Process Id Namespace How a pid namespace should look from the out- The venerable process id is usually limited to side is a much more delicate question. How 16bits so that the bitmap allocator is efficient should processes in a non default pid name- and so that people can read and remember the space be displayed in /proc? Should any of ids. The identifiers are allocated by the kernel, the process in a pid namespace show up in any and identifiers that are no longer in use are pe- other pid namespace? How much of the ex- riodically reused for new processes. A single isting infrastructure that takes pids should con- identifier value can refer to a process, a thread tinue to work? group, a process group and to a session, but ev- ery use starts life as a process identifier. This is one of the few areas where the discus- sion on the kernel list has come to a complete The only capability associated with process ids standstill, as an inexpensive technical solution is CAP_KILL which allows sending signals to to everyones requirements was not visible at the any process even if the normal security checks time of the conversation. would not allow it. A big piece of what makes the process id name- Process identifiers are used to identify the cur- space different is that processes are organized rent process, to identify the process that died, into a hierarchical tree. Maintaining the par- to specify a process or set of processes to send ent/child relationship between the process that a signal to, to specify a process or set of pro- initiates the pid namespace and the first process cesses to modify, to specify a process to de- in the new pid namespace requires first process bug, and in system monitoring tools to spec- in the new pid namespace have two pids. A ify which process the information pertains to. pid in the namespace of the parent and pid 1 Or in other words process identifiers are deeply in its own namespace. This results in names- entrenched in the user/kernel interface and are paces that are hierarchical unlike most names- used for just about everything. paces that are completely disjoint. 108 • Multiple Instances of the Global Linux Namespaces

Having one process with two pids looks like a entire tuple is equal. serious problem. It gets worse if we want that Which means if two uids are in different names- process to show up in other pid namespaces. paces the check will always fail. So the only permitted cases across uid namespaces will be After looking deeply at the underlying mecha- when everyone is allowed to perform the action nisms in the kernel I have started moving things the process is trying to perform or when the away from pid_t to pointers to struct pid. process has the appropriate capability to per- The immediate gain is that the kernel becomes form the action on any process. protected from pid wraparound issues. An alternative in some cases to modifying all Once all of the references that matter are of the checks to be against struct pid pointers inside the kernel a dif- tuples is to modify some of the checks to be ferent implementation becomes possible. We against user_struct pointers. can hang multiple tuples off struct pid allowing us to have a Since uid namespaces are not persistent, map- different name for the same pid in several pid ping of a uid namespace to filesystems requires namespaces. some new mechanisms. The primary mech- super_ With processes in subordinate pid namespaces anism is to associate with the each block at least potentially showing up when we need the uid namespace of the filesystem; them we can preserve the existing UNIX api probably moving that information into each struct for all functions that take pids and not need to in the kernel for speed and reinvent pid namespace specific solutions. flexibility.

The question yet to be answered in my mind To allow sharing of filesystem mounts between is do we always map a process’s struct pid different uid namespaces requires either using into all of its ancestor’s pid namespaces, or do acls to tag with non-default filesystem we provide a mechanism that performs those namespace information or using the key infras- mappings on demand? tructure to provide a mapping between different uid namespaces.

2.6 The User and Group ID Namespace Virtual filesystems require special care as fre- quently they allow access to all kinds of spe- In the kernel user ids are used for both account- cial kernel functionality without any capability ing and for for performing security checks. checks if the uid of a process equals 0. So vir- The per user accounting is connected to the tual filesystems like proc and must spec- user_struct. Security checks are done ify the default kernel uid namespace in their against uid, euid, suid, fsuid, gid, superblock or it will be trivial to violate the egid, sgid, fsgid, processes capabilities, kernel security checks. and variables maintained by a Linux security module. The kernel allows any of the uid/gid There is a question of whether the change in values to rotate between slots, or, if a process rules and mechanisms should take place in the has CAP_SETUID, arbitrary values to be set core kernel code, making it uid namespace into the filesystem uids. aware, or in a Linux security module. A key of that decision is the uid hash table and With a uid namespace the security checks for user_struct. From my reading of the ker- equality of uids become checks to ensure the nel code it appears that current Linux security 2006 Linux Symposium, Volume One • 109 modules can only further restrict the default can also provide for enhanced security enforce- kernel permissions checks and there is not a ment mechanisms in containers. In essence this hook that makes it possible to allocate a dif- second modification is the implementation of a ferent user_struct depending on security namespace for security mechanisms and policy. module policies.

Which means at least the allocation of user_ 2.8 The Security Keys Namespace struct, and quite possibly making all of the uid checks fail if the uid namespaces are not equal, should happen in the core of the ker- Not long ago someone added to the kernel what nel with security modules standing in the back- is the frustration of anyone attempting to imple- ground providing really advanced facilities. menting namespaces to allow for the migration of user space. Another obscure and little known With a uid namespace it becomes safe to give global namespace. untrusted users CAP_SETUID without reduc- ing security. In this case each key on a key ring is assigned a global key_serial_t value. CAP_SYS_ ADMIN is used to guard ownership and permis- 2.7 Security Modules and Namespaces sion changes.

I have yet to look in detail but at first glance There are two basic approaches that can be pur- this looks like one of the easy cases, where we sued to implement multiple instances of user can just simply implement another copy of the space. Objects in the kernel can be isolated by lookup table. It appears the key_serial_ policy and security checks with security mod- t values are just used for manipulation of the ules, or they can be isolated by making visible security keys, from user space. only the objects you are allowed to access by using namespaces. 2.9 The Device Namespace The Linux Jail module (http://sf.net/ projects/linuxjail) implemented by "Serge E. Hallyn" is a Not giving children in containers CAP_SYS_ good example of what can be done with just MKNOD and not mounting sysfs is sufficient a security module and isolating a group of pro- to prevent them from accessing any device cesses with permission checks and policy rather nodes that have not been audited for use by than simply making the inaccessible parts of that container. Getting a new instance of the the system disappear. uid/gid namespace is enough to remove access from magic sysfs entries controlling devices al- Following that general principle Linux secu- though there is some question on how to bring rity modules have two different roles they can them back. play when implementing multiple instances of user space. They can make up for any unim- For purposes of migration, unless all devices a plemented kernel namespace by isolating ob- set of processes has access to are purely virtual, jects with additional permission checks, which pretending the devices haven’t changed is non- is good as a short term solution. Linux secu- sense. Instead it makes much more sense to ex- rity modules modified to be container aware plicitly acknowledge the devices have changed 110 • Multiple Instances of the Global Linux Namespaces and send hotplug remove and add events to the REALTIME is interesting. In the context of set of processes. process migration all of these clocks become interesting. With the use of hotplug events the assumption that the global major and minor numbers that a The thread and process clocks simply need an device uses are constant is removed. offset field so the amount of time spent on the previous machine can be added in. So that we Equally as sensitive as CAP_SYS_MKNOD, and can prevent the clocks from going backwards. probably more important if mounting sysfs is allowed, is CAP_CHOWN. It allows changing The monotonic timer needs an offset field so the owner of a file. Since it would be required that we can guarantee that it never goes back- to change the sysfs owner before a sensitive file wards in the presence of process migration. could be accessed. The realtime clock matters the least but hav- So in practice managing the device namespace ing an additional offset field for clock adds ad- appears to be a user space problem with restric- ditional flexibility to the system and comes at tions on CAP_SYS_MKNOD and CAP_CHOWN practically no cost. being used to implement the filter policy of which devices a process has access to. All of the clocks except for CLOCK_ MONOTONIC support setting the clock with clock_settime so the existing control 2.10 The Time Namespace interfaces are sufficient. For the monotonic clock things are different, clock_settime is not allowed to set the clock, ensuring that The kernel provides access to several clocks the time will never run backwards, and there is CLOCK_REALTIME, CLOCK_MONOTONIC, a huge amount of sense in that logic. CLOCK_PROCESS_CPUTIME_ID, and CLOCK_THREAD_CPUTIME_ID being the The only reasonable course I can see is setting primaries. the monotonic clock when the time namespace CLOCK_REALTIME reports the current wall is created. Which probably means we will need clock time. a syscall (and not a clone flag) to create the clone flag and we should provide it with min- CLOCK_MONOTONIC is similar to CLOCK_ imum acceptable value of the monotonic clock. REALTIME except it cannot be set and thus never backs up.

CLOCK_PROCESS_CPUTIME_ID reports 3 Modifications of Kernel Inter- how much aggregate cpu time the threads in a faces process have consumed.

CLOCK_THREAD_CPUTIME_ID reports how One of the biggest challenges with implement- much cpu time an individual thread has con- ing multiple namespaces is how do we modify sumed. the existing kernel interfaces in a way that re- tains backwards compatibility for existing ap- If process migration is not a concern none plications while still allowing the reality of the of these clocks except possibly CLOCK_ new situation to be seen and worked with. 2006 Linux Symposium, Volume One • 111

Modifying existing system calls is proba- the query, getting the context information nec- bly the easiest case. Instead of reference essary to lookup up the appropriate global vari- global variables we reference variables through ables is a challenge. It can even be difficult to current. figure out which port namespace to reply to. As long as the only users of are part of the Modifying /proc is trickier. Ideally we would networking stack I have an implementation that introduce a new subdirectory for each class of solves the problems. However, as netlink has namespace, and in that folder list each instance been suggested for other things, I can’t count of that namespace, adding symbolic links from on just the network stack processing packets. the existing names where appropriate. Unfortu- nately it is not possible to obtain a list of names- Resource counters are one of the more chal- paces and the extra maintenance cost does not lenging interfaces to specify. Deciding if an yet seem to just the extra complexity and cost of interface should be per namespace or global is a linked list. So for proc what we are likely to a challenge, and answering the question how see is the information for each namespace listed does this work when we have recursive in- in /proc/pid with a symbolic link from the stances of a namespace. All of these con- existing location into the new location under cerns are exemplified when there are mul- /proc/self/. tiple untrusted users on the system. For starters we should be able to punt and im- Modifying sysctl is fairly straightforward plement something simple and require CAP_ but a little tricky to implement. The problem SYS_RESOURCE if there are any per name- is that sysctl assumes it is always dealing space resource limits. Which should leave with global variables. As we put those vari- us with a simple and correct implementation. ables into namespaces we can no longer store Then the additional concerns can be addressed the pointer into a global variable. So we need from there. to modify the implementation of sysctl to call a function which takes a task_struct argument to find where the variable is located. Once that is done we can move /proc/sys into /proc/pid/sys.

Modifying sysfs doesn’t come up for most of the namespaces but it is a serious issue for the network namespace. I haven’t a clue what the final outcome will be but assuming we want global visibility for monitor applications some- thing like /proc/pid and /proc/self needs to be added so we can list multiple in- stances and add symbolic links from their old location in sysfs.

The netlink interface is the most difficult kernel interface to work with. Because control and query packets are queued and not neces- sarily processed by the application that sends 112 • Multiple Instances of the Global Linux Namespaces Proceedings of the Linux Symposium

Volume One

July 19th–22nd, 2006 Ottawa, Ontario Canada Conference Organizers

Andrew J. Hutton, Steamballoon, Inc. C. Craig Ross, Linux Symposium

Review Committee

Jeff Garzik, Red Hat Software Gerrit Huizenga, IBM Dave Jones, Red Hat Software Ben LaHaise, Intel Corporation Matt Mackall, Selenic Consulting Patrick Mochel, Intel Corporation C. Craig Ross, Linux Symposium Andrew Hutton, Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart, Red Hat, Inc. David M. Fellows, Fellows and Carr, Inc. Kyle McMartin

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights to all as a condition of submission.