LXC( Container)

Lightweight virtual system mechanism Gao feng [email protected]

1 Outline

Introduction Namespace System API Libvirt LXC Comparison Problems Future work

2 Introduction

 Container: Operation System Level virtualization method for Linux

Guest1 Guest2 Container Management P1 P2 Tools P1 P2

Namespace API/ABI Namespace Set 1 Set 2 Kernel

3 Introduction

Why Container Better Performance

Kvm Container App Guest-OS Emulator-Lay App Host-OS Host-OS

Easy to set up Multi-Tenancy environment Namespace

Namespace isolates the resources of system, currently there are 6 kinds of namespaces in linux kernel. Mount namespace UTS namespace IPC namespace Net namespace Pid namespace User namespace

5 Mount Namespace

Each mount namespace has its own filesystem layout. /proc//mounts /proc//mounts /proc//mounts

/ /dev/sda1 / /dev/sda1 / /dev/sda3 /home /dev/sda2 /home /dev/sda2 /boot /dev/sda4

P1 P2 P3

Mount Mount Namespace1 Namespace2

6 UTS Namespace

Every uts namespace has its own uts related information.

ostype: Linux ostype: Linux Unalterable osrelease: 3.8.6 osrelease: 3.8.6 version: … version: …

hostname: uts1 hostname: uts2 alterable domainname : uts1 domainname : uts2

UTS namespace1 UTS namespace2

7 IPC Namespace

IPC namespce isolates the interprocess communication resource(shared memory, semaphore, message queue)

P1 P2 P3 P4

IPC IPC namespace1 namespace2

8 Net Namespace

Net namespace isolates the networking related resources

Net devices: eth0 Net devices: eth1 IP address: 1.1.1.1/24 IP address: 2.2.2.2/24 Route Route Firewall rule Firewall rule Sockets Sockets Proc Proc sysfs sysfs … … Net Namespace1 Net Namespace2

9 PID Namespace

 PID namespace isolates the Process ID, implemented as a hierarchy.

PID namespace1 (Parent) ls /proc (Level 0) 1 2 3 4 pid:1 pid:4

P1 pid:2 pid:3 P4 pid:1 pid:1 ls /proc ls /proc 1 P2 P3 1

PID Namespace2 (Child) PID Namespace3 (Child) (Level 1) (Level 1)

10 User Namespace

kuid/kgid: Original uid/gid, Global uid/gid: user id in user namespace, will be translated to kuid/kgid finally Only parent User NS has rights to set map

kuid: kuid: 2000-2004 1000-1009 uid_map uid_map 10 2000 5 0 1000 10 uid: uid: 10-14 0-9

User namespace1 User namespace2 11 User Namespace

Create and stat file in User namesapce

root uid_map: root #touch 0 1000 10 #stat /file /file User namespace File : “/file” Access: uid (0/root)

Disk /file (kuid:1000)

12 System API/ABI

Proc /proc//ns/

System Call clone unshare setns

13 Proc

 /proc//ns/ipc: ipc namespace  /proc//ns/mnt: mount namespace  /proc//ns/net: net namespace  /proc//ns/pid: pid namespace  /proc//ns/uts: uts namespace  /proc//ns/user: user namespace

 If the proc file of two processes is the same, these two processes must be in the same namespace.

14 System Call

clone int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, …);

6 new flags: CLONE_NEWIPC,CLONE_NEWNET, CLONE_NEWNS,CLONE_NEWPID, CLONE_NEWUTS,CLONE_NEWUSER

15 System Call

clone create process2 and IPC namespace2

Mount1 Mount1

clone(,, CLONE_NEWIPC,) IPC2 IPC1 P1 P2 (new created)

Others1 Others1

16 System Call

unshare int unshare(int flags);

Namespace extends the system call unshare too. User space can use unshare to create new namespace and the caller will run in this new created namespace.

17 System Call

unshare create net namespace2

Mount1 Mount1

unshare(CLONE_NEWNET) Net2 Net1 P1 P1 (new created)

Others1 Others1

18 System Call

setns int setns(int fd, int nstype);

setns is a new added system call for namespace. Process can use setns to set which namespace the process will belong to.

@fd: file descriptor of namespace(/proc//ns/*) @nstype: type of namespace.

19 System Call

setns Change the PID namespace of P2

P1 PID1 PID1 P1

setns(open(/proc/p1/ns/pid,) , 0) P2 P2 PID2 PID2

20 Libvirt LXC

 Libvirt LXC: userspace container management tool, Implemented as one type of libvirt driver. Manage containers Create namespace Create private filesystem layout for container Create devices for container Resources controller by cgroup

21 Comparison

The feature that host share the same kernel with guest makes container different from other virtualization method

Container KVM

performance Great Normal OS support Linux Only No Limit Security Normal Great Completeness Low Great

22 Problems

/proc/meminfo, cpuinfo… Kernel space (relate to cgroup) User space (poor efficiency)

New namespace Audit (assign to user namespace?) Syslog (do we really need it?)

23 Problems

Bandwidth control TC Qdisc On host (How to handle setting nic to container?) On container (user can change it) Netfilter How to control Ingress bandwidth Disk quota Uid/Gid Quota (Many users ) Project Quota (xfs only)

24 Future Work

Improve Libvirt LXC Unchanged systemd in Libvirt LXC Use interface of systemd to set cgroup Libvirt LXC based Audit namespace

25 Thank you! Q&A

26