بس حم م ال الر ن الرحیم Sharif University of Data and Network Technology Security Lab.
LightweightLightweight VirtualizationVirtualization inin LinuxLinux
SadeghSadegh DorriDorri N.N. PhDPhD CandidateCandidate
Data and Network Security Lab. Seminar, 4 Aban 1393 TheThe NeedNeed forfor VirtualizationVirtualization
Hypervisors are the living proof of operating system's incompetence! SchedulingScheduling aa Multi-processMulti-process “application”“application” - Nice, priority, etc. are hard to be dynamically managed KernelKernel MemoryMemory ManagementManagement - Fork bumps - $ while true; do mkdir x; cd x; done AbuseAbuse shouldshould bebe thethe application'sapplication's problem,problem, ratherrather thanthan beingbeing everyone's!everyone's!
The failure of operating systems and how we can fix it: http://lwn.net/Articles/524952/
AgendaAgenda
MotivationMotivation - Virtualization architectures - OS-level virtualization in Linux AA demodemo UnderUnder thethe hoodhood - LXC components - Related kernel features: cgroups and namespaces SecuritySecurity considerationsconsiderations ConclusionConclusion
VariousVarious VirtualizationVirtualization ArchitecturesArchitectures
HardwareHardware VirtualizationVirtualization
VMware,VMware, Parallels,Parallels, QEmu,QEmu, Bochs,Bochs, Xen,Xen, KVMKVM ResourcesResources cannotcannot bebe sharedshared betweenbetween VMs.VMs.
OS-LevelOS-Level VirtualizationVirtualization
Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo Containers FreeBSD jails Solaris Containers/Zones IBM AIX6 WPARs (Workload Partitions)
OS-LevelOS-Level VirtualizationVirtualization inin LinuxLinux
LinuxLinux ContainersContainers - Allow a kernel to support more resource-isolation use- cases - Without the overhead and complexity of running multiple kernel and driver instances BenefitsBenefits - Isolation - Small footprint - Speed
3)3) SpeedSpeed
2)2) FootprintFootprint
OnOn aa typicaltypical physicalphysical server,server, withwith averageaverage computecompute resources,resources, youyou cancan easilyeasily run:run: - 10-100 virtual machines - 100-1000 containers
OnOn disk,disk, containerscontainers cancan bebe veryvery light.light. - A few MB — even without fancy storage.
1)1) IsolationIsolation
EachEach containercontainer has:has: ItsIts ownown networknetwork interfaceinterface (and(and IPIP address)address) - can be bridged, routed... just like VMs ItsIts ownown filesystemfilesystem - Debian host can run Fedora container (& vice-versa) IsolationIsolation (security)(security) - container A & B can't harm (or even see) each other IsolationIsolation (resource(resource usage)usage) - soft & hard quotas for RAM, CPU, I/O... PossibilityPossibility ofof processprocess checkpoint/freezecheckpoint/freeze andand migrationmigration - Isolation prevents resource name conflicts
Use-Cases:Use-Cases: DevelopersDevelopers
ContinuousContinuous IntegrationIntegration - After each commit, run 100 tests in 100 environments ContinuousContinuous PackagingPackaging - Example: Project Builder EscapeEscape dependencydependency hellhell - Build (and/or run) in a controlled environment PutPut everythingeverything inin aa containercontainer - Even the tiny things
Use-Cases:Use-Cases: HostingHosting ProvidersProviders
CheapCheap CheaperCheaper HostingHosting (VPS(VPS providers)providers) GiveGive awayaway moremore freefree stuffstuff - "Pay for your production, get your staging for free!" - Spin up/down on demand, in seconds - Example: dotCloud
““Google has built their entire datacenter infrastructure around Linux containers, launching more than 2 billion containers per week.”” (Kubernetes:(Kubernetes: openopen sourcesource GoogleGoogle cloudcloud platform)platform)
Use-Cases:Use-Cases: EveryoneEveryone
LookLook insideinside youryour VMsVMs - You can see (and kill) individual processes - You can browse (and change) the filesystem DoDo (almost)(almost) whateverwhatever youyou diddid withwith VMsVMs - ... But faster MigrationMigration - Checkpoint then unfreeze: experimental (CRIU)
Solutions in Linux
OpenVZOpenVZ
ModifiedModified LinuxLinux kernelkernel - Also works with unpatched Linux 3.x (reduced feature set) EachEach containercontainer isis aa separateseparate entityentity withwith itsits own:own: - Files: System libraries, applications, virtualized /proc and /sys, virtualized locks, etc. - Users and groups: its own root user, as well as other users and groups. - Process tree: only sees its own processes (incl. init) - Network: virtual network device with own IP addresses, iptables, and routing rules. - Devices: can be granted access to real devices. - IPC objects: shared memory, semaphores, messages.
LXCLXC (LinuX(LinuX Containers)Containers)
Container:Container: - Provides an env. like a standard Linux installation but without the need for a separate kernel. - Single kernel and drivers, multiple different user spaces AA groupgroup ofof processesprocesses inin LinuxLinux inin anan isolatedisolated environment.environment. - From inside: looks like a VM - From outside: looks like normal processes - Something (conceptually) in the middle between a chroot on steroids and a full fledged VM LXCLXC vs.vs. OpenVZOpenVZ - OpenVZ: production ready and stable; pushing to the upstream - LXC: a work-in-progress; uses standard kernel features
LXCLXC LifecycleLifecycle
lxc-createlxc-create - Setup a container (root filesystem and config) lxc-startlxc-start - Boot the container (by default, you get a console) lxc-consolelxc-console - Attach a console (if you started in background) lxc-stoplxc-stop - Shutdown the container lxc-destroylxc-destroy - Destroy the filesystem created with lxc- create
See also: LXC Web Panel - http://lxc-webpanel.github.io/
Demo...
Under the Hood
LXCLXC ComponentsComponents
Components:Components: - The liblxc library - Several language bindings for the API: ● Python, lua, Go, ruby, Haskell - A set of standard tools to control the containers - Container templates OpenOpen source!source! https://linuxcontainers.org/https://linuxcontainers.org/
FeaturesFeatures MakingMaking upup LXCLXC
KernelKernel featuresfeatures usedused inin LXC:LXC: - Isolation: ● Kernel namespaces (ipc, uts, mount, pid, network and user) ● Chroots (using pivot_root) - Resource management ● Control groups (cgroups) - Security: ● Apparmor and SELinux profiles ● Seccomp policies ● Kernel capabilities
Pivot_rootPivot_root andand ChrootChroot
ChangeChange thethe rootroot directorydirectory toto aa newnew pathpath - Pivot_root: switches the complete system and remove dependencies on the old root dir. - Chroot: applied on a single process
SeccompSeccomp
seccompseccomp (SECure(SECure COMPutingCOMPuting mode)mode) - A simple sandboxing mechanism (Linux 2.6.12+ (2005)) - Allows a process to make a one-way transition into a "secure" state ● Syscalls limited to exit(), sigreturn(), read() and write() to already-open file descriptors. - Any attempts for other system calls result in SIGKILL.
seccomp-bpfseccomp-bpf - An extension to seccomp that allows filtering of system calls using a configurable policy - Used by OpenSSH and vsftpd as well as Google Chrome/Chromium on Chrome OS and Linux to sandbox Flash player and renderers.
CapabilitiesCapabilities
In traditional UNIX, processes are: - Privileged (EUID is 0): Bypass all kernel permission checks. - Unprivileged: full permission checking (EUID, EGID, and supplementary group list). Since Linux kernel 2.2: - The superuser privileges are divided into distinct units (a.k.a. as capabilities) - Capabilities can be independently enabled and disabled (per-thread) Examples: - CAP_CHOWN: Make arbitrary changes to file UIDs and GIDs. - CAP_KILL: Bypass permission checks for sending signals. - CAP_NET_ADMIN: Perform various network-related operations. - CAP_SYS_ADMIN - CAP_SYS_BOOT: Use reboot and kexec_load
LinuxLinux SecuritySecurity ModulesModules (LSM)(LSM)
AA LinuxLinux kernelkernel frameworkframework toto supportsupport differentdifferent securitysecurity modelsmodels - Avoids favoritism toward any single implementation. - Examples: AppArmor, SELinux, Smack and TOMOYO Linux UsedUsed toto implementimplement differentdifferent MACsMACs
Access Control
Control Groups
IntroductionIntroduction toto CGroupsCGroups
CgroupsCgroups (control(control groups):groups): - Allocate resources (CPU, memory, network, or their combinations) among user-defined groups of tasks (processes) - Think ulimit, but for groups of processes ... and with fine-grained accounting. - Initiated at Google (2006) - Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some previous releases). Commands:Commands: - cgcreate: creates new cgroup - cgset: sets parameters for given cgroup(s) - cgexec: runs a task in specified control groups.
CGroups:CGroups: ImplementationImplementation
ImplementedImplemented asas aa specialspecial cgroupcgroup filefile systemsystem - libcgroup is a library that abstracts the control group file system in Linux. - CGroup services: Allow persistence across reboot and ease of use. AA fewfew simplesimple hookshooks insertedinserted intointo thethe kernelkernel (not(not performance-performance- critical):critical): - In boot phase, process creation and destroy methods, task_struct procfsprocfs entries:entries: ● For each process: /proc/pid/cgroup. ● System-wide: /proc/cgroups
CGroupCGroup SubsystemsSubsystems
cpucpu - control CPU scheduler cpuacctcpuacct - generates automatic reports on CPU resources cpusetcpuset - assigns individual CPUs (cores) and memory nodes memorymemory - limits memory use + generates automatic reports on memory resources freezerfreezer - suspends or resumes tasks in a cgroup.
CGroupsCGroups SubsystemsSubsystems (cont'd)(cont'd)
blkioblkio - limits on block devices IO (disk, solid state, USB, etc.). devices:devices: - allows/denies access to devices net_clsnet_cls - differentiates between packets of different cgroups. net_prionet_prio - dynamically set the priority of network traffic per network interface.
Cgroups:Cgroups: BasicsBasics
EverythingEverything exposedexposed throughthrough aa virtualvirtual filesystemfilesystem - /cgroup, /sys/fs/cgroup... YourMountpointMayVary CreateCreate aa cgroup:cgroup: - mkdir /cgroup/aloha - Automatically creates these files: tasks, tasks, cgroup.procs, etc. MoveMove processprocess withwith PIDPID 12341234 toto thethe cgroup:cgroup: - echo 1234 > /cgroup/aloha/tasks LimitLimit memorymemory usage:usage: - echo 10000000 > /cgroup/aloha/memory.limit_in_bytes
CPUsetCPUset SubsystemSubsystem
EachEach subsystemsubsystem addsadds specificspecific controlcontrol filesfiles forfor itsits ownown needsneeds - Prefixed by its name cpuset.sched_relax_domain_level cpuset.cpus cpuset.memory_migrate cpuset.mems cpuset.memory_pressure cpuset.cpu_exclusive cpuset.memory_spread_page cpuset.mem_exclusive cpuset.memory_spread_slab cpuset.mem_hardwall cpuset.memory_pressure_enabled cpuset.sched_load_balance
CGroup:CGroup: CPUCPU (and(and Friends)Friends)
LimitingLimiting - Set cpu.shares (defines relative weights) AccountingAccounting - Check cpustat.usage for user/system breakdown IsolateIsolate - Use cpuset.cpus (also for NUMA systems) Can'tCan't reallyreally throttlethrottle aa groupgroup ofof process.process. - But that's OK: context-switching << 1/HZ
CGroup:CGroup: MemoryMemory
UpUp toto 2525 controlcontrol filesfiles LimitingLimiting - memory usage, swap usage - soft limits and hard limits - can be nested AccountingAccounting - cache vs. rss - active vs. inactive - file-backed pages vs. anonymous pages - page-in/page-out IsolationIsolation - Reserve memory thanks to hard limits
CGroup:CGroup: BlockBlock I/OI/O
LimitingLimiting && IsolationIsolation - blkio.throttle.{read,write}.{iops,bps}.device - Drawback: only for sync I/O (i.e.: "classical" reads; not writes; not mapped files) AccountingAccounting - Number of IOs, bytes, service time... - Drawback: same as previously CGroupsCGroups aren'taren't perfectperfect toto limitlimit I/OI/O - Limiting the amount of dirty memory helps a bit.
Namespaces
LinuxLinux NamespacesNamespaces
Namespaces:Namespaces: LightweightLightweight processprocess virtualizationvirtualization - Isolation: Enable a process (or several processes) to have different views of the system than other processes. - Idea dates back to 1992 (Plan 9) IntroducedIntroduced inin LinuxLinux 2.4.192.4.19 (2002)(2002) - User namespace was the last ns: A number of Linux filesystems are not user-namespace aware, yet! UserUser spacespace modificationmodification - No modification is needed (in general) - Some utilities are made namespace-aware (iproute, util-linux)
DifferentDifferent KindsKinds ofof NamespacesNamespaces
ThereThere areare currentlycurrently 66 namespacesnamespaces inin Linux:Linux: - pid (processes) - net (network interfaces, routing...) - ipc (System V IPC) - mnt (mount points, filesystems) - uts (hostname) - user (UIDs) 44 otherother namespacesnamespaces areare notnot implementedimplemented (yet):(yet): - Security, security keys, device, and time AllAll requirerequire CAP_SYS_ADMINCAP_SYS_ADMIN - Except user namespaces (not privileged) - All the other ones can be created in conjunction with a new user namespace.
NamespacesNamespaces ImplementationImplementation
NamespacesNamespaces dodo notnot havehave namesnames - Each namespace has a unique inode number (Linux 3.8+) - inode number of each namespace is created when the namespace is created. - There is an initial, default namespace for each namespace. lsls -al-al /proc/
TrivialTrivial NamespacesNamespaces
UTSUTS (hostname)(hostname) - gethostname(),sethostname() - struct system_utsname per container SysSys VV IPC:IPC: shmem,shmem, semaphores,semaphores, msgmsg queuesqueues - Keys must be mutually agreed upon by both client and server processes - ipc namespace: uniqueness context of keys
Namespaces:Namespaces: pidpid
UsuallyUsually aa PIDPID isis anan arbitraryarbitrary numbernumber SpecialSpecial cases:cases: - Init (i.e. child reaper) has a PID of 1 - Can't change PID (process migration)
Namespaces:Namespaces: pidpid (cont'd)(cont'd)
PIDPID isis nono longerlonger uniqueunique inin kernelkernel - A process has (can have) different PIDs in each ns - /proc/$PID/* is virtualized PIDPID namespacesnamespaces areare nestednested - Processes in a PID ns can't see/affect processes of the parent ns - But all PIDs in the ns are visible to the parent ns.
PIDPID 11
EachEach PIDPID namespacenamespace hashas aa PIDPID #1#1 - Its first process BehaviorBehavior likelike thethe “init”“init” process:process: - When a process dies, all its orphaned children will now have the process with PID 1 as their parent. - Sending SIGKILL signal does not kill process 1, regardless of which namespace the command was issued (initial namespace or other pid namespace). AnAn importantimportant featurefeature forfor containerscontainers
Namespaces:Namespaces: netnet
LogicallyLogically anotheranother copycopy ofof thethe networknetwork stack,stack, withwith itsits ownown separate...separate... - Network interfaces (and its own lo/127.0.0.1) - IP address(es) and sockets - routing table(s), iptables rules CommunicationCommunication betweenbetween containers:containers: - UNIX domain sockets (=on the filesystem) - By creating a pair of network devices (veth) and move one to another namespace (like a pipe.)
Namespaces:Namespaces: mntmnt
InIn aa newnew mountmount namespace:namespace: - All previous mounts will be visible - But mounts/unmounts in that mount namespace are invisible to the rest of the system. - Mounts/unmounts in the global namespace are visible in that namespace. AA mntmnt namespacenamespace cancan havehave itsits ownown rootfsrootfs SpecialSpecial filesystemsfilesystems mustmust bebe remounted,remounted, e.g.:e.g.: - procfs (to see the processes) - devpts (to see pseudo-terminals)
Namespaces:Namespaces: useruser
AA processprocess willwill havehave distinctdistinct setset ofof UIDs,UIDs, GIDsGIDs andand capabilities.capabilities. - UID42 in container X isn't UID42 in container Y
UIDUID NamespaceNamespace (example)(example)
RunningRunning fromfrom somesome useruser accountaccount - id -u → 1000 (effective user ID) - id -g → 1000 (effective group ID) Capbilties:Capbilties: catcat /proc/self/status/proc/self/status || grepgrep CapCap - CapInh: 0000000000000000 - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - CapBnd: 0000001fffffffff InIn orderorder toto createcreate aa useruser namespacenamespace andand startstart aa shell,shell, wewe willwill runrun fromfrom thatthat non-rootnon-root account:account: - unshare -U /bin/bash
ExampleExample (cont'd)(cont'd)
NowNow fromfrom thethe newnew shellshell runrun - id -u → 65534 - id -g → 65534 - These are default values for the eUID and eGUID In the new namespace. - No difference if unshare by the root user Capabilities:Capabilities: catcat /proc/self/status/proc/self/status || grepgrep CapCap - CapInh:0000000000000000 - CapPrm:0000000000000000 - CapEff:0000000000000000 - CapBnd:0000001fffffffff InIn fact:fact: - The namespace had full capabilities, but unshare removed them. - User mapping can be specified in gid_map and uid_map of the created process.
unshareunshare utilutil
RunsRuns aa programprogram withwith somesome namespace(s)namespace(s) unsharedunshared fromfrom parentparent ExampleExample - ./unshare --net bash - A new network namespace was generated and the bash process was generated inside that namespace. - Now ifconfig -a will show only the loopback device AfterAfter processprocess termination,termination, - The namespace(s) will be freed.
NamespaceNamespace problemsproblems // todostodos
MissingMissing namespaces:namespaces: - tty, fuse, binfmt_misc IdentifyingIdentifying aa namespacenamespace - No namespace ID, just process(es) - Partly solved by inode numbers EnteringEntering existingexisting namespacesnamespaces - fd=nsfd(NS, PID); setns(fd); - Were not possible in older kernels
Security of Containers
Uncertaities,Uncertaities, FearsFears andand DoubtsDoubts
““LXCLXC isis notnot secure.secure. IfIf II wantwant realreal securitysecurity I'llI'll useuse KVM.”KVM.” - Dan Berrange, famous LXC hacker, 2011 StillStill quotedquoted todaytoday - Still true in some cases - Things have changed a little since 2011 ResponsesResponses - Kernel exploits - Default LXC settings - Containers needing to do funky stuff
KernelKernel ExploitsExploits
KernelKernel exploits:exploits: ContainersContainers shareshare thethe kernelkernel - Buggy kernel and syscalls → Game over! - Unless the container is forbidden from those syscalls - seccomp-bpf
DefaultDefault LXCLXC SettingsSettings
DefaultDefault LXCLXC settingssettings - Apparmor and SELinux are used to restrict some actions, by default (intension: stop accidential harm) - Capabilities must be restricted, too.
Full capabilities and permissions inside = Sudoers access to the guest user!
ContainersContainers NeedingNeeding ExtraExtra Priv.Priv.
NetworkNetwork InterfacesInterfaces forfor VPNVPN oror otherother Multicast,Multicast, broadcast,broadcast, packetpacket sniffingsniffing RawRaw accessaccess toto devicesdevices (disks,(disks, GPU,GPU, …)…) MountingMounting stuffstuff (even(even withwith FUSE)FUSE)
More privileges = Greater attack surface
SomeSome SolutionsSolutions
UseUse LXCLXC forfor packagingpackaging - Containers are not used on the target host. UseUse LXCLXC forfor developmentdevelopment andand testingtesting - Insider code - Prevents accidental harm to the systems LXCLXC forfor WebWeb appsapps andand databasesdatabases - Shouldn't require extra privileges UseUse capabilitiescapabilities DefenseDefense inin depth!depth! - Use multiple security mechanisms - Both in the container and in the host (not different from the usual!)
OtherOther SolutionsSolutions
OneOne containercontainer perper machinemachine - Containers for fast deployment: Docker OneOne VMVM perper containercontainer - Run untrusted code within a VM within a container
ConclusionConclusion
ContainersContainers seemseem toto bebe thethe futurefuture ofof virtualizationvirtualization - Already used in production settings - E.g. in Google AA stackstack ofof openopen sourcesource solutionssolutions isis there!there! - Linux - LXC - Docker, ProjectBuilder, Puppet, etc. - PaaS EveryEvery technologytechnology hashas itsits ownown drawbacksdrawbacks - Security is a strong concern!
Thank You!
UsefulUseful ReferencesReferences