<<

بس حم م ال الر ن الرحیم Sharif University of Data and Network Technology Security Lab.

LightweightLightweight VirtualizationVirtualization inin LinuxLinux

SadeghSadegh DorriDorri N.N. PhDPhD CandidateCandidate

Data and Network Security Lab. Seminar, 4 Aban 1393 TheThe NeedNeed forfor VirtualizationVirtualization

Hypervisors are the living proof of 's incompetence!  SchedulingScheduling aa Multi-processMulti- “application”“application” - Nice, priority, etc. are hard to be dynamically managed  KernelKernel MemoryMemory ManagementManagement - Fork bumps - $ while true; do mkdir x; cd x; done  AbuseAbuse shouldshould bebe thethe application'sapplication's problem,problem, ratherrather thanthan beingbeing everyone's!everyone's!

The failure of operating systems and how we can fix it: http://lwn.net/Articles/524952/

AgendaAgenda

 MotivationMotivation - architectures - OS-level virtualization in  AA demodemo  UnderUnder thethe hoodhood - LXC components - Related kernel features: and namespaces  SecuritySecurity considerationsconsiderations  ConclusionConclusion

VariousVarious VirtualizationVirtualization ArchitecturesArchitectures

HardwareHardware VirtualizationVirtualization

 VMware,VMware, Parallels,Parallels, QEmu,QEmu, ,Bochs, ,Xen, KVMKVM  ResourcesResources cannotcannot bebe sharedshared betweenbetween VMs.VMs.

OS-LevelOS-Level VirtualizationVirtualization

 Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo Containers  FreeBSD jails  /Zones  IBM AIX6 WPARs ()

OS-LevelOS-Level VirtualizationVirtualization inin LinuxLinux

 LinuxLinux ContainersContainers - Allow a kernel to support more resource-isolation use- cases - Without the overhead and complexity of running multiple kernel and driver instances  BenefitsBenefits - Isolation - Small footprint - Speed

3)3) SpeedSpeed

2)2) FootprintFootprint

 OnOn aa typicaltypical physicalphysical server,server, withwith averageaverage computecompute resources,resources, youyou cancan easilyeasily run:run: - 10-100 virtual machines - 100-1000 containers

 OnOn disk,disk, containerscontainers cancan bebe veryvery light.light. - A few MB — even without fancy storage.

1)1) IsolationIsolation

EachEach containercontainer has:has:  ItsIts ownown networknetwork interfaceinterface (and(and IPIP address)address) - can be bridged, routed... just like VMs  ItsIts ownown filesystemfilesystem - Debian host can run Fedora container (& vice-versa)  IsolationIsolation (security)(security) - container A & B can't harm (or even see) each other  IsolationIsolation (resource(resource usage)usage) - soft & hard quotas for RAM, CPU, I/O...  PossibilityPossibility ofof processprocess checkpoint/freezecheckpoint/freeze andand migrationmigration - Isolation prevents resource name conflicts

Use-Cases:Use-Cases: DevelopersDevelopers

 ContinuousContinuous IntegrationIntegration - After each commit, run 100 tests in 100 environments  ContinuousContinuous PackagingPackaging - Example: Project Builder  EscapeEscape dependencydependency hellhell - Build (and/or run) in a controlled environment  PutPut everythingeverything inin aa containercontainer - Even the tiny things

Use-Cases:Use-Cases: HostingHosting ProvidersProviders

 CheapCheap CheaperCheaper HostingHosting (VPS(VPS providers)providers)  GiveGive awayaway moremore freefree stuffstuff - "Pay for your production, get your staging for free!" - Spin up/down on demand, in seconds - Example: dotCloud

““Google has built their entire datacenter infrastructure around Linux containers, launching more than 2 billion containers per week.”” (:(Kubernetes: openopen sourcesource GoogleGoogle cloudcloud platform)platform)

Use-Cases:Use-Cases: EveryoneEveryone

 LookLook insideinside youryour VMsVMs - You can see (and kill) individual processes - You can browse (and change) the filesystem  DoDo (almost)(almost) whateverwhatever youyou diddid withwith VMsVMs - ... But faster  MigrationMigration - Checkpoint then unfreeze: experimental (CRIU)

Solutions in Linux

OpenVZOpenVZ

 ModifiedModified LinuxLinux kernelkernel - Also works with unpatched Linux 3.x (reduced feature set)  EachEach containercontainer isis aa separateseparate entityentity withwith itsits own:own: - Files: System libraries, applications, virtualized /proc and /sys, virtualized locks, etc. - Users and groups: its own root user, as well as other users and groups. - Process tree: only sees its own processes (incl. ) - Network: virtual network device with own IP addresses, iptables, and routing rules. - Devices: can be granted access to real devices. - IPC objects: shared memory, semaphores, messages.

LXCLXC (LinuX(LinuX Containers)Containers)

 Container:Container: - Provides an env. like a standard Linux installation but without the need for a separate kernel. - Single kernel and drivers, multiple different user spaces  AA groupgroup ofof processesprocesses inin LinuxLinux inin anan isolatedisolated environment.environment. - From inside: looks like a VM - From outside: looks like normal processes - Something (conceptually) in the middle between a on steroids and a full fledged VM  LXCLXC vs.vs. OpenVZOpenVZ - OpenVZ: production ready and stable; pushing to the upstream - LXC: a work-in-progress; uses standard kernel features

LXCLXC LifecycleLifecycle

-createlxc-create - Setup a container (root filesystem and config)  lxc-startlxc-start - Boot the container (by default, you get a console)  lxc-consolelxc-console - Attach a console (if you started in background)  lxc-stoplxc-stop - Shutdown the container  lxc-destroylxc-destroy - Destroy the filesystem created with lxc- create

See also: LXC Web Panel - http://lxc-webpanel.github.io/

Demo...

Under the Hood

LXCLXC ComponentsComponents

 Components:Components: - The liblxc library - Several language bindings for the API: ● Python, lua, Go, ruby, Haskell - A set of standard tools to control the containers - Container templates  OpenOpen source!source!  https://linuxcontainers.org/https://linuxcontainers.org/

FeaturesFeatures MakingMaking upup LXCLXC

 KernelKernel featuresfeatures usedused inin LXC:LXC: - Isolation: ● Kernel namespaces (ipc, uts, mount, pid, network and user) ● Chroots (using pivot_root) - Resource management ● Control groups (cgroups) - Security: ● Apparmor and SELinux profiles ● policies ● Kernel capabilities

Pivot_rootPivot_root andand ChrootChroot

 ChangeChange thethe rootroot directorydirectory toto aa newnew pathpath - Pivot_root: switches the complete system and remove dependencies on the old root dir. - Chroot: applied on a single process

SeccompSeccomp

 seccompseccomp (SECure(SECure COMPutingCOMPuting mode)mode) - A simple sandboxing mechanism (Linux 2.6.12+ (2005)) - Allows a process to make a one-way transition into a "secure" state ● Syscalls limited to exit(), sigreturn(), read() and write() to already-open file descriptors. - Any attempts for other system calls result in SIGKILL.

 seccomp-bpfseccomp-bpf - An extension to seccomp that allows filtering of system calls using a configurable policy - Used by OpenSSH and vsftpd as well as /Chromium on Chrome OS and Linux to sandbox Flash player and renderers.

CapabilitiesCapabilities

 In traditional UNIX, processes are: - Privileged (EUID is 0): Bypass all kernel permission checks. - Unprivileged: full permission checking (EUID, EGID, and supplementary group list).  Since Linux kernel 2.2: - The superuser privileges are divided into distinct units (a.k.a. as capabilities) - Capabilities can be independently enabled and disabled (per-)  Examples: - CAP_CHOWN: Make arbitrary changes to file UIDs and GIDs. - CAP_KILL: Bypass permission checks for sending signals. - CAP_NET_ADMIN: Perform various network-related operations. - CAP_SYS_ADMIN - CAP_SYS_BOOT: Use reboot and kexec_load

LinuxLinux SecuritySecurity ModulesModules (LSM)(LSM)

 AA LinuxLinux kernelkernel frameworkframework toto supportsupport differentdifferent securitysecurity modelsmodels - Avoids favoritism toward any single implementation. - Examples: AppArmor, SELinux, and  UsedUsed toto implementimplement differentdifferent MACsMACs

Access Control

Control Groups

IntroductionIntroduction toto CGroupsCGroups

 CgroupsCgroups (control(control groups):groups): - Allocate resources (CPU, memory, network, or their combinations) among user-defined groups of tasks (processes) - Think ulimit, but for groups of processes ... and with fine-grained accounting. - Initiated at Google (2006) - Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some previous releases).  Commands:Commands: - cgcreate: creates new cgroup - cgset: sets parameters for given cgroup(s) - cgexec: runs a task in specified control groups.

CGroups:CGroups: ImplementationImplementation

 ImplementedImplemented asas aa specialspecial cgroupcgroup filefile systemsystem - libcgroup is a library that abstracts the control group file system in Linux. - CGroup services: Allow persistence across reboot and ease of use.  AA fewfew simplesimple hookshooks insertedinserted intointo thethe kernelkernel (not(not performance-performance- critical):critical): - In boot phase, process creation and destroy methods, task_struct  procfsprocfs entries:entries: ● For each process: /proc/pid/cgroup. ● System-wide: /proc/cgroups

CGroupCGroup SubsystemsSubsystems

 cpucpu - control CPU scheduler  cpuacctcpuacct - generates automatic reports on CPU resources  cpusetcpuset - assigns individual CPUs (cores) and memory nodes  memorymemory - limits memory use + generates automatic reports on memory resources  freezerfreezer - suspends or resumes tasks in a cgroup.

CGroupsCGroups SubsystemsSubsystems (cont'd)(cont'd)

 blkioblkio - limits on block devices IO (disk, solid state, USB, etc.).  devices:devices: - allows/denies access to devices  net_clsnet_cls - differentiates between packets of different cgroups.  net_prionet_prio - dynamically set the priority of network traffic per network interface.

Cgroups:Cgroups: BasicsBasics

 EverythingEverything exposedexposed throughthrough aa virtualvirtual filesystemfilesystem - /cgroup, /sys/fs/cgroup... YourMountpointMayVary  CreateCreate aa cgroup:cgroup: - mkdir /cgroup/aloha - Automatically creates these files: tasks, tasks, cgroup.procs, etc. MoveMove processprocess withwith PIDPID 12341234 toto thethe cgroup:cgroup: - echo 1234 > /cgroup/aloha/tasks  LimitLimit memorymemory usage:usage: - echo 10000000 > /cgroup/aloha/memory.limit_in_bytes

CPUsetCPUset SubsystemSubsystem

 EachEach subsystemsubsystem addsadds specificspecific controlcontrol filesfiles forfor itsits ownown needsneeds - Prefixed by its name cpuset.sched_relax_domain_level cpuset.cpus cpuset.memory_migrate cpuset.mems cpuset.memory_pressure cpuset.cpu_exclusive cpuset.memory_spread_page cpuset.mem_exclusive cpuset.memory_spread_slab cpuset.mem_hardwall cpuset.memory_pressure_enabled cpuset.sched_load_balance

CGroup:CGroup: CPUCPU (and(and Friends)Friends)

 LimitingLimiting - Set cpu.shares (defines relative weights)  AccountingAccounting - Check cpustat.usage for user/system breakdown  IsolateIsolate - Use cpuset.cpus (also for NUMA systems)  Can'tCan't reallyreally throttlethrottle aa groupgroup ofof process.process. - But that's OK: context-switching << 1/HZ

CGroup:CGroup: MemoryMemory

 UpUp toto 2525 controlcontrol filesfiles  LimitingLimiting - memory usage, swap usage - soft limits and hard limits - can be nested  AccountingAccounting - cache vs. rss - active vs. inactive - file-backed pages vs. anonymous pages - page-in/page-out  IsolationIsolation - Reserve memory thanks to hard limits

CGroup:CGroup: BlockBlock I/OI/O

 LimitingLimiting && IsolationIsolation - blkio.throttle.{read,write}.{iops,bps}.device - Drawback: only for I/O (i.e.: "classical" reads; not writes; not mapped files)  AccountingAccounting - Number of IOs, bytes, service time... - Drawback: same as previously  CGroupsCGroups aren'taren't perfectperfect toto limitlimit I/OI/O - Limiting the amount of dirty memory helps a bit.

Namespaces

LinuxLinux NamespacesNamespaces

 Namespaces:Namespaces: LightweightLightweight processprocess virtualizationvirtualization - Isolation: Enable a process (or several processes) to have different views of the system than other processes. - Idea dates back to 1992 (Plan 9)  IntroducedIntroduced inin LinuxLinux 2.4.192.4.19 (2002)(2002) - User namespace was the last ns: A number of Linux filesystems are not user-namespace aware, yet!  UserUser spacespace modificationmodification - No modification is needed (in general) - Some utilities are made namespace-aware (iproute, util-linux)

DifferentDifferent KindsKinds ofof NamespacesNamespaces

 ThereThere areare currentlycurrently 66 namespacesnamespaces inin Linux:Linux: - pid (processes) - net (network interfaces, routing...) - ipc (System V IPC) - mnt (mount points, filesystems) - uts (hostname) - user (UIDs)  44 otherother namespacesnamespaces areare notnot implementedimplemented (yet):(yet): - Security, security keys, device, and time  AllAll requirerequire CAP_SYS_ADMINCAP_SYS_ADMIN - Except user namespaces (not privileged) - All the other ones can be created in conjunction with a new user namespace.

NamespacesNamespaces ImplementationImplementation

 NamespacesNamespaces dodo notnot havehave namesnames - Each namespace has a unique inode number (Linux 3.8+) - inode number of each namespace is created when the namespace is created. - There is an initial, default namespace for each namespace.  lsls -al-al /proc//ns/proc//ns - lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839] - lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840] - lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956] - lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836] - lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837] - lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]

TrivialTrivial NamespacesNamespaces

 UTSUTS (hostname)(hostname) - gethostname(),sethostname() - struct system_utsname per container  SysSys VV IPC:IPC: shmem,shmem, semaphores,semaphores, msgmsg queuesqueues - Keys must be mutually agreed upon by both client and server processes - ipc namespace: uniqueness context of keys

Namespaces:Namespaces: pidpid

 UsuallyUsually aa PIDPID isis anan arbitraryarbitrary numbernumber  SpecialSpecial cases:cases: - Init (i.e. child reaper) has a PID of 1 - Can't change PID (process migration)

Namespaces:Namespaces: pidpid (cont'd)(cont'd)

 PIDPID isis nono longerlonger uniqueunique inin kernelkernel - A process has (can have) different PIDs in each ns - /proc/$PID/* is virtualized  PIDPID namespacesnamespaces areare nestednested - Processes in a PID ns can't see/affect processes of the parent ns - But all PIDs in the ns are visible to the parent ns.

PIDPID 11

 EachEach PIDPID namespacenamespace hashas aa PIDPID #1#1 - Its first process  BehaviorBehavior likelike thethe “init”“init” process:process: - When a process dies, all its orphaned children will now have the process with PID 1 as their parent. - Sending SIGKILL signal does not kill process 1, regardless of which namespace the command was issued (initial namespace or other pid namespace).  AnAn importantimportant featurefeature forfor containerscontainers

Namespaces:Namespaces: netnet

 LogicallyLogically anotheranother copycopy ofof thethe networknetwork stack,stack, withwith itsits ownown separate...separate... - Network interfaces (and its own lo/127.0.0.1) - IP address(es) and sockets - routing table(s), iptables rules  CommunicationCommunication betweenbetween containers:containers: - UNIX domain sockets (=on the filesystem) - By creating a pair of network devices (veth) and move one to another namespace (like a pipe.)

Namespaces:Namespaces: mntmnt

 InIn aa newnew mountmount namespace:namespace: - All previous mounts will be visible - But mounts/unmounts in that mount namespace are invisible to the rest of the system. - Mounts/unmounts in the global namespace are visible in that namespace.  AA mntmnt namespacenamespace cancan havehave itsits ownown rootfsrootfs  SpecialSpecial filesystemsfilesystems mustmust bebe remounted,remounted, e.g.:e.g.: - (to see the processes) - (to see pseudo-terminals)

Namespaces:Namespaces: useruser

 AA processprocess willwill havehave distinctdistinct setset ofof UIDs,UIDs, GIDsGIDs andand capabilities.capabilities. - UID42 in container X isn't UID42 in container Y

UIDUID NamespaceNamespace (example)(example)

 RunningRunning fromfrom somesome useruser accountaccount - id -u → 1000 (effective user ID) - id -g → 1000 (effective group ID)  Capbilties:Capbilties: catcat /proc/self/status/proc/self/status || grepgrep CapCap - CapInh: 0000000000000000 - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - CapBnd: 0000001fffffffff  InIn orderorder toto createcreate aa useruser namespacenamespace andand startstart aa shell,shell, wewe willwill runrun fromfrom thatthat non-rootnon-root account:account: - unshare -U /bin/bash

ExampleExample (cont'd)(cont'd)

 NowNow fromfrom thethe newnew shellshell runrun - id -u → 65534 - id -g → 65534 - These are default values for the eUID and eGUID In the new namespace. - No difference if unshare by the root user  Capabilities:Capabilities: catcat /proc/self/status/proc/self/status || grepgrep CapCap - CapInh:0000000000000000 - CapPrm:0000000000000000 - CapEff:0000000000000000 - CapBnd:0000001fffffffff  InIn fact:fact: - The namespace had full capabilities, but unshare removed them. - User mapping can be specified in gid_map and uid_map of the created process.

unshareunshare utilutil

 RunsRuns aa programprogram withwith somesome namespace(s)namespace(s) unsharedunshared fromfrom parentparent  ExampleExample - ./unshare --net bash - A new network namespace was generated and the bash process was generated inside that namespace. - Now ifconfig -a will show only the loopback device  AfterAfter processprocess termination,termination, - The namespace(s) will be freed.

NamespaceNamespace problemsproblems // todostodos

 MissingMissing namespaces:namespaces: - tty, fuse, binfmt_misc  IdentifyingIdentifying aa namespacenamespace - No namespace ID, just process(es) - Partly solved by inode numbers  EnteringEntering existingexisting namespacesnamespaces - fd=nsfd(NS, PID); setns(fd); - Were not possible in older kernels

Security of Containers

Uncertaities,Uncertaities, FearsFears andand DoubtsDoubts

 ““LXCLXC isis notnot secure.secure. IfIf II wantwant realreal securitysecurity I'llI'll useuse KVM.”KVM.” - Dan Berrange, famous LXC hacker, 2011  StillStill quotedquoted todaytoday - Still true in some cases - Things have changed a little since 2011  ResponsesResponses - Kernel exploits - Default LXC settings - Containers needing to do funky stuff

KernelKernel ExploitsExploits

 KernelKernel exploits:exploits: ContainersContainers shareshare thethe kernelkernel - Buggy kernel and syscalls → Game over! - Unless the container is forbidden from those syscalls - seccomp-bpf

DefaultDefault LXCLXC SettingsSettings

 DefaultDefault LXCLXC settingssettings - Apparmor and SELinux are used to restrict some actions, by default (intension: stop accidential harm) - Capabilities must be restricted, too.

Full capabilities and permissions inside = Sudoers access to the guest user!

ContainersContainers NeedingNeeding ExtraExtra Priv.Priv.

 NetworkNetwork InterfacesInterfaces forfor VPNVPN oror otherother  Multicast,Multicast, broadcast,broadcast, packetpacket sniffingsniffing  RawRaw accessaccess toto devicesdevices (disks,(disks, GPU,GPU, …)…)  MountingMounting stuffstuff (even(even withwith FUSE)FUSE)

More privileges = Greater attack surface

SomeSome SolutionsSolutions

 UseUse LXCLXC forfor packagingpackaging - Containers are not used on the target host.  UseUse LXCLXC forfor developmentdevelopment andand testingtesting - Insider code - Prevents accidental harm to the systems  LXCLXC forfor WebWeb appsapps andand databasesdatabases - Shouldn't require extra privileges  UseUse capabilitiescapabilities  DefenseDefense inin depth!depth! - Use multiple security mechanisms - Both in the container and in the host (not different from the usual!)

OtherOther SolutionsSolutions

 OneOne containercontainer perper machinemachine - Containers for fast deployment:  OneOne VMVM perper containercontainer - Run untrusted code within a VM within a container

ConclusionConclusion

 ContainersContainers seemseem toto bebe thethe futurefuture ofof virtualizationvirtualization - Already used in production settings - E.g. in Google  AA stackstack ofof openopen sourcesource solutionssolutions isis there!there! - Linux - LXC - Docker, ProjectBuilder, Puppet, etc. - PaaS  EveryEvery technologytechnology hashas itsits ownown drawbacksdrawbacks - Security is a strong concern!

Thank You!

UsefulUseful ReferencesReferences