Virtual Filesystems: Why We Need Them and How They Work

Virtual Filesystems: Why We Need Them and How They Work

Virtual filesystems: why we need them and how they work Alison Chaiken Peloton Technology [email protected] March 9, 2019 My coworkers with our product We're hiring. Agenda ● Filesystems and VFS ● /proc and /sys ● Monitoring with eBPF and bcc ● About bind mounts and namespaces ● containers and ro-rootfs ● live-media boots 3 Does your system work now? Do you really want to mess with it? 4 What is a filesystem? 5 What is a filesystem? ● Robert Love: “A filesystem is a hierarchical storage of data adhering to a specific structure.” Does the image depict a filesystem? Linux's definition of a filesystem A filesystem must define the system calls: struct file_operations { … ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); int (*open) (struct inode *, struct file *); … } What are virtual filesystems? 8 How VFS are used userspace kernel hardware Storage-backed filesystems Pseudofilesystems 9 S. R. Kleiman and Sun Microsystems, ”Vnodes: An Architecture for Multiple File System Types”, in Proc. USENIX, Summer 1986. 10 https://commons.wikimedia.org/w/index.php?curid=64193508 VFS are an abstract interface that that interface abstract VFS an are specific FS's implement .link() =.link() simple_link() =.rmdir my_rmdir(); write() read(), open(), ext4, fuse .link = simple_link(); = .link simple_link(); =.rmdir simple_rmdir(); .write = NULL; = .read NULL; = NULL; .open VFS … ; … overrides optional mandatory stubs Typical file_operations struct file_operations .release = ext4_release_file, ext4_file_operations = { .fsync = ext4_sync_file, .llseek = ext4_llseek, .get_unmapped_area = .read_iter = ext4_file_read_iter, thp_get_unmapped_area, .write_iter = ext4_file_write_iter, .splice_read = .unlocked_ioctl = ext4_ioctl, .mmap = ext4_file_mmap, generic_file_splice_read, .mmap_supported_flags = .splice_write = MAP_SYNC, iter_file_splice_write, .open = ext4_file_open, .fallocate = ext4_fallocate, }; VFS Basics ● The VFS methods are defined in the kernel's fs/*c source files. ● Subdirectories of fs/ contain specific FS implementations. ● VFS resolve paths and permissions before calling into FS methods. ● A great example of code reuse! Unless … 13 “Resources limits were not respected, users could overwrite a setuid file without resetting the setuid bits, time stamps would not be updated . affected all filesystems offering those features and needed to be fixed at the VFS level.” Link to article 14 /proc and /sys 15 The observation that motivated the talk Try this: $ stat /proc/cpuinfo $ stat /sys/power/state $ file /proc/cpuinfo $ file /sys/power/state ? ? Why are the results so different? ? ? ? ? ? ? ? 16 Sysfs System boot Procfs Now 17 /procfs has tables; /sys has single params 18 state of kernel itself is visible via procfs ● /proc/<PID> directories contain per-process stats. ● The 'sysctl' interface manipulates /proc/sys: $ 'sysctl -a' lists system memory, network tunables ● procfs files are 'seq files' whose contents are generated dynamically. 19 /proc files: empty or no? 20 The contents of procfs appear when summoned when appear of procfs contents The Copyright 2006 Koen Kooi, all rights reserved 21 "It is a fundamental quantum doctrine that a measurement does not reveal a pre-existing value of the measured property." -- David Mermin 22 sysfs is how the kernel reacts to events ● sysfs: – publishes events to userspace about appearance and disappearance of devices, FS, power, modules … – allows these objects to be configured. – includes the kernel's famous stable ABI. ● In sysfs lies the userspace that one MUST NOT BREAK! 23 Watch USB stick insertion with eBPF and bcc git clone [email protected]:iovisor/bcc.git trace.py source Use tplist.py to discover kprobes and userspace probes that trace.py can watch. 24 Illustrating the full power of bcc-tools Watch the same sysfs_create_files() function, get more details. 25 The source code tells you what programs can do; eBPF/bcc-tools tell you what they actually do. Kernel User easy to Minimal space use performance hit ftrace X ? strace X X bcc/ X X X X eBPF 26 Bind mounts and mount namespaces 27 symlinks, chroots, binds and overlays ● Symlinking a file or directory provides no security, and is static. ● chroot / is dynamic, but provides no /proc, /sys, /dev. ● Bind-mounting a file or directory over another: – provides dynamic, secure, granular reference to dir/file at another path; – useful for containers and IoT devices. ● Overlaying a filesystem over another: – provides a union of the FS at one path with the FS at another; – useful for live media boots. 28 Bind mount tmpfs Demo /home/newstuff FS A /home/oldstuff storage media FS B /home/oldstuff /home/secretstuff /home/newstuff bound /home/oldstuff = view Files at path B inherit permissions from FS A. 29 Bind-mount flags control visibility of mount events, not files prevent loops Host Host Host Container Container Container Shared Slave Private (mirror) (my mounts are private) (default) From Documentation/filesystems/sharedsubtree.txt. 30 Namespaces are magic that enables containers ● chroot, the old 'container', had minimal security. ● Container security is implemented (in part) via namespaces. ● Each container can have a different view of the system's files. ● See an overview with mountinfo files. ● Info about fields is in Documentation/filesystems/proc.txt. 31 Example: containers 32 Start a simple container systemd-nspawn is a container manager akin to runc or lxc. 33 Watch container bind mounts with BCC Intentional hiding of kernel Private mounts: symbols invisible to parent 34 read-only root filesystems 35 ● ● ● ● ● Forces separation of application data and binaries. Device problems reported from the field reproduce. Malware cannot modify /usr/, /etc, keys . rootfs does not get full. Safely yank device power. Motivation: Read-only rootfs: a critical tool for embedded for tool a critical rootfs: Read-only https://tinyurl.com/y7t2k7ma 36 read-only rootfs challenges ● /var must be mounted separately from /. ● Programs that modify $HOME at runtime: gstreamer, openssh- client … ● rootfs builders must – pre-populate these files, or – bind- or overlay-mount them from other paths. Not a bug but a feature! 37 Overlayfs tmpfs upper /etc/passwd /home/newstuff storage media lower /etc/passwd /home/oldstuff overlaid /etc/passwd /home/oldstuff = view /home/newstuff 38 Replace /etc/passwd inside a container 39 Summary ● VFS are one of Linux' core components. ● /proc, /sys and most on-HW FS are based on VFS. ● Bind-mounts and mount NS enable containers and ro- rootfs. ● bcc-tools and eBPF are remarkably powerful and easy to use. 40 Acknowledgements Much thanks to Akkana Peck, Michael Eager and Sarah Newman for comments and corrections. Ballroom H at 6 PM: “Accidentally accessible” 41 References ● About kobjects, seq files and sysfs: Appendix C, Essential Device Drivers by S. Venkateswaran ● About “everything is a file”: chapters 2, 4, 13, Linux Kernel Development by Robert Love ● Excellent mount namespaces article by Michael Kerrisk ● Excellent “Object-oriented design patterns in the kernel” article series by Neil Brown ● “BPF in the Kernel” series by Matt Fleming 42 Example: Live CD 43 Prepopulated /run directory on Kali Linux LiveCD 44 Kali Linux relies on overlayfs 45 Info from /proc/<PID>/mountinfo about shared mounts 46 sysfs vs procfs sizes /sys files are 1 page of memory and contain 1 string/ number. /procfs files often 'contain' a table of data. 47 Overlayfs mounts ● Overlay mounts are like bind mounts, but changes in the upper directory obscure those in the lower directory. ● A file in /tmp/upper can appear to replace files in /home on storage media. 48 Bind mounts ● Bind mounts make an existing file or directory appear at a new path. – Changes to the directory appear in both places. – A file in /tmp can appear to be in $HOME in addition to files that are in $HOME on storage media. 49 Subtle but important win with ro-rootfs A ro-rootfs forces better application design via separation of data and binaries. 50 ● ● ● Fix by by Fix Is /tmp on /dev/hdx? on /dev/sdx? this: Try $ findmnt /tmp$findmnt Make sure that fstab ends with a newline! a with ends fstab that sure Make stick. USB bootable a on /etc/fstab of copy a Keep editing /etc/fstabediting A systems administration tip! systemsadministration A ! 51 https://tinyurl.com/ybomxyfo Turning off sysfs? Keyboard and mouse 52 A few oddities: /proc/kcore 53.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    53 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us