An Analysis of Linux Scalability to Many Cores

An Analysis of Linux Scalability to Many Cores Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich MIT CSAIL What is scalability? ● Application does N times as much work on N cores as it could on 1 core ● Scalability may be limited by Amdahl's Law: ● Locks, shared data structures, ... ● Shared hardware (DRAM, NIC, ...) Why look at the OS kernel? ● Many applications spend time in the kernel ● E.g. On a uniprocessor, the Exim mail server spends 70% in kernel ● These applications should scale with more cores ● If OS kernel doesn't scale, apps won't scale Speculation about kernel scalability ● Several kernel scalability studies indicate existing kernels don't scale well ● Speculation that fixing them is hard ● New OS kernel designs: ● Corey, Barrelfish, fos, Tessellation, … ● How serious are the scaling problems? ● How hard is it to fix them? ● Hard to answer in general, but we shed some light on the answer by analyzing Linux scalability Analyzing scalability of Linux ● Use a off-the-shelf 48-core x86 machine ● Run a recent version of Linux ● Used a lot, competitive baseline scalability ● Scale a set of applications ● Parallel implementation ● System intensive Contributions ● Analysis of Linux scalability for 7 real apps. ● Stock Linux limits scalability ● Analysis of bottlenecks ● Fixes: 3002 lines of code, 16 patches ● Most fixes improve scalability of multiple apps. ● Remaining bottlenecks in HW or app ● Result: no kernel problems up to 48 cores Method ● Run application ● Use in-memory file system to avoid disk bottleneck ● Find bottlenecks ● Fix bottlenecks, re-run application ● Stop when a non-trivial application fix is required, or bottleneck by shared hardware (e.g. DRAM) Method ● Run application ● Use in-memory file system to avoid disk bottleneck ● Find bottlenecks ● Fix bottlenecks, re-run application ● Stop when a non-trivial application fix is required, or bottleneck by shared hardware (e.g. DRAM) Method ● Run application ● Use in-memory file system to avoid disk bottleneck ● Find bottlenecks ● Fix bottlenecks, re-run application ● Stop when a non-trivial application fix is required, or bottleneck by shared hardware (e.g. DRAM) Method ● Run application ● Use in-memory file system to avoid disk bottleneck ● Find bottlenecks ● Fix bottlenecks, re-run application ● Stop when a non-trivial application fix is required, or bottleneck by shared hardware (e.g. DRAM) Method ● Run application ● Use in-memory file system to avoid disk bottleneck ● Find bottlenecks ● Fix bottlenecks, re-run application ● Stop when a non-trivial application fix is required, or bottleneck by shared hardware (e.g. DRAM) Method ● Run application ● Use in-memory file system to avoid disk bottleneck ● Find bottlenecks ● Fix bottlenecks, re-run application ● Stop when a non-trivial application fix is required, or bottleneck by shared hardware (e.g. DRAM) Off-the-shelf 48-core server ● 6 core x 8 chip AMD DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Poor scaling on stock Linux kernel 48 44 perfect scaling 40 36 32 28 24 20 16 12 terrible scaling 8 4 0 memcached PostgreSQL Psearchy Exim Apache gmake Metis Y-axis: (throughput with 48 cores) / (throughput with one core) Exim onstock Linux: collapse Throughput (messages/second) 10000 12000 6000 8000 2000 4000 0 1 4 Throughput 8 12 16 20 Cores 24 28 32 36 40 44 48 Exim onstock Linux: collapse Throughput (messages/second) 10000 12000 6000 8000 2000 4000 0 1 4 Throughput 8 12 16 20 Cores 24 28 32 36 40 44 48 Exim onstock Linux: collapse Throughput (messages/second) 10000 12000 2000 4000 6000 8000 0 1 4 Kerneltime Throughput 8 12 16 20 Cores 24 28 32 36 40 44 48 0 3 6 9 12 15 Kernel CPU time (milliseconds/message) Oprofile shows an obvious problem samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: 4000 msg/sec 1661 4.2850 vmlinux filemap_fault 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page Oprofile shows an obvious problem samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: 4000 msg/sec 1661 4.2850 vmlinux filemap_fault 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page Oprofile shows an obvious problem samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: 4000 msg/sec 1661 4.2850 vmlinux filemap_fault 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page Bottleneck: reading mount table ● sys_open eventually calls: struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; } Bottleneck: reading mount table ● sys_open eventually calls: struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; } Bottleneck: reading mount table ● sys_open eventually calls: struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); Critical section is short. Why does mnt = hash_get(mnts, path); it cause a scalability bottleneck? spin_unlock(&vfsmount_lock); return mnt; } Bottleneck: reading mount table ● sys_open eventually calls: struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); Critical section is short. Why does mnt = hash_get(mnts, path); it cause a scalability bottleneck? spin_unlock(&vfsmount_lock); return mnt; } ● spin_lock and spin_unlock use many more cycles than the critical section Linux spin lock implementation void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Linux spin lockAllocate implementation a ticket void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Linux spin lockAllocate implementation a ticket void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Linux spin lockAllocate implementation a ticket void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Linux spin lockAllocate implementation a ticket void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Linux spin lockAllocate implementation a ticket void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; 120 – 420 cycles } Linux spin lock implementation Spin until it's my void spin_lock(spinlock_t *lock) voidturn spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Linux spin lock implementation void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Linux spin lock implementationUpdate the ticket value void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Scalability collapse caused by non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket;

An Analysis of Linux Scalability to Many Cores

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support