The Multicore Evolution and Operating Systems

The multicore evolution and Non-scalable locks are dangerous. Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. In the operating systems Proceedings of the Linux Symposium, Ottawa, Canada, July 2012. Frans Kaashoek Joint work with: Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, Robert Morris, and Nickolai Zeldovich MIT How well does Linux scale? Off-the-shelf 48-core server (AMD) DRAM DRAM DRAM DRAM ● Experiment: ● Linux 2.6.35-rc5 (relatively old, but problems are representative of issues in recent kernels too) ● Select a few inherent parallel system applications ● Measure throughput on different # of cores ● Use tmpfs to avoid disk bottlenecks ● Insight 1: Short critical sections can lead to DRAM DRAM DRAM DRAM sharp performance collapse ● Cache-coherent and non-uniform access ● An approximation of a future 48-core chip Poor scaling on stock Linux kernel Exim on stock Linux: collapse 12000 48 Throughput perfect scaling 44 10000 40 36 32 8000 28 24 20 6000 16 12 8 4000 terrible scaling 4 0 memcached PostgreSQL Psearchy 2000 Exim Apache gmake Metis Y-axis: (throughput with 48 cores) / (throughput with one core) 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores 1 Exim on stock Linux: collapse Exim on stock Linux: collapse 12000 12000 15 Throughput Throughput Kernel time 10000 10000 12 8000 8000 9 6000 6000 6 4000 4000 3 2000 2000 0 0 0 1 4 8 12 16 20 24 28 32 36 40 44 48 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores Cores Oprofile shows an obvious problem Oprofile shows an obvious problem samples % app name symbol name samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 2329 6.5456 vmlinux unmap_vmas 40 cores: 2197 6.1746 vmlinux filemap_fault 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault 966 2.7149 vmlinux page_fault samples % app name symbol name samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: 48 cores: 1661 4.2850 vmlinux filemap_fault 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 896 2.3115 vmlinux unlock_page Oprofile shows an obvious problem Bottleneck: reading mount table ● samples % app name symbol name Delivering an email calls sys_open 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas ● sys_open calls 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c struct vfsmount *lookup_mnt(struct path *path) 1182 3.3220 vmlinux unlock_page { 966 2.7149 vmlinux page_fault struct vfsmount *mnt; spin_lock(&vfsmount_lock); samples % app name symbol name mnt = hash_get(mnts, path); 13515 34.8657 vmlinux lookup_mnt spin_unlock(&vfsmount_lock); return mnt; 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: } 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 2 Bottleneck: reading mount table Bottleneck: reading mount table ● sys_open calls: ● sys_open calls: struct vfsmount *lookup_mnt(struct path *path) struct vfsmount *lookup_mnt(struct path *path) { { struct vfsmount *mnt; struct vfsmount *mnt; spin_lock(&vfsmount_lock); spin_lock(&vfsmount_lock); Serial section is short. Why does mnt = hash_get(mnts, path); mnt = hash_get(mnts, path); it cause a scalability bottleneck? spin_unlock(&vfsmount_lock); spin_unlock(&vfsmount_lock); return mnt; return mnt; } } What causes the sharp Ticket Lock performance collapse? struct { int current_ticket; int next_ticket; ● Linux uses ticket spin locks, which are non- } spinlock_t scalable void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) ● So we should expect collapse [Anderson 90] { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while(t != lock->current_ticket) } ● But why so sudden, and so sharp, for a short ; /* spin */ section? } ● Is spin lock/unlock implemented incorrectly? ● Is hardware cache-coherence protocol at fault? Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } } 3 Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } } Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; 500 cycles } } Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } } 4 Scalability collapse caused by non-scalable locks [Anderson 90] Why collapse with short sections? void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Previous lock holder notifies next lock holder after sending out N/2 replies ● Arrival rate is proportional to # non-waiting cores ● Service time is proportional to # cores waiting (k) ● As k increases, waiting time goes up ● As waiting time goes up, k increases ● System gets stuck in states with many waiting cores Short sections result in collapse Avoiding lock collapse ● Unscalable locks are fine for long sections ● Unscalable locks collapse for short sections ● Sudden sharp collapse due to “snowball” effect ● Scalable locks avoid collapse altogether ● But requires interface change ● Experiment: 2% of time spent in critical section ● Critical sections become “longer” with more cores ● Lesson: non-scalable locks fine for long sections Scalable lock scalability Avoiding lock collapse is not enough to scale ● “Scalable” locks don't make the kernel scalable ● Main benefit is avoiding collapse: total throughput will not be lower with more cores ● But, usually want throughput to keep increasing with more cores ● It doesn't matter much which one ● But all slower in terms of latency 5.

Load more