The multicore evolution and Non-scalable locks are dangerous. Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. In the operating systems Proceedings of the Linux Symposium, Ottawa, Canada, July 2012.
Frans Kaashoek
Joint work with: Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, Robert Morris, and Nickolai Zeldovich
MIT
How well does Linux scale? Off-the-shelf 48-core server (AMD)
DRAM DRAM DRAM DRAM ● Experiment:
● Linux 2.6.35-rc5 (relatively old, but problems are representative of issues in recent kernels too) ● Select a few inherent parallel system applications ● Measure throughput on different # of cores ● Use tmpfs to avoid disk bottlenecks
● Insight 1: Short critical sections can lead to DRAM DRAM DRAM DRAM sharp performance collapse ● Cache-coherent and non-uniform access
● An approximation of a future 48-core chip
Poor scaling on stock Linux kernel Exim on stock Linux: collapse
12000
48 Throughput perfect scaling 44 10000 40 36 32 8000 28 24
20 6000 16 12
8 4000 terrible scaling 4 0 memcached PostgreSQL Psearchy 2000 Exim Apache gmake Metis
Y-axis: (throughput with 48 cores) / (throughput with one core) 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores
1 Exim on stock Linux: collapse Exim on stock Linux: collapse
12000 12000 15 Throughput Throughput Kernel time 10000 10000 12
8000 8000
9
6000 6000
6
4000 4000
3 2000 2000
0 0 0 1 4 8 12 16 20 24 28 32 36 40 44 48 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores Cores
Oprofile shows an obvious problem Oprofile shows an obvious problem
samples % app name symbol name samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 2329 6.5456 vmlinux unmap_vmas 40 cores: 2197 6.1746 vmlinux filemap_fault 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault 966 2.7149 vmlinux page_fault
samples % app name symbol name samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: 48 cores: 1661 4.2850 vmlinux filemap_fault 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 896 2.3115 vmlinux unlock_page
Oprofile shows an obvious problem Bottleneck: reading mount table
● samples % app name symbol name Delivering an email calls sys_open 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas ● sys_open calls 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c struct vfsmount *lookup_mnt(struct path *path) 1182 3.3220 vmlinux unlock_page { 966 2.7149 vmlinux page_fault struct vfsmount *mnt; spin_lock(&vfsmount_lock);
samples % app name symbol name mnt = hash_get(mnts, path); 13515 34.8657 vmlinux lookup_mnt spin_unlock(&vfsmount_lock); return mnt; 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: } 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page
2 Bottleneck: reading mount table Bottleneck: reading mount table
● sys_open calls: ● sys_open calls:
struct vfsmount *lookup_mnt(struct path *path) struct vfsmount *lookup_mnt(struct path *path) { { struct vfsmount *mnt; struct vfsmount *mnt; spin_lock(&vfsmount_lock); spin_lock(&vfsmount_lock); Serial section is short. Why does mnt = hash_get(mnts, path); mnt = hash_get(mnts, path); it cause a scalability bottleneck? spin_unlock(&vfsmount_lock); spin_unlock(&vfsmount_lock); return mnt; return mnt; } }
What causes the sharp Ticket Lock
performance collapse? struct {
int current_ticket;
int next_ticket; ● Linux uses ticket spin locks, which are non- } spinlock_t scalable void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock)
● So we should expect collapse [Anderson 90] { {
t = atomic_inc(lock->next_ticket); lock->current_ticket++;
while(t != lock->current_ticket) } ● But why so sudden, and so sharp, for a short ; /* spin */
section? } ● Is spin lock/unlock implemented incorrectly?
● Is hardware cache-coherence protocol at fault?
Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } }
3 Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } }
Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; 500 cycles } }
Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } }
4 Scalability collapse caused by non-scalable locks [Anderson 90] Why collapse with short sections? void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; }
Previous lock holder notifies next lock holder after sending out N/2 replies ● Arrival rate is proportional to # non-waiting cores
● Service time is proportional to # cores waiting (k)
● As k increases, waiting time goes up
● As waiting time goes up, k increases
● System gets stuck in states with many waiting cores
Short sections result in collapse Avoiding lock collapse
● Unscalable locks are fine for long sections
● Unscalable locks collapse for short sections
● Sudden sharp collapse due to “snowball” effect
● Scalable locks avoid collapse altogether
● But requires interface change
● Experiment: 2% of time spent in critical section
● Critical sections become “longer” with more cores
● Lesson: non-scalable locks fine for long sections
Scalable lock scalability Avoiding lock collapse is not enough to scale
● “Scalable” locks don't make the kernel scalable
● Main benefit is avoiding collapse: total throughput will not be lower with more cores
● But, usually want throughput to keep increasing with more cores
● It doesn't matter much which one
● But all slower in terms of latency
5