The multicore evolution and Non-scalable locks are dangerous. Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. In the operating systems Proceedings of the Symposium, Ottawa, Canada, July 2012.

Frans Kaashoek

Joint work with: Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, Robert Morris, and Nickolai Zeldovich

MIT

How well does Linux scale? Off-the-shelf 48-core server (AMD)

DRAM DRAM DRAM DRAM ● Experiment:

● Linux 2.6.35-rc5 (relatively old, but problems are representative of issues in recent kernels too) ● Select a few inherent parallel system applications ● Measure throughput on different # of cores ● Use tmpfs to avoid disk bottlenecks

● Insight 1: Short critical sections can lead to DRAM DRAM DRAM DRAM sharp performance collapse ● Cache-coherent and non-uniform access

● An approximation of a future 48-core chip

Poor scaling on stock Exim on stock Linux: collapse

12000

48 Throughput perfect scaling 44 10000 40 36 32 8000 28 24

20 6000 16 12

8 4000 terrible scaling 4 0 memcached PostgreSQL Psearchy 2000 Exim Apache gmake Metis

Y-axis: (throughput with 48 cores) / (throughput with one core) 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores

1 Exim on stock Linux: collapse Exim on stock Linux: collapse

12000 12000 15 Throughput Throughput Kernel time 10000 10000 12

8000 8000

9

6000 6000

6

4000 4000

3 2000 2000

0 0 0 1 4 8 12 16 20 24 28 32 36 40 44 48 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores Cores

Oprofile shows an obvious problem Oprofile shows an obvious problem

samples % app name symbol name samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 2329 6.5456 vmlinux unmap_vmas 40 cores: 2197 6.1746 vmlinux filemap_fault 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault 966 2.7149 vmlinux page_fault

samples % app name symbol name samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: 48 cores: 1661 4.2850 vmlinux filemap_fault 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 896 2.3115 vmlinux unlock_page

Oprofile shows an obvious problem Bottleneck: reading mount table

● samples % app name symbol name Delivering an email calls sys_open 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas ● sys_open calls 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c struct vfsmount *lookup_mnt(struct path *path) 1182 3.3220 vmlinux unlock_page { 966 2.7149 vmlinux page_fault struct vfsmount *mnt; spin_lock(&vfsmount_lock);

samples % app name symbol name mnt = hash_get(mnts, path); 13515 34.8657 vmlinux lookup_mnt spin_unlock(&vfsmount_lock); return mnt; 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: } 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page

2 Bottleneck: reading mount table Bottleneck: reading mount table

● sys_open calls: ● sys_open calls:

struct vfsmount *lookup_mnt(struct path *path) struct vfsmount *lookup_mnt(struct path *path) { { struct vfsmount *mnt; struct vfsmount *mnt; spin_lock(&vfsmount_lock); spin_lock(&vfsmount_lock); Serial section is short. Why does mnt = hash_get(mnts, path); mnt = hash_get(mnts, path); it cause a scalability bottleneck? spin_unlock(&vfsmount_lock); spin_unlock(&vfsmount_lock); return mnt; return mnt; } }

What causes the sharp Ticket Lock

performance collapse? struct {

int current_ticket;

int next_ticket; ● Linux uses ticket spin locks, which are non- } spinlock_t scalable void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock)

● So we should expect collapse [Anderson 90] { {

t = atomic_inc(lock->next_ticket); lock->current_ticket++;

while(t != lock->current_ticket) } ● But why so sudden, and so sharp, for a short ; /* spin */

section? } ● Is spin lock/unlock implemented incorrectly?

● Is hardware cache-coherence protocol at fault?

Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } }

3 Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } }

Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; 500 cycles } }

Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } }

4 Scalability collapse caused by non-scalable locks [Anderson 90] Why collapse with short sections? void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; }

Previous lock holder notifies next lock holder after sending out N/2 replies ● Arrival rate is proportional to # non-waiting cores

● Service time is proportional to # cores waiting (k)

● As k increases, waiting time goes up

● As waiting time goes up, k increases

● System gets stuck in states with many waiting cores

Short sections result in collapse Avoiding lock collapse

● Unscalable locks are fine for long sections

● Unscalable locks collapse for short sections

● Sudden sharp collapse due to “snowball” effect

● Scalable locks avoid collapse altogether

● But requires interface change

● Experiment: 2% of time spent in critical section

● Critical sections become “longer” with more cores

● Lesson: non-scalable locks fine for long sections

Scalable lock scalability Avoiding lock collapse is not enough to scale

● “Scalable” locks don't make the kernel scalable

● Main benefit is avoiding collapse: total throughput will not be lower with more cores

● But, usually want throughput to keep increasing with more cores

● It doesn't matter much which one

● But all slower in terms of latency

5