The Multicore Evolution and Operating Systems

The multicore evolution and Non-scalable locks are dangerous. Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. In the operating systems Proceedings of the Linux Symposium, Ottawa, Canada, July 2012. Frans Kaashoek Joint work with: Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, Robert Morris, and Nickolai Zeldovich MIT How well does Linux scale? Off-the-shelf 48-core server (AMD) DRAM DRAM DRAM DRAM ● Experiment: ● Linux 2.6.35-rc5 (relatively old, but problems are representative of issues in recent kernels too) ● Select a few inherent parallel system applications ● Measure throughput on different # of cores ● Use tmpfs to avoid disk bottlenecks ● Insight 1: Short critical sections can lead to DRAM DRAM DRAM DRAM sharp performance collapse ● Cache-coherent and non-uniform access ● An approximation of a future 48-core chip Poor scaling on stock Linux kernel Exim on stock Linux: collapse 12000 48 Throughput perfect scaling 44 10000 40 36 32 8000 28 24 20 6000 16 12 8 4000 terrible scaling 4 0 memcached PostgreSQL Psearchy 2000 Exim Apache gmake Metis Y-axis: (throughput with 48 cores) / (throughput with one core) 0 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores 1 Exim on stock Linux: collapse Exim on stock Linux: collapse 12000 12000 15 Throughput Throughput Kernel time 10000 10000 12 8000 8000 9 6000 6000 6 4000 4000 3 2000 2000 0 0 0 1 4 8 12 16 20 24 28 32 36 40 44 48 1 4 8 12 16 20 24 28 32 36 40 44 48 Cores Cores Oprofile shows an obvious problem Oprofile shows an obvious problem samples % app name symbol name samples % app name symbol name 2616 7.3522 vmlinux radix_tree_lookup_slot 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas 2329 6.5456 vmlinux unmap_vmas 40 cores: 2197 6.1746 vmlinux filemap_fault 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c 1348 3.7885 vmlinux copy_page_c 1182 3.3220 vmlinux unlock_page 1182 3.3220 vmlinux unlock_page 966 2.7149 vmlinux page_fault 966 2.7149 vmlinux page_fault samples % app name symbol name samples % app name symbol name 13515 34.8657 vmlinux lookup_mnt 13515 34.8657 vmlinux lookup_mnt 2002 5.1647 vmlinux radix_tree_lookup_slot 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: 48 cores: 1661 4.2850 vmlinux filemap_fault 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 896 2.3115 vmlinux unlock_page Oprofile shows an obvious problem Bottleneck: reading mount table ● samples % app name symbol name Delivering an email calls sys_open 2616 7.3522 vmlinux radix_tree_lookup_slot 2329 6.5456 vmlinux unmap_vmas ● sys_open calls 40 cores: 2197 6.1746 vmlinux filemap_fault 10000 msg/sec 1488 4.1820 vmlinux __do_fault 1348 3.7885 vmlinux copy_page_c struct vfsmount *lookup_mnt(struct path *path) 1182 3.3220 vmlinux unlock_page { 966 2.7149 vmlinux page_fault struct vfsmount *mnt; spin_lock(&vfsmount_lock); samples % app name symbol name mnt = hash_get(mnts, path); 13515 34.8657 vmlinux lookup_mnt spin_unlock(&vfsmount_lock); return mnt; 2002 5.1647 vmlinux radix_tree_lookup_slot 48 cores: } 1661 4.2850 vmlinux filemap_fault 4000 msg/sec 1497 3.8619 vmlinux unmap_vmas 1026 2.6469 vmlinux __do_fault 914 2.3579 vmlinux atomic_dec 896 2.3115 vmlinux unlock_page 2 Bottleneck: reading mount table Bottleneck: reading mount table ● sys_open calls: ● sys_open calls: struct vfsmount *lookup_mnt(struct path *path) struct vfsmount *lookup_mnt(struct path *path) { { struct vfsmount *mnt; struct vfsmount *mnt; spin_lock(&vfsmount_lock); spin_lock(&vfsmount_lock); Serial section is short. Why does mnt = hash_get(mnts, path); mnt = hash_get(mnts, path); it cause a scalability bottleneck? spin_unlock(&vfsmount_lock); spin_unlock(&vfsmount_lock); return mnt; return mnt; } } What causes the sharp Ticket Lock performance collapse? struct { int current_ticket; int next_ticket; ● Linux uses ticket spin locks, which are non- } spinlock_t scalable void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) ● So we should expect collapse [Anderson 90] { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while(t != lock->current_ticket) } ● But why so sudden, and so sharp, for a short ; /* spin */ section? } ● Is spin lock/unlock implemented incorrectly? ● Is hardware cache-coherence protocol at fault? Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } } 3 Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } } Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; 500 cycles } } Scalability collapse caused by Scalability collapse caused by non-scalable locks [Anderson 90] non-scalable locks [Anderson 90] void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } while (t != lock->current_ticket) } ; /* Spin */ ; /* Spin */ struct spinlock_t { struct spinlock_t { } int current_ticket; } int current_ticket; int next_ticket; int next_ticket; } } 4 Scalability collapse caused by non-scalable locks [Anderson 90] Why collapse with short sections? void spin_lock(spinlock_t *lock) void spin_unlock(spinlock_t *lock) { { t = atomic_inc(lock->next_ticket); lock->current_ticket++; while (t != lock->current_ticket) } ; /* Spin */ struct spinlock_t { } int current_ticket; int next_ticket; } Previous lock holder notifies next lock holder after sending out N/2 replies ● Arrival rate is proportional to # non-waiting cores ● Service time is proportional to # cores waiting (k) ● As k increases, waiting time goes up ● As waiting time goes up, k increases ● System gets stuck in states with many waiting cores Short sections result in collapse Avoiding lock collapse ● Unscalable locks are fine for long sections ● Unscalable locks collapse for short sections ● Sudden sharp collapse due to “snowball” effect ● Scalable locks avoid collapse altogether ● But requires interface change ● Experiment: 2% of time spent in critical section ● Critical sections become “longer” with more cores ● Lesson: non-scalable locks fine for long sections Scalable lock scalability Avoiding lock collapse is not enough to scale ● “Scalable” locks don't make the kernel scalable ● Main benefit is avoiding collapse: total throughput will not be lower with more cores ● But, usually want throughput to keep increasing with more cores ● It doesn't matter much which one ● But all slower in terms of latency 5.

The Multicore Evolution and Operating Systems

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support