Asymmetry-Aware Scalable Locking

Asymmetry-aware Scalable Locking

Nian Liu†, Jinyu Gu†, Dahai Tang‡, Kenli Li‡, Binyu Zang†, Haibo Chen† † Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University ‡ Hunan University Abstract MCS lock TAS lock MCS lock TAS lock The pursuit of power-efficiency is popularizing asymmetric 8 10 8 multicore processors (AMP) such as ARM big.LITTLE, Ap- 6 ops/sec) 6 6 ple M1 and recent Intel Alder Lake with big and little cores. 4 4 However, we find that existing scalable locks fail to scaleon 2 2 AMP and cause collapses in either throughput or latency, or Tail Latency (ms) 0 0 both, because their implicit assumption of symmetric cores Throughput (10 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 no longer holds. To address this issue, we propose the first Thread Number Thread Number asymmetry-aware scalable lock named LibASL. LibASL pro- (a) Throughput collapse. (b) Latency collapse. vides a new lock ordering guided by applications’ latency Figure 1. Performance collapses on Apple M1 when locks requirements, which allows big cores to reorder with little are heavily contended. M1 has 4 big and 4 little cores. The cores for higher throughput under the condition of preserv- first 4 threads are bound to different big cores. Others are ing applications’ latency requirements. Using LibASL only re- bound to different little cores. quires linking the applications with it and, if latency-critical, inserting few lines of code to annotate the coarse-grained

latency requirement. We evaluate LibASL in various bench- Application LibASL marks including five popular databases on Apple M1. Evalu- handle_request(Req) { ation results show that LibASL can improve the throughput + epoch_start(id) asl_mutex_lock pthread_mutex_lock Redirect Reorderable Lock by up to 5 times while precisely preserving the tail latency ... Transparetly Unmodified Latency-critical Code Existing designated by applications. FIFO Lock + epoch_end(id, latency_SLO) Keywords: synchronization primitives, lock, scalability, } asymmetric multicore processor Figure 2. LibASL overview. 1 Introduction Single-ISA asymmetric multicore processor (AMP) combines cores of different computing capacities in one pro- in FIFO order) or long-term (e.g., NUMA-aware locks [26, 32– cessor [47, 48] and has been widely used in mobile devices 34, 45, 46, 51, 54] ensure the equal chance in a period), assume (e.g., ARM big.LITTLE [16]). By combing both faster big symmetric computing capacity. Therefore, in AMP, they give cores and slower little cores together, AMP is more flexible the slower little cores the same chance as the big cores to in accommodating both performance-oriented and energy- hold the lock, which makes the longer execution time of the efficiency-oriented scenarios, such as leveraging all cores to critical section in little cores expose on the critical path and achieve peak performance and using little cores only when causes throughput collapse. On the other side, locks that do arXiv:2108.03355v1 [cs.DC] 7 Aug 2021 energy is preferred. There is also a recent trend to embrace not preserve the acquisition fairness rely on atomic opera- such an architecture in more general CPU processors, includ- tions to decide the lock holder (e.g., test-and-set spinlock). ing the desktop and the edge server [2, 10, 15]. As before, They assume symmetric success rate of the atomic operation applications on AMP need to use locks for acquiring exclu- when executing simultaneously, which is also asymmetric in sive access to shared data. However, we observe that existing AMP. Thus, those locks are likely to be passed only among locks, including those scalable in the symmetric multicore one type of cores (i.e., either big cores or little cores), which processor (SMP) [24, 31] or non-uniform memory access sys- causes latency collapse even starvation to the others. More- tem (NUMA) [26, 32–34, 45, 46, 51, 54], fail to scale in AMP over, when the slower little cores have a higher chance to and cause collapses in either throughput or latency, or both. lock, the throughput also collapses due to the longer execu- After an in-depth analysis, we find the main reason is that tion time of the critical sections on them. Figure 1 shows the those locks (implicitly) assume symmetric cores, which does performance collapses on Apple M1. Both the fair MCS lock not hold on AMP. On the one side, locks that preserve lock and the unfair TAS (test-and-set) spinlock face throughput acquisition fairness (i.e., give all cores an equal chance to collapse when scaling to little cores. Besides, TAS spinlock’s lock), either short-term (e.g., MCS lock [52] passes the lock latency also collapses and is 3.7x longer than the MCS lock. 1 Facing the asymmetry in AMP, it is non-trivial to decide Critical Section Non-Critical Section Waiting the lock ordering (who lock first) for both high through- Core 0 Core 1 put and low latency. Binding threads only to big cores is Core 2 an intuitive choice. However, using little cores can achieve Core 3 higher throughput under a lower contention. Besides, it may (a) FIFO lock (MCS, Ticket) Throughput collapses. Core 0 violate the energy target as the energy-aware scheduler [8] Core 1 could schedule threads to little cores for saving energy. An- Core 2 Core 3 other intuitive approach is to give big cores a fixed higher (b) TAS lock (little-core-affinity) Throughput and latency collapse. chance to lock. However, the throughput and the latency are Core 0 mutually competing in AMP. Thus, it is hard to find a suit- Core 1 Core 2 able static proportion that can meet the application’s latency Core 3 requirement and improve the throughput at the same time. Timeline (c) TAS spinlock (big-core-affinity) Latency collapses. In this paper, we propose an asymmetry-aware lock named LibASL as shown in Figure 2. Rather than ensuring the lock Figure 3. An example timeline (from left to right) when the acquisition fairness that causes collapses on AMP, LibASL lock is highly contended. Core 0/1 are big cores. Core 2/3 are provides a new lock ordering directly guided by applications’ little cores. More critical sections are executed means higher latency requirements to achieve better throughput. Atop of throughput; longer waiting time leads to longer latency. a FIFO waiting queue, LibASL allows big cores to reorder (lock first) with little cores for higher throughput under the condition that the victim (been reordered) will not miss the 2 Scalable Locking is Non-scalable on AMP application’s latency target. To achieve such an ordering, we 2.1 Asymmetric Multicore Processor first design a reorderable lock, which exposes the reorder capability as a configurable time window. Big cores can only In this section, we introduce several major features of AMP. reorder with little cores during that time window. Atop of the First, the asymmetry in AMP is inherent. Performance can reorderable lock, LibASL automatically chooses a suitable also be asymmetric in SMP when using the dynamic voltage (fine-grained) reorder window on each lock acquisition ac- and frequency scaling (DVFS). However, they can boost the cording to applications’ coarse-grained latency requirements frequency of the lagging core [21, 25, 57] while AMP cannot. through a feedback mechanism. LibASL provides intuitive Second, asymmetric cores in recent AMP [3, 14, 15] are interfaces for developers to specify the coarse-grained la- placed in a single cluster and share the same Last Level Cache tency requirements (e.g., request handling procedure) in the (LLC). Thus, communication among cores in AMP is similar form of latency SLO (service level objective, e.g., 99% request to SMP rather than NUMA. should be complete within 50ms), which is widely adopted Third, the scheduler (e.g., the energy-aware scheduler in by both academia [40, 49, 58, 62] and industry [9, 29, 30]. To Linux [8]) can place different threads across asymmetric use LibASL, annotating the SLO is the only required effort cores [36, 42]. Multi-threaded applications achieve better if latency-critical, which is already clearly defined by appli- performance by leveraging all asymmetric cores [6] and cation in most cases. Non-latency-critical application can need to use lock for synchronization among them as before. benefit from LibASL without modifications. We evaluate LibASL in multiple benchmarks including 2.2 A Study of Existing Scalable Locks in AMP five popular databases on Apple M1, the only off-the-shelf Scalable locking in SMP and NUMA has been extensively desktop AMP yet. Results show that LibASL improves the studied. However, existing scalable locks are non-scalable throughput of pthread_mutex_lock by up to 5x (3.8x to MCS on AMP and encounter performance collapses. The main lock, 2.5x to TAS spinlock) while precisely maintaining the reason is that existing locks (implicitly) assume symmetric tail latency even in highly variable workloads. cores, which does not hold in AMP. There are two major In summary, this paper makes the following contributions: differences between AMP and SMP that cause the collapses. First, the computing capacity is asymmetric. Little cores spend a longer time executing the same critical section. Thus, • The first in-depth analysis of the performance col- when the lock preserves the acquisition fairness, it gives lapses of existing locks on AMP. little cores an equal chance to hold the lock, which makes • An asymmetry-aware scalable lock LibASL, which the longer execution time of critical sections on them expose provides a new latency-SLO-guided lock ordering to on the critical path and causes a throughput collapse. achieve the best throughput the SLO allows on AMP. We explain this problem through an example in Figure 3(a). • A thorough evaluation on the real desktop AMP (Ap- The system includes two big cores (core 0/1) and two little ple M1) and real-world applications that confirms the cores (core 2/3), and they are competing for the same lock effectiveness of LibASL. intensively. We divide the program’s execution into three 2 parts, including executing the critical section, the non-critical MCS lock TAS lock MCS lock TAS lock section and waiting for the lock. As shown in Figure 3(a), 35 12 when ensuring the short-term (i.e., FIFO) lock acquisition 30 10 25 ops/sec)

6 8 fairness (e.g., the MCS lock [52]), threads hand over the lock 20 6 in a FIFO order. In result, the longer execution time of the 15 10 4 critical section in core 2/3 will be exposed on the critical path 5 Tail Latency (ms) 2 0 0 and hurts the throughput. Throughput (10 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Besides the short-term acquisition fairness, previous work Thread Number Thread Number provides long-term acquisition fairness to improve the (a) Throughput (b) Latency throughput in many-core processors or NUMA while keeping a relatively low latency. Malthusian lock [31] reduces the Figure 4. When TAS lock shows big-core-affinity, it can contention in the many-core processor for better through- achieve higher throughput but the latency still collapse. put by blocking all competitors in the waiting queue except the head and the tail. It achieves long-term fairness by peri- advantage (big-core-affinity in Figure 3(c)), little cores will odically shifting threads between passive blocking and ac- starve. Nevertheless, it allows big cores that arrived later to tive acquiring. NUMA-aware locks [26, 32–34, 45, 46, 51, 54] lock before (reorder with) earlier little cores. Thus more crit- batch the competitors from the same NUMA node to reduce ical sections are executed on faster big cores, which brings the cross node remote memory references. They achieve higher throughput. long-term fairness by periodically allowing different NUMA We validate our observations on Apple M1. In Figure 1 nodes to hold the lock. However, the long-term fairness also (Section 1), we present the case when TAS lock shows little- gives the little cores an equal chance to hold the lock in a core-affinity. In Figure 4, we present another scenario when period and thus hurts the throughput. TAS lock shows big-core-affinity2. In both scenarios, the Implication 1: Lock ordering that respects acquisition fair- fair MCS lock faces throughput collapses (over 50% degra- ness, either short-term or long-term, is no longer suitable in dation from 4 big cores to all cores), while the unfair TAS AMP. In SMP or NUMA, preserving acquisition fairness can lock faces latency collapses. When the TAS lock shows little- prevent starvation and achieve relatively low latency with- core-affinity in Figure 1, its throughput also collapses and out degrading the throughput but causes collapses in AMP. is 43% worse than the MCS lock when using all the cores. Thus, a new lock ordering should be proposed for AMP to However, when TAS lock shows big-core-affinity in Figure 4, meet the latency goal while bringing higher throughput. more critical sections will be executed on the faster big cores, Second, the success rate of atomic operations (e.g., test-and- which brings 53% higher throughput than the MCS lock. The set, TAS) is asymmetric. On some AMP systems (including latency of MCS lock also increases when scaling to little ARM Kirin970 and Intel L16G7), we observe that big cores cores due to the longer execution time of critical sections have a stable advantage over little cores in winning the on little cores. However, it is much shorter than the TAS atomic TAS. While on other platform (including Apple M1), lock. These observations also hold in real-world applica- the advantage shifts between asymmetric cores1. Thus, locks tions. When TAS lock shows little-core-affinity in SQLite that do not preserve acquisition fairness and rely on the (detailed in Section 4.2), it has 49% worse throughput and atomic operation to decide the lock holder (e.g., TAS spin- 1.8x longer tail latency than the MCS lock. However, when lock) also have the scalability issue in AMP. Such locks are TAS lock shows big-core-affinity in UpscaleDB, it has 90% likely to be held only by one type of cores (i.e., either big better throughput yet 2.5x longer tail latency than the MCS cores or little cores), which causes a latency collapse even lock. starvation to the others. Moreover, when the little cores have Implication 2: Reordering to prioritize faster cores is indis- a bigger chance to hold the lock, the throughput also col- pensable in AMP for higher throughput, but it must be bounded. lapses due to the longer execution time of the critical sections When TAS lock shows big-core-affinity, it reorders big cores on them. with little cores unlimitedly and achieves higher throughput. As shown in Figure 3(b), when little cores show advan- However, the unlimited reordering causes a latency collapse. tage in winning the atomic TAS, they have more chance Thus, the reordering must be bounded for preserving appli- to lock (we name it as little-core-affinity). Thus, big cores cations’ latency requirements. can barely lock (starvation). Besides, the longer execution time of the critical sections in little cores will be exposed 2.3 Strawman Solutions and hurts the throughput. Similarly, when big cores show The straightforward approach to address the AMP scalability issue is only using big cores. However, little cores can 1On Apple M1, when executing TAS back-to-back (higher contention), little cores show a stable advantage. With the distance between two TAS 2We identify the affinity by comparing the number of executed critical increased (lower contention), big cores show a stable advantage. sections in different types of core. 3 282930 void request_handler(void) { 222324252627 181920 21 20 14 151617 1213 + epoch_start(5); // Epoch 5 starts 1011 ops/s) 9 5 7 8 6 if (flag == PATH_1) { 16 5 4 Critical Section 1 3 lock(&lock_1); 12 2 lock & unlock (&lock_2); Critical Section 2 1 unlock(&lock_1); Throughput (10 8 5 10 15 20 25 30 35 40 45 50 55 60 else { P99 Tail Latency (us) lock & unlock (&lock_1); Critical Section 3 } Unmodiﬁed Latency-critical Code Figure 5. Latency and throughput when setting different + epoch_end(5, 1000); // Epoch 5 ends. SLO 1us proportions on the label of each point (e.g., 10 means big } cores have 10x higher chance to lock). Figure 6. Example of using LibASL. help achieve higher throughput with a lower contention. Finding the optimal number of cores to run applications is a suitable fine-grained reorder window for each lock acqui- long-existing problem [31, 39]. Moreover, the energy-aware sition based on the application’s coarse-grained latency re- scheduler may schedule threads to little cores to save energy. quirement. To this end, by proposing a feedback mechanism, Binding threads only to big cores may violate the energy LibASL automatically chooses a suitable reorder window target. on each lock acquisition according to applications’ coarse- Another intuitive solution is adopting proportional ex- grained latency requirements through a feedback mecha- ecution that gives big cores a fixed higher chance to lock. nism. Figure 5 shows the performance when setting different pro- Usage model. Using LibASL is simple and straightforward. portions. As shown in the figure, the throughput and the By leveraging weak-symbol replacement, LibASL redirects latency are mutually competing against each other in AMP the invocations to pthread_mutex_lock transparently. Thus, that a larger proportion leads to higher throughput but a applications that use pthread_mutex_lock only need to link longer tail latency. However, there is no clear clue whether a with LibASL and, if latency-critical, add few lines of code to specific application prefers throughput over latency (and its annotate the latency requirement (non-latency-critical ap- extent) or the opposite. Moreover, since applications’ loads plications can benefit from LibASL without modifications). may change over the time, the latency will be unstable and LibASL provides two intuitive interfaces to annotate the unpredictable when setting a fixed proportion. Therefore, it latency SLO of a certain code block (named as an epoch), is almost impossible to choose a static and suitable propor- including epoch_start and epoch_end. Each epoch has a tion to meet applications’ needs. unique epoch id, which should be passed as an argument (e.g., 5 in Figure 6). The epoch_id is statically given by pro- LibASL 3 Design of LibASL grammers, which can be further managed by sim- ply through a global counter. epoch_end takes another ar- 3.1 Overview gument, which specifies the latency SLO of the epoch in To address the lock scalability problem on AMP, we propose nanoseconds (e.g., 1000 means the epoch’s latency SLO is an asymmetry-aware scalable lock LibASL. Rather than pre- 1us in Figure 6). LibASL does not restrict both the number of serving the lock acquisition fairness, LibASL provides a new locks and how the lock is used in an epoch. Thus, program- lock ordering guided directly by the applications’ latency mers can mark a coarse-grained latency SLO (e.g., request SLOs for better throughput and bounded latency (according handler in Figure 6). to Implication 1). Atop of a FIFO waiting queue, LibASL al- Human efforts. For latency-critical applications, annotat- lows reordering under the condition that the victim (been ing applications’ existing coarse-grained SLOs is the only reordered) will not miss the application’s latency SLO. Thus, required effort to use LibASL. Such SLOs are defined accord- big cores can reorder as much as possible with little cores ing to the actual latency target and are commonly available to achieve higher throughput, while little cores can barely in both practice (e.g., an interactive app may have an SLO meet their latency SLO (according to Implication 2). of 16.6ms to satisfy the 60Hz frame rate requirement) and To achieve such an ordering, bounded reorder capability research (e.g., [40] attaches latency SLO to syscalls, [58] is needed. Thus, we first design a reorderable lock, which marks the latency SLO of storage system operations). For exposes the bounded reorder capability as a configurable re- those without clear SLOs, LibASL provides a profiling tool order time window atop of a FIFO waiting queue. Only during that generates a latency-throughput graph (see Variant SLOs the time window, big cores can reorder (lock before) with figures in Section 4) and helps developers choose suitable little cores. Once the window expires, no reorder will happen SLOs. For non-latency-critical applications, LibASL can be (reordering is bounded). However, it is non-trivial to set a transparently used with no SLO. 4 Each standby competitor has a. Competitors invoking an unique reorder time window. lock_reorder/lock_eventually Algorithm 1. Reorderable lock implementation. do not enqueue and become standby Standby Competitors int lock_immediately(mutex_t *mutex) { competitors. 1 expires 2 return lock_fifo(mutex); b. Competitors invoking 3 } c enqueue a Lock_immediately enqueue immediately 4 and reorder with earlier standby 5 int lock_reorder(mutex_t *mutex, competitors. 6 uint64_t window) { Lock b c. A standby competitor will enqueue 7 uint64_t window_end; Holder when the queue is empty or its reorder 8 uint64_t cnt = 0, next_check = 1; FIFO waiting queue time window expires. 9 if (window < THRESHOLD || 10 is_lock_free(mutex)) 11 return lock_fifo(mutex); Figure 7. Reorderable lock blocks standby competitors and 12 window_end = current() + window; allows other competitors to reorder with them. 13 while (current() < window_end) { 14 if (cnt ++ == next_check) { 15 if (is_lock_free(mutex)) 16 break; 17 next_check <<= 1; Usage scenarios. Improving systems’ throughput without 18 } violating the latency SLO is always preferred [49, 58, 59, 19 } 20 return lock_fifo(mutex); 62]. By using LibASL to achieve higher throughput, service 21 } providers can consolidate more requests into fewer servers 22 23 int lock_eventually(mutex_t *mutex) { for cost and energy efficiency without breaking the service- 24 return lock_reorder(mutex, level agreement; user-interference applications can run faster 25 MAX_REORDER_WINDOW); 26 } without compromising user experience. since the more time that the lock is not free when check- 3.2 Reorderable Lock ing, the heavier the contention is likely to be, and the less The reorderable lock exposes the bounded reorder capa- likely the lock will be free in the same short period. When bility atop of existing FIFO locks (e.g., the MCS lock). It the reorder time window expires, the competitor can finally provides three interfaces to acquire the lock, including enqueue (line 20). We do not use a secondary queue for the lock_immediately, lock_reorder and lock_eventually. standby competitors because that each competitor can have Figure 7 presents the behavior of those interfaces. Com- a different reorder window. Thus, they may enqueue at dif- petitors using lock_immediately will be appended to the ferent time once the reorder window expires (not in FIFO tail of the waiting queue immediately. Competitors us- order). As for lock_eventually, it sets the reorder window ing lock_reorder and lock_eventually will be regarded to the maximum and invokes lock_reorder. All three inter- as standby competitors. If the waiting queue is empty, faces will invoke the lock_fifo in a limited time. Thus, the standby competitors can enqueue and then become the lock reorderable lock is starvation-free. holder. Otherwise, the standby competitors are blocked. Each Since the reorderable lock does not modify the underneath standby competitor has a unique reorder time window. Other lock, for unlocking, the reorderable lock directly invokes the competitors can reorder with them and lock ealier during unmodified unlock procedure of the underneath lock. the reorder window. Thus, the reordering is bounded by that window. A standby competitor will enqueue once its reorder 3.3 LibASL window expires. The reorder window is an argument of Atop of the reorderable lock, LibASL collects the applica- lock_reorder, while lock_eventually sets the maximum tion’s latency SLO and maps it to a suitable reorder window reorder window. to maximize the reordering without violating the SLO. The Algorithm 1 shows the implementation of the reorder- mapping is achieved by tracing all epochs’ latency and ad- able lock. When calling lock_immediately, the competi- justing the reorder window at every epoch ends. Since dif- tor will directly enqueue (line 2) by using the lock inter- ferent epoch may have different latency SLO, LibASL keeps face (lock_fifo) of the underneath FIFO lock (e.g., MCS). individual reorder window for each epoch. When calling lock_reorder takes an argument window that specifies the the pthread_mutex_lock in an epoch, LibASL redirects it length of the reorder window in nanoseconds. When calling to lock_immediately if the thread is running on big cores. lock_reorder, the competitor will firstly check whether Otherwise, it uses lock_reorder and sets the reorder win- the lock is free (line 10) or the window is too small to do dow of that epoch. any meaningful jobs (line 9). If so, the competitor will en- Algorithm 2 shows the implementation of the LibASL’s queue immediately (line 11). Otherwise, it will become a epoch interfaces. Each epoch has the per-thread metadata, standby competitor. During the reorder window, the standby which keeps the length of the reorder window (window), the competitors will check the lock status occasionally (line 15). start timestamp (start) and the adjusting unit of the epoch’s We use a binary exponential back-off strategy to reduce the length (unit). When calling epoch_start, epoch_id spec- contention over the lock (line 17). It is an intuitive choice ifies the unique id of the upcoming epoch, which willbe 5 Algorithm 2. LibASL epoch implementation. Algorithm 3. LibASL internal interface. 1 typedef struct epoch { 1 int asl_mutex_lock(mutex_t *mutex) { 2 uint64_t window;/* Reorder Window*/ 2 if (is_big_core()) 3 uint64_t start;/* Timestamp*/ 3 return lock_immediately(mutex); 4 uint64_t unit;/* Adjust Unit*/ 4 else if (cur_epoch_id < 0) 5 } epoch_t ; 5 return lock_eventually(mutex); 6 __thread epoch_t epoch[MAX_EPOCH]; 6 else 7 __thread int cur_epoch_id = -1; 7 return lock_reorder(mutex, 8 #define PCT 99/* 99th Percentile Latency*/ 8 epoch[cur_epoch_id].window); 9 9 } 10 int epoch_start(int epoch_id) { 11 /* Checks are omitted*/ 12 cur_epoch_id = epoch_id; 13 epoch[epoch_id].start = current(); reorderable lock is implemented atop of existing locks, both 14 return 0; the trylock and the nested locking are supported. Besides, 15 } 16 the conditional variable is also supported by using the same 17 int epoch_end(int epoch_id, technique in litl [39]. 18 uint64_t required_latency) { 19 /* Checks are omitted*/ 20 uint64_t latency, window; 3.4 Analyses 21 if (is_big_core()) 22 goto out; Throughput. LibASL provides good scalability in AMP. We 23 latency = current()-epoch[epoch_id].start; analyze different situations applications may encounter and 24 window = epoch[epoch_id].window; 25 if (latency > required_latency) { the corresponding behavior of LibASL as follows. 26 window >>= 1; Big cores and little cores are not competing for the same lock. 27 epoch[epoch_id].unit = 28 (window*(100-PCT))/100 > MIN_UNIT ? In big cores, LibASL behaves the same as the underneath 29 (window*(100-PCT))/100 : MIN_UNIT; FIFO lock (e.g., the MCS lock). In little cores, LibASL behaves 30 } else{ 31 window += epoch[epoch_id].unit; similarly to the backoff spinlock. Both locks are scalable [24] 32 } when competitors are from one type of core. 33 epoch[epoch_id].window = window; 34 out: Big cores and little cores are competing for the same lock. 35 cur_epoch_id = -1; When the lock is not heavily contented, competitors from 36 return 0; 37 } both big and little cores can immediately hold the lock if the lock is free (no additional overhead). Little cores can stored in the per-thread global variable cur_epoch_id (line help achieve higher throughput in such cases (e.g., Figure 8c 12). Then it records the start timestamp (line 13) by using shows the corresponding experiment). With the contention the light-weight clock_gettime. level increased, big cores will reorder with little cores under When the epoch ends, epoch_end takes another argument the condition that the latency SLO is still met. When the required_latency, which specifies the latency SLO of the big cores do not saturate the lock (i.e., the lock becomes current epoch in nanoseconds. It calculates the current la- free sometimes in a while), little cores will lock once the tency (line 23), compares with the requirement (line 25) and queue is empty. Thus, LibASL can find the sweet spot where updates the reorder window accordingly. We take a conser- some additional little cores help saturate the lock for bet- vative strategy to adjust the reorder window inspired by the ter throughput (and block the rest little cores). Otherwise, TCP congestion control algorithm [22], which combines lin- allowing any extra little core to join the competition will de- ear growth and exponential reduction when latency exceeds. grade the throughput. In those cases, little cores can get the We set the granularity of growth (unit) to be 100−푃퐶푇 of the 100 lock only when the reorder window expires. Thus, LibASL reduced window3, where PCT represents the percentile the can improve the throughput as much as the latency SLO SLO specifies (defined in line 8). allows. Theoretically, the throughput of LibASL and the la- By leveraging weak-symbol replacement, LibASL redirects tency SLO have a negative reciprocal relationship4. So, the pthread_mutex_lock in applications to asl_mutex_lock growth speed of throughput will slow down with a larger in Algorithm 3 transparently with negligible overhead (20+ SLO (see Figure 8b). cycles, similar to litl [39]). When calling libASL_lock, Latency. LibASL can precisely maintain the latency under competitors from big cores directly acquire the underneath SLO through a feedback mechanism. The size of the reorder lock using lock_immediately (line 3), while those from little cores use lock_reorder and set the window length of the current epoch. If not in any epoch, it calls lock_eventually 4Supposing the big core is 푎 times faster than the little core and each critical (line 5). Identifying the core type is achieved by getting the section takes 1 second in the big core (푎 seconds in the little core). When 푥 푥+1 = − 푎−1 core id and looking up a pre-defined mapping. Since the big cores execute before 1 little core, the throughput is 푥+푎 1 푥+푎 (i.e., the 푥 + 1 critical sections divides the execution time 푥 + 푎). Meanwhile, the 3 100 So that after another 100−푃퐶푇 executions, the latency will be the same as SLO has a linear relationship with 푥 (when SLO increases by 1s, 1 more big the one which barely exceeds the SLO and triggers the exponential reduction. core can reorder). So the latency SLO and the throughput has a negative 100 100 푃퐶푇 The probability of not exceeding the SLO is ( 100−푃퐶푇 −1)/ 100−푃퐶푇 = 100 . reciprocal relationship. 6 window has a monotonic relationship with the epoch’s la- Evaluation Setup. We evaluate LibASL on Apple M1, the tency (e.g., a smaller reorder window means a shorter wait- only available desktop AMP yet. LibASL also works well ing time and a shorter latency). It still holds when an epoch on mobile AMP processors (e.g., ARM big.LITTLE), since contains multiple locks since they share the same window the improvement of LibASL comes from considering the size. Thus, LibASL can find the suitable window size that the asymmetry in computing capacity and is not restricted to latency barely meets the SLO by adjusting the size according certain AMP processors. However, due to the space limit, to the latency (if the latency higher than SLO, shrink the we only present the results on M1 here. Apple M1 has 4 big window, and vice versa). cores and 4 little cores, where big cores are about 4.7x the Even if the epoch length (i.e., execution time) becomes performance of little cores5. In such a system, the theoretical heterogeneous (e.g., executing different code paths), LibASL throughput speedup upper bound of LibASL to FIFO lock can still maintain the tail latency because the reorder window (e.g., MCS) by allowing big cores to reorder is 1.8x6. We run shrinks exponentially once the violation happens and grows an unmodified Linux 5.11 on M1 [7]. gradually (linearly) in the following executions. Thus, it only In the following experiments, without explicit statement, gives some short epoch with a small reorder window (larger the reorderable lock is built atop of the MCS lock, and the window can still meet SLO) but will not violate the SLO. PCT is set to 99 to guarantee the P99 latency. We create 8 Notice that the latency SLO is not a strict deadline. LibASL threads and bind them to different cores (only for evalua- uses it as a hint to maximize throughput without violating tion, core-binding is not required by LibASL). We compare it. There are some cases where LibASL does not take effect. LibASL with pthread_mutex_lock (in glibc-2.32), TAS, ticket, We conclude as follows: MCS and ShflLock45 [ ]. ShflLock provides a lock reordering framework but can only take a static reorder policy. The SLO- 1. Inappropriate latency requirement. When the guided ordering (not static) in LibASL is hard to be integrated given SLO is impossible to achieve even without re- into it. Instead, we implement a proportional-based static ordering, LibASL falls back to a FIFO lock (best effort). policy, which gives the big core a fixed higher (10x) chance to 2. Non-lock-sensitive workloads. In such workloads, lock7. As shown in Figure 5, any proportion is a point lies on locks barely affect both the latency and the throughput the latency-throughput relationship curve. Thus, we choose since they are barely used on the critical path. Inher- the proportion (i.e., 10) that has obvious throughput improve- ently, LibASL will not influence the performance. ment without introducing extremely long latency to compare Energy. Energy-efficiency is one of the major targets ofthe with. We also present the speedup upper-bound of LibASL AMP system. To maintain the lowest energy consumption, by using the maximum reorder window8 (LibASL-MAX). Linux provides EAS (energy-aware scheduler [8]), which chooses the most suitable core for each thread. LibASL does 4.1 Micro-Benchmarks not require binding thread to cores. Thus threads can migrate Bench-1: A heavily contended benchmark. In this benchmark, between cores freely. The reorder window can quickly adjust all threads repeatedly execute the same epoch that contains itself once the migration happens. Therefore, LibASL will not critical sections in different lengths and acquiring multiple violate the scheduling decision, as well as the energy target. locks in a nested manner. We measure the latency of epoch Threads will only run on big cores when the scheduler de- and present the tail latency of little cores (Little P99), big cides so (not caused by LibASL). Moreover, when running on cores (Big P99) and overall (Overall P99) separately. As big cores, LibASL makes threads do meaningful jobs rather shown in Figure 8a, the TAS lock shows big-core-affinity here than busy waiting, which helps saving the energy [35]. (big cores have a much shorter tail latency). Thus, among Space overhead. The space overhead of LibASL comes from existing locks, TAS lock has the best throughput yet the the metadata of epochs, which is negligible because the per- worst P99 latency. When setting the latency SLO to 0us thread metadata of an epoch only takes 24 bytes (see Algo- (LibASL-0), LibASL has the same throughput and latency rithm 2), and is irrelevant to the number of locks. as the MCS lock since the latency target is impossible to achieve (best effort FIFO). When achieving similar through- 4 Evaluation put with TAS lock (LibASL-35), LibASL reduces the overall We evaluate LibASL to answer the following questions: tail latency by 37% (55% in little cores). When having a similar overall tail latency as the TAS lock (LibASL-55), LibASL 1. How much throughput can LibASL improve when setting variant SLO? 2. Can LibASL precisely maintain epochs’ latency in var- 5According to the result of Sysbench [18]. The ratio may vary in different ious situations? workloads. 6 4.75+1 − 1 ≈ 1.8: Comparing the case where the big cores always run and LibASL 2 3. How does perform under variant contention the case where the big cores and little cores run one by one. levels? 7Batching at most 10 big cores before passing to 1 little core. 4. Can LibASL take effect in real-world applications? 8Maximum reorder window is set to 100ms in LibASL. 7 MCS-4 MCS TAS Pthread 337 Big P99 Overall P99 Big P99 Overall P99 Little P99 Throughput Ticket SHFL-PB10 Little P99 Throughput 4 90 168 7397 6 100 5 3 80 4 ops/s) ops/s) 5

60 4 6 2 60 3 1 40 2 30 2 Latency (us)

Latency (us) 0

20 1 Throughput Speedup

Throughput (10 -1 0 0 Throughput (10 0 0 0 1 2 3 4 5 0 20 40 60 80 100 n PthreadTAS Lock LibASL-0 Interval Between Critical Sections (10 nops) Ticket LockSHFL-PB10MCS Lock LibASL-35LibASL-45LibASL-65LibASL-MAX Set Latency SLO (us) (a) Latency and throughput comparison. LibASL-X means the SLO is set (b) Performance of LibASL when setting (c) Throughput speedup of LibASL un- to Xus. LibASL-MAX means enabling maximum reordering (upper-bound). variant SLOs (x-axis). der variant contention levels.

Big P99 Overall P99 Big P99 Overall P99 Big Core Latency Little Core Latency Little P99 Throughput Little P99 Throughput 200 14 160 10 125

12 120 150 8 100 ops/s) ops/s)

10 80 5 5 6 75 100 8 40 4 6 4 50 Latency (us) 50 Latency (ms) 2 3 2 25 Request Latency (us) Throughput (10 Throughput (10 0 0 0 0 0 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 0 2 4 6 8 10 Pthread LibASL-0 Time (ms) MCS-STP LibASL-MAX Set Latency SLO (ms) (d) Self-adaptive Reorder Window: Latency of each epoch during the (e) Performance of blocking locks and (f) LibASL with variant SLO when core- first 350ms. LibASL when core-oversubscription. oversubscription.

Figure 8. Micro-benchmarks. Big P99 and Little P99 presents the 99th percentile latency in big and little cores individually. achieves 22% better throughput. Although TAS lock also (im- as discussed in Section 3.4. Meanwhile, big cores get more plicitly) prioritizes the big cores under such a circumstance, chances to lock. Thus, both the big cores’ and the overall the reordering depends on the hardware atomic operation tail latencies are shorter than the little cores’ and are within and is unstable and uncontrollable. LibASL manages the re- the SLO. The only exception is when setting an SLO shorter ordering elaborately, which allows more critical sections than 15us (the tail latency of MCS), the required latency execute on the big cores and thus has higher throughput. is impossible to achieve even if passing the lock in FIFO LibASL can bring up to 86% throughput speedup to the TAS (best-effort). Therefore, LibASL falls back to the MCS lock. lock and 130% to the MCS lock when setting a larger SLO Bench-2: A highly variable workload. To present the effec- (LibASL-MAX). When giving big cores a 10x higher chance tiveness of LibASL in managing the epoch’s latency in highly to lock (SHFL-PB10), it improves the throughput of MCS by variable workloads, we record each epoch’s latency in the 20% while having a 4x longer overall tail latency. LibASL first 350ms when executing the same benchmark inthe outperforms the SHFL-PB10 by 35% when having similar Bench-1. Figure 8d shows the latency of each epoch executed overall latency (LibASL-65). It is because that LibASL has on big cores and little cores individually. The latency SLO is better cache locality by batching more big cores before pass- set to 100us throughout the experiment. During the 100 and ing the lock to little core for no SLO violation, while the 200ms period, we enlarge the epoch’s length (execution time) proportional-based approach has to periodically pass the by 128 times and shrink it back to the original length during lock to little cores. The pthread_mutex_lock has the worst the 200 and 250ms period. After that, we change the length throughput and the longest tail latency. LibASL can outper- of the epoch rapidly and randomly during the 250 to 300ms form it by 3.5x at most. (Question 1) period. Finally, the epoch’s length is set 1024 times larger Figure 8b shows the throughput and the latency of LibASL than the original till the end. As shown in the figure, LibASL when setting variant latency SLOs (epoch’s length does not is fully capable of maintaining the latency in a highly vari- change). As shown in the figure, with setting a larger SLO able workload (Question 2). Every time the latency exceeds (x-axis from left to right), the throughput increases, and the SLO, the reorder window shrinks to its half and increases the tail latency of little cores sticks straightly to the Y=X slowly in the next 100 executions (PCT is set to 99). When line (i.e., meet the SLO). It validates that LibASL improves the epoch’s length enlarges at 100 ms and 200 ms, LibASL the lock’s throughput under the condition that little cores quickly adjusts the reorder window to a suitable size. Even can barely meet their latency SLO. Moreover, the growth when the epoch’s length becomes highly heterogeneous dur- speed of throughput slows down with SLO becomes larger ing 250 and 300 ms, LibASL can still keep the latency within

8 LibASL LibASL-Overall IDEAL LibASL-Little Big P99 Overall P99 Big P99 Overall P99 Little P99 Throughput Little P99 Throughput 3 SLO = 100us 90 200 8001 40 20 40 150 32 15 30 ops/s) 21 5 ops/s) 2 60 100 24 5 10 20 8 16 Latency (us) 1 30 Latency (us) 5 10 Latency (us) 4 8

0 0 0 0 Throughput (10 Normalized Throughput Throughput (10 0 5 10 15 20 0 0 PthreadTAS Lock MCS LockLibASL-6 0 20 40 60 80 100 SHFL-PB10 LibASL-1.9e6 Set Latency SLO (us) Ratio of Long (100x) Epoch (a) Stack (b) Variant SLOs.

Figure 9. Performance when mixing epochs of significantly Big P99 Overall P99 Big P99 Overall P99 Little P99 Throughput Little P99 Throughput different lengths with variant ratios. Throughput is normal- 180 27996 1e6 90 30 60 120 60 25 50

ized to MCS. SLO is set to 100us (also the p99 tail latency of ops/s) 23 5

ops/s) 20 40 60 30 5 the MCS lock at 100%). 15 30 30 16 10 20 Latency (us)

Latency (us) 15 8 5 10 Throughput (10 0 0 0 0 Throughput (10 0 5 10 15 20 25 30 PthreadTAS Lock the SLO. When the epoch’s length enlarges 1024 times at 300 SHFL-PB10MCS LockLibASL-30LibASL-MAX Set Latency SLO (us) ms, the target latency is impossible to achieve. Thus, LibASL (c) Linked List. 1e6 means 106. (d) Variant SLOs. falls back to the underneath MCS lock, and both big and little cores have similar latencies. Figure 10. Data structures. Legends are explained in Fig- Bench-3: A benchmark with variant contention levels. In this ure 8. benchmark, we evaluate LibASL under variant contention levels by executing a different number of nop instructions between two lock acquisitions. Figure 8c shows the throughput and long (100x longer) epochs of different ratios. We com- speedup of LibASL over the locks in the legends (e.g., when pare LibASL with the ideal case that directly chooses (no x=0, LibASL outperforms MCS by 2x and TAS lock by 45%). window adjustment) a suitable window for different epochs To allow reordering as much as possible, we do not set the la- (impossible in real world). As shown in the Figure 9, LibASL tency SLO in LibASL. We also include the result of only using can bring significant and close-to-ideal throughput improve- big cores (MCS-4). When competitors from big cores already ment to MCS (maximum 20% gap with ideal at 50% ratio) saturate the lock (x < 3), LibASL makes the competitors from while keeping the latency under SLO. little cores to be standby competitors and achieves similar Bench-6: Data structures benchmarks. In the stack bench- throughput with MCS-4 (significantly better than others). mark, threads randomly push or pop (fifty-fifty) 1 element With the contention level decreased, LibASL allows little from the stack in each epoch. Similarly, in the linked list cores to join the competition and achieves better throughput benchmark, threads randomly append or remove (fifty-fifty) than only using big cores (up to 68%). Among all contention 1 element from the list in each epoch. As shown in Figure 10a, levels, LibASL achieves good throughput (Question 3). in the stack benchmark, LibASL only improves the through- Bench-4: A benchmark with CPU core over-subscription. In put of TAS lock by 8% (18% to ShflLock) when having similar this benchmark, we examine the effectiveness of LibASL overall P99 latency (LibASL-6). It is because that the lock in a core over-subscription situation by creating 2 threads is mostly passing only among big cores when using TAS on each core and executing Bench-1. We replace the non- lock, which results in an extremely long tail latency in lit- blocking MCS lock in LibASL with the pthread_mutex_lock tle cores (Little P99) but good throughput. When setting and use nanosleep to block the standby competitor. Results a larger SLO, LibASL can provide up to 27% speedup over are presented in Figure 8e and 8f. Since the MCS lock is TAS lock (LibASL-MAX). In the linked list benchmark (Fig- passed in a FIFO order, the waking-up latency will be ex- ure 10c), TAS lock shows a little-core-affinity (big cores have posed on the critical path and leads to a significant through- extremely long latency). Thus, TAS lock faces a significant put degradation (spin-then-park MCS is 96% worse than throughput degradation since critical sections are mostly pthread_mutex_lock). So, LibASL uses pthread_mutex_lock executed on slower little cores. LibASL outperforms TAS rather than the MCS lock. Although pthread_mutex_lock lock by 1.7x (in throughput) when having similar overall does not guarantee the FIFO order and has unstable lock tail latency (LibASL-30, up to 3.2x with a larger SLO). Com- acquisition latency, LibASL still can preserve the SLO ow- pared with MCS, ShflLock and pthread_mutex_lock, LibASL ing to its self-adaptive reorder window and improve the brings up to 2.5x, 1.5x and 2.3x throughput improvement. throughput of pthread_mutex_lock by up to 90%. Besides enabling reorder, LibASL also reduces the contention Bench-5: A benchmark mixing epochs of significantly differ- by putting competitors to the standby mode, which helps it ent lengths. In this benchmark, we randomly generate short go beyond the theoretical speedup upper bound comparing 9 Table 1. Databases Considered

Application Type Version Benchmark Lock Usage in each Epoch Kyoto Cabinet [11] In-memory KV Store (CacheDB) 1.2.78 50% Put 50% Get Slot-level Lock, Method Lock upscaledb [19] On-disk Persistent KV Store 2.2.1 50% Put 50% Get Global DB Lock, Async Worker Pool Lock LMDB [13, 28] On-disk Persistent KV Store 0.9.70 50% Put 50% Get Global DB Lock, Metadata Lock LevelDB [12] On-disk Persistent KV Store 1.22 db_bench Random Read Metadata Lock SQLite [17] On-disk Persistent Database 3.33.0 1/3 Insert 1/3 Simple 1/3 Complex Query State Machine Lock, Metadata Locks to the MCS lock (1.8x in Evaluation Setup). In both bench- are executed on little cores and have longer latency. There marks, LibASL prevents the latencies from violating the SLO is also a clear boundary in little core’s latency. About half (Y=X) as shown in Figure 10b and 10d. of the operations have shorter latency (<35us), which is because of the shorter execution time of Get operation. As for 4.2 Application Benchmarks the longer Put operation (another half), since the reorder Databases. To examine the effectiveness of using LibASL in window shrinks by half and grows linearly once the latency real-world applications (Question 4), we evaluate 5 popu- exceeds the SLO, the probability grows linearly after Half lar databases (detailed in Table 1). Databases benefit from SLO (35us). using little cores to handle more requests in fewer machines, Similar results can be found in upscaledb and LMDB (Fig- ∼ which can improve the cost and energy efficiency in edge ure 11d 11i). In upscaledb, among existing locks, TAS lock computing [49, 58, 62]. However, when threads in asym- has the highest throughput (90% higher than MCS) yet the metric cores are intensively competing for the same lock, longest tail latency (2.5x longer than MCS). When having LibASL-140 LibASL using existing locks causes performance collapses. Integrat- a similar tail latency ( ) with TAS lock, ing LibASL only requires inserting 3 lines of code, including improves the throughput by 46%, which further goes to wrapping the operations with epoch_start and epoch_end, 1.6x (3.8x to MCS, 5x to pthread_mutex_lock and 50% to LibASL and adding the header file. As prior work [32, 45] does, we ShflLock) when setting a larger SLO. In LMDB, has run each benchmark for a fixed period and calculate the 40% higher throughput than TAS lock when having a similar LibASL-600 average throughput. Moreover, to present the effectiveness latency ( ), which goes to 60% (86% to MCS, 126% of LibASL in cases where epochs’ lengths are highly het- to pthread mutex and 27% to ShflLock) with larger SLO. In erogenous, we randomly choose to insert or find (fifty-fifty both benchmarks, the latency CDF (Figure 11f, 11i) shows chance referring to YCSB-A [20]) 1 item in an epoch. In most a similar trend with Figure 11c. The clear boundary in little Get benchmarks, each epoch acquires multiple locks as listed in core’s latency distinguishes the shorter with the longer Put the rightmost column of Table 1. . We first evaluate LibASL in several KV-stores. KV-store LevelDB is another widely-used KV-store. However, Lev- plays an important role in CDN or IoT edge servers as the elDB implements its own blocking strategy rather than di- Put storage service [1, 5]. In Kyoto Cabinet, we Put or Get (fifty- rectly using the pthread_mutex_lock on operation. Thus randomread db_bench fifty) 1 item to an in-memory cacheDB in an epoch and mea- we use the test in the build-in to only Get Get sure its latency. As shown in Figure 11a, TAS lock shows big- test its operation. In LevelDB, each operation will core-affinity and thus has the best throughput yet theworst acquire a global lock to take a snapshot of internal database LibASL tail latency among existing locks. LibASL reduces the tail structures. As shown in Figure 12a, improves the latency by 90% when achieving similar throughput with TAS throughput of TAS lock by 50% when having a similar la- LibASL-15 lock (LibASL-70) and improves the throughput by up to 23% tency ( ), which goes to 2.5x (1.6x to MCS, 1.8x (96% to MCS and 89% to pthread_mutex_lock). When hav- to pthread_mutex_lock and 1.3x to ShflLock) when setting Get ing a similar tail latency with ShflLock (LibASL-40), LibASL a larger SLO. Since we only test the operation, most improves the throughput by 38%. Figure 11b shows the per- requests have a longer latency than the half-SLO as shown formance of LibASL when setting variant latency SLOs. Al- in Figure 12c. LibASL though the execution time of Put or Get is heterogeneous, Finally, we evaluate in SQLite, which is a relational LibASL can still precisely maintain the tail latency while database and has been used in Azure IoT edge server [4]. We Insert improving the throughput. place 1/3 , 1/3 simple query (point query on an in- Figure 11c presents the latency Cumulative Distribution dexed column) and 1/3 complex query (range query on an Function (CDF) of LibASL when setting the SLO to 70us. In indexed column with a filter on a non-indexed column) ina DEFERRED the figure, Overall and Little represent the overall and lit- transaction enclosed in an epoch. Moreover, we tle core’s latency. A clear boundary can be seen in the overall add an extremely long full-table-scan on a 100k table every LibASL latency since most operations are finished on big cores. Due 1000 executions in the same epoch to present that can to the intensive contention, only less than 20% of operations 10 survive on some occasionally appeared extremely long re- Big P99 Little P99 Overall P99 Throughput 1e6 28 600 quests. SQLite is configured to use the Multi-thread thread-

400 21 ops/s) 200 6 ing modes. Different from previous benchmarks, SQLite uses 80 14 60 locks to protect the internal state machine. The transac- 40 7 Latency (us) 20 tion can commit successfully only in a certain state. Thus 0 0

Throughput (10 epoch’s latency greatly fluctuates, which results in a non- PthreadTAS Lock LibASL-0 Ticket LockSHFL-PB10MCS Lock LibASL-40LibASL-70LibASL-MAX linear growth in little core’s latency as shown in Figure 12f. (a) Kyoto Cabinet. 1e6 means 106. The chosen SLOs are only for easier compar- It also amplifies the differences of transaction success ratein ing. Other settings are detailed in figure 11b. big and little cores when using ShflLock and causes the latency collapses in little cores. Moreover, both the simple and Big P99 Overall P99 Little P99 Throughput Overall Little the complex Select operations have much shorter execution 200 24 1 SLO Half SLO time than the Insert operation. Thus, 2/3 of the requests 150 18 0.8 ops/s) 6 0.6 have extremely short tail latency (latency grows significantly 100 12 0.4 after y>2/3 in Figure 12f). However, LibASL is still able to Probability Latency (us) 50 6 0.2 precisely keep the tail latency under the SLO and improve Throughput (10 0 0 0 the throughput as shown in Figure 12e. The occasionally ap- 0 50 100 150 200 0 10 20 30 40 50 60 70 80 Set Latency SLO (us) Insert Latency (us) peared extremely long epoch only causes a window shrink, (b) Variant SLOs (c) CDF (SLO: 70us) LibASL can quickly adjust it back to the suitable size in the following execution. Thus, it does not influence the perfor- Big P99 Little P99 Overall P99 Throughput 2000 8950 24 mance. TAS lock shows little-core-affinity here (big cores ops/s) 1000 16 6 have longer latency). Thus, it has both lower throughput and 270 12 180 8 a longer tail latency than the MCS lock. LibASL can bring Latency (us) 90 4 up to 2.1x throughput speedup to the TAS lock (55% to MCS, 0 0

Throughput (10 2.1x to pthread_mutex_lock and 35% to ShflLock) without PthreadTAS Lock MCS Lock LibASL-0 Ticket LockSHFL-PB10 LibASL-100LibASL-180LibASL-MAX violating SLO. (d) upscaledb PARSEC. We also evaluate LibASL in non-latency-critical Big P99 Overall P99 Little P99 Throughput Overall Little applications. LibASL will use the maximum reorder win- 400 16 1 SLO dow. We present the results of three benchmarks from PAR- 300 12 0.8 ops/s) 6

0.6 Half SLO SEC [23], which can represent all three different types ob- 200 8 0.4 served in PARSEC. We also include the results when using Probability Latency (us) 100 4 0.2 only 4 big cores (Spin-4Big) and 2 big cores (Spin-2Big) to Throughput (10 0 0 0 show the scalability of each application. 0 100 200 300 400 0 20 40 60 80 100 120 140 160 Set Latency SLO (us) Insert Latency (us) Dedup uses pipeline parallelism to compress data and is (e) Variant SLOs (f) CDF (SLO: 140us) known to be lock-sensitive [39]. We configure the bench-

Big P99 Little P99 Overall P99 Throughput mark to use a single task queue to enable cross-asymmetric 3000 5e5 21 cores’ work balance. As shown in Figure 13a, LibASL im-

2000 18 ops/s) 4 proves the throughput of MCS lock by 2x. However, LibASL 1000 15 600 10 and TAS lock have similar throughput, which is close to only Latency (us) 300 5 using big cores (Spin-4Big). It is because that little cores 0 0

Throughput (10 can barely acquire the lock when using TAS lock (big-core- PthreadTAS Lock MCS Lock LibASL-0 Ticket LockSHFL-PB10 LibASL-400LibASL-600LibASL-MAX affinity). Nevertheless, TAS lock may encounter throughput 5 (g) LMDB. 5e5 means 5 × 10 . collapse when showing little-core-affinity under different scenarios (e.g., in linked list benchmark in Figure 10c) while Big P99 Overall P99 Little P99 Throughput Overall Little 2000 24 1 LibASL does not. Dedup represents lock-sensitive applica- Half SLO SLO tions, which may encounter the scalability issue in AMP 1500 18 0.8 ops/s) 4 0.6 systems. LibASL provides stable and best performance in 1000 12 0.4 such applications without code modifications and forbids Probability Latency (us) 500 6 0.2 the throughput collapse as in MCS lock and TAS lock. Throughput (10 0 0 0 Raytrace uses a per-thread task queue and enables work- 0 500 1000 1500 2000 0 500 1000 1500 2000 Set Latency SLO (us) Insert Latency (us) stealing when the local task queue is empty. As shown (h) Variant SLOs (i) CDF (SLO: 1900us) in the figure, all the locks behave the same since locks are not heavily contended. Raytrace scales well on AMP. Figure 11. Databases. Legends are explained in Figure 8. Little cores improve the overall throughput by 36% (from 11 Big Tail Total Tail Little Tail Throughput Overall Little

100 30 1 SLO Big P99 Little P99 Overall P99 Throughput Half SLO 300 6e5 40 80 24 0.8 ops/s)

200 30 5 ops/s) 17 17 5 60 18 100 20 0.6

24 12 40 12 0.4 Probability Latency (us) 20 6 Latency (us) 12 6 0.2

Throughput (10 0 0 0 0 0 Throughput (10 0 20 40 60 80 100 0 20 40 60 80 100 120 PthreadTAS Lock MCS Lock LibASL-0 Ticket LockSHFL-PB10 LibASL-15LibASL-30LibASL-MAX Set Latency SLO (us) Insert Latency (us) (a) LevelDB. 6e5 means 6 × 105. (b) Variant SLOs (c) CDF (SLO: 100us)

Big P99 Overall P99 Little P99 Throughput Overall Little

20 10 1 SLO Big P99 Little P99 Overall P99 Throughput 100 10 15 0.8 ops/s)

4 50 33 8

ops/s) 0.6

6 10 5 8 6 0.4 Probability Latency (us) 5 4 3 0.2 Latency (ms) Throughput (10 0 0 0 0 0 0 5 10 15 20 0 1 2 3 4 5

PthreadTAS Lock MCS Lock LibASL-0 LibASL-4 LibASL-7 Throughput (10 Ticket LockSHFL-PB10 LibASL-MAX Set Latency SLO (ms) Insert Latency (ms) (d) SQLite. LibASL-X means the SLO is set to X ms. (e) Variant SLOs (f) CDF (SLO: 4ms) Figure 12. Database benchmarks. Legends are explained in Figure 8.

Spin-4Big to TAS). Raytrace represents a bunch of applica- 2 1.5 16 TAS tions, which scale well on AMP, including map-reduce ap- MCS 1.5 12 1 LibASL Pthread plications in Phoenix [60]. Those applications’ performance 1 8 Spin-4Big 0.5 Spin-2Big is not bounded by locks (limited number of cores in M1) 0.5 4 and either use work-stealing or assign tasks at runtime to Execution Time (s) 0 0 0 achieve a good work balance among asymmetric cores. (a) Dedup (b) Raytrace (c) Ocean_cp In contrast, Ocean_cp fails to scale on AMP. It contains multiple phases, and each core is assigned with even works. Figure 13. PARSEC benchmark. Ocean_cp scales well when adding 2 big cores to the system (from Spin-2Big to Spin-4Big) but fails to go further when including little cores. It is because big cores have to wait for higher throughput when ensuring similar tail latency (e.g., slower little cores at the end of each phase. Besides Ocean_cp, 46% higher than TAS lock in LMDB). Moreover, LibASL can streamcluster, lu_cb and barnes in PARSEC use a similar further outperform TAS lock by 2.5x when setting a larger approach, therefore fail to scale on AMP. Those applications SLO in LevelDB. Comparing to the proportional-based ap- make a false assumption (symmetric computing capacity) proach, LibASL can better meet applications’ needs by im- towards the underneath cores. LibASL cannot take effect proving throughput as much as possible considering the since it is an orthogonal problem. Those applications have certain latency SLO of an application. Moreover, LibASL to be modified either by adding work balance or assigning achieves better throughput when having a similar latency. tasks considering the asymmetry to scale on AMP. 5 Related Work Scalable locking has been extensively studied over decades. 4.3 Evaluation Highlights However, previous work [26, 27, 31–34, 37, 38, 41, 45, 50, 51, Results confirm the effectiveness of LibASL in AMP. First, 53–55, 61] mainly targets the scalability problem in either LibASL can precisely maintain the latency SLO even in SMP or NUMA rather than AMP. There are already some highly variant workloads. Second, LibASL shows promis- locks that reorder competitors to achieve better through- ing performance advantages over existing locks. Comparing put [27, 31–34, 45, 51, 54] in NUMA or many-core processor. to fair locks (lowest latency but low throughput), LibASL NUMA-aware locks [27, 32–34, 51, 54] reorder competitors can significantly improve the throughput (e.g., 3.8x to MCS from local NUMA node with other nodes to reduce cross- lock in upscaledb). Comparing to unfair locks (highest la- node remote memory references, which does not exist in tency but sometimes high throughput), LibASL has much AMP. Malthusian lock [31] reorders active competitors with lower tail latency when achieving similar throughput (e.g., passive blocking ones to reduce contention in many-core 90% lower than TAS lock in KyotoCabinet), and substantially processors. However, as discussed in Section 2.2, they rely 12 on the long-term fairness to keep a relatively low latency, References which brings throughput collapse on AMP. Besides, long- [1] [n.d.]. Akamai: IoT Edge Connect. https://www.akamai.com/cn/zh/ term fairness only forbids starvation, leaving the latency products/performance/iot-edge-connect.jsp. unpredictable. LibASL provides a new SLO-guided lock or- [2] [n.d.]. Apple M1 Chip. https://www.apple.com/mac/m1/. dering for AMP and achieves a higher throughput. [3] [n.d.]. ARM DynamIQ Shared Unit Technical Reference Manual. https://developer.arm.com/documentation/100453/0002/functional- ShflLock [45] provides a lock reordering framework. It description/introduction/about-the-dsu. relies on a provided static policy to shuffle the queue in- [4] [n.d.]. Azure IoT Edge SQLite Module. https://github.com/Azure/iot- ternally. However, the scalability issue in AMP cannot be edge-sqlite. easily solved by proposing a static policy. Preserving fair- [5] [n.d.]. The best IoT Databases for the Edge – an overview and com- ness brings throughput collapse; reordering without limit pact guide. https://objectbox.io/the-best-iot-databases-for-the-edge- an-overview-and-compact-guide/. causes latency collapse. It is also hard to find a suitable static [6] [n.d.]. CFS wakeup path and Arm big.LITTLE/DynamIQ. https://lwn. proportion to prioritize big cores as discussed in Section 2.3. net/Articles/793379/. Thus in LibASL, the reorderable lock exposes the reorder [7] [n.d.]. CORELLIUM: How We Port Linux to M1. https://corellium. capability so that LibASL can adjust the extent of reordering com/blog/linux-m1. on the fly and achieves the best throughput the SLO allows. [8] [n.d.]. Energy Aware Scheduling. https://www.kernel.org/doc/html/ latest/scheduler/sched-energy.html. Delegation locks [37, 38, 41, 50, 53, 55, 61] reduce the data [9] [n.d.]. Google Cloud: Defining SLOs. https://cloud.google.com/ movement by executing all critical sections in one core (the solutions/defining-SLOs. lock server), which significantly improves the throughput [10] [n.d.]. Intel Alder Lake: Performance Hybrid with Golden Cove and in NUMA. Although placing the lock server on big cores Gracemont for 2021, Intel Architecture Day 2020. https://newsroom. can hide the weak computing capacity of little cores (similar intel.com/press-kits/architecture-day-2020/. [11] [n.d.]. Kyoto Cabinet: a straightforward implementation of DBM. to [43, 44, 56], but big cores are dedicated for acceleration), it https://dbmx.net/kyotocabinet/. requires the big cores busy polling, which wastes a big core [12] [n.d.]. LevelDB. https://github.com/google/leveldb. and violates the energy target when the contention is low. [13] [n.d.]. LMDB TECHNICAL INFORMATION. https://symas.com/lmdb/ LibASL achieves good performance in variant contention lev- technical/. els. A more severe obstacle of using delegation lock is that it [14] [n.d.]. A Look at Intel Lakefield: A 3D-Stacked Single-ISA Heteroge- neous Penta-Core SoC. https://fuse.wikichip.org/news/3417/a-look- requires non-trivial code modifications to convert all critical at-intel-lakefield-a-3d-stacked-single-isa-heterogeneous-penta- sections into closures, which brings enormous engineering core-soc/. work due to the complexity of real-world applications. In- [15] [n.d.]. New Intel Core Processors with Intel Hybrid Technol- stead, LibASL only requires linking and, if latency-critical, ogy. https://www.intel.com/content/www/us/en/products/docs/ inserting few lines of code to specify the latency SLO. processors/core/core-processors-with-hybrid-technology-brief.html. [16] [n.d.]. Processing Architecture for Power Efficiency and Performance. Computing capacity can also be asymmetric in SMP when https://www.arm.com/why-arm/technologies/big-little. using DVFS. Previous work [21, 25, 57] boosts the frequency [17] [n.d.]. SQLite. https://www.sqlite.org/index.html. of the lock holder to gain better throughput. However, un- [18] [n.d.]. Sysbench: Scriptable database and system performance bench- like DVFS, the asymmetry in AMP is inherent. Thus, those mark. https://github.com/akopytov/sysbench. techniques also cannot be applied to AMP. [19] [n.d.]. upscaledb: embedded database technology. https://upscaledb. com/. Improving the system’s throughput for better cost and en- [20] [n.d.]. YCSB Core Workloads. https://github.com/brianfrankcooper/ ergy efficiency without violating the latency SLO is awidely YCSB/wiki/Core-Workloads. adopted technique [58, 59, 62]. WorkloadCompactor [62] [21] S. Akram, J. B. Sartor, and L. Eeckhout. 2016. DVFS performance and Cake [58] reduce the datacenter’s cost by consolidating prediction for managed multithreaded applications. In 2016 IEEE Inter- more loads into fewer servers without violating the latency national Symposium on Performance Analysis of Systems and Software (ISPASS). 12–23. https://doi.org/10.1109/ISPASS.2016.7482070 SLO. Elfen Scheduling [59] leverages the SMT to run latency- [22] Mark Allman, Vern Paxson, Wright Stevens, et al. 1999. TCP congestion critical and other requests simultaneously to improve the control. (1999). utilization without compromising the SLO. LibASL takes a [23] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. similar approach to solve the lock’s scalability problem on The PARSEC Benchmark Suite: Characterization and Architectural AMP. Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (Toronto, Ontario, Canada) (PACT ’08). Association for Computing Machinery, New York, NY, USA, 72–81. https://doi.org/10.1145/1454115.1454128 6 Conclusion [24] Silas Boyd-Wickizer, M Frans Kaashoek, Robert Morris, and Nickolai In this paper, we propose an asymmetry-aware scalable lock Zeldovich. 2012. Non-scalable locks are dangerous. In Proceedings of named LibASL. It provides a new SLO-guided lock ordering the Linux Symposium. 119–130. [25] Juan M. Cebrian, Daniel Sánchez, Juan L. Aragón, and Stefanos Kaxiras. to maximize big cores’ chance to lock for better throughput 2013. Efficient inter-core power and thermal balancing for multicore while carefully maintaining little cores’ latencies. Evalua- processors. Computing 95, 7 (2013), 537–566. https://doi.org/10.1007/ tions on real-world applications show that LibASL brings s00607-012-0236-6 better performance over counterparts on AMP. 13 [26] Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High 2017. MittOS: Supporting Millisecond Tail Tolerance with Fast Re- Performance Locks for Multi-Level NUMA Systems. SIGPLAN Not. 50, jecting SLO-Aware OS Interface. In Proceedings of the 26th Sympo- 8 (Jan. 2015), 215–226. https://doi.org/10.1145/2858788.2688503 sium on Operating Systems Principles (Shanghai, China) (SOSP ’17). [27] Milind Chabbi and John Mellor-Crummey. 2016. Contention- Association for Computing Machinery, New York, NY, USA, 168–183. Conscious, Locality-Preserving Locks. SIGPLAN Not. 51, 8, Article 22 https://doi.org/10.1145/3132747.3132774 (Feb. 2016), 14 pages. https://doi.org/10.1145/3016078.2851166 [41] Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat [28] Howard Chu. 2011. MDB: A memory-mapped database and backend Combining and the Synchronization-Parallelism Tradeoff. In Proceed- for OpenLDAP. In Proceedings of the 3rd International Conference on ings of the Twenty-Second Annual ACM Symposium on Parallelism in LDAP, Heidelberg, Germany. 35. Algorithms and Architectures (Thira, Santorini, Greece) (SPAA ’10). [29] Jeffrey Dean and Luiz André Barroso. 2013. The Tail atScale. Commun. Association for Computing Machinery, New York, NY, USA, 355–364. ACM 56 (2013), 74–80. http://cacm.acm.org/magazines/2013/2/160173- https://doi.org/10.1145/1810479.1810540 the-tail-at-scale/fulltext [42] Brian Jeff. 2013. big. LITTLE technology moves towards fully hetero- [30] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan geneous global task scheduling. ARM white paper (2013). Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubra- [43] José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. 2012. manian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s Bottleneck identification and scheduling in multithreaded applications. Highly Available Key-Value Store. In Proceedings of Twenty-First ACM In Proceedings of the 17th International Conference on Architectural SIGOPS Symposium on Operating Systems Principles (Stevenson, Wash- Support for Programming Languages and Operating Systems, ASPLOS ington, USA) (SOSP ’07). Association for Computing Machinery, New 2012, London, UK, March 3-7, 2012, Tim Harris and Michael L. Scott York, NY, USA, 205–220. https://doi.org/10.1145/1294261.1294281 (Eds.). ACM, 223–234. https://doi.org/10.1145/2150976.2151001 [31] Dave Dice. 2017. Malthusian Locks. In Proceedings of the Twelfth Euro- [44] José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. 2013. pean Conference on Computer Systems (Belgrade, Serbia) (EuroSys ’17). Utility-based acceleration of multithreaded applications on asymmet- Association for Computing Machinery, New York, NY, USA, 314–327. ric CMPs. In The 40th Annual International Symposium on Computer https://doi.org/10.1145/3064176.3064203 Architecture, ISCA’13, Tel-Aviv, Israel, June 23-27, 2013, Avi Mendelson [32] Dave Dice and Alex Kogan. 2019. Compact NUMA-Aware Locks. (Ed.). ACM, 154–165. https://doi.org/10.1145/2485922.2485936 In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, [45] Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Germany) (EuroSys ’19). Association for Computing Machinery, New Taesoo Kim. 2019. Scalable and Practical Locking with Shuffling. In York, NY, USA, Article 12, 15 pages. https://doi.org/10.1145/3302424. Proceedings of the 27th ACM Symposium on Operating Systems Principles 3303984 (Huntsville, Ontario, Canada) (SOSP ’19). Association for Computing [33] Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-Combining Machinery, 586–599. https://doi.org/10.1145/3341301.3359629 NUMA Locks. In Proceedings of the Twenty-Third Annual ACM Sym- [46] Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Scalable posium on Parallelism in Algorithms and Architectures (San Jose, Cali- NUMA-aware Blocking Synchronization Primitives. In 2017 USENIX fornia, USA) (SPAA ’11). Association for Computing Machinery, New Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, York, NY, USA, 65–74. https://doi.org/10.1145/1989493.1989502 July 12-14, 2017, Dilma Da Silva and Bryan Ford (Eds.). USENIX Asso- [34] David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock Cohorting: ciation, 603–615. https://www.usenix.org/conference/atc17/technical- A General Technique for Designing NUMA Locks. SIGPLAN Not. 47, 8 sessions/presentation/kashyap (Feb. 2012), 247–256. https://doi.org/10.1145/2370036.2145848 [47] Rakesh Kumar, Keith I Farkas, Norman P Jouppi, Parthasarathy Ran- [35] Babak Falsafi, Rachid Guerraoui, Javier Picorel, and Vasileios Trig- ganathan, and Dean M Tullsen. 2003. Single-ISA heterogeneous multi- onakis. 2016. Unlocking Energy. In 2016 USENIX Annual Technical core architectures: The potential for processor power reduction. In Conference, USENIX ATC 2016, Denver, CO, USA, June 22-24, 2016, Ajay Proceedings. 36th Annual IEEE/ACM International Symposium on Mi- Gulati and Hakim Weatherspoon (Eds.). USENIX Association, 393– croarchitecture, 2003. MICRO-36. IEEE, 81–92. 406. https://www.usenix.org/conference/atc16/technical-sessions/ [48] Rakesh Kumar, Dean M Tullsen, Parthasarathy Ranganathan, Nor- presentation/falsafi man P Jouppi, and Keith I Farkas. 2004. Single-ISA heterogeneous [36] Songchun Fan and Benjamin C. Lee. 2016. Evaluating asymmetric multi-core architectures for multithreaded workload performance. In multiprocessing for mobile applications. In 2016 IEEE International Proceedings. 31st Annual International Symposium on Computer Archi- Symposium on Performance Analysis of Systems and Software, ISPASS tecture, 2004. IEEE, 64–75. 2016, Uppsala, Sweden, April 17-19, 2016. IEEE Computer Society, 235– [49] David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, 244. https://doi.org/10.1109/ISPASS.2016.7482098 and Christos Kozyrakis. 2014. Towards energy proportionality for [37] Panagiota Fatourou and Nikolaos D. Kallimanis. 2011. A Highly- large-scale latency-critical workloads. In ACM/IEEE 41st International Efficient Wait-Free Universal Construction. In Proceedings of the Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, Twenty-Third Annual ACM Symposium on Parallelism in Algorithms USA, June 14-18, 2014. IEEE Computer Society, 301–312. https: and Architectures (San Jose, California, USA) (SPAA ’11). Association //doi.org/10.1109/ISCA.2014.6853237 for Computing Machinery, New York, NY, USA, 325–334. https: [50] Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia L. Lawall, and //doi.org/10.1145/1989493.1989549 Gilles Muller. 2012. Remote Core Locking: Migrating Critical-Section [38] Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the Execution to Improve the Performance of Multithreaded Applications. Combining Synchronization Technique. SIGPLAN Not. 47, 8 (Feb. 2012), In 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 257–266. https://doi.org/10.1145/2370036.2145849 13-15, 2012, Gernot Heiser and Wilson C. Hsieh (Eds.). USENIX Asso- [39] Rachid Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, ciation, 65–76. https://www.usenix.org/conference/atc12/technical- and Vasileios Trigonakis. 2019. Lock–Unlock: Is That All? A Pragmatic sessions/presentation/lozi Analysis of Locking in Software Systems. ACM Trans. Comput. Syst. 36, [51] Victor Luchangco, Daniel Nussbaum, and Nir Shavit. 2006. A Hier- 1, Article 1 (March 2019), 149 pages. https://doi.org/10.1145/3301501 archical CLH Queue Lock. In Euro-Par 2006, Parallel Processing, 12th [40] Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. International Euro-Par Conference, Dresden, Germany, August 28 - Sep- Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. tember 1, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 4128), Wolfgang E. Nagel, Wolfgang V. Walter, and Wolfgang Lehner (Eds.). Springer, 801–810. https://doi.org/10.1007/11823285_84

14 [52] John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for Annual Technical Conference (USENIX ATC 14). USENIX Association, Scalable Synchronization on Shared-Memory Multiprocessors. ACM Philadelphia, PA, 193–204. https://www.usenix.org/conference/atc14/ Trans. Comput. Syst. 9, 1 (Feb. 1991), 21–65. https://doi.org/10.1145/ technical-sessions/presentation/wamhoff 103727.103729 [58] Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy H. [53] Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. 1999. Exe- Katz, and Ion Stoica. 2012. Cake: enabling high-level SLOs on shared cuting parallel programs with synchronization bottlenecks efficiently. storage systems. In ACM Symposium on Cloud Computing, SOCC ’12, In Proceedings of the International Workshop on Parallel and Distributed San Jose, CA, USA, October 14-17, 2012, Michael J. Carey and Steven Computing for Symbolic and Irregular Applications, Vol. 16. Citeseer, Hand (Eds.). ACM, 14. https://doi.org/10.1145/2391229.2391243 95. [59] Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2016. Elfen [54] Zoran Radovic and Erik Hagersten. 2003. Hierarchical Backoff Locks Scheduling: Fine-Grain Principled Borrowing from Latency-Critical for Nonuniform Communication Architectures. In Proceedings of the Workloads Using Simultaneous Multithreading. In 2016 USENIX An- Ninth International Symposium on High-Performance Computer Ar- nual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June chitecture (HPCA’03), Anaheim, California, USA, February 8-12, 2003. 22-24, 2016, Ajay Gulati and Hakim Weatherspoon (Eds.). USENIX Asso- IEEE Computer Society, 241–252. https://doi.org/10.1109/HPCA.2003. ciation, 309–322. https://www.usenix.org/conference/atc16/technical- 1183542 sessions/presentation/yang [55] Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu. 2017. Ffwd: [60] Richard M. Yoo, Anthony Romano, and Christos Kozyrakis. 2009. Delegation is (Much) Faster than You Think. In Proceedings of the Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory 26th Symposium on Operating Systems Principles (Shanghai, China) system. In Proceedings of the 2009 IEEE International Symposium on (SOSP ’17). Association for Computing Machinery, New York, NY, USA, Workload Characterization, IISWC 2009, October 4-6, 2009, Austin, TX, 342–358. https://doi.org/10.1145/3132747.3132771 USA. IEEE Computer Society, 198–207. https://doi.org/10.1109/IISWC. [56] M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. 2009.5306783 Patt. 2009. Accelerating critical section execution with asymmet- [61] Mingzhe Zhang, Haibo Chen, Luwei Cheng, Francis C. M. Lau, and ric multi-core architectures. In Proceedings of the 14th International Cho-Li Wang. 2017. Scalable Adaptive NUMA-Aware Lock. IEEE Trans. Conference on Architectural Support for Programming Languages and Parallel Distributed Syst. 28, 6 (2017), 1754–1769. https://doi.org/10. Operating Systems, ASPLOS 2009, Washington, DC, USA, March 7-11, 1109/TPDS.2016.2630695 2009, Mary Lou Soffa and Mary Jane Irwin (Eds.). ACM, 253–264. [62] Timothy Zhu, Michael A. Kozuch, and Mor Harchol-Balter. 2017. Work- https://doi.org/10.1145/1508244.1508274 loadCompactor: reducing datacenter cost while providing tail latency [57] Jons-Tobias Wamhoff, Stephan Diestelhorst, Christof Fetzer, Patrick SLO guarantees. In Proceedings of the 2017 Symposium on Cloud Com- Marlier, Pascal Felber, and Dave Dice. 2014. The TURBO Diaries: puting, SoCC 2017, Santa Clara, CA, USA, September 24-27, 2017. ACM, Application-controlled Frequency Scaling Explained. In 2014 USENIX 598–610. https://doi.org/10.1145/3127479.3132245