[3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 60

...... ACCELERATING CRITICAL SECTION EXECUTION WITH ASYMMETRIC MULTICORE ARCHITECTURES

...... CONTENTION FOR CRITICAL SECTIONS CAN REDUCE PERFORMANCE AND SCALABILITY BY

CAUSING SERIALIZATION.THE PROPOSED ACCELERATED CRITICAL SECTIONS

MECHANISM REDUCES THIS LIMITATION. ACS EXECUTES CRITICAL SECTIONS ON THE

HIGH-PERFORMANCE CORE OF AN ASYMMETRIC CHIP MULTIPROCESSOR (ACMP), WHICH

CAN EXECUTE THEM FASTER THAN THE SMALLER CORES CAN.

...... Extracting high performance program regions on the multiple small from chip multiprocessors (CMPs) requires cores.1-4 In addition to Amdahl’s bottleneck, M. Aater Suleman partitioning the application into threads ACS runs selected critical sections on the that execute concurrently on multiple cores. large core, which runs them faster than the University of Texas Because threads cannot be allowed to update smaller cores. shared data concurrently, accesses to shared ACS dedicates the large core exclusively at Austin data are encapsulated inside critical sections. to running critical sections (and the Only one thread executes a critical section at Amdahl’s bottleneck). In conventional sys- Onur Mutlu a given time; other threads wanting to exe- tems, when a core encounters a critical sec- cute the same critical section must wait. Crit- tion, it acquires the for the critical Carnegie Mellon ical sections can serialize threads, thereby section, executes the critical section, and reducing performance and scalability (that releases the lock. In ACS, when a small University is, the number of threads at which perfor- core encounters a critical section, it sends a mance saturates). Shortening the execution request to the large core for execution of time inside critical sections can reduce this that critical section and stalls. The large Moinuddin K. Qureshi performance loss. core executes the critical section and notifies This article proposes the accelerated critical the small core when it has completed the IBM Research sections mechanism. ACS is based on the critical section. The small core then resumes asymmetric chip multiprocessor, which consists execution. By accelerating critical section ex- Yale N. Patt of at least one large, high-performance core ecution, ACS reduces serialization, lowering and many small, power-efficient cores (see the likelihood of threads waiting for a critical University of Texas the ‘‘Related work’’ sidebar for other work section to finish. in this area). The ACMP was originally pro- Ourevaluationonasetof12critical- at Austin posed to run Amdahl’s serial bottleneck section-intensive workloads shows that ACS (where only a single thread exists) more reduces the average execution time by 34 quickly on the large core and the parallel percent compared to an equal-area 32-core ......

60 Published by the IEEE Computer Society 0272-1732/10/$26.00 c 2010 IEEE

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 61

...... Related work The most closely related work to accelerated critical sections (ACS) are the numerous proposals to optimize the implementation of lock ac- References quire and release operations and the locality of shared data in critical 1. S. Sridharan et al., ‘‘Thread Migration to Improve Synchro- sections using operating system and compiler techniques. We are not nization Performance,’’ Proc. Workshop on Operating Sys- aware of any work that speeds up the execution of critical sections tem Interference in High Performance Applications using more aggressive execution engines. (OSIHPA), 2006; http://www.kinementium.com/~rcm/doc/ osihpa06.pdf. Improving locality of shared data and locks 2. P. Trancoso and J. Torrellas, ‘‘The Impact of Speeding up Sridharan et al.1 propose a thread- algorithm to increase Critical Sections with Data Prefetching and Forwarding,’’ shared data locality. When a thread encounters a critical section, the Proc. Int’l Conf. Parallel Processing (ICPP 96), IEEE CS operating system migrates the thread to the processor with the shared Press, 1996, pp. 79-86. data. Although ACS also improves shared data locality, it does not re- 3. P. Ranganathan et al., ‘‘The Interaction of Software Prefetch- quire the operating system intervention and the thread migration over- ing with ILP Processors in Shared-Memory Systems,’’ Proc. head incurred by their scheme. Moreover, ACS accelerates critical Ann. Int’l Symp. Computer Architecture (ISCA 97), ACM section execution, a benefit unavailable in Sridharan et al.’s work.1 Press, 1997, pp. 144-156. Trancoso and Torrellas2 and Ranganathan et al.3 improve locality in crit- 4. D. Culler et al., Parallel Computer Architecture: A Hardware/ ical sections using software prefetching. We can combine these Software Approach, Morgan Kaufmann, 1998. techniques with ACS to improve performance. Primitives (such as 5. J.F. Martı´nez and J. Torrellas, ‘‘Speculative Synchronization: Test&Test&Set and Compare&Swap) implement lock acquire and release Applying Thread-Level Speculation to Explicitly Parallel Appli- operations efficiently but do not increase the speed of critical section cations,’’ Proc. Architectural Support for Programming Lan- processing or the locality of shared data.4 guages and Operating Systems (ASPLOS 02), ACM Press, 2002, pp. 18-29. Hiding the latency of critical sections 6. R. Rajwar and J.R. Goodman, ‘‘Speculative Lock Elision: En- Several schemes hide critical section latency.5-7 We compared abling Highly Concurrent Multithreaded Execution,’’ Proc. ACS with TLR, a mechanism that hides critical section latency by Int’l Symp. Microarchitecture (MICRO 01), IEEE CS Press, overlapping multiple instances of the same critical section as long 2001, pp. 294-305. as they access disjoint data. ACS largely outperforms TLR because, 7. R. Rajwar and J.R. Goodman, ‘‘Transactional Lock-Free Exe- unlike TLR, ACS accelerates critical sections whether or not they cution of Lock-Based Programs,’’ Proc. Architectural Support have data conflicts. for Programming Languages and Operating Systems (ASPLOS 02), ACM Press, 2002, pp. 5-17. Asymmetric CMPs 8. M. Annavaram, E. Grochowski, and J. Shen, ‘‘Mitigating Several researchers have shown the potential to improve the perfor- Amdahl’s Law through EPI Throttling,’’ SIGARCH Computer mance of an application’s serial part using an asymmetric chip multiproc- Architecture News, vol. 33, no. 2, 2005, pp. 298-309. essor (ACMP).8-11 We use the ACMP to accelerate critical sections as 9. M. Hill and M. Marty, ‘‘Amdahl’s Law in the Multicore Era,’’ well as the serial part in multithreaded workloads.¨ Ipek et al. state Computer, vol. 41, no. 7, 2008, pp. 33-38. that multiple small cores can be combined—that is, fused—to form a 10. T.Y. Morad et al., ‘‘Performance, Power Efficiency and powerful core to speed up the serial program portions.12 We can com- Scalability of Asymmetric Cluster Chip Multiprocessors,’’ bine our technique with their scheme such that the powerful core accel- IEEE Computer Architecture Letters, vol.5,no.1,Jan. erates critical sections. Kumar et al. use heterogeneous cores to reduce 2006, pp. 14-17. power and increase throughput for multiprogrammed workloads, not mul- 11. M.A. Suleman et al., ACMP: Balancing Hardware Efficiency tithreaded programs.13 and Programmer Efficiency, HPS Technical Report, 2007. 12. E. I¨pek et al., ‘‘Core Fusion: Accommodating Software Diver- Remote procedure calls sity in Chip Multiprocessors,’’ Proc. Ann. Int’l Symp. Com- The idea of executing critical sections remotely on a different pro- puter Architecture (ISCA 07), ACM Press, 2007, pp. 186-197. cessor resembles the RPC14 mechanism used in network programming. 13. R. Kumar et al., ‘‘Heterogeneous Chip Multiprocessors,’’ RPC is used to execute client subroutines on remote server computers. Computer, vol. 38, no. 11, 2005, pp. 32-38. ACS differs from RPC in that ACS runs critical sections remotely on 14. A.D. Birrell and B.J. Nelson, ‘‘Implementing Remote Proce- the same chip within the same address space, not on a remote dure Calls,’’ ACM Trans. Computer Systems, vol. 2, no. 1, computer. 1984, pp. 39-59.

......

JANUARY/FEBRUARY 2010 61

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 62

...... TOP PICKS

symmetric CMP and by 23 percent com- Figure 1b shows the execution time of the pared to an equal-area ACMP. Moreover, kernel in Figure 1a on a 4-core CMP. After for seven of the 12 workloads, ACS also the serial part A, four threads (T1, T2, T3, increases scalability. and T4) are spawned, one on each core. When part B is complete, the serial part E The problem is executed on a single core. We analyze the A multithreaded application consists of serialization caused by the critical section in two parts. The serial part is the classical the steady state of Part B. Between time t0 5 Amdahl’s bottleneck. The parallel part is and t1, all threads execute in parallel. At where multiple threads execute concurrently. time t1, T2 starts executing the critical section Threads operate on different portions of the while T1, T3, and T4 continue to execute the same problem and communicate via shared code independent of the critical section. At memory. The principle dic- time t2, T2 finishes the critical section, tates that multiple threads may not update which allows three threads (T1, T3, and shared data concurrently. Thus, accesses to T4) to contend for the critical section; T3 shared data are encapsulated in regions of wins and enters the critical section. Between code guarded by synchronization primitives time t2 and t3, T3 executes the critical section (such as locks) called critical sections. Only while T1 and T4 remain idle, waiting for T3 one thread can execute a particular critical to finish. Between time t3 and t4, T4 executes section at any given time. Critical sections the critical section while T1 continues to differ from Amdahl’s serial bottleneck. Dur- wait. T1 finally gets to execute the critical sec- ing a critical section’s execution, other tion between time t4 and t5. threads that do not need to execute that crit- This example shows that the time taken ical section can make progress. In contrast, to execute a critical section significantly no other thread exists in Amdahl’s serial affects not only the thread executing it but bottleneck. also the threads waiting to enter the critical We use a simple example to show critical section. For example, between t2 and t3, sections’ performance impact. two threads (T1 and T4) are waiting for Figure 1a shows the code for a multi- T3 to exit the critical section, without per- threaded kernel in which each thread forming any useful work. Therefore, acceler- dequeues a work quantum from the priority ating the critical section’s execution not only queue (PQ) and attempts to process it. If the improves T3’s performance, but also reduces thread cannot solve the problem, it divides the wasteful waiting time of T1 and T4. the problem into subproblems and inserts Figure 1c shows the execution of the same them into the priority queue. This is a com- kernel, assuming that critical sections take mon technique for parallelizing many half as long to execute. Halving the time branch-and-bound algorithms.6 In our taken to execute critical sections reduces thread benchmarks, we use this kernel to solve the serialization, which significantly reduces the 15-puzzle problem (see http://en.wikipedia. time spent in the parallel portion. Thus, org/wiki/Fifteen_puzzle). accelerating critical sections can significantly The kernel consists of three parts. The improve performance. initial part A and the final part E are the se- On average, the critical section in Figure 1a rial parts of the program. They comprise executes 1.5K instructions. During an in- Amdahl’s serial bottleneck because only one sert, the critical section accesses multiple thread exists in those sections. Part B is the nodes of the priority queue (implemented parallel part, executed by multiple threads. as a heap) to find a suitable place for inser- It consists of code that is both inside the crit- tion. Because of its lengthy execution, this ical section (C1 and C2, both protected critical section incurs high contention. by lock X) and outside the critical section When the workload is executed with eight (D1 and D2). Only one thread can execute threads, on average four threads wait for the critical section at a given time, which this critical section at a given time. The av- can cause serialization of the parallel part erage number of waiting threads increases to and reduce overall performance. 16whentheworkloadisexecutedwith ...... 62 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 63

InitPriorityQueue(PQ); A,E: Amdahl’s serial part B: Parallel portion SpawnThreads(); A C1,C2: Critical sections ForEach Thread: D: Outside critical section

while (problem not solved)

Lock (X) SubProblem = PQ.remove(); C1 Unlock(X);

Solve(SubProblem); D1 B If(problem solved) break; NewSubProblems = Partition(SubProblem);

Lock(X) PQ.insert(NewSubProblems); C2 Unlock(X)

... D2

PrintSolution(); E

(a)

Idle T1 D2 C1

T2 D2 C1 D1 C2 A ...... E T3 D1 C2 D2

T4 D1 C2 D2 Time

tbegin t0 t1 t2 t3 t4 t5 t6 tend

(b)

T1 D2 C1

T2 D2 C1 D1 C2

A ... T3 D1 C2 D2 ... E

T4 D1 C2 D2 Time

tbegin t0 t1 t2 t3 t4 t5 tend

(c)

Figure 1. Amdahl’s serial part, parallel part, and the critical section in a multithreaded 15-puzzle kernel: code example (a) and execution timelines on the baseline chip multiprocessor (CMP) (b) and accelerated critical sections (ACS) (c).

32 threads. In contrast, when we accelerate The task of shortening critical sections has this critical section using ACS, the average traditionally been left to programmers. How- number of waiting threads reduces to two ever, tuning code to minimize critical section and three for eight- and 32-threaded execu- execution requires substantial programmer tion, respectively. time and effort. A mechanism that can ......

JANUARY/FEBRUARY 2010 63

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 64

...... TOP PICKS

executing the requested critical section at the Critical section P1 P2 first opportunity and continues normal pro- request buffer (CSRB) P0 P3 P4 cessing until it encounters a CSRET instruc- tion, which signifies the end of the critical P5 P6 P7 P8 section. When P0 executes the CSRET in- P9 P10 P11 P12 struction, it sends a critical section done (CSDONE) signal to therequestingsmall Figure 2. An example ACS architecture implemented on an asymmetric core. Upon receiving this signal, the small chip multiprocessor (ACMP). ACS consists of one high-performance core core resumes normal execution. (P0) and several smaller cores. Our ASPLOS paper describes the ACS ar- chitecture in detail, focusing on hardware/ISA/ compiler/library support, interconnect extensions, operating system support, ACS’s handling of shorten critical sections’ execution time trans- nested critical sections and /exceptions, parently in hardware, without requiring pro- and its accommodation of multiple large grammer support, can significantly impact cores and multiple parallel applications.8 both the software and the hardware industries. False serialization ACS design ACS, as we have described it thus far, exe- The ACS mechanism is implemented on cutes all critical sections on a single dedicated a homogeneous-ISA, heterogeneous-core core. Consequently, ACS can serialize the ex- CMP that provides hardware support for ecution of independent critical sections that cache coherence. ACS is based on the could have executed concurrently in a con- ACMP architecture,3,4,7 which was proposed ventional system. We call this effect false seri- to handle Amdahl’s serial bottleneck. ACS alization. To reduce false serialization, ACS consists of one high-performance core and can dedicate multiple execution contexts for several small cores. The serial part of the pro- critical section acceleration. We can achieve gram and the critical sections execute on the this by making the large core a simultaneous high-performance core, whereas the remain- multithreading (SMT) engine or by provid- ing parallel parts execute on the small cores. ing multiple large cores on the chip. With Figure 2 shows an example ACS architec- multiple contexts, different contexts operate ture implemented on an ACMP consisting of on independent critical sections concur- one large core (P0) and 12 small cores (P1 to rently, thereby reducing false serialization. P12). ACS executes the parallel threads on Some workloads continue to experience the small cores and dedicates P0 to the execu- false serialization even when more than one tion of critical sections (as well as serial program context is available for critical section execu- portions). A critical section request buffer (CSRB) tion. For such workloads, we propose selec- in P0 buffers the critical section execution tive acceleration of critical sections (SEL). requests from the small cores. ACS introduces SEL tracks false serialization experienced by two new instructions—critical section execution each critical section and disables the acceler- call (CSCALL) and critical section return ated execution of critical sections experienc- (CSRET)—which are inserted at the beginning ing high false serialization. and end of a critical section, respectively. To estimate false serialization, we aug- Figure 3 contrasts ACS with a conven- ment the CSRB with a table of saturating tional system. In conventional locking counters (one per critical section) for track- (Figure 3a), when a small core encounters a ing false serialization. We quantify false seri- critical section, it acquires the lock protecting alization by counting the number of critical the critical section, executes the critical section, sections in the CSRB for which the LOCK_ and releases the lock. In ACS (Figure 3b), ADDR differs from the incoming request’s when a small core executes a CSCALL, it LOCK_ADDR. If this count is greater sends the CSCALL to P0 and stalls until it than 1 (that is, if the CSRB contains at least receives a response. When P0 receives a two independent critical sections), the estima- CSCALL, it buffers it in the CSRB. P0 starts tion logic adds the count to the saturating ...... 64 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 65

Small core Small core Large core A = compute(); A = compute(); LOCK X PUSH A result = CS(A); CSCALL X, TPC CSCALL request UNLOCK X send X, TPC, TPC: POP A result = CS(A) print result STACK_PTR, CORE_ID PUSH result CSRET X POP result CSDONE response print result (a) (b)

Figure 3. Source code and its execution: baseline (a) and ACS (b). counter corresponding to the incoming in an 8-core system, but only 12.5 percent request’s LOCK_ADDR. If the count is 1 of cores in a 32-core system). (that is, if the CSRB contains exactly one crit- In addition, more cores (threads) increase ical section), the estimation logic decrements critical section contention, thereby increasing the corresponding saturating counter. If the the benefit of faster critical section execution. counter reaches its maximum value, the esti- mation logic notifies all small cores that ACS CSCALL/CSDONE signals versus lock has been disabled for the particular critical sec- acquire/release tion. Future CSCALLs to that critical section Second, ACS requires the communication are executed locally at the small core. (Our of CSCALL and CSDONE transactions be- SEL implementation hashes lock addresses tween a small core and a large core, an over- into 16 sets and uses 6-bit counters. The head not present in conventional systems. total storage overhead of SEL is 36 bytes: However, ACS can compensate for the over- 16 counters of 6 bits each and 16 ACS_DIS- head of CSCALL and CSDONE by keeping ABLE bits for each of the 12 small cores. As the lock at the large core, thereby reducing our ASPLOS paper shows, SEL completely the cache-to-cache transfers incurred by con- recovers the performance loss due to false seri- ventional systems during lock acquire opera- alization, and ACS with SEL outperforms tions.9 ACS actually has an advantage in that ACS without SEL by 15 percent.8 it can overlap the latency of CSCALL and CSDONE with the execution of another in- Trade-off analysis: Why ACS works stance of the same critical section. A conven- ACS makes three performance trade-offs tional locking mechanism can only acquire a compared to conventional systems. lock after the critical section has been com- pleted, which always adds a delay before crit- Faster critical sections versus fewer threads ical section execution. First, ACS has fewer threads than conven- tional systems because it dedicates a large Cache misses due to private versus shared data core, which could otherwise be used for Finally, ACS transfers private data that is more threads, to accelerating critical sections. referenced in the critical section from the ACS improves performance when the benefit small core’s cache to the large core’s cache. of accelerating critical sections is greater than Conventional locking does not incur this the loss due to the unavailability of more overhead. However, conventional systems threads. ACS is more likely to improve per- incur overhead in transferring shared data. formance when the number of cores on the The shared data ‘‘ping-pongs’’ between chip increases for two reasons. caches as different threads execute the critical The marginal loss in parallel throughput section. ACS eliminates transfers of shared due to the large core is smaller (for example, data by keeping the data at the large core, if the large core replaces four small cores, it which can offset the misses it causes to trans- eliminates 50 percent of the smaller cores fer private data into the large core......

JANUARY/FEBRUARY 2010 65

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 66

...... TOP PICKS

trade-offs support this, as we demonstrate 120 8 SCMP ACS elsewhere. 100 80 Results 60 We evaluate three CMP architectures: a sym- 40 metric CMP (SCMP) consisting of all small Execution time 20 cores; an asymmetric CMP (ACMP) with one normalized to ACMP 0 81632largecorewithtwo-waySMTandremaining (a) Area budget small cores; and an ACMP augmented with support for ACS. All comparisons are at an 120 equal-area budget. We use 12 critical-section- 100 intensive parallel applications, including sorting, 80 MySQL databases, data mining, IP packet rout- 60 ing, Web caching, and SPECjbb. Details of our 8 40 methodology are available elsewhere.

Execution time 20

normalized to ACMP Performance at optimal number of threads 0 81632We evaluate ACS when the number of (b) Area budget threads is equal to the optimal number of threads for each application-configuration Figure 4. Execution time of symmetric CMP (SCMP) and ACS, normalized pair. Figure 4a shows the execution time of to ACMP, when the number of threads is equal to the optimal number of SCMP and ACS normalized to ACMP at threads for each application (a) and when the number of threads is equal area budgets of 8, 16, and 32. ACS improves to the number of thread contexts (b). performance compared to ACMP at all area budgets. At an area budget of 8, SCMP out- performs ACS because ACS cannot compen- By keeping shared data in the large core’s sate for the loss in throughput due to fewer cache, ACS provides less cache space for parallel threads. However, this loss dimin- shared data than conventional locking ishes as the area budget increases. At area (where shared data can reside in any on- budget 32, ACS improves performance by chip cache). This can increase cache misses. 34 percent compared to SCMP, and 23 per- However, we find that such cache misses cent compared to ACMP. are rare and do not degrade performance be- cause the large core’s private cache is usually Performance when the number of threads equals large enough. In fact, ACS decreases cache the number of contexts misses if the critical section accesses more When an estimate of the optimal number shared data than private data. of threads in unavailable, most systems ACS can improve performance even if spawn as many threads as there are available there are equal or more accesses to private thread contexts. Having more threads data than shared data because the large core increases critical section contention, thereby can still improve the performance of other further increasing ACS’s performance bene- instructions and hide the latency of some fit. Figure 4b shows the execution time of cache misses using latency-tolerance tech- SCMP and ACS normalized to ACMP niques such as out-of-order execution. when the number of threads is equal to the In summary, ACS’s performance benefits number of contexts for area budgets of 8, (faster critical section execution, improved 16, and 32. ACS improves performance lock locality, and improved shared data local- compared to ACMP for all area budgets. ity) are likely to outweigh its overhead ACS improves performance over SCMP for (reduced parallel throughput, CSCALL and all area budgets except 8, where accelerating CSDONE overhead, and reduced private critical sections does not make up for the data locality). Our extensive experimental loss in throughput. For an area budget of results on a wide variety of systems and 32, ACS outperforms both SCMP and quantitative analysis of ACS’s performance ACMP by 46 and 36 percent, respectively...... 66 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 67

Application scalability Table 1. Best number of threads for each configuration. We evaluate ACS’s impact on application scalability (the number of threads at which Symmetric Asymmetric Accelerated critical performance saturates). Figure 5 shows the Workload CMP CMP sections speedup curves of ACMP, SCMP, and ep 4 4 4 ACS over one small core as the area budget is 8 8 12 varies from 1 to 32. The curves for ACS pagemine 8 8 12 and ACMP start at 4 because they require at puzzle 8 8 32 least one large core, which is area-equivalent qsort 16 16 32 tofoursmallcores.Table1summarizes sqlite 8 8 32 Figure 5 by showing the number of threads tsp 32 32 32 required to minimize execution time for iplookup 24 24 24 each application using ACMP, SCMP, and oltp-1 16 16 32 ACS. For seven of the 12 applications, oltp-2 16 16 24 ACS improves scalability. For the remaining specjbb 32 32 32 applications, scalability is not affected. We webcache 32 32 32 conclude that if thread contexts are available on the chip, ACS uses them more effectively than either ACMP and SCMP. We compare the performance of ACS and TLR for critical-section-intensive applications. ACS on symmetric CMP We implemented TLR as described else- Part of ACS’s performance benefit is due where,9 adding a 128-entry buffer to each to the improved locality of shared data and small core to handle speculative memory locks. This benefit can be realized even in updates. Figure 6 shows the execution time the absence of a large core. We can imple- of an ACMP augmented with TLR and the ment a variant of ACS on a symmetric execution time of ACS normalized to CMP, which we call symmACS.Insym- ACMP (the area budget is 32 and the number mACS, one of the small cores is dedicated of threads is set to the optimal number for to executing critical sections. On average, each system). ACS outperforms TLR on all symmACS reduces execution time by only benchmarks. TLR reduces average execution 4 percent, which is much lower than ACS’s time by 6 percent whereas ACS reduces it 34 percent performance benefit. We con- by 23 percent. We conclude that ACS is a clude that most of ACS’s performance im- compelling, higher-performance alternative provement comes from using the large core. to state-of-the-art mechanisms that aim to transparently reduce critical section overhead. ACS versus techniques to hide critical section latency Contributions and impact Several proposals—for example, transac- Several concepts derived from our work tional memory,10 speculative lock elision on ACS can impact future research and de- (SLE),11 transactional lock removal (TLR),9 velopment in parallel programming and and speculative synchronization12—try to CMP architectures. hide a critical section’s latency. These pro- posals execute critical sections concurrently Handling critical sections with other instances of the same critical sec- Current systems treat critical sections as a tion as long as no data conflicts occur. In part of the normal thread code and execute contrast, ACS accelerates all critical sections them in place—that is, on the same core regardless of their length and the data they where they are encountered. Instead of execut- are accessing. Furthermore, unlike transac- ing critical sections on the same core, ACS tional memory, ACS does not require code executes them on a remote high-performance modification; it incurs a significantly lower core that executes them faster than the other hardware overhead; and it reduces the smaller, cores. This idea of accelerating code cache misses for shared data, a feature remotely can be extended to other program unavailable in all previous schemes. portions......

JANUARY/FEBRUARY 2010 67

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 68

...... TOP PICKS

3 4

2 3 2

2 1 1 SCMP 1 ACMP Speedup vs. small core Speedup vs. small core Speedup vs. small core ACS 0 0 0 8162432 8162432 8162432 (a) Area (small cores) (b) Area (small cores) (c) Area (small cores)

6 4 5

5 4 3 4 3 3 2 2 2 1 Speedup vs. small core Speedup vs. small core Speedup vs. small core 1 1

0 0 0 8162432 8162432 8162432 (d) Area (small cores) (e) Area (small cores) (f) Area (small cores)

13 9 12 8 7 11 10 7 6 9 6 5 8 7 5 4 6 4 5 3 4 3 2 3 2 Speedup vs. small core Speedup vs. small core Speedup vs. small core 2 1 1 1 0 0 0 8162432 8162432 8162432 (g) Area (small cores) (h) Area (small cores) (i) Area (small cores)

10 10 9 9 8 8 7 2 7 6 6 5 5 4 4 1 3 3 2 2 Speedup vs. small core Speedup vs. small core Speedup vs. small core 1 1 0 0 0 8162432 8162432 8162432 (j) Area (small cores) (k) Area (small cores) (l) Area (small cores)

Figure 5. Speedup of ACMP, SCMP, and ACS over a single small core: ep (a), is (b), pagemine (c), puzzle (d), qsort (e), sqlite (f), tsp (g), iplookup (h), oltp-1 (i), oltp-2 (j), specjbb (k), and webcache (l).

...... 68 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 69

Accelerating bottlenecks 110 Current proposals, such as transactional 100 memory, try to parallelize critical sections. 90 80 ACS accelerates them. This concept is appli- 70 cable to other critical paths in multithreaded 60 programs. For example, we could improve 50 the performance of pipeline parallel (that is, 40

Execution time 30 streaming) workloads, which is dominated 20 by the execution speed of the slowest pipeline normalized to ACMP 10 0 stage, by accelerating the slowest stage using is ep tsp qsort oltp-1 oltp-2 the large core. puzzle gmean iplookup specjbb pagemine webcache

Trade-off between shared and private data ACS ACMP with TLR Many believe that executing a part of the code at a remote core will increase the num- ber of cache misses. ACS shows that this as- Figure 6. ACS vs. transactional lock removal (TLR) performance. sumption is not always true. Remotely executing code can actually reduce the num- ber of cache misses if more shared data than (such as server applications). ACS provides private data is accessed (as in the case of most a mechanism to leverage one or more large critical sections). Compiler/runtime schedu- cores in the parallel program portions, lers can use this trade-off to decide whether which makes the ACMP applicable to pro- to run a code segment locally or remotely. grams with or without serial portions. This makes ACMP practical for multiple market Impact on programmer productivity segments. Correctly inserting and shortening critical sections are among the most challenging Alternate ACS implementations tasks in parallel programming. By executing We proposed a combined hardware/soft- critical sections faster, ACS reduces the pro- ware implementation of ACS. Future re- grammers’ burden. Programmers can write search can develop other implementations code with longer critical sections (which is of ACS for systems with different trade-offs easy to get correct) and rely on hardware and requirements. For example, we can im- for critical section acceleration. plement ACS solely in software as a part of the runtime library. The software-only ACS Impact on parallel algorithms requires no hardware support and can be ACS could enable the use of more effi- used in today’s systems. However, it has a cient parallel algorithms that traditionally higher overhead for sending CSCALLs to are not used in favor of algorithms with the remote core. Other implementations of shorter critical sections. For example, ACS ACS can be developed for SMP and multi- could make practical the use of memoiza- socket systems. tion tables (data structures to cache com- puted values for avoiding computation he ACS mechanism provides a high- redundancy), which are avoided in parallel T performance and simple-to-implement programs because their use introduces long substrate that allows more productive and critical sections with poor shared-data easier development of parallel programs—a locality. key problem facing multicore systems and computer architecture today. As such, our Impact on CMP architectures proposal allows the software and hardware Our previous work proposed ACMP with industries to make an easier transition to a single large core that runs the serial many-core engines. We believe that ACS Amdahl’s bottleneck.4 Thus, without ACS, will not only impact future CMP designs ACMP is inapplicable for parallel programs but also make parallel programming more with nonexistent (or small) serial portions accessible to the average programmer. MICRO ......

JANUARY/FEBRUARY 2010 69

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 70

...... TOP PICKS

...... References 12. J.F. Martı´nez and J. Torrellas, ‘‘Speculative 1. M. Annavaram, E. Grochowski, and J. Shen, Synchronization: Applying Thread-Level ‘‘Mitigating Amdahl’s Law through EPI Speculation to Explicitly Parallel Applications,’’ Throttling,’’ SIGARCH Computer Architec- Proc. Architectural Support for Programming ture News, vol. 33, no. 2, 2005, pp. 298-309. Languages and Operating Systems (ASPLOS 2. M. Hill and M. Marty, ‘‘Amdahl’s Law in the 02), ACM Press, 2002, pp. 18-29. Multicore Era,’’ Computer, vol. 41, no. 7, 2008, pp. 33-38. M. Aater Suleman is a PhD candidate of 3. T.Y. Morad et al., ‘‘Performance, Power Ef- electrical and computer engineering at the ficiency and Scalability of Asymmetric Clus- University of Texas at Austin. His research ter Chip Multiprocessors,’’ IEEE Computer interests include chip multiprocessor archi- Architecture Letters, vol.5,no.1,Jan. tectures and parallel programming. Suleman 2006, pp. 14-17. has a master’s degree in electrical and 4. M.A. Suleman et al., ACMP: Balancing computer engineering from the University Hardware Efficiency and Programmer Effi- of Texas at Austin. He is a member of the ciency, HPS Technical Report, 2007. IEEE, ACM, and HKN. 5. G.M. Amdahl, ‘‘Validity of the Single Pro- Onur Mutlu is an assistant professor of cessor Approach to Achieving Large Scale electrical and computer engineering at Computing Capabilities,’’ Am. Federation of In- Carnegie Mellon University. His research formation Processing Societies Conf. Proc. interests include computer architecture and (AFIPS), Thompson Books, 1967, pp. 483-485. systems. Mutlu has a PhD in electrical and 6. E.L. Lawler and D.E. Wood, ‘‘Branch-and- computer engineering from the University of Bound Methods: A Survey,’’ Operations Re- Texas at Austin. He is a member of the IEEE search, vol. 14, no. 4, 1966, pp. 699-719. and ACM. 7. E. I¨pek et al., ‘‘Core Fusion: Accommodat- ing Software Diversity in Chip Multiproces- Moinuddin Qureshi is a research staff sors,’’ Proc. Ann. Int’l Symp. Computer member at IBM Research. His research Architecture (ISCA 07), ACM Press, 2007, interest includes computer architecture, scal- pp. 186-197. able memory system design, resilient com- 8. M.A. Suleman et al., ‘‘Accelerating Critical puting, and analytical modeling of computer Section Execution with Asymmetric Multi- systems. Qureshi has a PhD in electrical and core Architectures,’’ Proc. Architectural computer engineering from the University of Support for Programming Languages and Texas at Austin. He is a member of the IEEE. Operating Systems (ASPLOS 09), ACM Press, 2009, pp. 253-264. Yale N. Patt is the Ernest Cockrell, Jr., 9. R. Rajwar and J.R. Goodman, ‘‘Transac- Centennial Chair in Engineering at the tional Lock-Free Execution of Lock-Based University of Texas at Austin. His research Programs,’’ Proc. Architectural Support for interests include harnessing the expected Programming Languages and Operating fruits of future process technology into more Systems (ASPLOS 02), ACM Press, 2002, effective microarchitectures for future micro- pp. 5-17. processors. Patt has a PhD in electrical 10. M. Herlihy and J. Moss, ‘‘Transactional engineering from Stanford University. He Memory: Architectural Support for Lock- is a Fellow of both the IEEE and ACM. Free Data Structures,’’ Proc. Ann. Int’l Symp. Computer Architecture (ISCA 93), Direct questions and comments to Aater ACM Press, 1993, pp. 289-300. Suleman, 1 University Station, C0803, 11. R. Rajwar and J.R. Goodman, ‘‘Speculative Austin, TX 78712; [email protected]. Lock Elision: Enabling Highly Concurrent Multithreaded Execution,’’ Proc. Int’l Symp. Microarchitecture (MICRO 01), IEEE CS Press, 2001, pp. 294-305.

...... 70 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply.