Accelerating Critical Section Execution with Asymmetric Multicore Architectures

[3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 60 .......................................................................................................................................................................................................................... ACCELERATING CRITICAL SECTION EXECUTION WITH ASYMMETRIC MULTICORE ARCHITECTURES .......................................................................................................................................................................................................................... CONTENTION FOR CRITICAL SECTIONS CAN REDUCE PERFORMANCE AND SCALABILITY BY CAUSING THREAD SERIALIZATION.THE PROPOSED ACCELERATED CRITICAL SECTIONS MECHANISM REDUCES THIS LIMITATION. ACS EXECUTES CRITICAL SECTIONS ON THE HIGH-PERFORMANCE CORE OF AN ASYMMETRIC CHIP MULTIPROCESSOR (ACMP), WHICH CAN EXECUTE THEM FASTER THAN THE SMALLER CORES CAN. ......Extracting high performance program regions on the multiple small from chip multiprocessors (CMPs) requires cores.1-4 In addition to Amdahl’s bottleneck, M. Aater Suleman partitioning the application into threads ACS runs selected critical sections on the that execute concurrently on multiple cores. large core, which runs them faster than the University of Texas Because threads cannot be allowed to update smaller cores. shared data concurrently, accesses to shared ACS dedicates the large core exclusively at Austin data are encapsulated inside critical sections. to running critical sections (and the Only one thread executes a critical section at Amdahl’s bottleneck). In conventional sys- Onur Mutlu a given time; other threads wanting to exe- tems, when a core encounters a critical sec- cute the same critical section must wait. Crit- tion, it acquires the lock for the critical Carnegie Mellon ical sections can serialize threads, thereby section, executes the critical section, and reducing performance and scalability (that releases the lock. In ACS, when a small University is, the number of threads at which perfor- core encounters a critical section, it sends a mance saturates). Shortening the execution request to the large core for execution of time inside critical sections can reduce this that critical section and stalls. The large Moinuddin K. Qureshi performance loss. core executes the critical section and notifies This article proposes the accelerated critical the small core when it has completed the IBM Research sections mechanism. ACS is based on the critical section. The small core then resumes asymmetric chip multiprocessor, which consists execution. By accelerating critical section ex- Yale N. Patt of at least one large, high-performance core ecution, ACS reduces serialization, lowering and many small, power-efficient cores (see the likelihood of threads waiting for a critical University of Texas the ‘‘Related work’’ sidebar for other work section to finish. in this area). The ACMP was originally pro- Ourevaluationonasetof12critical- at Austin posed to run Amdahl’s serial bottleneck section-intensive workloads shows that ACS (where only a single thread exists) more reduces the average execution time by 34 quickly on the large core and the parallel percent compared to an equal-area 32-core .............................................................. 60 Published by the IEEE Computer Society 0272-1732/10/$26.00 c 2010 IEEE Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 61 ............................................................................................................................................................................................... Related work The most closely related work to accelerated critical sections (ACS) are the numerous proposals to optimize the implementation of lock ac- References quire and release operations and the locality of shared data in critical 1. S. Sridharan et al., ‘‘Thread Migration to Improve Synchro- sections using operating system and compiler techniques. We are not nization Performance,’’ Proc. Workshop on Operating Sys- aware of any work that speeds up the execution of critical sections tem Interference in High Performance Applications using more aggressive execution engines. (OSIHPA), 2006; http://www.kinementium.com/~rcm/doc/ osihpa06.pdf. Improving locality of shared data and locks 2. P. Trancoso and J. Torrellas, ‘‘The Impact of Speeding up Sridharan et al.1 propose a thread-scheduling algorithm to increase Critical Sections with Data Prefetching and Forwarding,’’ shared data locality. When a thread encounters a critical section, the Proc. Int’l Conf. Parallel Processing (ICPP 96), IEEE CS operating system migrates the thread to the processor with the shared Press, 1996, pp. 79-86. data. Although ACS also improves shared data locality, it does not re- 3. P. Ranganathan et al., ‘‘The Interaction of Software Prefetch- quire the operating system intervention and the thread migration over- ing with ILP Processors in Shared-Memory Systems,’’ Proc. head incurred by their scheme. Moreover, ACS accelerates critical Ann. Int’l Symp. Computer Architecture (ISCA 97), ACM section execution, a benefit unavailable in Sridharan et al.’s work.1 Press, 1997, pp. 144-156. Trancoso and Torrellas2 and Ranganathan et al.3 improve locality in crit- 4. D. Culler et al., Parallel Computer Architecture: A Hardware/ ical sections using software prefetching. We can combine these Software Approach, Morgan Kaufmann, 1998. techniques with ACS to improve performance. Primitives (such as 5. J.F. Martı´nez and J. Torrellas, ‘‘Speculative Synchronization: Test&Test&Set and Compare&Swap) implement lock acquire and release Applying Thread-Level Speculation to Explicitly Parallel Appli- operations efficiently but do not increase the speed of critical section cations,’’ Proc. Architectural Support for Programming Lan- processing or the locality of shared data.4 guages and Operating Systems (ASPLOS 02), ACM Press, 2002, pp. 18-29. Hiding the latency of critical sections 6. R. Rajwar and J.R. Goodman, ‘‘Speculative Lock Elision: En- Several schemes hide critical section latency.5-7 We compared abling Highly Concurrent Multithreaded Execution,’’ Proc. ACS with TLR, a mechanism that hides critical section latency by Int’l Symp. Microarchitecture (MICRO 01), IEEE CS Press, overlapping multiple instances of the same critical section as long 2001, pp. 294-305. as they access disjoint data. ACS largely outperforms TLR because, 7. R. Rajwar and J.R. Goodman, ‘‘Transactional Lock-Free Exe- unlike TLR, ACS accelerates critical sections whether or not they cution of Lock-Based Programs,’’ Proc. Architectural Support have data conflicts. for Programming Languages and Operating Systems (ASPLOS 02), ACM Press, 2002, pp. 5-17. Asymmetric CMPs 8. M. Annavaram, E. Grochowski, and J. Shen, ‘‘Mitigating Several researchers have shown the potential to improve the perfor- Amdahl’s Law through EPI Throttling,’’ SIGARCH Computer mance of an application’s serial part using an asymmetric chip multiproc- Architecture News, vol. 33, no. 2, 2005, pp. 298-309. essor (ACMP).8-11 We use the ACMP to accelerate critical sections as 9. M. Hill and M. Marty, ‘‘Amdahl’s Law in the Multicore Era,’’ well as the serial part in multithreaded workloads.¨ Ipek et al. state Computer, vol. 41, no. 7, 2008, pp. 33-38. that multiple small cores can be combined—that is, fused—to form a 10. T.Y. Morad et al., ‘‘Performance, Power Efficiency and powerful core to speed up the serial program portions.12 We can com- Scalability of Asymmetric Cluster Chip Multiprocessors,’’ bine our technique with their scheme such that the powerful core accel- IEEE Computer Architecture Letters, vol.5,no.1,Jan. erates critical sections. Kumar et al. use heterogeneous cores to reduce 2006, pp. 14-17. power and increase throughput for multiprogrammed workloads, not mul- 11. M.A. Suleman et al., ACMP: Balancing Hardware Efficiency tithreaded programs.13 and Programmer Efficiency, HPS Technical Report, 2007. 12. E. I¨pek et al., ‘‘Core Fusion: Accommodating Software Diver- Remote procedure calls sity in Chip Multiprocessors,’’ Proc. Ann. Int’l Symp. Com- The idea of executing critical sections remotely on a different pro- puter Architecture (ISCA 07), ACM Press, 2007, pp. 186-197. cessor resembles the RPC14 mechanism used in network programming. 13. R. Kumar et al., ‘‘Heterogeneous Chip Multiprocessors,’’ RPC is used to execute client subroutines on remote server computers. Computer, vol. 38, no. 11, 2005, pp. 32-38. ACS differs from RPC in that ACS runs critical sections remotely on 14. A.D. Birrell and B.J. Nelson, ‘‘Implementing Remote Proce- the same chip within the same address space, not on a remote dure Calls,’’ ACM Trans. Computer Systems, vol. 2, no. 1, computer. 1984, pp. 39-59. .................................................................... JANUARY/FEBRUARY 2010 61 Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:10:47 EDT from IEEE Xplore. Restrictions apply. [3B2-9] mmi2010010060.3d 10/2/010 17:56 Page 62 ............................................................................................................................................................................................... TOP PICKS symmetric CMP and by 23 percent com- Figure 1b shows the execution time of the pared to an equal-area ACMP. Moreover, kernel in Figure 1a on a 4-core CMP.

Load more