Dynamic Helper Threaded Prefetching on the Sun Ultrasparc® CMP Processor

Dynamic Helper Threaded Prefetching on the Sun UltraSPARC® CMP Processor Jiwei Lu, Abhinav Das, Wei-Chung Hsu Khoa Nguyen, Santosh G. Abraham Department of Computer Science and Engineering Scalable Systems Group University of Minnesota, Twin Cities Sun Microsystems Inc. {jiwei,adas,hsu}@cs.umn.edu {khoa.nguyen,santosh.abraham}@sun.com Abstract [26], [28], the processor checkpoints the architectural state and continues speculative execution that Data prefetching via helper threading has been prefetches subsequent misses in the shadow of the extensively investigated on Simultaneous Multi- initial triggering missing load. When the initial load Threading (SMT) or Virtual Multi-Threading (VMT) arrives, the processor resumes execution from the architectures. Although reportedly large cache checkpointed state. In software pre-execution (also latency can be hidden by helper threads at runtime, referred to as helper threads or software scouting) [2], most techniques rely on hardware support to reduce [4], [7], [10], [14], [24], [29], [35], a distilled version context switch overhead between the main thread and of the forward slice starting from the missing load is helper thread as well as rely on static profile feedback executed, minimizing the utilization of execution to construct the help thread code. This paper develops resources. Helper threads utilizing run-time a new solution by exploiting helper threaded pre- compilation techniques may also be effectively fetching through dynamic optimization on the latest deployed on processors that do not have the necessary UltraSPARC Chip-Multiprocessing (CMP) processor. hardware support for hardware scouting (such as Our experiments show that by utilizing the otherwise checkpointing and resuming regular execution). idle processor core, a single user-level helper thread Initial research on software helper threads is sufficient to improve the runtime performance of the developed the underlying run-time compiler main thread without triggering multiple thread slices. algorithms or evaluated them using simulation. With Moreover, since the multiple cores are physically the advent of processors supporting SMT and VMT, decoupled in the CMP, contention introduced by helper threading has been shown to be effective on the helper threading is minimal. This paper also discusses Intel Pentium-4 SMT processor and the Itanium-2 several key technical challenges of building a light- processor [7], [13], [24], [25], [29]. In an SMT weight dynamic optimization/software scouting system processor such as Pentium-4, many of the processor on the UltraSPARC/Solaris platform. core resources such as L1 caches, issue queues are either partitioned or shared. Helper threads need to be constructed, deployed and monitored carefully so that 1. Introduction the negative resource contention effects do not Modern processors spend a significant fraction of outweigh the gains due to prefetching. The VMT overall execution time waiting for the memory systems method used a novel combination of performance to deliver cache lines. This observation has motivated monitoring and debugging features of the Itanium-2 to copious research on hardware and software data toggle between the main thread and helper thread prefetching schemes. Execution-based prefetching is a execution. However, on Itanium-2, the large overhead promising approach that aims to provide high in toggling between these modes limits the number of prefetching coverage and accuracy. These schemes cycles available for actual helper code execution to a exploit the abundant execution resources that are couple of hundred cycles. Only a few missing loads severely underutilized following an L2 or L3 cache can be launched in this short time interval. miss on contemporary processors supporting Almost all general-purpose processor chips are Simultaneous Multi-Threading (SMT) [8] or Virtual moving to Chip Multi-Processors (CMP) [19], [20], Multi-Threading (VMT) [24]. In hardware pre- including the Gemini, Niagara, Panther chips from execution or scouting [3], [15], [18], [22], [23], [25], Sun, the Power4 and Power5 chips from IBM [16] and recent announcements from AMD/Intel. The IBM and Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05) 0-7695-2440-0/05 $20.00 © 2005 IEEE Sun CMPs have a single L2 cache shared by all the Section 5 evaluates the performance of dynamic helper cores and private L1 caches for each of the cores. threading and Section 6 draws the conclusion. Since the cores do not share any execution or L1 resources, helper thread execution on one core has 2. Background and Related Works minimal negative impact on main thread execution on another core. However, since the L2 cache is shared, Many researchers have proposed to use one or more the helper thread may prefetch data into the L2 cache speculative threads to warm up the shared resources, on behalf of the main thread. Since such CMPs are such as the caches and the branch prediction tables, to soon going to be almost universal in the general- reduce the penalty of cache misses and branch mis- purpose arena and since many single-thread predictions. These helper threads (also called scout applications are dominated by off-chip stall cycles, threads) usually execute a code segment pre- helper thread prefetching on CMPs is an attractive constructed at compile time by identifying the proposition that needs further investigation. instructions on the execution path leading to the The minimal sharing of resources between cores performance bottleneck [5]. gives rise to unique issues that are not present in an 2.1. SMT/VMT vs. CMP SMT or VMT implementation. First, how does the main thread initiate helper thread execution for a Pre-computation based helper threads have been particular L2 cache miss? In an SMT system, both evaluated on Pentium-4 with hyper-threading and on threads are co-located on the same core enabling fast Itanium-2 system with special hardware support for synchronization. Second, how does the main thread VMT [7], [13], [15], [24], [29]. Other helper communicate register values to the helper thread? In threading works such as Data-Driven Multi-Threading the Itanium-2 VMT system, the register file is (DDMT) [3], Simultaneous Subordinate Micro- effectively shared between the main and helper threading (SSMT) [28] and Transparent Threads [10] threads. target at achieving effective helper threaded We have devised innovative mechanisms to prefetching on SMT processors. address these issues, implemented a complete dynamic Unlike SMT and VMT, which share many critical optimization system for helper thread based resources, Chip Multi-processing (CMP) processors prefetching, and measured actual speedups on an limit sharing, for example, to only the L2/L3 cache. existing Sun UltraSPARC IV+ CMP chip [27]. In our While the restricted resource sharing moderates the system, the main thread is bound to one core and the benefit of helper threading to only L2/L3 cache runtime performance monitoring code, the runtime prefetching, it also avoids the drawback of hard-to- optimizer and the dynamically generated helper code control resource contention encountered by helper execute on the other core. Runtime performance threading on SMT. The impact of different resource monitoring selects program regions that have sharing levels to thread communication cost on CMP delinquent loads. The helper code generated for these as well as the corresponding performance margin for regions is optimized to prefetch for delinquent loads. pre-execution have been quantitatively assessed by The main thread uses a mailbox in shared memory to Brown and et al. on a research Itanium CMP [14]. communicate and initiate helper thread execution. The 2.2. Dynamic vs. Static Helper Threading normal caching mechanism maintains this mailbox in the L2 cache and also in the L1 cache of the helper Software-based dynamic optimization [6], [17], [30], thread's core. We address many other implementation [31], [33] adapts to the runtime behavior of a program issues in this first evaluation of helper thread due to the change of input data, and/or the underlying prefetching on a physical Chip-Multiprocessor and micro-architecture. Current dynamic optimizations measure significant performance gains on several include data prefetching, procedure inlining, partial SPEC benchmarks and a real-world application. dead code elimination, and code layout, which have The remainder of this paper is organized as been proven to be useful complements to static follows. Section 2 provides the background and related optimizations. Optimizations such as data cache work. Section 3 discusses the helper thread model in prefetching and branch mis-prediction reduction are our optimization framework, including code selection, usually difficult to perform at compile time since helper thread dispatching, communication and cache miss and branch mis-prediction information synchronization. Section 4 introduces the dynamic may not be available. Furthermore, program hot spots optimization framework on the UltraSPARC system. may as well change under different input data sets or Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05) 0-7695-2440-0/05 $20.00 © 2005 IEEE start detected new no system busy? phase? no main thread still alive? yes yes no yes prepare profile from the samples dispatch new helper task, if any. no new profile arrives? optimize hot loops and end prepare helper tasks yes check phase status

Dynamic Helper Threaded Prefetching on the Sun Ultrasparc® CMP Processor

Sun Fire E2900 Server

Sun Ultratm 5 Workstation Just the Facts

Sun Ultratm 2 Workstation Just the Facts

Chapter 2 Java Processor Architectural

Debugging Multicore & Shared- Memory Embedded Systems

Sun SPARC Enterprise T5440 Servers

Day 2, 1640: Leveraging Opensparc

Sun Ultratm 25 Workstation & Sun Ultra 45 Workstation Just the Facts

Ultrasparctm II Microprocessor

Overview of Sun Integrated Lights out Manager

Ultrasparc-III Ultrasparc-III Vs Intel IA-64

Sun Blade 1000 and 2000 Workstations