The Impact of Hyper Threading on Processor Resource Utilization in Production Applicaitons

The Impact of Hyper-Threading on Processor Resource Utilization in Production Applications Subhash Saini, Haoqiang Jin, Robert Hood, David Barker, Piyush Mehrotra and Rupak Biswas NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035-1000, USA {subhash.saini, haoqiang.jin, robert.hood, david.p.barker, piyush.mehrotra, rupak.biswas}@nasa.gov Abstract—Intel provides Hyper-Threading (HT) in processors choice as, for example, in Intel’s Nehalem micro- based on its Pentium and Nehalem micro-architecture such as architecture, where it is called Hyper-Threading (HT). the Westmere-EP. HT enables two threads to execute on each core in order to hide latencies related to data access. These two As is the case with other forms of on-chip parallelism, threads can execute simultaneously, filling unused stages in the such as multiple cores and instruction-level parallelism, functional unit pipelines. To aid better understanding of HT- SMT uses resource sharing to make the parallel related issues, we collect Performance Monitoring Unit (PMU) implementation economical. With SMT, this sharing has the data (instructions retired; unhalted core cycles; L2 and L3 potential for improving utilization of resources such as that cache hits and misses; vector and scalar floating-point of the floating-point unit through the hiding of latency in the operations, etc.). We then use the PMU data to calculate a new memory hierarchy. When one thread is waiting for a load metric of efficiency in order to quantify processor resource instruction to complete, the core can execute instructions utilization and make comparisons of that utilization between from another thread without stalling. single-threading (ST) and HT modes. We also study performance gain using unhalted core cycles, code efficiency of The purpose of this paper is to measure the impact of HT using vector units of the processor, and the impact of HT mode on processor utilization. We accomplish this by computing on various shared resources like L2 and L3 cache. Results processor efficiency and investigating how various shared using four full-scale, production-quality scientific applications resources affect performance of scientific applications in HT from computational fluid dynamics (CFD) used by NASA mode. Specifically, we present a new metric for processor scientists indicate that HT generally improves processor efficiency to characterize its utilization in single threading resource utilization efficiency, but does not necessarily (ST) and HT modes for the hex-core Westmere-EP processor translate into overall application performance gain. used in SGI Altix ICE 8400EX supercomputer. We also investigate the effect of memory hierarchy on the Keywords: Simultaneous Multi-Threading (SMT), Hyper- performance of scientific applications in both the modes. We Threading (HT), Intel’s Nehalem micro-architecture, Intel use four production computational fluid dynamics (CFD) Westmere-EP, Computational Fluid Dynamics (CFD), SGI Altix applications—OVERFLOW, USM3D, Cart3D, and NCC— ICE 8400EX, Performance Tools, Benchmarking, Performance that are used extensively by scientists and engineers at Evaluation NASA and throughout the aerospace industry. I. INTRODUCTION In order to better understand the performance characteristics of these codes, we collect Performance Current trends in microprocessor design have made high Monitoring Unit (PMU) data (instructions retired; L2 and L3 resource utilization a key requirement for achieving good cache hits and misses; vector and scalar floating-point performance. For example, while deeper pipelines have led operations, etc.) in both ST and HT modes. We analyze the to 3 GHz processors, each new generation of micro- results to understand the factors influencing the performance architecture technology comes with increased memory of codes in HT mode. latency and a decrease in relative memory speed. This results in the processor spending a significant amount of time The remainder of this paper is organized as follows. We waiting for the memory system to fetch data. This “memory present background and related work in the next section. wall” problem continues to remain a major bottleneck and as Section III discusses HT in the context of the Nehalem a result, sustained performance of most real-world micro-architecture and its Westmere-EP processor. In applications is less than 10% of peak. Section IV, we detail the architecture of the platform used in this study—the SGI Altix ICE 8400EX, based on the Over the years, a number of multithreading techniques Westmere-EP processor. Section V discusses the have been employed to hide this memory latency. One experimental setup, including the hardware performance approach is simultaneous multi-threading (SMT), which counters. In Section VI, we describe the benchmarks and exposes more parallelism to the processor by fetching and applications used in our study. In Section VII, we discuss retiring instructions from multiple instruction streams, metrics used to measure the effectiveness of HT and the thereby increasing processor utilization. SMT requires only utilization of processor resources in both ST and HT modes. some extra hardware instead of replicating the entire core. Section VIII presents and analyzes the performance results of Price and performance benefits make it a common design our experiments. We discuss other factors that influenced the execution units and memory hierarchy, when results of this study in Section IX, and end with some executing in HT mode. conclusions from this work in Section X. III. HYPER-THREADING IN NEHALEM MICRO- ARCHITECTURE II. BACKGROUND AND RELATED WORK Intel introduced SMT, called Hyper-Threading (HT), into Hyper-Threading (HT) allows instructions from multiple its product line in 2002 with new models of their Pentium 4 threads to run on the same core. When one thread stalls, a processors [1-3]. The advantage of HT is its ability to better second thread is allowed to proceed. To support HT, the utilize processor resources and to hide memory latency. Nehalem micro-architecture has several advantages over the There have been a few efforts studying the effectiveness of Pentium 4. First, the newer design has much more memory HT on application performance [4-6]. Boisseau et al. bandwidth and larger caches, giving it the ability to get data conducted a performance evaluation of HT on a Dell 2650 to the core faster. Second, Nehalem is a much wider dual processor-server based on Pentium 4 using matrix- architecture than Pentium 4. It supports two threads per core, matrix multiplication and a 256-particle molecular dynamics presenting the abstraction of two independent logical cores. benchmark written in OpenMP [4]. Haung et al. The physical core contains a mixture of resources, some of characterized the performance of Java applications using which are shared between threads [2]: Pentium 4 processors with HT [5]. Blackburn et al. studied the performance of garbage collection in HT mode by using • replicated resources for each thread, such as register some of the Pentium 4 performance counters [6]. A key state, return stack buffer (RSB), and the instruction finding of these investigations was that the Pentium 4’s queue; implementation of HT was not very advantageous, as the • partitioned resources tagged by the thread number, processor had very limited memory bandwidth and issued such as load buffer, store buffer, and reorder buffer; only two instructions per cycle. • shared resources, such as L1, L2, and L3 cache; and Recently, HT was extended to processors that use Intel’s • shared resources unaware of the presence of Nehalem micro-architecture [7]. In these processors, threads, such as execution units. memory bandwidth was enhanced significantly by overcoming the front-side bus memory bandwidth bottleneck The RSB is an improved branch target prediction and by increasing instruction issuance from two to four per mechanism. Each thread has a dedicated RSB to avoid any cycle. Saini et al. conducted a performance evaluation of HT cross-contamination. Such replicated resources should not on small numbers of Nehalem nodes using NPB [8]. Results have an impact on HT performance. Partitioned resources are showed that for one node, HT provided a slight advantage statically allocated between the threads and reduce the only for LU. BT, SP, MG, and LU achieved the greatest resources available to each thread. However there is no benefit from HT at 4 nodes: factors of 1.54, 1.43, 1.14, and competition for these resources. On the other hand, the two 1.14, respectively, while FT did not achieve any benefit threads do compete for shared resources and the performance independent of the number of nodes. Later on Saini et al. depends on the dynamic behavior of the threads. Some of the extended their work on HT to measure the relative efficiency shared resources are unaware of HT. For example, the E of the processor in terms of cycle per instruction using the scheduling of instructions to execution units is independent formula of threads, but there are limits on the number of instructions from each thread that can be queued. E = 100*(2*CPI / CPI ) – 100 ST HT Figure 1 is a schematic description of HT for the Nehalem micro-architecture. In the diagram, the rows depict where CPIST and CPIHT are cycle per instruction in ST and each of the Westmere-EP processor’s six execution units— HT modes respectively [9]. two floating-point units (FP0 and FP1), one load unit (LD0), In this study we focus on the Westmere-EP Xeon one store unit (ST0), one load address unit (LA0), and one processor, which is based on the Nehalem micro- branch unit (BR0). It is a sixteen-stage pipeline. Each box architecture. represents a single micro-operation running on an execution unit. The contributions of this paper are as follows: Figure 1(a) shows the ST mode (no HT) in a core where • We present efficiency, a new performance metric in the core is executing only one thread (Thread 0 shown in terms of instruction per cycle to quantify the green) and white space denotes unfilled stages in the utilization of the processor, by collecting PMU data pipeline.

Load more