The Impact of Hyper Threading on Processor Resource Utilization in Production Applicaitons

Total Page:16

File Type:pdf, Size:1020Kb

The Impact of Hyper Threading on Processor Resource Utilization in Production Applicaitons The Impact of Hyper-Threading on Processor Resource Utilization in Production Applications Subhash Saini, Haoqiang Jin, Robert Hood, David Barker, Piyush Mehrotra and Rupak Biswas NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035-1000, USA {subhash.saini, haoqiang.jin, robert.hood, david.p.barker, piyush.mehrotra, rupak.biswas}@nasa.gov Abstract—Intel provides Hyper-Threading (HT) in processors choice as, for example, in Intel’s Nehalem micro- based on its Pentium and Nehalem micro-architecture such as architecture, where it is called Hyper-Threading (HT). the Westmere-EP. HT enables two threads to execute on each core in order to hide latencies related to data access. These two As is the case with other forms of on-chip parallelism, threads can execute simultaneously, filling unused stages in the such as multiple cores and instruction-level parallelism, functional unit pipelines. To aid better understanding of HT- SMT uses resource sharing to make the parallel related issues, we collect Performance Monitoring Unit (PMU) implementation economical. With SMT, this sharing has the data (instructions retired; unhalted core cycles; L2 and L3 potential for improving utilization of resources such as that cache hits and misses; vector and scalar floating-point of the floating-point unit through the hiding of latency in the operations, etc.). We then use the PMU data to calculate a new memory hierarchy. When one thread is waiting for a load metric of efficiency in order to quantify processor resource instruction to complete, the core can execute instructions utilization and make comparisons of that utilization between from another thread without stalling. single-threading (ST) and HT modes. We also study performance gain using unhalted core cycles, code efficiency of The purpose of this paper is to measure the impact of HT using vector units of the processor, and the impact of HT mode on processor utilization. We accomplish this by computing on various shared resources like L2 and L3 cache. Results processor efficiency and investigating how various shared using four full-scale, production-quality scientific applications resources affect performance of scientific applications in HT from computational fluid dynamics (CFD) used by NASA mode. Specifically, we present a new metric for processor scientists indicate that HT generally improves processor efficiency to characterize its utilization in single threading resource utilization efficiency, but does not necessarily (ST) and HT modes for the hex-core Westmere-EP processor translate into overall application performance gain. used in SGI Altix ICE 8400EX supercomputer. We also investigate the effect of memory hierarchy on the Keywords: Simultaneous Multi-Threading (SMT), Hyper- performance of scientific applications in both the modes. We Threading (HT), Intel’s Nehalem micro-architecture, Intel use four production computational fluid dynamics (CFD) Westmere-EP, Computational Fluid Dynamics (CFD), SGI Altix applications—OVERFLOW, USM3D, Cart3D, and NCC— ICE 8400EX, Performance Tools, Benchmarking, Performance that are used extensively by scientists and engineers at Evaluation NASA and throughout the aerospace industry. I. INTRODUCTION In order to better understand the performance characteristics of these codes, we collect Performance Current trends in microprocessor design have made high Monitoring Unit (PMU) data (instructions retired; L2 and L3 resource utilization a key requirement for achieving good cache hits and misses; vector and scalar floating-point performance. For example, while deeper pipelines have led operations, etc.) in both ST and HT modes. We analyze the to 3 GHz processors, each new generation of micro- results to understand the factors influencing the performance architecture technology comes with increased memory of codes in HT mode. latency and a decrease in relative memory speed. This results in the processor spending a significant amount of time The remainder of this paper is organized as follows. We waiting for the memory system to fetch data. This “memory present background and related work in the next section. wall” problem continues to remain a major bottleneck and as Section III discusses HT in the context of the Nehalem a result, sustained performance of most real-world micro-architecture and its Westmere-EP processor. In applications is less than 10% of peak. Section IV, we detail the architecture of the platform used in this study—the SGI Altix ICE 8400EX, based on the Over the years, a number of multithreading techniques Westmere-EP processor. Section V discusses the have been employed to hide this memory latency. One experimental setup, including the hardware performance approach is simultaneous multi-threading (SMT), which counters. In Section VI, we describe the benchmarks and exposes more parallelism to the processor by fetching and applications used in our study. In Section VII, we discuss retiring instructions from multiple instruction streams, metrics used to measure the effectiveness of HT and the thereby increasing processor utilization. SMT requires only utilization of processor resources in both ST and HT modes. some extra hardware instead of replicating the entire core. Section VIII presents and analyzes the performance results of Price and performance benefits make it a common design our experiments. We discuss other factors that influenced the execution units and memory hierarchy, when results of this study in Section IX, and end with some executing in HT mode. conclusions from this work in Section X. III. HYPER-THREADING IN NEHALEM MICRO- ARCHITECTURE II. BACKGROUND AND RELATED WORK Intel introduced SMT, called Hyper-Threading (HT), into Hyper-Threading (HT) allows instructions from multiple its product line in 2002 with new models of their Pentium 4 threads to run on the same core. When one thread stalls, a processors [1-3]. The advantage of HT is its ability to better second thread is allowed to proceed. To support HT, the utilize processor resources and to hide memory latency. Nehalem micro-architecture has several advantages over the There have been a few efforts studying the effectiveness of Pentium 4. First, the newer design has much more memory HT on application performance [4-6]. Boisseau et al. bandwidth and larger caches, giving it the ability to get data conducted a performance evaluation of HT on a Dell 2650 to the core faster. Second, Nehalem is a much wider dual processor-server based on Pentium 4 using matrix- architecture than Pentium 4. It supports two threads per core, matrix multiplication and a 256-particle molecular dynamics presenting the abstraction of two independent logical cores. benchmark written in OpenMP [4]. Haung et al. The physical core contains a mixture of resources, some of characterized the performance of Java applications using which are shared between threads [2]: Pentium 4 processors with HT [5]. Blackburn et al. studied the performance of garbage collection in HT mode by using • replicated resources for each thread, such as register some of the Pentium 4 performance counters [6]. A key state, return stack buffer (RSB), and the instruction finding of these investigations was that the Pentium 4’s queue; implementation of HT was not very advantageous, as the • partitioned resources tagged by the thread number, processor had very limited memory bandwidth and issued such as load buffer, store buffer, and reorder buffer; only two instructions per cycle. • shared resources, such as L1, L2, and L3 cache; and Recently, HT was extended to processors that use Intel’s • shared resources unaware of the presence of Nehalem micro-architecture [7]. In these processors, threads, such as execution units. memory bandwidth was enhanced significantly by overcoming the front-side bus memory bandwidth bottleneck The RSB is an improved branch target prediction and by increasing instruction issuance from two to four per mechanism. Each thread has a dedicated RSB to avoid any cycle. Saini et al. conducted a performance evaluation of HT cross-contamination. Such replicated resources should not on small numbers of Nehalem nodes using NPB [8]. Results have an impact on HT performance. Partitioned resources are showed that for one node, HT provided a slight advantage statically allocated between the threads and reduce the only for LU. BT, SP, MG, and LU achieved the greatest resources available to each thread. However there is no benefit from HT at 4 nodes: factors of 1.54, 1.43, 1.14, and competition for these resources. On the other hand, the two 1.14, respectively, while FT did not achieve any benefit threads do compete for shared resources and the performance independent of the number of nodes. Later on Saini et al. depends on the dynamic behavior of the threads. Some of the extended their work on HT to measure the relative efficiency shared resources are unaware of HT. For example, the E of the processor in terms of cycle per instruction using the scheduling of instructions to execution units is independent formula of threads, but there are limits on the number of instructions from each thread that can be queued. E = 100*(2*CPI / CPI ) – 100 ST HT Figure 1 is a schematic description of HT for the Nehalem micro-architecture. In the diagram, the rows depict where CPIST and CPIHT are cycle per instruction in ST and each of the Westmere-EP processor’s six execution units— HT modes respectively [9]. two floating-point units (FP0 and FP1), one load unit (LD0), In this study we focus on the Westmere-EP Xeon one store unit (ST0), one load address unit (LA0), and one processor, which is based on the Nehalem micro- branch unit (BR0). It is a sixteen-stage pipeline. Each box architecture. represents a single micro-operation running on an execution unit. The contributions of this paper are as follows: Figure 1(a) shows the ST mode (no HT) in a core where • We present efficiency, a new performance metric in the core is executing only one thread (Thread 0 shown in terms of instruction per cycle to quantify the green) and white space denotes unfilled stages in the utilization of the processor, by collecting PMU data pipeline.
Recommended publications
  • 1 Introduction
    Cambridge University Press 978-0-521-76992-1 - Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors Jean-Loup Baer Excerpt More information 1 Introduction Modern computer systems built from the most sophisticated microprocessors and extensive memory hierarchies achieve their high performance through a combina- tion of dramatic improvements in technology and advances in computer architec- ture. Advances in technology have resulted in exponential growth rates in raw speed (i.e., clock frequency) and in the amount of logic (number of transistors) that can be put on a chip. Computer architects have exploited these factors in order to further enhance performance using architectural techniques, which are the main subject of this book. Microprocessors are over 30 years old: the Intel 4004 was introduced in 1971. The functionality of the 4004 compared to that of the mainframes of that period (for example, the IBM System/370) was minuscule. Today, just over thirty years later, workstations powered by engines such as (in alphabetical order and without specific processor numbers) the AMD Athlon, IBM PowerPC, Intel Pentium, and Sun UltraSPARC can rival or surpass in both performance and functionality the few remaining mainframes and at a much lower cost. Servers and supercomputers are more often than not made up of collections of microprocessor systems. It would be wrong to assume, though, that the three tenets that computer archi- tects have followed, namely pipelining, parallelism, and the principle of locality, were discovered with the birth of microprocessors. They were all at the basis of the design of previous (super)computers. The advances in technology made their implementa- tions more practical and spurred further refinements.
    [Show full text]
  • Instruction Latencies and Throughput for AMD and Intel X86 Processors
    Instruction latencies and throughput for AMD and Intel x86 processors Torbj¨ornGranlund 2019-08-02 09:05Z Copyright Torbj¨ornGranlund 2005{2019. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved. This report is work-in-progress. A newer version might be available here: https://gmplib.org/~tege/x86-timing.pdf In this short report we present latency and throughput data for various x86 processors. We only present data on integer operations. The data on integer MMX and SSE2 instructions is currently limited. We might present more complete data in the future, if there is enough interest. There are several reasons for presenting this report: 1. Intel's published data were in the past incomplete and full of errors. 2. Intel did not publish any data for 64-bit operations. 3. To allow straightforward comparison of an important aspect of AMD and Intel pipelines. The here presented data is the result of extensive timing tests. While we have made an effort to make sure the data is accurate, the reader is cautioned that some errors might have crept in. 1 Nomenclature and notation LNN means latency for NN-bit operation.TNN means throughput for NN-bit operation. The term throughput is used to mean number of instructions per cycle of this type that can be sustained. That implies that more throughput is better, which is consistent with how most people understand the term. Intel use that same term in the exact opposite meaning in their manuals. The notation "P6 0-E", "P4 F0", etc, are used to save table header space.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations
    A Concurrent Real-Time White Paper 2881 Gateway Drive Pompano Beach, FL 33069 (954) 974-1700 www.concurrent-rt.com Real-Time Performance During CUDA™ A Demonstration and Analysis of RedHawk™ CUDA RT Optimizations By: Concurrent Real-Time Linux® Development Team November 2010 Overview There are many challenges to creating a real-time Linux distribution that provides guaranteed low process-dispatch latencies and minimal process run-time jitter. Concurrent Real Time’s RedHawk Linux distribution meets and exceeds these challenges, providing a hard real-time environment on many qualified hardware configurations, even in the presence of a heavy system load. However, there are additional challenges faced when guaranteeing real-time performance of processes while CUDA applications are simultaneously running on the system. The proprietary CUDA driver supplied by NVIDIA® frequently makes demands upon kernel resources that can dramatically impact real-time performance. This paper discusses a demonstration application developed by Concurrent to illustrate that RedHawk Linux kernel optimizations allow hard real-time performance guarantees to be preserved even while demanding CUDA applications are running. The test results will show how RedHawk performance compares to CentOS performance running the same application. The design and implementation details of the demonstration application are also discussed in this paper. Demonstration This demonstration features two selectable real-time test modes: 1. Jitter Mode: measure and graph the run-time jitter of a real-time process 2. PDL Mode: measure and graph the process-dispatch latency of a real-time process While the demonstration is running, it is possible to switch between these different modes at any time.
    [Show full text]
  • A Modern Primer on Processing in Memory
    A Modern Primer on Processing in Memory Onur Mutlua,b, Saugata Ghoseb,c, Juan Gomez-Luna´ a, Rachata Ausavarungnirund SAFARI Research Group aETH Z¨urich bCarnegie Mellon University cUniversity of Illinois at Urbana-Champaign dKing Mongkut’s University of Technology North Bangkok Abstract Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.
    [Show full text]
  • Unit: 4 Processes and Threads in Distributed Systems
    Unit: 4 Processes and Threads in Distributed Systems Thread A program has one or more locus of execution. Each execution is called a thread of execution. In traditional operating systems, each process has an address space and a single thread of execution. It is the smallest unit of processing that can be scheduled by an operating system. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. Thread Structure Process is used to group resources together and threads are the entities scheduled for execution on the CPU. The thread has a program counter that keeps track of which instruction to execute next. It has registers, which holds its current working variables. It has a stack, which contains the execution history, with one frame for each procedure called but not yet returned from. Although a thread must execute in some process, the thread and its process are different concepts and can be treated separately. What threads add to the process model is to allow multiple executions to take place in the same process environment, to a large degree independent of one another. Having multiple threads running in parallel in one process is similar to having multiple processes running in parallel in one computer. Figure: (a) Three processes each with one thread. (b) One process with three threads. In former case, the threads share an address space, open files, and other resources. In the latter case, process share physical memory, disks, printers and other resources.
    [Show full text]
  • Gpu Concurrency
    GPU CONCURRENCY ROBERT SEARLES 5/26/2021 EXECUTION SCHEDULING & MANAGEMENT Pre-emptive scheduling Concurrent scheduling Processes share GPU through time-slicing Processes run on GPU simultaneously Scheduling managed by system User creates & manages scheduling streams C B A B C A B A time time time- slice 2 CUDA CONCURRENCY MECHANISMS Streams MPS MIG Partition Type Single process Logical Physical Max Partitions Unlimited 48 7 Performance Isolation No By percentage Yes Memory Protection No Yes Yes Memory Bandwidth QoS No No Yes Error Isolation No No Yes Cross-Partition Interop Always IPC Limited IPC Reconfigure Dynamic Process launch When idle MPS: Multi-Process Service MIG: Multi-Instance GPU 3 CUDA STREAMS 4 STREAM SEMANTICS 1. Two operations issued into the same stream will execute in issue- order. Operation B issued after Operation A will not begin to execute until Operation A has completed. 2. Two operations issued into separate streams have no ordering prescribed by CUDA. Operation A issued into stream 1 may execute before, during, or after Operation B issued into stream 2. Operation: Usually, cudaMemcpyAsync or a kernel call. More generally, most CUDA API calls that take a stream parameter, as well as stream callbacks. 5 STREAM EXAMPLES Host/Device execution concurrency: Kernel<<<b, t>>>(…); // this kernel execution can overlap with cpuFunction(…); // this host code Concurrent kernels: Kernel<<<b, t, 0, streamA>>>(…); // these kernels have the possibility Kernel<<<b, t, 0, streamB>>>(…); // to execute concurrently In practice, concurrent
    [Show full text]
  • Parallel Computing
    Parallel Computing Announcements ● Midterm has been graded; will be distributed after class along with solutions. ● SCPD students: Midterms have been sent to the SCPD office and should be sent back to you soon. Announcements ● Assignment 6 due right now. ● Assignment 7 (Pathfinder) out, due next Tuesday at 11:30AM. ● Play around with graphs and graph algorithms! ● Learn how to interface with library code. ● No late submissions will be considered. This is as late as we're allowed to have the assignment due. Why Algorithms and Data Structures Matter Making Things Faster ● Choose better algorithms and data structures. ● Dropping from O(n2) to O(n log n) for large data sets will make your programs faster. ● Optimize your code. ● Try to reduce the constant factor in the big-O notation. ● Not recommended unless all else fails. ● Get a better computer. ● Having more memory and processing power can improve performance. ● New option: Use parallelism. How Your Programs Run Threads of Execution ● When running a program, that program gets a thread of execution (or thread). ● Each thread runs through code as normal. ● A program can have multiple threads running at the same time, each of which performs different tasks. ● A program that uses multiple threads is called multithreaded; writing a multithreaded program or algorithm is called multithreading. Threads in C++ ● The newest version of C++ (C++11) has libraries that support threading. ● To create a thread: ● Write the function that you want to execute. ● Construct an object of type thread to run that function. – Need header <thread> for this. ● That function will run in parallel alongside the original program.
    [Show full text]
  • Advanced X86
    Advanced x86: BIOS and System Management Mode Internals Input/Output Xeno Kovah && Corey Kallenberg LegbaCore, LLC All materials are licensed under a Creative Commons “Share Alike” license. http://creativecommons.org/licenses/by-sa/3.0/ ABribuEon condiEon: You must indicate that derivave work "Is derived from John BuBerworth & Xeno Kovah’s ’Advanced Intel x86: BIOS and SMM’ class posted at hBp://opensecuritytraining.info/IntroBIOS.html” 2 Input/Output (I/O) I/O, I/O, it’s off to work we go… 2 Types of I/O 1. Memory-Mapped I/O (MMIO) 2. Port I/O (PIO) – Also called Isolated I/O or port-mapped IO (PMIO) • X86 systems employ both-types of I/O • Both methods map peripheral devices • Address space of each is accessed using instructions – typically requires Ring 0 privileges – Real-Addressing mode has no implementation of rings, so no privilege escalation needed • I/O ports can be mapped so that they appear in the I/O address space or the physical-memory address space (memory mapped I/O) or both – Example: PCI configuration space in a PCIe system – both memory-mapped and accessible via port I/O. We’ll learn about that in the next section • The I/O Controller Hub contains the registers that are located in both the I/O Address Space and the Memory-Mapped address space 4 Memory-Mapped I/O • Devices can also be mapped to the physical address space instead of (or in addition to) the I/O address space • Even though it is a hardware device on the other end of that access request, you can operate on it like it's memory: – Any of the processor’s instructions
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]