Analysis of Multithreaded Architectures for Parallel Computing

Analysis of Multithreaded Architectures for Parallel Computing Rafael H. Saavedra-Barrera David E. Culler Thorsten von Eicken Computer Science Division University of California Berkeley, California 94720 Abstract The latency avoidance strategy embodied in caches is Multithreading has been proposed as an architectural stra- attractive for multiprocessors because local copies of shared data tegy for tolerating latency in multiprocessors and, through lim- are produced on demand. However, this replication leads to a ited empirical studies, shown to offer promise. This paper thorny problem of maintaining a coherent image of the memory develops an analytical model of multithreaded processor [Good83, Arga88]. Concrete solutions exist for small bus-based behavior based on a small set of architectural and program multiprocessors, but even these involve considerable complexity parameters. The model gives rise to a large Markov chain, and observed miss rates on shared data are fairly high [Egge88]. which is solved to obtain a formula for processor efficiency in Approaches that do not provide automatic replication can be terms of the number of threads per processor, the remote refer- expected to make a larger number of long-latency requests. ence rate, the latency, and the cost of switching between threads. Several recent architectures adopt the alternative strategy It is shown that a multithreaded processor exhibits three operat- of tolerating latency. These can be loosely classified as mul- ing regimes: linear (efficiency is proportional to the number of tithreaded architectures, because each processor supports several threads), transition, and saturation (efficiency depends only on active threads of computation. While some threads are the remote reference rate and switch cost). Formulae for regime suspended, awaiting completion of a long-latency request, the boundaries are derived. The model is embellished to reflect processor can continue servicing the other threads. Dataflow cache degradation due to multithreading, using an analytical architectures are an extreme example: the processor maintains a model of cache behavior, demonstrating that returns diminish as queue of enabled instructions and, in each cycle, removes an the number threads becomes large. Predictions from the embel- instruction from this queue and dispatches it into the instruction lished model correlate well with published empirical measure- pipeline. A number of more conservative architectures have ments. Prescriptive use of the model under various scenarios been proposed that maintain the state of several conventional indicates that multithreading is effective, but the number of use- instruction streams and switch among these based on some ful threads per processor is fairly small. event, either every instruction [Hals88, Smit78, This88], every load instruction [Ianu88, Nikh89], or every cache miss 1. Introduction [Webe89]. While the approach is conceptually appealing, the Memory latency, the time required to initiate, process, available evidence for its effectiveness is drawn from a small and return the result of a memory request, has always been the sample of programs under hypothetical implementations bane of high-performance computer architecture, and it is espe- [Arvi88, Ianu88, Webe89]. No coherent framework for evaluat- cially critical in large-scale multiprocessors. Many processor ing these architectures has emerged. cycles may elapse while a request is communicated among phy- In this paper, we develop a simple analytical model of sically remote modules, in addition to those lost during the multithreaded architectures in order to understand the potential memory access. For scalable architectures, communication gains offered by the approach and its fundamental limitations. latency will generally increase with the number of modules, due For example, what latency and remote reference rates justify to the depth of the network, and may be quite unpredictable, due multithreading, how many threads must be supported to achieve to network contention. Thus, the architectural challenge is to a significant improvement, and what are the trade-offs between minimize the fraction of processor cycles wasted due to long- increased tolerance and degraded cache performance due to latency requests. There are two obvious alternatives: avoid increasing the number of threads. Our model is based on four latency or tolerate it. In other words, we may attempt to reduce parameters: the latency incurred on a remote reference (L ), the the number of such requests or perform other useful work while number of threads that can be interleaved (N ), the cycles lost in requests are outstanding. In uniprocessors, the former is realized performing a switch (C ), and the interval between switches trig- by caches [Smit82] and the latter by dynamic hazard resolution gered by remote references (R ). L , N , and C are basic machine logic [Ande67, Russ78]. Unfortunately, neither of these parameters, determined by the interconnection between modules, approaches extends trivially to large-scale multiprocessors the amount of processor state devoted to maintaining active [Arvi87]. threads, and the implementation of the switch mechanism, respectively. R reflects a combination of program behavior and memory system design. 2 The basic architectural model is presented in Section 2. corresponding state. The basic idea behind a multithreaded To develop a general understanding of the bahavior of mul- machine is to interleave the execution of several contexts in tithreaded architectures, Section 3 analyzes the model under the order to dramatically reduce the value of Idle , but without assumption that the remote reference interval is fixed. A simple overly increasing the magnitude of Switching . calculation shows that, for a given latency L , the efficiency of The state of a processor is determined by the disposition processor increases linearly with the number of threads up to a of the various contexts on the processor. During its lifetime, a saturation point and is constant beyond that point. To provide a context cycles through the following states: ready, running, leav- more realistic prediction of the processor behavior, in Section 4 ing, and blocked. There can be at most one context running or we assume the remote reference interval has a geometric distri- leaving. A processor is busy, if there is a context in the running bution and arrive at a Markov chain giving the efficiency as a state; it is switching while making the transition from one con- function of N , C , L and the mean value of R . The size of this text to another, i.e., when a context is leaving. Otherwise, all chain is extremely large, making direct numerical solutions contexts are blocked and we say the processor is idle. A running infeasible, however, an exact solution is obtained. It gives a context keeps the processor busy until it issues an operation that more accurate estimate of the processor efficiency and identifies requires a context switch. The context then spends C cycles in the transition region between linear improvement and saturation. the leaving state, then goes into the blocked state for L cycles, We can see how increasing the number of contexts enlarges the and finally re-enters the ready state. Eventually the processor saturation region, with respect to L , and that the effects of will choose it and the cycle will start again. switching overhead are most pronounced in saturation. This general model can be applied to several different To make the model still more realistic, in Section 5 we multithreaded architectures, with varying switching policies. embellish it to include the effects of per-processor cache degra- We briefly present three examples and give an interpretation of dation due to competing threads. In this case, the processor effi- the run length and memory latency for each. First, suppose a ciency improves more slowly with the number of contexts, for a context is preempted when it causes a cache miss. In this case, given latency, and reaches a peak, from which it may drop with R is taken to be the average interval between misses (in cycles) further increase in N . Section 6 compares predictions of the and L the time to satisfy the miss. Here, the processor switches model with published measurements obtained through trace- contexts only when it is certain that the current one will be driven simulation, showing the correlation to be quite good. delayed for a significant number of cycles. A more conservative Section 7 employs the model to determine maximum improve- approach is to switch on every load, independent of whether it ment, net gain, and points of diminishing return under various will cause a miss or not. In this case, R represents the average operating assumptions in multithreaded architectures. To the interval between loads. Our general multithreading model extent that these predictions can be verified, this analytic assumes that a context is blocked for L cycles after every switch, approach deepens our understanding of the behavior of this new but in the case of a "switch on load" processor this only happens class of machines. if the load causes a cache miss. The general model can be employed, if we postulate that there are two sources of latency 2. Multithreaded Processor Model (L 1 and L 2) each having a particular probability (p 1 and p 2)of In this section, we present the architectural model and occurring on every switch. If L 1 represents the latency on a definitions used throughout the paper.

Analysis of Multithreaded Architectures for Parallel Computing

2.5 Classification of Parallel Computers

Massively Parallel Computing with CUDA

Parallel Computer Architecture

A Review of Multicore Processors with Parallel Programming

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

Challenges for the Message Passing Interface in the Petaflops Era

A Survey on Parallel Multicore Computing: Performance & Improvement

Vector Processors

Introduction to Parallel Computing

Parallel Processing: Past, Present and Future

14. Parallel Computing 14.1 Introduction 14.2 Independent