Realizing High IPC Through a Scalable Memory-Latency Tolerant Multipath Microarchitecture

Realizing High IPC Through a Scalable Memory-Latency Tolerant Multipath Microarchitecture D. Morano, A. Khala¯, D.R. Kaeli A.K. Uht Northeastern University University of Rhode Island dmorano, akhala¯, [email protected] [email protected] Abstract designs. Unfortunately, most of the ¯ne-grained ILP inherent in integer sequential programs spans several A microarchitecture is described that achieves high basic blocks. Data and control independent instruc- performance on conventional single-threaded pro- tions, that may exist far ahead in the program in- gram codes without compiler assistance. To obtain struction stream, need to be speculatively executed high instructions per clock (IPC) for inherently se- to exploit all possible inherent ILP. quential (e.g., SpecInt-2000 programs), a large num- A large number of instructions need to be fetched ber of instructions must be in ﬂight simultaneously. each cycle and executed concurrently in order to However, several problems are associated with such achieve this. We need to ¯nd the available program microarchitectures, including scalability, issues re- ILP at runtime and to provide su±cient hardware lated to control ﬂow, and memory latency. to expose, schedule, and otherwise manage the out- Our design investigates how to utilize a large mesh of-order speculative execution of control independent of processing elements in order to execute a single- instructions. Of course, the microarchitecture has to threaded program. We present a basic overview of also provide a means to maintain the architectural our microarchitecture and discuss how it addresses program order that is required for proper program scalability as we attempt to execute many instruc- execution. tions in parallel. The microarchitecture makes use We present a novel microarchitecture in this pa- of control and value speculative execution, multi- per that can be applied to any existing ISA. Our mi- path execution, and a high degree of out-of-order croarchitecture is targeted at obtaining substantial execution to help extract instruction level paral- program speedups on integer codes. The microar- lelism. Execution-time predication and time-tags for chitecture can speculatively execute hundreds of in- operands are used for maintaining program order. We structions ahead in the program instruction stream provide simulation results for several geometries of and thus expose large amounts of inherent ILP. We our microarchitecture illustrating a range of design use multipath execution to cover latencies associated tradeo®s. Results are also presented that show the with branch mispredictions. We also take advantage small performance impact over a range of memory of control and data independent instructions through system latencies. our use of execution-time predication. Since we provide local bu®ering (a form of L0 caching) throughout 1 Introduction our grid layout, we can satisfy a large number of ac- cesses without accessing higher levels of the memory A number of studies into the limits of instruction level hierarchy. Further, since our design utilizes a form parallelism (ILP) have been promising in that they of value speculation on all operands, we consume the have shown that there is a signi¯cant amount of par- present value of a register, even if an outstanding load allelism within typical sequentially oriented single- is pending which targets this register. threaded programs (e.g., SpecInt-2000). The work of Lam and Wilson [8], Uht and Sindagi [18], Gon- 1.1 Related Work zalez and Gonzalez [4] have shown that there exists a great amount of instruction level parallelism (ILP) There have been several attempts at substantially that is not being exploited by any existing computer increasing program IPC through the exploitation of ILP. The Multiscalar processor architecture [15] at- tion penalties. Exploitation of control and data inde- tempted to realize substantial IPC speedups over con- pendent instructions beyond the join of a hammock ventional superscalar processors. However, the ap- branch [3, 17] is also capitalized upon where pos- proach is quite di®erent than ours in that Multiscalar sible. Choosing which paths in multipath execution relies on compiler participation where we do not. A should be given priority for machine resources is also notable attempt at realizing high IPC was done by addressed by the machine. The predicted program Lipasti and Shen on their Superspeculative architec- path is referred to as the mainline path. We give ex- ture [9]. They achieved an IPC of about 7 with re- ecution resource priority to this mainline path with alistic hardware assumptions. The Ultrascalar ma- respect to any possible alternative paths. Since al- chine [6] achieves asymptotic scalability, but only re- ternative paths have lower priority with respect to alizes a small amount of IPC due to its conservative the mainline path, they are referred to as disjoint execution model. Nagarajan et al proposed a Grid paths. This sort of strategy for the spawning of dis- Architecture of ALUs connected by an operand net- joint paths results in what is termed disjoint eager work [11]. This has some similarities to our work. execution (DEE). We therefore refer to disjoint paths However, unlike our work, their microarchitecture re- as simply DEE paths. These terms are taken from lies on the coordinated use of the compiler along with Uht [18]. a new ISA to obtain higher IPCs. Time-tags are the basic mechanism used in the mi- The rest of this paper is organized as follows. Sec- croarchitecture to order all operands, including mem- tion 2 reviews the critical elements of our proposed ory operands, while they are being used by instruc- microarchitecture. Section 3 presents simulation re- tions currently being executed. Time-tags are small sults for a range of machine con¯gurations, and shows values that are associated with operands that serve as the potential impact of multipath execution when ap- both an identifying tag and as a means to order them plied. We also discuss how our microarchitecture re- with respect to each other. The Warp Engine [2] also duces our dependence on the memory system by pro- used time-tags to manage large amounts of specula- viding a large amount of local caching on our datap- tive execution, but our use of them is much simpler ath. We summarize and conclude in section 4. than theirs. Time-tags are present on some recent machines (e.g., P6, Pentium 4), though are used for di®erent purposes than as employed in our microar- 2 The Microarchitecture chitecture. We have described most of the details of this microar- 2.1 Microarchitecture Components chitecture in [19]. We will review these details at a level to allow the reader to grasp a general under- Figure 1 provides a high-level view of our microar- standing as it applies to the memory system perfor- chitecture. Our microarchitecture shares many basic mance. In addition, this paper presents a microar- similarities to most conventional machines. The main chitecture with a realistic memory subsystem and we memory block, the L2 cache (uni¯ed in the present therefor are better able to evaluate the performance case), and the L1 instruction cache are all rather sim- impact of the memory hierarchy. ilar to those in common use. Except for the fact The microarchitecture is very aggressive in terms that the main memory, L2 cache, and L1 data cache of the amount of speculative execution it performs. are all address-interleaved, there is nothing further This is realized through a large amount of scalable unique about these components. Our L1 data cache execution resources. Resource scalability of the mi- is similar to most conventional data caches except croarchitecture is achieved through its distributed na- that it also has the ability to track speculative mem- ture along with repeater-like components that limit ory writes using a store bu®er. Our L1 d-cache shares the maximum bus spans. Contention for major cen- a similar goal with the Speculative Versioning Cache tralized structures is avoided. Conventional central- [5] but is simpler in some respects. Since we allow ized resources like a register ¯le, reorder bu®er, and speculative memory writes to propagate out to the centralized execution units, are eliminated. L1 data cache, multiple copies of a speculative write The microarchitecture also addresses several issues may be present in the L1 data cache store bu®er at associated with conditional branches. Spawning al- any time. They are di®erentiated from each other ternative speculative paths when encountering con- through the use of time-tags. ditional branches is done to avoid branch mispredic- The i-fetch unit ¯rst fetches instructions from i- memory operand transfer buses from and to L1 execution I-fetch L1 L1 d-cache D-cache window I-cache ML DEE ML DEE branch FU FU predictors and AS AS AS AS instruction AS AS AS AS load buffers AS AS AS AS shared operand PE PE forwarding buses sharing group sharing group FU FU AS AS AS AS AS AS AS AS L2 cache AS AS AS AS PE PE sharing group sharing group main from memory instruction forwarding buses forwarding buses dispatch buffer Figure 1: High-level View of the Distributed Microar- instruction dispatch buses chitecture. Shown are the major hardware components of the microarchitecture. With the exception Figure 2: The Execution Window. Shown is a layout of the execution window block, this is similar to most of the Active Stations (AS) and Processing Elements conventional microarchitectures (PE) along with some bus interconnections to imple- . ment a large, distributed microarchitecture. Groups of ASes share a PE; a group is called a sharing group. cache along one or more predicted program paths. Due to our relatively large instruction fetch band- silicon or circuit board space, from the ¯rst. This width requirement, we allow for the fetching of up to operation is termed operand forwarding.

Realizing High IPC Through a Scalable Memory-Latency Tolerant Multipath Microarchitecture

State Model Syllabus for Undergraduate Courses in Science (2019-2020)

073-080.Pdf (568.3Kb)

PERL – a Register-Less Processor

Instruction Pipelining in Computer Architecture Pdf

Utilizing Parametric Systems for Detection of Pipeline Hazards

Pipeliningpipelining

Processor Design：System-On-Chip Computing For

Operand Registers and Explicit Operand Forwarding James Balfour, R

A CUDA Program • a CUDA Program Consists of a Mixture of the Following Two Parts

Pipelining Design Techniques

Delicm: Scalar Dependence Removal at Zero Memory Cost

ECE 341 Lecture # 13 Lecture Topics Types of Pipeline Hazards