CS252 Spring 2017 Graduate Computer Architecture Lecture 9

Total Page:16

File Type:pdf, Size:1020Kb

CS252 Spring 2017 Graduate Computer Architecture Lecture 9 CS252 Spring 2017 Graduate Computer Architecture Lecture 9: Vector Supercomputers Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 8 Overcoming the worst hazards in OoO superscalars: • Branch prediction • Bimodel • Local/Branch History Table • Global/gselect, gshare • Tournament • Branch address cache (predict multiple branches per cycle) • Trace cache • Return Address Predictors • Today - Load/Store Queues, Vector Supercomputers 2 WU UCB CS252 SP17 Load-Store Queue Design § After control hazards, data hazards through memory are proBably next most important Bottleneck to superscalar performance § Modern superscalars use very sophisticated load- store reordering techniques to reduce effective memory latency By allowing loads to Be speculatively issued CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 3 Speculative Store Buffer Store Store § Just like register updates, stores should Address Data not modify the memory until after the Speculative instruction is committed. A speculative Store Buffer store Buffer is a structure introduced to V S Tag Data hold speculative store data. V S Tag Data § During decode, store Buffer slot allocated V S Tag Data V S Tag Data in program order V S Tag Data § Stores split into “store address” and V S Tag Data “store data” micro-operations § “Store address” execution writes tag Store Commit § “Store data” execution writes data Path § Store commits when oldest instruction and Both address and data available: - clear speculative Bit and eventually Tags Data move data to cache § On store abort: - clear valid Bit L1 Data Cache CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 4 Load bypass from speculative store buffer Speculative Load Address Store Buffer L1 Data Cache V S Tag Data V S Tag Data V S Tag Data V S Tag Data Tags Data V S Tag Data V S Tag Data Load Data § If data in both store buffer and cache, which should we use? Speculative store Buffer § If same address in store buffer twice, which should we use? Youngest store older than load CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 5 Memory Dependencies sd x1, (x2) ld x3, (x4) § When can we execute the load? CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 6 In-Order Memory Queue § Execute all loads and stores in program order § => Load and store cannot leave ROB for execution until all previous loads and stores have completed execution § Can still execute loads and stores speculatively, and out-of-order with respect to other instructions § Need a structure to handle memory ordering… CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 7 Conservative O-o-O Load Execution sd x1, (x2) ld x3, (x4) § Can execute load Before store, if addresses known and x4 != x2 § Each load address compared with addresses of all previous uncommitted stores - can use partial conservative check i.e., bottom 12 bits of address, to save hardware § Don’t execute load if any previous store address not known § (MIPS R10K, 16-entry address queue) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 8 Address Speculation sd x1, (x2) ld x3, (x4) § Guess that x4 != x2 § Execute load Before store address known § Need to hold all completed But uncommitted load/store addresses in program order § If suBsequently find x4==x2, squash load and all following instructions § => Large penalty for inaccurate address speculation CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 9 Memory Dependence Prediction (Alpha 21264) sd x1, (x2) ld x3, (x4) § Guess that x4 != x2 and execute load Before store § If later find x4==x2, squash load and all following instructions, But mark load instruction as store-wait § SuBsequent executions of the same load instruction will wait for all previous stores to complete § Periodically clear store-wait Bits CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 10 Supercomputer Applications § Typical application areas - Military research (nuclear weapons, cryptography) - Scientific research - Weather forecasting - Oil exploration - Industrial design (car crash simulation) - Bioinformatics - Cryptography § All involve huge computations on large data set § Supercomputers: CDC6600, CDC7600, Cray-1, … § In 70s-80s, Supercomputer º Vector Machine CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 11 Vector Supercomputers § Epitomized By Cray-1, 1976: § Scalar Unit - Load/Store Architecture § Vector Extension - Vector Registers - Vector Instructions § Implementation - Hardwired Control - Highly Pipelined Functional Units - Interleaved Memory System - No Data Caches - No Virtual Memory [©Cray Research, 1976] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 12 Cray-1 Internals displayed at EPFL Photograph by Rama, Wikimedia Commons, Cc-by-sa-2.0-fr 13 WU UCB CS252 SP17 Vector Programming Model Scalar Registers Vector Registers x31 v31 x0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR v1 Vector Arithmetic v2 Instructions + + + + + + vadd v3, v1, v2 v3 [0] [1] [VLR-1] Vector Load and Store Vector Register Instructions v1 vld v1, x1, x2 Memory Base, x1 Stride, x2 CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 14 Vector Code Example # Vector Code # C code li x4, 64 for (i=0; i<64; i++) # Scalar Code setvlr x4 C[i] = A[i] + B[i]; li x4, 64 loop: vfld v1, x1 fld f1, 0(x1) vfld v2, x2 fld f2, 0(x2) vfadd.d v3,v1,v2 fadd.d f3,f1,f2 vfsd v3, x3 fsd f3, 0(x3) addi x1, 8 addi x2, 8 addi x3, 8 subi x4, 1 bnez x4, loop CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 15 Cray-1 (1976) V0 Vi V1 V. Mask V2 Vj 64 Element Vector V3 V. Length Registers V4 Vk Single Port V5 V6 Memory V7 FP Add S0 Sj FP Mul 16 Banks of 64- ( (Ah) + j k m ) S1 S2 Sk FP Recip bit words Si S3 (A0) 64 S4 Si Int Add + Tjk S5 T Regs S6 8-bit SECDED S7 Int Logic Int Shift A0 80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt A2 load/store Aj Ai A3 (A0) 64 A4 Ak Addr Add Bjk A5 A 320MW/sec B Regs A6 i Addr Mul instruction A7 buffer refill 64-bitx16 NIP CIP 4 Instruction Buffers LIP memory bank cycle 50 ns processor cycle 12.5 ns (80Mhz) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 16 Vector Instruction Set Advantages § Compact - one short instruction encodes N operations § Expressive, tells hardware that these N operations: - are independent - use the same functional unit - access disjoint registers - access registers in same pattern as previous instructions - access a contiguous Block of memory (unit-stride load/store) - access memory in a known pattern (strided load/store) § ScalaBle - can run same code on more parallel pipelines (lanes) CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 17 Vector Arithmetic Execution • Use deep pipeline (=> fast clock) to v v v execute element operations 1 2 3 • Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six-stage multiply pipeline v3 <- v1 * v2 CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 18 Vector Instruction Execution vfadd.d vc, va, vb Execution using Execution using one pipelined four pipelined functional unit functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 19 Interleaved Vector Memory System § Bank Busy time: Time Before Bank ready to accept next request § Cray-1, 16 Banks, 4 cycle Bank Busy time, 12 cycle latency Base Stride Vector Registers Address Generator + 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 20 Vector Unit Structure Functional Unit Vector Registers Elements Elements Elements Elements 0, 4, 8, … 1, 5, 9, … 2, 6, 10, 3, 7, 11, … … Lane Memory Subsystem CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 21 T0 Vector Microprocessor (UCB/ICSI, 1995) Vector register Lane elements striped over lanes [24][25] [26][27][28] [29] [30] [31] [16][17] [18][19][20] [21] [22] [23] [8] [9] [10][11][12] [13] [14] [15] [0] [1] [2] [3] [4] [5] [6] [7] CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 22 Vector Instruction Parallelism § Can overlap execution of multiple vector instructions - example machine has 32 elements per vector register and 8 lanes Load Unit Multiply Unit Add Unit load mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 23 Vector Chaining § Vector version of register Bypassing - introduced with Cray-1 V1 V2 V3 V4 V5 vld v1 vfmul v3,v1,v2 Vfadd v5, v3, v4 Chain Chain Load Unit Mult. Add Memory CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 24 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written Before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Load Mul Add CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 25 Vector Startup § Two components of vector startup penalty - functional unit latency (time through pipeline) - dead time or recovery time (time before another vector instruction can start down pipeline) Functional Unit Latency R X X X W R X X X W First Vector Instruction R X X X W R X X X W R X X X W Dead Time R X X X W R X X X W R X X X W Dead Time R X X X W Second Vector Instruction R X X X W CS252, Spring 2015, Lecture 9 © Krste Asanovic, 2015 26 Dead Time
Recommended publications
  • Memory Consistency
    Lecture 9: Memory Consistency Parallel Computing Stanford CS149, Winter 2019 Midterm ▪ Feb 12 ▪ Open notes ▪ Practice midterm Stanford CS149, Winter 2019 Shared Memory Behavior ▪ Intuition says loads should return latest value written - What is latest? - Coherence: only one memory location - Consistency: apparent ordering for all locations - Order in which memory operations performed by one thread become visible to other threads ▪ Affects - Programmability: how programmers reason about program behavior - Allowed behavior of multithreaded programs executing with shared memory - Performance: limits HW/SW optimizations that can be used - Reordering memory operations to hide latency Stanford CS149, Winter 2019 Today: what you should know ▪ Understand the motivation for relaxed consistency models ▪ Understand the implications of relaxing W→R ordering Stanford CS149, Winter 2019 Today: who should care ▪ Anyone who: - Wants to implement a synchronization library - Will ever work a job in kernel (or driver) development - Seeks to implement lock-free data structures * - Does any of the above on ARM processors ** * Topic of a later lecture ** For reasons to be described later Stanford CS149, Winter 2019 Memory coherence vs. memory consistency ▪ Memory coherence defines requirements for the observed Observed chronology of operations on address X behavior of reads and writes to the same memory location - All processors must agree on the order of reads/writes to X P0 write: 5 - In other words: it is possible to put all operations involving X on a timeline such P1 read (5) that the observations of all processors are consistent with that timeline P2 write: 10 ▪ Memory consistency defines the behavior of reads and writes to different locations (as observed by other processors) P2 write: 11 - Coherence only guarantees that writes to address X will eventually propagate to other processors P1 read (11) - Consistency deals with when writes to X propagate to other processors, relative to reads and writes to other addresses Stanford CS149, Winter 2019 Coherence vs.
    [Show full text]
  • Memory Ordering: a Value-Based Approach
    Memory Ordering: A Value-Based Approach Harold W. Cain Mikko H. Lipasti Computer Sciences Dept. Dept. of Elec. and Comp. Engr. Univ. of Wisconsin-Madison Univ. of Wisconsin-Madison [email protected] [email protected] Abstract queues, etc.) used to find independent operations and cor- rectly execute them out of program order are often con- Conventional out-of-order processors employ a multi- strained by clock cycle time. In order to decrease clock ported, fully-associative load queue to guarantee correct cycle time, the size of these conventional structures must memory reference order both within a single thread of exe- usually decrease, also decreasing IPC. Conversely, IPC cution and across threads in a multiprocessor system. As may be increased by increasing their size, but this also improvements in process technology and pipelining lead to increases their access time and may degrade clock fre- higher clock frequencies, scaling this complex structure to quency. accommodate a larger number of in-flight loads becomes There has been much recent research on mitigating this difficult if not impossible. Furthermore, each access to this negative feedback loop by scaling structures in ways that complex structure consumes excessive amounts of energy. are amenable to high clock frequencies without negatively In this paper, we solve the associative load queue scalabil- affecting IPC. Much of this work has focused on the ity problem by completely eliminating the associative load instruction issue queue, physical register file, and bypass queue. Instead, data dependences and memory consistency paths, but very little has focused on the load queue or store constraints are enforced by simply re-executing load queue [1][18][21].
    [Show full text]
  • Connect User Guide Armv8-A Memory Systems
    ARMv8-A Memory Systems ConnectARMv8- UserA Memory Guide VersionSystems 0.1 Version 1.0 Copyright © 2016 ARM Limited or its affiliates. All rights reserved. Page 1 of 17 ARM 100941_0100_en ARMv8-A Memory Systems Revision Information The following revisions have been made to this User Guide. Date Issue Confidentiality Change 28 February 2017 0100 Non-Confidential First release Proprietary Notice Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM® in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners. Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder. The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded. This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product. Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”. Confidentiality Status This document is Confidential.
    [Show full text]
  • Recent Advances in Memory Consistency Models for Hardware
    SELECTED FOR PROC OF THE IEEE SPECIAL ISSUE ON DISTRIBUTED SHAREDMEMORY NOT THE FINAL VERSION Recent Advances in Memory Consistency Mo dels for Hardware SharedMemory Systems Sarita V Adve Vijay S Pai and Parthasarathy Ranganathan Department of Electrical and Computer Engineering Rice University Houston Texas rsimecericeedu Abstract sp ecication that determines what values the programmer The memory consistency mo del of a sharedmemory sys can exp ect a read to return tem determines the order in which memory op erations will Unipro cessors provide a simple memory consistency app ear to execute to the programmer The memory consis tency mo del for a system typically involves a tradeo b e mo del to the programmer that ensures that memory op era tween p erformance and programmability This pap er pro tions will app ear to execute one at a time and in the order vides an overview of recent advances in hardware optimiza sp ecied by the program or program order Thus a read tions compiler optimizations and programming environ in a unipro cessor returns the value of the last write to the ments relevant to memory consistency mo dels of hardware distributed sharedmemory systems same lo cation where last is uniquely dened by the pro We discuss recent hardware and compiler optimizations gram order Unipro cessor hardware and compilers how that exploit the observation that it is sucient to only ap ever do not necessarily execute memory op erations one pear as if the ordering rules of the consistency mo del are ob eyed These optimizations substantially
    [Show full text]
  • Fence Scoping
    Fence Scoping Changhui Lin Vijay Nagarajan Rajiv Gupta CSE Department School of Informatics CSE Department University of California, Riverside University of Edinburgh, UK University of California, Riverside [email protected] [email protected] [email protected] Abstract—We observe that fence instructions used by pro- avoid these problems by eschewing mutual exclusion, and grammers are usually only intended to order memory accesses improve scalability and robustness while still ensuring safety. within a limited scope. Based on this observation, we propose For example, non-blocking work-stealing [10] is a popular the concept fence scope which defines the scope within which approach for balancing load in parallel programs. Cilk [16] a fence enforces the order of memory accesses, called scoped bases on work-stealing to support load balancing of fully fence (S-Fence). S-Fence is a customizable fence, which enables strict computations. X10 [11] extends the the work-stealing programmers to express ordering demands by specifying the scope of fences when they only want to order part of memory approach in Cilk to support terminally strict computations. accesses. At runtime, hardware uses the scope information Other commonly used concurrent algorithms include non- conveyed by programmers to execute fence instructions in a blocking concurrent queue [33] which is implemented in Java manner that imposes fewer memory ordering constraints than class ConcurrentLinkedQueue, Lamport queue [28], Harris’s set a traditional fence, and hence improves program performance. [20], etc. Our experimental results show that the benefit of S-Fence hinges on the characteristics of applications and hardware parameters. A group of lock-free algorithms achieve peak speedups ranging from 1.13x to 1.34x; while full applications achieve speedups ranging from 1.04x to 1.23x.
    [Show full text]
  • Fences in Weak Memory Models
    Fences in Weak Memory Models Jade Alglave1, Luc Maranget1, Susmit Sarkar2, and Peter Sewell2 1 INRIA 2 University of Cambridge Abstract. We present a class of relaxed memory models, defined in Coq, parameterised by the chosen permitted local reorderings of reads and writes, and the visibility of inter- and intra-processor communications through memory (e.g. store atomicity relaxation). We prove results on the required behaviour and placement of memory fences to restore a given model (such as Sequential Consistency) from a weaker one. Based on this class of models we develop a tool, diy, that systematically and automatically generates and runs litmus tests to determine properties of processor implementations. We detail the results of our experiments on Power and the model we base on them. This work identified a rare implementation error in Power 5 memory barriers (for which IBM is providing a workaround); our results also suggest that Power 6 does not suffer from this problem. 1 Introduction Most multiprocessors exhibit subtle relaxed-memory behaviour, with writes from one thread not immediately visible to all others; they do not provide sequentially consistent (SC ) memory [17]. For some, such as x86 [22, 20] and Power [21], the vendor documentation is in inevitably ambiguous informal prose, leading to confusion. Thus we have no foundation for software verification of concurrent systems code, and no target specification for hardware verification of microarchi- tecture. To remedy this state of affairs, we take a firmly empirical approach, de- veloping, in tandem, testing tools and models of multiprocessor behaviour—the test results guiding model development and the modelling suggesting interesting tests.
    [Show full text]
  • Linux-Kernel Memory Ordering: Help Arrives at Last!
    Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology Linux Plumbers Conference Memory-Ordering BoF, November 3, 2016 Linux-Kernel Memory Ordering: Help Arrives At Last! Joint work with Jade Alglave, Luc Maranget, Andrea Parri, and Alan Stern © 2016 IBM Corporation Linux Plumbers Conference Memory-Ordering BoF, November 3, 2016 Overview .Who cares about memory models? .But memory-barrier.txt is incomplete! .Project history .Cat-language example: single-variable SC .Current status and demo .Not all communications relations are created equal .Rough rules of thumb 2 © 2016 IBM Corporation Linux Plumbers Conference Memory-Ordering BoF, November 3, 2016 Who Cares About Memory Models? 3 © 2016 IBM Corporation Linux Plumbers Conference Memory-Ordering BoF, November 3, 2016 Who Cares About Memory Models, and If So, Why??? .Hoped-for benefits of a Linux-kernel memory model –Memory-ordering education tool –Core-concurrent-code design aid –Ease porting to new hardware and new toolchains –Basis for additional concurrency code-analysis tooling .Likely drawbacks of a Linux-kernel memory model –Extremely limited code size • Analyze concurrency core of algorithm • Maybe someday automatically identifying this core • Perhaps even automatically stitch together multiple analyses (dream on!) –Limited types of operations (no function call, structures, call_rcu(), …) • Can emulate some of these • We expect that tools will become more capable over time • (More on this on a later slide) 4 © 2016 IBM Corporation
    [Show full text]
  • The Semantics of X86-CC Multiprocessor Machine Code
    The Semantics of x86-CC Multiprocessor Machine Code Susmit Sarkar1 Peter Sewell1 Francesco Zappa Nardelli2 Scott Owens1 Tom Ridge1 Thomas Braibant2 Magnus O. Myreen1 Jade Alglave2 1University of Cambridge 2INRIA http://www.cl.cam.ac.uk/users/pes20/weakmemory Abstract achieve high performance: hierarchies of local cache, write Multiprocessors are now dominant, but real multiprocessors buffers, speculative execution, etc. The visible manifestation do not provide the sequentially consistent memory that is as- of these optimisations, at the assembly language level, is that sumed by most work on semantics and verification. Instead, individual reads and writes to memory may be reordered they have subtle relaxed (or weak) memory models, usually in surprising ways. For example, consider the following x86 described only in ambiguous prose, leading to widespread program. confusion. We develop a rigorous and accurate semantics for x86 proc:0 proc:1 multiprocessor programs, from instruction decoding to re- MOV [100] ← $1 MOV [200] ← $1 laxed memory model, mechanised in HOL. We test the se- MOV EAX ← [100] MOV ECX ← [200] mantics against actual processors and the vendor litmus-test MOV EBX ← [200] MOV EDX ← [100] examples, and give an equivalent abstract-machine charac- terisation of our axiomatic memory model. For programs This consists of two straight-line programs running in par- that are (in some precise sense) data-race free, we prove allel on two processors, proc:0 and proc:1. The instruction in HOL that their behaviour is sequentially consistent. We MOV [100] ← $1 writes value 1 to memory address 100; the also contrast the x86 model with some aspects of Power and instruction MOV EAX ← [100] reads a value from memory ARM behaviour.
    [Show full text]
  • Memory Ordering in Modern Microprocessors Paul E
    Memory Ordering in Modern Microprocessors Paul E. McKenney Draft of 2007/09/19 15:15 Way 0 Way 1 Linux R has supported a large number of SMP sys- tems based on a variety of CPUs since the 2.0 kernel. 0x0 0x12345000 0x1 0x12345100 Linux has done an excellent job of abstracting away dif- 0x2 0x12345200 ferences among these CPUs, even in kernel code. One 0x3 0x12345300 important difference is how CPUs allow memory ac- 0x4 0x12345400 cesses to be reordered in SMP systems. 0x5 0x12345500 0x6 0x12345600 0x7 0x12345700 SMMP Hardware 0x8 0x12345800 0x9 0x12345900 Memory accesses are among the slowest of a CPU’s op- 0xA 0x12345A00 erations, due to the fact that Moore’s law has increased 0xB 0x12345B00 CPU instruction performance at a much greater rate than 0xC 0x12345C00 it has increased memory performance. This difference 0xD 0x12345D00 in performance increase means that memory operations 0xE 0x12345E00 0x43210E00 0xF have been getting increasingly expensive compared to simple register-to-register instructions. Modern CPUs sport increasingly large caches in order to reduce the Figure 1: CPU Cache Structure overhead of these expensive memory accesses. These caches can be thought of as simple hardware 0x12345F00. This location hashes to line 0xF, and both hash table with fixed size buckets and no chaining, as ways of this line are empty, so the corresponding 256- shown in Figure 1. This cache has sixteen “lines” and byte line can be accommodated. If the program were two “ways” for a total of 32 “entries”, each entry con- to access location 0x1233000, which hashes to line 0x0, taining a single 256-byte “cache line”, which is a 256- the corresponding 256-byte cache line can be accommo- byte-aligned block of memory.
    [Show full text]
  • Itanium Processor Microarchitecture
    ITANIUM PROCESSOR MICROARCHITECTURE THE ITANIUM PROCESSOR EMPLOYS THE EPIC DESIGN STYLE TO EXPLOIT INSTRUCTION-LEVEL PARALLELISM. ITS HARDWARE AND SOFTWARE WORK IN CONCERT TO DELIVER HIGHER PERFORMANCE THROUGH A SIMPLER, MORE EFFICIENT DESIGN. The Itanium processor is the first ic runtime optimizations to enable the com- implementation of the IA-64 instruction set piled code schedule to flow through at high architecture (ISA). The design team opti- throughput. This strategy increases the syn- mized the processor to meet a wide range of ergy between hardware and software, and requirements: high performance on Internet leads to higher overall performance. servers and workstations, support for 64-bit The processor provides a six-wide and 10- addressing, reliability for mission-critical stage deep pipeline, running at 800 MHz on applications, full IA-32 instruction set com- a 0.18-micron process. This combines both patibility in hardware, and scalability across a abundant resources to exploit ILP and high range of operating systems and platforms. frequency for minimizing the latency of each The processor employs EPIC (explicitly instruction. The resources consist of four inte- parallel instruction computing) design con- ger units, four multimedia units, two Harsh Sharangpani cepts for a tighter coupling between hardware load/store units, three branch units, two and software. In this design style the hard- extended-precision floating-point units, and Ken Arora ware-software interface lets the software two additional single-precision floating-point exploit all available compilation time infor- units (FPUs). The hardware employs dynam- Intel mation and efficiently deliver this informa- ic prefetch, branch prediction, nonblocking tion to the hardware.
    [Show full text]
  • Weak Memory Models with Matching Axiomatic and Operational Definitions
    Weak Memory Models with Matching Axiomatic and Operational Definitions Sizhuo Zhang1 Muralidaran Vijayaraghavan1 Dan Lustig2 Arvind1 1{szzhang, vmurali, arvind}@csail.mit.edu [email protected] 1MIT CSAIL 2NVIDIA Abstract Memory consistency models are notorious for being difficult to define precisely, to reason about, and to verify. More than a decade of effort has gone into nailing down the definitions of the ARM and IBM Power memory models, and yet there still remain aspects of those models which (perhaps surprisingly) remain unresolved to this day. In response to these complexities, there has been somewhat of a recent trend in the (general-purpose) architecture community to limit new memory models to being (multicopy) atomic: where store values can be read by the issuing processor before being advertised to other processors. TSO is the most notable example, used in the past by IBM 370 and SPARC-TSO, and currently used in x86. Recently (in March 2017) ARM has also switched to a multicopy atomic memory model, and the new RISC-V ISA and recent academic proposals such as WMM are pushing to do the same. In this paper, we show that when memory models are atomic, it becomes much easier to produce axiomatic definitions, operational definitions, and proofs of equivalence than doing the same under non- atomic models. The increased ease with which these definitions can be produced in turn allows architects to build processors much more confidently, and yet the relaxed nature of the models we propose still allows most or all of the performance of non-atomic models to be retained.
    [Show full text]
  • Louvre: Lightweight Ordering Using Versioning for Release Consistency
    Louvre: Lightweight Ordering Using Versioning for Release Consistency Pranith Kumar, Prasun Gera, Hyojong Kim, Hyesoon Kim Georgia Institute of Technology School of Computer Science, College of Computing Atlanta, GA, USA fpranith, prasun.gera, hyojong.kim, [email protected] Abstract—Fence instructions are fundamental primitives that an ordering point in the instruction stream, and when it is ensure consistency in a weakly consistent shared memory multi- encountered by the processor, certain classes of optimizations core processor. The execution cost of these instructions is sig- on the following memory accesses are disabled to ensure con- nificant and adds a non-trivial overhead to parallel programs. In a na´ıve architecture implementation, we track the ordering sistency [4]–[6]. Optimizations such as speculative execution, constraints imposed by a fence by its entry in the reorder prefetch, and buffering of store instructions are implemented buffer and its execution overhead entails stalling the processor’s by modern processors to reduce the cost of memory accesses pipeline until the store buffer is drained and also conservatively [7]–[9]. The fence instructions restrict optimizations which invalidating speculative loads. These actions create a cascading exploit instruction-level parallelism, disable speculative exe- effect of increased overhead on the execution of the following instructions in the program. We find these actions to be overly cution of post-fence loads and stores, and introduce stalls to restrictive and that they can be further relaxed thereby allowing drain the store buffer. aggressive optimizations. Researchers have proposed techniques that allow a proces- The current work proposes a lightweight mechanism in which sor to employ hardware optimizations when ordering instruc- we assign ordering tags, called versions, to load and store in- tions are scheduled, thereby reducing their execution over- structions when they reside in the load/store queues and the write buffer.
    [Show full text]