IA-64 Architecture Overview

Total Page:16

File Type:pdf, Size:1020Kb

IA-64 Architecture Overview IAIA--6464 ArchitectureArchitecture A High-Performance Computing Architecture [email protected] Internet Solutions Group EMEA Technical Marketing Overview Overview July 2000 AgendaAgenda Motivation and IA-64 feature overview IA-64 features • EPIC • Data types, memory and registers • Register stack • Predication and parallel compares • Software pipelining and register rotation • Control & data speculation • Branch architecture • Integer architecture • Floating point architecture Itanium™ processor overview Itanium™ processor based systems overview Operating systems, tools and programming Copyright © 2000, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners IAIA--64:64: ExtendingExtending thethe IntelIntel®® ArchitectureArchitecture Designed for High Performance Computing • Scientific • Technical & Engineering • Business New EPIC Technology IA-64 Architecture uses EPIC Itanium™ processor is the first implementation of IA-64 Copyright © 2000, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners PerformancePerformance LimitersLimiters Parallelism not fully utilized • Existing architectures cannot exploit sufficient parallelism in integer code to feed a wide in-order implementation Branches • Even with perfect branch prediction, small basic blocks of code do not fully utilize machine width Procedure Calls • Software modularity is becoming standard resulting call/return overhead Memory latency and address space • Increasing relative to processor cycle time (larger cache miss penalties) and limited address space IA-64 overcomes these limitations, IA-64 IA-64 overcomes these limitations, and more ! Copyright © 2000, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners IAIA--6464 ArchitecturalArchitectural FeaturesFeatures 64-bit Address Flat Memory Model Explicit Parallel Instruction Computing Parallelism Explicit Parallel Instruction Computing Large Register Files Automatic Register Stack Engine Predication Branches Software Pipelining Support Register Rotation Sophisticated Branch Architecture Procedure Calls Loop Control Hardware Control & Data Speculation Cache Control Memory Powerful Integer Architecture Latency Advanced Floating Point Architecture Multimedia Support (MMX™ Technology) IA-64 It’s more than just 64 bits … Copyright © 2000, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners NextNext GenerationGeneration ArchitectureArchitecture EPICEPIC DesignDesign PhilosophyPhilosophy Maximize performance via EPIC hardware & software synergy Advanced features VLIW OOO / SS enhance instruction level parallelism RISC • Predication, Speculation, … Massive hardware CISC resources for parallel Performance Performance execution Time BeyondBeyond TraditionalTraditional ArchitecturesArchitectures Copyright © 2000, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners AgendaAgenda Motivation and IA-64 feature overview IA-64 features • EPIC • Data types, memory and registers • Register stack • Predication and parallel compares • Software pipelining and register rotation • Control & data speculation • Branch architecture • Integer architecture • Floating point architecture Itanium™ processor overview Itanium™ processor based systems overview Operating systems, tools and programming Copyright © 2000, Intel Corporation. All rights reserved. 7 *Other brands and names are the property of their respective owners EPICEPIC InstructionInstruction ParallelismParallelism Source Code Instruction Groups No RAW or WAW (series of bundles) dependencies Issued in parallel depending on resources Instruction 3 instructions + Bundles template (3 Instructions) 3 x 41 bits + 5 bits = 128 bits Up to 6 instructions executed per clock Copyright © 2000, Intel Corporation. All rights reserved. 8 *Other brands and names are the property of their respective owners CompilerCompiler (HW)(HW) DataData TypesTypes 64-bit Integer 2x32-bit SIMD Integer 4x16-bit SIMD Integer 8x8-bit SIMD Integer 64-bit DP F.P. 2x32-bit SIMD SP-F.P. IA-64 All common data types are supported Copyright © 2000, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners 6464--bitbit MemoryMemory AccessAccess 18 BILLION Giga Bytes accessible • 264 == 18,446,744,073,709,551,616 Byte addressable access with 64-bit pointers • 64-bit virtual address space • HW support for 32-bit pointers Access granularity and alignmentalignment • 1,2,4,8,10,16 bytes • Alignment on naturally aligned boundaries is recommended • Instructions are always 16-byte aligned Support for both Big and Little endian byte order Memory hierarchy control 2.1 GB/s front-side bus Byte Addressable 64-bit IA-64 Byte Addressable 64-bit Virtual Address-Space Copyright © 2000, Intel Corporation. All rights reserved. 10 *Other brands and names are the property of their respective owners MemoryMemory HierarchyHierarchy ControlControl Software can explicitly control memory accesses • Specify levels of the memory hierarchy affected by the access • Allocation and Flush resolution is at least 32-bytes Allocation (Prefetch) • Allocation implies bringing the data close to the CPU • Allocation hints indicate at which level allocation takes place • Used in load, store, and explicit pre-fetch instructions De-allocation and Flush • Invalidates the addressed line in all levels of cache hierarchy • Write data back to memory if necessary Three levels of cache (full speed L2 cache, 2/4MB L3-cache) & Atomic operation support IA-64 Control over Cache (De)Allocation Copyright © 2000, Intel Corporation. All rights reserved. 11 *Other brands and names are the property of their respective owners MemoryMemory AccessAccess OrderingOrdering Explicit control • Memory Fence mf - ensures all prior memory operations are seen prior to all future memory operations • Acquire Load ld.acq - ensure I am seen prior to all future memory operations • Release store st.rel - ensure that all prior memory operations are seen prior to me • Synchronize instruction caches sync.i - Ensure all instruction caches have seen all prior flush cache instructions Implicit - applicable to semaphore instructions • xchg Exchange mem and General Register (GR) • cmpxchg Conditional exchange of mem and GR • fetchadd Add immediate to memory Strong ordering model is compatible with IA-32 Ordering Copyright © 2000, Intel Corporation. All rights reserved. 12 *Other brands and names are the property of their respective owners LargeLarge RegisterRegister SetSet Predicate Integer Registers FP Registers Branch Registers Registers NaT 63 0 63 0 81 0 63 bit 0 GR0 0 FR0 + 0.0 BR0 GR1 FR1 + 1.0 PR0 1 BR7 PR1 GR31 FR31 GR32 FR32 PR15 PR16 GR127 FR127 PR63 32 Static 32 Static 16 Static 96 Framed, Rotating 96 Rotating 48 Rotating Remove Resource IA-64 Remove Resource Bottlenecks Copyright © 2000, Intel Corporation. All rights reserved. 13 *Other brands and names are the property of their respective owners ContextContext SwitchSwitch “Normal”“Normal” – full context switch saves all registers – GR and FR “Lazy”“Lazy” – saves only GR registers or specific range (GR0-31, GR32-GR127) “Fast”“Fast” – doesn’t save any registers and uses 16 separate banked “shadow” registers ‘(OS only, GR16’-GR31’) – e.g. interrupt and exception handling Copyright © 2000, Intel Corporation. All rights reserved. 14 *Other brands and names are the property of their respective owners RegisterRegister StackStack GRs 0-31 are global to all procedures Stacked registers begin at GR32 and are local to each procedure 127 Each procedure’s registerregister stackstack frame varies from 0 to 96 registers 96 Stacked Only GRs implement a register stack • The FRs, PRs, and BRs are global to all 32 procedures 31 Register Stack Engine (RSE) 32 Global • Upon stack overflow/underflow, registers 0 are saved/restored to/from a backing store transparently Optimized CALL/RETURN IA-64 Optimized CALL/RETURN and Parameter Passing Copyright © 2000, Intel Corporation. All rights reserved. 15 *Other brands and names are the property of their respective owners RegisterRegister StackStack inin WorkWork Call changes frame to contain only the caller’s output Alloc instr. sets the frame region to the desired size • Three architecture parameters: local, output, and rotating Return restores the stack frame of the caller 56 Outputs 48 Virtual Local 52 52 Outputs Outputs (Inputs) Outputs 46 46 32 32 Local Local (Inputs) (Inputs) 32 Call Alloc Ret 32 PROC APROC B PROC B PROC A Avoids Register Spill/Fill IA-64 Avoids Register Spill/Fill among Procedure Calls Copyright © 2000, Intel Corporation. All rights reserved. 16 *Other brands and names are the property of their respective owners RegisterRegister StackStack EngineEngine allocate 127 Stack frame Stack frame Stack frame 56 D E F Outputs release 48 Stack frame Local C 52 (Inputs) Stack frame 46 Stack frame 32 B Logical save Stack frame A Stack frame Stack frame 32 B A restore 31 Global Register 0 Physical IA-64 GR (Integer) Registers only Copyright © 2000, Intel Corporation. All rights reserved. 17 *Other brands and names are the property of their respective owners Predication:Predication: ControlControl FlowFlow toto DataData FlowFlow Traditional Arch. IA-64 cmp a, b if cmp a, b p1,p2 jump.eq p2 X=1 p1 X=2 Load X=1 else
Recommended publications
  • Optimal Software Pipelining: Integer Linear Programming Approach
    OPTIMAL SOFTWARE PIPELINING: INTEGER LINEAR PROGRAMMING APPROACH by Artour V. Stoutchinin Schooi of Cornputer Science McGill University, Montréal A THESIS SUBMITTED TO THE FACULTYOF GRADUATESTUDIES AND RESEARCH IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTEROF SCIENCE Copyright @ by Artour V. Stoutchinin Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395. nie Wellington Ottawa ON K1A ON4 Ottawa ON KI A ON4 Canada Canada The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la fome de microfiche/^, de reproduction sur papier ou sur format électronique. The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantialextracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation- Acknowledgements First, 1 thank my family - my Mom, Nina Stoutcliinina, my brother Mark Stoutchinine, and my sister-in-law, Nina Denisova, for their love, support, and infinite patience that comes with having to put up with someone like myself. I also thank rny father, Viatcheslav Stoutchinin, who is no longer with us. Without them this thesis could not have had happened.
    [Show full text]
  • Decoupled Software Pipelining with the Synchronization Array
    Decoupled Software Pipelining with the Synchronization Array Ram Rangan Neil Vachharajani Manish Vachharajani David I. August Department of Computer Science Princeton University {ram, nvachhar, manishv, august}@cs.princeton.edu Abstract mentally affect run-time performance of the code. Out-of- order (OOO) execution mitigates this problem to an extent. Despite the success of instruction-level parallelism (ILP) Rather than stalling when a consumer, whose dependences optimizations in increasing the performance of micropro- are not satisfied, is encountered, the OOO processor will ex- cessors, certain codes remain elusive. In particular, codes ecute instructions from after the stalled consumer. Thus, the containing recursive data structure (RDS) traversal loops compiler can now safely assume average latency since ac- have been largely immune to ILP optimizations, due to tual latencies longer than the average will not necessarily the fundamental serialization and variable latency of the lead to execution stalls. loop-carried dependence through a pointer-chasing load. Unfortunately, the predominant type of variable latency To address these and other situations, we introduce decou- instruction, memory loads, have worst case latencies (i.e., pled software pipelining (DSWP), a technique that stati- cache-miss latencies) so large that it is often difficult to find cally splits a single-threaded sequential loop into multi- sufficiently many independent instructions after a stalled ple non-speculative threads, each of which performs use- consumer. As microprocessors become wider and the dis- ful computation essential for overall program correctness. parity between processor speeds and memory access laten- The resulting threads execute on thread-parallel architec- cies grows [8], this problem is exacerbated since it is un- tures such as simultaneous multithreaded (SMT) cores or likely that dynamic scheduling windows will grow as fast chip multiprocessors (CMP), expose additional instruction as memory access latencies.
    [Show full text]
  • VLIW Architectures Lisa Wu, Krste Asanovic
    CS252 Spring 2017 Graduate Computer Architecture Lecture 10: VLIW Architectures Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 9 Vector supercomputers § Vector register versus vector memory § Scaling performance with lanes § Stripmining § Chaining § Masking § Scatter/Gather CS252, Fall 2015, Lecture 10 © Krste Asanovic, 2015 2 Sequential ISA Bottleneck Sequential Superscalar compiler Sequential source code machine code a = foo(b); for (i=0, i< Find independent Schedule operations operations Superscalar processor Check instruction Schedule dependencies execution CS252, Fall 2015, Lecture 10 © Krste Asanovic, 2015 3 VLIW: Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency § Multiple operations packed into one instruction § Each operation slot is for a fixed function § Constant operation latencies are specified § Architecture requires guarantee of: - Parallelism within an instruction => no cross-operation RAW check - No data use before data ready => no data interlocks CS252, Fall 2015, Lecture 10 © Krste Asanovic, 2015 4 Early VLIW Machines § FPS AP120B (1976) - scientific attached array processor - first commercial wide instruction machine - hand-coded vector math libraries using software pipelining and loop unrolling § Multiflow Trace (1987) - commercialization of ideas from Fisher’s Yale group including “trace scheduling” - available
    [Show full text]
  • Static Instruction Scheduling for High Performance on Limited Hardware
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2017.2769641, IEEE Transactions on Computers 1 Static Instruction Scheduling for High Performance on Limited Hardware Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Själander, Vasileios Spiliopoulos Stefanos Kaxiras, Alexandra Jimborean Abstract—Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14% for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications’ high-performance. Index Terms—Compilers, code generation, memory management, optimization ✦ 1 INTRODUCTION result in a sub-optimal utilization of the limited OoO engine Computer architects of the past have steadily improved that may stall the core for an extended period of time.
    [Show full text]
  • 5 Microprocessors
    Color profile: Disabled Composite Default screen BaseTech / Mike Meyers’ CompTIA A+ Guide to Managing and Troubleshooting PCs / Mike Meyers / 380-8 / Chapter 5 5 Microprocessors “MEGAHERTZ: This is a really, really big hertz.” —DAVE BARRY In this chapter, you will learn or all practical purposes, the terms microprocessor and central processing how to Funit (CPU) mean the same thing: it’s that big chip inside your computer ■ Identify the core components of a that many people often describe as the brain of the system. You know that CPU CPU makers name their microprocessors in a fashion similar to the automobile ■ Describe the relationship of CPUs and memory industry: CPU names get a make and a model, such as Intel Core i7 or AMD ■ Explain the varieties of modern Phenom II X4. But what’s happening inside the CPU to make it able to do the CPUs amazing things asked of it every time you step up to the keyboard? ■ Install and upgrade CPUs 124 P:\010Comp\BaseTech\380-8\ch05.vp Friday, December 18, 2009 4:59:24 PM Color profile: Disabled Composite Default screen BaseTech / Mike Meyers’ CompTIA A+ Guide to Managing and Troubleshooting PCs / Mike Meyers / 380-8 / Chapter 5 Historical/Conceptual ■ CPU Core Components Although the computer might seem to act quite intelligently, comparing the CPU to a human brain hugely overstates its capabilities. A CPU functions more like a very powerful calculator than like a brain—but, oh, what a cal- culator! Today’s CPUs add, subtract, multiply, divide, and move billions of numbers per second.
    [Show full text]
  • Opinion Ten Reasons Why HP’S Itanium-Based Servers Have Reached the Point-Of-No-Return
    Opinion Ten Reasons Why HP’s Itanium-based Servers Have Reached the Point-of-No-Return Executive Summary Hewlett-Packard (HP) refers to its HP Integrity and Integrity NonStop Itanium-based servers as “business critical systems”. In Q4, 2007, HP sold over $1 billion of these business critical systems. But, since then, due to a number of factors, Itanium-based server sales have declined significantly. Over the past year, business critical systems sales have hovered in the $400 million range per quarter, an almost 60% decline as compared with the 2007 high-mark. From our perspective, HP’s Itanium-based servers have now achieved a form of stasis (a medical term that refers to an inactive state). We expect a rise in Itanium business this quarter (due to pent-up demand for the new Itanium 9500), but we also expect that – within in a few quarters – underlying, dogging problems will again drive Itanium business downward. These problems include HP’s financial woes; increased competition (particularly from Intel x86-based servers); market factors (such as the market move toward Linux and a market preference for x86 architecture); a broken ecosystem (where HP has actually had to take legal action to get a business partner to keep supporting its independent software on Itanium-based platforms); an ill-founded Itanium recovery plan known as “converged infrastructure”; and more (in fact, we list a total of ten reasons why we believe HP’s Itanium-based servers have reached the point-of-no- return on page 2 of this Opinion)… In this Opinion , Clabby Analytics describes why we believe that HP’s business critical Integrity servers have now reached the point-of-no-return.
    [Show full text]
  • Introduction to Software Pipelining in the IA-64 Architecture
    Introduction to Software Pipelining in the IA-64 Architecture ® IA-64 Software Programs This presentation is an introduction to software pipelining in the IA-64 architecture. It is recommended that the reader have some exposure to the IA-64 architecture prior to viewing this presentation. Page 1 Agenda Objectives What is software pipelining? IA-64 architectural support Assembly code example Compiler support Summary ® IA-64 Software Programs Page 2 Objectives Introduce the concept of software pipelining (SWP) Understand the IA-64 architectural features that support SWP See an assembly code example Know how to enable SWP in compiler ® IA-64 Software Programs Page 3 What is Software Pipelining? ® IA-64 Software Programs Page 4 Software Pipelining Software performance technique that overlaps the execution of consecutive loop iterations Exploits instruction level parallelism across iterations ® IA-64 Software Programs Software Pipelining (SWP) is the term for overlapping the execution of consecutive loop iterations. SWP is a performance technique that can be done in just about every computer architecture. SWP is closely related to loop unrolling. Page 5 Software Pipelining i = 1 i = 1 i = 2 i = 1 i = 2 i = 3 i = 1 i = 2 i = 3 i = 4 Time i = 2 i = 3 i = 4 i = 5 i = 3 i = 4 i = 5 i = 6 i = 4 i = 5 i = 6 i = 5 i = 6 i = 6 ® IA-64 Software Programs Here is a conceptual block diagram of a software pipeline. The loop code is separated into four pipeline stages. Six iterations of the loop are shown (i = 1 to 6).
    [Show full text]
  • Register Assignment for Software Pipelining with Partitioned Register Banks
    Register Assignment for Software Pipelining with Partitioned Register Banks Jason Hiser Steve Carr Philip Sweany Department of Computer Science Michigan Technological University Houghton MI 49931-1295 ¡ jdhiser,carr,sweany ¢ @mtu.edu Steven J. Beaty Metropolitan State College of Denver Department of Mathematical and Computer Sciences Campus Box 38, P. O. Box 173362 Denver, CO 80217-3362 [email protected] Abstract Trends in instruction-level parallelism (ILP) are putting increased demands on a machine’s register resources. Computer architects are adding increased numbers of functional units to allow more instruc- tions to be executed in parallel, thereby increasing the number of register ports needed. This increase in ports hampers access time. Additionally, aggressive scheduling techniques, such as software pipelining, are requiring more registers to attain high levels of ILP. These two factors make it difficult to maintain a single monolithic register bank. One possible solution to the problem is to partition the register banks and connect each functional unit to a subset of the register banks, thereby decreasing the number of register ports required. In this type of architecture, if a functional unit needs access to a register to which it is not directly connected, a delay is incurred to copy the value so that the proper functional unit has access to it. As a consequence, the compiler now must not only schedule code for maximum parallelism but also for minimal delays due to copies. In this paper, we present a greedy framework for generating code for architectures with partitioned register banks. Our technique is global in nature and can be applied using any scheduling method £ This research was supported by NSF grant CCR-9870871 and a grant from Texas Instruments.
    [Show full text]
  • The IA-32 Processor Architecture
    The IA-32 processor architecture Nicholas FitzRoy-Dale Document Revision: 1 Date: 2006/05/30 22:31:24 [email protected] http://www.cse.unsw.edu.au/∼disy/ Operating Systems and Distributed Systems Group School of Computer Science and Engineering The University of New South Wales UNSW Sydney 2052, Australia 1 Introduction This report discusses the most common instruction set architecture for desktop microprocessors: IA- 32. From a programmer’s perspective, IA-32 has not changed changed significantly since its introduc- tion with the Intel 80386 processor in 1985. IA-32 implementations, however, have undergone dra- matic changes in order to stay competitive with more modern architectures, particularly in the area of instruction-level parallelism. This report discusses the history of IA-32, and then the architectural features of recent IA-32 im- plementations, with particular regard to caching, multiprocessing, and instruction-level parallelism. An archtectural description is not particularly useful in isolation. Therefore, to provide context, each as- pect is compared with analogous features of other architectures, with particular attention paid to the RISC-style ARM processor and the VLIW-inspired Itanium. 2 A brief history of IA-32 IA-32 first appeared with the 80386 processor, but the architecture was by no means completely new. IA-32’s 8-bit predecessor first appeared in the Datapoint 2200 programmable terminal, released in 1971. Under contract to produce a single-chip version of the terminal’s multiple-chip TTL design, Intel’s im- plementation, the 8008, was not included in the terminal. Intel released the chip in 1972.
    [Show full text]
  • Intel Itanium 2 Processors Get Faster Bus Architecture 18 July 2005
    Intel Itanium 2 Processors Get Faster Bus Architecture 18 July 2005 Intel Corporation today introduced two Intel Itanium based servers. 2 processors which deliver better performance over the current generation for database, business The improved front side bus bandwidth allows for intelligence, enterprise resource planning and 10.6 gigabits of data per second to pass from the technical computing applications. processor to other system components. In contrast, the current generation 400 MHz FSB transfers 6.4 For the first time, Itanium 2 processors have a 667 gigabits of data per second. The ability to move megahertz (MHz) front side bus (FSB), which more data in a very short period of time is critical to connects and transfers data between the compute intensive applications in the scientific, oil microprocessor, chipset and system's main and gas and government industries. memory. Servers designed to utilize the new bus are expected to deliver more than 65 percent Hitachi, which will adopt the new Itanium 2 greater system bandwidth over servers designed processors with the 667 FSB into new Hitachi with current Itanium 2 processors with a 400 MHz BladeSymphony* servers coming in the next 30 FSB. This new capability will help set the stage for days, has also designed a chipset (the the forthcoming dual core Itanium processor, communications controller between the processor codenamed "Montecito," which will feature the and the rest of the computer system) to take same bus architecture. advantage of the new bus architecture. "Intel continues to bring new capabilities to the Platforms using Montecito are expected to deliver Itanium architecture, evolving the platform to up to twice the performance, up to three times the further improve performance for data intensive system bandwidth, and more than 2 1/2 times as tasks," said Kirk Skaugen, general manager of much on-die cache as the current generation of Intel's Server Platforms Group.
    [Show full text]
  • The Bene T of Predicated Execution for Software Pipelining
    Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 1 The Bene t of Predicated Execution for Software Pip elining Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu Center for Reliable and High-Performance Computing University of Illinois at Urbana/Champaign Urbana, IL 61801 Abstract implementations for the Warp [5] [3] and Cydra 5 [6] machines. Both implementations are based on the mo dulo scheduling techniques prop osed by Rau and Software pipelining is a compile-time scheduling Glaeser [7]. The Cydra 5 has sp ecial hardware sup- technique that overlaps successive loop iterations to p ort for mo dulo scheduling in the form of a rotating expose operation-level paral lelism. An important prob- register le and predicated op erations [8]. In the ab- lem with the development of e ective software pipelin- sence of sp ecial hardware supp ort in the Warp, Lam ing algorithms is how to hand le loops with conditional prop oses mo dulo variable expansion and hierarchical branches. Conditional branches increase the complex- reduction for handling register allo cation and condi- ity and decrease the e ectiveness of software pipelin- tional constructs resp ectively [3]. Rau has studied the ing algorithms by introducing many possible execution b ene ts of hardware supp ort for register allo cation [9]. paths into the scheduling scope. This paper presents In this pap er we analyze the implications and b en- an empirical study of the importanceofanarchi- e ts of predicated op erations for software pip elining tectural support, referredtoaspredicated execution, lo ops with conditional branches.
    [Show full text]
  • Software Pipelining of Nested Loops
    Software Pipelining of Nested Loops Kalyan Muthukumar and Gautam Doshi Intel Corporation 2200 Mission College Blvd., Santa Clara, CA 95052, U.S.A. fkalyan.muthukumar, [email protected] Abstract. Software pipelining is a technique to improve the perfor- mance of a loop by overlapping the execution of several iterations. The execution of a software-pipelined loop goes through three phases: prolog, kernel, and epilog. Software pipelining works best if most of the time is spent in the kernel phase rather than in the prolog or epilog phases. This can happen only if the trip count of a pipelined loop is large enough to amortize the overhead of prolog and epilog phases. When a software- pipelined loop is part of a loop nest, the overhead of filling and draining the pipeline is incurred for every iteration of the outer loop. This pa- per introduces two novel methods to minimize the overhead of software- pipeline fill/drain in nested loops. In effect, these methods overlap the draining of the software pipeline corresponding to one outer loop iter- ation with the filling of the software pipeline corresponding to one or more subsequent outer loop iterations. This results in better instruction- level parallelism (ILP) for the loop nest, particularly for loop nests in which the trip counts of inner loops are small. These methods exploit ItaniumTM architecture software pipelining features such as predication, register rotation, and explicit epilog stage control, to minimize the code size overhead associated with such a transformation. However, the key idea behind these methods is applicable to other architectures as well.
    [Show full text]