Rock:Ahigh-Performance Sparc Cmt Processor

Total Page:16

File Type:pdf, Size:1020Kb

Rock:Ahigh-Performance Sparc Cmt Processor ......................................................................................................................................................................................................................... ROCK:AHIGH-PERFORMANCE SPARC CMT PROCESSOR ......................................................................................................................................................................................................................... ROCK,SUN’S THIRD-GENERATION CHIP-MULTITHREADING PROCESSOR, CONTAINS 16 HIGH-PERFORMANCE CORES, EACH OF WHICH CAN SUPPORT TWO SOFTWARE THREADS. ROCK USES A NOVEL CHECKPOINT-BASED ARCHITECTURE TO SUPPORT AUTOMATIC HARDWARE SCOUTING UNDER A LOAD MISS, SPECULATIVE OUT-OF-ORDER RETIREMENT OF INSTRUCTIONS, AND AGGRESSIVE DYNAMIC HARDWARE PARALLELIZATION OF A SEQUENTIAL INSTRUCTION STREAM.IT IS ALSO THE FIRST PROCESSOR TO SUPPORT TRANSACTIONAL MEMORY IN HARDWARE. ......Designing an aggressive chip- even more aggressive simultaneous speculative Shailender Chaudhry multithreaded (CMT) processor1 involves threading (SST), which uses two checkpoints. many tradeoffs. To maximize throughput EA is an area-efficient way of creating a large Robert Cypher performance, each processor core must be virtual issue window without the large asso- highly area and power efficient, so that ciative structures. SST dynamically extracts Magnus Ekman many cores can coexist on a single die. Simi- parallelism, letting execution proceed in par- larly, if the processor is to perform well on a allel at two different points in the program. Martin Karlsson wide spectrum of applications, per-thread per- These schemes let Rock maximize formance must be high so that serial code also throughput when the parallelism is available, Anders Landin executes efficiently, minimizing performance and focus the hardware resources to boost problems related to Amdahl’s law.2,3 Rock, per-thread performance when it is at a pre- Sherman Yip Sun’s third-generation chip-multithreading mium. Because these speculation techniques ˚ processor, uses a novel resource-sharing hier- make Rock less sensitive to cache misses, Hakan Zeffer archical structure tuned to maximize perfor- we have been able to reduce on-chip cache mance per area and power.4 sizes and instead add more cores to further Marc Tremblay Rock uses an innovative checkpoint-based increase performance. The high core-to- Sun Microsystems architecture to implement highly aggressive cache ratio requires high off-chip bandwidth. speculation techniques. Each core has sup- Rock-based systems use large caches in the port for two threads, each with one check- off-chip memory controllers to maintain a point. The hardware support for threads relatively low access rate to DRAM; this can be used two different ways: a single maximizes performance and optimizes total core can run two application threads, giving system power. each thread hardware support for execute To help the programmability of multi- ahead (EA) under cache misses; or all of processor systems with hundreds of threads a core’s resources can adaptively be com- and to improve synchronization-related bined to run one application thread with performance, Rock supports transactional .............................................................. 6 Published by the IEEE Computer Society 0272-1732/09/$25.00 c 2009 IEEE Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. SerDes I/O links D cache MCU MCU MC MC CoreFGU Core IFU Core L2 L2 cluster cache cache CoreFGU Core SerDes I/O links D cache Rock Rock EFA PLL MMU Data switch SIU processor I/O processor I/O chip chip L2 L2 cache cache Core Core cluster cluster MC MC SerDes I/O links (a) (b) Figure 1. Die photo of the Rock chip (a) and logical overview of a two-processor-chip, 64-thread Rock-based system (b). memory5 in hardware; it is the first commer- number of instructions per cycle. Thus, cial processor to do so. TM lets a Rock four cores can share one IFU in a round- thread efficiently execute all instructions in robin fashion while maintaining full fetch a transaction as an atomic unit, or not at bandwidth. Furthermore, the shared in- all. Rock leverages the checkpointing mecha- struction cache enables constructive shar- nism with which it supports EA to imple- ing of common code, as is encountered ment the TM semantics with low additional in shared libraries and operating system hardware overhead. routines. Each core cluster contains two L1 data System overview caches (D cache) and two floating-point Figure 1 shows a die photo of the units (FGU), each of which is shared by a Rock processor and a logical overview of pair of cores. As Figure 1a shows, these struc- a Rock-based system using two processor tures are relatively large, so sharing them pro- chips. Each Rock processor has 16 cores, vides significant area savings. A crossbar each configurable to run one or two threads; connects the four core clusters with the four thus, each chip can run up to 32 threads. The level-two banks (L2 cache), the second-level 16 cores are divided into core clusters, with memory-management unit (MMU), and four cores per cluster. This design enables re- the system interface unit (SIU). Rock’s L2 source sharing among cores within a cluster, cache is 2 Mbytes, consisting of four 512- thus reducing the area requirements. All Kbyte 8-way set-associative banks. A rela- cores in a cluster share an instruction fetch tively large fraction of the die area is devoted unit (IFU) that includes the level-one (L1) to core clusters as opposed to L2 cache. We instruction cache. We decided that all four selected this design point after simulating cores in a cluster should share the IFU be- designs with different ratios of core clusters cause it is relatively simple to fetch a large to L2 cache. It provides the highest .................................................................... MARCH/APRIL 2009 7 Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. ............................................................................................................................................................................................... HOT CHIPS 20 throughput of the different options and exploiting the great spatial locality typical yields good single-thread performance. in instruction data. The fetch unit fetches a Each Rock chip is directly connected to full cache line of 16 instructions and for- an I/O bridge ASIC and to multiple memory wards it to a strand-specific fetch buffer. controller ASICs via high-bandwidth Rock’s branch predictor is a 60-Kbit serializer/deserializer (SerDes) links. This sys- (30K two-bit saturating counters) gshare tem topology provides truly uniform access branch predictor, enhanced to take advan- to all memory locations. The memory tage of the software prediction bit provided controllers implement a directory-based in Sparc V9 branches. In each cycle, two MESI coherence protocol (which uses the program-counter-relative (PC-relative) branches states modified, exclusive, shared,andinvalid) are predicted. Indirect branches are predicted supporting multiprocessor configurations. by either a 128-entry branch-target buffer or We decided not to support the owned a 128-entry return-address predictor. cache state (O) because there are no direct As Figure 2 shows, the fetch stages are links between processors, and so accesses to connected to the decode pipeline through a cache line in the O state in another pro- aninstructionbuffer.Thedecodeunit, cessor would require a four-hop transaction, which serves one strand per cycle, has a which would increase the latency and the helper ROM that it uses to decompose com- link bandwidth requirements. plex instructions into sequences of simpler To further reduce link bandwidth instructions. Once decoded, instructions requirements, Rock systems use hardware proceed to the appropriate strand-specific in- to compress packets sent over the links be- struction queue. A round-robin arbiter tween the processors and the memory decides which strand will have access to the controllers. Each memory controller has an issue pipeline the next cycle. On every 8-Mbyte level-three cache, with all the L3 cycle, up to four instructions can be steered caches in the system forming a single large, from the instruction queue to the issue globally shared cache. This L3 cache reduces first-in, first-out buffers (FIFOs). There are both the latency of memory requests and the five issue FIFOs (labeled A0, A1, BR, FP, DRAM bandwidth requirements. Each and MEM in Figure 2); each is three entries memory controller has 4 þ 1 fully buffered deep and directly connected to its respective dual in-line memory module (FB-DIMM) execution pipe. An issue FIFO will stall an memorychannels,withthefifthofthese instruction if there is a read-after-write used for redundancy. (RAW) hazard. However, such a stall does not necessarily affect instructions in the Pipeline overview other issue FIFOs. Independent younger A strand consists of the hardware resour- instructions in the other issue FIFOs can ces needed to execute a software thread. proceed to the execution unit out of order. Thus, in a Rock system, eight strands share Instructions that can’t complete their execu- a fetch unit, which uses a round-robin tion go into the deferred instruction queue scheme to determine which strand gets (DQ) for
Recommended publications
  • Datasheet Fujitsu SPARC M10-4S
    Datasheet Fujitsu SPARC M10-4S Datasheet Fujitsu SPARC M10-4S Everything your mission critical enterprise application needs in stability, scalability and asset protection Only the best with Fujitsu SPARC Enterprise A SPARC of steel Based on robust SPARC architecture and running the Fujitsu SPARC M10-4S server is the nearest thing you leading Oracle Solaris 11, Fujitsu SPARC M10-4S can get to an open mainframe. Absolutely rock servers are ideal for customers needing highly solid, dependable and sophisticated, but with the scalable, reliable servers that increase their system total Solaris binary compatibility necessary to both utilization and performance through virtualization. protect your investments and enhance your business. The combined leverage of Fujitsu’s expertise in mission-critical computing technologies and Its rich virtualization eco-system of extended high-performance processor design, with Oracle’s partitioning and Solaris Containers coupled with expertise in open, scalable, partition-based network dynamic reconfiguration, means non-stop operation computing, provides the overall flexibility to meet and total resource utilization at no extra cost. any task. Benchmark leading performance with the world’s best applications and outstanding processor scalability just add to the capabilities of this most expandable of system platform. Page 1 of 6 www.fujitsu.com/sparc Datasheet Fujitsu SPARC M10-4S Features and benefits Main features Benefits Supreme performance The supreme performance in all commercial servers Highest performance
    [Show full text]
  • Oracle Solaris and Oracle SPARC Systems—Integrated and Optimized for Mission Critical Computing
    An Oracle White Paper September 2010 Oracle Solaris and Oracle SPARC Servers— Integrated and Optimized for Mission Critical Computing Oracle Solaris and Oracle SPARC Systems—Integrated and Optimized for Mission Critical Computing Executive Overview ............................................................................. 1 Introduction—Oracle Datacenter Integration ....................................... 1 Overview ............................................................................................. 3 The Oracle Solaris Ecosystem ........................................................ 3 SPARC Processors ......................................................................... 4 Architected for Reliability ..................................................................... 7 Oracle Solaris Predictive Self Healing ............................................ 7 Highly Reliable Memory Subsystems .............................................. 9 Oracle Solaris ZFS for Reliable Data ............................................ 10 Reliable Networking ...................................................................... 10 Oracle Solaris Cluster ................................................................... 11 Scalable Performance ....................................................................... 14 World Record Performance ........................................................... 16 Sun FlashFire Storage .................................................................. 19 Network Performance ..................................................................
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]
  • Precise Runahead Execution
    2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Precise Runahead Execution Ajeya Naithani∗ Josue´ Feliu†‡ Almutaz Adileh∗ Lieven Eeckhout∗ ∗Ghent University, Belgium †Universitat Politecnica` de Valencia,` Spain Abstract—Runahead execution improves processor perfor- runahead mode, the higher the performance benefit of runahead mance by accurately prefetching long-latency memory accesses. execution. On the other hand, speculative code execution When a long-latency load causes the instruction window to fill up imposes overheads for saving and restoring state, and rolling and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. the pipeline back to a proper state to resume normal operation A recent improvement tracks the chain of instructions that leads after runahead execution. The lower the performance penalty of to the long-latency load, stores it in a runahead buffer, and exe- these overheads, the higher the performance gain. Consequently, cutes only this chain during runahead execution, with the purpose maximizing the performance benefits of runahead execution of generating more prefetch requests. Unfortunately, all prior requires (1) maximizing the number of useful prefetches per runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when runahead interval, and (2) limiting the switching overhead entering runahead mode and then need to re-fill the pipeline between runahead mode and normal execution. We find to restart normal operation. Moreover, runahead buffer limits that prior attempts to optimize the performance of runahead prefetch coverage by tracking only a single chain of instructions execution have shortcomings that impede them from adequately that leads to the same long-latency load.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Sun Netra T5440 Server Architecture
    An Oracle White Paper March 2010 Sun Netra T5440 Server Architecture Oracle White Paper—Sun Netra T5440 Server Architecture Introduction..........................................................................................1 Managing Complexity ..........................................................................2 Introducing the Sun Netra T5440 Server .........................................2 The UltraSPARC T2 Plus Processor and the Evolution of Throughput Computing .....................................10 Diminishing Returns of Traditional Processor Designs .................10 Chip Multiprocessing (CMP) with Multicore Processors................11 Chip Multithreading (CMT) ............................................................11 UltraSPARC Processors with CoolThreads Technology ...............13 Sun Netra T5440 Servers ..............................................................21 Sun Netra T5440 Server Architecture................................................21 System-Level Architecture.............................................................21 Sun Netra T5440 Server Overview and Subsystems ....................22 RAS Features ................................................................................30 Carrier-Grade Software Support........................................................31 System Management Technology .................................................32 Scalability and Support for CoolThreads Technology....................35 Cool Tools for SPARC: Performance and Rapid Time to Market ..39 Java
    [Show full text]
  • Runahead Execution
    18-447 Computer Architecture Lecture 26: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/7/2014 Today Start Memory Latency Tolerance Runahead Execution Prefetching 2 Readings Required Mutlu et al., “Runahead execution”, HPCA 2003. Srinath et al., “Feedback directed prefetching”, HPCA 2007. Optional Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006. Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005. Armstrong et al., “Wrong Path Events,” MICRO 2004. 3 Tolerating Memory Latency Latency Tolerance An out-of-order execution processor tolerates latency of multi-cycle operations by executing independent instructions concurrently It does so by buffering instructions in reservation stations and reorder buffer Instruction window: Hardware resources needed to buffer all decoded but not yet retired/committed instructions What if an instruction takes 500 cycles? How large of an instruction window do we need to continue decoding? How many cycles of latency can OoO tolerate? 5 Stalls due to Long-Latency Instructions When a long-latency instruction is not complete, it blocks instruction retirement. Because we need to maintain precise exceptions Incoming instructions fill the instruction window (reorder buffer, reservation stations). Once the window is full, processor cannot place new instructions into the window. This is called a full-window stall. A full-window stall prevents the processor from making progress in the execution of the program. 6 Full-window Stall Example 8-entry instruction window: Oldest LOAD R1 mem[R5] L2 Miss! Takes 100s of cycles. BEQ R1, R0, target ADD R2 R2, 8 LOAD R3 mem[R2] Independent of the L2 miss, MUL R4 R4, R3 executed out of program order, ADD R4 R4, R5 but cannot be retired.
    [Show full text]
  • Vector Runahead
    Vector Runahead Ajeya Naithaniy Sam Ainsworthz Timothy M. Joneso Lieven Eeckhouty yGhent University zUniversity of Edinburgh oUniversity of Cambridge Abstract—The memory wall places a significant limit on stride patterns, such techniques are endemic in today’s cache performance for many modern workloads. These applications systems [19]. For more complex indirection patterns, the feature complex chains of dependent, indirect memory accesses, inability at the cache-system level to identify complex load which cannot be picked up by even the most advanced microar- chitectural prefetchers. The result is that current out-of-order chains and generate their addresses limits existing techniques superscalar processors spend the majority of their time stalled. to simple array-indirect [88] and pointer-chasing [23] codes. While it is possible to build special-purpose architectures to To achieve the instruction-level visibility necessary to cal- exploit the fundamental memory-level parallelism, a microarchi- culate the addresses of complex access patterns seen in to- tectural technique to automatically improve their performance day’s workloads [3], we conclude that this ideal technique in conventional processors has remained elusive. Runahead execution is a tempting proposition for hiding must operate within the core, instead of within the cache. latency in program execution. However, to achieve high memory- Runahead execution [25, 32, 34, 57, 58, 64] is the most level parallelism, a standard runahead execution skips ahead of promising technique to date, where upon a memory stall cache misses. In modern workloads, this means it only prefetches at the head of the reorder buffer (ROB), execution enters the first cache-missing load in each dependent chain.
    [Show full text]
  • Ultrasparc T2 Processor
    UltraSPARC T2 Processor Vortrag im Rahmen des Seminars „Ausgewählte Themen in Hardwareentwurf und Optik“ HWS07 Universität Mannheim Janusz Schinke Inhalt Überblick Core Crossbar L2 Cache Internes Netzwerk PCI-Express Power Management System Status & Einsatz Bereich Zusammenfassung 09.10.07 Janusz Schinke 2 UltraSPARC T2 Processor Überblick Überblick 1/5 Zweite Generation eines Chip Multi-Threading (CMT) Prozessors 8 Sparc Cores, 4MB shared L2 Cache. Ausführung von 64 Threads (8 Threads pro Core). mehr als doppelte UltraSparc T1's Rechenleistung und Rechenleistung/Watt. mehr als zehn mal schnellere Floating Point Berechnung 09.10.07 Janusz Schinke 4 Überblick 2/5 Server-on-a-Chip Komponenten (SOC) zwei 10G Ethernet Anschlüsse Verschlüsselungseinheit On-chip PCI-Express FBDIMM Speicher 09.10.07 Janusz Schinke 5 Überblick 3/5 Block Diagramm 09.10.07 Janusz Schinke 6 Überblick 4/5 Niagara2 Die Micrograph 09.10.07 Janusz Schinke 7 Überblick 5/5 09.10.07 Janusz Schinke 8 UltraSPARC T2 Processor Core Core IFU – Instruction Fetch Unit EXU0/1 – Integer Execution Units LSU – Load/Store Unit FGU – Floating-Point/Graphics Unit SPU – Security Processing Unit TLU – Trap Logic Unit MMU – Memory Management Unit 09.10.07 Janusz Schinke 10 Core Pipeline 8-stufige Integer Pipeline 3-Taktzyklen load-use Latenz Speicher Bypass Writeback 09.10.07 Janusz Schinke 11 Core Pipeline 12-stufige Floating-Point Pipeline 6-Taktzyklen Latenz für abhnängige FP Operationen Längere Pipeline Stufe für Division/Quadratwurzel 09.10.07 Janusz
    [Show full text]
  • Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance
    EFFICIENT RUNAHEAD EXECUTION: POWER-EFFICIENT MEMORY LATENCY TOLERANCE SEVERAL SIMPLE TECHNIQUES CAN MAKE RUNAHEAD EXECUTION MORE EFFICIENT BY REDUCING THE NUMBER OF INSTRUCTIONS EXECUTED AND THEREBY REDUCING THE ADDITIONAL ENERGY CONSUMPTION TYPICALLY ASSOCIATED WITH RUNAHEAD EXECUTION. Today’s high-performance processors ulatively processed (executed) instructions, face main-memory latencies on the order of sometimes without enhancing performance. hundreds of processor clock cycles. As a result, For runahead execution to be efficiently even the most aggressive processors spend a sig- implemented in current or future high-per- nificant portion of their execution time stalling formance processors which will be energy- and waiting for main-memory accesses to constrained, processor designers must develop return data to the execution core. Previous techniques to reduce these extra instructions. research has shown that runahead execution Our solution to this problem includes both significantly increases a high-performance hardware and software mechanisms that are processor’s ability to tolerate long main-mem- simple, implementable, and effective. Onur Mutlu ory latencies.1, 2 Runahead execution improves a processor’s performance by speculatively pre- Background on runahead execution Hyesoon Kim executing the application program while the Conventional out-of-order execution processor services a long-latency (L2) data processors use instruction windows to buffer Yale N. Patt cache miss, instead of stalling the processor for instructions so they can tolerate long latencies. the duration of the L2 miss. Thus, runahead Because a cache miss to main memory takes University of Texas at execution lets a processor execute instructions hundreds of processor cycles to service, a that it otherwise couldn’t execute under an L2 processor needs to buffer an unreasonably large Austin cache miss.
    [Show full text]
  • Onur-447-Spring12-Lecture23
    18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more realistic memory hierarchy L2 cache model DRAM, memory controller models MSHRs, multiple outstanding misses Due April 23 Extra credit: Prefetching 2 Last Lecture Memory latency tolerance/reduction Stalls Four fundamental techniques Software and hardware prefetching Prefetcher throttling 3 Today More prefetching Runahead execution 4 Readings Srinath et al., “Feedback directed prefetching ”, HPCA 2007. Mutlu et al., “Runahead execution ”, HPCA 2003. 5 Tolerating Memory Latency How Do We Tolerate Stalls Due to Memory? Two major approaches Reduce/eliminate stalls Tolerate the effect of a stall when it happens Four fundamental techniques to achieve these Caching Prefetching Multithreading Out-of-order execution Many techniques have been developed to make these four fundamental techniques more effective in tolerating memory latency 7 Prefetching Review: Prefetching: The Four Questions What What addresses to prefetch When When to initiate a prefetch request Where Where to place the prefetched data How Software, hardware, execution-based, cooperative 9 Review: Challenges in Prefetching: How Software prefetching ISA provides prefetch instructions Programmer or compiler inserts prefetch instructions (effort) Usually works well only for “regular access patterns ” Hardware prefetching Hardware monitors processor
    [Show full text]
  • Oracle Solaris and Fujitsu SPARC Enterprise Systems — Integrated and Optimized for Enterprise Computing
    A Fujitsu White Paper May 2011 Oracle Solaris and Fujitsu SPARC Enterprise Systems — Integrated and Optimized for Enterprise Computing Oracle Solaris and Fujitsu SPARC Enterprise Systems—Integrated and Optimized for Enterprise Computing Executive Overview............................................................................. 2 Introduction—Datacenter Integration .................................................. 2 Overview ............................................................................................. 2 The Oracle Solaris Ecosystem........................................................ 2 SPARC Processors......................................................................... 3 Architected for Reliability..................................................................... 6 Oracle Solaris Predictive Self Healing ............................................ 7 Highly Reliable Memory Subsystems.............................................. 8 Oracle Solaris ZFS for Reliable Data .............................................. 9 Reliable Networking ........................................................................ 9 Scalable Performance....................................................................... 10 World Record Performance........................................................... 12 Network Performance ................................................................... 14 Security ............................................................................................. 14 Integrated with Fujitsu SPARC
    [Show full text]