Rock:Ahigh-Performance Sparc Cmt Processor
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Datasheet Fujitsu SPARC M10-4S
Datasheet Fujitsu SPARC M10-4S Datasheet Fujitsu SPARC M10-4S Everything your mission critical enterprise application needs in stability, scalability and asset protection Only the best with Fujitsu SPARC Enterprise A SPARC of steel Based on robust SPARC architecture and running the Fujitsu SPARC M10-4S server is the nearest thing you leading Oracle Solaris 11, Fujitsu SPARC M10-4S can get to an open mainframe. Absolutely rock servers are ideal for customers needing highly solid, dependable and sophisticated, but with the scalable, reliable servers that increase their system total Solaris binary compatibility necessary to both utilization and performance through virtualization. protect your investments and enhance your business. The combined leverage of Fujitsu’s expertise in mission-critical computing technologies and Its rich virtualization eco-system of extended high-performance processor design, with Oracle’s partitioning and Solaris Containers coupled with expertise in open, scalable, partition-based network dynamic reconfiguration, means non-stop operation computing, provides the overall flexibility to meet and total resource utilization at no extra cost. any task. Benchmark leading performance with the world’s best applications and outstanding processor scalability just add to the capabilities of this most expandable of system platform. Page 1 of 6 www.fujitsu.com/sparc Datasheet Fujitsu SPARC M10-4S Features and benefits Main features Benefits Supreme performance The supreme performance in all commercial servers Highest performance -
Oracle Solaris and Oracle SPARC Systems—Integrated and Optimized for Mission Critical Computing
An Oracle White Paper September 2010 Oracle Solaris and Oracle SPARC Servers— Integrated and Optimized for Mission Critical Computing Oracle Solaris and Oracle SPARC Systems—Integrated and Optimized for Mission Critical Computing Executive Overview ............................................................................. 1 Introduction—Oracle Datacenter Integration ....................................... 1 Overview ............................................................................................. 3 The Oracle Solaris Ecosystem ........................................................ 3 SPARC Processors ......................................................................... 4 Architected for Reliability ..................................................................... 7 Oracle Solaris Predictive Self Healing ............................................ 7 Highly Reliable Memory Subsystems .............................................. 9 Oracle Solaris ZFS for Reliable Data ............................................ 10 Reliable Networking ...................................................................... 10 Oracle Solaris Cluster ................................................................... 11 Scalable Performance ....................................................................... 14 World Record Performance ........................................................... 16 Sun FlashFire Storage .................................................................. 19 Network Performance .................................................................. -
Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g. -
Precise Runahead Execution
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Precise Runahead Execution Ajeya Naithani∗ Josue´ Feliu†‡ Almutaz Adileh∗ Lieven Eeckhout∗ ∗Ghent University, Belgium †Universitat Politecnica` de Valencia,` Spain Abstract—Runahead execution improves processor perfor- runahead mode, the higher the performance benefit of runahead mance by accurately prefetching long-latency memory accesses. execution. On the other hand, speculative code execution When a long-latency load causes the instruction window to fill up imposes overheads for saving and restoring state, and rolling and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. the pipeline back to a proper state to resume normal operation A recent improvement tracks the chain of instructions that leads after runahead execution. The lower the performance penalty of to the long-latency load, stores it in a runahead buffer, and exe- these overheads, the higher the performance gain. Consequently, cutes only this chain during runahead execution, with the purpose maximizing the performance benefits of runahead execution of generating more prefetch requests. Unfortunately, all prior requires (1) maximizing the number of useful prefetches per runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when runahead interval, and (2) limiting the switching overhead entering runahead mode and then need to re-fill the pipeline between runahead mode and normal execution. We find to restart normal operation. Moreover, runahead buffer limits that prior attempts to optimize the performance of runahead prefetch coverage by tracking only a single chain of instructions execution have shortcomings that impede them from adequately that leads to the same long-latency load. -
Computer Architectures an Overview
Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements. -
Sun Netra T5440 Server Architecture
An Oracle White Paper March 2010 Sun Netra T5440 Server Architecture Oracle White Paper—Sun Netra T5440 Server Architecture Introduction..........................................................................................1 Managing Complexity ..........................................................................2 Introducing the Sun Netra T5440 Server .........................................2 The UltraSPARC T2 Plus Processor and the Evolution of Throughput Computing .....................................10 Diminishing Returns of Traditional Processor Designs .................10 Chip Multiprocessing (CMP) with Multicore Processors................11 Chip Multithreading (CMT) ............................................................11 UltraSPARC Processors with CoolThreads Technology ...............13 Sun Netra T5440 Servers ..............................................................21 Sun Netra T5440 Server Architecture................................................21 System-Level Architecture.............................................................21 Sun Netra T5440 Server Overview and Subsystems ....................22 RAS Features ................................................................................30 Carrier-Grade Software Support........................................................31 System Management Technology .................................................32 Scalability and Support for CoolThreads Technology....................35 Cool Tools for SPARC: Performance and Rapid Time to Market ..39 Java -
Runahead Execution
18-447 Computer Architecture Lecture 26: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/7/2014 Today Start Memory Latency Tolerance Runahead Execution Prefetching 2 Readings Required Mutlu et al., “Runahead execution”, HPCA 2003. Srinath et al., “Feedback directed prefetching”, HPCA 2007. Optional Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006. Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005. Armstrong et al., “Wrong Path Events,” MICRO 2004. 3 Tolerating Memory Latency Latency Tolerance An out-of-order execution processor tolerates latency of multi-cycle operations by executing independent instructions concurrently It does so by buffering instructions in reservation stations and reorder buffer Instruction window: Hardware resources needed to buffer all decoded but not yet retired/committed instructions What if an instruction takes 500 cycles? How large of an instruction window do we need to continue decoding? How many cycles of latency can OoO tolerate? 5 Stalls due to Long-Latency Instructions When a long-latency instruction is not complete, it blocks instruction retirement. Because we need to maintain precise exceptions Incoming instructions fill the instruction window (reorder buffer, reservation stations). Once the window is full, processor cannot place new instructions into the window. This is called a full-window stall. A full-window stall prevents the processor from making progress in the execution of the program. 6 Full-window Stall Example 8-entry instruction window: Oldest LOAD R1 mem[R5] L2 Miss! Takes 100s of cycles. BEQ R1, R0, target ADD R2 R2, 8 LOAD R3 mem[R2] Independent of the L2 miss, MUL R4 R4, R3 executed out of program order, ADD R4 R4, R5 but cannot be retired. -
Vector Runahead
Vector Runahead Ajeya Naithaniy Sam Ainsworthz Timothy M. Joneso Lieven Eeckhouty yGhent University zUniversity of Edinburgh oUniversity of Cambridge Abstract—The memory wall places a significant limit on stride patterns, such techniques are endemic in today’s cache performance for many modern workloads. These applications systems [19]. For more complex indirection patterns, the feature complex chains of dependent, indirect memory accesses, inability at the cache-system level to identify complex load which cannot be picked up by even the most advanced microar- chitectural prefetchers. The result is that current out-of-order chains and generate their addresses limits existing techniques superscalar processors spend the majority of their time stalled. to simple array-indirect [88] and pointer-chasing [23] codes. While it is possible to build special-purpose architectures to To achieve the instruction-level visibility necessary to cal- exploit the fundamental memory-level parallelism, a microarchi- culate the addresses of complex access patterns seen in to- tectural technique to automatically improve their performance day’s workloads [3], we conclude that this ideal technique in conventional processors has remained elusive. Runahead execution is a tempting proposition for hiding must operate within the core, instead of within the cache. latency in program execution. However, to achieve high memory- Runahead execution [25, 32, 34, 57, 58, 64] is the most level parallelism, a standard runahead execution skips ahead of promising technique to date, where upon a memory stall cache misses. In modern workloads, this means it only prefetches at the head of the reorder buffer (ROB), execution enters the first cache-missing load in each dependent chain. -
Ultrasparc T2 Processor
UltraSPARC T2 Processor Vortrag im Rahmen des Seminars „Ausgewählte Themen in Hardwareentwurf und Optik“ HWS07 Universität Mannheim Janusz Schinke Inhalt Überblick Core Crossbar L2 Cache Internes Netzwerk PCI-Express Power Management System Status & Einsatz Bereich Zusammenfassung 09.10.07 Janusz Schinke 2 UltraSPARC T2 Processor Überblick Überblick 1/5 Zweite Generation eines Chip Multi-Threading (CMT) Prozessors 8 Sparc Cores, 4MB shared L2 Cache. Ausführung von 64 Threads (8 Threads pro Core). mehr als doppelte UltraSparc T1's Rechenleistung und Rechenleistung/Watt. mehr als zehn mal schnellere Floating Point Berechnung 09.10.07 Janusz Schinke 4 Überblick 2/5 Server-on-a-Chip Komponenten (SOC) zwei 10G Ethernet Anschlüsse Verschlüsselungseinheit On-chip PCI-Express FBDIMM Speicher 09.10.07 Janusz Schinke 5 Überblick 3/5 Block Diagramm 09.10.07 Janusz Schinke 6 Überblick 4/5 Niagara2 Die Micrograph 09.10.07 Janusz Schinke 7 Überblick 5/5 09.10.07 Janusz Schinke 8 UltraSPARC T2 Processor Core Core IFU – Instruction Fetch Unit EXU0/1 – Integer Execution Units LSU – Load/Store Unit FGU – Floating-Point/Graphics Unit SPU – Security Processing Unit TLU – Trap Logic Unit MMU – Memory Management Unit 09.10.07 Janusz Schinke 10 Core Pipeline 8-stufige Integer Pipeline 3-Taktzyklen load-use Latenz Speicher Bypass Writeback 09.10.07 Janusz Schinke 11 Core Pipeline 12-stufige Floating-Point Pipeline 6-Taktzyklen Latenz für abhnängige FP Operationen Längere Pipeline Stufe für Division/Quadratwurzel 09.10.07 Janusz -
Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance
EFFICIENT RUNAHEAD EXECUTION: POWER-EFFICIENT MEMORY LATENCY TOLERANCE SEVERAL SIMPLE TECHNIQUES CAN MAKE RUNAHEAD EXECUTION MORE EFFICIENT BY REDUCING THE NUMBER OF INSTRUCTIONS EXECUTED AND THEREBY REDUCING THE ADDITIONAL ENERGY CONSUMPTION TYPICALLY ASSOCIATED WITH RUNAHEAD EXECUTION. Today’s high-performance processors ulatively processed (executed) instructions, face main-memory latencies on the order of sometimes without enhancing performance. hundreds of processor clock cycles. As a result, For runahead execution to be efficiently even the most aggressive processors spend a sig- implemented in current or future high-per- nificant portion of their execution time stalling formance processors which will be energy- and waiting for main-memory accesses to constrained, processor designers must develop return data to the execution core. Previous techniques to reduce these extra instructions. research has shown that runahead execution Our solution to this problem includes both significantly increases a high-performance hardware and software mechanisms that are processor’s ability to tolerate long main-mem- simple, implementable, and effective. Onur Mutlu ory latencies.1, 2 Runahead execution improves a processor’s performance by speculatively pre- Background on runahead execution Hyesoon Kim executing the application program while the Conventional out-of-order execution processor services a long-latency (L2) data processors use instruction windows to buffer Yale N. Patt cache miss, instead of stalling the processor for instructions so they can tolerate long latencies. the duration of the L2 miss. Thus, runahead Because a cache miss to main memory takes University of Texas at execution lets a processor execute instructions hundreds of processor cycles to service, a that it otherwise couldn’t execute under an L2 processor needs to buffer an unreasonably large Austin cache miss. -
Onur-447-Spring12-Lecture23
18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more realistic memory hierarchy L2 cache model DRAM, memory controller models MSHRs, multiple outstanding misses Due April 23 Extra credit: Prefetching 2 Last Lecture Memory latency tolerance/reduction Stalls Four fundamental techniques Software and hardware prefetching Prefetcher throttling 3 Today More prefetching Runahead execution 4 Readings Srinath et al., “Feedback directed prefetching ”, HPCA 2007. Mutlu et al., “Runahead execution ”, HPCA 2003. 5 Tolerating Memory Latency How Do We Tolerate Stalls Due to Memory? Two major approaches Reduce/eliminate stalls Tolerate the effect of a stall when it happens Four fundamental techniques to achieve these Caching Prefetching Multithreading Out-of-order execution Many techniques have been developed to make these four fundamental techniques more effective in tolerating memory latency 7 Prefetching Review: Prefetching: The Four Questions What What addresses to prefetch When When to initiate a prefetch request Where Where to place the prefetched data How Software, hardware, execution-based, cooperative 9 Review: Challenges in Prefetching: How Software prefetching ISA provides prefetch instructions Programmer or compiler inserts prefetch instructions (effort) Usually works well only for “regular access patterns ” Hardware prefetching Hardware monitors processor -
Oracle Solaris and Fujitsu SPARC Enterprise Systems — Integrated and Optimized for Enterprise Computing
A Fujitsu White Paper May 2011 Oracle Solaris and Fujitsu SPARC Enterprise Systems — Integrated and Optimized for Enterprise Computing Oracle Solaris and Fujitsu SPARC Enterprise Systems—Integrated and Optimized for Enterprise Computing Executive Overview............................................................................. 2 Introduction—Datacenter Integration .................................................. 2 Overview ............................................................................................. 2 The Oracle Solaris Ecosystem........................................................ 2 SPARC Processors......................................................................... 3 Architected for Reliability..................................................................... 6 Oracle Solaris Predictive Self Healing ............................................ 7 Highly Reliable Memory Subsystems.............................................. 8 Oracle Solaris ZFS for Reliable Data .............................................. 9 Reliable Networking ........................................................................ 9 Scalable Performance....................................................................... 10 World Record Performance........................................................... 12 Network Performance ................................................................... 14 Security ............................................................................................. 14 Integrated with Fujitsu SPARC