Average Memory Access Time: Reducing Misses

Total Page:16

File Type:pdf, Size:1020Kb

Average Memory Access Time: Reducing Misses Review: Cache performance 332 Miss-oriented Approach to Memory Access: Advanced Computer Architecture ⎛ MemAccess ⎞ CPUtime = IC × ⎜ CPI + × MissRate × MissPenalt y ⎟ × CycleTime Chapter 2 ⎝ Execution Inst ⎠ ⎛ MemMisses ⎞ CPUtime = IC × ⎜ CPI + × MissPenalt y ⎟ × CycleTime Caches and Memory Systems ⎝ Execution Inst ⎠ CPIExecution includes ALU and Memory instructions January 2007 Separating out Memory component entirely Paul H J Kelly AMAT = Average Memory Access Time CPIALUOps does not include memory instructions ⎛ AluOps MemAccess ⎞ These lecture notes are partly based on the course text, Hennessy CPUtime = IC × ⎜ × CPI + × AMAT ⎟ × CycleTime and Patterson’s Computer Architecture, a quantitative approach (3rd ⎝ Inst AluOps Inst ⎠ and 4th eds), and on the lecture slides of David Patterson and John AMAT = HitTime + MissRate × MissPenalt y Kubiatowicz’s Berkeley course = ()HitTime Inst + MissRate Inst × MissPenalt y Inst + ()HitTime Data + MissRate Data × MissPenalt y Data Advanced Computer Architecture Chapter 2.1 Advanced Computer Architecture Chapter 2.2 Average memory access time: Reducing Misses Classifying Misses: 3 Cs AMAT = HitTime + MissRate × MissPenalt y Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. There are three ways to improve cache (Misses in even an Infinite Cache) Capacity—If the cache cannot contain all the blocks needed during performance: execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) 1. Reduce the miss rate, Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will 2. Reduce the miss penalty, or occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. 3. Reduce the time to hit in the cache. (Misses in N-way Associative, Size X Cache) More recent, 4th “C”: Coherence - Misses caused by cache coherence. Advanced Computer Architecture Chapter 2.3 Advanced Computer Architecture Chapter 2.4 Page 1 3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule (of thumb!) miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 0.14 0.14 1-way 1-way 0.12 Conflict 0.12 Conflict 2-way 2-way 0.1 0.1 4-way 4-way 0.08 0.08 8-way 8-way 0.06 0.06 Capacity Capacity 0.04 0.04 0.02 0.02 0 0 1 2 4 8 1 2 4 8 16 32 64 16 32 64 128 128 Compulsory vanishingly Compulsory Compulsory small Cache Size (KB) Cache Size (KB) Advanced Computer Architecture Chapter 2.5 Advanced Computer Architecture Chapter 2.6 3Cs Relative Miss Rate How We Can Reduce Misses? 100% 1-way 3 Cs: Compulsory, Capacity, Conflict 80% Conflict 2-way In all cases, assume total cache size not changed: 4-way 60% 8-way What happens if: 1) Change Block Size: 40% Which of 3Cs is obviously affected? Capacity 20% 2) Change Associativity: Which of 3Cs is obviously affected? 0% 1 2 4 8 3) Change Compiler: 16 32 64 Which of 3Cs is obviously affected? 128 Compulsory Cache Size (KB) Flaws: for fixed block size Good: insight => invention Advanced Computer Architecture Chapter 2.7 Advanced Computer Architecture Chapter 2.8 Page 2 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 2:1 Cache Rule of thumb: 25% The Miss Rate of a direct-mapped cache of size N Is the same as the Miss Rate of a 2-way set-associative 20% 1K cache size of size N/2 4K on average, over a large suite of benchmarks 15% Miss 16K Beware: Execution time is the only thing that Rate 10% 64K matters in the end Will Clock Cycle time increase? 5% 256K Hill [1988] suggested hit time for 2-way vs. 1-way 0% external cache +10%, internal + 2% 16 32 64 128 256 Block Size (bytes) Advanced Computer Architecture Chapter 2.9 Advanced Computer Architecture Chapter 2.10 Example: Avg. Memory Access Time vs. Miss Rate 3. Reducing Misses via a “Victim Cache” Example: assume clock cycle time (CCT) = 1.10 for 2- way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct How to combine fast hit time of mapped direct mapped TAGS DATA Cache Size Associativity yet still avoid conflict misses? (KB) 1-way 2-way 4-way 8-way Add buffer to place data 1 2.33 2.15 2.07 2.01 discarded from cache 2 1.98 1.86 1.76 1.68 Jouppi [1990]: 4-entry victim 4 1.72 1.67 1.61 1.53 cache removed 20% to 95% of 81.461.48 1.47 1.43 conflicts for a 4 KB direct Tag and Comparator One Cache line of Data 16 1.29 1.32 1.32 1.32 mapped data cache Tag and Comparator One Cache line of Data 32 1.20 1.24 1.25 1.27 Used in AMD Opteron, Alpha, HP Tag and Comparator One Cache line of Data 64 1.14 1.20 1.21 1.23 machines 128 1.10 1.17 1.18 1.20 Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy (Red means A.M.A.T. not improved by more associativity) Advanced Computer Architecture Chapter 2.11 Advanced Computer Architecture Chapter 2.12 Page 3 4. Reducing Misses via 5. Reducing Misses by Hardware “Pseudo-Associativity” Prefetching of Instructions & Data How to combine fast hit time of Direct Mapped and have E.g., Instruction Prefetching the lower conflict misses of 2-way SA cache? Alpha 21064 fetches 2 blocks on a miss Divide cache: on a miss, check other half of cache to see if Extra block placed in “stream buffer” there, if so have a pseudo-hit (slow hit) On miss check stream buffer Hit Time Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB Pseudo Hit Time Miss Penalty cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from Time 2 64KB, 4-way set associative caches Prefetching relies on having extra memory Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles bandwidth that can be used without penalty Better for caches not tied directly to processor (L2) Used in MIPS R10000 L2 cache, similar in UltraSPARC Advanced Computer Architecture Chapter 2.13 Advanced Computer Architecture Chapter 2.14 6. Reducing Misses by 7. Reducing Misses by Compiler Software Prefetching Data Optimizations Data Prefetch McFarling [1989]* reduced instruction cache misses by 75% Load data into register (HP PA-RISC loads) on 8KB direct mapped cache, 4 byte blocks in software Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Instructions Special prefetching instructions cannot cause faults; a form of speculative By choosing instruction memory layout based on callgraph, branch structure execution and profile data Prefetching comes in two flavors: Reorder procedures in memory so as to reduce conflict misses Binding prefetch: Requests load directly into register. Similar (but different) ideas work for data Must be correct address and register! Non-Binding prefetch: Load into cache. Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Can be incorrect. Frees HW/SW to guess! Loop Interchange: change nesting of loops to access data in order stored in Issuing Prefetch Instructions takes time memory Is cost of prefetch issues < savings in reduced misses? Loop Fusion: Combine 2 independent loops that have same looping and some Higher superscalar reduces difficulty of issue bandwidth variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows * “Program optimization for instruction caches”, ASPLOS89, http://doi.acm.org/10.1145/70082.68200 Advanced Computer Architecture Chapter 2.15 Advanced Computer Architecture Chapter 2.16 Page 4 Merging Arrays Example Loop Interchange Example /* Before: 2 sequential arrays */ /* Before */ int val[SIZE]; for (k = 0; k < 100; k = k+1) int key[SIZE]; for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) /* After: 1 array of stuctures */ x[i][j] = 2 * x[i][j]; struct merge { /* After */ int val; for (k = 0; k < 100; k = k+1) int key; for (i = 0; i < 5000; i = i+1) }; for (j = 0; j < 100; j = j+1) struct merge merged_array[SIZE]; x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through Reducing conflicts between val & key; memory every 100 words; improved spatial improve spatial locality locality Advanced Computer Architecture Chapter 2.17 Advanced Computer Architecture Chapter 2.18 Loop Fusion Example /* Before */ Blocking Example for (i = 0; i < N; i = i+1) /* Before */ for (j = 0; j < N; j = j+1) for (i = 0; i < N; i = i+1) a[i][j] = 1/b[i][j] * c[i][j]; for (j = 0; j < N; j = j+1) for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; d[i][j] = a[i][j] + c[i][j]; for (k = 0; k < N; k = k+1){ /* After fusion */ /* After array contraction */ r = r + y[i][k]*z[k][j];}; for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1) x[i][j] = r; for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1) }; { a[i][j] = 1/b[i][j] * c[i][j]; { c = c[i][j]; Two Inner Loops: d[i][j] = a[i][j] + c[i][j];} a = 1/b[i][j] * c; Read all NxN elements of z[] d[i][j] = a + c;} Read N elements of 1 row of y[] repeatedly 2 misses per access to a & c vs.
Recommended publications
  • Mips 16 Bit Instruction Set
    Mips 16 bit instruction set Continue Instruction set architecture MIPSDesignerMIPS Technologies, Imagination TechnologiesBits64-bit (32 → 64)Introduced1985; 35 years ago (1985)VersionMIPS32/64 Issue 6 (2014)DesignRISCTypeRegister-RegisterEncodingFixedBranchingCompare and branchEndiannessBiPage size4 KBExtensionsMDMX, MIPS-3DOpenPartly. The R12000 has been on the market for more than 20 years and therefore cannot be subject to patent claims. Thus, the R12000 and old processors are completely open. RegistersGeneral Target32Floating Point32 MIPS (Microprocessor without interconnected pipeline stages) is a reduced setting of the Computer Set (RISC) Instruction Set Architecture (ISA):A-3:19, developed by MIPS Computer Systems, currently based in the United States. There are several versions of MIPS: including MIPS I, II, III, IV and V; and five MIPS32/64 releases (for 32- and 64-bit sales, respectively). The early MIPS architectures were only 32-bit; The 64-bit versions were developed later. As of April 2017, the current version of MIPS is MIPS32/64 Release 6. MiPS32/64 differs primarily from MIPS I-V, defining the system Control Coprocessor kernel preferred mode in addition to the user mode architecture. The MIPS architecture has several additional extensions. MIPS-3D, which is a simple set of floating-point SIMD instructions dedicated to common 3D tasks, MDMX (MaDMaX), which is a more extensive set of SIMD instructions using 64-bit floating current registers, MIPS16e, which adds compression to flow instructions to make programs that take up less space, and MIPS MT, which adds layered potential. Computer architecture courses in universities and technical schools often study MIPS architecture. Architecture has had a major impact on later RISC architectures such as Alpha.
    [Show full text]
  • A Programming Model and Processor Architecture for Heterogeneous Multicore Computers
    A PROGRAMMING MODEL AND PROCESSOR ARCHITECTURE FOR HETEROGENEOUS MULTICORE COMPUTERS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Michael D. Linderman February 2009 c Copyright by Michael D. Linderman 2009 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Professor Teresa H. Meng) Principal Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Professor Mark Horowitz) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Professor Krishna V. Shenoy) Approved for the University Committee on Graduate Studies. iii Abstract Heterogeneous multicore computers, those systems that integrate specialized accelerators into and alongside multicore general-purpose processors (GPPs), provide the scalable performance needed by computationally demanding information processing (informatics) applications. However, these systems often feature instruction sets and functionality that significantly differ from GPPs and for which there is often little or no sophisticated compiler support. Consequently developing applica- tions for these systems is difficult and developer productivity is low. This thesis presents Merge, a general-purpose programming model for heterogeneous multicore systems. The Merge programming model enables the programmer to leverage different processor- specific or application domain-specific toolchains to create software modules specialized for differ- ent hardware configurations; and provides language mechanisms to enable the automatic mapping of processor-agnostic applications to these processor-specific modules.
    [Show full text]
  • Production Cluster Visualization: Experiences and Challenges
    Workshop on Parallel Visualization and Graphics Production Cluster Visualization: Experiences and Challenges Randall Frank VIEWS Visualization Project Lead Lawrence Livermore National Laboratory UCRL-PRES-200237 Workshop on Parallel Visualization and Graphics Why COTS Distributed Visualization Clusters? The realities of extreme dataset sizes (10TB+) • Stored with the compute platform • Cannot afford to copy the data • Co-resident visualization Track compute platform trends • Distributed infrastructure • Commodity hardware trends • Cost-effective solutions Migration of graphics leadership • The PC (Gamers) Desktops • Display technologies • HDR, resolution, stereo, tiled, etc Workshop on Parallel Visualization and Graphics 2 Production Visualization Requirements Unique, aggresive I/O requirements • Access patterns/performance Generation of graphical primitives • Graphics computation: primitive extraction/computation • Dataset decomposition (e.g. slabs vs chunks) Rendering of primitives • Aggregation of multiple rendering engines Video displays • Routing of digital or video tiles to displays (over distance) Interactivity (not a render-farm!) • Real-time imagery • Interaction devices, human in the loop (latency, prediction issues) Scheduling • Systems and people Workshop on Parallel Visualization and Graphics 3 Visualization Environment Architecture Archive PowerWall Analog Simulation Video Data Switch GigE Switch Visualization Engine Data Manipulation Engine Offices • Raw data on platform disks/archive systems • Data manipulation
    [Show full text]
  • Introduction to Playstation®2 Architecture
    Introduction to PlayStation®2 Architecture James Russell Software Engineer SCEE Technology Group In this presentation ä Company overview ä PlayStation 2 architecture overview ä PS2 Game Development ä Differences between PS2 and PC. Technology Group 1) Sony Computer Entertainment Overview SCE Europe (includes Aus, NZ, Mid East, America Technology Group Japan Southern Africa) Sales ä 40 million sold world-wide since launch ä Since March 2000 in Japan ä Since Nov 2000 in Europe/US ä New markets: Middle East, India, Korea, China ä Long term aim: 100 million within 5 years of launch ä Production facilities can produce 2M/month. Technology Group Design considerations ä Over 5 years, we’ll make 100,000,000 PS2s ä Design is very important ä Must be inexpensive (or should become that way) ä Technology must be ahead of the curve ä Need high performance, low price. Technology Group How to achieve this? ä Processor yield ä High CPU clock speed means lower yields ä Solution? ä Low CPU clock speed, but high parallelism ä Nothing readily available ä SCE designs custom chips. Technology Group 2) Technical Aspects of PlayStation 2 ä 128-bit CPU core “Emotion Engine” ä + 2 independent Vector Units ä + Image Processing Unit (for MPEG) ä GS - “Graphics Synthesizer” GPU ä SPU2 - Sound Processing Unit ä I/O Processor (CD/DVD, USB, i.Link). Technology Group “Emotion Engine” - Specifications ä CPU Core 128 bit CPU ä System Clock 300MHz ä Bus Bandwidth 3.2GB/sec ä Main Memory 32MB (Direct Rambus) ä Floating Point Calculation 6.2 GFLOPS ä 3D Geometry Performance 66 Million polygons/sec.
    [Show full text]
  • Sony's Emotionally Charged Chip
    VOLUME 13, NUMBER 5 APRIL 19, 1999 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Sony’s Emotionally Charged Chip Killer Floating-Point “Emotion Engine” To Power PlayStation 2000 by Keith Diefendorff rate of two million units per month, making it the most suc- cessful single product (in units) Sony has ever built. While Intel and the PC industry stumble around in Although SCE has cornered more than 60% of the search of some need for the processing power they already $6 billion game-console market, it was beginning to feel the have, Sony has been busy trying to figure out how to get more heat from Sega’s Dreamcast (see MPR 6/1/98, p. 8), which has of it—lots more. The company has apparently succeeded: at sold over a million units since its debut last November. With the recent International Solid-State Circuits Conference (see a 200-MHz Hitachi SH-4 and NEC’s PowerVR graphics chip, MPR 4/19/99, p. 20), Sony Computer Entertainment (SCE) Dreamcast delivers 3 to 10 times as many 3D polygons as and Toshiba described a multimedia processor that will be the PlayStation’s 34-MHz MIPS processor (see MPR 7/11/94, heart of the next-generation PlayStation, which—lacking an p. 9). To maintain king-of-the-mountain status, SCE had to official name—we refer to as PlayStation 2000, or PSX2. do something spectacular. And it has: the PSX2 will deliver Called the Emotion Engine (EE), the new chip upsets more than 10 times the polygon throughput of Dreamcast, the traditional notion of a game processor.
    [Show full text]
  • Ilore: Discovering a Lineage of Microprocessors
    iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Science & Applications Kirk Cameron, Chair Godmar Back Margaret Ellis May 24, 2021 Blacksburg, Virginia Keywords: Computer history, systems, computer architecture, microprocessors Copyright 2021, Samuel Lewis Furman iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman (ABSTRACT) Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing. We propose iLORE, a data model designed to represent intricate relationships between computer system benchmarks and computer components. We detail the methods we used to implement and populate the iLORE data model using data harvested from publicly available sources. Finally, we demonstrate the validity and utility of our iLORE implementation through an analysis of the characteristics and lineage of commercial microprocessors. We encourage the research community to interact with our data and visualizations at csgenome.org. iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman (GENERAL AUDIENCE ABSTRACT) Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing.
    [Show full text]
  • Grids to Petaflops
    Thoughts on HPC Facilities Strategies for DOE Office of Science: Grids to Petaflops Rick Stevens Argonne National Laboratory University of Chicago Outline • Update on NSF’s distributed terascale facility • What grid and facilities strategy is appropriate for DOE? • Limits to cluster based architectures • New paths to petaflops computing capability • Grid implications of affordable petaflops • Summary and recommendations R. Stevens Argonne National Laboratory + University of Chicago NSF TeraGrid Approach [$53M in FY01-FY03] • DTF’s project goal is deployment of a production Grid environment • staged deployment based on service priorities • first priority is a linked set of working IA-64 based clusters • immediately useful by the current NSF PACI user base • supporting current high-end applications • standard cluster and data management software • Grid software deployed in phases • basic, core, and advanced services • DTF technology choices based on application community trends • > 50% of top 20 PACI users compute on Linux clusters • development and production runs • majority of NSF MRE projects plan Data Grid environments R. Stevens Argonne National Laboratory + University of Chicago Major DTF and TeraGrid Tasks • Create management structure – Harder than we thought! • Engage major application teams – Starting with ITRs and MREs • Construct high bandwidth national network – On track • Integrate terascale hardware and software – Planning underway • Establish distributed TeraGrid operations – New Concepts needed • Deploy and harden Grid software
    [Show full text]
  • SCPH-39000 Series Service Manual
    SCPH-39000 series Reproduction Prohibited G-chassis NOTICE ADVARSEL Copyright by Sony Computer Entertainment Inc. All rights reserved. No reproduced and transferred required except Eksplosjonsfare ved feilaktig skifte av batteri. Benytt samme batteritype eller en tilsvarende type anbefalt av permission by Sony Computer Entertainment Inc. No commercial use and rental required. apparatfabrikanten. Brukte batterier kasseres i henhold til fabrikantens instruksjoner. REVISIONS. VARNING This manual compiled by first edition of SCPH-39000 series design data. Explosionsfara vid felaktigt batteribyte. Använt samma batterityp eller en ekvivalent typ som rekommenderas av Design and Specification will be subject to changed without notice. The latest versions of service manual, Technical apparattillverkaren. Kassera använt batteri enligt fabrikantens instruktion. memo, Technical information should be used together. VAROITUS SAFETY-RELATED COMPONENT WARNING !! Paristo voi räjähtää, jos se on virheellisesti asennettu. Vaihda paristo ainoastaan valmistajan suosittelemaan tyyppiin. COMPONENTS IDENTIFIED BY MARK OR DOTTED LINE WITH MARK ON THE SCHEMATIC DIAGRAMS Hävitä käytetty paristo valmistajan ohjeiden mukaisesti. AND IN THE PARTS LIST ARE CRITICAL TO SAFE OPERATION. REPLACE THESE COMPONENTS WITH SONY PARTS WHOSE PART NUMBERS APPEAR AS SHOWN IN THIS MANUAL OR IN SUPPLEMENTS PUBLISHED BY SONY. SAFETY CHECK-OUT After correcting the original service problem, perform the following safety checks before releasing the set to the customer: ATTENTION AU COMPOSANT AYANT RAPPORT À LA SÉCURITÉ ! 1. Check the area of your repair for unsoldered or cold-soldered connections. Check the entire board surface for LES COMPOSANTS IDENTIFIÉS PAR UNE MARQUE SUR LES DIAGRAMMES SCHÉMATIQUES ET LA LISTE solder splashes and bridges. DES PIÈCES SONT CRITIQUES POUR LA SÈCURITÈ DE FONCTIONNEMENT.
    [Show full text]
  • Evaluation of the Playstation 2 As a Cluster Computing Node
    Rochester Institute of Technology RIT Scholar Works Theses 2004 Evaluation of the PlayStation 2 as a cluster computing node Christopher R. Nigro Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Nigro, Christopher R., "Evaluation of the PlayStation 2 as a cluster computing node" (2004). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Evaluation of the PlayStation 2 as a Cluster Computing Node Christopher R. Nigro A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Department of Computer Engineering Kate Gleason College of Engineering Rochester Institute of Technology May 2, 2004 M.Shaaban Primary Advisor: Dr. Muhammad Shaaban Roy S. Czernikowski Advisor: Dr. Roy Czemikowski Greg Semeraro Advisor: Dr. Greg Semeraro Release Permission Form Rochester Institute of Technology Evaluation of the PlayStation 2 as' a Cluster Computing Node I, Christopher R. Nigro, hereby grant permission to the Wallace Library of the Rochester Institute of Technology to reproduce this thesis, in whole or in part, for non-commercial and non-profit purposes only. Christopher R. Nigro Christopher R. Nigro Date Abstract Cluster computing is currently a popular, cost-effective solution to the increasing computational demands of many applications in scientific computing and image processing. A cluster computer is comprised of several networked computers known as nodes. Since the goal of cluster computing is to provide a cost-effective means to processing computationally demanding applications, nodes that can be obtained at a low price with minimal performance tradeoff are always attractive.
    [Show full text]
  • Playstation 2
    Playstation 2 By James Thesing & Michael Loper Overview ● Introduction ● Features ● Emotion Engine ○ Vector Processors ○ CPU ‐ MIPS III ○ Image Processor ● I/O Processor ● Graphics Synthesizer ● Main Memory ● Conclusion Introduction ● Console Gaming ○ Design methodology based on system lifetime ● Sony founded in 1946 ● Contracted by Nintendo to make a CD add on ○ development of a console to compliment this called the “Play Station” ○ Contract dispute led to independent development of the “Playstation” ● Sony Computer Entertainment formed in 1994 ● Released PS2 March 2000 (Japan), Oct 2000 (NA) Features ● DVD Player ○ 8 to 28 times storage capacity of Playstation one ● Backwards Compatibility ● Two Memory Card Capacity ○ 8 MB ● Two Controllers Specifications ● CPU: 128‐bit Playstatoin 2 CPU ● System Clock Frequency: 249.912 MHz ● Cache Memory ○ Instruction: 16KB ○ Data: 8KB ● Main Memory: Direct Rambus ( Direct RDRAM ) ● Memory Size: 32MB ● Memory Bus Bandwidth: 3.2GB per Second ● Vector Units: ○ VU0 ○ VU1 ● Floating Point Performance: 6.2 GFLOPS ● 3D CG Geometric Transformation: 66 Million Polygons per Second ● Graphics: Graphics Synthesizer ● Clock Frequency: 147.456MHz ● Embedded DRAM: 4MB ● DRAM Bus bandwidth: 48GB Per Second ● DRAM Bus width: 2560 Bits ● Emotion Engine Clock Frequency: 33.86MHz Schematic Emotion Engine ‐ CPU • MIPS III(R4000) Two issue superscalar core o 128‐bit multimedia extension o SCE redefined the 64‐bit SIMD instructions for 128‐bit • Split L1 Cache o 16K two‐way set associative instruction cache o 8K two‐way
    [Show full text]
  • Appendix M Historical Perspectives and References
    M.1 Introduction M-2 M.2 The Early Development of Computers (Chapter 1) M-2 M.3 The Development of Memory Hierarchy and Protection (Chapter 2 and Appendix B) M-9 M.4 The Evolution of Instruction Sets (Appendices A, J, and K) M-17 M.5 The Development of Pipelining and Instruction-Level Parallelism (Chapter 3 and Appendices C and H) M-27 M.6 The Development of SIMD Supercomputers, Vector Computers, Multimedia SIMD Instruction Extensions, and Graphical Processor Units (Chapter 4) M-45 M.7 The History of Multiprocessors and Parallel Processing (Chapter 5 and Appendices F, G, and I) M-55 M.8 The Development of Clusters (Chapter 6) M-74 M.9 Historical Perspectives and References M-79 M.10 The History of Magnetic Storage, RAID, and I/O Buses (Appendix D) M-84 M Historical Perspectives and References If … history … teaches us anything, it is that man in his quest for knowledge and progress is determined and cannot be deterred. John F. Kennedy Address at Rice University (1962) Those who cannot remember the past are condemned to repeat it. George Santayana The Life of Reason (1905), Vol. 2, Chapter 3 M-2 ■ Appendix M Historical Perspectives and References M.1 Introduction This appendix provides historical background on some of the key ideas presented in the chapters. We may trace the development of an idea through a series of machines or describe significant projects. If you are interested in examining the initial development of an idea or machine or are interested in further reading, references are provided at the end of each section.
    [Show full text]
  • SYSC5603 Paper Analysis: Multiprocessors for DSP
    SYSC5603 Paper Analysis: Multiprocessors for DSP Author: Trevor Burton Submitted to: Professor Miodrag Bolic Carleton Student ID: 100278854 Date: Monday, February 6, 2006 SYSC5603 Paper Analysis – Multiprocessors for DSP Trevor Burton 1. Introduction The performance of single-core microprocessors, or uniprocessors, has been driven over the years by the decreasing size and increasing speed of silicon transistors. This has resulted in faster and more powerful processors in accordance with Moore’s Law, which states that the number of transistors that can be fabricated on a give chip size doubles approximately every two years. However, as the number of transistors on a chip has increased, along with speed at which they switch, so has the amount of power they consume and heat they generate. These power consumption and heat dissipation concerns, along with other factors such as dwindling performance gain from exploiting instruction-level parallelism (ILP) [1] in these superscalar processors, have resulted in a slowdown in processor performance increases [2]. As a result, manufactures have adopted a multi-core architecture where instead of focusing on designing bigger, more complex, and power hungry uniprocessors, several smaller energy efficient processors are integrated onto a single chip. These new multi- core processors operate at a lower temperature, are more energy efficient, and achieve greater throughput than their uniprocessor counterparts. The greater amount of throughput that can be attained by these new multi-core processors comes as a result of exploiting parallelism within the system. Many major manufactures of microprocessors for the desktop, laptop, and server market now offer multi-core processors [3-6].
    [Show full text]