Runahead Threads

Total Page:16

File Type:pdf, Size:1020Kb

Runahead Threads ADVERTIMENT . La consulta d’aquesta tesi queda condicionada a l’acceptació de les següents condicions d'ús: La difusió d’aquesta tesi per mitjà del servei TDX ( www.tesisenxarxa.net ) ha estat autoritzada pels titulars dels drets de propietat intel·lectual únicament per a usos privats emmarcats en activitats d’investigació i docència. No s’autoritza la seva reproducció amb finalitats de lucre ni la seva difusió i posada a disposició des d’un lloc aliè al servei TDX. No s’autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant al resum de presentació de la tesi com als seus continguts. En la utilització o cita de parts de la tesi és obligat indicar el nom de la persona autora. ADVERTENCIA . La consulta de esta tesis queda condicionada a la aceptación de las siguientes condiciones de uso: La difusión de esta tesis por medio del servicio TDR ( www.tesisenred.net ) ha sido autorizada por los titulares de los derechos de propiedad intelectual únicamente para usos privados enmarcados en actividades de investigación y docencia. No se autoriza su reproducción con finalidades de lucro ni su difusión y puesta a disposición desde un sitio ajeno al servicio TDR. No se autoriza la presentación de su contenido en una ventana o marco ajeno a TDR (framing). Esta reserva de derechos afecta tanto al resumen de presentación de la tesis como a sus contenidos. En la utilización o cita de partes de la tesis es obligado indicar el nombre de la persona autora. WARNING . On having consulted this thesis you’re accepting the following use conditions: Spreading this thesis by the TDX ( www.tesisenxarxa.net ) service has been authorized by the titular of the intellectual property rights only for private uses placed in investigation and teaching activities. Reproduction with lucrative aims is not authorized neither its spreading and availability from a site foreign to the TDX service. Introducing its content in a window or frame foreign to the TDX service is not authorized (framing). This rights affect to the presentation summary of the thesis as well as to its contents. In the using or citation of parts of the thesis it’s obliged to indicate the name of the author tesisdoctoral Runahead Threads Tanausú Ramírez García UNIVERSITAT POLITECNICA DE CATALUNYA Departament d’Arquitectura de Computadors c 2010 TANAUSU RAMIREZ GARCIA ALL RIGHTS RESERVED UNIVERSITAT POLITECNICA DE CATALUNYA Departament d’Arquitectura de Computadors Runahead Threads Tanaus´uRam´ırezGarc´ıa Advisors: Mateo Valero Universitat Politecnica` de Catalunya Alex Pajuelo Universitat Politecnica` de Catalunya Oliverio J. Santana Universidad de Las Palmas de Gran Canaria A THESIS SUBMITTED IN FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) DOCTOR POR LA UPC A mis padres, por su amor, apoyo y aliento ii Acknowledgments Esta tesis ha sido el resultado de una gran aventura fuera de mi tierra, Gran Canaria. Durante este largo camino, ha habido muchos momentos de alegr´ıay satisfacci´onper- sonal, as´ıcomo de malos y duros momentos. Sin embargo, he tenido la suerte de contar con el apoyo de muchas personas, a las cuales les quiero dedicar estas l´ıneaspues sin ellas no habr´ıallegado hasta aqu´ı. En primer lugar, quiero agradecer a mi director de tesis, Mateo Valero, por su confi- anza en m´ıdesde el principio. Sin a´unconocerme, me invit´oa una cena de un congreso nacional, y le bast´oun trayecto en su coche desde Lleida a Barcelona para animarme a hacer el doctorado. Mateo me ha brindado su ayuda, sus conocimientos y su apoyo, as´ıcomo la oportunidad de comenzar mi carrera investigadora en el Departamento de Arquitectura de Computadores (DAC) de la UPC. Igualmente, quiero dar las gracias a mis otros directores de tesis. A Alex Pajuelo, por su confianza e infalible ayuda tanto en el aspecto profesional como en el personal para que este trabajo saliera adelante. A Oliver Santana, por su dedicaci´ony ayuda prestada en todo momento tanto aqu´ıcomo desde la distancia. Posiblemente esta tesis no estar´ıaterminada sin vuestra ayuda, simplemente gracias. Para realizar esta tesis tuve que moverme de Las Palmas a Barcelona, y con ello dej´eatr´asa mi familia y muchos amigos en Canarias. En este viaje, agradecer es- pecialmente a Carmelo Acosta, amigo y compa~nerodesde el colegio, Javier Verd´uy Fran Cazorla compa~nerostambi´enen la ULPGC. Todos ellos, ya doctores, tomaron previamente el mismo camino que yo, anim´andomea venir, haciendo mi llegada m´as f´acily ayud´andomeen todas las situaciones durante estos a~nos. Gracias Carmelo y Hema por ofrecerme vuestra casa, pero especialmente vuestra compa~n´ıaen los primeros momentos. Una vez aqu´ı, tambi´enhe conocido a nuevas personas estupendas, que adem´asde compa~nerosde doctorado se han convertido en muy buenos amigos. Entre ellos, quiero mencionar especialmente a Josep Mar´ıaCodina, Ayose Falc´on,Ana Bosque, Jaume Abella y Rub´enGran. Me gustar´ıaagradecerles su apoyo, ´animo,amistad y compa~n´ıa iii en las diferentes etapas de esta tesis, pero sobre todo gracias por los momentos de diversi´onque me ayudaron a superar los obst´aculoscon alegr´ıa. Tambi´enagradecer a tantos otros compa~nerosdel DAC que he ido conociendo durante estos a~nos,y con los cuales he podido compartir reuniones de trabajo, sala de becarios, horas de comida, caf´esy muchas otras actividades: Adri´anCristal, Marco Galluzi, Alejandro Garc´ıa,Beatriz Otero, Jordi Guitart, Germ´anRodr´ıguez,Lloren¸c Cruz, Ramon Canal, Ferm´ınSanchez, Javi Alonso, Miquel Moreto... y muchos m´as que espero que no se molesten por no nombrar aqu´ı.Igualmente dar las gracias al resto de profesores y miembros del departamento que lo conforman y lo hacen uno de los m´asimportantes del mundo en su ´areade conocimiento. Del mismo modo, destacar que todo este trabajo ha sido financiado por la beca FPU AP2003-3682 del Ministerio de Educaci´ony Ciencia, as´ı como por las l´ıneas de investigaci´onde la Comisi´onInterministerial de Ciencia Y Tecnolog´ıa (CICYT) TIN-2004-07739 y TIN2007-60625, y tambi´enpor la red europea de excelencia High Performance Embedded Architectures and Compilers (HiPEAC). Personalmente, esta tesis no s´oloha supuesto un reto profesional sino seguramente un gran cambio en mi vida. Gracias a la tesis, he podido conocer otra tierra, otra gente, viajar y realizar muchas otras actividades, como entrar en el mundo de la salsa. Y as´ı, bailando salsa, he podido desconectar de las situaciones de estr´esy abatimiento en los momentos m´asdif´ıciles.Pero lo m´asimportante es que he conocido a muchas personas salseras que valen la pena, Oscar, Susana, Joaqu´ın, Cristina Robles, Eva, Cristina Salinas, Luba, Patrick... y a las cuales quiero agradecerles los muchos momentos de diversi´ony amistad que han compartido conmigo. De manera muy especial, quiero dar las gracias de todo coraz´ona Judit, la persona m´asmaravillosa que he podido conocer, pues su forma de ser transform´onuestro baile en amor. No tengo palabras para agradecerle su cari~nosacompa~n´ıa, por haberme apoyado y alentado repetidas veces, pues gracias a ella encontr´elas fuerzas necesarias para llegar hasta el final. Y, por supuesto, gracias por todo el amor con el que me rodeas, del cual espero disfrutar toda mi vida. Finalmente, mis m´asprofundo y sentido agradecimiento a mi familia, que con amorosa paciencia, y entreg´andomesu incondicional ayuda, tambi´enme han apoyado en esta etapa de mi vida, soportando la distancia que nos separaba. Mi madre Yolanda, mi padre Antonio, mi hermana Yurena, y mi abuela Tata, a pesar de la distancia siem- pre pude sentir su cari~no.Les quiero mucho. iv Abstract The evolution of processors has been supported by both the rapid technology advance and smart architecture designs. Until recently, processors mainly focused on exploit- ing instruction-level parallelism (ILP) to improve the performance. Nevertheless, the combination of the practical limits to superscalar processor implementations, the theo- retical limits of instruction-level parallelism, and the increment of power constraints has forced different changes in processor designs. In this sense, the computer architecture community turns to new research towards other forms of parallelism. Fruit of this trend, thread-level parallelism (TLP) has become a common strategy for improving processor performance. Simultaneous Multithreading (SMT) is one of these new paradigms to exploit TLP in recent years. The main feature of SMT pro- cessors is to simultaneously execute multiple threads to increase the utilization of the pipeline by sharing many more resources than in other types of processors. However, the memory wall problem is still present in these processors. Memory-intensive threads tend to use processor and memory resources poorly creating resource contention prob- lems. This kind of threads can clog up shared resources due to long-latency memory operations without making progress on a SMT processor, thereby hindering the overall system performance. In this dissertation, we target these problems with a two-fold objective: to improve the performance of these multithreading processors and to propose techniques to be as efficient as possible inside the current power constraints. To cover these goals, this dissertation firstly proposes Runahead Threads (RaT) to alleviate the memory wall problem, and secondly, provides additional techniques to improve the efficiency of our initial proposal. v vi Contents 1 Introduction1 1.1 Multithreaded models...........................2 1.1.1 Simultaneous Multithreaded Processors..............4 1.1.2 SMT trends.............................5 1.2 Thesis overview...............................6 1.2.1 The motivation: SMT problems..................6 1.2.2 The approach: Runahead Threads.................9 1.3 Thesis contributions............................ 11 1.4 List of publications............................
Recommended publications
  • Precise Runahead Execution
    2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Precise Runahead Execution Ajeya Naithani∗ Josue´ Feliu†‡ Almutaz Adileh∗ Lieven Eeckhout∗ ∗Ghent University, Belgium †Universitat Politecnica` de Valencia,` Spain Abstract—Runahead execution improves processor perfor- runahead mode, the higher the performance benefit of runahead mance by accurately prefetching long-latency memory accesses. execution. On the other hand, speculative code execution When a long-latency load causes the instruction window to fill up imposes overheads for saving and restoring state, and rolling and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. the pipeline back to a proper state to resume normal operation A recent improvement tracks the chain of instructions that leads after runahead execution. The lower the performance penalty of to the long-latency load, stores it in a runahead buffer, and exe- these overheads, the higher the performance gain. Consequently, cutes only this chain during runahead execution, with the purpose maximizing the performance benefits of runahead execution of generating more prefetch requests. Unfortunately, all prior requires (1) maximizing the number of useful prefetches per runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when runahead interval, and (2) limiting the switching overhead entering runahead mode and then need to re-fill the pipeline between runahead mode and normal execution. We find to restart normal operation. Moreover, runahead buffer limits that prior attempts to optimize the performance of runahead prefetch coverage by tracking only a single chain of instructions execution have shortcomings that impede them from adequately that leads to the same long-latency load.
    [Show full text]
  • A Chip Multi-Cluster Architecture with Locality Aware Task Distribution
    A CHIP MULTI-CLUSTER ARCHITECTURE WITH LOCALITY AWARE TASK DISTRIBUTION A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2007 Matthew James Horsnell School of Computer Science Contents List of Tables 7 List of Figures 8 List of Algorithms 12 Abstract 13 Declaration 14 Copyright 15 Acknowledgements 16 1 Introduction 17 1.1 Motivation . 17 1.2 Microprocessor Design Challenges . 19 1.2.1 Wire Delay . 20 1.2.2 Memory Gap . 21 1.2.3 Limits of Instruction Level Parallelism . 22 1.2.4 Power Limits . 23 1.2.5 Design Complexity . 24 1.3 Design Solutions . 26 1.3.1 Exploiting Parallelism . 26 2 1.3.2 Partitioned Designs . 29 1.3.3 Bridging the Memory Gap . 30 1.3.4 Design Abstraction and Replication . 31 1.4 Summary . 33 1.5 Research Aims . 33 1.6 Contributions . 33 1.7 Thesis Structure . 34 1.8 Publications . 35 2 Parallelism 36 2.1 Application Parallelism . 36 2.1.1 Amdahl's Law . 37 2.1.2 Implicit and Explicit Parallelism . 37 2.1.3 Granularity of Parallelism . 39 2.2 Architectural Parallelism . 42 2.2.1 Bit Level Parallelism . 42 2.2.2 Data Level Parallelism . 43 2.2.3 Instruction Level Parallelism . 44 2.2.4 Multithreading . 47 2.2.5 Simultaneous Multithreading . 50 2.2.6 Chip Multiprocessors . 50 2.3 Summary . 51 3 Jamaica CMP and Software Environment 52 3.1 The Jamaica Chip Multiprocessor . 52 3.1.1 Multithreading . 53 3.1.2 Register Windows .
    [Show full text]
  • Runahead Execution
    18-447 Computer Architecture Lecture 26: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/7/2014 Today Start Memory Latency Tolerance Runahead Execution Prefetching 2 Readings Required Mutlu et al., “Runahead execution”, HPCA 2003. Srinath et al., “Feedback directed prefetching”, HPCA 2007. Optional Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006. Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005. Armstrong et al., “Wrong Path Events,” MICRO 2004. 3 Tolerating Memory Latency Latency Tolerance An out-of-order execution processor tolerates latency of multi-cycle operations by executing independent instructions concurrently It does so by buffering instructions in reservation stations and reorder buffer Instruction window: Hardware resources needed to buffer all decoded but not yet retired/committed instructions What if an instruction takes 500 cycles? How large of an instruction window do we need to continue decoding? How many cycles of latency can OoO tolerate? 5 Stalls due to Long-Latency Instructions When a long-latency instruction is not complete, it blocks instruction retirement. Because we need to maintain precise exceptions Incoming instructions fill the instruction window (reorder buffer, reservation stations). Once the window is full, processor cannot place new instructions into the window. This is called a full-window stall. A full-window stall prevents the processor from making progress in the execution of the program. 6 Full-window Stall Example 8-entry instruction window: Oldest LOAD R1 mem[R5] L2 Miss! Takes 100s of cycles. BEQ R1, R0, target ADD R2 R2, 8 LOAD R3 mem[R2] Independent of the L2 miss, MUL R4 R4, R3 executed out of program order, ADD R4 R4, R5 but cannot be retired.
    [Show full text]
  • Rock:Ahigh-Performance Sparc Cmt Processor
    ......................................................................................................................................................................................................................... ROCK:AHIGH-PERFORMANCE SPARC CMT PROCESSOR ......................................................................................................................................................................................................................... ROCK,SUN’S THIRD-GENERATION CHIP-MULTITHREADING PROCESSOR, CONTAINS 16 HIGH-PERFORMANCE CORES, EACH OF WHICH CAN SUPPORT TWO SOFTWARE THREADS. ROCK USES A NOVEL CHECKPOINT-BASED ARCHITECTURE TO SUPPORT AUTOMATIC HARDWARE SCOUTING UNDER A LOAD MISS, SPECULATIVE OUT-OF-ORDER RETIREMENT OF INSTRUCTIONS, AND AGGRESSIVE DYNAMIC HARDWARE PARALLELIZATION OF A SEQUENTIAL INSTRUCTION STREAM.IT IS ALSO THE FIRST PROCESSOR TO SUPPORT TRANSACTIONAL MEMORY IN HARDWARE. ......Designing an aggressive chip- even more aggressive simultaneous speculative Shailender Chaudhry multithreaded (CMT) processor1 involves threading (SST), which uses two checkpoints. many tradeoffs. To maximize throughput EA is an area-efficient way of creating a large Robert Cypher performance, each processor core must be virtual issue window without the large asso- highly area and power efficient, so that ciative structures. SST dynamically extracts Magnus Ekman many cores can coexist on a single die. Simi- parallelism, letting execution proceed in par- larly, if the processor is to perform well on a allel at
    [Show full text]
  • Vector Runahead
    Vector Runahead Ajeya Naithaniy Sam Ainsworthz Timothy M. Joneso Lieven Eeckhouty yGhent University zUniversity of Edinburgh oUniversity of Cambridge Abstract—The memory wall places a significant limit on stride patterns, such techniques are endemic in today’s cache performance for many modern workloads. These applications systems [19]. For more complex indirection patterns, the feature complex chains of dependent, indirect memory accesses, inability at the cache-system level to identify complex load which cannot be picked up by even the most advanced microar- chitectural prefetchers. The result is that current out-of-order chains and generate their addresses limits existing techniques superscalar processors spend the majority of their time stalled. to simple array-indirect [88] and pointer-chasing [23] codes. While it is possible to build special-purpose architectures to To achieve the instruction-level visibility necessary to cal- exploit the fundamental memory-level parallelism, a microarchi- culate the addresses of complex access patterns seen in to- tectural technique to automatically improve their performance day’s workloads [3], we conclude that this ideal technique in conventional processors has remained elusive. Runahead execution is a tempting proposition for hiding must operate within the core, instead of within the cache. latency in program execution. However, to achieve high memory- Runahead execution [25, 32, 34, 57, 58, 64] is the most level parallelism, a standard runahead execution skips ahead of promising technique to date, where upon a memory stall cache misses. In modern workloads, this means it only prefetches at the head of the reorder buffer (ROB), execution enters the first cache-missing load in each dependent chain.
    [Show full text]
  • Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance
    EFFICIENT RUNAHEAD EXECUTION: POWER-EFFICIENT MEMORY LATENCY TOLERANCE SEVERAL SIMPLE TECHNIQUES CAN MAKE RUNAHEAD EXECUTION MORE EFFICIENT BY REDUCING THE NUMBER OF INSTRUCTIONS EXECUTED AND THEREBY REDUCING THE ADDITIONAL ENERGY CONSUMPTION TYPICALLY ASSOCIATED WITH RUNAHEAD EXECUTION. Today’s high-performance processors ulatively processed (executed) instructions, face main-memory latencies on the order of sometimes without enhancing performance. hundreds of processor clock cycles. As a result, For runahead execution to be efficiently even the most aggressive processors spend a sig- implemented in current or future high-per- nificant portion of their execution time stalling formance processors which will be energy- and waiting for main-memory accesses to constrained, processor designers must develop return data to the execution core. Previous techniques to reduce these extra instructions. research has shown that runahead execution Our solution to this problem includes both significantly increases a high-performance hardware and software mechanisms that are processor’s ability to tolerate long main-mem- simple, implementable, and effective. Onur Mutlu ory latencies.1, 2 Runahead execution improves a processor’s performance by speculatively pre- Background on runahead execution Hyesoon Kim executing the application program while the Conventional out-of-order execution processor services a long-latency (L2) data processors use instruction windows to buffer Yale N. Patt cache miss, instead of stalling the processor for instructions so they can tolerate long latencies. the duration of the L2 miss. Thus, runahead Because a cache miss to main memory takes University of Texas at execution lets a processor execute instructions hundreds of processor cycles to service, a that it otherwise couldn’t execute under an L2 processor needs to buffer an unreasonably large Austin cache miss.
    [Show full text]
  • Onur-447-Spring12-Lecture23
    18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more realistic memory hierarchy L2 cache model DRAM, memory controller models MSHRs, multiple outstanding misses Due April 23 Extra credit: Prefetching 2 Last Lecture Memory latency tolerance/reduction Stalls Four fundamental techniques Software and hardware prefetching Prefetcher throttling 3 Today More prefetching Runahead execution 4 Readings Srinath et al., “Feedback directed prefetching ”, HPCA 2007. Mutlu et al., “Runahead execution ”, HPCA 2003. 5 Tolerating Memory Latency How Do We Tolerate Stalls Due to Memory? Two major approaches Reduce/eliminate stalls Tolerate the effect of a stall when it happens Four fundamental techniques to achieve these Caching Prefetching Multithreading Out-of-order execution Many techniques have been developed to make these four fundamental techniques more effective in tolerating memory latency 7 Prefetching Review: Prefetching: The Four Questions What What addresses to prefetch When When to initiate a prefetch request Where Where to place the prefetched data How Software, hardware, execution-based, cooperative 9 Review: Challenges in Prefetching: How Software prefetching ISA provides prefetch instructions Programmer or compiler inserts prefetch instructions (effort) Usually works well only for “regular access patterns ” Hardware prefetching Hardware monitors processor
    [Show full text]
  • Copyright by Onur Mutlu 2006 the Dissertation Committee for Onur Mutlu Certifies That This Is the Approved Version of the Following Dissertation
    Copyright by Onur Mutlu 2006 The Dissertation Committee for Onur Mutlu certifies that this is the approved version of the following dissertation: Efficient Runahead Execution Processors Committee: Yale N. Patt, Supervisor Craig M. Chase Nur A. Touba Derek Chiou Michael C. Shebanow Efficient Runahead Execution Processors by Onur Mutlu, B.S.; B.S.E.; M.S.E. DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY THE UNIVERSITY OF TEXAS AT AUSTIN August 2006 Dedicated to my loving parents, Hikmet and Nevzat Mutlu, and my sister Miray Mutlu Acknowledgments Many people and organizations have contributed to this dissertation, intellectually, motivationally, or otherwise financially. This is my attempt to acknowledge their contribu- tions. First of all, I thank my advisor, Yale Patt, for providing me with the freedom and resources to do high-quality research, for being a caring teacher, and also for teaching me the fundamentals of computing in EECS 100 as well as valuable lessons in real-life areas beyond computing. My life as a graduate student life would have been very short and unproductive, had it not been for Hyesoon Kim. Her technical insights and creativity, analytical and questioning skills, high standards for research, and continuous encouragement made the contents of this dissertation much stronger and clearer. Her presence and support made even the Central Texas climate feel refreshing. Very special thanks to David Armstrong, who provided me with the inspiration to write and publish, both technically and otherwise, at a time when it was difficult for me to do so.
    [Show full text]
  • Runahead Execution
    15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, “Memory Dependence Prediction Using Store Sets ,” ISCA 1998. 2 Announcements Milestone I Due this Friday (Oct 14) Format: 2-pages Include results from your initial evaluations. We need to see good progress. Of course, the results need to make sense (i.e., you should be able to explain them) Midterm I Postponed to October 24 Milestone II Will be postponed. Stay tuned. 3 Course Checkpoint Homeworks Look at solutions These are not for grades. These are for you to test your understanding of the concepts and prepare for exams. Provide succinct and clear answers. Review sets Concise reviews Projects Most important part of the course Focus on this If you are struggling, talk with the TAs 4 Course Feedback Fill out the forms and return 5 Last Lecture Tag broadcast, wakeup+select loop Pentium Pro vs. Pentium 4 designs: buffer decoupling Consolidated physical register files in Pentium 4 and Alpha 21264 Centralized vs. distributed reservation stations Which instruction to select? 6 Today Load related instruction scheduling Runahead execution 7 Review: Centralized vs. Distributed Reservation Stations Centralized (monolithic): + Reservation stations not statically partitioned (can adapt to changes in instruction mix, e.g., 100% adds) -- All entries need to have all fields even though some fields might not be needed for some instructions (e.g. branches,
    [Show full text]
  • Configurable Simultaneously Single-Threaded (Multi-)Engine Processor
    Configurable Simultaneously Single-Threaded (Multi-)Engine Processor by Anita Tino Bachelor of Engineering (B.Eng), Ryerson University, 2009 Master of Applied Science (M.A.Sc), Ryerson University, 2011 A dissertation presented to Ryerson University in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the program of Electrical and Computer Engineering Toronto, Ontario, Canada, 2017 c Anita Tino, 2017 AUTHOR'S DECLARATION FOR ELECTRONIC SUBMISSION OF A DISSERTATION I hereby declare that I am the sole author of this thesis dissertation. This is a true copy of the dissertation, including any required final revi- sions, as accepted by my examiners. I authorize Ryerson University to lend this dissertation to other institu- tions or individuals for the purpose of scholarly research. I further authorize Ryerson University to reproduce this dissertation by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. I understand that my dissertation may be made electronically available to the public. -Anita Tino ii Configurable Simultaneously Single-Threaded (Multi-)Engine Processor Anita Tino Doctor of Philosophy, 2017, Electrical and Computer Engineering, Ryerson University Abstract As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core.
    [Show full text]
  • Runahead Execution a Short Retrospective
    Runahead Execution A Short Retrospective Onur Mutlu, Jared Stark, Chris Wilkerson, Yale Patt HPCA 2021 ToT Award Talk 2 March 2021 Agenda n Thanks n Runahead Execution n Looking to the Past n Looking to the Future Runahead Execution [HPCA 2003] n Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors" Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), Anaheim, CA, February 2003. Slides (pdf) One of the 15 computer architecture papers of 2003 selected as Top PicKs by IEEE Micro. 3 Small Windows: Full-Window Stalls 8-entry instruction window: Oldest LOAD R1 ß mem[R5] L2 Miss! Takes 100s oF cycles. BEQ R1, R0, target ADD R2 ß R2, 8 LOAD R3 ß mem[R2] Independent oF the L2 miss, MUL R4 ß R4, R3 executed out oF program order, ADD R4 ß R4, R5 but cannot be retired. STOR mem[R2] ß R4 ADD R2 ß R2, 64 Younger instructions cannot be executed LOAD R3 ß mem[R2] because there is no space in the instruction window. The processor stalls until the L2 Miss is serviced. n Long-latency cache misses are responsible for most full-window stalls. 4 Impact of Long-Latency Cache Misses 100 95 Non-stall (compute) time 90 85 80 Full-window stall time 75 70 65 60 55 50 45 40 35 30 25 L2 Misses 20 Normalized Execution Time Normalized Execution 15 10 5 0 128-entry window 512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model 5 Impact of Long-Latency Cache Misses 100 95 Non-stall (compute) time 90 85 80 Full-window stall time 75 70 65 60 55 50 45 40 35 30 25 L2 Misses 20 Normalized Execution Time Normalized Execution 15 10 5 0 128-entry window 2048-entry window 512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model 6 The Problem n Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies.
    [Show full text]
  • C 2012 Neal Clayton Crago ENERGY-EFFICIENT LATENCY TOLERANCE for 1000-CORE DATA PARALLEL PROCESSORS with DECOUPLED STRANDS
    c 2012 Neal Clayton Crago ENERGY-EFFICIENT LATENCY TOLERANCE FOR 1000-CORE DATA PARALLEL PROCESSORS WITH DECOUPLED STRANDS BY NEAL CLAYTON CRAGO DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2012 Urbana, Illinois Doctoral Committee: Professor Sanjay J. Patel, Chair Professor Wen-mei W. Hwu Associate Professor Steven S. Lumetta Associate Professor Deming Chen ABSTRACT This dissertation presents a novel decoupled latency tolerance technique for 1000-core data parallel processors. The approach focuses on developing in- struction latency tolerance to improve performance for a single thread. The main idea behind the approach is to leverage the compiler to split the original thread into separate memory-accessing and memory-consuming instruction streams. The goal is to provide latency tolerance similar to high-performance techniques such as out-of-order execution while leveraging low hardware com- plexity similar to an in-order execution core. The research in this dissertation supports the following thesis: Pipeline stalls due to long exposed instruction latency are the main performance lim- iter for cached 1000-core data parallel processors. Leveraging natural decou- pling of memory-access and memory-consumption, a serial thread of execu- tion can be partitioned into strands providing energy-efficient latency toler- ance. This dissertation motivates the need for latency tolerance in 1000-core data parallel processors and presents decoupled core architectures as an alterna- tive to currently used techniques. This dissertation discusses the limitations of prior decoupled architectures, and proposes techniques to improve both latency tolerance and energy-efficiency.
    [Show full text]