Validation of the Gem5 Simulator for X86 Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
1 Introduction
Cambridge University Press 978-0-521-76992-1 - Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors Jean-Loup Baer Excerpt More information 1 Introduction Modern computer systems built from the most sophisticated microprocessors and extensive memory hierarchies achieve their high performance through a combina- tion of dramatic improvements in technology and advances in computer architec- ture. Advances in technology have resulted in exponential growth rates in raw speed (i.e., clock frequency) and in the amount of logic (number of transistors) that can be put on a chip. Computer architects have exploited these factors in order to further enhance performance using architectural techniques, which are the main subject of this book. Microprocessors are over 30 years old: the Intel 4004 was introduced in 1971. The functionality of the 4004 compared to that of the mainframes of that period (for example, the IBM System/370) was minuscule. Today, just over thirty years later, workstations powered by engines such as (in alphabetical order and without specific processor numbers) the AMD Athlon, IBM PowerPC, Intel Pentium, and Sun UltraSPARC can rival or surpass in both performance and functionality the few remaining mainframes and at a much lower cost. Servers and supercomputers are more often than not made up of collections of microprocessor systems. It would be wrong to assume, though, that the three tenets that computer archi- tects have followed, namely pipelining, parallelism, and the principle of locality, were discovered with the birth of microprocessors. They were all at the basis of the design of previous (super)computers. The advances in technology made their implementa- tions more practical and spurred further refinements. -
Instruction Latencies and Throughput for AMD and Intel X86 Processors
Instruction latencies and throughput for AMD and Intel x86 processors Torbj¨ornGranlund 2019-08-02 09:05Z Copyright Torbj¨ornGranlund 2005{2019. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved. This report is work-in-progress. A newer version might be available here: https://gmplib.org/~tege/x86-timing.pdf In this short report we present latency and throughput data for various x86 processors. We only present data on integer operations. The data on integer MMX and SSE2 instructions is currently limited. We might present more complete data in the future, if there is enough interest. There are several reasons for presenting this report: 1. Intel's published data were in the past incomplete and full of errors. 2. Intel did not publish any data for 64-bit operations. 3. To allow straightforward comparison of an important aspect of AMD and Intel pipelines. The here presented data is the result of extensive timing tests. While we have made an effort to make sure the data is accurate, the reader is cautioned that some errors might have crept in. 1 Nomenclature and notation LNN means latency for NN-bit operation.TNN means throughput for NN-bit operation. The term throughput is used to mean number of instructions per cycle of this type that can be sustained. That implies that more throughput is better, which is consistent with how most people understand the term. Intel use that same term in the exact opposite meaning in their manuals. The notation "P6 0-E", "P4 F0", etc, are used to save table header space. -
Gem5, INTEROPERABILITY, and IMPROVING SIMULATOR METHODOLOGY
gem5, INTEROPERABILITY, AND IMPROVING SIMULATOR METHODOLOGY Jason Lowe-Power [email protected] @jlowepower Outline What is gem5? Recent gem5 features gem5’s future My vision Big news Features coming soon Created at Michigan by students of Steve Reinhardt, principally Nate Binkert. “A tool for simulating systems” 3 4 Created at Michigan by students of Steve Reinhardt, principally Nate Binkert. “A tool for simulating systems” Created at Wisconsin by students of Mark Hill and David Wood. Detailed memory system 5 6 Today today gem5 2011 paper cited over 3000 times 15-20% of ISCA/MICRO/HPCA papers use gem5 Thriving community project Average of 70-ish commits per month About 100 unique contributors over last 2 years 7 Today today Over 400 “models” Over 4000 parameters! 3 timing-based CPU models (simple, in order, out of order) 8 ISAs (ARM, RISC-V, x86, Alpha, Power, SPARC, MIPS, GCN3) 12 memory models (DDR3, DDR4, HBM, HMC, etc.) 42 devices (PCI, Arm platform, x86 platform, storage) AMD GPGPU 12 cache coherence protocols Network on chip (Garnet) 8 Today Programmatic configuration 9 Today Programmatic configuration 10 11 Today today Support for complex ML stacks See https://github.com/KyleRoarty/gem5_docker/ Support for ARM SVE instructions Integration with other simulators DRAMSim2 SST DSENT SystemC McPAT GPGPU-Sim 12 Simulator integration Choose a driver Choose an interface 13 gem5 + SST Multiple implementations SST used as driver for gem5 But two event queues Interface is memory requests Separate setup for each simulator 14 gem5 + SystemC -
Lost in Abstraction: Pitfalls of Analyzing Gpus at the Intermediate Language Level
2018 IEEE International Symposium on High Performance Computer Architecture Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, John Kalamatianos, Onur Kayiran, Michael LeBeane, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain†, Timothy G. Rogers† AMD Research, Advanced Micro Devices, Inc. †School of Electrical and Computer Engineering, Purdue University {anthony.gutierrez, brad.beckmann, alexandru.dutu, joe.gross, john.kalamatianos, onur.kayiran, michael.lebeane, matthew.poremba, brandon.potter, sooraj.puthoor, matthew.sinclair, mark.wyse, jieming.yin, xianwei.zhang}@amd.com {jain156, timrogers}@purdue.edu Abstract—Modern GPU frameworks use a two-phase handful of open-source simulators, such as GPGPU-Sim compilation approach. Kernels written in a high-level [11], gem5 [14], and Multi2Sim [35]. language are initially compiled to an implementation- Simulating a GPU is especially challenging because of agnostic intermediate language (IL), then finalized to the two-phase compilation flow typically used to generate the machine ISA only when the target GPU hardware GPU kernel binaries. To maintain portability between dif- is known. Most GPU microarchitecture simulators ferent generations of GPU hardware, GPU kernels are first available to academics execute IL instructions because compiled to an intermediate language (IL). Prior to launch- there is substantially less functional state associated ing a kernel, low-level software, such as the GPU driver or with the instructions, and in some situations, the ma- runtime, finalizes the IL into the machine instruction set chine ISA’s intellectual property may not be publicly architecture (ISA) representation that targets the GPU disclosed. -
Exam 1 Solutions
Midterm Exam ECE 741 – Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points Score 1 40 2 20 3 15 4 20 5 25 6 20 7 (bonus) 15 Total 140+15 • This is a closed book midterm. You are allowed to have only two letter-sized cheat sheets. • No electronic devices may be used. • This exam lasts 1 hour 50 minutes. • If you make a mess, clearly indicate your final answer. • For questions requiring brief answers, please provide brief answers. Do not write an essay. You can be penalized for verbosity. • Please show your work when needed. We cannot give you partial credit if you do not clearly show how you arrive at a numerical answer. • Please write your name on every sheet. EXAM 1 SOLUTIONS Problem 1 (Short answers – 40 points) i. (3 points) A cache has the block size equal to the word length. What property of program behavior, which usually contributes to higher performance if we use a cache, does not help the performance if we use THIS cache? Spatial locality ii. (3 points) Pipelining increases the performance of a processor if the pipeline can be kept full with useful instructions. Two reasons that often prevent the pipeline from staying full with useful instructions are (in two words each): Data dependencies Control dependencies iii. (3 points) The reference bit (sometimes called “access” bit) in a PTE (Page Table Entry) is used for what purpose? Page replacement The similar function is performed by what bit or bits in a cache’s tag store entry? Replacement policy bits (e.g. -
Trends in Processor Architecture
A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]). -
Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time. -
Theoretical Peak FLOPS Per Instruction Set on Less Conventional Hardware
Theoretical Peak FLOPS per instruction set on less conventional hardware Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—This is a companion paper to “Theoreti- popular at the time [4][5][6]. Only the FPUs are cal Peak FLOPS per instruction set on modern Intel shared in Bulldozer. We can take a look back at CPUs” [1]. In it, we survey some alternative hardware for the equation1, replicated from the main paper, and which the peak FLOPS can be of interest. As in the main see how this affects the peak FLOPS. paper, we take into account and explain the peculiarities of the surveyed hardware. Revision 1.16, 2016/10/04 08:40:17 flop 9 Index Terms—FLOPS > operation> > > I. INTRODUCTION > operations => Many different kind of hardware are in use to × micro − architecture instruction> perform computations. No only conventional Cen- > > tral Processing Unit (CPU), but also Graphics Pro- > instructions> cessing Unit (GPU) and other accelerators. In the × > cycle ; main paper [1], we described how to compute the flops = (1) peak FLOPS for conventional Intel CPUs. In this node extension, we take a look at the peculiarities of those cycles 9 × > alternative computation devices. second> > > II. OTHER CPUS cores => × machine architecture A. AMD Family 15h socket > > The AMD Family 15h (the name “15h” comes > sockets> from the hexadecimal value returned by the CPUID × ;> instruction) was introduced in 2011 and is composed node flop of the so-called “construction cores”, code-named For the micro-architecture parts ( =operation, operations instructions Bulldozer, Piledriver, Steamroller and Excavator. -
EPYC: Designed for Effective Performance
EPYC: Designed for Effective Performance By Linley Gwennap Principal Analyst June 2017 www.linleygroup.com EPYC: Designed for Effective Performance By Linley Gwennap, Principal Analyst, The Linley Group Measuring server-processor performance using clock speed (GHz) or even the traditional SPEC_int test can be misleading. AMD’s new EPYC processor is designed to deliver strong performance across a wide range of server applications, meeting the needs of modern data centers and enterprises. These design capabilities include advanced branch prediction, data prefetching, coherent interconnect, and integrated high-bandwidth DRAM and I/O interfaces. AMD sponsored the creation of this white paper, but the opinions and analysis are those of the author. Trademark names are used in an editorial fashion and are the property of their respective owners. Although many PC users can settle for “good enough” performance, data-center opera- tors are always seeking more. Web searches demand more performance as the Internet continues to expand. Newer applications such as voice recognition (for services such as Alexa and Siri) and analyzing big data also require tremendous performance. Neural networks are gaining in popularity for everything from image recognition to self-driving cars, but training these networks can tie up hundreds of servers for days at a time. Processor designers must meet these greater performance demands while staying within acceptable electrical-power ratings. Server processors are often characterized by core count and clock speed (GHz), but these characteristics provide only a rough approximation of application performance. As important as speed is, the amount of work that a processor can accomplish with each tick of the clock, a parameter known as instructions per cycle (IPC), is equally important. -
SST + Gem5 = a Scalable Simulation Infrastructure for High Performance Computing
SST + gem5 = A Scalable Simulation Infrastructure for High Performance Computing Mingyu Hsieh Jie Meng Michael Levenhagen Sandia National Labs P.O.Box Boston University Sandia National Labs 5800 ECE Department P.O.Box 5800 Albuquerque, NM Boston, MA Albuquerque, NM [email protected] [email protected] [email protected] Kevin Pedretti Ayse Coskun Arun Rodrigues Sandia National Labs Boston University Sandia National Labs P.O.Box 5800 ECE Department P.O.Box 5800 Albuquerque, NM Boston, MA Albuquerque, NM [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION High Performance Computing (HPC) faces new challenges in As HPC continues to push the bounds of computation, it scalability, performance, reliability, and power consumption. faces new challenges in scalability, performance, reliability, Solving these challenges will require radically new hardware and power consumption. To address these challenges, we and software approaches. It is impractical to explore this need to design new architectures, operating systems, pro- vast design space without detailed system-level simulations. gramming models, and software. It is impractical to ex- However, most of the existing simulators are either not suffi- plore this vast design space without detailed system-level ciently detailed, not scalable, or cannot evaluate key system simulations. However, most simulators are either not suffi- characteristics such as energy consumption or reliability. ciently detailed (i.e., glossing over important details in pro- To address this problem, we integrate the highly detailed cessor, memory, or network implementations), not scalable gem5 performance simulator into the parallel Structural Sim- (i.e., simulating highly parallel architectures with serial sim- ulation Toolkit (SST). -
THE AMD Gem5 APU SIMULATOR: MODELING GPUS USING the MACHINE ISA
THE AMD gem5 APU SIMULATOR: MODELING GPUS USING THE MACHINE ISA TONY GUTIERREZ, SOORAJ PUTHOOR, TUAN TA*, MATT SINCLAIR, AND BRAD BECKMANN AMD RESEARCH, *CORNELL JUNE 2, 2018 OBJECTIVES AND SCOPE Objectives ‒ Introduce the Radeon Open Compute Platform (ROCm) ‒ AMD’s Graphics Core Next (GCN) architecture and GCN3 ISA ‒ Describe the gem5-based APU simulator Modeling Scope APU ‒ Emphasis on the GPU side of the simulator systems ‒ APU (CPU+GPU) model, not discrete GPU ‒ Covers GPU arch, GCN3 ISA, and HW-SW interfaces Why are we releasing our code? ‒ Encourage AMD-relevant research ‒ Modeling ISA and real system stack is important [1] ‒ Enhance academic collaborations ‒ Enable intern candidates to get experience before arriving ‒ Enable interns to take their experience back to school Acknowledgement ‒ AMD Research’s gem5 team [1] Gutierrez et al. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. HPCA, 2018. 2 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | QUICK SURVEY Who is in our audience? ‒ Graduate students ‒ Faculty members ‒ Working for government research labs ‒ Working for industry Have you written an GPU program? ‒ CUDA, OpenCLTM, HIP, HC, C++ AMP, other languages Have you used these simulators? ‒ GPGPU-Sim ‒ Multi2Sim ‒ gem5 ‒ Our HSAIL-based APU model Are you familiar with our HPCA 2018 paper? ‒ Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level 3 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | OUTLINE Topic Presenter Time Background Tony 8:00 – 8:15 ROCm, GCN3 ISA, and GPU Arch Tony 8:15 – 9:15 HSA Implementation in gem5 Sooraj 9:15 – 10:00 Break 10:00 – 10:30 Ruby and GPU Protocol Tester Tuan 10:30 – 11:15 Demo and Workloads Matt 11:15 – 11:50 Summary and Questions All 11:50 – 12:00 4 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | BACKGROUND Overview of gem5 ‒ Source tree GPU terminology and system overview HSA standard and building blocks ‒ Coherent shared virtual memory ‒ User-level queues ‒ Signals ‒ etc. -
Design and Implementation of a Framework for Predicting Instruction Throughput
Professur für Höchstleistungsrechnen Friedrich-Alexander-Universität Erlangen-Nürnberg BACHELOR THESIS Design and Implementation of a Framework for Predicting Instruction Throughput Jan Laukemann Erlangen, January 11, 2018 Examiner: Prof. Dr. Gerhard Wellein Advisor: Julian Hammer Eidesstattliche Erklärung / Statutory Declaration Hiermit versichere ich eidesstattlich, dass die vorliegende Arbeit von mir selbständig, ohne Hilfe Dritter und ausschließlich unter Verwendung der angegebenen Quellen angefertigt wurde. Alle Stellen, die wörtlich oder sinngemäß aus den Quellen entnommen sind, habe ich als solche kenntlich gemacht. Die Arbeit wurde bisher in gleicher oder ähnlicher Form keiner anderen Prüfungsbehörde vorgelegt. I hereby declare formally that I have developed and written the enclosed thesis entirely by myself and have not used sources or means without declaration in the text. Any thoughts or quotations which were inferred from the sources are marked as such. This thesis was not submitted in the same or a substantially similar version to any other authority to achieve an academic grading. Der Friedrich-Alexander-Universität, vertreten durch die Professur für Höchstleistungsrechnen, wird für Zwecke der Forschung und Lehre ein einfaches, kostenloses, zeitlich und örtlich unbeschränktes Nutzungsrecht an den Arbeitsergebnissen der Arbeit einschließlich etwaiger Schutz- und Urheberrechte eingeräumt. Erlangen, January 11, 2018 Jan Laukemann i ii Zusammenfassung Für das Aufstellen eines Performance Modells oder anderweitiger Optimierung von, speziell wissenschaft- licher, Software, ist eine Analyse der in-core Ausführungszeit meist unerlässlich. Diese kann je nach verwendeter Microarchitektur des Prozessors unterschiedlich ausfallen. Deshalb soll in dieser Arbeit ein architekturabhängiges Open-Source Analysetool OSACA (Open-Source Architecture Analyzer) implemen- tiert werden, welches die Durchsatzanalyse des Intel-eigenen Tools IACA (Intel Archcitecture Code An- alyzer) beherrscht.