Performance Analysis for Machine Learning Applications

Total Page:16

File Type:pdf, Size:1020Kb

Performance Analysis for Machine Learning Applications Performance Analysis for Machine Learning Applications The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Wang, Yu. 2020. Performance Analysis for Machine Learning Applications. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences. Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365132 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Performance Analysis for Machine Learning Applications a dissertation presented by Yu Wang to The Department of School of Engineering and Applied Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science Harvard University Cambridge, Massachusetts Sep 2019 ©2019 – Yu Wang all rights reserved. Thesis advisor: Professor David Brooks and Gu-Yeon Wei Yu Wang Performance Analysis for Machine Learning Applications Abstract Performance analysis has been driving the advancements of software and hardware systems for decades. Proper analysis can reveal system and architectural bottlenecks, provide essential infor- mation for choosing frameworks and platforms, and further lead to performance optimizations. Systematic performance analysis is difficult since it involves various problem domains, algorithms, software stacks, systems, and hardware. Recently, the surge of machine learning applications and domain-specific architectures stimulates rapid evolution of all those dimensions, thus making perfor- mance analysis increasingly challenging. To tackle this problem, this dissertation conducts deep and systematic performance analysis for a variety of workloads, software frameworks, and hardware systems, and demonstrates proper ways to apply several performance analysis methodologies. First, we study the performance analysis method- ology for general-purpose processors, a traditional and mature industry, based on CPU benchmarks and demonstrate the importance of using representative benchmarks in drawing insights about hard- ware systems. Next, with the lessons learned from traditional methods, we rigorously study deep learning frameworks by applying proper analysis methods at corresponding scenarios. We extract the performance implications of key design features from those frameworks, and the insights are distilled into a set of simple guidelines to tune framework features. The proposed guidelines nearly close the performance gap between the state of the art and the global optimum. Further, we pro- pose a systematic methodology to facilitate performance analysis for rapidly evolving deep learning models and platforms. The proposed methodology can reveal deeper insights that are difficult to discover for traditional approaches. We demonstrate its utility with deep learning by comparing two generations of specialized hardware (Google’s Tensor Processing Unit v2 and v3), three heteroge- neous platforms (TPU, GPU, and CPU), and different versions of specialized software (TensorFlow and CUDA). Finally, since machine learning techniques advance rapidly and architects need to be aware of emerging applications, we take the first step towards analyzing Bayesian inference, an im- portant branch of machine learning. Optimization mechanisms are proposed based on the analysis. With the methodologies and analysis presented in this dissertation, we hope to encourage re- searchers and engineers to apply our methodologies to new platforms, software, and applications for systematic performance analysis. We envision this to help resolve existing performance bottlenecks and design better software and hardware for the current and future applications. iii Contents 1 Introduction and Background 1 1.1 The Role of Performance Analysis ......................... 2 1.2 Challenges of Analysis for Deep Learning ..................... 4 1.3 Bayesian Inference ................................. 5 1.4 Thesis Overview .................................. 6 1.5 Thesis Contributions ................................ 9 2 General-Purpose Platforms and Workloads 13 2.1 Introduction .................................... 14 2.2 Intel Processor Statistics .............................. 16 2.3 Methodology .................................... 23 2.4 Case Studies Overview ............................... 30 2.5 Prediction Accuracy ................................ 33 2.6 SKU Ranking Comparison ............................. 40 2.7 Related Work .................................... 46 2.8 Summary ...................................... 48 3 Parallelism in Deep Learning Frameworks 49 3.1 Introduction .................................... 50 3.2 Framework Design Overview ............................ 53 3.3 Experimental Setup ................................. 58 3.4 Scheduling Mechanism ............................... 60 3.5 Operator Design .................................. 68 3.6 Library Choice ................................... 77 3.7 Beyond One Socket ................................. 82 3.8 Framework Design Tuning ............................. 86 3.9 Summary ...................................... 90 4 A Systematic Methodology for Performance Analysis 91 4.1 Introduction .................................... 92 4.2 Methodology .................................... 94 iv 4.3 Hardware Platforms ................................ 100 4.4 TPU Performance Implications . 102 4.5 Cross-Platform Comparison ............................ 114 4.6 Software Stack Advances .............................. 121 4.7 Limitations ..................................... 124 4.8 Related Work .................................... 125 4.9 Summary ...................................... 126 5 Demystifying Bayesian Inference Workloads 127 5.1 Introduction .................................... 128 5.2 Bayesian Inference ................................. 131 5.3 BayesSuite: Bayesian Inference Workloads . 135 5.4 Performance Analysis ................................ 138 5.5 Bottleneck Resolution ............................... 143 5.6 Algorithm Convergence .............................. 147 5.7 Implications for Future Acceleration . 153 5.8 Related Work .................................... 156 5.9 Summary ...................................... 158 6 Future Directions 159 6.1 Deep Learning ................................... 160 6.2 Bayesian Inference ................................. 162 v Listing of figures 1.1 Increasing Bayesian inference publications in top machine learning conferences from 2007 to 2016. .................................... 6 2.1 The number of SKUs released by Intel for 17 microarchitectures between 1993 and 2016. (Data from http://ark.intel.com.) ......................... 17 2.2 Means and standard deviations of SPEC relative run times. 18 2.3 SKU breakdown in the SPEC (top) and Geekbench (bottom) datasets. 21 2.4 Means and standard deviations of SPEC workloads’ relative run times on the SKUs in our dataset. ................................... 21 2.5 Principal component analysis of SPEC workloads’ performance scaling on 639 config- urations of 352 SKUs. The green stars are centroids of two clusters identified by k-means. The outlier near the top is 481.wrf. ......................... 22 2.6 Results of predicting the performance of SPEC (top) and Geekbench (bottom) on new SKUs. ....................................... 33 2.7 The results of case 2. The workloads are sorted by the MAE after adding 50 SKUs. 35 2.8 Correlation of SPEC workloads’ outlierness (distance) with MAE after adding 50 SKUs. The distance of a workload is its distance from the the centroid in its cluster. Centroids are in Figure 2.5. .................................. 37 2.9 Case 3 is about cross-prediction. SPEC workloads are predicted with the Geekbench dataset (top). Geekbench workloads are predicted with the SPEC dataset (bottom). Cross- prediction has higher errors than self-prediction (case 2). 39 2.10 SKU ranking results of case 1. Our DNN model has the highest top-3 accuracy. 41 2.11 Based on SPEC (top) and Geekbench (bottom), compare the SKU rankings by self-prediction (case 2), cross-prediction (case 3) and the best baseline (Frequency). 42 2.12 Comparison of SKU rankings based on SPEC and Geekbench averages. 43 3.1 Time breakdown for Inception v3 training. .................... 51 3.2 An overview of the framework design features studied in this chapter. 52 3.3 Examples of (a) synchronous scheduling, (b) asynchronous scheduling, and (c) using one and four thread pools, with the same total hardware resources. 60 vi 3.4 (Bar Chart) The speedup of using asynchronous scheduling over synchronous. (Ta- ble) The maximum computational graph width and best numbers of thread pools. Work- loads with more branches benefit from more pools. 61 3.5 Asynchronous scheduling benefits large-batch inference and small-batch training. 63 3.6 (a) Inception v2 architecture contains modules with (b) four and (c) three independent branches. Area 1 exhibits inter- and intra-op parallelism and area 2 only intra-op. 64 3.7 Performance of Inception v2 with different numbers of inter-op pools and MKL threads per pool. Best configuration balances intra- and inter-op parallelism. 65 3.8 Execution time breakdown of four cases. ...................... 66 3.9 Execution traces of three cases in Figure 3.8. Color-coded areas 1 and 2 correspond to
Recommended publications
  • Security Analysis of Docker Containers in a Production Environment
    Security analysis of Docker containers in a production environment Jon-Anders Kabbe Master of Science in Communication Technology Submission date: June 2017 Supervisor: Colin Alexander Boyd, IIK Co-supervisor: Audun Bjørkøy, TIND Technologies Norwegian University of Science and Technology Department of Information Security and Communication Technology Title: Security Analysis of Docker Containers in a Production Environment Student: Jon-Anders Kabbe Problem description: Deployment of Docker containers has achieved its popularity by providing an au- tomatable and stable environment serving a particular application. Docker creates a separate image of a file system with everything the application require during runtime. Containers run atop the regular file system as separate units containing only libraries and tools that particular application require. Docker containers reduce the attack surface, improves process interaction and sim- plifies sandboxing. But containers also raise security concerns, reduced segregation between the operating system and application, out of date or variations in libraries and tools and is still an unsettled technology. Docker containers provide a stable and static environment for applications; this is achieved by creating a static image of only libraries and tools the specific application needs during runtime. Out of date tools and libraries are a major security risk when exposing applications over the Internet, but availability is essential in a competitive market. Does Docker raise some security concerns compared to standard application deploy- ment such as hypervisor-based virtual machines? Is the Docker “best practices” sufficient to secure the container and how does this compare to traditional virtual machine application deployment? Responsible professor: Colin Alexander Boyd, ITEM Supervisor: Audun Bjørkøy, TIND Technologies Abstract Container technology for hosting applications on the web is gaining traction as the preferred mode of deployment.
    [Show full text]
  • Memory Centric Characterization and Analysis of SPEC CPU2017 Suite
    Session 11: Performance Analysis and Simulation ICPE ’19, April 7–11, 2019, Mumbai, India Memory Centric Characterization and Analysis of SPEC CPU2017 Suite Sarabjeet Singh Manu Awasthi [email protected] [email protected] Ashoka University Ashoka University ABSTRACT These benchmarks have become the standard for any researcher or In this paper, we provide a comprehensive, memory-centric charac- commercial entity wishing to benchmark their architecture or for terization of the SPEC CPU2017 benchmark suite, using a number of exploring new designs. mechanisms including dynamic binary instrumentation, measure- The latest offering of SPEC CPU suite, SPEC CPU2017, was re- ments on native hardware using hardware performance counters leased in June 2017 [8]. SPEC CPU2017 retains a number of bench- and operating system based tools. marks from previous iterations but has also added many new ones We present a number of results including working set sizes, mem- to reflect the changing nature of applications. Some recent stud- ory capacity consumption and memory bandwidth utilization of ies [21, 24] have already started characterizing the behavior of various workloads. Our experiments reveal that, on the x86_64 ISA, SPEC CPU2017 applications, looking for potential optimizations to SPEC CPU2017 workloads execute a significant number of mem- system architectures. ory related instructions, with approximately 50% of all dynamic In recent years the memory hierarchy, from the caches, all the instructions requiring memory accesses. We also show that there is way to main memory, has become a first class citizen of computer a large variation in the memory footprint and bandwidth utilization system design.
    [Show full text]
  • A Practical Guide for Improving Transparency and Reproducibility in Neuroimaging Research Krzysztof J
    bioRxiv preprint first posted online Feb. 12, 2016; doi: http://dx.doi.org/10.1101/039354. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license. A practical guide for improving transparency and reproducibility in neuroimaging research Krzysztof J. Gorgolewski and Russell A. Poldrack Department of Psychology, Stanford University Abstract Recent years have seen an increase in alarming signals regarding the lack of replicability in neuroscience, psychology, and other related fields. To avoid a widespread crisis in neuroimaging research and consequent loss of credibility in the public eye, we need to improve how we do science. This article aims to be a practical guide for researchers at any stage of their careers that will help them make their research more reproducible and transparent while minimizing the additional effort that this might require. The guide covers three major topics in open science (data, code, and publications) and offers practical advice as well as highlighting advantages of adopting more open research practices that go beyond improved transparency and reproducibility. Introduction The question of how the brain creates the mind has captivated humankind for thousands of years. With recent advances in human in vivo brain imaging, we how have effective tools to ​ ​ peek into biological underpinnings of mind and behavior. Even though we are no longer constrained just to philosophical thought experiments and behavioral observations (which undoubtedly are extremely useful), the question at hand has not gotten any easier. These powerful new tools have largely demonstrated just how complex the biological bases of behavior actually are.
    [Show full text]
  • Overview of the SPEC Benchmarks
    9 Overview of the SPEC Benchmarks Kaivalya M. Dixit IBM Corporation “The reputation of current benchmarketing claims regarding system performance is on par with the promises made by politicians during elections.” Standard Performance Evaluation Corporation (SPEC) was founded in October, 1988, by Apollo, Hewlett-Packard,MIPS Computer Systems and SUN Microsystems in cooperation with E. E. Times. SPEC is a nonprofit consortium of 22 major computer vendors whose common goals are “to provide the industry with a realistic yardstick to measure the performance of advanced computer systems” and to educate consumers about the performance of vendors’ products. SPEC creates, maintains, distributes, and endorses a standardized set of application-oriented programs to be used as benchmarks. 489 490 CHAPTER 9 Overview of the SPEC Benchmarks 9.1 Historical Perspective Traditional benchmarks have failed to characterize the system performance of modern computer systems. Some of those benchmarks measure component-level performance, and some of the measurements are routinely published as system performance. Historically, vendors have characterized the performances of their systems in a variety of confusing metrics. In part, the confusion is due to a lack of credible performance information, agreement, and leadership among competing vendors. Many vendors characterize system performance in millions of instructions per second (MIPS) and millions of floating-point operations per second (MFLOPS). All instructions, however, are not equal. Since CISC machine instructions usually accomplish a lot more than those of RISC machines, comparing the instructions of a CISC machine and a RISC machine is similar to comparing Latin and Greek. 9.1.1 Simple CPU Benchmarks Truth in benchmarking is an oxymoron because vendors use benchmarks for marketing purposes.
    [Show full text]
  • Reproducible Reports with Knitr and R Markdown
    Reproducible Reports with knitr and R Markdown https://dl.dropboxusercontent.com/u/15335397/slides/2014-UPe... Reproducible Reports with knitr and R Markdown Yihui Xie, RStudio 11/22/2014 @ UPenn, The Warren Center 1 of 46 1/15/15 2:18 PM Reproducible Reports with knitr and R Markdown https://dl.dropboxusercontent.com/u/15335397/slides/2014-UPe... An appetizer Run the app below (your web browser may request access to your microphone). http://bit.ly/upenn-r-voice install.packages("shiny") Or just use this: https://yihui.shinyapps.io/voice/ 2/46 2 of 46 1/15/15 2:18 PM Reproducible Reports with knitr and R Markdown https://dl.dropboxusercontent.com/u/15335397/slides/2014-UPe... Overview and Introduction 3 of 46 1/15/15 2:18 PM Reproducible Reports with knitr and R Markdown https://dl.dropboxusercontent.com/u/15335397/slides/2014-UPe... I know you click, click, Ctrl+C and Ctrl+V 4/46 4 of 46 1/15/15 2:18 PM Reproducible Reports with knitr and R Markdown https://dl.dropboxusercontent.com/u/15335397/slides/2014-UPe... But imagine you hear these words after you finished a project Please do that again! (sorry we made a mistake in the data, want to change a parameter, and yada yada) http://nooooooooooooooo.com 5/46 5 of 46 1/15/15 2:18 PM Reproducible Reports with knitr and R Markdown https://dl.dropboxusercontent.com/u/15335397/slides/2014-UPe... Basic ideas of dynamic documents · code + narratives = report · i.e. computing languages + authoring languages We built a linear regression model.
    [Show full text]
  • Fault Tolerance and Re-Training Analysis on Neural Networks
    FAULT TOLERANCE AND RE-TRAINING ANALYSIS ON NEURAL NETWORKS by ABHINAV KURIAN GEORGE B.Tech Electronics and Communication Engineering Amrita Vishwa Vidhyapeetham, Kerala, 2012 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science, Computer Engineering, College of Engineering and Applied Science, University of Cincinnati, Ohio 2019 Thesis Committee: Chair: Wen-Ben Jone, Ph.D. Member: Carla Purdy, Ph.D. Member: Ranganadha Vemuri, Ph.D. ABSTRACT In the current age of big data, artificial intelligence and machine learning technologies have gained much popularity. Due to the increasing demand for such applications, neural networks are being targeted toward hardware solutions. Owing to the shrinking feature size, number of physical defects are on the rise. These growing number of defects are preventing designers from realizing the full potential of the on-chip design. The challenge now is not only to find solutions that balance high-performance and energy-efficiency but also, to achieve fault-tolerance of a computational model. Neural computing, due to its inherent fault tolerant capabilities, can provide promising solutions to this issue. The primary focus of this thesis is to gain deeper understanding of fault tolerance in neural network hardware. As a part of this work, we present a comprehensive analysis of fault tolerance by exploring effects of faults on popular neural models: multi-layer perceptron model and convolution neural network. We built the models based on conventional 64-bit floating point representation. In addition to this, we also explore the recent 8-bit integer quantized representation. A fault injector model is designed to inject stuck-at faults at random locations in the network.
    [Show full text]
  • Linux Kernal II 9.1 Architecture
    Page 1 of 7 Linux Kernal II 9.1 Architecture: The Linux kernel is a Unix-like operating system kernel used by a variety of operating systems based on it, which are usually in the form of Linux distributions. The Linux kernel is a prominent example of free and open source software. Programming language The Linux kernel is written in the version of the C programming language supported by GCC (which has introduced a number of extensions and changes to standard C), together with a number of short sections of code written in the assembly language (in GCC's "AT&T-style" syntax) of the target architecture. Because of the extensions to C it supports, GCC was for a long time the only compiler capable of correctly building the Linux kernel. Compiler compatibility GCC is the default compiler for the Linux kernel source. In 2004, Intel claimed to have modified the kernel so that its C compiler also was capable of compiling it. There was another such reported success in 2009 with a modified 2.6.22 version of the kernel. Since 2010, effort has been underway to build the Linux kernel with Clang, an alternative compiler for the C language; as of 12 April 2014, the official kernel could almost be compiled by Clang. The project dedicated to this effort is named LLVMLinxu after the LLVM compiler infrastructure upon which Clang is built. LLVMLinux does not aim to fork either the Linux kernel or the LLVM, therefore it is a meta-project composed of patches that are eventually submitted to the upstream projects.
    [Show full text]
  • Firecracker: Lightweight Virtualization for Serverless Applications
    Firecracker: Lightweight Virtualization for Serverless Applications Alexandru Agache, Marc Brooker, Andreea Florescu, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa, Amazon Web Services https://www.usenix.org/conference/nsdi20/presentation/agache This paper is included in the Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20) February 25–27, 2020 • Santa Clara, CA, USA 978-1-939133-13-7 Open access to the Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20) is sponsored by Firecracker: Lightweight Virtualization for Serverless Applications Alexandru Agache Marc Brooker Andreea Florescu Amazon Web Services Amazon Web Services Amazon Web Services Alexandra Iordache Anthony Liguori Rolf Neugebauer Amazon Web Services Amazon Web Services Amazon Web Services Phil Piwonka Diana-Maria Popa Amazon Web Services Amazon Web Services Abstract vantage over traditional server provisioning processes: mul- titenancy allows servers to be shared across a large num- Serverless containers and functions are widely used for de- ber of workloads, and the ability to provision new func- ploying and managing software in the cloud. Their popularity tions and containers in milliseconds allows capacity to be is due to reduced cost of operations, improved utilization of switched between workloads quickly as demand changes. hardware, and faster scaling than traditional deployment meth- Serverless is also attracting the attention of the research com- ods. The economics and scale of serverless applications de- munity [21,26,27,44,47], including work on scaling out video mand that workloads from multiple customers run on the same encoding [13], linear algebra [20, 53] and parallel compila- hardware with minimal overhead, while preserving strong se- tion [12].
    [Show full text]
  • In-Datacenter Performance Analysis of a Tensor Processing Unit
    In-Datacenter Performance Analysis of a Tensor Processing Unit Presented by Josh Fried Background: Machine Learning Neural Networks: ● Multi Layer Perceptrons ● Recurrent Neural Networks (mostly LSTMs) ● Convolutional Neural Networks Synapse - each edge, has a weight Neuron - each node, sums weights and uses non-linear activation function over sum Propagating inputs through a layer of the NN is a matrix multiplication followed by an activation Background: Machine Learning Two phases: ● Training (offline) ○ relaxed deadlines ○ large batches to amortize costs of loading weights from DRAM ○ well suited to GPUs ○ Usually uses floating points ● Inference (online) ○ strict deadlines: 7-10ms at Google for some workloads ■ limited possibility for batching because of deadlines ○ Facebook uses CPUs for inference (last class) ○ Can use lower precision integers (faster/smaller/more efficient) ML Workloads @ Google 90% of ML workload time at Google spent on MLPs and LSTMs, despite broader focus on CNNs RankBrain (search) Inception (image classification), Google Translate AlphaGo (and others) Background: Hardware Trends End of Moore’s Law & Dennard Scaling ● Moore - transistor density is doubling every two years ● Dennard - power stays proportional to chip area as transistors shrink Machine Learning causing a huge growth in demand for compute ● 2006: Excess CPU capacity in datacenters is enough ● 2013: Projected 3 minutes per-day per-user of speech recognition ○ will require doubling datacenter compute capacity! Google’s Answer: Custom ASIC Goal: Build a chip that improves cost-performance for NN inference What are the main costs? Capital Costs Operational Costs (power bill!) TPU (V1) Design Goals Short design-deployment cycle: ~15 months! Plugs in to PCIe slot on existing servers Accelerates matrix multiplication operations Uses 8-bit integer operations instead of floating point How does the TPU work? CISC instructions, issued by host.
    [Show full text]
  • Abstractions for Programming Graphics Processors in High-Level Programming Languages
    Abstracties voor het programmeren van grafische processoren in hoogniveau-programmeertalen Abstractions for Programming Graphics Processors in High-Level Programming Languages Tim Besard Promotor: prof. dr. ir. B. De Sutter Proefschrift ingediend tot het behalen van de graad van Doctor in de ingenieurswetenschappen: computerwetenschappen Vakgroep Elektronica en Informatiesystemen Voorzitter: prof. dr. ir. K. De Bosschere Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2018 - 2019 ISBN 978-94-6355-244-8 NUR 980 Wettelijk depot: D/2019/10.500/52 Examination Committee Prof. Filip De Turck, chair Department of Information Technology Faculty of Engineering and Architecture Ghent University Prof. Koen De Bosschere, secretary Department of Electronics and Information Systems Faculty of Engineering and Architecture Ghent University Prof. Bjorn De Sutter, supervisor Department of Electronics and Information Systems Faculty of Engineering and Architecture Ghent University Prof. Jutho Haegeman Department of Physics and Astronomy Faculty of Sciences Ghent University Prof. Jan Lemeire Department of Electronics and Informatics Faculty of Engineering Vrije Universiteit Brussel Prof. Christophe Dubach School of Informatics College of Science & Engineering The University of Edinburgh Prof. Alan Edelman Computer Science & Artificial Intelligence Laboratory Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology ii Dankwoord Ik wist eigenlijk niet waar ik aan begon, toen ik in 2012 in de cata- comben van het Technicum op gesprek ging over een doctoraat. Of ik al eens met LLVM gewerkt had. Ondertussen zijn we vele jaren verder, werk ik op een bureau waar er wel daglicht is, en is het eindpunt van deze studie zowaar in zicht. Dat mag natuurlijk wel, zo vertelt men mij, na 7 jaar.
    [Show full text]
  • CMSC 411: Computer Architecture
    CMSC 411: Computer Architecture Spring 2019 Jason Tang "1 About Your Friendly Instructor • Jason Tang (just call me Jason!)! • UMBC adjunct faculty member since 2012! • Taught CMSC 104, 202, 421, and 411! • Work full-time at a nearby mega-corporation as a software engineer "2 Contact Information • Email me at [email protected]! • O$ce in ITE 201C! • Tuesday / Thursday, 7:00 pm - 8:00 pm, right after class! • Teaching Assistant:! • TBA! • "3 Am I in the Right Class? • Prerequisites are:! • CMSC 313, or! • CMPE 212 + CMPE 310! • Must be able to read hexadecimal notation! • Should already by familiar with C/C++ and some assembly code! • This does not mean Java, Python, or other scripting language "4 Required Programming Knowledge • Know how (or research on Stack Overflow) to do these things:! • Read the very fantastic man pages! • Call a function and pass values in and out! • Di%erence between an in parameter, out parameter, and in/out parameter! • Know what a C++ reference technically is! • Understand basic boolean logic "5 Topics Covered • Instruction Sets! • Performance Measurements! • Machine Arithmetic! • Processor Design! • Memory Systems! • I/O Design! • Computer Buses "6 Course Information • http://www.csee.umbc.edu/~jtang/cs411.s19! • Grades will be posted on Blackboard! • Discussion forums are also on Blackboard! • All assignments submitted via submit system at linux.gl.umbc.edu! • Ensure you have a way to transfer files between your development machine and UMBC server (scp, PuTTy, Cyberduck, or equivalent)! • Using the clipboard to
    [Show full text]
  • 3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) Zero Ovf
    Chapter 3 Arithmetic for Computers 1 § 3.1Introduction Arithmetic for Computers Operations on integers Addition and subtraction Multiplication and division Dealing with overflow Floating-point real numbers Representation and operations Rechnerstrukturen 182.092 3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) zero ovf 1 Must support the Arithmetic/Logic 1 operations of the ISA A 32 add, addi, addiu, addu ALU result sub, subu 32 mult, multu, div, divu B 32 sqrt 4 and, andi, nor, or, ori, xor, xori m (operation) beq, bne, slt, slti, sltiu, sltu With special handling for sign extend – addi, addiu, slti, sltiu zero extend – andi, ori, xori overflow detection – add, addi, sub Rechnerstrukturen 182.092 3 — Arithmetic for Computers 3 (Simplyfied) 1-bit MIPS ALU AND, OR, ADD, SLT Rechnerstrukturen 182.092 3 — Arithmetic for Computers 4 Final 32-bit ALU Rechnerstrukturen 182.092 3 — Arithmetic for Computers 5 Performance issues Critical path of n-bit ripple-carry adder is n*CP CarryIn0 A0 1-bit result0 ALU B0 CarryOut0 CarryIn1 A1 1-bit result1 B ALU 1 CarryOut 1 CarryIn2 A2 1-bit result2 ALU B2 CarryOut CarryIn 2 3 A3 1-bit result3 ALU B3 CarryOut3 Design trick – throw hardware at it (Carry Lookahead) Rechnerstrukturen 182.092 3 — Arithmetic for Computers 6 Carry Lookahead Logic (4 bit adder) LS 283 Rechnerstrukturen 182.092 3 — Arithmetic for Computers 7 § 3.2 Addition and Subtraction 3.2 Integer Addition Example: 7 + 6 Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands
    [Show full text]