The Abacus Processor Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Donald Knuth Fletcher Jones Professor of Computer Science, Emeritus Curriculum Vitae Available Online
Donald Knuth Fletcher Jones Professor of Computer Science, Emeritus Curriculum Vitae available Online Bio BIO Donald Ervin Knuth is an American computer scientist, mathematician, and Professor Emeritus at Stanford University. He is the author of the multi-volume work The Art of Computer Programming and has been called the "father" of the analysis of algorithms. He contributed to the development of the rigorous analysis of the computational complexity of algorithms and systematized formal mathematical techniques for it. In the process he also popularized the asymptotic notation. In addition to fundamental contributions in several branches of theoretical computer science, Knuth is the creator of the TeX computer typesetting system, the related METAFONT font definition language and rendering system, and the Computer Modern family of typefaces. As a writer and scholar,[4] Knuth created the WEB and CWEB computer programming systems designed to encourage and facilitate literate programming, and designed the MIX/MMIX instruction set architectures. As a member of the academic and scientific community, Knuth is strongly opposed to the policy of granting software patents. He has expressed his disagreement directly to the patent offices of the United States and Europe. (via Wikipedia) ACADEMIC APPOINTMENTS • Professor Emeritus, Computer Science HONORS AND AWARDS • Grace Murray Hopper Award, ACM (1971) • Member, American Academy of Arts and Sciences (1973) • Turing Award, ACM (1974) • Lester R Ford Award, Mathematical Association of America (1975) • Member, National Academy of Sciences (1975) 5 OF 44 PROFESSIONAL EDUCATION • PhD, California Institute of Technology , Mathematics (1963) PATENTS • Donald Knuth, Stephen N Schiller. "United States Patent 5,305,118 Methods of controlling dot size in digital half toning with multi-cell threshold arrays", Adobe Systems, Apr 19, 1994 • Donald Knuth, LeRoy R Guck, Lawrence G Hanson. -
Increasing Memory Miss Tolerance for SIMD Cores
Increasing Memory Miss Tolerance for SIMD Cores ∗ David Tarjan, Jiayuan Meng and Kevin Skadron Department of Computer Science University of Virginia, Charlottesville, VA 22904 {dtarjan, jm6dg,skadron}@cs.virginia.edu ABSTRACT that use a single instruction multiple data (SIMD) organi- Manycore processors with wide SIMD cores are becoming a zation can amortize the area and power overhead of a single popular choice for the next generation of throughput ori- frontend over a large number of execution backends. For ented architectures. We introduce a hardware technique example, we estimate that a 32-wide SIMD core requires called “diverge on miss” that allows SIMD cores to better about one fifth the area of 32 individual scalar cores. Note tolerate memory latency for workloads with non-contiguous that this estimate does not include the area of any intercon- memory access patterns. Individual threads within a SIMD nection network among the MIMD cores, which often grows “warp” are allowed to slip behind other threads in the same supra-linearly with the number of cores [18]. warp, letting the warp continue execution even if a subset of To better tolerate memory and pipeline latencies, many- core processors typically use fine-grained multi-threading, threads are waiting on memory. Diverge on miss can either 1 increase the performance of a given design by up to a factor switching among multiple warps, so that active warps can of 3.14 for a single warp per core, or reduce the number of mask stalls in other warps waiting on long-latency events. warps per core needed to sustain a given level of performance The drawback of this approach is that the size of the regis- from 16 to 2 warps, reducing the area per core by 35%. -
Tug2007-Slides-2X2.Pdf
Dedication ÅEÌ Professor Donald Knuth (Stanford) Extending TEX and Professor William Kahan (Berkeley) ÅEÌAFÇÆÌ with Floating-Point Arithmetic AF Nelson H. F. Beebe ÇÆÌ X and Department of Mathematics University of Utah E T Salt Lake City, UT 84112-0090 USA TEX Users Group Conference 2007 talk. – p. 1/30 TEX Users Group Conference 2007 talk. – p. 2/30 ÅEÌAFÇÆÌ Arithmetic in TEX and Arithmetic in ÅEÌAFÇÆÌ ÅEÌ ÅEÌ Binary integer arithmetic with 32 bits (T X \count ÅEÌAFÇÆÌ restricts input numbers to 12 integer bits: ≥ E registers) % mf expr Fixed-point arithmetic with sign bit, overflow bit, 14 gimme an expr: 4095 >> 4095 ≥ gimme an expr: 4096 integer bits, and 16 fractional bits (T X \dimen, E ! Enormous number has been reduced. \muskip, and \skip registers) AF >> 4095.99998 AF Overflow detected on division and multiplication but not gimme an expr: infinity >> 4095.99998 on addition (flaw (NHFB), feature (DEK)) gimme an expr: epsilon >> 0.00002 gimme an expr: 1/epsilon Gyrations sometimes needed in ÅEÌAFÇÆÌ to work ÇÆÌ ! Arithmetic overflow. ÇÆÌ Xwith and fixed-point numbers X and >> 32767.99998 Uh, oh.E A little while ago one of the quantities gimmeE an expr: 1/3 >> 0.33333 that I was computing got too large, so I’m afraid gimme an expr: 3*(1/3) >> 0.99998 T T your answers will be somewhat askew. You’ll gimme an expr: 1.2 • 2.3 >> •1.1 probably have to adopt different tactics next gimme an expr: 1.2 • 2.4 >> •1.2 time. But I shall try to carry on anyway. -
Typeset MMIX Programs with TEX Udo Wermuth Abstract a TEX Macro
TUGboat, Volume 35 (2014), No. 3 297 Typeset MMIX programs with TEX Example: In section 9 the lines \See also sec- tion 10." and \This code is used in section 24." are given. Udo Wermuth No such line appears in section 10 as it only ex- tends the replacement code of section 9. (Note that Abstract section 10 has in its headline the number 9.) In section 24 the reference to section 9 stands for all of ATEX macro package is presented as a literate pro- the eight code lines stated in sections 9 and 10. gram. It can be included in programs written in the If a section is not used in any other section then languages MMIX or MMIXAL without affecting the it is a root and during the extraction of the code a assembler. Such an instrumented file can be pro- file is created that has the name of the root. This file cessed by TEX to get nicely formatted output. Only collects all the code in the sequence of the referenced a new first line and a new last line must be entered. sections from the code part. The collection process And for each end-of-line comment a flag is set to for all root sections is called tangle. A second pro- indicate that the comment is written in TEX. cess is called weave. It outputs the documentation and the code parts as a TEX document. How to read the following program Example: The following program has only one The text that starts in the next chapter is a literate root that is defined in section 4 with the headline program [2, 1] written in a style similar to noweb [7]. -
Tousimojarad, Ashkan (2016) GPRM: a High Performance Programming Framework for Manycore Processors. Phd Thesis
Tousimojarad, Ashkan (2016) GPRM: a high performance programming framework for manycore processors. PhD thesis. http://theses.gla.ac.uk/7312/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] GPRM: A HIGH PERFORMANCE PROGRAMMING FRAMEWORK FOR MANYCORE PROCESSORS ASHKAN TOUSIMOJARAD SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy SCHOOL OF COMPUTING SCIENCE COLLEGE OF SCIENCE AND ENGINEERING UNIVERSITY OF GLASGOW NOVEMBER 2015 c ASHKAN TOUSIMOJARAD Abstract Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards in- creased parallelism. However, increased parallelism does not necessarily lead to better per- formance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general- purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). -
Multi-Core Processors and Systems: State-Of-The-Art and Study of Performance Increase
Multi-Core Processors and Systems: State-of-the-Art and Study of Performance Increase Abhilash Goyal Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT speedup. Some tasks are easily divided into parts that can be To achieve the large processing power, we are moving towards processed in parallel. In those scenarios, speed up will most likely Parallel Processing. In the simple words, parallel processing can follow “common trajectory” as shown in Figure 2. If an be defined as using two or more processors (cores, computers) in application has little or no inherent parallelism, then little or no combination to solve a single problem. To achieve the good speedup will be achieved and because of overhead, speed up may results by parallel processing, in the industry many multi-core follow as show by “occasional trajectory” in Figure 2. processors has been designed and fabricated. In this class-project paper, the overview of the state-of-the-art of the multi-core processors designed by several companies including Intel, AMD, IBM and Sun (Oracle) is presented. In addition to the overview, the main advantage of using multi-core will demonstrated by the experimental results. The focus of the experiment is to study speed-up in the execution of the ‘program’ as the number of the processors (core) increases. For this experiment, open source parallel program to count the primes numbers is considered and simulation are performed on 3 nodes Raspberry cluster . Obtained results show that execution time of the parallel program decreases as number of core increases. -
Understanding and Guiding the Computing Resource Management in a Runtime Stacking Context
THÈSE PRÉSENTÉE À L’UNIVERSITÉ DE BORDEAUX ÉCOLE DOCTORALE DE MATHÉMATIQUES ET D’INFORMATIQUE par Arthur Loussert POUR OBTENIR LE GRADE DE DOCTEUR SPÉCIALITÉ : INFORMATIQUE Understanding and Guiding the Computing Resource Management in a Runtime Stacking Context Rapportée par : Allen D. Malony, Professor, University of Oregon Jean-François Méhaut, Professeur, Université Grenoble Alpes Date de soutenance : 18 Décembre 2019 Devant la commission d’examen composée de : Raymond Namyst, Professeur, Université de Bordeaux – Directeur de thèse Marc Pérache, Ingénieur-Chercheur, CEA – Co-directeur de thèse Emmanuel Jeannot, Directeur de recherche, Inria Bordeaux Sud-Ouest – Président du jury Edgar Leon, Computer Scientist, Lawrence Livermore National Laboratory – Examinateur Patrick Carribault, Ingénieur-Chercheur, CEA – Examinateur Julien Jaeger, Ingénieur-Chercheur, CEA – Invité 2019 Keywords High-Performance Computing, Parallel Programming, MPI, OpenMP, Runtime Mixing, Runtime Stacking, Resource Allocation, Resource Manage- ment Abstract With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared- memory models (e.g., MPI+OpenMP). This leads to a better exploitation of shared-memory communications and reduces the overall memory footprint. However, this evolution has a large impact on the software stack as applications’ developers do typically mix several programming models to scale over a large number of multicore nodes while coping with their hiearchical depth. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time. Dealing with different runtime systems may lead to a large number of execution flows that may not efficiently exploit the underlying resources. -
Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor
Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor Benoît Dupont de Dinechin Kalray S.A. [email protected] Figure 1: Overview of the MPPA3 processor. ABSTRACT CCS CONCEPTS The requirement of high performance computing at low power can • Computer systems organization → Multicore architectures; be met by the parallel execution of an application on a possibly Heterogeneous (hybrid) systems; System on a chip; Real-time large number of programmable cores. However, the lack of accurate languages. timing properties may prevent parallel execution from being appli- cable to time-critical applications. This problem has been addressed KEYWORDS by suitably designing the architecture, implementation, and pro- manycore processor, cyber-physical system, dependable computing gramming models, of the Kalray MPPA (Multi-Purpose Processor ACM Reference Format: Array) family of single-chip many-core processors. We introduce Benoît Dupont de Dinechin. 2019. Consolidating High-Integrity, High- the third-generation MPPA processor, whose key features are mo- Performance, and Cyber-Security Functions on a Manycore Processor. In tivated by the high-performance and high-integrity functions of The 56th Annual Design Automation Conference 2019 (DAC ’19), June 2– automated vehicles. High-performance computing functions, rep- 6, 2019, Las Vegas, NV, USA. ACM, New York, NY, USA, 4 pages. https: resented by deep learning inference and by computer vision, need //doi.org/10.1145/3316781.3323473 to execute under soft real-time constraints. High-integrity func- tions are developed under model-based design, and must meet hard 1 INTRODUCTION real-time constraints. Finally, the third-generation MPPA processor Cyber-physical systems are characterized by software that interacts integrates a hardware root of trust, and its security architecture with the physical world, often with timing-sensitive safety-critical is able to support a security kernel for implementing the trusted physical sensing and actuation [10]. -
RISC-V Instructioninstruction Setset
PortingPorting HelenOSHelenOS toto RISC-VRISC-V http://d3s.mff.cuni.cz Martin Děcký [email protected] CHARLES UNIVERSITY IN PRAGUE FacultyFaculty ofof MathematicsMathematics andand PhysicsPhysics IntroductionIntroduction Two system-level projects RISC-V is an instruction set architecture, HelenOS is an operating system Martin Děcký, FOSDEM, January 30th 2016 Porting HelenOS to RISC-V 2 IntroductionIntroduction Two system-level projects RISC-V is an instruction set architecture, HelenOS is an operating system Both originally started in academia But with real-world motivations and ambitions Both still in the process of maturing Some parts already fixed, other parts can be still affected Martin Děcký, FOSDEM, January 30th 2016 Porting HelenOS to RISC-V 3 IntroductionIntroduction Two system-level projects RISC-V is an instruction set architecture, HelenOS is an operating system Both originally started in academia But with real-world motivations and ambitions Both still in the process of maturing Some parts already fixed, other parts can be still affected → Mutual evaluation of fitness Martin Děcký, FOSDEM, January 30th 2016 Porting HelenOS to RISC-V 4 IntroductionIntroduction Martin Děcký Computer science researcher Operating systems Charles University in Prague Co-author of HelenOS (since 2004) Original author of the PowerPC port Martin Děcký, FOSDEM, January 30th 2016 Porting HelenOS to RISC-V 5 RISC-VRISC-V inin aa NutshellNutshell Free (libre) instruction set architecture BSD license, in development since 2014 Goal: No royalties for -
Implementation of a MIX Emulator: a Case Study of the Scala Programming Language Facilities
ISSN 2255-8691 (online) Applied Computer Systems ISSN 2255-8683 (print) December 2017, vol. 22, pp. 47–53 doi: 10.1515/acss-2017-0017 https://www.degruyter.com/view/j/acss Implementation of a MIX Emulator: A Case Study of the Scala Programming Language Facilities Ruslan Batdalov1, Oksana Ņikiforova2 1, 2 Riga Technical University, Latvia Abstract – Implementation of an emulator of MIX, a mythical synchronous manner, possible errors in a program may remain computer invented by Donald Knuth, is used as a case study of unnoticed. In the authors’ opinion, these checks are useful in the features of the Scala programming language. The developed mastering how to write correct programs because similar emulator provides rich opportunities for program debugging, such as tracking intermediate steps of program execution, an errors often occur in a modern program despite all changes in opportunity to run a program in the binary or the decimal mode hardware and software technologies. Therefore, it would be of MIX, verification of correct synchronisation of input/output helpful if an emulator supported running programs in different operations. Such Scala features as cross-compilation, family modes and allowed checking that the execution result was the polymorphism and support for immutable data structures have same in all cases. proved to be useful for implementation of the emulator. The The programming language chosen by the authors for the authors of the paper also propose some improvements to these features: flexible definition of family-polymorphic types, implementation of an emulator supporting these features is integration of family polymorphism with generics, establishing Scala. This choice is arbitrary to some extent and rather full equivalence between mutating operations on mutable data dictated by the authors’ interest in the features of this types and copy-and-modify operations on immutable data types. -
Parallel Processing with the MPPA Manycore Processor
Parallel Processing with the MPPA Manycore Processor Kalray MPPA® Massively Parallel Processor Array Benoît Dupont de Dinechin, CTO 14 Novembre 2018 Outline Presentation Manycore Processors Manycore Programming Symmetric Parallel Models Untimed Dataflow Models Kalray MPPA® Hardware Kalray MPPA® Software Model-Based Programming Deep Learning Inference Conclusions Page 2 ©2018 – Kalray SA All Rights Reserved KALRAY IN A NUTSHELL We design processors 4 ~80 people at the heart of new offices Grenoble, Sophia (France), intelligent systems Silicon Valley (Los Altos, USA), ~70 engineers Yokohama (Japan) A unique technology, Financial and industrial shareholders result of 10 years of development Pengpai Page 3 ©2018 – Kalray SA All Rights Reserved KALRAY: PIONEER OF MANYCORE PROCESSORS #1 Scalable Computing Power #2 Data processing in real time Completion of dozens #3 of critical tasks in parallel #4 Low power consumption #5 Programmable / Open system #6 Security & Safety Page 4 ©2018 – Kalray SA All Rights Reserved OUTSOURCED PRODUCTION (A FABLESS BUSINESS MODEL) PARTNERSHIP WITH THE WORLD LEADER IN PROCESSOR MANUFACTURING Sub-contracted production Signed framework agreement with GUC, subsidiary of TSMC (world top-3 in semiconductor manufacturing) Limited investment No expansion costs Production on the basis of purchase orders Page 5 ©2018 – Kalray SA All Rights Reserved INTELLIGENT DATA CENTER : KEY COMPETITIVE ADVANTAGES First “NVMe-oF all-in-one” certified solution * 8x more powerful than the latest products announced by our competitors** -
Processor Architectures
CS143 Handout 18 Summer 2008 30 July, 2008 Processor Architectures Handout written by Maggie Johnson and revised by Julie Zelenski. Architecture Vocabulary Let’s review a few relevant hardware definitions: register: a storage location directly on the CPU, used for temporary storage of small amounts of data during processing. memory: an array of randomly accessible memory bytes each identified by a unique address. Flat memory models, segmented memory models, and hybrid models exist which are distinguished by the way locations are referenced and potentially divided into sections. instruction set: the set of instructions that are interpreted directly in the hardware by the CPU. These instructions are encoded as bit strings in memory and are fetched and executed one by one by the processor. They perform primitive operations such as "add 2 to register i1", "store contents of o6 into memory location 0xFF32A228", etc. Instructions consist of an operation code (opcode) e.g., load, store, add, etc., and one or more operand addresses. CISC: Complex instruction set computer. Older processors fit into the CISC family, which means they have a large and fancy instruction set. In addition to a set of common operations, the instruction set has special purpose instructions that are designed for limited situations. CISC processors tend to have a slower clock cycle, but accomplish more in each cycle because of the sophisticated instructions. In writing an effective compiler back-end for a CISC processor, many issues revolve around recognizing how to make effective use of the specialized instructions. RISC: Reduced instruction set computer. Many modern processors are in the RISC family, which means they have a relatively lean instruction set, containing mostly simple, general-purpose instructions.