Hardware Architecture Impact on Manycore Programming Model

Total Page:16

File Type:pdf, Size:1020Kb

Hardware Architecture Impact on Manycore Programming Model UPTEC IT 21001 Degree Project in Computer and Information Engineering March 2, 2021 Hardware Architecture Impact on Manycore Programming Model Erik Stubbfalt¨ Civilingenjorsprogrammet¨ i informationsteknologi Master Programme in Computer and Information Engineering Abstract Institutionen for¨ Hardware Architecture Impact on Manycore informationsteknologi Programming Model Besoksadress:¨ ITC, Polacksbacken Lagerhyddsv¨ agen¨ 2 Erik Stubbfalt¨ Postadress: Box 337 751 05 Uppsala This work investigates how certain processor architectures can affect the implementation and performance of a parallel programming model. Hemsida: The Ericsson Many-Core Architecture (EMCA) is compared and con- http:/www.it.uu.se trasted to general-purpose multicore processors, highlighting differ- ences in their memory systems and processor cores. A proof-of-concept implementation of the Concurrency Building Blocks (CBB) program- ming model is developed for x86-64 using MPI. Benchmark tests show how CBB on EMCA handles compute-intensive and memory-intensive scenarios, compared to a high-end x86-64 machine running the proof- of-concept implementation. EMCA shows its strengths in heavy com- putations while x86-64 performs at its best with high degrees of data reuse. Both systems are able to utilize locality in their memory systems to achieve great performance benefits. Extern handledare: Lars Gelin & Anders Dahlberg, Ericsson Amnesgranskare:¨ Stefanos Kaxiras Examinator: Lars-Ake˚ Norden´ ISSN 1401-5749, UPTEC IT 21001 Tryckt av: Angstr˚ omlaboratoriet,¨ Uppsala universitet Sammanfattning Det har¨ projektet undersoker¨ hur olika processorarkitekturer kan paverka˚ implementa- tioner och prestanda hos en parallell programmeringsmodell. Ericsson Many-Core Ar- chitecture (EMCA) analyseras och jamf¨ ors¨ med kommersiella multicore-processorer. Skillnader i respektive minnessystem och processorkarnor¨ tas upp. En prototyp av en Concurrency Building Blocks-implementation (CBB) for¨ x86-64 tas fram med hjalp¨ av MPI. Benchmark-tester visar hur CBB tillsammans med EMCA hanterar beraknings-¨ intensiva samt minnesintensiva scenarion, i jamf¨ orelse¨ med ett modernt x86-64-system tillsammans med den utvecklade prototypen. EMCA visar sina styrkor i tunga berak-¨ ningar och x86-64 presterar bast¨ nar¨ data ateranv˚ ands¨ i hog¨ grad. Bada˚ systemen anvan-¨ der lokalitet i respektive minnessystem pa˚ ett satt¨ som har stora fordelar¨ for¨ prestandan. iv Contents 1 Introduction 1 2 Background 2 2.1 Multicore and manycore processors . .2 2.2 Parallel computing . .2 2.2.1 Different types of parallelism . .3 2.2.2 Parallel programming models . .3 2.3 Memory systems . .4 2.3.1 Cache and scratchpad memory . .5 2.4 Memory models . .6 2.5 SIMD . .7 2.6 Prefetching . .7 2.7 Performance analysis tools . .7 2.8 The actor model . .8 2.9 Concurrency Building Blocks . .8 2.10 The baseband domain . .9 3 Purpose, aims, and motivation 10 3.1 Delimitations . 10 4 Methodology 11 4.1 Literature study . 11 4.2 Development . 11 4.3 Testing . 11 v 5 Literature study 12 5.1 Comparison of architectures . 12 5.1.1 Memory system . 12 5.1.2 Processor cores . 13 5.1.3 SIMD operations . 14 5.1.4 Memory models . 14 5.2 Related academic work . 15 5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper . 15 5.2.2 A DSP Acceleration Framework For Software-Defined Radios On x86-64 . 17 5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre- fetches . 17 5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Sta- tistical Methods . 18 6 Selection of software framework 19 6.1 MPI . 19 6.1.1 Why MPI? . 19 6.1.2 MPICH . 20 6.1.3 Open MPI . 20 7 Selection of target platform 21 8 Evaluation methods 21 8.1 Strong scaling and weak scaling . 21 8.1.1 Compute-intensive benchmark . 22 8.1.2 Memory-intensive benchmark without reuse . 23 vi 8.1.3 Memory-intensive benchmark with reuse . 23 8.1.4 Benchmark tests in summary . 23 8.2 Collection of performance metrics . 24 8.3 Systems used for testing . 25 9 Implementation of CBB actors using MPI 26 9.1 Sending messages . 27 9.2 Receiving messages . 27 10 Creating and running benchmark tests 28 10.1 MPI for x86-64 . 28 10.2 CBB for EMCA . 29 11 Results and discussion 29 11.1 Compute-intensive benchmark . 30 11.1.1 Was the test not compute-intensive enough for EMCA? . 33 11.2 Memory-intensive benchmark with no data reuse . 35 11.3 Memory-intensive benchmark with data reuse . 39 11.4 Discussion on software complexity and optimizations . 43 12 Conclusions 44 13 Future work 45 13.1 Implement a CBB transform with MPI for x86-64 . 45 13.2 Expand benchmark tests to cover more scenarios . 45 13.3 Run benchmarks with hardware prefetching turned off . 45 13.4 Combine MPI processes with OpenMP threads . 46 vii 13.5 Run the same code in an ARMv8 system . 46 viii List of Figures 1 Memory hierarchy of a typical computer system [7]. .4 2 Memory hierarchy and address space for a cache configuration (left) and a scratchpad configuration (right) [2, Figure 1]. .5 3 Main artefacts of the CBB programming model. .9 4 Conceptual view of a multicore system implementing TSO [7, Fig- ure 4.4 (b)]. Store instructions are issued to a FIFO store buffer before entering the memory system. 15 5 Categorization of DSP benchmarks from simple (bottom) to com- plex (top) [5, Figure 1]. The grey area shows examples of benchmarks that BDTI provides. 16 6 Processor topology of the x86-64 system used for testing. 25 7 The CBB application used for implementation. 26 8 Normalized execution times for the compute-intensive benchmark test with weak scaling. 30 9 Normalized execution times for the compute-intensive benchmark test with strong scaling. 31 10 Speedup for the compute-intensive benchmark test with strong scaling. 32 11 Speedup for the compute-intensive benchmark test with strong scaling and 64-bit floating-point addition. Only EMCA was tested. 34 12 Normalized execution times for the memory-intensive benchmark with no data reuse and weak scaling. 35 13 Normalized execution times for the memory-intensive benchmark with no data reuse and strong scaling. 36 14 Speedup for the memory-intensive benchmark with no data reuse and strong scaling. 37 15 Normalized execution times for the memory-intensive benchmark with data reuse and weak scaling. 39 ix 16 Cache miss ratio in L1D for the memory-intensive benchmark with data reuse and weak scaling. 40 17 Normalized execution times for the memory-intensive benchmark with data reuse and strong scaling. 41 18 Speedup for the memory-intensive benchmark with data reuse and strong scaling. 42 19 Cache miss ratio for the memory-intensive benchmark with data reuse and strong scaling. 43 List of Tables 1 Flag synchronization program to motivate why memory models are needed [20, Table 3.1]. .6 2 One possible execution of the program in Table 1 [20, Table 3.2]. .6 x 1 Introduction 1 Introduction This work is centered around the connections between two areas within computer sci- ence, namely hardware architecture and parallel programming. How can a programming model, developed specifically for a certain processor type, be expanded and adapted to run on a completely different hardware architecture? This question, which is a gen- eral problem found in many areas of industry and research, is what this thesis revolves around. The project is conducted in collaboration with the Baseband Infrastructure (BBI) depart- ment at Ericsson. They develop low-level software platforms and tools used in baseband software within the Ericsson Radio System product portfolio. This includes the Con- currency Building Blocks (CBB) programming model, which is designed to take full advantage of the Ericsson Many-Core Architecture (EMCA) hardware. EMCA has a number of characteristics that sets it apart from commercial off-the-shelf (COTS) designs like x86-64 and ARMv8. EMCA uses scratchpad memories and sim- plistic DSP cores instead of the coherent cache systems and out-of-order cores with si- multaneous multithreading found in general-purpose hardware. These differences, and more, are investigated in a literature study with a special focus on how they might affect run-time performance. MPI is used as a tool for developing a working CBB prototype that can run on both x86- 64 and ARMv8. This choice is motivated by the many similarities between concepts used in CBB and concepts seen in MPI. Finally, a series of benchmark tests are run with CBB on EMCA and on the CBB-prototype on a high-end x86-64 machine. These tests aim to investigate some compute-intensive and memory-intensive scenarios, which are both relevant for actual baseband software. Each test is run with a fixed problem size which is divived equally among the available workers, and also with a problem size that increases linearly with the number of workers. EMCA shows very good perfor- mance with the compute-intensive tests. The test (using 16-bit integer addition) is in fact deemed to not be compute-intensive enough to highlight the expected scaling be- havior, and a modified benchmark (using 64-bit floating point addition) is also tested. In the memory-intensive tests, it is shown that x86-64 performs at its best when the degree of data reuse is high and it can hold data in its L1D cache. In this scenario it shows better scaling behavior than EMCA. However, x86-64 takes a much larger performance hit than EMCA when the number of processes exceed the number of available processor cores. The rest of this report is structured as follows: Section 2 describes the necessary back- ground theory on the problem at hand. Section 3 discusses the purpose, aims and mo- 1 2 Background tivation behind the project, along with some delimitations.
Recommended publications
  • Increasing Memory Miss Tolerance for SIMD Cores
    Increasing Memory Miss Tolerance for SIMD Cores ∗ David Tarjan, Jiayuan Meng and Kevin Skadron Department of Computer Science University of Virginia, Charlottesville, VA 22904 {dtarjan, jm6dg,skadron}@cs.virginia.edu ABSTRACT that use a single instruction multiple data (SIMD) organi- Manycore processors with wide SIMD cores are becoming a zation can amortize the area and power overhead of a single popular choice for the next generation of throughput ori- frontend over a large number of execution backends. For ented architectures. We introduce a hardware technique example, we estimate that a 32-wide SIMD core requires called “diverge on miss” that allows SIMD cores to better about one fifth the area of 32 individual scalar cores. Note tolerate memory latency for workloads with non-contiguous that this estimate does not include the area of any intercon- memory access patterns. Individual threads within a SIMD nection network among the MIMD cores, which often grows “warp” are allowed to slip behind other threads in the same supra-linearly with the number of cores [18]. warp, letting the warp continue execution even if a subset of To better tolerate memory and pipeline latencies, many- core processors typically use fine-grained multi-threading, threads are waiting on memory. Diverge on miss can either 1 increase the performance of a given design by up to a factor switching among multiple warps, so that active warps can of 3.14 for a single warp per core, or reduce the number of mask stalls in other warps waiting on long-latency events. warps per core needed to sustain a given level of performance The drawback of this approach is that the size of the regis- from 16 to 2 warps, reducing the area per core by 35%.
    [Show full text]
  • Tousimojarad, Ashkan (2016) GPRM: a High Performance Programming Framework for Manycore Processors. Phd Thesis
    Tousimojarad, Ashkan (2016) GPRM: a high performance programming framework for manycore processors. PhD thesis. http://theses.gla.ac.uk/7312/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] GPRM: A HIGH PERFORMANCE PROGRAMMING FRAMEWORK FOR MANYCORE PROCESSORS ASHKAN TOUSIMOJARAD SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy SCHOOL OF COMPUTING SCIENCE COLLEGE OF SCIENCE AND ENGINEERING UNIVERSITY OF GLASGOW NOVEMBER 2015 c ASHKAN TOUSIMOJARAD Abstract Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards in- creased parallelism. However, increased parallelism does not necessarily lead to better per- formance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general- purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM).
    [Show full text]
  • Hardware Architecture
    Hardware Architecture Components Computing Infrastructure Components Servers Clients LAN & WLAN Internet Connectivity Computation Software Storage Backup Integration is the Key ! Security Data Network Management Computer Today’s Computer Computer Model: Von Neumann Architecture Computer Model Input: keyboard, mouse, scanner, punch cards Processing: CPU executes the computer program Output: monitor, printer, fax machine Storage: hard drive, optical media, diskettes, magnetic tape Von Neumann architecture - Wiki Article (15 min YouTube Video) Components Computer Components Components Computer Components CPU Memory Hard Disk Mother Board CD/DVD Drives Adaptors Power Supply Display Keyboard Mouse Network Interface I/O ports CPU CPU CPU – Central Processing Unit (Microprocessor) consists of three parts: Control Unit • Execute programs/instructions: the machine language • Move data from one memory location to another • Communicate between other parts of a PC Arithmetic Logic Unit • Arithmetic operations: add, subtract, multiply, divide • Logic operations: and, or, xor • Floating point operations: real number manipulation Registers CPU Processor Architecture See How the CPU Works In One Lesson (20 min YouTube Video) CPU CPU CPU speed is influenced by several factors: Chip Manufacturing Technology: nm (2002: 130 nm, 2004: 90nm, 2006: 65 nm, 2008: 45nm, 2010:32nm, Latest is 22nm) Clock speed: Gigahertz (Typical : 2 – 3 GHz, Maximum 5.5 GHz) Front Side Bus: MHz (Typical: 1333MHz , 1666MHz) Word size : 32-bit or 64-bit word sizes Cache: Level 1 (64 KB per core), Level 2 (256 KB per core) caches on die. Now Level 3 (2 MB to 8 MB shared) cache also on die Instruction set size: X86 (CISC), RISC Microarchitecture: CPU Internal Architecture (Ivy Bridge, Haswell) Single Core/Multi Core Multi Threading Hyper Threading vs.
    [Show full text]
  • Efficient Processing of Deep Neural Networks
    Efficient Processing of Deep Neural Networks Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer Massachusetts Institute of Technology Reference: V. Sze, Y.-H.Chen, T.-J. Yang, J. S. Emer, ”Efficient Processing of Deep Neural Networks,” Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2020 For book updates, sign up for mailing list at http://mailman.mit.edu/mailman/listinfo/eems-news June 15, 2020 Abstract This book provides a structured treatment of the key principles and techniques for enabling efficient process- ing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the- art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas. 1 Contents Preface 9 I Understanding Deep Neural Networks 13 1 Introduction 14 1.1 Background on Deep Neural Networks .
    [Show full text]
  • Multi-Core Processors and Systems: State-Of-The-Art and Study of Performance Increase
    Multi-Core Processors and Systems: State-of-the-Art and Study of Performance Increase Abhilash Goyal Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT speedup. Some tasks are easily divided into parts that can be To achieve the large processing power, we are moving towards processed in parallel. In those scenarios, speed up will most likely Parallel Processing. In the simple words, parallel processing can follow “common trajectory” as shown in Figure 2. If an be defined as using two or more processors (cores, computers) in application has little or no inherent parallelism, then little or no combination to solve a single problem. To achieve the good speedup will be achieved and because of overhead, speed up may results by parallel processing, in the industry many multi-core follow as show by “occasional trajectory” in Figure 2. processors has been designed and fabricated. In this class-project paper, the overview of the state-of-the-art of the multi-core processors designed by several companies including Intel, AMD, IBM and Sun (Oracle) is presented. In addition to the overview, the main advantage of using multi-core will demonstrated by the experimental results. The focus of the experiment is to study speed-up in the execution of the ‘program’ as the number of the processors (core) increases. For this experiment, open source parallel program to count the primes numbers is considered and simulation are performed on 3 nodes Raspberry cluster . Obtained results show that execution time of the parallel program decreases as number of core increases.
    [Show full text]
  • Understanding and Guiding the Computing Resource Management in a Runtime Stacking Context
    THÈSE PRÉSENTÉE À L’UNIVERSITÉ DE BORDEAUX ÉCOLE DOCTORALE DE MATHÉMATIQUES ET D’INFORMATIQUE par Arthur Loussert POUR OBTENIR LE GRADE DE DOCTEUR SPÉCIALITÉ : INFORMATIQUE Understanding and Guiding the Computing Resource Management in a Runtime Stacking Context Rapportée par : Allen D. Malony, Professor, University of Oregon Jean-François Méhaut, Professeur, Université Grenoble Alpes Date de soutenance : 18 Décembre 2019 Devant la commission d’examen composée de : Raymond Namyst, Professeur, Université de Bordeaux – Directeur de thèse Marc Pérache, Ingénieur-Chercheur, CEA – Co-directeur de thèse Emmanuel Jeannot, Directeur de recherche, Inria Bordeaux Sud-Ouest – Président du jury Edgar Leon, Computer Scientist, Lawrence Livermore National Laboratory – Examinateur Patrick Carribault, Ingénieur-Chercheur, CEA – Examinateur Julien Jaeger, Ingénieur-Chercheur, CEA – Invité 2019 Keywords High-Performance Computing, Parallel Programming, MPI, OpenMP, Runtime Mixing, Runtime Stacking, Resource Allocation, Resource Manage- ment Abstract With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared- memory models (e.g., MPI+OpenMP). This leads to a better exploitation of shared-memory communications and reduces the overall memory footprint. However, this evolution has a large impact on the software stack as applications’ developers do typically mix several programming models to scale over a large number of multicore nodes while coping with their hiearchical depth. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time. Dealing with different runtime systems may lead to a large number of execution flows that may not efficiently exploit the underlying resources.
    [Show full text]
  • Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor
    Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor Benoît Dupont de Dinechin Kalray S.A. [email protected] Figure 1: Overview of the MPPA3 processor. ABSTRACT CCS CONCEPTS The requirement of high performance computing at low power can • Computer systems organization → Multicore architectures; be met by the parallel execution of an application on a possibly Heterogeneous (hybrid) systems; System on a chip; Real-time large number of programmable cores. However, the lack of accurate languages. timing properties may prevent parallel execution from being appli- cable to time-critical applications. This problem has been addressed KEYWORDS by suitably designing the architecture, implementation, and pro- manycore processor, cyber-physical system, dependable computing gramming models, of the Kalray MPPA (Multi-Purpose Processor ACM Reference Format: Array) family of single-chip many-core processors. We introduce Benoît Dupont de Dinechin. 2019. Consolidating High-Integrity, High- the third-generation MPPA processor, whose key features are mo- Performance, and Cyber-Security Functions on a Manycore Processor. In tivated by the high-performance and high-integrity functions of The 56th Annual Design Automation Conference 2019 (DAC ’19), June 2– automated vehicles. High-performance computing functions, rep- 6, 2019, Las Vegas, NV, USA. ACM, New York, NY, USA, 4 pages. https: resented by deep learning inference and by computer vision, need //doi.org/10.1145/3316781.3323473 to execute under soft real-time constraints. High-integrity func- tions are developed under model-based design, and must meet hard 1 INTRODUCTION real-time constraints. Finally, the third-generation MPPA processor Cyber-physical systems are characterized by software that interacts integrates a hardware root of trust, and its security architecture with the physical world, often with timing-sensitive safety-critical is able to support a security kernel for implementing the trusted physical sensing and actuation [10].
    [Show full text]
  • Architectural Adaptation for Application-Specific Locality
    Architectural Adaptation for Application-Specific Locality Optimizations y z Xingbin Zhang Ali Dasdan Martin Schulz Rajesh K. Gupta Andrew A. Chien Department of Computer Science yInstitut f¨ur Informatik University of Illinois at Urbana-Champaign Technische Universit¨at M¨unchen g fzhang,dasdan,achien @cs.uiuc.edu [email protected] zInformation and Computer Science, University of California at Irvine [email protected] Abstract without repartitioning hardware and software functionality and reimplementing the co-processing hardware. This re- We propose a machine architecture that integrates pro- targetability problem is an obstacle toward exploiting pro- grammable logic into key components of the system with grammable logic for general purpose computing. the goal of customizing architectural mechanisms and poli- cies to match an application. This approach presents We propose a machine architecture that integrates pro- an improvement over traditional approach of exploiting grammable logic into key components of the system with programmable logic as a separate co-processor by pre- the goal of customizing architectural mechanisms and poli- serving machine usability through software and over tra- cies to match an application. We base our design on the ditional computer architecture by providing application- premise that communication is already critical and getting specific hardware assists. We present two case studies of increasingly so [17], and flexible interconnects can be used architectural customization to enhance latency tolerance to replace static wires at competitive performance [6, 9, 20]. and efficiently utilize network bisection on multiproces- Our approach presents an improvement over co-processing sors for sparse matrix computations. We demonstrate that by preserving machine usability through software and over application-specific hardware assists and policies can pro- traditional computer architecture by providing application- vide substantial improvements in performance on a per ap- specific hardware assists.
    [Show full text]
  • Development of a Predictable Hardware Architecture Template and Integration Into an Automated System Design Flow
    Development Secure of Reprogramming a Predictable Hardware of Architecturea Network Template Connected and Device Integration into an Automated System Design Flow Securing programmable logic controllers MARCUS MIKULCAK MUSSIE TESFAYE KTH Information and Communication Technology Masters’ Degree Project Second level, 30.0 HEC Stockholm, Sweden June 2013 TRITA-ICT-EX-2013:138Degree project in Communication Systems Second level, 30.0 HEC Stockholm, Sweden Secure Reprogramming of a Network Connected Device KTH ROYAL INSTITUTE OF TECHNOLOGY Securing programmable logic controllers SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY ELECTRONIC SYSTEMS MUSSIE TESFAYE Development of a Predictable Hardware Architecture TemplateKTH Information and Communication Technology and Integration into an Automated System Design Flow Master of Science Thesis in System-on-Chip Design Stockholm, June 2013 TRITA-ICT-EX-2013:138 Author: Examiner: Marcus Mikulcak Assoc. Prof. Ingo Sander Supervisor: Seyed Hosein Attarzadeh Niaki Degree project in Communication Systems Second level, 30.0 HEC Stockholm, Sweden Abstract The requirements of safety-critical real-time embedded systems pose unique challenges on their design process which cannot be fulfilled with traditional development methods. To ensure their correct timing and functionality, it has been suggested to move the design process to a higher abstraction level, which opens the possibility to utilize automated correct-by-design development flows from a functional specification of the system down to the level of Multiprocessor Systems-on-Chip. ForSyDe, an embedded system design methodology, presents a flow of this kind by basing system development on the theory of Models of Computation and side-effect-free processes, making it possible to separate the timing analysis of computation and communication of process networks.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Integrated Circuit Technology for Wireless Communications: an Overview Babak Daneshrad Integrated Circuits and Systems Laboratory UCLA Electrical Engineering Dept
    Integrated Circuit Technology for Wireless Communications: An Overview Babak Daneshrad Integrated Circuits and Systems Laboratory UCLA Electrical Engineering Dept. email: [email protected] http://www.ee.ucla.edu/~babak Abstract: Figure 1 shows a block diagram of a typical wireless com- This paper provides a brief overview of present trends in the develop- munication system. Moreover it shows the partitioning of the ment of integrated circuit technology for applications in the wireless receiver into RF, IF, baseband analog and digital components. communications industry. Through advanced circuit, architectural and In this paper we will focus on two main classes of integrated processing technologies, ICs have helped bring about the wireless revo- circuit technologies. The first is ICs for analog and more lution by enabling highly sophisticated, low cost and portable end user importantly RF communications. In general the technologist as terminals. In this paper two broad categories of circuits are highlighted. The first is RF integrated circuits and the second is digital baseband well as the circuit designer’s challenge here is to first, make the processing circuits. In both areas, the paper presents the circuit design transistors operate at higher carrier frequencies, and second, to challenges and options presented to the designer. It also highlights the integrate as many of the components required in the receiver manner in which these technologies have helped advance wireless com- onto the IC. The second class of circuits that we will focus on munications. are digital baseband circuits. Where increased density and I. Introduction reduced power consumption are the key factors for optimiza- tion.
    [Show full text]
  • Parallel Processing with the MPPA Manycore Processor
    Parallel Processing with the MPPA Manycore Processor Kalray MPPA® Massively Parallel Processor Array Benoît Dupont de Dinechin, CTO 14 Novembre 2018 Outline Presentation Manycore Processors Manycore Programming Symmetric Parallel Models Untimed Dataflow Models Kalray MPPA® Hardware Kalray MPPA® Software Model-Based Programming Deep Learning Inference Conclusions Page 2 ©2018 – Kalray SA All Rights Reserved KALRAY IN A NUTSHELL We design processors 4 ~80 people at the heart of new offices Grenoble, Sophia (France), intelligent systems Silicon Valley (Los Altos, USA), ~70 engineers Yokohama (Japan) A unique technology, Financial and industrial shareholders result of 10 years of development Pengpai Page 3 ©2018 – Kalray SA All Rights Reserved KALRAY: PIONEER OF MANYCORE PROCESSORS #1 Scalable Computing Power #2 Data processing in real time Completion of dozens #3 of critical tasks in parallel #4 Low power consumption #5 Programmable / Open system #6 Security & Safety Page 4 ©2018 – Kalray SA All Rights Reserved OUTSOURCED PRODUCTION (A FABLESS BUSINESS MODEL) PARTNERSHIP WITH THE WORLD LEADER IN PROCESSOR MANUFACTURING Sub-contracted production Signed framework agreement with GUC, subsidiary of TSMC (world top-3 in semiconductor manufacturing) Limited investment No expansion costs Production on the basis of purchase orders Page 5 ©2018 – Kalray SA All Rights Reserved INTELLIGENT DATA CENTER : KEY COMPETITIVE ADVANTAGES First “NVMe-oF all-in-one” certified solution * 8x more powerful than the latest products announced by our competitors**
    [Show full text]