Tousimojarad, Ashkan (2016) GPRM: a High Performance Programming Framework for Manycore Processors. Phd Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Benchmarking Implementations of Functional Languages with ‘Pseudoknot’, a Float-Intensive Benchmark
Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 1996 Benchmarking implementations of functional languages with ‘Pseudoknot’, a float-intensive benchmark Hartel, Pieter H ; Feeley, Marc ; et al Abstract: Over 25 implementations of different functional languages are benchmarked using the same program, a floating-point intensive application taken from molecular biology. The principal aspects studied are compile time and execution time for the various implementations that were benchmarked. An important consideration is how the program can be modified and tuned to obtain maximal performance on each language implementation. With few exceptions, the compilers take a significant amount of time to compile this program, though most compilers were faster than the then current GNU C compiler (GCC version 2.5.8). Compilers that generate C or Lisp are often slower than those that generate native code directly: the cost of compiling the intermediate form is normally a large fraction of the total compilation time. There is no clear distinction between the runtime performance of eager and lazy implementations when appropriate annotations are used: lazy implementations have clearly come of age when it comes to implementing largely strict applications, such as the Pseudoknot program. The speed of C can be approached by some implementations, but to achieve this performance, special measures such as strictness annotations are required by non-strict implementations. The benchmark results have to be interpreted with care. Firstly, a benchmark based on a single program cannot cover a wide spectrum of ‘typical’ applications. Secondly, the compilers vary in the kind and level of optimisations offered, so the effort required to obtain an optimal version of the program is similarly varied. -
A Scheme Foreign Function Interface to Javascript Based on an Infix
A Scheme Foreign Function Interface to JavaScript Based on an Infix Extension Marc-André Bélanger Marc Feeley Université de Montréal Université de Montréal Montréal, Québec, Canada Montréal, Québec, Canada [email protected] [email protected] ABSTRACT FFIs are notoriously implementation-dependent and code This paper presents a JavaScript Foreign Function Inter- using a given FFI is usually not portable. Consequently, face for a Scheme implementation hosted on JavaScript and the nature of FFI’s reflects a particular set of choices made supporting threads. In order to be as convenient as possible by the language’s implementers. This makes FFIs usually the foreign code is expressed using infix syntax, the type more difficult to learn than the base language, imposing conversions between Scheme and JavaScript are mostly im- implementation constraints to the programmer. In effect, plicit, and calls can both be done from Scheme to JavaScript proficiency in a particular FFI is often not a transferable and the other way around. Our approach takes advantage of skill. JavaScript’s dynamic nature and its support for asynchronous In general FFIs tightly couple the underlying low level functions. This allows concurrent activities to be expressed data representation to the higher level interface provided to in a direct style in Scheme using threads. The paper goes the programmer. This is especially true of FFIs for statically over the design and implementation of our approach in the typed languages such as C, where to construct the proper Gambit Scheme system. Examples are given to illustrate its interface code the FFI must know the type of all data passed use. -
Increasing Memory Miss Tolerance for SIMD Cores
Increasing Memory Miss Tolerance for SIMD Cores ∗ David Tarjan, Jiayuan Meng and Kevin Skadron Department of Computer Science University of Virginia, Charlottesville, VA 22904 {dtarjan, jm6dg,skadron}@cs.virginia.edu ABSTRACT that use a single instruction multiple data (SIMD) organi- Manycore processors with wide SIMD cores are becoming a zation can amortize the area and power overhead of a single popular choice for the next generation of throughput ori- frontend over a large number of execution backends. For ented architectures. We introduce a hardware technique example, we estimate that a 32-wide SIMD core requires called “diverge on miss” that allows SIMD cores to better about one fifth the area of 32 individual scalar cores. Note tolerate memory latency for workloads with non-contiguous that this estimate does not include the area of any intercon- memory access patterns. Individual threads within a SIMD nection network among the MIMD cores, which often grows “warp” are allowed to slip behind other threads in the same supra-linearly with the number of cores [18]. warp, letting the warp continue execution even if a subset of To better tolerate memory and pipeline latencies, many- core processors typically use fine-grained multi-threading, threads are waiting on memory. Diverge on miss can either 1 increase the performance of a given design by up to a factor switching among multiple warps, so that active warps can of 3.14 for a single warp per core, or reduce the number of mask stalls in other warps waiting on long-latency events. warps per core needed to sustain a given level of performance The drawback of this approach is that the size of the regis- from 16 to 2 warps, reducing the area per core by 35%. -
The Evolution of Lisp
1 The Evolution of Lisp Guy L. Steele Jr. Richard P. Gabriel Thinking Machines Corporation Lucid, Inc. 245 First Street 707 Laurel Street Cambridge, Massachusetts 02142 Menlo Park, California 94025 Phone: (617) 234-2860 Phone: (415) 329-8400 FAX: (617) 243-4444 FAX: (415) 329-8480 E-mail: [email protected] E-mail: [email protected] Abstract Lisp is the world’s greatest programming language—or so its proponents think. The structure of Lisp makes it easy to extend the language or even to implement entirely new dialects without starting from scratch. Overall, the evolution of Lisp has been guided more by institutional rivalry, one-upsmanship, and the glee born of technical cleverness that is characteristic of the “hacker culture” than by sober assessments of technical requirements. Nevertheless this process has eventually produced both an industrial- strength programming language, messy but powerful, and a technically pure dialect, small but powerful, that is suitable for use by programming-language theoreticians. We pick up where McCarthy’s paper in the first HOPL conference left off. We trace the development chronologically from the era of the PDP-6, through the heyday of Interlisp and MacLisp, past the ascension and decline of special purpose Lisp machines, to the present era of standardization activities. We then examine the technical evolution of a few representative language features, including both some notable successes and some notable failures, that illuminate design issues that distinguish Lisp from other programming languages. We also discuss the use of Lisp as a laboratory for designing other programming languages. We conclude with some reflections on the forces that have driven the evolution of Lisp. -
Multi-Core Processors and Systems: State-Of-The-Art and Study of Performance Increase
Multi-Core Processors and Systems: State-of-the-Art and Study of Performance Increase Abhilash Goyal Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT speedup. Some tasks are easily divided into parts that can be To achieve the large processing power, we are moving towards processed in parallel. In those scenarios, speed up will most likely Parallel Processing. In the simple words, parallel processing can follow “common trajectory” as shown in Figure 2. If an be defined as using two or more processors (cores, computers) in application has little or no inherent parallelism, then little or no combination to solve a single problem. To achieve the good speedup will be achieved and because of overhead, speed up may results by parallel processing, in the industry many multi-core follow as show by “occasional trajectory” in Figure 2. processors has been designed and fabricated. In this class-project paper, the overview of the state-of-the-art of the multi-core processors designed by several companies including Intel, AMD, IBM and Sun (Oracle) is presented. In addition to the overview, the main advantage of using multi-core will demonstrated by the experimental results. The focus of the experiment is to study speed-up in the execution of the ‘program’ as the number of the processors (core) increases. For this experiment, open source parallel program to count the primes numbers is considered and simulation are performed on 3 nodes Raspberry cluster . Obtained results show that execution time of the parallel program decreases as number of core increases. -
Part: an Asynchronous Parallel Abstraction for Speculative Pipeline Computations Kiko Fernandez-Reyes, Dave Clarke, Daniel Mccain
ParT: An Asynchronous Parallel Abstraction for Speculative Pipeline Computations Kiko Fernandez-Reyes, Dave Clarke, Daniel Mccain To cite this version: Kiko Fernandez-Reyes, Dave Clarke, Daniel Mccain. ParT: An Asynchronous Parallel Abstraction for Speculative Pipeline Computations. 18th International Conference on Coordination Languages and Models (COORDINATION), Jun 2016, Heraklion, Greece. pp.101-120, 10.1007/978-3-319-39519- 7_7. hal-01631723 HAL Id: hal-01631723 https://hal.inria.fr/hal-01631723 Submitted on 9 Nov 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License ParT: An Asynchronous Parallel Abstraction for Speculative Pipeline Computations? Kiko Fernandez-Reyes, Dave Clarke, and Daniel S. McCain Department of Information Technology Uppsala University, Uppsala, Sweden Abstract. The ubiquity of multicore computers has forced program- ming language designers to rethink how languages express parallelism and concurrency. This has resulted in new language constructs and new com- binations or revisions of existing constructs. In this line, we extended the programming languages Encore (actor-based), and Clojure (functional) with an asynchronous parallel abstraction called ParT, a data structure that can dually be seen as a collection of asynchronous values (integrat- ing with futures) or a handle to a parallel computation, plus a collection of combinators for manipulating the data structure. -
The Butterfly(TM) Lisp System
From: AAAI-86 Proceedings. Copyright ©1986, AAAI (www.aaai.org). All rights reserved. The Butterfly TMLisp System Seth A. Steinberg Don Allen Laura B agnall Curtis Scott Bolt, Beranek and Newman, Inc. 10 Moulton Street Cambridge, MA 02238 ABSTRACT Under DARPA sponsorship, BBN is developing a parallel symbolic programming environment for the Butterfly, based This paper describes the Common Lisp system that BBN is on an extended version of the Common Lisp language. The developing for its ButterflyTM multiprocessor. The BBN implementation of Butterfly Lisp is derived from C Scheme, ButterflyTM is a shared memory multiprocessor which may written at MIT by members of the Scheme Team.4 The contain up to 256 processor nodes. The system provides a simplicity and power of Scheme make it particularly suitable shared heap, parallel garbage collector, and window based as a testbed for exploring the issues of parallel execution, as I/Osystem. The future constructis used to specify well as a good implementation language for Common Lisp. parallelism. The MIT Multilisp work of Professor Robert Halstead and THE BUTTERFLYTM LISP SYSTEM students has had a significant influence on our approach. For example, the future construct, Butterfly Lisp’s primary For several decades, driven by industrial, military and mechanism for obtaining concurrency, was devised and first experimental demands, numeric algorithms have required implemented by the Multilisp group. Our experience porting increasing quantities of computational power. Symbolic MultiLisp to the Butterfly illuminated many of the problems algorithms were laboratory curiosities; widespread demand of developing a Lisp system that runs efficiently on both for symbolic computing power lagged until recently. -
Understanding and Guiding the Computing Resource Management in a Runtime Stacking Context
THÈSE PRÉSENTÉE À L’UNIVERSITÉ DE BORDEAUX ÉCOLE DOCTORALE DE MATHÉMATIQUES ET D’INFORMATIQUE par Arthur Loussert POUR OBTENIR LE GRADE DE DOCTEUR SPÉCIALITÉ : INFORMATIQUE Understanding and Guiding the Computing Resource Management in a Runtime Stacking Context Rapportée par : Allen D. Malony, Professor, University of Oregon Jean-François Méhaut, Professeur, Université Grenoble Alpes Date de soutenance : 18 Décembre 2019 Devant la commission d’examen composée de : Raymond Namyst, Professeur, Université de Bordeaux – Directeur de thèse Marc Pérache, Ingénieur-Chercheur, CEA – Co-directeur de thèse Emmanuel Jeannot, Directeur de recherche, Inria Bordeaux Sud-Ouest – Président du jury Edgar Leon, Computer Scientist, Lawrence Livermore National Laboratory – Examinateur Patrick Carribault, Ingénieur-Chercheur, CEA – Examinateur Julien Jaeger, Ingénieur-Chercheur, CEA – Invité 2019 Keywords High-Performance Computing, Parallel Programming, MPI, OpenMP, Runtime Mixing, Runtime Stacking, Resource Allocation, Resource Manage- ment Abstract With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared- memory models (e.g., MPI+OpenMP). This leads to a better exploitation of shared-memory communications and reduces the overall memory footprint. However, this evolution has a large impact on the software stack as applications’ developers do typically mix several programming models to scale over a large number of multicore nodes while coping with their hiearchical depth. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time. Dealing with different runtime systems may lead to a large number of execution flows that may not efficiently exploit the underlying resources. -
Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor
Consolidating High-Integrity, High-Performance, and Cyber-Security Functions on a Manycore Processor Benoît Dupont de Dinechin Kalray S.A. [email protected] Figure 1: Overview of the MPPA3 processor. ABSTRACT CCS CONCEPTS The requirement of high performance computing at low power can • Computer systems organization → Multicore architectures; be met by the parallel execution of an application on a possibly Heterogeneous (hybrid) systems; System on a chip; Real-time large number of programmable cores. However, the lack of accurate languages. timing properties may prevent parallel execution from being appli- cable to time-critical applications. This problem has been addressed KEYWORDS by suitably designing the architecture, implementation, and pro- manycore processor, cyber-physical system, dependable computing gramming models, of the Kalray MPPA (Multi-Purpose Processor ACM Reference Format: Array) family of single-chip many-core processors. We introduce Benoît Dupont de Dinechin. 2019. Consolidating High-Integrity, High- the third-generation MPPA processor, whose key features are mo- Performance, and Cyber-Security Functions on a Manycore Processor. In tivated by the high-performance and high-integrity functions of The 56th Annual Design Automation Conference 2019 (DAC ’19), June 2– automated vehicles. High-performance computing functions, rep- 6, 2019, Las Vegas, NV, USA. ACM, New York, NY, USA, 4 pages. https: resented by deep learning inference and by computer vision, need //doi.org/10.1145/3316781.3323473 to execute under soft real-time constraints. High-integrity func- tions are developed under model-based design, and must meet hard 1 INTRODUCTION real-time constraints. Finally, the third-generation MPPA processor Cyber-physical systems are characterized by software that interacts integrates a hardware root of trust, and its security architecture with the physical world, often with timing-sensitive safety-critical is able to support a security kernel for implementing the trusted physical sensing and actuation [10]. -
Graph Reduction Without Pointers
Graph Reduction Without Pointers TR89-045 December, 1989 William Daniel Partain The University of North Carolina at Chapel Hill Department of Computer Science ! I CB#3175, Sitterson Hall Chapel Hill, NC 27599-3175 UNC is an Equal Opportunity/Aflirmative Action Institution. Graph Reduction Without Pointers by William Daniel Partain A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science. Chapel Hill, 1989 Approved by: Jfn F. Prins, reader ~ ~<---( CJ)~ ~ ;=tfJ\ Donald F. Stanat, reader @1989 William D. Partain ALL RIGHTS RESERVED II WILLIAM DANIEL PARTAIN. Graph Reduction Without Pointers (Under the direction of Gyula A. Mag6.) Abstract Graph reduction is one way to overcome the exponential space blow-ups that simple normal-order evaluation of the lambda-calculus is likely to suf fer. The lambda-calculus underlies lazy functional programming languages, which offer hope for improved programmer productivity based on stronger mathematical underpinnings. Because functional languages seem well-suited to highly-parallel machine implementations, graph reduction is often chosen as the basis for these machines' designs. Inherent to graph reduction is a commonly-accessible store holding nodes referenced through "pointers," unique global identifiers; graph operations cannot guarantee that nodes directly connected in the graph will be in nearby store locations. This absence of locality is inimical to parallel computers, which prefer isolated pieces of hardware working on self-contained parts of a program. In this dissertation, I develop an alternate reduction system using "sus pensions" (delayed substitutions), with terms represented as trees and vari ables by their binding indices (de Bruijn numbers). -
Parallel Processing with the MPPA Manycore Processor
Parallel Processing with the MPPA Manycore Processor Kalray MPPA® Massively Parallel Processor Array Benoît Dupont de Dinechin, CTO 14 Novembre 2018 Outline Presentation Manycore Processors Manycore Programming Symmetric Parallel Models Untimed Dataflow Models Kalray MPPA® Hardware Kalray MPPA® Software Model-Based Programming Deep Learning Inference Conclusions Page 2 ©2018 – Kalray SA All Rights Reserved KALRAY IN A NUTSHELL We design processors 4 ~80 people at the heart of new offices Grenoble, Sophia (France), intelligent systems Silicon Valley (Los Altos, USA), ~70 engineers Yokohama (Japan) A unique technology, Financial and industrial shareholders result of 10 years of development Pengpai Page 3 ©2018 – Kalray SA All Rights Reserved KALRAY: PIONEER OF MANYCORE PROCESSORS #1 Scalable Computing Power #2 Data processing in real time Completion of dozens #3 of critical tasks in parallel #4 Low power consumption #5 Programmable / Open system #6 Security & Safety Page 4 ©2018 – Kalray SA All Rights Reserved OUTSOURCED PRODUCTION (A FABLESS BUSINESS MODEL) PARTNERSHIP WITH THE WORLD LEADER IN PROCESSOR MANUFACTURING Sub-contracted production Signed framework agreement with GUC, subsidiary of TSMC (world top-3 in semiconductor manufacturing) Limited investment No expansion costs Production on the basis of purchase orders Page 5 ©2018 – Kalray SA All Rights Reserved INTELLIGENT DATA CENTER : KEY COMPETITIVE ADVANTAGES First “NVMe-oF all-in-one” certified solution * 8x more powerful than the latest products announced by our competitors** -
Parallelism in Lisp
Parallelism in Lisp Michael van Biema Columbia University Dept. of Computer Science New York, N.Y. 10027 Tel: (212)280-2736 [email protected] three attempts are very interesting, in that two arc very similar Abstract in their approach but very different in the level of their constructs, and the third takes a very different approach. We This paper examines Lisp from the point of view of parallel do not study the so called "pure Lisp" approaches to computation. It attempts to identify exactly where the potential parallelizing Lisp since these are applicative approaches and for parallel execution really exists in LISP and what constructs do not present many of the more complex problems presented are useful in realizing that potential. Case studies of three by a Lisp with side-effects [4, 3]. attempts at augmenting Lisp with parallel constructs are examined and critiqued. The first two attempts concentrate on what we call control parallelism. Control parallelism is viewed here as a medium- or course-grained parallelism on the order of a function call in 1. Parallelism in Lisp Lisp or a procedure call in a traditional, procedure-oriented There are two main approaches to executing Lisp in parallel. language. A good example of this type of parallelism is the One is to use existing code and clever compiling methods to parallel evaluation of all the arguments to a function in Lisp, parallelize the execution of the code [9, 14, 11]. This or the remote procedure call or fork of a process in some approach is very attractive because it allows the use of already procedural language.