Locality-Conscious Parallel Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

Locality-Conscious Parallel Algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Locality-conscious parallel algorithms Nodari Sitchinava Karlsruhe Institute of Technology University of Hawaii October 17, 2013 Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions 1990s: Faster Programs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Mid 2000s – Power Dissipation Is a Problem Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Mid 2000s – Power Dissipation Is a Problem Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel computing 1970-90s: PRAM, BSP, hypercubes... Important results on the limits of parallelism Are they practical? Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location Runtime Θ(n log n) Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location Runtime per element 1e-05 9e-06 8e-06 7e-06 seconds 6e-06 Runtime 5e-06 Θ(n log n) 4e-06 3e-06 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location Runtime per element 1e-05 9e-06 8e-06 7e-06 seconds 6e-06 Runtime 5e-06 Θ(n log n) 4e-06 3e-06 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Θ notation What does Θ notation represent? Number of comparisons Number of arithmetic operations Number of memory accesses Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Θ notation What does Θ notation represent? Number of comparisons Number of arithmetic operations Number of memory accesses All of the above in RAM/PRAM models Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location Θ(n log n) memory accesses? Runtime per element DRAM accesses per element 1e-05 80 9e-06 70 60 8e-06 50 7e-06 40 seconds 6e-06 30 5e-06 20 4e-06 10 3e-06 0 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n n Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Modern memory design CPU Disk Caches! L1 Cache L2 Cache RAM Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Caches in Multi-cores CPU 1 CPU 2 Disk CPU 3 CPU 4 L1 L2 RAM Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Dealing with Caches External Memory Model CPU M Cache B External memory Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions RAM PRAM CPU PE 1 PE 2 ... PE P Random Access Memory Random access memory External Memory CPU M Cache B External memory Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions RAM PRAM CPU PE 1 PE 2 ... PE P Random Access Memory Random access memory External Memory Parallel External Memory CPU PE 1 PE 2 PE P ... M M M M Cache Cache Cache Cache B B B B External memory Shared memory Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel External Memory (PEM) Model [SPAA ’08] PEM: A simple model of locality and parallelism PE 1 PE 2 PE P ... M M M Cache Cache Cache Explicit cache replacement B B B Up to P block transfers = 1 I/O Shared memory Block-level CREW access Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel External Memory (PEM) Model [SPAA ’08] PEM: A simple model of locality and parallelism PE 1 PE 2 PE P ... M M M Cache Cache Cache Explicit cache replacement B B B Up to P block transfers = 1 I/O Shared memory Block-level CREW access Complexity: Parallel I/O complexity – number of parallel block transfers Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Solutions in PEM Algorithms Sorting [SPAA’08] Orthogonal Computational Geometry [ESA’10, Independent IPDPS’11] Set List Problems on Graphs Ranking Tree [IPDPS’10] Euler Tour Connected Expression Tree MST Components Evaluation Bi−connected Batched LCA PEM Data Structures [SPAA’12] Components Ear Decomposition Parallel Buffer/Range Tree Exp. Validation [ESA’13] Orthogonal Computational Geometry Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions PEM Algorithms Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Sorting [SPAA ’08] d = min{M/B, n/P} p log (N/B) Sorting M/B d-way mergesort d-way distribution sort Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Sorting [SPAA ’08] d = min{M/B, n/P} p log (N/B) Sorting M/B d-way mergesort d-way distribution sort n n sort Θ PB logM/B B = P (n) I/Os Θ(n log n) comparisons Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Geometric Algorithms [ESA ’10, IPDPS ’11] Solved Problems Line segment intersections Rectangle intersections Range Query Orthogonal point location n n sort Θ PB logM/B B = P (n) I/Os Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p d vertical slabs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p d vertical slabs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p d vertical slabs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p d vertical slabs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p d vertical slabs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p O(N/P) d vertical d vertical slabs slabs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Parallel Distribution Sweeping d = min{M/B, n/P} p O(N/P) d vertical d vertical slabs slabs Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Example: Orthogonal Point Location Complexity n n O PB logM/B B I/Os O (n log n) comparisons Runtime per element DRAM accesses per element 1e-05 80 9e-06 70 60 8e-06 50 7e-06 40 seconds 6e-06 30 5e-06 20 4e-06 10 3e-06 0 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 n n logM/B (n/B) ≈ 2 Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Graph Problems [IPDPS ’10] Independent Set List Ranking Tree Euler Tour Connected Expression Tree MST Components Evaluation Bi−connected Batched LCA Components Ear Decomposition Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Pointers And Caches Don’t Get Along Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions Pointers And Caches Don’t Get Along n n n min Θ + log n , Θ logM B I/Os for list ranking. n P PB / B o Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions If you bear with me for a moment... Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPGPU Computing Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPU computing GPGPU – graphics cards for computation Hundreds of cores Thousands of threads Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms GPGPU Conclusions GPUs Global Memory (n ≤ 3GB) SM w SM w sh. mem sh. mem Shared (Local) Memory (M ≈ 48 kB) Global memory CPU memory Registers (≈ 32 kB) Nodari Sitchinava – Locality-conscious parallel algorithms Motivation PEM Model PEM Algorithms
Recommended publications
  • Paralellizing the Data Cube
    PARALELLIZING THE DATA CUBE By Todd Eavis SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT DALHOUSIE UNIVERSITY HALIFAX, NOVA SCOTIA JUNE 27, 2003 °c Copyright by Todd Eavis, 2003 DALHOUSIE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE The undersigned hereby certify that they have read and recommend to the Faculty of Graduate Studies for acceptance a thesis entitled “Paralellizing the Data Cube” by Todd Eavis in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Dated: June 27, 2003 External Examiner: Virendra Bhavsar Research Supervisor: Andrew Rau-Chaplin Examing Committee: Qigang Gao Evangelos Milios ii DALHOUSIE UNIVERSITY Date: June 27, 2003 Author: Todd Eavis Title: Paralellizing the Data Cube Department: Computer Science Degree: Ph.D. Convocation: October Year: 2003 Permission is herewith granted to Dalhousie University to circulate and to have copied for non-commercial purposes, at its discretion, the above title upon the request of individuals or institutions. Signature of Author THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, AND NEITHER THE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAY BE PRINTED OR OTHERWISE REPRODUCED WITHOUT THE AUTHOR’S WRITTEN PERMISSION. THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINED FOR THE USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THIS THESIS (OTHER THAN BRIEF EXCERPTS REQUIRING ONLY PROPER ACKNOWLEDGEMENT IN SCHOLARLY WRITING) AND THAT ALL SUCH USE IS CLEARLY ACKNOWLEDGED. iii To the two women in my life: Amber and Bailey. iv Table of Contents Table of Contents v List of Tables x List of Figures xi Abstract i Acknowledgements ii 1 Introduction 1 1.1 Overview of Primary Research .
    [Show full text]
  • Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation
    Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation Michael A. Bender∗x Jonathan Berryy Simon D. Hammondy K. Scott Hemmerty Samuel McCauley∗ Branden Moorey Benjamin Moseleyz Cynthia A. Phillipsy David Resnicky and Arun Rodriguesy ∗Stony Brook University, Stony Brook, NY 11794-4400 USA fbender,[email protected] ySandia National Laboratories, Albuquerque, NM 87185 USA fjberry, sdhammo, kshemme, bjmoor, caphill, drresni, [email protected] zWashington University in St. Louis, St. Louis, MO 63130 USA [email protected] xTokutek, Inc. www.tokutek.com Abstract—A fundamental challenge for supercomputer memory close to the processor, there can be a higher architecture is that processors cannot be fed data from number of connections between the memory and caches, DRAM as fast as CPUs can consume it. Therefore, many enabling higher bandwidth than current technologies. applications are memory-bandwidth bound. As the number 1 of cores per chip increases, and traditional DDR DRAM While the term scratchpad is overloaded within the speeds stagnate, the problem is only getting worse. A computer architecture field, we use it throughout this variety of non-DDR 3D memory technologies (Wide I/O 2, paper to describe a high-bandwidth, local memory that HBM) offer higher bandwidth and lower power by stacking can be used as a temporary storage location. DRAM chips on the processor or nearby on a silicon The scratchpad cannot replace DRAM entirely. Due to interposer. However, such a packaging scheme cannot contain sufficient memory capacity for a node. It seems the physical constraints of adding the memory directly likely that future systems will require at least two levels of to the chip, the scratchpad cannot be as large as DRAM, main memory: high-bandwidth, low-power memory near although it will be much larger than cache, having the processor and low-bandwidth high-capacity memory gigabytes of storage capacity.
    [Show full text]
  • Minimizing Writes in Parallel External Memory Search
    Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Minimizing Writes in Parallel External Memory Search Nathan R. Sturtevant and Matthew J. Rutherford University of Denver Denver, CO, USA fsturtevant, [email protected] Abstract speed in Chinese Checkers, and is 3.5 times faster in Rubik’s Cube. Recent research on external-memory search has shown that disks can be effectively used as sec- WMBFS is motivated by several observations: ondary storage when performing large breadth- first searches. We introduce the Write-Minimizing First, most large-scale BFSs have been performed to ver- Breadth-First Search (WMBFS) algorithm which ify the diameter (the number of unique depths at which states is designed to minimize the number of writes per- can be found) or the width (the maximum number of states at formed in an external-memory BFS. WMBFS is any particular depth) of a state space, after which any com- also designed to store the results of the BFS for puted data is discarded. There are, however, many scenarios later use. We present the results of a BFS on a in which the results of a large BFS can be later used for other single-agent version of Chinese Checkers and the computation. In particular, a breadth-first search is the under- Rubik’s Cube edge cubes, state spaces with about lying technique for building pattern databases (PDBs), which 1 trillion states each. In evaluating against a com- are used as heuristics for search. PDBs require that the depth parable approach, WMBFS reduces the I/O for the of each state in the BFS be stored for later usage.
    [Show full text]
  • The Parallel Persistent Memory Model
    The Parallel Persistent Memory Model Guy E. Blelloch∗ Phillip B. Gibbons∗ Yan Gu∗ Charles McGuffey∗ Julian Shuny ∗Carnegie Mellon University yMIT CSAIL ABSTRACT 1 INTRODUCTION We consider a parallel computational model, the Parallel Persistent In this paper, we consider a parallel computational model, the Par- Memory model, comprised of P processors, each with a fast local allel Persistent Memory (Parallel-PM) model, that consists of P pro- ephemeral memory of limited size, and sharing a large persistent cessors, each with a fast local ephemeral memory of limited size M, memory. The model allows for each processor to fault at any time and sharing a large slower persistent memory. As in the external (with bounded probability), and possibly restart. When a processor memory model [4, 5], each processor runs a standard instruction set faults, all of its state and local ephemeral memory is lost, but the from its ephemeral memory and has instructions for transferring persistent memory remains. This model is motivated by upcoming blocks of size B to and from the persistent memory. The cost of an non-volatile memories that are nearly as fast as existing random algorithm is calculated based on the number of such transfers. A access memory, are accessible at the granularity of cache lines, key difference, however, is that the model allows for individual pro- and have the capability of surviving power outages. It is further cessors to fault at any time. If a processor faults, all of its processor motivated by the observation that in large parallel systems, failure state and local ephemeral memory is lost, but the persistent mem- of processors and their caches is not unusual.
    [Show full text]
  • Implementing Operational Intelligence Using In-Memory Computing
    Implementing Operational Intelligence Using In-Memory Computing William L. Bain ([email protected]) June 29, 2015 Agenda • What is Operational Intelligence? • Example: Tracking Set-Top Boxes • Using an In-Memory Data Grid (IMDG) for Operational Intelligence • Tracking and analyzing live data • Comparison to Spark • Implementing OI Using Data-Parallel Computing in an IMDG • A Detailed OI Example in Financial Services • Code Samples in Java • Implementing MapReduce on an IMDG • Optimizing MapReduce for OI • Integrating Operational and Business Intelligence © ScaleOut Software, Inc. 2 About ScaleOut Software • Develops and markets In-Memory Data Grids, software middleware for: • Scaling application performance and • Providing operational intelligence using • In-memory data storage and computing • Dr. William Bain, Founder & CEO • Career focused on parallel computing – Bell Labs, Intel, Microsoft • 3 prior start-ups, last acquired by Microsoft and product now ships as Network Load Balancing in Windows Server • Ten years in the market; 400+ customers, 10,000+ servers • Sample customers: ScaleOut Software’s Product Portfolio ® • ScaleOut StateServer (SOSS) ScaleOut StateServer In-Memory Data Grid • In-Memory Data Grid for Windows and Linux Grid Grid Grid Grid • Scales application performance Service Service Service Service • Industry-leading performance and ease of use • ScaleOut ComputeServer™ adds • Operational intelligence for “live” data • Comprehensive management tools • ScaleOut hServer® • Full Hadoop Map/Reduce engine (>40X faster*)
    [Show full text]
  • Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures
    Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures by Alejandro Salinger A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2013 c Alejandro Salinger 2013 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In ad- dition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms.
    [Show full text]
  • High Performance Computing Ð Past, Present and Future
    High Performance Computing – Past, Present and Future Anthony J. G. Hey University of Southampton, Southampton, UK 1 Introduction The New Technology Initiative of the Joint Information Systems Committee in the UK defined High Performance Computing (HPC) in the following terms [1]: “Computing resources which provide more than an order of magnitude more computing power than is normally available on one's desktop.” This is a useful definition of HPC – since it reflects the reality that HPC is a moving target: what is state of the art supercomputing this year will be desktop computing in a few years' time. Before we look towards the future for HPC, it is well worthwhile for us to spend a little time surveying the incredible technological progress in computing over the past 50 years. We will then quickly review some “Grand Challenge” scientific applications, followed by a discussion of the Fortran programming language. Fortran is the paradigmatic scientific programming language and in many respects the evolution of Fortran mirrors developments in computer architecture. After a brief look at the ambitious plans of IBM and Lawrence Livermore Laboratory in the USA to build a true “Grand Challenge” computer by 2004, we conclude with a look at the rather different types of application which are likely to make parallel computers a real commercial success. 2 The Past to the Present Historians of computing often begin by paying tribute to Charles Babbage and his famous Difference Engine – the first complete design for an automatic calculating device. The “Grand Challenge” problem that Babbage was attacking was the calculation of astronomical and nautical tables – a process that was extremely laborious and error-prone for humans.
    [Show full text]
  • Experiments with a Parallel External Memory System*
    Experiments with a Parallel External Memory System? Mohammad R. Nikseresht1, David A. Hutchinson2, and Anil Maheshwari1 1 School of Computer Science, Carleton University 2 Dept. of Systems and Computer Engineering, Carleton University Abstract. The theory of bulk-synchronous parallel computing has pro- duced a large number of attractive algorithms, which are provably op- timal in some sense, but typically require that the aggregate random access memory (RAM) of the processors be sufficient to hold the entire data set of the parallel problem instance. In this work we investigate the performance of parallel algorithms for extremely large problem instances relative to the available RAM. We describe a system, Parallel Exter- nal Memory System (PEMS), which allows existing parallel programs designed for a large number of processors without disks to be adapted easily to smaller, realistic numbers of processors, each with its own disk system. Our experiments with PEMS show that this approach is practi- cal and promising and the run times scale predictable with the number of processors and with the problem size. 1 Introduction In this work we investigate the performance of parallel algorithms for extremely large problem instances relative to the available Random Access Memory (RAM). Using theoretical results of [1, 2], we transform parallel algorithms designed for a large number of processors without disks to smaller, realistic numbers of pro- cessors, each with its own disk system. External Memory (EM) Algorithms: These algorithms are designed so that their run times scale predictably even as the size of their data increases far be- yond the size of internal RAM.
    [Show full text]
  • Fundamentals – Parallel Architectures, Models, and Languages
    HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages Michael Bader TUM – SCCS Winter 2017/2018 Part I Parallel Architectures (sorry, not everywhere the latest ones . ) Michael Bader j HPC – Algorithms and Applications j Fundamentals j Winter 2017/2018 2 Manycore CPU – Intel Xeon Phi Coprocessor • coprocessor = works as an extension card on the PCI bus • ≈ 60 cores, 4 hardware threads per core • simpler architecture for each core, but • wider vector computing unit (8 double-precision floats) • next generation (Knights Landing) available as standalone CPU (since 2017) Michael Bader j HPC – Algorithms and Applications j Fundamentals j Winter 2017/2018 3 Manycore CPU – Intel “Knights Landing” 2nd half ’15 Unveiling Details of Knights Landing 1st commercial systems (Next Generation Intel® Xeon Phi™ Products) 3+ TFLOPS1 In One Package Platform Memory: DDR4 Bandwidth and Parallel Performance & Density Capacity Comparable to Intel® Xeon® Processors Compute: Energy-efficient IA cores2 … . Microarchitecture enhanced for HPC3 . 3X Single Thread Performance vs Knights Corner4 5 . Intel Xeon Processor Binary Compatible . Intel® Silvermont Arch. On-Package Memory: Enhanced for HPC . up to 16GB at launch . 1/3X the Space6 Integrated Fabric . 5X Bandwidth vs DDR47 . 5X Power Efficiency6 Processor Package Jointly Developed with Micron Technology All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. 1Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per cycle. FLOPS = cores x clock frequency x floating- point operations per second per cycle. 2Modified version of Intel® Silvermont microarchitecture currently found in Intel® AtomTM processors.
    [Show full text]
  • The Efficiency of Mapreduce in Parallel External Memory Arxiv
    The Efficiency of MapReduce in Parallel External Memory Gero Greiner Riko Jacob Institute of Theoretical Computer Science ETH Zurich, Switzerland {greinerg,rjacob}@inf.ethz.ch Abstract Since its introduction in 2004, the MapReduce framework has be- come one of the standard approaches in massive distributed and paral- lel computation. In contrast to its intensive use in practise, theoretical footing is still limited and only little work has been done yet to put MapReduce on a par with the major computational models. Follow- ing pioneer work that relates the MapReduce framework with PRAM and BSP in their macroscopic structure, we focus on the functionality provided by the framework itself, considered in the parallel external memory model (PEM). In this, we present upper and lower bounds on the parallel I/O-complexity that are matching up to constant factors for the shuffle step. The shuffle step is the single communication phase where all information of one MapReduce invocation gets transferred from map workers to reduce workers. Hence, we move the focus to- wards the internal communication step in contrast to previous work. The results we obtain further carry over to the BSP∗ model. On the one hand, this shows how much complexity can be “hidden” for an algo- rithm expressed in MapReduce compared to PEM. On the other hand, arXiv:1112.3765v1 [cs.DC] 16 Dec 2011 our results bound the worst-case performance loss of the MapReduce approach in terms of I/O-efficiency. 1 Introduction The MapReduce framework has been introduced by Dean and Ghemawat [5] to provide a simple parallel model for the design of algorithms on huge data sets.
    [Show full text]
  • The Power of In-Memory Computing: from Supercomputing to Stream Processing
    The Power of In-Memory Computing: From Supercomputing to Stream Processing William Bain, Founder & CEO ScaleOut Software, Inc. October 28, 2020 About the Speaker Dr. William Bain, Founder & CEO of ScaleOut Software: • Email: [email protected] • Ph.D. in Electrical Engineering (Rice University, 1978) • Career focused on parallel computing – Bell Labs, Intel, Microsoft ScaleOut Software develops and markets In-Memory Data Grids, software for: • Scaling application performance with in-memory data storage • Providing operational intelligence on live data with in-memory computing • 15+ years in the market; 450+ customers, 12,000+ servers 2 What Is In-Memory Computing? Generally accepted characteristics: • Comprises both hardware & software techniques. • Hosts data sets in primary memory. • Distributes computing across many servers. • Employs data-parallel computations. Why use IMC? • Can quickly process “live,” fast-changing data. • Can analyze large data sets. • “Scaling out” is more scalable and cost- effective than “scaling up”. 3 In the Beginning Caltech Cosmic Cube (1983) • Possibly the earliest in-memory computing system • Created by professors Geoffrey Fox and Charles Seitz • Targeted at solving scientific problems (high energy physics, astrophysics, chemistry, chip simulation) • 64 “nodes” with Intel 8086/8087 processors & 8MB total memory, hypercube interconnect, 3.2 MFLOPS • “One-tenth the power of the Cray 1 but 100X less expensive” 4 The Era of Commercialization Commercial Parallel Supercomputers • 1984: Industry pioneered
    [Show full text]
  • On the Complexity of List Ranking in the Parallel External Memory Model
    On the Complexity of List Ranking in the Parallel External Memory Model Riko Jacob1, Tobias Lieber1, and Nodari Sitchinava2 1 Institute for Theoretical Computer Science, ETH Z¨urich, Switzerland frjacob,[email protected] 2 Department of Information and Computer Sciences, University of Hawaii, USA [email protected] Abstract. We study the problem of list ranking in the parallel external memory (PEM) model. We observe an interesting dual nature for the hardness of the problem due to limited information exchange among the processors about the structure of the list, on the one hand, and its close relationship to the problem of permuting data, which is known to be hard for the external memory models, on the other hand. By carefully defining the power of the computational model, we prove a permuting lower bound in the PEM model. Furthermore, we present a stronger Ω(log2 N) lower bound for a special variant of the problem and for a specific range of the model parameters, which takes us a step closer toward proving a non-trivial lower bound for the list ranking problem in the bulk-synchronous parallel (BSP) and MapReduce models. Finally, we also present an algorithm that is tight for a larger range of parameters of the model than in prior work. 1 Introduction Analysis of massive graphs representing social networks using distributed pro- gramming models, such as MapReduce and Hadoop, has renewed interests in dis- tributed graph algorithms. In the classical RAM model, depth-first search traver- sal of the graph is the building block for many graph analysis solutions.
    [Show full text]