Sequencel Optimization Results on IBM POWER8 with Linux

Total Page:16

File Type:pdf, Size:1020Kb

Sequencel Optimization Results on IBM POWER8 with Linux SequenceL Optimization Results on IBM POWER8 with Linux This report is a compilation of the final performance results of TMT optimizing the auto- parallelizing SequenceL multicore programming tool set to the IBM POWER8 platform. SequenceL is a compact, powerful functional programming language and auto-parallelizing tool set that quickly and easily converts algorithms to robust, massively parallel code. SequenceL was designed to work in concert with existing programming languages and legacy code, making it a powerful tool for software engineers to modernize code for modern platforms. Customers have nicknamed it “Matlab on steroids” since it enables scientists and engineers to easily explore different algorithms and innovations, then quickly convert them to robust, performant production code that runs on a multitude of modern hardware platforms. This effort included: Performance optimizations to utilize SIMD (Altivec, VSX) vectorization and perform cache optimization on POWER8 systems in both the SequenceL runtime environment and code generated by the SequenceL compiler. Ensuring SequenceL is able to run with recent open source library versions that have had optimizations added for POWER8, including gcc 4.9+, glibc 2.20+ Tested performance and correctness with Advanced Toolchain for PowerLinux v8 and v9. Ran and tested SequenceL on multiple POWER8 Linux distributions, including: o Red Hat RHEL 7.2 o CentOS 7 o Ubuntu 14, 15.10, and 16.04 TMT performance testing benchmarks on multiple IBM POWER8 and Intel x86 platforms. Comparisons of POWER8 to Intel x86 Xeon v3 After completing the optimization work on IBM POWER8, TMT ran its suite of heatmap programs on two POWER8 configurations and three Intel x86 “Haswell” configurations for comparison and verification purposes. TMT uses these programs, which are written in SequenceL, to stress the hardware platforms and ensure the SequenceL tools are operating optimally on the platforms. An overview of the tested server configurations are below and results are on subsequent pages. Tested server configurations (all using RHEL 7.2 Linux): IBM S822: POWER8E 3.6GHz 20 cores (160 threads), 256GB memory, 1TB disk IBM S824: POWER8E 4.32GHz 16 cores (128 threads), 512GB memory, 780GB SSD Dell R730: Intel Xeon E5-2687W v3 3.10GHz 20 cores (40 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2699 v3 2.3GHz 36 cores (72 threads), 256GB memory, 200GB SSD Dell R730: Intel Xeon E5-2650 v3 2.3GHz 20 cores (40 threads), 256GB memory, 200GB SSD June 2016 Final Results, IBM POWER8 vs. Intel x86 Matrix Multiply 7 6 5 3.6 Ghz P8 4.3 Ghz P8 4 Xeon 2687 v3 3 Seconds Xeon 2699 v3 Xeon 2650 v3 2 Fastest 1 0 1 2 4 8 16 20 40 80 140 160 Threads Matrix Multiplication of two 2000 X 2000 matrices of double precision floating point numbers. Less time is better. Game of Life 10 9 8 7 3.6 Ghz P8 6 4.3 Ghz P8 5 Xeon 2687 v3 Seconds 4 Xeon 2699 v3 3 Fastest Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads Game of life calculation on a 2000 X 2000 board. Stresses memory system and integer arithmetic. Less time is better. June 2016 FFT 2.5 2 3.6 Ghz P8 1.5 4.3 Ghz P8 Xeon 2687 v3 Seconds 1 Xeon 2699 v3 Xeon 2650 v3 Fastest 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads Two dimensional Fast Fourier Transformation of a 1024 X 1024 matrix of double precision floating point numbers. Less time is better. Quick Sort 0.6 0.5 0.4 3.6 Ghz P8 4.3 Ghz P8 0.3 Xeon 2687 v3 Seconds Xeon 2699 v3 0.2 Fastest Xeon 2650 v3 0.1 0 1 2 4 8 16 20 40 80 140 160 Threads Sorts a list of 350,000 double precision floating point numbers. Less time is better. June 2016 Substring Search 5 4.5 4 3.5 3.6 Ghz P8 3 4.3 Ghz P8 2.5 Xeon 2687 v3 Seconds 2 Xeon 2699 v3 Fastest 1.5 Xeon 2650 v3 1 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads Search for a substring within a list of 124,000,000 characters. Less time is better. Barnes Hut 35 30 25 3.6 Ghz P8 20 4.3 Ghz P8 Xeon 2687 v3 15 Seconds Xeon 2699 v3 Xeon 2650 v3 10 Fastest 5 0 1 2 4 8 16 20 40 80 140 160 Threads The Barnes-Hut calculation is an approximation of an N-Body simulation. Less time is better. Barnes-Hut relies heavily on vector instructions, so Intel x86 does better due to 256 bit SIMD width vs. 128 bit on POWER8. June 2016 Matrix Inverse 3.5 3 2.5 3.6 Ghz P8 2 4.3 Ghz P8 Xeon 2687 v3 1.5 Seconds Xeon 2699 v3 Fastest 1 Xeon 2650 v3 0.5 0 1 2 4 8 16 20 40 80 140 160 Threads Performs a matrix inverse calculation on a 1000 X 1000 matrix of double precision floating point numbers. Less time is better. Sparse Compression 0.3 0.25 0.2 3.6 Ghz P8 4.3 Ghz P8 0.15 Xeon 2687 v3 Seconds Xeon 2699 v3 0.1 Fastest Xeon 2650 v3 0.05 0 1 2 4 8 16 20 40 80 140 160 Threads Converts a 7000 X 7000 matrix into a sparse matrix. Generally a memory bound problem, less time is better. June 2016 Sparse Multiplication 0.16 0.14 0.12 3.6 Ghz P8 0.1 4.3 Ghz P8 0.08 Xeon 2687 v3 Seconds Xeon 2699 v3 0.06 Fastest Xeon 2650 v3 0.04 0.02 0 1 2 4 8 16 20 40 80 140 160 Threads Multiply a sparse matrix of 5000 X 5000 double precision floating point values with a vector of 5000 double precision floating point values. This is a memory bound problem, less time is better. Sparse Decompression 0.09 0.08 0.07 0.06 3.6 Ghz P8 4.3 Ghz P8 0.05 Xeon 2687 v3 0.04 Seconds Xeon 2699 v3 0.03 Fastest Xeon 2650 v3 0.02 0.01 0 1 2 4 8 16 20 40 80 140 160 Threads Convert a 5000 X 5000 sparse matrix of double precision floating point values into a matrix. This is a memory bound problem, less time is better. June 2016 Autodesk 6 5 4 3.6 Ghz P8 4.3 Ghz P8 3 Xeon 2687 v3 Seconds Xeon 2699 v3 2 Fastest Xeon 2650 v3 1 0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem. Tests threading capability and FPU performance. Less time is better. Find Pi 0.7 0.6 0.5 3.6 Ghz P8 0.4 4.3 Ghz P8 Xeon 2687 v3 0.3 Seconds Xeon 2699 v3 0.2 Xeon 2650 v3 0.1 0 1 2 4 8 16 20 40 80 140 160 Threads Calculate the value of Pi. Tests floating point arithmetic. Less time is better. June 2016 Mandelbrot 16 14 12 3.6 Ghz P8 10 4.3 Ghz P8 8 Xeon 2687 v3 Seconds Xeon 2699 v3 6 Xeon 2650 v3 4 Fastest 2 0 1 2 4 8 16 20 40 80 140 160 Threads Use Monte Carlo sampling to calculate the Mandelbrot set area. This problem is bound by CPU performance. Less time is better. Spectral Norm 10 9 8 7 3.6 Ghz P8 6 4.3 Ghz P8 5 Xeon 2687 v3 Seconds 4 Xeon 2699 v3 Fastest 3 Xeon 2650 v3 2 1 0 1 2 4 8 16 20 40 80 140 160 Threads Numerical analysis calculations. Bound by a combination of memory and CPU performance. Less time is better. June 2016 Jacobi (Cells / Sec) Fastest 18 16 14 12 3.6 Ghz 4.3 Ghz 10 Xeon 2687 v3 8 Cells/sec Xeon 2699 v3 6 Xeon 2650 v3 4 2 0 1 2 4 8 16 20 40 80 140 160 Threads Jacobi iteration on a 3 dimensional array of 250 X 250 X 250 double precision floating point numbers. This follows a pattern that is very common in HPC fields. It is generally memory bound, although there are also many floating point calculations. The most cells/sec is best. Discrete Cosine Transform 7 6 5 3.6 Ghz 4 4.3 Ghz Xeon 2687 v3 3 Seconds Xeon 2699 v3 Fastest 2 Xeon 2650 v3 1 0 1 2 4 8 16 20 40 80 140 160 Threads Discrete cosine transformation of a 4096 X 4096 matrix of double precision floating point numbers. This calculation is commonly used in compression algorithms. Less time is better. June 2016 Semblance 10 9 8 3.6 Ghz 7 3.6Ghz (Limited SMT) 6 4.3 Ghz 5 4.3Ghz (Limited SMT) Seconds 4 Xeon 2687 v3 Fastest 3 Xeon 2699 v3 2 Xeon 2650 v3 1 0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem in the Oil & Gas industry. Memory bound problem that also includes many floating point calculations and tests the threading capability of a machine. Includes results of reduced SMT settings on POWER8 (1@16 cores, 1@20 cores, 2@40 cores, 4@80 cores, 7@140 cores). Less time is better. WirelessHART 12 10 3.6 Ghz 3.6Ghz (Limited SMT) 8 4.3 Ghz 4.3Ghz (Limited SMT) 6 Xeon 2687 v3 Seconds Fastest 4 Xeon 2699 v3 Xeon 2650 v3 2 0 1 2 4 8 16 20 40 80 140 160 Threads Customer problem that creates a schedule for a network gateway.
Recommended publications
  • Swri IR&D Program 2016
    Internal Research and Development 2016 The SwRI IR&D Program exists to broaden the Institute's technology base and to encourage staff professional growth. Internal funding of research enables the Institute to advance knowledge, increase its technical capabilities, and expand its reputation as a leader in science and technology. The program also allows Institute engineers and scientists to continually grow in their technical fields by providing freedom to explore innovative and unproven concepts without contractual restrictions and expectations. Space Science Materials Research & Structural Mechanics Intelligent Systems, Advanced Computer & Electronic Technology, & Automation Engines, Fuels, Lubricants, & Vehicle Systems Geology & Nuclear Waste Management Fluid & Machinery Dynamics Electronic Systems & Instrumentation Chemistry & Chemical Engineering Copyright© 2017 by Southwest Research Institute. All rights reserved under U.S. Copyright Law and International Conventions. No part of this publication may be reproduced in any form or by any means, electronic or mechanical, including photocopying, without permission in writing from the publisher. All inquiries should be addressed to Communications Department, Southwest Research Institute, P.O. Drawer 28510, San Antonio, Texas 78228-0510, [email protected], fax (210) 522-3547. 2016 IR&D | IR&D Home SwRI IR&D 2016 – Space Science Capability Development and Demonstration for Next-Generation Suborbital Research, 15-R8115 Scaling Kinetic Inductance Detectors, 15-R8311 Capability Development of
    [Show full text]
  • A Functional Approach to Finding Answer Sets
    A Functional Approach to Finding Answer Sets Bryant Nelson, Josh Archer, and Nelson Rushton Texas Tech Dept. of Computer Science (bryant.nelson | josh.archer | nelson.rushton) @ ttu.edu Texas Tech University Department of Computer Science Box 43104 Lubbock, TX 79409-3104 we have been convinced of Hughes’ hypothesis. We set out to further test this Submitted to FCS 2013 hypothesis by encoding an Answer Set Solver in SequenceL. We hypothesize that Keywords: Functional Programming, a functional language would allow the Answer Set Prolog, Logic Programming programmer to implement an Answer Set Solver more quickly than a procedural Abstract: A naive answer set solver was language, and that the resulting program implemented in the functional programming would be more readable and language SequenceL, and its performance understandable. compared to mainstream solvers on a set of standard benchmark problems. Finding answer sets is an NP-Complete Implementation was very rapid (25 person problem, and over the last few decades a hours) and the resulting program only 400 few solvers have been developed that lines of code. Nonetheless, the algorithm tackle this problem. The current juggernauts was tractable and obtained parallel in this area are the DLV, CLASP, and speedups. Performance, though not SMODELS solvers, all of which are very pathologically poor, was considerably efficient. The problem would seem to benefit slower (around 20x) than that of mainstream from a parallelized solver, yet CLASP only solvers (CLASP, Smodels, and DLV) on all has a branch (CLASPar) dedicated to this, but one benchmark problem. and DLV has been “experimenting” with parallelization in their grounder [DLVPar].
    [Show full text]
  • Bryant Nelson
    Bryant K. Nelson, PhD (940) 902-5460 [email protected] bethune-bryant bryant-nelson Education: Doctor of Philosophy in Computer Science, Texas Tech University 2016 “Automatic Distributed Programming Using SequenceL” Bachelor of Science in Computer Science, Texas Tech University 2012 Bachelor of Science in Mathematics, Texas Tech University 2012 Professional Experience: Texas Multicore Technologies Inc., Software Development Engineer Jul 2012- Present Austin, Texas o Develop and extend the programming language, SequenceL, including work on the compiler, interpreter, debugger and IDE. Develop and maintain the primary libraries shipped with SequenceL. o Work with companies porting existing projects to SequenceL to make use of multicore hardware. o Speak at technical conferences and briefings. Design and lead technical workshops and training. o Primary Languages: SequenceL, C++, C#, Haskell, Bash, make, Python o Primary Tools: Jenkins, Git, Jira, Cross-OS Development (Linux/Mac/Windows) Tyler Technologies Inc., Software Developer Mar 2011- Aug 2012 Lubbock, Texas o As part of the Public Safety Team I worked on projects such as, Computer Aided Dispatch, Unit Mapping and Control, Records Management, Mobile Unit Interfaces, and Statistical Crime Analysis. o Primary Languages: Visual Basic 6, C# .NET, SQL Texas Tech University, Computer Science Undergraduate Research Assistant Aug 2010-Jul 2012 Lubbock, Texas o Worked closely with NASA improving and rewriting flight code in a more efficient new language, SequenceL. SequenceL is a declarative, automatic-parallelizing, high-level language. Texas Tech University, Computer Science Supplemental Instructor Aug 2009 - May 2010 Lubbock, Texas o Taught a Supplemental Computer Science class multiple times a week intended to supplement the information given by the professor, and to assist students in developing good study habits.
    [Show full text]
  • Designing Interdisciplinary Approaches to Problem Solving Into Computer Languages
    Designing Interdisciplinary Approaches to Problem Solving Into Computer Languages by Daniel E. Cooke (1), Vladik Kreinovich (2), and Joseph E. Urban (3) (1) Computer Science Department Texas Tech University; (2) Computer Science Department University of Texas at El Paso; (3) Computer Science and Engineering Department Arizona State University Abstract Many interdisciplinary design efforts require the involvement of computer scientists because of the complexity of the problem solving tools available for the projects. This paper demonstrates how appropriate language design can place high level languages in the hands of scientists and engineers, thus providing a more automated approach to problem solving that may reduce the amount of computer scientist involvement. The language SequenceL serves as an example of this approach. 1.0 Introduction There is an ongoing discussion at the highest levels of the United States government concerning "data morgues." The concern has to do with the current hardware capabilities that permit the acquisition and storage of vast amounts of data and the inability of scientists armed with current software technology to process and analyze the data. The current problem actually demonstrates that advances in computer software have not kept pace with advances in computer hardware. If corresponding software advances can be made, the data may be found to be comatose, rather than dead. Among other root problems, the comatose data is symptomatic of the fact that currently available software technology is not based upon abstractions that appropriately simplify approaches to complex problems and data sets, especially when the data sets are distributed among multiple processing elements. If large data sets containing, e.g., telemetry data, are to be analyzed, then exploratory or data mining programs must be written.
    [Show full text]
  • Comparative Programming Languages CM20253
    We have briefly covered many aspects of language design And there are many more factors we could talk about in making choices of language The End There are many languages out there, both general purpose and specialist And there are many more factors we could talk about in making choices of language The End There are many languages out there, both general purpose and specialist We have briefly covered many aspects of language design The End There are many languages out there, both general purpose and specialist We have briefly covered many aspects of language design And there are many more factors we could talk about in making choices of language Often a single project can use several languages, each suited to its part of the project And then the interopability of languages becomes important For example, can you easily join together code written in Java and C? The End Or languages And then the interopability of languages becomes important For example, can you easily join together code written in Java and C? The End Or languages Often a single project can use several languages, each suited to its part of the project For example, can you easily join together code written in Java and C? The End Or languages Often a single project can use several languages, each suited to its part of the project And then the interopability of languages becomes important The End Or languages Often a single project can use several languages, each suited to its part of the project And then the interopability of languages becomes important For example, can you easily
    [Show full text]
  • Your Software Faster, Better, Sooner
    The Complete Solution BETTER SOFTWARE FASTER Express problems in engineering terms without worrying how it will compute. Focus on problem solving, innovation, and getting it right the first time. Your Software Faster, AUTO-PARALLELIZING Applications self-parallelize, with no effort on the part of the user, and are Better, Sooner provably race-free. AGILE & TRANSPORTABLE Faster time to market and the ability to quickly explore/support new hardware architectures. Applications developed for today’s technology automatically perform on tomorrow’s. The Right Tools for the Multicore INDUSTRY STANDARDS Built on open industry standard C++ and OpenCL to work with existing design Programming Job flows, tools, and training. Designed to work in concert with other programming languages, legacy code, and libraries. © 2016 Texas Multicore Technologies, Inc. TexasMulticore.com Unleashing the Potential of Today’s Multicore Systems The right tools for the multicore pro- SequenceL Gets Rid of Decades of SequenceL not only automatically gramming job. The days of program- Programming Baggage. With Se- parallelizes the code, but is provably ming in a one-size-fits all language quenceL, the programming paradigm race free. Race conditions are the are quickly becoming a thing of the no longer mimics old single-step CPU largest quality challenge a parallelized past, especially for multicore and hardware. Similar in concept to SQL software development team faces. many-core systems. SequenceL is a for databases, SequenceL is more like When combined with the many other compact, powerful functional pro- writing formulas on a white board. In defect sources, test and debug be- gramming language and auto- SequenceL, you describe the desired comes a much larger effort than the parallelizing tool set that quickly and output in terms of the input, as func- code development itself.
    [Show full text]
  • Exploring the Current Landscape of Programming Paradigm and Models
    Cover Venkatesh Prasad Ranganath Story Researcher at Microsoft Research, India Exploring the Current Landscape of Programming Paradigm and Models Until few years ago, imperative programming of Programming Paradigms[44]” page assignments (e.g. i = i + 1). This is evident and object-oriented programming have at Wikipedia states that “None of the by the explicitly use and maintenance been two prevalent programming main programming paradigms have a of a counter variable i, dependence on paradigms in the industry. However, this precise, globally unanimous defi nition, let the length of nums, and direct access of has started to change over the past few alone an offi cial international standard.” elements of nums (via indexing syntax, years with the community exploring and Hence, instead of chasing defi nitions of i.e. nums[i]). slowly adopting functional programming. programming paradigms, let us try to In object-oriented style, the Similarly, there has been a renewed understand programming paradigms via program does not depend on the interest in exploring existing programming examples. internal details of list type. Instead, languages or creating new programming We shall start with three prevalent list type encapsulates its elements languages that support either tried-and- programming paradigms: imperative, (implementation details) and provides forgotten paradigms or a combination of object-oriented, and functional. an iterator abstraction (interface) to paradigms. Further, there has been a surge For this purpose, let us consider iterate over its elements. Consequently, in languages and programming systems to the following programs that identical in the program depends only on the support development of loosely coupled functionality but written in imperative, iterator abstraction.
    [Show full text]
  • Automatic Parallelization Tools: a Review Varsha
    ISSN XXXX XXXX © 2017 IJESC Research Article Volume 7 Issue No.3 Automatic Parallelization Tools: A Review Varsha. K.. R Department of CSE RV College of Engineering, Bangalore, India Abstract: With the advent of multi core architectures, the concept of parallel programming is not restricted anymore to super-computers and enterprise level applications alone. Programmers are always ready to take advantage of the available potential of programs to be parallelized. The objective of the paper is to review various tools that are available for performing automatic parallelization. Keywords: Multi-core, DOALL, Loop-level Parallelis m, concurrency, data-dependence, profiling I. INTRODUCTION III. EXISTING MODELS The rise of the concept of parallel or concurrent programming The general procedure for these tools is to take a sequential code has led to the development of multi-core architectures on a wider as input and generate parallel codes. Using the sequential code, a range. Programmers find parallel programs executing on dependency matrix or graph is implemented which stores multiple cores or processors to be more efficient than serial ones. dependency information that exists between consecutive But it is a tedious task to write parallel programs directly or to statements in a loop. The next step is to detect a code section convert all existing serial programs to parallel. This limitation which has potential parallelism. Finally the tools generate a can be eliminated by automating this procedure. Tools are parallel code by inserting parallel programming constructs/ developed which can automatically accept a serial code and directives to potential DOALL loops. A loop can be considered generate a parallel code by inserting suitable parallel constructs.
    [Show full text]
  • Automatic Concurrency in Sequencel
    Electronic Notes in Theoretical Computer Science URL httpwwwelsevier nl lo ca te ent cs vol ume htm l pages Automatic Concurrency in SequenceL Daniel E Co oke Department of Computer Science Texas Tech University Lubbock TX USA Vladik Kreinovich Department of Computer Science University of Texas at El Paso El Paso TX USA Abstract This pap er presents a programming language which we b elieve to be most appro priate for the automation of parallel data pro cessing esp ecially data pro cessing of concern to the oil industry and to the US Federal Agencies involved in the analysis of Satellite Telemetry Data Fo cus is placed up on ma jor language issues facing the developmentof the information p ower grid The pap er presents an example of the typ e of parallelism desired in the Grid To implement this parallelism in such a lan guage as Javawe need to sp ecify parallelism explicitly We show that if we rewrite the same solution in the high level language SequenceL then parallelism b ecomes implicit SequenceL seems therefore to be a go o d candidate for a Grid Oriented Language b ecause its abstraction relieves the problem solver of much of the burden normally required in development of parallel problem solutions The Need for New Language Abstractions Hardware improvements and the general spread of computing and computer applications have created opp ortunities for scientists and engineers to solve ever more complicated problems However there are concerns ab out whether scientists and engineers p ossess the software to ols necessary to solve
    [Show full text]
  • AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K
    AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K. Nelson, B.S. A Dissertation In COMPUTER SCIENCE Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Submitted to: Dr. Nelson Rushton Chair of Committee Dr. Richard Watson Dr. Bradley Nemanich Dr. Yong Chen Dr. Mark Sheridan Dean of the Graduate School August, 2016 c Bryant K. Nelson, 2016 Texas Tech University, Bryant Nelson, August 2016 For Dr. Dan Cooke, in memoriam. ii Texas Tech University, Bryant Nelson, August 2016 ACKNOWLEDGMENTS There are many people that made getting to where I am today possible. While I can never truly express the extent of my gratitude for these people, here I would like to take some time to acknowledge them and their contributions. First, I would like to acknowledge Dr. Nelson Rushton, the chair of my committee, my academic advisor, and my friend. Four years ago Dr. Rushton took me on as a PhD student, fresh out of my bachelor's programs. Over these past four years he has been there to guide me and offer advice as both a mentor and a friend. From collaborating on papers and guiding me down research paths to helping me get my grouping tighter and draw faster. Without Dr. Rushton's guidance and insight, I doubt that I would have made it this far in my academic career. Second, I would like to acknowledge my friend and colleague, Josh Archer. Josh and I actually started our undergraduate degrees at the same time and went through the same program, dual majoring in Computer Science and Mathematics.
    [Show full text]
  • Mali-T880 Launch Slides
    ARM: Power-efficient Compute for HPC CERN OCP HPC Conference Darren Cepulis [email protected] 1 CONFIDENTIAL Over 12B ARM Based Chips Shipped in 2014 2013 2014 Growth 5.1bn 5.4bn 300m Mobile 2.9bn 4.1bn 1.2b Embedded 1.8bn 1.9bn 100m Enterprise Home 0.6bn 0.6bn - ARM CPU Core Unit Shipments 2 CONFIDENTIAL ARM Architecture: Licensing Overview 1000+ 1B overall smartphones 350+ sold in 2013 customers licenses 50B 121 new 200+ chips licenses sold Cortex-M licenses, shipped in 2013 over 160 companies 10B 620 licenses ARM based sold in past 50 ARM chips shipped in five years v8-A licenses 2013 Hundreds of optimized system-on-chip solutions 3 CONFIDENTIAL 3 Extensible Architecture for Heterogeneous Multi-core Solutions Up to 24 I/O Up to 4 Virtualized Interrupts Heterogeneous processors – CPU, GPU, DSP and coherent cores per accelerators interfaces for cluster GIC-500 accelerators and I/O Cortex CPU Cortex CPU 10-40 or CHI or CHI DSPDSP GbE PCIe PCIe DSP SATA master master Cortex®-A72 Cortex-A72 Cortex-A72 Cortex-A72 DPI DPI Crypto USB ACE Up to 12 Cortex CPU Cortex CPU AHB NIC-400 coherent or CHI or CHI master master clusters Cortex-A53 Cortex-A53 Cortex-A53 Cortex-A53 I/O Virtualisation CoreLink MMU-500 CoreLink™ CCN-512 Cache Coherent Network 1-32MB L3 cache Snoop Filter Memory Memory Memory Memory Network Interconnect Network Interconnect Controller Controller Controller Controller Integrated NIC-400 NIC-400 L3 cache DMC-520 DMC-520 DMC-520 DMC-520 x72 x72 x72 x72 Flash SRAM GPIO PCIe DDR4-3200 DDR4-3200 DDR4-3200 DDR4-3200 Up to Quad channel DDR3/4 x72 Peripheral address space 4 CONFIDENTIAL ARM in Datacenter/HPC Compute .
    [Show full text]
  • A Parallel Communication Architecture for The
    A PARALLEL COMMUNICATION ARCHITECTURE FOR THE LANGUAGE SEQUENCEL by JULIAN B. RUSSBACH, B.S. A THESIS IN COMPUTER SCIENCE Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Approved Chairperson ofthe Committee Accepted Uean ot the Uraduate School December, 2004 ACKNOWLEDGEMENTS I would like to thank my committee chair Dr. Per Andersen for his wisdom, patience, and insight as a person and computer scientist; Dr. Nelson Rushton for his numerous ideas and contributions to SequenceL grammar and semantics, the inception of the token ring for dynamic load balancing SequenceL execution, and his contribution to the cluster; and Dr. Daniel E. Cooke for his enthusiasm, keen eye, cluster provisions, and the opportunity to work on a great research team. Thanks also to Chris G. Fielder for help with cluster assembly and troubleshooting; Chris McClimmans for cluster maintenance suggestions; and Dr. Philip Smith for use of the Texas Tech HPCC computers and valuable MPI lessons. A special thanks goes to my girlfriend Radoslava for her tolerance of me through a year of work. TABLE OF CONTENTS ACKNOWLEDGEMENTS ii ABSTRACT v LIST OF FIGURES vi CHAPTER I. INTRODUCTION 1 I. I Document Overview 3 1.2 Introduction to the Language SequenceL 4 1.2.1 What is SequenceL? 4 1.2.2 Consume-Simplify-Produce 8 1.2.3 Normalize-Transpose-Distribute 10 1.3 Other Parallel Languages 12 1.4 SequenceL Implementations 14 II. LITERATURE REVIEW 19 2.1.1 Load Balancing 19 2.2 Token rings 23 2.2.1 IBM's Token Ring 23 2.2.2 "token rings" 26 III.
    [Show full text]