Unstructured Computations on Emerging Architectures

Total Page:16

File Type:pdf, Size:1020Kb

Unstructured Computations on Emerging Architectures Unstructured Computations on Emerging Architectures Dissertation by Mohammed A. Al Farhan In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia May 2019 2 EXAMINATION COMMITTEE PAGE The dissertation of M. A. Al Farhan is approved by the examination committee Dissertation Committee: David E. Keyes, Chair Professor, King Abdullah University of Science and Technology Edmond Chow Associate Professor, Georgia Institute of Technology Mikhail Moshkov Professor, King Abdullah University of Science and Technology Markus Hadwiger Associate Professor, King Abdullah University of Science and Technology Hakan Bagci Associate Professor, King Abdullah University of Science and Technology 3 ©May 2019 Mohammed A. Al Farhan All Rights Reserved 4 ABSTRACT Unstructured Computations on Emerging Architectures Mohammed A. Al Farhan his dissertation describes detailed performance engineering and optimization Tof an unstructured computational aerodynamics software system with irregu- lar memory accesses on various multi- and many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic- and architecture-oriented optimizations are essential for achieving worthy performance. We investigate several state-of-the-practice shared-memory optimization techniques applied to key kernels for the important problem class of unstructured meshes. We illustrate for a broad spectrum of emerging microprocessor architectures as represen- tatives of the compute units in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating-point numerics of the original code. While the linear algebraic kernels are bottlenecked by memory bandwidth for even modest numbers of hardware cores sharing a common address space, the edge-based loop kernels, which arise in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, are compute-intensive and effectively exploit contemporary multi- and many-core processing hardware. We therefore employ low- and high-level algorithmic- and architecture-specific code op- timizations and tuning in light of thread- and data-level parallelism, with a focus on 5 strong thread scaling at the node-level. Our approaches are based upon novel multi- level hierarchical workload distribution mechanisms of data across different compute units (from the address space down to the registers) within every hardware core. We analyze the demonstrated aerodynamics application on specific computing architec- tures to develop certain performance metrics and models to bespeak the upper and lower bounds of the performance. We present significant full application speedup relative to the baseline code, on a succession of many-core processor architectures, i.e., Intel Xeon Phi Knights Corner (5.0x) and Knights Landing (2.9x). In addi- tion, the performance of Knights Landing outperforms, at significantly lower power consumption, Intel Xeon Skylake with nearly twofold speedup. These optimizations are expected to be of value for many other unstructured mesh partial differential equation-based scientific applications as multi- and many- core architecture evolves. To my family My parents { Aminah and Ahmed My fianc´ee{ Wijdan My siblings { Riyadh, Huda, Adi, Iman, Zahra, and Qassim For their constant love and support For enduring my absence while working on my PhD research For always believing in me even when I do not I LOVE YOU! 7 ACKNOWLEDGMENTS \As we express our gratitude, we must never forget that the highest appreciation is not to utter words, but to live by them." John F. Kennedy Over the course of seven years of my PhD experience, so many people have been helping me at King Abdullah University of Science and Technology (KAUST) and elsewhere. I therefore would like to take this opportunity to express my deepest appreciation to those people who joined me throughout this wonderful and enjoyable journey. Your presence has enriched my life in many incredible ways { Thank you very much! First and foremost, I would like to thank my advisor, Professor David E. Keyes for the guidance and intellectual challenges he has tirelessly given me. I especially fortunate to work with him and benefit from his prospectives. In addition, I am very grateful to my PhD committee members for their feedback, in particular, Professor Edmond Chow of Georgia Institute of Technology, for his insightful comments and suggestions that helped me to augment and improve the dissertation. Many thanks and appreciation go to my friends and colleagues in the KAUST Ex- treme Computing Research Center (ECRC) and KAUST Supercomputing Laboratory (KSL), in particular, Mustafa Abduljabbar, for providing a spectacular environment 8 for learning, sharing ideas, and conducting a word-class research in high performance computing. Support in the form of computing resources was provided by ECRC, KSL, KAUST Information Technology Research Division, Intel Parallel Computing Centers, Isam- bard Project at University of Bristol, CUDA Center of Excellence at KAUST, Blue Waters Supercomputer at University of Illinois at Urbana-Champaign, and Cray Cen- ter of Excellence at KAUST. I am very thankful to my family for reaching out with encouragement and love. It really means more than you know. Thanks to my parents, Aminah and Ahmed, for being so patient and for bearing with me over the last couple of years. Allah bless them for everything they have done for me. I am really honored to have parents like them. Furthermore, a lot of gratitude and appreciation go to my wonderful fianc´ee, Wijdan. No words can ever express how grateful I am to her, how I am honored to have her in my life, and how much I appreciate her consistent and unwavering kindness, love, and support. Also, many thanks to my siblings, Riyadh, Huda, Adi, Iman, Zahra, and Qassim, who have been extremely understanding and supportive for my studies. I feel very lucky to have a family that shares my enthusiasm for academic pursuits. Last but not least, I want to thank all of my friends at KAUST and elsewhere for their support and kindness during this incredible journey. Yours Sincerely, Mohammed A. Al Farhan 9 TABLE OF CONTENTS Examination Committee Page 2 Copyright 3 Abstract 4 Dedication 6 Acknowledgments 7 List of Figures 12 List of Tables 15 I Preliminaries 16 1 Introduction 17 1.1 Dissertation Overview . 21 1.1.1 Statement of Contributions . 21 1.1.2 Summary of Results . 22 1.1.3 Dissertation Structure . 23 1.2 Other Research Projects . 24 1.2.1 Optimizing FMM Kernels on Emerging Architecture . 24 1.2.2 Extreme Scale FMM-accelerated BIE Solver for Wave Scattering 25 2 Background 27 2.1 Unstructured Computations . 27 2.1.1 Fully Unstructured Navier-Stokes in 3 Dimensions . 27 2.1.2 Pseudo-transient Newton-Krylov-Schwarz ( NKS) . 35 2.1.3 Indirect Addressing . 39 2.2 Emerging Architectures . 40 2.2.1 The Golden Age of Microprocessor Architecture . 41 10 2.2.2 Intel® Xeon® Phi™ ....................... 43 2.2.3 Intel® Xeon® ........................... 56 3 Related Work 59 3.1 Unstructured Computations . 59 3.1.1 Porting PETSc-FUN3D to Shared-memory Parallelism . 59 3.1.2 Emerging Unstructured CFD Research Code . 61 3.2 Emerging Architectures . 62 3.3 Our Contributions to the State-of-the-art Many-core Optimizations . 64 II Optimizing the Unstructured Grid Motif 67 4 Porting PETSc-FUN3D to Knights Corner 68 4.1 Highlights of the Contributions . 69 4.2 Thread Affinity Control { Pinning and Binding . 70 4.3 Thread-level Parallelism . 71 4.4 Experimental Setup . 76 4.4.1 Platforms Used for Experiments . 76 4.4.2 Software Stacks . 77 4.4.3 Input Data Sets . 79 4.5 Performance Results and Analysis . 79 4.5.1 Offload Baseline Model . 80 4.5.2 Native Baseline Model . 82 4.5.3 Performance Results with the Coarse Mesh . 82 4.5.4 Performance Results with the Fine Mesh . 84 4.5.5 Comparison of Optimized KNC Performance to CPU Performance 85 4.5.6 Large-scale Strong Scalability Study . 88 5 Porting PETSc-FUN3D to Knights Landing 91 5.1 Highlights of the Contributions . 92 5.2 PETSc-FUN3D Computational Routines . 92 5.2.1 Preprocessing and Setup Phase . 93 5.2.2 NKS Kernels Phase . 94 5.3 Data-level Parallelism . 103 5.3.1 Vectorizing Edge-based Loop Kernels . 104 5.3.2 Fine-grained Data Partitioning . 108 5.4 Experimental Setup . 111 11 5.4.1 Platforms Used for Experiments . 113 5.4.2 Software Stacks . 113 5.4.3 Input Data Sets . 114 5.5 Performance Results and Analysis . 114 5.5.1 Performance of the Flux Routine . 115 5.5.2 Performance of the Gradient Routine . 118 5.5.3 Vectorization Efficiency of the Edge-based Loop . 119 5.5.4 Performance of Explicit Vectorization . 121 5.5.5 Strong Thread Scalability Study . 121 5.5.6 Memory Bandwidth and Flop/s Performance . 124 5.5.7 Roofline Model . 125 5.5.8 Performance on Multi/Many-core Hardware . 127 III Summary and Reflections 133 6 Concluding Remarks and Future Outlook 134 6.1 Broader Impact . 136 6.2 Lessons Learned . 136 6.3 Future Directions . 138 References 139 12 LIST OF FIGURES 2.1 The surface triangulation of the ONERA M6 wing. The wing surface triangulation is shown in green; the symmetry root plane in red; and the far-field boundary in blue. 29 2.2 Tetrahedral mesh edge-based loop kernel. 34 2.3 Baseline performance analysis of PETSc-FUN3D application code. The edge-based loops phase contains the flux evaluation kernel ( 45%), ≥ gradient kernel using weighted least squares that applies Gram-Schmidt ( 10%), and Jacobian matrix construction ( 7%), whereas the sparse ≥ ≥ recurrences phase includes the Incomplete LU factorization ( 16%) ≥ and the Sparse Triangular Solve ( 17%).
Recommended publications
  • A Type Inference on Executables
    A Type Inference on Executables Juan Caballero, IMDEA Software Institute Zhiqiang Lin, University of Texas at Dallas In many applications source code and debugging symbols of a target program are not available, and what we can only access is the program executable. A fundamental challenge with executables is that during compilation critical information such as variables and types is lost. Given that typed variables provide fundamental semantics of a program, for the last 16 years a large amount of research has been carried out on binary code type inference, a challenging task that aims to infer typed variables from executables (also referred to as binary code). In this article we systematize the area of binary code type inference according to its most important dimensions: the applications that motivate its importance, the approaches used, the types that those approaches infer, the implementation of those approaches, and how the inference results are evaluated. We also discuss limitations, point to underdeveloped problems and open challenges, and propose further applications. Categories and Subject Descriptors: D.3.3 [Language Constructs and Features]: Data types and struc- tures; D.4.6 [Operating Systems]: Security and Protection General Terms: Languages, Security Additional Key Words and Phrases: type inference, program executables, binary code analysis ACM Reference Format: Juan Caballero and Zhiqiang Lin, 2015. Type Inference on Executables. ACM Comput. Surv. V, N, Article A (January YYYY), 35 pages. DOI:http://dx.doi.org/10.1145/0000000.0000000 1. INTRODUCTION Being the final deliverable of software, executables (or binary code, as we use both terms interchangeably) are everywhere. They contain the final code that runs on a system and truly represent the program behavior.
    [Show full text]
  • Intelligent Systems and Platforms Transforming the Industrial Cloud Era
    Intelligent Systems and Platforms Transforming the Industrial Cloud Era With innovative technologies from cloud computing (industrial server, video server), edge computing (fanless, slim & portable devices), to high performance embedded systems. Advantech transforms embedded systems into intelligent systems with smart, secure, energy-saving features, built with Industrial Cloud Services and professional Overview System Design-To-Order Services (System DTOS). Advantech’s intelligent systems are designed to target vertical markets in intelligent transportation, factory automation/machine automation, cloud infrastructure, intelligent video application. Industrial Server & Storage Industrial Cloud Intelligent Vision Systems Intelligent Video Systems Intelligent Systems Data Acquisition Modules Intelligent Transportation Systems 0-4 Star Products Intelligent Video Solution DVP-7011UHE DVP-7011MHE DVP-7017HE DVP-5311D Overview 1-ch H.264 4K HDMI 2.0 PCIe 1-ch Full HD H.264 M.2 Video 1-ch Full HD H.264 Mini PCIe Video (DVI-DVI), Control and Video Capture Card with SDK Capture Card with SDK Video Capture Card with SDK Data Transmission Extender • 1-channel 4K HDMI 2.0 video input with • 1 channel HDMI/DVI-D/DVI-A/YPbPr • 1 channel SDI channel video inputs with • Supports High Resolution 1920x1200 @ H.264 software compression channel video inputs with H.264 software H.264 software compression 60Hz WUXGA compression • 60/50 fps (NTSC/PAL) at up to • 30/25 fps (NTSC/PAL) at up to full HD • Zero pixel loss with TMDS signal correction 4096 x 2160p
    [Show full text]
  • Targeting Embedded Powerpc
    Freescale Semiconductor, Inc. EPPC.book Page 1 Monday, March 28, 2005 9:22 AM CodeWarrior™ Development Studio PowerPC™ ISA Communications Processors Edition Targeting Manual Revised: 28 March 2005 For More Information: www.freescale.com Freescale Semiconductor, Inc. EPPC.book Page 2 Monday, March 28, 2005 9:22 AM Metrowerks, the Metrowerks logo, and CodeWarrior are trademarks or registered trademarks of Metrowerks Corpora- tion in the United States and/or other countries. All other trade names and trademarks are the property of their respective owners. Copyright © 2005 by Metrowerks, a Freescale Semiconductor company. All rights reserved. No portion of this document may be reproduced or transmitted in any form or by any means, electronic or me- chanical, without prior written permission from Metrowerks. Use of this document and related materials are governed by the license agreement that accompanied the product to which this manual pertains. This document may be printed for non-commercial personal use only in accordance with the aforementioned license agreement. If you do not have a copy of the license agreement, contact your Metrowerks representative or call 1-800-377- 5416 (if outside the U.S., call +1-512-996-5300). Metrowerks reserves the right to make changes to any product described or referred to in this document without further notice. Metrowerks makes no warranty, representation or guarantee regarding the merchantability or fitness of its prod- ucts for any particular purpose, nor does Metrowerks assume any liability arising
    [Show full text]
  • Embedded DRAM
    Embedded DRAM Raviprasad Kuloor Semiconductor Research and Development Centre, Bangalore IBM Systems and Technology Group DRAM Topics Introduction to memory DRAM basics and bitcell array eDRAM operational details (case study) Noise concerns Wordline driver (WLDRV) and level translators (LT) Challenges in eDRAM Understanding Timing diagram – An example References Slide 1 Acknowledgement • John Barth, IBM SRDC for most of the slides content • Madabusi Govindarajan • Subramanian S. Iyer • Many Others Slide 2 Topics Introduction to memory DRAM basics and bitcell array eDRAM operational details (case study) Noise concerns Wordline driver (WLDRV) and level translators (LT) Challenges in eDRAM Understanding Timing diagram – An example Slide 3 Memory Classification revisited Slide 4 Motivation for a memory hierarchy – infinite memory Memory store Processor Infinitely fast Infinitely large Cycles per Instruction Number of processor clock cycles (CPI) = required per instruction CPI[ ∞ cache] Finite memory speed Memory store Processor Finite speed Infinite size CPI = CPI[∞ cache] + FCP Finite cache penalty Locality of reference – spatial and temporal Temporal If you access something now you’ll need it again soon e.g: Loops Spatial If you accessed something you’ll also need its neighbor e.g: Arrays Exploit this to divide memory into hierarchy Hit L2 L1 (Slow) Processor Miss (Fast) Hit Register Cache size impacts cycles-per-instruction Access rate reduces Slower memory is sufficient Cache size impacts cycles-per-instruction For a 5GHz
    [Show full text]
  • SAMPLE CHAPTER 1 Chapter Personal Computer 1 System Components the FOLLOWING COMPTIA A+ ESSENTIALS EXAM OBJECTIVES ARE COVERED in THIS CHAPTER
    SAMPLE CHAPTER 1 Chapter Personal Computer 1 System Components THE FOLLOWING COMPTIA A+ ESSENTIALS EXAM OBJECTIVES ARE COVERED IN THIS CHAPTER: Ûß1.2 Explain motherboard components, types and features Nß Form Factor Nß ATX / BTX, Nß micro ATX Nß NLX Nß I/O interfaces Material Nß Sound Nß Video Nß USB 1.1 and 2.0 Nß Serial Nß IEEE 1394 / FireWire Nß Parallel Nß NIC Nß Modem Nß PS/2 Nß Memory slots Nß RIMM Nß DIMM Nß SODIMM CopyrightedNß SIMM Nß Processor sockets Nß Bus architecture 86498book.indb 1 7/22/09 5:37:17 AM Nß Bus slots Nß PCI Nß AGP Nß PCIe Nß AMR Nß CNR Nß PCMCIA Chipsets Nß BIOS / CMOS / Firmware Nß POST Nß CMOS battery Nß Riser card / daughterboard Nß [Additional subobjectives covered in chapter 2] Ûß1.4 Explain the purpose and characteristics of CPUs and their features Nß Identify CPU types Nß AMD Nß Intel Nß Hyper threading Nß Multi core Nß Dual core Nß Triple core Nß Quad core Nß Onchip cache Nß L1 Nß L2 Nß Speed (real vs. actual) Nß 32 bit vs. 64 bit Ûß1.5 Explain cooling methods and devices Nß Heat sinks Nß CPU and case fans 86498book.indb 2 7/22/09 5:37:18 AM Nß Liquid cooling systems Nß Thermal compound Ûß1.6 Compare and contrast memory types, characteristics and their purpose Nß Types Nß DRAM Nß SRAM Nß SDRAM Nß DDR / DDR2 / DDR3 Nß RAMBUS Nß Parity vs. Non-parity Nß ECC vs. non-ECC Nß Single sided vs. double sided Nß Single channel vs.
    [Show full text]
  • H1 2015-2016 Results
    H1 2015-2016 Results November 2015 Safe Harbour Statement This presentation contains forward-looking statements made pursuant to the safe harbour provisions of the Private Securities litigation reform Act of 1995. By nature, forward looking statement represent the judgment regarding future events and are based on currently available information. Although the Company cannot guarantee their accuracy, actual results may differ materially from those the company anticipated due to a number of uncertainties, many of which the Company is not aware. For additional information concerning these and other important factors that may cause the Company’s actual results to differ materially from expectations and underlying assumptions, please refer to the reports filed by the Company with the Autorité des Marchés Financiers (AMF). Soitec – H1 2015-2016 Results – November 2015 2 Agenda 1 Highlights 2 H1 2015-2016 Financial results 3 Outlook 4 Q&A Appendix: Electronics core business Soitec – H1 2015-2016 Results – November 2015 3 H1 2015-2016 - Core Business highlights Communication and Power – Demand remains robust for RF-SOI products in mobile applications; SOI content continues to grow within smartphones due to increased complexity (number of bands and performance) – Bernin 200mm-diameter wafer capacity is almost sold out for CY 2016 – Simgui (Chinese foundry) produced its first 200mm wafers in October 2015 and is now starting customers qualifications – Customers developing successfully 300mm wafers for RF - at least one major fabless and one foundry
    [Show full text]
  • SEP8253 User Manual
    SEP8253 User Manual Revision 0.2 May 16, 2019 Copyright © 2019 by Trenton Systems, Inc. All rights reserved. PREFACE The information in this user’s manual has been carefully reviewed and is believed to be accurate. Trenton Systems assumes no responsibility for any inaccuracies that may be contained in this document, and makes no commitment to update or to keep current the information in this manual, or to notify any person or organization of the updates. Please Note: For the most up-to-date version of this manual, please visit our website at: www.trentonsystems.com. Trenton Systems, Inc. reserves the right to make changes to the product described in this manual at any time and without notice. This product, including software and documentation, is the property of Trenton Systems and/or its licensors, and is supplied only under a license. Any use or reproduction of this product is not allowed, except as expressly permitted by the terms of said license. IN NO EVENT WILL TRENTON SYSTEMS, INC. BE LIABLE FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, SPECULATIVE OR CONSEQUENTIAL DAMAGES ARISING FROM THE USE OR INABILITY TO USE THIS PRODUCT OR DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. IN PARTICULAR, TRENTON SYSTEMS, INC. SHALL NOT HAVE LIABILITY FOR ANY HARDWARE, SOFTWARE, OR DATA STORED OR USED WITH THE PRODUCT, INCLUDING THE COSTS OF REPAIRING, REPLACING, INTEGRATING, INSTALLING OR RECOVERING SUCH HARDWARE, SOFTWARE, OR DATA. Contact Information Trenton Systems, Inc. 1725 MacLeod Drive Lawrenceville, GA 30043 (770) 287-3100 [email protected] [email protected] [email protected] www.trentonsystems.com 2 INTRODUCTION Warranty The following is an abbreviated version of Trenton Systems’ warranty policy for High Density Embedded Computing (HDEC®) products.
    [Show full text]
  • Multiprocessing Contents
    Multiprocessing Contents 1 Multiprocessing 1 1.1 Pre-history .............................................. 1 1.2 Key topics ............................................... 1 1.2.1 Processor symmetry ...................................... 1 1.2.2 Instruction and data streams ................................. 1 1.2.3 Processor coupling ...................................... 2 1.2.4 Multiprocessor Communication Architecture ......................... 2 1.3 Flynn’s taxonomy ........................................... 2 1.3.1 SISD multiprocessing ..................................... 2 1.3.2 SIMD multiprocessing .................................... 2 1.3.3 MISD multiprocessing .................................... 3 1.3.4 MIMD multiprocessing .................................... 3 1.4 See also ................................................ 3 1.5 References ............................................... 3 2 Computer multitasking 5 2.1 Multiprogramming .......................................... 5 2.2 Cooperative multitasking ....................................... 6 2.3 Preemptive multitasking ....................................... 6 2.4 Real time ............................................... 7 2.5 Multithreading ............................................ 7 2.6 Memory protection .......................................... 7 2.7 Memory swapping .......................................... 7 2.8 Programming ............................................. 7 2.9 See also ................................................ 8 2.10 References .............................................
    [Show full text]
  • Interprocedural Analysis of Low-Level Code
    TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Institut fur¨ Informatik Lehrstuhl Informatik II Interprocedural Analysis of Low-Level Code Andrea Flexeder Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. H. M. Gerndt Prufer¨ der Dissertation: 1. Univ.-Prof. Dr. H. Seidl 2. Dr. A. King, University of Kent at Canterbury / UK Die Dissertation wurde am 14.12.2010 bei der Technischen Universitat¨ Munchen¨ eingereicht und durch die Fakultat¨ fur¨ Informatik am 9.6.2011 angenommen. ii Contents 1 Analysis of Low-Level Code 1 1.1 Source versus Binary . 1 1.2 Application Areas . 6 1.3 Executable and Linkable Format (ELF) .................. 12 1.4 Application Binary Interface (ABI)..................... 18 1.5 Assumptions . 24 1.6 Contributions . 24 2 Control Flow Reconstruction 27 2.1 The Concrete Semantics . 31 2.2 Interprocedural Control Flow Reconstruction . 33 2.3 Practical Issues . 39 2.4 Implementation . 43 2.5 Programming Model . 44 3 Classification of Memory Locations 49 3.1 Semantics . 51 3.2 Interprocedural Variable Differences . 58 3.3 Application to Assembly Analysis . 73 4 Reasoning about Array Index Expressions 81 4.1 Linear Two-Variable Equalities . 81 4.2 Application to Assembly Analysis . 88 4.3 Register Coalescing and Locking . 89 5 Tools 91 5.1 Combination of Abstract Domains . 91 5.2 VoTUM . 96 6 Side-Effect Analysis 101 6.1 Semantics . 105 6.2 Analysis of Side-Effects . 108 6.3 Enhancements . 115 6.4 Experimental Results . 118 iii iv CONTENTS 7 Exploiting Alignment for WCET and Data Structures 123 7.1 Alignment Analysis .
    [Show full text]
  • HEP Computing Trends
    HEP Computing Trends Andrew Washbrook University of Edinburgh ATLAS Software & Computing Tutorials 19th January 2015 UTFSM, Chile Introduction • I will cover future computing trends for High Energy Physics with a leaning towards the ATLAS experiment • Some examples of non-LHC experiments where appropriate • This is a broad subject area (distributed computing, storage, I/O) so here I will focus on the readiness of HEP experiments to changing trends in computing architectures • Also some shameless promotion of work I have been involved in.. Many thanks to all the people providing me with material for this talk! LHC Context Run 2 • Increase in centre of mass energy 13TeV • Increase in pile up from ~20 to ~50 • Increase in Trigger rate up to 1 KHz RAW to ESD • More computing resources required to Reconstruction reconstruct events Time High Luminosity LHC • HL-LHC starts after LS3 (~2022) • Aim to provide 300[-1 per year • Pileup of 150 expected • 200 PB/year of extra storage HL-LHC Timeline CPU Evolution • Die shrink getting smaller • Research down to 5nm depending on lithography and materials • Clock speed improvement has slowed • More cores per socket • Server at the Grid computing centre has at least 16 cores, typically more • Extrapolation from 2013 predicts 25% server performance improvement per year Processor scaling trends 1e+06 Transistors ● Clock Power ● Performance ● Performance/W ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●●● ●● ●●●●● ● ● ●●●●● ●●●●●●●●● ●●●● ● ● ●●●● ●●●●●●●● ●●●●●● ● ●● ● ● ●● ● ●●●● ●●●●●●●●●●●● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●●●●
    [Show full text]
  • Chapter 6 : Memory System
    Computer Organization and Architecture Chapter 6 : Memory System Chapter – 6 Memory System 6.1 Microcomputer Memory Memory is an essential component of the microcomputer system. It stores binary instructions and datum for the microcomputer. The memory is the place where the computer holds current programs and data that are in use. None technology is optimal in satisfying the memory requirements for a computer system. Computer memory exhibits perhaps the widest range of type, technology, organization, performance and cost of any feature of a computer system. The memory unit that communicates directly with the CPU is called main memory. Devices that provide backup storage are called auxiliary memory or secondary memory. 6.2 Characteristics of memory systems The memory system can be characterised with their Location, Capacity, Unit of transfer, Access method, Performance, Physical type, Physical characteristics, Organisation. Location • Processor memory: The memory like registers is included within the processor and termed as processor memory. • Internal memory: It is often termed as main memory and resides within the CPU. • External memory: It consists of peripheral storage devices such as disk and magnetic tape that are accessible to processor via i/o controllers. Capacity • Word size: Capacity is expressed in terms of words or bytes. — The natural unit of organisation • Number of words: Common word lengths are 8, 16, 32 bits etc. — or Bytes Unit of Transfer • Internal: For internal memory, the unit of transfer is equal to the number of data lines into and out of the memory module. • External: For external memory, they are transferred in block which is larger than a word.
    [Show full text]
  • Multi-Core Processors and Systems: State-Of-The-Art and Study of Performance Increase
    Multi-Core Processors and Systems: State-of-the-Art and Study of Performance Increase Abhilash Goyal Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT speedup. Some tasks are easily divided into parts that can be To achieve the large processing power, we are moving towards processed in parallel. In those scenarios, speed up will most likely Parallel Processing. In the simple words, parallel processing can follow “common trajectory” as shown in Figure 2. If an be defined as using two or more processors (cores, computers) in application has little or no inherent parallelism, then little or no combination to solve a single problem. To achieve the good speedup will be achieved and because of overhead, speed up may results by parallel processing, in the industry many multi-core follow as show by “occasional trajectory” in Figure 2. processors has been designed and fabricated. In this class-project paper, the overview of the state-of-the-art of the multi-core processors designed by several companies including Intel, AMD, IBM and Sun (Oracle) is presented. In addition to the overview, the main advantage of using multi-core will demonstrated by the experimental results. The focus of the experiment is to study speed-up in the execution of the ‘program’ as the number of the processors (core) increases. For this experiment, open source parallel program to count the primes numbers is considered and simulation are performed on 3 nodes Raspberry cluster . Obtained results show that execution time of the parallel program decreases as number of core increases.
    [Show full text]