18-447 Lecture 21: Parallelism – ILP to Multicores Parallel

Total Page:16

File Type:pdf, Size:1020Kb

18-447 Lecture 21: Parallelism – ILP to Multicores Parallel CMU 18‐447 S’10 L21‐1 © 2010 J. C. Hoe J. F. Martínez 18‐447 Lecture 21: Parallelism –ILP to Multicores James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Lab 4 due this week Optional reading assignments below. Handouts: The Microarchitecture of Superscalar Processors, Smith and Sohi, Proceedings of IEEE, 12/1995. (on Blackboard) The MIPS R10000 Superscalar Microprocessor, Yeager, IEEE Micro, 4/1996. (on Blackboard) Design Challenges of Technology Scaling, Shekhar Borkar, IEEE Micro, 1999. (on Blackboard) CMU 18‐447 S’10 L21‐2 © 2010 J. C. Hoe Parallel Processing 101 J. F. Martínez Assume you have N units of work and each unit of work takes 1 unit‐time on 1 processing element (PE) ‐ with 1 PE, it will take N unit‐time to complete the N units of work ‐ with p PEs, how long does it take to complete the same N units of work? “Ideally”, speedup is “linear” with p runtime speedup runtime speedup= sequential runtimeparalle l N/p 1 p= 1 2 3 4 5 . p= 1 2 3 4 5 . CMU 18‐447 S’10 L21‐3 © 2010 J. C. Hoe It may be linear, but . J. F. Martínez S 4 3 2 1 32 64 p How could this be? CMU 18‐447 S’10 L21‐4 © 2010 J. C. Hoe Parallelization Overhead J. F. Martínez The cheapest algorithm may not be the most scalable, s.t., runtimeparalle l@p=1 = Kruntimesequential and K>1 and speedup = p/K K is known facetiously as the “parallel slowdown” Communications between PEs are not free ‐ a PE may spend extra instructions/time in the act of sending or receiving data ‐ a PE may spend extra time waiting for data to arrive from another PEa function of latency and bandwidth ‐ a PE may spend extra time waiting for another PE to get to a particular point of the computation (a.k.a. synchronization) CMU 18‐447 S’10 L21‐5 © 2010 J. C. Hoe It could be worse . J. F. Martínez S limited scalability 4 3 2 1 1 24 8 p May never get high speedup regardless of the number of PEs CMU 18‐447 S’10 L21‐6 © 2010 J. C. Hoe Parallelism Defined J. F. Martínez T1 (call it “Work”): ‐ time to complete work with 1 PE x = a + b; y = b * 2 T (ll(call it “Cr itica l Pth”)Path”): z =(x‐y) * (x+y) ‐ time to complete work given infinite PEs ‐ T lower‐bounded by dataflow a b dependencies + *2 Average Parallelism: x y Pavg = T1 / T For a system wihith p PEs - + Tp max{ T1/p, T } Sp min{ p, T1/T } * When Pavg>>p Tp T1/p and Sp p An app has to have parallelism to get speedup from parallel PEs!! CMU 18‐447 S’10 L21‐7 © 2010 J. C. Hoe Amdahl’s Law on Speedup J. F. Martínez A program is rarely completely parallelizable. Let’s say a fraction f is parallelizable by a factor of p and the rest are not timesequential (1 ‐ f) f timeparallel (1 ‐ f) f/p timeparallel = timesequential∙( (1‐f) + f/p ) Sparallel = 1 / ( (1‐f) + f/p ) If f is small p doesn’t matter. An architect also cannot ignore the sequential performance of the (1‐f) portion CMU 18‐447 S’10 L21‐8 © 2010 J. C. Hoe J. F. Martínez ILP: the parallelism you already know CMU 18‐447 S’10 L21‐9 © 2010 J. C. Hoe Going after IPC J. F. Martínez Scalar Pipeline with Forwarding Operation Latency = 1 Peak IPC = 1 Instruction‐Level Parallelism = 1 IF ID EX MEM WB ream IF ID EX MEM WB t s IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB instruction cyc 0 1 2 3 4 5 6 7 8 9 10 CMU 18‐447 S’10 L21‐10 © 2010 J. C. Hoe Superscalar Machines J. F. Martínez Superscalar (Pipelined) Execution OL = 1 baseline cycles Peak IPC = N per baseline cycle ILP = N IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB stream IF ID EX MEM WB IF ID EX MEM WB ction IF ID EX MEM WB u IF ID EX MEM WB IF ID EX MEM WB instr IF ID EX MEM WB cyc 0 1 2 3 4 5 6 7 8 9 10 Achieving full performance requires finding N “independent” instructions per cycle CMU 18‐447 S’10 L21‐11 © 2010 J. C. Hoe ILP: Instruction‐Level Parallelism J. F. Martínez ILP is a measure of the amount of inter‐dependencies between instructions Average ILP =T1 / T = no. instruction / no. cyc required code1: ILP = 1 i.e. must execute serially code2: ILP = 3 iei.e. can execute at the same time code1: r1 r2 + 1 code2: r1 r2 + 1 r3 r1 / 17 r3 r9 / 17 r4 r0 - r3 r4 r0 - r10 CMU 18‐447 S’10 L21‐12 © 2010 J. C. Hoe Removing False Dependencies J. F. Martínez Anti and output dependencies are false dependencies r3 r1 op r2 r5 r3 op r4 r3 r6 op r7 The dependence is on the register name rather than data Given infinite number of registers, anti and output dependencies can always be eliminated CMU 18‐447 S’10 L21‐13 © 2010 J. C. Hoe Register Renaming: Example J. F. Martínez Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 ‐ r5 Renamed r1 r2 / r3 ILP=1 r4 r1 * r5 r8 r3 + r6 r9 r8 ‐ r5 ILP=2 CMU 18‐447 S’10 L21‐14 © 2010 J. C. Hoe Hardware Register Renaming J. F. Martínez Rename rename Register ISA name Rename t56 e.g. r12 Table File (t0 ... t63) maintain bindings from ISA reg. names to rename registers When issuing an instruction that updates ‘rd’: ‐ allocate an unused rename register tx ‐ recording binding from ‘rd’ to tx When to remove a binding? When to de‐allocate a rename register? r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 CMU 18‐447 S’10 L21‐15 © 2010 J. C. Hoe Out‐of‐Order Execution J. F. Martínez Renaming eliminates WAW and WAR In a RAW dependent instruction pair, the reader must wait for the result from the writer How to get more ILP? r1 r2 + 1 r3 r1 / 17 ILP=1 r4 r0 ‐ r3 r11 r12 + 1 ILP=2 r13 r19 / 17 r14 r0 ‐ r20 CMU 18‐447 S’10 L21‐16 © 2010 J. C. Hoe Dataflow Execution Ordering J. F. Martínez Maintain a window of many pending instructions (a.k.a. Issue Buffer) Dispatch instructions out‐of‐order ‐ find instructions whose operands are available ‐ give preference to older instructions ‐ A completing instruction may enable other pending instructions (RAW) Need to remember how to put things back in order (Reor der Bff)Buffer) CMU 18‐447 S’10 L21‐17 © 2010 J. C. Hoe Instruction Reorder Buffer J. F. Martínez At today’s clock frequency, on a memory load ‐ a cache hit (best case) takes 4~7 cyc ‐ a L1 cache miss takes a few 10s of cycles ‐ an off‐chip cache miss takes a few 100s of cycles ROB is a program‐order instruction bookkeeping structure ‐ instructions must enter and leave in program order ‐ holds 10s to 100s of “in‐flight” instructions in various stages of execution ‐ re‐sorts all instructions on exit to appear to complete in program order ‐ supports precise exception for any in‐flight instruction CMU 18‐447 S’10 L21‐18 © 2010 J. C. Hoe Speculative Execution J. F. Martínez Multiple levels of branch predictions are needed to fetch 100’s instructions beyond the commit point Instructions after a predicted branch are speculative must undo their effects in case of misprediction Maintain separate copies of ‐ In‐order State: a check‐point state up to just before the first speculated instruction ‐ Speculative State: include all state changes after check‐ pp,oint, ppyossibly multiple predicted branches Commit ‐ admit known‐to‐be good speculative state changes into the in‐order state Rewind ‐ discard all, or part of, the speculative state CMU 18‐447 S’10 L21‐19 © 2010 J. C. Hoe MIPS R10000 J. F. Martínez pre‐decoded I‐cache 4xinst decode map table (16R4W) map table 16‐entry 16‐entry 8x4 entries int. Q FP. Q Active List (R.S.) (R.S.) (ROB) 64‐entry 64‐entry Int GPR FPR 7R3W 5R3W ALU1 ALU2 LD/ST ALU1 ALU2 Read [Yeager 1996, IEEE Micro] if you are really interested CMU 18‐447 S’10 L21‐20 © 2010 J. C. Hoe State of the Art J. F. Martínez AMD Intel Intel IBM IBM Fijitsu SUN Opteron Xeon Itanium P5 P6 SPARC 7 T2 8360SE X7460 9050 cores/threads 4x1 6x1 2x2 2x2 2x2 4x2 8x8 Clock (GHz) 2.5 2.67 1.60 2.2 5 2.52 1.8 Issue Rate 3 (x86) 4 (rop) 6 5 7 4 2 Pipeline depth 12/17 14 8 15 13 15 8/12 Out‐of‐order 72(rop) 96(rop) inorder 200 limited 64 inorder on‐chip$ (MB) 2+2 9+16 1+12 1.92 8 6 4 Trans (106) 463 1900 1720 276 790 600 503 Power (W) 105 130 104 100 >100 135 95 SPECint 2006 14.4/170 22/274 14.5/1534 10.5/197 15.8/1837 10.5/2088 ‐‐/142 per‐core/total SPECfp 2006 18.5/156 22/142 17.3/1671 12.9/229 20.1/1822 25.0/1861 ‐‐/111 per‐core/total Microprocessor Report, Oct 2008 CMU 18‐447 S’10 L21‐21 © 2010 J.
Recommended publications
  • Slicing (Draft)
    Handling Parallelism in a Concurrency Model Mischael Schill, Sebastian Nanz, and Bertrand Meyer ETH Zurich, Switzerland [email protected] Abstract. Programming models for concurrency are optimized for deal- ing with nondeterminism, for example to handle asynchronously arriving events. To shield the developer from data race errors effectively, such models may prevent shared access to data altogether. However, this re- striction also makes them unsuitable for applications that require data parallelism. We present a library-based approach for permitting parallel access to arrays while preserving the safety guarantees of the original model. When applied to SCOOP, an object-oriented concurrency model, the approach exhibits a negligible performance overhead compared to or- dinary threaded implementations of two parallel benchmark programs. 1 Introduction Writing a multithreaded program can have a variety of very different motiva- tions [1]. Oftentimes, multithreading is a functional requirement: it enables ap- plications to remain responsive to input, for example when using a graphical user interface. Furthermore, it is also an effective program structuring technique that makes it possible to handle nondeterministic events in a modular way; develop- ers take advantage of this fact when designing reactive and event-based systems. In all these cases, multithreading is said to provide concurrency. In contrast to this, the multicore revolution has accentuated the use of multithreading for im- proving performance when executing programs on a multicore machine. In this case, multithreading is said to provide parallelism. Programming models for multithreaded programming generally support ei- ther concurrency or parallelism. For example, the Actor model [2] or Simple Con- current Object-Oriented Programming (SCOOP) [3,4] are typical concurrency models: they are optimized for coordination and event handling, and provide safety guarantees such as absence of data races.
    [Show full text]
  • Assessing Gains from Parallel Computation on a Supercomputer
    Volume 35, Issue 1 Assessing gains from parallel computation on a supercomputer Lilia Maliar Stanford University Abstract We assess gains from parallel computation on Backlight supercomputer. The information transfers are expensive. We find that to make parallel computation efficient, a task per core must be sufficiently large, ranging from few seconds to one minute depending on the number of cores employed. For small problems, the shared memory programming (OpenMP) and a hybrid of shared and distributive memory programming (OpenMP&MPI) leads to a higher efficiency of parallelization than the distributive memory programming (MPI) alone. I acknowledge XSEDE grant TG-ASC120048, and I thank Roberto Gomez, Phillip Blood and Rick Costa, scientific specialists from the Pittsburgh Supercomputing Center, for technical support. I also acknowledge support from the Hoover Institution and Department of Economics at Stanford University, University of Alicante, Ivie, and the Spanish Ministry of Science and Innovation under the grant ECO2012- 36719. I thank the editor, two anonymous referees, and Eric Aldrich, Yongyang Cai, Kenneth L. Judd, Serguei Maliar and Rafael Valero for useful comments. Citation: Lilia Maliar, (2015) ''Assessing gains from parallel computation on a supercomputer'', Economics Bulletin, Volume 35, Issue 1, pages 159-167 Contact: Lilia Maliar - [email protected]. Submitted: September 17, 2014. Published: March 11, 2015. 1 Introduction The speed of processors was steadily growing over the last few decades. However, this growth has a natural limit (because the speed of electricity along the conducting material is limited and because a thickness and length of the conducting material is limited). The recent progress in solving computationally intense problems is related to parallel computation.
    [Show full text]
  • Instruction Level Parallelism Example
    Instruction Level Parallelism Example Is Jule peaty or weak-minded when highlighting some heckles foreground thenceforth? Homoerotic and commendatory Shelby still pinks his pronephros inly. Overneat Kermit never beams so quaveringly or fecundated any academicians effectively. Summary of parallelism create readable and as with a bit says if we currently being considered to resolve these two machine of a pretty cool explanation. Once plug, it book the parallel grammatical structure which creates a truly memorable phrase. In order to accomplish whereas, a hybrid approach is designed to whatever advantage of streaming SIMD instructions for each faction the awful that executes in parallel on independent cores. For the ILPA, there is one more type of instruction possible, which is the special instruction type for the dedicated hardware units. Advantages and high for example? Two which is already present data is, to process includes comprehensive career related services that instruction level parallelism example how many diverse influences on. Simple uses of parallelism create readable and understandable passages. Also note that a data dependent elements that has to be imported from another core in another processor is much higher than either of the previous two costs. Why the charge of the proton does not transfer to the neutron in the nuclei? The OPENMP code is implemented in advance way leaving each thread can climb up an element from first vector and compare after all the elements in you second vector and forth thread will appear able to execute simultaneously in parallel. To be ready to instruction level parallelism in this allows enormous reduction in memory.
    [Show full text]
  • Minimizing Startup Costs for Performance-Critical Threading
    Minimizing Startup Costs for Performance-Critical Threading Anthony M. Castaldo R. Clint Whaley Department of Computer Science Department of Computer Science University of Texas at San Antonio University of Texas at San Antonio San Antonio, TX 78249 San Antonio, TX 78249 Email : [email protected] Email : [email protected] Abstract—Using the well-known ATLAS and LAPACK dense on several eight core systems running a standard Linux OS, linear algebra libraries, we demonstrate that the parallel manage- ATLAS produced alarmingly poor parallel performance even ment overhead (PMO) can grow with problem size on even stati- on compute bound, highly parallelizable problems such as cally scheduled parallel programs with minimal task interaction. Therefore, the widely held view that these thread management matrix multiply. issues can be ignored in such computationally intensive libraries is wrong, and leads to substantial slowdown on today’s machines. Dense linear algebra libraries like ATLAS and LAPACK [2] We survey several methods for reducing this overhead, the are almost ideal targets for parallelism: the problems are best of which we have not seen in the literature. Finally, we regular and often easily decomposed into subproblems of equal demonstrate that by applying these techniques at the kernel level, performance in applications such as LU and QR factorizations complexity, minimizing any need for dynamic task scheduling, can be improved by almost 40% for small problems, and as load balancing or coordination. Many have high data reuse much as 15% for large O(N 3) computations. These techniques and therefore require relatively modest data movement. Until are completely general, and should yield significant speedup in recently, ATLAS achieved good parallel speedup using simple almost any performance-critical operation.
    [Show full text]
  • Performance of Physics-Driven Procedural Animation of Character Locomotion for Bipedal and Quadrupedal Gait
    Thesis no: MECS-2015-03 Performance of Physics-Driven Procedural Animation of Character Locomotion For Bipedal and Quadrupedal Gait Jarl Larsson Faculty of Computing Blekinge Institute of Technology SE371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulllment of the requirements for the degree of Master of Science in Engineering: Game and Software Engineering. The thesis is equivalent to 20 weeks of full-time studies. Contact Information: Author: Jarl Larsson E-mail: [email protected] University advisors: Ph.D. Veronica Sundstedt Ph.D. Martin Fredriksson Department of Creative Technologies Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract Context. Animation of character locomotion is an important part of computer animation and games. It is a vital aspect in achieving be- lievable behaviour and articulation for virtual characters. For games there is also often a need for supporting real-time reactive behaviour in an animation as a response to direct or indirect user interaction, which have given rise to procedural solutions to generate animation of locomotion. Objectives. In this thesis the performance aspects for procedurally generating animation of locomotion within real-time constraints is evaluated, for bipeds and quadrupeds, and for simulations of sev- eral characters. A general pose-driven feedback algorithm for physics- driven character locomotion is implemented for this purpose. Methods. The execution time of the locomotion algorithm is evalu- ated using an automated experiment process, in which real-time gait simulations of incrementing character population count are instanti- ated and measured, for the bipedal and quadrupedal gaits.
    [Show full text]
  • Easy Dataflow Programming in Clusters with UPC++ Depspawn
    This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TPDS.2018.2884716 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Easy Dataflow Programming in Clusters with UPC++ DepSpawn Basilio B. Fraguela, Diego Andrade Abstract—The Partitioned Global Address Space (PGAS) programming model is one of the most relevant proposals to improve the ability of developers to exploit distributed memory systems. However, despite its important advantages with respect to the traditional message-passing paradigm, PGAS has not been yet widely adopted. We think that PGAS libraries are more promising than languages because they avoid the requirement to (re)write the applications using them, with the implied uncertainties related to portability and interoperability with the vast amount of APIs and libraries that exist for widespread languages. Nevertheless, the need to embed these libraries within a host language can limit their expressiveness and very useful features can be missing. This paper contributes to the advance of PGAS by enabling the simple development of arbitrarily complex task-parallel codes following a dataflow approach on top of the PGAS UPC++ library, implemented in C++. In addition, our proposal, called UPC++ DepSpawn, relies on an optimized multithreaded runtime that provides very competitive performance, as our experimental evaluation shows. Index Terms—libraries, parallel programming models, distributed memory, multithreading, programmability, dataflow F 1 INTRODUCTION HILE the exploitation of parallelism is never trivial, Namely, the private space is always the one that can be more W this is particularly true in the case of distributed efficiently accessed and the shared local space is accessible memory systems such as clusters.
    [Show full text]
  • Econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible
    A Service of Leibniz-Informationszentrum econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible. zbw for Economics Kuchen, Herbert (Ed.); Majchrzak, Tim A. (Ed.); Müller-Olm, Markus (Ed.) Working Paper Tagungsband 16. Kolloquium Programmiersprachen und Grundlagen der Programmierung (KPS'11): 26. bis 28. September 2011, Schloss Raesfeld, Münsterland Arbeitsberichte des Instituts für Wirtschaftsinformatik, No. 132 Provided in Cooperation with: University of Münster, Department of Information Systems Suggested Citation: Kuchen, Herbert (Ed.); Majchrzak, Tim A. (Ed.); Müller-Olm, Markus (Ed.) (2011) : Tagungsband 16. Kolloquium Programmiersprachen und Grundlagen der Programmierung (KPS'11): 26. bis 28. September 2011, Schloss Raesfeld, Münsterland, Arbeitsberichte des Instituts für Wirtschaftsinformatik, No. 132, Westfälische Wilhelms- Universität Münster, Institut für Wirtschaftsinformatik, Münster This Version is available at: http://hdl.handle.net/10419/59558 Standard-Nutzungsbedingungen: Terms of use: Die Dokumente auf EconStor dürfen zu eigenen wissenschaftlichen Documents in EconStor may be saved and copied for your Zwecken und zum Privatgebrauch gespeichert und kopiert werden. personal and scholarly purposes. Sie dürfen die Dokumente nicht für öffentliche oder kommerzielle You are not to copy documents for public or commercial Zwecke vervielfältigen, öffentlich ausstellen, öffentlich zugänglich purposes, to exhibit the documents publicly, to make them machen, vertreiben oder anderweitig nutzen. publicly available on the internet, or to distribute or otherwise use the documents in public. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten, If the documents have been made available under an Open gelten abweichend von diesen Nutzungsbedingungen die in der dort Content Licence (especially Creative Commons Licences), you genannten Lizenz gewährten Nutzungsrechte.
    [Show full text]
  • Extracting Parallelism from Legacy Sequential Code Using Transactional Memory
    Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Anil Kumar S. Vullikanti Paul E. Plassmann Robert P. Broadwater Roberto Palmieri Sedki Mohamed Riad May 25, 2016 Blacksburg, Virginia Keywords: Transaction Memory, Automatic Parallelization, Low-Level Virtual Machine, Optimistic Concurrency, Speculative Execution, Legacy Systems, Age Commitment Order, Low-Level TM Semantics, TM Friendly Semantics Copyright 2016, Mohamed M. Saad Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Mohamed M. Saad (ABSTRACT) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction.
    [Show full text]
  • Optimization Techniques for Efficient HTA Programs
    Parallel Computing 38 (2012) 465–484 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Optimization techniques for efficient HTA programs ⇑ Basilio B. Fraguela a, , Ganesh Bikshandi b, Jia Guo c, María J. Garzarán c, David Padua c, Christoph von Praun d a Depto. de Electrónica e Sistemas, Universidade da Coruña, Facultade de Informática, Campus de Elviña, S/N, 15071 A Coruña, Spain b Intel Labs, Intel Technology India Pvt. Ltd., Bangalore 560 103, Karnataka, India c Dept. of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801 IL, USA d Fakultät Informatik, Georg-Simon-Ohm Hochschule, Postfach 210320, 90121 Nuremberg, Germany article info abstract Article history: Object oriented languages can be easily extended with new data types, which facilitate Received 20 May 2011 prototyping new language extensions. A very challenging problem is the development of Received in revised form 11 May 2012 data types encapsulating data parallel operations, which could improve parallel program- Accepted 14 May 2012 ming productivity. However, the use of class libraries to implement data types, particularly Available online 22 May 2012 when they encapsulate parallelism, comes at the expense of performance overhead. This paper describes our experience with the implementation of a C++ data type called Keywords: hierarchically tiled array (HTA). This object includes data parallel operations and allows Parallel programming the manipulation of tiles to facilitate developing efficient parallel codes and codes with Optimization Programmability high degree of locality. The initial performance of the HTA programs we wrote was lower Libraries than that of their conventional MPI-based counterparts.
    [Show full text]
  • Introduction to MPI
    Introduction to MPI Jerome Vienne Texas Advanced Computing Center September 19th, 2013 Email: [email protected] 1 Course Objectives & Assumptions • Objectives – Teach basics of MPI-Programming – Share information related to running MPI programs on Stampede • Assumptions – The audience has the basic understanding of C or Fortran programming – The audience has access to MPI installation either locally or remotely 2 Content Overview • Basic concepts related to MPI • Environment Management MPI routines • Compiling and running MPI programs • Types of communication – Point-to-Point communication routines – Collective communication routines • Examples • Summary 3 Content Overview • Basic concepts related to MPI • Environment Management MPI routines • Compiling and running MPI programs • Types of communication – Point-to-Point communication routines – Collective communication routines • Examples • Summary 4 Different Programming Models P0 P1 P0 P1 P0 P1 Logical shared memory Shared Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space SHMEM, DSM MPI Global Arrays, UPC, CAF • OpenMP falls under the category of Shared Memory Model • MPI (Message Passing Interface) falls under the category of Distributed Memory Model 5 History of Message Passing Interface (MPI) • Before 1989, no portable message passing library – Code on nCUBE/10 code doesn’t work on Intel iPSC • Standardization was needed – Efficiency, portability and flexibility was desired • Result of this effort: MPI – MPI is a
    [Show full text]
  • Performance Loss Between Concept and Keyboard
    Performance loss between concept and keyboard Andr´asZ. Salamon1;2 and Vashti Galpin3 1 Computing Laboratory, University of Oxford 2 Oxford-Man Institute of Quantitative Finance 3 LFCS, School of Informatics, University of Edinburgh Abstract. Standards bodies and commercial software vendors have defined parallel constructs to harness the parallelism in computations. Using the task graph model of parallel program execution, we show how common programming constructs that impose series-parallel task dependencies can lead to unbounded slowdown compared to the inherent parallelism in the algorithm. We describe various ways in which this slowdown can be avoided. Inexpensive multicore processors have brought parallelism to the desktop computer [2] and users would like to take advantage of this parallelism for faster program execution. Standards for multiple-processor programming such as OpenCL [7] and commercial numeri- cal software such as Matlab4 and Mathematica5 include language constructs for parallelism. Our position is that these constructs may limit the amount of parallelism, causing slowdown, but we also argue that there are ways to avoid this unnecessary loss in performance. With the projected progression from multicore computing (2-8 cores) to manycore computing (hundreds of cores) [10], we believe that parallel computing systems should avoid slowdown at the point of expressing the intention of the programmer, between the concept and the keyboard. We focus on a specific structure on the dependencies between program tasks which some constructs impose. This structure is called series-parallel and can be most easily expressed as those task graphs generated by the language P ::= seq(P; P ) j par(P; P ) j a where a is a task or activity which represents some amount of program code (possibly as small as a single arithmetic operation) to be executed on one processor.
    [Show full text]
  • FDTD) Algorithms on a Selection of High Performance Multiprocessor Computing Systems
    A comparative analysis of the performance and deployment overhead of parallelized Finite Difference Time Domain (FDTD) algorithms on a selection of high performance multiprocessor computing systems by RG Ilgner Dissertation presented in fulfilment of the requirements for the degree Doctor of Philosophy in the Faculty of Engineering at Stellenbosch University Promoter: Prof DB Davidson Department of Electrical & Electronic Engineering December 2013 Stellenbosch University http://scholar.sun.ac.za Declaration By submitting this thesis electronically, I declare that the entirety of the work contained therein is my own, original work, that I am the sole author thereof (save to the extent explicitly otherwise stated), that reproduction and publication thereof by Stellenbosch University will not infringe any third party rights and that I have not previously in its entirety or in part submitted it for obtaining any qualification. December 2013 Copyright © 2013 Stellenbosch University All rights reserved Stellenbosch University http://scholar.sun.ac.za Abstract The parallel FDTD method as used in computational electromagnetics is implemented on a variety of different high performance computing platforms. These parallel FDTD implementations have regularly been compared in terms of performance or purchase cost, but very little systematic consideration has been given to how much effort has been used to create the parallel FDTD for a specific computing architecture. The deployment effort for these platforms has changed dramatically with time, the deployment time span used to create FDTD implementations in 1980 ranging from months, to the contemporary scenario where parallel FDTD methods can be implemented on a supercomputer in a matter of hours. This thesis compares the effort required to deploy the parallel FDTD on selected computing platforms from the constituents that make up the deployment effort, such as coding complexity and time of coding.
    [Show full text]