Scalability Study in Parallel Computing Mark Alan Fienup Iowa State University

Total Page:16

File Type:pdf, Size:1020Kb

Scalability Study in Parallel Computing Mark Alan Fienup Iowa State University Iowa State University Capstones, Theses and Retrospective Theses and Dissertations Dissertations 1995 Scalability study in parallel computing Mark Alan Fienup Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/rtd Part of the Computer Sciences Commons Recommended Citation Fienup, Mark Alan, "Scalability study in parallel computing " (1995). Retrospective Theses and Dissertations. 10900. https://lib.dr.iastate.edu/rtd/10900 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. INFORMATION TO USERS Hiis manuscript has been reproduced fFom the microfilni master. UMI films the text directly from the original or copy submitted. Thus, some thesis and disseitation cc^ies are in typewriter £ace, while others may be from any type of conq)uter printer. The quality of this reproduction is dependoit npon the qnali^ of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margiTic and improper aUgmnent can adverse^ affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note win indicate the deletion. Oversize materials (e.g., rn^s, drawings, diarts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to ri^ in equal sections with small overly Each original is also photographed in one exposure and is included in reduced form at the bade of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photQgr^hic prints are available for any photographs or illustrations appearing in this c(^ for an additional chaige. Contact UMI direct^ to order. A Bell & Howell Information Company 300 North Zeeb Road. Ann Arbor. Ml 48106-1346 USA 313/761-4700 800/521-0600 Scalability study in parallel computing by Mark Alan Fienup A Thesis Submitted to the Graduate Faculty in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Department: Computer Science Major: Computer Science Signature was redacted for privacy. In" ge of Major Work Signature was redacted for privacy Signature was redacted for privacy. For the Graduate College Iowa State University Ames, Iowa 1995 UHI Niimber: 9531736 HMZ Hicro£orm 9531736 Copyright 1995, by ONI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17r United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 To my wife, Virginia, and daughter, Ann. lU TABLE OF CONTENTS CHAPTER L INTRODUCTION 1 CHAPTER 2. LITERATURE REVIEW 7 CHAPTER 3. PARALLEL ALGORITHMS 13 CHAPTER 4. ASYMPTOTIC ANALYSIS OF SCALABILITY 26 CHAPTERS. MASPAR IMPLEMENTATIONS 51 CHAPTER 6. SCALABILITY EXPERIMENTS 62 CHAPTER 7. ACCURACY OF ASYMPTOTIC SCALABILITY METRICS IN PRACTICE 69 CHAPTER 8. CONCLUSIONS 88 REFERENCES 91 ACKNOWLEDGEMENTS 94 APPENDIX A. ADDITIONAL TABLES 95 APPENDIX B. MASPAR MPL CODES 111 1 CHAPTER 1. INTRODUCTION 1.1. Motivation One of the major goals of parallel computing is to decrease the execution-time of a computing task. For sequential programs, there are often several algorithms for solving a task, but usually a simple time-complexity analysis using "big-oh" notation suffices in determining the better algorithm. When trying to solve a task on a parallel computer, several complicating factors arise: (1) "How should the problem be decomposed on the processors?", (2) "How does the communication network's topology impact performance?", (3) "If I run my problem using more processors, how much improvement can I expect?", (4) "Given a choice of parallel computers to run my task, which should I choose?", and (5) "If I want to solve a bigger problem, using more processors what kind of performance can I expect?". To help answer many of these questions several scalability metrics have been proposed. These metrics differ in the scaling assumptions that they make. For example, Amdahl's law [2] assumes that the problem size is fixed as the number of processors is increased, the isoefficiency function [8] keeps the parallel computers efficiency fixed by allowing the problem size to grow with the number of processors, and in [23] scaling to control the simulation enor in scientific applications is discussed. In this thesis, a new scalablity metric, called CMP-scalability *(Constant-Memory-per-Processor) [5] [6], is proposed that assumes that the memory used per processor is constant as the problem size grows. 2 As you might expect, the CMP scalability metric is tailored to answering question (5) above: "If I want to solve a bigger problem, using more processors what kind of performance can I expect?". Specifically, the CMP-scalability function describes the asymptotic rate of grow of the speedup function under the constant-memory-per-processor scaling assumption. This seems like a natural question to ask as parallel-computer users often want to run larger problems. Since the memory available on each processor is fixed, more processors are required to store the larger problem and execute the problem in a shorter amount of time. Thus, CMP-scalability seems to be based on a reasonable scaling assumption. Parallel implementations of Cannon's matrix multiplication (MM) [4], Gauss-Jordan Elimination with partial pivoting (GJE), and Faster Fourier Transform (FFT) [6] on the MasPar architecture (a SIMD machine with a two-dimensional array of processors) were performed to evaluate the CMP-scalability metric. These algorithms were chosen because of their widely varying computation to communication ratios. The CMP-scalability metric predicted that the speedup of matrix multiplication would grow linearly, 0(P), with the number of processors, the speedup of Gauss-Jordan Elimination would grow as OiyfP) with the number of processors, and the speedup of Fast Fourier Transform would grow as Oi/P logjP) with the number of processors. It was somewhat surprising that the communication intensive Fast Fourier Transform was predicted to outperform the computationally more intensive Gauss-Jordan Elimination. Experimental studies on a 16K-processor MasPar MP-1 and 4K-processor MP-2 did 3 not seem to behave as predicted by the CMP-scalability metric. The speedup plots, under constant-memory-per-processor scaling, for all three algorithms appeared to be roughly linear with the slopes of 0.89,0.76, and 0.62 for matrix multiplication, Gauss-Jordan Elimination, and Fast Fourier Transform, respectively. At this point, the formulas describing the execution time of each algorithm were scrutinized, refined, and experimentally verified since these formulas were used in the CMP-scalability analysis. While some small errors in the formulas were discovered, the CMP-scalability analysis was not affected. To help explain the experimental results, other scalability metrics, especially the isoefficiency function, were used to analyze the chosen algorithms. The isoefficiency analysis of the three algorithms on a mesh architecture predicted that matrix multiplication would scale better than Gauss-Jordan Elimination which would scale much better than Fast Fourier Transform. Surprisingly, the predicted relative order of the scalability for the three algorithms was different for the isoefficiency metric than the CMP-scalability metric. While the relative scalability predicted by the isoefficiency function matched the experimental study, the magnitude of difference between the Gauss-Jordan Elimination and Fast Fourier Transform was not experimentally observed. 1.2. Questions to be Answered in the Thesis The following questions will be answered by the thesis; 1) How can the CMP and isoefficiency-scalability metrics predict different relative scalability for the Gauss-Jordan Elimination and Fast Fourier Transform algorithms? 2) Why don't the experimental results on the MasPar computers agree with the CMP and 4 isoefficiency-scalability analyses? 3) Would the predicted scalability analysis be observed on a different parallel computer with different machine parameters? 1.3. Organization of Thesis and Summary of Results The thesis starts in Chapter 2 with a review of the relevant literature to provide necessary background information. Scaling alternative and other scalability metrics are reviewed. In light of the questions to be answered by the thesis, particular emphasis will be placed on constant-memory-per-processor scaling and the isoefficiency-scalability metric. Chapter 3 discusses the parallel algorithms for Cannon's matrix multiplication, Gauss-Jordan Elimination, and Fast Fourier Transform on a two-dimensional mesh and on a hypercube parallel computer. For each algorithm, execution-time formulas are developed. These formulas are used in Chapter 4 to demonstrate how the asymptotic CMP and isoefficiency-scalability metrics can be calculated with a minimum amount of work. The apparenfly conflicting scalability results between the CMP and isoefficiency-scalability metrics for Gauss-Jordan Elimination and Fast Fourier Transform on a mesh are identified. At this point question
Recommended publications
  • Parallel Prefix Sum (Scan) with CUDA
    Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, Mark Harris Initial release 2007 April 2007 ii Abstract Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads. Month 2007 1 Parallel Prefix Sum (Scan) with CUDA Table of Contents Abstract.............................................................................................................. 1 Table of Contents............................................................................................... 2 Introduction....................................................................................................... 3 Inclusive and Exclusive Scan .........................................................................................................................3 Sequential Scan.................................................................................................................................................4 A Naïve Parallel Scan ......................................................................................... 4 A Work-Efficient
    [Show full text]
  • Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load flows for Power Distribution Systems
    Electric Power Systems Research 65 (2003) 169Á/177 www.elsevier.com/locate/epsr Distributed algorithms with theoretic scalability analysis of radial and looped load flows for power distribution systems Fangxing Li *, Robert P. Broadwater ECE Department Virginia Tech, Blacksburg, VA 24060, USA Received 15 April 2002; received in revised form 14 December 2002; accepted 16 December 2002 Abstract This paper presents distributed algorithms for both radial and looped load flows for unbalanced, multi-phase power distribution systems. The distributed algorithms are developed from tree-based sequential algorithms. Formulas of scalability for the distributed algorithms are presented. It is shown that computation time dominates communication time in the distributed computing model. This provides benefits to real-time load flow calculations, network reconfigurations, and optimization studies that rely on load flow calculations. Also, test results match the predictions of derived formulas. This shows the formulas can be used to predict the computation time when additional processors are involved. # 2003 Elsevier Science B.V. All rights reserved. Keywords: Distributed computing; Scalability analysis; Radial load flow; Looped load flow; Power distribution systems 1. Introduction Also, the method presented in Ref. [10] was tested in radial distribution systems with no more than 528 buses. Parallel and distributed computing has been applied More recent works [11Á/14] presented distributed to many scientific and engineering computations such as implementations for power flows or power flow based weather forecasting and nuclear simulations [1,2]. It also algorithms like optimizations and contingency analysis. has been applied to power system analysis calculations These works also targeted power transmission systems. [3Á/14].
    [Show full text]
  • Scalability and Performance Management of Internet Applications in the Cloud
    Hasso-Plattner-Institut University of Potsdam Internet Technology and Systems Group Scalability and Performance Management of Internet Applications in the Cloud A thesis submitted for the degree of "Doktors der Ingenieurwissenschaften" (Dr.-Ing.) in IT Systems Engineering Faculty of Mathematics and Natural Sciences University of Potsdam By: Wesam Dawoud Supervisor: Prof. Dr. Christoph Meinel Potsdam, Germany, March 2013 This work is licensed under a Creative Commons License: Attribution – Noncommercial – No Derivative Works 3.0 Germany To view a copy of this license visit http://creativecommons.org/licenses/by-nc-nd/3.0/de/ Published online at the Institutional Repository of the University of Potsdam: URL http://opus.kobv.de/ubp/volltexte/2013/6818/ URN urn:nbn:de:kobv:517-opus-68187 http://nbn-resolving.de/urn:nbn:de:kobv:517-opus-68187 To my lovely parents To my lovely wife Safaa To my lovely kids Shatha and Yazan Acknowledgements At Hasso Plattner Institute (HPI), I had the opportunity to meet many wonderful people. It is my pleasure to thank those who sup- ported me to make this thesis possible. First and foremost, I would like to thank my Ph.D. supervisor, Prof. Dr. Christoph Meinel, for his continues support. In spite of his tight schedule, he always found the time to discuss, guide, and motivate my research ideas. The thanks are extended to Dr. Karin-Irene Eiermann for assisting me even before moving to Germany. I am also grateful for Michaela Schmitz. She managed everything well to make everyones life easier. I owe a thanks to Dr. Nemeth Sharon for helping me to improve my English writing skills.
    [Show full text]
  • Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)
    Scalability and Optimization Strategies for GPU Enhanced Neural Networks (GeNN) Naresh Balaji1, Esin Yavuz2, Thomas Nowotny2 {T.Nowotny, E.Yavuz}@sussex.ac.uk, [email protected] 1National Institute of Technology, Tiruchirappalli, India 2School of Engineering and Informatics, University of Sussex, Brighton, UK Abstract: Simulation of spiking neural networks has been traditionally done on high-performance supercomputers or large-scale clusters. Utilizing the parallel nature of neural network computation algorithms, GeNN (GPU Enhanced Neural Network) provides a simulation environment that performs on General Purpose NVIDIA GPUs with a code generation based approach. GeNN allows the users to design and simulate neural networks by specifying the populations of neurons at different stages, their synapse connection densities and the model of individual neurons. In this report we describe work on how to scale synaptic weights based on the configuration of the user-defined network to ensure sufficient spiking and subsequent effective learning. We also discuss optimization strategies particular to GPU computing: sparse representation of synapse connections and occupancy based block-size determination. Keywords: Spiking Neural Network, GPGPU, CUDA, code-generation, occupancy, optimization 1. Introduction The computational performance of traditional single-core CPUs has steadily increased since its arrival, primarily due to the combination of increased clock frequency, process technology advancements and compiler optimizations. The clock frequency was pushed higher and higher, from 5 MHz to 3 GHz in the years from 1983 to 2002, until it reached a stall a few years ago [1] when the power consumption and dissipation of the transistors reached peaks that could not compensated by its cost [2].
    [Show full text]
  • CSE373: Data Structures & Algorithms Lecture 26
    CSE373: Data Structures & Algorithms Lecture 26: Parallel Reductions, Maps, and Algorithm Analysis Aaron Bauer Winter 2014 Outline Done: • How to write a parallel algorithm with fork and join • Why using divide-and-conquer with lots of small tasks is best – Combines results in parallel – (Assuming library can handle “lots of small threads”) Now: • More examples of simple parallel programs that fit the “map” or “reduce” patterns • Teaser: Beyond maps and reductions • Asymptotic analysis for fork-join parallelism • Amdahl’s Law • Final exam and victory lap Winter 2014 CSE373: Data Structures & Algorithms 2 What else looks like this? • Saw summing an array went from O(n) sequential to O(log n) parallel (assuming a lot of processors and very large n!) – Exponential speed-up in theory (n / log n grows exponentially) + + + + + + + + + + + + + + + • Anything that can use results from two halves and merge them in O(1) time has the same property… Winter 2014 CSE373: Data Structures & Algorithms 3 Examples • Maximum or minimum element • Is there an element satisfying some property (e.g., is there a 17)? • Left-most element satisfying some property (e.g., first 17) – What should the recursive tasks return? – How should we merge the results? • Corners of a rectangle containing all points (a “bounding box”) • Counts, for example, number of strings that start with a vowel – This is just summing with a different base case – Many problems are! Winter 2014 CSE373: Data Structures & Algorithms 4 Reductions • Computations of this form are called reductions
    [Show full text]
  • CSE 613: Parallel Programming Lecture 2
    CSE 613: Parallel Programming Lecture 2 ( Analytical Modeling of Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2019 Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Parallel running time on 푝 processing elements, 푇푃 = 푡푒푛푑 – 푡푠푡푎푟푡 , where, 푡푠푡푎푟푡 = starting time of the processing element that starts first 푡푒푛푑 = termination time of the processing element that finishes last Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Sources of overhead ( w.r.t. serial execution ) ― Interprocess interaction ― Interact and communicate data ( e.g., intermediate results ) ― Idling ― Due to load imbalance, synchronization, presence of serial computation, etc. ― Excess computation ― Fastest serial algorithm may be difficult/impossible to parallelize Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Overhead function or total parallel overhead, 푇푂 = 푝푇푝 – 푇 , where, 푝 = number of processing elements 푇 = time spent doing useful work ( often execution time of the fastest serial algorithm ) Speedup Let 푇푝 = running time using 푝 identical processing elements 푇1 Speedup, 푆푝 = 푇푝 Theoretically, 푆푝 ≤ 푝 ( why? ) Perfect or linear or ideal speedup if 푆푝 = 푝 Speedup Consider adding 푛 numbers using 푛 identical processing Edition nd elements. Serial runtime,
    [Show full text]
  • A Review of Multicore Processors with Parallel Programming
    International Journal of Engineering Technology, Management and Applied Sciences www.ijetmas.com September 2015, Volume 3, Issue 9, ISSN 2349-4476 A Review of Multicore Processors with Parallel Programming Anchal Thakur Ravinder Thakur Research Scholar, CSE Department Assistant Professor, CSE L.R Institute of Engineering and Department Technology, Solan , India. L.R Institute of Engineering and Technology, Solan, India ABSTRACT When the computers first introduced in the market, they came with single processors which limited the performance and efficiency of the computers. The classic way of overcoming the performance issue was to use bigger processors for executing the data with higher speed. Big processor did improve the performance to certain extent but these processors consumed a lot of power which started over heating the internal circuits. To achieve the efficiency and the speed simultaneously the CPU architectures developed multicore processors units in which two or more processors were used to execute the task. The multicore technology offered better response-time while running big applications, better power management and faster execution time. Multicore processors also gave developer an opportunity to parallel programming to execute the task in parallel. These days parallel programming is used to execute a task by distributing it in smaller instructions and executing them on different cores. By using parallel programming the complex tasks that are carried out in a multicore environment can be executed with higher efficiency and performance. Keywords: Multicore Processing, Multicore Utilization, Parallel Processing. INTRODUCTION From the day computers have been invented a great importance has been given to its efficiency for executing the task.
    [Show full text]
  • Parallel Algorithms and Parallel Program Design
    Introduction to Parallel Algorithms and Parallel Program Design Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 12 – Introduction to Parallel Algorithms Methodological Design q Partition ❍ Task/data decomposition q Communication ❍ Task execution coordination q Agglomeration ❍ Evaluation of the structure q Mapping I. Foster, “Designing and Building ❍ Resource assignment Parallel Programs,” Addison-Wesley, 1995. Book is online, see webpage. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 2 Partitioning q Partitioning stage is intended to expose opportunities for parallel execution q Focus on defining large number of small task to yield a fine-grained decomposition of the problem q A good partition divides into small pieces both the computational tasks associated with a problem and the data on which the tasks operates q Domain decomposition focuses on computation data q Functional decomposition focuses on computation tasks q Mixing domain/functional decomposition is possible Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 3 Domain and Functional Decomposition q Domain decomposition of 2D / 3D grid q Functional decomposition of a climate model Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 4 Partitioning Checklist q Does your partition define at least an order of magnitude more tasks than there are processors in your target computer? If not, may loose design flexibility. q Does your partition avoid redundant computation and storage requirements? If not, may not be scalable. q Are tasks of comparable size? If not, it may be hard to allocate each processor equal amounts of work.
    [Show full text]
  • Parallel System Performance: Evaluation & Scalability
    ParallelParallel SystemSystem Performance:Performance: EvaluationEvaluation && ScalabilityScalability • Factors affecting parallel system performance: – Algorithm-related, parallel program related, architecture/hardware-related. • Workload-Driven Quantitative Architectural Evaluation: – Select applications or suite of benchmarks to evaluate architecture either on real or simulated machine. – From measured performance results compute performance metrics: • Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism. – Resource-oriented Workload scaling models: How the speedup of a parallel computation is affected subject to specific constraints: 1 • Problem constrained (PC): Fixed-load Model. 2 • Time constrained (TC): Fixed-time Model. 3 • Memory constrained (MC): Fixed-Memory Model. • Parallel Performance Scalability: For a given parallel system and a given parallel computation/problem/algorithm – Definition. Informally: – Conditions of scalability. The ability of parallel system performance to increase – Factors affecting scalability. with increased problem size and system size. Parallel Computer Architecture, Chapter 4 EECC756 - Shaaban Parallel Programming, Chapter 1, handout #1 lec # 9 Spring2013 4-23-2013 Parallel Program Performance • Parallel processing goal is to maximize speedup: Time(1) Sequential Work Speedup = < Time(p) Max (Work + Synch Wait Time + Comm Cost + Extra Work) Fixed Problem Size Speedup Max for any processor Parallelizing Overheads • By: 1 – Balancing computations/overheads (workload) on processors
    [Show full text]
  • Multiprocessing and Scalability
    Multiprocessing and Scalability A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Multiprocessing and Scalability Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uni- processor systems. However, due to a number of difficult problems, the potential of these machines has been difficult to realize. This is because of the: Fall 2004 2 Multiprocessing and Scalability Advances in technology ─ Rate of increase in performance of uni-processor, Complexity of multiprocessor system design ─ This drastically effected the cost and implementation cycle. Programmability of multiprocessor system ─ Design complexity of parallel algorithms and parallel programs. Fall 2004 3 Multiprocessing and Scalability Programming a parallel machine is more difficult than a sequential one. In addition, it takes much effort to port an existing sequential program to a parallel machine than to a newly developed sequential machine. Lack of good parallel programming environments and standard parallel languages also has further aggravated this issue. Fall 2004 4 Multiprocessing and Scalability As a result, absolute performance of many early concurrent machines was not significantly better than available or soon- to-be available uni-processors. Fall 2004 5 Multiprocessing and Scalability Recently, there has been an increased interest in large-scale or massively parallel processing systems. This interest stems from many factors, including: Advances in integrated technology. Very
    [Show full text]
  • Pascal Viewer: a Tool for the Visualization of Parallel Scalability Trends
    PaScal Viewer: a Tool for the Visualization of Parallel Scalability Trends 1st Anderson B. N. da Silva 2nd Daniel A. M. Cunha Pro-Reitoria´ de Pesquisa, Inovac¸ao˜ e Pos-Graduac¸´ ao˜ Dept. de Eng. de Comp. e Automac¸ao˜ Instituto Federal da Paraiba Univ. Federal do Rio Grande do Norte Joao Pessoa, Brazil Natal, Brazil [email protected] [email protected] 3st Vitor R. G. Silva 4th Alex F. de A. Furtunato 5th Samuel Xavier de Souza Dept. de Eng. de Comp. e Automac¸ao˜ Diretoria de Tecnologia da Informac¸ao˜ Dept. de Eng. de Comp. e Automac¸ao˜ Univ. Federal do Rio Grande do Norte Instituto Federal do Rio Grande do Norte Univ. Federal do Rio Grande do Norte Natal, Brazil Natal, Brazil Natal, Brazil [email protected] [email protected] [email protected] Abstract—Taking advantage of the growing number of cores particular environment. The configuration of this environment in supercomputers to increase the scalabilty of parallel programs includes the number of cores, their operating frequency, and is an increasing challenge. Many advanced profiling tools have the size of the input data or the problem size. Among the var- been developed to assist programmers in the process of analyzing data related to the execution of their program. Programmers ious collected information, the elapsed time in each function can act upon the information generated by these data and make of the code, the number of function calls, and the memory their programs reach higher performance levels. However, the consumption of the program can be cited to name just a few information provided by profiling tools is generally designed [2].
    [Show full text]
  • Computer Architecture: Parallel Processing Basics
    Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions,
    [Show full text]