A Distributed Memory Parallel Fourth-Order IADEMF Algorithm

Total Page:16

File Type:pdf, Size:1020Kb

A Distributed Memory Parallel Fourth-Order IADEMF Algorithm (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 9, 2019 A Distributed Memory Parallel Fourth-Order IADEMF Algorithm 1 Noreliza Abu Mansor , Ahmad Norma Alias2 Mohammad Khatim Hasan4 3 Kamal Zulkifle Ibnu Sina Institute of Fundamental Faculty of Information Science and College of Engineering Science Studies, Universiti Technology, Universiti Kebangsaan Universiti Tenaga Nasional Teknologi Malaysia, Johor, Malaysia Malaysia, Selangor, Malaysia Selangor, Malaysia Abstract—The fourth-order finite difference Iterative The studies showed that the accuracies of the IADEDY and Alternating Decomposition Explicit Method of Mitchell and the IADEMG are comparable to the IADEMF. Alias [5] Fairweather (IADEMF4) sequential algorithm has demonstrated studied the parallel implementation of the IADEMF on its ability to perform with high accuracy and efficiency for the distributed parallel computing using the parallel virtual solution of a one-dimensional heat equation with Dirichlet machine. A fragmented numerical algorithm of the IADEMF boundary conditions. This paper develops the parallelization of method was designed by Alias [6] in terms of the data-flow the IADEMF4, by applying the Red-Black (RB) ordering graph where its parallel implementation using LuNA technique. The proposed IADEMF4-RB is implemented on programming system was then executed. Sulaiman et al. [7, 8] multiprocessor distributed memory architecture based on proposed the half-sweep and the quarter-sweep IADEMF Parallel Virtual Machine (PVM) environment with Linux methods respectively, for the purpose of achieving better operating system. Numerical results show that the IADEMF4- RB accelerates the convergence rate and largely improves the convergence rate and faster execution time than the serial time of the IADEMF4. In terms of parallel performance corresponding full-sweep method. Alias [9] implemented the evaluations, the IADEMF4-RB significantly outperforms its Interpolation Conjugate gradient method to improve the counterpart of the second-order (IADEMF2-RB), as well as the parallel performance of the IADEMF. Shariffudin et al. [10] benchmarked fourth-order classical iterative RB methods, presented the parallel implementation of the IADEDY for namely, the Gauss-Seidel (GS4-RB) and the Successive Over- solving a two-dimensional heat equation on a distributed relaxation (SOR4-RB) methods. system of Geranium Cadcam cluster (GCC) using the Message Passing Interface. Keywords—Fourth-order method; finite difference; red-black ordering; distributed memory architecture; parallel performance A recent study made by Mansor [11] involved the evaluations development of a convergent and unconditionally stable fourth-order IADEMF sequential algorithm (IADEMF4). The I. INTRODUCTION proposed scheme is found to be capable of enhancing the The heat equation is a mathematical model that describes accuracy of the original corresponding method of the second- heat conduction processes of a physical system. Sahimi et al. order, that is, the IADEMF2. The IADEMF4 seems to be [1] had proposed a finite difference scheme known as the more accurate, more efficient and has better rate of Iterative Alternating Decomposition Explicit (IADE) method convergence than the benchmarked fourth-order classical to approximate the solution of a one-dimensional heat iterative methods, namely, the Gauss-Seidel (GS4) and the equation with Dirichlet boundary conditions. The IADE successive over-relaxation (SOR4) methods. However, the scheme employs the fractional splitting of the Mitchell and IADEMF4 may be too slow to be implemented especially Fairweather (MF) variant whose accuracy is of the order, when the problem involves larger linear systems of equations. It is thus justified to consider parallel computing to speed up O()(). t24 x The scheme, commonly abbreviated as the the execution time without compromising its accuracy. The IADEMF, is developed by applying the second-order spatial algorithm has explicit features which add to its advantage, accuracy to the heat equation. Due to the latter, in this paper, thus it can be fully utilized for parallelization. the IADEMF will also be referred to as the IADEMF2. It is a This paper attempts to parallelize the IADEMF4, by two-stage iterative procedure and has been proven to have applying the Red-Black (RB) ordering technique, for solving merit in terms of convergence, stability and accuracy. It is large sparse linear systems that arise from the discretization of generally found to be more accurate than the classical the one-dimensional heat equation with Dirichlet boundary Alternating Group Explicit class of methods [2]. conditions. It aims to effectively implement the IADEMF4- Several studies have later been developed based on the RB on parallel computers, with improved performance over its IADE method. Sahimi et al. [3, 4] developed new second- serial counterpart. The high computational complexity of the order IADE methods using different variants such as the IADEMF4-RB will be implemented on multiprocessor D‟Yakonov (IADEDY) and the Mitchell-Griffith variant distributed memory architecture based on Parallel Virtual Machine (PVM) environment with Linux operating system. (IADEMG). Each variant is of the order, O()() t24 x . 599 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 9, 2019 This paper is outlined as follows. Section II recalls the of boundary values and known values at the previous time formulation of the IADEMF4 scheme. Section III presents the level k . development of the IADEMF4-RB parallel strategy. The computational complexity of the RB methods considered in Auf this paper is given in Section IV. Section V shows the c d e numerical experiment conducted in this study. The results and uf22 discussion on parallel performance of the methods under b c d e uf 33 consideration are discussed in Section VI. At the end of this a b c d e .. paper is the conclusion. .. II. FORMULATION OF THE IADEMF4 (AN OVERVIEW) a b c d. e .. In this section, the development of the IADEMF4 a b c d ufmm22 algorithm [11] is briefly reviewed. Consider the one- uf dimensional heat equation (1) which models the flow of heat a b c (m 2) x ( m 2) mm11 k 1 in a homogeneous unchanging medium of finite extent, in the (6) absence of heat source. The entries in f are defined as UU2 k k1 k k k f2 b() u 1 u 1 cuˆ 2 du 3 eu 4 2 t x (1) k k1 k k k k f3 a() u 1 u 1 bu 2 cuˆ 3 du 4 eu 5 subject to given initial and Dirichlet boundary conditions k k k k k fi au i2 bu i 1 cuˆ i du i 1 eu i 2 , i 4,5,..., m 3 U( x ,0) f ( x ), 0 x 1 k k k k k k 1 fm2 au m 4 bu m 3 cuˆ m 2 du m 1 e() u m u m U(0, t ) g ( t ), 0 t T f auk bu k cuˆ k d( u k uk1) U(1, t ) h ( t ), 0 t T m1 m 3 m 2 m 1 m m (7) (2) The IADEMF4 scheme secondly employs the fractional Based on the finite difference approach, the time-space splitting of the higher-order accuracy formula of the MF domain is discretized by using a set of lines parallel to the t variant [13], axis given by xi i x , i0,1, ... , m , m 1 and a set of lines (pp 1/2) ( ) parallel to the x axis given by tk k t , k0,1, ... , n , n 1 ()()rI G12u rI gG u f (8) . The grid spacings have uniform size, that is, xm 1/( 1) ()()rI Gu(pp 1) rI gG u ( 1/2) g f and t T/( n 1) . At a grid-point P(,) xik t in the solution 21 (9) domain, the dependent variable U(,) x t which represents the where G and G are two constituent matrices and r, I non-dimensional temperature at time t and at position x, is 1 2 and p represent an acceleration parameter, an identity matrix approximated by uk . i and the iteration index respectively. The value of g is defined The IADEMF4 is developed by firstly executing the 6 r as g , r 0 . The vectors u(p 1) and u(p 1/ 2) unconditionally stable fourth-order Crank-Nicolson 6 approximation (3) on the heat equation [12]. represent the approximate solution at the iteration level (p 1) and at some intermediate level (p 1/ 2) , 1k1 k 1 2 1 4 k 1 k (ui u i ) ( x x )( u i u i ) respectively. t 2 12 2(x ) (3) After some algebraic manipulations for the equations in (8) The discretization of (3) leads to the expression given in 1 (4), with the constants defined as in (5). and (9), the form, GGGG1 2 1 2 uf is obtained, 6 k1 k 1 k 1 k 1 k 1 k k suggesting that matrix A in (6) can be decomposed into. aui2 bu i 1 cu i du i 1 eu i 2 au i 2 bu i 1 cuˆ k du k eu k , i 2,3, ..., m 1 1 i i12 i (4) AGGGG 1 26 1 2 2 4 5 2 4 5 (10) a,,,,, b c d e cˆ 24 3 4 3 24 4 (5) To retain the penta-diagonal structure of A , the matrices G and G have to be in the form of lower and upper tri- In matrix form, the approximation in (4) can be 1 2 represented by (6), where A is a sparse penta- diagonal matrices respectively, Thus, Auf diagonal coefficient matrix, and the column vectors T u (u2 , u 3 ,..., umm 2 , u 1 ) contain the unknown values of u at T the time level k 1 and f (f2 , f 3 ,..., fmm 2 , f 1 ) consists 600 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol.
Recommended publications
  • Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks
    Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks OSU Booth Talk (SC ’19) Ammar Ahmad Awan [email protected] Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University Agenda • Introduction – Deep Learning Trends – CPUs and GPUs for Deep Learning – Message Passing Interface (MPI) • Research Challenges: Exploiting HPC for Deep Learning • Proposed Solutions • Conclusion Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 2 Understanding the Deep Learning Resurgence • Deep Learning (DL) is a sub-set of Machine Learning (ML) – Perhaps, the most revolutionary subset! Deep Machine – Feature extraction vs. hand-crafted Learning Learning features AI Examples: • Deep Learning Examples: Logistic Regression – A renewed interest and a lot of hype! MLPs, DNNs, – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Adopted from: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 3 AlexNet Deep Learning in the Many-core Era 10000 8000 • Modern and efficient hardware enabled 6000 – Computability of DNNs – impossible in the 4000 ~500X in 5 years 2000 past! Minutesto Train – GPUs – at the core of DNN training 0 2 GTX 580 DGX-2 – CPUs – catching up fast • Availability of Datasets – MNIST, CIFAR10, ImageNet, and more… • Excellent Accuracy for many application areas – Vision, Machine Translation, and
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • Parallel Programming Using Openmp Feb 2014
    OpenMP tutorial by Brent Swartz February 18, 2014 Room 575 Walter 1-4 P.M. © 2013 Regents of the University of Minnesota. All rights reserved. Outline • OpenMP Definition • MSI hardware Overview • Hyper-threading • Programming Models • OpenMP tutorial • Affinity • OpenMP 4.0 Support / Features • OpenMP Optimization Tips © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition The OpenMP Application Program Interface (API) is a multi-platform shared-memory parallel programming model for the C, C++ and Fortran programming languages. Further information can be found at http://www.openmp.org/ © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition Jointly defined by a group of major computer hardware and software vendors and the user community, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from multicore systems and SMPs, to embedded systems. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware • MSI hardware is described here: https://www.msi.umn.edu/hpc © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware The number of cores per node varies with the type of Xeon on that node. We define: MAXCORES=number of cores/node, e.g. Nehalem=8, Westmere=12, SandyBridge=16, IvyBridge=20. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware Since MAXCORES will not vary within the PBS job (the itasca queues are split by Xeon type), you can determine this at the start of the job (in bash): MAXCORES=`grep "core id" /proc/cpuinfo | wc -l` © 2013 Regents of the University of Minnesota.
    [Show full text]
  • Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters
    Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple
    [Show full text]
  • An Overview of the PARADIGM Compiler for Distributed-Memory
    appeared in IEEE Computer, Volume 28, Number 10, October 1995 An Overview of the PARADIGM Compiler for DistributedMemory Multicomputers y Prithvira j Banerjee John A Chandy Manish Gupta Eugene W Ho dges IV John G Holm Antonio Lain Daniel J Palermo Shankar Ramaswamy Ernesto Su Center for Reliable and HighPerformance Computing University of Illinois at UrbanaChampaign W Main St Urbana IL Phone Fax banerjeecrhcuiucedu Abstract Distributedmemory multicomputers suchasthetheIntel Paragon the IBM SP and the Thinking Machines CM oer signicant advantages over sharedmemory multipro cessors in terms of cost and scalability Unfortunately extracting all the computational p ower from these machines requires users to write ecientsoftware for them which is a lab orious pro cess The PARADIGM compiler pro ject provides an automated means to parallelize programs written using a serial programming mo del for ecient execution on distributedmemory multicomput ers In addition to p erforming traditional compiler optimizations PARADIGM is unique in that it addresses many other issues within a unied platform automatic data distribution commu nication optimizations supp ort for irregular computations exploitation of functional and data parallelism and multithreaded execution This pap er describ es the techniques used and pro vides exp erimental evidence of their eectiveness on the Intel Paragon the IBM SP and the Thinking Machines CM Keywords compilers distributed memorymulticomputers parallelizing compilers parallel pro cessing This research was supp orted
    [Show full text]
  • Concepts from High-Performance Computing Lecture a - Overview of HPC Paradigms
    Concepts from High-Performance Computing Lecture A - Overview of HPC paradigms OBJECTIVE: The clock speeds of computer processors are topping out as the limits of traditional computer chip technology are being reached. Increased performance is being achieved by increasing the number of compute cores on a chip. In order to understand and take advantage of this disruptive technology, one must appreciate the paradigms of high-performance computing. The economics of technology Technology is governed by economics. The rate at which a computer processor can execute instructions (e.g., floating-point operations) is proportional to the frequency of its clock. However, 2 power ∼ (voltage) × (frequency), voltage ∼ frequency, 3 =⇒ power ∼ (frequency) e.g., a 50% increase in performance by increasing frequency requires 3.4 times more power. By going to 2 cores, this performance increase can be achieved with 16% less power1. 1This assumes you can make 100% use of each core. Moreover, transistors are getting so small that (undesirable) quantum effects (such as electron tunnelling) are becoming non-negligible. → Increasing frequency is no longer an option! When increasing frequency was possible, software got faster because hardware got faster. In the near future, the only software that will survive will be that which can take advantage of parallel processing. It is already difficult to do leading-edge computational science without parallel computing; this situation can only get worse. Computational scientists who ignore parallel computing do so at their own risk! Most of numerical analysis (and software in general) was designed for the serial processor. There is a lot of numerical analysis that needs to be revisited in this new world of computing.
    [Show full text]
  • In Reference to RPC: It's Time to Add Distributed Memory
    In Reference to RPC: It’s Time to Add Distributed Memory Stephanie Wang, Benjamin Hindman, Ion Stoica UC Berkeley ACM Reference Format: D (e.g., Apache Spark) F (e.g., Distributed TF) Stephanie Wang, Benjamin Hindman, Ion Stoica. 2021. In Refer- A B C i ii 1 2 3 ence to RPC: It’s Time to Add Distributed Memory. In Workshop RPC E on Hot Topics in Operating Systems (HotOS ’21), May 31-June 2, application 2021, Ann Arbor, MI, USA. ACM, New York, NY, USA, 8 pages. Figure 1: A single “application” actually consists of many https://doi.org/10.1145/3458336.3465302 components and distinct frameworks. With no shared address space, data (squares) must be copied between 1 Introduction different components. RPC has been remarkably successful. Most distributed ap- Load balancer Scheduler + Mem Mgt plications built today use an RPC runtime such as gRPC [3] Executors or Apache Thrift [2]. The key behind RPC’s success is the Servers request simple but powerful semantics of its programming model. client data request cache In particular, RPC has no shared state: arguments and return Distributed object store values are passed by value between processes, meaning that (a) (b) they must be copied into the request or reply. Thus, argu- ments and return values are inherently immutable. These Figure 2: Logical RPC architecture: (a) today, and (b) with a shared address space and automatic memory management. simple semantics facilitate highly efficient and reliable imple- mentations, as no distributed coordination is required, while then return a reference, i.e., some metadata that acts as a remaining useful for a general set of distributed applications.
    [Show full text]
  • Parallel Breadth-First Search on Distributed Memory Systems
    Parallel Breadth-First Search on Distributed Memory Systems Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA {ABuluc, KMadduri}@lbl.gov ABSTRACT tions, in these graphs. To cater to the graph-theoretic anal- Data-intensive, graph-based computations are pervasive in yses demands of emerging \big data" applications, it is es- several scientific applications, and are known to to be quite sential that we speed up the underlying graph problems on challenging to implement on distributed memory systems. current parallel systems. In this work, we explore the design space of parallel algo- We study the problem of traversing large graphs in this rithms for Breadth-First Search (BFS), a key subroutine in paper. A traversal refers to a systematic method of explor- several graph algorithms. We present two highly-tuned par- ing all the vertices and edges in a graph. The ordering of allel approaches for BFS on large parallel systems: a level- vertices using a \breadth-first" search (BFS) is of particular synchronous strategy that relies on a simple vertex-based interest in many graph problems. Theoretical analysis in the partitioning of the graph, and a two-dimensional sparse ma- random access machine (RAM) model of computation indi- trix partitioning-based approach that mitigates parallel com- cates that the computational work performed by an efficient munication overhead. For both approaches, we also present BFS algorithm would scale linearly with the number of ver- hybrid versions with intra-node multithreading. Our novel tices and edges, and there are several well-known serial and hybrid two-dimensional algorithm reduces communication parallel BFS algorithms (discussed in Section2).
    [Show full text]
  • Introduction to Parallel Computing
    INTRODUCTION TO PARALLEL COMPUTING Plamen Krastev Office: 38 Oxford, Room 117 Email: [email protected] FAS Research Computing Harvard University OBJECTIVES: To introduce you to the basic concepts and ideas in parallel computing To familiarize you with the major programming models in parallel computing To provide you with with guidance for designing efficient parallel programs 2 OUTLINE: Introduction to Parallel Computing / High Performance Computing (HPC) Concepts and terminology Parallel programming models Parallelizing your programs Parallel examples 3 What is High Performance Computing? Pravetz 82 and 8M, Bulgarian Apple clones Image credit: flickr 4 What is High Performance Computing? Pravetz 82 and 8M, Bulgarian Apple clones Image credit: flickr 4 What is High Performance Computing? Odyssey supercomputer is the major computational resource of FAS RC: • 2,140 nodes / 60,000 cores • 14 petabytes of storage 5 What is High Performance Computing? Odyssey supercomputer is the major computational resource of FAS RC: • 2,140 nodes / 60,000 cores • 14 petabytes of storage Using the world’s fastest and largest computers to solve large and complex problems. 5 Serial Computation: Traditionally software has been written for serial computations: To be run on a single computer having a single Central Processing Unit (CPU) A problem is broken into a discrete set of instructions Instructions are executed one after another Only one instruction can be executed at any moment in time 6 Parallel Computing: In the simplest sense, parallel
    [Show full text]
  • Multi-Core Architectures
    Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures • This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip 4 Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 5 The cores run in parallel thread 1 thread 2 thread 3 thread 4 c c c c o o o o r r r r e e e e 1 2 3 4 6 Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 7 Interaction with the Operating System • OS perceives each core as a separate processor • OS scheduler maps threads/processes to different cores • Most major OS support multi-core today: Windows, Linux, Mac OS X, … 8 Why multi-core ? • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits: – heat problems – speed of light problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism) 9 Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.
    [Show full text]
  • Scalable Problems and Memory-Bounded Speedup
    Scalable Problems and MemoryBounded Sp eedup XianHe Sun ICASE Mail Stop C NASA Langley Research Center Hampton VA sunicaseedu Abstract In this pap er three mo dels of parallel sp eedup are studied They are xedsize speedup xedtime speedup and memorybounded speedup The latter two consider the relationship b etween sp eedup and problem scalability Two sets of sp eedup formulations are derived for these three mo dels One set considers uneven workload allo cation and communication overhead and gives more accurate estimation Another set considers a simplied case and provides a clear picture on the impact of the sequential p ortion of an application on the p ossible p erformance gain from parallel pro cessing The simplied xedsize sp eedup is Amdahls law The simplied xedtime sp eedup is Gustafsons scaled speedup The simplied memoryb ounded sp eedup contains b oth Amdahls law and Gustafsons scaled sp eedup as sp ecial cases This study leads to a b etter under standing of parallel pro cessing Running Head Scalable Problems and MemoryBounded Sp eedup This research was supp orted in part by the NSF grant ECS and by the National Aeronautics and Space Administration under NASA contract NAS SCALABLE PROBLEMS AND MEMORYBOUNDED SPEEDUP By XianHe Sun ICASE NASA Langley Research Center Hampton VA sunicaseedu And Lionel M Ni Computer Science Department Michigan State University E Lansing MI nicpsmsuedu Intro duction Although parallel pro cessing has b ecome a common approach for achieving high p erformance there is no wellestablished metric to
    [Show full text]