Programming Using the Partitioned Global Address Space (PGAS)

Total Page:16

File Type:pdf, Size:1020Kb

Programming Using the Partitioned Global Address Space (PGAS) M04-M04- Programming Programming UsingUsing thethe PartitionedPartitioned GlobalGlobal AddressAddress SpaceSpace (PGAS)(PGAS) ModelModel Tarek El-Ghazawi, GWU Brad Chamberlain, Cray Vijay Saraswat, IBM Overview A. Introduction to PGAS (~ 45 mts) B. Introduction to Languages A. UPC (~ 60 mts) B. X10 (~ 60 mts) C. Chapel (~ 60 mts) C. Comparison of Languages (~45 minutes) A. Comparative Heat transfer Example B. Comparative Summary of features C. Discussion D. Hands-On (90 mts) 2 1 A.A. IntroductionIntroduction toto PGASPGAS The common architectural landscape SMPSMP Clusters Clusters SMP Node SMP Node PEs, PEs, PEs, PEs, . Memory Memory Interconnect Massively Parallel Processor Massively Parallel Processor Systems Systems IBMIBM Blue Blue Gene Gene CrayCray XT5 XT5 4 2 Programming Models What is a programming model? 0The logical interface between architecture and applications Why Programming Models? 0Decouple applications and architectures Write applications that run effectively across architectures Design new architectures that can effectively support legacy applications Programming Model Design Considerations 0Expose modern architectural features to exploit machine power and improve performance 0Maintain Ease of Use 5 Examples of Parallel Programming Models Message Passing Shared Memory (Global Address Space) Partitioned Global Address Space (PGAS) 6 3 The Message Passing Model Concurrent sequential processes Explicit communication, two- sided Library-based Positive: 0Programmers control data … and work distribution. Negative: Legend: 0Significant communication overhead for small Thread/Process Memory Access transactions 0Excessive buffering Address Space Message 0Hard to program in Example: MPI 7 The Shared Memory Model Concurrent threads with shared space Positive: 0 Simple statements … 0Read remote memory via an expression 0Write remote memory through assignment Negative: 0Manipulating shared data leads to Legend: synchronization Thread/Process Memory Access requirements Address Space 0Does not allow locality exploitation Example: OpenMP, Java 8 4 Hybrid Model(s) Example: Shared + Message Passing Network … … … Process Thread Address space Example: OpenMP at the node (SMP), and MPI in between 9 The PGAS Model Concurrent threads with a partitioned shared space 0 A datum may reference data in other partitions 0 Global arrays have fragments in multiple partitions Positive: 0 Helps in exploiting locality PGAS 0 Simple statements as UPC, CAF, X10 shared memory Legend: Negative: 0 Thread/Process Memory Access sharing all memory can result in subtle bugs and Address Space race conditions 0 Examples: This Tutorial! UPC, X10, Chapel, CAF, Titanium 10 5 PGAS vs. other programming models/languages UPC, X10, MPI OpenMP HPF Chapel PGAS Distributed Distributed Memory model (Partitioned Global Shared Memory Memory Shared Memory Address Space) Notation Language Library Annotations Language Global arrays? Yes No No Yes Global Yes No No No pointers/references? Locality Exploitation Yes Yes, necessarily No Yes 11 The heterogeneous/accelerated architectural landscape I/O gateway nodes ••• (100’s of such cluster nodes) “Scalable Unit” Cluster Interconnect Switch/Fabric CrayCray XT5h: XT5h: FPGA/Vector-accelerated FPGA/Vector-accelerated Opteron Opteron RoadRoad Runner: Runner: Cell-accelerated Cell-accelerated Opteron Opteron 12 6 Example accelerator technologies SPEPlace 1 Place 2 Place 3 Place 4 Place 5 Place 6 Place 7 Place 8 SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF 16B/cycle EIB (up to 96B/cycle) 16B/cycle PPE 16B/cycle 16B/cycle (2x) PPU MIC BIC L2 L1 PXU 32B/cycle 16B/cycle Dual FlexIOTM GPU TM GPU Place 0 XDR 64-bit Power Architecture with VMX From NVidia CUDA Programming Guide CellCell Processor Processor 13 FPGA Multi-core w/ accelerators (Intel IXP 2850) FPGA Multi-core w/ accelerators (Intel IXP 2850) The current architectural landscape Substantial architectural Multicore systems will innovation is anticipated over dramatically raise the number the next ten years. of cores available to 0Hardware situation remains applications. murky, but programmers 0Programmers must need stable interfaces to understand concurrent develop applications structure of their applications. Heterogenous accelerator- Applications seeking to based systems will exist, leverage these architectures raising serious programmability will need to go beyond data- challenges. parallel, globally synchronizing 0Programmers must MPI model. choreograph interactions These changes, while most between heterogenous profound for HPC now, will processors, memory change the face of commercial subsystems. computing over time. 14 7 Asynchronous PGAS Legend: Thread/Process Memory Access Address Space PGAS UPC, CAF, X10 Explicit concurrency Asynchronous activities can be used for active messaging SPMD is a special case 0DMAs, Asynchronous activities can 0fork/join concurrency, do- be started and stopped in a all/do-across parallelism given space partition Concurrency is made explicit and programmable. 15 How do we realize APGAS? Through an APGAS library in C, Through languages Fortran, Java (co-habiting with 0Asynchronous Co-Array MPI) Fortran 0Implements PGAS extension of CAF with asyncs Remote references 0Asynchronous UPC (AUPC) Global data-structures Proper extension of UPC with 0 asyncs Implements inter-place 0 messaging X10 (already asynchronous) Extension of sequential Java Optimizes inlinable asyncs 0Chapel (already synchronous) 0Implements global and/or collective operations Language runtimes share 0Implements intra-place common APGAS runtime concurrency Libraries reduce cost of Atomic operations adoption, languages offer enhanced productivity benefits Algorithmic scheduler 0Customer gets to choose Builds on XL UPC runtime, GASNet, ARMCI, LAPI, DCMF, DaCS, Cilk runtime, Chapel 16 runtime 8 Overview A. Introduction to PGAS (~ 45 mts) B. Introduction to Languages A. UPC (~ 60 mts) B. X10 (~ 60 mts) C. Chapel (~ 60 mts) C. Comparison of Languages (~45 minutes) A. Comparative Heat transfer Example B. Comparative Summary of features C. Discussion D. Hands-On (90 mts) 1 SC08SC08 PGASPGAS TutorialTutorial UnifiedUnified ParallelParallel CC -- UPC UPC Tarek El-Ghazawi [email protected] The George Washington University 1 UPC Overview 1) UPC in a nutshell 4) Advanced topics in UPC 0 Memory model 0 Dynamic Memory Allocation 0 Execution model 0 Synchronization in UPC 0 UPC Systems 0 UPC Libraries 2) Data Distribution and Pointers 0 Shared vs Private Data 5) UPC Productivity 0 Examples of data 0 Code efficiency distribution 0 UPC pointers 3) Workload Sharing 0 upc_forall 3 Introduction UPC – Unified Parallel C Set of specs for a parallel C 0v1.0 completed February of 2001 0v1.1.1 in October of 2003 0v1.2 in May of 2005 Compiler implementations by vendors and others Consortium of government, academia, and HPC vendors including IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U of Florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, … 4 2 Introduction cont. UPC compilers are now available for most HPC platforms and clusters 0Some are open source A debugger is available and a performance analysis tool is in the works Benchmarks, programming examples, and compiler testing suite(s) are available Visit www.upcworld.org or upc.gwu.edu for more information 5 UPC Systems Current UPC Compilers 0Hewlett-Packard 0Cray 0IBM 0Berkeley 0Intrepid (GCC UPC) 0MTU UPC application development tools 0Totalview 0PPW from UF 6 3 UPC Home Page http://www.upc.gwu.edu 7 UPC textbook now available UPC: Distributed Shared Memory Programming Tarek El-Ghazawi William Carlson Thomas Sterling Katherine Yelick Wiley, May, 2005 ISBN: 0-471-22048-5 8 4 What is UPC? Unified Parallel C An explicit parallel extension of ISO C A partitioned shared memory parallel programming language 9 UPC Execution Model A number of threads working independently in a SPMD fashion 0MYTHREAD specifies thread index (0..THREADS- 1) 0Number of threads specified at compile-time or run-time Synchronization when needed 0Barriers 0Locks 0Memory consistency control 10 5 UPC Memory Model Thread 0 Thread 1 Thread THREADS-1 Shared Partitioned Global address space Private 0 Private 1 Private THREADS-1 Private Private Spaces A pointer-to-shared can reference all locations in the shared space, but there is data-thread affinity A private pointer may reference addresses in its private space or its local portion of the shared space Static and dynamic memory allocations are supported for both shared and private memory 11 UPC Overview 1) UPC in a nutshell 4) Advanced topics in UPC 0 Memory model 0 Dynamic Memory Allocation 0 Execution model 0 Synchronization in UPC 0 UPC Systems 0 UPC Libraries 2) Data Distribution and Pointers 5) UPC Productivity 0 Shared vs. Private Data 0 Code efficiency 0 Examples of data distribution 0 UPC pointers 3) Workload Sharing 0 upc_forall 12 6 A First Example: Vector addition Thread 0 Thread 1 //vect_add.c Iteration #: 0 1 #include <upc_relaxed.h> 2 3 #define N 100*THREADS v1[0] v1[1] Shared Space Shared shared int v1[N], v2[N], v1plusv2[N]; v1[2] v1[3] void main() { … int i; v2[0] v2[1] for(i=0; i<N; i++) v2[2] v2[3] if (MYTHREAD==i%THREADS) … v1plusv2[i]=v1[i]+v2[i];v1plusv2[0] v1plusv2[1] } v1plusv2[2]…v1plusv2[3] 13 2nd Example: A More Efficient Implementation Thread 0 Thread 1 Iteration #: 0 1 //vect_add.c 2 3 #include <upc_relaxed.h> v1[0] v1[1] #define N 100*THREADS Space Shared
Recommended publications
  • Analysis of GPU-Libraries for Rapid Prototyping Database Operations
    Analysis of GPU-Libraries for Rapid Prototyping Database Operations A look into library support for database operations Harish Kumar Harihara Subramanian Bala Gurumurthy Gabriel Campero Durand David Broneske Gunter Saake University of Magdeburg Magdeburg, Germany fi[email protected] Abstract—Using GPUs for query processing is still an ongoing Usually, library operators for GPUs are either written by research in the database community due to the increasing hardware experts [12] or are available out of the box by device heterogeneity of GPUs and their capabilities (e.g., their newest vendors [13]. Overall, we found more than 40 libraries for selling point: tensor cores). Hence, many researchers develop optimal operator implementations for a specific device generation GPUs each packing a set of operators commonly used in one involving tedious operator tuning by hand. On the other hand, or more domains. The benefits of those libraries is that they are there is a growing availability of GPU libraries that provide constantly updated and tested to support newer GPU versions optimized operators for manifold applications. However, the and their predefined interfaces offer high portability as well as question arises how mature these libraries are and whether they faster development time compared to handwritten operators. are fit to replace handwritten operator implementations not only w.r.t. implementation effort and portability, but also in terms of This makes them a perfect match for many commercial performance. database systems, which can rely on GPU libraries to imple- In this paper, we investigate various general-purpose libraries ment well performing database operators. Some example for that are both portable and easy to use for arbitrary GPUs such systems are: SQreamDB using Thrust [14], BlazingDB in order to test their production readiness on the example of using cuDF [15], Brytlyt using the Torch library [16].
    [Show full text]
  • Ada-95: a Guide for C and C++ Programmers
    Ada-95: A guide for C and C++ programmers by Simon Johnston 1995 Welcome ... to the Ada guide especially written for C and C++ programmers. Summary I have endeavered to present below a tutorial for C and C++ programmers to show them what Ada can provide and how to set about turning the knowledge and experience they have gained in C/C++ into good Ada programming. This really does expect the reader to be familiar with C/C++, although C only programmers should be able to read it OK if they skip section 3. My thanks to S. Tucker Taft for the mail that started me on this. 1 Contents 1 Ada Basics. 7 1.1 C/C++ types to Ada types. 8 1.1.1 Declaring new types and subtypes. 8 1.1.2 Simple types, Integers and Characters. 9 1.1.3 Strings. {3.6.3} ................................. 10 1.1.4 Floating {3.5.7} and Fixed {3.5.9} point. 10 1.1.5 Enumerations {3.5.1} and Ranges. 11 1.1.6 Arrays {3.6}................................... 13 1.1.7 Records {3.8}. ................................. 15 1.1.8 Access types (pointers) {3.10}.......................... 16 1.1.9 Ada advanced types and tricks. 18 1.1.10 C Unions in Ada, (food for thought). 22 1.2 C/C++ statements to Ada. 23 1.2.1 Compound Statement {5.6} ........................... 24 1.2.2 if Statement {5.3} ................................ 24 1.2.3 switch Statement {5.4} ............................. 25 1.2.4 Ada loops {5.5} ................................. 26 1.2.4.1 while Loop .
    [Show full text]
  • L22: Parallel Programming Language Features (Chapel and Mapreduce)
    12/1/09 Administrative • Schedule for the rest of the semester - “Midterm Quiz” = long homework - Handed out over the holiday (Tuesday, Dec. 1) L22: Parallel - Return by Dec. 15 - Projects Programming - 1 page status report on Dec. 3 – handin cs4961 pdesc <file, ascii or PDF ok> Language Features - Poster session dry run (to see material) Dec. 8 (Chapel and - Poster details (next slide) MapReduce) • Mailing list: [email protected] December 1, 2009 12/01/09 Poster Details Outline • I am providing: • Global View Languages • Foam core, tape, push pins, easels • Chapel Programming Language • Plan on 2ft by 3ft or so of material (9-12 slides) • Map-Reduce (popularized by Google) • Content: • Reading: Ch. 8 and 9 in textbook - Problem description and why it is important • Sources for today’s lecture - Parallelization challenges - Brad Chamberlain, Cray - Parallel Algorithm - John Gilbert, UCSB - How are two programming models combined? - Performance results (speedup over sequential) 12/01/09 12/01/09 1 12/1/09 Shifting Gears Global View Versus Local View • What are some important features of parallel • P-Independence programming languages (Ch. 9)? - If and only if a program always produces the same output on - Correctness the same input regardless of number or arrangement of processors - Performance - Scalability • Global view - Portability • A language construct that preserves P-independence • Example (today’s lecture) • Local view - Does not preserve P-independent program behavior - Example from previous lecture? And what about ease
    [Show full text]
  • Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP
    Cimple: Instruction and Memory Level Parallelism A DSL for Uncovering ILP and MLP Vladimir Kiriansky, Haoran Xu, Martin Rinard, Saman Amarasinghe MIT CSAIL {vlk,haoranxu510,rinard,saman}@csail.mit.edu Abstract Processors have grown their capacity to exploit instruction Modern out-of-order processors have increased capacity to level parallelism (ILP) with wide scalar and vector pipelines, exploit instruction level parallelism (ILP) and memory level e.g., cores have 4-way superscalar pipelines, and vector units parallelism (MLP), e.g., by using wide superscalar pipelines can execute 32 arithmetic operations per cycle. Memory and vector execution units, as well as deep buffers for in- level parallelism (MLP) is also pervasive with deep buffering flight memory requests. These resources, however, often ex- between caches and DRAM that allows 10+ in-flight memory hibit poor utilization rates on workloads with large working requests per core. Yet, modern CPUs still struggle to extract sets, e.g., in-memory databases, key-value stores, and graph matching ILP and MLP from the program stream. analytics, as compilers and hardware struggle to expose ILP Critical infrastructure applications, e.g., in-memory databases, and MLP from the instruction stream automatically. key-value stores, and graph analytics, characterized by large In this paper, we introduce the IMLP (Instruction and working sets with multi-level address indirection and pointer Memory Level Parallelism) task programming model. IMLP traversals push hardware to its limits: large multi-level caches tasks execute as coroutines that yield execution at annotated and branch predictors fail to keep processor stalls low. The long-latency operations, e.g., memory accesses, divisions, out-of-order windows of hundreds of instructions are also or unpredictable branches.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Scheduling Task Parallelism on Multi-Socket Multicore Systems
    Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill The University of North Carolina at Chapel Hill Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Task Parallel Programming in a Nutshell! • A task consists of executable code and associated data context, with some bookkeeping metadata for scheduling and synchronization. • Tasks are significantly more lightweight than threads. • Dynamically generated and terminated at run time • Scheduled onto threads for execution • Used in Cilk, TBB, X10, Chapel, and other languages • Our work is on the recent tasking constructs in OpenMP 3.0. The University of North Carolina at Chapel Hill ! 4 Simple Task Parallel OpenMP Program: Fibonacci! int fib(int n)! {! fib(10)! int x, y;! if (n < 2) return n;! #pragma omp task! fib(9)! fib(8)! x = fib(n - 1);! #pragma omp task! y = fib(n - 2);! #pragma omp taskwait! fib(8)! fib(7)! return x + y;! }! The University of North Carolina at Chapel Hill ! 5 Useful Applications! • Recursive algorithms cilksort cilksort cilksort cilksort cilksort • E.g. Mergesort • List and tree traversal cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge • Irregular computations cilkmerge • E.g., Adaptive Fast Multipole cilkmerge cilkmerge
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Parallel Programming
    Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e.
    [Show full text]
  • Scalable and High Performance MPI Design for Very Large
    SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
    [Show full text]
  • BCL: a Cross-Platform Distributed Data Structures Library
    BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT high-performance computing, including several using the Parti- One-sided communication is a useful paradigm for irregular paral- tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray lel applications, but most one-sided programming environments, Fortran, X10, and Chapel [9, 11, 12, 25, 29, 30]. These languages are including MPI’s one-sided interface and PGAS programming lan- especially well-suited to problems that require asynchronous one- guages, lack application-level libraries to support these applica- sided communication, or communication that takes place without tions. We present the Berkeley Container Library, a set of generic, a matching receive operation or outside of a global collective. How- cross-platform, high-performance data structures for irregular ap- ever, PGAS languages lack the kind of high level libraries that exist plications, including queues, hash tables, Bloom filters and more. in other popular programming environments. For example, high- BCL is written in C++ using an internal DSL called the BCL Core performance scientific simulations written in MPI can leverage a that provides one-sided communication primitives such as remote broad set of numerical libraries for dense or sparse matrices, or get and remote put operations. The BCL Core has backends for for structured, unstructured, or adaptive meshes. PGAS languages MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data can sometimes use those numerical libraries, but are missing the structures to be used natively in programs written using any of data structures that are important in some of the most irregular these programming environments.
    [Show full text]
  • An Introduction to Numpy and Scipy
    An introduction to Numpy and Scipy Table of contents Table of contents ............................................................................................................................ 1 Overview ......................................................................................................................................... 2 Installation ...................................................................................................................................... 2 Other resources .............................................................................................................................. 2 Importing the NumPy module ........................................................................................................ 2 Arrays .............................................................................................................................................. 3 Other ways to create arrays............................................................................................................ 7 Array mathematics .......................................................................................................................... 8 Array iteration ............................................................................................................................... 10 Basic array operations .................................................................................................................. 11 Comparison operators and value testing ....................................................................................
    [Show full text]