A Hardware and Software Architecture for Pervasive Parallelism

Total Page:16

File Type:pdf, Size:1020Kb

A Hardware and Software Architecture for Pervasive Parallelism A Hardware and Software Architecture for Pervasive Parallelism by Mark Christopher Jeffrey Bachelor of Applied Science, University of Toronto (2009) Master of Applied Science, University of Toronto (2011) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2020 c Mark Christopher Jeffrey, MMXX. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author . ............................ Department of Electrical Engineering and Computer Science October 25, 2019 Certified by. ......................... Daniel Sanchez Associate Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by . ........................ Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students iii A Hardware and Software Architecture for Pervasive Parallelism by Mark Christopher Jeffrey Submitted to the Department of Electrical Engineering and Computer Science on October 25, 2019, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract Parallelism is critical to achieve high performance in modern computer systems. Un- fortunately, most programs scale poorly beyond a few cores, and those that scale well often require heroic implementation efforts. This is because current parallel architec- tures squander most of the parallelism available in applications and are too hard to program. This thesis presents Swarm, a new execution model, architecture, and system soft- ware that exploits far more parallelism than conventional multicores, yet is almost as easy to program as a sequential machine. Programmer-ordered tasks sit at the software-hardware interface. Swarm programs consist of tiny tasks, as small as tens of instructions each. Parallelism is dynamic: tasks can create new tasks at run time. Synchronization is implicit: the programmer specifies a total or partial order on tasks. This eliminates the correctness pitfalls of explicit synchronization (e.g., deadlock and data races). Swarm hardware uncovers parallelism by speculatively running tasks out of order, even thousands of tasks ahead of the earliest active task. Its speculation mech- anisms build on decades of prior work, but Swarm is the first parallel architecture to scale to hundreds of cores due to its new programming model, distributed structures, and distributed protocols. Leaning on its support for task order, Swarm incorporates new techniques to reduce data movement, to speculate selectively for improved effi- ciency, and to compose parallelism across abstraction layers. Swarm achieves efficient near-linear scaling to hundreds of cores on otherwise hard-to-scale irregular applications. These span a broad set of domains, including graph analytics, discrete-event simulation, databases, machine learning, and genomics. Swarm even accelerates applications that are conventionally deemed sequential. It out- performs recent software-only parallel algorithms by one to two orders of magnitude, and sequential implementations by up to 600× at 256 cores. Thesis Supervisor: Daniel Sanchez Title: Associate Professor of Electrical Engineering and Computer Science iv v Acknowledgments This PhD journey at MIT has been extremely rewarding, challenging, and stimulating. I sincerely thank the many individuals who supported me along the way. First and foremost, I am grateful to my advisor, Professor Daniel Sanchez. His breadth of knowledge and wide range of skills astounded me when I arrived at MIT and continue to impress as I depart. My background was in parallel and distributed systems, and Daniel quickly brought me up to speed on computer architecture research. Early on, Daniel opined that a lot of fun happens at the interface, where you can change both hardware and software to create radical new designs. Now, I could not agree more. Daniel taught me how to think about the big picture challenges in computer science, then frame tractable research problems, and use limit study prototypes to quickly filter ideas as promising or not. He helped me improve my communication of insights and results to the broader community. He also taught me new tricks to debug low-level protocol deadlocks or system errors. Daniel exemplified strong leadership: he would recognize and nurture particular strengths in his students, and rally strong teams to tackle big important problems, while encouraging us to work on our weaknesses. Daniel both provided so much freedom throughout my time at MIT, yet was always available when I needed him. I would like to thank my thesis committee members, Professor Joel Emer and Pro- fessor Arvind. Over the years, Joel taught me the importance of carefully distilling the contributions of a research project down to its core insights; implementation details are important, but secondary to understanding the insight. Throughout my career, I will strive to continue his principled approach to computer architecture research, and his welcoming, inclusive, and empathetic approach to mentorship. Arvind, who solved fundamental problems in parallel dataflow architectures and languages, provided a valuable perspective and important feedback on this work. This thesis is the result of collaboration with an outstanding team of students: Suvinay Subramanian, Cong Yan, Maleen Abeydeera, Victor Ying, and Hyun Ryong (Ryan) Lee. Suvinay was a crucial partner in all of the projects of this thesis. I ap- preciate the hours we spent brainstorming, designing, and debugging. I learned a lot from his persistence, deep focus, and unbounded optimism. Cong invested a few months to wrangle a database application for our benchmark suite. We learned a lot from such a large workload. Maleen improved the modeling of our simulations, imple- mented several applications, and, bringing his expertise in hardware design, provided valuable feedback and fast prototypes to simplify our designs. Victor added important capabilities to our simulation in the Espresso and Capsules project, identified several opportunities for system optimization, and has been a crucial sounding board for the last three years. I am excited to see where his audacious work will lead next. Ryan implemented new applications that taught us important lessons on how to improve performance. vi I am thankful to the members of the Sanchez group: Maleen Abeydeera, Nathan Beckmann, Nosayba El-Sayed, Yee Ling Gan, Harshad Kasture, Ryan Lee, Anurag Mukkara, Quan Nguyen, Suvinay Subramanian, Po-An Tsai, Cong Yan, Victor Ying, and Guowei Zhang. In addition to providing excellent feedback on papers and presentations, they brought fun and insight to my time at MIT, through technical discussions, teaching me history, eating, traveling to conferences, and hiking the trails of New York and New Hampshire. Thanks to my officemate, Po-An, who shared much wisdom about mentor- ship, research trajectories, and career decisions. I thoroughly enjoyed our one official collaboration on the MICRO-50 submissions server, and our travels around the globe. I appreciate my dear friends across Canada and the United States, who made every moment outside the lab extraordinary. Last but not least, an enormous amount of thanks goes to my family. My siblings encouraged me to start this PhD journey and provided valuable advice along the way. My parents boosted morale with frequent calls full of love and encouragement and re- laxing visits back home. My nieces and nephews brought a world of fun and adventure every time we got together. I got to watch you grow up over the course of this degree. Finally, a very special thanks goes to my wife, Ellen Chan, who brings joy, fun, and order to my life. I could not have finished this thesis without her love and support. I am grateful for financial support from C-FAR, one of six SRC STAR-net centers by MARCO and DARPA; NSF grants CAREER-1452994, CCF-1318384, SHF-1814969, and NSF/SRC grant E2CDA-1640012; a grant from Sony; an MIT EECS Jacobs Presidential Fellowship; an NSERC Postgraduate Scholarship; and a Facebook Fellowship. Contents Abstract iii Acknowledgements v 1 Introduction 1 1.1 Challenges . 2 1.2 Contributions . 5 1.3 Thesis Organization . 7 2 Background 9 2.1 Important Properties of Task-Level Parallelism . 10 2.1.1 Task Regularity . 10 2.1.2 Task Ordering . 11 2.1.3 Task Granularity . 12 2.1.4 Open Opportunity: Fine-Grain Ordered Irregular Parallelism . 12 2.2 Exploiting Regular Parallelism . 13 2.3 Exploiting Non-Speculative Irregular Parallelism . 14 2.4 Exploiting Speculative Irregular Parallelism . 18 2.4.1 Dynamic Instruction-Level Parallelism . 19 2.4.2 Thread-Level Speculation . 20 2.4.3 Transactional Memory . 24 3 Swarm: A Scalable Architecture for Ordered Parallelism 27 3.1 Motivation . 29 3.1.1 Understanding Ordered Irregular Parallelism . 29 3.1.2 Analysis of Ordered Irregular Algorithms . 31 3.1.3 Limitations of Thread-Level Speculation . 33 vii viii Contents 3.2 Swarm Execution Model . 34 3.3 Swarm Implementation . 36 3.3.1 ISA Extensions . 37 3.3.2 Task Queuing and Prioritization . 38 3.3.3 Speculative Execution and Versioning . 39 3.3.4 Virtual Time-Based Conflict Detection . 41 3.3.5 Selective Aborts . 43 3.3.6 Scalable Ordered Commits . 43 3.3.7 Handling Limited Queue Sizes . 45 3.3.8 Analysis of Hardware Costs . 46 3.4 Experimental Methodology . 47 3.5 Evaluation . 50 3.5.1 Swarm Scalability . 50 3.5.2 Swarm vs. Software Implementations . 50 3.5.3 Swarm Analysis . 52 3.5.4 Sensitivity Studies . 55 3.5.5 Swarm Case Study: astar ........................ 56 3.6 Additional Related Work . 56 3.7 Summary . 58 4 Spatial Hints: Data-Centric Execution of Speculative Parallel Programs 59 4.1 Motivation . 61 4.2 Spatial Task Mapping with Hints . 63 4.2.1 Hint API and ISA Extensions . 64 4.2.2 Hardware Mechanisms . 64 4.2.3 Adding Hints to Benchmarks .
Recommended publications
  • Investigations on Hardware Compression of IBM Power9 Processors
    Investigations on hardware compression of IBM Power9 processors Jérome Kieffer, Pierre Paleo, Antoine Roux, Benoît Rousselle Outline ● The bandwidth issue at synchrotrons sources ● Presentation of the evaluated systems: – Intel Xeon vs IBM Power9 – Benchmarks on bandwidth ● The need for compression of scientific data – Compression as part of HDF5 – The hardware compression engine NX-gzip within Power9 – Gzip performance benchmark – Bitshuffle-LZ4 benchmark – Filter optimizations – Benchmark of parallel filtered gzip ● Conclusions – on the hardware – on the compression pipeline in HDF5 Page 2 HDF5 on Power9 18/09/2019 Bandwidth issue at synchrotrons sources Data analysis computer with the main interconnections and their associated bandwidth. Data reduction Upgrade to → Azimuthal integration 100 Gbit/s Data compression ! Figures from former generation of servers Kieffer et al. Volume 25 | Part 2 | March 2018 | Pages 612–617 | 10.1107/S1600577518000607 Page 3 HDF5 on Power9 18/09/2019 Topologies of Intel Xeon servers in 2019 Source: intel.com Page 4 HDF5 on Power9 18/09/2019 Architecture of the AC922 server from IBM featuring Power9 Credit: Thibaud Besson, IBM France Page 6 HDF5 on Power9 18/09/2019 Bandwidth measurement: Xeon vs Power9 Computer Dell R840 IBM AC922 Processor 4 Intel Xeon (12 cores) 2 IBM Power9 (16 cores) 2.6 GHz 2.7 GHz Cache (L3) 19 MB 8x 10 MB Memory channels 4x 6 DDR4 2x 8 DDR4 Memory capacity → 3TB → 2TB Memory speed theory 512 GB/s 340 GB/s Measured memory speed 160 GB/s 270 GB/s Interconnects PCIe v3 PCIe v4 NVlink2 & CAPI2 GP-GPU co-processor 2Tesla V100 PCIe v3 2Tesla V100 NVlink2 Interconnect speed 12 GB/s 48 GB/s CPU ↔ GPU Page 8 HDF5 on Power9 18/09/2019 Strength and weaknesses of the OpenPower architecture While amd64 is today’s de facto standard in HPC, it has a few competitors: arm64, ppc64le and to a less extend riscv and mips64.
    [Show full text]
  • Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP
    Cimple: Instruction and Memory Level Parallelism A DSL for Uncovering ILP and MLP Vladimir Kiriansky, Haoran Xu, Martin Rinard, Saman Amarasinghe MIT CSAIL {vlk,haoranxu510,rinard,saman}@csail.mit.edu Abstract Processors have grown their capacity to exploit instruction Modern out-of-order processors have increased capacity to level parallelism (ILP) with wide scalar and vector pipelines, exploit instruction level parallelism (ILP) and memory level e.g., cores have 4-way superscalar pipelines, and vector units parallelism (MLP), e.g., by using wide superscalar pipelines can execute 32 arithmetic operations per cycle. Memory and vector execution units, as well as deep buffers for in- level parallelism (MLP) is also pervasive with deep buffering flight memory requests. These resources, however, often ex- between caches and DRAM that allows 10+ in-flight memory hibit poor utilization rates on workloads with large working requests per core. Yet, modern CPUs still struggle to extract sets, e.g., in-memory databases, key-value stores, and graph matching ILP and MLP from the program stream. analytics, as compilers and hardware struggle to expose ILP Critical infrastructure applications, e.g., in-memory databases, and MLP from the instruction stream automatically. key-value stores, and graph analytics, characterized by large In this paper, we introduce the IMLP (Instruction and working sets with multi-level address indirection and pointer Memory Level Parallelism) task programming model. IMLP traversals push hardware to its limits: large multi-level caches tasks execute as coroutines that yield execution at annotated and branch predictors fail to keep processor stalls low. The long-latency operations, e.g., memory accesses, divisions, out-of-order windows of hundreds of instructions are also or unpredictable branches.
    [Show full text]
  • (Perry & Wolf 92) Software Architecture (Garlan & Shaw
    ICS 221, Winter 2001 Software Architecture Software Architecture (Perry & Wolf 92) “Architecture is concerned with the selection of architectural elements, their interactions, and the Software Architecture constraints on those elements and their interactions necessary to provide a framework in which to satisfy the requirements and serve as a basis for the design.” David S. Rosenblum ICS 221 “Design is concerned with the modularization and detailed interfaces of the design elements, their Winter 2001 algorithms and procedures, and the data types needed to support the architecture and to satisfy the requirements.” Software Architecture Software Architecture (Garlan & Shaw 93) (Shaw & Garlan 96) “Software architecture is a level of design that goes “The architecture of a software system defines beyond the algorithms and data structures of the that system in terms of computational computation; designing and specifying the overall components and interactions among those system structure emerges as a new kind of problem. Structural issues include gross organization and components. … In addition to specifying the global control structure; protocols for communication, structure and topology of the system, the synchronization, and data access; assignment of architecture shows the correspondence functionality to design elements; physical between the requirements and elements of distribution; composition of design elements; scaling the constructed system, thereby providing and performance; and selection among design some rationale for the design decisions.” alternatives.” Analogies with Differences Between Civil and Civil Architecture Software Architecture Civil Engineering and Civil Architecture “Software systems are like cathedrals—first we are concerned with the engineering and design of build them and then we pray.” civic structures (roads, buildings, bridges, etc.) — Sam Redwine ! Multiple views ! Civil: Artist renderings, elevations, floor plans, blueprints ! Physical vs.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Scheduling Task Parallelism on Multi-Socket Multicore Systems
    Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill The University of North Carolina at Chapel Hill Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Task Parallel Programming in a Nutshell! • A task consists of executable code and associated data context, with some bookkeeping metadata for scheduling and synchronization. • Tasks are significantly more lightweight than threads. • Dynamically generated and terminated at run time • Scheduled onto threads for execution • Used in Cilk, TBB, X10, Chapel, and other languages • Our work is on the recent tasking constructs in OpenMP 3.0. The University of North Carolina at Chapel Hill ! 4 Simple Task Parallel OpenMP Program: Fibonacci! int fib(int n)! {! fib(10)! int x, y;! if (n < 2) return n;! #pragma omp task! fib(9)! fib(8)! x = fib(n - 1);! #pragma omp task! y = fib(n - 2);! #pragma omp taskwait! fib(8)! fib(7)! return x + y;! }! The University of North Carolina at Chapel Hill ! 5 Useful Applications! • Recursive algorithms cilksort cilksort cilksort cilksort cilksort • E.g. Mergesort • List and tree traversal cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge • Irregular computations cilkmerge • E.g., Adaptive Fast Multipole cilkmerge cilkmerge
    [Show full text]
  • POWER® Processor-Based Systems
    IBM® Power® Systems RAS Introduction to IBM® Power® Reliability, Availability, and Serviceability for POWER9® processor-based systems using IBM PowerVM™ With Updates covering the latest 4+ Socket Power10 processor-based systems IBM Systems Group Daniel Henderson, Irving Baysah Trademarks, Copyrights, Notices and Acknowledgements Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Active AIX® POWER® POWER Power Power Systems Memory™ Hypervisor™ Systems™ Software™ Power® POWER POWER7 POWER8™ POWER® PowerLinux™ 7® +™ POWER® PowerHA® POWER6 ® PowerVM System System PowerVC™ POWER Power Architecture™ ® x® z® Hypervisor™ Additional Trademarks may be identified in the body of this document. Other company, product, or service names may be trademarks or service marks of others. Notices The last page of this document contains copyright information, important notices, and other information. Acknowledgements While this whitepaper has two principal authors/editors it is the culmination of the work of a number of different subject matter experts within IBM who contributed ideas, detailed technical information, and the occasional photograph and section of description.
    [Show full text]
  • Designing Software Architecture to Support Continuous Delivery and Devops: a Systematic Literature Review
    Designing Software Architecture to Support Continuous Delivery and DevOps: A Systematic Literature Review Robin Bolscher and Maya Daneva University of Twente, Drienerlolaan 5, Enschede, The Netherlands [email protected], [email protected] Keywords: Software Architecture, Continuous Delivery, Continuous Integration, DevOps, Deployability, Systematic Literature Review, Micro-services. Abstract: This paper presents a systematic literature review of software architecture approaches that support the implementation of Continuous Delivery (CD) and DevOps. Its goal is to provide an understanding of the state- of-the-art on the topic, which is informative for both researchers and practitioners. We found 17 characteristics of a software architecture that are beneficial for CD and DevOps adoption and identified ten potential software architecture obstacles in adopting CD and DevOps in the case of an existing software system. Moreover, our review indicated that micro-services are a dominant architectural style in this context. Our literature review has some implications: for researchers, it provides a map of the recent research efforts on software architecture in the CD and DevOps domain. For practitioners, it describes a set of software architecture principles that possibly can guide the process of creating or adapting software systems to fit in the CD and DevOps context. 1 INTRODUCTION designing new software architectures tailored for CD and DevOps practices. The practice of releasing software early and often has For clarity, before elaborating on the subject of been increasingly more adopted by software this SLR, we present the definitions of the concepts organizations (Fox et al., 2014) in order to stay that we will address: Software architecture of a competitive in the software market.
    [Show full text]
  • IBM Power Systems Performance Report Apr 13, 2021
    IBM Power Performance Report Power7 to Power10 September 8, 2021 Table of Contents 3 Introduction to Performance of IBM UNIX, IBM i, and Linux Operating System Servers 4 Section 1 – SPEC® CPU Benchmark Performance 4 Section 1a – Linux Multi-user SPEC® CPU2017 Performance (Power10) 4 Section 1b – Linux Multi-user SPEC® CPU2017 Performance (Power9) 4 Section 1c – AIX Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 5 Section 1d – Linux Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 6 Section 2 – AIX Multi-user Performance (rPerf) 6 Section 2a – AIX Multi-user Performance (Power8, Power9 and Power10) 9 Section 2b – AIX Multi-user Performance (Power9) in Non-default Processor Power Mode Setting 9 Section 2c – AIX Multi-user Performance (Power7 and Power7+) 13 Section 2d – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power8) 15 Section 2e – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power7 and Power7+) 20 Section 3 – CPW Benchmark Performance 19 Section 3a – CPW Benchmark Performance (Power8, Power9 and Power10) 22 Section 3b – CPW Benchmark Performance (Power7 and Power7+) 25 Section 4 – SPECjbb®2015 Benchmark Performance 25 Section 4a – SPECjbb®2015 Benchmark Performance (Power9) 25 Section 4b – SPECjbb®2015 Benchmark Performance (Power8) 25 Section 5 – AIX SAP® Standard Application Benchmark Performance 25 Section 5a – SAP® Sales and Distribution (SD) 2-Tier – AIX (Power7 to Power8) 26 Section 5b – SAP® Sales and Distribution (SD) 2-Tier – Linux on Power (Power7 to Power7+)
    [Show full text]
  • Task Parallelism Bit-Level Parallelism
    Parallel languages as extensions of sequential ones Alexey A. Romanenko [email protected] What this section about? ● Computers. History. Trends. ● What is parallel program? ● What is parallel programming for? ● Features of parallel programs. ● Development environment. ● etc. Agenda 1. Sequential program 2. Applications, required computational power. 3. What does parallel programming for? 4. Parallelism inside ordinary PC. 5. Architecture of modern CPUs. 6. What is parallel program? 7. Types of parallelism. Agenda 8. Types of computational installations. 9. Specificity of parallel programs. 10.Amdahl's law 11.Development environment 12.Approaches to development of parallel programs. Cost of development. 13.Self-test questions History George Boole Claude Elwood Shannon Alan Turing Charles Babbage John von Neumann Norbert Wiener Henry Edward Roberts Sciences ● Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. ● Cybernetics is the interdisciplinary study of the structure of regulatory system Difference machine Arithmometer Altair 8800 Computer with 8-inch floppy disk system Sequential program A program perform calculation of a function F = G(X) for example: a*x2+b*x+c=0, a != 0. x1=(-b-sqrt(b2-4ac))/(2a), x2=(-b+sqrt(b2-4ac))/(2a) Turing machine Plasma modeling N ~ 106 dX ~ F dT2 j j F ~ sum(q, q ) j i i j Complexity ~ O(N*N) more then 1012 * 100...1000 operations Resource consumable calculations ● Nuclear/Gas/Hydrodynamic
    [Show full text]
  • Towards a Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions
    Towards A Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions Millad Ghane Sunita Chandrasekaran Margaret S. Cheung Department of Computer Science Department of Computer and Information Physics Department University of Houston Sciences University of Houston TX, USA University of Delaware Center for Theoretical Biological Physics, [email protected],[email protected] DE, USA Rice University [email protected] TX, USA [email protected] Abstract 1 Introduction An ever-growing diversity in the architecture of modern super- Heterogeneity has become increasingly prevalent in the recent computers has led to challenges in developing scientifc software. years given its promising role in tackling energy and power con- Utilizing heterogeneous and disruptive architectures (e.g., of-chip sumption crisis of high-performance computing (HPC) systems [15, and, in the near future, on-chip accelerators) has increased the soft- 20]. Dennard scaling [14] has instigated the adaptation of heteroge- ware complexity and worsened its maintainability. To that end, we neous architectures in the design of supercomputers and clusters need a productive software ecosystem that improves the usability by the HPC community. The July 2019 TOP500 [36] report shows and portability of applications for such systems while allowing how 126 systems in the list are heterogeneous systems confgured every parallelism opportunity to be exploited. with one or many GPUs. This is the prevailing trend in current gen- In this paper, we outline several challenges that we encountered eration of supercomputers. As an example, Summit [31], the fastest in the implementation of Gecko, a hierarchical model for distributed supercomputer according to the Top500 list (June 2019) [36], has shared memory architectures, using a directive-based program- two IBM POWER9 processors and six NVIDIA Volta V100 GPUs.
    [Show full text]
  • Software Engineering (SE) 1
    Software Engineering (SE) 1 SE 352 | OBJECT-ORIENTED ENTERPRISE APPLICATION DEVELOPMENT SOFTWARE ENGINEERING | 4 quarter hours (Undergraduate) (SE) This course focuses on applying object-oriented techniques in the design and development of software systems for enterprise applications. Topics SE 325 | INTRODUCTION TO SOFTWARE ENGINEERING | 4 quarter hours include component architecture, such as Java Beans and Enterprise (Undergraduate) Java Beans, GUI components, such as Swing, database connectivity and This course introduces students to the activities performed at each object repositories, server application integration using technologies stage of the development process so that they can understand the full such as servlets, Java Server Pages, JDBC and RMI, security and lifecycle context of specific tasks such as coding and testing. Topics will internationalization. PREREQUISITE(S): CSC 301. include software development processes, domain modeling, requirements CSC 301 is a prerequisite for this class. elicitation and specification, architectural design and analysis, product SE 356 | SOFTWARE DEVELOPMENT FOR MOBILE AND WIRELESS and process level metrics, configuration management, quality assurance SYSTEMS | 4 quarter hours activities including user acceptance testing and unit testing, project (Undergraduate) management skills such as risk analysis, effort estimation, project This course will focus on the unique aspects of developing software release planning, and software engineering ethics. applications for mobile and wireless systems, such as personal digital CSC 301 or CSC 393 is a prerequisite for this class. assistant (PDA) devices and mobile phones. Topics will include user SE 330 | OBJECT ORIENTED MODELING | 4 quarter hours interface design for small screens with restricted input modalities, data (Undergraduate) synchronization for mobile databases as well as wireless programming Object-oriented modeling techniques for analysis and design.
    [Show full text]
  • A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
    TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization Blake A. Hechtman, Andrew D. Hilton, and Daniel J. Sorin Department of Electrical and Computer Engineering Duke University Abstract —We have developed a task-parallel runtime targeting CPUs are a poor fit for GPUs. To understand system, called TREES, that is designed for high why this mismatch exists, we must first understand the performance on CPU/GPU platforms. On platforms performance of an idealized task-parallel application with multiple CPUs, Cilk’s “work-first” principle (with no runtime) and then how the runtime’s overhead underlies how task-parallel applications can achieve affects it. The performance of a task-parallel application performance, but work-first is a poor fit for GPUs. We is a function of two characteristics: its total amount of build upon work-first to create the “work-together” work to be performed (T1, the time to execute on 1 principle that addresses the specific strengths and processor) and its critical path (T∞, the time to execute weaknesses of GPUs. The work-together principle on an infinite number of processors). Prior work has extends work-first by stating that (a) the overhead on shown that the runtime of a system with P processors, the critical path should be paid by the entire system at TP, is bounded by = ( ) + ( ) due to the once and (b) work overheads should be paid co- greedy o ff line scheduler bound [3][10]. operatively. We have implemented the TREES runtime A task-parallel runtime introduces overheads and, for in OpenCL, and we experimentally evaluate TREES purposes of performance analysis, we distinguish applications on a CPU/GPU platform.
    [Show full text]