Shared Memory Programming on NUMA–Based Clusters Using a General and Open Hybrid Hardware / Software Approach

Total Page:16

File Type:pdf, Size:1020Kb

Shared Memory Programming on NUMA–Based Clusters Using a General and Open Hybrid Hardware / Software Approach Shared Memory Programming on NUMA–based Clusters using a General and Open Hybrid Hardware / Software Approach Martin Schulz Institut für Informatik Lehrstuhl für Rechnertechnik und Rechnerorganisation Shared Memory Programming on NUMA–based Clusters using a General and Open Hybrid Hardware / Software Approach Martin Schulz Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. R. Bayer, Ph.D. Prüfer der Dissertation: 1. Univ.-Prof. Dr. A. Bode 2. Univ.-Prof. Dr. H. Hellwagner, Universität Klagenfurt / Österreich Die Dissertation wurde am 24. April 2001 bei der Technischen Universität München ein- gereicht und durch die Fakultät für Informatik am 28. Juni 2001 angenommen. Abstract The widespread use of shared memory programming for High Performance Computing (HPC) is currently hindered by two main factors: the limited scalability of architectures with hardware support for shared memory and the abundance of existing programming models. In order to solve these issues, a comprehensive shared memory framework needs to be created which enables the use of shared memory on top of more scalable architectures and which provides a user–friendly solution to deal with the various different programming models. Driven by the first issue, a large number of so–called SoftWare Distributed Shared Memory (SW–DSM) systems have been developed. These systems rely solely on soft- ware components to create a transparent global virtual memory abstraction on highly scal- able, loosely coupled architectures without any direct hardware support for shared memory. However, they are often affected by inherent performance problems and, in addition, do not solve the second issue of the existence of (too) many shared memory programming models. On the contrary, the large amount of work done in the DSM area has led to a significant number of independent systems, each with its own API, thereby further worsening the sit- uation. The work presented within this thesis therefore takes the idea of SW–DSM systems a step further by proposing a general and open shared memory framework called HAMSTER (Hybrid-dsm based Adaptive and Modular Shared memory archiTEctuRe). Instead of be- ing fixed to a single shared memory programming model or API, this framework provides a comprehensive set of shared memory services enabling the implementation of almost any shared memory programming model on top of a single core. These services are designed in a way that minimizes the complexity for target programming models making the imple- mentation of a large number of different models feasible. This can include both existing and new application or application domain specific programming models easing both the porting of given and the parallelization of new applications. In addition, the HAMSTER framework avoids typical performance problems of SW– DSM systems by relying on so–called NUMA (Non–Uniform Memory Access) architec- tures which combine scalability and cost effectiveness with limited support for shared memory in the form of non–cache coherent hardware DSM. Their capabilities are directly exploited by a new type of hybrid hardware/software DSM system, the core of the HAM- STER framework. This Hybrid–DSM approach closes the semantic gap between the global physical memory provided by the underlying hardware and the global virtual memory re- i quired for shared memory programming enabling applications to directly benefit from the hardware support. On top of this Hybrid–DSM system, the HAMSTER framework defines and imple- ments several independent and orthogonal management modules. This includes separate modules for memory, consistency, synchronization, and task management as well as for the control of the cluster and the global process abstraction. Each of these modules of- fers typical services required by implementations of shared memory programming models. Combined they form the HAMSTER interface which can then be used to implement shared memory programming models without much effort. This capability is proven through the implementation of a number of selected shared memory programming models on top of the HAMSTER framework. These models range from transparently distributed thread models all the way to explicit put/get libraries and also include various APIs from existing SW–DSM systems with different relaxed consistency models. It therefore covers the whole spectrum of shared memory programming models and underlines the broad applicability of this approach. The presented concepts are evaluated using a large number of different benchmarks and kernels exhibiting the performance details of the individual components. In addition, HAMSTER is used as the basis for the implementation or port of two real–world applica- tions from the area of nuclear medical imaging, more precisely the reconstruction of PET images and their spectral analysis. These experiments cover both the porting of an already existing shared memory application using a given DSM API and the parallelization of an application from scratch using a new, customized API. In both cases, the system provides an efficient platform resulting in a very scalable execution. These experiment, therefore, prove both the wide applicability and the efficiency of the overall HAMSTER framework. ii Acknowledgments I would like to take this opportunity to express my deepest gratitude to the following people, for I realize that this work would not have been possible without their help, guidance, and support. First, I am indebted to my two advisors, Prof. Dr. A. Bode, chair of LRR–TUM, and Prof. Dr. H. Hellwagner, now at the University of Klagenfurt. Prof. Hellwagner initially took me in as his student and after his move to Klagenfurt, Prof. Bode took over without hesitation. Together they provided me with an excellent research environment and gave me the greatest possible freedom to pursue my work and interests. In addition, despite their busy schedules, they both always had an open ear for me. I would also like to direct my special thanks to Dr. Wolfgang Karl who, in his func- tion as leader of the architecture group at LRR, significantly contributed to the excellent research environment. In addition, his support and the many long, informal discussions and helpful comments regarding my work were invaluable, especially in times when I seemed stuck. I would also like to thank my many colleagues at LRR–TUM; I enjoyed very much working with them and the many discussions we shared over coffee. Especially I would like to mention my office mates over the years: Phillip Drum, Michael Eberl, Detlef Fliegl (who also contributed significantly to the Windows NT driver for the SCI-VM and provided personal system administration support), Günther Rackl, and Christian Weiß. Additionally, I would like thank Jie Tao, whose pleasant disposition and enthusiasm made it a pleasure to work with her. Our secretaries, especially Mrs. Eberhardt, Mrs. Wöllgens, and Mrs. Brunnhuber, deserve my special thanks for their help in maneuvering through the many administrative obstacles ranging from contract issues to business trip applications. Without them, I would still be sitting here in a pile of paper work. I would also like to thank Klaus Tilk, our system administrator, for keeping our systems alive and healthy. I do not want to forget to mention Prof. Dr. A. Chien and the members of his Concurrent Systems Architecture Group (CSAG) from the University of Illinois at Urbana–Champaign (now at the University of California at San Diego). During my stay in this group and during my Master’s work there, they introduced me to a scientific style of working from which I still very much profit. The European Union/Commision deserves credit for the funding it provided for the majority of my work. This was done mainly within the ESPRIT projects SISCI (EP 23174) iii and NEPHEW (EP 29907). In addition, the many partners Europe–wide involved in these projects provided an additional source of inspiration and support. Special thanks go to Dolphin ICS, most specifically to Kåre Løchson, Hugo Kohmann, Torsten Amundsen, and Roy Nordstrøm, who were involved not only in these ESPRIT projects, but also technically supported our work within SMiLE. The work within NEPHEW also brought me in contact with the Clinic for Nuclear Medicine at the “Klinikum Rechts der Isar”. From this cooperation I received valuable input for the evaluation part of this work. This would not have been possible without help from Frank Munz, Sibylle Ziegler, and Martin Völk and I am very thankful for the time and energy they invested. On a more personal note, I am indebted to my friends and my whole family who were always at my side providing constant encouragement and support. I would especially like to thank my good friend, Dr. Johannes Zimmer, for all our long talks and for his prolific help with LATEX. I would also like to express my deepest gratitude to my parents. From early on, they sparked and encouraged my interest in learning and provided unconditional support in all of my endeavors, both financially and (most important) spiritually. Last, but certainly not least, I would like to thank the love of life, my wife Laura, who not only took the time to review my thesis and to give valuable comments regarding language and style, but who also supported me throughout the entire time and with her whole heart. Martin Schulz April 2001 iv Contents 1 Motivation
Recommended publications
  • Emerging Technologies Multi/Parallel Processing
    Emerging Technologies Multi/Parallel Processing Mary C. Kulas New Computing Structures Strategic Relations Group December 1987 For Internal Use Only Copyright @ 1987 by Digital Equipment Corporation. Printed in U.S.A. The information contained herein is confidential and proprietary. It is the property of Digital Equipment Corporation and shall not be reproduced or' copied in whole or in part without written permission. This is an unpublished work protected under the Federal copyright laws. The following are trademarks of Digital Equipment Corporation, Maynard, MA 01754. DECpage LN03 This report was produced by Educational Services with DECpage and the LN03 laser printer. Contents Acknowledgments. 1 Abstract. .. 3 Executive Summary. .. 5 I. Analysis . .. 7 A. The Players . .. 9 1. Number and Status . .. 9 2. Funding. .. 10 3. Strategic Alliances. .. 11 4. Sales. .. 13 a. Revenue/Units Installed . .. 13 h. European Sales. .. 14 B. The Product. .. 15 1. CPUs. .. 15 2. Chip . .. 15 3. Bus. .. 15 4. Vector Processing . .. 16 5. Operating System . .. 16 6. Languages. .. 17 7. Third-Party Applications . .. 18 8. Pricing. .. 18 C. ~BM and Other Major Computer Companies. .. 19 D. Why Success? Why Failure? . .. 21 E. Future Directions. .. 25 II. Company/Product Profiles. .. 27 A. Multi/Parallel Processors . .. 29 1. Alliant . .. 31 2. Astronautics. .. 35 3. Concurrent . .. 37 4. Cydrome. .. 41 5. Eastman Kodak. .. 45 6. Elxsi . .. 47 Contents iii 7. Encore ............... 51 8. Flexible . ... 55 9. Floating Point Systems - M64line ................... 59 10. International Parallel ........................... 61 11. Loral .................................... 63 12. Masscomp ................................. 65 13. Meiko .................................... 67 14. Multiflow. ~ ................................ 69 15. Sequent................................... 71 B. Massively Parallel . 75 1. Ametek.................................... 77 2. Bolt Beranek & Newman Advanced Computers ...........
    [Show full text]
  • The Risks Digest Index to Volume 11
    The Risks Digest Index to Volume 11 Search RISKS using swish-e Forum on Risks to the Public in Computers and Related Systems ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Index to Volume 11 Sunday, 30 June 1991 Issue 01 (4 February 1991) Re: Enterprising Vending Machines (Allan Meers) Re: Risks of automatic flight (Henry Spencer) Re: Voting by Phone & public-key cryptography (Evan Ravitz) Re: Random Voting IDs and Bogus Votes (Vote by Phone) (Mike Beede)) Re: Patriots ... (Steve Mitchell, Steven Philipson, Michael H. Riddle, Clifford Johnson) Re: Man-in-the-loop on SDI (Henry Spencer) Re: Broadcast local area networks ... (Curt Sampson, Donald Lindsay, John Stanley, Jerry Leichter) Issue 02 (5 February 1991) Bogus draft notices are computer generated (Jonathan Rice) People working at home on important tasks (Mike Albaugh) Predicting system reliability (Martyn Thomas) Re: Patriots (Steven Markus Woodcock, Mark Levison) Hungry copiers (another run-in with technology) (Scott Wilson) Re: Enterprising Vending Machines (Dave Curry) Broadcast LANs (Peter da Silva, Scott Hinckley) Issue 03 (6 February 1991) Tube Tragedy (Pete Mellor) New Zealand Computer Error Holds Up Funds (Gligor Tashkovich) "Inquiry into cash machine fraud" (Stella Page) Quick n' easy access to Fidelity account info (Carol Springs) Re: Enterprising Vending Machines (Mark Jackson) RISKS of no escape paths (Geoff Kuenning) A risky gas pump (Bob Grumbine) Electronic traffic signs endanger motorists... (Rich Snider) Re: Predicting system reliability (Richard P. Taylor) The new California licenses (Chris Hibbert) Phone Voting -- Really a Problem? (Michael Barnett, Dave Smith) Re: Electronic cash completely replacing cash (Barry Wright) Issue 04 (7 February 1991) Subway door accidents (Mark Brader) http://catless.ncl.ac.uk/Risks/index.11.html[2011-06-11 08:17:52] The Risks Digest Index to Volume 11 "Virus" destroys part of Mass.
    [Show full text]
  • 2.2 Adaptive Routing Algorithms and Router Design 20
    https://theses.gla.ac.uk/ Theses Digitisation: https://www.gla.ac.uk/myglasgow/research/enlighten/theses/digitisation/ This is a digitised version of the original print thesis. Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] Performance Evaluation of Distributed Crossbar Switch Hypermesh Sarnia Loucif Dissertation Submitted for the Degree of Doctor of Philosophy to the Faculty of Science, Glasgow University. ©Sarnia Loucif, May 1999. ProQuest Number: 10391444 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a com plete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest ProQuest 10391444 Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States C ode Microform Edition © ProQuest LLO. ProQuest LLO.
    [Show full text]
  • Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K
    University of New Mexico UNM Digital Repository Computer Science ETDs Engineering ETDs Fall 12-1-2018 Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K. Gutiérrez Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds Part of the Numerical Analysis and Scientific omputC ing Commons, OS and Networks Commons, Software Engineering Commons, and the Systems Architecture Commons Recommended Citation Gutiérrez, Samuel K.. "Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs." (2018). https://digitalrepository.unm.edu/cs_etds/95 This Dissertation is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected]. Samuel Keith Guti´errez Candidate Computer Science Department This dissertation is approved, and it is acceptable in quality and form for publication: Approved by the Dissertation Committee: Professor Dorian C. Arnold, Chair Professor Patrick G. Bridges Professor Darko Stefanovic Professor Alexander S. Aiken Patrick S. McCormick Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs by Samuel Keith Guti´errez B.S., Computer Science, New Mexico Highlands University, 2006 M.S., Computer Science, University of New Mexico, 2009 DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science The University of New Mexico Albuquerque, New Mexico December 2018 ii ©2018, Samuel Keith Guti´errez iii Dedication To my beloved family iv \A Dios rogando y con el martillo dando." Unknown v Acknowledgments Words cannot adequately express my feelings of gratitude for the people that made this possible.
    [Show full text]
  • Detecting False Sharing Efficiently and Effectively
    W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures.
    [Show full text]
  • Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines
    Leaper: A Learned Prefetcher for Cache Invalidation in LSM-tree based Storage Engines ∗ Lei Yang1 , Hong Wu2, Tieying Zhang2, Xuntao Cheng2, Feifei Li2, Lei Zou1, Yujie Wang2, Rongyao Chen2, Jianying Wang2, and Gui Huang2 fyang lei, [email protected] fhong.wu, tieying.zhang, xuntao.cxt, lifeifei, zhencheng.wyj, rongyao.cry, beilou.wjy, [email protected] Peking University1 Alibaba Group2 ABSTRACT 101 Hit ratio Frequency-based cache replacement policies that work well QPS on page-based database storage engines are no longer suffi- Latency cient for the emerging LSM-tree (Log-Structure Merge-tree) based storage engines. Due to the append-only and copy- 100 on-write techniques applied to accelerate writes, the state- of-the-art LSM-tree adopts mutable record blocks and issues Normalized value frequent background operations (i.e., compaction, flush) to reorganize records in possibly every block. As a side-effect, 0 50 100 150 200 such operations invalidate the corresponding entries in the Time (s) Figure 1: Cache hit ratio and system performance churn (QPS cache for each involved record, causing sudden drops on the and latency of 95th percentile) caused by cache invalidations. cache hit rates and spikes on access latency. Given the ob- with notable examples including LevelDB [10], HBase [2], servation that existing methods cannot address this cache RocksDB [7] and X-Engine [14] for its superior write perfor- invalidation problem, we propose Leaper, a machine learn- mance. These storage engines usually come with row-level ing method to predict hot records in an LSM-tree storage and block-level caches to buffer hot records in main mem- engine and prefetch them into the cache without being dis- ory.
    [Show full text]
  • CSC 256/456: Operating Systems
    CSC 256/456: Operating Systems Multiprocessor John Criswell Support University of Rochester 1 Outline ❖ Multiprocessor hardware ❖ Types of multi-processor workloads ❖ Operating system issues ❖ Where to run the kernel ❖ Synchronization ❖ Where to run processes 2 Multiprocessor Hardware 3 Multiprocessor Hardware ❖ System in which two or more CPUs share full access to the main memory ❖ Each CPU might have its own cache and the coherence among multiple caches is maintained CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 4 Multi-core Processor ❖ Multiple processors on “chip” ❖ Some caches shared CPU CPU ❖ Some not shared Cache Cache Shared Cache Memory Bus 5 Hyper-Threading ❖ Replicate parts of processor; share other parts ❖ Create illusion that one core is two cores CPU Core Fetch Unit Decode ALU MemUnit Fetch Unit Decode 6 Cache Coherency ❖ Ensure processors not operating with stale memory data ❖ Writes send out cache invalidation messages CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 7 Non Uniform Memory Access (NUMA) ❖ Memory clustered around CPUs ❖ For a given CPU ❖ Some memory is nearby (and fast) ❖ Other memory is far away (and slow) 8 Multiprocessor Workloads 9 Multiprogramming ❖ Non-cooperating processes with no communication ❖ Examples ❖ Time-sharing systems ❖ Multi-tasking single-user operating systems ❖ make -j<very large number here> 10 Concurrent Servers ❖ Minimal communication between processes and threads ❖ Throughput usually the goal ❖ Examples ❖ Web servers ❖ Database servers 11 Parallel Programs ❖ Use parallelism
    [Show full text]
  • Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique
    International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 11, Issue 9, September 2020, pp. 602-611, Article ID: IJARET_11_09_061 Available online at http://iaeme.com/Home/issue/IJARET?Volume=11&Issue=9 ISSN Print: 0976-6480 and ISSN Online: 0976-6499 DOI: 10.34218/IJARET.11.9.2020.061 © IAEME Publication Scopus Indexed IMPROVING PERFORMANCE OF THE DISTRIBUTED FILE SYSTEM USING SPECULATIVE READ ALGORITHM AND SUPPORT-BASED REPLACEMENT TECHNIQUE Rathnamma Gopisetty Research Scholar at Jawaharlal Nehru Technological University, Anantapur, India. Thirumalaisamy Ragunathan Department of Computer Science and Engineering, SRM University, Andhra Pradesh, India. C Shoba Bindu Department of Computer Science and Engineering, Jawaharlal Nehru Technological University, Anantapur, India. ABSTRACT Cloud computing systems use distributed file systems (DFSs) to store and process large data generated in the organizations. The users of the web-based information systems very frequently perform read operations and infrequently carry out write operations on the data stored in the DFS. Various caching and prefetching techniques are proposed in the literature to improve performance of the read operations carried out on the DFS. In this paper, we have proposed a novel speculative read algorithm for improving performance of the read operations carried out on the DFS by considering the presence of client-side and global caches in the DFS environment. The proposed algorithm performs speculative executions by reading the data from the local client-side cache, local collaborative cache, from global cache and global collaborative cache. One of the speculative executions will be selected based on the time stamp verification process. The proposed algorithm reduces the write/update overhead and hence performs better than that of the non-speculative algorithm proposed for the similar DFS environment.
    [Show full text]
  • Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)
    Lecture 11: Directory-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) “It’s such a heartbreaking thing... to have your performance silently crippled by cache lines dropping all over the place due to false sharing.” - Nancy Sinatra CMU 15-418/618, Spring 2016 Today: what you should know ▪ What limits the scalability of snooping-based approaches to cache coherence? ▪ How does a directory-based scheme avoid these problems? ▪ How can the storage overhead of the directory structure be reduced? (and at what cost?) ▪ How does the communication mechanism (bus, point-to- point, ring) affect scalability and design choices? CMU 15-418/618, Spring 2016 Implementing cache coherence The snooping cache coherence Processor Processor Processor Processor protocols from the last lecture relied Local Cache Local Cache Local Cache Local Cache on broadcasting coherence information to all processors over the Interconnect chip interconnect. Memory I/O Every time a cache miss occurred, the triggering cache communicated with all other caches! We discussed what information was communicated and what actions were taken to implement the coherence protocol. We did not discuss how to implement broadcasts on an interconnect. (one example is to use a shared bus for the interconnect) CMU 15-418/618, Spring 2016 Problem: scaling cache coherence to large machines Processor Processor Processor Processor Local Cache Local Cache Local Cache Local Cache Memory Memory Memory Memory Interconnect Recall non-uniform memory access (NUMA) shared memory systems Idea: locating regions of memory near the processors increases scalability: it yields higher aggregate bandwidth and reduced latency (especially when there is locality in the application) But..
    [Show full text]
  • Efficient Synchronization Mechanisms for Scalable GPU Architectures
    Efficient Synchronization Mechanisms for Scalable GPU Architectures by Xiaowei Ren M.Sc., Xi’an Jiaotong University, 2015 B.Sc., Xi’an Jiaotong University, 2012 a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the faculty of graduate and postdoctoral studies (Electrical and Computer Engineering) The University of British Columbia (Vancouver) October 2020 © Xiaowei Ren, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Efficient Synchronization Mechanisms for Scalable GPU Architectures submitted by Xiaowei Ren in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering. Examining Committee: Mieszko Lis, Electrical and Computer Engineering Supervisor Steve Wilton, Electrical and Computer Engineering Supervisory Committee Member Konrad Walus, Electrical and Computer Engineering University Examiner Ivan Beschastnikh, Computer Science University Examiner Vijay Nagarajan, School of Informatics, University of Edinburgh External Examiner Additional Supervisory Committee Members: Tor Aamodt, Electrical and Computer Engineering Supervisory Committee Member ii Abstract The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications. Unlike latency-critical Central Processing Units (CPUs), throughput-oriented GPUs provide high performance by exploiting massive application parallelism.
    [Show full text]
  • Some Performance Comparisons for a Fluid Dynamics Code
    NBSIR 87-3638 Some Performance Comparisons for a Fluid Dynamics Code Daniel W. Lozier Ronald G. Rehm U.S. DEPARTMENT OF COMMERCE National Bureau of Standards National Engineering Laboratory Center for Applied Mathematics Gaithersburg, MD 20899 September 1987 U.S. DEPARTMENT OF COMMERCE NATIONAL BUREAU OF STANDARDS NBSIR 87-3638 SOME PERFORMANCE COMPARISONS FOR A FLUID DYNAMICS CODE Daniel W. Lozier Ronald G. Rehm U.S. DEPARTMENT OF COMMERCE National Bureau of Standards National Engineering Laboratory Center for Applied Mathematics Gaithersburg, MD 20899 September 1987 U.S. DEPARTMENT OF COMMERCE, Clarence J. Brown, Acting Secretary NATIONAL BUREAU OF STANDARDS, Ernest Ambler, Director - 2 - 2. BENCHMARK PROBLEM In this section we describe briefly the source of the benchmark problem, the major logical structure of the Fortran program, and the parameters of three different specific instances of the benchmark problem that vary widely in the amount of time and memory required for execution. Research Background As stated in the introduction, our purpose in benchmarking computers is solely in the interest of further- ing our investigations into fundamental problems of fire science. Over a decade ago, stimulated by federal recognition of very large losses of fife and property by fires each year throughout the nation, NBS became actively involved in a national effort to reduce such losses. The work at NBS ranges from very practical to quite theoretical; our approach, which proceeds directly from basic principles, is at the theoretical end of this spectrum. Early work was concentrated on developing a mathematical model of convection arising from a prescribed source of heat in an enclosure, e.g.
    [Show full text]
  • CIS 501 Computer Architecture This Unit: Shared Memory
    This Unit: Shared Memory Multiprocessors App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU • Hardware multihreading CPUCPU Computer Architecture • Multiprocessing • Synchronization • Lock implementation Unit 9: Multicore • Locking gotchas (Shared Memory Multiprocessors) • Cache coherence Slides originally developed by Amir Roth with contributions by Milo Martin • Bus-based protocols at University of Pennsylvania with sources that included University of • Directory protocols Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. • Memory consistency models CIS 501 (Martin): Multicore 1 CIS 501 (Martin): Multicore 2 Readings Beyond Implicit Parallelism • Textbook (MA:FSPTCM) • Consider “daxpy”: • Sections 7.0, 7.1.3, 7.2-7.4 daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) • Section 8.2 Z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000, the loop has 10,000-way parallelism! • How do we exploit it? CIS 501 (Martin): Multicore 3 CIS 501 (Martin): Multicore 4 Explicit Parallelism Multiplying Performance • Consider “daxpy”: • A single processor can only be so fast daxpy(double *x, double *y, double *z, double a): • Limited clock frequency for (i = 0; i < SIZE; i++) • Limited instruction-level parallelism Z[i] = a*x[i] + y[i]; • Limited cache hierarchy • Break it
    [Show full text]