Infiniband Scalability in Open

Total Page:16

File Type:pdf, Size:1020Kb

Infiniband Scalability in Open Infiniband Scalability in Open MPI G. M. Shipman1,2, T. S. Woodall1, R. L. Graham1, A. B. Maccabe2 1 Advanced Computing Laboratory Los Alamos National Laboratory 2 Scalable Systems Laboratory Computer Science Department University of New Mexico Abstract including Myrinet [17], Quadrics [3], Gigabit Ethernet and, recently, Infiniband [1]. Infini- Inifiniband is becoming an important intercon- band (IB) is increasingly deployed in small to nect technology in high performance comput- medium sized commodity clusters. It is IB’s low ing. Recent efforts in large scale Infiniband de- price/performance qualities that has made it at- ployments are raising scalability questions in the tractive to the HPC market. HPC community. Open MPI, a new production Of the available distributed memory program- grade implementation of the MPI standard, pro- ming models, the Message Passing Interface vides several mechanisms to enhance Infiniband (MPI) standard [16] is currently the most widely scalability. Initial comparisons with MVAPICH, used. Several MPI implementations support In- the most widely used Infiniband MPI imple- finiband including Open MPI [10], MVAPICH mentation, show similar performance but with [15], LA-MPI [11] and NCSA MPI [18]. How- much better scalability characterics. Specifi- ever, there are concerns about the scalability of cally, small message latency is improved by up Infiniband for MPI applications, partially arising to 10% in medium/large jobs and memory usage from the fact that Infiniband was initially devel- per host is reduced by as much as 300%. In ad- oped as a general I/O fabric technology and not dition, Open MPI provides predicatable latency specifically targeted to HPC [4]. that is close to optimal without sacrificing band- In this paper, we describe Open MPI’s scal- width performance. able support for Infiniband. In particular, Open MPI makes use of Inifiniband feature not cur- rently used by other MPI/IB implementations, 1 Introduction allowing Open MPI to scale more effectively than current implementations. We illustrate High performance computing (HPC) systems the scalability of Open MPI’s Infiniband sup- are continuing a trend toward distributed mem- port through comparisions with the widely- ory clusters consisting of commodity compo- used MVAPICH implementation, and show that nents. Many of these systems make use of Open MPI uses less memory and provides better commodity or ‘near’ commodity interconnects latency than MVAPICH on medium/large-scale 1 clusters. and scalability parameters which allow for easy The remainder of this paper is organized as tuning. follows. Section 2 presents a brief overview of The Open MPI point-to-point (p2p) design the Open MPI general point-to-point message and implementation is based on multiple MCA design. Next, section 3 discusses the Infini- frameworks. These frameworks provide func- band architecture including current limitations tional isolation with clearly defined interfaces. of the architecture. MVAPICH is discussed in Figure 1 illustrates the p2p framework architec- section 4 including potential scalability issues ture. relating to this implementation. Section 5 pro- vides a detailed description of Infiniband sup- MPI port in Open MPI. Scalability and performance results are discussed in section 6, followed by PML conclusions and future work in section 7. BML OpenIB OpenIB SM BTL BTL BTL Open IB Open IB SM MPool MPool MPool 2 Open MPI Rcache Rcache The Open MPI Project is a collaborative effort by Los Alamos National Lab, the Open Systems Figure 1: Open MPI p2p framework Laboratory at Indiana University, the Innova- tive Computing Laboratory at the University of Tennessee and the High Performance Comput- As shown in Figure 1 the architecture con- ing Center at the University of Stuttgart (HLRS). sists of four layers. Working from the bottom up The goal of this project is to develop a next gen- these layers are the Byte Transfer Layer (BTL), eration implementation of the Message Passing BTL Management Layer (BML), Point-to-Point Interface. Open MPI draws upon the unique ex- Messaging Layer (PML) and the MPI layer. pertise of each of these groups which includes Each of these layers is implemented as an MCA prior work on LA-MPI, LAM/MPI [20], FT- framework. Other MCA frameworks shown are MPI [9] and PAX-MPI [13]. Open MPI is how- the Memory Pool (MPool) and the Registration ever, a completely new MPI, designed from the Cache (Rcache). While these are illustrated ground up to address the demands of current and and defined as layers, critical send/receive paths next generation architectures and interconnects. bypass the BML, as it is used primarily during Open MPI is based on a Modular Compo- initialization/BTL selection. nent Architecture [19]. This architecture sup- ports the runtime selection of components that MPool The memory pool provides mem- are optimized for a specific operating environ- ory allocation/deallocation and registra- ment. Multiple network interconnects are sup- tion/deregistration services. Infiniband re- ported through this MCA. Currently there are quires memory to be registered (phys- two Infiniband components in Open MPI. One ical pages present and pinned) before supporting the OpenIB Verbs-API and another send/receive or RDMA operations can use supporting the Mellanox Verbs-API. In addition the memory as a source or target. Sep- to being highly optimized for scalability these arating this functionality from other com- components provide a number of performance ponents allows the MPool to be shared 2 among various layers. For example, layer may be safely bypassed by upper lay- MPI ALLOC MEM uses these MPools to ers for performance. The current BML register memory with available intercon- component is named R2. nects. PML The PML implements all logic for Rcache The registration cache allows memory p2p MPI semantics including standard, pools to cache registered memory for later buffered, ready, and synchronous commu- operations. When initialized, MPI message nication modes. MPI message transfers are buffers are registered with the Mpool and scheduled by the PML based on a specific cached via the Rcache. For example, dur- policy. This policy incorporates BTL spe- ing an MPI SEND the source buffer is reg- cific attributes to schedule MPI messages. istered with the memory pool and this reg- Short and long message protocols are im- istration may be then be cached, depending plemented within the PML. All control on the protocol in use. During subsequent messages (ACK/NACK/MATCH) are also MPI SEND operations the source buffer is managed at the PML. The benefit of this checked against the Rcache, and if the reg- structure is a seperation of transport proto- istration exists the PML may RDMA the col from the underlying interconnects. This entire buffer in a single operation without significantly reduces both code complex- incurring the high cost of registration. ity and code redundancy enhancing main- tainability. There are currently three PMLs BTL The BTL modules expose the underlying available in the Open MPI code base. This semantics of the network interconnect in a paper discusses OB1 the latest generation consistent form. BTLs expose a set of com- PML in the later section. munication primitives appropriate for both During startup, a PML component is selected send/receive and RDMA interfaces. The and initialized. The PML component selected BTL is not aware of any MPI semantics; it defaults to OB1 but may be overridden by a run- simply moves a sequence of bytes (poten- time parameter/environment setting. Next the tially non-contiguous) across the underly- BML component R2 is selected. R2 then opens ing transport. This simplicity will enable and initializes all available BTL modules. Dur- early adoption of novel network devices ing BTL module initialization, R2 directs peer and encourages vendor support. There are resource discovery on a per-BTL basis. This al- several BTL modules currently available; lows the peers to negotiate which set of inter- including TCP, GM, Portals, Shared Mem- faces they will use to communicate with each ory (SM), Mellanox VAPI and OpenIB other. This infrastructure allows for heteroge- VAPI. In the later section we discusses the nous networking interconnects within a cluster. Mellanox VAPI and OpenIB VAPI BTLs. BML The BML acts as a thin multi-plexing 3Infiniband layer, allowing the BTLs to be shared among multiple upper layers. Discovery of The Infiniband specification is published by the peer resources is coordinated by the BML Infiniband Trade Association (ITA) originally and cached for multiple consumers of the created by Compaq, Dell, Hewlett-Packard, BTLs. After resource discovery, the BML IBM, Intel, Microsoft, and Sun Microsystems. 3 IB was originally proposed as a general I/O Two-sided send/receive operations are initi- technology, allowing for a single I/O fabric to ated by enqueueing a send WQE on a QP’s send replace mutliple existing fabrics. The goal of a queue. The WQE specifies only the senders lo- single I/O fabric has faded and currently Infini- cal buffer. The remote process must pre-post band is targeted as an Inter Process Communi- a receive WQE on the corresponding receive cation (IPC) and Storage Area Network (SAN) queue which specifies a local buffer address to interconnect technology. be used as the destination of the receive. Send Infiniband, similar to Myrinet and Quadrics, completion indicates the send WQE is com- provides both Remote Direct Memory Access pleted locally and results in a sender side CQ (RDMA) and Operating System (OS) bypass fa- entry. When the transfer actually completes a cilities. RDMA enables data transfer from the CQ entry will be posted to the receivers CQ. address space of an application processs to a One-sided RDMA operations are likewise ini- peer process across the network fabric without tiated by enqueueing a RDMA WQE on the requiring involvement of the host CPU. Infini- Send Queue. However, this WQE specifies both band RDMA operations support both two-sided the source and target virtual addresses along send/receive and one-sided put/get semantics.
Recommended publications
  • The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗
    THE COMPONENT ARCHITECTURE OF OPEN MPI: ENABLING THIRD-PARTY COLLECTIVE ALGORITHMS∗ Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Laboratory, Indiana University Bloomington, Indiana, USA [email protected] [email protected] Abstract As large-scale clusters become more distributed and heterogeneous, significant research interest has emerged in optimizing MPI collective operations because of the performance gains that can be realized. However, researchers wishing to develop new algorithms for MPI collective operations are typically faced with sig- nificant design, implementation, and logistical challenges. To address a number of needs in the MPI research community, Open MPI has been developed, a new MPI-2 implementation centered around a lightweight component architecture that provides a set of component frameworks for realizing collective algorithms, point-to-point communication, and other aspects of MPI implementations. In this paper, we focus on the collective algorithm component framework. The “coll” framework provides tools for researchers to easily design, implement, and exper- iment with new collective algorithms in the context of a production-quality MPI. Performance results with basic collective operations demonstrate that the com- ponent architecture of Open MPI does not introduce any performance penalty. Keywords: MPI implementation, Parallel computing, Component architecture, Collective algorithms, High performance 1. Introduction Although the performance of the MPI collective operations [6, 17] can be a large factor in the overall run-time of a parallel application, their optimiza- tion has not necessarily been a focus in some MPI implementations until re- ∗This work was supported by a grant from the Lilly Endowment and by National Science Foundation grant 0116050.
    [Show full text]
  • (12) United States Patent (10) Patent No.: US 8,862,870 B2 Reddy Et Al
    USOO886287OB2 (12) United States Patent (10) Patent No.: US 8,862,870 B2 Reddy et al. (45) Date of Patent: Oct. 14, 2014 (54) SYSTEMS AND METHODS FOR USPC .......... 713/152–154, 168, 170; 709/223, 224, MULTI-LEVELTAGGING OF ENCRYPTED 709/225 ITEMIS FOR ADDITIONAL SECURITY AND See application file for complete search history. EFFICIENT ENCRYPTED ITEM (56) References Cited DETERMINATION U.S. PATENT DOCUMENTS (75) Inventors: Anoop Reddy, Santa Clara, CA (US); 5,867,494 A 2/1999 Krishnaswamy et al. Craig Anderson, Santa Clara, CA (US) 5,909,559 A 6, 1999 SO (73) Assignee: Citrix Systems, Inc., Fort Lauderdale, (Continued) FL (US) FOREIGN PATENT DOCUMENTS (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 CN 1478348 A 2, 2004 U.S.C. 154(b) by 0 days. EP 1422.907 A2 5, 2004 (Continued) (21) Appl. No.: 13/337.735 OTHER PUBLICATIONS (22) Filed: Dec. 27, 2011 Australian Examination Report on 200728.1083 dated Nov.30, 2010. (65) Prior Publication Data (Continued) US 2012/O17387OA1 Jul. 5, 2012 Primary Examiner — Abu Sholeman (74) Attorney, Agent, or Firm — Foley & Lardner LLP: Related U.S. Application Data Christopher J. McKenna (60) Provisional application No. 61/428,138, filed on Dec. (57) ABSTRACT 29, 2010. The present disclosure is directed towards systems and meth ods for performing multi-level tagging of encrypted items for (51) Int. Cl. additional security and efficient encrypted item determina H04L 9M32 (2006.01) tion. A device intercepts a message from a server to a client, H04L 2L/00 (2006.01) parses the message and identifies a cookie.
    [Show full text]
  • The Quadrics Network (Qsnet): High-Performance Clustering Technology
    Proceedings of the 9th IEEE Hot Interconnects (HotI'01), Palo Alto, California, August 2001. The Quadrics Network (QsNet): High-Performance Clustering Technology Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg Computer & Computational Sciences Division Los Alamos National Laboratory ¡ fabrizio,feng,hoisie,scoll,eitanf ¢ @lanl.gov Abstract tegration into large-scale systems. While GigE resides at the low end of the performance spectrum, it provides a low-cost The Quadrics interconnection network (QsNet) con- solution. GigaNet, GSN, Myrinet, and SCI add programma- tributes two novel innovations to the field of high- bility and performance by providing communication proces- performance interconnects: (1) integration of the virtual- sors on the network interface cards and implementing differ- address spaces of individual nodes into a single, global, ent types of user-level communication protocols. virtual-address space and (2) network fault tolerance via The Quadrics network (QsNet) surpasses the above inter- link-level and end-to-end protocols that can detect faults connects in functionality by including a novel approach to and automatically re-transmit packets. QsNet achieves these integrate the local virtual memory of a node into a globally feats by extending the native operating system in the nodes shared, virtual-memory space; a programmable processor in with a network operating system and specialized hardware the network interface that allows the implementation of intel- support in the network interface. As these and other impor- ligent communication protocols; and an integrated approach tant features of QsNet can be found in the InfiniBand speci- to network fault detection and fault tolerance. Consequently, fication, QsNet can be viewed as a precursor to InfiniBand.
    [Show full text]
  • High Performance Network and Channel-Based Storage
    High Performance Network and Channel-Based Storage Randy H. Katz Report No. UCB/CSD 91/650 September 1991 Computer Science Division (EECS) University of California, Berkeley Berkeley, California 94720 (NASA-CR-189965) HIGH PERFORMANCE NETWORK N92-19260 AND CHANNEL-BASED STORAGE (California Univ.) 42 p CSCL 098 Unclas G3/60 0073846 High Performance Network and Channel-Based Storage Randy H. Katz Computer Science Division Department of Electrical Engineering and Computer Sciences University of California Berkeley, California 94720 Abstract: In the traditional mainframe-centered view of a computer system, storage devices are coupled to the system through complex hardware subsystems called I/O channels. With the dramatic shift towards workstation-based com- puting, and its associated client/server model of computation, storage facilities are now found attached to file servers and distributed throughout the network. In this paper, we discuss the underlying technology trends that are leading to high performance network-based storage, namely advances in networks, storage devices, and I/O controller and server architectures. We review several commercial systems and research prototypes that are leading to a new approach to high performance computing based on network-attached storage. Key Words and Phrases: High Performance Computing, Computer Networks, File and Storage Servers, Secondary and Tertiary Storage Device 1. Introduction The traditional mainframe-centered model of computing can be characterized by small numbers of large-scale mainframe computers, with shared storage devices attached via I/O channel hard- ware. Today, we are experiencing a major paradigm shift away from centralized mainframes to a distributed model of computation based on workstations and file servers connected via high per- formance networks.
    [Show full text]
  • HIPPI Developments for CERN Experiments A
    VERSION OF: 5-Feb-98 10:15 HIPPI Developments for CERN experiments A. van Praag ,T. Anguelov, R.A. McLaren, H.C. van der Bij, CERN, Geneva, Switzerland. J. Bovier, P. Cristin Creative Electronic Systems, Geneva, Switzerland. M. Haben, P. Jovanovic, I. Kenyon, R. Staley University of Birmingham, Birmingham, U.K. D. Cunningham, G. Watson Hewlett Packard Laboratories, Bristol, U.K. B. Green, J. Strong Royal Hollaway and Bedford New College, U.K. Abstract HIPPI Standard, fast, simple, inexpensive; is this not a contradiction in terms? The High-Performance Parallel We have decided to use the High Performance Parallel Interface (HIPPI) is a new proposed ANSI standard, using a Interface (HIPPI) to implement these links. The HIPPI minimal protocol and providing 100 Mbyte/sec transfers over specification was started in the Los Alamos laboratory in distances up to 25 m. Equipment using this standard is 1989 and is now a proposed ANSI standard (X3T9/88-127, offered by a growing number of computer manufacturers. A X3T9.3/88-23, HIPPI PH) [1,2]. This standard allows commercially available HIPPI chipset allows low cost 100 Mbyte/sec synchronous data transfers between a "Source" implementations. In this article a brief technical introduction and a "Destination". Seen from the lowest level upwards the to the HIPPI will be given, followed by examples of planned HIPPI specification proposes a logical framing hierarchy applications in High Energy Physics experiments including where the smallest unit of data to be transferred, called a the present developments involving CERN: a detector "burst" has a standard size of 256 words of 32 bit or optional emulator, a risc processor based VME connection, a long 64 bit (Fig.
    [Show full text]
  • Infiniband and 10-Gigabit Ethernet for Dummies
    The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC ‘19) by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon Drivers of Modern HPC Cluster Architectures High Performance Accelerators / Coprocessors Interconnects - InfiniBand high compute density, high Multi-core <1usec latency, 200Gbps performance/watt SSD, NVMe-SSD, Processors Bandwidth> >1 TFlop DP on a chip NVRAM • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. NetworkSummit Based Computing Sierra Sunway K - Laboratory OSU Booth SC’19TaihuLight Computer 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Memo Memo Memo Logical shared memory Memor Memor Memor Shared Memory ry ry ry y y y Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface)Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are Network Based Computing Laboratorygradually receiving importanceOSU Booth SC’19 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Co-DesignCo-Design Application Kernels/Applications OpportuniOpportuni tiesties andand Middleware ChallengeChallenge ss acrossacross Programming Models VariousVarious MPI, PGAS (UPC, Global Arrays, OpenSHMEM), LayersLayers CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
    [Show full text]
  • PC Hardware Contents
    PC Hardware Contents 1 Computer hardware 1 1.1 Von Neumann architecture ...................................... 1 1.2 Sales .................................................. 1 1.3 Different systems ........................................... 2 1.3.1 Personal computer ...................................... 2 1.3.2 Mainframe computer ..................................... 3 1.3.3 Departmental computing ................................... 4 1.3.4 Supercomputer ........................................ 4 1.4 See also ................................................ 4 1.5 References ............................................... 4 1.6 External links ............................................. 4 2 Central processing unit 5 2.1 History ................................................. 5 2.1.1 Transistor and integrated circuit CPUs ............................ 6 2.1.2 Microprocessors ....................................... 7 2.2 Operation ............................................... 8 2.2.1 Fetch ............................................. 8 2.2.2 Decode ............................................ 8 2.2.3 Execute ............................................ 9 2.3 Design and implementation ...................................... 9 2.3.1 Control unit .......................................... 9 2.3.2 Arithmetic logic unit ..................................... 9 2.3.3 Integer range ......................................... 10 2.3.4 Clock rate ........................................... 10 2.3.5 Parallelism .........................................
    [Show full text]
  • User Manual Revision: C96e096
    Bright Cluster Manager 8.1 User Manual Revision: c96e096 Date: Fri Sep 14 2018 ©2018 Bright Computing, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Bright Computing, Inc. Trademarks Linux is a registered trademark of Linus Torvalds. PathScale is a registered trademark of Cray, Inc. Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. SUSE is a registered trademark of Novell, Inc. PGI is a registered trademark of NVIDIA Corporation. FLEXlm is a registered trademark of Flexera Software, Inc. PBS Professional, PBS Pro, and Green Provisioning are trademarks of Altair Engineering, Inc. All other trademarks are the property of their respective owners. Rights and Restrictions All statements, specifications, recommendations, and technical information contained herein are current or planned as of the date of publication of this document. They are reliable as of the time of this writing and are presented without warranty of any kind, expressed or implied. Bright Computing, Inc. shall not be liable for technical or editorial errors or omissions which may occur in this document. Bright Computing, Inc. shall not be liable for any damages resulting from the use of this document. Limitation of Liability and Damages Pertaining to Bright Computing, Inc. The Bright Cluster Manager product principally consists of free software that is licensed by the Linux authors free of charge. Bright Computing, Inc. shall have no liability nor will Bright Computing, Inc. provide any warranty for the Bright Cluster Manager to the extent that is permitted by law.
    [Show full text]
  • Setting up a Beowulf Cluster Using Open MPI on Linux
    Setting up a Beowulf Cluster Using Open MPI on Linux https://techtinkering.com/2009/12/02/setting-up-a-beowulf-c... TechTinkering ============= Twitter YouTube Email GitHub RSS Setting up a Beowulf Cluster Using Open MPI on Linux 2 December 2009 Lawrence Woodman #Beowulf Clusters #Distributed Processing #High Performance Computing #Linux #MPI #Parallel Processing I have been doing a lot of work recently on Linear Genetic Programming. This requires a great deal of processing power and to meet this I have been using Open MPI to create a Linux cluster. What follows is a quick guide to getting a cluster running. The basics really are very simple and, depending on the size, you can get a simple cluster running in less than half an hour, assuming you already have the machines networked and running Linux. What is a Beowulf Cluster? A Beowulf Cluster is a collection of privately networked computers which can be used for parallel computations. They consist of commodity hardware running open source software such as Linux or a BSD, often coupled with PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). A standard set up will consist of one master node which will control a number of slave nodes. The slave nodes are typically headless and generally all access the same files from a server. They have been referred to as Beowulf Clusters since before 1998 when the original Beowulf HOWTO was released. What is Open MPI? Open MPI is one of the leading MPI-2 implementations used by many of the TOP 500 Supercomputers. It is open source and is developed and maintained by a consortium of academic, research, and industry partners.
    [Show full text]
  • Lecture 12: I/O: Metrics, a Little Queuing Theory, and Busses
    Lecture 12: I/O: Metrics, A Little Queuing Theory, and Busses Professor David A. Patterson Computer Science 252 Fall 1996 DAP.F96 1 Review: Disk Device Terminology Disk Latency = Queuing Time + Seek Time + Rotation Time + Xfer Time Order of magnitude times for 4K byte transfers: Seek: 12 ms or less Rotate: 4.2 ms @ 7200 rpm (8.3 ms @ 3600 rpm ) Xfer: 1 ms @ 7200 rpm (2 ms @ 3600 rpm) DAP.F96 2 Review: R-DAT Technology 2000 RPM Four Head Recording Helical Recording Scheme Tracks Recorded ±20° w/o guard band Read After Write Verify DAP.F96 3 Review: Automated Cartridge System STC 4400 8 feet 10 feet 6000 x 0.8 GB 3490 tapes = 5 TBytes in 1992 $500,000 O.E.M. Price 6000 x 20 GB D3 tapes = 120 TBytes in 1994 1 Petabyte (1024 TBytes) in 2000 DAP.F96 4 Review: Storage System Issues • Historical Context of Storage I/O • Secondary and Tertiary Storage Devices • Storage I/O Performance Measures • A Little Queuing Theory • Processor Interface Issues • I/O Buses • Redundant Arrarys of Inexpensive Disks (RAID) • ABCs of UNIX File Systems • I/O Benchmarks • Comparing UNIX File System Performance DAP.F96 5 Disk I/O Performance 300 Response Metrics: Time (ms) Response Time Throughput 200 100 0 0% 100% Throughput (% total BW) Queue Proc IOC Device Response time = Queue + Device Service time DAP.F96 6 Response Time vs. Productivity • Interactive environments: Each interaction or transaction has 3 parts: – Entry Time: time for user to enter command – System Response Time: time between user entry & system replies – Think Time: Time from response until user
    [Show full text]
  • Hybrid MPI & Openmp Parallel Programming
    Hybrid MPI & OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Rolf Rabenseifner1) Georg Hager2) Gabriele Jost3) [email protected] [email protected] [email protected] 1) High Performance Computing Center (HLRS), University of Stuttgart, Germany 2) Regional Computing Center (RRZE), University of Erlangen, Germany 3) Texas Advanced Computing Center, The University of Texas at Austin, USA Tutorial M02 at SC10, November 15, 2010, New Orleans, LA, USA Hybrid Parallel Programming Slide 1 Höchstleistungsrechenzentrum Stuttgart Outline slide number • Introduction / Motivation 2 • Programming models on clusters of SMP nodes 6 8:30 – 10:00 • Case Studies / pure MPI vs hybrid MPI+OpenMP 13 • Practical “How-To” on hybrid programming 48 • Mismatch Problems 97 • Opportunities: Application categories that can 126 benefit from hybrid parallelization • Thread-safety quality of MPI libraries 136 10:30 – 12:00 • Tools for debugging and profiling MPI+OpenMP 143 • Other options on clusters of SMP nodes 150 • Summary 164 • Appendix 172 • Content (detailed) 188 Hybrid Parallel Programming Slide 2 / 169 Rabenseifner, Hager, Jost Motivation • Efficient programming of clusters of SMP nodes SMP nodes SMP nodes: cores shared • Dual/multi core CPUs memory • Multi CPU shared memory Node Interconnect • Multi CPU ccNUMA • Any mixture with shared memory programming model • Hardware range • mini-cluster with dual-core CPUs Core • … CPU(socket) • large constellations with large SMP nodes SMP board … with several sockets
    [Show full text]
  • Bright Cluster Manager User Manual
    Bright Cluster Manager 6.0 User Manual Revision: 3473 Date: Tue, 22 Jan 2013 Table of Contents Table of Contents . i 1 Introduction 1 1.1 What Is A Beowulf Cluster? . 1 1.2 Brief Network Description . 2 2 Cluster Usage 5 2.1 Login To The Cluster Environment . 5 2.2 Setting Up The User Environment . 6 2.3 Environment Modules . 6 2.4 Compiling Applications . 9 3 Using MPI 11 3.1 Interconnects . 11 3.2 Selecting An MPI implementation . 12 3.3 Example MPI Run . 12 4 Workload Management 17 4.1 What Is A Workload Manager? . 17 4.2 Why Use A Workload Manager? . 17 4.3 What Does A Workload Manager Do? . 17 4.4 Job Submission Process . 18 4.5 What Do Job Scripts Look Like? . 18 4.6 Running Jobs On A Workload Manager . 18 4.7 Running Jobs In Cluster Extension Cloud Nodes Using cmsub 19 5 SLURM 21 5.1 Loading SLURM Modules And Compiling The Executable 21 5.2 Running The Executable With salloc ............ 22 5.3 Running The Executable As A SLURM Job Script . 24 6 SGE 29 6.1 Writing A Job Script . 29 6.2 Submitting A Job . 33 6.3 Monitoring A Job . 34 6.4 Deleting A Job . 35 7 PBS Variants: Torque And PBS Pro 37 7.1 Components Of A Job Script . 38 7.2 Submitting A Job . 44 ii Table of Contents 8 Using GPUs 51 8.1 Packages . 51 8.2 Using CUDA . 51 8.3 Using OpenCL . 52 8.4 Compiling Code .
    [Show full text]