Infiniband Scalability in Open
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗
THE COMPONENT ARCHITECTURE OF OPEN MPI: ENABLING THIRD-PARTY COLLECTIVE ALGORITHMS∗ Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Laboratory, Indiana University Bloomington, Indiana, USA [email protected] [email protected] Abstract As large-scale clusters become more distributed and heterogeneous, significant research interest has emerged in optimizing MPI collective operations because of the performance gains that can be realized. However, researchers wishing to develop new algorithms for MPI collective operations are typically faced with sig- nificant design, implementation, and logistical challenges. To address a number of needs in the MPI research community, Open MPI has been developed, a new MPI-2 implementation centered around a lightweight component architecture that provides a set of component frameworks for realizing collective algorithms, point-to-point communication, and other aspects of MPI implementations. In this paper, we focus on the collective algorithm component framework. The “coll” framework provides tools for researchers to easily design, implement, and exper- iment with new collective algorithms in the context of a production-quality MPI. Performance results with basic collective operations demonstrate that the com- ponent architecture of Open MPI does not introduce any performance penalty. Keywords: MPI implementation, Parallel computing, Component architecture, Collective algorithms, High performance 1. Introduction Although the performance of the MPI collective operations [6, 17] can be a large factor in the overall run-time of a parallel application, their optimiza- tion has not necessarily been a focus in some MPI implementations until re- ∗This work was supported by a grant from the Lilly Endowment and by National Science Foundation grant 0116050. -
(12) United States Patent (10) Patent No.: US 8,862,870 B2 Reddy Et Al
USOO886287OB2 (12) United States Patent (10) Patent No.: US 8,862,870 B2 Reddy et al. (45) Date of Patent: Oct. 14, 2014 (54) SYSTEMS AND METHODS FOR USPC .......... 713/152–154, 168, 170; 709/223, 224, MULTI-LEVELTAGGING OF ENCRYPTED 709/225 ITEMIS FOR ADDITIONAL SECURITY AND See application file for complete search history. EFFICIENT ENCRYPTED ITEM (56) References Cited DETERMINATION U.S. PATENT DOCUMENTS (75) Inventors: Anoop Reddy, Santa Clara, CA (US); 5,867,494 A 2/1999 Krishnaswamy et al. Craig Anderson, Santa Clara, CA (US) 5,909,559 A 6, 1999 SO (73) Assignee: Citrix Systems, Inc., Fort Lauderdale, (Continued) FL (US) FOREIGN PATENT DOCUMENTS (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 CN 1478348 A 2, 2004 U.S.C. 154(b) by 0 days. EP 1422.907 A2 5, 2004 (Continued) (21) Appl. No.: 13/337.735 OTHER PUBLICATIONS (22) Filed: Dec. 27, 2011 Australian Examination Report on 200728.1083 dated Nov.30, 2010. (65) Prior Publication Data (Continued) US 2012/O17387OA1 Jul. 5, 2012 Primary Examiner — Abu Sholeman (74) Attorney, Agent, or Firm — Foley & Lardner LLP: Related U.S. Application Data Christopher J. McKenna (60) Provisional application No. 61/428,138, filed on Dec. (57) ABSTRACT 29, 2010. The present disclosure is directed towards systems and meth ods for performing multi-level tagging of encrypted items for (51) Int. Cl. additional security and efficient encrypted item determina H04L 9M32 (2006.01) tion. A device intercepts a message from a server to a client, H04L 2L/00 (2006.01) parses the message and identifies a cookie. -
The Quadrics Network (Qsnet): High-Performance Clustering Technology
Proceedings of the 9th IEEE Hot Interconnects (HotI'01), Palo Alto, California, August 2001. The Quadrics Network (QsNet): High-Performance Clustering Technology Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg Computer & Computational Sciences Division Los Alamos National Laboratory ¡ fabrizio,feng,hoisie,scoll,eitanf ¢ @lanl.gov Abstract tegration into large-scale systems. While GigE resides at the low end of the performance spectrum, it provides a low-cost The Quadrics interconnection network (QsNet) con- solution. GigaNet, GSN, Myrinet, and SCI add programma- tributes two novel innovations to the field of high- bility and performance by providing communication proces- performance interconnects: (1) integration of the virtual- sors on the network interface cards and implementing differ- address spaces of individual nodes into a single, global, ent types of user-level communication protocols. virtual-address space and (2) network fault tolerance via The Quadrics network (QsNet) surpasses the above inter- link-level and end-to-end protocols that can detect faults connects in functionality by including a novel approach to and automatically re-transmit packets. QsNet achieves these integrate the local virtual memory of a node into a globally feats by extending the native operating system in the nodes shared, virtual-memory space; a programmable processor in with a network operating system and specialized hardware the network interface that allows the implementation of intel- support in the network interface. As these and other impor- ligent communication protocols; and an integrated approach tant features of QsNet can be found in the InfiniBand speci- to network fault detection and fault tolerance. Consequently, fication, QsNet can be viewed as a precursor to InfiniBand. -
High Performance Network and Channel-Based Storage
High Performance Network and Channel-Based Storage Randy H. Katz Report No. UCB/CSD 91/650 September 1991 Computer Science Division (EECS) University of California, Berkeley Berkeley, California 94720 (NASA-CR-189965) HIGH PERFORMANCE NETWORK N92-19260 AND CHANNEL-BASED STORAGE (California Univ.) 42 p CSCL 098 Unclas G3/60 0073846 High Performance Network and Channel-Based Storage Randy H. Katz Computer Science Division Department of Electrical Engineering and Computer Sciences University of California Berkeley, California 94720 Abstract: In the traditional mainframe-centered view of a computer system, storage devices are coupled to the system through complex hardware subsystems called I/O channels. With the dramatic shift towards workstation-based com- puting, and its associated client/server model of computation, storage facilities are now found attached to file servers and distributed throughout the network. In this paper, we discuss the underlying technology trends that are leading to high performance network-based storage, namely advances in networks, storage devices, and I/O controller and server architectures. We review several commercial systems and research prototypes that are leading to a new approach to high performance computing based on network-attached storage. Key Words and Phrases: High Performance Computing, Computer Networks, File and Storage Servers, Secondary and Tertiary Storage Device 1. Introduction The traditional mainframe-centered model of computing can be characterized by small numbers of large-scale mainframe computers, with shared storage devices attached via I/O channel hard- ware. Today, we are experiencing a major paradigm shift away from centralized mainframes to a distributed model of computation based on workstations and file servers connected via high per- formance networks. -
HIPPI Developments for CERN Experiments A
VERSION OF: 5-Feb-98 10:15 HIPPI Developments for CERN experiments A. van Praag ,T. Anguelov, R.A. McLaren, H.C. van der Bij, CERN, Geneva, Switzerland. J. Bovier, P. Cristin Creative Electronic Systems, Geneva, Switzerland. M. Haben, P. Jovanovic, I. Kenyon, R. Staley University of Birmingham, Birmingham, U.K. D. Cunningham, G. Watson Hewlett Packard Laboratories, Bristol, U.K. B. Green, J. Strong Royal Hollaway and Bedford New College, U.K. Abstract HIPPI Standard, fast, simple, inexpensive; is this not a contradiction in terms? The High-Performance Parallel We have decided to use the High Performance Parallel Interface (HIPPI) is a new proposed ANSI standard, using a Interface (HIPPI) to implement these links. The HIPPI minimal protocol and providing 100 Mbyte/sec transfers over specification was started in the Los Alamos laboratory in distances up to 25 m. Equipment using this standard is 1989 and is now a proposed ANSI standard (X3T9/88-127, offered by a growing number of computer manufacturers. A X3T9.3/88-23, HIPPI PH) [1,2]. This standard allows commercially available HIPPI chipset allows low cost 100 Mbyte/sec synchronous data transfers between a "Source" implementations. In this article a brief technical introduction and a "Destination". Seen from the lowest level upwards the to the HIPPI will be given, followed by examples of planned HIPPI specification proposes a logical framing hierarchy applications in High Energy Physics experiments including where the smallest unit of data to be transferred, called a the present developments involving CERN: a detector "burst" has a standard size of 256 words of 32 bit or optional emulator, a risc processor based VME connection, a long 64 bit (Fig. -
Infiniband and 10-Gigabit Ethernet for Dummies
The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC ‘19) by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon Drivers of Modern HPC Cluster Architectures High Performance Accelerators / Coprocessors Interconnects - InfiniBand high compute density, high Multi-core <1usec latency, 200Gbps performance/watt SSD, NVMe-SSD, Processors Bandwidth> >1 TFlop DP on a chip NVRAM • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. NetworkSummit Based Computing Sierra Sunway K - Laboratory OSU Booth SC’19TaihuLight Computer 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Memo Memo Memo Logical shared memory Memor Memor Memor Shared Memory ry ry ry y y y Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface)Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are Network Based Computing Laboratorygradually receiving importanceOSU Booth SC’19 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Co-DesignCo-Design Application Kernels/Applications OpportuniOpportuni tiesties andand Middleware ChallengeChallenge ss acrossacross Programming Models VariousVarious MPI, PGAS (UPC, Global Arrays, OpenSHMEM), LayersLayers CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. -
PC Hardware Contents
PC Hardware Contents 1 Computer hardware 1 1.1 Von Neumann architecture ...................................... 1 1.2 Sales .................................................. 1 1.3 Different systems ........................................... 2 1.3.1 Personal computer ...................................... 2 1.3.2 Mainframe computer ..................................... 3 1.3.3 Departmental computing ................................... 4 1.3.4 Supercomputer ........................................ 4 1.4 See also ................................................ 4 1.5 References ............................................... 4 1.6 External links ............................................. 4 2 Central processing unit 5 2.1 History ................................................. 5 2.1.1 Transistor and integrated circuit CPUs ............................ 6 2.1.2 Microprocessors ....................................... 7 2.2 Operation ............................................... 8 2.2.1 Fetch ............................................. 8 2.2.2 Decode ............................................ 8 2.2.3 Execute ............................................ 9 2.3 Design and implementation ...................................... 9 2.3.1 Control unit .......................................... 9 2.3.2 Arithmetic logic unit ..................................... 9 2.3.3 Integer range ......................................... 10 2.3.4 Clock rate ........................................... 10 2.3.5 Parallelism ......................................... -
User Manual Revision: C96e096
Bright Cluster Manager 8.1 User Manual Revision: c96e096 Date: Fri Sep 14 2018 ©2018 Bright Computing, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Bright Computing, Inc. Trademarks Linux is a registered trademark of Linus Torvalds. PathScale is a registered trademark of Cray, Inc. Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. SUSE is a registered trademark of Novell, Inc. PGI is a registered trademark of NVIDIA Corporation. FLEXlm is a registered trademark of Flexera Software, Inc. PBS Professional, PBS Pro, and Green Provisioning are trademarks of Altair Engineering, Inc. All other trademarks are the property of their respective owners. Rights and Restrictions All statements, specifications, recommendations, and technical information contained herein are current or planned as of the date of publication of this document. They are reliable as of the time of this writing and are presented without warranty of any kind, expressed or implied. Bright Computing, Inc. shall not be liable for technical or editorial errors or omissions which may occur in this document. Bright Computing, Inc. shall not be liable for any damages resulting from the use of this document. Limitation of Liability and Damages Pertaining to Bright Computing, Inc. The Bright Cluster Manager product principally consists of free software that is licensed by the Linux authors free of charge. Bright Computing, Inc. shall have no liability nor will Bright Computing, Inc. provide any warranty for the Bright Cluster Manager to the extent that is permitted by law. -
Setting up a Beowulf Cluster Using Open MPI on Linux
Setting up a Beowulf Cluster Using Open MPI on Linux https://techtinkering.com/2009/12/02/setting-up-a-beowulf-c... TechTinkering ============= Twitter YouTube Email GitHub RSS Setting up a Beowulf Cluster Using Open MPI on Linux 2 December 2009 Lawrence Woodman #Beowulf Clusters #Distributed Processing #High Performance Computing #Linux #MPI #Parallel Processing I have been doing a lot of work recently on Linear Genetic Programming. This requires a great deal of processing power and to meet this I have been using Open MPI to create a Linux cluster. What follows is a quick guide to getting a cluster running. The basics really are very simple and, depending on the size, you can get a simple cluster running in less than half an hour, assuming you already have the machines networked and running Linux. What is a Beowulf Cluster? A Beowulf Cluster is a collection of privately networked computers which can be used for parallel computations. They consist of commodity hardware running open source software such as Linux or a BSD, often coupled with PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). A standard set up will consist of one master node which will control a number of slave nodes. The slave nodes are typically headless and generally all access the same files from a server. They have been referred to as Beowulf Clusters since before 1998 when the original Beowulf HOWTO was released. What is Open MPI? Open MPI is one of the leading MPI-2 implementations used by many of the TOP 500 Supercomputers. It is open source and is developed and maintained by a consortium of academic, research, and industry partners. -
Lecture 12: I/O: Metrics, a Little Queuing Theory, and Busses
Lecture 12: I/O: Metrics, A Little Queuing Theory, and Busses Professor David A. Patterson Computer Science 252 Fall 1996 DAP.F96 1 Review: Disk Device Terminology Disk Latency = Queuing Time + Seek Time + Rotation Time + Xfer Time Order of magnitude times for 4K byte transfers: Seek: 12 ms or less Rotate: 4.2 ms @ 7200 rpm (8.3 ms @ 3600 rpm ) Xfer: 1 ms @ 7200 rpm (2 ms @ 3600 rpm) DAP.F96 2 Review: R-DAT Technology 2000 RPM Four Head Recording Helical Recording Scheme Tracks Recorded ±20° w/o guard band Read After Write Verify DAP.F96 3 Review: Automated Cartridge System STC 4400 8 feet 10 feet 6000 x 0.8 GB 3490 tapes = 5 TBytes in 1992 $500,000 O.E.M. Price 6000 x 20 GB D3 tapes = 120 TBytes in 1994 1 Petabyte (1024 TBytes) in 2000 DAP.F96 4 Review: Storage System Issues • Historical Context of Storage I/O • Secondary and Tertiary Storage Devices • Storage I/O Performance Measures • A Little Queuing Theory • Processor Interface Issues • I/O Buses • Redundant Arrarys of Inexpensive Disks (RAID) • ABCs of UNIX File Systems • I/O Benchmarks • Comparing UNIX File System Performance DAP.F96 5 Disk I/O Performance 300 Response Metrics: Time (ms) Response Time Throughput 200 100 0 0% 100% Throughput (% total BW) Queue Proc IOC Device Response time = Queue + Device Service time DAP.F96 6 Response Time vs. Productivity • Interactive environments: Each interaction or transaction has 3 parts: – Entry Time: time for user to enter command – System Response Time: time between user entry & system replies – Think Time: Time from response until user -
Hybrid MPI & Openmp Parallel Programming
Hybrid MPI & OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Rolf Rabenseifner1) Georg Hager2) Gabriele Jost3) [email protected] [email protected] [email protected] 1) High Performance Computing Center (HLRS), University of Stuttgart, Germany 2) Regional Computing Center (RRZE), University of Erlangen, Germany 3) Texas Advanced Computing Center, The University of Texas at Austin, USA Tutorial M02 at SC10, November 15, 2010, New Orleans, LA, USA Hybrid Parallel Programming Slide 1 Höchstleistungsrechenzentrum Stuttgart Outline slide number • Introduction / Motivation 2 • Programming models on clusters of SMP nodes 6 8:30 – 10:00 • Case Studies / pure MPI vs hybrid MPI+OpenMP 13 • Practical “How-To” on hybrid programming 48 • Mismatch Problems 97 • Opportunities: Application categories that can 126 benefit from hybrid parallelization • Thread-safety quality of MPI libraries 136 10:30 – 12:00 • Tools for debugging and profiling MPI+OpenMP 143 • Other options on clusters of SMP nodes 150 • Summary 164 • Appendix 172 • Content (detailed) 188 Hybrid Parallel Programming Slide 2 / 169 Rabenseifner, Hager, Jost Motivation • Efficient programming of clusters of SMP nodes SMP nodes SMP nodes: cores shared • Dual/multi core CPUs memory • Multi CPU shared memory Node Interconnect • Multi CPU ccNUMA • Any mixture with shared memory programming model • Hardware range • mini-cluster with dual-core CPUs Core • … CPU(socket) • large constellations with large SMP nodes SMP board … with several sockets -
Bright Cluster Manager User Manual
Bright Cluster Manager 6.0 User Manual Revision: 3473 Date: Tue, 22 Jan 2013 Table of Contents Table of Contents . i 1 Introduction 1 1.1 What Is A Beowulf Cluster? . 1 1.2 Brief Network Description . 2 2 Cluster Usage 5 2.1 Login To The Cluster Environment . 5 2.2 Setting Up The User Environment . 6 2.3 Environment Modules . 6 2.4 Compiling Applications . 9 3 Using MPI 11 3.1 Interconnects . 11 3.2 Selecting An MPI implementation . 12 3.3 Example MPI Run . 12 4 Workload Management 17 4.1 What Is A Workload Manager? . 17 4.2 Why Use A Workload Manager? . 17 4.3 What Does A Workload Manager Do? . 17 4.4 Job Submission Process . 18 4.5 What Do Job Scripts Look Like? . 18 4.6 Running Jobs On A Workload Manager . 18 4.7 Running Jobs In Cluster Extension Cloud Nodes Using cmsub 19 5 SLURM 21 5.1 Loading SLURM Modules And Compiling The Executable 21 5.2 Running The Executable With salloc ............ 22 5.3 Running The Executable As A SLURM Job Script . 24 6 SGE 29 6.1 Writing A Job Script . 29 6.2 Submitting A Job . 33 6.3 Monitoring A Job . 34 6.4 Deleting A Job . 35 7 PBS Variants: Torque And PBS Pro 37 7.1 Components Of A Job Script . 38 7.2 Submitting A Job . 44 ii Table of Contents 8 Using GPUs 51 8.1 Packages . 51 8.2 Using CUDA . 51 8.3 Using OpenCL . 52 8.4 Compiling Code .