Cluster Computing at a Glance Chapter 1: by M. Baker and R. Buyya
High Performance Cluster Computing Introduction Scalable Parallel Computer Architecture Architectures and Systems Towards Low Cost Parallel Computing and Motivations Windows of Opportunity A Cluster Computer and its Architecture Clusters Classifications Book Editor: Rajkumar Buyya Commodity Components for Clusters Network Service/Communications SW Slides Prepared by: Hai Jin Cluster Middleware and Single System Image Resource Management and Scheduling (RMS) Programming Environments and Tools Cluster Applications Internet and Cluster Computing Center Representative Cluster Systems Cluster of SMPs (CLUMPS) Summary and Conclusions 2 http://www.buyya.com/cluster/
Introduction How to Run Applications Faster ?
Need more computing power There are 3 ways to improve performance: Improve the operating speed of processors & Work Harder other components Work Smarter constrained by the speed of light, thermodynamic laws, & the high financial costs for processor Get Help fabrication Computer Analogy Using faster hardware Connect multiple processors together & coordinate their computational efforts Optimized algorithms and techniques used to solve computational tasks parallel computers allow the sharing of a computational task among Multiple computers to solve a particular task multiple processors
3 4
Era of Computing Two Eras of Computing
Rapid technical advances
the recent advances in VLSI technology
software technology OS, PL, development methodologies, & tools
grand challenge applications have become the main driving force
Parallel computing
one of the best ways to overcome the speed bottleneck of a single processor
good price/performance ratio of a small cluster- based parallel computer
5 6
1 Scalable Parallel Computer Scalable Parallel Computer Architectures Architectures
Taxonomy MPP A large parallel processing system with a shared-nothing based on how processors, memory & architecture interconnect are laid out Consist of several hundred nodes with a high-speed Massively Parallel Processors (MPP) interconnection network/switch Each node consists of a main memory & one or more Symmetric Multiprocessors (SMP) processors Cache-Coherent Nonuniform Memory Access Runs a separate copy of the OS (CC-NUMA) SMP Distributed Systems 2-64 processors today Shared-everything architecture Clusters All processors share all the global resources available Single copy of the OS runs on these systems
7 8
Scalable Parallel Computer Key Characteristics of Architectures Scalable Parallel Computers CC-NUMA a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture every processor has a global view of all of the memory Distributed systems considered conventional networks of independent computers have multiple system images as each node runs its own OS the individual machines could be combinations of MPPs, SMPs, clusters, & individual computers Clusters a collection of workstations of PCs that are interconnected by a high-speed network work as an integrated collection of resources have a single system image spanning all its nodes 9 10
Towards Low Cost Parallel Motivations of using NOW over Computing Specialized Parallel Computers
Parallel processing Individual workstations are becoming increasing linking together 2 or more computers to jointly solve some computational problem powerful since the early 1990s, an increasing trend to move away Communication bandwidth between workstations is from expensive and specialized proprietary parallel supercomputers towards networks of workstations increasing and latency is decreasing
the rapid improvement in the availability of commodity high Workstation clusters are easier to integrate into performance components for workstations and networks existing networks Low-cost commodity supercomputing Typical low user utilization of personal workstations from specialized traditional supercomputing platforms to cheaper, general purpose systems consisting of loosely Development tools for workstations are more mature coupled components built up from single or multiprocessor PCs or workstations Workstation clusters are a cheap and readily
need to standardization of many of the tools and utilities available used by parallel applications (ex) MPI, HPF Clusters can be easily grown 11 12
2 Trend Windows of Opportunities
Parallel Processing Workstations with UNIX for science & Use multiple processors to build MPP/DSM-like systems for industry vs PC-based machines for parallel computing Network RAM administrative work & work processing Use memory associated with each workstation as aggregate DRAM cache A rapid convergence in processor Software RAID performance and kernel-level functionality Redundant array of inexpensive disks of UNIX workstations and PC-based Use the arrays of workstation disks to provide cheap, highly machines available, & scalable file storage Possible to provide parallel I/O support to applications Use arrays of workstation disks to provide cheap, highly available, and scalable file storage Multipath Communication Use multiple networks for parallel data transfer between 13 14 nodes
Cluster Computer and its Architecture Cluster Computer Architecture
A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource A node a single or multiprocessor system with memory, I/O facilities, & OS generally 2 or more computers (nodes) connected together in a single cabinet, or physically separated & connected via a LAN appear as a single system to users and applications provide a cost-effective way to gain features and benefits
15 16
Prominent Components of Beowulf Cluster Computers (I)
Multiple High Performance Computers PCs Head Node Workstations
Switch SMPs (CLUMPS) Distributed HPC Systems leading to Metacomputing Client Node1 Client Node2
17 18
3 Prominent Components of Prominent Components of Cluster Computers (II) Cluster Computers (III)
High Performance Networks/Switches State of the art Operating Systems Ethernet (10Mbps), Fast Ethernet Linux (Beowulf) (100Mbps), Microsoft NT(Illinois HPVM) SUN Solaris (Berkeley NOW) InfiniteBand (1-8 Gbps) IBM AIX (IBM SP2) Gigabit Ethernet (1Gbps) HP UX (Illinois - PANDA) SCI (Dolphin - MPI- 12micro-sec latency) Mach (Microkernel based OS) (CMU) ATM Cluster Operating Systems (Solaris MC, SCO Myrinet (1.2Gbps) Unixware, MOSIX (academic project) Digital Memory Channel OS gluing layers (Berkeley Glunix) FDDI 19 20
Prominent Components of Prominent Components of Cluster Computers (IV) Cluster Computers (V)
Network Interface Card Fast Communication Protocols and Services Myrinet has NIC Active Messages (Berkeley) InfiniteBand (HBA) Fast Messages (Illinois) User-level access support U-net (Cornell) XTP (Virginia)
21 22
Prominent Components of Prominent Components of Cluster Computers (VI) Cluster Computers (VII)
Cluster Middleware Parallel Programming Environments and Tools Single System Image (SSI) Threads (PCs, SMPs, NOW..) POSIX Threads System Availability (SA) Infrastructure Java Threads Hardware MPI Linux, NT, on many Supercomputers DEC Memory Channel, DSM (Alewife, DASH), SMP Techniques PVM Operating System Kernel/Gluing Layers Software DSMs (Shmem) Solaris MC, Unixware, GLUnix Compilers C/C++/Java Applications and Subsystems Parallel programming with C++ (MIT Press book) Applications (system management and electronic forms) RAD (rapid application development tools) Runtime systems (software DSM, PFS etc.) GUI based tools for PP modeling Resource management and scheduling software (RMS) Debuggers CODINE, LSF, PBS, NQS, etc. Performance Analysis Tools 23 24 Visualization Tools
4 Prominent Components of Key Operational Benefits of Cluster Computers (VIII) Clustering
Applications High Performance Sequential Expandability and Scalability Parallel / Distributed (Cluster-aware app.) High Throughput Grand Challenging applications Weather Forecasting High Availability Quantum Chemistry Molecular Biology Modeling Engineering Analysis (CAD/CAM) ……………….
PDBs, web servers,data-mining
25 26
Clusters Classification (I) Clusters Classification (II)
Application Target Node Ownership High Performance (HP) Clusters Dedicated Clusters Grand Challenging Applications Non-dedicated clusters
High Availability (HA) Clusters Adaptive parallel computing
Mission Critical applications Communal multiprocessing
27 28
Clusters Classification (III) Clusters Classification (IV)
Node Operating System Node Hardware Linux Clusters (e.g., Beowulf) Clusters of PCs (CoPs) Solaris Clusters (e.g., Berkeley NOW) Piles of PCs (PoPs) NT Clusters (e.g., HPVM) Clusters of Workstations (COWs) AIX Clusters (e.g., IBM SP2) SCO/Compaq Clusters (Unixware) Clusters of SMPs (CLUMPs) Digital VMS Clusters HP-UX clusters Microsoft Wolfpack clusters
29 30
5 Clusters Classification (V) Clusters Classification (VI)
Levels of Clustering Node Configuration Group Clusters (#nodes: 2-99) Homogeneous Clusters Nodes are connected by SAN like Myrinet Departmental Clusters (#nodes: 10s to 100s) All nodes will have similar architectures and run the same OSs Organizational Clusters (#nodes: many 100s) National Metacomputers (WAN/Internet- Heterogeneous Clusters based) All nodes will have different architectures International Metacomputers (Internet-based, and run different OSs #nodes: 1000s to many millions) Metacomputing Web-based Computing Agent Based Computing 31 32 Java plays a major in web and agent based computing
Commodity Components for Clusters Commodity Components for Clusters (I) (II)
Processors Memory and Cache Standard Industry Memory Module (SIMM) Intel x86 Processors Extended Data Out (EDO) Pentium Pro and Pentium Xeon Allow next access to begin while the previous data is still AMD x86, Cyrix x86, etc. being read Digital Alpha Fast page Alpha 21364 processor integrates processing, memory Allow multiple adjacent accesses to be made more controller, network interface into a single chip efficiently Access to DRAM is extremely slow compared to the speed of IBM PowerPC the processor Sun SPARC the very fast memory used for Cache is expensive & cache SGI MIPS control circuitry becomes more complex as the size of the cache grows HP PA Within Pentium-based machines, uncommon to have a 64-bit Berkeley Intelligent RAM (IRAM) integrates wide memory bus as well as a chip set that support 2Mbytes of processor and DRAM onto a single chip external cache 33 34
Commodity Components for Clusters Commodity Components for Clusters (III) (IV)
Disk and I/O System Bus
Overall improvement in disk access time has ISA bus (AT bus)
been less than 10% per year Clocked at 5MHz and 8 bits wide
Amdahl’s law Clocked at 13MHz and 16 bits wide
Speed-up obtained by from faster processors VESA bus is limited by the slowest system component 32 bits bus matched system’s clock speed Parallel I/O PCI bus Carry out I/O operations in parallel, 133Mbytes/s transfer rate supported by parallel file system based on Adopted both in Pentium-based PC and non- hardware or software RAID Intel platform (e.g., Digital Alpha Server) 35 36
6 Commodity Components for Clusters Commodity Components for Clusters (V) (VI)
Cluster Interconnects Cluster Interconnects Asynchronous Transfer Mode (ATM) Communicate over high-speed networks using a standard Switched virtual-circuit technology networking protocol such as TCP/IP or a low-level protocol such Cell (small fixed-size data packet) as AM use optical fiber - expensive upgrade Standard Ethernet telephone style cables (CAT-3) & better quality cable (CAT-5) 10 Mbps Scalable Coherent Interfaces (SCI)
cheap, easy way to provide file and printer sharing IEEE 1596-1992 standard aimed at providing a low-latency distributed shared memory across a cluster bandwidth & latency are not balanced with the computational power Point-to-point architecture with directory-based cache coherence reduce the delay interprocessor communication Ethernet, Fast Ethernet, and Gigabit Ethernet eliminate the need for runtime layers of software protocol-paradigm translation Fast Ethernet – 100 Mbps less than 12 usec zero message-length latency on Sun SPARC
Gigabit Ethernet Designed to support distributed multiprocessing with high bandwidth and low latency preserve Ethernet’s simplicity SCI cards for SPARC’s SBus and PCI-based SCI cards from Dolphin deliver a very high bandwidth to aggregate multiple Fast Ethernet segments Scalability constrained by the current generation of switches & relatively expensive components 37 38
Commodity Components for Clusters Commodity Components for Clusters (VII) (VIII)
Cluster Interconnects Operating Systems Myrinet 2 fundamental services for users 1.28 Gbps full duplex interconnection network make the computer hardware easier to use Use low latency cut-through routing switches, which is able to create a virtual machine that differs markedly from the real offer fault tolerance by automatic mapping of the network machine configuration share hardware resources among users Support both Linux & NT Processor - multitasking Advantages The new concept in OS services Very low latency (5µs, one-way point-to-point) support multiple threads of control in a process itself Very high throughput parallelism within a process Programmable on-board processor for greater flexibility multithreading Disadvantages POSIX thread interface is a standard programming environment Expensive: $1500 per host Trend Complicated scaling: switches with more than 16 ports are unavailable Modularity – MS Windows, IBM OS/2 Microkernel – provide only essential OS services 39 40 high level abstraction of OS portability
Commodity Components for Clusters Commodity Components for Clusters (IX) (X)
Operating Systems Operating Systems Solaris Linux UNIX-based multithreading and multiuser OS UNIX-like OS support Intel x86 & SPARC-based platforms
Runs on cheap x86 platform, yet offers the power and Real-time scheduling feature critical for multimedia applications flexibility of UNIX Support two kinds of threads Light Weight Processes (LWPs) Readily available on the Internet and can be downloaded User level thread without cost Support both BSD and several non-BSD file system Easy to fix bugs and improve system performance CacheFS AutoClient Users can develop or fine-tune hardware drivers which can easily be made available to other users TmpFS: uses main memory to contain a file system Proc file system Features such as preemptive multitasking, demand-page Volume file system virtual memory, multiuser, multiprocessor support Support distributed computing & is able to store & retrieve distributed information
OpenWindows allows application to be run on remote systems
41 42
7 Commodity Components for Clusters (XI) Windows NT 4.0 Architecture
Operating Systems Microsoft Windows NT (New Technology)
Preemptive, multitasking, multiuser, 32-bits OS
Object-based security model and special file system (NTFS) that allows permissions to be set on a file and directory basis
Support multiple CPUs and provide multitasking using symmetrical multiprocessing
Support different CPUs and multiprocessor machines with threads
Have the network protocols & services integrated with the base OS
several built-in networking protocols (IPX/SPX., TCP/IP, NetBEUI), & APIs (NetBIOS, DCE RPC, Window Sockets (Winsock))
43 44
Network Services/ Communication SW Cluster Middleware & SSI
Communication infrastructure support protocol for Bulk-data transport SSI Streaming data Supported by a middleware layer that resides between the Group communications OS and user-level environment Communication service provide cluster with important QoS parameters Middleware consists of essentially 2 sublayers of SW Latency infrastructure Bandwidth SSI infrastructure Reliability Glue together OSs on all nodes to offer unified access to Fault-tolerance system resources Jitter control Network service are designed as hierarchical stack of protocols System availability infrastructure with relatively low-level communication API, provide means to Enable cluster services such as checkpointing, automatic implement wide range of communication methodologies failover, recovery from failure, & fault-tolerant support RPC among all nodes of the cluster DSM Stream-based and message passing interface (e.g., MPI, PVM) 45 46
What is Single System Image (SSI) ? Single System Image Boundaries
A single system image is the illusion, created by software or hardware, Every SSI has a boundary that presents a collection of SSI support can exist at resources as one, more powerful resource. different levels within a SSI makes the cluster appear like a system, one able to be build single machine to the user, to on another applications, and to the network. A cluster without a SSI is not a cluster 47 48
8 SSI Boundaries -- an applications SSI Levels/Layers SSI boundary
Application and Subsystem Level
Operating System Kernel Level
Batch System Hardware Level SSI Boundary (c) In search 49 of clusters 50
SSI at Operating System Kernel SSI at Harware Layer (Underware) or Gluing Layer
Level Examples Boundary Importance Level Examples Boundary Importance
Kernel/ Solaris MC, Unixware each name space: kernel support for Application and Subsystem Level OS Layer MOSIX, Sprite,Amoeba files, processes, applications, adm / GLUnix pipes, devices, etc. subsystems kernel UNIX (Sun) vnode, type of kernel modularizes SSI Operating System Kernel Level interfaces Locus (IBM) vproc objects: files, code within processes, etc. kernel memory SCI, DASH memory space better communica- virtual none supporting each distributed may simplify tion and synchro- memory operating system kernel virtual memory implementation nization space of kernel objects memory SCI, SMP techniquesmemory and I/O lower overhead microkernel Mach, PARAS, Chorus, each service implicit SSI for and I/O device space cluster I/O OSF/1AD, Amoeba outside the all system services microkernel 51 (c) In search of clusters 52 (c) In search of clusters
SSI at Application and Subsystem Single System Image Benefits Layer (Middleware)
Provide a simple, straightforward view of all system Level Examples Boundary Importance resources and activities, from any node of the cluster application cluster batch system, an application what a user system management wants Free the end user from having to know where an application will run distributed DB, subsystem a subsystem SSI for all Free the operator from having to know where a OSF DME, Lotus applications of resource is located Notes, MPI, PVM the subsystem Sun NFS, OSF, Let the user work with familiar interface and file system shared portion of implicitly supports commands and allows the administrators to manage DFS, NetWare, the file system many applications and so on and subsystems the entire clusters as a single entity toolkit OSF DCE, Sun explicit toolkit best level of Reduce the risk of operator errors, with the result ONC+, Apollo facilities: user, support for heter- that end users see improved reliability and higher Domain service name,time ogeneous system availability of the system 53 (c) In search of clusters 54
9 Single System Image Benefits (Cont’d) Middleware Design Goals
Allowing centralize/decentralize system management and Complete Transparency in Resource Management control to avoid the need of skilled administrators from Allow user to use a cluster easily without the knowledge of the system administration underlying system architecture Present multiple, cooperating components of an application to The user is provided with the view of a globalized file system, the administrator as a single application processes, and network Scalable Performance Greatly simplify system management Can easily be expanded, their performance should scale as well Provide location-independent message communication To extract the max performance, the SSI service must support load Help track the locations of all resource so that there is no balancing & parallelism by distributing workload evenly among nodes longer any need for system operators to be concerned with Enhanced Availability their physical location Middleware service must be highly available at all times Provide transparent process migration and load balancing At any time, a point of failure should be recoverable without ’ across nodes. affecting a user s application Employ checkpointing & fault tolerant technologies Improved system response time and performance Handle consistency of data when replicated
55 56
SSI Support Services Availability Support Functions
Single Entry Point Single I/O Space (SIOS): telnet cluster.myinstitute.edu any node can access any peripheral or disk devices telnet node1.cluster. myinstitute.edu without the knowledge of physical location. Single File Hierarchy: xFS, AFS, Solaris MC Proxy Single Process Space (SPS) Single Management and Control Point: Management from single GUI Any process on any node create process with cluster wide process wide and they communicate through signal, Single Virtual Networking pipes, etc, as if they are one a single node. Single Memory Space - Network RAM / DSM Checkpointing and Process Migration. Single Job Management: GLUnix, Codine, LSF Saves the process state and intermediate results in Single User Interface: Like workstation/PC memory to disk to support rollback recovery when node windowing environment (CDE in Solaris/NT), may it fails can use Web technology PM for dynamic load balancing among the cluster nodes
57 58
Resource Management and Scheduling (RMS) Services provided by RMS
RMS is the act of distributing applications among computers to maximize their throughput Process Migration Enable the effective and efficient utilization of the resources Computational resource has become too heavily loaded available Fault tolerant concern Software components Resource manager Checkpointing Locating and allocating computational resource, authentication, process creation and migration Scavenging Idle Cycles Resource scheduler 70% to 90% of the time most workstations are idle Queueing applications, resource location and assignment Fault Tolerance Reasons using RMS Provide an increased, and reliable, throughput of user applications on Minimization of Impact on Users the systems Load Balancing Load balancing Utilizing spare CPU cycles Multiple Application Queues Providing fault tolerant systems Manage access to powerful system, etc 59 Basic architecture of RMS: client-server system 60
10 Some Popular Programming Environments and Tools Resource Management Systems (I)
Project Commercial Systems - URL Threads (PCs, SMPs, NOW..) LSF http://www.platform.com/ In multiprocessor systems CODINE http://www.genias.de/products/codine/tech_desc.html Used to simultaneously utilize all the available Easy-LL http://www.tc.cornell.edu/UserDoc/SP/LL12/Easy/ processors NQE http://www.cray.com/products/software/nqe/ In uniprocessor systems Public Domain System - URL Used to utilize the system resources effectively Multithreaded applications offer quicker response to CONDOR http://www.cs.wisc.edu/condor/ user input and run faster http://www.gnqs.org/ GNQS Potentially portable, as there exists an IEEE standard DQS http://www.scri.fsu.edu/~pasko/dqs.html for POSIX threads interface (pthreads) PRM http://gost.isi.edu/gost-group/products/prm/ Extensively used in developing both application and system software PBS http://pbs.mrj.com/ 61 62
Programming Environments and Tools Programming Environments and Tools (II) (III)
Distributed Shared Memory (DSM) Systems Message Passing Systems (MPI and PVM) Message-passing Allow efficient parallel programs to be written for distributed the most efficient, widely used, programming paradigm on distributed memory systems memory system 2 most popular high-level message-passing systems – PVM & complex & difficult to program MPI Shared memory systems PVM offer a simple and general programming model but suffer from scalability both an environment & a message-passing library DSM on distributed memory system MPI alternative cost-effective solution a message passing specification, designed to be standard Software DSM for distributed memory parallel computing using explicit Usually built as a separate layer on top of the comm interface message passing Take full advantage of the application characteristics: virtual pages, objects, & language types are units of sharing attempt to establish a practical, portable, efficient, & ThreadMarks, Linda flexible standard for message passing Hardware DSM generally, application developers prefer MPI, as it is fast Better performance, no burden on user & SW layers, fine granularity of sharing, becoming the de facto standard for message passing extensions of the cache coherence scheme, & increased HW complexity DASH, Merlin
63 64
Programming Environments and Tools (IV) Functionality of Parallel Debugger
Parallel Debuggers and Profilers Managing multiple processes and multiple threads Debuggers within a process Very limited Displaying each process in its own window HPDF (High Performance Debugging Forum) as Parallel Tools Consortium project in 1996 Displaying source code, stack trace, and stack frame Developed a HPD version specification, which defines the for one or more processes functionality, semantics, and syntax for a commercial-line parallel debugger Diving into objects, subroutines, and functions TotalView Setting both source-level and machine-level A commercial product from Dolphin Interconnect Solutions breakpoints The only widely available GUI-based parallel debugger Sharing breakpoints between groups of processes that supports multiple HPC platforms Defining watch and evaluation points Only used in homogeneous environments, where each process of the parallel application being debugged must be running Displaying arrays and its slices under the same version of the OS Manipulating code variable and constants 65 66
11 Programming Environments and Tools Performance Analysis (V) and Visualization Tools
Performance Analysis Tools Tool Supports URL Help a programmer to understand the performance AIMS Instrumentation, monitoring http://science.nas.nasa.gov/Software/AIMS characteristics of an application library, analysis Analyze & locate parts of an application that exhibit poor MPE Logging library and snapshot http://www.mcs.anl.gov/mpi/mpich performance and create program bottlenecks performance visualization Major components Pablo Monitoring library and http://www-pablo.cs.uiuc.edu/Projects/Pablo/ analysis A means of inserting instrumentation calls to the performance monitoring routines into the user’s applications Paradyn Dynamic instrumentation http://www.cs.wisc.edu/paradyn running analysis A run-time performance library that consists of a set of monitoring routines SvPablo Integrated instrumentor, http://www-pablo.cs.uiuc.edu/Projects/Pablo/ monitoring library and analysis A set of tools for processing and displaying the performance data Monitoring library http://www.pallas.de/pages/vampir.htm Issue with performance monitoring tools Vampir performance visualization Intrusiveness of the tracing calls and their impact on the application performance Dimenmas Performance prediction for http://www.pallas.com/pages/dimemas.htm message passing programs Instrumentation affects the performance characteristics of the parallel application and thus provides a false view of its Paraver Program visualization and http://www.cepba.upc.es/paraver performance behavior analysis 67 68
Programming Environments and Tools Need of more Computing Power: (VI) Grand Challenge Applications
Cluster Administration Tools Solving technology problems using computer Berkeley NOW modeling, simulation and analysis Gather & store data in a relational DB Use Java applet to allow users to monitor a system Geographic SMILE (Scalable Multicomputer Implementation using Low-cost Equipment) Information Called K-CAP Systems Consist of compute nodes, a management node, & a client that can control and monitor the cluster Life Sciences Aerospace K-CAP uses a Java applet to connect to the management node through a predefined URL address in the cluster PARMON A comprehensive environment for monitoring large clusters Use client-server techniques to provide transparent access to all nodes to be monitored CAD/CAM Digital Biology 69 parmon-server & parmon-client 70 Military Applications
Representative Cluster Systems (I) Architecture of NOW System
The Berkeley Network of Workstations (NOW) Project Demonstrate building of a large-scale parallel computer system using mass produced commercial workstations & the latest commodity switch-based network components Interprocess communication
Active Messages (AM) basic communication primitives in Berkeley NOW A simplified remote procedure call that can be implemented efficiently on a wide range of hardware Global Layer Unix (GLUnix)
An OS layer designed to provide transparent remote execution, support for interactive parallel & sequential jobs, load balancing, & backward compatibility for existing application binaries
Aim to provide a cluster-wide namespace and uses Network PIDs (NPIDs), and Virtual Node Numbers (VNNs) 71 72
12 Representative Cluster Systems Representative Cluster Systems (II) (III)
The High Performance Virtual Machine The Berkeley Network of Workstations (NOW) Project Network RAM (HPVM) Project Allow to utilize free resources on idle machines as a paging device Deliver supercomputer performance on a low cost for busy machines COTS system Serverless any machine can be a server when it is idle, or a client when it needs Hide the complexities of a distributed system more memory than physically available behind a clean interface xFS: Serverless Network File System
A serverless, distributed file system, which attempt to have low Challenges addressed by HPVM latency, high bandwidth access to file system data by distributing the functionality of the server among the clients Delivering high performance communication to
The function of locating data in xFS is distributed by having each standard, high-level APIs client responsible for servicing requests on a subset of the files Coordinating scheduling and resource management File data is striped across multiple clients to provide high bandwidth Managing heterogeneity 73 74
HPVM Layered Architecture Representative Cluster Systems (IV)
The High Performance Virtual Machine (HPVM) Project Fast Messages (FM) A high bandwidth & low-latency comm protocol, based on Berkeley AM Contains functions for sending long and short messages & for extracting messages from the network Guarantees and controls the memory hierarchy Guarantees reliable and ordered packet delivery as well as control over the scheduling of communication work Originally developed on a Cray T3D & a cluster of SPARCstations connected by Myrinet hardware Low-level software interface that delivery hardware communication performance High-level layers interface offer greater functionality, application portability, and ease of use 75 76
Representative Cluster Systems Representative Cluster Systems (V) (VI)
The Beowulf Project The Beowulf Project System Software Investigate the potential of PC clusters for Grendel performing computational tasks the collection of software tools resource management & support distributed applications Refer to a Pile-of-PCs (PoPC) to describe a loose Communication ensemble or cluster of PCs through TCP/IP over Ethernet internal to cluster Emphasize the use of mass-market commodity employ multiple Ethernet networks in parallel to satisfy the internal data transfer bandwidth required components, dedicated processors, and the use of a achieved by ‘channel binding’ techniques
private communication network Extend the Linux kernel to allow a loose ensemble of nodes to participate in a number of global namespaces Achieve the best overall system cost/performance ratio for the cluster Two Global Process ID (GPID) schemes Independent of external libraries GPID-PVM compatible with PVM Task ID format & uses PVM as its signal transport 77 78
13 Representative Cluster Systems (VII) Solaris MC Architecture
Solaris MC: A High Performance Operating System for Clusters A distributed OS for a multicomputer, a cluster of computing nodes connected by a high-speed interconnect Provide a single system image, making the cluster appear like a single machine to the user, to applications, and the the network Built as a globalization layer on top of the existing Solaris kernel Interesting features extends existing Solaris OS preserves the existing Solaris ABI/API compliance provides support for high availability uses C++, IDL, CORBA in the kernel
79 leverages spring technology 80
Representative Cluster Systems Cluster System Comparison Matrix (VIII)
Solaris MC: A High Performance Operating System Project Platform Communications OS Other for Clusters Beowulf PCs Multiple Ethernet Linux and MPI/PVM. Use an object-oriented framework for communication with TCP/IP Grendel Sockets and between nodes HPF Based on CORBA Berkeley Solaris-based Myrinet and Active Solaris + AM, PVM, MPI, Now PCs and Messages GLUnix + HPF, Split-C Provide remote object method invocations workstations xFS Provide object reference counting HPVM PCs Myrinet with Fast NT or Linux Java-fronted, Support multiple object handlers Messages connection FM, Sockets, Single system image features and global Global Arrays, Global file system resource SHEMEM and Distributed file system, called ProXy File System (PXFS), manager + MPI provides a globalized file system without need for modifying the LSF existing file system Solaris MC Solaris-based Solaris-supported Solaris + C++ and Globalized process management PCs and Globalization CORBA Globalized network and I/O workstations layer 81 82
Cluster of SMPs (CLUMPS) Hardware and Software Trends
Network performance increase of tenfold using 100BaseT Ethernet Clusters of multiprocessors (CLUMPS) with full duplex support The availability of switched network circuits, including full crossbar To be the supercomputers of the future switches for proprietary network technologies such as Myrinet Multiple SMPs with several network interfaces Workstation performance has improved significantly can be connected using high performance Improvement of microprocessor performance has led to the networks availability of desktop PCs with performance of low-end workstations at significant low cost 2 advantages Performance gap between supercomputer and commodity-based Benefit from the high performance, easy-to- clusters is closing rapidly use-and program SMP systems with a small Parallel supercomputers are now equipped with COTS components, number of CPUs especially microprocessors Increasing usage of SMP nodes with two to four processors Clusters can be set up with moderate effort, resulting in easier administration and better The average number of transistors on a chip is growing by about 40% support for data locality inside a node per annum The clock frequency growth rate is about 30% per annum 83 84
14 Advantages of using COTS-based Technology Trend Cluster Systems
Price/performance when compared with a dedicated parallel supercomputer Incremental growth that often matches yearly funding patterns The provision of a multipurpose system
85 86
Computing Platforms Evolution Breaking Administrative Barriers What are key attributes in HPC?
2100 2100 2100 2100 Computation Time
P ? Communication Time E
2100 R 2100 2100 2100 F 2100 O Administrative Barriers R Individual M A Group N Department C Campus E State National Globe Inter Planet Universe
Local Desktop SMPs or Enterprise Global Inter Planet (Single Processor) SuperCom Cluster Cluster/Grid 87 puters Cluster/Grid Cluster/Grid ?? 88
15