Cluster Computing: The Cluster Computing Commodity Supercomputing

Lecture 3 By Mark Baker and Rajkumar Buyya Software - Practice and Experience vol. 1, no. 1, Jan 1988

Clusters Clusters

Commodity Important factor making the use of Built using commodity HW and SW workstations practical: components ØStandardization of tools and utilities Playing a major role in redefining the § MPI and HPF § Allows applications to be developed and concept of supercomputing tested in a cluster and ported to a parallel platform when ready

Clusters Clusters Why are clusters preferred over MPPs? Basic definition Ø Workstations are becoming powerful Ø A cluster is a collection of workstations or PCs Ø Bandwidth between workstations is increasing that are interconnected via some network Ø Workstation clusters are easier to integrate technology into existing networks Likely scenario Ø Typical lower user utilization Ø Computers will be state-of-the-art Ø Development tools are more mature Ø Network will be high-bandwidth and low-latency Ø Workstation clusters are cheap and readily available Such a cluster can provide fast and reliable Ø Clusters can be enlarged and individual nodes services to computationally intensive can have their capabilities extended applications

1 Clusters Clusters

Topics to be discussed Clusters can be classified as ØComponents ØDedicated cluster ØTools ØNon-dedicated clusters ØTechniques ØMethodologies

Clusters Clusters

Dedicated clusters Non-Dedicated clusters ØA particular individual does not own a ØIndividuals own workstations workstation ØApplications are executed by stealing ØResources are shared cycles ØParallel computing can be performed ØTension between owner and remote user across the entire cluster ØImportant issues: migration and load balance

Clusters Commodity Components

Suitable for applications that are not Processors and Memory communication intensive Disk I/O ØTypically low bandwidth and high latency Cluster Interconnects Workstations Operating Systems ØSome sort of Unix platform Windows of Opportunity ØLately, PCs running Linux or NT

2 Processors Processors

Today, single-chip CPUs are almost as Related projects powerful as processors used in ØDigital Alpha 21364 supercomputers of the recent past § Integrate processing, , and network interface into a single chip

Processors Processors

Related projects Processors used in Clusters ØThe Berkeley Intelligent Ram project ØDigital Alpha – Alpha Farm § Exploring the entire spectrum of issues ØIBM Power PC – IBM SP involved in a general-purpose computer ØSun SPARC – Berkeley NOW system that integrates a processor and DRAM onto a single chip ØSGI MIPS ØHP PA

Memory The System Bus

Amount of memory required varies Needs to match the system’s clock with the target application speed Parallel programs distribute data Intel PCI Bus: 133MBps throughout the nodes ØUsed in Pentium based PCs There should be enough memory to ØDigital AlphaServers avoid constant swapping Decreasing the distinction between Caches are key: 8KB to 2MB PCs and workstations

3 Commodity Components Disk I/O

Processors and Memory Disk density is increasing 60-80% Disk I/O every year Disk access time have not kept pace Cluster Interconnects with performance Operating Systems Parallel/Grand Challenge applications Windows of Opportunity need to large amount of data Necessary to improve I/O performance

Disk I/O Commodity Components

Way of improving Processors and Memory ØCarry out I/O operations concurrently Disk I/O with the support of a parallel file system Cluster Interconnects ØCan be constructed by using the disks associated with each workstation in the Operating Systems cluster Windows of Opportunity

Cluster Interconnects Cluster Interconnects

Individual nodes in a cluster are Requirements to balance the usually connected with a high-speed computational power of the low-latency high-bandwidth network workstations available Communication uses ØBandwidth: more than 10MBps ØStandard Network protocol: TCP/IP ØLatency: at most 100us ØLow level protocol: Active Messages or Fast Messages

4 Cluster Interconnects Cluster Interconnects

Network Technologies Ethernet ØFast Ethernet ØCheap and widely used to form clusters ØATM Standard Ethernet ØMyrinet Ø10Mbps – not enough Fast Ethernet Ø100Mbps – meets the requirement

Cluster Interconnects Cluster Interconnects

ATM (Asynchronous Transfer Mode) ATM (Asynchronous Transfer Mode) ØSwitched virtual circuit technology ØUsually, no optical in desktops ØDeveloped for telecommunication ØATM on CAT-5 ØIntended to be used for LAN and WAN § 15.5 MBps § Presents an unified approach to both § Allows upgrades of existing networks ØBased around small-fixed size packets without replacing cabling ØDesigned for a number of media § Example: copper wire and optic fiber § Performance varies with the hardware

Cluster Interconnects Cluster Interconnects

SCI – Scalable Coherent Interface SCI – Scalable Coherent Interface ØIEEE 1596 Standard Ø Point-to-point architecture Ø Directory-based cache coherent ØLow-latency high-bandwidth distributed Ø access across a network Faster than any network technology available Ø depends on switches ØProvides a scalable architecture that Ø Expensive allows large systems to be built out of Ø Produced for SPARC Sbus and PCI Based inexpensive mass produced components systems

5 Cluster Interconnects Commodity Components

Myrinet Processors and Memory Ø1.28 Gbps full duplex LAN supplied by Disk I/O Myricom Cluster Interconnects ØBased on cut-through switches Operating Systems ØProprietary and high-performance § Low latency and high bandwidth Windows of Opportunity ØUsed in expensive clusters

Operating Systems Operating Systems

Modern operating systems Solaris ØMultitask ØFrom Sun ØMultithreading at kernel level ØUnix-based multithreaded and multi-user ØUser-level high-performance system multithreading without kernel Ø intervention Supports Intel and SPARC platforms ØNetwork support Ø ØMost popular Network support includes TCP/IP stack, § Solaris, Linux, and Windows NT RPC and NFS

Operating Systems Operating Systems

Solaris Linux ØProgramming environment includes: Ø Unix-like operating system Ø § ANSI compliant C and C++ compilers Developed by Linus Torvalds, a Finnish undergraduate student in 91-92 § Tools to profile and debug multithreaded Ø programs Open source and free § Later, lots of contribution from other programmers § Wide range of SW tools, libraries and utilities Ø Robust, reliable, POSIX compliant

6 Operating Systems Operating Systems

Linux Linux: Why is it so popular? Ø FREE! Available from the Internet and can be ØPre-emptive multi-tasking downloaded without cost ØDemand-paged virtual memory Ø Runs on cheap x86 platforms, yet offers the ØMulti-user support power and flexibility of Unix Ø Easy to fix debugs and improve system ØMulti-processor support performance Ø Users can develop or fine-tune HW drivers and these can be made easily available to other users Ø Applications and system software are freely available (for example: GNU software)

Operating Systems Operating Systems

NT NT Ø 32-bit pre-emptive multitasking and multi-user ØMicrosoft Corporation is the dominant operating system provider of SW in the personal Ø Fault tolerant: each 32-bit application operates computing market place in its own virtual memory address space ØIN 1996, NT and Windows 95 had Ø Complete OS together 66% of the desktop OS market Ø Supports most CPU architectures share Ø Supports multiprocessor machines through the use of threads Ø Network protocols and services are integrated with the base OS

Commodity Components Windows of Opportunity

Processors and Memory The resources available in the Disk I/O average NOW offer a number of Cluster Interconnects research opportunities: ØParallel processing Operating Systems ØNetwork RAM for virtual memory Windows of Opportunity ØSoftware RAID ØMulti-path communication

7 Programming Tools Message Passing Systems

For HPC on Clusters Message Passing Libraries allow ØMessage Passing Systems efficient parallel programs to be § PVM, MPI written for systems ØDistributed Shared Memory Systems ØParallel Debuggers and Profilers Provide routines to initiate and configure the messaging environment ØPerformance Analysis Tools Provide functions for sending and Ø Cluster Monitoring receiving data

Message Passing Systems Message Passing Systems

Two most popular high-level message- PVM passing systems for scientific and ØEnvironment and message-passing library engineering applications ØDesigned to run parallel applications on ØPVM systems ranging from high-end ØMPI defined by the MPI forum supercomputers to clusters of workstations

Message Passing Systems Message Passing Systems

MPI MPICH Ø Specification for message-passing ØMost popular of the current free Ø Designed to be standard for distributed implementations of MPI memory using explicit message passing ØDeveloped at Argonne National Ø Attempts to establish a practical, portable, Laboratory and Mississipi State efficient, and flexible standard for message passing Ø Available on most of the HPC systems including SMP machines

8 Message Passing Systems Message Passing Systems

MPICH MPICH ØPortable, built on top of a restricted ØADI, basic point-to-point message number of HW-independent low-level passing functions, which form the ADI ØRemaining MPICH, management of ØADI, Abstract Device Interface communicators, derived data types, contains 25 functions collective operations ØThe rest of MPI contains 125 functions ØHas been ported to most computing and is implemented on top of ADI platforms, including NT

Programming Tools Distributed Shared Memory

For HPC on Clusters DSM ØMessage Passing Systems ØShared memory programming paradigm § PVM, MPI on a distributed memory system ØDistributed Shared Memory Systems ØPhysically distributed and logically ØParallel Debuggers and Profilers shared memory ØPerformance Analysis Tools ØAttractive solution for large-scale high- performance computing ØCluster Monitoring

Distributed Shared Memory Distributed Shared Memory

Implemented using HW or SW SW Solution solutions Ø Built as a separate layer on top of the message- passing interface Ø Virtual memory pages, objects, and language types are units of sharing Ø Implementation achieved by § Compiler § User-level runtime package Ø Examples § Munin, TreadMarks, Linda, Clouds

9 Distributed Shared Memory Distributed Shared Memory

HW Solution HW Solution ØBetter performance ØTypical classes of HW DSM systems ØNo burden on user and SW layers § CC-NUMA (DASH) § transparency – Cache-Coherent Non-Uniform Memory Access § COMA (KSR1) ØFiner granularity of sharing – Cache-Only § Extensions of the schemes § Reflective memory systems (Merlin) ØIncreased HW complexity – Automatically transmits the data to all connected computers local memory, transparently, with zero overhead and at extremely high speeds

Programming Tools Parallel Debuggers and Profilers

For HPC on Clusters Highly desirable to have some form of easy-to-use parallel debuggers and Ø Message Passing Systems profiling tools § PVM, MPI Most vendors of HPC systems provide some ØDistributed Shared Memory Systems form of debugger and performance ØParallel Debuggers and Profilers analyzer for their platforms ØPerformance Analysis Tools Ideally, these tools should be able to work in a heterogeneous environment ØCluster Monitoring Ø Develop on NOWs , run on dedicated HPC system

Parallel Debuggers and Profilers Parallel Debuggers and Profilers

Debuggers A parallel debugger should ØA small number of debuggers can be ØManage multiple processes and multiple used in a cross-platform heterogeneous threads of a process environment ØDisplay each process in its own window Ø1996: Forum to establish a standard ØDisplay source code and stack for one or § Define functionality, semantics, syntax more processes § For a command-line parallel debugger ØDive into objects, subroutines, and functions

10 Parallel Debuggers and Profilers Parallel Debuggers and Profilers

A parallel debugger should Total View ØSet both source-level and machine-level ØCommercial product from Dolphin breakpoints Interconnect Solutions ØSharing breakpoints between groups of ØCurrently, the only widely available processes parallel debugger for multiple HPC ØDefine watch and evaluation points platforms ØDisplay arrays and array slices ØHowever, only for homogeneous ØManipulate code variables and constants environments

Programming Tools Performance Analysis Tools

For HPC on Clusters Used to help programmers understand the ØMessage Passing Systems performance characteristics of a particular application § PVM, MPI ØDistributed Shared Memory Systems Analyze and locate parts of an application that exhibit poor performance and create ØParallel Debuggers and Profilers bottlenecks ØPerformance Analysis Tools Useful for sequential applications ØCluster Monitoring Enormously helpful for parallel applications

Performance Analysis Tools Performance Analysis Tools

Create performance information Components ØProduce performance data during Ø A means of inserting instrumentation calls to execution the performance monitoring routines into the users’ application ØProvide a post-mortem analysis and Ø A runtime performance library display of the performance information § Set of monitoring routines that measure and record A few tools can do runtime analysis various aspects of a program’s performance Ø A set of tools that process and display the ØEither in addition to or instead of the performance data most-mortem analysis

11 Performance Analysis Tools Performance Analysis Tools

A post-mortem performance analysis Issues tool works by ØIntrusiveness of the tracing calls and ØAdding instrumentation calls into the their impact on the application source code performance ØCompiling and linking the application with ØFormat of the trace-file a performance analysis runtime library § Must contain detailed and useful information ØRunning the application to generate a but cannot be too large trace-file § Should conform to some standard format to enable the usage of different GUI interface ØProcessing and viewing the trace-file to visualize the performance data

Performance Analysis Tools Programming Tools

Examples For HPC on Clusters ØAIMS (NASA) ØMessage Passing Systems ØMPE (ANL) § PVM, MPI ØPablo/SvPablo (UIUC) ØDistributed Shared Memory Systems ØParadyn (Wisconsin) ØParallel Debuggers and Profilers ØVT (IBM) ØPerformance Analysis Tools ØDimemas (Pallas) ØCluster Monitoring

Cluster Monitoring Cluster Monitoring

System administration tools Cluster monitoring ØAllow clusters to be observed at ØThe Berkeley NOW different levels using a GUI § System administration tool gathers and ØGood management software is crucial for scatters data in a database exploiting as a HPC platform § Uses a Java applet to allow users to monitor a system from their browser

12 Cluster Monitoring Cluster Monitoring

Cluster monitoring Cluster monitoring ØSolstice SyMon from Sun Microsystems ØPARMON § Allows standalone workstations to be § Comprehensive environment for monitoring monitored large clusters § Uses client/server technology for monitoring § Client/server technology to provide § Node Status Reporter provides a standard transparent access to all nodes to be mechanism for measurement and access to monitored status information of clusters – Server - provide system resource activities and utilization information § Parallel applications/tools can access NSR – Client – A Java applet capable of gathering and through the NSR interface visualizing real-time cluster information

Representative Systems NOW

NOW (Network of Workstations) Goal: combining distributed HPVM (The High Performance Virtual workstations into a single system Machine) ØResearch and development The Beowulf Project § Network interface hardware § Fast communication protocols Solaris MC: A High Performance § Distributed file systems Operating System for Clusters § Distributed Scheduling § Job Control

NOW NOW

Inter-processor communication Inter-processor communication ØActive messages ØIncludes a collection of low- § Basic communication primitives in the NOW latency,parallel communication primitives § AM communication is essentially a simplified § Berkeley sockets remote procedure that can be implemented efficiently on a wide range of hardware § Fast sockets § Generalized to support a broad spectrum of § Shared address space parallel C (Split-C) applicatons § MPI – Client/server, file systems, operating systems, parallel programs § HPF

13 NOW NOW

Process Management Virtual Memory ØGLUNIX (Global Layer Unix) Ø Utilize memory in idle machines as a paging device for busy machines § Operating system layer designed to provide Ø The system is serverless – transparent remote execution § Any machine can be a server when it is idle or a client – Support for interactive parallel and sequential jobs when it needs more memory that what is available – Load balancing Ø Two prototypes – Backward compatibility for existing application § One use Solaris segment drivers to implement an binaries external user-level pager which exchanges pages with remote page daemons § Multi-user system implemented at the user § The other provides similar operations on similarly level mapped regions using signals

NOW HPVM File System Ø xFS is a serverless, distributed file system High Performance Virtual Machine Ø Attempts to have low-latency, high-bandwidth Goals access to data Ø Ø Distributes the functionality of the server Deliver performance on low-cost among the clients COTS systems Ø Typical duties of the server Ø Hide the complexities of distributed system § Maintaining cache coherence behind a clean interface § Locating data The HPVM architecture consists of § Servicing disk requests Ø Number of SW components with high-level Ø In xSF such as MPI, SHMEM, § Each client is responsible for servicing requests on a subset of files § File data is striped across multiple clients

HPVM HPVM

Claims to address the following Illinois Fast Messages (FM) challenges Ø Originally developed on a Cray T3D and a cluster of SPARCstations connected by Myrinet ØDelivering high-performance Ø Has a low-level software interface that delivers communication to standard, high-level hardware communication performance APIs Ø Has a higher-level layer interface for greater ØCoordinating scheduling and resource functionality, application portability, and ease management of use ØManaging heterogeneity

14 HPVM The Beowulf Project

Illinois Fast Messages (FM) Initiated in the Summer of 1994 Ø Based on the Berkeley AM Sponsored by NANA Ø FM is not the surface API, but the underlying semantics Goal Ø Contains functions for sending long and short ØInvestigate the potential of PC clusters messages and for extracting messages from the for performing computational tasks network ØBeowulf refers to a Pile-of-PC (PoPC) as Ø Guarantees reliable and ordered packet delivery and control over the communication a loose ensemble of PCs which is similar scheduling to clusters or networks of workstations

The Beowulf Project The Beowulf Project

Emphasis of PoPC on the Adds to the PoPC model Øuse of mass-market commodity ØNo custom components components § Accepted standard interfaces: PCI bus, IDE Ødedicated processors and SCSI interfaces, Ethernet ØUsage of a private communication ØIncremental growth and technology network tracking Overall Goal ØUsage of readily available and free SW components Ø Achieve the ‘best’ overall system § Linux cost/performance ratio for the cluster

The Beowulf Project The Beowulf Project

Grendel SW Architecture Grendel SW Architecture ØCollection of SW tools being developed Ø Key to success § Inter-process communication bandwidth and system and evolving within the Beowulf project support for parallel I/O ØTools for resource management and to Ø TCP/IP over Ethernet support distributed applications Ø Uses multiple Ethernet networks in parallel to ØBeowulf distribution includes several satisfy the bandwidth requirement programming environments and § Transparent to the user development libraries § Implemented as an enhancer to the Linux kernel § Has been shown that up to 3 networks can be ganged § PVM, MPI, BSP, IPC, and pthreads together leading to significant throughput

15 The Beowulf Project The Beowulf Project

Grendel SW Architecture Programming Model ØEach node runs its own copy of the Linux ØSupports several distributed programming kernel paradigms § Nodes may participate in a number of global § Message passing spaces (for example, process id) – MPI, PVM, and BSP § Need a mechanism that allow unmodified § Distributed shared memory versions of standard UNIX processes utilities (for example, ps) to work across cluster

Summary and Conclusions HW and SW Trends

Hardware and Software Trends Important advances that contributed Cluster Technologies Trends to the existence of clusters Predictions about the Future ØFast Ethernet ØSwitched network circuits Final Thoughts ØWorkstations ‘ performance ØMicroprocessor performance leading to HP PCs ØPowerful and stable Unix system for PCs

HW and SW Trends HW and SW Trends

Trends Trends ØProcessor speeds will keep going up ØNetwork performance is increasing ØMemory sizes will keep going up faster ØNetwork costs are decreasing than memory speeds Ø § To compensate organize DRAM in banks and HP technologies (ATM, SCI, Myrinet) transfer in parallel to/from the banks are promising § Also, use memory hierarchy ØLinux is the main used system ØDisks are also becoming larger faster ØNT is catching up than they are becoming faster

16 Cluster Technology Trends Predictions about the Future

Network is key The gap will continue to close Ethernet technologies are more likely Stealing-cycles systems will continue to be the main stream to use whatever resource they find available Ø1 Gigabit Ethernet Dedicated clusters used for HPC will Ø10 Gigabit Ethernet continue to evolve as new and more powerful technologies become available

Predictions about the Future Final Thoughts

More than one processor per node will The need for computational power exceeds become fairly common our abilities to fulfill this need To reduce latency, cluster SW will Clusters are the most promising way by which this gap can be reduced bypass the OS kernel COTS-based clusters has a number of NIC will be more intelligent advantages OS will provide a rich set of tools and Ø price/performance utilities and will also be robust and Ø growth reliable Ø provision of a multi-purpose system

17