Cluster Computing at a Glance Chapter 1: by M. Baker and R. Buyya

High Performance Cluster Computing „ Introduction „ Scalable Parallel Computer Architecture Architectures and Systems „ Towards Low Cost Parallel Computing and Motivations „ Windows of Opportunity „ A Cluster Computer and its Architecture „ Clusters Classifications Book Editor: Rajkumar Buyya „ Commodity Components for Clusters „ Network Service/Communications SW Slides Prepared by: Hai Jin „ Cluster Middleware and „ Resource Management and Scheduling (RMS) „ Programming Environments and Tools „ Cluster Applications Internet and Cluster Computing Center „ Representative Cluster Systems „ Cluster of SMPs (CLUMPS) „ Summary and Conclusions 2 http://www.buyya.com/cluster/

Introduction How to Run Applications Faster ?

„ Need more computing power „ There are 3 ways to improve performance: „ Improve the operating speed of processors & „ Work Harder other components „ Work Smarter „ constrained by the speed of light, thermodynamic laws, & the high financial costs for processor „ Get Help fabrication „ Computer Analogy „ Using faster hardware „ Connect multiple processors together & coordinate their computational efforts „ Optimized algorithms and techniques used to solve computational tasks „ parallel computers „ allow the sharing of a computational task among „ Multiple computers to solve a particular task multiple processors

3 4

Era of Computing Two Eras of Computing

„ Rapid technical advances

„ the recent advances in VLSI technology

„ software technology „ OS, PL, development methodologies, & tools

„ grand challenge applications have become the main driving force

„ Parallel computing

„ one of the best ways to overcome the speed bottleneck of a single processor

„ good price/performance ratio of a small cluster- based parallel computer

5 6

1 Scalable Parallel Computer Scalable Parallel Computer Architectures Architectures

„ „ Taxonomy MPP „ A large parallel processing system with a shared-nothing „ based on how processors, memory & architecture interconnect are laid out „ Consist of several hundred nodes with a high-speed „ Massively Parallel Processors (MPP) interconnection network/switch „ Each node consists of a main memory & one or more „ Symmetric Multiprocessors (SMP) processors „ Cache-Coherent Nonuniform Memory Access „ Runs a separate copy of the OS (CC-NUMA) „ SMP „ Distributed Systems „ 2-64 processors today „ Shared-everything architecture „ Clusters „ All processors share all the global resources available „ Single copy of the OS runs on these systems

7 8

Scalable Parallel Computer Key Characteristics of Architectures Scalable Parallel Computers „ CC-NUMA „ a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture „ every processor has a global view of all of the memory „ Distributed systems „ considered conventional networks of independent computers „ have multiple system images as each node runs its own OS „ the individual machines could be combinations of MPPs, SMPs, clusters, & individual computers „ Clusters „ a collection of workstations of PCs that are interconnected by a high-speed network „ work as an integrated collection of resources „ have a single system image spanning all its nodes 9 10

Towards Low Cost Parallel Motivations of using NOW over Computing Specialized Parallel Computers

„ Parallel processing „ Individual workstations are becoming increasing „ linking together 2 or more computers to jointly solve some computational problem powerful „ since the early 1990s, an increasing trend to move away „ Communication bandwidth between workstations is from expensive and specialized proprietary parallel towards networks of workstations increasing and latency is decreasing

„ the rapid improvement in the availability of commodity high „ Workstation clusters are easier to integrate into performance components for workstations and networks existing networks Low-cost commodity supercomputing „ Typical low user utilization of personal workstations „ from specialized traditional supercomputing platforms to cheaper, general purpose systems consisting of loosely „ Development tools for workstations are more mature coupled components built up from single or multiprocessor PCs or workstations „ Workstation clusters are a cheap and readily

„ need to standardization of many of the tools and utilities available used by parallel applications (ex) MPI, HPF „ Clusters can be easily grown 11 12

2 Trend Windows of Opportunities

„ Parallel Processing „ Workstations with for science & „ Use multiple processors to build MPP/DSM-like systems for industry vs PC-based machines for parallel computing „ Network RAM administrative work & work processing „ Use memory associated with each workstation as aggregate DRAM cache „ A rapid convergence in processor „ Software RAID performance and kernel-level functionality „ Redundant array of inexpensive disks of UNIX workstations and PC-based „ Use the arrays of workstation disks to provide cheap, highly machines available, & scalable file storage „ Possible to provide parallel I/O support to applications „ Use arrays of workstation disks to provide cheap, highly available, and scalable file storage „ Multipath Communication „ Use multiple networks for parallel data transfer between 13 14 nodes

Cluster Computer and its Architecture Cluster Computer Architecture

„ A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource „ A node „ a single or multiprocessor system with memory, I/O facilities, & OS „ generally 2 or more computers (nodes) connected together „ in a single cabinet, or physically separated & connected via a LAN „ appear as a single system to users and applications „ provide a cost-effective way to gain features and benefits

15 16

Prominent Components of Computers (I)

„ Multiple High Performance Computers „ PCs Head Node „ Workstations

Switch „ SMPs (CLUMPS) „ Distributed HPC Systems leading to Metacomputing Client Node1 Client Node2

17 18

3 Prominent Components of Prominent Components of Cluster Computers (II) Cluster Computers (III)

„ High Performance Networks/Switches „ State of the art Operating Systems „ Ethernet (10Mbps), Fast Ethernet „ (Beowulf) (100Mbps), „ Microsoft NT(Illinois HPVM) „ „ SUN Solaris (Berkeley NOW) InfiniteBand (1-8 Gbps) „ IBM AIX (IBM SP2) „ Gigabit Ethernet (1Gbps) „ HP UX (Illinois - PANDA) „ SCI (Dolphin - MPI- 12micro-sec latency) „ Mach (Microkernel based OS) (CMU) „ ATM „ Cluster Operating Systems (Solaris MC, SCO „ Myrinet (1.2Gbps) Unixware, MOSIX (academic project) „ Digital Memory Channel „ OS gluing layers (Berkeley Glunix) „ FDDI 19 20

Prominent Components of Prominent Components of Cluster Computers (IV) Cluster Computers (V)

„ Network Interface Card „ Fast Communication Protocols and Services „ Myrinet has NIC „ Active Messages (Berkeley) „ InfiniteBand (HBA) „ Fast Messages (Illinois) „ User-level access support „ U-net (Cornell) „ XTP (Virginia)

21 22

Prominent Components of Prominent Components of Cluster Computers (VI) Cluster Computers (VII)

„ Cluster Middleware „ Parallel Programming Environments and Tools „ „ Single System Image (SSI) Threads (PCs, SMPs, NOW..) „ POSIX Threads „ System Availability (SA) Infrastructure „ Java Threads „ Hardware „ MPI „ Linux, NT, on many Supercomputers „ DEC Memory Channel, DSM (Alewife, DASH), SMP Techniques „ PVM „ Kernel/Gluing Layers „ Software DSMs (Shmem) „ Solaris MC, Unixware, GLUnix „ Compilers „ C/C++/Java „ Applications and Subsystems „ Parallel programming with C++ (MIT Press book) „ Applications (system management and electronic forms) „ RAD (rapid application development tools) „ Runtime systems (software DSM, PFS etc.) „ GUI based tools for PP modeling „ Resource management and scheduling software (RMS) „ Debuggers „ CODINE, LSF, PBS, NQS, etc. „ Performance Analysis Tools „ 23 24 Visualization Tools

4 Prominent Components of Key Operational Benefits of Cluster Computers (VIII) Clustering

„ Applications „ High Performance „ Sequential „ Expandability and Scalability „ Parallel / Distributed (Cluster-aware app.) „ High Throughput „ Grand Challenging applications „ Weather Forecasting „ High Availability „ Quantum Chemistry „ Molecular Biology Modeling „ Engineering Analysis (CAD/CAM) „ ……………….

„ PDBs, web servers,data-mining

25 26

Clusters Classification (I) Clusters Classification (II)

„ Application Target „ Node Ownership „ High Performance (HP) Clusters „ Dedicated Clusters „ Grand Challenging Applications „ Non-dedicated clusters

„ High Availability (HA) Clusters „ Adaptive parallel computing

„ Mission Critical applications „ Communal multiprocessing

27 28

Clusters Classification (III) Clusters Classification (IV)

„ Node Operating System „ Node Hardware „ Linux Clusters (e.g., Beowulf) „ Clusters of PCs (CoPs) „ Solaris Clusters (e.g., Berkeley NOW) „ Piles of PCs (PoPs) „ NT Clusters (e.g., HPVM) „ Clusters of Workstations (COWs) „ AIX Clusters (e.g., IBM SP2) „ SCO/Compaq Clusters (Unixware) „ Clusters of SMPs (CLUMPs) „ Digital VMS Clusters „ HP-UX clusters „ Microsoft Wolfpack clusters

29 30

5 Clusters Classification (V) Clusters Classification (VI)

„ Levels of Clustering „ Node Configuration „ Group Clusters (#nodes: 2-99) „ Homogeneous Clusters „ Nodes are connected by SAN like Myrinet „ Departmental Clusters (#nodes: 10s to 100s) „ All nodes will have similar architectures and run the same OSs „ Organizational Clusters (#nodes: many 100s) „ National Metacomputers (WAN/Internet- „ Heterogeneous Clusters based) „ All nodes will have different architectures „ International Metacomputers (Internet-based, and run different OSs #nodes: 1000s to many millions) „ Metacomputing „ Web-based Computing „ Agent Based Computing 31 32 „ Java plays a major in web and agent based computing

Commodity Components for Clusters Commodity Components for Clusters (I) (II)

„ Processors „ Memory and Cache „ Standard Industry Memory Module (SIMM) „ Intel Processors „ Extended Data Out (EDO) „ Pentium Pro and Pentium Xeon „ Allow next access to begin while the previous data is still „ AMD x86, Cyrix x86, etc. being read „ Digital Alpha „ Fast page „ Alpha 21364 processor integrates processing, memory „ Allow multiple adjacent accesses to be made more controller, network interface into a single chip efficiently „ Access to DRAM is extremely slow compared to the speed of „ IBM PowerPC the processor „ Sun SPARC „ the very fast memory used for Cache is expensive & cache „ SGI MIPS control circuitry becomes more complex as the size of the cache grows „ HP PA „ Within Pentium-based machines, uncommon to have a 64-bit „ Berkeley Intelligent RAM (IRAM) integrates wide memory bus as well as a chip set that support 2Mbytes of processor and DRAM onto a single chip external cache 33 34

Commodity Components for Clusters Commodity Components for Clusters (III) (IV)

„ Disk and I/O „ System Bus

„ Overall improvement in disk access time has „ ISA bus (AT bus)

been less than 10% per year „ Clocked at 5MHz and 8 bits wide

„ Amdahl’s law „ Clocked at 13MHz and 16 bits wide

„ Speed-up obtained by from faster processors „ VESA bus is limited by the slowest system component „ 32 bits bus matched system’s clock speed „ Parallel I/O „ PCI bus „ Carry out I/O operations in parallel, „ 133Mbytes/s transfer rate supported by parallel file system based on „ Adopted both in Pentium-based PC and non- hardware or software RAID Intel platform (e.g., Digital Alpha Server) 35 36

6 Commodity Components for Clusters Commodity Components for Clusters (V) (VI)

„ Cluster Interconnects „ Cluster Interconnects „ Asynchronous Transfer Mode (ATM) „ Communicate over high-speed networks using a standard „ Switched virtual-circuit technology networking protocol such as TCP/IP or a low-level protocol such „ Cell (small fixed-size data packet) as AM „ use optical fiber - expensive upgrade „ Standard Ethernet „ telephone style cables (CAT-3) & better quality cable (CAT-5) „ 10 Mbps „ Scalable Coherent Interfaces (SCI)

„ cheap, easy way to provide file and printer sharing „ IEEE 1596-1992 standard aimed at providing a low-latency distributed shared memory across a cluster „ bandwidth & latency are not balanced with the computational power „ Point-to-point architecture with directory-based cache coherence „ reduce the delay interprocessor communication „ Ethernet, Fast Ethernet, and Gigabit Ethernet „ eliminate the need for runtime layers of software protocol-paradigm translation „ Fast Ethernet – 100 Mbps „ less than 12 usec zero message-length latency on Sun SPARC

„ „ Gigabit Ethernet Designed to support distributed multiprocessing with high bandwidth and low latency „ preserve Ethernet’s simplicity „ SCI cards for SPARC’s SBus and PCI-based SCI cards from Dolphin „ deliver a very high bandwidth to aggregate multiple Fast Ethernet segments „ Scalability constrained by the current generation of switches & relatively expensive components 37 38

Commodity Components for Clusters Commodity Components for Clusters (VII) (VIII)

„ Cluster Interconnects „ Operating Systems „ Myrinet „ 2 fundamental services for users „ 1.28 Gbps full duplex interconnection network „ make the computer hardware easier to use „ Use low latency cut-through routing switches, which is able to „ create a virtual machine that differs markedly from the real offer fault tolerance by automatic mapping of the network machine configuration „ share hardware resources among users „ Support both Linux & NT „ Processor - multitasking „ Advantages „ The new concept in OS services „ Very low latency (5µs, one-way point-to-point) „ support multiple threads of control in a process itself „ Very high throughput „ parallelism within a process „ Programmable on-board processor for greater flexibility „ multithreading „ Disadvantages „ POSIX thread interface is a standard programming environment „ Expensive: $1500 per host „ Trend „ Complicated scaling: switches with more than 16 ports are unavailable „ Modularity – MS Windows, IBM OS/2 „ Microkernel – provide only essential OS services 39 40 „ high level abstraction of OS portability

Commodity Components for Clusters Commodity Components for Clusters (IX) (X)

„ Operating Systems „ Operating Systems „ Solaris „ Linux „ UNIX-based multithreading and multiuser OS „ UNIX-like OS „ support Intel x86 & SPARC-based platforms

„ Runs on cheap x86 platform, yet offers the power and „ Real-time scheduling feature critical for multimedia applications flexibility of UNIX „ Support two kinds of threads „ Light Weight Processes (LWPs) „ Readily available on the Internet and can be downloaded „ User level thread without cost „ Support both BSD and several non-BSD file system „ Easy to fix bugs and improve system performance „ CacheFS „ AutoClient „ Users can develop or fine-tune hardware drivers which can easily be made available to other users „ TmpFS: uses main memory to contain a file system „ Proc file system „ Features such as preemptive multitasking, demand-page „ Volume file system virtual memory, multiuser, multiprocessor support „ Support distributed computing & is able to store & retrieve distributed information

„ OpenWindows allows application to be run on remote systems

41 42

7 Commodity Components for Clusters (XI) Windows NT 4.0 Architecture

„ Operating Systems „ NT (New Technology)

„ Preemptive, multitasking, multiuser, 32-bits OS

„ Object-based security model and special file system (NTFS) that allows permissions to be set on a file and directory basis

„ Support multiple CPUs and provide multitasking using symmetrical multiprocessing

„ Support different CPUs and multiprocessor machines with threads

„ Have the network protocols & services integrated with the base OS

„ several built-in networking protocols (IPX/SPX., TCP/IP, NetBEUI), & APIs (NetBIOS, DCE RPC, Window Sockets (Winsock))

43 44

Network Services/ Communication SW Cluster Middleware & SSI

„ Communication infrastructure support protocol for „ Bulk-data transport „ SSI „ Streaming data „ Supported by a middleware layer that resides between the „ Group communications OS and user-level environment „ Communication service provide cluster with important QoS parameters „ Middleware consists of essentially 2 sublayers of SW „ Latency infrastructure „ Bandwidth „ SSI infrastructure „ Reliability „ Glue together OSs on all nodes to offer unified access to „ Fault-tolerance system resources „ Jitter control „ Network service are designed as hierarchical stack of protocols „ System availability infrastructure with relatively low-level communication API, provide means to „ Enable cluster services such as checkpointing, automatic implement wide range of communication methodologies failover, recovery from failure, & fault-tolerant support „ RPC among all nodes of the cluster „ DSM „ Stream-based and message passing interface (e.g., MPI, PVM) 45 46

What is Single System Image (SSI) ? Single System Image Boundaries

„ A single system image is the illusion, created by software or hardware, „ Every SSI has a boundary that presents a collection of „ SSI support can exist at resources as one, more powerful resource. different levels within a „ SSI makes the cluster appear like a system, one able to be build single machine to the user, to on another applications, and to the network. „ A cluster without a SSI is not a cluster 47 48

8 SSI Boundaries -- an applications SSI Levels/Layers SSI boundary

Application and Subsystem Level

Operating System Kernel Level

Batch System Hardware Level SSI Boundary (c) In search 49 of clusters 50

SSI at Operating System Kernel SSI at Harware Layer (Underware) or Gluing Layer

Level Examples Boundary Importance Level Examples Boundary Importance

Kernel/ Solaris MC, Unixware each name space: kernel support for Application and Subsystem Level OS Layer MOSIX, Sprite,Amoeba files, processes, applications, adm / GLUnix pipes, devices, etc. subsystems kernel UNIX (Sun) vnode, type of kernel modularizes SSI Operating System Kernel Level interfaces Locus (IBM) vproc objects: files, code within processes, etc. kernel memory SCI, DASH memory space better communica- virtual none supporting each distributed may simplify tion and synchro- memory operating system kernel virtual memory implementation nization space of kernel objects memory SCI, SMP techniquesmemory and I/O lower overhead microkernel Mach, PARAS, Chorus, each service implicit SSI for and I/O device space cluster I/O OSF/1AD, Amoeba outside the all system services microkernel 51 (c) In search of clusters 52 (c) In search of clusters

SSI at Application and Subsystem Single System Image Benefits Layer (Middleware)

„ Provide a simple, straightforward view of all system Level Examples Boundary Importance resources and activities, from any node of the cluster application cluster batch system, an application what a user system management wants „ Free the end user from having to know where an application will run distributed DB, subsystem a subsystem SSI for all „ Free the operator from having to know where a OSF DME, Lotus applications of resource is located Notes, MPI, PVM the subsystem „ Sun NFS, OSF, Let the user work with familiar interface and file system shared portion of implicitly supports commands and allows the administrators to manage DFS, NetWare, the file system many applications and so on and subsystems the entire clusters as a single entity toolkit OSF DCE, Sun explicit toolkit best level of „ Reduce the risk of operator errors, with the result ONC+, Apollo facilities: user, support for heter- that end users see improved reliability and higher Domain service name,time ogeneous system availability of the system 53 (c) In search of clusters 54

9 Single System Image Benefits (Cont’d) Middleware Design Goals

„ Allowing centralize/decentralize system management and „ Complete Transparency in Resource Management control to avoid the need of skilled administrators from „ Allow user to use a cluster easily without the knowledge of the system administration underlying system architecture „ Present multiple, cooperating components of an application to „ The user is provided with the view of a globalized file system, the administrator as a single application processes, and network „ Scalable Performance „ Greatly simplify system management „ Can easily be expanded, their performance should scale as well „ Provide location-independent message communication „ To extract the max performance, the SSI service must support load „ Help track the locations of all resource so that there is no balancing & parallelism by distributing workload evenly among nodes longer any need for system operators to be concerned with „ Enhanced Availability their physical location „ Middleware service must be highly available at all times „ Provide transparent process migration and load balancing „ At any time, a point of failure should be recoverable without ’ across nodes. affecting a user s application „ Employ checkpointing & fault tolerant technologies „ Improved system response time and performance „ Handle consistency of data when replicated

55 56

SSI Support Services Availability Support Functions

„ Single Entry Point „ Single I/O Space (SIOS): „ telnet cluster.myinstitute.edu „ any node can access any peripheral or disk devices „ telnet node1.cluster. myinstitute.edu without the knowledge of physical location. „ Single File Hierarchy: xFS, AFS, Solaris MC Proxy „ Single Process Space (SPS) „ Single Management and Control Point: Management from single GUI „ Any process on any node create process with cluster wide process wide and they communicate through signal, „ Single Virtual Networking pipes, etc, as if they are one a single node. „ Single Memory Space - Network RAM / DSM „ Checkpointing and Process Migration. „ Single Job Management: GLUnix, Codine, LSF „ Saves the process state and intermediate results in „ Single User Interface: Like workstation/PC memory to disk to support rollback recovery when node windowing environment (CDE in Solaris/NT), may it fails can use Web technology „ PM for dynamic load balancing among the cluster nodes

57 58

Resource Management and Scheduling (RMS) Services provided by RMS

„ RMS is the act of distributing applications among computers to maximize their throughput „ Process Migration „ Enable the effective and efficient utilization of the resources „ Computational resource has become too heavily loaded available „ Fault tolerant concern „ Software components „ „ Resource manager Checkpointing „ Locating and allocating computational resource, authentication, process „ creation and migration Scavenging Idle Cycles „ Resource scheduler „ 70% to 90% of the time most workstations are idle „ Queueing applications, resource location and assignment „ Fault Tolerance „ Reasons using RMS „ „ Provide an increased, and reliable, throughput of user applications on Minimization of Impact on Users the systems „ Load Balancing „ Load balancing „ „ Utilizing spare CPU cycles Multiple Application Queues „ Providing fault tolerant systems „ Manage access to powerful system, etc 59 „ Basic architecture of RMS: client-server system 60

10 Some Popular Programming Environments and Tools Resource Management Systems (I)

Project Commercial Systems - URL „ Threads (PCs, SMPs, NOW..) LSF http://www.platform.com/ „ In multiprocessor systems CODINE http://www.genias.de/products/codine/tech_desc.html „ Used to simultaneously utilize all the available Easy-LL http://www.tc.cornell.edu/UserDoc/SP/LL12/Easy/ processors NQE http://www.cray.com/products/software/nqe/ „ In uniprocessor systems Public Domain System - URL „ Used to utilize the system resources effectively „ Multithreaded applications offer quicker response to CONDOR http://www.cs.wisc.edu/condor/ user input and run faster http://www.gnqs.org/ GNQS „ Potentially portable, as there exists an IEEE standard DQS http://www.scri.fsu.edu/~pasko/dqs.html for POSIX threads interface (pthreads) PRM http://gost.isi.edu/gost-group/products/prm/ „ Extensively used in developing both application and system software PBS http://pbs.mrj.com/ 61 62

Programming Environments and Tools Programming Environments and Tools (II) (III)

„ Distributed Shared Memory (DSM) Systems „ Message Passing Systems (MPI and PVM) „ Message-passing „ Allow efficient parallel programs to be written for distributed „ the most efficient, widely used, programming paradigm on distributed memory systems memory system „ 2 most popular high-level message-passing systems – PVM & „ complex & difficult to program MPI „ Shared memory systems „ PVM „ offer a simple and general programming model „ but suffer from scalability „ both an environment & a message-passing library „ DSM on distributed memory system „ MPI „ alternative cost-effective solution „ a message passing specification, designed to be standard „ Software DSM for distributed memory parallel computing using explicit „ Usually built as a separate layer on top of the comm interface message passing „ Take full advantage of the application characteristics: virtual pages, objects, & language types are units of sharing „ attempt to establish a practical, portable, efficient, & „ ThreadMarks, Linda flexible standard for message passing „ Hardware DSM „ generally, application developers prefer MPI, as it is fast „ Better performance, no burden on user & SW layers, fine granularity of sharing, becoming the de facto standard for message passing extensions of the cache coherence scheme, & increased HW complexity „ DASH, Merlin

63 64

Programming Environments and Tools (IV) Functionality of Parallel Debugger

„ Parallel Debuggers and Profilers „ Managing multiple processes and multiple threads „ Debuggers within a process „ Very limited „ Displaying each process in its own window „ HPDF (High Performance Debugging Forum) as Parallel Tools Consortium project in 1996 „ Displaying source code, stack trace, and stack frame „ Developed a HPD version specification, which defines the for one or more processes functionality, semantics, and syntax for a commercial-line „ parallel debugger Diving into objects, subroutines, and functions „ TotalView „ Setting both source-level and machine-level „ A commercial product from Dolphin Interconnect Solutions breakpoints „ The only widely available GUI-based parallel debugger „ Sharing breakpoints between groups of processes that supports multiple HPC platforms „ Defining watch and evaluation points „ Only used in homogeneous environments, where each process of the parallel application being debugged must be running „ Displaying arrays and its slices under the same version of the OS „ Manipulating code variable and constants 65 66

11 Programming Environments and Tools Performance Analysis (V) and Visualization Tools

„ Performance Analysis Tools Tool Supports URL „ Help a programmer to understand the performance AIMS Instrumentation, monitoring http://science.nas.nasa.gov/Software/AIMS characteristics of an application library, analysis „ Analyze & locate parts of an application that exhibit poor MPE Logging library and snapshot http://www.mcs.anl.gov/mpi/mpich performance and create program bottlenecks performance visualization „ Major components Pablo Monitoring library and http://www-pablo.cs.uiuc.edu/Projects/Pablo/ analysis „ A means of inserting instrumentation calls to the performance monitoring routines into the user’s applications Paradyn Dynamic instrumentation http://www.cs.wisc.edu/paradyn running analysis „ A run-time performance library that consists of a set of monitoring routines SvPablo Integrated instrumentor, http://www-pablo.cs.uiuc.edu/Projects/Pablo/ monitoring library and analysis „ A set of tools for processing and displaying the performance data Monitoring library http://www.pallas.de/pages/vampir.htm „ Issue with performance monitoring tools Vampir performance visualization „ Intrusiveness of the tracing calls and their impact on the application performance Dimenmas Performance prediction for http://www.pallas.com/pages/dimemas.htm message passing programs „ Instrumentation affects the performance characteristics of the parallel application and thus provides a false view of its Paraver Program visualization and http://www.cepba.upc.es/paraver performance behavior analysis 67 68

Programming Environments and Tools Need of more Computing Power: (VI) Grand Challenge Applications

„ Cluster Administration Tools Solving technology problems using computer „ Berkeley NOW modeling, simulation and analysis „ Gather & store data in a relational DB „ Use Java applet to allow users to monitor a system Geographic „ SMILE (Scalable Multicomputer Implementation using Low-cost Equipment) Information „ Called K-CAP Systems „ Consist of compute nodes, a management node, & a client that can control and monitor the cluster Life Sciences Aerospace „ K-CAP uses a Java applet to connect to the management node through a predefined URL address in the cluster „ PARMON „ A comprehensive environment for monitoring large clusters „ Use client-server techniques to provide transparent access to all nodes to be monitored CAD/CAM Digital Biology 69 „ parmon-server & parmon-client 70 Military Applications

Representative Cluster Systems (I) Architecture of NOW System

„ The Berkeley Network of Workstations (NOW) Project „ Demonstrate building of a large-scale parallel computer system using mass produced commercial workstations & the latest commodity switch-based network components „ Interprocess communication

„ Active Messages (AM) „ basic communication primitives in Berkeley NOW „ A simplified remote procedure call that can be implemented efficiently on a wide range of hardware „ Global Layer Unix (GLUnix)

„ An OS layer designed to provide transparent remote execution, support for interactive parallel & sequential jobs, load balancing, & backward compatibility for existing application binaries

„ Aim to provide a cluster-wide namespace and uses Network PIDs (NPIDs), and Virtual Node Numbers (VNNs) 71 72

12 Representative Cluster Systems Representative Cluster Systems (II) (III)

„ The High Performance Virtual Machine „ The Berkeley Network of Workstations (NOW) Project „ Network RAM (HPVM) Project „ Allow to utilize free resources on idle machines as a paging device „ Deliver performance on a low cost for busy machines COTS system „ Serverless „ any machine can be a server when it is idle, or a client when it needs „ Hide the complexities of a distributed system more memory than physically available behind a clean interface „ xFS: Serverless Network File System

„ A serverless, distributed file system, which attempt to have low „ Challenges addressed by HPVM latency, high bandwidth access to file system data by distributing „ the functionality of the server among the clients Delivering high performance communication to

„ The function of locating data in xFS is distributed by having each standard, high-level APIs client responsible for servicing requests on a subset of the files „ Coordinating scheduling and resource management „ File data is striped across multiple clients to provide high bandwidth „ Managing heterogeneity 73 74

HPVM Layered Architecture Representative Cluster Systems (IV)

„ The High Performance Virtual Machine (HPVM) Project „ Fast Messages (FM) „ A high bandwidth & low-latency comm protocol, based on Berkeley AM „ Contains functions for sending long and short messages & for extracting messages from the network „ Guarantees and controls the memory hierarchy „ Guarantees reliable and ordered packet delivery as well as control over the scheduling of communication work „ Originally developed on a Cray T3D & a cluster of SPARCstations connected by Myrinet hardware „ Low-level software interface that delivery hardware communication performance „ High-level layers interface offer greater functionality, application portability, and ease of use 75 76

Representative Cluster Systems Representative Cluster Systems (V) (VI)

„ The Beowulf Project „ The Beowulf Project „ System Software „ Investigate the potential of PC clusters for „ Grendel performing computational tasks „ the collection of software tools „ resource management & support distributed applications „ Refer to a Pile-of-PCs (PoPC) to describe a loose „ Communication ensemble or cluster of PCs „ through TCP/IP over Ethernet internal to cluster „ Emphasize the use of mass-market commodity „ employ multiple Ethernet networks in parallel to satisfy the internal data transfer bandwidth required components, dedicated processors, and the use of a „ achieved by ‘channel binding’ techniques

private communication network „ Extend the Linux kernel to allow a loose ensemble of nodes to participate in a number of global namespaces „ Achieve the best overall system cost/performance ratio for the cluster „ Two Global Process ID (GPID) schemes „ Independent of external libraries „ GPID-PVM compatible with PVM Task ID format & uses PVM as its signal transport 77 78

13 Representative Cluster Systems (VII) Solaris MC Architecture

„ Solaris MC: A High Performance Operating System for Clusters „ A distributed OS for a multicomputer, a cluster of computing nodes connected by a high-speed interconnect „ Provide a single system image, making the cluster appear like a single machine to the user, to applications, and the the network „ Built as a globalization layer on top of the existing Solaris kernel „ Interesting features „ extends existing Solaris OS „ preserves the existing Solaris ABI/API compliance „ provides support for high availability „ uses C++, IDL, CORBA in the kernel

„ 79 leverages spring technology 80

Representative Cluster Systems Cluster System Comparison Matrix (VIII)

„ Solaris MC: A High Performance Operating System Project Platform Communications OS Other for Clusters Beowulf PCs Multiple Ethernet Linux and MPI/PVM. „ Use an object-oriented framework for communication with TCP/IP Grendel Sockets and between nodes HPF „ Based on CORBA Berkeley Solaris-based Myrinet and Active Solaris + AM, PVM, MPI, Now PCs and Messages GLUnix + HPF, Split-C „ Provide remote object method invocations workstations xFS „ Provide object reference counting HPVM PCs Myrinet with Fast NT or Linux Java-fronted, „ Support multiple object handlers Messages connection FM, Sockets, „ Single system image features and global Global Arrays, „ Global file system resource SHEMEM and „ Distributed file system, called ProXy File System (PXFS), manager + MPI provides a globalized file system without need for modifying the LSF existing file system Solaris MC Solaris-based Solaris-supported Solaris + C++ and „ Globalized process management PCs and Globalization CORBA „ Globalized network and I/O workstations layer 81 82

Cluster of SMPs (CLUMPS) Hardware and Software Trends

„ Network performance increase of tenfold using 100BaseT Ethernet „ Clusters of multiprocessors (CLUMPS) with full duplex support „ The availability of switched network circuits, including full crossbar „ To be the supercomputers of the future switches for proprietary network technologies such as Myrinet „ Multiple SMPs with several network interfaces „ Workstation performance has improved significantly can be connected using high performance „ Improvement of microprocessor performance has led to the networks availability of desktop PCs with performance of low-end workstations at significant low cost „ 2 advantages „ Performance gap between supercomputer and commodity-based „ Benefit from the high performance, easy-to- clusters is closing rapidly use-and program SMP systems with a small „ Parallel supercomputers are now equipped with COTS components, number of CPUs especially microprocessors „ Increasing usage of SMP nodes with two to four processors „ Clusters can be set up with moderate effort, resulting in easier administration and better „ The average number of transistors on a chip is growing by about 40% support for data locality inside a node per annum „ The clock frequency growth rate is about 30% per annum 83 84

14 Advantages of using COTS-based Technology Trend Cluster Systems

„ Price/performance when compared with a dedicated parallel supercomputer „ Incremental growth that often matches yearly funding patterns „ The provision of a multipurpose system

85 86

Computing Platforms Evolution Breaking Administrative Barriers What are key attributes in HPC?

2100 2100 2100 2100 „ Computation Time

P ? „ Communication Time E

2100 R 2100 2100 2100 F 2100 O Administrative Barriers R Individual M A Group N Department C Campus E State National Globe Inter Planet Universe

Local Desktop SMPs or Enterprise Global Inter Planet (Single Processor) SuperCom Cluster Cluster/Grid 87 puters Cluster/Grid Cluster/Grid ?? 88

15