OpenSSI () C luster Project openssi.org

Jeff Edlund Senior Principle Solution Architect NSP

Bruce Walker Staff Fellow – Office of Strategy & Technology

© 2004 Hewlett-Packard Developm ent Com pany, L.P. The inform ation contained herein is subject to change without notice Agenda

• W hat are today’s clustering strategies for Linux • W hy isn’t failover clustering enough • W hat is Single System Im age (SSI) • W hy is SSI so im portant • O penSSI Cluster Project Architecture • Project Status

07/10/03 2 M any types of Clusters

• High Perform ance Clusters ¦ Beowulf; 1000 nodes; parallel program s; M PI • Load-leveling Clusters ¦ M ove processes around to borrow cycles (eg. M osix) • W eb-Service Clusters ¦ LVS/Piranah; load-level tcp connections; replicate data • Storage Clusters ¦ G FS; parallel filesystem s; sam e view of data from each node • Database Clusters ¦ O racle Parallel Server; • High Availability Clusters ¦ ServiceG uard, Lifekeeper, Failsafe, heartbeat, failover clusters • Single System Im age Clusters

07/10/03 3 W ho is Doing SSI Clustering?

• O utside Linux: ¦ Com paq/HP with VM SClusters, TruClusters, N SK, and N SC ¦ Sun had “Full M oon”/Solaris M C (now SunClusters) ¦ IBM Sysplex ? • Linux SSI: ¦ Scyld - form of SSI via Bproc ¦ M osix - form of SSI due their hom enode/process m igration technique and looking at a single root filesystem ¦ Polyserve - form of SSI via CFS (Cluster File System ) ¦ Q Clusters – SSI through software / m iddleware layer ¦ RedHat G FS – G lobal File System (based on Sistina) ¦ Hive Com puting – Declarative program m ing m odel for “workers” ¦ O penSSI Cluster Project – SSI project to bring all attributes together

07/10/03 4 Scyld - Beowulf Bproc (used by Scyld): ¦ process-related solution ¦ m aster node with slaves ¦ initiate process on m aster node and explicitly “m ove”, “rexec” or “rfork” to slave node ¦ all files closed when the process is m oved ¦ m aster node can “see” all the processes which were started there ¦ m oved processes see the process space of the m aster (som e pid m apping) ¦ process system calls shipped back to the m aster node (including fork) ¦ other system calls executed locally but not SSI

07/10/03 5 M osix M osix / O penM osix: ¦ hom e nodes with slaves ¦ initiate process on hom e node and transparently m igrate to other nodes ¦ hom e node can “see” all and only all processes started there ¦ m oved processes see the view of the hom e node ¦ m ost system calls actually executed back on the hom e node ¦ DFSA helps to allow I/O to be local to the process

07/10/03 6 PolyServe M atrix Server: ¦ Com pletely sym m etric Cluster File System with DLM ( no m aster / slave relationships) ¦ Each node m ust be directly attached to SAN ¦ Lim ited SSI for m anagem ent ¦ N o SSI for processes ¦ N o load balancing

07/10/03 7 Q lusters ClusterFram e: ¦ Based on M osix ¦ Uses Hom e-node SSI Application Components

¦ centralized policy-based ClusterFrame XHA ClusterFrame SSI m anagem ent Xtreme High Availability Single System Image • reduces overhead Enterprise Cluster ClusterFrame Q RM – Q lusters Resource M anager • pre-determ ined resource M anagement allowances • centralized provisioning ClusterFrame Platform

¦ stateful application recovery Linux Kernel

Intel Blades & Storage Systems

07/10/03 8 RedHat G FS – G lobal File System RedHat Cluster Suite (G FS): ¦ Form erly Sistina ¦ Prim arily Parallel Physical file system (only real form of SSI) ¦ Used in conjunction with RedHat cluster m anager to provide • High availability • IP load balancing ¦ Lim ited sharing and no process load balancing

07/10/03 9 Hive Com puting - Tsunam i Hive Creator: ¦ Hives can be m ade up of any num ber of IA32 m achines ¦ Hives consist of: • Client applications • Hive client API • W orkers • W orker applications ¦ Databases exist outside of the Hive ¦ Applications m ust be m odified to run in a Hive ¦ N o Cluster File System ¦ Closer to G rid m odel than SSI

07/10/03 10 Are there O pportunity G aps in the current SSI offerings? YES!!

A Full SSI solution is the foundation for sim ultaneously addressing all the issues in all the cluster solution areas

O pportunity to com bine:

•High Availability •IP load balancing •IP failover •Process load balancing •Cluster filesystem •Distributed Lock M anager •Single nam espace •M uch m ore …

07/10/03 11 W hat is a Full Single System Im age Solution?

Com plete Cluster looks like a single system to: ¦ Users; ¦ Adm inistrators; ¦ Program s and Program m ers; Co-operating O S Kernels providing transparent access to all O S resources cluster-wide, using a single nam espace ¦ A.K.A – You don’t really know it’s a cluster!

The state of cluster nirvana

07/10/03 12 SM P – Sym m etrical M ulti Processing functionality

Function SM P M anageability Yes Usability Yes Sharing / Utilization Yes High Availability Scalability Incremental G row th Price / Performance

07/10/03 13 Value add of HA clustering to SM P

Traditional Function SM P Clusters M anageability Ye s U sability Ye s Sharing / U tilization Ye s H igh Availability Ye s Scalability Ye s Incremental G row th Ye s Price / Performance Ye s

07/10/03 14 SSI Clusters have the best of both!!

Traditional Function SM P Clusters SSI Clusters M anageability Yes Yes Usability Yes Yes Sharing / Utilization Yes Yes High Availability Yes Yes Scalability Yes Yes Incremental G row th Yes Yes Price / Performance Yes Yes

07/10/03 15 Com m on Clustering G oals O ne or All of: • High Availability ¦ A com pute engine is always available to run m y workload • Scalability ¦ As I need m ore resource I can access it transparently to the end user application • M anageability ¦ I can guarantee som e level of service because I can efficiently m onitor, operate and service m y com pute resources • Usability ¦ Com pute resources are assem bled together in such a way as to give m e trouble free easy operations of m y com pute resources without regard to having knowledge of the cluster

07/10/03 16 O penSSI Linux Cluster Project

Ideal/Perfect Cluster in all dimensions SMP Typical HA Cluster Availability OpenSSI Linux Cluster Project

Scalability Manageability

log scale

HUGE

Really BIG

Usability

07/10/03 17 O verview of O penSSI Cluster

• Single HA root filesystem • Consistent O S kernel on each node • Cluster form ation early in boot • Strong M em bership • Single, clusterwide view of files, filesystem s, devices, processes and ipc objects • Single m anagem ent dom ain • Load balancing of connections and processes

07/10/03 18 O penSSI Cluster Project Availability

• N o Single (or even m ultiple) Point(s) of Failure • Autom atic Failover/restart of services in the event of hardware or software failure • Application Availability is sim pler in an SSI Cluster environm ent; statefull restart easily done; • SSI Cluster provides a sim pler operator and program m ing environm ent • O nline software upgrade • Architected to avoid scheduled downtim e

07/10/03 19 O penSSI Cluster Project Price / Perform ance Scalability

• W hat is Scalability? ¦ Environm ental Scalability and Application Scalability!

• Environm ental (Cluster) Scalability: ¦ m ore USEABLE processors, m em ory, I/O , etc. ¦ SSI m akes these added resources useable

07/10/03 20 O penSSI Cluster Project Price / Perform ance Scalability Application Scalability: • SSI m akes distributing function very easy • SSI allows sharing of resources between processes on different nodes (all resources transparently visible from all nodes): ¦ filesystem s, IPC, processes, devices*, networking* • SSI allows replicated instances to co-ordinate (alm ost as easy as replicated instances on an SM P; in som e ways m uch better) • Load balancing of connections and processes • O S version in local m em ory on each node • Industry Standard Hardware (can m ix hardware) • Distributed O S algorithm s written to scale to hundreds of nodes (and successful dem onstrated to 133 blades and 27 Itanium SM P nodes)

07/10/03 21 O penSSI Linux Cluster - M anageability

• Single Installation ¦ Joining the cluster is autom atic as part of booting and doesn’t have to m anaged • Trivial online addition of new nodes • Use standard single node tools (SSI Adm inistration) • Visibility of all resources of all nodes from any node ¦ Applications, utilities, program m ers, users and adm inistrators often needn’t be aware of the SSI Cluster

• Sim pler HA (High Availability) m anagem ent

07/10/03 22 O penSSI Linux Cluster Single System Adm inistration

• Single set of User accounts (not N IS) • Single set of filesystem s (no “N etwork m ounts”) • Single set of devices • Single view of networking • Single set of Services (printing, dum ps, networking*, etc.) • Single root filesystem (lots of adm in files there) • Single set of paging/swap spaces (goal) • Single install • Single boot and single copy of kernel • Single m achine m anagem ent tools

07/10/03 23 O penSSI Linux Cluster - Ease of Use

• Can run anything anywhere with no setup; • Can see everything from any node; • Service failover/restart is trivial; • Autom atic or m anual load balancing; ¦ powerful environm ent for application service provisioning, m onitoring and re-arranging as needed

07/10/03 24 Blades and O penSSI Clusters

¦Very sim ple provisioning of hardware, system and applications ¦N o root filesystem per node ¦Single install of the system and single application install ¦N odes can netboot ¦Local disk only needed for swap but can be shared ¦Blades don’t need FCAL connect but can use it ¦Single, highly available IP address for the cluster ¦Single system update and single application update ¦Sharing of filesystem s, devices, processes, IPC that other blade “SSI” system s don’t have ¦Application failover very rapid and very sim ple ¦Can easily have m ultiple clusters and then trivially m ove nodes between the clusters

07/10/03 25 How Does SSI Clustering W ork?

Uniprocessor or SMP node Uniprocessor or SMP node

Users,, applliicatiions,, and Users,, applliicatiions,, and systems management systems management

Standard OS Extensions Extensions Sttandard OS kernell calllls kernell calllls

Standard Linux Standard Linux Modullar Modullar 2.4 2.4 kernell kernell kernel kernel extensiions extensiions with SSI hooks with SSI hooks

Devices Devices IP-based interconnect

Other nodes

07/10/03 26 Com ponents of Strict SSI

• Single File hierarchy • Single I/O space • Single Process M anagem ent • Single Virtual N etworking • Single IPC space and access ¦ pipes, sem aphores, shared m em ory, sockets, etc. • Single system m anagem ent and user interface • Single M em ory Space *******

07/10/03 27 Added Com ponents for SSI+

• Cluster M em bership (CLM S) and m em bership APIs • Internode Com m unication Subsystem (ICS) • Cluster Volum e M anager • Distributed Lock M anager • Process m igration/re-scheduling to other nodes • Load-leveling of processes and internet connections • Single sim ple installation • High Availability Features (see next slide)

07/10/03 28 Added HA Com ponents for SSI+

• Basically anything pure HA clusters have: ¦ device failover and filesystem failover ¦ HA interconnect ¦ HA IP address or addresses ¦ Process/Application m onitoring and restart • Transparent filesystem failover or parallel filesystem

07/10/03 29 Com ponent Contributions to O penSSI Cluster Project

Lustre AAppppll.. AAvvaaiill.. CCLLMMSS GGFFSS MMoossiixx BBeeoowwuullff VVpprroocc DDLLMM LLVVSS OOCCFFSS IIPPCC

DDRRBBDD CCFFSS

EEVVMMSS OOppeennSSSSII CClluusstteerr PPrroojjeecctt LLooaadd LLeevveelliinngg

HP contributed

Open source and integrated

To be integrated

07/10/03 30 Com ponent Contributions to O penSSI Cluster Project

• LVS - : ¦ front end director (software) load levels connections to backend servers ¦ can use N AT, tunneling or redirection • (we are using redirection) ¦ can failover director ¦ integrated with CLM S but doesn’t use ICS ¦ http://www.LinuxVirtualServer.org

07/10/03 31 Com ponent Contributions to O penSSI Cluster Project

• G FS, openG FS: ¦ parallel physical filesystem ; direct access to shared device from all nodes; ¦ Sistina has proprietary version (G FS) (now RH has it) • http://www.sistina.com /products_gfs.htm ¦ project was using open version (openG FS) • http://sourceforge.net/projects/opengfs

07/10/03 32 Com ponent Contributions to O penSSI

• Lustre: ¦ open source project, funded by HP, Intel and US N ational Labs ¦ parallel network filesystem ; ¦ file service split between a m etadata service (directories and file inform ation) and data service (spread across m any data servers (stripping, etc.) ¦ operations can be done and cached at the client if there is no contention ¦ designed to scale to thousands of clients and hundreds of server nodes • http://www.lustre.org

07/10/03 33 Com ponent Contributions to O penSSI Cluster Project

• D LM - D istributed Lock M anager: ¦ IBM open source project (being taken over) ¦ Is now used by openG FS ¦ http://sourceforge.net/projects/opendlm

07/10/03 34 Com ponent Contributions to O penSSI Cluster Project

• D RBD - D istributed Replicated Block D evice: ¦ open source project to provide block device m irroring across nodes in a cluster ¦ can provide HA storage m ade available via CFS ¦ W orks with O penSSI ¦ http://drbd.cubit.at

07/10/03 35 Com ponent Contributions to O penSSI Cluster Project

• Beow ulf : ¦ M PICH and other Beowulf subsystem s just work on O penSSI • G anglia, Scalable PBS, M aui, … .

07/10/03 36 Com ponent Contributions to O penSSI Cluster Project

• EVM S - Enterprise Volum e M anagem ent System • not yet clusterized or integrated with SSI • http://sourceforge.net/projects/evm s/

07/10/03 37 SSI Cluster Architecture/ Com ponents

14. Init;booting; 18.Timesync 13. Packaging and Install (NTP) run levels 15. Sysadmin;

16. Appl Availability; 17. Application Service 19. MPI, etc. HA daemons Provisioning Kernel 3. Filesystem Interface 5. Process 1. Membership CFS GFS Loadleveling

Physical 6. IPC filesystems Lustre 4. Process Mgmt

7. Networking/ LVS 8. DLM 9. Devices/ 10. Kernel shared storage Replication devfs Service

2. Internode Communication/ 11. CLVM/ EVMS (TBD) 12. DRBD (TBD) HA interconnect

07/10/03 38 O penSSI Linux Clusters - Status

• Version 1.0 just released – ¦ Binary, Source and CVS options ¦ Functionally com plete RH9 and RHel3 ¦ release also available ¦ IA-32 and Itanium Platform s ¦ Runs HPTC apps as well as O racle RAC ¦ Available at openssi.org

07/10/03 39 O penSSI Linux Clusters - Conclusions

• HP has recognized that Linux clusters are im portant part of the future. • HP has recognized that com bining scalability with availability and m anageability/ease-of-use is key to clustering • HP is leveraging its m erger with Com paq (Tandem /Digital) to bring the very best of clustering to a Linux base

07/10/03 40 Backup m aterial

07/10/03 41 1. SSI Cluster M em bership (CLM S)

• CLM S kernel service on all nodes • Current CLM S M aster on one node • Cold SSI Cluster Boot selects m aster (can fail to another node) ¦ other nodes join in autom atically and early in kernel initialization • N odedown detection subsystem m onitors connectivity ¦ rapidly inform CLM S of failure (can get sub-second detection) ¦ excluded nodes im m ediately reboot (som e integration with STO M ITH still needed) • There are APIs for m em bership and transitions

07/10/03 42 1. Cluster M em bership APIs

• cluster_ nam e() • cluster_m em bership() • clusternode_num () • cluster_transition() and cluster_detailedtransition() ¦ m em bership transition events • clusternode_info() • clusternode_setinfo() • clusternode_avail()

• Plus com m and versions for shell program m ing

07/10/03 43 2. Inter-N ode Com m unication (ICS)

• Kernel to kernel transport subsystem ¦ runs over TCP/IP ¦ Structured to run over VI or other m essaging system s • RPC, request/response, m essaging ¦ server threads, queuing, channels, priority, throttling, connection m gm t, nodedown, ... • Bonding for HA interconnect

07/10/03 44 3. Filesystem Strategy

• Support parallel physical filesystem s (like G FS), layered CFS (which allows SSI cluster coherent access to non- parallel physical filesystem s (JFS, XFS, reiserfs, ext3, cdfs, etc.) and parallel distributed (eg. Lustre) • transparently ensure all nodes see the sam e m ount tree

07/10/03 45 3. Cluster Filesystem (CFS)

• Single root filesystem m ounted on one node • O ther nodes join root node and “discover” root filesystem • O ther m ounts done as in std Linux • Standard physical filesystem s (ext2, ext3, XFS, ..) • CFS layered on top (all access thru CFS) ¦ provides coherency, single site sem antics, distribution and failure tolerance • transparent filesystem failover

07/10/03 46 3. Filesystem Failover for CFS - O verview

• Dual or m ulti-ported Disk strategy • Sim ultaneous access to the disk not required • CFS layered/stacked on standard physical filesystem and optionally Volum e m gm t • For each filesystem , only one node directly runs the physical filesystem code and accesses the disk until m ovem ent or failure • W ith hardware support, not lim ited to only dual porting • Can m ove active filesystem s for load balancing • Processes on client nodes see no failures, even if server fails (transparent failover to another server)

07/10/03 47 3. Filesystem Model – GFS/OpenGFS

• Parallel physical filesystem ; direct access to shared device from all nodes; • Sistina has proprietary version (G FS) ¦ http://www.sistina.com /products_gfs.htm • Project is currently using open version (openG FS) ¦ http://sourceforge.net/projects/opengfs

07/10/03 48 3. Filesystem Model - Lustre

• O pen source project, funded by HP, Intel and US N ational Labs • Parallel network filesystem ; • File service split between a m etadata service (directories and file inform ation) and data service (spread across m any data servers (stripping, etc.) • O perations can be done and cached at the client if there is no contention • Designed to scale to thousands of clients and hundreds of server nodes ¦ http://www.lustre.org

07/10/03 49 4. Process M anagem ent

• Single pid space but allocate locally • Transparent access to all processes on all nodes • Processes can m igrate during execution (next instruction is on a different node; consider it rescheduling on another node) • M igration is via servicing /proc//goto (done transparently by kernel) or m igrate syscall (m igrate yourself) • Also rfork and rexec syscall interfaces and onnode and fastnode com m ands • process part of /proc is system wide (so ps & debuggers “just work” system wide) • Im plem ented via a virtual process (Vproc) architecture

07/10/03 50 4. Vproc Features

• Process always executes system calls locally • N o “do-do” at “hom e node”; never m ore than 1 task struct per process • For HA and perform ance, processes can com pletely m ove ¦ Therefore can service node without application interruption • Process always only has 1 process id • Clusterwide job control • Architecture to allow com peting rem ote process im plem entations

07/10/03 51 4. Process Relationships

• Parent/child can be distributed • Process G roup can be distributed • Session can be distributed • Foreground pgrp can be distributed • Debugger/ Debuggee can be distributed • Signaler and process to be signaled can be distributed • All are rebuilt as appropriate on arbitrary failure • Signals are delivered reliably under arbitrary failure

07/10/03 52 Vproc Architecture - Data Structures and Code Flow

Code Flow Data structures Base OS code calls vproc interface routines for a give vproc vproc Define interface Private data Replaceable vproc code handles relationships and sends messages as needed; calls pproc routines to manipulate task struct; may Define have it’s own private data interface Base OS code manipulates task task structure

07/10/03 53 4. Vproc Im plem entation

• Task structure split into 3 pieces: ¦ vproc (tiny, just pid and pointer to private data) ¦ pvproc (prim arily relationship lists; … ) ¦ task structure • all 3 on process execution node; • vproc/pvproc structs can exists on other nodes, prim arily as a result of process relationships

07/10/03 54 Vproc Im plem entation - Data Structures and Code Flow

Code Flow Data structures Base OS code calls vproc interface routines for a give vproc vproc Define interface Parent/child Replaceable vproc code pvproc Process group handles relationships and session sends messages as needed; calls pproc routines to manipulate task struct Define interface Base OS code manipulates task task structure

07/10/03 55 Vproc Im plem entation - Vproc Interfaces

• High level vproc interfaces exist for any operation (m ostly system calls) which m ay act on a process other than the caller, or m ay im pact a process relationship. Exam ples are sigproc, sigpgrp, exit, fork relationships, ... • To m inim ize “hooks” there are no vproc interfaces for operations which are done strictly to yourself (eg. Setting signal m asks) • Low level interfaces (pproc routines) are called by vproc routines for any m anipulation of the task structure

07/10/03 56 Vproc Im plem entation - Tracking

• O rigin node (creation node; node whose num ber is in the pid) is responsible for knowing if the process exists and where it is execution (so there is a vproc/pvproc struct on this node and a field in the pvproc indicates the execution node of the process); if a process wants to m ove, it m ust only tell it’s origin node; • If the origin node goes away, part of the nodedown recovery will populate the “surrogate origin node”, whose identity is well known to all nodes; never a window where anyone m ight think the process did not exist; • W hen the origin node reappears, it resum es the tracking (lots of bad things would happen if you didn’t do this, like confusing others and duplicate pids) • If the surrogate origin node dies, nodedown recovery repopulates the takeover surrogate origin;

07/10/03 57 Vproc Im plem entation - Relationships

• Relationships are handled through the pvproc struct and not task struct; • Relationship list (linked list of vproc/pvproc structs) is kept with the list leader (e.g.. Execution node of the parent or pgrp leader) • Relationship list som etim es has to be rebuilt due to failure of the leader (e.g.. Process groups do not go away when the leader dies) • Com plete failure handling is quite com plicated - published paper on how we do it.

07/10/03 58 Vproc Im plem entation - parent/child relationship

Parent process (100) Child process 140 running Child process 180 at it’s execution node at parent’s execution node running remote

Vproc 100 Vproc 140 Vproc 180

Parent link Sibling link pvproc pvproc pvproc

task task

07/10/03 59 Vproc Im plem entation - APIs

• rexec()- sem antically identical to exec but with node num ber arg - can also take “fastnode” argum ent • rfork()- sem antically identical to fork but with node num ber arg ¦ can also take “fastnode” argum ent • m igrate() - m ove m e to node indicated; can do fastnode as well • /proc//goto causes process m igration • where_pid() - way to ask on which node a process is executing

07/10/03 60 5. Process Load Leveling

• There are two types of load leveling - connection load leveling and process load leveling • Process load leveling can be done “m anually” or via daem ons (m anual is onnode and fastnode; autom atic is optional) • Share load info with other nodes • each local daem on can decide to m ove work to another node • Adapted from the M osix project load-leveling

07/10/03 61 6. Interprocess Com m unication (IPC)

• Sem aphores, m essage queues and shared m em ory are created and m anaged on the node of the process that created them • N am espace m anaged by IPC nam eserver (rebuilt autom atically on nam eserver node failure) • pipes and fifos and ptys and sockets are created and m anaged on the node of the process that created them • All IPC objects have a system wide nam espace and accessibility from all nodes

07/10/03 62 6. Basic IPC m odel

Object nameserver function Object Server (track which objects (may know who the are on which nodes) client nodes are (fifos, shm, pipes, sockets,

Object client knows where the server is

07/10/03 63 7. N etworking/LVS - Linux Virtual Server

• Front end director (software) load levels connections to backend servers • Can use N AT, tunneling or redirection ¦ (we are using redirection) • Can failover director • Integrated with CLM S but doesn’t use ICS • Som e enhancem ents to ease m anagem ent • http://www.LinuxVirtualServer.org

07/10/03 64 8. DLM - Distributed Lock M anager

• IBM open source project (abandoned and saved) • hopefully it will be used by openG FS • http://sourceforge.net/projects/opendlm

07/10/03 65 9. System wide Device N am ing and Access

• Each node creates a device space thru devfs and m ounts it in /cluster/nodenum #/dev • N am ing done through a stacked CFS • each node sees it’s devices in /dev • Access through rem ote device fileops (distribution and coherency) • M ultiported can route thru one node or direct from all • Rem ote ioctls can use transparent rem ote copyin/out • Device Drivers usually don’t require change or recom pile

07/10/03 66 11. EVMS/CLVM EVM S - Enterprise Volum e M anagem ent System ¦ not yet clusterized or integrated with SSI ¦ http://sourceforge.net/projects/evm s/

CLVM - Cluster Logical Volum e M anager ¦ being done by Sistina (not going to be open source) ¦ not yet integrated with SSI ¦ http://www.sistina.com p/products_lvm .htm

07/10/03 67 12. DRBD - Distributed Replicated Block Device

• open source project to provide block device m irroring across nodes in a cluster • can provide HA storage m ade available via CFS • not yet integrated with SSI • http://drbd.cubit.at

07/10/03 68 13. Packaging and Installation

• First N ode: ¦ install standard distribution ¦ Run SSI RPM and reboot SSI kernel • O ther N odes: ¦ can PXE/netboot up and then use shared root ¦ basically a trivial install (addnode com m and)

07/10/03 69 14. Init, booting and Run Levels

• Single init process that can failover if the node it is on fails • nodes can netboot into the cluster or have a local disk boot im age • all nodes in the cluster run at the sam e run level • if local boot im age is old, autom atic update and reboot to new im age

07/10/03 70 15. Single System Adm inistration

• Single set of User accounts (not N IS) • Single set of filesystem s (no “N etwork m ounts”) • Single set of devices • Single view of networking (with m ultiple devices) • Single set of Services (printing, dum ps, networking*, etc.) • Single root filesystem (lots of adm in files there) • Single install • Single boot and single copy of kernel • Single m achine m anagem ent tools

07/10/03 71 16. Application Availability

• “Keepalive” and “Spawndaem on” part of base N onStop Clusters technology • Provides User-level application restart for registered processes • Restart on death of process or node • Can register processes (or groups) at system startup or anytim e • Registered processes started with “Spawndaem on” • Can unregister at any tim e • Used by the system to watch daem ons • Could use other standard application availability technology (eg. Failsafe or ServiceG uard)

07/10/03 72 16. Application Availability

• Sim pler than other Application Availability solutions ¦ one set of configuration files ¦ any process can run on any node ¦ Restart does not require hierarchy of resources (system does resource failover) ¦ Can use std “services” m anagem ent for autom atic failover/restart

07/10/03 73 19. Beowulf

• M PICH libraries and m pirun just work in the SSI cluster • LAM PI has also been adapted • Job launch, job m onitoring and job abort m uch sim pler in O penSSI environm ent • http://www.beowulf.org

07/10/03 74 O penSSI Linux Clusters - Conclusions

• O penSSI is an attem pt to provide a com m on cluster fram ework for all form s of clustering • O penSSI sim ultaneous addresses availability, scalability, m anageability and usability • Lots of neat stuff all together in the O penSSI project • Dem onstrated on 25-node Blade system • Tested to 132 nodes using Proliant rack system s

07/10/03 75