SYSADMIN OpenSSI www.sxc.hu

Building high-performance clusters with OpenSSI SPEED QUEEN The OpenSSI framework rearranges processes for easy and transparent

clustering. BY ALEKSANDER KORZYNSKI

Single System Image (SSI) clus- source code of the program contain clus- OpenSSI constantly monitors the load ter is a collection of separate tering code – or at least that the program on the computers in the cluster, and it A computers that appear as a sin- be linked to a clustering library. automatically migrates processes among gle multi-processor system. SSI cluster- OpenSSI is a comprehensive, open the nodes. The OpenSSI system is capa- ing platforms are typically implemented source SSI clustering solution for ble of migrating a multi-threaded appli- at the kernel level. based on HP’s NonStop clustering plat- cation, but it cannot migrate individual In SSI high-performance clusters, the form. NonStop is derived from Locus, threads. A migrated process can con- program does not even have to know it which was developed in the 1980s. tinue with the same operations it was is running in a cluster. The only require- OpenSSI can transparently spread pro- doing at the original machine. It can ment is that the program spawn multiple cesses over multiple machines – a fea- read and write to the same files or de- processes. These processes then pass ture known as load-leveling [1]. Other vices. A migrated process can even con- transparently among the computers in SSI Linux clustering platforms capable tinue Inter-Process Communication the cluster. of load-leveling include openMosix and (IPC) over a pipe or socket. Although SSI clusters are not as scal- Kerrighed. Most applications run on OpenSSI able as some of the other clustering al- OpenMosix is by far the most popular without any modifications with some ex- ternatives, they offer a significant advan- SSI clustering alternative for Linux. In ceptions, as listed in the man page for tage in that the program running on the July, however, it was announced that the the migrate command. OpenSSI also pro- system doesn’t even have to know the openMosix project will end in March vides a library that programs can use to cluster exists. Other high-performance 2008. Kerrighed is relatively new and is control the cluster. clustering platforms require that the under rapid development. In this article, I will show you how to Administrative Features Aleksander Korzynski is a Software set up load-leveling in OpenSSI, starting The view of the filesystem is the same at Developer at ClusterVision in the with an overview of relevant OpenSSI every machine; however, the contents of Netherlands. Earlier this year, he re- features, then describing the OpenSSI in- a particular file can be different at each ceived his MSc from the Warsaw stallation and configuration before using computer in the cluster. This occurs University of Technology in Poland. a cluster to compile source code on mul- through a feature known as Context- His thesis project included analyzing tiple machines at once. Dependent SymLinks (CDSL). and modifying the source code of And finally, I will demonstrate a visual A CDSL is a symbolic link that points

THEAUTHOR the OpenSSI kernel. He enjoys play- monitoring and management tool for to different file contents at each ma- ing snooker. OpenSSI. chine. Devices on the local computer are

64 ISSUE 84 NOVEMBER 2007 OpenSSI SYSADMIN

available under /dev as usual, and de- If you want the cluster to be accessible The first problem you could encounter vices on other machines are accessible from an outside network, it’s a good idea is that your network card might not be under special subdirectories in /dev. to configure another network card on supported. Because the Live CD is based Process management is cluster-wide. one of the machines, rather than using on the old 2.4.22 kernel, it might not rec- The whole cluster shares a single PID the interconnect for that purpose. You’ll ognize late model cards. For instance, it space, with a single naming space for be able to access any computer in the didn’t recognize my network card, IPC objects. cluster through the system with the two which was manufactured in 2007. Additionally, the /proc filesystem is network cards. The other problem is that, contrary to modified to show information about all the official documentation, the Live CD the processes in the cluster. Init and Non-Init has buggy support of Etherboot. It only Many standard system tools work with OpenSSI is difficult to install. The official supports PXE well. This problem has OpenSSI without any modifications, al- documentation calls the procedure been reported to the OpenSSI mailing though some tools are specifically “somewhat clunky.” You should be fairly list. Check whether your card supports adapted for OpenSSI. The ps or top com- familiar with the distribution you are PXE. If you’re unlucky, jump to the hard mands, for example, display processes using, so that you’ll be able investigate drive installation directly. running on all computers in the cluster. and solve potential problems. To run the Live CD, download the CD In OpenSSI, one node is special and ISO image [2]. You have to burn it to one Versions called the init node. The init node boots blank CD only. You will boot only one The OpenSSI project supports stable and directly from the root filesystem. The re- node from the CD, then you’ll boot the development release branches. The sta- maining nodes boot over the network. remaining nodes over the network. ble versions are numbered 1.2.x and the To boot the non-init nodes over the development versions inhabit the 1.9.x network, you need Etherboot or PXE. Setting up a Cluster series. The stable branch is based on the Etherboot is free software that you in- For clarity, I’ll guide you through setting 2.4.x Linux kernel and the development stall Etherboot on a bootable medium, up a cluster of two nodes. The first node branch on the 2.6.x kernel. The latest like a floppy, a CD, or the hard drive, or will be the init node, and the other node stable version is OpenSSI 1.2.2, which is flash it into the network card’s ROM. will be the non-init node. based on kernel 2.4.22. The latest devel- Almost any network card is compati- First, enter the BIOS setup of the non- opment version is 1.9.2, which is based ble with Etherboot. PXE, on the other init node and enable PXE. Next, reboot on kernel 2.6.10. hand, is already integrated onto some the node. It will display its MAC address OpenSSI is released through distribu- network cards, and you usually enable and hang waiting for the boot server. tion-specific packages. Stable series re- it in the BIOS setup. Write the MAC address down on a piece leases are available for Fedora Core 2, of paper. 3.1 (Sarge), and Red Hat 9. Live CD Installation Now boot the init node from the Live Development releases are available for The easiest way to get started with CD. Wait until you are given the shell Fedora Core 3 and Debian 3.1 (Sarge). OpenSSI is by booting the Live CD; how- prompt. If you can’t boot due to hard- You can also find a Live CD version that ever, you may face two problems with ware problems, you can try the hard runs OpenSSI 1.2 and is based on Knop- the Live CD. drive installation. Using the 1.9.x pix 3.6. Development Branch Hardware The development branch is a port of The OpenSSI team has an ambitious To create an OpenSSI cluster, you’ll need the stable branch to the 2.6 kernel and road map. According to OpenSSI devel- at least two PCs and a local area net- to newer Linux distributions. The branch oper Roger Tsang, the next release will work. The network used for connecting inherits the advantages of the 2.6 kernel, be 1.9.3 in the development branch. It the machines in the cluster together is such as better I/ O schedulers, greater will be based on Linux kernel 2.6.11 and called the interconnect. The faster the scalability, more supported hardware, will include enhancements to high-avail- interconnect, the better your perfor- and an improved virtual machine. ability clustering, the cluster filesystem, The development branch adds support and the OpenSSI core. It will also add mance will be. for running OpenSSI in the Xen and experimental support for the Infiniband If you are using the OpenSSI Live CD, Linux KVM virtual machine environ- network architecture. none of the computers has to have a ments. The next release in the develop- Other plans include porting to the Linux hard drive. One computer will boot from ment branch will include even more im- 2.6.16.x stable kernel series. Ports to Red CD-ROM and the remaining computers provements. Hat Enterprise Linux 5 and the related will boot over the network. For the tradi- As of this writing, the development CentOS 5 are in the making. Support for tional hard drive installation, only one branch has a notable bug. The bug is ob- x86_64 hardware is under development computer has to have a hard drive. served when a process that has been as well. The machines that boot over the net- moved from one machine to another lis- Once the 1.9.x branch is stable, it will work can start booting from floppy, CD- tens on a TCP/ IP socket. In particular, the become OpenSSI 2.0.0. As a long-term ROM, a small partition on the hard drive, bug affects SSI high-performance clus- task, the developers will continue work- tering and does not affect web service or ing on Linux kernel hooks intended for or from a network card capable of net- high-availability clustering. inclusion into the mainstream kernel. work booting.

NOVEMBER 2007 ISSUE 84 65 SYSADMIN OpenSSI

OpenSSI series instead of 1.2.x will prob- Finally, hit Enter ably help. twice to acknowledge If you succeed in booting the Live CD, the configuration. you should now have the shell prompt at After you have config- the init node. Now you have to configure ured the init node, the init node to accept the other node another node can join into the cluster. To do that, run ssi- the cluster. addnode. You will need to answer a se- Reboot the non-init ries of questions. node now and wait For the first question, type y to enable until you see the shell verbose guidelines. Next, you will be prompt. Type cluster asked to pick a node number. Type a -v to confirm that the number between 2 and 125. You will non-init node has then be asked for the MAC address of booted successfully. Figure 1: The webView management tool lets you monitor the the node. Type the address you wrote If the new node nodes on the cluster. down previously. doesn’t boot, you can Note that the numbers should be sepa- try restarting the boot server and reboot- gram, you can either put the program rated by colons regardless of how PXE ing the non-init node. This step is some- name on the load-leveling list or run displays the address to you. Next, you times necessary because the boot server the special bash-ll shell before invoking will be asked for the IP address of the does not reply to the same machine the program. new node. Type an address from the twice. To restart the boot server, type the If a program is on the load-leveling 10.x.x.x space, for example, 10.0.0.2. following commands at the init node: list, all the processes it creates, and all (The init node has 10.0.0.1 assigned to it their offspring, will be load-leveled. If already). When asked whether the node # /etc/init.d/dhcp restart you are running the Live CD or Debian, boots over PXE or Etherboot, type P for # /etc/init.d/inetd restart the list is located at /cluster/etc/loadlev- PXE. Type a name of your choice for the ellist. If you are running Fedora, it is in nodename, for example, Knoppix2. (The If you haven’t stumbled upon hardware /etc/sysconfig/loadlevellist. init node is Knoppix1 already). compatibility problems, you now have a To set up the special bash-ll shell, running SSI cluster. Jump to copy the standard bash shell to a file Listing 1: Load-Leveling Demo the section on configuring called bash-ll. Also, put bash-ll on the load-leveling. load-leveling list. Now, if you want to user@debian:~$ demo-proclb -m 1800 in_file load-level a program, first run bash-ll. Forked child 50 / 50 Hard Drive Install Each process starts from the new shell, 1: 0 ##### |1800: 575 recs/sec For real production use, and all offspring of that process, will be 2: 0 ####### |1800: 812 recs/sec you’ll have to install load-leveled. 3: 0 ####### |1800: 833 recs/sec OpenSSI on the hard drive. To complete the configuration, create a The hard drive installation bash-ll hard link pointing to bash: 4: 0 ####### |1800: 837 recs/sec takes longer than running 5: 0 ####### |1800: 841 recs/sec the Live CD, and I won’t de- $ ln /bin/bash /bin/bash-ll 6: 0 ####### |1800: 833 recs/sec scribe it here in detail. How- 7: 0 ####### |1800: 835 recs/sec ever, I will let you know On Fedora, bash-ll is configured by de- where to look for the de- fault. On the Live CD, bash-ll is not con- 8: 0 ####### |1800: 835 recs/sec tailed instructions. figured by default. On Debian, bash-ll is 9: 0 ####### |1800: 817 recs/sec At the OpenSSI Download partially configured. 10: ####### |1800: 791 recs/sec page [2], you will find links 11: 0 ########## |1800: 1183 recs/sec to How to Install guides for Testing Load-Leveling each OpenSSI version. Once To test the configuration, run the demo 12: 0 ############ |1800: 1431 recs/sec you have installed basic that comes with OpenSSI. 13: 0 ########### |1800: 1375 recs/sec OpenSSI, find the informa- The demo works as follows. First you 14: 0 ############ |1800: 1425 recs/sec tion on how to configure disable load-leveling. Then you run a 15: 0 ########## |1800: 1172 recs/sec various features [3]. specifically prepared demo program. The program spawns several processes. Each 16: 0 ############# |1800: 1592 recs/sec Load-Leveling second, the program outputs informa- 17: 0 ########### |1800: 1320 recs/sec Once you have a basic tion on how much data it processes per 18: 0 ############ |1800: 1486 recs/sec OpenSSI cluster up and run- second. Enable load-leveling at any time, 19: 0 ############# |1800: 1504 recs/sec ning, you can configure and the processes are spread over the 20: 0 ############ |1800: 1479 recs/sec load-leveling for the cluster. cluster. You can then observe the in- To tell OpenSSI to config- crease in the total amount of data the 21: 0 ############ |1800: 1477 recs/sec ure load-leveling for a pro- program processes per second.

66 ISSUE 84 NOVEMBER 2007 SYSADMIN OpenSSI

input file, which can visual monitoring and management tool be any large file. for OpenSSI. You can, for exam- ple, use the initrd OpenSSI-webView image in /boot. In- OpenSSI-webView is a is a PHP-based voke the program: tool developed by Kilian Cavalotti. The interface is divided into several tabs. You $ demo-proclb can view the load at each node in the cluster (Figure 1), monitor the status of each node (Figure 2), or view charts The program should with various statistics (Figure 3). You now be running. can also monitor processes at each node Look at the output and manually migrate a process to an- for several seconds. other node. Figure 2: You can also use webView to view status information. The output displays To use webView, you first have to in- the amount of data stall OpenSSI on the hard drive. You’ll If the demo is not present in your hard processed per second. find instructions online for installing drive installation, check the OpenSSI Enable load-leveling at any time. To do OpenSSI-webView on distros such as website. If you’re using the Live CD, that, switch to another console and run: Debian [4] and Fedora [5]. you’ll find the demo in /demo-proclb. To run the demo on the Live CD, boot # /etc/init.d/loadlevel start Conclusion at least two nodes into the cluster. On The OpenSSI features are amazing, and the Live CD, the program is on the load- Load-leveling is now enabled, and pro- the ambition of the project is great. How- leveling list already. Add the demo direc- cesses are being migrated to other ever, the releases are rather far behind tory to your PATH: nodes. Switch back to the console with the development of the Linux kernel. the demo program. With the openMosix project having an- # export PATH=$PATH:/demo-proclb Did it work? After a few seconds, the nounced End of Life, and with Kerrighed total amount of data processed per sec- under rapid development, the future of Now switch to another console on the ond should increase. Listing 1 shows an OpenSSI remains open. For the time same node and disable load-leveling example of the demo program at work. being, enjoy it. You can check out a temporarily: Each line represents the amount of data comprehensive performance comparison processed in a given second. Load-level- between OpenSSI and its competitors in # /etc/init.d/loadlevel stop ing is enabled at t=9. Lottiaux et al. [6]. ■

Load-leveling is now disabled. Switch Parallel Compiling INFO back to the previous console to run the If you are a software developer, you can [1] Introduction to OpenSSI: demo program. The program requires an use an OpenSSI cluster to speed up your http:// openssi. org/ ssi-intro. pdf compiler. In practice, this so- [2] OpenSSI download: lution is only really conve- http:// openssi. org/ cgi-bin/ nient if you’re using the hard view?page=download. html drive installation. [3] OpenSSI documentation: http:// The make command has a openssi. org/ cgi-bin/ view?page=docs switch that splits the compi- [4] OpenSSI webView for Debian: http:// wiki. openssi. org/ go/ OpenSSI_ lation task into a number of webView parallel jobs. The switch is -j. [5] OpenSSI webView for Fedora: To compile source code on http:// openssi-webview. sourceforge. multiple machines at once, net/ install. php configure load-leveling and [6] OpenMosix, OpenSSI and Kerrighed: then type: a Comparative Study: http:// www. irisa. fr/ paris/ Biblio/ Papers/ Lottiaux/ $ make -j N your_project LotBoiGalValMor05CCGrid. pdf

N is a number equal to or Acknowledgments greater than the number of Many thanks to Roger Tsang for answer- nodes. You can experiment ing my questions about OpenSSI. with different values of N. If you have installed Special thanks to Killian Cavallotti for permission to use the OpenSSI-web- Figure 3: Charts offer a visual representation of cluster OpenSSI on the hard drive, View screenshots. performance. you can install the excellent

68 ISSUE 84 NOVEMBER 2007