Beowulf Cluster' for High-Performance Computing Tasks at the University: a Very Profitable Investment

Informatica 25 (2001) 189-193 189 'Beowulf cluster' for high-performance computing tasks at the university: A very profitable investment. Manuel J. Galan, Fidel Garcia, Luis Alvarez, Antonio Ocon and Enrique Rubio ClCEl. Centro de Innovacion en Tecnologias de la Informacion. University of Las Palnias de Gran Canaria. Edificio de Arquitectura, Campus Universitario de TafiraBaja., 35017 Las Palmas de Gran Canaria, SPAIN. E-mail: [email protected] Keywords: Cluster, OSS, Linux, High performance computing, computer commodities. Received: January 24, 2001 We describe a high performance/low priče "computer cluster" named Beowulf Cluster. This kind of device was initially developed by T. Sterling and D. Becker (N.A.S.A) and it is a kind of cluster built primarily out of commodity hardware components, running an OSS (Open Source Software) operating system like Linux or FreeBSD, interconnected by a private high-speed network, dedicated to running high-performance computing tasks. We will show several advantages ofsuch a device, among them we can mention very high performance-price ratio, easy scalability, recyclingpossibilities ofthe hardware components and guarantee of iisability/upgradeability in the medium and long-term future. Ali these advantages make it specially suitable for the university environment and, thus, we make a description about the implementation of a Beowulf Cluster using commodity eguipment dedicated to run high- performance computing tasks at our imiversity focusing on the different areas of application of this device that range from prodiiction of high quality multimedia content to applications in numerical simidation in engineering. Now, the task of the programmer is to determining what 1 Introduction "concurrent" parts of the program should be executed in "parallel" and what parts should not. The answer to this will determine the efficiency ofthe application. 1.1 Concurrency and Parallelism In a 'so called' perfect parallel computer, the ratio of Regarding program execution, there is one very communication/processing would be equal to one and important distinction that needs to be made: the anything that is "concurrent" could be implemented in difference between "concurrency" and "parallelism". We "parallel". Unfortunately, real parallel computers, will define these two concepts as follows: including shared memory machines, do not behave "this • "Concurrency": the parts of a program that can be well". computed independently. • "Parallelism": the parallel parts of a program are 1.2 Architectures for Parallel Computing those "concurrency" parts that are executed on separate processing elements at the same tirne. 1.2.1 Hardware Architectures The distinction is very important, because "concurrency" is a property ofthe program and efficient "parallelism" is There are three common hardware architectures for parallel computing: a property ofthe machine. ldeally, "parallel" execution should result in faster • Shared memory machines, SMP, that communicate through memory, i.e. MPP (Massively Parallel performance. The limiting factor in parallel performance Processors, like the nCube, CMS, Convex SPP, Cray is the communication speed (bandwidth) and latency T3D, Cray T3E, etc). This king of configuration is between compute nodes. sustained on dedicated hardware. The main Many of the common parallel benchmarks are highly characteristics are a very high bandvvidth betvveen parallel and communication and latency are not the CPUs and mernory. bottleneck. This type of problem can be called "obviously parallel". Other applications are not so simple • Local memory machines that communicate by and executing "concurrent" parts of the program in messages, i.e. NOWs (Netvvorks of Workstations) "parallel" may actually cause the program to run slower, and clusters. In this category each workstation thus offsetting any performance gains in other maintains its individuality, however there is a tight "concurrent" parts of the program. In simple terms, the integration with the rest of the members of the cost of communication tirne must pay for the savings in cluster. So we can say they constitute a new entity computation tirne, otherwise the "parallel" execution of knowri as "The Cluster". In our proposal we will the "concurrent" part is inefficient. focus on a particular kind of cluster called "Beowulf Cluster" (see [3]). 190 Informatica 25 (2001) 189-193 M.J. Galanetal. et al. • Local memory machine that integrale in a ]oosely operation. Depending on the application, parallelizing knit collaborative netvvork. In this category we can code can be easy, extremely difficult, or in some cases include several collaborative Internet efforts which impossible due to algorithm dependencies. are able to share the load of a difficult and hard to solve problem among a large number of computers 1.3 Definition of Cluster executing an "ad hoc" client program. We can Cluster is a collection of machines connected using a mention the SETl@home (see [4]), Entropia Project network in such a way that they bebave like a single (see [5]), etc. Computer. Cluster is used for parallel processing, for load The formed classification is not strict, in the sense that it balancing and for fauU tolerance. Clustering is a popular is possible to connect many shared memory machines to strategy for implementing parallel processing create a "hybrid" shared memory machine. These hybrid applications because it enables conipanies to leverage the machines "look" like a single large SMP machine to the investment already made in PCs and workstations. In user and are often called NUMA (non-uniform memory addition, it's relatively easy to add new CPUs simply by access). It is also possible to connect SMP machines as adding a new PC or workstation to the network. local memory compute nodes. The user cannot (at this point) assign a specific task to a specific SMP processor. The user can, hovvever, start two independent processes 1.4 The Beowulf Cluster or a threaded process and expect to see a performance Beovvulf is not a special softvvare package, new netvvork increase over a single CPU system. Lastly we could add topology or the latest Linux kernel hack. It is a kind of several of this hybrid system into some collaborative cluster built primarily out of commodity hardvvare Internet effort building a super hybrid system. components, running an OSS (Open Source Softvvare) operating system like Linux or FreeBSD, interconnected 1.2.2 Software Architcctures by a private high-speed netvvork, dedicated to running high-performance computing tasks. In this part we will consider both the "basement" One of the main differences betvveen a Beovvulf Cluster softvvare (API) and the application issues. and a COW (Cluster of Workstations) is the fact that Beovvulf behaves more like a single machine rather than 1.2.2.1 Software API (Application Programming many vvorkstations. The nodes in the cluster don't sit on Interface) people's desks; they are dedicated to running cluster jobs. There are basically two ways to "express" concurrency in It is usually connected to the outside vvorld through only a program: a single node. • Using Messages sent between processors: A While most distributed computing systems provide Message is simple: some data and a destination general purpose, multi-user environments, the Beovvulf processor. Common message passing APIs are PVM distributed computing system is specifically designed for (see [6]) or MPI (see [7]). Messages require copying single user vvorkloads typical of high end scientific data whi]e Threads use data in plače. The latency vvorkstation environments. and speed at vvhich messages can be copied are the Beovvulf systems have been constructed from a variety of limiting factor with message passing models. The parts. For the sake of performance some non-commodity advantage to using messages on an SMP machine, as components (i.e. produced by a single manufacturer) opposed to Threads, is that if you decided to use have been employed. In order to account for the different clusters in the future it is easier to add machines or types of systems and to make discussions about machines scale up your application. a bit easier. It has been propose the follovving • Using operating system Threads: They were classification scheme: developed because shared memory SMP designs CLASS I: This class of machines built entirely from allovved very fast shared memory communication commodity vendor parts. We shall use the "Computer and synchronization betvveen concurrent parts of a Shopper" certification test to define "commodity off-the- program. In contrast to messages, a large amount of shelf parts. (Computer Shopper is a one-inch thick copying can be eliminated with threads. The most monthly magazine/catalog of PC systems and common API for threads is the POSIX API. It is components). A CLASS I Beovvulf is a machine that can difficult to extend threads beyond one SMP be assembled from parts found in at least 3 machine. It requires NUMA technology that is nationally/globally circulated advertising catalogs. difficult to implement. The advantages of a CLASS 1 systems are: hardvvare is Other methods do exist, but the formers are the most available form multiple sources (lovv prices, easy widely used. maintenance), no reliance on a single hardvvare vendor, driver support from O.S. commodity, usually based on 1.2.2.2 Application Issues standards (SCSI, Ethernet, etc). In order to run an application in parallel on multiple On the other side, the disadvantages of a CLASS 1 CPUs, it must be explicitly broken into concurrent parts. systems are: best performance may require CLASS II There are some tools and compilers that can break up hardvvare. programs, but parallelizing codes is not a "plug and play" CLASS II: This class is simply any machine that does not pass the Computer Shopper certification test. BEOWULF PROJECT AT ULPGC lnformatica25 (2001) 189-193 191 The advantages of a CLASS II system are: performance • LAMA's Materials Laboratory at UPV/EHU running can be quite good Monte Carlo simulations of phase transitions in The disadvantages of a CLASS H system are: driver condensed matter physics (see [14]).

Beowulf Cluster' for High-Performance Computing Tasks at the University: a Very Profitable Investment

Multicomputer Cluster

2.5 Classification of Parallel Computers

Cluster Computing: Architectures, Operating Systems, Parallel Processing & Programming Languages

Su(3) Gluodynamics on Graphics Processing Units V

Building a Beowulf Cluster

Beowulf Clusters Make Supercomputing Accessible

Improving Tasks Throughput on Accelerators Using Opencl Command Concurrency∗

State-Of-The-Art in Parallel Computing with R

Spark on Hadoop Vs MPI/Openmp on Beowulf

Towards a Scalable File System on Computer Clusters Using Declustering

Choosing the Right Hardware for the Beowulf Clusters Considering Price/Performance Ratio

Programmable Interconnect Control Adaptive to Communication Pattern of Applications