Concept and Implementation of CLUSTERIX: National Cluster of Linux Systems

Roman Wyrzykowski1, Norbert Meyer2, and Maciej Stroinski2

1 Czestochowa University of Technology Institute of Computer & Information Sciences Dabrowskiego 73, 42-200 Czestochowa, [email protected] http://icis.pcz.pl 2 Poznan Supercomputing and Networking Center Noskowskiego 10, 61-704 Poznan, Poland {meyer, stroins}@man.poznan.pl http://www.man.poznan.pl

Abstract. This paper presents the concept and implementation of the National Cluster of Linux Systems (CLUSTERIX) - a distributed PC- cluster (or metacluster) of a new generation, based on the Polish Optical Network PIONIER. Its implementation makes it possible to deploy a production Grid environment, which consists of local PC- clusters with 64- and 32-bit Linux machines, located in independent centers across Poland. The management software developed as Open Source allows for dynamic changes in the metacluster configuration. The resulting system will be tested on a set of pilot distributed applications developed as a part of the project. The project is implemented by 12 Polish supercomputing centers and metropolitan area networks.

1 Introduction

PC-clusters using Open Source software such as Linux are the most common and available parallel systems now. At the same time, the capability of Gigabit/s wide area networks are increasing rapidly, to the point when it becomes feasible and indeed interesting to think of the high-end integrated metacluster environment rather than a set of disjoint local clusters. Such metaclusters [3,17,18] can be viewed as key elements of the modern Grid infrastructure, and used by scientists and engineers to solve computationally and data demanding problems. In Poland, we have access to all crucial elements which are necessary to build the national Linux metacluster. The most important among them is Polish Optical Network PIONIER [15, 16]. It represents an intelligent, multi-channel optical network using the DWDM technology, with the bandwidth of n x (10, 40, ...) Gb/s, based on IP protocol. On the transport layer this network provides allocation of dedicated resources for specified applications, Grids, and thematic networks. 2 Roman Wyrzykowski et al.

2 Project Goals and Status

The main objective of the CLUSTERIX project [1] is to develop mechanisms and tools that allow for the deployment of a production Grid environment with the bacbone consisting of dedicated, local Linux clusters with 64-bit machines. Local clusters are placed in geographically distant independent centers connected by the Polish Optical Network PIONIER. It is assumed that each (in theory) Linux cluster may be attached to the backbone dynamically, as so called dynamic cluster. As a result, a geographically distributed Linux cluster is obtained, with a dynamically changing configuration, fully operational, and integrated with services offered by other projects. The project started on December 2003, and lasts 32 months. It is divided into two stages: (i) research and development with estimated duration of 20 months, (ii) deployment stage. The project is implemented by 12 Polish supercomputing centers and metropolitan area networks affiliated to Polish universities, with Czestochowa University of Technology as the project coordinator. It is important to note the phrase ”production Grid”; meaning the devel- opment of software/hardware infrastructure accessible for real computing, fully operational and integrated with services offered by other projects related to the PIONIER program [16], e.g., the National Computational Cluster based on the LSF batch system, National Data Warehouse, and virtual lab project. Deliver- ing advanced and specialized services integrated into a single coherent system requires additional mechanisms not available in the existing pilot installations (see, e.g., CrossGrid testbed [2]). They are commonly constrained by the as- sumption of static infrastructure in terms of the number of nodes and services provided, as well as the number of users organized into virtual organizations. On the contrary, in CLUSTERIX we provide mechanisms and tools for an auto- mated attachement of dynamic clusters; for example, non-dedicated clusters or labs may be attached to the backbone during the night or weekend. In the CLUSTERIX project, a lot of emphasis is laid on the usage of the IPv6 protocol [8] and its added functionality - enhanced reliability and QoS. This functionality delivered to the application level and at least used in middleware would allow for a better quality of services. Nothing like a production, IPv6- based Grid infrastructure does exist at present, but taking into account duration of the project it may be assumed that the IPv6 standard will be widely used. Therefore, the developed tools will support both IPv6 and IPv4. After the system is built, it will be tested on a set of pilot applications created as a part of the project. The important goal of the project is also to support potential CLUSTERIX users in preparation of their Grid applications, thus creating a group of people being able to use the cluster in an optimal way after the research and deployment works are finished. Concept and Implementation of CLUSTERIX 3

3 Pilot Installation

The CLUSTERIX project includes a pilot installation (Fig.1) consisting of 12 local clusters located in independent centers across Poland. They are intercon- nected via dedicated 1 Gb/s channels provided by the PIONIER optical network.

S£UPSK GDAÑSK 32xIA-64, 128GBRAM, 1168GBHDD, switchInfiniBand

OLSZTYN 6xIA-64, 12GBRAM, 438GBHDD, switch48x1Gb/s 30xIA-64, TORUÑ BIA£YSTOK 60GBRAM, 6xIA-64, 1095 GBHDD, 12GBRAM, switch48x1Gb/s 6xIA-64, 2219 GBHDD, 16GBRAM, switch24x1Gb/s POZNAÑ 219 GBHDD, switch24x1Gb/s WARSZAWA

12xIA-64, 24GBRAM, ZIELONA GÓRA 438 GBHDD, 8xIA-64, switch48x1Gb/s 24GBRAM, 292 GBHDD, £ÓD switch24x1Gb/s

18xIA-64, PU£AWY 172GBRAM, 6278 GBHDD, switchInfiniBand 16xIA-64, 32GBRAM, WROC£AW 2219 GBHDD, 24xIA-64, 8xIA-64, switch24x1Gb/s 48GBRAM, 16GBRAM, 876 GBHDD, 292 GBHDD, CZÊSTOCHOWA switch48x1Gb/s switch48x1Gb/s

GLIWICE 16xIA-64, 32GBRAM, 584 GBHDD, KATOWICE switch24x1Gb/s KRAKÓW RZESZÓW

BIELSKO-BIA£A

Fig. 1. Pilot installation in the CLUSTERIX project

The core of the testbed is equipped with 127 Intel Itanium2 nodes managed by Linux OS (Debian distribution, kernel 2.6.x). A computational node includes two Itanium2 processors (1,3 GHz, 3 MB cache), 4 GB or 8 GB RAM, 73 or 146 GB SCSI HDD, as well as two network interfaces (Gigabit Ethernet, and InfiniBand or Myrinet). Such a dual network interface allows for creating two independent communication channels dedicated to exchange of messages during computations and NFS support. The efficient access to the PIONIER backbone is provided through a Gigabit Ethernet L2/L3 coupling switch (see Fig.2). 4 Roman Wyrzykowski et al.

Fig. 2. Architecture of the CLUSTERIX infrastructure Concept and Implementation of CLUSTERIX 5

Selected 32-bit machines are dedicated to management of local clusters and the entire infrastructure. While users tasks are allowed to be executed only on computational nodes, each local cluster is equipped with an access node where the Globus Toolkit [5] and local batch system are running. All machines inside a local cluster are protected by a firewall, which is also used as a router for attachment of dynamic clusters. Access to resources of the National Linux Cluster is allowed only from machines called entry points; physical users can possess their accounts only on these dedicated nodes. It is assumed that end- users applications are submitted to the CLUSTERIX system through WWW portals. An important element of the pilot installation is Data Storage System. Before execution of an application, input data are fetched from storage elements and transferred to access nodes; after the execution output data are returned from access nodes to storage elements. The Data Storage System includes a distributed implementation of data broker. Currently each storage element is equipped with 2 TB HDD.

4 Pilot Applications

The National Linux Cluster will be used for running HTC applications, as well as large-scale distributed applications that require parallel use of resources of one or more local clusters (meta-applications). In the project, selected end-user‘s applications are being developed for the experimental verification of the project assumptions and deliverables, as well as to achieve real application results. It is clear that applications and their ability to use distributed resources efficiently will decide finally on success of computational Grids. Because of the hierarchical architecture of the CLUSTERIX infrastructure, it is not a trivial issue to adopt an application for its efficient execution on the metacluster. This requires parallelization on several levels corresponding to the metacluster architecture, and taking into account heterogeneity in both the computing power of different nodes, and network performance between various subsystems. Another problem is a variable availability of Grid components. In the CLUSTERIX project, the MPICH-G2 tool [10] based on the Globus Toolkit is used as a Grid-enabled implementation of MPI standard. The list of pilot applications includes among others:

– FEM modeling of castings solidification; – modeling transonic flows and design of advanced tip devices; – prediction of protein structures from a sequence of aminoacids and simula- tion of protein folding; – investigation of properties of bio-molecular systems, for drug design; – large-scale simulations of blood circulation in micro-capillaries; – astrophysical simulations; – package GAMESS in the CLUSTERIX environment. 6 Roman Wyrzykowski et al.

5 CLUSTERIX Middleware

5.1 Technologies and Architecture

The middleware developed in the project should allow for:

– managing clusters with dynamically changing configuration, including tem- porarily attached clusters; – submitting, executing and monitoring HPC/HTC applications accordingly to users preferences; – efficient management of users and virtual organizations; – effective management of network resources, with use of IPv6 protocols; – integration of services delivered as outcome of other projects, especially those related to the PIONIER program, e.g., data warehouse, other computational services; – respecting local policies of administration and management within indepen- dent domains; – convenient access to resources and applications, using an integrated interface; – high level of reliability and security in the heterogeneous environment.

The CLUSTERIX software is developed as Open Source, and is based on using the Globus Toolkit 2.4 and Web Services, with Globus 2.4 available in the Globus 3.2 distribution. The use of Web Services makes the created software easier to reuse, and allows for interoperability with other Grid Systems on the service level. It is important to note that initially the OGSI/OGSA concept [4], implemented in Globus 3, was assumed to be used in CLUSTERIX. However, due to the rapid transition to Globus 4 and the WS-Resource Framework, we had to reject the initial decision, and choose Globus 2.4 as the only possibility to build the CLUSTERIX production environment, taking into account the time limitations of the project. The usage of the Open Source approach allows anybody to access the project source code, modify it and publish the changes. This makes the software more reliable and secure. Open software is easier to integrate with newly developed and existing software, like the GridLab resource management system [6], which is adopted in the CLUSTERIX project. The architecture of the CLUSTERIX middleware is shown in Fig.3. In the successive subsections, we describe concisely some key components of this mid- dleware

5.2 Resource Management System

In CLUSTERIX, we base on the GridLab Resource Management System (GRMS) developed in the GridLab project [6]. The main functionality of GRMS include:

– ability to choose the best resource for the task execution, according to the job description and a chosen mapping algorithm; – submitting the GRMS task according to the job description; Concept and Implementation of CLUSTERIX 7

Fig. 3. Architecture of the CLUSTERIX middleware 8 Roman Wyrzykowski et al.

– ability to migrate the GRMS task to a better resource; – ability to cancel the task; – providing information about the task status, and other information about tasks, e.g., name of host where the task is/was running; – ability to transfer input and output files.

This approach implies the necessity to integrate GRMS with services de- veloped in the CLUSTERIX project, such as monitoring/information system, data management system, checkpointing mechanism, and management of users accounts and virtual organizations. The additional functionality of GRMS de- veloped for CLUSTERIX include: communication with resource management systems in different domains, cooperation with the network resource manage- ment system, support for MPICH-G2, and prediction module. The prediction module is crucial for providing an efficient use of available resources. The basic functionality of this module includes:

– ability to predict execution times and resource demands of tasks, using avail- able information about resources and tasks; – prediction of time spent by tasks in queues of local batch systems; – ability to take into account prediction errors, and find resource assignments that are the least sensitive to these errors.

The prediction module uses reasoning techniques based on knowledge discov- ery, statistical approaches and rough sets.

5.3 Data Management System

Grid applications deal with large volumes of data. Consequently, effective data- management solutions are vital for Grids. For the CLUSTERIX project, the Clusterix Data Management System (CDMS) has been developed, based on the analysis of existing implementations and user’s requirements [12]. A special at- tention has been paid to making the system user-friendly and efficient, aiming at creation of a reliable and secure Data Storage System [9]. Taking into account Grid specific networking parameters - different band- width, current load and network technologies between geographically distant sites, CDMS tries to optimize data throughput via replication and replica se- lection techniques. Another key feature to be considered during implementation of Grid data services is fault tolerance. In CDMS, the modular design and dis- tributed operation model assure elimination of a single point of failure. In par- ticular, multiple instances of the data broker are running concurrently, and their coherence is provided by a synchronization subsystem. The basic technologies used in the development of CDMS include: gridFTP and GSI components of Globus 2.4, as well as Web Services implemented using the GSOAP plugin from GridLab [6]. Concept and Implementation of CLUSTERIX 9

5.4 Network Resource Management System The Polish Optical Network PIONIER, which is used in CLUSTERIX as a back- bone interconnect, is based on the DWDM optical technology and 10 Gigabit Ethernet standard. PIONIER allows to create dedicated VLANs based on 802.1q standard, as well as setting traffic priorities based on 802.1p standard. Addition- ally, Black Diamond 6808 switches from Extreme Networks, which are installed in the backbone, support a proprietary protocol which allows us to guarantee bandwidth for a given VLAN. Based on requirements of the CLUSTERIX middleware and pilot applica- tions, it has been decided to establish two dedicated VLANs within the PIO- NIER network: – computational network with bandwidth of 1 Gb/s; – management network with bandwidth of 100 Mb/s dedicated to configura- tion of the metacluster, measurement purposes, software upgrading, etc. In accordance with the project goals, the IPv6 protocol is extensively de- ployed in the system infrastructure and middleware. Apart from advantages mentioned in Section 2, the use of IPv6 offers other benefits, e.g., the mobile Ipv6 protocol allows us to use fixed IPv6 addresses for dynamic clusters, irrespective of a place of their attachment. Note that the IPSec mechanism is used to improve security on the network level, for both IPv6 and IPv4 protocols. In CLUSTERIX, the SNMP protocol is used for the management of all elements that are critical for network operation, like backbone and coupling switches, access nodes, storage elements, and firewalls. This allows us to build the Network Resource Management System, which contains the following main components: – measurement agents; – data base containig results of measurement; – network resource manager (network broker); – graphical user interface. The results of measurements are then used for the traffic management, and in the Clusterix Data Management System. Providing a required quality of network services is extremely important to deliver performance capacities of the CLUSTERIX infrastructure to end-users applications. For this aim, the following techniques are deployed: – creation of two VLANs in a computational network, with different priorities (normal and high); – tagging IP packets (especially promising for the IPv6 protocol); – differentiated services.

5.5 Management of Users Accounts and Virtual Organizations Grids as geographically distributed, large-scale environments raise strong de- mands to management of users accounts and virtual organizations (VOs) [4]. 10 Roman Wyrzykowski et al.

Apart from scalability, fault tolerance and security, the main requirement is flexibility because management tools must be able to take into account different roles of: – user, – administrator of a resource, – manager of a virtual organization, – manager of a group of resources. Unfortunately, existing tools do not support flexible policy of resource autho- rization and accounting. A new management tool developed in CLUSTERIX features an open archi- tecture based on plugins. This allows for different methods of authorization, The architecture of this tool is highly distributed in order to provide scalability and fault tolerance. Another important feature is dynamic assignment of accounts, since a pool of accounts is assigned to users when required. The availability of a Site Accounting Information System (SAIS) and Virtual Organization Infor- mation System (VOIS) for every site and VO, respectively, gives the tool the ability to collect and store accounting information across all sites and VOs. The kernel of this tool is Globus Authorization Module (GAM) - an extension of the Globus gatekeeper. GAM provides different authorization plugins, which imple- ment different authorization policies such as: – accept all users from the grid-mapfile; – ban users put on the black list; – accept all users from a certain VO; – query a remote authorization system; – accept all users with a certificate matching a given template. GAM collects also basic accounting information (time, user, account, etc.), stored and processed in SAISs and VOISs.

5.6 User Interface The main design goal is to create a flexible portal interface that can be easily extended and adopted for utilization with different applications. We focus on separation of the visualization part from the application’s logic, as well as giving possibility of framework extension at run-time. Support of the VRML, X3D, SVG and chart (jpeg, png) formats of output presentations is also an important goal, as well as security and fault tolerance. It is necessary to provide adaptation of ready end-users applications, and easy installation of them seamlessly on multiple hosts. These requirements constraint us to use SSH as a communication channels (see Fig.4), when the only way of communication between the portal and CLUSTERIX services is interaction with the SSH session server. The security of interface is provided by using encrypted communication protocols, and storing certificates on entry points not in portals. A new feature is using XML-based application extensions, called by us parsers, which allow us to describe rules of interaction between users and applications, Concept and Implementation of CLUSTERIX 11

PORTAL SSH SERVER GRIDSPHERE (PERL) (XML) PARSER PTY ENTRYPOINT -PORTLETS

-SERVICES (SSH SESSION)

(JAVA) PARSERS

(PERL)

(HTTPS)

ClusterIX SERVICES USER HOST

Fig. 4. Architecture of user interface including input data format, output data parsing and visualization. Parsers are dynamically loaded Perl components, which are generated based on XML de- scriptions provided by users. In combinations with SSH sessions and pseudo- terminals, which have been successfully used for the implementation of Web Condor Interface [11], parsers and application-specific managers allow us to gain persistence and interaction possibilities in Grids. The resulting SSH Session Server Framework allows for a fully distributed implementation, which is an efficient way to provide fault-tolerant features. The proposed framework is used together with the GridSphere Portlet Framework [7] the outcome of the GridLab Project. This gives us the possibility to use a variety of build-in features of the GridSphere technology such as users secure space, chart generation, etc. Fig.5 presents a portal screenshot for a demo application which has been developed [14] for the FEM modelling of heat transfer in castings.

6 Conclusions

This paper presents the concept and implementation of CLUSTERIX, a ge- ographically distributed Linux cluster based on the Polish Optical Network 12 Roman Wyrzykowski et al.

Fig. 5. A portal screenshot for the heat transfer demo application

PIONIER. The main objective of the CLUSTERIX project is to develop mech- anisms and tools that allow for the deployment of a production Grid environ- ment. The CLUSTERIX backbone consists of dedicated, local Linux clusters with 64-bit machines. It is assumed that so called dynamic clusters may be at- tached to the backbone dynamically. For example, non-dedicated clusters and labs may be attached to the backbone during the night or weekend. As a result, CLUSTERIX is an open and complex structure whose efficient management is not a trivial issue. Among the most important problems not covered in this paper are: security issues, management of cluster software, moni- toring of cluster nodes, and checkpointing. For example, the possible attachement of dynamic clusters results in a distrusted environment which is difficult to be Concept and Implementation of CLUSTERIX 13 secured comprehensively. While solving the secutity issues in CLUSTERIX, the major approach is to integrate existing products and find out the configuration which fits into the best security level for both types of clusters - backbone and dynamic ones.

Acknowledgements. The CLUSTERIX project has been funded by the Polish Ministry of Science and Information Society Technologies under grant 6T11 2003C/06098. We also would like to thank Intel Corporation for sponsoring the project and help in building the pilot installation.

References

1. CLUSTERIX Project Home Page, http://clusterix.pcz.pl 2. Crossgrid Exploitation Website, http://www.crossgrid.org 3. DemogGrid Project, http://www.lpds.sztaki.hu 4. Foster, I., Kesselman, C., Nick, J.M., Tuecke, S.: The physiology of the Grid. In: Grid Computing - Making the Global Infrastructure a Reality, J. Wiley & Sons, 2003, 217- 249 5. Globus Project Home Page, http://www.globus.org 6. GridLab: A Grid Application Toolkit and Testbed, http://www.gridlab.org 7. GridSphere Portal, http://www.gridsphere.org 8. IPv6: The Next Generation , http://www.ipv6.org 9. Karczewski, K., Kuczynski, L., Wyrzykowski, R.: Secure Data Transfer and Repli- cation Mechanisms in Grid Environments. In: Proc. Cracow’03 Grid Workshop, Cracow, (2003) 190-196 10. Karonis, N., Toonen, B., Foster, I.: MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface, Journal of Parallel and Distributed Computing (JPDC), Vol. 63, No. 5, pp. 551-563, May 2003. 11. Kuczynski, T., Wyrzykowski, R.: Cluster Monitoring and Management in the We- bCI Environment. Lect. Notes in Comp. Sci., Springer-Verlag, 3019 (2004) 375- 382. 12. Kuczynski, L., Karczewski, K., Wyrzykowski, R.: Clusterix Data Management Sys- tem. In: Proc. Cracow’04 Grid Workshop, Cracow, (2004) (in print) 13. Olas, T., Karczewski, K., Tomas, A., Wyrzykowski, R: FEM Computations on Clusters Using Different Models of Parallel Programming. Lect. Notes in Comp. Sci. 2328 (2002) 170-182 14. Olas, T., Wyrzykowski, R: Porting Thermomechanical Applications to CLUS- TERIX Environment. In: Proc. Cracow’04 Grid Workshop, Cracow, (2004) (in print) 15. PIONIER Home Page, http://www.pionier.gov.pl 16. Weglarz, J.: Poznan Networking and Supercomputing Center: 10 years of experience in building IT infrastructure for e-Science in Poland, http://www.man.poznan.pl/10years/papers/weglarz.ppt 17. The TeraGrid: A Primer, http;//www.teragrid.org 18. Wyrzykowski, R., Meyer, N., Stroinski, M.: PC-Based LINUX Metaclusters as Key Elements of Grid Infrastructure. In: Proc. Cracow’02 Grid Workshop, Cracow, (2002) 96-103