Experience with LCG-2 and Storage Resource Management Middleware Dimitrios Tsirigkas September 10th, 2004

Abstract

The University of Edinburgh is participating in the ScotGrid project, working with Glasgow and Durham to create a prototype Tier 2 site for the LHC Computing Grid (LCG). This requires that LCG-2, the software release of the LCG project, has to be installed on the University hardware.

Being a site that will mainly provide storage, Edinburgh is also actively involved in the devel- opment of ways to interface such resources to the Grid. The Storage Resource Manager (SRM) is a protocol for an interface between client applications and storage systems. The Storage Re- source Broker (SRB), developed at the San Diego Supercomputer Center (SDSC), is a system that can be used to manage distributed storage resources in Grid-like environments.

In this report, we will describe work done during a period of sixteen weeks, in the context of an MSc in High Performance Computing. The first part of the work involved helping to set up LCG software at the Edinburgh ScotGrid site and to monitor the hardware using the Ganglia distributed monitoring system. The second part of the work aimed at the development of an interface between the SDSC Storage Resource Broker and an implementation of the SRM specification, which was developed at Lawrence Berkeley National Laboratory (LBNL). Acknowledgements

I would like to thank James Perry and Philip Clark for supervising my dissertation. I am also grateful to Alasdair Earl and Steve Thorn, who offered a great deal of both practical help and information in the course of my project. Paul Walsh should also be thanked for helping me write the Latex file for this document.

When the Large Hadron Collider (LHC) comes online in 2007, it will become the largest el- ementary particle accelerator ever to have operated in the world. Four experiments will be conducted on the LHC and the data generated will scale to Petabytes. Managing this data effi- ciently across a world wide network of collaborating institutes and universities is a challenge, which the particle physics community has chosen to address using Grid computing. The Uni- versity of Edinburgh is one of the major contributors to this effort. It possesses substantial storage resources to be used for storing LHC data and is currently in the process of connecting them to a prototype Grid being set up by a number of institutes in the UK.

This document details the work completed as part of a dissertation project for the MSc in High Performance Computing at EPCC. The project had two main parts. The first part involved work on setting up and configuring the Edinburgh Grid site. The goal of the second part was to create an interface between two pieces of middleware used to manage storage resources in a distributed environment.

The contents of the chapters following this introduction are summarised bellow.

Chapter 2 provides a background in Grid computing. All the necessary concepts in under- standing the following chapters can be found here. There is also a brief description of two Grid related projects that are very relevant to this work, Globus and the European DataGrid.

In Chapter 3 we explain why modern experimental particle physics can benefit from Grid Computing. We then introduce three related projects. The LHC Computing Grid, GridPP and ScotGrid. The LHC Computing Grid is a successor of the European DataGrid and aims at utilising Grid technologies to address the computing needs of the Large Hadron Colllider experiments. GridPP is an effort to produce the infrastructure and deploy the tecnhology for the creation of a Particle Physics Grid in the UK and ScotGrid is the subset of GridPP that refers to Scottish Grid sites.

Describing LCG-2, the latest release of the LHC Computing Grid is the main purpose of Chap- ter 4. We will see how LCG-2 attempts to address the challenges associated with any Grid

1 2 project and provide an outline of how it can be used.

LCFGng is a piece of software that was developed at the University of Edinburgh and is being used in many Unix clusters. It provides an automatic way to install and configure the cluster nodes, making administration easier. Chapter 5 describes the main aspects of LCFGng, how it can be used together with LCG-2 and how it has been used in the context of this dissertation.

Chapter 6 describes Ganglia, a distributed monitoring system for clusters and cluster hierar- chies. Ganglia is being used by GridPP sites, including Edinburgh, to monitor their equipment. The chapter also explains how Ganglia was used for this dissertation.

Interfacing storage resources to the Grid is far from trivial. The Storage Resource Manager is a specification defining the ways in which a storage resource should be accessible to applications through the Grid and has already been implemented for different storage systems. The Storage Resource Broker is a complete solution to using storage resources in a distributed environment. Chapter 7 discusses SRM, SRB and how they could be used together, one of the questions this work aimed at answering.

Chapter 8 closes the dissertation with an brief reaccount of the work done. It explains how this project resulted in the gaining of knowledge and experience, highlights the encountered problems and attempts to find the reasons why they occured. Chapter 2

Background on Grid Computing

2.1 Virtual Organisations

A Virtual Organisation (VO) is a dynamic collection of individuals and/or institutions, willing to share information and computing resources to achieve a common goal. This sharing is regulated by a set of agreed upon rules, which define the role and the priviledges of each entity within the organisation.

An example of a virtual organisation would be an international collaboration of universities, industry and government agencies aimed at developing and testing a new type of experimental aircraft. Such a large-scale project would require sharing not only technology and scientific expertise, but also computing power to simulate the aircraft operation and storage resources to keep the data used during the R&D process. This organisation would obviously operate under a set of rules and would be dynamic in nature, as members could join and leave and political and financial circumstances could change at any stage.

2.2 Grid Computing

Grid Computing is the science that makes the existence of VOs possible by addressing their computing needs. There are four categories of issues that Grid technology is faced with:

• Information

• Data and Storage Resource Management

• Job and Computing Resource Management

• Security

3 4

In this section, we will provide brief descriptions of each of the four areas.When there are standard ways in which the issues are addressed, we will also provide brief outlines.

2.2.1 Security

It is easy to understand why security is an important concern for Grid computing. In a Grid environment users, institutions and individuals make their resources and data available to a large number of people and certainly need some guarantee that this does not put them at risk. Anyone needs to maintain at least some level of control on what kind of applications run on his personal or cluster and has to trust their users. Data in a storage system connected to the Grid might be confidential or even classified. How can resources with different security policies safely be made to interoperate?

The standard way to address the issue is based on public key cryptography. Public key cryptog- raphy enables entities to authenticate each other. More specifically, every entity on the Grid has two unique strings. One is only known to the entity and the other is made public. The private and the public key share a relationship that makes it very difficult to derive the former from the latter. The public key works with encryption algorithms to produce unreadable (encrypted) forms of data. Encrypted data can then be decrypted by the private key only. This ensures that anyone can safely send confidential data to the owner of the keys. Therefore, if A is able to read encrypted information sent by B, B can be sure that A is the owner of the public key he/she used for the encryption. The private and the public key can be used with reversed roles as well, in which case the reader decrypting the document using the public key can be sure that it has been written by the private key holder.

The above procedure can be prohibitively slow, since encryption and decryption algorithms can take long to execute for large texts. Digital signatures make it faster. From any document, there can be produced a string of characters of standard lenght (the digest) by means of a hash function. This string characterises the document - it is extremely unlikely that different documents will produce the same strings. If the string is encrypted with A’s private key and sent along with the document, B can use the public key to decrypt it. Then B can pass the document to the hash function, compare the digests and thus authenticate A. Therefore, the digest of any document, encrypted with A’s private key play the part of a digital signature. Of course, the difference between a digital signature and a real signature is that digital signatures are unique to the document they are used for.

However, in a Grid environment, a resource cannot hold the public key of every user, neither is it possible to know that a public key belongs to the right person. Certification Authorities (CAs) are authorised by all members of a VO to provide them with certificates and every member has access to their public key. A certificate is a document with the owner’s details and public key, digitally signed by the CA. This can be used by two entities on the Grid to authenticate each other. There are different certificate formats but the one most widely used is the Internet Engineering Task Force’s X.509. More specifically, if A and B want to authenticate each other, A sends B her certificate. B uses the CA’s public key to make sure that the public 5 key on the certificate belongs to the person detailed on it. Then B can use A’s public key to encrypt something and send it back to A. A decrypts it with her private key and returns it to B. If the returned document is identical to the original, A has been authenticated successfully and the same procedure can be repeated for B. In the end, both A and B are sure of each other’s identity.

The mutual authorisation process is automated and happens by having the user type a secret phrase. This allows the use of her encrypted private key for the authorisation to take place. It is desired however, that a user only authenticates himself once, at the start of a Grid session as opposed to every time he interracts with a new resource or user. This is achieved by means of Proxy certificates. Proxy Certificates contain the same information about their owner as the normal certificate, but are digitally signed by the owner, during the initial sign in and expire sooner. The public and private keys for the proxy certificate are new and the private key is not stored encrypted and can be accessed without requiring a pass phrase. Therefore, when A needs to prove B his identity, he sends both his CA certificate and his proxy. B can then use the CA’s public key to make sure his Certificate is original and then use his public key to authenticate the signature on the proxy. After this point, authorisation of A can proceed as described above, but using the public and private keys corresponding to the proxy.

2.2.2 Information

A Grid is an environment of a very dynamic nature. The number of users and the status and availability of resources are constantly changing. Moreover, it is desired that the users are not necessarily aware of the details of the resources they have access to and therefore may not be able to specify exactly which they will use to run their applications or store their data. This means that the Grid should be “self aware” in the sense that information on its status and that of its resources should somehow be made available to a number of services, capable of answering requests for that information. This would allow for efficient allocation of resources, sensible management of data and quick diagnosis of technical problems.

There are a number of important requirements that an Information Service for the Grid should fulfill. Perhaps the most important one is that it should be distributed. This would make it independent of one or a small number of specific, centralised servers and would guarantee that it would continue to function even though individual resources on the Grid may break. Besides that, the way information is accessed should be standard and platform-independent, to make sure that every resource can publish its status and every authorised user or application can view it. Another important requirement rises from the previous one, as some information on available resources may have to only be available to a group of authorised users. This means that the information and the security service need to made to cooperate to ensure the safekeeping of sensitive information.

A standard tool used by Grid Information Services is the Lightweight Directory Access Pro- tocol (LDAP). LDAP is a specification that defines how messages that contain relevant in- formation are formulated and exchanged between applications and the databases that store it. 6

The usefulness of LDAP has resulted in the development of an open source implementation, OpenLDAP, which is heavily used in most, if not all, major Grid projects.

2.2.3 Data and Storage Resources Management

One of the facts that make Grid Computing attractive is that a lot of current scientific efforts result in the production of large amounts of data, which need to be safely stored and easily accessed by the scientists - a good example is provided in the next chapter. The management of data is therefore one of the major challenges faced by Grid technologies and is certainly the most relevant to this dissertation.

A very important issue that should be addressed is how the great variety of data storage systems are interfaced and connected to the Grid. A VO can be expected to have resources including disk caches, RAID arrays, and tape storage systems. The requirement for transparency rules out the possibility of demanding from clients to be able to access such a variety of systems, therefore all different resources should present the same “face” to the Grid. This is one of the major issues this dissertation is concerned with and ways of interfacing storage resources to the Grid will be discussed in greater detail in other chapters.

Another requirement is that data can be easily and quickly accessed by those authorised to view and manipulate it. This means that replicas of the same data must be distributed among resource sites and that there have to be fast and reliable ways to access it. A standard protocol used for the transfer of data between sites is GridFTP.

In order to use data one must first be able to locate it and make sure they contain useful in- formation. Therefore, two more issues arise: How is it possible to find a replica on the Grid and how can it be certified that this replica contains up to date information, suited to the user’s needs? These questions point out the need for replica and metadata cervices. Typically, a meta- data service provides data about the data contained in file, so that their quality and suitability can be determined. Afterwards, the replica service returns the physical locations of the files satisfying the user’s demands.

2.2.4 Job and Computing Resources Management

The management of jobs and the computing resources they run on is another difficult challenge faced by Grid Computing. A first issue that rises is portability. Very few programs are devel- oped and tested in a way that ensures they can be run on any platform. Furthermore, how can the user be guaranteed that the computing resource used for his jobs fulfills her performance requirements? Therefore, there is a need to enable the user to submit a set of requirements together with her job, thus limiting the range of systems the job could be assigned to. This assumes that there is a service deployed, which makes the choice of the resource, based on those requirements and what is available. 7

Connecting computing resources to the Grid is also an issue in itself. The main problem is that individual resources are governed by their own policies and have their own batch systems. For obvious reasons the owners of the resources will not be willing to abandon their choices, so these different systems all need to be interfaced to the Grid.

2.3 Grid Projects

A brief overview of two of the most important Grid Computing projects to date will be provided in this section.

2.3.1 Globus

The Globus project [2] aims at developing Grid technologies. The major contributors are the Argonne National Laboratory, the Information Sciences Institute at the University of Southern California, the University of Chicago, the University of Edinburgh and the Swedish Center for Parallel Computers. The activities of this collaboration fall under four categories: Research, development of Grid software tools, development of applications that make use of Grid com- puting and setting up of testbeds for Grid technologies.

The Globus project has released the Globus Toolkit (GT), a collection of middleware address- ing the major challenges in Grid Computing. The GT, particularly in its second version, was very successful in providing tools that could be used to build information, data management and resource management services as well as a solid security infrastructure. As a result it is being used extensively by other projects like the European Data Grid and the LHC Computing Grid. The latest version, GT3, introduces the idea of grid services and proposes a new Grid architecture based on that idea. The next version, GT4, will go one step further and no longer support the grid services paradigm. However, the particle physics community has not yet em- braced GT3 and it will probably be long before it embraces GT4, so those two versions are outside the scope of this work.

2.3.2 The European Data Grid

The European Data Grid (EDG) [3] was an EU-funded project pursuing the development and testing of Grid technologies to be used for scientific purposes. The project ran from 2001 to 2004 and was led by CERN. Other contributors included “the European Space Agency (ESA), France’s Centre National de la Recherche Scientifique (CNRS), Italy’s Istituto Nazionale di Fisica Nucleare (INFN), the Dutch National Institute for Nuclear Physics and High Energy Physics (NIKHEF) and UK’s Particle Physics and Astronomy Research Council (PPARC)” [3]. 8

The structure of the EDG project is illustrated in Figure 2.1. The work was divided into work- packages. Workpackages 1 to 5 involved the development of Grid Middleware to address the major issues in Grid Computing, workpackage 6 was about the creation of testbeds for the developed technologies, workpackage 7 targeted network services and workpackages 8 to 10 were concerned with the development of Grid scientific applications. A final work package included the management of the project. The four working groups that carried out the project were the Testbed and Infrastructure group (WP 6-7), the Applications group (WP 8-10), the Computational and DataGrid Middleware group (WP 1-5), and the Management and Dissemi- nation group (WP 11).

Figure 2.1: The structure of the EDG project. Image taken from [3].

EDG achieved important advances in Grid technology and developed specifications and imple- mentations that are still in use in the context of the programmes that succeeded it - the EGEE project, which stands for “Enabling Grids for E-science in Europe” and the LHC Computing Grid, which is specific to Elementary Particle Physics and will be covered in the next chapter. Chapter 3

LCG, GridPP and ScotGrid

3.1 The Large Hadron Collider - why Grid Technologies?

The Large Hadron Collider (LHC) at CERN is scheduled to begin operation in 2007 and will be the most powerful particle accelerator in the world. There are four experiments currently being prepared, which will take data from it: CMS, ATLAS, ALICE and LHCb. All four of these ex- periments are international collaborations involving hundreds of institutions and approximately six thousand scientists. The purpose of the LHC experiments is to study the fundamental prop- erties of matter at high energies and test the current theories used to describe them. All four experiments are based on the idea of detecting and tracing the interactions and movement of accelerated particles by means of sophisticated detectors and a large number of complicated electronic devices. The digital output produced by the LHC experiments, in total, is expected to be of an order of magnitude of Petabytes per year. The creation of a single resource devoted to the purpose of storing and analysing this data is impossible, for political and practical reasons. However, many of these contributors possess considerable computing resources, which could be made available to the other members of their collaborations to serve the common purpose. However, the varying nature of those resources and the different individual policies regulating their use make this a complicated and difficult task.

The LHC is faced with problems, which Grid computing is well suited to. Each of the four collaborations constitutes a Virtual Organisation, which aims to combine a set of computing resources and to enable their members to access and manipulate data distributed among many different geographical locations and administrative domains. It should now be apparent that the LHC provides an ideal opportunity for the application of Grid technologies.

9 10

3.2 The LCG Project

The LHC Computing Grid (LCG) is the project that will “prepare the computing infrastruc- ture for the simulation, processing and analysis of LHC data”[4]. In other words, the LCG project aims to provide the necessary middleware that will allow the interconnection of the diverse computing resources within the same computational data grid. The LCG middleware will provide transparent access to these resources and serve as a basis for the data-intensive high energy physics applications undertaking the actual science. However, the LHC project is not just about developing middleware. It includes the coordination of the resource sites and the exchange of experience and information among them, the deployment of the services, the extensive testing of the resulting system and the monitoring and support of its operation, while the LHC experiments are running.

In the context of LCG, the different sites connected together are grouped into Tiers. The LHC experiments, where the data is produced is the Tier 0 . CERN, where from data is distributed to the other sites is a Tier 1 center. Tier 1 centers are sites with major storage and computing resources that often operate on the national level. Smaller sites, for example universities or lab- oratories, which may or may not possess considerable resources of one or both kinds cooperate in forming regional Tier 2 centers. An individual university is itself a Tier 3, whether partic- ipating in a Tier 2 or not. Finally, individual desktops are Tier 4. In the UK, the Rutherford Appleton Laboratory is the Tier 1 center and the University of Edinburgh is one of the Tier 3 sites.

3.3 Status and near future of the LCG Project

The LCG project is currently in its first phase, which started in 2002 and is expected to last until 2005. During this phase prototypes and actual implementations of the LCG Grid services are being developed and tested in increasingly demanding “data challenges”, separately for each of the experiments. The releases of the LCG middleware resulting from this process are being installed in an increasing number of sites and work is being done to interface them to the local resource managment systems. The goal of this phase is to produce a detailed and complete design for the final system that will be in place by the time the LHC begins operation. The next phase or phases of the project will be concerned with the execution of this design and the maintenance and possibly further development of the system during its operation.

The latest product of the LCG project, LCG-2, comprises a set of middleware tools. It was released in April and will be running throughout 2004. The next chapter is devoted to LCG-2 and will provide a detailed discussion of its components and functionality. 11

3.4 GridPP

GridPP is a project involving 19 British universities, the Rutherford Appleton Laboratory and CERN. It began in 2001 with the purpose of creating a Grid for particle physics, which could ultimately expand to provide services to a wider range of scientific disciplines. The areas in which it is active are the development of Grid applications for particle physics, the development of Grid middleware and the deployment of the current technology in testbeds across the UK. Naturally there are close ties between GridPP and LCG. Through the GridPP project, the UK has become the most active LCG participant of all CERN member states.

In the context of GridPP and apart from the Tier 1 center at RAL, there are currently four regional Tier 2 centers in the UK: London, SouthGrid, NorthGrid and ScotGrid [insert figure here].

3.5 ScotGrid

ScotGrid is a collaboration between the Universities of Edinburgh, Glasgow and Durham. Its purpose is to develop a prototype Tier 2 center for the particle physics Grid in the UK. Once connected to the Grid, the resources provided by the ScotGrid institutes will be used by scien- tists involved with LHCb and ATLAS experiments to perform simulations and data analysis. A logical layout of the ScotGrid system can be seen in figure 3.1 on the next page.

3.5.1 ScotGrid Hardware in Edinburgh

In Edinburgh, the effort is concentrated in Storage and Data Management issues, which is nat- ural since the main part of the available resources comprises storage resources. More specif- ically, the ScotGrid storage hardware at Edinburgh comprise an IBM eServer x440 of 8 Intel Xeon 1.9GHz CPUs with 32 GB RAM, an IBM Dual FAStT900 22TB RAID array and 10TB of the 155TB Sun Microsystems Storage Area Network that the university has recently ac- quired.

At the front end, interfacing the ScotGrid hardware with the Grid, are two IBM eServer x205 Intel P-IV 1.8GHz with 256 GB RAM, Glenellen and Glenlivet, as well as an IBM eServer x305 four dual Intel P-III Xeon 1GHz with 2 GB RAM, Glenmorangie. Furthermore, there will soon be four dual Intel Xeon 2.8GHz machines, with 2GB RAM and 200GB EIDE HDD to be used as Worker Nodes i.e. machines running jobs submitted to the site from the Grid. 12

Figure 3.1: A logical layout of ScotGrid. Image taken from [6]. Chapter 4


This chapter provides an overview of LCG-2, which is the latest release of the LCG project. It is based on work that was conducted by the European DataGrid[3] project and incorporates components of the Globus Toolkit [2]. We provide a brief description of how LCG-2 addresses the following areas, which are crucial to any Grid system: Security, Information, Job Manage- ment, Data Management and Interaction with the the user, the applications and the resources.

4.1 Interaction with the user and the applications

A Command Line Interface (CLI) and a Graphical User Interface (GUI) handle the interaction between LCG-2 and the user. The CLI allows the user to identify himself as someone autho- rised to use the Grid, use the information service to find out about the resources available on the Grid and to submit and manage jobs. The Java GUI provides the same functionality in a more user-friendly way. It contains an editor for the Job Description Language (JDL), as well as two more components, for the submission and monitoring of jobs. The functionality of the user interfaces can pass to applications by means of various APIs provided by LCG-2. Many of the potential uses of the LCG-2 UI will be mentioned in the following sections.

4.2 Interaction with the resources

The ability of the system to interact with the resources on the Grid is provided by the Storage Element (SE) in the case of a storage resource and by the Grid Gate (GG) in the case of a computing resource.

The Storage Element (SE) was first defined in the context of the European DataGrid project (Work Package 5). Setting up a computer as a storage element means connecting the computer to the Grid and using it as a server to enable access to the storage space of the resource. Users

13 14 of the Grid can communicate with the various SEs by means of standard protocols, typically without needing to know any resource specific details. The SE providing the interface to a resource can handle all necessary internal communications to serve the users’ requests. The proposed architecture of the SE software is described in detail in [5].

A number of homogeneous computing nodes (Worker Nodes or WN) connected to the Grid as a single entity are called a Computing Element (CE). A Computing Element also includes a node called the Grid Gate (GG) which interfaces the resource to the Grid. The Grid Gate uses a Globus tool, called the Grid Resource Allocation Manager (GRAM) as well as the local resource manager and a logging and bookkeeping server, that keeps track of the functions performed by the resource. By managing the Worker Nodes, the GG can satisfy job requests coming from the Grid even though they are not adapted to the local batch system.

4.3 Security

LCG-2 has adopted the Grid Security Infrastructure (GSI) developed by the Globus Project. The way it handles security is based on the principles described in the previous chapter. In order to use LCG-2, users have to be members of one of the LCG VO’s. Then they can request certificates from one of the recognised CAs. With a certificate installed on a browser, the users must visit the LCG-2 registration webpage and register with the LCG-2 service.

4.4 Information System

The information system provided by LCG-2 has been taken from the Globus Toolkit and the EDG project. Every individual resource runs a service called the Grid Resource Information Service (GRIS). This service can report on the characteristics and the state of that specific re- source to the Grid Index Information Service (GIIS) that runs at the aggregate level. The GRIS and the GIIS can be queried by users (and their applications) and can provide information about the resources connected to the Grid. Additionally, there is another service called the Berkeley DB Information Index which stores information from multiple GIISes in a database that can also be querried by the user. Information to and from the databases of the Grid Information Service is accessed using OpenLDAP.

The ways in which a user can make indirect use of the Information Service through the CLI, will be described in the following two sections, on Job and Data Management. However, the CLI also provides the user with direct access to the service by allowing him to querry the information databases for the status of specific resources. 15

4.5 Job Management

The set of services that provide the ability to manage jobs are collectively called the Workload Management System (WMS). The WMS includes services for receiving job requests from the user or application, finding suitable resources to run them on, modifying them to run in the environment of the Grid and managing them afterwards.

4.5.1 The Job Description Language

The Job Description Language that LCG-2 uses is a product of the Condor project [7] and is called Classified Advertisment (ClassAds) language. It allows the user to create descriptions of his job, specifying important characteristics. Examples include the environment under which the job should run, the names of the input and output files, a particular CE, where the job should run and the SE where the output files should be uploaded. The JDL also provides the ability to set requirements in terms of proximity and performance capability of the CE and of the available storage space of the SE.

4.5.2 Command line tools

There are a number of tools in the CLI, for the purpose of submitting and managing jobs. Through those tools, the WMS provides all the functionality usually required from a job man- agement system, including submitting, cancelling and checking the status of jobs. It also sup- ports the ability to list the available CEs and submission of interactive jobs. Another useful feature of the system that can be exploited from the CLI is the BrokerInfo file. This file is cre- ated for a specific job and contains information about it, which can be obtained through the use of the edg-brokerinfo command. Features that are currently not supported but are likely to be incorporated in future official releases of LCG-2 include submission of checkpointable jobs or jobs making use of the Message Passing Interface (MPI).

4.6 Data Management

The Replica Management System (RMS), the data management component of LCG-2, was developed in the context of the European DataGrid project. It mainly consists of two services. The first one is the Replica Location Service (RLS). The RLS maintains a mapping between the Grid Unique Identifiers (GUIDs - see below) and the physical locations of the files they identify. The second service is the Replica Metadata Catalog (RMC) which maps the Logical File Names (LFN) of the files to their GUIDs. The RMC also contains metadata about the files. The two services of the RMS can be accessed through the Replica Manager, which is part of 16 the user interface. There is one separate RMS for each of the VOs and all of them are provided by CERN.

4.6.1 File names

Every user in the LCG-2 Grid environment, identifies files by four different types of names. First, for each file, there is one Unique Grid Identifier. The GUID of a file is guaranteed to refer to that file exclusively and is the same for any user. In contrast, the Logical File Name of a file, is the name by which a user refers to a particular file. In other words, it is an alias for a GUID, arbitrarily set up by a user. Beside those two names, there is also a Storage URL (SURL) and a Transport URL (TURL). The reason for those is that GUIDs and LFNs cannot be used to specify the physical location of a particular copy (or replica) of that file. SURLs and TURLs contain information about the hosting SE and the local identifier of the file within that SE. The difference between SURLs and TURLs is mainly that the SURL is used by the RMS and the local SE to locate a replica, whereas the TURL is used by an application running on the Grid and contains all the information necessary to get the replica from a particular location, including the protocol and port that should be used.

4.6.2 Command line tools

It was mentioned earlier that the two services of the RMS can be accessed through the Replica Manager. The LCG-2 CLI contains a Replica Manager client, which can perform a number of data management operations. More specifically, there are commands for assigning GUIDs to files and uploading/registering them to the Grid (the user can specify the SE, but it is not necessary), getting information about the available storage resources, finding the existing repli- cas of files, deleting replicas or creating new ones and viewing the contents of specific SEs. Moreover, the user can assign LFNs to files to enable their use by his/her applications.

Besides using the Replica Manager as an interface, the user has the option of accessing the RMC and the RLS directly by using two clients from the CLI. However, the low level opera- tions made possible by those clients are dangerous, as they can create inconsistencies between the two catalogues (an update on one of the two catalogues does not enforce an update on the other), therefore, in most cases, the use of the Replica Manager client is preferable.

4.7 Relevance to the Dissertation

An important objective of my MSc project was the gaining of knowledge and understanding of the LCG-2 middleware and the storage aspects in particular. This was important, because, as a Tier 3 center, Edinburgh will make use of this middleware to connect to the rest of the LHC Grid. For this reason, it was necessary to have LCG-2 installed on a front end of machines. 17

During this summer, version 2_1_1 of LCG-2 has indeed been installed on three computers and there is currently a functional SE that can accept commands via the network from any computer running the User Interface software. To be able to follow and participate in this process, I had to study the LCG-2 user guide [9] extensively and learn the architecture and features of LCG-2, what services the LHC Computing Grid will offer and how they can be accessed. I also went through the necessary steps to obtain a Grid Certificate and install it on my workstation, so that I could use the User Interface to try the LCG-2 command line tools. Chapter 5


The Local Configuration System (LCFG) [8], [11], is a piece of software that can be used to “install and manage the configuration of large numbers of Unix systems” [8]. It started as a project in 1993 at the University of Edinburgh and was originally designed for the Solaris OS. Since then, it has been ported to Linux and become an open source project distributed under the GNU public license. It has grown and evolved significantly and is today called LCFGng for “next generation”. The EDG project modified a version of LCFGng as a system for installing and configuring machines running their Grid middleware and since then the two versions have evolved independently. In this chapter we always refer to the EDG LCFG, even though most of the information found here holds true for both versions.

5.1 The architecture of LCFG

The architecture of LCFG can be seen in Figure 5.1 on the following page. The operation of LCFG is based on a central server that holds configuration files for all other computers to be configured/managed. A configuration file on the central server is written in the LCFG language and could refer settings that apply to one of the computers or one setting that applies to many computers. These source files can be passed through the LCFG compiler to create profiles. A profile does not correspond to one source file, but contains the complete configuration of one computer, so a single source file may be compiled into many profiles. The LCFG central server also runs a web server which is used to publish the profiles. Every time an update is available, the server notifies the LCFG clients running on the managed machines. The clients can then download their new profile. A number of Perl scripts, the LCFG components, are then run by the client to turn the profile into the appropriate configuration files and make all necessary changes.

18 19

Figure 5.1: The LCFG architecture

5.1.1 Source Files

Source files are made up of header file inclusion statements containing standard configurations and resources. Resources are two-word statements that specify the necessary details that are not included in the header files and are specific to the configured systems. For example, the standard desktop configuration could have one ethernet card whereas the system to be config- ured could have two - differences like this require additional resources. The first word of the resource is the key and the second is the value. The key is made up from the hostname the statement refers to, the relevant component that should make the configuration changes and the parameter. The second word is the value the parameter should take in the new configuration. The program used to compile source files is called mkxprof. Calls to mkxprof can be made explicitly, but it is also a daemon, checking the configuration files for updates and recompiling the ones that have undergone changes. In this way, the user only needs to save an updated file. The compilation and publishing will be handled automatically.

5.1.2 Profiles

Source files are compiled into profiles. There is one profile for each machine to be configured and it contains all the necessary information on the configuration of that machine. Profiles are XML files divided into sections, one for each component. What keys are included in each individual section depends on the component it is meant for, but the general format of a profile 20 is defined by a schema.

5.1.3 Components

Components are scripts, usually written in Perl, which are responsible for one configurable aspect of the machine’s operation. Once a profile is received by the client, the relevant com- ponents are called with the required arguments that specify the method in the script to be run. The components typically implement a configure method and perhaps a start and stop method as well. What the configure method usually does is to create configuration files and the directories where they are meant to be placed. The start and stop methods are there in components that are meant to stop and start daemons, for example in the case of a service that needs to be restarted to get the new configuration.

5.2 LCFG and LCG-2. Relevance to the Dissertation

The EDG project used a version of LCFG as a means to install and configure their software. This practice was passed on to the LCG project and, currently, the usual way to install LCG- 2 is through an LCFG server. Of course, the installation and configuration can also be done manually. In Edinburgh it was decided to use LCFG for installation and managing the system configuration. The installation took place over the summer and I was involved in the process.

5.2.1 Installing LCFG

In order to install LCG using LCFG, it must first be installed on a central server. In Edinburgh, the LCFG service is set up on Glenellen, an IBM eServer x205 Intel P-IV 1.8GHz with 256 GB RAM. All the necessary files, including the essential document with the installation instructions [12], can be downloaded from the LCG-2 CVS repository at [10]. The operating system

The first step is to install Red Hat 7.3 on the server. It is important to partition the hard disk so that /opt is the largest partition, as this is where the software will reside. It should also be kept in mind during the installation, that alongside the LCFG server there need also be: a DHCP server; an NFS file server; and a web server running on the host. This will affect the packages installed and the firewall configuration. For the installation in Edinburgh, after Red Hat Linux 7.3 was installed, YUM (the Yellow dog Updater, Modified) was used to make sure that the latest and most secure versions of all packages were installed. 21 Downloading the rpm packages

After the installation of the operating system is complete, three rpm packages must be down- loaded from the CVS repository and installed: edg-populate-serverng, edg-updaterep, updaterpms- static-server. The only one of those to be used directly by the user, edg-updaterep.rpm, comes with a configuration script that can be set to make it download the rpms for a specific CVS tag for LCG-2 from the repository. However, the configuration files for the LCG-2 installation must be checked out from the repository manually and put in a directory of the user’s choice before updaterep is run.

The final step in installing the LCFG server software is to run a Perl script provided with the distribution, lcfgng_server_update.pl, which checks that all the necessary rpms are have been downloaded and generates a scripts that performs the installation. LiveOS

LiveOS is an operating system that is loaded into the LCFG clients when they are booted from the network, in the beginning of the LCG-2 installation. The next step is to have it installed and configured in a directory on the LCFG server. The installation is handled by a Perl script called lcfgng installroot.pl and then the user can configure some parameters by editing a file called installparams. The DHCP server

A DHCP server should run on the machine running the LCFG server. A DHCP server has the duty of giving all the other hosts on the network their network configuration. This includes assigning IP addresses the server has a range of IP addresses that it is authorised to distribute and every time a host goes online and requests an address it makes the assignment. A DHCP service needs to be set up on the LCFG server, so that the nodes to be configured obtain their configurations dynamically. The DHCP server also tells the Network Card of the node to boot via PXE and to load the PXE loader, pxelinux.0 using the TFTP server. To make writing a configuration file easy the LCG-2 distribution includes an example configuration file, dhcpd.conf.ngexample. That file should be edited depending on the characteristics of the network where LCG-2 is to be installed and renamed to /etc/dhcp.conf. HTTP

The HTTP server running on the LCFG node needs to be configured to limit access to the LCFG to the computers from domains that are safe. This is very important, as the configuration information stored in the LCFG server can make the other nodes on the network vulnerable to attacks, if not kept secret. A simple way of configuring the HTTP server properly is provided 22 in the form of a configuration file, httpd.conf.ngexample73, which has been downloaded at the beginning of the installation.

Another important step that should be taken at this point is the encryption of a password and its storing in a file typically named /etc/httpd/.htpasswd. This password will have to be used when trying to access the LCFG with a web browser. NFS

An NFS server has to be properly configured on the LCFG server. It is used by the nodes to transfer the LiveOS files during their initial boot. The configuration is very simple and is done by adding two lines in the file /etc/exports. The LCG-2 distribution contains an example file with the name /etc/exports.ngexample73, where from the lines can be copied. Setting up PXE

PXE allows nodes on a network to boot from the LCFG server. If a node is set to boot from its network card, then, when it starts, its network card sends a request to the DHCP server to obtain its configuration. If the DHCP server is properly configured, it will instruct the node to download the PXE loader, pxelinux.0, from the /tftpboot directory, using TFTP. Once the PXE loader is downloaded, it in turn uses TFTP to download its configura- tion file from the directory /tftpboot/pxelinux.cfg and a kernel from the directory /tftpboot/kernel/. Then the node boots that kernel. This is the way the LiveOS oper- ating system is passed to the nodes prior to the LCG-2 installation. The user can have multiple configuration files and kernels in the respective directories. If there are multiple configuration files then the user will be presented with an option of boot types when a node is booted.

Obviously, since the PXE loader and the kernel and configuration files are transfered to the node via TFTP, it should be made sure prior to the installation that there is a TFTP server running on the LCFG server.

5.2.2 Installing LCG Site-wide settings

After the previous steps have been completed, the server is set up and the configuration for the site must be prepared. All the site-wide settings will go into LCFG header files that can later be included in the node-specific source files. There are four header files that need to be edited. Those files are:

• cfgdir-cfg.h, which contains the directory of the configuration files. 23

• local-cfg.h, which contains modifications to standard Red Hat 7.3 settings.

• private-cfg.h, which contains security settings, including the root password for the site. The password is encrypted using openssl and is stored in encrypted format.

• site-cfg.h, which contains settings applying to the whole LCG-2 site (site name, LCG version, etc). LCFG source files for the node types

After the site-wide settings have been made, it is time to specify the configuration for the different types of nodes. There are example source files provided with the LCG-2 downloaded files and those can be edited and renamed to the hostname of the node they refer to. Therefore, we end up with a number of source files equal to the number of nodes in the site. Those source files can be compiled into XML profiles using mkxprof. PXE node installation

The installation can begin by accessing the LCFG server with a web browser. The URL that should be used is http:///install/install.cgi, the default username is lcfgng and the password is the password that was set during the http con- figuration. The user is presented with a web interface that allows him/her to specify the node to be installed and the type of the installation. The different installation types correspond to the different configuration files for pxelinux that have been created when PXE was set up. When the nodes are rebooted their configuration will be the one selected. Therefore, in order to install a Storage Element, all that needs to be done is to select the appropriate boot type and reboot the host. Once the machine is booted and the is sent the profile that contains its configuration. The configuration defines the boot type, in other words, the kernel to boot and the filesystem to mount. Once the node is rebooted it functions as an LCFG client node, which means that it will pick up any changes made in the LCFG source files located at the server that affect its profile.

5.2.3 Relevance to the Dissertation

Since one of the aims of the MSc project was to understand the way a site running LCG-2 is set up, I had to install an LCFG server on an old computer that used to be on the physics net- work. The process began by formatting the hard disk and installing Red Hat Linux 7.3. I then followed the steps described in the previous paragraphs and ended up with a functional instal- lation of LCFG on the computer. Unfortunately, the reproduction of the LCG-2 installation had to stop there, since there were no other machines that I could use to install LCG-2 on. How- ever, I did follow and understand the installation process as it took place for the Edinburgh 24 hardware. I feel that following the installation was instructive and even though I was, quite understandably, not allowed to work on the hardware myself, I learnt a lot from the process.

As an LCG front end there is an IBM eServer x205 Intel P-IV 1.8GHz with 256 GB RAM, Glenellen, used as the LCFG server and another one, Glenlivet, used as the GG to the Worker Nodes. Glenmorangie, an IBM eServer x305 dual Intel P-III Xeon 1GHz with 2 GB RAM is set up as a SE. Finally, the worker nodes of the Computing Element are going to be four dual Intel Xeon 2.8GHz machines, with 2GB RAM and 200GB EIDE HDD. Chapter 6

Monitoring with Ganglia

Due to the dynamic nature of the LHC computing Grid, resources connected to it need to be monitored. The monitoring tool that will be used for the GridPP sites and has recently been installed in Edinburgh is Ganglia [13]. Ganglia is an open source project that started at the University of California, Berkeley, as part of an effort to link university clusters and has evolved into a complete monitoring system. It is also distributed in the sense that it can not only be used to monitor clusters but also supports cluster hierarchies. This is achieved by appointing representative nodes in each of the clusters and organising them into trees. Each node is then responsible for reporting to the one above (parent), on the state of those on the branches below (children).

Ganglia has been ported to many different platforms and tested thoroughly. It is also highly scalable - the communications have been optimised so as to introduce as small an overhead as possible and it has been successfully used to handle clusters scaling to thousands of nodes [14].

6.1 The Ganglia Architecture

The two main components of Ganglia are the Monitoring Daemon (gmond) and the Meta Daemon (gmetad). gmond is responsible for collecting information at the cluster level and runs on each of the nodes of the monitored cluster. It has three kinds of threads. The Collect and publish thread collects the monitoring information for the node gmond runs on and publishes it to a multicast channel used by every other gmond in the cluster. The multicast channel is being listened to by the Listening threads, which pick up information broadcasted by other gmond daemons and update the local gmond’s information accordingly. Thus, every node that has gmond running has a complete view of the cluster, kept in a hash table. Another kind of threads, the XML export threads, listen for clients applications or gmetad daemons requesting information and answer them with messages in XML format.

25 26 gmetad is a Perl daemon, responsible for collecting information for clusters or groups of clusters. One gmetad runs on each of the representative cluster nodes. Typically, this means that it collects information from a set of nodes running gmond daemons as well as a number of child gmetad daemons, representative of other nodes. This information comes in the form of XML messages. Then it puts all its information together in its database and passes it on to the gmetad running on its own parent node, when requested.

Figure 6.1 shows how the Ganglia components are combined to build cluster hierarchies.

Figure 6.1: Ganglia cluster hierarchies

6.2 Metrics

Ganglia distinguishes between metrics characterising them as either built-in or application- specific/user defined. Built in metrics are used to describe the state of a cluster node. The number of built in metrics varies depending on the platform, but the most important (number of CPUs, CPU and memory usage, running processes etc) are always available. Application- specific metrics are defined by the user and the user can explicitly specify the frequency by which they are collected and sent on the multicast channel. Applications can use the gmetric command line tool to publish informations about themselves. 27

6.3 Transmitting and Storing Monitoring Information

6.3.1 Messages on the multicast channel

There are two types of messages exchanged between Ganglia nodes. The first is messages on the multicast channel. A multicast message is a message from one host to all other hosts in a group. What basically distinguishes a multicast message from a broadcast message is that in the case of the latter, all hosts in a network receive the message. In contrast, a multicast message is delivered to a dynamic group of recepients, which is identified by one IP address and does not include all hosts. Therefore, multicast messages result in less traffic on the network and are more efficient. Furthermore, nodes enter or leave the group of recipients dynamically, without the need to change the configuration of a central service or restart it discovery is automatic.

The messages transmitted on the multicast channel are either heartbeats or collected metrics. Heartbeats are messages that signal that a host is up and running. A heartbeat includes the start time of the gmond daemon so that the rest of the nodes can detect restarting daemons. When a gmond does not send heartbeats for some time, the host is assumed to be down. The other kind of multicast messages are in the eXternal Data Representation (XDR) format, which is machine independent and also efficient. XDR messages include monitoring information sent from one gmond to the others in the group. A gmond sends updated values for the metrics every time a change occurs, that surpasses a defined threshold. Updates will not be sent for values that don’t vary significantly.

6.3.2 XML messages

The messages passed from gmond to gmetad running nodes or exchanged between gmetad running nodes include monitoring information for subsets of cluster federations and are in XML format. XML, being portable and self describing, enables the integration of Ganglia with other software. These messages are transmitted by gmond or gmetad after a request by another gmetad is received. Any gmetad will periodically send such requests to the gmond daemons of the nodes it represents as well as all of the “child” gmetad daemons. In this way, it obtains up to date values for the metrics monitored. The XML messages sent from a gmond include information on all the gmond nodes on the same multicast channel. The XML messages from a gmetad include aggregated information on every single node lower than the gmetad in the cluster hierarchy.

6.3.3 Storing Monitoring Information

To store the information they receive through the multicast channel, gmond daemons use a hash table in memory. The hash table supports simultaneous entering of data by listening threads accessing different parts of the table, to increase efficiency. The system is also opti- 28 mised for accesses by the XML export threads. Data on the table is stored in binary form to reduce its size and to allow quick conversion from the XDR format.

Timeouts are the mechanism by which gmond daemons distinguish between valid data and expired data that should be deleted. For every piece of data put on the hash table, gmond records the time of receipt. Data that have not been updated for longer than a soft limit is considered suspicious and any client applications using it are notified. If a second, hard limit is crossed the data is removed from the hash table.

The data collected by the gmetad daemons is stored using the Round Robin Database tool (RRDtool). All hosts running gmetad daemons keep databases with all the data sent to it and create graphs of that data versus time.

6.4 The PHP Front end

Ganglia has a web front end system written in PHP. This system runs on the same host as a gmetad daemon. Its role is to create and periodically update web pages where it publishes the information contained in the gmetad RRDdatabase. It thus provides the user with the ability to easily access the information gathered by Ganglia, without having to querry the databases directly. Moreover, it provides this information in the form of the graphs created by RRDtool, as opposed to just raw numbers, making the monitoring process even more user friendly. All the metrics that Ganglia uses can be obtained through the PHP frontend.

6.5 Using Ganglia for the Scotgrid Hardware. Relevance to the Dissertation

Ganglia is currently being installed in many GridPP sites, to be used not only by the individual site administrators, but also as a way to centrally monitor the GridPP hardware. As part of this process, it was recently installed and configured for the Edinburgh site as well. For my MSc project, I had to read the Ganglia documentation, understand how it works and then participate in the installation/configuration process.

6.5.1 Using LCFG to configure Ganglia

In Edinburgh, Ganglia was installed and configured by following the instructions found on the GridPP website. Figure 6.2 on the next page shows how the Ganglia daemons interact to gather the monitoring information.

The LCFG server, Glenellen is also the node running gmetad, that is, the representative node for the Edinburgh cluster. On all the hosts in the cluster, including Glenellen, there are gmond 29

Figure 6.2: Ganglia on the Edinburgh front end daemons running, collecting the monitoring information. The gmond daemons publish their information on the multicast channel, so every host is updated on the state of all others. Pe- riodically, the gmetad on Glenellen requests the cluster monitoring information from one of the gmond daemons. This information is sent as an XML file and then stored in a database. The PHP front end running on Glenellen can dispay this information on browsers of authorised hosts.

6.5.2 The near future

The Edinburgh hardware is now monitored and the monitoring info collected by the gmetad on Glenellen can be seen online at by all allowed hosts. For the time being, those include only hosts of the university network, however, in the near future, the list of allowed hosts will expand to include other GridPP machines. This will allow for centralised monitoring of the GridPP hardware.

There might also be internal changes for the Edinburgh site. When the Worker Nodes are in place they might form a different Ganglia cluster, with a gmetad representing them separately. The cluster of the WNs will be lower in the hierarchy than the front end cluster, represented by the gmetad of Glenellen. In this way, all the site information will still be accessible from Glenellen. However, it will be possible to monitor the WN as an isolated group as well. 30

6.5.3 Monitoring Examples

This section and the diagrams included show the way two different operations on the Scotgrid host, Glenellen, are viewed from the PHP front end. The graphs have been produced by trying operations and capturing the images from a browser window. A memory leak

A C program that creates a memory leak has been written and run on Glenellen. The program allocates memory for a long array of pointers to doubles and then loops over the array between time intervals of constant length, allocating a large amount of memory for each of the pointers. The program can be found in appendix A. This results in more and more of the system memory being used until eventually, if left alone, the program will allocate all the memory it tries to allocate or the system will run out of memory. The latter should obviously be avoided because it may crash the system.

Glenellen has 2 GB of memory. The first run of the program was set to allocate 1GB of mem- ory, sleeping for 1 second between allocations. The result was the first steep rise in memory usage that can be seen on Graphs 6.3 on the following page and 6.4 on page 32. On the second run, the program was set to exit after allocating 1.5 GB to avoid the risk of crashing the system. The time interval in between allocations was set to 5, to create a slower increase. It can be seen on the graphs that the memory usage rises approximately linearly. This is exactly what should be expected, since the allocations happen between regular time intervals and the same amount of memory is allocated every time. It is interesting to observe that as the amount of allocated memory approaches 1.5 GB, the system starts using swap space to save memory. This did not happen during the first run, because memory usage did not reach a critical level. To cross check that we get the correct information from Ganglia, we ran the top command. This confirmed the accuracy of the Ganglia graph. A file transfer

Another operation that was monitored was a 21 GB file transfer from Glenkinchie. Graphs 6.5 on page 33 and 6.6 on page 34 show the results. At the beginning of the transfer we notice that the use of memory increases quickly and the network and CPU activity are high. However, when the RAM is filled and can no longer be used as a buffer, the speed of the process is bottlenecked by the speed of writing on the hard disk and both the CPU load and the network activity drop. Furthermore, the process becomes even slower, as the free space in the hard disk of Glenellen is dramatically reduced. In the end, the scp had to be manually cancelled and the file deleted, to stop Glenellen from running out of hard disk space.

A final observation to be made from these graphs is that Glenellen sends packets through the network as well as receiving. The reason is that in an scp transfer the receiver sends messages to the sender to acknowledge receipt of the packets. 31

Figure 6.3: The leak program output, a terminal running the top utility and the Ganglia web- page. The program claims to have allocated 1253MB, the top utility gives a value of 1.3 GB and the value shown on the Ganglia Graph for the total memory usage is approximately 1.4GB. 32

Figure 6.4: Memory usage, free memory and free swap memory. We notice that the first two graphs are consistent. The third graph shows the usage of swap memory for the second run. 33

Figure 6.5: The Ganglia page for Glenmorangie shortly after the file transfer. The start of the file transfer resulted in a quick rise in memory usage. When all the memory was used, Glenmorangie only received and processed data as quickly as it could write it on the hard disk. The result was a drop in CPU and network activity. 34

Figure 6.6: As Glenmorangie receives packets of data it sends confirmation packets back to Glenkinchie. Almost 20 minutes after the transfer started and with almost 5GB transfered, the process was stopped, since there was not enough disk space on Glenellen to hold a 21GB file. Chapter 7


It is obvious that the variety of storage resources on the Grid will be great and that each of those resources will have different functionality and their own data handling policies. It is therefore necessary to have a uniform interface to all those different local systems, so that clients can easily interact with them without having to know how to deal with each of them separately. There are a number of tools aimed at addressing this issue. Those most relevant to this project are summarised in the following paragraphs.

7.1 SRM

The Storage Resource Manager (SRM) is an interface specification defining the ways in which a server running on a storage resource should be able to interact with applications trying to reach it via the Grid. These applications should be able to invoke a specified set of methods and expect standard responses and the role of the SRM interface is to make sure that any implementation of a storage management system is able to use those methods and responses. SRM has been the result of the collaboration of the European DataGrid, CERN, Fermilab and LBNL.. There are implementations of the SRM protocols for a number of storage systems: HPSS, Enstore, JasMINE, CASTOR, EDG SE and ATLAS and RAID arrays. The following sections will describe the way files should be viewed and treated by an SRM implementation, as well as the most common methods such an implementation should provide.

7.1.1 SRM file and storage space types

The SRM specification characterises files based on their lifetime on a storage system. More specifically, a file living in an SRM-managed storage system can be permanent, volatile or durable. Depending on the type of files stored in it storage space is also assigned one of those three descriptions.

35 36

Permanent files in an SRM-managed storage system are of the same nature as permanent files in a typical filesystem. Those files are guaranteed to remain unchanged and inside the storage system, unless their owner chooses to delete them. Therefore, a user of the Grid interacting with an SRM server and using permanent files can usually count on finding those files on that system for long periods of time.

Volatile files are those that have a specified lifetime. A volatile file is only guaranteed to be found by a user during its lifetime and its lifetime is specific to that particular user. For example, if user A is accessing a volatile file, she will be guaranteed by the storage system that it will be accessible for a certain amount of time. If in the meantime another user, B, asks for the same file, a new lifetime for the file will be associated with him and he will get an independent guarantee for a different time period. If, at any moment, the existence of a volatile file is not guaranteed to any user (i.e. all its lifetimes have expired) the file could be removed, as soon as the space it resides in needs to be reclaimed by the storage system.

Durable files are usually files of a temporary nature that contain important data. They also have lifetimes, but once those expire, the storage system cannot yet delete them. Instead, it has to notify the owner of the files that their lifetime has expired and perhaps copy them to permanent space (depending on the implementation). A suitable candidate for a durable file would be a very large file that contains important information and needs to be accessed quickly. Since durable space will usually be in the disk cache, as opposed to a tape, quick access to the file could be provided and the file would still not be lost if not accessed for some time.

A file of a specific type can always be stored in space of the same type. Moreover, durable files can also use permanent spaces and volatile files can be stored in spaces of any type. This is demonstrated in figure 7.1.

Figure 7.1: SRM filetypes. Image taken from [22]. 37

7.1.2 SRM functionality

There are currently many different types of storage systems, like disk caches, tape storage sys- tems, RAID arrays etc. A storage facility may contain one or more kind of storage system. In addition, those systems will support different operations and be governed by different poli- cies. However, in a Grid environment, every storage system should present the same “face” to the rest of the “world”. The basic operations that any storage system managed by an SRM implementation should support fall under five categories. A brief description of those five cat- egories of functions will be provided here. For a complete listing, refer to the SRM interface specification [16] Space Management Functions

The first category includes space management functions. The functions in this category can be used to reserve and release space in a storage system, as well as to find out information about the space and the files contained in it (free space, type, lifetime etc). By using this type of functions it is also possible to modify these parameters, ie. to prolong the lifetime or change the type of a file or space. Typically, the caller provides a user name and some information about the space, like the type or the size of the space as arguments. Space Management Functions

Directory functions are those that perform the directory tasks that would be needed to manage directories in a Unix-type filesystem. For this reason, their names are the same as those of well known Unix shell commands with the srm prefix (srmMkdir, srmRmdir, srmMv etc). Directories are a virtual construct that essentially provide the user with a way to logically group files on the Grid. Two files can belong to the same directory regardless of their size, type and physical location. Transfer Functions

Transfer functions are used to transfer files from and to SRM-managed storage systems. The filenames used by those functions to refer to replicas on the Grid are the SURLs and the TURLs. Typically a client requesting a file will call the PrepareToGet function providing the server with an SURL for the requested file. Then the server will pick a suitable transfer protocol and return it in a TURL that can be used for the transfer. In the case of a client trying to upload a file to the storage system, the PrepareToPut function will be used. Once the SRM storage manager allocates space for that file a TURL is provided to the client and the transfer can begin. It should be noted here that SRM does not perform the data transfers for the client, just provides it with the necessary information (TURL of the file and protocol) to perform it itself. 38

Figure 7.2: SRM and file transfers. Image taken from [17]. Permission Functions

The fourth category of functions are the Permission functions. Permissions are a way of pro- tecting files on the Grid from unauthorised access. Grid file permissions are completely anal- ogous to permissions set for files on a Unix filesystem. Functions in this category allow users to define who has access to their files and what they are allowed to do with them. They also provide a way of checking permissions for a given file, that can be utilised by both clients and SRM servers. Status Functions

Finally, the SRM specification defines Status functions that can be used to track the progress of an SRM operation. Thus, for the duration of a download or upload from or to a storage system, starting from the moment of the client request, information on the status of the operation can be obtained from the manager by means of the StatusOfGetRequest or the StatusOfPutRequest function respectively.

7.1.3 File pinning

We will provide here a description of file pinning, a technique that SRM implementations should support. In short, file pinning is the practice of extending the guaranteed lifetime of a file of temporary nature, in order to greatly increase the chance that it will be available after an amount of time. This feature can be implemented with different levels of sophistication. The 39 simplest example would be to allow the client to request that a “mark” is placed on the file. This mark does not correspond to a predefined extension of lifetime, but whenever the storage manager needs to make some space it makes sure it removes the file with the oldest mark first. A more complex implementation of file pinning would take into account the identity of the client requesting the pin, not allowing unlimited consecutive pins by the same client. It would also allow the client to specify the desired duration of their requested pins. For a more detailed discussion of file pinning strategies see [18].

7.1.4 LCG-2 and SRM

LCG-2 currently supports the classic SE solution provided by the DataGrid project but this is soon going to change. It is intended that the final LCG-2 release will include an SE imple- mentation based on the Storage Resource Manager (SRM). There are a number of differences between the classic and the SRM implementation of the SE, the most important being that the latter supports asynchronus data transfer operations and the pinning of files. In the case of a classic SE implementation, the client can only issue the next request to the SE after the previ- ous one has been completed. Moreover, it is not possible for a client to “book” multiple files in a storage system for future use.

7.2 The LBNL SRM

The Lawrence Berkeley National Laboratory have developed a software package which imple- ments the SRM specification. It includes a Disk Resource Manager (DRM) that can be used for disks and NFS systems, a Tape Resource Manager (TRM), which can be used to manage mass storage systems and combines the two to form what is called a Hierarchical Resource Manager (HRM). Figure 7.3 on the next page shows how a DRM, a TRM and an HRM can in- terface storage resources to the Grid. The protocols supported by the LBNL SRM are GridFTP, FTP and http. Successful use of the LBNL SRM has been made in the context of STAR [19], an experiment at Brookhaven National Laboratory (BNL), invastigating the results of heavy ion collisions. SRM was used to automate the process of transferring data between RCF and NERSC, the storage systems located at BNL and LBNL respectively. It is still being actively developed and a new version (2.1) is going to be released towards the end of 2004. For the rest of this document, the acronym SRM will refer to the LBNL implementation unless we explicitly state that we are refering to the specification.

7.3 The SDSC SRB

A team working at the San Diego Supercomputer Center (SDSC) have developed the Storage Resource Broker (SRB)[20]. SRB is a software suite that can provide uniform access to dis- 40

Figure 7.3: DRM, TRM and HRM and the systems they interface to the Grid. Image taken from [21]. tributed data resources. It can be used as middleware, but it is also a complete solution in itself (it contains a user interface and does not need to be part of an application, in order to manage combinations of storage resources). SRB is a mature and useful product, that is already being used by a number of sites in the UK.

7.3.1 The SRB architecture

Figure 7.4 on the following page illustrates the basic operation of SRB. An SRB server can be installed on top of a storage system and the client applications can use the network to make calls to it in a number of ways, which are detailed below. The SRB server interacts with the the Metadata Catalogue (MCAT) server, which provides it with a logical set of filenames for the files stored inside the managed systems. This enables the SRB server to execute the desired I/O operations.

SRB has a number of interesting features that increase its functionality and make it a very useful piece of software. Those most important to this work are summarised below. 41

Figure 7.4: The SRB architecture. Image taken from [20]. SRB Master and SRB Agents

The SRB server is implemented in the following manner. When the server is running, a process called SRB master listens on a well known port for any incoming client requests. As soon as a request arrives from a new client, it spawns an SRB agent, a process that will handle the interaction with that particular client, thereby moved to a different port. After that, SRB Master goes back to listening for new client requests and the client starts issuing requests to the agent. Each agent has a high level request handler and a low level request handler and client requests are dispatched to one or the other accordingly. The high level request handler is responsible for accessing the MCAT, in order to register or physically locate data. The low level request handler typically makes use of the physical locations of data or storage spaces to perform the I/O operations.

7.3.2 The Metadata Catalogue

MCAT stores metadata about the data stored in the SRB-managed storage systems and is useful in providing an organisation of those files using logical filenames. In this virtual “filesystem”, the equivalent of directories are called collections. Very much like a Unix directory, a col- lection may contain a number of files as well as sub-collections, thus enabling the creation of filesystem-like hierarchies. The important difference, however, is that files within the same col- lection might be physically located in different storage resources. MCAT conveniently keeps the mapping of logical filenames to physical locations hidden from the client applications. 42

7.3.3 The S-commands

The S-commands are a set of command line tools that resemble the common Unix command line utilities (ls, mkdir, cd etc) but operate on files in SRB collection hierarchies as op- posed to Unix directory based filesystems. They are typically named after the equivalent Unix command, with an S prefix.

7.3.4 The SRB client API

Data-intensive client applications can be build on top of SRB, using the C, C++ and Java client APIs provided. The client library is provided with the standard client package distribution and it is straightforward to have other applications include the nessecary header file and link against it. The functions included in the API fall under two main categories. The first is the High Level API. Its functions perform operations on data registered with MCAT and any high level operation updates the MCAT accordingly. The Low Level API functions do not imply any communication with MCAT and operate on data by passing messages for the low level request handler to the SRB agent. Appart from the High and Low Level APIs, there are also functions that open and close connections to SRB servers, communicate with MCAT to obtain metadata or to manually register and unregister data objects or perform other minor operations. A detailed description of the extensive SRB client API can be found on the SRB website [20].

7.4 An interface between SRM and SRB

The development of SRM for LCG-2 introduces an issue for those storage resources which have invested in installing and using SRB for their data management needs. It is clear that LCG-2 should be used by all the resources that will eventually be connected to the LHC Grid and that CERN is moving towards adopting SRM as the prefered protocol for data access. It may therefore appear that these sites should abandon the plans of using SRB and switch to SRM instead. However, the cost of this change could be reduced by means of an interface between the two systems. Developing such an interface would make it possible to install SRM on top of SRB, thus allowing use of the former without wasting all the work that has been done on the latter. The second part of the MSc project was meant to be the development of that interface.

The idea was to create a working installation of the SRB server on Redkite, one of the work- stations on the Physics network. Then I would install an SRM server on the same computer. Adding some code to SRM, I should be able to interface it to SRB and get it to translate all the requests it gets from the client applications properly, passing them on to the SRB server, the same way it would do in the case of a storage management system. 43

7.4.1 Installing SRB from source

The first task was to install SRB. The installation of SRB server is meant to be a straightforward procedure. It is done automatically by editing and running a Perl script that can be downloaded from the SRB website [20]. However, I found that the Perl script contained a number of bugs, like unnessecary endlines that caused it to crash, which took me a long time to correct. This delayed the installation, but in the end the SRB server worked properly.

7.4.2 Installing SRM from source

It was decided to use the LBNL implementation of SRM. The first task was to install the code from source. Most of the necessary files can be downloaded from the SRM distribution page [23], but the TRM source code cannot it should be requested by e-mail from the LBNL SRM development team. A brief account of the installation process will be provided here. GCC 2.95.3

Parts of the SRM code are dependent on the version 2.95.3 of the GNU C Compiler and will not compile properly using another version. The version of GCC running on Redkite was 3.2.3, so I had to install 2.95.3. Not having root access to Redkite, complicated the procedure. GCC 2.95.3 and its libraries were installed in a subdirectory in user space. The directory containing the GCC binary was included at the beginning of the PATH environment variable. Globus

The next step was to install the the GT2 core. This is necessary, because the SRM server code links against the Globus libraries. The GT2 core was installed by downloading the Grid Packaging Tools (GPT) [24], in which version 2.14 is included. After building the GT2 core, the Globus library must be added to the LD_LIBRARY_PATH environment variable, to enable GCC to find it. ORBacus 4.1.0

SRM uses the ORBacus Object Resource Broker, which can also be downloaded from the SRM distribution page. In order to have it installed, the runconfig executable file provided must be run. This is the ORBacus configurer. It requests some basic information on the system and then creates the configuration and gives very easy to follow instructions on how to proceed to build the software. The configurer also adds all the libraries to the LD_LIBRARY_PATH automatically. 44 DRM, the HRM Clients and TRM

Before the source of the different SRM components is compiled, it should be checked that the LD_LIBRARY_PATH environment variable includes not only the paths to the ORBacus libraries, but also the path to the Globus library. There are examples of running the configure scripts of all of the components, which can be downloaded from the SRM distribution page. The configure scripts check for all the necessary libraries and automatically create the makefiles that should be used for the compilation. Running make and make install should normally work for all components. Installation Problems

The installation proved a difficult and long process. At the time of this writing, there is no guide available for installing SRM from source. For this and other reasons, the installation proved a much more difficult task than originally expected. It should be acknowledged that, the SRM development team were very quick to answer questions and provide help when asked for it, however, this did not solve all the problems. The first unsolved problem was encountered when I tried to “make install” the DRM server software. The process would exit with a vague error message. The feedback from the SRM development team was that this operation was not important and that the same effect could be achieved by manually copying the binaries into the bin directory of the installation folder. However, after the installation completed, I found that the DRM server would exit without a comprehensive error message. In order to investigate if that was a problem of the installation or the configuration I was using, I downloaded the binary distribution of SRM and started the DRM server using the same configuration. The DRM started normally, I therefore concluded that it had to do with the installation. However, I was unable to find what the problem was, because I had received no error messages while the software was being installed. The SRM team attributed the problem to an incompatibility with the Red Hat Enterprise WS (2.4.21-15.0.4.EL) installed on Redkite.

7.4.3 Thoughts on an Interface

Under the circumstances and due to time constraints, it was not possible to start developing software for the interface, especially since the software could not be compiled and tested. However, an effort was made to create a design for the interface. Two possible solutions have been considered, but it was not possible to reach a conclusion.

The first way to build the interface is by adding code to the Tape Resource Manager. TRM already has interfaces to the HPSS and NCAR systems. Those are built by inheriting from a class called FileFetcher and creating the classes FileFetcherHPSS and FileFetcherNCAR respectively. Every time a put, get or other SRM transfer request is submitted to TRM, the FileFetcher-derived class either brings the file from the storage system into the disk cache of the SE (in case of a get) or prepares the storage system to receive it from the disk cache (in case 45 of a put) . Then it returns the TURL of the file, instructing the client to perform the file transfer. Therefore, a possibility would be to treat SRB as a storage system and derive a FileFetcherSRB class that makes calls to the SRB library to make SRB files available for transfer or help the SRB system host the files the client wants to store.

The second solution was suggested by Alex Sim of the SRM development team. It was not clearly stated, but the idea is to treat SRB as a file transfer protocol and have it as another op- tion alongside GridFTP and BBFTP (the two transfer protocols that DRM currently supports). When a client submits a file transfer request to an SRM server, it also sends a list of supported protocols. The server then picks one of the protocols and sends back a TURL to tell the client how to make the transfer. It is not clear to me how SRB could be treated as a protocol without the client supporting it. The idea that SRM instructs the client to function as an SRB client seems very limited in its applicability - not but a minority of the clients will be able to support this. The only reasonable interpretation I have come up with is for the SRM server to dynami- cally create an SRB client application for the requested file transfer and return that to the client. However, this seems to be a rather complicated idea.

This part of the project had to be abandoned as part of the MSc dissertation. The reason is that I did not fully understand either of the two possible ways the interface should be created. My normal course of action would be to study the SRM code, which is quite extensive and com- plicated. This would allow me to gain better understanding of the different ways an interface could be developed and create a design that could later prove useful. However, time limitations and insufficient documentation for the SRM code made this an impossibility. Chapter 8

Summary and Conclusions

8.1 Summary of the MSc Project

The MSc project had two major parts. The first one involved becoming familiar with the soft- ware and middleware tools currently used by the high energy physics community to apply Grid technologies to their discipline. For that purpose, I studied the documentation of LCG-2 and LCFG and installed the latter on a workstation to better understand the process. I also par- ticipated in the activities of the Edinburgh Particle Physics group, including the installation and configuration of LCG-2 using LCFG on the ScotGrid hardware, the use of the Ganglia distributed monitoring system for the ScotGrid front end and the attendance at the GridPP meeting at CERN. I feel that this first part of the project was very beneficial for me, substan- tially adding to my knowledge of working with Grid middleware. I also feel that, by doing this work, I gained a much improved understanding of how clusters of computers can be set up and administered.

The second goal of the project was to create an interface between two pieces of software used for Storage Management, the SDSC SRB and the LBNL SRM. To that end, I read what doc- umentation for those two tools was available and tried to create working installations on a University workstation. For SRB in particular, I also followed a course given at the University of Edinburgh by one of the members of the SDSC development team. Although the effort I put into this task was significant, time limitations, as well as other problems encountered, made it impossible for me to complete. A more detailed account of the reasons for this will be provided in the following section. However, I believe that by working on the second part of the project, I still gained useful experience. I plan to keep working on the same field in the future and therefore I think that all I learnt will contribute to making my future attempts more successful.

46 47

8.2 Post Mortem

8.2.1 General Issues

Before listing the problems that occured unexpectedly, most of which affected specific aspects or parts of the project, it should be noted, that my own lack of experience on the type of work that the dissertation involved, was perhaps the most important hindrance. This is true for both parts of the project. Grid tools are complicated software, far from trivial in their use. Furthermore, they are not particularly stable and their documentation is often insufficient. I believe that working with such software and learning its use requires the ability to critically view the documentation, understand what the important concepts and ideas are and focus the learning effort on those. This ability comes with experience, which I did not possess. As a result, I often found myself spending a long time on trying to understand details, which, as I later realised, were of secondary importance. Another issue, which experience would have helped me with, was the reading of the SRM code. I believe that if I had been involved in a collaborative software development project before, as opposed to working on my own code, I may have felt more comfortable reading the SRM code and finding out how it works. As it turned out, I was intimidated by the size of the code and the number of the files and failed to conceive the main concepts, the data flow and ultimately the code architecture.

One more general problem that I was faced with, was that the project was not well defined from the start. The general direction was more or less known, but there was no concrete set of things to do and end goals to achieve. This affected my work very much, as much of the time had to be dedicated to finding useful things to do. This significantly decreased the time that was available to me for doing useful work.

8.2.2 Specific Issues

Apart from situations that affected the project as a whole, there were also incidents or facts that appeared along the way and caused delay in specific ways. Those will be listed here.

A first issue was the unexpected delay of the LCG-2 deployment in Edinburgh. This was mainly due to hardware related issues that required that equipment be relocated, repaired or replaced. As a result of that, the installation of Ganglia was complete no sooner that two weeks before the project deadline. It was originally intended that the first part of the project would be complete in July, but this situation meant that it had to be left aside for more than a month and completed at the expense of the second part.

Another very significant delay was introduced towards the end of July, just before the first part of the project was nearing completion. Due to a family emergency I was forced to spend two weeks away from Edinburgh. Apart from the obvious delay by two weeks, this also took my attention away from the project and made it difficult to get back on schedule. 48

As a result of both the previously mentioned delay and the fact that it had not been identified as a project aim from the start, the second part of the project was only assigned to me with less than five weeks before the deadline. This, combined with the first part of the project not being complete, decreased the time I could spend working on the SRMSRB interface and definitely made the completion of that task more unlikely.

Significant problems in the second part of the project were caused by the lack of documentation for the LBNL SRM. In contrast to SRB which is a very well documented project, there are neither any installation instructions or any document with information for developers. The only possible way to understand the architecture of SRM is by directly reading the source code. However, the code is of substantial size contains very few comments and contains many pieces of code that have been commented out without any explanation provided.

Other problems with SRM were caused by its dependencies on other programs and incompat- ibility with Red Hat Linux. It has already been explained in chapter 7 that SRM requires a specific version of the GCC compiler. Apart from that, it also needs ORBacus and the GT2 core. Those issues introduced many delays in the installation process. At one point I even had to correct a bug in a Globus header that was included in DRM code. In the end, more time was spent trying to install the required software for SRM than SRM itself.

8.2.3 Final thoughts

Overall, I am rather satisfied with how the project evolved, given the circumstances I was faced with during its course. The project was interesting and challenging enough and the effort I made working on it has paid off. I believe I have gained valuable knowledge and experience and will be able to produce much better results working for similar projects in the future.

Were I to repeat the dissertation, there are some things that I would have done differently. Perhaps the most important would be to make sure that the project became strictly defined from the start, so that I had a clear direction to go towards. Furthermore, I would have spent no time learning things that, in the end, did not prove useful. Another thing that might have helped is asking the SRM development team to produce a description of the code architecture from early on. Finally, I could also have created a Linux installation in my personal computer, so that I could try installing software without suffering from the lack of root access.

Possible future work relevant to this dissertation would obviously involve the design and im- plementation of the interface between SRM and SRB, since the project was not successful in that respect. Another idea would be the customisation of Ganglia specifically for the ScotGrid hardware. Ganglia is largely customisable and could be made to monitor application specific metrics or certain aspects of the hardware that could be considered useful by the system ad- ministrator. Finally, any Grid related work always requires additional work in order to remain up to date and to be of any use, as specifications and requirements are rapidly changing. Appendix A leak.c

1 #include 2 #include 3 #include 4 5 #define SIZE 1000000 6 #define ITER 1000 7 #define TIME_TO_SLEEP 5 8 #define MB_LIMIT 1500 9 10 int main() 11 { 12 int i, j; 13 long int mb_allocated = 0; 14 double *jumping_pointer[ITER]; 15 clock_t start, end; 16 double elapsed; 17 18 for (i = 0; (mb_allocated < MB_LIMIT) && (i < ITER); i++) 19 { 20 21 start = clock(); 22 23 jumping_pointer[i] = 24 (double *)malloc(sizeof(double) * SIZE); 25 26 if (jumping_pointer[i] == NULL) 27 { 28 printf("Out of memory\n"); 29 printf("Allocated memory for %d doubles\n", i * SIZE); 30 exit(1); 31 } 32 33 mb_allocated += 34 ((double)(SIZE * ((int)sizeof(double)))) / 1024 / 1024; 35 printf("allocated %ld MB so far\n", mb_allocated); 36 37 sleep(TIME_TO_SLEEP); 38 39 for (j = 0; j < SIZE; j++) 40 {

49 50

41 jumping_pointer[i][j] = (double) i; 42 } 43 44 45 } 46 47 return 0; 48 } Appendix B

Index of Acronyms

CA Certification Authority CE Computing Element CLI Command Line Interface EDG European Data Grid GG Grid Gate GIIS Grid Index Information Service GPT Grid Packaging Tools GRAM Grid Resource Allocation Manager GRIS Grid Resource Information Service GSI Grid Security Infrastracture GT Globus Toolkit GUI Graphical User Interface GUID Grid Unique Identifier HRM Hierarchical Resource Manager JDL Job Description Lnaguage LCFG Logical Configuration System LCFGng LCFG ’next generation’ LCG LHC Computing Grid LDAP Lightweight Directory Access Protocol LFN Logical File Name MCAT Metadata Catalogue MPI Message Passing Interface RLS Replica Location Service RMC Replica Metadata Catalogue RMS Replica Management System RRD Round Robin Database SDSC San Diego Supercomputer Center SE Storage Element SRB Storage Resource Broker

51 52

SRM Storage Resource Manager SURL Storage URL TRM Tape Resource Manager TURL Transport URL VO Virtual Organisation XDR eXternal Data Representation Bibliography

