Status of the Grid Computing for the ALICE Experiment in the Czech Republic

Dagmar Adamová, Nuclear Physics Institute AS CR Řež/Prague

1 Introduction

ALICE (A Large Ion Collider Experiment) is one of the four experiments at the CERN Large Hadron Collider (LHC) [1], designed to study physics of nucleus-nucleus collisions at LHC energies (Pb+Pb collisions at sNN = 5,5 TeV, proton+proton collisions at sNN = 14 TeV). The mission of ALICE is exploration of strongly interacting matter at extreme energy densities, where the formation of the quark-gluon plasma [2] is expected. Nowadays, the project encompasses over 1000 scientists from about 30 countries.

The LHC, which tested circulating beams of protons shortly in September 2008 and will start again in Spring 2009, is expected to produce roughly 15 PetaBytes (PB) ( = 15 million GigaBytes (GB) ) of data during one standard data taking year. This data needs to be provided for analysis to some 5000 scientists in about 500 research institutes and universities around the globe, who are participating in the LHC experiments. In addition, all the data needs to be available over the estimated 15-year lifetime of the LHC.

It is impossible to centralize the resources necessary for storage and processing of such enormous amount of data at one location near the experiments, as in the traditional approach. Instead, the model of globally distributed resources for storage and analysis of the data from the four LHC experiments was adopted: a computing Grid. The development, building and maintenance of the corresponding distributed computing infrastructure is the mission of the LHC Computing Grid Project (LCG) [3].

The ALICE Computing model [4] relies on the assumption of existence of the mentioned functional computing Grid infrastructure, allowing efficient access to the resources. In this article, we first present some basic features of the ALICE Grid computing model and then summarize the contribution of the Czech ALICE group to the ALICE Grid computing project.

2 Basic features of the ALICE Grid computing model

2.1 Grid basics According to the original definition, "a computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities" [5]. A Grid allows its constituent resources to be used in a coordinated fashion to deliver qualities of service, relating for example to availability, security, reliability, performance, throughput, response time and/or co-allocation of multiple resource types, to meet complex user demands, so that the performance of the combined system is significantly greater than that of the sum of its parts. Among numerous computing grid systems, the scientific grids like the LCG are characterized by integration of resources from multiple institutions, each with their own policies and mechanisms, using open, general-purpose protocols to manage sharing of resources and providing high quality services. As a result, the LCG system in particular will guarantee that every physicist will have equal access to the data and computing resources independent of geographical location and time zone with round-the-clock monitoring and support.

The software, that organizes and integrates the resources in a Grid and manages, e.g., data distribution and access, job submission, user authentication and authorization, is known as the Grid middleware (Fig. 1). Being developed by different Grid projects, it can be very different even if the functionality is similar. It is built by layered interacting packages and may be shaped using different managers. The middleware components include, e.g., Grid Access Service (GAS), Grid Resource Information Service (GRIS), Resource Broker (RB), Job Submission Service (JSS), Workload Management System (WMS), Computing Element (CE), Storage Element (SE) and so on. For the supply of much of the middleware, the LCG project depends upon several other projects including Globus, Condor, the Virtual Data Toolkit and the gLite toolkit [3].

AliEn Components

AliEn-Bas e RPM AliEn-SE DB RPM Cluster Proxy AliEn-Clie Storage nt Monitor Logger Element Authe RPM Web User API n Portal Application ALIEN C/C++ FTD (C/C++) AliEn-Alic Informa e tion Computing CPU Service Server RPM Element AliEn-CE Process RPM Monitor RB AliEn-Ser ver

RPM AliEn-Por tal RPM Modules & libraries Web of AliEn Services

Figure 1: An example of architecture of a grid middleware: the ALICE specific middleware AliEn [6], discussed in detail later in text.

2.2 ALICE Distributed Computing model

Similar to the other LHC experiments, the ALICE computing model relies upon the assumption, that on the centers providing resources to ALICE, the fundamental Grid services will be deployed providing integration of these centers into the LCG or LCG-interfaced Grid systems. For the distribution of tasks to the different centers, the LCG has adopted the hierarchical structure, where the centers are ranked as so called Tiers (Fig. 2) : Tier-0 is CERN, Tier-1's are the major computing centers with mass storage capability, Tier-2's are smaller regional computing centers, Tier-3’s the university departmental computing clusters and Tier-4’s the user workstations. The tasks performed by centers of different Tier level are as follows: Figure 2: Schematic view of the ALICE computing tasks in the framework of Tier-ranked centers

The Tier-0 center at CERN will provide the permanent storage of the original raw data and the first- pass reconstruction will be performed there. A second copy of the raw data and additional copies of the reconstructed data will be stored at the Tier-1 centers. They will be responsible for managing the permanent data storage of raw, simulated and processed data and providing resources for re-processing and scheduled analysis. Connections of 10 Gigabit/s (Gb/s) are required from CERN to each Tier-1, to facilitate the required data traffic. The Tier-2 centers provide appropriate computational capacities and storage services for performing Monte Carlo simulations of the p+p and Pb+Pb collision events and for end user analysis. Tier-1 <-> Tier-2 connectivity of at least 1 Gb/s is envisaged, to allow both Monte Carlo data upload and download of data for analysis. As of today, the ALICE Grid project is represented by 73 sites spanning over 4 continents (Africa, Asia, Europe, North America), see Fig. 3. It involves 6 Tier-1 centers and 66 Tier-2 centers. Altogether, the resources provided by the centers for the ALICE computing represent about 7000 CPUs and 1,5 PB of distributed storage, which capacity will double in about half a year.

Figure 3: The ALICE centers in Europe. Blue lines represent ongoing transfer of data. All the centers are powered by the ALICE-developed middleware AliEn (ALIce ENvironment) [6], first working prototype of which was completed in 2002. It provides the entry point into the LCG (through an AliEn-LCG interface) and also other Grid systems worldwide. AliEn consists of a number of components and services, special emphasis being given from the beginning on file catalogue, job submission and control, application software management and end user analysis. Main tasks of the ALICE Grid project include: - Physics Data Challenges, discussed in detail later on; - management of raw data (registration of data at the Mass Storage System (MSS) at CERN and on the Grid, replication of data to Tier-1‘s and quasi-online reconstruction at Tier-0 and Tier- 1’s); - fast analysis of detector calibration data (done immediately after the data taking, crucial for feedback to detector experts); - user analysis and fast analysis of simulated and raw data for First Physics predictions.;

2.3 Physics Data Challenges Similar to the other LHC experiments, ALICE has been extensively testing and validating its computing model and Grid infrastructure during yearly Physics Data Challenges (PDC), of increasing size and complexity. A Data Challenge comprises the simulations of data - events from the detector, followed by the processing of that data, using the software and computing infrastructure that will be used for the real data [3]. This production, of Monte Carlo (MC) simulated events of p+p and Pb+Pb collisions, is going on at all of the ALICE sites, the conditions and event characteristics being configured as requested by the Physics Working Groups [7]. The generated data files are migrated to CERN and are available for further processing. All the relevant informations on the events generated during various production cycles are published on the ALICE Grid monitoring web pages [8].

The first distributed MC productions date back to 2003, followed by PDCs 2004 and 2005, which spanned over a couple of months. Since April 2006, the PDC has been running in a quasi-permanent mode, interrupted only during hardware or software upgrade downtimes, not longer than a couple of days. The amount of data produced during the Data Challenges is quite remarkable: during the quasi- permanent PDC mode in April 2006 – July 2008, roughly 330 millions of collision events with various physics content have been produced, according to requests of Physics Working Groups. Since July 2008, the Data Challenge production has been switched to the (MC) simulation and reconstruction of p+p collision events for the predictions of First Physics upon the LHC startup.

3 ALICE Grid computing in Prague

Czech Republic has been actively participating in the ALICE Grid project ever since 2003, the entry point into the ALICE Grid being the computer farm Golias [9] in Prague. The availability of the farm for ALICE has been ensured by the Czech ALICE group’s investments into the farm hardware and by providing the computing services necessary for processing of the ALICE production and analysisjobs at the farm.

3.1 The Czech LCG Tier-2 center Golias The computing center Golias, located in Prague at the Institute of Physics AS CR, is the biggest site in the Czech Republic providing computing and storage services for particle physics experiments (apart from ALICE, also the LHC experiment ATLAS, D0 experiment in Fermilab [10] and some other). The computing and storage resources (about 450 CPUs and 50 TeraBytes (TB) of disk space) are gradually increased to meet requirements of the experiments. The farm has excellent network connectivity with other institutions worldwide (1 or 10 Gb/s) and it is integrated through the installed middleware components into the LCG. Since 2005, it is a certified Tier-2 center of the LCG project, and in April 2008, Czech Republic represented by the farm Golias signed the Memorandum of Understanding (MoU) of the Worldwide LHC Computing Grid Collaboration (WLCG [11]).

Although being only a mid-size center providing almost half of its capacity for the D0 project, Golias‘ contributions to the ALICE and ATLAS Data Challenges have been quite decent. Golias delivered processing of 4 - 7% of the jobs submitted by the ALICE experiment in Data Challenges 2004 - 2006 and about 2% in 2007 and similarly for ATLAS.

3.2 ALICE Grid computing at Golias

The Grid computing for ALICE in the Czech Republic has been performed at the farm Golias, ever since its startup in 2003, when the ALICE middleware AliEn was installed and a large number of simulation jobs for the Physics Performance Report [1] was completed at the farm. The integration into the ALICE Grid system is currently provided with a combination of LCG and AliEn middleware components. One member of the local group is appointed as the local ALICE production manager, responsible for the production of simulated data and maintenance/upgrade of the AliEn middleware.

In spite of the fact, that ALICE is officially entitled to use only about 10% of the farm CPUs (less than 10% before December 2007), the amount of simulated data produced at the farm during ALICE PDCs has been considerably exceeding these limits. The Czech group has been participating in all the ALICE PDCs so far and the number of production jobs completed at the farm has been up to 4 – 7% of the overall ALICE production, while the resources officially available for ALICE represent less than 1% of the whole project. In 2007 - 2008, the relative contribution from the Czech Republic decreased to about 1,5%, which is due to increasing number of participating sites and the corresponding upscale of resources. During PDC-2006, almost 70 000 jobs were successfully completed at the farm representing about 5% of the complete production. During 2007 - 2008, the number of successfully done jobs was more than 100 000, which amounts to about 1,5% of the whole production – cf. Fig. 4.

Figure 4: Relative contributions of ALICE centers during the PDC 2007 This contribution was achieved by the permanent supervision of performance of the production jobs, steady job submission and prompt solution of arising problems. The problems might include e.g.: - bugs in the ALICE software AliRoot for simulation, reconstruction and analysis [12,13]; - problems with download of input files for the simulation, access of conditional data files during the process of reconstruction and upload of the output files through Grid; - malfunctioning or breakdown of any of the middleware services: they might include e.g. problems of Resource Brokers (the middleware component, which manages resource allocation and job submission scheduling) or creation of various types of the proxies (certificates needed to use the Grid) and other; - failures of the local services: the batch system disorders, hardware defects, connectivity problems.…..

Regardless of different kinds of the production show-stoppers, a steady job submission and usage of all available free CPUs resulted in the maximum number of simultaneously running jobs well over 100 – cf. Fig. 5: officially, we were entitled to use 25/48 CPUs in 2007. The farm’s batch system has a scheduler, which allows the experiments to use more CPUs than they are officially authorized, to keep the computing nodes job-busy most of the time. Also, by the systematic control of the log files produced by the jobs, which is necessary for the production monitoring, the local ALICE group helped in debugging and validation of the ALICE software AliRoot.

Figure 5: Active ALICE jobs on the farm Golias in 2007. The lower part shows number of queued jobs, the upper part the number of running jobs.

3.3 The ALICE Grid Storage Building of an efficient distributed storage system covering the needs of the massive MC simulation productions as well as of the raw data processing has been a subject of intensive development in the ALICE Grid project during last 3 - 4 years. Currently, ALICE operates 38 storage system endpoints distributed over 21 sites [14] providing 4 different flavours of storage solutions with the total capacity about 1,5 PB. The Tier-2 centers are obliged to provide the disk-based storage, which is critical especially for analysis. Tier-2’s are recommended to follow a rule, to provide 1TB of disk storage per 2 - 3 CPUs. The Czech group has been building a storage cluster dedicated to the ALICE Grid for about 2 years. As a result, two Storage Elements [14] are available for ALICE at the Nuclear Physics Institute in Řež near Prague, representing Close Storage Elements for the ALICE Grid jobs at the farm Golias, with the total capacity of 21 TB. Due to the excellent network connectivity of this storage cluster to Golias, these Storage Elements are considered as a component of the Prague WLCG Tier-2 center.

4 Summary In the present article, the basic features of the ALICE Grid computing model were described and the involvement of the Czech Republic into this project was characterized. Czech Republic has been the active participant in the ALICE Grid project ever since 2003. The entry point into the ALICE Grid has been the computer farm Golias in Prague, a middlesized Tier-2 center fully integrated into the LCG environment, delivering computing services to particle physics experiments including ALICE.

In spite of the fact, that the local ALICE group drives officially less than 1% of the overall ALICE computing resources available, it has been delivering computing performance often exceeding these resources: 4 - 7% of the overall production during the Data Challenges 2004 – 2006 and up to 2% in 2007. This performance is delivered basically by a team of 5 people: one local ALICE production manager and 4 system administrators of the farm Golias.

Our performance during the ALICE Data Challenges 2004 – 2008 has successfully established Czech Republic as the Tier-2 center in the ALICE Grid project, that delivers reliable services. This center will, according to the Tier hierarchical system, continue in the production of Monte Carlo simulated data and in processing of the user analysis jobs also after the operation of the LHC is resumed in Spring 2009. To deliver the computing performance pledged by Czech Republic, we are prepared to perform regular upgrades of the hardware resources dedicated to ALICE according to the WLCG MoU, and to keep the current high level of the computing services.

Acknowledgements I highly appreciate critical comments provided by Latchezar Betev, the coordinator of the ALICE Grid project (CERN), and by my colleagues from the Center for Physics of Ultra-relativistic Heavy-ion Collisions and from the Institute of Physics AS CR in Prague. The work was supported by the MSMT CR contracts No. 1P04LA211 and LC 07048.

REFERENCES

[1] ALICE Collaboration: ALICE: Physics Performance Report, Volume I., J. Phys. G30 (2004) 1517; Volume II., J. Phys. G32 (2006) 1295. [2] Proc. Quark Matter 2006, J. Phys. G34 (2007) S173; Proc. Quark Matter 2008, to be published: http://www.veccal.ernet.in/qm2008.html. [3] LHC Computing Grid: Technical Design Report, http://lcg.web.cern.ch/LCG/tdr/. [4] ALICE Experiment Computing TDR, http://aliceinfo.cern.ch/Collaboration/Documents/TDR/Computing.html. [5] I. Foster: What Is the Grid? A Three Point Checklist, http://www.gridtoday.com/02/0722/100136.html. [6] AliEn, http://alien.cern.ch/twiki/bin/view/AliEn/Home. [7] ALICE Experiment Physics Working Groups, http://aliceinfo.cern.ch/Collaboration/PhysicsWorkingGroups/index.html. [8] MonALISA Repository for ALICE: PRODUCTION CYCLES, http://pcalimonitor.cern.ch/job_details.jsp [9] Regional computing center for particle physics, http://www.particle.cz/farm/index.aspx? lang=en. [10] The D0 Experiment, http://www-d0.fnal.gov/. [11] Worldwide LHC Computing Grid, http://lcg.web.cern.ch/LCG/overview.htm. [12] The AliRoot Primer, http://aliceinfo.cern.ch/export/download/OfflineDownload/OfflineBible.pdf. [13] R. Brun et. al., Computing in ALICE, Nucl. Instr. Meth. A502 (2003) 339-346. [14] ALICE Storage Elements, http://pcalimonitor.cern.ch/stats?page=SE/tabl .