Specifications for DØ Regional Analysis Centers

Total Page:16

File Type:pdf, Size:1020Kb

Specifications for DØ Regional Analysis Centers

Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM Specifications for DØ Regional Analysis Centers

Bertram, R. Brock, F. Filthaut, L. Lueking, P. Mattig, M. Narain , P. Lebrun, B. Thooris , J. Yu, C. Zeitnitz

Abstract The expected data size at the end of RunIIa is over 400TB for resulting data set. This immense amount data poses issues for sharing data for expeditious data analyses. In addition, the international character of the DØ experiment demands the data to be distributed throughout the collaboration. The data must be stored permanently and be reprocessed throughout the distributed environment. In this regard, it is necessary to construct regional analysis centers that can store large data set and can process data should there be any need for reprocessing. This document presents the specifications for such regional centers including data characteristics, requirement for computing infrastructure, and services that these centers must provide. This document also presents recommended sites to be established as regional analysis centers.

1. Introduction (*Jae, Patrice) The current size of a typical event from the DØ detector is 0.25Mega Bytes (MB). The maximum output rate out of the online DAQ system is 50Hz in Run IIa. This rate and the size of the typical events constitutes 12.5MB/sec maximum throughput. In addition the total number of events with the maximum output rate will results in 1.0109 events at the end of the expected Run IIa. The fully optimized speed of reconstruction is expected to be 10sec/event estimated based on a machine with a CPU clock speed 800MHz with about 40 specint95. This results in 1.01010 seconds to complete a reconstruction cycle. This means it takes one full year to process the entire Run IIa events using 500 computers, the expected number of machine that comprise offline reconstruction farm. Moreover the 400TB of resulting data set will take one to two months through a dedicated gigabit network to a single site. This means it would be just inconceivable and impractical to transfer data to all 77 collaborating institutions and over 600 physicists to access the data for analyses. These issues get worse as the Tevatron delivers luminosity at its expected rate, resulting in a factor of eight or higher data size at the end of the entire Run II program. Given these issues related to immense amount of data, despite the fact that the centralize clusters of computers can provide a lot of services, it must be complemented with a model that allows distributed data model and capability for remote analysis from the central location, that is Fermilab. In addition one of the important tasks that remote institutions can contribute significantly is through software and analysis activities that do not require presence at the site of the experiment. The operating assumptions for specification of the regional analysis centers are covered in section 2. The motivation and a proposed architecture of the DØ Remote Analysis Model (DØRAM) are discussed in sections 3 and 4. Sections 5 and 6 cover the services to be provided by an RAC and the proposed data tiers to be stored at the RACs. Section 7 describes the requirement for an

1 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM RAC in both hardware and software supports. The recommended sites and the implementation time scale are discussed in sections 8 and 9. 2. Operating Assumptions (*Jae, Lee) a. Targeted output rates from the online DAQ system is at 50Hz and 75Hz DC for Run IIa and IIb, respectively. b. The event sizes for various data tiers are listed in Table 1.

Event Size Run IIa (2yrs) Run IIb (4yrs) Average Event Rate 22Hz (peak 50) 75 Hz Number of Events 1.0109 1.01010 Raw Data 300 kByte/ev 300 TB 2.8 PB DST 100kByte/ev 100 TB 930 TB Thumbnail 10kByte/ev 10TB 93 TB Monte Carlo 35 Hz 2.4108 1.0109 Monte Carlo 500kByte 120 TB 500 TB

c) Monthly luminosity profile during the year 2002 and 2003-2008.

Dates Instantaneous Luminosity Integrated Luminosity pb-1 1/1/02 1E13 0 3/1/02 2E13 16 6/1/02 4E13 73 10/15/02 6E13 178 12/31/02 9E13 307 Mid 2003 20E31 Early 2004 Shift to 132 ns operation 2,000 (End of 2004) 2005 Shutdown for Run IIb 2005-2008 50E31 4,000/yr  15,000 (end 2008) 3. Motivation for Regional Analysis Centers and Use Cases(*Chip, Jae)

The basic unit of progress by remote institutions is a successful measurement made from within a perhaps unique local environment without the necessity of full data/MC storage capability nor the necessity of retrieval of samples of the entire data or database set. This is a basic requirement for off-shore institutions and a desirable goal for many of the U.S. groups. The RACs can act as intermediary “rest stops” for the most intense data tiers and perhaps some projection of the databases required for analysis. The purpose of this section is to describe real analyses in terms of what a user would actually do and how that user would rely on the RAC and the FNAL central site. Two kinds of analyses are described: one relies primarily on local ROOT tools and storage of only rootuples at the user’s home site (the “workstation”), the other requires interaction with at least DST level data.

a. W Cross Section Determination The assumptions for this example are:

2 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM  The primary workstation analysis is at the ROOT level  The analysis may include Thumbnail (TMB) files resident on the workstation  The offsite institution is not SAM site  The RAC with which it is associated is a SAM site  The MC calculations are initiated at farms which are SAM sites  Complete Thumbnail (TMB) file sets exist at the RACs

The basic steps that are required in order to make the measurement are straightforward: count the number of corrected events with W bosons above background and normalize to the luminosity. In order to do this within the assumptions above, a strawman chain of events has been envisioned as an example. Figure 1 shows the relevant steps in terms of requests, movements of data, and calculations that would either be choreographed by the user, or actually carried out by that user at the home institution.

Figure 1 Simple diagrams of chain of events for W cross section analysis.

The various geographical locations (GL) are show as the colored areas: FNAL (brown, left), at least one RAC (green, next), the user’s workstation (pink, next), and at least one MC farm (yellow, right). The vertical purple line in the user GL represents the user’s workstation and roughly the logical (and perhaps temporal) progression of events proceeds from top to bottom along that line.

3 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM Some actions are automatic: the production of a complete set of TMB files from FNAL to the RAC is such a process. Other actions are initiated by the user. As represented here, requests for some remote action are blue lines with arrows from the workstation to some processor connected to a storage medium. Purple lines represent the creation or splitting of a data set and than the copying of that set. Dashed lines represent a copy, usually a replication, from one GL to another. A black line without an arrow represents a calculation.

The progression of events is (could be) as follows:

i) The data accumulate at the RAC as TMB files. ii) The user initiates a request, perhaps by trigger, to the RAC for a W sample and background sample of TMB records. (1) These records are replicated to the workstation. iii) A preliminary analysis of the TMB files is performed at the workstation, leading to rough cuts for sample selection. (1) The results of that analysis lead to the ability to select a working dataset for signal and background. (2) The presumption is that the measurement will require information which is not available from the TMB data tier alone. iv) A request to FNAL is initiated for the stripping of DST-level files for both signal and background. (1) This is readily done since the TMB records are subset of the full DST records: hence, the TMB-tuned selection is directly and efficiently applicable to the DST. (2) These DST sets are copied temporarily to the RAC v) The user analyses the DST files to produce specialized rootuples which will contain the records not available from the TMB data. (1) The produced rootuples are replicated back on the workstation. (2) The DST’s can be discarded and readily reproduced if necessary. vi) The analysis of the original TMB files could also initiate the production of specific Monte Carlo runs at a remote MC farm site. (1) This is initiated through SAM and a cached DST-level file set of signal and backgrounds is produced (2) These MC DST data can be replicated back to the RAC and discarded at the remote MC farm site vii)The MC DST data are analyzed and rootuples are produced at the RAC (1) These rootuples are replicated back at the workstation viii) The luminosity calculation is initiated after the selections have been made at the workstation (1) The user initiates a set of queries to the FNAL Oracle database. (2) This results in a flattened luminosity file set which is replicated to the workstation ix) With all of this information at hand, the cross section calculation can proceed (1) Of course it will be necessary to repeat many or all of the preceeding steps

4 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM (2) This could be facilitated through the replay of history, saved out as scripts when the process was first initiated (?)

As can be seen, the user acts as a conductor, initiating requests for data movement among FNAL, RAC, and a MC farm, coordinating the reduction of DST’s to rootuples, and redoing the steps as necessary due to mistakes or to include new data which may still be periodically coming in.

The requirements for the RAC, in this example, are a large amount of temporary storage and the ability for an outside user to initiate calculations (eg., creation of rootuples). The demand on the database is minimal and replication is not necessary beyond the FNAL site boundary.

b. Determination of the electron energy scale c. Inclusive Jet Cross section Measurement d. Pick SUSY Candidates not done yet

4. DØ Remote Analysis Model (DØRAM) Architecture(*Jae) Given the expected volume of the data to be analyzed, it is inconceivable to have the entire data set residing at a central location, such as Fermilab. Therefore it is necessary to select a few large size collaborating institutions as regional centers and allocate 10- 20% of the entire available data set on that location. These regional analysis centers must be chosen by their computational capacity, especially the input/output capacity, and their geographical location within the network to minimize unnecessary network traffic. The regional centers governs a few institutional centers which acts as gateway to the regional data storage facility and the desktop users within the given institution, until such time the user is needed for larger data set that are stored in many regional centers. Various resource managements should be done at the regional analysis center level. In other words, when a user needs CPU resources, the server at the institutional center receives the requests and gauges the available resources within the institution to see if the tasks can be sufficiently handled using the resources within its own network. If there were not sufficient resources within the institutional network, a request will be propagated to the regional center, and the same kind of gauging will occur within the regional network. This process will continue till the request reaches to the server at the main center (ground 0) at which time, the priorities will be assigned by a procedure determined by the collaboration. Figure 2 shows the proposed architecture of a DØ Remote Analysis Model (DØRAM) involving Regional Analysis Centers (RAC) and corresponding Institutional and Desktop analysis stations.

This architecture is designed to minimize unnecessary traffic through the network. The Central Analysis Center (CAC) in this architecture is Fermilab’s central analysis system that stores the copy of entire data set and all other processed data set. The CAC provides corresponding metadata catalogue services and other database services as well. The Regional Analysis Centers (RAC) stores at least 10-20% of the full data set in

5 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM various tiers as determined in section 6 of this document. They must act as a mini-CAC in its own region of network. As discussed above, the RACs provide services defined in section 5 of this document to users in its network in Institutional Analysis Centers (IAC) within its region.

Figure 2 A schematic diagam of a DØ Remote Analysis Model (DØRAM) Acrhitecture.

The requests from users within a region should be handled within its network without generating additional network traffic beyond the regional network. The full access to data set by the users in region should be approved based on the policy laid out in section 8. 5. Services Provided by the Regional Analysis centers (*Frank, Patrice, Lee, Iain, Chip) The aim of this section is to investigate what kind of services one might reasonably expect to be provided by Regional Analysis Centers. As mentioned in the Introduction, they should be able to provide services to the DØ collaboration as a whole. However, the collaboration will also benefit (indirectly, by the fact that the load on the FNAL resources is decreased) from the services that they can provide to the Institutional Analysis Centers that are directly connected to them. The services that could be provided by the Regional Analysis Centers are discussed in some detail below. a. Code Distribution Services(*Frank, Patrice) The distribution of code represents a much smaller data volume than any access to data. Nevertheless, re- distribution of code can alleviate some of the tasks that are otherwise performed by FNAL only: whenever a given machine is the only alternative to obtain software, any downtime of this machine is likely to present problems to the collaboration as a whole. Having the same code available from a secondary source will increase the overall

6 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM efficiency of the DØ software and analysis effort. This holds true even more in the case of institutes with slow links to the outside world: even if this will make it difficult to process substantial amounts of data, it may well be possible to work on algorithm development. It is then very convenient to be able to do this at the individual institutes. The amount of code that goes with any individual release has been growing steadily over the past years, from some 4 Gbyte/release in 2000 to some 8 Gbyte/release at present; it is to be expected that this will grow in the near future. Therefore even this represents a non-negligible resource. Also, given the fact that “test” releases at present take place on a weekly basis, unless a high degree of automation takes place it requires a non-trivial amount of administrative work. b. Batch Processing Services(*Patrice) The computing centers involved in D0Race are generally running different batch system (LSF, BQS,...), different data storage systems (HPSS,NSTOR,...) and different authentication policies. Running distributed jobs in such an environment is a challenging goal. Hopefully, this problem is addressed in Data Grid projects. The foreseen solution is based on the Globus toolkit: -authentication via X509 certificates -Grid Resource Allocation and Management (GRAM) for job submission -Globus Toolkit's Global Access to Secondary Storage (GASS) for distributed data access. This toolkit should be replaced or interfaced by SAM and this has to be defined more clearly. for inter-domain management, combined with Condor for intra-domain management. This set, known as Condor-G, has been demonstrated on various test-bed for "real-life" problems like CMS Monte-Carlo simulations and is probably the guideline for a distributed D0 computation. c. Data Delivery Services(*Lee) d. Data Reprocessing Services(*Frank) The data characteristics section has some time scale for reprocessing in it (6mos-1.5 years interval). Could you reflect this time scale into this section? Also, could you mention about the need for strongly centralized coordination for massive reprocessing? Storing a subset of the DØ collider data at the RACs allows to reprocess these data locally. In view of the limited computing resources available at FNAL, and of the improvements that are likely to be made to the algorithms used at present for the reconstruction of the collider data, this is expected to be an essential ingredient of the RACs. A priori it is impossible to predict at what level, and with what frequency, these improvements will take place. Therefore two scenario’s can be envisaged in the extreme: one where all data reprocessing starts from the raw data, and one where reprocessing starts at some point of the reconstruction stage – to be determined for each processing pass, but with the greatest gain if only small parts of the reconstruction have to be redone. Both approaches offer advantages and drawbacks. The second approach seems the most flexible and CPU-friendly. However, it can only work if all the required input information is available. For the DST, a good candidate as a basis for event reprocessing, this is not expected to be the case in the near future.

7 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM The most straightforward option is to make each processing pass start from the raw data. If algorithmic improvements occur at relatively low levels of the reconstruction (and this is to be expected for a fairly long time to come), the processing time required is not likely to be substantially more than that starting from the output of any reconstruction stage. For the time being, it is therefore the preferred alternative; but all that is possible should be done to allow to migrate to the other alternative. If the data volume is an issue, it should also be questioned (in the long run) whether reconstruction output is kept at all. Clearly this aspect is addressed automatically by the first of the above-mentioned options. In the second case, several options exist to ensure that calibration of the detector remains possible: ensure that sufficient information is available at the DST, retain the reconstruction output for some fraction of the data, take special calibration runs, or run calibration programs (quasi) on-line. 3. Database access service(*Chip) DØ has implemented a database architecture employing a central, high availability database instance at Fermilab, and specialized servers that deliver information to clients. This has satisfied the existing needs of the collaboration. The central database is used for two types of information storage within the DØ experiment: 1) file metadata and processing information, and 2) calibration constants and run configuration information. The actual delivery of this data to clients is done through middle tier server applications called “Database Servers” (DBS). These servers impart several important qualities to the system including 1) reducing the number of concurrent users of the Oracle database, 2) enhancing the ability to make schema changes to database tables while minimizing the effect on clients, and 3) overall system scalability. The most important aspect of the three tier architecture when considering the deployment of the system to the global DØ collaboration for data analysis is scalability. Several enhancements are being built into the calibration DBS’s to provide the level of operation needed as we move into the era of remote analysis and data processing. First, two levels of caching are being built in, a memory cache and a persistent server side cache. The memory cache reduces the number of accesses to the central database when large parallel farms are processing events for a limited range of detector runs. The server side disk cache will allow persistent storage of large ranges of calibration and run configuration information at local processing sites and if the network to FNAL is disrupted, or the central DB at FNAL is unavailable, processing will continue unaffected. The new DB’s are multi-threaded and many clients can be serviced simultaneously. If no clients are present, the DS will disconnect from the database. Another feature that will provide scalability is the ability of the DBS’s to run in a proxy mode. This will allow us to establish a network of DBS’s at remote sites with a configuration similar to the RAC model itself. Each analysis site will have local caching servers which connect to corresponding DBS’s at the RAC’s. These, in turn, will connect to DBS’s at Fermilab. These servers are instrumented so we will be able to monitor the usage and detect hot spots in the system. Additional DBS’s can be deployed to increase the heavy load areas of the deployment. It is anticipated that this will satisfy the needs of the collaboration.

8 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM There are a few technical issues that will need to be addressed. First, remote installations that are running within in Virtual Private Networks behind switches/firewalls will require solutions initiated at the operational site. Second, if it is determined that central database at FNAL is a serious and unacceptable single-point-of-failure for the operational of the overall system, we will consider deployment of one or more additional databases that will have “snapshots” of the central database information. This is a viable solution because this is read-only data, and the issues of synchronization for a network of databases do not need to be addressed.

4. MC Production and Processing (*Iain) MC Production is perhaps the simplest services of all because it is essentially self- contained and does not require database access, unlike actual data from the detector. Currently there are several remote production sites with the following capabilities Site CPU's CPU Type Disk Store Mass Storage Current Sites BU 10 CCIN2P3 XX% of 200 Lancaster 200 750 MHz 1.5 TB 30 TB plus Nikhef 150 750-800 MHz TB Prague 30 700 MHz UTA 100 750 MHz 0.4TB Total Current 440 Planned Future Sites University of Michigan 100 ? ? ? Manchester University 64 1.9 GHz 2.5 TB ? Oklahoma 25% of 256 2 GHz 500 GB UCD Ireland Total Future ~600

The minimum required capability of these production facilities must be sufficient to meet all of the needs for MC production. The planned rate of MC production over the net five years is planned to 35 Hz. For a standard 750 MHz CPU the time per event of the various stages of MC processing are: Process/Time per Generation Døgstar Døsim Døreco Analyze Total Event 0.5 Events Overlaid, Plate Level Geant WW inclusive 0.5 186 13.1 12.6 3 215.7

If we assume that the duty cycle of any given node is 80% then we require 9.5k 750MHz (40 specInt95?) CPU's to fulfill our MC generation alone. Currently we have approximately 500 CPUs available so we clearly need to increase the number of CPUs available for these tasks. If we assume that by 2003 we will have available to us 4 GHz CPUs instead of the 750 MHz machines currently available then we will obtain a factor of five reduction in the number of required CPUs for remote production to a minimum of 2k. It is

9 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM important to note that this number of remote processors does not include a contribution for remote data analysis and reprocessing. If we require >10k CPU's in near constant use to meet our processing requirements there is an obvious need to maximise our possible resources. This will probably require making use of as many university machines as possible. This only makes sense in terms of fully implemented GRID options within SAM. In order for RAC to provide MC production services, each remote production facility must be able to run a minimal set of DØ software. These minimal requirements are as follows: •A fully functional SAM station, capable of storing and retrieving data from any location. •mc_runjob, the DØ Monte Carlo job submission software •The full MC suite of software, including but not limited to: The generators Pythia, Herwig, and Isajet; Døgstar; Døsim; Døreco; Trigsim; and the equivalent of Recoanalyze (releasd as tarballs, the full software distribution is not required). To be able to run reconstruction of these MC data the production facilities need not to access any of the databases to first order.

5. MC Data Storage Services(*Iain) For MC Production we assume that until the steady state mode is reached that all MC data is disposable and will need to be fully regenerated after each major software release. For this reason no intermediate stages of the MC production will be saved. The steady state will be reached once we have a stable detector simulation with all detector components fully simulated and a close to final simulation of the magnetic field. Once this stage has been reached the limiting process is obviously the Geant simulation of the detector. At this point in time the experiment will wish to accumulate a standard MC event sample for comparison with the data. Since the processing after the Geant simulation only takes 10% of the total processing time, and the requirement of modelling several different luminosity bins, we will be storing the Døgstar output as well as the final output of each event. This output will be reprocessed after each improvement in reconstruction, or to simulate different luminosities. 6. Data Characteristics (*Meena, Peter) Table 1 lists the various available data formats and their target size per event. It also includes the storage needs for RunIIa and RunIIb.

Below we note the expected data formats and fractions of these data type we envision will be made available to the RAC. Howver, to define the optimal distribution of the data we need more and better estimates about the number of users at the various places (e.g. on both sides of the Atlantic) and the need of data transfer. Based on such information we should then perform realistic studies on data transfer etc.. Also simulation tools like MONARC might be of help. One may think of various schemes depending on the available resources and band width: World wide distribution without replications, all data available on each continent (some replication), specific data sets distributed according to the main needs of the various main RAC users (high level of replication). The solutions realized depend also on the progress of the GRID and related infrastructure.

10 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM

b. Raw data Some significant fraction of raw data must be transferred to these sites and permanently stored in their cache system for reprocessing. An example of raw data access may be reprocessing triggered due to availability of much superior alignment and calibration (e.g. cell or energy scale) constants. These data sets at different RAC could be divided along specific data streams (see point c below). We probably be expected to reprocess a fraction of data once every six months, as we are trying to understand the calibrations and alignments, and once a year when stable operations are achieved. c. Thumbnail (EDU5) The thumbnail that constitutes to full data set statistics must be kept at these sites. These thumbnail data should be sufficient to provide significant data set for higher statistics refinement of analyses. It may be prudent to have the disk storage capacity to store two such sets at a given time. One set maybe from a version of reconstruction which would be “current” and used for main stream analyses. As the experiment will mature, we may find ourselves in a situation where there are many analyses which use the “current” reco version, while other analyses geared towards precision measurements may need the most “recent” version of reprocessing. It probably will be unreasonable to expect that all analyses will revert to the most recent version, as it becomes available. So we need to plan for a period of overlap where the “recent” set gradually phases out the “current” set as older analyses are sent out for publications. d. DST (EDU50) I have not seen DSTs come out at the moment, are we keeping any – do we intend to? We have documents which describe it but I am not sure if this will ever be implemented? This format is expected to contain all information needed to analyze events and be half the size of the raw data. This format will serve two different purposes. First one is to generate root tuples for physics analysis using pre- selected events and the second is for reprocessing some limited aspects of reconstruction before making the root-tuples. This format will have calibrated and packed “pseudo-Raw Data” from subdetectors (other than central tracking), in order to provide the flexibility for redoing most aspects of the reconstruction. The full set of DSTs corresponding to the physics streams of interest at the RAC should be available at the RAC site. This is the only way one can produce the most recent version of the thumbnails at the RAC. e. Specific Data Streams D data will be recorded and reconstructed according to the online streams based on triggers and filter definitions. For a given physics analysis, it may be necessary to only transfer a subset of these data streams for analysis. It may be prudent to identify different RACs via their physics interests and transfer data sets from streams corresponding to the physics analysis program at that center. . In this case RACs may limit the amount of reconstructed data to a well defined subset taking into account, however, a broad enough selection to reliably estimate the backgrounds. This will maximize the usage of the streams, bring a coherence

11 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM to the analysis efforts as well as ease in various book keeping efforts e.g. luminosity etc. This requires some pre-studies as to:  which kind of data should be safely stored  what advantage in terms of speed and reliability one expects from such a focused analysis.

It should be emphasized that it should be up to the main users of a specific RAC to decide on which data they intend to store. This refers to both data and Monte Carlo simulation.

f. MC Data Tier Due to large volume of MC data being generated, it is impossible to currently store as well as transfer via network to FNAL all files output from various stages of MC production such as generator, D0GSTAR, D0SIM, D0RECO. The final output root-tuple from RecoAnalyze is the only one currently transferred from remote MC production sites to SAM store at FNAL. In the future, as more MC production sites come online, and the MC capacity increases, we will probably continue to store only the root-tuples outputs for MC in SAM. However, at some RAC one could conceive of storing intermediate outputs for smaller set of samples. These are useful for detector performance, algorithm, and triggers studies to name a few, where the full statistics are not necessarily needed. g. Data Replication In case of data replication it is of utmost importance to guarantee that the same reconstruction versions are used for all RACs and that the results are identical (this requires some kind of certification procedure e.g. on a well defined subset). Also this implies that reprocessing has to be done in a centrally organized manner. No RAC by itself is allowed to replace an official data set with a privately (even if it is improvement) reprocessed version.

In case of an official reprocessing of the data sets it should be clearly defined by a central institution which RAC is to reprocess which data set. More than one reprocessing on any event has to be avoided. We need to make it explicit the case where many RACs want to have some narrowly chosen data streams. Since the RACs must serve for the experiment, there should be service data sets in case reprocessing is necessary. The tier could be DSTs or RAWs but might not be tightly coupled with the specific interest of the region. It is inconceivable to think transferring RAW data to RACs at the end of Run IIb for reprocessing. The data at the sufficient level of full reconstruction must reside already at RACs to expedite the reprocessing process. 7. Requirements of Regional Analysis Centers A Regional Analysis Center must be considered as a mini-CAC in services to IACs and DAS. For this reason, RACs require sufficient computing infrastructure and software support.

h. Location Issues Distribution should follow more or less that of collaborators ( in practice,

12 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM should be able to “serve” sufficiently many people ). Locations should de defined preferably to avoid unnecessary network traffic overlap.

i. Network Bandwidths Stringent requirements are needed for bandwidth to FNAL (as this is where the centers have to get much of their “initial” data from), at least 1 Gbit/s . High bandwidth to other RACs is desirable, as also fair amount of data have to be shipped between centers. High bandwidth to IACs is also desirable (for those institutes that want to be able to look at data “at home”), but perhaps somewhat less urgent (if IAC’s needs can also be served by processing at the RACs).

j. Storage Cache Size A sufficient storage place is needed to hold 10~20% of data permanently and must be expandable to accommodate data increase: more than 30TB (300TB) just for RunIIa (RunIIb) Raw data, and 10TB (100TB) for RunIIa (RunIIb) Thumbnail set . Compute required cache size is based on Thumbnail and some fraction of raw/reco . About accessibility, Thumbnails will likely be used often, so random access could be preferred.

k. Computer Processing Infrastructure The RACs must have sufficient computing processing power to reprocess all its data within four to six weeks. (RunIIa: 2·108 evts/1month at 10s/event) The CPU resources could be different for each RAC according to the number of IACs in the area. Requests can also be done by other RACs.

l. Support personnel A relevant study must be done to estimate the support personnel in each RAC, in order to manage the software distribution and to provide the support for users. Concerning the software distribution, the RAC must have a mirror copy of CVS database for synchronized update between RACs and CAC. It must also keep the relevant copies of databases, and act as a SAM service station to provide data and rootuples to IACs and DAS. Each RAC center should have personnel to provide triage support for installations within their “branch” of the DØ-SAM network. This includes help in DØ code distribution and use, SAM station installation and support, and help with general DØ software applications. Problems arising that are not solvable at the RAC site, can be relayed to other RACE sites for assistance, and finally to FNAL for resolution when required.

8. Recommended Sites and Justification(*Christian + all for relevant parts) a. Europe i. Karlsruhe

13 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM The computing center at the "Forschungzentrum Karlsruhe" started last year and will be used by all four LHC experiments, Barbar and both Tevatron experiments. It is forseen as a Tier-1 center for the LHC computing. Currently some experiments (ALICE, ATLAS, BarBar, CDF, D0) have already an individual software server available and BarBar is actively using the center for data analysis as well as MC generation. In total 36 computing (dual machines) nodes are available now and an additional 32 soon. This gives 136 CPUs (1.3 GHz PIII). The plan is to increase this by another 50 CPUs until fall ´02 and 200 CPUs in ´03. A tape system with a capacity of 53TB is available now. In terms of disk space the German DØ group will have 3TB available this year (1.6TB already in place). In terms of network the current link is only 34Mbit, but will be increased to 155MBit soon. A first test with 622MBit is foreseen for fall ´02. ii. NIKHEF High bandwidth connections exist within the Netherlands in the form of the Surfnet academic network, with a 10 Gbit/s backbone at present and 80 Gbit/s foreseen at the end of 2002. This network extends in fact all the way to Chicago. In addition, NIKHEF is the center for Dutch HEP research, and is also a tier-1 Grid site. A Linux farm consisting of 50 dual CPU, 800 MHz machines is available and is used for the production of Monte Carlo events. Access exists to a 9310 PowderHorn tape robot with a maximum capacity of 250 TByte (shared with other users). The farm is being run at an approximately 1 FTE manpower basis. iii. IN2P3, Lyon The Computing Center of IN2P3 at Lyon is national French center used by Nuclear and High Energy Physics community. It can provide a large CPU power and Mass Storage. The CPU is essentially provided by PCIII Linux machine (about 17K Spec Int95) but shared with other experiments. The mass storage is based on HPSS (High performance Storage Services). The actual limitation is given by 36000 cartridges accessible by the robot (about 1000 TB). The amount of disk is more than 20 TB and the IO is using RFIO protocol. CCIN2P3 has a good network connection with fermilab. The actual bandwidth with FNAL is 20 Mb/s and has to be improved in the future. The CCIN2P3 is also involved in the European DATA GRID project. iv. Other Institutions b. US i. University of Texas at Arlington UTA HEP group has two distinct clusters of farm machines. The HEP farm consists of 24 dual Pentium III 866 MHz machines, running under the Linux operating system. Since UTA has been one of the major offsite Monte Carlo production centers for DØ we have substantial expertise on production of simulated events and detectors. The second farm consists of six dual Pentium III 850 MHz processors.

14 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM This farm is managed in collaboration with UTA computer science engineering department. This farm works as an independent entity but is controlled by the server running on the HEP farm. This farm can be used for various testing protocols for DØ Grid software development. This farm is also equipped with 100 Mbits/sec network switches which will be upgraded to gigabit switches in the spring 2002. The group has recently added a dual Pentium processor analysis server and completed the deployment of DØ software environment. The server has a total storage space of 350GB. This machine is housed in the main office area of the Science Hall and is equipped with a 100 Mbits/sec network capability. The building is equipped throughout with 100 Mbits/sec network and will be upgraded to gigabit in the middle of year 2002. The UTA off-campus network is equipped with one Internet II and two IT network connections, each providing up to 45Mbits/sec, totaling approximately 140Mbits/sec aggregated bandwidth. ii. Boston University BU DØ group has an analysis server which consists of eight 1.13GHz Pentium II processors and 640Gbytes of storage and is running under Linux Red Hat 7.2. This entire BU physics building is quipped with 100 Mbits/sec switches. Upgrading these to 100 Gigabit switches is being considered. At the moment, BU DØ group has access to the BU ATLAS computing facilities. The BU ATLAS group is equipped with a 10 processor Linux P-III farm, and a 128 processor Linux farm (shared). The BU campus network is equipped with OC12 (665 Mbits/sec) networking, Internet2 to Abilene and vBNS. The networking infrastructure is maintained and owned by the University. In terms of software, the BU ATLAS nodes have Globus 1.1.4, Condor 6.2, PBS and Pacman installed. Due to low usage by BU ATLAS group of these facilities they have agreed to let the BU DØ group utilize their farm for the near future. A clear path of acquisition of computing nodes for BU DØ group needs to be defined. The expertise developed in conjunction with ATLAS group will be available to the DØ group to install and maintain an RAC. iii. Fermilab Fermilab will provide processing, storage, and bandwidth to support DØ data taking, reconstruction, and Monte Carlo activities. All Raw detector data will be stored in mass storage throughout the duration of the Run IIa and IIb phases of the experiment. Output from the FNAL D0 reconstruction farm will be kept at FNAL. Final stages of MC processing output, in particular DST and thumbnail tiers, will be sent from processing centers to FNAL for permanent storage. FNAL will supply processing CPU and upgrades for Central Analysis station, D0-Farm, and several compute server analysis stations. Disk cache will be provided to store one complete set of thumbnail data on the Central Analysis station along with substantial amounts of MC and other data samples.

15 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM Bandwidth to deliver data to approximately six RAC’s will be provided, and overcapacity to allow for special processing projects requiring additional bandwidth to certain centers for particular group activities. Support for D0 code and SAM server deployment and maintenance will be provided to RAC’s when needed. Global resource management that enables tuning data delivery rates to RAC’s will be controlled by configuration of servers operating at FNAL. Storage robot tape mounts and tape drive usage will be monitored and allocated to provide needed file storage and delivery for all local and remote needs. iv. Two more sites?? c. South America i. Brazil d. Asia i. India ii. Korea e. Summary Table of Resources from Candidate Sites Institute Storage CPU Network Support Cache (TB) (Specint95) Bandwidth Personnel NIKHEF Karlsruhe IN2P3, Lyon Boston Univ. UTA FNAL

9. Policies for Full Data Access Approval(*Iain, Meena, Lee) When the system is completely operational, the RAC facilities will provide caching and bandwidth for files being brought from Fermilab, or other RAC sites, for stations within their branch of the D0-SAM network tree. Access to bandwidth from Fermilab, including tape, network, and cache usage, will be provided through administrative controls which limit the number of concurrent transfers from the FNAL domain to each RAC station. This limitation will be set to accommodate normal operation and provide a fair division of resources to all RAC users. If it is observed that a single user or group is abnormally weighting the number of transfers to a particular location, and causing a large backlog of work to that region, it will show up in the transfer monitoring plots.

16 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM It is possible to turn up the capabilities for special project efforts that are planned and sanctioned by the experiment. These can be established on short notice, by creating special SAM groups within a particular set of stations where the work is being executed. Jobs run under such groups can be monitored and special resources, at the local SAM stations where they are being performed and throughout the D0-SAM network, can be enabled. Summaries of resource usage for all aspects of the project will be recorded and subsequent tasks, with similar characteristics, can be evaluated and precisely estimated. The magnitude of the output data storage also needs to be considered. In many cases it will be desirable to store large outputs at the participating RAC. If the data is needed by other RAC sites it can be transferred through their caches to analyzing consumers. For major production tasks, output data can be stored at FNAL. The size and cost of the output media need to be included as part of the resource evaluation. The authorization for these large data movement and processing tasks will be performed at the level of authority within DØ commensurate with the size of the task, its impact on available resources, and the urgency of the need. Most such projects can be authorized by the ORB, however representation from the affected RAC centers would be required. In cases where the number of special projects escalate, and the need for resources far outstrips their availability, priorities must be established. In such instances, the urgency will be set by the Physics Coordinator with input from the physics groups and spokespeople as needed. There may be times when certain proposed processing projects will need to be canceled or scaled back. Every attempt will be made understand and accommodate requests from physics groups and DØ at large. a. Before Analysis Refinement b. Approval Process for Sufficient Level of Refinement c. After the Approval d. Implementation of Policies 10. Implementation Time Scale(*Jae, Patrice) In order for the RACs to be effective, gradual implementation with full capacity for Run IIa data set by the end of Run IIa. The infrastructure and capacity of the RACs must be scalable to grow together with the increase in data size. We propose the implementation process to begin with establishing one of the sites to be capable of the following services:  Data Storage and Delivery service  Data analysis batch service  Database access service  Monte Carlo Generation and Reconstruction We propose the above to be implemented in Karlsruhe, Germany, by Oct. 1, 2002, coinciding the milestone date for the detector to be fully ready for serious physics data taking. The selection of most the RACs should complete by the end of August, 2002. The other centers then should follow the implementation steps Karlsruhe had gone through to reach to the Karlsruhe level by the end of the year 2002.

11. Bureaucratic Process and Arrangements(*Patrice, Peter, Lee, Meena, Jae) Memorandum of understanding (MOU) should be signed between RAC, FNAL and DØ collaboration to plan to the resources and services which has to be provided by each party. Among others the following topics has to be well described:

17 Created on 25/04/2002 07:55:00 Modified on 5/11/2018, 3:53 AM - The OS upgrading which has to be coordinated to insure that at least a same version is installed on all RAC's and CAC. - The amount of storage (tapes, disks) has to be defined, and we do provide a timetable to take into account the period where we need reprocess data (increase by a factor two for instance the SAM cache area) - The CPU power as the storage has to be planed. - Establish a list of specifications to be "D0 Grid compliant" which has to be respected by the RAC and also by developers. The specific constraints of centers like storage system or others have to be described in the MOU to be officially supported by the collaboration. 12. Conclusions(*Jae, Patrice)

18

Recommended publications