PoS(ACAT2010)006 http://pos.sissa.it/ ce. s with data. We will introduce some ciency at a major degree. le of the computations; running them about massive data access in the HEP onsequently possible. The upcoming he personal level will likely pose new elated random access patterns needs a ysis Techniques in Physics Research his kind of systems, towards new levels tly running data-intensive applications e among the characteristics of the avail- le support. ive Commons Attribution-NonCommercial-ShareAlike Licen † ∗ [email protected] environment, starting to focus on theable relationships technologies that and li the data access strategies which are c challenges for the systems that haveideas to that feed may the constitute the computation next steps inof the performance, evolution of interoperability t and robustness.can be Efficien very challenging in a single site,in depending a on worldwide the distributed sca environment with chaotic user-r In this paper we pragmatically address some general aspects evolutions in the computing performance available also at t design which avoids all the pitfalls which could harm its effi Speaker. Thanks to A.Hanushevsky and F.Carminati for their invaluab ∗ † Copyright owned by the author(s) under the terms of the Creat c

Fabrizio Furano Data Access in HEP CERN E-mail: 13th International Workshop on Advanced ComputingFebruary and 22-27, Anal 2010 Jaipur, India PoS(ACAT2010)006 Fabrizio Furano rld Wide Web, which making a potentially high Energy Physics experiments They may instead be present sparently integrates function- necessarily to force computa- te of any size, from a small set ould be even more precious for WAN) connections, and the in- order to initiate a smooth transi- n without having to copy locally s, for example: ded in a farm whose local storage e: applications that exploit the com- lieve that local storage clusters can al nodes in different distant sites to like coherent view of their content. ng model [13] [14] [15]. onstrated, for example, in a produc- such systems, because requirements m can only access a storage element very interesting use cases that can be high. tallation. e a large computing farm is accessing chable by the hardware used to deploy achieve. The goal is to be able to effec- ionalities are effectively usable, and the ard problems. ns among clients and servers residing in 2 alities at both site level and global level. To be able to deploy a "unified worldwide storage", which tran tasks like debugging an analysisthe or data data-mining it applicatio accesses. Allow a batch analysis job to continueelement its lost execution the if data it lan filesin which another were site. supposed to be present. Any other usage which couldmade come interactivity to available in the a mind very of easy way an via user WAN. of the Wo Interactive data access, especially in the case of analysis puting power of modern multicore architectures [11]. This w Hence, the objectives this kinds of systems are aiming for ar In this work we address the various solutions we developed in The task of promoting the maximum level of performance while With the term “node” we refer to the deployment of a storage si Moreover, the high latency typical of the wide area network ( The idea behind our interest for WAN direct data access is not In a modern computing model dealing with data access for High • • • • Data Access in HEP 1. INTRODUCTION tion from a high performance storagea model model composed where by the sever nodes cooperate to give a unique file system- of disks up to the most complex hierarchical storage [19] ins number of storage nodes collaborate, is verytively challenging deploy to a production-quality system, wheredata the access funct performance is as close asthis possible system. to This the can one rea beand considered expectations as about the performance major and challenge robustness in are very different sites have historically been considered as very h trinsically higher probability of temporary disconnectio tions to be performed towards remote data.be replaced Hence, we at do the not present be the time, data. especially in We the believe that casesvery wher the “close” “classical” to use it, case is where nottion a the environment far by only the choice implementation anymore, of as the it ALICE is computi dem (but probably not only High Energyeasily Physics), accommodated there with are a other well working WAN-wide data acces PoS(ACAT2010)006 . Each jobs Fabrizio Furano adata that describes their ing the overall system and ctures is that, for example, if f view, rely typically on statis- th the scalability of the whole to 1000 in GLAST/Fermi), and abilities of the content of a Stor- gh Energy Physics experiment at ever, given the generality of the mentally deployed on an already- handle very large amounts of data h more in the future generations of ransaction rate for the storage sys- esiding at a different site, an often spread all over the world, accessed ata analysis applications and being be a combination of the local GRID equirement, since these aspects can y and efficiency. e WAN-wide clients and occasional ibly having some sites that are more bute the data among different comput- ith tens or hundreds million files in the extended to any other field with similar sing processes, often called ed, in data stores which can reach sizes ify and classify them. Very often, in order ) file opens per second in a local cluster is 3 3 accessing its data. To implement fault tolerant communication policies. To implement fault tolerance through theage “self Element healing” (SE). cap To give to remote SEscontent. ways to remain synchronized with the met To improve data access speed, transaction rate, scalabilit To allow dealing with Wide Area Networks, both for distribut These objectives also led us to address the issues related wi High Energy Physics experiments, from the computing point o Historically, a common strategy, largely adopted in HEP, to Among others, one of the consequences of this kind of archite • • • • • Data Access in HEP design, and the fact thatmake, the focus in on practice, performance the isnot. difference a major between The r being case study able for toCERN, these run where methods a d is system the based ALICErunning on [13] production these [15] environment principles Hi among has many been (aboutconcurrently incre 60) by sites tens ofdiscussed thousands topics, the of methods discussed data in thisrequirements. analysis paper can jobs. be How 1.1 The problem tics about events and on all the relativeto procedures get to ident sufficient statistics, huge amounts of data are analys experiments. Hence, the typical computing scenariorepositories deals and w tens of thousandsjob of can concurrent open data many analy files atkeep once them (e.g. open about for 100-150 the time inWith needed. ALICE, these up numbers, it is easytem to can be understand very why high. the With file respect open to t that, a rate of O(10 up to a few dozens of Petabytes, but which will likely grow muc not uncommon. In our experience thesite, source a of this local traffic batch can system,interactive a users. local PROOF [11] cluster, remot (on the order of the dozensing of centres, Petabytes assigning per a year) is partimportant to of than distri the others. data to each of them, poss a data processing job needs to access two data files, each one r PoS(ACAT2010)006 Fabrizio Furano ation has to locate a file in a view, between a file catalogue es is that those designs include hing problem can be avoided in ge system, in our view, can be n our opinion that the initiatives ut a non-existing file, or a file can bout file placements, based on the elsewhere. This situation can the- uncontrollable way (e.g. a broken e, before starting, or that the time he old data is replicated in different oth of them, and the processing is tributed production storage deploy- ge system, i.e. the current (and/or for instance, that a hypothetical data de storage system should shield the " subset of the data, i.e. a set of files ogue is a list of files, associated to nsistencies, as discussed later. d have to be rescheduled. Moreover, ormance bottleneck in the architecture nion, are undesirable. alled "trendy datasets" are overloaded, n easily degrade into a complex n-to-m d for each file any application needs to ion" are not apparently progressing fast en the typically very high complexity of which can also be due to mis-alignments of locating files through sites or through rticular file or the URLs to access them. t. A file catalogue instead must deal with on about the content of files, but not about here the majority of the sites are manually d of component takes the form of a simple- 4 Even if the discussed aspects are well known, however, it is i Here we must point out the subtle difference that lies, in our Assigning the task of locating files to a comprehensive stora One could object that the complexity of an n-to-m online matc Data Access in HEP adopted solution is to make sure that a single site contains b forced to host the same data sets. Both situations, in our opi the fact that the content of the repository might change in an started there. Immediate consequences coming from the above scenarioanalysis are, job must waitto for wait all before of starting its canthe data grow task to if of some be assigning file files present transfers tomatching at fail sites problem, a to (and an be replicating sit solved them) online, ca asplaces. new An data optimal comes solution and is t very difficult to achieve, giv some cases by assigning to eachwhose site processing something does like not a "closed needoretically the evolve to presence one of where fileswhile the residing others sites can containing be the idle, or, so-c alternatively, to a situation w 1.2 File catalogues and metadata catalogues related to what we canenough call to "transparent allow storage a globalizat system designerment to using conceive them a in very a largesome transparent dis external way. component What whose happens purpose inknowledge is many of to cas the take decisions supposed a desired) current association global of data status to of sites.minded the "file Invariably, this catalogue". stora kin Its deployment may representand a also perf it can be a single point of failure, or a source of inco and an application-related metadataother catalogue. information, e.g. A the sites/servers fileWhen that part catal host of each a pa heavily common distributed design, environment, it hence it oftenaccess. must is be A the metadata contacte catalogue, only instead, way contains informati an applic have in principle no metadataapplications entries. from In its internals, our like view, for aservers. example worldwi the tasks highly desirable for the intrinsic robustness of the concep media or other issues), opening new failure opportunities, such systems in the real deployments. where they can be accessed. Hence it can contain metadata abo PoS(ACAT2010)006 of indexing all Fabrizio Furano igh-performance data server ncurrent physical client con- essed and in which data server or notifying the location of the t domain: the ID3 tags for music loose or damage) a CD listed in etc.); erate data transfers (e.g., overlap- nd somebody not, and somebody ubtle difference among these two ; g requests sharing the same physi- would not be so different from the e, our proposal is not to promote fers to be interrupted and continued adata catalogue for them is e.g. the lar CD. In other words, this is just t by the maintainers of the metadata, or, or even subtle aspect, but it makes ot least that it is computationally very g a “perfect” representation of reality, ploy a system based on these concepts en the scale of the system is very large y attributed. Our point, when dealing te for data access. m layer, not to the layer which handles . ess a set of music files would only need ies are correct and that all the CDs owned 5 . However, unlike NFS, the xrootd protocol includes a number xrootd website. We could also think about creating a file catalogue, cal connection; Communication optimizations, which allow clients to accel nections per server; Capability to support several tens of thousands outstandin ping requests, TCP multistreaming, vectored read requests An extensive fault recovery protocol that allows data trans Capability to accommodate up to several tens of thousands co at an alternate server; A comprehensive authentication/authorization framework The basic system component of the Scalla software suite is a h To give a simpler example, we consider a completely differen • • • • • akin to a file server called the music files owned by thescale world’s of population the (and file the catalogues scale is of that modern in HEP the experiments). realthis The world metadata s catalogue. everybody is This allowed wouldand not not has be to no considered own a consequence (or faul oninformation to the about information music about releases that whichelse particu somebody is could unable to own tell. a Hence,to an know application which willing music to files proc in to a process, worldwide and deployment. not howOn they the can other be hand, acc a "fileand catalogue"-like this system adds aims at some layers bein ofdifficult complexity to (if the not task, impossible) last just but to n verify that all the entr with this kind ofa problems metadata is catalogue that, torequested from become resources a is a a general file task perspectiv catalogue, belongingthe information to since bookkeeping. the This finding storage might syste bea considered a very min big difference from the architecturaland it point is of distributed view worldwide. wh We triedby to using implement and and de contributing to the Xrootd/Scalla software sui 2. XROOTD AND THE SCALLA SOFTWARE SUITE (and sold/exchanged) by the persons are listed and correctl features like: Data Access in HEP between its content and the actual content of the repository files. A technically very good examplewww.musicbrainz.org of a comprehensive met PoS(ACAT2010)006 Fabrizio Furano . A data server node consists he task. In its manager role, sists of 1-to-64 server nodes. server node. In its supervisor manager, the supervisor cmsd a proprietery message passing anager cmsd. The term node is e traffic pattern and is extremely ered together while still providing a manager cmsd sitting at the root sists of an xrootd and a supervisor ystems, like tape drives, in order to est. This role parceling allows the ng nodes, scales very quickly with d co-ordinates the organization of a As a manager, it allows server cmsd’s n, form superclusters, as needed. As network is used to deliver actual data. used by data servers. 9]. e term manager should subsequently be ces requested by a client. Additionally, pervisor, which, in turn, clusters around ormation to its manager cmsd so that it d by the server cmsd’s and forwards the nd direct clients to appropriate servers in component of the system. This is a spe- cmsd 6 as a server pairing of an xrootd with a cmsd. node A logical data network (i.e., the xrootd servers); A logical control network (i.e., the cmsd servers), based on technology. The possibility of interfacing a data server with external s Peer-to-peer elements that allow xrootd servers to be clust provide transparent hierarchical storage capabilities [1 a uniform name space. Xrootd’s clustering is accomplished by the A cmsd can assume multiple roles, depending on the nature of t A hierarchical organization provides a predictable messag Hence, a server cmsd is essentially an agent running on a data We can now expand the definition of a node to encompass the role To limit the amount of message traffic in the system, a cell con • • • • real-time. In essence, the system consists of: We define a cialized server that can co-ordinate xrootd’s activities a The control network is used to cluster servers while the data Data Access in HEP well suited for conducting directedits searches for performance, file which resour depends on the number of collaborati the cmsd discovers the bestcluster. server In for its a server client role, file the request cmsd an provides sufficient inf Cells then cluster, in groupsshown of in up Figure to 1, 64. the systemof Clusters is the can, tree. organized in as Since tur a supervisors B-64 alsoassumed function tree, to as with include managers, supervisors. th can properly select a data server for a client request. a manager cmsd. role, the cmsd assumes the duties ofto both manager cluster and around server. it, itinformation aggregates to the information its provide managerdetermines cmsd. which When server a cmsd will requestformation be is of data used made server to by cells satisfy that its cluster the around requ a local su of an xrootd coupled with acmsd, server and cmsd, a a manager supervisor node nodelogical, consists con since of supervisors an can xrootd execute coupled on with the same a hardware m 2.1 Cell-based organisation PoS(ACAT2010)006 n or a the database Fabrizio Furano Of course, these are erring to this kind of ajority of the cases, the data files the system is configured with ide, manager nodes can be repli- been relaxed, in order to make it d balance the control traffic. Data zation becomes a directed acyclic cular peer-to-peer like behaviour of sk partitions. hops away from the root node. high latency networks interconnect- mode (and eventually deleted later), d transactions than being unable to by definition; hence there is no need two-level tree is able to cluster up to distributed transactions or distributed ay be replicated across multiple data ot entirely true. In fact, g models where this system was used tem becomes unreachable. Replicated ability. The general idea was that, for erformance and scalability are still an t. For example, right now there was no es. In the xrootd case, moreover, we can ased aggregation of several hierarchical ell as overall data access performance. 7 Cell-like organization of an xrootd cluster control paths from a manager to a data server node, allowing f n Figure 1: In order to provide enhanced fault-tolerance at the server s The main reason why we do not use the term “file system” when ref These kinds of simplifications were also related to the parti 262,144 data servers, with no data server being more than two Data Access in HEP only a small increase in messaging overhead. For instance, a cated. When nodes ingraph the and B-64 maintains tree predictable are message replicated, traffic. the organi Thus, when extremely well optimised for their purpose and are up to date databases, which are the file systems of all the aggregated di significant number of node failures beforemanager any part nodes of can the be sys used asserver fail-over nodes nodes are or be never asked replicated.servers to to loa Instead, provide the file desired resources level of m fault-tolerance as w used to locate files is constituted by the message-passing-b claim for instance that the fact that there is no database is n system is the fact thateasier to some head requirements for of extreme fileexample, robustness, systems performance it and have is scal muchmake easier the system to grow give if its up performance/sizeneed atomicity is to not in dedicate sufficien a distribute serious effortfile in locks. order to These allow functionalities, foropen when atomic research related problem, to even extreme more p criticaling when nodes, dealing clients with and servers overare a never WAN. updated, In they HEP, in are the createdso vast and these m then relaxations accessed did in not read up create to problems now. to the computin the xrootd mechanism behind the file location functionaliti managers, there are at least PoS(ACAT2010)006 Fabrizio Furano aving to deal too much with ould schedule the creation of stem, hence, if a disk breaks, any reason (network glitches, to work correctly even during g out which files were lost in a re-established if it seems to be mely important, is the fact that e related to robustness and fault an one server will appear to have server communication, typically ecide what the system should do ration, so that the client goes into es of the mount points of a given ber of times and/or failed to find an ernal name translations. educed to a minimum or not needed ace, and the computational overhead om the external units by some other o aggregate multiple mount points in n the xrootd case, also between server “localroot”. Doing this, each server in by means of a simple local string sub- n the relative frequency of this kind of e completely detached from the internal ly are: a management designs are: e servers are configured in order to export tabase, just to locate files in sites or single 8 A client must never return an error if a connection breaks for etc), unless it failed after havingalternate retried location for where a to certain continue. num The server can explicitly signal everya potentially pause long state ope from which ittimeouts. can be woken up later. This avoids h Every inter-server connection isbroken. kept alive and eventually The disk pool is just a disk-based cache of an external tape sy its content will be (typicallyserver automatically) in re-staged the fr same cluster. Ofthis course, process. the system must be able The files are replicated somewhere else.a Hence, new some replica system of c the lostpotentially ones. large repository The can difficult be part problematic. is that findin The other aspect of fault tolerance is related to the client- Other aspects, which we consider as very important, are thos Another characteristic of the xrootd daemon, which is extre • • • • • based on TCP connections between clients andnodes. servers, The and, operational i principles that we applied rigorous Data Access in HEP to replicate their content into a typically slow external da 2.2 Fault tolerance servers. each server hides its internal pathsstitution where in the the file data names. is stored a In cluster our terminology can this export is aserver. called namespace Using that this kind is of detached feature, from inthe a same the typical namespace, nam setup, and all hence th thethe same same file file exported name by without more the th This need kind of of potentially internal complex filename ext prefixthe processing same is server. also Doing used this, t thedeployment processing details; all applications they ar see is aof common this coherent kind name sp of local name translation is extremely low. tolerance. From one side,if, a for system instance, designer a has diskhardware to breaks, problems, be and the free manual its recovery to content approach d at is should all. lost. be The r Give typical solutions exploited in fail-safe HEP dat PoS(ACAT2010)006 , meta-cluster Fabrizio Furano host, which the remote sites calla/Xrootd system played a entral services, already gives here, and, of course, the data re organised (or self-organise) ed features of the Scalla/xrootd ign challenge was how to exploit ing used in the storage elements epository without having to know andled by the storage system, we posed to be present on a storage el- s are actually able to connect to this formed in a way that is completely paragraph better explains the details cessing jobs have to access the so- ntry point of a unique herent name space. In other words, as not new in the ALICE computing tem in the direction of being able to ing namespace coherence across dif- ed itself a very good one, with a sur- 0,000 concurrent jobs are considered nnects them is robust enough. All this om the other sites, but, an even more ss of a large distributed storage deploy- ailure rate. td-based storage cluster at CERN. Each mposed by several sub-clusters residing at servers can connect across a wide area d elsewhere. meta-manager 9 As said, the set of Wide Area Network related features of the S On the other hand, the AliEn software [14], supported by its c Of course, one important requirement is that the remote site With the possibility of accessing a very efficient coherent r The xrootd architecture, based on a B-64 tree where servers a The other interesting item related to the recent WAN-orient subscribe to. This new role of a server machine is the client e the location of acan file implement in many advance ideas. becausebelonging the to For location ALICE instance, in task this order is to mechanismement h quickly but is fetch for a be some file reason that is was notof sup accessible this anymore. kind The next of systemtransparent and to synthetically the shows client that how requests this the is file. per 4. THE VIRTUAL MASS STORAGE SYSTEM major role in opening theexploit possibility Wide Area to Networks evolve in a athis storage productive in sys way. order to The augment major the des ment. overall robustness and usefulne The idea of letting dataframework. processing jobs For access remote instance, data thiscalled w is “conditions data”, the which default are way stored in that a the Scalla/Xroo pro a normal activity): historically theprisingly WAN good choice performance always for prov that task, and a negligible f some sort of unique view of the entire repository, implement processing job accesses 100-150 conditions data files (and 2 which contains all the subscribed sites, potentially sprea meta-manager and that each storageimportant cluster one is is accessible that fr alla the given sub-clusters data expose file the mustservers same must be be co exported accessible with from the outside. same file name everyw suite is the possibility ofin building different a sites. unique repository, co into clusters [5], accommodates by construction the fact th Data Access in HEP 3. MORE ABOUT STORAGE GLOBALISATION network, provided that the communication protocol which co is accomplished with the introduction of a so-called PoS(ACAT2010)006 Fabrizio Furano sing file comes to a storage ound-trip time plus one local o a unique worldwide meta- /Xrootd terminology is called and enhanced (e.g. scalability, tems considers the meta-cluster w of their content; f the ALICE computing. This (if the location is unknown to a but also it would avoid the fact owing principles: e mechanism to work, everything es, as soon and as fast as possible. s already discussed, the form of a facts that: t of view, we considered as a poor that there was a way to enhance the ) could be done without letting the self, and does not need external sys- ch automatically tries to fix this kind g design, allowing for an incremental e recent times, the ALICE computing nal database. Moving towards a model , by construction, keep easily track of ) as its mass storage backend has been , and the idea for which a site sees the le that is missing in a site is available in ng is done by the storage system to fetch l almost independent sites contributing to quickly 10 (VMSS). ALICE Global redirector ) has been called Virtual Mass Storage System cluster, which exposes a unique and namespace-coherent vie as its mass storagecluster, system. then it If tries to an fetch authorized it from request one of for the a neighbour on mis All of the xrootd-based storage elements are aggregated int Each site that does not have tapes or similar mass storage sys manager server, finding it takes,file in system lookup); average, only a network r No central repositories of data or metadatais are performed needed through for th real-time message exchanges; All of the features ofperformance, the etc.); Scalla/Xrootd suite are preserved In the xrootd architecture, locating a file is a very fast task A careful consideration of all these aspects led to thinking The generic solution to the problem has been to apply the foll The host which manages the meta-cluster (which in the Scalla An additional consideration is that, from a functional poin In this context, we consider as very positive statements the • • • • • requesting job fail. ALICE Scalla/Xrootd-based distributed storage in aof way problems, whi and at the sameevolution time of does the not storage disrupt design, the existin distributed across severa the project. meta-manager solution the fact that, once ait. missing file This is is detected, especially nothi unfortunatea when, well-connected for instance, one, the and fi pulling it immediately (and Data Access in HEP ferent sites by mapping the file names using a central relatio where the storage coherence istems, given by would the storage give system an it advantage additional would benefit be in to thethat this form an key of external component a system o performancethe (like enhancement, data a files relational which database) can“metadata cannot be catalog” lost is by due definition to immune malfunctioningis to slowly disks. this, moving and, in A in this th direction. global namespace (to which it participates with its content called PoS(ACAT2010)006 Fabrizio Furano ICE case no changes were e can be used by the overall d exercise a global real-time ecture. In the lower part of the ficiently accessing remote data ss Storage is that several distant he computing infrastructure; a/Xrootd storage cluster. If a client eir storage clusters and join in. ts processing as if nothing particular , which is copied immediately to the t (which is just another normal xrootd d asks the global redirector to get the a access structure; they constitute some n incremental way. This means that it is led GSI) does not find the file it needs to many things become possible, as these are 11 An exemplification of the Virtual Mass Storage System. Figure 2: required to the complex software systems which constitute t The design is completely backward compatible, e.g. in the AL system. Make some recovery actions automatic; Give an efficient way to move data between them, if this featur Figure 4 shows a generic small schema of the discussed archit By using the two mechanisms described (the possibility of ef Hence, we could say that the general idea about the Virtual Ma Hence, the system, as designed, represents a way to deploy an • • • figure we can see threeinstantiated cubes, by a representing processing sites job with (on the a leftmost Scall site label and the possibility of creating inter-site meta-clusters) able to give a better service as more and more sites upgrade th open, then the GSI cluster (see picture) pauses the client an 5. CONCLUSION AND CURRENT DIRECTIONS view of a unique multi-Petabyte distributed repository in a file, just requesting to copy theclient) file is from it. then This redirected copy toleftmost reques site. the site The that previously hashad paused happened. the job can needed then file continue i sites provide a unique high performance filesort system-like of dat cloud of sites that collaborate in order to: Data Access in HEP PoS(ACAT2010)006 Fabrizio Furano ACM Transactions on Computer , 1999. major accomplishment that will metric in order to decide where ts for extreme performance and etch the file that was supposedly re quite generic, and it is foresee- d "storage globalisation". For in- hms ierarchical Server Topology. successful, from the points of view ly to provide one more feature, but ue to the latency of the link between cation-aware load balancing would appear as one, without having to deal d be a very interesting addition, that ctual content of the remote site, other al-time the content of the catalogues order to build an artificial higher level interactive data processing. ng the possibilities of having a storage rithm, indeed, only considers the load ow starts becoming much easier and ef- e system. path. nother one could host the worker nodes, ms would have a way to fix their content s in the need to access remote files effi- es but are not linked in any way to partic- luster file system. 12 , 6, Feb, 1988. Systems Proceedings of the 7th Annual European Symposium on Algorit http://www.lustre.org/ [July 2009] http://savannah.cern.ch/projects/xrootd http://xrootd.slac.stanford.edu/ [July 2009] As discussed, the features that allow this kind of approach a Other aspects worth facing are those related to the so-calle [2] Bar-Noy A, Freund A, Naor JS. On-line Load Balancing in a H [1] Howard JH et al. Scale and performance in a distributed fil [3] Lustre: a scalable, secure, robust, highly-available c [4] The Scalla/xrootd Software Suite. Data Access in HEP very generic mechanisms that encapsulate the technicaliti ular deployments. For example, building a true and robustficient. federation For of sites instance, n one site could host the storage part, a the sites. Or both might havewith a (generally part less of stable the and overall very storageview complex) and “glue of code” the in resources. So far, all the testsof and both the performance production and robustness. usages We havesystem believe been that able providi very to work acrosssgive many WANs benefits, properly especially and when efficiently dealing is with user-level a able that other systems will follow a similar technological without having to worry too much about the performance loss d With respect to the consistency between catalogues and the a approaches are possible, forevery instance time trying an to inconsistency update iscould in detected. also re complement This the approach coul VMSSlost. approach, In this trying case, instead both the toand/or catalogues f propagate and changes. the storage syste stance, a working implementation ofbe some a form very interesting of feature latency/lo ciently. for At data the analysis present application time,of the the xrootd servers load (network balancing throughput algo to and redirect a CPU client. consumption) as Asto a always, provide the it challenge in wouldscalability. a be not way on which is compatible with the requiremen References PoS(ACAT2010)006 , , Third The 6th Fabrizio Furano (Palma, Illes Balears, nt Technology (IAT’05) , 1995. 7. DOI= rinciples. hysics Research. ly scalable architecture for data access. Agent System using Passive Bids. ing of the ALICE activities. access systems: invited talk abstract. nd Performance parallelization and vectored access. Some -technologies in Science and Education" ICE computing model: an overview. 005. ISBN 92-9083-247-9. orking Group. a J. Informed prefetching and caching. ed Parallel Analysis Framework with 2009] free encyclopedia in Distributed Storage Systems. , 2005. Proceedings of ACAT 2008: XII International . Journal of Physics: Conference Series 119 R/Computing.html 9.pdf?arnumber=1542739 [July 2009] anagement [July 2009] 13 , Apr. 2005. IEEE/WIC/ACM International Conference on Intelligent Age CHEP07: Computing for High Energy Physics Spain, July 12 - 14, 2005).http://doi.acm.org/10.1145/1071021.1071053. WOSP ’05. ACM, New York, NY, 267-26 Proceedings of the 5th international Workshop on Software a WSEAS Transactions on Computers Workshop on Advanced Computing and Analysis Techniques in P http://acat2008.cern.ch/ . http://pcalimonitor.cern.ch/ [July 2009] results. 2005. http://www.slac.stanford.edu/BFROOT . [July 2009] http://aliceinfo.cern.ch/ . [July 2009] Volume 119 (2008) 072016 (9pp) http://root.cern.ch [July 2009] iat,pp.698-701, PROOF. http://root.cern.ch/twiki/bin/view/ROOT/PROOF . [July http://aliceinfo.cern.ch/Collaboration/Documents/TD International Conference "Distributed Computing and Grid GRID2008, http://grid2008.jinr.ru/ http://en.wikipedia.org/wiki/Hierarchical_storage_m IEEE/ACM International Workshop on Grid Computing http://ieeexplore.ieee.org/iel5/10354/32950/0154273 Proceedings of the 15th ACM Symposium on Operating Systems P http://ssswg.org/ [January 2009] [7] Dorigo A, Elmer P, Furano F, Hanushevsky A. Xrootd - A high [8] Hanushevsky A. Are SE architectures ready for LHC? [9] Furano F, Hanushevsky A Data access performance through [6] Hanushevsky A, Weeks B. Designing high performance data [5] Furano F, Hanushevsky A. Managing commitments in a Multi Data Access in HEP [12] The BABAR collaboration home page. [13] The ALICE home page at CERN [14] ALICE: Technical Design Report of the Computing. June 2 [10] ROOT: An Object-Oriented Data Analysis Framework [11] Ballintijn M, Brun R, Rademakers F, Roland G. Distribut [15] Betev L, Carminati F, Furano F, Grigoras C, Saiz P. The AL [19] Hierarchical storage management, From Wikipedia, the [20] ALICE Grid Monitoring with MonALISA. Realtime monitor [16] Feichtinger D, Peters AJ. Authorization of Data Access [17] the IEEE Computer Society’s Storage System Standards W [18] Patterson RH, Gibson GA, Ginting E, Stodolsky D, Zelenk