To- #8208;Peer Software Distribu癢n in ALICE Grid Services to Enable
Total Page:16
File Type:pdf, Size:1020Kb
Employing peer-to-peer so0ware distribu6on in ALICE Grid Services to enable opportunis6c use of OSG resources R Jeff Porter & Iwona Sakrejda (LBNL) Cosn Grigoras, Latchezar Betev, Federico Carmina, Pablo Saiz (CERN) Outline • Introduc6on – ALICE Grid & AliEn – Tradi6onal So0ware Deployment • BitTorrent so0ware deployment AliTorrent • Tes6ng on OSG resources • Performance characteris6cs • Target mul6ple resources • Summary .R. Jeff Porter LBNL 2 AliEn & the ALICE Grid • ALICE Grid Facility – In produc6on since ~2004 – more than 80 sites – Small, central operaons team • & site admins around the world • AliEn: ‘Alice Environment’ – Central Services at CERN ALICE Compu6ng Grid • Task Queue, Job & File Catalog • So0ware management (build & deployment) this – VO box site-specific operaons talk • Job Agent submission • So0ware deployment • Resource monitoring – Grid Monitoring with MonALisa – Data Management • AliEn FileCatalog • Grid-Enabled XRootD SEs Yesterday’s talk by Pablo Saiz: “AliEn.R. Jeff Porter LBNL : ALICE Environment on the GRID” 3 Tradi6onal So0ware Deployment • Managed Central So0ware: Central Build Servers – rou6ne so0ware builds AliEn, AliRoot, … – Catalogued & stored in AliEn SLC MacOS Ubuntu • Grid Site SW Operaons – Jobs request SW from VO box service Catalogue ALICE:CERN:SE – VO box PackMan service pulls SW AliEn – SW deployed on shared area – WNs read SW from shared area ALICE Grid Site VO Shared SW area Box – resource bofleneck requests WW N W – single point of failure N WW N N – can require ac6ve repairs per site Shared SW Area N WN NFS/GPFS/AFS .R. Jeff Porter LBNL 4 Basic Torrent details File chunks contain hashes of original file to provide data integrity C.Grigoras, ALICE T1/T2 Workshop Tracker: map of seeders:files Seeders: have & serve file Leeches: pull & serve file chunks .R. Jeff Porter LBNL 5 AliTorrent So0ware Deployment • Managed Central So0ware Central Build Servers – Addi6onal AliEn torrent store AliEn, AliRoot, … – Catalogue, seeder & tracker SLC MacOS Ubuntu • Grid site SW deployment – VO Box is not involved Catalogue alitorrent.cern.ch – Seeder | Tracker Jobs pull SW from: AliEn • alitorrent.cern.ch seeder • local peers • other sites as available ALICE Grid Site – though typically behind a FW requests W VO W • Resolves: N W Box N W – Bofleneck & single point failures N W N N WN – Site level maintenance of shared area .R. Jeff Porter LBNL 6 AliTorrent Details • Torrent Features – Distributed Hash Tables • Decentralized seeder lookup – seeders are trackers – Peer Exchange • Local peer informaon is propagated by seeders – Local Peer Discovery • Mul6cast to discover peers on same network • ALICE/AliEn Features – Total so0ware download is ~300-400MB – Enabled per site (VO box) with AliEn LDAP flag .R. Jeff Porter LBNL 7 AliTorrent as an opportunity • AliTorrent use in AliEn – Reduces problems associated with SW deployment – Simplifies site operaons by removing a VO box service Does not eliminate VO box model from ALICE Grid Does eliminate site-specific VO box requirement • Eliminaon of site-specific VO box allows for remote use of other Grid resources OSG .R. Jeff Porter LBNL 8 AliEn-OSG Interface • Developed in 2009 by ALICE & OSG teams – in produc6on at ALICE-US sites since 2009 – poster @ CHEP’09 • Features added to AliEn – VO box as a Condor-G submit host to local OSG CE – Condor module in AliEn • Builds submit files to launch JobAgents • monitors Condor queue for site u6lizaon – Specificaons put into AliEn LDAP • OSG-CE endpoint • Job occupancy targets .R. Jeff Porter LBNL 9 Extending OSG-AliEn Interface for AliTorrent • Without torrent Shared SW area #!/bin/bash /project/projectdirs/alice/alicedev/alien.v2-19.168/bin/alien RunAgent • With torrent #!/bin/bash DIR=`pwd`/alien_installaon.$$ job sandbox mkdir -p $DIR cd $DIR wget hp://alien.cern.ch/alien-installer -O alien-auto-installer chmod +x alien-auto-installer ./alien-auto-installer -type workernode -batch -torrent -install-dir $DIR/alien ./alien/bin/alien RunAgent • With torrent on OSG: • replace `pwd` above with $OSG_WN_TMP DIR= /ALICE/alien_installaon.$$ OSG TMP space .R. Jeff Porter LBNL 10 First Results • Ini6al goal to target NERSC Carver system – Carver: NERSC IBM IDATAPLEX HPC system with OSG CE – ALICE obtained a small allocaon for development work – No VO Box would allowed on the system • Conferred with NERSC security about torrent use – earlier requests for use of torrent were ‘tabled’ for review – Result: torrent not banned explicitly but NERSC relies on dynamic filtering “go ahead and try” ---- log file ---- It just worked WNHost = "c1554.nersc.gov"; Geng the torrent It looks like the default IP address is 128.55.61.240 Seeding the torrent Extrac6ng the files Tes6ng /osg/carver/wn_tmp/ALICE/alien_installaon.20613/alien.v2-19.169/ AliEn workernode installaon took approximately 133 seconds Installaon finished! .R. Jeff Porter LBNL 11 Performance • Job latency from SW install: Tradi6onal vs AliTorrent – Tradi6onal install: • ‘copy’ from shared area - torrent install • lone job latency: ~1 sec - ‘cp’ from shared area • steady increase vs #-jobs – AliTorrent install: • CERN to LBNL/NERSC • lone job latency: ~360 sec • sharp decrease vs #-jobs Install 6me (seconds) increase with #-jobs • AliTorrent advantage Number of concurrent jobs – small numbers of long jobs where SW install 6me is insignificant – sites with large numbers of concurrent ALICE jobs .R. Jeff Porter LBNL 12 Expand to OSG Sites • Single VO box submission to OSG pilot job factory Gridresources \ – OSG-CE Endpoints carvergrid.nersc.gov/jobmanager-pbs:50:(jobType=single)(queue=serial) \ itbv-ce-pbs.uchicago.edu/jobmanager-pbs:5:none \ • Managed as a list glcc88.ucllnl.org/jobmanager-pbs:20:none \ osgitb1.nhn.ou.edu/jobmanager-condor:5:none \ • Endpoint:MaxJobs:RSL pdsfgrid.nersc.gov/jobmanager-sge:50:none – Test on selected sites • OSG-ITB sites • ALICE-US sites LBNL & LLNL – Close SE 700 TB SE @ NERSC • output des6naon NERSC/PDSF 10592 gt2 pdsfgrid.nersc.gov/jobmanager-sge hps://pdsfgrid4.nersc.gov:46490/25251/1337447241/ 10628 gt2 glcc88.ucllnl.org/jobmanager-pbs hfps://glcc88.ucllnl.org:40894/8704/1337454774/ LLNL 10629 gt2 glcc88.ucllnl.org/jobmanager-pbs hps://glcc88.ucllnl.org:40902/9292/1337454810/ 10645 gt5 itbv-ce-pbs.uchicago.edu/jobmanager-pbs hfps://itbv-ce-pbs.uchicago.edu:36913/3026418950974269075/ UC-ITB 10647 gt5 itbv-ce-pbs.uchicago.edu/jobmanager-pbs hfps://itbv-ce-pbs.uchicago.edu:36913/3026418950974269075/ 10664 gt2 carvergrid.nersc.gov/jobmanager-pbs hfps://carvergrid.nersc.gov:60888/2359/1337525731/ NERSC/Carver 10665 gt2 carvergrid.nersc.gov/jobmanager-pbs hfps://carvergrid.nersc.gov:60891/2358/1337525731/ 10674 gt5 osgitb1.nhn.ou.edu/jobmanager-condor hfps://osgitb1.nhn.ou.edu:64981/933117525531598993/ OU-ITB 10675 gt5 osgitb1.nhn.ou.edu/jobmanager-condor hfps://osgitb1.nhn.ou.edu:64981/933117525531598993/ .R. Jeff Porter LBNL 13 Towards produc6on operaons • Proof in principal is complete – AliEn Jobs run at sites submifed by remote VO box – successful running on mulple sites from single VO box – Plan to target OSG produc6on sites • Site policy cau6on: – limits can be broad – has hampered wider use • ALICE Experience: – ini6al distrust turns into recogni6on of approach’s merit – simplifies site operaons .R. Jeff Porter LBNL 14 Summary • So0ware deployment on shared area – Bofleneck & site-level single point failure – site-level SW corrup6on requires admin interven6on • Torrent model AliTorrent – Removes bofleneck & site-level single point of failure – Eliminates a site service & reduces site management – Performance capabili6es meets typical ALICE workflow & site requirements Eliminates requirement for site-specific VO box • We have leveraged this capability to demonstrate AliEn workflow for opportunis6c use of mul6ple OSG resources • AliTorrent is a site-friendly tool for opportunis6c (or general) use – don’t ask the site to “do” something install or manage a service – ask the site to “not do” something block torrent use .R. Jeff Porter LBNL 15 .