To- #8208;Peer Software Distribu癢n in ALICE Grid Services to Enable

To- #8208;Peer Software Distribu癢n in ALICE Grid Services to Enable

Employing peer-to-peer so0ware distribu6on in ALICE Grid Services to enable opportunis6c use of OSG resources R Jeff Porter & Iwona Sakrejda (LBNL) Cosn Grigoras, Latchezar Betev, Federico Carmina, Pablo Saiz (CERN) Outline • Introduc6on – ALICE Grid & AliEn – Tradi6onal So0ware Deployment • BitTorrent so0ware deployment AliTorrent • Tes6ng on OSG resources • Performance characteris6cs • Target mul6ple resources • Summary .R. Jeff Porter LBNL 2 AliEn & the ALICE Grid • ALICE Grid Facility – In produc6on since ~2004 – more than 80 sites – Small, central operaons team • & site admins around the world • AliEn: ‘Alice Environment’ – Central Services at CERN ALICE Compu6ng Grid • Task Queue, Job & File Catalog • So0ware management (build & deployment) this – VO box site-specific operaons talk • Job Agent submission • So0ware deployment • Resource monitoring – Grid Monitoring with MonALisa – Data Management • AliEn FileCatalog • Grid-Enabled XRootD SEs Yesterday’s talk by Pablo Saiz: “AliEn.R. Jeff Porter LBNL : ALICE Environment on the GRID” 3 Tradi6onal So0ware Deployment • Managed Central So0ware: Central Build Servers – rou6ne so0ware builds AliEn, AliRoot, … – Catalogued & stored in AliEn SLC MacOS Ubuntu • Grid Site SW Operaons – Jobs request SW from VO box service Catalogue ALICE:CERN:SE – VO box PackMan service pulls SW AliEn – SW deployed on shared area – WNs read SW from shared area ALICE Grid Site VO Shared SW area Box – resource bofleneck requests WW N W – single point of failure N WW N N – can require ac6ve repairs per site Shared SW Area N WN NFS/GPFS/AFS .R. Jeff Porter LBNL 4 Basic Torrent details File chunks contain hashes of original file to provide data integrity C.Grigoras, ALICE T1/T2 Workshop Tracker: map of seeders:files Seeders: have & serve file Leeches: pull & serve file chunks .R. Jeff Porter LBNL 5 AliTorrent So0ware Deployment • Managed Central So0ware Central Build Servers – Addi6onal AliEn torrent store AliEn, AliRoot, … – Catalogue, seeder & tracker SLC MacOS Ubuntu • Grid site SW deployment – VO Box is not involved Catalogue alitorrent.cern.ch – Seeder | Tracker Jobs pull SW from: AliEn • alitorrent.cern.ch seeder • local peers • other sites as available ALICE Grid Site – though typically behind a FW requests W VO W • Resolves: N W Box N W – Bofleneck & single point failures N W N N WN – Site level maintenance of shared area .R. Jeff Porter LBNL 6 AliTorrent Details • Torrent Features – Distributed Hash Tables • Decentralized seeder lookup – seeders are trackers – Peer Exchange • Local peer informaon is propagated by seeders – Local Peer Discovery • Mul6cast to discover peers on same network • ALICE/AliEn Features – Total so0ware download is ~300-400MB – Enabled per site (VO box) with AliEn LDAP flag .R. Jeff Porter LBNL 7 AliTorrent as an opportunity • AliTorrent use in AliEn – Reduces problems associated with SW deployment – Simplifies site operaons by removing a VO box service Does not eliminate VO box model from ALICE Grid Does eliminate site-specific VO box requirement • Eliminaon of site-specific VO box allows for remote use of other Grid resources OSG .R. Jeff Porter LBNL 8 AliEn-OSG Interface • Developed in 2009 by ALICE & OSG teams – in produc6on at ALICE-US sites since 2009 – poster @ CHEP’09 • Features added to AliEn – VO box as a Condor-G submit host to local OSG CE – Condor module in AliEn • Builds submit files to launch JobAgents • monitors Condor queue for site u6lizaon – Specificaons put into AliEn LDAP • OSG-CE endpoint • Job occupancy targets .R. Jeff Porter LBNL 9 Extending OSG-AliEn Interface for AliTorrent • Without torrent Shared SW area #!/bin/bash /project/projectdirs/alice/alicedev/alien.v2-19.168/bin/alien RunAgent • With torrent #!/bin/bash DIR=`pwd`/alien_installaon.$$ job sandbox mkdir -p $DIR cd $DIR wget hp://alien.cern.ch/alien-installer -O alien-auto-installer chmod +x alien-auto-installer ./alien-auto-installer -type workernode -batch -torrent -install-dir $DIR/alien ./alien/bin/alien RunAgent • With torrent on OSG: • replace `pwd` above with $OSG_WN_TMP DIR= /ALICE/alien_installaon.$$ OSG TMP space .R. Jeff Porter LBNL 10 First Results • Ini6al goal to target NERSC Carver system – Carver: NERSC IBM IDATAPLEX HPC system with OSG CE – ALICE obtained a small allocaon for development work – No VO Box would allowed on the system • Conferred with NERSC security about torrent use – earlier requests for use of torrent were ‘tabled’ for review – Result: torrent not banned explicitly but NERSC relies on dynamic filtering “go ahead and try” ---- log file ---- It just worked WNHost = "c1554.nersc.gov"; Geng the torrent It looks like the default IP address is 128.55.61.240 Seeding the torrent Extrac6ng the files Tes6ng /osg/carver/wn_tmp/ALICE/alien_installaon.20613/alien.v2-19.169/ AliEn workernode installaon took approximately 133 seconds Installaon finished! .R. Jeff Porter LBNL 11 Performance • Job latency from SW install: Tradi6onal vs AliTorrent – Tradi6onal install: • ‘copy’ from shared area - torrent install • lone job latency: ~1 sec - ‘cp’ from shared area • steady increase vs #-jobs – AliTorrent install: • CERN to LBNL/NERSC • lone job latency: ~360 sec • sharp decrease vs #-jobs Install 6me (seconds) increase with #-jobs • AliTorrent advantage Number of concurrent jobs – small numbers of long jobs where SW install 6me is insignificant – sites with large numbers of concurrent ALICE jobs .R. Jeff Porter LBNL 12 Expand to OSG Sites • Single VO box submission to OSG pilot job factory Gridresources \ – OSG-CE Endpoints carvergrid.nersc.gov/jobmanager-pbs:50:(jobType=single)(queue=serial) \ itbv-ce-pbs.uchicago.edu/jobmanager-pbs:5:none \ • Managed as a list glcc88.ucllnl.org/jobmanager-pbs:20:none \ osgitb1.nhn.ou.edu/jobmanager-condor:5:none \ • Endpoint:MaxJobs:RSL pdsfgrid.nersc.gov/jobmanager-sge:50:none – Test on selected sites • OSG-ITB sites • ALICE-US sites LBNL & LLNL – Close SE 700 TB SE @ NERSC • output des6naon NERSC/PDSF 10592 gt2 pdsfgrid.nersc.gov/jobmanager-sge hps://pdsfgrid4.nersc.gov:46490/25251/1337447241/ 10628 gt2 glcc88.ucllnl.org/jobmanager-pbs hfps://glcc88.ucllnl.org:40894/8704/1337454774/ LLNL 10629 gt2 glcc88.ucllnl.org/jobmanager-pbs hps://glcc88.ucllnl.org:40902/9292/1337454810/ 10645 gt5 itbv-ce-pbs.uchicago.edu/jobmanager-pbs hfps://itbv-ce-pbs.uchicago.edu:36913/3026418950974269075/ UC-ITB 10647 gt5 itbv-ce-pbs.uchicago.edu/jobmanager-pbs hfps://itbv-ce-pbs.uchicago.edu:36913/3026418950974269075/ 10664 gt2 carvergrid.nersc.gov/jobmanager-pbs hfps://carvergrid.nersc.gov:60888/2359/1337525731/ NERSC/Carver 10665 gt2 carvergrid.nersc.gov/jobmanager-pbs hfps://carvergrid.nersc.gov:60891/2358/1337525731/ 10674 gt5 osgitb1.nhn.ou.edu/jobmanager-condor hfps://osgitb1.nhn.ou.edu:64981/933117525531598993/ OU-ITB 10675 gt5 osgitb1.nhn.ou.edu/jobmanager-condor hfps://osgitb1.nhn.ou.edu:64981/933117525531598993/ .R. Jeff Porter LBNL 13 Towards produc6on operaons • Proof in principal is complete – AliEn Jobs run at sites submifed by remote VO box – successful running on mulple sites from single VO box – Plan to target OSG produc6on sites • Site policy cau6on: – limits can be broad – has hampered wider use • ALICE Experience: – ini6al distrust turns into recogni6on of approach’s merit – simplifies site operaons .R. Jeff Porter LBNL 14 Summary • So0ware deployment on shared area – Bofleneck & site-level single point failure – site-level SW corrup6on requires admin interven6on • Torrent model AliTorrent – Removes bofleneck & site-level single point of failure – Eliminates a site service & reduces site management – Performance capabili6es meets typical ALICE workflow & site requirements Eliminates requirement for site-specific VO box • We have leveraged this capability to demonstrate AliEn workflow for opportunis6c use of mul6ple OSG resources • AliTorrent is a site-friendly tool for opportunis6c (or general) use – don’t ask the site to “do” something install or manage a service – ask the site to “not do” something block torrent use .R. Jeff Porter LBNL 15 .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us