University University of Iowa :: Center for Bioinformatics and Computational Biology (and wannabes … like me) Welcome GalaxyCzars!! A few reminders:A few posted: Have you checked out the page?wiki Survey results are Feel free to play around and get used to the interface. voting options on your hand left side. We might use the featurespolling Blackboardof includes all chats! This call will be The recorded playback & posted for later on. from talking,When please let us know your name & where you are talkers at once, so please leave it unselected unless needed. on your left hand side. There can only be simultaneous6 To openup your audio line, you must select the “Talk” button will do our best. Technical Problems? please usethe chat room help. for We http://wiki.g2.bx.psu.edu/Community/GalaxyCzars
0
–
check check out [email protected]
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Web Conference Agenda Break out at Galaxy Community Conference Open storage server solution. data, how NGS tools take in/output data the right and finding Presentation: Galaxy at Iowa: Discuss our issues with big beyond discussions/sharing? Group Goals: what do we want to accomplish with this group agenda.generic Logistics: Address how we want to tackle these calls. Mic
& recruit new volunteers the next call.for
Frequency of calls
1
Go over [email protected]
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology future calls Proposed Agenda Generic for wants to host a call specific to an issue. connections. 20 working on anew feature or customization coding item. 20 use cases and pain points. are troubleshooting galaxy institution on what they are doing a problemor they min:20 Galaxy Our in Town. mins min: Galaxy Today/Tomorrow. : Open
Either from IE Mic -
organize smaller breakouts someone if
Discussion & make point to point -
or haveor someone walk through their penn
-
presentation from local a
state team, or from someone
-
presentation galaxyon a
.
CBCB +
University of Iowa Custom Galaxy Deployment And the tale of big data storage …
Ann Black-Ziegelbein Glenn Johnson Ben Rogers [email protected] [email protected] [email protected] Senior Application Developer Senior System Administrator Director of Research Computing University of Iowa University of Iowa University of Iowa Iowa Initiative in Human Genetics High Performance Compute Cluster High Performance Compute Cluster Center for Bioinformatics and Computational Biology University University of Iowa :: Center for Bioinformatics and Computational Biology Why, Who, Where, What Galaxy @Iowa Background:
4
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Why Galaxy toIowa Bring stageWhat is our galaxy deployment? Why host a local Galaxy deployment? PublishedHuman Can deployment Sub cluster SGE compute to HPC jobs dispatching and campusHosted our with IowaUniversityCampus/ Iowaof onlyHospital & access Clinics DeploymentAlpha Phase,butIt’s Alive storagedata Local transfer and caps tight No quota data Customizabledeployment, example: for IowaUCSC Local IGV In exposingtools/reference additionalLazy indices/etc. Upon request. Capable databasesand tools custom Expose additional how Tune tools are exposed Control versiontools of - set of tools tools reference& of set genomes exposed compared the to publicGalaxy viewGalaxydatasets local
of allpublicGalaxyserver functionality
exome browser
analysisisavailable pipeline
!!!!!!!!
5
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Who is using Who isusing Galaxy @Iowa Future Publicity: U Have about50 “alphauser” accounts created. se case we scenarios expect to support: @ all GalaxyIowaon course short bioinformatics holdinga be Will untilwe holdoff to workTrying can throughremaining some issues New Tools Experimenters Core Published Workflows ticketJIRA trackingof problems. Havequestions/problems wellfor as userslistserv a localsupport aswiki But only10 Exposing researchExposing from tools research bioinformatics our teams Researchers dataon out toolssandboxusingaand as trying corebioinformatics Common,well tested, workflows builtbe and publishedwill by our Etc. Exome RNA
Seq
- 12 12 activeregular users
analysis
analysis
6
CBCB
Campus
University University of Iowa :: Center for Bioinformatics and Computational Biology
LDAP
OpenSSH
Web Apache / $GALAXY_ HawkID Authen00tication Sftp database
HOME SSL Redirection Auth with
jail
chroot
Proxy to Galaxy Galaxy Server Galaxy
Where? OurGalaxy Topology
Storage: Luster scratch /scratch/Galaxy Galaxy Galaxy Web 01 galaxydb Postgres Galaxy Galaxy Web 03 Galaxy Web 02 Galaxy Web 04 NFS Mount NFS $GALAXY_HOME Centos 5.4 Centos 24GB RAM 24core
:
Galaxy Job Manager manage accnt sftp $GALAXY_ Linux ment user user HOME / runner runner runner tools
jail …
&
SGE DRM (WorkLoad Manager) 7
schedule:prioritize:assign
Centos 5.4 Centos Infiniband 20:12 core & 144 GB 139: 200: 3508 Tools… Database… Binaries [email protected] GalaxyAccessible: 12 8
Cluster Compute High Iowa Campus processor - core - core& . . .
Perf connect & & 24
24GB .
GB
CBCB
University University of Iowa :: Center for Bioinformatics and Computational Biology What has IowaGalaxy Customized for exposed to subset Moved workflows to top tool of navigation bar, limited tools Expose acustom to/from configuration emails for Tool customizations accounts Linking & Support of auto Support etc Tweaks threads, HPC(such as our to existing for tools memory, NewUniversity Iowa of tools Watcher to dynamically scripts as changenecessary perms file Support )
sftp autocreating
transfers transfers (no ftp) - ticket fromJIRA in opening Galaxy
linux
accounts from Galaxy user
8
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology storage solutionfails. What todo whenWhat your Galaxy @Iowa The Problem:
9
CBCB How we got intothis mess … University University of Iowa :: Center for Bioinformatics and Computational Biology
10
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Iowa & How Iowa had What the isheck Our existing HPC Storage wasSolution OSS ( MDS ( Hardware: HP scale it should welllarge under theory fast In and be load generally large for used scale From Wikipedia: “ Each Each enclosureconfigured as a RAID with 4 sets 4 enclosuresof 48 1TSATAconsisting drives volume 2 enclosures with 12300G SAS drives configured as a RAID Lustre Lustre
Object Storage Servers)
Lustre Meta Data Server) Lustre MSA 2312sa G2 P2000 Lustre Lustre
configured ? : Background
is is a parallel
cluster
distributed file system file distributed computing -
6 (10 data + 2 parity data 6 (10disks + 2 parity Lustre
” 11
, [email protected] - 50 )
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology was not ideal. could While be performance good inspurts Analysis: How well the response times were too long. wouldThis userkill jobs as Sometimes the Job IO would slow and sometimes cause user jobs to die OSS machines wouldThe become OSS servers The would ramp to ahigh load, often >200. servers needed to work harder with typesall of extra writes. was not aligned, having widthof 640K a stripe Lustre writes which seemed to exacerbate the situation whichthe exacerbateto seemed writes largevolume streaming of performed Typical NGS tooling .
sends bulk IO chunks in 1024K but the underlying array
lustre
servers would evict the clients because Lustre unresponsive
failed us failed
Iowa’s configuration . meant the This 12
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology And thenwhat happed. compare to Decided following the storage Galaxy: for types movedGalaxy of off more optimal differentlook to at decided storage a to in addition solutions With more Iowa’sload on user wasit Galaxy) just (not HPC Lustre Gluster Existing ZFSStorage Reconfiguring namespace multipleservers and present thoseafilesin single viewunified or localmultiplethecapacityof and performance “is a distributed/parallel/virtual
reconfigured
(not to be confused with Lustre .” Lustre –
fromcommunity. gluster.org
configuration Lustre
meant destroyingdata all
storage storageonto andaZFS box Lustre filesystem
)
.
Itlets you aggregate filesystems 13
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology New Storage Solutions Galaxy @Iowa Benchmarking
14
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Not really Apple to Apples…. Did ZFS really have a chance?? What was benchmarked & compared: wealready had the box & 50 of ZFS Storage!TB ZFS expansion, cheap has easy to manage, CPU,client and minimal and wouldour Galaxy goals?” it goodbe for enough was:question The “Howquickly would degrade and ZFS performance wasquestion not:The “Would all access?” HPC user ZFS for scale infiniband We knewZFS would not scale as well. It using is not distributed nor tested and aborted: Sniff Gluster NFS/ZFS w/ 4 Gb Ethernet NFS/ZFS w/ 2 Gb Ethernet Lustre ZFS w/
w/
infiniband .
Infiniband
over
Ip
15
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology BenchmarkOur Test: architecture While we are lookedalso it … at howout onto optimizestorage each jobs the Benchmarkingalwaysis aworkin progress ... Never quite completed. Benchmarking Storage Solutions Tested concurrent1,30,60,90 clients Variations: BWA And I am not claiming to be a performance expert. Would liked to have done more … but competed with cores, others for and time investment Output: Reference indices: Input: sampe CO: streaming to local writes disk, copy remoteto upon finish SO: streaming to remotewrites storage 3.2R: leave on secondary remote storage (ZFS) 2.RR: leave on remote storage remotefor reference 1.LR: copy compressed archive to local disk, extract & reference local 3. SI: have 2. RI: leave on remote storageremote for reads 1. LI: copy localtolocal disk readsfor
paired end read alignment step on Human aln
input local,
fastq
input remote
Exome
data. 16
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Benchmarking Challenges work with. benchmarking claiming our Not isperfect, butit gives to data us scripts,writing re failures,Understanding analyzingresults, babysitting scripts, coresGetting cluster on the runs. consistent Clean https://bitbucket.org/iihgbiocore/ngssgeloadgenerator/overview Yeah,it istimeconsuming. Competedbenchmarking allwithusers for time. Reproduciblenumbers … Keepingpeople away from storage test runs during
- writing scripts … scripts writing
17
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Galaxy @Iowa: Ourgoals Our targets: Oh yeah: it stable. must be wrappers to makeoptimal. things Also: without having goodto tweak performance all Galaxy tool Yowzah Great: 100 concurrent for clients Good performance Minimum: 60concurrent for clients performance Good !: 150 concurrent for clients performance Good
18
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Drum rollDrum … Results The
19 [email protected] gluster ZFS 4 Gb ZFS 2 Gb
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Taking acloser look… rawThe data: zfs_4Gb_LILRCO gluster_IB_SILRSO gluster_IB_LILRSO zfs_4Gb_LILRSO gluster_IB_SI2RSO gluster_IB_RI2RSO gluster_IB_RILRSO zfs_4Gb_RILRSO gluster_IB_RIRRSO zfs_4Gb_RIRRSO Test Variation * Remember … the savings… the Remember * adds over up 20+ of apipeline steps! gluster_IB_SI2RCO gluster_IB_SILRCO gluster_IB_LILRCO
Avg. Job Run Time inSeconds 1843.00 1872.00 1998.00 3222 1607.00 1980.00 1760 1495.00 1511.00 1568.00 1567 3077.00 1517 1
2292.00 2450.00 3014.00 4405 0.00 2285.00 3282 2080.00 2206.00 1979.00 2631 3640.00 2487 30 Num
Clients 3049.00 3601.00 3390.00 6934 2580.00 2453.00 7500 2146.00 2442.00 3433.00 4538 7014.00 4470
60
4002.00 3583.00 4242.00 10656 6424.00 2642.00 9133 4447.00 3360.00 5266.00 6362 7576.00 6671 90 20
[email protected] indices ( indices By off loading By off much effect,much hurt Moving to input copyingseems to help Saved Time Reference maynothave Writing local Writing local orsplit hurt hurt and then and zfs
gluster gluster zfs ?
?
)
CBCB , University University of Iowa :: Center for Bioinformatics and Computational Biology zfs gluster The takeThe Avg.aggregated overtime test all run job appear better than expected than better appear Wouldlike ZFS run to “worst case”again numbersthe as (RIRRSO) Jobs on general,In https://dl.dropbox.com/u/90475071/bwa generated runs: test the from during collected metrics Are you person? details a Youdownload can zipofall reportsa the
~7500 seconds to ~3500 seconds for 90 clients!!~3500 secondsto ~7500 seconds from Withtuning, to decrease some time got jobrun loadingreference Off Indices made largest impact
gluster gluster 1860 1921 1
aways were more sensitive variation to test
scaled better scaled 3626 2449 30
… .
6473 3396 60
variations (seconds) variations
- sampe_reports_pdf.zip
21
7810 4326 90
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology What happened What happened to time get rawthe we of it out performance wanted to makeworthit our BWA additional do Did not benchmarking wesince not could Lustre On Lustre Althoughalso more flexible. NFS Lustre Gluster From discussionsin thecommunity a 5GB 5GB a
– ismore expensiveharder and manageto gooda not be might HP MSA marriage &
90MB/s –
– 60MB/s (with additionaltuning got closerto 600MB/s) dd
600MB/s
test,raid single a using arraywe saw following:the
Lustre Lustre
likeshardware certain ?
22
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Choosing Storage: Why What about ZFS? Cons: The traction.Gaining industry Easy to setup and manage. needed. Cheap expansion. Can add in more harddrives as additional “bricks” as Strongwith scalability performance mightget archived to morehold placeasa usedbe Alsomight “permanent”datawhich eventually Current plan:reference for use datalibrariesand indices Wouldlike benchmarkto 10Gb outages. Faileddisks, servers or resourcecontention. Glusterfsd client workload,client but the not rest of failureSimulated by attaching
clientprocess corea consumecan node. acompute on causecan This
gluster Redhat
server processesfullcause can ethernet gdb
purchased & putting weightbehind. gluster
to to client
w/ ZFS w/ .
glusterfsd
Gluster
process, the this affects
23
gluster [email protected] ?
file systemfile
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Iowa’s managing Storage planfor Wait, Monitor, and Iterate on this plan. Auto No hard quota caps now will be in place … for Galaxy storage will not be backed up. Talk to us in6 action advance in Email to users will sent be allowingto take them –
cleanup datasets than older 30 days
mons
and and it different.couldbe
24
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Our localGalaxy: What’s next? Our futuregoals trail. Versiontracking and tools of Multi items Lots of little& finish fit data sequencer analysis intoGalaxy for Integrate with DNACore to directly sequencers provide feedback communityon tools and Galaxy additional featuresExpose bioinformatics based Developand publish moreworkflows common InCommon Wrap up
- versiontool data data storage evaluation & migrate
federated identity federated management :
support
workflows. We audit a clear need
25
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Upcoming Upcoming Events Enrollmentis limited. interpretation.Participants be requiredwill own their to bring laptops. onsessions enable will toparticipants analyzedata fromitsgeneration to be interwovenwith practical examples of data generationand analysis. Hands achieveresult. a particular oflectures A series tointroducebasic concepts will design aof next generationsequencingexperimentand theworkflow to systems. Upon completionofthis course, understandwill participants the popular next Course. Short a Bioinformatics Geneticswill offer onmost course willfocus The investigationsroutine. theIn summer of2012, the Iowa Institute ofHuman these technologiesare making innovativeapproaches to genome willmake personalizedgenomic medicinea reality; inthe research laboratory, waveofthegenomics revolution. clinicalapplicationof these The technologies Massivelyparallel DNA sequencingtechnologies have ushered inthenext 1 CourseAugust Short Bioinformatics Brought toyou by: http Course will leverageIowa’s Galaxy Deployment Generation to Variant Mutation Detection Using Massively Parallel Sequencing: From Data Iowa InitiativeHuman in Genetics ://www.medicine.uiowa.edu/humangenetics/bioinformaticscourse - generationsequencing platforms
Annotation
-
3, 2012 -
the
Illumina 26
HiSeq [email protected] - wide
and MiSeq /
-
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology storage your with experiences to hearabout & Please share with us from them –
we would love
!! 27 learn learn
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology wants to host a call specific to an issue. connections. 20 working on anew feature or customization coding item. 20 use cases and pain points. are troubleshooting galaxy institution on what they are doing a problemor they min:20 Galaxy Our in Town. Our Next Call mins min: Galaxy Today/Tomorrow. : Open
Either from IE Mic -
organize smaller breakouts someone if
Discussion & make point to point -
or haveor someone walk through their penn
– -
We Need presentation from local a
state team, or from someone
-
presentation galaxyon a
.
28
Y ou! [email protected]
CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology More Info on Backup:
Gluster
CBCB Infiniband GlusterFS GlusterFS Can Can achieve throughputhigher than Gluster etc. thebackendRelies on metadata, system file for allocation,
volume metadata is managed by the clients. . Same hardware as systemis a distributed file that works over IP over
lustre lustre with less disks
CBCB G The readThe throughput is about 600 sizes brick all file per givesThis consistent throughput write of 500 XFS system alignment with is RAID the file stripe managed lusterFS m kfs.xfs
– f – d su
=64k,sw=10
– l su =64k / =64k - 700MBps per brick dev /dm - - 600 0
MBps
vis
for for
CBCB Max Read: Max Write: GlusterFS The aggregateThe bandwidth from IOR is with 32 1G files
6388.65
9368.44 9368.44 MBps
MBps
CBCB