University University of Iowa :: Center for Bioinformatics and Computational Biology  (and wannabes … like me) Welcome GalaxyCzars!! A few reminders:A few         posted: Have you checked out the page?wiki Survey results are Feel free to play around and get used to the interface. voting options on your hand left side. We might use the featurespolling Blackboardof includes all chats! This call will be The recorded playback & posted for later on. from talking,When please let us know your name & where you are talkers at once, so please leave it unselected unless needed. on your left hand side. There can only be simultaneous6 To openup your audio line, you must select the “Talk” button will do our best. Technical Problems? please usethe chat room help. for We http://wiki.g2.bx.psu.edu/Community/GalaxyCzars 

0

check check out [email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology      Web Conference Agenda Break out at Galaxy Community Conference Open storage server solution. data, how NGS tools take in/output data the right and finding Presentation: Galaxy at Iowa: Discuss our issues with big beyond discussions/sharing? Group Goals: what do we want to accomplish with this group agenda.generic Logistics: Address how we want to tackle these calls. Mic

& recruit new volunteers the next call.for

Frequency of calls

1

Go over [email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology    future calls Proposed Agenda Generic for wants to host a call specific to an issue. connections. 20 working on anew feature or customization coding item. 20 use cases and pain points. are troubleshooting galaxy institution on what they are doing a problemor they min:20 Galaxy Our in Town. mins min: Galaxy Today/Tomorrow. : Open

Either from IE Mic -

organize smaller breakouts someone if

Discussion & make point to point -

or haveor someone walk through their penn

-

presentation from local a

state team, or from someone

-

presentation galaxyon a

.

2 [email protected]

CBCB +

University of Iowa Custom Galaxy Deployment And the tale of big data storage …

Ann Black-Ziegelbein Glenn Johnson Ben Rogers [email protected] [email protected] [email protected] Senior Application Developer Senior System Administrator Director of Research Computing University of Iowa University of Iowa University of Iowa Iowa Initiative in Human Genetics High Performance Compute Cluster High Performance Compute Cluster Center for Bioinformatics and Computational Biology University University of Iowa :: Center for Bioinformatics and Computational Biology Why, Who, Where, What Galaxy @Iowa Background:

4

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology   Why Galaxy toIowa Bring       stageWhat is our galaxy deployment?    Why host a local Galaxy deployment? PublishedHuman   Can   deployment Sub cluster SGE compute to HPC jobs dispatching and campusHosted our with IowaUniversityCampus/ Iowaof onlyHospital & access Clinics DeploymentAlpha Phase,butIt’s Alive storagedata Local transfer and caps tight No quota data    Customizabledeployment, example: for IowaUCSC Local IGV In exposingtools/reference additionalLazy indices/etc. Upon request. Capable databasesand tools custom Expose additional how Tune tools are exposed Control versiontools of - set of tools tools reference& of set genomes exposed compared the to publicGalaxy viewGalaxydatasets local

of allpublicGalaxyserver functionality

exome browser

analysisisavailable pipeline

!!!!!!!!

5

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology    Who is using Who isusing Galaxy @Iowa   Future Publicity:    U    Have about50 “alphauser” accounts created. se case we scenarios expect to support: @ all GalaxyIowaon course short bioinformatics holdinga be Will untilwe holdoff to workTrying can throughremaining some issues  New Tools  Experimenters  Core Published Workflows ticketJIRA trackingof problems. Havequestions/problems wellfor as userslistserv a localsupport aswiki But only10 Exposing researchExposing from tools research bioinformatics our teams Researchers dataon out toolssandboxusingaand as trying    corebioinformatics Common,well tested, workflows builtbe and publishedwill by our Etc. Exome RNA

Seq

- 12 12 activeregular users

analysis

analysis

6

[email protected]

CBCB

Campus

University University of Iowa :: Center for Bioinformatics and Computational Biology

LDAP

OpenSSH

Web Apache / $GALAXY_ HawkID Authen00tication Sftp database

HOME SSL Redirection Auth with

jail

chroot

Proxy to Galaxy Galaxy Server Galaxy

Where? OurGalaxy Topology

Storage: Luster scratch /scratch/Galaxy Galaxy Galaxy Web 01 galaxydb Postgres Galaxy Galaxy Web 03 Galaxy Web 02 Galaxy Web 04 NFS Mount NFS $GALAXY_HOME Centos 5.4 Centos 24GB RAM 24core

:

Galaxy Job Manager manage accnt sftp $GALAXY_ ment user user HOME / runner runner runner tools

jail …

&

SGE DRM (WorkLoad Manager) 7

schedule:prioritize:assign

Centos 5.4 Centos Infiniband 20:12 core & 144 GB 139: 200: 3508 Tools… Database… Binaries [email protected] GalaxyAccessible: 12 8

Cluster Compute High Iowa Campus processor - core - core& . . .

Perf connect & & 24

24GB .

GB

CBCB

University University of Iowa :: Center for Bioinformatics and Computational Biology     What has IowaGalaxy Customized for exposed to subset Moved workflows to top tool of navigation bar, limited tools  Expose acustom to/from configuration emails for   Tool customizations   accounts Linking & Support of auto Support etc Tweaks threads, HPC(such as our to existing for tools memory, NewUniversity Iowa of tools Watcher to dynamically scripts as changenecessary perms file Support )

sftp autocreating

transfers transfers (no ftp) - ticket fromJIRA in opening Galaxy

linux

accounts from Galaxy user

8

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology storage solutionfails. What todo whenWhat your Galaxy @Iowa The Problem:

9

[email protected]

CBCB How we got intothis mess … University University of Iowa :: Center for Bioinformatics and Computational Biology

10

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology    Iowa &    How Iowa had   What the isheck Our existing HPC Storage wasSolution   OSS (  MDS ( Hardware: HP scale it should welllarge under theory fast In and be load generally large for used scale From Wikipedia: “ Each Each enclosureconfigured as a RAID with 4 sets 4 enclosuresof 48 1TSATAconsisting drives volume 2 enclosures with 12300G SAS drives configured as a RAID Lustre

Object Storage Servers)

Lustre Meta Data Server) Lustre MSA 2312sa G2 P2000 Lustre Lustre

configured ? : Background

is is a parallel

cluster

distributed file distributed computing -

6 (10 data + 2 parity data 6 (10disks + 2 parity Lustre

” 11

, [email protected] - 50 )

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology      was not ideal. could While be performance good inspurts Analysis: How well the response times were too long. wouldThis userkill jobs as Sometimes the Job IO would slow and sometimes cause user jobs to die OSS machines wouldThe become OSS servers The would ramp to ahigh load, often >200.  servers needed to work harder with typesall of extra writes. was not aligned, having widthof 640K a stripe Lustre writes which seemed to exacerbate the situation whichthe exacerbateto seemed writes largevolume streaming of performed Typical NGS tooling .

sends bulk IO chunks in 1024K but the underlying array

lustre

servers would evict the clients because Lustre unresponsive

failed us failed

Iowa’s configuration . meant the This 12

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology    And thenwhat happed.    compare to Decided following the storage Galaxy: for types  movedGalaxy of off more optimal differentlook to at decided storage a to in addition solutions With more Iowa’sload on user wasit Galaxy) just (not HPC Lustre  Gluster Existing ZFSStorage Reconfiguring namespace multipleservers and present thoseafilesin single viewunified or localmultiplethecapacityof and performance “is a distributed/parallel/virtual

reconfigured

(not to be confused with Lustre .” Lustre –

fromcommunity. gluster.org

configuration Lustre

meant destroyingdata all

storage storageonto andaZFS box Lustre filesystem

)

.

Itlets you aggregate filesystems 13

on [email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology New Storage Solutions Galaxy @Iowa Benchmarking

14

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology   Not really Apple to Apples….     Did ZFS really have a chance??     What was benchmarked & compared: wealready had the box & 50 of ZFS Storage!TB ZFS expansion, cheap has easy to manage, CPU,client and minimal and wouldour Galaxy goals?” it goodbe for enough was:question The “Howquickly would degrade and ZFS performance wasquestion not:The “Would all access?” HPC user ZFS for scale infiniband We knewZFS would not scale as well. It using is not distributed nor   tested and aborted: Sniff Gluster NFS/ZFS w/ 4 Gb Ethernet NFS/ZFS w/ 2 Gb Ethernet Lustre ZFS w/

w/

infiniband .

Infiniband

over

Ip

15

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology       BenchmarkOur Test: architecture While we are lookedalso it … at howout onto optimizestorage each jobs the  Benchmarkingalwaysis aworkin progress ... Never quite completed. Benchmarking Storage Solutions  Tested concurrent1,30,60,90 clients    Variations: BWA And I am not claiming to be a performance expert. Would liked to have done more … but competed with cores, others for and time investment   Output:    Reference indices:    Input: sampe CO: streaming to local writes disk, copy remoteto upon finish SO: streaming to remotewrites storage 3.2R: leave on secondary remote storage (ZFS) 2.RR: leave on remote storage remotefor reference 1.LR: copy compressed archive to local disk, extract & reference local 3. SI: have 2. RI: leave on remote storageremote for reads 1. LI: copy localtolocal disk readsfor

paired end read alignment step on Human aln

input local,

fastq

input remote

Exome

data. 16

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology     Benchmarking Challenges work with. benchmarking claiming our Not isperfect, butit gives to data us   scripts,writing re failures,Understanding analyzingresults, babysitting scripts,  coresGetting cluster on the   runs. consistent Clean https://bitbucket.org/iihgbiocore/ngssgeloadgenerator/overview Yeah,it istimeconsuming. Competedbenchmarking allwithusers for time. Reproduciblenumbers … Keepingpeople away from storage test runs during

- writing scripts … scripts writing

17

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology  Galaxy @Iowa: Ourgoals      Our targets: Oh yeah: it stable. must be wrappers to makeoptimal. things Also: without having goodto tweak performance all Galaxy tool Yowzah Great: 100 concurrent for clients Good performance Minimum: 60concurrent for clients performance Good !: 150 concurrent for clients performance Good

18

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology Drum rollDrum … Results The

19 [email protected] gluster ZFS 4 Gb ZFS 2 Gb

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology  Taking acloser look… rawThe data: zfs_4Gb_LILRCO gluster_IB_SILRSO gluster_IB_LILRSO zfs_4Gb_LILRSO gluster_IB_SI2RSO gluster_IB_RI2RSO gluster_IB_RILRSO zfs_4Gb_RILRSO gluster_IB_RIRRSO zfs_4Gb_RIRRSO Test Variation * Remember … the savings… the Remember * adds over up 20+ of apipeline steps! gluster_IB_SI2RCO gluster_IB_SILRCO gluster_IB_LILRCO

Avg. Job Run Time inSeconds 1843.00 1872.00 1998.00 3222 1607.00 1980.00 1760 1495.00 1511.00 1568.00 1567 3077.00 1517 1

2292.00 2450.00 3014.00 4405 0.00 2285.00 3282 2080.00 2206.00 1979.00 2631 3640.00 2487 30 Num

Clients 3049.00 3601.00 3390.00 6934 2580.00 2453.00 7500 2146.00 2442.00 3433.00 4538 7014.00 4470

60

4002.00 3583.00 4242.00 10656 6424.00 2642.00 9133 4447.00 3360.00 5266.00 6362 7576.00 6671 90 20

[email protected] indices ( indices By off loading By off much effect,much hurt Moving to input copyingseems to help Saved Time Reference maynothave Writing local Writing local orsplit hurt hurt and then and

gluster gluster zfs ?

?

)

CBCB , University University of Iowa :: Center for Bioinformatics and Computational Biology    zfs gluster   The takeThe Avg.aggregated overtime test all run job appear better than expected than better appear Wouldlike ZFS run to “worst case”again numbersthe as (RIRRSO)   Jobs on general,In https://dl.dropbox.com/u/90475071/bwa generated runs: test the from during collected metrics Are you person? details a Youdownload can zipofall reportsa the

~7500 seconds to ~3500 seconds for 90 clients!!~3500 secondsto ~7500 seconds from Withtuning, to decrease some time got jobrun loadingreference Off Indices made largest impact

gluster gluster 1860 1921 1

aways were more sensitive variation to test

scaled better scaled 3626 2449 30

… .

6473 3396 60

variations (seconds) variations

- sampe_reports_pdf.zip

21

7810 4326 90

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology     What happened What happened to time get rawthe we of it out performance wanted to makeworthit our BWA additional do Did not benchmarking wesince not could  Lustre    On  Lustre Althoughalso more flexible. NFS Lustre Gluster From discussionsin thecommunity a 5GB 5GB a

– ismore expensiveharder and manageto gooda not be might HP MSA marriage &

90MB/s –

– 60MB/s (with additionaltuning got closerto 600MB/s) dd

600MB/s

test,raid single a using arraywe saw following:the

Lustre Lustre

likeshardware certain ?

22

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology       Choosing Storage: Why    What about ZFS?   Cons: The traction.Gaining industry Easy to setup and manage. needed. Cheap expansion. Can add in more harddrives as additional “bricks” as Strongwith performance mightget archived to morehold placeasa usedbe Alsomight “permanent”datawhich eventually Current plan:reference for use datalibrariesand indices Wouldlike benchmarkto 10Gb outages. Faileddisks, servers or  resourcecontention. Glusterfsd client workload,client but the not rest of failureSimulated by attaching

clientprocess corea consumecan node. acompute on causecan This

gluster Redhat

server processesfullcause can ethernet gdb

purchased & putting weightbehind. gluster

to to client

w/ ZFS w/ .

glusterfsd

Gluster

process, the this affects

23

gluster [email protected] ?

file systemfile

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology     Iowa’s managing Storage planfor  Wait, Monitor, and Iterate on this plan.  Auto No hard quota caps now will be in place … for Galaxy storage will not be backed up. Talk to us in6 action advance in Email to users will sent be allowingto take them –

cleanup datasets than older 30 days

mons

and and it different.couldbe

24

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology  Our localGalaxy: What’s next?         Our futuregoals trail. Versiontracking and tools of Multi items Lots of little& finish fit data sequencer analysis intoGalaxy for Integrate with DNACore to directly sequencers provide feedback communityon tools and Galaxy additional featuresExpose bioinformatics based Developand publish moreworkflows common InCommon Wrap up

- versiontool data data storage evaluation & migrate

federated identity federated management :

support

workflows. We audit a clear need

25

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology  Upcoming Upcoming Events    Enrollmentis limited. interpretation.Participants be requiredwill own their to bring laptops. onsessions enable will toparticipants analyzedata fromitsgeneration to be interwovenwith practical examples of data generationand analysis. Hands achieveresult. a particular oflectures A series tointroducebasic concepts will design aof next generationsequencingexperimentand theworkflow to systems. Upon completionofthis course, understandwill participants the popular next Course. Short a Bioinformatics Geneticswill offer onmost course willfocus The investigationsroutine. theIn summer of2012, the Iowa Institute ofHuman these technologiesare making innovativeapproaches to genome willmake personalizedgenomic medicinea reality; inthe research laboratory, waveofthegenomics revolution. clinicalapplicationof these The technologies Massivelyparallel DNA sequencingtechnologies have ushered inthenext  1 CourseAugust Short Bioinformatics  Brought toyou by: http Course will leverageIowa’s Galaxy Deployment Generation to Variant Mutation Detection Using Massively Parallel Sequencing: From Data Iowa InitiativeHuman in Genetics ://www.medicine.uiowa.edu/humangenetics/bioinformaticscourse - generationsequencing platforms

Annotation

-

3, 2012 -

the

Illumina 26

HiSeq [email protected] - wide

and MiSeq /

-

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology storage your with experiences to hearabout & Please share with us from them –

we would love

!! 27 learn learn

[email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology    wants to host a call specific to an issue. connections. 20 working on anew feature or customization coding item. 20 use cases and pain points. are troubleshooting galaxy institution on what they are doing a problemor they min:20 Galaxy Our in Town. Our Next Call mins min: Galaxy Today/Tomorrow. : Open

Either from IE Mic -

organize smaller breakouts someone if

Discussion & make point to point -

or haveor someone walk through their penn

– -

We Need presentation from local a

state team, or from someone

-

presentation galaxyon a

.

28

Y ou! [email protected]

CBCB University University of Iowa :: Center for Bioinformatics and Computational Biology More Info on Backup:

Gluster

29 [email protected]

CBCB    Infiniband GlusterFS GlusterFS Can Can achieve throughputhigher than Gluster etc. thebackendRelies on metadata, system file for allocation,

volume metadata is managed by the clients. . Same hardware as systemis a distributed file that works over IP over

lustre lustre with less disks

CBCB    G The readThe throughput is about 600 sizes brick all file per givesThis consistent throughput write of 500  XFS system alignment with is RAID the file stripe managed lusterFS m kfs.xfs

– f – d su

=64k,sw=10

– l su =64k / =64k - 700MBps per brick dev /dm - - 600 0

MBps

vis

for for

CBCB Max Read: Max Write:  GlusterFS The aggregateThe bandwidth from IOR is with 32 1G files

6388.65

9368.44 9368.44 MBps

MBps

CBCB