Presentation Headline Subhead
Total Page:16
File Type:pdf, Size:1020Kb
IBM Spectrum Scale IBM Life Sciences Madhav Ponamgi [email protected] Alignment Paths Finish City block traversal problem. Beginning from the start travel North / East to the finish. • How many possible paths in a M x N grid? • How long is a path? • Is there an optimal path? • What if we weight the edges, can there be an optimal path? • How do we compute an optimal path? • What are all the optimal paths? Start IBM Systems | 2 Genomic Alignment Paths Finish Determining the optimal alignment of short sequence of base pairs. C A T G T A C A A T C G A C C T • Non-trivial to find “best” alignment G • Scoring matrix for multi-dimensional alignment • Using weighted edges, a Dynamic Programming method can be used to construct T matchings A • V. Levenshtein introduced edit distance – A T C G A C C shortest sequence of single line edits match two Start sequences IBM Systems | 3 Phylogenetics . Sort by reversal: 612345 -> 162345 -> 126345-> 123645-> 123465->123456 . Used for comparing genome to genome sequences . Across species similar genes are laid in different groups – called Synteny. Mouse and human genomes are similar with about 300 blocks re-arranged. How to get from one arrangement to another in the most “parsimonious” sequence of steps . Depending on length of genomes can be I/O intensive . Faster method: : 612345 -> 543216->123456 IBM Systems | 4 Motif finding methods CGGGGCTGGTCGTCACATTCCCCCTTTTGATATTTGAGGGGTGCTGCCAATAA CCAAAAGCGGACAAGGGGATCCGTTGACGACCTAAATCAACGGCCAAGGCC AAGGCCAGGAGGCGCCTTTGCTGGTTCTACCTGAATTTCTAAAAAAGATTATA ATGTAATGTCGGTCCTCCTGCTGTACAACTGAGATCATGCTGCTGCTTTCAAC Genomic sequences often contain motifs, short Possible solution methods: sequences of base pairs, that frame gene • Brute-force method encoding sequences • Examine all size sequences of size k contained • The motifs may not be known in advance within a sequence of size n • This results in an exponential algorithm of order • The length or size of the motif may not be O(n^k) known (i.e. how many base pairs) • Branch-and-Bound • It’s not unusual for the motifs to vary slightly • Using branch-and-bound techniques and search within the genomic sequence (i.e. the motif trees the brute force method can be improved may have slight mutations as it recurs) upon • Goal is to identify and locate these sequences • The I/O pattern for this type search is not so gene can be identified restricted serial apart from the initialization because the parallel methods evaluate the genome independently (k << n) IBM Systems | 5 Life Sciences §- §- § Computational / Quantum § Molecular Dynamics § Genomics / Proteomics § Translational Medicine / Chemistry Imaging . Mass spectometry . Large molecules / enzymes . Ab Inito / Hartree-Fock . Personalized genomics . De novo sequencing . Classical physics / empirical . Gaussian . Health records / HIPPA . Sequest . NWChem . GAMESS . Functional MRI . Scaffold . NAMD . MOPAC . Data mining / Security . Illumina / Accelerys / CLC . Charmm Traditional Sequencing Method – SANGER Method § Sanger sequencing Fewer toxic chemicals Less radioactive materials Long continuous reads Separate DNA into single strand Add Primer (known sequence) Place into 4 separate ddNTP The ddNTP stop reaction Sort across gel electrophloresis The first whole human genome was delivered in 2003 It cost $3B and took 13 years to complete Read back top to bottom IBM Systems Next Generation Sequencers .Data needs to be assembled and analyzed for gene identification, gene variations and gene functionality .8 .2015. All rights reserved. Evolution of DNA Sequencers $149,000, Life Technology Ion Proton Sequencer (late 2012) Source: A decade’s perspective on DNA http://www.youtube.com/watch?v=OKhxoGcr4Rk sequencing technology Elaine R. Mardis Nature, Volume: 470, Pages: 198–203 Date published: (10 February 2011) $740,000, Illumina HiSeq 2500 Availability Output Run time Average read Cost/run Proton 1 Sept 2012 10Gb 2-4 hrs 200 bp $999 Proton 2 Mar 2013 100Gb 2-4 hrs >200bp <$999 HiSeq 2500 Mid 2012 600Gb 11 days 150bp $11,000 Source: http://blueseq.com/knowledgebank/sequencing-platforms/ Oxford Nanopore MinION Cost per Genome .10 https://www.genome.gov/27541954/dna-sequencing-costs/ Explosive Growth of Data 14K 13K § NCBI GenBank 12K 2001 first human genome 11K 2005 NGS publication 10K 9K 2008 Solexa sequencer 8K 2010 BGI center opens 7K 2012 78TB analysis 1 week Terabytes § 6K 5K 4K 3K 2K 1K 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Disk Challenge • 20 Petabytes by 2018 • If current 3.5” 8TB disks are a guide.. then we'd be looking at 2,500 disk farm • Stacked horizontally, this would be a tower 127 meters in height – Statue of Liberty, 93 M – Eiffel Tower, 324 M IBM Systems . NEED Analytics . NEED High Performance Computing . NEED High Performance Storage Genomics Data High-Level Pipeline Base Calling Alignment or §Sam / Bam (Vendor Tool) §FastQ/FastA Assembly §Aligned NGS §Raw NGS Reads Reads .13 TB .8TB Variant .2 TB Calling §VCF file .200 GB §Genomic Variant §Downstream Analysis IBM Systems Spectrum Scale: A File System for Genomic Pipelines .Instruments .Sequencers .Users and applications .Compute Farm .Single name space .POSIX .SMB/ .Map Reduce .NFS CIFS Connector .Site A .Spectrum Scale .Sit e B Automated data placement and data migration .Off .Sit e C Premise .Tape .Flash .Storage Rich Servers .Disk §Source §Information Extraction / Data Integration and Management §Analytics §Interface §Publication §Information Extraction §Insight and Discovery §Cognitive Analytics §User Interfaces s §LuceGene / SRS §Watson Discovery §Watson Healthcare §Web Portals §Patents §Google Analytics §NLP / Nuance §Ontologies Science Data §Knowledge Hub §BigInsight, Vivisimo, Varicent § EMR §MDM §ETL §NLP §Predictive Analytics §Cohort Query §Clinical §MDM Server, RDM §ETL Server §Content §SAS, R, SPSS, Rev §i2b2 Analytics R §Studies § GWAS § Data Explore TR §Clinical & Omics §Translational DW DW §PLINK, VAAST §tranSMART, Workbench Translational §Imaging HUB §Cognos CDM, Data §NextBio, § tranSMART §Ab §Computational §Molecular Modeling § Safety & Efficacy §Distribution Inito Chemistry §Quantum / Hartree-Fock §Large Molecule / §Clinical Trials §Marketing / Sales §Experimental Empirical Discovery Discovery § §Simulation §Drug Discovery §Target Enzymes §In vivo Efficacy §FDA: Phase I-III §Approval / § Manufacture §Genomics §Radiomics §Proteomics §Cytometry §NGS §De Novo Assembly §Variant Analysis §Annotatio §Pathway §Application Analysis Management n §PID, Reactome §Reference DB Map §Picard, GATK, SAM, SOAP §Annovar §LIMS §Bionimbus, Galaxy §Bowtie, BWA, SOAP, CLCbio §TCGA, dbSNP, GEO, §GO §Expression Analysis Genomics §Ref DB §Velvet, Allpaths, Biovia §Partek, GenePattern § .I/O .ILM .AFM Scale: Global Storage & File System Scale / ESS .NFS/AFM Protect, Archive, HPSS Spark .WLM .RTM .Security Speed: Workload & Workflow Platform LSF, RTM, PAC, Symphony Cluster Big Data .Cloud Smart: Infrastructure PPM, PCM, EGO Galaxy IBM Systems Multi-Site Pharmaceutical AFM Deployment .4 Sites with roving users .IW Mode with no Pre-Fetch Spectrum Scale Spectrum Scale .Each site consists of ESS storage with two CES nodes Spectrum Scale.Users are in separate directories without common shared files .Campus inter-connect is 1GigE Spectrum Scale Spectrum Scale IBM Systems .Major Cancer Center .HPC .HPC . Research .Clients .NFS Servers .40 GbE Network .InfiniBand .NSD .NSD .Servers .Servers .Head Nodes .SAN Fiber .SAS .NSD2 .TSM servers . TSM . TSM .Disk .Disk .Stage 1 – 500 TB’s scratch .Stage 2- 2 Petabytes .Stage 3 – Existing .Pool .Pool storage .New storage .New storage .Stage 4 – Backup .Base FM . FM2 . FM3 . FM4 . FM5 . FM6 .Slots .10 – .Total Slots .Total Slots .Total Slots .Total Slots .Total Slots LTO6 .10 – LTO6 .Physical tape Library University System Architecture .Omega .Louise .BulldogN .Core GPFS Cluster IBM Systems Typical Genomics File Size Histogram Personalized Medicine Acute Lymphoblastic Leukemia (ALL) . Relapse 2014 (variation) . Full Exome sequenced . Connection to Philadelphia Chromosome and 1-nucleotide mutant NT5C2 + NUP214-ABL1 (resistance to chemo drug) . Treatment with BMS Sprycel IBM Systems | 21 .Help is on the way .3 weeks .Diagnosis and Cure .Dr. Wartman, WashU .Cancer researcher and .leukemia survivor .IBM CONFIDENTIAL .Help is on the way .3 weeks .Diagnosis and Cure .$32 Billion .Healthcare analytics market by 2022 .Dr. Wartman, WashU .Cancer researcher and .leukemia survivor .$88 Billion .Precision medicine market by 2022 .IBM CONFIDENTIAL Crispr CAS9 Genome Editing IBM Systems | 24 Molecular Biology Dogma Big Data §DNA §Transcription §RNA §Translation §Proteins IBM Systems Established & proven with blue-chip organizations ENTERPRISE HEALTHCARE PUBLIC SECTOR Government PARTNERS .27 27 • Single Name Space Spectrum Scale is the backbone of the data hub • Supports multiple tiers of storage: flash, spinning disk, tape and archive • Geographically dispersed management of data including disaster recovery • Encryption for protection of regulatory/patient data .Clients • Support multiple protocols for instruments, compute loads, File and Object storage .Instruments Single Name Space Site .File/Data Site A B Servers AFM/DR .ILM .ILM Site C .JBOD Disk Enclosures .Shared storage Pool .Active Pool .Tape /optical Pool End-to-End NIST / FIPS2-40 Certified Encryption .IBM Security) IBM Systems Thank you. ibm.com/systems IBM Systems | 29 Legal notices Copyright © 2015 by International Business Machines Corporation. All rights reserved. No part of this document may be