And Data-Intensive Analyses in Bioinformatics

Compute- and Data-Intensive Analyses in Bioinformatics" Wayne Pfeiffer! SDSC/UCSD! August 8, 2012! ! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Questions for today" • How big is the flood of data from high-throughput DNA sequencers?! • What bioinformatics codes are installed at SDSC?! • What are typical compute- and data-intensive analyses of in bioinformatics?! • What are their computational requirements?! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Size matters: how much data are we talking about?" • 3.1 GB for human genome! • Fits on flash drive; assumes FASTA format (1 B per base)" •! >100 GB/day from a single Illumina HiSeq 2000! •! 50 Gbases/day of reads in FASTQ format (2.5 B per base)" •! 300 GB to >1 TB of reads needed as input for analysis of whole human genome, depending upon coverage! •! 300 GB for 40x coverage" •! 1 TB for 130x coverage" •! Multiple TB needed for subsequent analysis! •! 45 TB on disk at SDSC for W115 project! (~10,000x single genome)" •! Multiple genomes per person!" •! May only be looking for kB or MB in the end" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Market-leading DNA sequencers come from! Illumina & Life Technologies (both SD County companies)" •! Illumina HiSeq 2000! •! Big; $690,000 list price" •! High throughput" •! Low error rate" •! 100-bp paired-end reads" read! ! read! •! Life Technologies Ion PGM! •! Small; $50,000 list price" •! Low throughput" •! Modest error rate" •! #250-bp reads" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Cost of DNA sequencing is dropping much faster (1/10 in 2 y)! than cost of computing (1/2 in 2 y);! this is producing the flood of data" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO What does this mean?" •! Growth of read data is roughly inversely proportional to drop in sequencing cost! •! >100 GB/day of reads from a single Illumina HiSeq 2000 now" •! 1 TB/day of reads from a sequencer likely by 2014" •! Analysis & quality control will dominate the cost! •! <$10,000 for sequencing human genome now" •! $1,000 for sequencing human genome in 2013 or 2014" •! $$10,000 for analysis & quality control of human genome sequence now and decreasing relatively slowly" •! Analysis improvements are needed to take advantage of new sequencing technology! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Many widely-used bioinformatics codes! are installed on Triton, Trestles, & Gordon" •! Pairwise sequence alignment! •! ATAC, BFAST, BLAST, BLAT, Bowtie, BWA" •! Multiple sequence alignment (via CIPRES gateway)! •! ClustalW, MAFFT" •! RNA-Seq analysis! •! TopHat, Cufflinks" •! De novo assembly! •! ABySS, SOAPdenovo, Velvet! •! Phylogenetic tree inference (via CIPRES gateway)! •! BEAST, GARLI, MrBayes, RAxML! •! Tool kits! •! BEDTools, GATK, SAMtools" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational requirements for some codes & data sets! can be substantial" !Input !Output !Memory !Time !Cores /! Code & data set !(GB) !(GB) !(GB) !(h) !computer! ! BFAST 0.6.4c !26 !19 !17 !8 !8 / Dash! 52M 100-bp reads! SOAPdenovo 1.05 !424 !77 !387 !26 !16 / Triton P+C! ! 1.7B 100-bp reads !! Velvet 1.1.06 !35 !617 !539 !9 !16 / Triton PDAF! 562M "50-bp reads! MrBayes 3.2.1 !<1 !27 !12 !155 !8 / Gordon! DNA data,! 40 taxa, 16k patterns ! RAxML 7.2.7 !<1 !<1 !47 !106 !160 / Trestles! amino acid data,! 1.6k taxa, 8.8k patterns! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Benchmark tests were run on various computers,! some with large shared memory" •! Gordon from Appro at SDSC! •! 16-core nodes with 2.6-GHz Intel Sandy Bridge processors" •! 64 GB of memory per node + vSMP" •! Trestles from Appro at SDSC! •! 32-core nodes with 2.4-GHz AMD Magny-Cours processors" •! 64 GB of memory per node" •! Triton CC & Dash from Appro at SDSC! •! 8-core nodes with 2.4-GHz Intel Nehalem processors" •! 24 & 48 GB of memory per node + vSMP on Dash" •! Triton PDAF from Sun at SDSC! •! 32-core nodes with 2.5-GHz AMD Shanghai processors" •! 256 & 512 GB of memory per node " •! Blacklight from SGI at PSC! •! 2,048-core NUMA nodes with 2.27-GHz Intel Nehalem processors " •! 16 TB of memory per NUMA node" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Typical projects involve multiple codes,! some with multiple steps, combined in workflows" •! HuTS: Human Tumor Study! •! Search for genome variants between blood and tumor tissue" •! Start from Ilumina 100-bp paired-end reads" •! Use BWA & GATK on Triton to find SNPs & short indels" •! Use SOAPdenovo, ATAC, & custom scripts on Triton to find long indels" •! W115: Study of 115-year-old woman’s genomes (!)! •! Search for genome variants between " "" blood and brain tissue" •! Start from SOLiD 50-bp reads" •! Use BioScope, SAMtools, & GATK " "" elsewhere to find SNVs & short indels" •! Use SAMtools, ABySS, Velvet, ATAC, BFAST, " "" & custom scripts on Triton to find long indels" Hendrikje van! SAN DIEGO SUPERCOMPUTER CENTER Andel-Schipper at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflows for! common bioinformatics analyses" DNA reads in De novo assembly: Contigs & FASTQ format! SOAPdenovo, scaffolds in Velvet, …! FASTA format! Read mapping, i.e., Reference Pairwise alignment: Multiple sequence pairwise alignment: genome in alignment: ClustalW, ATAC, BLAST, …! BFAST, BWA, …! FASTA format! MAFFT, …! Alignment info Variant calling: Alignment info Aligned in various sequences in in BAM format! GATK, …! formats! various formats! Variants: SNPs, Tree in various Phylogenetic tree indels, others! formats! inference: MrBayes, RAxML, …! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for! read mapping & variant calling" DNA reads in FASTQ format! Goal: identify simple variants, e.g.,! • single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs)! Read mapping, i.e., Reference ! CACCGGCGCAGTCATTCTCATAAT! pairwise alignment: genome in ! ||||||||||| ||||||||||||! BFAST, BWA, …! FASTA format! CACCGGCGCAGACATTCTCATAAT! ! • short insertions & deletions (indels)! Alignment info Variant calling: ! CACCGGCGCAGTCATTCTCATAAT! in BAM format! GATK, …! |||||||||| |||||||||||! ! CACCGGCGCA ATTCTCATAAT! ! Variants: SNPs, indels, others! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this means that cetuximab is not effective for chemotherapy" BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO BFAST took about 8 hours & 17 GB of memory! to map a small set of reads; speedup was 3.7 on 8 cores" •! Parallelization is typically done by! •! Separate runs for each lane of reads" •! Threads within a run" ! ! ! !8-thread! ! !1-thread !8-thread! ! !memory! !Step !time (h) !time(h)! !Speedup !(GB)! ! Match !8.4 !3.1 !2.7 !16.9! Align !19.3 !4.2 !4.6 !2.0 !! Postprocess !0.4 !0.4 !1.0 !2.2! Total !28.2 !7.7 !3.7! • Tabulated results are for! • One lane of Illumina 100-bp paired-end reads: 52 million reads" •! One index with k=22 on reference human genome (done previously)" •! One 8-core node of Dash with 2.4-GHz Intel Nehalems & 48 GB of memory" •! 26 GB input, half for reads & half for index; 19 GB output! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for! de novo assembly & variant calling" DNA reads in De novo assembly: Contigs & Goal: identify more FASTQ format! SOAPdenovo, scaffolds in complex variants, e.g.,! Velvet, …! FASTA format! • !large indels! • !duplications! Reference Pairwise alignment: • !inversions! genome in ATAC, BLAST, …! FASTA format! • !translocations! Variant calling: Alignment info in various GATK, …! formats! Variants: SNPs, indels, others! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Key conceptual steps in de novo assembly" 1.# Find reads that overlap by a specified number of bases (the k-mer size), typically by building a graph in memory 2. Merge overlapping, “good” reads into longer contigs, typically by simplifying the graph 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO de Bruijn graph has! k-mers as nodes connected by reads;! assembly involves finding Eulerian path through graph" AGAC Diagram from Michael Schatz, Cold Spring Harbor" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO SOAPdenovo & Velvet are two leading assemblers that use de Bruijn graph algorithm" •! SOAPdenovo is from BGI! •! Code has four steps: pregraph, contig, map, & scaffold" •! pregraph & map are parallelized with Pthreads, but not reproducibly" •! pregraph uses the most time & memory! •! http://soap.genomics.org.cn/soapdneovo.html" •! Velvet is from EMBL-EBI! •! Code has two steps: hash & graph" •! Both are parallelized with OpenMP, but not reproducibly! •! Either step can use more time or memory depending upon problem & computer" •! http://www.ebi.ac.uk/~zerbino/velvet" •! k-mer size is adjustable parameter! •! Typically it is adjusted to maximize N50 length of scaffolds or contigs" •! N50 length is central measure of distribution weighted by lengths" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

And Data-Intensive Analyses in Bioinformatics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support