<<

Compute- and Data-Intensive Analyses in

Wayne Pfeiffer SDSC/UCSD August 8, 2012

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Questions for today

• How big is the flood of data from high-throughput DNA sequencers? • What bioinformatics codes are installed at SDSC? • What are typical compute- and data-intensive analyses of in bioinformatics? • What are their computational requirements?

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Size matters: how much data are we talking about?

• 3.1 GB for human genome • Fits on flash drive; assumes FASTA format (1 B per base) • >100 GB/day from a single Illumina HiSeq 2000 • 50 Gbases/day of reads in FASTQ format (2.5 B per base) • 300 GB to >1 TB of reads needed as input for analysis of whole human genome, depending upon coverage • 300 GB for 40x coverage • 1 TB for 130x coverage • Multiple TB needed for subsequent analysis • 45 TB on disk at SDSC for W115 project! (~10,000x single genome) • Multiple genomes per person! • May only be looking for kB or MB in the end

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Market-leading DNA sequencers come from Illumina & Life Technologies (both SD County companies) • Illumina HiSeq 2000 • Big; $690,000 list price • High throughput • Low error rate • 100-bp paired-end reads read

read

• Life Technologies Ion PGM • Small; $50,000 list price • Low throughput • Modest error rate • ≤250-bp reads

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Cost of DNA is dropping much faster (1/10 in 2 y) than cost of computing (1/2 in 2 y); this is producing the flood of data

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO What does this mean?

• Growth of read data is roughly inversely proportional to drop in sequencing cost • >100 GB/day of reads from a single Illumina HiSeq 2000 now • 1 TB/day of reads from a sequencer likely by 2014 • Analysis & quality control will dominate the cost • <$10,000 for sequencing human genome now • $1,000 for sequencing human genome in 2013 or 2014 • ≥$10,000 for analysis & quality control of human genome sequence now and decreasing relatively slowly • Analysis improvements are needed to take advantage of new sequencing technology

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Many widely-used bioinformatics codes are installed on Triton, Trestles, & Gordon • Pairwise • ATAC, BFAST, BLAST, BLAT, Bowtie, BWA • Multiple sequence alignment (via CIPRES gateway) • ClustalW, MAFFT • RNA-Seq analysis • TopHat, Cufflinks • De novo assembly • ABySS, SOAPdenovo, Velvet • Phylogenetic tree inference (via CIPRES gateway) • BEAST, GARLI, MrBayes, RAxML • Tool kits • BEDTools, GATK, SAMtools

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational requirements for some codes & data sets can be substantial Input Output Memory Time Cores / Code & data set (GB) (GB) (GB) (h) computer BFAST 0.6.4c 26 19 17 8 8 / Dash 52M 100-bp reads SOAPdenovo 1.05 424 77 387 26 16 / Triton P+C 1.7B 100-bp reads Velvet 1.1.06 35 617 539 9 16 / Triton PDAF 562M ≤50-bp reads MrBayes 3.2.1 <1 27 12 155 8 / Gordon DNA data, 40 taxa, 16k patterns RAxML 7.2.7 <1 <1 47 106 160 / Trestles amino acid data, 1.6k taxa, 8.8k patterns

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Benchmark tests were run on various computers, some with large shared memory • Gordon from Appro at SDSC • 16-core nodes with 2.6-GHz Intel Sandy Bridge processors • 64 GB of memory per node + vSMP • Trestles from Appro at SDSC • 32-core nodes with 2.4-GHz AMD Magny-Cours processors • 64 GB of memory per node • Triton CC & Dash from Appro at SDSC • 8-core nodes with 2.4-GHz Intel Nehalem processors • 24 & 48 GB of memory per node + vSMP on Dash • Triton PDAF from Sun at SDSC • 32-core nodes with 2.5-GHz AMD Shanghai processors • 256 & 512 GB of memory per node • Blacklight from SGI at PSC • 2,048-core NUMA nodes with 2.27-GHz Intel Nehalem processors • 16 TB of memory per NUMA node

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Typical projects involve multiple codes, some with multiple steps, combined in workflows • HuTS: Human Tumor Study • Search for genome variants between blood and tumor tissue • Start from Ilumina 100-bp paired-end reads • Use BWA & GATK on Triton to find SNPs & short indels • Use SOAPdenovo, ATAC, & custom scripts on Triton to find long indels • W115: Study of 115-year-old woman’s genomes (!) • Search for genome variants between blood and brain tissue • Start from SOLiD 50-bp reads • Use BioScope, SAMtools, & GATK elsewhere to find SNVs & short indels • Use SAMtools, ABySS, Velvet, ATAC, BFAST, & custom scripts on Triton to find long indels Hendrikje van SAN DIEGO SUPERCOMPUTER CENTER Andel-Schipper

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflows for common bioinformatics analyses

DNA reads in De novo assembly: Contigs & FASTQ format SOAPdenovo, scaffolds in Velvet, … FASTA format

Read mapping, i.e., Reference Pairwise alignment: Multiple sequence pairwise alignment: genome in alignment: ClustalW, ATAC, BLAST, … BFAST, BWA, … FASTA format MAFFT, …

Alignment info Variant calling: Alignment info Aligned in various sequences in in BAM format GATK, … formats various formats

Variants: SNPs, Tree in various Phylogenetic tree indels, others formats inference: MrBayes, RAxML, …

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for read mapping & variant calling

DNA reads in FASTQ format Goal: identify simple variants, e.g., • single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs) Read mapping, i.e., Reference CACCGGCGCAGTCATTCTCATAAT! pairwise alignment: genome in ||||||||||| ||||||||||||! BFAST, BWA, … FASTA format CACCGGCGCAGACATTCTCATAAT! • short insertions & deletions (indels) Alignment info Variant calling: CACCGGCGCAGTCATTCTCATAAT! in BAM format GATK, … |||||||||| |||||||||||! CACCGGCGCA ATTCTCATAAT! Variants: SNPs, indels, others

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this means that cetuximab is not effective for chemotherapy

BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO BFAST took about 8 hours & 17 GB of memory to map a small set of reads; speedup was 3.7 on 8 cores • Parallelization is typically done by • Separate runs for each lane of reads • Threads within a run 8-thread 1-thread 8-thread memory Step time (h) time(h) Speedup (GB) Match 8.4 3.1 2.7 16.9 Align 19.3 4.2 4.6 2.0 Postprocess 0.4 0.4 1.0 2.2 Total 28.2 7.7 3.7 • Tabulated results are for • One lane of Illumina 100-bp paired-end reads: 52 million reads • One index with k=22 on reference human genome (done previously) • One 8-core node of Dash with 2.4-GHz Intel Nehalems & 48 GB of memory • 26 GB input, half for reads & half for index; 19 GB output

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for de novo assembly & variant calling

DNA reads in De novo assembly: Contigs & Goal: identify more FASTQ format SOAPdenovo, scaffolds in complex variants, e.g., Velvet, … FASTA format • large indels • duplications

Reference Pairwise alignment: • inversions genome in ATAC, BLAST, … FASTA format • translocations

Variant calling: Alignment info in various GATK, … formats

Variants: SNPs, indels, others

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Key conceptual steps in de novo assembly

1. Find reads that overlap by a specified number of bases (the k-mer size), typically by building a graph in memory

2. Merge overlapping, “good” reads into longer contigs, typically by simplifying the graph

3. Link contigs to form scaffolds using paired-end information

Diagrams from Serafim Batzoglou, Stanford

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO de Bruijn graph has k-mers as nodes connected by reads; assembly involves finding Eulerian path through graph

AGAC

Diagram from Michael Schatz, Cold Spring Harbor

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO SOAPdenovo & Velvet are two leading assemblers that use de Bruijn graph algorithm • SOAPdenovo is from BGI • Code has four steps: pregraph, contig, map, & scaffold • pregraph & map are parallelized with Pthreads, but not reproducibly • pregraph uses the most time & memory • http://soap.genomics.org.cn/soapdneovo.html • Velvet is from EMBL-EBI • Code has two steps: hash & graph • Both are parallelized with OpenMP, but not reproducibly • Either step can use more time or memory depending upon problem & computer • http://www.ebi.ac.uk/~zerbino/velvet • k-mer size is adjustable parameter • Typically it is adjusted to maximize N50 length of scaffolds or contigs • N50 length is central measure of distribution weighted by lengths

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO SOAPdenovo & Velvet each have their strengths

• Quality of assembly • Both give similar assemblies • Speed • SOAPdenovo is faster • Memory • SOAPdenovo uses much less memory • vSMP • Velvet often runs well with vSMP, whereas SOAPdenovo does not • Reads • Both work with Illumina reads, but only Velvet works with SOLiD reads

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Graph step of Velvet works well on Gordon with vSMP; Gordon, Blacklight, & Triton PDAF have similar speeds when memory for hash step is small

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hash step of Velvet runs much slower on Gordon with vSMP & somewhat slower on Blacklight when memory for hash step is large; graph step still works well on Gordon with vSMP

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO What is going on?

• Memory access for graph step of Velvet is fairly regular • This is efficient with vSMP • Performance improved significantly last year through tuning of vSMP by ScaleMP • Memory access for hash step of Velvet is nearly random • This is inefficient with vSMP • Memory access for pregraph step of SOAPdenovo (not shown) is also nearly random • Since pregraph step uses most memory, large-memory SOAPdenovo runs are slow with vSMP • vSMP allows analyses otherwise possible on only a few computers

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for de novo assembly followed by phylogenetic analyses

DNA reads in De novo assembly: Contigs & FASTQ format SOAPdenovo, scaffolds in Velvet, … FASTA format

Multiple sequence alignment is matrix of taxa vs characters ...... ! Multiple sequence Human AAGCTTCACCGGCGCAGTCATTCTCATAAT...! alignment: ClustalW, Chimpanzee AAGCTTCACCGGCGCAATTATCCTCATAAT...! MAFFT, … Gorilla AAGCTTCACCGGCGCAGTTGTTCTTATAAT...! Orangutan AAGCTTCACCGGCGCAACCACCCTCATGAT...! Gibbon AAGCTTTACAGGTGCAACCGTCCTCATAAT...! Aligned sequences in Final output is phylogeny or tree with taxa at its tips various formats /------Human! | ! |------Chimpanzee! + ! | /------Gorilla! Phylogenetic tree | | ! inference: MrBayes, \---+ /------Orangutan! RAxML, … \------+ ! \------Gibbon! SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalability of RAxML & MrBayes was improved during past three years by Stamatakis, Goll, & Pfeiffer • Hybrid MPI/Pthreads version of RAxML was developed • MPI code was added to previous Pthreads-only code • Parallelization is multi-grained as well as hybrid • Change in algorithm often leads to better solution • Hybrid MPI/OpenMP version of MrBayes was developed • OpenMP code was added to previous MPI-only code • Parallelization is multi-grained as well as hybrid • Memory-efficient code called RAxML-Light was developed • This allows very large trees to be analyzed together with RAxML • Single-node runs are more efficient than before • Multi-node runs with more cores are possible • Scalability before was limited to about 8 cores for typical analyses • Hybrid codes now scale well to 10s of cores for typical analyses • Scripted version of RAxML-Light scales even further SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO RAxML parallel efficiency is >0.5 up to 60 cores for >1,000 patterns*; speedup is superlinear for comprehensive analysis at some core counts; scalability improves with number of patterns

* Number of patterns = number of unique columns in multiple sequence alignment

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO RAxML run time for a DNA analysis went from >3 days on 1 core to ~1.3 hours on 60 cores; large amino acid analysis was solved in 4.4 days on 160 cores

Char- Pat- Boot- Data Time (h) Time (h) Speed- Taxa acters terns straps type & cores & cores up 150 1,269 1,130 400 RNA 2.1, 1 0.06, 60 33 218 2,294 1,846 450, 500 DNA 8.7, 1 0.20, 60 43 404 13,158 7,429 450, 400 DNA 74.8, 1 1.27, 60 59 1,596 10,301 8,807 160 AA 106, 160

• Tabulated results are for • Comprehensive analysis with number of bootstrap searches determined automatically followed by 10 or 20 thorough searches • 32-core nodes of Trestles with 2.4-GHz AMD Magny-Cours processors • 10 MPI processes & 6 threads/process using 60 cores (which gives better performance than using 64 cores) • 20 MPI processes & 8 threads/process using 160 cores

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO MrBayes runs 1.6x to 3.3x faster on Gordon than Trestles depending upon the size of the data set; speedup is greater for larger data sets that are not partitioned

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO The CIPRES gateway lets biologists run parallel versions of tree inference codes via a browser interface on the Trestles & Gordon supercomputers at SDSC

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Questions & answers about analyzing DNA sequence data

• How big is the flood of data from high-throughput DNA sequencers? • >100 GB per day from a single Illumina sequencer now • 1 TB/day from a sequencer likely by 2014 • What are three compute- and data-intensive analyses of DNA sequence data? • Mapping of short reads against a reference genome • De novo assembly of short reads • Phylogenetic tree inference

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO So how compute- and data-intensive are the three bioinformatics analyses we considered?

Here is a qualitative summary

Compute- Memory- I/O- Analysis intensive intensive* intensive Read mapping x x De novo assembly x x x Tree inference (usually) x Tree inference (sometimes) x x * I.e., large memory per node is needed for shared-memory implementations

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO