<<

bioKepler: A Comprehensive Bioinformacs Scienfic Workflow Module for Distributed Analysis of Large-Scale Biological Data

Ilkay Alntas1, Jianwu Wang2, Daniel Crawl1, Shweta Purawat1

1 San Diego Supercomputer Center, UC San Diego 2 UMBC

WorDS.sdsc.edu A Toolbox with Many Tools

• Data • Search, database access, IO operaons, streaming data in real-me… • Compute • Data-parallel paerns, external execuon, … • Network operaons • Provenance and fault tolerance

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution! Workflows are Used in These Diverse Scenarios in Biological Sciences Publicaon Data Archival • From analysis to searchable results • Often for data • Standardization reduction • Auto generation of • In real-time or Data Analysis methods and materials offline

Many forms • Data-intensive Acquision • HPC Workflows foster Generaon • Local Exploratory Data collaborations! • Sequencers • Sensor networks • Flexibility and synergy • Medical imaging • Optimization of resources • Increasing reuse • Standards compliance CAMERA Example:

Using Scientific Workflows and Related Provenance for Collaborative Metagenomics Research Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) http://camera.calit2.net CAMERA is a Collaborative Environment

Data Cart Multiple Available User Workspace Mixed collections of Single workspace with CAMERA Data (e.g. projects, samples) access to all data and Data Discovery results (private and GIS and Advanced query shared) options

Group Workspace Share specified User Workspace data with Data Analysis collaborators Workflow based analysis Workflows are a Central Part of CAMERA • CAMERA-supported – 28 existing workflows Taxonomy Binning • Workflows under development – Fragment Recruitment Viewer Duplicate filtering Metagenomic – Next Generation Annotation QC and – VIROME Pipeline filter Assembly Clustering – Standalone tools – National Center for MoreGenome than 1500 workflow Research submissions monthly!BLAST – Joint Genome Institute Comparison, Statistical analysis, and more • User built workflows – Currently running in a sandbox • Inputs: from local or CAMERA file systems; – Will be ported to a virtual user-supplied parameters cloud environment • Outputs: sharable with a group of users and links to the semantic database

All can be reached through the CAMERA portal at:hp://portal.camera.calit2.net CAMERA Portal - Workflows CAMERA Workflows

RAMMCAP CAMERA Workflows CAMERA Workflows CAMERA Job Status CAMERA Workflow Results Pushing the boundaries of existing infrastructure and workflow system capabilities New Requirements from the User Community • Increase reuse – best development practices by the scientific community – other bio packages

• Increase programmability by end users – users with various skill levels – to formulate actual domain specific workflows

• Increase resource utilization – optimize execution across available computing resources – in an efficient, transparent and intuitive manner

• Make analysis a part of the end-to-end scientific model from data generation to publication RAMMCAP – Rapid Clustering and Functional Annotation for Metagenomic Sequences Annotaon features: • tRNA predicon (tRNAscan) } Clustering of reads • rRNA predicon (meta_RNA, BLAST) } Mul-step clustering of ORFs • ORF call (ORF_finder, Metagene) } GO assignment • RPS-BLAST against COG etc } EC number assignment • HMMER against Pfam / Tigrfam

A number of bioinformacs tools are used in RAMMCAP

Tool Descripon BLAST Scalable parallel database search with blastn, blastp, tblastn, blastx, tblastx MegaBLAST Fast database search with MegaBLAST Diversity Diversity analysis for viral metagenome QC Quality control for 454 raw reads CD-HIT-454 Idenfy arficial duplicates from 454 reads RAMMCAP Metagenome annotaon - rRNA, tRNA, ORF predicon - reads and ORF clustering - reads and ORF informaon - family and funcon annotaon (Pfam, TIGRfam, COG) - and Enzyme Classificaon annotaon - Combined annotaon summary FRV Fragment Recruitment Viewer Assembly Consensus-based meta-assembler for 454 reads KEGG Pathway annotaon by search KEGG database with blastp RDP binning Taxonomy binning of rRNA sequences using RDP classifier BLAST binning Taxonomy binning by querying ref. rRNA DB using blastn tRNA Idenficaon of tRNAs from fragments using tRNA-scan Meta-RNA Idenficaon of rRNAs from fragments using HMM BLAST-RNA Idenficaon of rRNAs by querying ref. rRNA DB using blastn ORF_finder ORF call by six reading frame translaon Metagene ORF call by Metagene FragGeneScan ORF call with FragGeneScan from 454 reads Pfam Protein family annotaon against Pfam using HMMER TIGRfam Protein family annotaon against TIGRfam using HMMER COG Protein family annotaon against NCBI COG using rps- KOG Protein family annotaon against NCBI KOG using rps-blast PRK Protein family annotaon against NCBI PRK using rps-blast CD-HIT-EST Clustering of reads CD-HIT Clustering of ORFs H-CD-HIT Mulple level clustering of ORFs into ORF family Original implementaon of the annotaon workflow in Kepler

A green box is called a ‘actor’ , Data flow is divided. which performs a task.

This special actor represents an annotaon component, such as BLAST search.

Workflow parameters, which can be specified by users in portal, are passed to workflow components. Each actor was a wrapper to a web service!

Customized web services! RAMMCAP

metagene

QC cd-hit

blast

tRNA

Data size KB MB GB TB

CPU me Second Hour Day Month Year QC cd-hit metagene tRNA hmmer blast

Memory GB 10GB 100GB QC tRNA metagene hmmer blast cd-hit

Parallel No need No Mul threading MPI Map Reduce

QC cd-hit hmmer blast hmmer tRNA metagene blast RAMMCAP – Rapid Clustering and Functional Annotation for Metagenomic Sequences Data size KB MB GB TB

CPU me Minute Hour Day Month Year QC cd-hit metagene tRNA hmmer blast

Memory GB 10GB 100GB QC tRNA metagene hmmer blast cd-hit

Parallel No need No Mul threading MPI Map Reduce

QC cd-hit hmmer blast hmmer tRNA metagene blast

Data size KB MB GB TB NGS

CPU me Minute Hour Day Month Year QC metagene cd-hit tRNA hmmer blast

Memory GB 10GB 100GB QC tRNA metagene hmmer blast cd-hit

Parallel No need No Mul threading MPI Map Reduce

QC cd-hit hmmer blast hmmer tRNA metagene blast Another cases – RNA-seq / genomic / metagenomic

Raw reads

Reads QC QC

HQ reads mapping Velvet, BWA SOAPdenovo, Bowe mapping Assemble Abyss assembly BLAST Oases Trinity

Alignments Congs

Further analysis

Data size KB MB GB TB NGS

assembly CPU me Minute Hour Day Month QC mapping

Memory GB 10GB 100GB QC mapping assembly

Parallel No need No Mul threading MPI Map Reduce

mapping assembly QC mapping bioKepler implementaon: Using bioActors instead of wrapper actors

bio bio

bio bio

bio bio bio bio bio

bio bio

bio bio

bio

bio

Wrapper Actors bioActors • • Need implementaon of underlying Reusable • Mulple execuon modes computaonal tools • Build-in parallel execuon capabilies www.bioKepler.org

Gateways and other user environments Kepler and Provenance Framework bioKepler BioLinux Galaxy Clovr Hadoop …

CLOUD and OTHER COMPUTING RESOURCES e.g., SGE, Amazon, FutureGrid, XSEDE

A coordinated ecosystem of biological and technological May 22nd, 2014 packagesScalable Bioinformacs Boot Camp for bioinformatics! The bioKepler Approach

• Parallel Computation Framework – Use Distributed Data-Parallel (DDP) frameworks, e.g., MapReduce, and other parallelization methods to execute subworkflows • bioActors – Configurable and reusable higher-order components for bioinformatics and • Transparent support for different execution engines and computational environments • Deployment on diverse environments Reuse, Programmability, Execution www.bioKepler.org

• Funded by NSF ABI & CI Reuse programs - Altintas (PI) and Li (Co-PI) • Development of a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data

Big improvement on usability and programmability by end users!

bioKepler Kepler supports • Workflows • Other third party programming tools, e.g., R, Kepler CloudBioLinux • CORE Matlab, KNIME • Distributed Data • Extensible task and data Parallel parallelizaon • Provenance • Service orientaon • Reporng • Execuon on mulple • Run Manager engines, e.g., SDF, SGE, Galaxy … Bio-Linux • … Hadoop bioKepler’s Conceptual Framework

Kepler Data-Parallel Execution Patterns Provenance Map-Reduce Master-Slave All-Pairs Data Lineage Execution History bioKepler Director Scheduler Reporting Bioinformatics Tools bioActors Report Designer PDF Generation BLAST Mapping Executable Assembly HMMER Clustering Workflow Plan CD-HIT Run Manager Search Tag Workflow

Customize Execution Fault-Tolerance Bioinformatician & Integrate Engine Alternatives Error Handling

Deploy & Execute

Data Compute Adhoc Amazon Network Ensembl Transfer EC2 Triton Resource CAMERA Genbank Sun Grid FutureGrid Engine bioActors

• Set of steps to execute a bioinformatics tool locally or in an external environment – Locally executable – Parallelized external execution • Customizable by the user based on external packages – Tools imported from CloudBioLinux • Tools are evaluated on their computational requirements Transparent Execution includes Parallelization Solutions in Distributed Environments

• Tradional parallel programming interfaces – Examples: MPI and OpenMP – Hard to implement – Original sequenal tools cannot be reused

• Parallel job execuon – Examples: SGE and Condor – Original sequenal tools can be reused – Create small jobs by spling data or tasks – Hard to achieve data locality for each job

• Data parallel job execuon – Examples: Hadoop and Stratosphere – Original sequenal tools can be reused – Support customized and automa data paron and distribuon – Support data locality for each job through special distributed file system, HDFS Distributed Data-Parallel bioActors

• Set of steps to execute a bioinformatics tool in DDP environment • Customized from the ExecutionChoice actor • Includes: – Data-parallel patterns, e.g., Map, Reduce, Cross, All-Pairs, etc., to specify data grouping – I/O to interface with storage – Data format specifying how to split and join DDP bioActor Usage Model

1. Search bioActor Library 4c. Save in Library A1 A2 An 4b. Add to User: Larger Workflow 2b. Choose 2a. Choose Workflow Developer Generic Specific DDP Director Workflow DDP DDP Blast Generic 3. Add to Workflow 2b. Create 4a. Execute Sub-Workflow Results Status of bioActors 500+ bioActors are listed under current bioKepler release, ~40 of them are parallelized. Example bioActors

• Alignment: BLAST, BLAT • Profile-: PSI-BLAST • Hidden Markov Model: HMMER • Mapping: Bowtie, BWA, Samtools • Multiple Alignment: ClustalW, Muscle • Clustering: CD-HIT, Blastclust • Gene Prediction: Glimmer, Genescan, Fraggenescan • tRNA prediction: tRNA-scan, Meta-RNA • Phylogeny: FastTree, RAxML Example Workflows Current Release Downloadable as a package at: http://www.biokepler.org/releases • A bioKepler VM executable on Amazon EC2, FutureGrid and SDSC Cloud – Builds upon CloudBioLinux including Bio- Linux and Galaxy • A bioActor template that can be customized for different execution choices – e.g., local vs. Map/Reduce on a specific environment • Example usecases Demo and Quesons

WorDS Director: Ilkay Alntas, Ph.D. Email: al[email protected]