Providing Bioinformatics Services on Cloud

Providing Bioinformatics Services on Cloud Christophe Blanchet, Clément Gauthey InfrastructureC. Blanchet Distributed and C. Gauthey for Biology IDB-IBCPEGI CF13, CNRS Manchester, FR3302 - LYON 9 April - FRANCE2013 http://idee-b.ibcp.fr Infrastructure Distributed for Biology - IDB CNRS-IBCP FR3302, Lyon, FRANCE IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and the French National Research Agency's Arpege Programme (ANR-10-SEGI-001) Bioinformatics Today • Biological data are big data • 1512 online databases (NAR Database Issue 2013) • Institut Sanger, UK, 5 PB • Beijing Genome Institute, China, 4 sites, 10 PB ➡ Big data in lot of places • Analysing such data became difficult • Scale-up of the analyses : gene/protein to complete genome/ proteome, ... • Lot of different daily-used tools • That need to be combined in workflows • Usual interfaces: portals, Web services, federation,... ➡ Datacenters with ease of access/use • Distributed resources ADN Experimental platforms: NGS, imaging, ... • ADN • Bioinformatics platforms BI Federation of datacenters M ➡ BI CC ADN ADN BI ADN EGI CF13, Manchester, 9 April 2013 ADN ADN BI ADN ADN M M Sequencing Genomes Complete genome sequencing become a lab commodity with NGS (cheap and efficient) source: www.genomesonline.org source: www.politigenomics.com/next-generation-sequencing-informatics EGI CF13, Manchester, 9 April 2013 Infrastructures in Biology Lot of tools and web services to treat and vizualize lot of data EGI CF13, Manchester, 9 April 2013 The scene • Bioinformatics services providers • Is it easy to deploy lot of (incompatible) tools ? • To make them connected to public databases ? • To limit transfer of huge data ? • To provide users with their own computing resources ? • With their own isolated storage ? • Scientists • Is it easy to access/use these tools ? • To adapt to your usage ? • To get your/other tools deployed on a datacenter ? • To combine them ? • To get my own computing/storage resources ? EGI CF13, Manchester, 9 April 2013 IDB’s Cloud • Cloud workbench for Biology • 13 turnkey bioinformatics appliances (as of Apr. 2013) • Running since Sept. 2011, opened to Biology community • Lyon, FRANCE • Powered by • StratusLab • Compute nodes, Block storage • +900 cores, +4TB RAM, 36TB vdisks • Mainly Intel SandyBridge servers with 32c 128GB • Bigmen servers with 64c 768GB • VMs from 1 to 64c, 512MB to 760GB RAM • + Openstack • Object storage (Swift) • +200 TB redundant & scalable storage EGI CF13, Manchester, 9 April 2013 Driven throught a simple web interface EGI CF13, Manchester, 9 April 2013 Integrate Bioinformatics Tools in Cloud Virtual Linuxmachines BioinformaticsTools Abyss BLAST GOR4 Ray PhyML RedHat, BWA CentOS ClustalW Debian, FastA Ubuntu SSearch Create new Suse Appliance Bioinformatics Marketplace Sequence Structure NGS Galaxy ARIA (…) • Appliances are virtual machines • small : few GB, easy to convert in most virtualization formats • Installed and pre-configured with common bioinformatics tools • e.g. BLAST, Clustalw, ARIA, MEME, HMMer, TopHat, BWA, Samtools, etc. EGI CF13, Manchester, 9 April 2013 Bioinformatics Appliances EGI CF13, Manchester, 9 April 2013 Select your bioinformatics tools EGI CF13, Manchester, 9 April 2013 Run Bioinformatics Cloud Instances Bioinformatics Marketplace Sequence Structure NGS Galaxy ARIA (…) Launch Instances PaaS launch jobs BLAST, IaaS ssh Clustal, etc. Shared FS Master & Storage Workers VM ARIA VM CNS IBCP's Cloud Portal Resources EGI CF13, Manchester, 9 April 2013 Manage your Cloud Instances EGI CF13, Manchester, 9 April 2013 Biological Data in Cloud Upload your data scp http/S3 Public PaaS Data sources UNIPROT launch jobs BLAST, IaaS ssh Clustal, EMBL shared etc. PROSITE (NFS) Genomes PDB Shared FS Master & Storage Workers VM ARIA VM CNS pdisk (iSCSI) Bioinformatics User Portal Cloud Persistent data scp http/S3 Get your results EGI CF13, Manchester, 9 April 2013 Example: ‘biocompute’ Appliance • Use your own instance(s) • With pre-installed standard bioinformatics tools • BLAST, FastA, SSearch,HMM,... • ClustalW2, Clustal-Omega, Muscle,.. • Bowtie(2), BWA, samtools, ... • MEME, R, etc. • Connected to public reference data • Uniprot, EMBL, genomes, PDB, etc. • Automaticaly shared to the VMs EGI CF13, Manchester, 9 April 2013 Example: Galaxy portal for NGS analyses • Analyse NGS data • portal Galaxy is widely used in the community • connected to large public data: sequences and indexes • large user data (GBs) • Preserve workflows and results (persistent storage) EGI CF13, Manchester, 9 April 2013 Example: Proteomics • Motivation • Collaboration with a mass spectroscopy platform • Running out of space on their local resources • Protein identification • Mass experimental data • Reference databases : nr, Swiss-Prot • Reference screening tools: OMSSA, X!Tandem • User interface • Remote display • NX • Reference GUIs • SearchGUI • PeptidShaker source: PeptideShaker site EGI CF13, Manchester, 9 April 2013 Conclusion • Provide turnkey bioinformatics appliances • Standard tools and pipelines • Interoperability: ready to run on cloud • Easier to transfer appliances than data (GB vs TB) • Provide a cloud infrastructure tightly connected to existing bioinformatics infrastructure • Public IDB’s bioinformatics cloud • Linked to public biological databases • In collaboration with the French Bioinformatics Institute • Ease the usage by scientists • Usual bioinformatics gateways • Persistent and large ubiquitous storage • Web interface for cloud management EGI CF13, Manchester, 9 April 2013 Perspectives • Define good practices to provide academic community and industry with bioinformatics services! • French Bioinformatics Institute - IFB • Goals are to provide core bioinformatics resources to the national and international life science research community in key fields such as genomics, proteomics, systems biology, etc. • Aims at building a national academic cloud devoted to Bioinformatics, inspired by the model evaluated through the IDB’s cloud. • European ELIXIR infrastructure • To build a sustainable European infrastructure for biological information, supporting life science research and its translation • IFB will be the French representative in ELIXIR. EGI CF13, Manchester, 9 April 2013 Questions ? • Acknowledgment • StratusLab members • co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001). http://idee-b.ibcp.fr EGI CF13, Manchester, 9 April 2013.

Providing Bioinformatics Services on Cloud

De Novo Genomic Analyses for Non-Model Organisms: an Evaluation of Methods Across a Multi-Species Data Set

MATCH-G Program

Large Scale Genomic Rearrangements in Selected Arabidopsis Thaliana T

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Sequence Alignment/Map Format Specification

Assembly Exercise

Supplemental Material Nanopore Sequencing of Complex Genomic

Syntax Highlighting for Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E

Software List for Biology, Bioinformatics and Biostatistics CCT

Computational Protocol for Assembly and Analysis of SARS-Ncov-2 Genomes

Alignment-Free Sequence Comparison: Benefits, Applications, and Tools Andrzej Zielezinski1, Susana Vinga2, Jonas Almeida3 and Wojciech M

NGS Screening for Identification of Novel Pexophagy-Related Mutation in Arabidopsis Thaliana †