Bioinformatique sur Cloud Cas d’usage avec le portail Galaxy

Christophe Blanchet

Institute of Biology and Chemistry of Head of Service ‘’Infrastructure for Biology - IDB’’ CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr

IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO- RI-261552), the French National Research Agency's Arpege Programme (ANR-10-SEGI-001) and by the French Institute for (IFB-RENABI) Bioinformatics Today • Biological data are big data • 1512 online databases (NAR Database Issue 2013) • Institut Sanger, UK, 5 PB • Beijing Genome Institute, China, 5 sites, 12.6 PB ➡ Big data in many places • Analysing such data became difficult • Scale-up of the analyses : / to complete genome/ proteome, ... • Lot of different daily-used tools • That need to be combined in workflows • Usual interfaces: portals, Web services, federation,... ➡ Datacenters with ease of access/use • Distributed resources ADN Experimental platforms: NGS, imaging, ... • ADN • Bioinformatics platforms BI Federation of datacenters M ➡ BI CC ADN

ADN

BI ADN Ecole Bioinformatique Aviesan, 18 octobre 2013 ADN ADN BI

ADN ADN

M M IDB Cloud and Bioinformatics Appliances

Cloud workbench for Biology • tools R Running since Sept. 2011 BLAST • FastA OMSSA CNRS-IBCP FR3302, Lyon, France ClustalW2 SSearch PeptideShaker ARIA X!tandem opened to Biology community HMMer BWA • TopHat Galaxy Clustal Muscle 14 bioinformatics appliances: Galaxy portal, fastQC • Omega standard compute nodes, proteomics, virtual desktop, Create structural biology, ... new cloud services + +40 users from all IFB regional centers system • Virtual Machines PRABI 15, APLIBIO 14, RENABI-NE 8, -SO 2, -GS 1, -GO 1 • VMs up to 32cores-768GB RAM Bioinformatics Marketplace Structures Galaxy Sequences Proteomics ... Infrastructure • Move cloud Compute +900cores +4TB ram virtual • data machines Standard nodes (32c-128GB) • BI tools VM: public data BLAST, Bigmen nodes (64c 768GB) user data ClustalW2, • UNIPROT etc. Z Powered by StratusLab EMBL • B Genomes PDB • Storage +250TB A PROSITE • Virtual disks, object storage (S3) IDB Cloud

Ecole Bioinformatique Aviesan, 18 octobre 2013 Cloud extended services

• Bioinformatics Marketplace • find appropriate appliances more easily. • reduce “noise” in the central Marketplace • respect visibility contraints for the bioinformatic Native cloud services appliances, such as confidentiality • Authentication • Bioinformatics metadata ‘’bio:tool’’ • Virtual machine IDB • additional elements related to bioinformatics tools management • to annotate appliances • Persistent disk service • help users to search for the tools themselves or • Client CLI the type of analysis • etc. • select suitable bioinformatics appliances containing the required tools • Integrated Web interface • VM & virtual disks management • browse bionformatics appliances with ‘‘bio:tool’’ MDz

Ecole Bioinformatique Aviesan, 18 octobre 2013 Driven throught a simple web interface

Ecole Bioinformatique Aviesan, 18 octobre 2013 Run your Bioinformatics Cloud Instances

Bioinformatics Marketplace

Sequence Structure NGS Galaxy ARIA (…)

Launch

Instances

PaaS launch jobs BLAST, IaaS ssh Clustal, etc.

Shared FS

Master & Storage Workers VM ARIA VM CNS

IBCP's Cloud Portal Resources

Ecole Bioinformatique Aviesan, 18 octobre 2013 Biological Data in Cloud Upload your data sftp/http/S3

Public PaaS Data sources UNIPROT launch jobs BLAST, IaaS ssh Clustal, EMBL shared etc. PROSITE (NFS) Genomes PDB Shared FS

Master & Storage Workers VM ARIA VM CNS

pdisk (iSCSI) Bioinformatics User Portal Cloud Persistent data sftp/http/S3

Get your results

Ecole Bioinformatique Aviesan, 18 octobre 2013 Examples of Cloud Bionformatics Appliances

Ecole Bioinformatique Aviesan, 18 octobre 2013 Standard Bioinformatics node

• ‘Biocompute’ appliance • Use your own instance(s) • With pre-installed standard bioinformatics tools • BLAST, FastA, SSearch,HMM,... • ClustalW2, Clustal-Omega, Muscle,.. • Bowtie(2), BWA, samtools, ... • MEME, R, etc. • Connected to public reference data • Uniprot, EMBL, genomes, PDB, etc. • Automaticaly shared to the VMs

Ecole Bioinformatique Aviesan, 18 octobre 2013 Structural Biology • TOwards StruCtural AssignmeNt Improvement • To improve the determination of protein structures based on Nuclear Magnetic Resonance (NMR) information with ARIA software • Large computational needs. • A NMR laboratory will not specially invest in building a cluster of about 100 nodes to be able to run such NMR structure calculations. • Flexibility of the cloud to deploy the different required bioinformatics tools can accelerate such a procedure. • Commercial interest in providing such tools to structural biologists on a “pay as you go” basis.

• Endorsers: Institut Pasteur Paris and CNRS IBCP

Ecole Bioinformatique Aviesan, 18 octobre 2013 Proteomics desktop • Motivation • Collaboration with a mass spectroscopy platform • Running out of space on their local resources • Protein identification • Mass experimental data • Reference databases : nr, Swiss-Prot • Reference screening tools: OMSSA, X!Tandem • User interface • Remote display • NX • Reference GUIs • SearchGUI • PeptidShaker

source: PeptideShaker site Ecole Bioinformatique Aviesan, 18 octobre 2013 MapReduce Biology • Provide turnkey virtual machine with pre- configured mapreduce framework Developed in the context of the French • Accelerate bigadata analysis with the two steps map & project MapReduce, ANR ARPEGE reduce paradigm • Hadoop MapReduce 1.0.4 • Appliances (2) • standard hadoop mapreduce • bioinformaytics software integrated in hadoop • Sequences similarity with mapreduce paradigm • FastA & SSearch • deploy database of sequences in HDFS • compare each structure to others

Databank

subset subset ... Each mapper #01 #02 FastAMR splits the send the databank into subsets score and and puts them in the sequences to DFS along with the reducers sequences file Mappers Results score sequence FastAMR FastA FastA Reducers score sequence #01 #02 ......

Users run the FastAMR script with its sequences Each mapper Reducers copy the and the databank runs a FastA best scores of the User's program on a whole experiment Sequences part of the in the DFS databank Ecole Bioinformatique Aviesan, 18 octobre 2013 Cas d’usage avec Galaxy

Ecole Bioinformatique Aviesan, 18 octobre 2013 Compte Cloud IDB

• Connectez-vous • Remplissez les différents champs • adresse mail institutionnelle • Créer la demande implique l’acceptation des conditions d’utilisations !

https://idee-b.ibcp.fr/cloud.html

Ecole Bioinformatique Aviesan, 18 octobre 2013 Appliances disponibles

• Liste des appliances existantes • Documentation spécifique aux appliances • Création directe • bouton ‘Power’

Ecole Bioinformatique Aviesan, 18 octobre 2013 Créer mon portail Galaxy

• Appelée aussi ‘Instance’ • Compléter les différents paramètres • lui assigner un nom • nombre de CPUs • taille mémoire • attacher un disque virtuel • Cluster de VM • remplir le nombre de VMs • choix du nom unique

Ecole Bioinformatique Aviesan, 18 octobre 2013 Connexion sur mon instance Galaxy

Ecole Bioinformatique Aviesan, 18 octobre 2013 Les disques durs virtuels

• Un disque virtuel permet • Actions de conserver ses données • créer un vdisk indépendamment de • gérer ses vdisks l’exécution des VMs • Utiliser un vdisk • retrouver ses données d’une • à la création de la VM VM à la suivante. • montage à chaud

Ecole Bioinformatique Aviesan, 18 octobre 2013 Echanger les données avec mon portail Galaxy

• sftp / scp • client graphique: Cyberduck, Transmit, Filezilla, ... • Web: Galaxy - Get Data - Lien pour download

Ecole Bioinformatique Aviesan, 18 octobre 2013 Conclusion • Added value of cloud, e.g. NGS with Galaxy tools R • for scientific analyses: user-specific resources, isolated, BLAST different instances together FastA OMSSA ClustalW2 SSearch for training: Oct 2012 Bordeaux, Mai 2013 Galaxy Lille, PeptideShaker • ARIA X!tandem HMMer BWA (next) 2014 Galaxy Jouy TopHat samtools Galaxy Clustal Muscle fastQC • for tools integration: semantic annotation, solve Omega software dependencies Create new cloud for development & operations (DevOps): different services + • Linux system versions at the same time Virtual Machines • Provide turnkey bioinformatics appliances Bioinformatics Marketplace Standard tools and pipelines • Structures Galaxy ... • New developments Sequences Proteomics Ready to run on clouds Move • cloud data virtual • Public bioinformatics cloud (e.g. IDB) machines BI tools VM: Tightly connected to existing bioinformatics public data BLAST, • ClustalW2, user data UNIPROT etc. resources Z EMBL Linked to public biological databases B Genomes • PDB A PROSITE • In collaboration with the French Institute of IDB Cloud Bioinformatics

Ecole Bioinformatique Aviesan, 18 octobre 2013 IFB - French Institute of Bioinformatics

Mission : to make available core bioinformatics resources to the national/international life science research community.

• To provide support for biology programs • supporting projects • training users • To provide an IT infrastructure devoted to management and analysis of biological data • material resources : CPUs, disks, etc. • availability of biology data collections • deployment of bioinformatics tools • To act as a middleman between the life science community and the bioinformatics/computer science research community

Institut Français de Bioinformatique IFB - Infrastructure

• IFB-Core resources • Academic cloud for life cience • Will be hosted at CNRS IDRIS supercomputing center (PARIS) - RENABI IFB - A pilot infrastructure (2014-Q1) Bioinformatics • French Institute Production infrastructure APLIBIO • RENABI-NE +5,000cores 1PB (2014-S2) RENABI-GO • + Regional resources • 6 regional bioinformatics centers +6,000 cores ~1PB FIB-core IT • CNRS-IDRIS, Paris • 2 existing clouds: PRABI-IBCP IDB cloud (Lyon) & Genouest genocloud (Rennes)

PRABI

• Deploy a clouds federation RENABI-SO RENABI-GS

Institut Français de Bioinformatique Questions ?

• Acknowledgment • Clément Gauthey (IDB) • StratusLab members • co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001).

http://idee-b.ibcp.fr

Ecole Bioinformatique Aviesan, 18 octobre 2013