Three Data Delivery Cases for EMBL- EBI's Embassy

Three data delivery cases for EMBL- EBI’s Embassy Guy Cochrane www.ebi.ac.uk EMBL European Bioinformatics Institute Genes, genomes & variation Protein sequences • European Nucleotide Archive • InterPro • 1000 Genomes • Pfam • Ensembl • UniProt • Ensembl Genomes • Ensembl Plants Molecular structures • European Genome-phenome Archive • Protein Data Bank in Europe • Metagenomics portal • Electron Microscopy Data Bank • GWAS Catalog browser Expression • ArrayExpress Chemical biology • Expression Atlas • ChEMBL • Metabolights • ChEBI • PRIDE Literature & ontology • Europe PubMed Central Reactions, interactions • Gene Ontology & pathways Systems • Experimental Factor • IntAct • BioModels Ontology • Reactome • Enzyme Portal • MetaboLights • BioSamples Sequence data at EMBL-EBI Sample/method Sample/method Read Read Alignment Alignment European Genome-phenome Archive - Controlled access data - Human data around molecular medicine Assembly - http://www.ebi.ac.uk/ega/ Annotation European Nucleotide Archive - Unrestricted data - Pan-species and application - http://www.ebi.ac.uk/ena/ Sequence data at EMBL-EBI Sample/method Sample/method Read Read Alignment Alignment European Genome-phenome Archive - Controlled access data - Human data around molecular medicine Assembly - http://www.ebi.ac.uk/ega/ Infrastructure provision Annotation - BBSRC: RNAcentral, MG Portal - MRC: 100k Genomes data implementation European Nucleotide Archive - EC: COMPARE, MicroB3, ESGI, - Unrestricted data BASIS - Pan-species and application - http://www.ebi.ac.uk/ena/ - etc. Challenges • Data have high volume and grow rapidly • Data are dynamic (continuous feed) and their application has urgency • Users require arbitrary and ad hoc access Tara Oceans Tara Oceans Capacity Infectious disease • Opportunity: A methodological revolution in clinical and public health towards shotgun sequencing-based methods • Scientific power: Sequence harbours rich information • Diagnostic: identification, typing, resistance profiling, etc. • Public health: outbreak detection, response strategy, vaccine development • Mechanistic: host interactions, pathogencity, virulence, transmission, anti- COMPARE: recently launched microbial resistance Horizon 2020 project in which EMBL-EBI is informatics provider • Informatics roles for EMBL: • COMPARE: Rapid global sharing of surveillance and outbreak data, systematic integrated analysis, compute provision (Embassy) • Standards for reporting, analysis and the communication of results • New algorithms and analysis methods • User interfaces for surveillance data reporting , across the domains Global Microbial Identifier: Initiative with EMBL-EBI involvement supporting technologies, standards and data sharing for pathogen surveillance COMPARE platform Sources Processes Portals and environments COMPARE COMPARE Data Resource COMPARE workflow engine Food COMPARE Portal Registry workflow development Public data Assembly & ‘Default’ tools alignment Clinical INSDC data Annotaon ‘Hosted tools’ exchange workflow development API API Managed access Typing data Outbreak workflow development Private data Workflow integraon API EBI infrastructure Embassy infrastructure DTU infrastructure Embassy virtual domain COMPARE platform Sources Processes Portals and environments COMPARE COMPARE Data Resource COMPARE workflow engine Food COMPARE Portal Registry workflow development Public data Assembly & ‘Default’ tools alignment Clinical INSDC data Annotaon ‘Hosted tools’ exchange workflow development API API Managed access Typing data Urgency Outbreak workflow development Private data Workflow integraon API EBI infrastructure Embassy infrastructure DTU infrastructure Embassy virtual domain Personalised medicine • Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation • As part of GA4GH, EMBL-EBI is working on • Resources serving reference human genomic and transcriptomic data, including Google read API, variant ‘Beacons’, etc. • CRAM compression supporting greater data fluidity and APIs to allow direct computational access • Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures • Past and current FP7 projects include SLING, BASIS, ESGI Personalised medicine • Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation • As part of GA4GH,Arbitrary EMBL-EBI is working access on • Resources serving reference human genomic and transcriptomic data, including Google read API, variant ‘Beacons’, etc. • CRAM compression supporting greater data fluidity and APIs to allow direct computational access • Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures • Past and current FP7 projects include SLING, BASIS, ESGI ENA conventional read data delivery Conventional infrastructure (FTP, Aspera, GridFTP) ENA metadata FIRE1 ENA data (NFS) ENA Embassy read data delivery Conventional infrastructure (FTP, Aspera, GridFTP) ENA metadata FIRE2 FUSE FUSE ENA data (Cleversafe) HTTP ENA Embassy read data delivery Conventional Embassy cloud infrastructure infrastructure (VMWare -> OpenStack) (FTP, Aspera, GridFTP) Marine cache Tara Oceans Embassy ENA metadata Pathogen cache COMPARE Embassy FIRE2 FUSE FUSE ENA data CRAM (Cleversafe) cache GA4GH Embassy HTTP ENA external read data delivery …phase II EMBL-EBI Embassy Cloud Steven Newhouse Head of Technical Services The Challenge Facing EMBL-EBI • Volume and variety of genomic data expanding • EMBL-EBI data doubling every year - replication is challenging • Infrastructure currently 50,000 CPUs & 60+PB • Need to support complex analysis scenarios • Web and programmatic access to services (3M unique users) • Access to both public and managed access data sets • Bespoke workflows and tools across a variety of domains • Hard for users to replicate data sets for local analysis • Use the ‘cloud’ to bring local analysis to EMBL-EBI data 18 EMBL-EBI Embassy Cloud • Service hosted at EMBL-EBI data centres • Direct network access to public and managed data sets • Direct network to access public services • Expect both academic and commercial users • Technical Implementation • Logically isolated outside EMBL-EBI’s LANs • Secure flexible infrastructure for both tenant and host • Resources exposed using VMware’s vCloud Director & OpenStack • Provide isolated IaaS clouds to multiple users 19 Why ‘Embassy’ Cloud? • An embassy is sovereign territory in a host country • Host Country: EMBL-EBI Data Centre • Sovereign Territory: Host Country not allowed to enter • Virtualisation provides the protection for ‘tenant’ and ‘host’ • Host puts boundaries in place to protect it from the tenant • Tenant has freedom and control within those boundaries 20 21 Embassy CloudConcept Virtualised EMBL-EBI Hardware Hardware EMBL-EBI Virtualised Public Data Public Services Managed Data Embassy Cloud 1 Embassy Cloud 2 PanCancer Embassy Cloud 3 Private Data User Benefits for the IaaS Model • Tenant organisations get an empty virtual infrastructure • They establish their own virtual machines and networks • System administration performed by the tenant • EMBL-EBI staff have no access to the VMs • Added value from EMBL-EBI over other clouds • Machines and data hosted in known jurisdiction • Direct network data sets (public & managed access) • Direct network access to public EMBL-EBI services 22 Benefits to EMBL-EBI of the IaaS Model • A secure collaborative workspace • Work does not contend with main EMBL-EBI resources • Clearly define the committed IT resources and data • Explore how to build more data focused analysis services • Move the analysis to where the big data is located • Learn from and inform other big data scientific communities 23 Embassy Cloud: Typical Uses • Collaborative Environment • Neutral ground outside internal network • CTTV: Resources and VMs to host intranet, databases, … • Data Staging • Undertake submission from local machine (following data staging) rather from remote location • BRAEMBL: Remote submission unreliable due to file upload • Data Analysis • Large scale management and analysis of data • PanCancer: 1,000 cores, 2.5 TB RAM, 1.0 PB HDD Issues • Object Store Storage Infrastructure • Essential for scalable high-performance storage • Applications need to adapt to flat model • Current caching strategy will have a limit • Sharing resources between sites/communities/clouds • Adopt a standards based model for federating resources • Solutions for uploading and distributing VMs (+containers?) • Replicating large data sets to ‘attract’ workloads to a cloud 25 Gaps à Activities à Solutions? • Data Set Replication • Strategic pre-positioning of data into clouds • Leverage JANET/GEANT, GridFTP + Globus Transfers, … • Cloud federation for mobile computing • EGI has a federated cloud and VM distribution model • ELIXIR plans to build on existing infrastructure where possible • Wide-area file access needed for collaborative data analysis • High performance wide-area object-store • Need access control for human related data • Coordinated investment in infrastructure • Where is the UK coordination? What coordination is needed? • Integrating commercial resources where they add value • Integration with EU Infrastructure (ELIXIR) 26 .

Three Data Delivery Cases for EMBL- EBI's Embassy

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Dual Proteome-Scale Networks Reveal Cell-Specific Remodeling of the Human Interactome

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Identification of Transcribed Sequences in Arabidopsis Thaliana by Using High-Resolution Genome Tiling Arrays

Impact of the Protein Data Bank Across Scientific Disciplines.Data Science Journal, 19: 25, Pp

The Biogrid Interaction Database

The Interpro Database, an Integrated Documentation Resource for Protein

Multiple Sequence Alignment

Efficient Storage and Analysis of Genome Data in Relational Database Systems

The Uniprot Knowledgebase BLAST

Pdbefold Tutorial Tutorial Pdbefold Can May Be Accessed from Multiple Locations on the Pdbe Website