Quick viewing(Text Mode)

1-Ewan Birney

1-Ewan Birney

Big Data in Biology &Healthcare

Ewan Birney Director, EMBL-EBI www.ebi.ac.uk What is EMBL-EBI?

• Europe’s home for biological data services, research and training • A trusted data provider for the life sciences • 200 Petabytes of storage (0.2 exabytes) • >40,000 CPU Cores • Part of EMBL, an intergovernmental research organisation • International: 600 members of staff from 60 nations • Home of the ELIXIR Technical hub. Global reference data

See the live map at www.ebi.ac.uk/about/our-impact We have been living through a revolution. One 2003 to 2017

The cost of sequencing a The cost of sequencing a genome in 2003 genome in 2017

$100 Genome within the next 5 years (likely 3 years) Real-time in the field Measuring DNA, RNA,

(Note: I am a long-term, paid consultant to Oxford Nanopore) Medical Genomics Sequencing is now “cheap enough”

• Between $200-300 / exome, and $800-$1000 for whole genome • Line of sight to $100 genome • Quoted by Illumina, contenders emerging, steady progress. • More costs now in consent, DNA sample acquisition (storage and standard analysis low-ish, but not 0!)

• All in costs at or below “routine” medical diagnosis, eg, MRI scans Clinical Utility is present: Rare disease

• Consistent 20-30% yield of diagnosis for suspected rare diseases • Diagnosis ends “diagnostic odyssey” for patients – painful, emotionally draining and costly the healthcare service • Opens up reproductive choices for the parents • Like for like study in Australia • 5 fold more diagnoses at 1/3 cost to previous standard of care!

• Roll out in Denmark, Finland, France, UK Clinical Utility is present: Cancer

• (Cancer logistics harder: sample acquisition and DNA extraction harder to standardise; timelines far shorter) • In umbrella + basket trials, 1 in 10 patients have treatment changing information from cancer genomic information • Often being deployed in aggressive, “any option” metastasis scenarios • Broader molecular phenotyping via genomics showing promise • Signatures of NHER (BRCA1/BRCA2) defects far broader than suspected from germline associations • Age of key becoming more obvious Cohorts and Medical genomics

Norway Denmark Sweden Canada Netherlands

Finland

Iceland Estonia South Korea China Scotland Germany Ireland Switzerland Japan Turkey USA UK Medical Countries with active national medical genome Austria Spain Iran projects

Israel Taiwan Countries with some activity of medical genomics France Jordan U.A.E Countries planning medical genome projects Kuwait Mexico Cohorts Saudi Arabia Qatar India National cohorts > 100k genotyped or sequenced at least 25k H3Africa Malaysia National cohorts > 100k people active collection now Planning national cohorts > 100k Brazil Singapore

South Africa Australia Big numbers! Genomics: from research to healthcare

Research Practicing Medicine

• English language • National language • Light-weight legal • Heavy legal framework • Similar systems • Different systems • Open data • Closed data • Publications • Not published • Grant funding • Contract funding Bridges need at least two anchors Global standards: the GA4GH

• GA4GH is THE standards-setting body for genomics and healthcare • Embraces federated approach • Setting community standards early • Cloud: Analysis carried out where the data ‘lives’ • “You’re already using it!”: SAM/BAM/CRAM/VCF formats • Tools: htsget – the first step away from file-based access • Rare disease diagnoses: Matchmaker Exchange • Federated discovery: GA4GH Beacons Federation

Open research data Healthcare data with research use

analysis analysis

Aggregate data globally Analyse data locally (via VMs)

Download, analyse locally Collate analyses 1. Pheno ontology recommendations C & P Discov DURI GKS 2. Info models for clin data exchange C & P Discov GKS 3. Implementing pheno standards C & P Discov GKS Clinical & 4. Test bed & interoperability demo Cloud Secur

5. TES Cloud Discov Secur Phenotypic 6. TRS Cloud Discov Secur Cloud 7. WES Cloud Discov Secur 8. DOS Cloud Discov Secur DURI LSG 9. Beacon Discov DURI Discovery 10. Search Discov C & P DURI GKS

11. Service registry Discov Cloud LSG 12. Variant submission Discov GKS Data Security 13. IoG Discov 14. Breach response Secur R & E 15. AAI Secur C & P Discov DURI GKS Data Use & 16. Researcher ID & Bona Fide status DURI Secur R & E 17. DUO DURI Secur R & E Researcher IDs 18. Variant Annotation GKS C & P Discov 19. Variant Representation GKS Genomic 20. htsget streaming API LSG 21. Reference sequence retrieval API LSG GKS Knowledge 22. Read file formats LSG 23. Genetic variation file formats LSG GKS Standards 24. RNASeq expression matrix LSG Large-Scale 25. Return of results policy R & E 26. Participant values survey R & E Genomics 27. Code of conduct for data sharing R & E Regulatory & 28. Cloud access policy R & E Secur DURI Ethics Europe’s opportunity

• Strengths/Opportunities • Weaknesses/Threats

• Public Healthcare • Less IT depth in some systems healthcare systems • Strong genomics • Fragmentation of skills • Strong public health • AI / Big Data capacity delivery (skills+ capital) • Strong infrastructure • Transnational complexity • Transnational requirment EMBL-EBI, ELIXIR and GA4GH

• EMBL-EBI is the world’s leading infrastructure provider • Human Reference Genome, Annotation, , Proteomics, Structure, Pathways and Literature • ELIXIR is Europe’s transnational coordination of bioinformatics infrastructure • 23 European countries + EMBL-EBI • Human data community • GA4GH is the global standards setting organisation in human genomics • ELIXIR and GA4GH have a strategic partnership Humans: a new model Humans are…

• Similar to most other life forms on Earth • Outbred with pretty good genetics • Huge cohorts – millions of people • Big (lots and lots of cells) • Willing participants – they take themselves to hospitals to be phenotyped • Popular organisms – research into them attracts a lot of funding • …A great model organism for understanding biology – including human disease! Trabeculation

UK BioBank – 500,000 healthy UK citizens, consistently phenotyped and genotyped (will be full genome sequence)

100,000 will be MRI imaged (head including fMRI, chest including cardiac MRI) Fractal dimension trabeculation Co-registration

Meta analysis

Many loci also shows changes in QRS

GOSR2 Systolic BP Heart phenotypes

TNNT2 SLC35F1 Heart phenotypes Pulse rate DCM

TTN

Some loci have “other heart conditions” ICD-10 codes Meta analysis Replication in 1,200 other healthy Brits

Meta analysis

Many loci also shows changes in QRS

GOSR2 Systolic BP Heart phenotypes

TNNT2 SLC35F1 Heart phenotypes Pulse rate DCM

TTN

Some loci have “other heart conditions” ICD-10 codes Thanks

Hannah Meyer, EBI Declan O’regan, LMS, MRC Thank you!

Follow me on : @ewanbirney I blog regularly (Google Ewan Birney)

2/14/2019 33 Imaging: new technologies change the game

EM tomography, Atomic-scale models from EM

Super-resolution light microscopy

High-resolution MRI and CT Light sheet microcopy Huge impact on biological research

Tools for the wet lab Tools for the dry lab ‘White-collar’ and ‘blue-collar’ problems

Ground-breaking ideas Making them work

Innovative, interesting, Tools and data blue-skies thinking management: necessary, less glamorous Life : many data types

Genes, genomes & variation

Gene, protein & metabolite expression

Protein sequences, families & motifs Phenotypes

Macromolecular structures

Interactions, reactions & pathways

Chemogenomics & metabolomics Data resources at EMBL-EBI , protein & metabolite , genomes & variation • Ensembl expression • Ensembl Genomes • Expression Atlas • GWAS Catalog • Metabolights • Metagenomics portal • PRIDE • RNA Central Protein Molecular structures sequences, families & motifs • in Europe • InterPro • Electron Microscopy Data • Bank • UniProt Literature & ontologies Chemical Systems • BioModels • Experimental Factor biology • BioSamples Ontology • ChEBI • Enzyme Portal • • ChEMBL • IntAct • BioStudies • SureChEMBL • Reactome • Europe PMC Molecular Archives • European Nucleotide Archive ~410 people • European Variation Archive Worldwide collaborations • European Genome-phenome Archive • ArrayExpress Data Growth Doubling time Doubling time ~6 months ~16 months