1-Ewan Birney
Big Data in Biology &Healthcare
Ewan Birney Director, EMBL-EBI www.ebi.ac.uk What is EMBL-EBI?
• Europe’s home for biological data services, research and training • A trusted data provider for the life sciences • 200 Petabytes of storage (0.2 exabytes) • >40,000 CPU Cores • Part of EMBL, an intergovernmental research organisation • International: 600 members of staff from 60 nations • Home of the ELIXIR Technical hub. Global reference data
See the live map at www.ebi.ac.uk/about/our-impact We have been living through a revolution. One genome 2003 to 2017
The cost of sequencing a The cost of sequencing a genome in 2003 genome in 2017
$100 Genome within the next 5 years (likely 3 years) Real-time genomics in the field Measuring DNA, RNA, protein…
(Note: I am a long-term, paid consultant to Oxford Nanopore) Medical Genomics Sequencing is now “cheap enough”
• Between $200-300 / exome, and $800-$1000 for whole genome • Line of sight to $100 genome • Quoted by Illumina, contenders emerging, steady progress. • More costs now in consent, DNA sample acquisition (storage and standard analysis low-ish, but not 0!)
• All in costs at or below “routine” medical diagnosis, eg, MRI scans Clinical Utility is present: Rare disease
• Consistent 20-30% yield of diagnosis for suspected rare diseases • Diagnosis ends “diagnostic odyssey” for patients – painful, emotionally draining and costly the healthcare service • Opens up reproductive choices for the parents • Like for like study in Australia • 5 fold more diagnoses at 1/3 cost to previous standard of care!
• Roll out in Denmark, Finland, France, UK Clinical Utility is present: Cancer
• (Cancer logistics harder: sample acquisition and DNA extraction harder to standardise; timelines far shorter) • In umbrella + basket trials, 1 in 10 patients have treatment changing information from cancer genomic information • Often being deployed in aggressive, “any option” metastasis scenarios • Broader molecular phenotyping via genomics showing promise • Signatures of NHER (BRCA1/BRCA2) defects far broader than suspected from germline associations • Age of key mutations becoming more obvious Cohorts and Medical genomics
Norway Denmark Sweden Canada Netherlands
Finland
Iceland Estonia South Korea China Scotland Germany Ireland Switzerland Japan Turkey USA UK Medical Genomes Countries with active national medical genome Austria Spain Iran projects
Israel Taiwan Countries with some activity of medical genomics France Jordan U.A.E Countries planning medical genome projects Kuwait Mexico Cohorts Saudi Arabia Qatar India National cohorts > 100k genotyped or sequenced at least 25k H3Africa Malaysia National cohorts > 100k people active collection now Planning national cohorts > 100k Brazil Singapore
South Africa Australia Big numbers! Genomics: from research to healthcare
Research Practicing Medicine
• English language • National language • Light-weight legal • Heavy legal framework • Similar systems • Different systems • Open data • Closed data • Publications • Not published • Grant funding • Contract funding Bridges need at least two anchors Global standards: the GA4GH
• GA4GH is THE standards-setting body for genomics and healthcare • Embraces federated approach • Setting community standards early • Cloud: Analysis carried out where the data ‘lives’ • “You’re already using it!”: SAM/BAM/CRAM/VCF formats • Tools: htsget – the first step away from file-based access • Rare disease diagnoses: Matchmaker Exchange • Federated discovery: GA4GH Beacons Federation
Open research data Healthcare data with research use
analysis analysis
Aggregate data globally Analyse data locally (via VMs)
Download, analyse locally Collate analyses 1. Pheno ontology recommendations C & P Discov DURI GKS 2. Info models for clin data exchange C & P Discov GKS 3. Implementing pheno standards C & P Discov GKS Clinical & 4. Test bed & interoperability demo Cloud Secur
5. TES Cloud Discov Secur Phenotypic 6. TRS Cloud Discov Secur Cloud 7. WES Cloud Discov Secur 8. DOS Cloud Discov Secur DURI LSG 9. Beacon Discov DURI Discovery 10. Search Discov C & P DURI GKS
11. Service registry Discov Cloud LSG 12. Variant submission Discov GKS Data Security 13. IoG Discov 14. Breach response Secur R & E 15. AAI Secur C & P Discov DURI GKS Data Use & 16. Researcher ID & Bona Fide status DURI Secur R & E 17. DUO DURI Secur R & E Researcher IDs 18. Variant Annotation GKS C & P Discov 19. Variant Representation GKS Genomic 20. htsget streaming API LSG 21. Reference sequence retrieval API LSG GKS Knowledge 22. Read file formats LSG 23. Genetic variation file formats LSG GKS Standards 24. RNASeq expression matrix LSG Large-Scale 25. Return of results policy R & E 26. Participant values survey R & E Genomics 27. Code of conduct for data sharing R & E Regulatory & 28. Cloud access policy R & E Secur DURI Ethics Europe’s opportunity
• Strengths/Opportunities • Weaknesses/Threats
• Public Healthcare • Less IT depth in some systems healthcare systems • Strong genomics • Fragmentation of skills • Strong public health • AI / Big Data capacity delivery (skills+ capital) • Strong infrastructure • Transnational complexity • Transnational requirment EMBL-EBI, ELIXIR and GA4GH
• EMBL-EBI is the world’s leading bioinformatics infrastructure provider • Human Reference Genome, Annotation, Transcription, Proteomics, Structure, Pathways and Literature • ELIXIR is Europe’s transnational coordination of bioinformatics infrastructure • 23 European countries + EMBL-EBI • Human data community • GA4GH is the global standards setting organisation in human genomics • ELIXIR and GA4GH have a strategic partnership Humans: a new model organism Humans are…
• Similar to most other life forms on Earth • Outbred organisms with pretty good genetics • Huge cohorts – millions of people • Big (lots and lots of cells) • Willing participants – they take themselves to hospitals to be phenotyped • Popular organisms – research into them attracts a lot of funding • …A great model organism for understanding biology – including human disease! Trabeculation
UK BioBank – 500,000 healthy UK citizens, consistently phenotyped and genotyped (will be full genome sequence)
100,000 will be MRI imaged (head including fMRI, chest including cardiac MRI) Fractal dimension trabeculation Co-registration
Meta analysis
Many loci also shows changes in QRS
GOSR2 Systolic BP Heart phenotypes
TNNT2 SLC35F1 Heart phenotypes Pulse rate DCM
TTN
Some loci have “other heart conditions” ICD-10 codes Meta analysis Replication in 1,200 other healthy Brits
Meta analysis
Many loci also shows changes in QRS
GOSR2 Systolic BP Heart phenotypes
TNNT2 SLC35F1 Heart phenotypes Pulse rate DCM
TTN
Some loci have “other heart conditions” ICD-10 codes Thanks
Hannah Meyer, EBI Declan O’regan, LMS, MRC Thank you!
Follow me on twitter: @ewanbirney I blog regularly (Google Ewan Birney)
2/14/2019 33 Imaging: new technologies change the game
EM tomography, Atomic-scale models from EM
Super-resolution light microscopy
High-resolution MRI and CT Light sheet microcopy Huge impact on biological research
Tools for the wet lab Tools for the dry lab ‘White-collar’ and ‘blue-collar’ problems
Ground-breaking ideas Making them work
Innovative, interesting, Tools and data blue-skies thinking management: necessary, less glamorous Life science: many data types
Genes, genomes & variation
Gene, protein & metabolite expression
Protein sequences, families & motifs Phenotypes
Macromolecular structures
Interactions, reactions & pathways
Chemogenomics & metabolomics Data resources at EMBL-EBI Gene, protein & metabolite Genes, genomes & variation • Ensembl expression • Ensembl Genomes • Expression Atlas • GWAS Catalog • Metabolights • Metagenomics portal • PRIDE • RNA Central Protein Molecular structures sequences, families & motifs • Protein Data Bank in Europe • InterPro • Electron Microscopy Data • Pfam Bank • UniProt Literature & ontologies Chemical Systems • BioModels • Experimental Factor biology • BioSamples Ontology • ChEBI • Enzyme Portal • Gene Ontology • ChEMBL • IntAct • BioStudies • SureChEMBL • Reactome • Europe PMC Molecular Archives • European Nucleotide Archive ~410 people • European Variation Archive Worldwide collaborations • European Genome-phenome Archive • ArrayExpress Data Growth Doubling time Doubling time ~6 months ~16 months