Ewan Birney (Tweetable Until the Last Bit – I’Ll Point It Out)

Information in the Lifesciences… The Science of the 21st Century …and why it needs infrastructure Ewan Birney (tweetable until the last bit – I’ll point it out) EBI is an Outstation of the European Molecular Biology Laboratory. Outline of the talk • Who am I? • Genomics 2000-2012 • HapMap, 1000 Genomes, GWAS, ENCODE • Route into medicine • Why we need an infrastructure • (Some more whimsical uses of DNA…) Who am I? • Associate Director at European Bioinformatics Institute (EBI) • Involved in genomics since I was 19 (almost 20 years!) • Trained as a biochemist – most people think I am CS • Analysed – sometimes lead EBI is in Hinxton, South – Cambridgeshire human/mouse/rat/platypus etc genomes. EBI is part of EMBL, ~like • Lead the analysis of CERN for molecular biology ENCODE Molecular Biology • The study of how life works – at a molecular level • Key molecules: • DNA – Information store (Disk) • RNA – Key information transformer, also does stuff (RAM) • Proteins – The business end of life (Chip, robotic arms) • Metabolites – Fuel and signalling molecules (electricity) • Theories of how these interact – no theories of to predict what they are • Instead we determine attributes of molecules and store them in globally accessible, open, databases Crash Course in DNA EBI is an Outstation of the European Molecular Biology Laboratory. DNA is a covalently linked polymer nearly always found in anti-parallel, non covalent pairs We represent it as strings, not worrying about one pair of the two polymers >6 dna:chromosome chromosome:GRCh37:6:133017695:133161157:1 GCAGCAAGACAGAAGTGACTCATACATACAAGGGATCCCCAATAAGATTATCGGCAGATT TCTCATCAATAACTTTGGAGACCACAAAGCATTGAGCTGATATATTTAAAGTACTGAAAG AAAAAAAAATCTGACAACCAAGAATTCTATATCCATCAGAACTGCCCTTCAAAAGGGAGG GAGAAATGAAGACATTCTCAGATTTGAGAAGAAAGGAAAGAGAGAAGGGAGGGGAGGGGA GAGGAGGGGAGGGGAGGAGAGGAGAGGAGAGGGCACAGTGGCTCACGCCTGTAATCCTAG CACTTTGCAAGACTGAGGCCAGTGGAACACCTGAGGTCAGGAGATCGAGACCATCCTGGC TAACACGGTGAAACCCCGTCTCCACTAAAAATACAAAAAATTAGCCAGGCGTGGTGGCAG GTGCCTGTAGTTCCAGCTACTCAGGAGGCTGAGGCAGCAGAATGGCGTGAACTCGGGAGG TGGAGCTTGCAGTGAGCTGAGATTGCGCCCCTGCACTCCAGCCTGGGTGACAGAGTGAGA CTCTGTCTCAAAAAAATAAAAAGTTTAAAAATATTTTAAAAAAAGAAAGAAAGAAGGGAG 1 monomer is called a “base pair” – bp We can routinely determine small parts of DNA 1977-1990 – 500 bp, manual tracking 1990-2000 – 500 bp, computational tracking, 1D, “capillary” 2005-2012 – 20-100bp, 2D systems, (“2nd Generation” or NGS) Fred Sanger, inventor of terminator DNA rd 2012 - ?? >5kb, Real time “3 sequencing Generation” Costs have come exponentially down Volumes have gone up! 1E+14 1E+13 Capillary reads 1E+12 Assembled sequences Next gen. reads 1E+11 1E+10 Bases 1E+09 Similar explosion of 100000000 Image based methods 10000000 See Blog posts about da 1000000 specific compression 100000 1980 1985 1990 1995 2000 2005 2010 2015 Date A genome is all our DNA Every cell has two copies of 3e9bp (one from mum, one from dad) in 24 polymers (“chromosomes”) Ecoli: 4e6, Yeast, 12e6 Medaka, White Pine 0.9e9 20e9 Human Genome project • 1989 – 2000 – sequencing the human genome • Just 1 “individual” – actually a mosaic of about 24 individuals but as if it was one • Old school technologies • A bit epic • Now • Same data volume generated in ~3mins in a current large scale centre • It’s all about the analysis What happened next? EBI is an Outstation of the European Molecular Biology Laboratory. We looked into human variation 3 in 10,000 bases between any two individuals are different (a bit more between Africans) The similarity of a European to an African (any population) is Only marginally smaller than European to European (2 or 3%). Only a minute amount of DNA is unique to any population … and associate this with traits or disease (you can infer the majority of the genome by knowing a base About 1 every 5,000 to 10,000 bases – the experiments to Look at this density is far cheaper than sequencing) ENCODE Dimensions CellsCells 3,010 Experiments 182 Cell Lines/ Tissues 182 Cell Lines/ 5 TeraBases 1716x of the Human Genome Methods/Factors 164 Assays (114 different Chip) ENCODE Uniform Analysis Pipeline Anshul Kundaje, Qunhua Li, Michael Hoffman, Jason Ernst, Joel Rozowsky, Pouya Kheradpour Mapped reads from production (Bam) Uniform Peak Calling Pipeline (SPP, PeakSeq) Signal Generation (read extension and mappability correction) Good reproducibility Poor reproducibility Segmentation Rep2 Rep1 IDR Processing, QC and Blacklist Filtering ChromHMM/Segway Self Organising Maps Motif Discovery Stats, GSC Signal Aggregation enrichments, etc. over peaks Discovering functional genome segments Michael Hoffman, Jason Ernst, Bill Noble, Manolis Kellis Well understood: TSS, Gene Start, Gene Bodies Reassuringly Interesting “Enhancers” (2 states) Insulators Definitely There, Unexpected Specific Gene End ~7 Major flavours of genome Sub-classification of Repeats 25 “elaborations” 1,000s of details Impact on Medicine EBI is an Outstation of the European Molecular Biology Laboratory. 3 big areas of impact for medicine Germ line “Precision” cancer Pathogens + Risk to disease medicine Hospital acquired infections Germ Line impact • Everyone has differential risk of disease • But the shift in risk is small • Perhaps 1 to 2% have a striking change in risk to a serious disease (>10 fold) which is “actionable” • This goes up to 3-4% if you 1:500 people have HCM count some less clinically 1:500 people have FH worrying diseases Precision cancer diagnosis • Cancer is a genomic disease • By sequencing a cancer you can understand its molecular form better • Particular molecular forms respond to particular bugs Pathogens • Sequencing provides a clear cut diagnosis of pathogens • Can also be used to sequence environments (eg, hospitals) • Immune systems for hospitals Why we need an infrastructure… EBI is an Outstation of the European Molecular Biology Laboratory. Infrastructure components for ENCODE • ~300TB of space • (small compared to 1,000 genomes/Cancer) • >5 years of CPU • (EBI 10,000 core farm critical in turn around times) • High bandwidth connectivity + Aspera • Easier to connect Seattle and Santa Cruz via EBI (!) Infrastructures are critical… But we only notice them when they go wrong Biology already needs an information infrastructure • For the human genome • (…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl • For the function of genes and proteins • For all genes, in text and computational – UniProt and GO • For all 3D structures • To understand how proteins work – PDBe • For where things are expressed • The differences and functionality of cells - Atlas ..But this keeps on going… • We have to scale across all of (interesting) life • There are a lot of species out there! • We have to handle new areas, in particular medicine • A set of European haplotypes for good imputation • A set of actionable variants in germline and cancers • We have to improve our chemical understanding • Of biological chemicals • Of chemicals which interfere with Biology How? Fully Centralised Fully Distributed Pros: Stability, reuse, Pros: Responsive, Geographic Learning ease Language responsive Cons: Hard to concentrateCons: Internal communication overhead Expertise across of life scienceHarder for end users to learn Geographic, language placementHarder to provide multi-decade stability Bottlenecks and lack of diversity How? Robust network with a strong hub Node “Domain Specific hub” “National” General Hub (EBI) ELIXIR’s mission To build a sustainable European infrastructure for biological information, supporting life science research and its medicine translation to: environment bioindustries society 34 Overlapping Networks Other infrastructures needed for biology • EuroBioImaging • Cellular and whole organism Imaging • BioBanks (BBMRI) • We need numbers – European populations – in particular for rare diseases, but also for specific sub types of common disease • Mouse models and phenotypes (Infrafrontier) • A baseline set of knockouts and phenotypes in our most tractable mammalian model • (it’s hard to prove something in human) • Robust molecular assays in a clinical setting (EATRIS) • The ability to reliably use state of the art molecular techniques in a clinical research setting And… just for fun… (and don’t tweet this bit) EBI is an Outstation of the European Molecular Biology Laboratory. Over a beer… Ha! At some point all the data we Store is going to be DNA… Of course, the cost effective way To store this would be as DNA… 1 g == 1 PB (with redundancy) Scaleable? Cost effective? *** University of Albany SUNY Group *** *** HudsonAlpha Institute, Caltech, Stanford Group *** Scott A. Tenenbaum (5), Luiz O. Penalva Devin M. Absher (14), Henry Amrhein (59), Michael Anaya (42), Anita Bansal (14), Serafim Batzoglou (2), Kevin M. (84), Francis Doyle (5). Bowling (14), Marie K. Cross (14), Nicholas S. Davis (14), Tracy Eggleston (14), Clarke Gasper (42), Jason Gertz (14), DeSalvo Gilberto (59), Chris Gunter (14), Preti Jain (14), Brandon King (59), Anshul Kundaje (2), Shawn E. *** University of Chicago, Stanford Group Levy (14), Max W. LibbrechtIan (2), Geor giDunham, K. Marinov (59), Kenneth McCue (59), Sarah K. Meadows (14), Ali Mortazavi *** (30), Michael A. Muratet (14), Richard M. Myers (14), Amy S. Nesmith (14), J. Scott Newberry (14), Kimberly M. ENCODE Authors Subhradip Karmakar (41), Stephen G. Newberry (14), Stephanie L. Parker (14), E. Christopher Partridge (14), Florencia Pauli (14), Shirley Pepke (60), Landt (13), Raj R. Bhanvadia (41), Alina Barbara Pusey (14), Timothy E. Reddy (14), Arend

Ewan Birney (Tweetable Until the Last Bit – I’Ll Point It Out)

Gene Prediction: the End of the Beginning Comment Colin Semple

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Functional Effects Detailed Research Plan

Semantic Web

The for Report 07-08

Classifying Transport Proteins Using Profile Hidden Markov Models And

Ontology-Based Methods for Analyzing Life Science Data

BIOGRAPHICAL SKETCH NAME: Berger

PREDICTD: Parallel Epigenomics Data Imputation with Cloud-Based Tensor Decomposition

Multi-Class Protein Classification Using Adaptive Codes

Aggregation and Correlation Toolbox for Analyses of Genome Tracks Justin Jee Yale University

UC Irvine UC Irvine Previously Published Works