Information in the Lifesciences… The Science of the 21st Century …and why it needs infrastructure
Ewan Birney (tweetable until the last bit – I’ll point it out)
EBI is an Outstation of the European Molecular Biology Laboratory. Outline of the talk
• Who am I? • Genomics 2000-2012 • HapMap, 1000 Genomes, GWAS, ENCODE • Route into medicine • Why we need an infrastructure
• (Some more whimsical uses of DNA…) Who am I? • Associate Director at European Bioinformatics Institute (EBI) • Involved in genomics since I was 19 (almost 20 years!) • Trained as a biochemist – most people think I am CS • Analysed – sometimes lead EBI is in Hinxton, South – Cambridgeshire human/mouse/rat/platypus etc genomes. EBI is part of EMBL, ~like • Lead the analysis of CERN for molecular biology ENCODE Molecular Biology
• The study of how life works – at a molecular level
• Key molecules: • DNA – Information store (Disk) • RNA – Key information transformer, also does stuff (RAM) • Proteins – The business end of life (Chip, robotic arms) • Metabolites – Fuel and signalling molecules (electricity) • Theories of how these interact – no theories of to predict what they are • Instead we determine attributes of molecules and store them in globally accessible, open, databases Crash Course in DNA
EBI is an Outstation of the European Molecular Biology Laboratory. DNA is a covalently linked polymer nearly always found in anti-parallel, non covalent pairs We represent it as strings, not worrying about one pair of the two polymers
>6 dna:chromosome chromosome:GRCh37:6:133017695:133161157:1 GCAGCAAGACAGAAGTGACTCATACATACAAGGGATCCCCAATAAGATTATCGGCAGATT TCTCATCAATAACTTTGGAGACCACAAAGCATTGAGCTGATATATTTAAAGTACTGAAAG AAAAAAAAATCTGACAACCAAGAATTCTATATCCATCAGAACTGCCCTTCAAAAGGGAGG GAGAAATGAAGACATTCTCAGATTTGAGAAGAAAGGAAAGAGAGAAGGGAGGGGAGGGGA GAGGAGGGGAGGGGAGGAGAGGAGAGGAGAGGGCACAGTGGCTCACGCCTGTAATCCTAG CACTTTGCAAGACTGAGGCCAGTGGAACACCTGAGGTCAGGAGATCGAGACCATCCTGGC TAACACGGTGAAACCCCGTCTCCACTAAAAATACAAAAAATTAGCCAGGCGTGGTGGCAG GTGCCTGTAGTTCCAGCTACTCAGGAGGCTGAGGCAGCAGAATGGCGTGAACTCGGGAGG TGGAGCTTGCAGTGAGCTGAGATTGCGCCCCTGCACTCCAGCCTGGGTGACAGAGTGAGA CTCTGTCTCAAAAAAATAAAAAGTTTAAAAATATTTTAAAAAAAGAAAGAAAGAAGGGAG
1 monomer is called a “base pair” – bp We can routinely determine small parts of DNA
1977-1990 – 500 bp, manual tracking
1990-2000 – 500 bp, computational tracking, 1D, “capillary”
2005-2012 – 20-100bp, 2D systems, (“2nd Generation” or NGS) Fred Sanger, inventor of terminator DNA rd 2012 - ?? >5kb, Real time “3 sequencing Generation” Costs have come exponentially down Volumes have gone up!
1E+14
1E+13
Capillary reads
1E+12 Assembled sequences
Next gen. reads 1E+11
1E+10 Bases 1E+09 Similar explosion of
100000000 Image based methods
10000000 See Blog posts about da
1000000 specific compression
100000 1980 1985 1990 1995 2000 2005 2010 2015 Date A genome is all our DNA
Every cell has two copies of 3e9bp (one from mum, one from dad) in 24 polymers (“chromosomes”)
Ecoli: 4e6, Yeast, 12e6 Medaka, White Pine 0.9e9 20e9 Human Genome project
• 1989 – 2000 – sequencing the human genome • Just 1 “individual” – actually a mosaic of about 24 individuals but as if it was one • Old school technologies • A bit epic • Now • Same data volume generated in ~3mins in a current large scale centre • It’s all about the analysis What happened next?
EBI is an Outstation of the European Molecular Biology Laboratory. We looked into human variation
3 in 10,000 bases between any two individuals are different (a bit more between Africans)
The similarity of a European to an African (any population) is Only marginally smaller than European to European (2 or 3%). Only a minute amount of DNA is unique to any population … and associate this with traits or disease
(you can infer the majority of the genome by knowing a base About 1 every 5,000 to 10,000 bases – the experiments to Look at this density is far cheaper than sequencing)
ENCODE Dimensions CellsCells
3,010 Experiments
182 Cell Lines/ Tissues 182 Cell Lines/ 5 TeraBases 1716x of the Human Genome
Methods/Factors
164 Assays (114 different Chip) ENCODE Uniform Analysis Pipeline Anshul Kundaje, Qunhua Li, Michael Hoffman, Jason Ernst, Joel Rozowsky, Pouya Kheradpour
Mapped reads from production (Bam)
Uniform Peak Calling Pipeline (SPP, PeakSeq) Signal Generation (read extension and mappability correction)
Good reproducibility Poor reproducibility
Segmentation Rep2
Rep1 IDR Processing, QC and Blacklist Filtering ChromHMM/Segway
Self Organising Maps Motif Discovery Stats, GSC Signal Aggregation enrichments, etc. over peaks Discovering functional genome segments Michael Hoffman, Jason Ernst, Bill Noble, Manolis Kellis Well understood: TSS, Gene Start, Gene Bodies
Reassuringly Interesting “Enhancers” (2 states) Insulators
Definitely There, Unexpected Specific Gene End
~7 Major flavours of genome Sub-classification of Repeats 25 “elaborations” 1,000s of details Impact on Medicine
EBI is an Outstation of the European Molecular Biology Laboratory. 3 big areas of impact for medicine
Germ line “Precision” cancer Pathogens + Risk to disease medicine Hospital acquired infections Germ Line impact
• Everyone has differential risk of disease • But the shift in risk is small • Perhaps 1 to 2% have a striking change in risk to a serious disease (>10 fold) which is “actionable”
• This goes up to 3-4% if you 1:500 people have HCM count some less clinically 1:500 people have FH worrying diseases Precision cancer diagnosis
• Cancer is a genomic disease • By sequencing a cancer you can understand its molecular form better • Particular molecular forms respond to particular bugs Pathogens
• Sequencing provides a clear cut diagnosis of pathogens • Can also be used to sequence environments (eg, hospitals)
• Immune systems for hospitals Why we need an infrastructure…
EBI is an Outstation of the European Molecular Biology Laboratory. Infrastructure components for ENCODE
• ~300TB of space • (small compared to 1,000 genomes/Cancer) • >5 years of CPU • (EBI 10,000 core farm critical in turn around times) • High bandwidth connectivity + Aspera • Easier to connect Seattle and Santa Cruz via EBI (!) Infrastructures are critical… But we only notice them when they go wrong Biology already needs an information infrastructure
• For the human genome • (…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl • For the function of genes and proteins • For all genes, in text and computational – UniProt and GO • For all 3D structures • To understand how proteins work – PDBe • For where things are expressed • The differences and functionality of cells - Atlas ..But this keeps on going…
• We have to scale across all of (interesting) life • There are a lot of species out there! • We have to handle new areas, in particular medicine • A set of European haplotypes for good imputation • A set of actionable variants in germline and cancers • We have to improve our chemical understanding • Of biological chemicals • Of chemicals which interfere with Biology How?
Fully Centralised Fully Distributed
Pros: Stability, reuse, Pros: Responsive, Geographic Learning ease Language responsive
Cons: Hard to concentrateCons: Internal communication overhead Expertise across of life scienceHarder for end users to learn Geographic, language placementHarder to provide multi-decade stability Bottlenecks and lack of diversity How?
Robust network with a strong hub Node
“Domain Specific hub”
“National” General Hub (EBI) ELIXIR’s mission
To build a sustainable European infrastructure for biological information, supporting life science research and its medicine translation to:
environment
bioindustries
society
34 Overlapping Networks Other infrastructures needed for biology
• EuroBioImaging • Cellular and whole organism Imaging • BioBanks (BBMRI) • We need numbers – European populations – in particular for rare diseases, but also for specific sub types of common disease • Mouse models and phenotypes (Infrafrontier) • A baseline set of knockouts and phenotypes in our most tractable mammalian model • (it’s hard to prove something in human) • Robust molecular assays in a clinical setting (EATRIS) • The ability to reliably use state of the art molecular techniques in a clinical research setting And… just for fun… (and don’t tweet this bit)
EBI is an Outstation of the European Molecular Biology Laboratory. Over a beer…
Ha! At some point all the data we Store is going to be DNA…
Of course, the cost effective way To store this would be as DNA… 1 g == 1 PB (with redundancy) Scaleable? Cost effective? *** University of Albany SUNY Group *** *** HudsonAlpha Institute, Caltech, Stanford Group *** Scott A. Tenenbaum (5), Luiz O. Penalva Devin M. Absher (14), Henry Amrhein (59), Michael Anaya (42), Anita Bansal (14), Serafim Batzoglou (2), Kevin M. (84), Francis Doyle (5). Bowling (14), Marie K. Cross (14), Nicholas S. Davis (14), Tracy Eggleston (14), Clarke Gasper (42), Jason Gertz (14), DeSalvo Gilberto (59), Chris Gunter (14), Preti Jain (14), Brandon King (59), Anshul Kundaje (2), Shawn E. *** University of Chicago, Stanford Group Levy (14), Max W. LibbrechtIan (2), Geor giDunham, K. Marinov (59), Kenneth McCue (59), Sarah K. Meadows (14), Ali Mortazavi *** (30), Michael A. Muratet (14), Richard M. Myers (14), Amy S. Nesmith (14), J. Scott Newberry (14), Kimberly M. ENCODE Authors Subhradip Karmakar (41), Stephen G. Newberry (14), Stephanie L. Parker (14), E. Christopher Partridge (14), Florencia Pauli (14), Shirley Pepke (60), Landt (13), Raj R. Bhanvadia (41), Alina Barbara Pusey (14), Timothy E. Reddy (14), Arend Sidow (61), Diane Trout (59), Katherine E. Varley (14), Jost *** Overall Coordination *** Choudhury (41), Marc Domanus (41), Lijia Vielmetter (42), Brian A. Williams (59), Barbara Wold (42). Ian Dunham (1), Anshul Kundaje (2). Ma (41), Jennifer Moran (41), Dorrelyn Anshul KundajePatacsil (13), Teri Slifer (13), Alec *** University of Massachusetts Medical School Genome Folding Group *** *** Data Production Leads *** Victorsen (41), Xinqiong Yang (13), Job Dekker (12), Gaurav Jain (12), Bryan R. Lajoie (12), Amartya Sanyal (12). Shelley F. Aldred (3), Patrick J. Collins (3), Carrie A. Davis (4), Francis Doyle (5), Charles B. Epstein (6), Seth Frietze (7), Michael Snyder (35), Kevin P. White (41). Jennifer Harrow (8), Vishwanath R. Iyer (9), Rajinder Kaul (10), Jainab Khatun (11), Bryan R. Lajoie (12), Stephen G. *** University of Massachusetts Medical School Weng Group *** Landt (13), Bum-Kyu Lee (9), Florencia Pauli (14), Kate R. Rosenbloom (15), Peter Sabo (16), Alexias Safi (17), Amartya *** University of Heidelberg Group *** Zhiping Weng (23), Troy W. Whitfield (23), Jie Wang (23), Patrick J. Collins (3), Shelley F. Aldred (3), Nathan D. Sanyal (12), Noam Shoresh (6), Jeremy M. Simon (18), Lingyun Song (17), Nathan D. Trinklein (3). Thomas Auer (85), Lazaro Centanin (85), Trinklein (3), E. Christopher Partridge (14), Richard M. Myers (14). Michael Eichenlaub (85), Franziska Gruhl *** Lead Analysts *** (85), Stephan Heermann (85), Daigo Inoue *** NHGRI Groups *** Robert C. Altshuler (19), Ewan Birney (1), James B. Brown (20), Chao Cheng (21), Sarah Djebali (22), Xianjun Dong (23), (85), Tanja Kellner (85), Stephan Laura Elnitski (38), Elliott H. Margulies (31), Stephen C. J. Parker (31), Hanna M. Petrykowska (38). Ian Dunham (1), Jason Ernst (19), Terrence S. Furey (24), Mark Gerstein (21), Belinda Giardine (25), Melissa Greven (23), Kirchmaier (85), Claudia Mueller (85), Ross C. Hardison (26), RobertShelley S. Harris (25), Javier Herrero F. (1), Aldred Michael M. Hoffman (16), Patrick Sowmya Iyer (27), Manolis J. Collins, Carrie A. Davis, FrancisRobert Reinhardt (85), Lea Schertel (85), *** Duke University, EBI, University of Texas, Austin, University of North Carolina-Chapel Hill Group *** Kellis (19), Jainab Khatun (11), Pouya Kheradpour (19), Anshul Kundaje (2), Timo Lassmann (28), Qunhua Li (20), Stephanie Schneider (85), Rebecca Sinn Gregory E. Crawford (37), Terrence S. Furey (24), Paul G. Giresi (18), Linda L. Grasfeder (18), Vishwanath R. Iyer Xinying Lin (23), Angelika Merkel (29), Ali Mortazavi (30), Stephen C. J. Parker (31), Timothy E. Reddy (14), Joel (85), Beate Wittbrodt (85), Jochen (9), Damian Keefe (1), Seul Ki Kim (18), Min Jae Kim (18), Bum-Kyu Lee (9), Jason D. Lieb (18), Zheng Liu (9), Darin Rozowsky (21), Felix Schlesinger (4), Bob Thurman (16), Jie Wang (23), Lucas D. Ward (19), Troy W. Whitfield (23), Wittbrodt (85). Steven P. Wilder (1), WeishengDoyle, Wu (25), Hualin S. Xi Charles(32), Kevin (Yuk-Lap) Yip (21), B. Jiali ZhuangEpstein, (23). London (17), RyanSeth M. McDaniell Frietze, (9), Joanna O. Mieczkowska Jennifer (18), Piotr A. Mieczkowski (62),Harrow, Yunyun Ni (9), Naim U. Rashid (63), Alexias Safi (17), Matthew R. Schaner (18), Nathan C. Sheffield (17), Christopher Shestak (18), *** Data Analysis Center *** Kimberly A. Showers (18), Jeremy M. Simon (18), Lingyun Song (17), Tianyuan Wang (17), Deborah Winter (17), *** Writing Group *** Ian Dunham (1), Kathryn Beal (1), Alvis Zhancheng Zhang (24), Zhuzhu Z. Zhang (18), Ewan Birney (1), Akshay A. Bhinge (9), Anna Battenhouse (9), Bradley E. Bernstein (33), EwanVishwanath Birney (1), Ian Dunham (1), Eric D. Green R. (34), ChrisIyer, Gunter (14), Rajinder Michael Snyder (35). Kaul, Jainab Khatun, BryanBrazma (86), Paul R. Flicek (1), Javier Sheera Adar (18). Herrero (1), Nathan Johnson (1), Damian *** NHGRI Project Management *** Keefe (1), Margus Lukk (86), Nicholas M. *** Lawrence Berkeley National Laboratory Group *** Leslie B. Adams (36), Laura A.L.Lajoie, Dillon (36), Elise A. FeingoldStephen (36), Peter J. Good (36), G. Eric D. Landt, Green (34), Caroline J. Bum-Kyu Lee, Florencia Pauli,Luscombe (87), KateDaniel Sobral (1), Juan M. Matthew J. Blow (64), Axel Visel (64), Len A. Pennachio (65). Kelly (36), Rebecca F. Lowdon (36), Michael J. Pazin (36), Judith R. Wexler (36), Julia Zhang (36). Vaquerizas (87), Steven P. Wilder (1), Serafim Batzoglou (2), Arend Sidow (61), *** Cold Spring Harbor, University of Geneva, Center for Genomic Regulation, Barcelona, RIKEN, University of *** Principal Investigators *** Nadine Hussami (2), Sofia Kyriazopoulou- R. Rosenbloom, Peter Lausanne,Sabo, Genome InstituteAlexias of Singapore Group Safi, *** Amartya Sanyal, Bradley E. Bernstein (33), Ewan Birney (1), Gregory E. Crawford (37), Job Dekker (12), Laura Elnitski (38), Peggy J. Panagiotopoulou (2), Max W. Libbrecht (2), Sarah Djebali (22), Carrie A. Davis (4), Angelika Merkel (29), Alex Dobin (4), Timo Lassmann (28), Ali Mortazavi Farnham (7), Mark Gerstein (21), Morgan C. Giddings (11), Thomas R. Gingeras (39), Eric D. Green (34), Roderic Marc A. Schaub (2), Anshul Kundaje (2), (30), Andrea Tanzer (54), Julien Lagarde (22), Wei Lin (4), Felix Schlesinger (4), Chenghai Xue (4), Georgi K. Guig
Brad Bernstein (Eric Lander, Manolis Kellis, Tony Kouzarides) Ewan Birney (Jim Kent, Mark Gerstein, Bill Noble, Peter Bickel, Ross Hardison, Zhiping Weng) Greg Crawford (Ewan Birney, Jason Lieb, Terry Furey, Vishy Iyer) Jim Kent (David Haussler, Kate Rosenbloom) John Stamatoyannopoulos (Evan Eichler, George Stamatoyannopoulos, Job Dekker, Maynard Olson, Michael Dorschner, Patrick Navas, Phil Green) Mike Snyder (Kevin Struhl, Mark Gerstein, Peggy Farnham, Sherman Weissman) Rick Myers (Barbara Wold) Scott Tenenbaum (Luiz Penalva) Tim Hubbard (Alexandre Reymond, Alfonso Valencia, David Haussler, Ewan Birney, Jim Kent, Manolis Kellis, Mark Gerstein, Michael Brent, Roderic Guigo) Tom Gingeras (Alexandre Reymond, David Spector, Greg Hannon, Michael Brent, Roderic Guigo, Stylianos Antonarakis, Yijun Ruan, Yoshihide Hayashizaki) Zhiping Weng (Nathan Trinklein, Rick Myers)
Additional ENCODE Participants: Elliott Marguiles, Eric Green, Job Dekker, Laura Elnitski, Len Pennachio, Jochen Wittbrodt .. and many senior scientists, postdocs, students, technicians, computer scientists, statisticians and administrators in these groups NHGRI: Elise Feingold, Mike Pazin, Peter Good 44