The Revolution in Human Genetics: Deciphering Complexity
David Galas Institute for Systems Biology, Seattle, WA Genetics and Environment integration is key to future medicine
Blood proteins, miRNA, mRNA, Predictive, Personalized Genome Cell Read-out Diagnostics
Disease & Health
Complex biological Networks
Environment Summary • Technological transformation of human genetics • Whole genome sequencing • Family of four project – what have we learned? Next projects. • Challenges – Reference genome problem – Complex genetic models – Integration of data types: new methods History of Genetic Marker Density
Human Genetic Markers
No. of Type of marker loci Recombination Inference sampled
Blood groups ~20 N/A Electrophoretic ~30 N/A HLA type 1 N/A RFLPs >105 N/A VNTRs >104 minimal Microsatellites >105 minimal SNP Genotypes >106 probabilistic Exomes 1% marginal Genomes All exact Family Sequences and the Future of human genetics • Traditional (common SNP, additive effects) GWAS has been a gene collection method – reaching asymptotes for known phenotypes • Inferences from family sequences plus populations-based studies • Constrained analysis using: biological understanding of networks, additional data (e.g. population-based information) • Genetic-environment deconvolution “Missing heritability” Family genomics in perspective
1.
2.
3. Current technical issues
• Sources of DNA • Library biases • Inherent accuracy of reads • Read lengths – currently 20 to 70 • Maps to “reference” variation file • Coverage and depth (~40 fold) • Assembly and phase • Error rates – types Pilot project: Complete Genome Sequences of a Family of Four
• Parents healthy—kids both have two genetic diseases • Power of a family sequence permits: – Noise reduction : Very low error rate — <1/100,000 – Discovery of ~230,000 new single base variants – First full recombination map of a family – First determination of inter-generational mutation rates – Identification of disease gene candidates for rare genetic diseases (two in this case)
Roach et al., Science, April 30, 2010 Whole Genome Sequencing of Family
Unaffected parents
Children with craniofacial And limb malformation (Miller Syn.) and lung disease (cilliary dyskenesis) Inheritance state vector: family of four
Allele Assortment Inheritance States
haploidentical paternal
identical
haploidentical maternal
non- identical
Larger family – higher dimension Sequencing a nuclear family yields: high resolution crossover sites and haplotypes
Inheritance “patterns” in kids (0.5Mb bins)
QuickTime™ and a decompressor are needed to see this picture.
QuickTime™ and a decompressor are needed to see this picture. Dad Mom haploidentical paternal
identical
haploidentical maternal Inheritance vector is 4 dimensional
non- identical Family Allele Inheritance Patterns
1 22
2 21
3 20
4 19
5 18
6 17
7 16
8 15
9 14
10 13
11 12 Utility of 5 Simple 10 recessive Constraints & (SNPs) 104 Error Reduction
103 A Constraints: 102 B 1. recessive
10 2. “twin” states C 3. very rare 1 D 4. detrimental
Compound 104 heterozygous (genes) 103 A All SNPs, possibly detrimental
2 10 B All SNPs, probably detrimental C Very rare SNPs, possibly detrimental 10 D Very rare SNPs, probably detrimental 1 Four gene candidates! Genomes of kids
1 2 3 4 5 6 7 8 9 10111213141516171819202122X ZNF721 DNAH5 KIAA0556
DHODH Miller’s gene
Ciliary dyskenesis gene
Sibling genomes are identical across ~25% of their length
centromere haploidentical maternal (23.2% here) heterochromatin paternal recombination identical error region maternal recombination CNV haploidentical paternal candidate gene nonidentical VAAST FPT results for 2 MILLER Syndrome patients
DHODH
DNAH5 Pgene Pgenome P < 1e-3 P > 0.1 P < 1e-7 P > 0.05 P < 1e-8 P < 5e-3 DHODH ranked 3rd DNAH5 ranked 11th out of 21,902 genes
Mark Yandell, Univ of Utah Schematic of VAAST Analysis of MILLER Kindred 1 using a single quartet : only two candidate genes
DHODH
DNAH5
Pgene Pgenome P < 1e-3 P > 0.1 P < 1e-7 P > 0.05 P < 1e-8 P < 5e-3 DHODH ranked 1st DNAH5 ranked 2nd out of 21,902 genes
Mark Yandell, Univ of Utah Effective Error Rate (1 - 0.7) x (1.0x10-5) = 3x10-6 = 99.9997% accurate
Intergenerational mutation rate: 1.1x10-8 per position per haploid genome = 70 mutations in each diploid genome “Compression regions” & reference genomes
Sorting out the errors in the reference genome is essential: Variations vs false identifications A C Reality
Reference
Sequences would be called as A/C hets
Thus, regions of high frequencies of hets in families, together with coverage anomaly implies compression error
Compression Blocks in the Human Genome (probable)
Numbers indicate the number of markers in each compression block Now & Future “references”
Now Future
Sequence Reads Sequence Reads
Map Comparisons onto & Inference
Chosen Reference Set of high - accuracy Genome Sequence Genome Sequences
Simple, crude, error generating Complex, accurate More complex genetics • Larger families = high dimension inheritance vectors • Need this to test more complex genetic models
Iterative model testing
Biology GeneticGenetic ModelModel Other genetic data •• NoNo ofof genesgenes as priors ••TypesTypes ofof variationsvariations ••ModelModel ofof interactionsinteractions
Increase model complexity TestTest againstagainst familyfamily datadata •• AllAll genegene variantsvariants •• RelationshipRelationship inin familyfamily
CandidateCandidate modelmodel HypothesisHypothesis forfor testingtesting Family genome sequences
Can provide: • Accurate sequence • Low FDR of rare “SNPs” • Highly accurate phase & recombination points • New approaches to testing complex genetic models Follow-on Studies
We expect to analyze 120 to 150 high accuracy genomes in next set: (almost all will be in families)
• As of October 2010 we have about 50 whole genomes done • Huntington’s disease: modifier genes • Congenital heart defect: modifier genes • Planned (focus on neurodegeneration): – Parkinson’s – Spinal muscular atrophy – others Congenital Heart Defects D. Srivastava – Gladstone Institute
- Primary defect = GATA4 mutation - Identify modifiers New Computational Approaches to Genetics: integrating networks and genetic data analysis
• Simple, additive genome-wide genetic studies for detection of genetic effects are powerful, but deeply flawed • Beyond additive GWAS genetic effects, • How to add biology: beyond genotype-phenotype correlations
GWAS “Systems genetics”
Additivity assumption Reality Problem: Can we constrain statistical analysis with complex biological knowledge (gene interactions, knowledge of networks, related phenotype components, GWAS data) Approaches to the problem A framework for complex knowledge constraints
Probabilistic graphical models (Bayesian nets, First-order logic
Markov nets, PBNs) Classes of objects Handle uncertainty and noise in the data Recursive, potentially Compactly represent probability distributions + infinite structures Use powerful algorithms for probabilistic reasoning Relational data
Probabilistic logic-based modeling Combining probabilistic and logic representations The best of both worlds – can represent partial knowledge Relational Markov Networks, Loopy Logic, Markov Logic Networks, etc. The most general approach: Markov Logic Networks Yeast sporulation data analysis by MLN
• Detected informative markers: 71, 117, 160, 72, 116, 123, 57, 14 (130), 79, 20 •Markers in red are confirmed by (Gerke,Lorenz,Cohen, 09) • 57, 14, 130, 20 – new informative loci Genetic interactions inferred by MLN from Yeast Cross Sporulation phenotype
Red: positive interaction Green: Negative interaction Intensity indicates strength of interaction
Sakhanenko and Galas, J. Comp. Biol. (October 2010) Next steps towards computational “Systems Genetics”
• More complex genetic models – Testing on yeast data – Family genome sequences of humans • Trim down full first order logic propositions – Adopt language for biological descriptions – Integrate knowledge of network function Summary • Technological transformation of human genetics • Whole genome sequencing • Family of four project – what have we learned? Next projects. • Challenges – Reference genome problem – Complex genetic models – Integration of data types: new methods Acknowledgements
Institute for Systems Biology: Jared Roach, Arian Smit, Gustavo Glusman, Paul Shannon, Lee Rowen, Robert Hubley, Lee Hood, Kai Wang, Nikita Sakhanenko, Ji-Hoon Cho, Alton Eldridge
University of Utah Lynn Jorde, Chad Huff, Mark Yandell
University of Washington Gustavo Glusman, Jared Roach, Chad Huff, Arian Smit Mike Bamshad, Jay Shendure
University of Luxembourg, LCSB Rudi Balling, Antonio DelSol Funding ISB-Univ of Luxembourg Program, Complete Genomics: NIH, NSF- FIBR Program, DoD Rade Drmanac, Dennis Ballinger, Krishna Pant, Andrew Sparks