The Revolution in Human Genetics: Deciphering Complexity

David Galas Institute for Systems Biology, Seattle, WA Genetics and Environment integration is key to future medicine

Blood proteins, miRNA, mRNA, Predictive, Personalized Genome Cell Read-out Diagnostics

Disease & Health

Complex biological Networks

Environment Summary • Technological transformation of human genetics • Whole genome sequencing • Family of four project – what have we learned? Next projects. • Challenges – Reference genome problem – Complex genetic models – Integration of data types: new methods History of Genetic Marker Density

Human Genetic Markers

No. of Type of marker loci Recombination Inference sampled

Blood groups ~20 N/A Electrophoretic ~30 N/A HLA type 1 N/A RFLPs >105 N/A VNTRs >104 minimal Microsatellites >105 minimal SNP Genotypes >106 probabilistic Exomes 1% marginal Genomes All exact Family Sequences and the Future of human genetics • Traditional (common SNP, additive effects) GWAS has been a gene collection method – reaching asymptotes for known phenotypes • Inferences from family sequences plus populations-based studies • Constrained analysis using: biological understanding of networks, additional data (e.g. population-based information) • Genetic-environment deconvolution “Missing heritability” Family genomics in perspective

3. Current technical issues

• Sources of DNA • Library biases • Inherent accuracy of reads • Read lengths – currently 20 to 70 • Maps to “reference” variation file • Coverage and depth (~40 fold) • Assembly and phase • Error rates – types Pilot project: Complete Genome Sequences of a Family of Four

• Parents healthy—kids both have two genetic diseases • Power of a family sequence permits: – Noise reduction : Very low error rate — <1/100,000 – Discovery of ~230,000 new single base variants – First full recombination map of a family – First determination of inter-generational mutation rates – Identification of disease gene candidates for rare genetic diseases (two in this case)

Roach et al., Science, April 30, 2010 Whole Genome Sequencing of Family

Unaffected parents

Children with craniofacial And limb malformation (Miller Syn.) and lung disease (cilliary dyskenesis) Inheritance state vector: family of four

Allele Assortment Inheritance States

haploidentical paternal

identical

haploidentical maternal

nonidentical

Larger family – higher dimension Sequencing a nuclear family yields: high resolution crossover sites and haplotypes

Inheritance “patterns” in kids (0.5Mb bins)

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture. Dad Mom haploidentical paternal

identical

haploidentical maternal Inheritance vector is 4 dimensional

nonidentical Family Allele Inheritance Patterns

1 22

2 21

3 20

4 19

5 18

6 17

7 16

8 15

9 14

10 13

11 12 Utility of 5 Simple 10 recessive Constraints & (SNPs) 104 Error Reduction

103 A Constraints: 102 B 1. recessive

10 2. “twin” states C 3. very rare 1 D 4. detrimental

Compound 104 heterozygous (genes) 103 A All SNPs, possibly detrimental

2 10 B All SNPs, probably detrimental C Very rare SNPs, possibly detrimental 10 D Very rare SNPs, probably detrimental 1 Four gene candidates! Genomes of kids

1 2 3 4 5 6 7 8 9 10111213141516171819202122X ZNF721 DNAH5 KIAA0556

DHODH Miller’s gene

Ciliary dyskenesis gene

Sibling genomes are identical across ~25% of their length

centromere haploidentical maternal (23.2% here) heterochromatin paternal recombination identical error region maternal recombination CNV haploidentical paternal candidate gene nonidentical VAAST FPT results for 2 MILLER Syndrome patients

DHODH

DNAH5 Pgene Pgenome P < 1e-3 P > 0.1 P < 1e-7 P > 0.05 P < 1e-8 P < 5e-3 DHODH ranked 3rd DNAH5 ranked 11th out of 21,902 genes

Mark Yandell, Univ of Utah Schematic of VAAST Analysis of MILLER Kindred 1 using a single quartet : only two candidate genes

DHODH

DNAH5

Pgene Pgenome P < 1e-3 P > 0.1 P < 1e-7 P > 0.05 P < 1e-8 P < 5e-3 DHODH ranked 1st DNAH5 ranked 2nd out of 21,902 genes

Mark Yandell, Univ of Utah Effective Error Rate (1 - 0.7) x (1.0x10-5) = 3x10-6 = 99.9997% accurate

Intergenerational mutation rate: 1.1x10-8 per position per haploid genome = 70 mutations in each diploid genome “Compression regions” & reference genomes

Sorting out the errors in the reference genome is essential: Variations vs false identifications A C Reality

Reference

Sequences would be called as A/C hets

Thus, regions of high frequencies of hets in families, together with coverage anomaly implies compression error

Compression Blocks in the Human Genome (probable)

Numbers indicate the number of markers in each compression block Now & Future “references”

Now Future

Sequence Reads Sequence Reads

Map Comparisons onto & Inference

Chosen Reference Set of high - accuracy Genome Sequence Genome Sequences

Simple, crude, error generating Complex, accurate More complex genetics • Larger families = high dimension inheritance vectors • Need this to test more complex genetic models

Iterative model testing

Biology GeneticGenetic ModelModel Other genetic data •• NoNo ofof genesgenes as priors ••TypesTypes ofof variationsvariations ••ModelModel ofof interactionsinteractions

Increase model complexity TestTest againstagainst familyfamily datadata •• AllAll genegene variantsvariants •• RelationshipRelationship inin familyfamily

CandidateCandidate modelmodel HypothesisHypothesis forfor testingtesting Family genome sequences

Can provide: • Accurate sequence • Low FDR of rare “SNPs” • Highly accurate phase & recombination points • New approaches to testing complex genetic models Follow-on Studies

We expect to analyze 120 to 150 high accuracy genomes in next set: (almost all will be in families)

• As of October 2010 we have about 50 whole genomes done • Huntington’s disease: modifier genes • Congenital heart defect: modifier genes • Planned (focus on neurodegeneration): – Parkinson’s – Spinal muscular atrophy – others Congenital Heart Defects D. Srivastava – Gladstone Institute

- Primary defect = GATA4 mutation - Identify modifiers New Computational Approaches to Genetics: integrating networks and genetic data analysis

• Simple, additive genome-wide genetic studies for detection of genetic effects are powerful, but deeply flawed • Beyond additive GWAS genetic effects, • How to add biology: beyond genotype-phenotype correlations

GWAS “Systems genetics”

Additivity assumption Reality Problem: Can we constrain statistical analysis with complex biological knowledge (gene interactions, knowledge of networks, related phenotype components, GWAS data) Approaches to the problem A framework for complex knowledge constraints

Probabilistic graphical models (Bayesian nets, First-order logic

Markov nets, PBNs)  Classes of objects  Handle uncertainty and noise in the data Recursive, potentially  Compactly represent probability distributions + infinite structures  Use powerful algorithms for probabilistic reasoning  Relational data

Probabilistic logic-based modeling  Combining probabilistic and logic representations  The best of both worlds – can represent partial knowledge  Relational Markov Networks, Loopy Logic, Markov Logic Networks, etc. The most general approach: Markov Logic Networks Yeast sporulation data analysis by MLN

• Detected informative markers: 71, 117, 160, 72, 116, 123, 57, 14 (130), 79, 20 •Markers in red are confirmed by (Gerke,Lorenz,Cohen, 09) • 57, 14, 130, 20 – new informative loci Genetic interactions inferred by MLN from Yeast Cross Sporulation phenotype

Red: positive interaction Green: Negative interaction Intensity indicates strength of interaction

Sakhanenko and Galas, J. Comp. Biol. (October 2010) Next steps towards computational “Systems Genetics”

• More complex genetic models – Testing on yeast data – Family genome sequences of humans • Trim down full first order logic propositions – Adopt language for biological descriptions – Integrate knowledge of network function Summary • Technological transformation of human genetics • Whole genome sequencing • Family of four project – what have we learned? Next projects. • Challenges – Reference genome problem – Complex genetic models – Integration of data types: new methods Acknowledgements

Institute for Systems Biology: Jared Roach, Arian Smit, Gustavo Glusman, Paul Shannon, Lee Rowen, Robert Hubley, Lee Hood, Kai Wang, Nikita Sakhanenko, Ji-Hoon Cho, Alton Eldridge

University of Utah Lynn Jorde, Chad Huff, Mark Yandell

University of Washington Gustavo Glusman, Jared Roach, Chad Huff, Arian Smit Mike Bamshad, Jay Shendure

University of Luxembourg, LCSB Rudi Balling, Antonio DelSol Funding ISB-Univ of Luxembourg Program, Complete Genomics: NIH, NSF- FIBR Program, DoD Rade Drmanac, Dennis Ballinger, Krishna Pant, Andrew Sparks