Identifying functionally significant causal variants in Segregation data Functional data Annotating gene sequence
variation Frequency Animal in Interpretation Models Controls Shamil Sunyaev
Department of Biomedical Informatics Harvard Medical School Bioinformatics
Broad Institute of M.I.T. and Harvard
Map variants on genomic annotation Nonsense variants
One of most significant types of variants usually leading to the complete loss of function.
Watch for multiple transcripts! Nonsense variants are enriched in sequencing artifacts
Watch for conflicting annotations! Important considerations: i) location along the gene, ii) does the variant cause NMD? iii) is the variant in a commonly skipped exon?
Tool: LOFTEE Variants involved in splicing SpliceAI
Variants in canonic splicing sites
Variants in exonic or intronic splicing enhancers
Gain of splicing variants
Missense variants: computational Does the mutation fit the pattern of past predictions evolution?
This image cannot currently be displayed.
human VVSTADLCAPSSTKLDERA
dog FVSTSELCAGSTTRLEERA
fish FLSTSELCVPSTLKVNEKV
Statistical issues: -sequences are related by phylogeny -generally, we have too few sequences Does the mutation fit the pattern of past evolution? Continuous time Markov model
• We assume a constant fitness landscape: what is good for fish is good for human!
GLY VAL ALA GLY ALA • We can estimate whether the mutation fits the pattern of amino acid changes.
• We can also estimate rate of evolution at the amino acid site
Continuous time Markov model Protein structure view
P – matrix of transition probabilities
푃 푡 = 푒
p – stationary distribution • Most of pathogenic mutations are important for stability (good news?).
• DDG is difficult to estimate. 푄휋 = 0 • Unfolded protein response pathway has to be taken into account.
• Heuristic structural parameters help but less than comparative genomics. PolyPhen2 Weakly deleterious mutations
• Multiple independent lines of evidence suggest abundance of weakly deleterious alleles in humans
• Weakly deleterious variants may occur in highly conserved positions
• Weakly deleterious alleles probably contribute to complex phenotypes but not to simple Mendelian phenotypes
www.genetics.bwh.harvard.edu/pph2 Adzhubei, et al. Nature Methods 2010
Conservation can be due to very weak selection Constant evolutionary landscape Every new mutation eventually will be either fixed or lost AA s – selection coefficient b B Ne - effective population size
For humans estimated to be ~ 10 000
1 . 0 K/K0 0 . 9
0 . 8 A
0 . 7 β 0 . 6 Complete Neutral 0 . 5 conservation
behavior 0 . 4
0 . 3
0 . 2
0 . 1
0 . 0 10-6 10-5 10-4 10-3 Selection coefficient, s Epistatic interactions Compensatory mutations
AA b B
A β
The phenomenon of compensatory Ridges on the fitness landscape mutations in different fields
• Biochemistry – protein stability, allosteric effects
• Genetics – incomplete penetrance
• Evolutionary biology – speciation, epistatic models of evolution Dobzhansky-Muller incompatibility Looking at vertebrate species
Many human disease mutations are found in How complex genetic suppression can be? vertebrates
HumVar "Disease" LETTER doi:10.1038/nature12678 (22,207 variants) ClinVar "Pathogenic" (10,596 variants) Genetic incompatibilities are widespread within species
6,503 1 1 2,3 1 1,2,4 16,544 3,250 5.5-6.5% of presumably Russell B. Corbett-Detig , Jun Zhou , Andrew G. Clark , Daniel L. Hartl & Julien F. Ayroles 313 530 pathogenic human
2,100 mutations are detected in mammals Numerous Dobzhansky-Muller 24,304,185 incompatibilities in fly population
Found in MultIZ 100-Way alignment (24,307,128 variants) How complex genetic suppression can be? Model: accumulation of neutral variants
Variant of interest time x 1 change
x x 2 changes
x x x 3 changes Stability
x x x 4 changes
噣
time xx x x x x n changes
Disease variants do not fit the Poisson Exponential fit expectation Model: accumulation of disease variants Model: Erlang distribution
target size m Suppressors Disease variant time x 1 change