<<

Molecular

Justin Fay Center for Sciences Department of Genetics 4515 McKinley Ave. Rm 4305 [email protected]

Molecular evolution is the study of the cause and effects of evolutionary changes in molecules

Species 1 GGCAGTGACATTTTCTAACGCGAAGGTACTT 2 GGCAGCGCCATTTTCTAATGCGAGGGTACTT Species 3 GGCAGCGCCATTGTCTAATGCGAGGGTACTT ***** * **** ***** **** *******

Phylogenetics Archea Divergence times Human-chimp-neanderthal Comparative Ultraconserved sequences ( and selection) ENCODE Fox2p

Phylogenetics Methods

Table 1. Number of possible rooted and unrooted trees. Table 2. Distance matrix. Sequence A B C Number of Number of rooted Number of A sequences trees unrooted trees B d(AB) 2 1 1 C d(AC) d(BC) 3 3 1 D d(AD) d(BD) d(CD) 4 15 3 Each d is the distance (substitution rate) 5 954 105 between pairs of sequences 10 34,459,425 2,027,025

Taxonomists have long debated phylogenetic methods. D C

There are many types of methods:

Character state methods (also called cladistic methods), like A B parsimony.

Distance or similarity based methods (also called phenetic Software: methods), like UPGMA. PAUP Maximum likelihood and Bayesian Methods. PHYLIP MEGA Parsimony (non-parametric) and Maximum likelihood MrBayes (parametric) are both used when phylogeny is critical. trees vs Species trees

1. Orthology 2. Independence (no concerted evolution or horizontal transfer)

Orthologs are created by speciation events. Paralogs are genes created by duplication events. Homologs are genes that are similar because of shared ancestry.

Duplication Orthologues and paralogues can be distinguished by i) synteny or ii) phylogeny.

Speciation

    Species 1 Species 2 and Horizontal Gene Transfer

Locus 1 Chr02 HHF1 HHT1 Species tree Locus 2 Chr14 HHT2 HHF2

Vertebrate to Bacteria

Bacteria to Vertebrate No conversion Gene Conversion (true phylogeny) (Comparative Genomics)

1. Conservation

Annotation of genes, regulatory sequences and other functional elements

Functional sequences will remain conserved across distantly related species whereas non-functional sequences will accumulate changes

2. Divergence

Evolution of genes, regulatory sequences and other functional elements

Species-specific functional sequences

Functional sequences with new or modified functions

Origins of Molecular Evolution

Insulin was the first sequenced in 1955 for which Fred Sanger received the Nobel prize. Cytochrome C protein sequence (Margoliash et al. 1961).

The sequencing of the same from different species established a number of key principles of molecular evolution:

1. Most proteins are highly conserved and changes that do occur are not found within functionally important sites. For example human diabetics were treated with insulin purified from pigs and cows.

2. The rate of substitution is constant across phylogenetic lineages.

Molecular clock - the rate of amino acid or nucleotide substitution is constant per year across phylogenetic lineages (Zuckerkandl and Pauling 1962). Controversial but revolutionized phylogenetics and set the stage for the neutral theory.

Neutral theory or random drift hypothesis - the vast majority of that become polymorphic in a population and fixed between species are not driven by Darwinian selection but are neutral or nearly neutral with respect to fitness (Kimura 1968; King and Jukes 1969). The neutral theory is dead; long live the neutral theory.

Difference between and substitution rate. y c n e u q e r f

n o i t a l u p o P

Time

Mutation rate the chance of a mutation occurring in each generation or cell division (does NOT depend on selection)

Substitution rate the frequency at which mutations become fixed within a population (depends on selection)

Substitution rate = mutation rate * fixation probability * time Fixation probability depends on selection Nucleotide Substitution Models

Jukes and Cantor (JC69) Model (1969)

A G Purines

   

C T Pyrimidines 

Assumptions of JC model. 1) Equal base frequencies Nucleotide substitution models 2) Equal mutation rates between the bases correct for multiple hits 3) Constant mutation rate 4) No selection

Jukes Cantor Model

p = 3/31 = 0.097 K = 0.104 substitutions per site

Other nucleotide substitution models

Model Assumption Free Reference Parameters JC69 A=G=C=T 1 Jukes & Cantor ts=tv 1969 K80 A=G=C=T 2 Kimura 1980

F81 ts=tv 4 Felsenstein 1980 HKY85 5 Hasegawa, Kishino & Yano GTR unequal rates 9 Tavare 1986

Substitution Rates with Selection Substitution rate = mutation rate * fixation probability * time

The substitution rate for neutral mutations = 2Nµ * 1/2N * t = µt The substitution rate for adaptive mutations = 2Nµ * 2s * t = 4Nsµt for 4Ns > 1

No selection: The substitution rate between two species is K = 2t.

−4N sq 1−e e P= Selection: −4N s 1−e e

t

S.cerevisiae S.paradoxus

Conserved sequences

Human-Mouse conservation

Species Conserved* Conserved Noncoding Reference (non-repetitive aligned) Humans 3-8% 21% Waterston et al. (2002) Worms 18-37% 18% Shabalina & Kondrashov (1999) Flies 37-53% 40-70% Andolfatto (2005)

Yeast 47-68% 30-40% Chin et al. (2005), Doniger et al. (2005) *Siepel et al. (2005) and expression assays of conserved noncoding sequences

Pennacchio et al. 2006 Yun et al. 2012 Rapidly Evolving Genes (dN/dS)

Detecting selection using the nucleotide substitution rate Synonymous change - mutation that does not change the amino acid sequence of a protein. Nonsynonymous change - mutation that changes the amino acid sequence of a protein. dN or Ka = the nonsynonymous substitution rate = # nonsynonymous changes / # nonsynonymous sites. dS or Ks = the synonymous substitution rate = # synonymous changes / # synonymous sites.

Table 1. The . Codon AA Codon AA Codon AA Codon AA TTT Phe TCT Ser TAT Tyr TGT Cys TTC Phe TCC Ser TAC Tyr TGC Cys Interpretation of dN/dS ratios (assuming synonymous sites are TTA Leu TCA Ser TAA Stop TGA Stop TTG Leu TCG Ser TAG Stop TGG Trp neutral): dN/dS = 1No constraint on protein sequence, i.e. nonsynonymous CTT Leu CCT Pro CAT His CGT Arg CTC Leu CCC Pro CAC His CGC Arg changes are neutral. CTA Leu CCA Pro CAA Gln CGA Arg dN/dS < 1Functional constraint on the protein sequence, i.e. CTG Leu CCG Pro CAG Gln CGG Arg nonsynonymous mutations are deleterious. ATT Ile ACT Thr AAT Asn AGT Ser ATC Ile ACC Thr AAC Asn AGC Ser dN/dS > 1Change in the function of the protein sequence, i.e. ATA Ile ACA Thr AAA Lys AGA Arg nonsynonymous mutations are adaptive. ATG Met ACG Thr AAG Lys AGG Arg

GTT Val GCT Ala GAT Asp GGT Gly GTC Val GCC Ala GAC Asp GGC Gly GTA Val GCA Ala GAA Glu GGA Gly GTG Val GCG Ala GAG Glu GGG Gly Rapidly Evolving Genes dN increased by positive selection dN decreased by negative selection Problem: dN may be influenced by both and still be less than dS

Nayak et al. 2005 Branch Model (dN/dS) (rate heterogeneity)

15 copies in human Vary in copy in other primates

Johnson et al. 2001

Site Model (dN/dS)

● Positive selection on the egg receptor (VERL) for abalone sperm lysin. ● VERL – lysin are a lock and key for fertilization. ● Co-evolution by sexual selection, conflict or microbial attack.

Gilando et al. 2003

Sites – methods Maximum Parsimony (Suzuki) Maximum Likelihood (PAML, HyPhy)

Models of molecular evolution

Key Assumptions:

➔Alignments are correct ➔Sites are independent ➔Mutational & selection parameters

Alignment Accuracy & Coverage

No Indels No indels Indels

No constraint

Constraint

Pollard et al. 2004 Alignment differences gp120 HIV/SIV

ClustalW alignment PRANK alignment (phylogeny aware)

Detection of positive selection depends on the alignment

Markova-Raina and Petrov (2011)

Mutation rate variation

● Transitions vs. – transitions occur twice as often as transversions ● CpG - Spontaneous deamination of 5- methylcytosine results in thymine and ammonia, 20x higher rate of ● 28% of mutations are transitions at CpG sites but only 3.5% of sites are CpG ● Genomic position (5-10%) ● Age, sex (2 – 10 fold) ● Repeats (polynucleotides, microsatellites)

Types of Mutations - WGS

Single nucleotide Transpositions Duplications /Deletion Rearrangement

G/C to A/T 2.9-fold higher than reverse! Predicts 74% AT content

Substitution rate as a function of GC content

BRCA1 sliding window Ka/Ks analysis

Codon Bias

Measures of Codon Bias

CAI – codon adaptive index based on relative usage of the codon to the most abundant codon for an amino acid

Fop – frequency of the optimal codon

ENC – effective number of codons based on the deviation from equal usage Explanation of Codon Bias

Bias towards GC ending codons that is not found in adjacent noncoding regions

Correlates with highly expressed genes

Correlates with tRNA abundance

Explanations: translational accuracy/speed, protein misfolding

Codon Bias is correlated with Synonymous Substitution Rate

Codon Bias correlation depends on distance

Codon models

αs = synonymous rate

βs = nonsynonymous rate

R = tv/ts

πny = frequency of target nucleotide n in codon y

Binding site models

● Sequence ~ binding affinity (Schneider et al. 1986, Berg and von Hippel 1987)

● Binding affinity ~ fitness (Gerland and Hwa 2002, Sengupta et al. 2002)

● Fitness ~ substitution rate (Moses et al. 2004)

Kimura 1962

Bulmer 1991

Moses et al. 2004

Biased Gene Conversion AT to GC bias

Recombination occurs in hotspots Recombination hotspots evolve rapidly Biased gene conversion occurs in bursts (non-equilibrium)

Recombination and predicted equilibrium GC frequency

Correlomics

r (Interaction ~ Fitness) = 0.15, P = 3.4x10e-13 r (Fitness ~ Evolutionary rate) = -0.13, P = 4.3x10e-7 r (Interactions ~ Evolutionary rate) = -0.24, P = 0.002 Spurious (strong) correlations

Significance and effect size

Statistical significance (a low P value) measures how certain we are that a given effect exists. Effect size measures the magnitude of an effect.

r = 0.10, P < 1e-16 A squared correlation coefficient below 0.1 (r < 0.3) means the effect is pretty much non-existent, regardless of how low the P value is.

Claus Wilke, UT-Austin (Blog 2013) predicts the

Polymorphisms vs Divergence

P ( SNP | conserved amino acid )

P ( SNP | conserved factor binding site )

Methods for Predicting Human Disease Mutations 2.2% of human disease alleles are WT in mouse

Disease Conserved mutations sites

Method True Positive False Positives

SIFT 69% 20%

PolyPhen 69% 9%

SIFT: Ng P, Henikoff S (2001) Predicting deleterious amino acid substitutions. Genome Res 11: 863-874. PolyPhen: Sunyaev S, Ramensky V, Koch I, Lathe W, Kondrashov A et al. (2001) Prediction of deleterious human alleles. Hum Mol Genet 10: 591-597.

Likelihood Ratio Test

human GYCF G AQEQ 32 vertebrate species chimp GYCF G AQEQ 18,993 alignments orangutan GYCF G AQEQ dS = 12.2 subs/site rhesus GYCF G AQEQ bushbaby GYCF G VQEQ

s treeshrew GYCF G VQEQ l a

t rat GYCF G VQEQ

n mouse GYCF G VQEQ e

c squirrel GYCF G VQEQ a l guineapig GYCF G VQEQ P dog GYCF G IQEQ l cat GYCF G VQEQ a t

n horse GYCF G VQEQ e s

l cow GYCF G VQEQ c n a a e l microbat GYCF G VQEQ k m p

- c armadillo GYCF G VQEQ i m n g h

a opossum GYCF G VAEQ o o C r

N m platypus GYGF G EQEQ F frog GFCF G ETKQ tetraodon GCCF G NLEE

h stickleback GYCF G DGEE s i medaka GYCF G DLEE F zebrafish GYCF G DLEE Tons of Deleterious Mutations

Chun and Fay (2009) Most Deleterious SNPs are Rare

Three Methods Applied to Venter

Method Tested (%) Deleterious (%)

SIFT 5,401 (72%) 890 (16%)

PolyPhen 6,746 (90%) 555 (8.2%) probably 768 (11%) possibly LRT 5,645 (75%) 796 (14%)

7,534 High Quality NSN SNPs in Venter Genome

Disturbing Overlap Among Three Methods

LRT

28%

3% 10% 5% 18% 30% 6%

PolyPhen SIFT

7,534 NSN SNPs in Venter Genome 1,735 SNPs predicted deleterious by any one of the three methods

Human disease associated SNPs

21,429 disease-associated SNPs (2,113 publications) 5,270 in HapMap3 Chen et al. 2010 Conservation of GWAS SNPs

High-confidence

Dudley et al. (2012) GWAS SNPs OR vs. Conservation

Dudley et al. (2012) Phylomedicine

Kumar et al. (2011) Phylomedicine: an evolutionary telescope to explore and diagnose the universe of disease mutations. Trends in Genetics 27:377-386