Viral epidemiology and the coalescent

Marc A. Suchard [email protected]

UCLA Key to

SISMID University of Washington Coalescent theory Key to population genetics

Population size Structure Migration Selection

SISMID University of Washington Coalescent theory Tree-based population genetics

The coalescent . . .

Models the ancestral relationships of a random sample of individuals taken from a large background population.

Describes a probability distribution on ancestral trees given a population history

Covers ancestral trees, not sequences, and its simplest from assumes neutral .

SISMID University of Washington Coalescent theory Population history inference

A population history often called a demographic →

Inference: learn about changes in population size through time

Applications include:

I Reconstructing infection disease epidemics I Investigating viral dynamics within hosts I Using viral sequences as genetic markers for their hosts and host demographics I Identifying population bottlenecks caused by:

F Changes in climate/environment? aridifcation, ice ages F Competition with other species? humans F Transmission between hosts in viruses

SISMID University of Washington Coalescent theory Information pipe-line

SISMID University of Washington Coalescent theory Coalescent inference

Coalescent theory

SISMID University of Washington Coalescent theory Simple model of reproduction

For a randomly chosen pair of individuals, they share a common ancestor (coalesce) in the previous generation with probability 1/N.

SISMID University of Washington Coalescent theory Wright-Fisher reproduction model

A constant population size of N individuals (usually 2N) Each new (non-overlapping) generation “chooses" its parents from the previous generation at random with replacement No geographic/social structure, no recombination, no selection

SISMID University of Washington Coalescent theory Sample tree from a Wright-Fisher population

A sample tree of 3 sequences from a population of N = 10

SISMID University of Washington Coalescent theory Kingman discrete-time coalescent

2 individuals coalesce in 1 1 generation w.p. N 2 individuals coalesce in j generations w.p.

1  1 j−1 1 N − N k individuals coalesce in j generations w.p.

k 1  k 1 j−1 1 2 N − 2 N

SISMID University of Washington Coalescent theory Kingman continuous-time coalescent

Let t j define a rescaled time in the∼ past, and Assume a sample of n individuals with n N  Then, the waiting time for k individuals to have k 1 ancestors −

−(k) t P(uk t) = 1 e 2 N ≤ − Exponential (memoryless) defines a continuous-time → Kingman (1982) J Appl Prob 19, 27-42 Markov chain Kingman (1982) Stoch Proc Appl 13, 235-48 2N (u ) = E k k(k 1) − SISMID University of Washington Coalescent theory Kingman coalescent: CTMC

The number of ancestral lineages decreases by one at each coalescence The process continues until the most recent common ancestor (MRCA) is reached What is the expected time to MRCA?

n ! n X X E uk = E(uk) k=2 k=2 n X 2N = k(k 1) k=2 −  1  = 2N 1 − n

Note: tMRCA/2 < E(u2) SISMID University of Washington Coalescent theory Kingman coalescent: probability distribution

Given a known tree from a sample of individuals from a populationT The coalescent allows us to calculation the probability P( N) T |

Or, the inverse problem : learn about N from T

SISMID University of Washington Coalescent theory N governs the rate of coalescence

Large and small population sizes

SISMID University of Washington Coalescent theory Quiz

Which population is large?

SISMID University of Washington Coalescent theory Coalescent assumptions

The major weakness of the coalescent lies in its simplifying assumptions

Neutral evolution? Reproductive ? Panmitic population?

But, does this matter?

SISMID University of Washington Coalescent theory Solution: Effective population size

Consider an abstract parameter, the effective population size Ne

The Ne of a real biological population is the size of an idealized Wright-Fisher population that loses or gains genetic diversity at the same rate

Ne is generally smaller than the census population

The coalescent Ne provides the time-to-ancestry distribution for a sample tree from a real population

SISMID University of Washington Coalescent theory Variable population size coalescent

Growing population

Changes in Ne reflect changes in the census population size

SISMID University of Washington Coalescent theory Variable population size coalescent

SISMID University of Washington Coalescent theory Parametric models of N(t) through time

The standard coalescent can be extended to accommodate various scenarios of demographic change through time

However, few parametric forms (constant, exponential, logistic) are available. Can use piece-wise combinations

SISMID University of Washington Coalescent theory Review: continuous-time coalescent

Time Time measured in generation units t1 u2 hk i N = const uk Exp 2 N t2 → ∼ u3 N = N(t) → k R t+tk+1 N t3 − du (2) tk+1 N(u) u4 Pr(uk > t tk+1) = e | t4 uk are not independent any more

N(t) = N

Constant population size

N(t) = Ne−100t Exponential growth

SISMID University of Washington Coalescent theory Piecewise constant demographic model

Number of Lineages: 2 3 4 5 Isochronous Data

Ne(t) = θk for tk < t tk−1. ≤

u2, . . . , un are independent

k(k−1)uk k(k−1) − 2θ Pr (uk θk) = e k | 2θk u2 u3 u4 u5 Qn t1 t2 t3 t4 t5= 0 Pr (F θ) Pr (uk θk) | ∝ k=2 |

Equivalent to estimating exponential mean from one observation.

Need further restrictions to estimate all effective pop sizes θ!

SISMID University of Washington Coalescent theory Piecewise constant demographic model

Number of Lineages: 213 4 3 4 3 Heterochronous Data

w20, . . . , wnjn are independent

Pr(wk0 | θk)= nk0(nk0−1)wk0 n (n −1) − k0 k0 e 2θk 2θk

nkj (nkj −1)wkj − 2θ u2 u3 u4 u5 Pr(wkj | θk)=e k , j>0

w20 w30 w40 w41 w50 w51 w52 Qn Qjk  Pr (F | θ) ∝ k=2 j=0 Pr wkj | θk

Equivalent to estimating exponential mean from one observation.

Need further restrictions to estimate all effective pop sizes θ!

SISMID University of Washington Coalescent theory Previous priors to restrict θ

Strimmer and Pybus (2001)

Make Ne(t) constant across some inter-coalescent times Group inter-coalescent intervals with AIC

Drummond et al. (2005) Multiple change-point model with fixed number of change-points Change-points allowed only at coalescent events Joint estimation of phylogenies and population dynamics

Opgen-Rhein et al. (2005) Multiple change-point model with random number of change-points

Change-points allowed anywhere in interval (0, t1] Posterior is approximated with rjMCMC

SISMID University of Washington Coalescent theory Smoothing priors - Gaussian Markov random fields

Go to the log scale xk = log θk

" n−2 # (n−2)/2 ω X 1 2 Pr(x ω) ω exp (xk+1 xk) | ∝ − 2 dk − k=1

d1 d2 d3 dn−2

x1 x2 x3 x4 xn⌧2 xn⌧1

Weighting Schemes

1. Skyride : weights dk determined by tree (in relative time)

2. Skygrid : dk on a regular grid in absolute time + multi-locus

Pr(x, ω) = Pr(x ω)Pr(ω) | Pr(ω) ωα−1e−βω, diffuse prior with α = 0.01, β = 0.01 ∝ SISMID University of Washington Coalescent theory MCMC algorithm

Pr (G, Q, x D) Pr (D G, Q) Pr (Q) Pr (G x) Pr (x) | ∝ | | Updating Population Size Trajectory Use fast GMRF sampling (Rue et al., 2001, 2004) Draw ω∗ from an arbitrary univariate proposal distribution Use Gaussian approximation of Pr(x ω∗, G) to propose x∗ | Jointly accept/reject (ω∗, x∗) in Metropolis-Hastings step

Object-Oriented Reality?

BEAST = Bayesian Evolutionary Analysis Sampling Trees

Pr(G x, D, Q) - sampled by BEAST | Pr(Q G, D) - sampled by BEAST |

SISMID University of Washington Coalescent theory Simulation: constant population size

Classical Skyline Plot ORMCP Model Beast MCP 0.1 1.0 5.0 0.1 1.0 5.0 0.1 1.0 5.0 Effective Population Size

1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0

Uniform GMRF Time−Aware GMRF Beast GMRF 0.1 1.0 5.0 0.1 1.0 5.0 0.1 1.0 5.0 Effective Population Size

1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0 Time (Past to Present) Time (Past to Present) Time (Past to Present)

SISMID University of Washington Coalescent theory Simulation: exponential growth

Classical Skyline Plot ORMCP Model Beast MCP 0.001 0.1 1.0 0.001 0.1 1.0 0.001 0.1 1.0 Effective Population Size

0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002

Uniform GMRF Time−Aware GMRF Beast GMRF 0.001 0.1 1.0 0.001 0.1 1.0 0.001 0.1 1.0 Effective Population Size

0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002 Time (Past to Present) Time (Past to Present) Time (Past to Present)

SISMID University of Washington Coalescent theory Simulation: exponential growth with bottleneck

Classical Skyline Plot ORMCP Model Beast MCP Effective Population Size 0.001 0.01 1.0 0.001 0.01 1.0 0.001 0.01 1.0 0.15 0.10 0.05 0.00 0.15 0.10 0.05 0.00 0.15 0.10 0.05 0.00

Uniform GMRF Time−Aware GMRF Beast GMRF Effective Population Size 0.001 0.01 1.0 0.001 0.01 1.0 0.001 0.01 1.0 0.15 0.10 0.05 0.00 0.15 0.10 0.05 0.00 0.15 0.10 0.05 0.00 Time (Past to Present) Time (Past to Present) Time (Past to Present)

SISMID University of Washington Coalescent theory Accuracy in simulations

Z TMRCA Nbe(t) Ne(t) Percent error = | − |dt 100, 0 Ne(t) ×

Table: Percent error in simulations. We compare percent errors, defined in equation (1), for the Opgen-Rhein multiple change-point (ORMCP), uniform and fixed-tree time-aware Gaussian Markov random field (GMRF) smoothing, BEAST multiple change-point (MCP) model, and BEAST GMRF smoothing.

Model Constant Exponential Bottleneck ORMCP 14.0 1.7 7.4 Uniform GMRF 32.8 1.5 5.9 Time-Aware GMRF 2.8 1.2 4.8 BEAST MCP 38.2 1.6 5.2 BEAST GMRF 1.7 1.0 5.4

SISMID University of Washington Coalescent theory Multi-locus performance

Adding loci improves dynamics recovery under the Skygrid One Locus Two Loci 10 10 5 5

0 0 Relative 5 5 − − loci % error Log Effective Population Size Population Log Effective Size Population Log Effective 10 10 − − 1 1.23 −10 −8 −6 −4 −2 0 −10 −8 −6 −4 −2 0

Time (Past to Present) Time (Past to Present) 2 1.66 5 1.90 Five Loci Ten Loci 10 2.41 10 10

5 5 Table: Relative error of

0 0 extended Bayesian skyline 5 5

− − plot (EBSP) to skygrid Log Effective Population Size Population Log Effective Size Population Log Effective 10 10

− − under exponential growth −10 −8 −6 −4 −2 0 −10 −8 −6 −4 −2 0

Time (Past to Present) Time (Past to Present)

SISMID University of Washington Coalescent theory Figure 3: Simulation for a population that experiences exponential growth followed by a decline. See Figure 1 for the legend explanation. As in Figure 2, the trajectories are constant (and not informative) during time range ( 10, 7) which precedes the greatest root height of the trees used to generate the data sets. The plots illustrate the improvement in correctly recovering past population trends by incorporating data from additional loci.

41 GMRF Precision Prior Sensitivity

ω - GMRF precision, controls smoothness Usually Pr(ω D) is sensitive to perturbations of Pr(ω) | Not in our coalescent model!

GMRF Precision Prior and Posterior GMRF Precision Sensitiviy to Prior 3 − τ

prior 4 ty − i posterior ens 5 D − Posterior of log 6 − 0.0 0.2 0.4 0.6 0.8 −10 −8 −6 −4 −20 2 1 2 5 10 100 log τ Prior Mean of τ

SISMID University of Washington Coalescent theory HCV Epidemics in Egypt

Estimated Genealogy BEAST GMRF 5

10 Random population

4 sample 10 3

10 No sign of population

2 sub-structure 10 10 Scaled Effective Pop. Size 250 150 50 0 Parenteral antischistosomal Unconstrained Fixed−Tree GMRF Constrained Fixed−Tree GMRF

5 5 therapy (PAT) was ze 10 10

Si practiced from 1920s 4 4 op. 10 10

P to 1980s 3 3 ve i 10 10 2 2 ect Eff Bayes Factor 12,880 d e l PAT Starts 10 10 10 10

ca in favor of constant S 250 150 50 0 250 150 50 0 population size prior Time (Years Before 1993) Time (Years Before 1993) to 1920

SISMID University of Washington Coalescent theory Influenza Intra-Season Population Dynamics

2001−2002 Season 4 4 10 10 3 3 October 1 10 100 10 10 100 10 Scaled Effective Pop. Size 60 50 40 30 20 10 0 70 50 30 10 0 Time (Weeks before 03/21/02) Time (Weeks before 03/21/02)

2003−2004 Season 5 5 10 10 4 4 October 1 10 10 3 3 10 100 10 10 100 10 1 1 Scaled Effective Pop. Size 80 60 40 20 0 50 40 30 20 10 0 Time (Weeks before 02/05/04) Time (Weeks before 02/05/04) New York state hemagglutinin sequences serially sampled (Ghedin et al., 2005)

SISMID University of Washington Coalescent theory Summary

Genealogies inform us about population size trajectories Prior restrictions are necessary for non(semi)-parametric estimation of Ne(t) Smoothing can be imposed by GMRF priors

Software: skyride and skygrid

Implemented as coalescent priors in BEAST Exploit approximate Gibbs sampling Faster convergence? Better mixing?

References: Minin et al. (2008) Molecular Biology & Evolution, 25, 1459–1471. Gill et al. (2013) Molecular Biology & Evolution, 30, 713–724.

SISMID University of Washington Coalescent theory Active ideas: GMRFs are highly generalizable

Hierarchical Modeling

Incorporate multiple loci simultaneously Pool information for statistical power

Flu genes display similar (not equal) dynamics No need for strict equality

Introducing Covariates

Augment field at fixed observation times Formal statistical testing for:

I External factors (environment, drug tx) I Population dynamics (bottle-necks, growth) Gill et al. (in press) Systematic Biology.

SISMID University of Washington Coalescent theory