Genetics: Early Online, published on November 29, 2017 as 10.1534/genetics.117.300551

1 Relaxed selection during a recent human

2 expansion 3 4 5 Peischl, S.1,2,3*, Dupanloup, I.1,2,4*, Foucal, A.1,2, Jomphe, M.5, Bruat, V.6, Grenier, J.-C.6, Gouy, A. 1,2, 6 Gilbert, K.J. 1,2, Gbeha, E.6, Bosshard, L.1,2, Hip-Ki, E.6, Agbessi,M.6, Hodgkinson, A.6,7, Vézina, H.5, 7 Awadalla, P.6,8, and Excoffier, L.1,2 8 1 9 CMPG, Institute of Ecology an Evolution, University of Berne, 3012 Berne, Switzerland 2 10 Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland 3 11 Interfaculty Bioinformatics Unit, University of Berne, 3012 Berne, Switzerland 12 4 Swiss Integrative Center for Human Health SA, 1700 Fribourg, Switzerland 5 13 Balsac Project, University of Quebec at Chicoutimi, Saguenay, Canada 6 14 Hôpital Ste-Justine, University of Montréal, Montréal, Canada 7 15 Department of Medical and Molecular Genetics, Guy's Hospital, King’s College London, London, UK 8 16 Ontario Institute for Research, Department of Molecular Genetics, University of Toronto, 17 Toronto Canada 18 19 Running title: Relaxed selection during a recent human expansion 20 Keywords: range expansion, Quebec, mutation load, genetic drift 21 * Equal contribution

22 Correspondence 23 Stephan Peischl & Laurent Excoffier 24 CMPG, Institute of Ecology and Evolution 25 University of Berne 26 Baltzerstrasse 6 27 3012 Berne, Switzerland 28 Email: [email protected] 29 [email protected]

Copyright 2017. 30 Abstract 31 Humans have colonized the planet through a series of range expansions, which deeply impacted 32 genetic diversity in newly settled areas and potentially increased the frequency of deleterious mutations 33 on expanding wave fronts. To test this prediction, we studied the genomic diversity of French Canadians 34 who colonized Quebec in the 17th century. We used historical information and records from ~4000 35 ascending genealogies to select individuals whose ancestors lived mostly on the colonizing wave front 36 and individuals whose ancestors remained in the core of the settlement. Comparison of exomic diversity 37 reveals that i) both new and low frequency variants are significantly more deleterious in front than in 38 core individuals, ii) equally deleterious mutations are at higher frequencies in front individuals, and iii) 39 front individuals are two times more likely to be homozygous for rare very deleterious mutations 40 present in Europeans. These differences have emerged in the past 6-9 generations and cannot be 41 explained by differential inbreeding, but are consistent with relaxed selection mainly due to higher rates 42 of genetic drift on the wave front. Demographic inference and modeling of the evolution of rare variants 43 suggest lower effective size on the front, and lead to an estimation of selection coefficients that increase 44 with conservation scores. Even though range expansions had a relatively limited impact on the overall 45 fitness of French Canadians, they could explain the higher prevalence of recessive genetic diseases in 46 recently settled regions of Quebec.

2 47 Introduction 48 The impact of recent demographic changes or single bottlenecks on the overall fitness of

49 populations is still highly debated (LOHMUELLER et al. 2008; FU et al. 2014; LOHMUELLER 2014; SIMONS et al.

50 2014; DO et al. 2015; HENN et al. 2015; GRAVEL 2016; HENN et al. 2016). Simulation and theoretical

51 approaches suggest that populations on expanding wave fronts experience strong genetic drift (PEISCHL

52 et al. 2013; PEISCHL et al. 2015), leading to relaxed selection (GRAVEL 2016) resulting in the build-up of

53 expansion load (PEISCHL et al. 2013). The main reason for this expansion load is that low population 54 densities and strong genetic drift at the wave front promote the genetic surfing of both neutral and

55 selected variants (PEISCHL et al. 2013). This relatively inefficient selection on the wave front leads to the

56 preservation of many new mutations, unless very deleterious (PEISCHL et al. 2013), and it shifts the site 57 frequency spectrum (SFS) of standing variants, thus increasing the contribution of (partially) recessive

58 variants to mutation load (PEISCHL AND EXCOFFIER 2015). After a range expansion, both a decrease of 59 diversity and an increase in the recessive mutation load with distance from the source is expected

60 (KIRKPATRICK AND JARNE 2000; PEISCHL AND EXCOFFIER 2015). This pattern has recently been shown to occur 61 in non-African human populations, where a gradient of recessive load has been observed between

62 North Africa and the Americas (HENN et al. 2016). Whereas the bottleneck out of Africa that started

63 about 50 Kya (e.g. GRAVEL et al. 2011) must have created a mutation load, the exact dynamics of this 64 load increase due to the expansion process is still unknown. It is also unclear if a much more recent 65 expansion could have had a significant impact on the genetic load of populations. 66 The settlement of Quebec can be considered as a series of demographic and spatial expansions 67 following initial bottlenecks. The majority of the 6.5 million French Canadians living in Quebec are the

68 descendants of about 8,500 founder immigrants of mostly French origin (CHARBONNEAU et al. 2000;

69 LABERGE et al. 2005). This French immigration started with the founding of a few settlements along the

70 Saint Laurence river at the beginning of the 17th century (CHARBONNEAU et al. 2000). Most new 71 settlements were restricted to the Saint Laurence Valley until the 19th century, after which remote 72 territories began to be colonized. Bottlenecks and serial founder effects occurring during range 73 expansions are thought to have profoundly affected patterns of genetic diversity, leading to large

74 frequency differences when compared to the French source population (LABERGE et al. 2005). Even 75 though the French Canadian population has expanded 700 fold in about 300 years, its genetic diversity is 76 actually not what is expected in a single panmictic, exponentially growing population, as allele

77 frequencies have drifted much more than expected in a fast growing population (HEYER 1995; HEYER

78 1999). It has been shown that genetic surfing (KLOPFSTEIN et al. 2006; PEISCHL et al. 2016) has occurred

79 during the recent colonization of Saguenay-Lac St-Jean area (MOREAU et al. 2011), and that the fertility of

3 80 women on the wave front was 25% higher than of those living in the core of the settlement, giving them 81 more opportunity to transmit their genes to later generations. In addition, female fertility was found to

82 be heritable on the front but not on the core (MOREAU et al. 2011), a property that further contributes to

83 lower the effective size of the front population (AUSTERLITZ AND HEYER 1998; SIBERT et al. 2002) and to

84 enhance drift on the wave front. Social transmission of fertility (AUSTERLITZ AND HEYER 1998) and genetic

85 surfing during range expansions or a combination of both (MOREAU et al. 2011) have been proposed to 86 explain a rapid increase of some low frequency variants. It thus seems that differences in allele 87 frequencies between French Canadians and continental Europe are due to a mixture of the random 88 sampling of initial immigrants (founder effect) and of strong genetic drift having occurred in Quebec 89 after the initial settlement, resulting in a genetically and geographically structured population of French

90 Canadians (BHERER et al. 2011). 91 The demographic history of Quebec has not only affected patterns of neutral diversity, but also

92 the prevalence of some genetic diseases independently from inbreeding (DE BRAEKELEER 1991; HEYER

93 1995; LABERGE et al. 2005; YOTOVA et al. 2005), as well as the average selective effect of segregating

94 variants (CASALS et al. 2013). Even though French Canadians have fewer mutations segregating in the 95 population than the French, these mutations are found at loci which are, on average, much more 96 conserved, and thus are potentially more deleterious than those segregating in the French population

97 (CASALS et al. 2013). Recurrent founder effects, low densities and intergenerational correlation in 98 reproductive success could all contribute to increase drift and reduce the efficacy of selection on

99 expanding wave fronts, and thus lead to the development of a stronger mutation load (PEISCHL et al. 100 2013). Furthermore, even short periods of increased drift and relaxed selection can affect the efficacy of

101 selection in future generations (GRAVEL 2016), resulting in differential rates of purging in front and core 102 populations. It is therefore likely that the excess of low frequency deleterious variants observed in

103 French Canadian individuals (e.g. CASALS et al. 2013), could be at least partly due to the expansion 104 process rather than to the sole initial bottleneck. 105 To better understand and quantify the effect of a recent expansion process on the amount and 106 pattern of deleterious genetic variation, we screened the ascending genealogies of 3916 individuals

107 from the CARTaGENE cohort (AWADALLA et al. 2013) that were linked to the BALSAC genealogical 108 database (http://balsac.uqac.ca/). Using stringent criteria on the quality of genealogical information (see 109 Methods), we selected 51 (front) individuals whose ancestors were as close as possible to the front of 110 the colonization of Quebec, and 51 (core) individuals whose ancestors were as far as possible from the 111 front (see Methods, Fig. 1, and Supporting Animation S1 and S2). We then sequenced these 102 112 individuals at very high coverage (mean 89.5X, range 67X-128X) for 106.5 Mb of exomic and UTR

4 113 regions and contrasted their genomic diversity to detect if sites with various degrees of conservation 114 and deleteriousness have been differentially impacted by selection.

115 Results 116 French Canadians vs. Europeans 117 French Canadians are genetically very divergent from three European populations of the 1000

118 Genome phase 3 panel (THE GENOMES PROJECT 2015) (Great Britain, Spain, and Italy, Fig. S1), as expected 119 after a strong bottleneck. When focusing on SNPs shared between French Canadians and Europeans and 120 thus on relatively high frequency variants, core individuals are found genetically closer to European 121 samples than front individuals in the first two PCA axes (Fig. S1B), in keeping with stronger drift having 122 occurred on the wave front. The qualitative outcome of the PCA does not change if we remove inbred 123 individuals (Figs. S2 and S3) or if we account for linkage among SNPs (Fig. S3). Furthermore, we checked 124 that these differences between Europeans and French Canadians were not due to the lower coverage of 125 the 1000G exomes (65X on average) by downsampling our data from 89.5X to 65X and repeating the 126 PCA (Fig. S1C and D). If we assess the functional impact of point mutations with GERP Rejected

127 Substitution (GERP RS) scores (DAVYDOV et al. 2010), sites polymorphic in French Canadians are on 128 average more conserved than sites polymorphic in Europeans (Fig. 2A). Again, downsampling to account 129 for differences in coverage does not affect this result (Fig. S4). Thus even though French Canadians have 130 fewer polymorphic sites than 1000G populations from Europe, their variants are on average potentially

131 more deleterious than those found in European samples (Fig. 2A), in line with previous results (CASALS et 132 al. 2013). We find the same pattern if we calculate the average GERP RS score per derived allele rather 133 than per polymorphic site (Fig. S5). Note that these results still hold if we focus only on SNPs that are 134 shared between 1000G and Quebec samples, even though the distributions are slightly more 135 overlapping (Figs 2A and S5B).

136 Genomic diversity of front and core individuals 137 In French Canadians, front individuals have a significantly smaller number of variants than core 138 individuals (Table 1) consistent with higher rates of drift. The allele frequencies in front and core 139 individuals are overall very similar (Fig. S6), but there is a significant deficit of singletons on the front as

140 compared to the core (pperm < 0.001, Table S1, Fig. S7), which is balanced by an excess of doubletons on

141 the front (pperm < 0.001). Note that this pattern is consistently found for all GERP RS score categories 142 (Figs. S7 and S8) and is consistent with simulations of range expansions (Fig. S9). We then looked 143 whether genes containing SNPs with large frequency differences between front and core (i.e. those with

144 FST p-value < 0.01) were overly represented in some gene ontology (GO) categories. Among these genes,

5 145 allele frequency differences between the front and the core ranged from 0.069 to 0.294 (see Table S2 146 for a list of strongly differentiated genes). The top 25 significantly enriched GO categories (Table S3) are 147 generally involved in very conserved processes like gene expression, development and cell growth (see 148 Supplementary Notes and Fig. S10), suggestive of a relaxation of selection rather than specific 149 adaptations to the front environment.

150 Similar number of derived alleles in front and core individuals 151 We first computed the total number of predicted neutral alleles per individual (-2 ≤ GERP RS 152 score < 2) and, as expected, we do not find any differences between front and core (Table 1, Fig. S11). 153 We then computed the total number of predicted deleterious alleles (GERP RS score ≥ 2) in front and

154 core individuals. This measure has previously been used as proxy for mutation load (SIMONS et al. 2014;

155 DO et al. 2015), implicitly assuming that mutations interact additively and that all mutations have equal

156 effects. In line with theoretical predictions (KIRKPATRICK AND JARNE 2000; PEISCHL AND EXCOFFIER 2015), we 157 find virtually no difference between front and core individuals (Table 1, Figs. S12A and S13), with a slight 158 but non-significant excess of derived alleles in front individuals as compared to individuals from the core

159 (푝푝푒푟푚 = 0.38). When we consider the cumulative GERP RS score per individual considering all variants 160 irrespective of their frequencies, we find a similarly slight and non-significant excess in the front as 161 compared to the core (Fig. S12B). We note, however, that the relationship between GERP RS scores and

162 selection coefficients is highly non-linear (RACIMO AND SCHRAIBER 2014), which means that both the total 163 number of derived alleles and the cumulative GERP RS scores do not necessarily reflect the mutation

164 load, even if we assume additive effects of mutations (see also ref HENN et al. 2015). Furthermore, if 165 (some) mutations are (partially) recessive, these two measures are insensitive to a change in mutation

166 load that would be caused by increased genetic drift (KIRKPATRICK AND JARNE 2000; PEISCHL AND EXCOFFIER 167 2015). If we only count derived homozygous sites, the difference in the cumulative GERP RS score 168 between front and core is more pronounced as compared to the total additive measures (Fig. S12C, see 169 Table 1 for total number of derived homozygous sites), but it remains statistically insignificant

170 (푝푝푒푟푚 = 0.14). Note that the lack of significant differences between front and core is not surprising,

171 given their short divergence time (Fig. 1B). We find, however, a significantly higher ( pperm < 0.001)

172 cumulative GERP RS score in the core for singletons, which is compensated by a significantly lower

173 statistic for doubletons, tripletons and quadrupletons ( < 0.05, Fig. S14). We confirmed these

174 patterns using alternative deleteriousness scoring systems (Figs. S15 – S18). Since it appears difficult to

175 assess mutation load from sequence data in human populations (see e.g. LOHMUELLER 2014), in the 176 following we shall rather focus on statistics other than genetic load that can measure a relaxation of

6 177 selection via changes in the allele frequencies of deleterious mutations and compare them to analytical 178 results and individual-based simulations.

179 Low frequency variants in front individuals are more conserved

180 In a population that experiences high rates of drift, and hence less efficient selection (GRAVEL 181 2016), we expect that the average (negative) selection coefficient associated to segregating variants

182 increases as compared to a population that experiences less drift (PEISCHL AND EXCOFFIER 2015). We 183 therefore consider the average GERP RS score per site as a proxy for the average deleteriousness of a 184 segregating site, since strongly conserved sites are expected to be under strong purifying selection. The

185 examination of low frequency variants that are enriched for deleterious mutations (BOYKO et al. 2008;

186 NELSON et al. 2012; KIEZUN et al. 2013) should allow us to better evidence the presence of differential 187 selection between front and core and we therefore stratify the average GERP RS score according to their 188 derived allele frequency. We actually find a negative relationship between the frequency of mutations 189 and their average GERP RS scores (Fig. 2B), and low frequency variants (DAF < 5%) have significantly

190 larger GERP RS scores (and are thus potentially more deleterious) on the front than in the core (pperm =

191 0.038). Since new variants should also be enriched for deleterious mutations (BOYKO et al. 2008; KEINAN

192 AND CLARK 2012), we then focused on mutations private to front or to core individuals. With this 193 additional filtering, the differences in GERP RS scores between front and core for low frequency 194 mutations are much more pronounced (Fig. 2C), with significant differences for both doubletons and

195 tripletons (pperm = 0.03 and pperm = 0.0025, respectively). We checked that these results were not due to 196 our use of GERP RS scores by repeating our analyses using alternative proxies for the damaging effects 197 of mutations. We find overall very similar evidence of reduced selection in front populations (Figs. S17 –

198 S27) for point mutations (and for indels) identified as under selection by PolyPhen-2 (ADZHUBEI et al.

199 2010), PhyloPNH (FU et al. 2012) and CADD (KIRCHER et al. 2014), suggesting that our results are robust 200 to alternative scoring systems for deleteriousness.

201 New deleterious mutations are at higher frequencies on the front 202 We further enriched our data for new mutations that occurred during the colonization of Quebec 203 by focusing only on French Canadian mutations that are not observed in the entire 1000G phase 3 panel, 204 and which are private either to the core or to the front samples. In this filtered data set, we find a

205 significant excess of predicted deleterious (GERP RS score > 2) singletons in the core (pperm < 0.001), and

206 an excess of doubletons in the front (pperm < 0.001, Table S1). Interestingly, the doubletons on the front 207 are as conserved as singletons in both core and front samples, suggesting that doubletons on the front 208 are variants that would be singletons in the core (Fig. 2D). We can test the claim that differential

7 209 selection has allowed mutations at more conserved sites to reach or be maintained at higher 210 frequencies on the front by comparing the cumulative GERP RS score of neutral and predicted 211 deleterious doubletons. If the increase in the average degree of conservation of doubletons at the front 212 is due to a purely neutral process, such as inbreeding or demography, we should see similar differences 213 in the cumulative GERP RS score of doubletons between front and core for neutral and deleterious 214 variants. We find, however, that the cumulative GERP RS scores for these doubletons are similar in front 215 and core individuals for neutral sites (-2 < GERP RS < 2), but significantly larger in front individuals for 216 non-neutral GERP RS score categories (GERP RS ≥2) (Fig. 3). To see if inbreeding could explain the 217 observed excess of deleterious doubletons in the front, we compared samples from the region of 218 Saguenay, where remote inbreeding is higher than in the rest of Quebec (Fig. S28), with front samples 219 coming from other regions of Quebec. We find that doubletons in less inbred non-Saguenay individuals 220 are at loci that are on average more conserved than those of Saguenay individuals (Fig. S29), showing 221 that inbreeding cannot explain the increase in frequency of rare deleterious variants.

222 Likelihood-based demographic and selection coefficient inference 223 We used the allele frequency distributions of mutations that are singletons in European 1000G 224 populations and that are still seen in Quebec to estimate the parameters of a simple demographic 225 model for the settlement of French Canada, as well as selection coefficients of sites belonging to 226 different GERP RS categories. In this model, a small founding population splits off from the ancestral 227 population, and then further splits into two subpopulations: the front and the core (Fig. 4A). We

228 estimate the effective population size of the founding population (NBN), the front (NF), and the core (NC) 229 under a maximum-likelihood framework based on inter-generational allele frequency transition matrices 230 (see Methods for details). We report here results for a model in which we fix the duration of initial 231 bottleneck to one generation, but the analysis of a model with a 7-generation bottleneck yields 232 qualitatively similar results, which can be found in the Supporting Information (Fig. S30). We infer that

233 French Canadians passed through a bottleneck equivalent to 푁̂퐵푁 = 354 effective diploid individuals, and

234 that the front population was about 2.5 smaller (푁̂푒,푓푟표푛푡 = 3,972) than the core population ( 푁̂푒,푐표푟푒 = 235 9,977) (Fig. 4B). We then used these maximum likelihood estimates (MLE) to estimate the contribution

236 of the range expansion to the total variance in allele frequencies on the front as VF = V퐵푁 + V퐸푋푃,

237 where VBN is the variance in allele frequencies after the bottleneck, and VEXP is the remaining variance

238 due to the expansion process. We find that V퐸푋푃 explains about 20% of the total variance in allele 239 frequencies that occurred since the initial settlement at the expansion front. Therefore, we estimate 240 that under our simple model, 20% of the genetic divergence between Europe and the front has been 241 generated by the expansion process, whereas the remaining 80% is due to the initial bottleneck shared

8 242 by the core. We also estimated the strength of selection associated with rare variants under our 243 estimated demographic model. In agreement with predictions, the MLE for the selection coefficient 244 associated with predicted neutral variants is centered around zero, whereas the selection coefficients 245 associated with predicted deleterious sites become more negative with increasing GERP RS score (Fig.

246 4C, maximum likelihood estimates and 95% confidence intervals: −0.006 < 푠̂GERP[−2,2[ = 0 < 0.006,

247 −0.034 < 푠̂GERP[2,4[ = −0.024 < −0.013, −0.042 < 푠̂GERP[4,6[ = −0.032 < −0.022 , −0.145 <

248 푠̂GERP≥6 = −0.072 < 0.001). Note that the most negative selection coefficient for GERP RS scores ≥ 6 is 249 not significantly different from zero due to the small number of sites belonging to this category. We 250 note that these estimates are based on the evolution of rare (≈0.1 %), segregating variants and these

251 alleles are significantly enriched for deleterious alleles (BOYKO et al. 2008) (irrespective of their 252 associated GERP RS scores estimated in mammals). Therefore, the inferred selection coefficients do not 253 correspond to the DFE of all variants, which should have considerably lower selection coefficients. 254 Nevertheless, our results suggest a clear monotonic relationship between GERP RS scores and 255 deleteriousness. 256 Whereas the previous approach concentrates on the fate of sites that are singletons in Europe, 257 we used an alternative approach to estimate demographic parameters and the distribution of fitness 258 effects based on the joint site frequency spectrum (SFS) computed between core and front samples. 259 Using only assumed neutral sites (with GERP RS between -2 and +2), we first estimated the demographic 260 parameters of a more realistic demographic model (see Fig S31) including recent exponential growth in

261 both core and front populations using the program fastsimcoal2 (EXCOFFIER et al. 2013). In this approach, 262 we obtain results (reported in Table S4) congruent with the previous (and analytically more tractable) 263 model, in the sense that just after divergence, the initial core population was about 1.3 times larger than 264 the initial front population, supporting the view of a larger extent of drift in the front after divergence. 265 This demographic scenario was then used to estimate the parameters of a gamma distributed DFE from

266 the SFS of sites assumed to be under selection (GERP RS ≥2) with Fitdadi (KIM et al. 2017), in the front 267 and in the core populations separately. We find that the average selection coefficient for these

268 potentially deleterious sites is smaller in the core than in the front (푠̅푐표푟푒 = 0.000319; 푠̅푓푟표푛푡 = 269 0.000378), thus suggesting that mutations tend to have stronger deleterious effects in front 270 populations as compared to the core. If we assume a monotonic relationship between GERP RS scores 271 and selection coefficients, this difference seems especially due to the most deleterious sites since the −6 272 average selection coefficient for GERP RS [2,4[ is almost identical at the front (푠̂GERP[2,4[ = 3.82∙10 ) −6 273 and in the core (푠̂GERP[2,4[ = 4.32∙10 ), whereas selection coefficients are larger in front than in core −4 −4 274 for GERP RS [4,6[ (푠̅푐표푟푒 = 4.58∙10 ; 푠푓푟표푛푡̅ = 5.34∙10 ) and GERP RS ≥ 6

9 −3 −3 275 (푠̅푐표푟푒 = 3.73∙10 ; 푠푓푟표푛푡̅ = 4.62∙10 ) (Fig. S32). Likelihood comparisons suggests that the DFEs 276 estimated in the two population are actually different (see legend of Fig. S32), which could imply that 277 both differential demography and differential selection have shaped the SFS of selected sites. This 278 interpretation is supported by inference of the DFE under simpler demographic models using the

279 software DFE-alpha (EYRE-WALKER AND KEIGHTLEY 2007; SCHNEIDER et al. 2011, Fig. S33). When comparing 280 the inferred DFEs to previous studies, we find that the average strength of selection that we infer is

281 lower than what has been reported previously (BOYKO et al. 2008; KIM et al. 2017). However, note that 282 previous studies were done on non-synonymous sites, whereas we have included synonymous 283 mutations in our approach. The fact that synonymous sites tend to have lower GERP RS scores, could 284 therefore explain why our approach yields an overall more neutral DFE compared to previous studies. 285 An alternative reason why our estimates of the DFE differs from previous ones could be the use of the

286 multinomial likelihood function in Fitdadi (KIM et al. 2017), that only uses the proportional SFS, rather

287 than absolute counts and the number of monomorphic positions. Indeed (KIM et al. 2017) show that 288 estimates for the scale parameter can be off a bit when using the multinomial likelihood, which could 289 explain the observed discrepancy of the estimated DFEs.

290 Variants with low frequency in Europe have been more impacted by selection in the core 291 Because neutral sites should only be affected by drift and not by selection, stronger drift at the

292 front should increase the variance of neutral allele frequencies (GRAVEL 2016), but should not affect their 293 average frequency. In contrast, if we follow a set of deleterious variants that were present at a given 294 frequency in the ancestral population, the frequency of these variants should be smaller in the core as 295 compared to the front if the purging of deleterious variants was more efficient. To test these 296 predictions, we again use mutations that are singletons in European 1000G populations and that are still 297 seen in Quebec and contrast empirical patterns of allele frequencies with predictions from the 298 maximum likelihood model described above (Fig. 4A). In agreement with theoretical predictions (Figs.

299 S34 ), we find no significant difference in the average derived allele frequencies ( xd ) of European

300 singletons predicted to be neutral (GERP RS score between -2 and 2) ( xd = 0.00720 vs 0.00717 in front

301 and core, respectively, pperm = 0.34, Fig. S35), and a slightly larger variance of derived allele frequencies

302 on the front (s. d. ( ): 0.0163 vs 0.0159, pperm = 0.072, Fig. S36). Contrastingly, predicted deleterious

303 sites have significantly higher derived allele frequencies on the front than in the core (pperm = 0.0146 for 304 sites with GERP RS score > 4) and the difference between front and core is increasing with increasing 305 GERP RS score (Fig. S35), in keeping with higher rates of purging in the ancestry of core individuals. In 306 line with neutral evolution of predicted deleterious alleles, the average derived allele frequency of

10 307 European singletons seems to be independent of the GERP RS score at the front, whereas derived allele 308 frequency clearly decreases with increasing GERP RS score in the core, indicative of faster purging of 309 more deleterious mutations (Fig. S35). We also checked the distribution of GERP RS scores among those 310 sites and found no notable differences between front and core (Fig. S37), showing that the larger 311 frequency of these deleterious variants at the front is not compensated by overall lower GERP RS scores 312 of these variants as compared to the core. 313 Since differences between core and front individuals are strongest for rare alleles, these 314 differences may have an impact on the homozygosity of recessive deleterious alleles and thus influence

315 disease incidence. We used the ClinVar database (LANDRUM et al. 2014) to identify pathogenic variants

316 (causing Mendelian disorders, see ref.RICHARDS et al. 2015) in the set of SNPs segregating in French 317 Canadians. The distribution of GERP RS scores for pathogenic variants is clearly shifted towards higher 318 GERP RS scores as compared to the distribution for all SNPs loci (Fig. S38), confirming that GERP RS is a 319 valid deleteriousness scoring system. We find that front individuals carry more known pathogenic

320 variants than core individuals (8.96 vs 7.82, respectively, Fig. S40, 푝푝푒푟푚 = 0.032) and have a 11.8%

321 higher probability to be homozygotes for these pathogenic variants than core individuals (푝푝푒푟푚 = 322 0.11), suggesting that the expansion process has also affected disease causing mutations. This is 323 medically relevant since among the 92 pathogenic variants for which we could find information on their 324 dominance/recessivity status (out of 116), a large majority (75, or 81.5%) were considered as fully 325 recessive. For rare deleterious variants (i.e., derived singletons in Europe with GERP RS score ≥ 2), this 326 excess in homozygosity is 9.5 %. Of importance, this excess increases with GERP RS scores and reaches

327 approximately 90% (pperm = 0.021) for sites with a GERP RS score larger than 6 (Fig. 5). The fact that the 328 variance in allele frequencies becomes more similar between front and core with increasing GERP RS 329 score (Fig. S36A) shows that the excess in homozygosity on the front is not due to an increase in 330 variance of allele frequencies in the front alone. Note also that the increase of homozygosity cannot be 331 explained by the higher inbreeding level prevailing on the front, and that the differences in 332 homozygosity between front and core become even more pronounced if one removes more inbred

333 Saguenay individuals (pperm = 0.008, Fig. 5). This last result shows that stronger purifying selection in the 334 core rather than higher inbreeding on the front is directly responsible for the lower frequencies of 335 deleterious mutations in the core.

336 Simulations can reproduce observed differences between front and core 337 Whereas it seems difficult to perform demographic inferences under a complex spatially explicit 338 model, we have used forward simulations to see how well a model of range expansion can explain our 339 observations (see Methods for details on the simulations). Our simulations reveal that the observed

11 340 excess of singletons in core populations as well as the excess of doubletons in front populations are 341 consistent with a model of range expansion (Fig. S9). Even though previous theoretical work showed

342 that range expansions lead to a flattening of the whole SFS (SOUSA et al. 2014), our current results are in 343 line with theory since we are considering here a much shorter evolutionary period (16 generations

344 instead of e.g. 150 in PEISCHL AND EXCOFFIER (2015)). Indeed, during the onset of the expansion, drift 345 should have more impact on the low frequency entries of the SFS (e.g. singletons and doubletons), 346 because these are the entries for which the number of sites is largest and because they are close to an 347 absorbing state (loss). Importantly, simulations also confirm these features (i.e., a deficiency of 348 singletons and excess of doubletons on the front as compared to the core) of the SFS for negatively 349 selected (recessive or co-dominant) mutations (Fig. S9). Finally, our simulations confirm that an excess 350 of homozygosity should rapidly develop on the front and that it should increase with the deleteriousness 351 of mutations (Fig. S39), in keeping with the observed patterns in Quebec (Fig. 5). Altogether, our 352 simulation results show that a model of range expansion can explain most of the observed differences 353 between front and core individuals in Quebec.

354 Discussion 355 The interaction between demography and selection has been a central theme in population 356 genetics. A particularly hotly debated topic is whether and to what extent recent demography has

357 affected the efficacy of selection in modern humans (LOHMUELLER et al. 2008; GAZAVE et al. 2013; FU et al.

358 2014; LOHMUELLER 2014; SIMONS et al. 2014; DO et al. 2015; HENN et al. 2015; GRAVEL 2016; HENN et al. 359 2016). The original conclusion that European populations show a larger proportion of predicted

360 deleterious variants when compared to African populations (LOHMUELLER et al. 2008) has been recently 361 revisited in a series of studies that reached different and apparently opposite conclusions (reviewed in

362 LOHMUELLER 2014). However, this controversy might have arisen because different studies focused on 363 different patterns or processes. First, people focused either on measures of the efficacy of selection 364 (defined as the amount of change in load per generation) or on measures of the mutation load (see e.g.

365 GRAVEL 2016, for a detailed study of this distinction). Second, people either measured the load as being

366 due to co-dominant (SIMONS et al. 2014; DO et al. 2015) or partially recessive (HENN et al. 2016) 367 mutations, which can lead to drastically different conclusions about the consequences of demographic

368 change on mutation load (HENN et al. 2015; HENN et al. 2016). Finally, most theoretical and empirical 369 work has focused on the effects of bottlenecks and recent population growth, but ignored the out of

370 Africa expansion process and the spatial structure of human populations (SOUSA et al. 2014 ). While it 371 has now been shown that old human expansions could lead to the buildup of a recessive mutation load

372 (HENN et al. 2016), it is still unclear whether very recent or ongoing expansions could also affect patterns

12 373 of diversity in genomic regions under selection, and what are the exact genomic signatures of these 374 recent expansions. 375 Our current study uses a unique combination of historical records, detailed genealogical 376 information, and genomic data to assess the impact of such a recent range expansion on functional 377 genetic diversity, and to disentangle the effects of genetic drift, purifying selection, and inbreeding 378 during an expansion. As expected, given the short divergence time of front and core, we find that the 379 allele frequency distributions in front and core populations are very similar across all GERP RS categories 380 (Figs. S6 and S7), resulting in overall balanced numbers of predicted deleterious alleles or cumulative 381 GERP RS scores per individual (Table 1, Figs. S12 and S13, see also Figs. S15 and S16). However, a closer 382 look reveals significant differences between front and core, particularly for low frequency variants (Fig. 383 2, Figs. S14, S22 – S24), which should be enriched for deleterious mutations and thus be more sensitive 384 to differential selection. The significant differences we have detected between front and core individuals 385 all suggest that larger frequencies of deleterious mutations on the front are due to relaxed purifying 386 selection on the front. The fact that front and core individuals mainly diverged six generations ago with 387 respect to the position of their ancestors to the colonization front (Fig. 1B) suggests that the relaxation 388 of natural selection can affect modern populations remarkably quickly. The recent divergence between 389 front and core populations (around 1780, Fig. S28) has left traces in the genomic diversity of French 390 Canadians that are of two kinds. First, front individuals show increased genetic drift relative to core 391 individuals, as attested by their overall lower levels of diversity (Table 1), their larger genetic divergence 392 from Europeans (Fig. S1), and their lower estimated effective size (Fig. 4B, Fig. S31, Table S4, Table S7). 393 This result confirms the genetic surfing effect previously identified in the Saguenay Lac St-Jean region

394 (MOREAU et al. 2011), but it is not driven by samples from the Saguenay area (e.g. Fig. S29). Rather, it is a 395 property shared by all individuals with ancestors having lived on the front, and presently found in the 396 most peripheral regions of Quebec (Fig. 1). Second, potentially relaxed selection on the front as 397 compared to the core is supported by several lines of evidence. The main evidence comes from the fact 398 that sites targeted by mutations tend to be more conserved in front than in core individuals (Fig. 2B-2D), 399 and that rare, putatively deleterious derived alleles, have a higher probability to be homozygous at the 400 front (Fig. 5). The relaxed selection hypothesis is especially obvious when one considers deleterious 401 mutations that are at low frequencies (singletons) in Europe and that are present at lower frequencies in 402 core than in front individuals, or mutations that are now at low frequencies in Quebec and that are 403 occurring at more conserved (and thus potentially more deleterious) sites in front than in core 404 individuals (e.g., private doubletons and tripletons in Figs. 2C and 2D).

13 405 At first sight, the increased frequency of rare and potentially deleterious alleles (i.e. doubletons) 406 in front individuals could be attributed to their higher inbreeding levels. However, there are several lines 407 of arguments against this interpretation. First, we note that there are about 5% more doubletons on the 408 front than in the core (21,332 vs. 20,284, Fig. S7, Table S1), which cannot be explained by a difference in 409 inbreeding level of only 0.3% (Fig. S28). Instead, individual based simulations show that the excess of 410 doubletons at the front is consistent with a model of range expansion (Figs. S7 and S9). Second, the 411 proportion of doubleton sites where both derived alleles are in the same individuals is smaller than 412 expected (1/101=0.99%) in both front (0.651%) and core (0.646%) individuals, which is indicative of

413 similar (pperm = 0.898) levels of selection against derived homozygotes in both samples. Third, if higher 414 inbreeding (and not relaxed selection) on the front had increased the frequency of all rare mutations 415 irrespective of their deleterious effect, more deleterious mutations should have been better purged by 416 selection than less deleterious mutations, provided that a part of the mutations is (partially) recessive. 417 The observed doubletons on the front should then be on average less conserved due to the more 418 efficient purging in the core. However, we find the opposite, with doubletons at the front being more 419 conserved than in the core (Fig. 2), which means that the number of doubletons at highly conserved 420 sites has increased proportionally more than at neutral sites. Fourth, we find that less inbred individuals 421 from the front tend to have rare variants that are more deleterious than more inbred individuals from 422 the Saguenay area (Figs. S28 and S29). Finally, the difference in inbreeding level between front and core 423 individuals cannot explain the 2-fold increased expected homozygosity for extremely deleterious 424 variants on the front (Fig. 5), and removing Saguenay individuals from the analysis amplifies the excess 425 of derived homozygotes on the front (Fig. 5). Contrastingly, a model of range expansion can explain the 426 increase in derived homozygosity at the expansion front (Fig. S39). Taken together, these results suggest 427 that differences between front and core individuals are mainly driven by increased drift at the expansion 428 front and more efficient selection against deleterious mutations in the core.

429 In line with previous results (CASALS et al. 2013), we find that all French Canadians carry on 430 average more strongly conserved mutations than Europeans (Fig. 2A). Even though it has been proposed

431 that this is the result of a mere founder effect (CASALS et al. 2013), current French Canadians descend

432 from 8500 French founders (LABERGE et al. 2005), which implies a relatively mild founder effect that

433 would take hundreds to thousands generations to increase load to such an extent (LOHMUELLER et al.

434 2008; PEISCHL et al. 2013). More likely, this load could have been created during the initial settlement and 435 range expansion that occurred in Quebec along the Saint-Laurence valley. A major loss of diversity and 436 an increase in the frequency of rare deleterious variants might indeed have occurred during the first 9 437 generations of the settlement of Quebec, until the middle of the 18th century, before current front and

14 438 core individuals actually diverged (Fig. 1B). The importance of these early generations is supported by 439 genealogical analyses of the genetic contributions of the founders having lived at different periods. Early

440 settlers have indeed contributed 45%- 90% to the current French Canadian gene pool (HEYER 1995;

441 BHERER et al. 2011), depending on the regions of Quebec, and early founders contributed proportionally

442 more than later individuals to the current French Canadian gene pool (HEYER 1995; BHERER et al. 2011;

443 MOREAU et al. 2011). Overall, our modeling of the evolution of European singletons suggests that the 444 bottleneck shared between core and front populations explains about 80% of the variance in allele 445 frequencies at the expansion front, whereas only 20% of this variance can be attributed to the separate 446 expansion of the ancestors of front individuals (Fig. 4). Note that this latter value should be considered 447 as a lower bound for the total contribution of the expansion, because front and core samples have a 448 shared history of being on the expansion front in the first few generations in Quebec, and this shared 449 expansion is absorbed into the estimate of the bottleneck population size in our estimation procedure. 450 Overall, our results clearly suggest that the low effective size prevailing on the wave front of the 451 colonization has made selection overall less efficient at the population level than in the core. 452 Consequently, over a very short time (nine generations or less, see Fig. 1 and Fig. S28), deleterious 453 variants have been more efficiently purged in the core and maintained at lower frequencies than in 454 front individuals. This excess of deleterious mutations in front individuals has probably only a minor 455 effect on the total mutation load and on the fitness of most individuals, because these mutations are 456 still at very low frequencies. Nevertheless, this wave front effect might be medically relevant as rare 457 deleterious variants have a higher probability of being homozygous on the front than in the core, 458 suggesting that rare recessive diseases should be more common in individuals whose ancestors lived on 459 the front. In agreement with this prediction, we find that front individuals have on average one more

460 known pathogenic variant than core individuals (8.96 vs 7.82, respectively, Fig. S40, 풑풑풆풓풎 = ퟎ. ퟎퟑퟐ). 461 Since most (85%) of these pathogenic variants are recessive, front individuals are more likely to be 462 derived homozygous and thus to develop some disease. Importantly, this effect is noticeably stronger 463 than the relative risk to develop a rare disease due to inbreeding. In addition, the evidence of a relaxed 464 selection on recent wave fronts suggests that prolonged periods of range expansions over hundreds of 465 generations should have promoted the spread of deleterious mutations in newly settled territories, 466 which could have contributed to global variation in mutation load and in a burden of genetic diseases in 467 expanding modern humans. This hypothesis should to be testable when ancient DNA becomes available 468 in early pioneer populations. 469

15 470 Methods 471 Selection of individuals to sequence 472 We have selected individual to be sequenced by screening the genealogy of 3916 individuals of

473 the CARTAGENE biobank (AWADALLA et al. 2013), who could be connected to the BALSAC genealogical 474 database (http://balsac.uqac.ca) thanks to the information they provided on their parents and 475 grandparents. The BALSAC database includes records from all Catholic marriages in Quebec from 1621 476 to 1965, totaling more than 3 million records (5 million individuals). The ascending genealogies of the 477 3916 CARTAGENE individuals were assessed for their maximum generation depth, their completeness 478 defined as the fraction of ancestors that are traced back in an individual’s genealogy at generation g

479 relative to the maximum number of ancestors ( 2g ) at that generation (JETTÉ 1991), as well as our ability 480 to assess the front or core status of the ancestors. We thus first eliminated 420 genealogies which 481 spanned over less than 12 generations (maximum generation depth < 12 gen). We also eliminated 537 482 genealogies which had a mean depth smaller than 8 generations, 578 genealogies whose completeness

483 (JETTÉ 1991) computed over the last 6 generations was less than 95%, and 97 additional genealogies 484 whose completeness computed over the 12 generations was less than 30%. Genealogies were also 485 filtered based on the quantity of information available for the computation of a cumulative Wave front

486 Index (cWFI), defined as cWFI GC WFI , where the summation is over all ancestors in the i ii

487 genealogy, GCi is the genetic contribution of the i-th ancestor, WFIi is the wave front index of the i-th

488 ancestor, defined asWFI1/ (1 g ) (2011), and g is the number of generations elapsed since the

489 foundation of the location where the ancestor reproduced (see ref. (MOREAU et al. 2011) for more 490 details). A cWFI value of 1 would imply that all the ancestors of the focal individual reproduced on the 491 wave front. To ensure that differences in between individuals are not due to a lack of information 492 on the core-front status of individuals in the genealogy, we eliminated 717 genealogies for which a 493 single was missing for any individual belonging to the 6 most recent generations (

494 completeness < 1 for the 6 most recent generations) and 15 additional genealogies for which the

495 completeness until generation 12 was less than 0.5. We also excluded from the analysis genealogies for 496 which the total number of individuals with computable WFI until generation 12 was either too small or 497 too large, so that the was computed on genealogies of comparable total sizes. The 10% smallest 498 and the 15% largest genealogies were thus eliminated (389 genealogies) from further analyses. The 499 1163 remaining individuals were ranked according to their , and we then selected individuals with 500 the 10% smallest and 10% highest . We also eliminated from these two groups those individuals

16 501 that were too closely related. The kinship coefficient  (WRIGHT 1922) was thus computed between all

502 members of these groups to determine their relatedness. For 41 pairs of individuals more related than 503 second cousins ( > 1/64), one of the two individuals was removed at random. Furthermore, we

504 removed individuals with inbreeding coefficients larger than 0.05. Finally, the 60 individuals with the 505 lowest cWFI and the 60 individuals with the largest were selected for further DNA analyses. 506 Among these, 51 individuals of each category for which peripheral blood samples were available in the 507 CARTaGENE biobank were further considered for DNA extraction and sequencing. The geographic 508 location of the marriage place of 102 individuals’ parents is reported in Fig. 1 and examples of the 509 location of the ancestors of front and core individuals at various periods are shown in Animations S1 510 and S2. In our final sample, only 3 individuals had inbreeding coefficients > 0.02 (Fig. S28).

511 DNA extraction, library preparation and sequencing 512 Peripheral blood samples preserved in EDTA tubes from 102 selected individuals from the 513 CARTaGENE cohort were processed for DNA extraction using the FlexiGene DNA kit as recommended by 514 the supplier (Qiagen). Total DNA was quantified by measurements with the NanoDrop 8000 515 spectrophotometer (Themo Scientific) followed by dsDNA quantitfication with the QUBIT 2.0 516 fluorometer (Life Technologies). DNA libraries were prepared for each sample following the standard 517 protocol of KAPA Library Preparation Kit for Illumina sequencing platforms. A Covaris S2 fragmentation 518 (Duty cycle - 10%, Intensity - 5.0, Cycle per burst - 200, Duration - 120 seconds, Mode Frequency - 519 Sweeping, Displayed Power Covaris S2 – 23W) was performed on 1µg dsDNA input (50µl total volume) 520 for each sample to generate 180 – 200 bp average size fragments. The resulting 3’ and 5’ overhangs 521 were end repaired, 3’-adenylated and ligated to specific indexed adaptors. After a dual SPRI size 522 selection of 250 – 450 bp adapter-ligated fragments, final pre-capture library enrichment was performed 523 by LM-PCR followed by a library amplification cleanup with magnetic beads (AMpure XP, Agencourt). 524 Following the protocol for whole exome capture with the Roche NimbleGen SeqCap EZ Exome + UTR 525 Library kit (User’s Guide v4.2, http://www.nimblegen.com/products/seqcap/ez/exome-utr/index.html), 526 the enriched fragments size distribution was then checked using a DNA 1000 chip on an Agilent 2100 527 Bioanalyzer for whole exome capture validation. The 102 uniquely indexed amplified DNA samples were 528 mixed into 34 pool libraries of 3 different indexed DNA each, and were then hybridized to specific 529 SeqCap EZ Hybridization Enhancing oligos at +47°C for 72 hours. After a washing step followed by a 530 SeqCap EZ Pure Capture Beads recovery of the targeted sequences (here whole exome + UTRs), the 531 multiplex DNA samples were amplified by a post-capture LM-PCR, cleaned with AMpure XP magnetic 532 beads and bioanalyzed with a DNA 1000 chip to quantify and qualify the amplified captured multiplexed 533 DNA samples. Prior to sequencing, a final validation by qPCR assays was carried out on the DNA samples

17 534 to assess the relative fold enrichment in pre-captured sequences versus post-captured ones. Finally, 535 these 34 DNA pools (one pool per lane) were paired-end (2x100bp) sequenced on an Illumina HiSeq 536 2500 System.

537 Alignment and Variant Calling 538 Before mapping reads, a quality control was done using FASTQC, and trimming of the adapters 539 and of poor quality read ends was done using Trim Galore (>=Q20). The reads were then mapped to the 540 hg19 reference genome using BWA v 0.5.9r16 using the default parameters. PCR duplicates were 541 removed using Picard-tools v1.56 (http://broadinstitute.github.io/picard/). We kept properly paired and 542 uniquely mapped reads using Samtools v0.1.19-44428cd. After these steps, we estimated the mean 543 sequence coverage per individual, across the targeted exomic and UTR regions of cumulative length 544 106.5 Mb, to be between 67X-128X (Fig. S41). Realignment around indels and variants recalibration 545 were performed with GATK v3.2-2. GATK v3.2-2 was also used to call variants using the workflow 546 recommended by the Broad Institute (https://www.broadinstitute.org/gatk/guide/best- 547 practices?bpm=DNAseq). We performed a first step using HaplotypeCaller, reporting the calls in GVCF 548 mode. Then the joint genotyping calls were performed using the GenotypeGVCFs subprogram of GATK, 549 to get the raw SNP and INDEL calls. The last step of recalibrating and filtering the genotype calls was 550 done with VQSR, using the recommended options separately on the SNP and INDEL calls.

551 Sequence analysis 552 We removed all variants associated with a quality score below 30. We kept 426,301 SNPs and 553 43,081 indels and used ANNOVAR to functionally characterize these variants. Table S5 gives the number 554 of variants in each ANNOVAR functional class. Individual genotypes associated with low read depth (DP 555 < 10) and low genotype quality (GQ < 20) were marked as missing genotypes. We also collected 556 polymorphism data for 305 individuals from 3 European populations (British from England and Scotland 557 (GBR), Spanish from Spain (IBS) and Italians from Tuscany, Italy (TSI), Table S6) from the 1000 Genomes

558 phase 3 panel (THE GENOMES PROJECT 2015). Note that the 1000 Genomes phase 3 panel set of variants 559 consists of polymorphisms called from a combination of both low and high coverage data (between 8X - 560 30X). Our comparison of French Canadians and individuals from European populations of the 1000 561 Genomes phase 3 panel was restricted to the genomic regions that intersected between the targeted 562 regions sequenced in the present study and the high coverage target of the 1000 Genomes phase 3 563 panel, which amount 46.4 Mb. Since the 1000G data had lower coverage (65X) than our French 564 Canadian data (89.5X), which could lead to differences in genotype frequencies other than those due

18 565 to population history, we performed a downsampling of French Canadian individual reads using 566 Samtools v1.1, option “-s”, to randomly select reads and reduce the coverage to 65X. 567 This step was then followed by the SNP calling and the recalibration steps described in the previous 568 section. We defined shared SNPs between French Canadians and individuals from the 1000 Genomes 569 phase 3 panel as SNPs found in both datasets. Differences in number of various types of sites were 570 obtained by a permutation test consisting of randomly permuting individuals between front and core, 571 reestimating the desired statistics on the permuted samples and estimating the p-value of the observed 572 statistics in the generated empirical null distribution.

573 Assessment of mutation effects

574 The ancestral state of all mutations was characterized, following the 1000 Genomes project (THE

575 GENOMES PROJECT 2015), using the human ancestor genome inferred from the alignment of 6 primates 576 genomes (Homo sapiens, Pan troglodytes, Gorilla gorilla, Pongo abelii, Macaca mulatta, Callithrix 577 jacchus)(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_align 578 ments/). The biological impact of SNPs was assessed via GERP Rejected Substitution (GERP RS) scores

579 (COOPER et al. 2005; DAVYDOV et al. 2010), which measure, at a given genomic location, the difference 580 between the expected and the observed number of mutations at a given position occurring across a 581 phylogeny of 35 mammals. GERP RS scores were obtained from the UCSC genome browser 582 (http://hgdownload.cse.ucsc.edu/gbdb/hg19/bbi/All_hg19_RS.bw). Note that the human sequence was 583 not included in the calculation of GERP RS scores. The human reference sequence was indeed excluded 584 from the alignment for the calculation of both the neutral rate and site specific ‘observed’ rate for the 585 RS score to prevent any bias in the estimates. Mutations were classified as being “neutral”, “moderate”, 586 “large” or “extreme” for GERP RS scores with ranges [-2,2[, [2,4[, [4,6[ and [6,∞[, respectively. GERP RS 587 scores of 0 indicates that the alignment of mammalian sequences was too shallow at that position to get

588 a meaningful estimate of constraint (GOODE et al. 2010) and sites with such scores were removed from 589 all analyses involving GERP RS scores. 590 We also used additional methods to assess the functional effect of SNPs and to characterize short

591 indels. We used PolyPhen-2 (ADZHUBEI et al. 2010) for predicting the damaging effect of missense 592 mutations. PolyPhen-2 predicts the possible impact of amino acid substitutions on the stability and 593 function of human proteins using structural and comparative evolutionary considerations. We also used 594 PhyloP scores, which measure evolutionary conservation at individual alignment sites to identify

595 putatively deleterious variants (POLLARD et al. 2010). PhyloP scores were computed from the alignments 596 of 36 eutherian mammal sequences, without the human reference sequence, and denoted as PhyloPNH

597 (FU et al. 2012). Finally, we used CADD scores, which integrate several annotations including

19 598 conservation metrics, regulatory information, transcript information and protein-level scores into a

599 single measure (C-score) for each variant (KIRCHER et al. 2014). We used scaled C-scores, phred-like 600 scores ranging from 0.001 to 99, in our analyses, as these scores are easily interpretable. A scaled C- 601 score larger than 10 indicates that the corresponding variant is predicted to be in the 10% most 602 deleterious classes of variants. A scaled C-score larger than 20 indicates that the corresponding variant is 603 predicted to be in the 1% most deleterious class of variants. Mutations were classified as being 604 “neutral”, “moderate”, “large” or “extreme” for CADD scores with ranges [0,10[, [10,20[, [20,30[ and 605 [30,∞ [, respectively.

606 Assessment of relaxation of selection 607 Assessing mutation load from genomic data is an inherently difficult problem ((for further

608 discussion see LOHMUELLER 2014)). Instead, we use GERP RS scores as a proxy for selection intensity and 609 calculate the average GERP RS score for each individual across all sites at which the focal individual 610 carries a derived allele. The average GERP RS score has a straightforward interpretation: it measures the 611 average degree of conservation of segregating sites. We expect that selection keeps strongly conserved 612 sites (large GERP RS scores) at low frequency, whereas weakly conserved sites should be found at higher 613 frequency. A relaxation of selection, for instance due to strong genetics drift, should therefore lead to an 614 increase of the average GERP RS score. Here, we focus here on the average RS score per site. The 615 average GERP RS score per site is simply the average of GERP RS scores calculated over all sites at which 1 616 an individual carries at least one copy of a derived mutation: ∑푅푆 , where n is the number of 푛 푖

617 segregating sites per individual, and 푅푆푖 is the GERP RS score of site 푖. Note that this measure does not 618 distinguish between heterozygous sites and derived homozygous sites. To account for the frequency of 619 derived alleles we also calculated the average GERP RS score across sites that have a given derived allele 620 frequency.

621 Detection of outlier SNPs and Gene Ontology analysis

622 To detect potential outlier SNPs based on levels of genetic differentiation, we used the outlier FST

623 method proposed by Beaumont and Nichols (BEAUMONT AND NICHOLS 1996) and implemented in the

624 Arlequin software (EXCOFFIER AND LISCHER 2010). In brief, this test uses coalescent simulations to generate

625 the joint distribution of FST and heterozygosity between populations expected under a finite-island

626 model, having the same average FST value as that observed. This null distribution is then used to

627 compute the p-value of each SNP based on its observed FST and heterozygosity levels. SNPs with FST 628 values outside the 99% quantile based on the simulations were considered as outliers. These SNPs were

629 then annotated to Ensembl gene IDs with the R package BiomaRt (DURINCK et al. 2009). SNPs were

20 630 mapped to a gene if they were located in the gene transcript or within 10 kb of it. If a SNP was allocated 631 to more than one gene with this method, we uniquely allocated it to its closest gene. If more than one

632 SNP was assigned to a given gene, we only kept the SNP with the highest FST value. 633 We conducted a Gene Ontology (GO) enrichment analysis on the list of significant SNPs using the

634 topGO R package (ALEXA et al. 2006) . We applied the default algorithm using a Kolmogorov-Smirnov (KS) 635 test to detect highly differentiated biological processes and obtain their p-values. This approach 636 integrates information about relationships between the GO terms and the different scores of the genes 637 (here, the p-values) into the calculation of the statistical significance. We kept only GO terms which 638 included more than 10 genes in this analysis.

639 Maximum likelihood estimation of past demography and selection coefficients 640 To infer demography and selection coefficients, we considered only sites that are found as private 641 singletons in the European 1000G populations and that are found polymorphic in Quebec. We used the 642 current frequency of these variants in Europe as a proxy for their frequency during the foundation of 643 Quebec. This allowed us to directly estimate front and core effective population sizes without having to 644 estimate additional parameters for the European population. 645 We modeled the evolution of allele frequencies at independent sites under random genetic drift 646 and natural selection in two panmictic populations, denoted the core and the front. Variables describing 647 properties of the front and core are denoted with sub- or super-scripts 푓 and 푐, respectively. For

(푓) 648 simplicity, we only present calculations for the front.; the core can be treated analogously. Then 푥푖 649 denotes the number of sites with a derived allele frequency of 푖. Let 푋 (푡) = (푥(푓), … , 푥(푓) ), denote the 푓 0 푁푓

650 SFS on the front, where 푁푓 is the effective population size at the front and 푡 denotes the time (in 651 generations) since the founding of Quebec. Assuming a Wright-Fisher model of drift and genic selection 652 (that is, no dominance or epistasis), the SFS then evolves according to

(푓) 2푁푓 (푓) 푗(1−푠) 653 푥푖 (푡 + 1) = ∑ 푥푗 퐵 (푖, 2푁푓, ), (1) 푗=0 푗(1−푠)+2푁푓−푗 654 where 퐵(푘, 푛, 푝) denotes the binomial distribution and 푠 is the strength of selection against the derived 655 allele. We calculated the current allele frequency distribution (16 generations after the onset of the

(푓) 2푁퐵푁 (퐵푁) 푖(1−푠) 656 settlement) 푋푓(16) with the initial condition 푥푖 (0) = ∑푖=0 푥푖 퐵 (푖, 2푁푓, ), where 푖(1−푠)+2푁퐵푁−푖 (퐵푁) 1 657 푥푖 = 퐵(푖, 푁퐵푁, ) is the expected allele frequency distribution in the bottlenecked population and 2푛0

658 푛0 is the sample size in Europe. We then obtained the expected allele frequency distribution for a

659 sample of 푛푓 = 51 individuals by

푓 2푁푓 (푓) 660 푥푖,푠푎푚푝푙푒 = ∑푗=0 푥푗 퐵(푖, 2푛푓, 푗/(2푁푓)). (2)

21 (푓) (푓) 661 Let 푝푖 = 푥푖,푠푎푚푝푙푒/(2푛푓) be the relative frequency of sites with a derived allele frequency of 푖. To 662 account for the fact that we only considered sites shared between Europe and Quebec, we corrected 663 the allele frequency distribution by multiplying the proportion of sites that are not found polymorphic at

(푓) (푐) 664 the front, 푝0 , by (1 − 푝0 ), i.e., we count only the proportion of sites where the derived allele is lost

2푁푓 (푓) 665 in the front but is polymorphic in the core, and then renormalize such that ∑푖=0 푝푖 = 1. Note that this 666 renormalization accounts for the loss of variants in both front and core. We thus account for the loss of 667 variants in both front and core in the estimation process, since the loss of variants is reflected in the 668 shape of the re-normalized SFS. We can then calculated the likelihood from our data as

퐿(푌푓, 푌푐|푁푓, 푁푐, 푠) =

(푓) (푓) (푐) ( ) 2푛푓 푦 푦 2푛 푦 푦 푐 (푓) 0 (푓) 2푛푓 푐 (푐) 0 (푐) 2푛푐 669 ( (푓) (푓) ) (푝 ) …. (푝 ) ( (푐) (푐) ) (푝 ) …. (푝 ) , 푦 , … , 푦 0 2푛푓 푦 , … , 푦 0 푁푐 0 2푛푓 0 2푛푐

670 where 푌 = (푦(푓), … , 푦(푓) ) and 푌 = 푦(푐), … , 푦(푐) denote the observed derived allele frequencies in 푓 0 2푛푓 푐 0 2푛푐 671 front and core respectively. The likelihood was then maximized numerically via a grid search in the 672 parameter space.

673 Contribution of Expansion to Variance in Allele Frequencies 674 To estimate the relative contribution of the expansion to the total variation in allele frequencies, 675 we first iterated eq. (1) with the MLE parameters to obtain expected allele frequency distributions after

(퐵푁) (푓) 676 the bottleneck (푥푖 ) and at the end of the expansion (푥푖 ). We then calculated the variance in allele

677 frequencies after the bottleneck (푉퐵푁) and at the end of the expansion (푉퐹) from the allele frequency

678 distributions. Using the decomposition 푉퐹 = 푉퐵푁 + 푉퐸푋푃, the proportion of variance explained by the V −V 679 expansion process is given by F 퐵푁. 푉퐹

680 Demographic and Selection Inference from the Site Frequency Spectrum (SFS) 681 We used a total of 43,407,133 neutral sites (with GERP RS cores between -2 and 2) to estimate the

682 past demography of French Canadians from the joint (Front and Core) SFS using fastsimcoal2 (EXCOFFIER 683 et al. 2013). The demographic scenario is shown in Fig. S31. We performed 50 runs of fastsimcoal2, each 684 with 50 cycles, and we used 200,000 simulations per likelihood estimations. The estimated parameters 685 obtained from the run maximizing the likelihood are shown in Table S4.

686 Based on the demography inferred from neutral sites, we then used Fitdadi (KIM et al. 2017) to 687 estimate the Distribution of Fitness Effects (DFE) from the SFS computed on 40,341,124 sites assumed to 688 be under selection (with GERP RS scores ≥ 2). This estimation was done independently in the core and 689 the front population, assuming that the DFE followed a Gamma distribution and using a multinomial

22 690 likelihood. In each case, we fixed the shared and population specific demographic parameters shown in 691 Table S4 and searched for the Gamma distribution parameters maximizing the likelihood computed 692 from the SFS inferred from the deleterious sites. The resulting DFEs are reported in Fig. S32.

693 Inference of DFE with DFE-alpha

694 We used DFE-alpha (EYRE-WALKER AND KEIGHTLEY 2007; SCHNEIDER et al. 2011) to estimate the DFE of 695 deleterious mutations independently for front and core populations. We first estimated the 696 demographic and mutational parameters under all implemented models (1-, 2-, or 3-epoch demographic 697 history with piecewise constant population sizes) using predicted neutral sites. We then used the SFS of 698 predicted deleterious sites to infer the distribution of fitness effects, as modelled by a Gamma 699 distribution. The resulting demographic histories are reported in Table S7 and the inferred DFEs are 700 shown in Fig. S33.

701 Individual-Based Simulations

702 We performed individual based simulations of a range expansion in a 2D habitat consisting of a

703 lattice of 11x11 discrete demes (stepping stone model). Generations are discrete and non-overlapping,

704 and mating within each deme is random. Migration is homogeneous and isotropic, except that the

705 boundaries of the habitat are reflecting, i.e., individuals cannot migrate out of the habitat. Population

706 size grows logistically within demes. Our simulations start from a single panmictic ancestral population,

707 representing France. After a burn-in phase that ensures that the ancestral population are at mutation-

708 selection-drift balance, a propagule of founders is placed on the deme with coordinates (3,6) on the

709 11x11 grid representing French Canada (see Fig. S42). During the next 6 generations, the population

710 expands along a 1 deme wide corridor in the middle of the habitat (representing the Saint Laurence

711 River corridor). During these 6 generations, all colonized demes in French Canada receive migrants from

712 the ancestral populations in equal proportions. The number of migrants were chosen to roughly match

713 historical records (CHARBONNEAU et al. 2000). In particular, we chose 1000, 2000, 1000, 1000, 1000, and

714 2000 pioneer immigrants from the ancestral population for the first 6 generations, respectively. After

715 that, the expansion continues into the remaining habitat for 11 generations. See Fig. S42 for a sketch of

716 the model.

23 717 We chose a carrying capacity of K = 1,000 diploid individuals and the size of the ancestral

718 population was 10,000. Migration rate was set to m = 0.2 and the within-deme growth rate was R = 2

719 (that is, at low densities the population doubles within one generation, reflecting the average absolute

720 fitness of approximately 4–5 surviving children getting married per woman (MOREAU et al. 2011)). We

721 simulated a set of 10,000 independent diallelic loci per individual. The genome-wide mutation rate was

722 set to U = 0.1. Mutations occur only in one direction and back mutations are ignored. We performed two

723 types of simulations: (i) evolution of neutral mutations, and (ii) evolution of sites under purifying

724 selection. In the latter case, we assumed that all sites had the same selection coefficient s. Mutations

725 were considered either co-dominant or completely recessive and interact multiplicatively across loci,

726 that is, there is epistasis. We also simulated and recorded the cumulative wave front index (cWFI) of

727 each individual. The simulation code can be downloaded from: https://github.com/CMPG/ADMRE.

728 Data Access 729 The data will be submitted to the European Genome-phenome Archive (EGA) repository. The 730 accession number is EGAS00001001957.

731 Acknowledgements 732 We are grateful to Claude Bhérer for her detailed comments on the manuscript. We would like to 733 thank the CARTaGENE participants and team for data collection and assistance, Marc Tremblay for his 734 help in connecting CARTaGENE individuals to the Balsac genealogical data base, Remy Brugmann for 735 Bioinformatic analyses, Ryan Gutenkunst and Bernard Kim for their help with dadi and Fitdadi 736 computations, and the Ubelix High Performance Computing cluster of the University of Bern. We 737 confirm that informed consent was obtained from all subjects. This work has been made possible by 738 Swiss NSF grants No. 31003A-143393 and 310030B-166605 to LE. AH is an FRSQ Research Fellow. AH 739 holds an MRC eMedLab Medical Bioinformatics Career Development Fellowship, funded from award 740 MR/L016311/1. PA is supported by the Ministry of Research of Ontario. KJG was funded by EMBO long- 741 term fellowship ALTF 2-2016 742

743 Disclosure Declaration 744 The authors declare that they have no conflicts of interest.

24 745 Figures

746 747 Figure 1: Location and number of sampled individuals and distribution of the cumulative Wave 748 front Indices (cWFI). A: Front and core sampled individuals are shown in white and 749 gray, respectively. The numbers inside circles indicate the sample size for each 750 location. B: The leftmost panel shows the distribution of cWFI among sampled 751 individuals. The other three panels display the cWFI of the ancestors of the sampled 752 individuals that lived 6, 9 or 12 generations ago, which shows that observed 753 differences in cWFI between current samples have mostly emerged in the 6 most 754 recent generations.

25 755

756 Figure 2: A: Distributions of average GERP RS scores per site per individual in three European 757 1000G populations, as well as in core and front individuals. Left: All sites. Right: Sites 758 shared between 1000G samples and Quebec (t-test p-values = 10-7 and 10-5, 759 respectively). B: Average GERP RS score per site having different Derived Allele 760 Frequencies (DAF). The solid horizontal lines show the average GERP RS score per site. 761 The violin plots show the the average GERP RS score distribution obtained by 762 bootstrap (1000 replicates). C: Like B, but for mutations private to the front or to the 763 core. D: Like B but for singletons and doubletons that are private to front or core and 764 not found in the 1000G phase 3 panel. For the sake of clarity, higher DAF classes are 765 not shown in panels B- D. Only SNPs with GERP RS scores larger than 0 were used for 766 the calculations of GERP RS scores in all panels. Asterisks indicate significance levels 767 obtained by permutation tests: * p < 0.05, ** p < 0.01, *** p < 0.001.

26 768

769 Figure 3: Distribution of the cumulative additive GERP RS scores of doubletons in front and core 770 individuals for different GERP RS categories. Sites were considered if they were not 771 seen in derived states in 1000G samples and if they were private to the core or to the 772 front. Differences between front and core are significant for the three categories of 773 sites potentially under selection (p = 10-11, 10-9, 10-4 for mildly, strongly, and extremely 774 deleterious sites, respectively), but not for the neutral sites (-2 ≤ GERP RS score < 2, p 775 = 0.34)

776

777

778

779

780

27 781 782 Figure 4. Maximum likelihood estimation of effective population sizes and selection 783 coefficients of rare variants. A: Sketch of the model for the demographic history of 784 core and front populations. Likelihoods were calculated based on the expectation of 785 the change in allele frequency distribution of rare variants (that is, singletons in the 786 European sample). Marginal likelihoods and MLE for effective population sizes of 787 bottleneck, and in front and core (B), and selection coeffcients for different GERP RS 788 categories (C). Shaded areas indicate 95% confidence intervals in (B), and horizonal 789 bars indicate 95% confidence intervals in (C).

790

28 791

792 Figure 5: Ratio of expected homozygosity for variants that are singletons in European 1000G 22 793 populations. HR E[]/[] qfc E q where 푞푓 and푞푐 are derived allele frequencies in 794 front and core individuals, respectively. The horizontal solid lines indicate HR for 795 different GERP RS score categories. The dashed lines indicates the expected HR 796 values that would be due to differences in estimated inbreeding levels between front 22 797 and core, calculated as qc  f q c(1  q c)/ q c , where f  ffront  f core . Violin plots 798 show the distribution of 5000 bootstrap replicates. We find significant differences 799 between the expected values for GERP RS scores > 6 (all individuals: p = 0.021, 800 without Saguenay individuals: p= 0.008, obtained by bootstrap).

801

802

803

804

29 805 Tables 806 Table 1: Summary of genetic diversity in front and core samples.

core front Total

Type and number of polymorphism (n=51) (n=51) (n=102)

Total No. of SNPs 426,301

No. of SNPs with inferred ancestral/derived state 314,483 > 308,396 396,424

No. of SNPs without missing data 266,547 > 261,355 328,372

No. of exonic SNPs 83,653 > 81,763 107,525 No. of non-synonymous SNPs 40,750 > 39,595 55,133

No. of predicted deleterious SNPs 66,300 > 64,763 85,748

No. of SNPs private to one of the two groups of 78,310 > 72,353 150,663 individuals No. of SNPs without missing data and not seen in 31,608 > 29,811 56,669 1000G phase 3 panel No. of SNPs without missing data, private to one of the two groups, and not seen in 1000G phase 3 26,858 > 25,061 51,919 panel

Average number of derived predicted neutral 50,983.94 ≈ 50,996.51 alleles per individual (-2 ≤ GERP RS < 2)

Average number of derived predicted deleterious 22,017.57 ≈ 22,034.16 alleles per individual (GERP RS ≥ 2)

Average number of derived homozygous predicted 6,037.53 ≈ 6,057.75 deleterious sites per individual (GERP RS ≥ 2)

No. of indels 33,789 > 33,297 43,081

Heterozygosity All sites 0.0588 ≈ 0.0586 Exons 0.0548 ≈ 0.0547

Introns 0.0632 ≈ 0.0630 5' UTR 0.0489 ≈ 0.0487 3'UTR 0.0623 ≈ 0.0623

807

808 Significant differences between front and core are indicated by “>’’ (permutation test, pperm 809 <0.001), and non-significant differences are indicated by “≈’’ (p>0.05).

30 810 References 811 Adzhubei, I. A., S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova et al., 2010 A method and 812 server for predicting damaging missense mutations. Nature methods 7: 248-249. 813 Alexa, A., J. Rahnenfuhrer and T. Lengauer, 2006 Improved scoring of functional groups from 814 gene expression data by decorrelating GO graph structure. Bioinformatics 22: 1600-1607. 815 Austerlitz, F., and E. Heyer, 1998 Social transmission of reproductive behavior increases 816 frequency of inherited disorders in a young-expanding population. Proc Natl Acad Sci U S 817 A 95: 15140-15144. 818 Awadalla, P., C. Boileau, Y. Payette, Y. Idaghdour, J. P. Goulet et al., 2013 Cohort profile of the 819 CARTaGENE study: Quebec's population-based biobank for public health and 820 personalized . Int J Epidemiol 42: 1285-1299. 821 Beaumont, M. A., and R. A. Nichols, 1996 Evaluating loci for use in the genetic analysis of 822 population structure. Proceedings of the Royal Society London B 263: 1619-1626. 823 Bherer, C., D. Labuda, M. H. Roy-Gagnon, L. Houde, M. Tremblay et al., 2011 Admixed ancestry 824 and stratification of Quebec regional populations. Am J Phys Anthropol 144: 432-441. 825 Boyko, A. R., S. H. Williamson, A. R. Indap, J. D. Degenhardt, R. D. Hernandez et al., 2008 826 Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS 827 Genet 4: e1000083. 828 Casals, F., A. Hodgkinson, J. Hussin, Y. Idaghdour, V. Bruat et al., 2013 Whole-exome sequencing 829 reveals a rapid change in the frequency of rare functional variants in a founding 830 population of humans. PLoS Genet 9: e1003815. 831 Charbonneau, H., B. Desjardins, J. Légaré and H. Denis, 2000 The Population of the St. Lawrence 832 Valley, 1608-1760, pp. 99-142 in A Population History of North America, edited by M. R. 833 Haines, Steckel, R.H. Cambridge University Press. 834 Cooper, G. M., E. A. Stone, G. Asimenos, N. C. S. Program, E. D. Green et al., 2005 Distribution 835 and intensity of constraint in mammalian genomic sequence. Genome Res 15: 901-913. 836 Davydov, E. V., D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow et al., 2010 Identifying a high 837 fraction of the human genome to be under selective constraint using GERP++. PLoS 838 Comput Biol 6: e1001025. 839 De Braekeleer, M., 1991 Hereditary disorders in Saguenay-Lac-St-Jean (Quebec, Canada). Hum 840 Hered 41: 141-146. 841 Do, R., D. Balick, H. Li, I. Adzhubei, S. Sunyaev et al., 2015 No evidence that selection has been 842 less effective at removing deleterious mutations in Europeans than in Africans. Nat Genet 843 47: 126-131. 844 Durinck, S., P. T. Spellman, E. Birney and W. Huber, 2009 Mapping identifiers for the integration 845 of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4: 1184-1191. 846 Excoffier, L., I. Dupanloup, E. Huerta-Sánchez, V. C. Sousa and M. Foll, 2013 Robust demographic 847 inference from genomic and SNP data. PLoS Genet 9: e1003905. 848 Excoffier, L., and H. E. Lischer, 2010 Arlequin suite ver 3.5: a new series of programs to perform 849 population genetics analyses under Linux and Windows. Mol Ecol Resour 10: 564-567. 850 Eyre-Walker, A., and P. D. Keightley, 2007 The distribution of fitness effects of new mutations. 851 Nat Rev Genet 8: 610-618. 852 Fu, W., R. M. Gittelman, M. J. Bamshad and J. M. Akey, 2014 Characteristics of neutral and 853 deleterious protein-coding variation among individuals and populations. Am J Hum Genet 854 95: 421-436.

31 855 Fu, W., T. D. O’Connor, G. Jun, H. M. Kang, G. Abecasis et al., 2012 Analysis of 6,515 exomes 856 reveals the recent origin of most human protein-coding variants. Nature. 857 Gazave, E., D. Chang, A. G. Clark and A. Keinan, 2013 Population growth inflates the per- 858 individual number of deleterious mutations and reduces their mean effect. Genetics 195: 859 969-978. 860 Goode, D. L., G. M. Cooper, J. Schmutz, M. Dickson, E. Gonzales et al., 2010 Evolutionary 861 constraint facilitates interpretation of genetic variation in resequenced human genomes. 862 Genome Res 20: 301-310. 863 Gravel, S., 2016 When is Selection Effective? Genetics. 864 Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth et al., 2011 Demographic 865 history and rare allele sharing among human populations. Proc Natl Acad Sci U S A 108: 866 11983-11988. 867 Henn, B. M., L. R. Botigue, C. D. Bustamante, A. G. Clark and S. Gravel, 2015 Estimating the 868 mutation load in human genomes. Nat Rev Genet 16: 333-343. 869 Henn, B. M., L. R. Botigue, S. Peischl, I. Dupanloup, M. Lipatov et al., 2016 Distance from sub- 870 Saharan Africa predicts mutational load in diverse human genomes. Proceedings of the 871 National Academy of Sciences 113: E440-449. 872 Heyer, E., 1995 Genetic Consequences of Differential Demographic Behavior in the Saguenay 873 Region, Quebec. American Journal of Physical Anthropology 98: 1-11. 874 Heyer, E., 1999 One founder/one gene hypothesis in a new expanding population: Saguenay 875 (Quebec, Canada). Human biology 71: 99-109. 876 Jetté, R., 1991 Traité de généalogie. Les Presses de l'Université de Montréal, Montréal. 877 Keinan, A., and A. G. Clark, 2012 Recent explosive human population growth has resulted in an 878 excess of rare genetic variants. Science 336: 740-743. 879 Kiezun, A., S. L. Pulit, L. C. Francioli, F. van Dijk, M. Swertz et al., 2013 Deleterious alleles in the 880 human genome are on average younger than neutral alleles of the same frequency. PLoS 881 Genet 9: e1003301. 882 Kim, B. Y., C. D. Huber and K. E. Lohmueller, 2016 Inference of the distribution of selection 883 coefficients for new nonsynonymous mutations using large samples. bioRxiv. 884 Kim, B. Y., C. D. Huber and K. E. Lohmueller, 2017 Inference of the distribution of selection 885 coefficients for new nonsynonymous mutations using large samples. Genetics 206: 345- 886 361. 887 Kircher, M., D. M. Witten, P. Jain, B. J. O'Roak, G. M. Cooper et al., 2014 A general framework for 888 estimating the relative pathogenicity of human genetic variants. Nature genetics 46: 310- 889 315. 890 Kirkpatrick, M., and P. Jarne, 2000 The Effects of a Bottleneck on Inbreeding Depression and the 891 Genetic Load. Am Nat 155: 154-167. 892 Klopfstein, S., M. Currat and L. Excoffier, 2006 The fate of mutations surfing on the wave of a 893 range expansion. Mol Biol Evol 23: 482-490. 894 Laberge, A. M., J. Michaud, A. Richter, E. Lemyre, M. Lambert et al., 2005 Population history and 895 its impact on medical genetics in Quebec. Clin Genet 68: 287-301. 896 Landrum, M. J., J. M. Lee, G. R. Riley, W. Jang, W. S. Rubinstein et al., 2014 ClinVar: public archive 897 of relationships among sequence variation and human phenotype. Nucleic Acids Res 42: 898 D980-985.

32 899 Lohmueller, K. E., 2014 The distribution of deleterious genetic variation in human populations. 900 Curr Opin Genet Dev 29: 139-146. 901 Lohmueller, K. E., A. R. Indap, S. Schmidt, A. R. Boyko, R. D. Hernandez et al., 2008 Proportionally 902 more deleterious genetic variation in European than in African populations. Nature 451: 903 994-U995. 904 Moreau, C., C. Bherer, H. Vezina, M. Jomphe, D. Labuda et al., 2011 Deep human genealogies 905 reveal a selective advantage to be on an expanding wave front. Science 334: 1148-1150. 906 Nelson, M. R., D. Wegmann, M. G. Ehm, D. Kessner, P. St Jean et al., 2012 An abundance of rare 907 functional variants in 202 drug target genes sequenced in 14,002 people. Science 337: 908 100-104. 909 Peischl, S., I. Dupanloup, L. Bosshard and L. Excoffier, 2016 Genetic surfing in human 910 populations: from genes to genomes. Current Opinion in Genetics & Development 41: 911 53–61. 912 Peischl, S., I. Dupanloup, M. Kirkpatrick and L. Excoffier, 2013 On the accumulation of deleterious 913 mutations during range expansions. Mol Ecol. 914 Peischl, S., and L. Excoffier, 2015 Expansion load: recessive mutations and the role of standing 915 genetic variation. Mol Ecol. 916 Peischl, S., M. Kirkpatrick and L. Excoffier, 2015 Expansion load and the evolutionary dynamics of 917 a species range. Am Nat 185: E81-93. 918 Pollard, K. S., M. J. Hubisz, K. R. Rosenbloom and A. Siepel, 2010 Detection of nonneutral 919 substitution rates on mammalian phylogenies. Genome Res 20: 110-121. 920 Racimo, F., and J. G. Schraiber, 2014 Approximation to the distribution of fitness effects across 921 functional categories in human segregating polymorphisms. PLoS Genet 10: e1004697. 922 Richards, S., N. Aziz, S. Bale, D. Bick, S. Das et al., 2015 Standards and guidelines for the 923 interpretation of sequence variants: a joint consensus recommendation of the American 924 College of Medical Genetics and Genomics and the Association for Molecular Pathology. 925 Genet Med 17: 405-424. 926 Schneider, A., B. Charlesworth, A. Eyre-Walker and P. D. Keightley, 2011 A method for inferring 927 the rate of occurrence and fitness effects of advantageous mutations. Genetics 189: 928 1427-1437. 929 Sibert, A., F. Austerlitz and E. Heyer, 2002 Wright-Fisher revisited: the case of fertility correlation. 930 Theor Popul Biol 62: 181-197. 931 Simons, Y. B., M. C. Turchin, J. K. Pritchard and G. Sella, 2014 The deleterious mutation load is 932 insensitive to recent population history. Nat Genet 46: 220-224. 933 Sousa, V., S. Peischl and L. Excoffier, 2014 Impact of range expansions on current human 934 genomic diversity. Curr Opin Genet Dev 29: 22-30. 935 The Genomes Project, C., 2015 A global reference for human genetic variation. Nature 526: 68- 936 74. 937 Wright, S., 1922 Coefficients of Inbreeding and Relationship. American Naturalist 56: 330-338. 938 Yotova, V., D. Labuda, E. Zietkiewicz, D. Gehl, A. Lovell et al., 2005 Anatomy of a founder effect: 939 myotonic dystrophy in Northeastern Quebec. Human genetics 117: 177-187.

940

33 A

Côte-Nord 2 Abitibi-Témiscamingue Saguenay

22 7 5 Côte-de-Beaupré Bas-Saint-Laurent Mauricie 2 Lanaudière 1 8 Beauce 25 1 Quebec city 1 1 2 19 Estrie Bois-Francs 51 Richelieu B now > 6 gen. > 9 gen. > 12 gen.

● ●

● ●

● ● 0 2 4 6 8 6 4 2 0 core front core front core front core front cummulative Wave Front Index (cWFI) Index Front Wave cummulative A B All SNPs All SNPs shared SNPs

*** *** *** *

● ●

● ● 2.5 2.52 2.54 2.52 2.5

average GERP RS score RS GERP average Front 2.1 2.2Core 2.3 2.4 2.5 2.6

average GERP RS score RS GERP average French Canada Europe French Canada Europe

12345678910 C D DAF SNPs private to Front or Core SNPs private to Front or Core and not found in 1000G

*

** Front 2.2 2.3 2.4 2.5 2.6 2.7

average GERP RS score RS GERP average Front Core average GERP RS score RS GERP average Core ** 2.4 2.5 2.6 2.7 2.8 2.9 1234 singletons doubletons DAF cum. GERP RS score cum. GERP RS score 0 50 100 150 −10 0 5 15 GERP RS <6 RS 4≤GERP GERP RS <2 RS -2≤GERP oeFront Core Front Core ● ● ● ●

cum. GERP RS score cum. GERP RS score 0 2 4 6 8 12 0 20 40 60 80 GERP RS <4 RS 2≤GERP oeFront Core Front Core GERP RS ≥6 GERP RS ● ● ● ● B NC NF NBN A log likelihood log

NBN -52 -50 -48-46-44-42 -50 -52

200 500 1000 2000 5000 10000 20000 N C e GERP RS [-2,2] GERP RS [2,4] NF NC GERP RS [4,6] GERP RS > 6

France French Canada likelihood log -50 -40 -30 -20 -10 -20 -30 -40 -50

-0.1 -0.05 0 0.05 0.1 s HR 1.0 1.5 2.0 2.5 3.5 -,[[,[[4,6[ [2,4[ [-2,2[ Without Saguenayindividuals All individuals EPRS category GERP ≥ 6 * * *