1 SUPPLEMENTARY MATERIALS for

2 Pervasive introgression of MHC genes in newt zones

3 K. Dudek, T. S. Gaczorek, P. Zieliński, W. Babik

4 5 Supplementary Methods

6 MHC class II genotyping

7 MHC class II was amplified in 10 ul PCR reactions containing: 50-100 ng of genomic DNA, 8 5 ul of Multiplex PCR kit (Qiagen) and each of four primers (IIex2_Fb 9 TCTCTCCRCAGYGGACTTYG, IIex2_Fc TCTSTCCTCAGATGATTTYG, IIex2_R2 10 CTCACGCCTCCGKTKGTACAGG IIex2_R3 CTCACGCHTCCGSTGCTCCAKG, all 11 sequences 5’->3’) at concentration of 1 uM. Individuals were barcoded with a combination of 12 6 bp indexes at the 5’ end of forward and reverse primers. PCR conditions were as follows: 13 initial denaturation at 95 °C for 15 min, followed by 35 cycles: 95 °C for 30 s, 55 °C for 30 s 14 and 72 °C for 70 s and final elongation at 72 °C for 10 min. Amplicons were pooled 15 approximately equimolarly based on gel-band intensity, pools were gel-purified, Illumina 16 adaptors were ligated using NEXTflex PCR-Free DNA Library Prep Kit for Illumina (Bioo 17 Scientific), libraries were quantitated with NEBNext Library Quant Kit for Illumina (NEB) 18 and sequenced on Illumina MiSeq (v3 600 cycles kits). Genotyping was performed using the 19 adjustable clustering method implemented in AmpliSAS (Sebastian, Herdegen, Migalska, & 20 Radwan, 2016). Following clustering of sequence variants within amplicons, we considered 21 only variants with per amplicon frequency > 1.0%. Several variants showing no similarity to 22 MHC class II as well as putative pseudogenes (variants with frameshifts or in-frame stop 23 codons) were excluded from further analyses. To estimate repeatability of genotyping 56 24 (MHC class I) and 62 (MHC class II) samples were amplified and genotyped twice.

25 MHC class II diversity and tests of selection

26 Nadachowska-Brzyska et al. (Nadachowska-Brzyska, Zielinski, Radwan, & Babik, 2012) 27 analysed MHC class II exon 2 variation and tested selection in the whole Lissotriton vulgaris 28 species complex. In the present study we analysed a longer fragment and intensively sampled 29 only three out of nine evolutionary lineages within the complex. Therefore we report diversity 30 and results of tests of selection using only sequences obtained in the current study. 31 Divergence between alleles was estimated in MEGA7 (Kumar, Stecher, & Tamura, 2016), by 32 calculating nucleotide (Tamura & Nei), synonymous/nonsynonymous (Nei & Gojobori) and 33 amino acid (Poisson-corrected) distances. To test for the signal of positive selection we 34 compared fit of three codon-based models of evolution (M0, M7 and M8) in PAML (Yang, 35 2007). To speed up computations we excluded singleton alleles from the analyses. Codons 36 under positive selection (posterior probability (PP) > 0.95) were identified using the Bayes 37 Empirical Bayes Procedure in PAML. Location of positively selected codons was compared 38 to that of the ABS in human MHC class II (Tong et al., 2006). 39

1

40 Classical MHC alleles and supertypes

41 MHC class I alleles were previously classified, on the basis of expression level and sequence 42 similarity, into two classes: i) HEX – intermediate and high expression, putative functional 43 alleles, ii) LEX – low expression, putative nonclassical/nonfunctional alleles (Fijarczyk, 44 Dudek, Niedzicka, & Babik, 2018). New MHC class I alleles detected in the current study 45 were incorporated into this classification using the rules described in (Fijarczyk et al., 2018). 46 For MHC class II information on the expression status is more limited, although most alleles 47 amplified with our primers appear to be expressed (Nadachowska-Brzyska et al., 2012). We 48 have therefore not attempted to classify the class II alleles into putative classical and 49 nonclassical/nonfunctional groups. 50 The idea of supertype analysis is to cluster alleles into classes of functionally similar 51 sequences as defined by physico-chemical properties of amino acids in positions that 52 determine specificity of antigen binding. Despite an overwhelming signal of positive selection 53 in both MHC classes, only a few codons were identified as positively selected (5 in class I and 54 2 in class II), too few to define supertypes. Therefore we used the combination of those 55 positively selected codons and previously described human ABS (Reche & Reinherz, 2003; 56 Tong et al., 2006). Five physicochemical descriptors of each amino acid (Sandberg, Eriksson, 57 Jonsson, Sjöström, & Wold, 1998) were used to group alleles using the K-means clustering in 58 adegenet (Jombart, 2008). Based on the Bayesian Information Criterion (BIC) we identified 59 25 supertypes for class I (only HEX alleles were included) and 22 supertypes for class II. We 60 note two limitations of the supertype approach as applied to the newt dataset. First, because of 61 the small number of codons identified as positively selected, supertypes were based mostly on 62 human ABS positions. Second, the value of BIC decreased monotonically with the increasing 63 number of supertypes, making identification of the number of supertypes somewhat arbitrary. 64 The supertypes should thus be regarded as clusters of alleles based on amino acid similarity in 65 most polymorphic positions rather than robustly defined groups of functionally similar alleles. 66 As a consequence, supertype-based results should be treated with caution.

67 Simulations

68 Spatially explicit, forward in time simulations were performed using Selector (Currat, 69 Gerbault, Di, Nunes, & Sanchez-Mazas, 2015). Each species was represented by 15 demes 70 exchanging migrants according to the stepping stone model, i.e. only between adjacent demes. 71 Two transects, 5 demes each, were connected by 5 vertically arranged demes (Fig. S1). 72 Because the distance between transects in the IN zone was approximately 5 x larger than the 73 length of the transects, to reduce computational burden, migration between the connecting 74 demes was set to ca. 0.27 of that between demes within transects (Charlesworth & 75 Charlesworth, 2010, eq. 7.8a), so that the 5 connecting demes actually corresponded to 25 76 demes exchanging migrants at the same rate as demes within transects. Three strengths of 77 migration between demes within transect were evaluated (Nm = 0.1, 0.5, 2.5). 78 At the beginning of the simulations (Fig. S1I), only a single, centrally located deme 79 within each species was occupied (N0 = 1000 individuals) and both species shared a single 80 pool of alleles (na = 15 – 500). Selector starts with a uniform allele frequency distribution, so 81 the initial differences between the species were minor, resulting only from sampling error. 82 This setting emulated the split of a single ancestral species into two descendant species of 83 equal sizes. Note however that the number of alleles maintained within each species was 84 limited (Fig. S5), so depending on the initial number of alleles, large fraction of alleles could 85 be lost from each species, quickly reducing the number of shared alleles. No mutations were

2

86 allowed throughout simulations. Various strengths of negative frequency dependent selection 87 (s = 0.05 – 0.3) were evaluated and selection was kept constant throughout a simulation; 88 fitness of an allele with frequency f(a) was defined as 1 – f(a)s. Following establishment of 89 the founding populations, colonization of the initially empty demes (identical carrying 90 capacities, N = 100 – 1000, growth rate 0.5) within each species and migration between 91 demes occurred for 1000 generations. Then the carrying capacity of the founding demes was 92 set to that of other demes (Fig. S1II). Both species were evolving in isolation for further 93 15 000 generations and then hybridization was allowed by setting carrying capacity of two, 94 previously unoccupied, demes located in the centre of each transect to 0.01 – 0.1N (Fig. 95 S1III). Thus, in each transect the hybrid zone consisted of a single deme. Immigration into the 96 zone was high and symmetrical, i.e. demes adjacent to the zone sent identical number of 97 emigrants in each direction, and emigration from the zone was reduced to 0.01 – 0.1 of that 98 value (controlled by the carrying capacity of the hybrid zone population), making the zone a 99 barrier to introgression. Hybridization was allowed for 160 - 1600 generations corresponding 100 to 0.01-0.1 of the time of evolution in isolation. For each combination of parameter values 50 101 simulations were performed. 102 At the end of each simulation samples of 16 individuals were taken from each deme 103 within transect (except of the hybrid zone deme, Fig. S1IV) and the following statistics were 104 calculated: i) percentage of variation explained by the between species and between transects 105 within species AMOVA components, ii) fraction and number of alleles shared between 106 species and between transects within species, iii) fraction and number of alleles shared 107 exclusively (those shared by two focal groups but absent from all other groups) between 108 transects within species. 109 To investigate the temporal and spatial dynamics of introgression we recorded the 110 fraction of introgressed gene copies in each deme within the transect at various times 111 following hybridization. A single transect with demes of N = 250 and distinct (non- 112 overlapping) sets of alleles in each species was simulated, which allowed straightforward 113 calculation of the fraction of introgressed gene copies. Scenarios with a single allele initially 114 fixed within each species and with 5 and 15 alleles per species were investigated. The latter 115 reflected the number of alleles maintained within species for deme size N = 250 individuals 116 under strong negative frequency dependent selection (Fig. S5). Immediately after 117 establishment of populations we turned on hybridization and recorded, using actual allele 118 frequencies reported by Selector, the fraction of introgressed gene copies in each deme at 119 different times following the onset of hybridization. We investigated both neutral and 120 negative frequency dependent scenarios, s = 0 and 0.3, respectively and different strengths of 121 introgression, 0.01-0.1 of migration between demes within species (Nhm = 0.025-0.25). For 122 each combination of parameters 50 simulations were performed. 123 Simulations described above assumed a single multiallelic locus, but MHC in newts is 124 multilocus and both classes are tightly linked. To check for the effect of these differences in 125 genetic architecture between simulated and real data, we ran a subset of scenarios modelling 126 multilocus MHC haplotypes explicitly. The following parameter values were investigated: s = 127 0.3, Nm = 2.5, na = 500 (in this case the initial number of multilocus haplotypes), N= 100, 128 250, 1000, Nhm = 0, 0.04, 0.1 and the time of hybridization was set to 0.1 of the time of 129 evolution in isolation. 130 To obtain haplotypes for simulations we used the following procedure. First, the 131 empirical distributions of (i) the number of MHC alleles per individual (class I and II 132 combined) and (ii) allele frequencies, were constructed with both species combined, but 133 excluding mixed populations. Second, we estimated parameters of (iii) such a normal

3

134 distribution describing the number of MHC alleles per haplotype, that produced, when alleles 135 on haplotypes were sampled from (ii), the distribution of the number of alleles per individual 136 most similar to (i). At the beginning of each Selector simulation, a haplotype with the number 137 of MHC alleles sampled from distribution (iii) and their identity sampled from the allele 138 frequency distribution (ii) was assigned to each allele simulated by Selector. Following the 139 simulation AMOVA was performed with the number of pairwise differences between 140 haplotypes as the distance measure. 141

142 Supplementary References 143 144 Charlesworth, B., & Charlesworth, D. (2010). Elements of Evolutionary Genetics. Greenwood 145 Village: Roberts. 146 Currat, M., Gerbault, P., Di, D., Nunes, J. M., & Sanchez-Mazas, A. (2015). Forward-in-time, 147 spatially explicit modeling software to simulate genetic lineages under selection. 148 Evolutionary Bioinformatics, 11, EBO. S33488. 149 Fijarczyk, A., Dudek, K., Niedzicka, M., & Babik, W. (2018). Balancing selection and 150 introgression of newt immune-response genes. Proceedings of the Royal Society of 151 London B: Biological Sciences, 285(1884), 20180819. 152 Jombart, T. (2008). adegenet: a R package for the multivariate analysis of genetic markers. 153 Bioinformatics, 24(11), 1403-1405. 154 Kumar, S., Stecher, G., & Tamura, K. (2016). MEGA7: Molecular Evolutionary Genetics 155 Analysis version 7.0 for bigger datasets. Molecular and Evolution, 33(7), 156 1870-1874. 157 Nadachowska-Brzyska, K., Zielinski, P., Radwan, J., & Babik, W. (2012). Interspecific 158 hybridization increases MHC class II diversity in two sister species of newts. 159 Molecular , 21(4), 887-906. 160 Reche, P. A., & Reinherz, E. L. (2003). Sequence variability analysis of human class I and 161 class II MHC molecules: Functional and structural correlates of amino acid 162 polymorphisms. Journal of Molecular Biology, 331(3), 623-641. 163 Sandberg, M., Eriksson, L., Jonsson, J., Sjöström, M., & Wold, S. (1998). New chemical 164 descriptors relevant for the design of biologically active peptides. A multivariate 165 characterization of 87 amino acids. Journal of medicinal chemistry, 41(14), 2481- 166 2491. 167 Sebastian, A., Herdegen, M., Migalska, M., & Radwan, J. (2016). amplisas: a web server for 168 multilocus genotyping using next‐generation amplicon sequencing data. Molecular 169 Ecology Resources, 16(2), 498-510. 170 Tong, J. C., Bramson, J., Kanduc, D., Chow, S., Sinha, A. A., & Ranganathan, S. (2006). 171 Modeling the bound conformation of Pemphigus Vulgaris-associated peptides to 172 MHC cclass II DR and DQ alleles. Immunome Research, 2, 1. 173 Yang, Z. H. (2007). PAML 4: Phylogenetic analysis by maximum likelihood. Molecular 174 Biology and Evolution, 24(8), 1586-1591. 175

4

176 Supplementary Tables S1 to S4 are in a separate Excel Workbook

177 Table S1. Sampling and MHC variation. 178 Table S2. Sequence divergence between MHC class II alleles. 179 Table S3. AMOVA for supertypes. 180 Table S4. MHC and genome-wide cline parameters. 181

5

182 Supplementary Figures 183 184 Fig. S1. Design of simulations. Four stages of simulations shown in panels I – IV were: I) an 185 ancestral species split into two equally sized, completely isolated descendant species, which 186 thus initially shared a pool of alleles, II) following the split, each species colonized a world of 187 demes conforming to a one-dimensional stepping stone model arranged into a horseshoe 188 shape that approximated the two transects of the IN zone, i.e. the distance between transects 189 was 5 times larger than the length of a transect within each species, III) following prolonged 190 evolution in isolation, secondary contact and hybridization ensued; a single deme hybrid zone 191 acted as a partial barrier to – immigration into the zone was high but emigration 192 from the zone was strongly reduced as in classical tension zone models. Details of simulations 193 are in Supplementary Methods. 194

195 196

6

197 Fig. S2. Per individual number of MHC alleles. Red - L. montandoni, green – L. vulgaris; 198 Individuals from syntopic populations were excluded. Dashed lines show the averages. A) 199 MHC class I, B) MHC class II.

200 201

7

202 Fig. S3. Sequence logo summarizing amino acid variation among sequences of MHC 203 class II alleles. Positions identified as the Antigen Binding Sites (ABS) in all MHC class IIB 204 genes are highlighted in yellow. Asterisks denote codons under positive selection. Note high 205 diversity in most ABS positions. 206

207 208

8

209 Fig. S4. Principal Component ordination of individuals based on MHC data. In all plots 210 individuals from syntopic populations were excluded; light green – L. vulgaris inside the 211 Carpathian Basin (IN), dark green – L. vulgaris outside the Carpathian Basin (OUT), light red 212 – L. montandoni IN, dark red – L. montandoni OUT. PCAs are based on: A) all MHC class I 213 alleles, B) all MHC class II alleles, C) Putative functional class I alleles (HEX), D) putative 214 nonclassical/nonfunctional MHC class I alleles (LEX). 215

216 217

9

218 Fig. S5. Relationship between MHC variation and intraspecific migration. All results for 219 scenarios without hybridization, initial number of alleles na = 500, after 16 thousand 220 generations; migration rate expressed as a product of deme size and migration rate (Nm). A) 221 Number of alleles maintained within species, B) Number of alleles shared between transects 222 within species. Means and 95% confidence intervals from 50 simulations are shown. 223

224 225

10

226 Fig S6. Comparison of simulation results with MHC modelled as a single multiallic locus 227 (A, C) or as multilocus haplotypes (B, D). For each haplotype the number of alleles was 228 sampled from the normal distribution with mean = 7.6 and variance = 3.04, while their 229 identity was sampled from the empirical distribution of allele frequencies (for details see 230 Supplementary Methods).A, B: Percentage of total variance explained by between species and 231 between transect within species AMOVA components. C, D: Percentage of alleles shared 232 exclusively (those shared by two focal groups but absent from all other groups) between 233 species within transect and between transects within species. 234

235 236

11