Supplementary Materials for

Whole Genome Sequences Of Malawi Reveal Multiple Radiations Interconnected By Gene Flow Milan Malinsky, Hannes Svardal, Alexandra M. Tyers, Eric A. Miska, Martin J. Genner, George F. Turner, and Richard Durbin correspondence to: [email protected], [email protected]

This PDF file includes:

Materials and Methods Supplementary Text Figs. S1 to S32 Tables S1 to S8

Materials and Methods DNA extraction and sequencing: DNA was extracted from fin clips using the PureLink® Genomic DNA extraction kit (Life Technologies). Genomic libraries for paired-end sequencing on the Illumina HiSeq 2000 machine were prepared according to the Illumina TruSeq HT protocol to obtain paired-end reads with mean insert size of 300-500bp. As detailed in Supplementary Table 1, we used either Illumina HiSeq v3 chemistry (generating 100bp paired-end reads) or Illumina HiSeq v4 reagents (125bp paired-end reads). Low coverage (~6x) samples with v4 reagents were multiplexed 12 per lane. High coverage (~15x) v4 samples were multiplexed four per lane. For high coverage (~15x) v3 samples, a multiplexed library with 8 samples was sequenced over three lanes. The nine trio samples were multiplexed across eight lanes using the v3 chemistry, delivering approximately 40x coverage per individual. Raw data have been deposited at the NCBI Sequence Read Archive under BioProject PRJEB1254; sample accessions are listed in Table S4.

Alignment: Reads were aligned to the Metriaclima zebra reference assembly version 1.111 (Table S5) using the bwa-mem v.0.7.10 algorithm82 with default options. For the trio parent samples that were used in the main analysis, we aligned data from only three of the eight lanes, aiming to have the same genome coverage for the trios as for the remaining (15x) samples. For each sample, 96-98% of reads could be aligned to the reference. Duplicate reads were marked on both per-lane and per sample basis using the MarkDuplicates tool from the Picard software package with default options (http://broadinstitute.github.io/picard). Local realignment around indels was performed on both per lane and per sample basis using the IndelRealigner tool from the GATK v3.3.0 software package83.

Variant calling, filtering, and genotype refinement: Briefly, SNP and short indel variants against the M. zebra reference were called independently using GATK v3.3.0 haplotype caller84 and samtools/bcftools v.1.185. Variant filtering was then performed on the GATK variant calls using hard filters based on overall depth, quality by depth, excess missingness, excess of reads with zero mapping quality, strand/mapping bias, and inbreeding coefficient (see below). After filtering the GATK dataset, we performed an intersection of GATK and samtools sites and kept only variant sites present in both datasets. If the GATK and samtools alleles differed at a particular locus, we kept the GATK allele. At this point, multiallelic sites were excluded and we used genotype likelihoods output by GATK sites to perform genotype refinement, imputation, and phasing in BEAGLE v.4.086. Indels were retained for the genotype refinement step, but later generally excluded from analyses using vcftools v0.1.12b option --remove-indels, except where specifically indicated.

The particular commands/parameters used were: GATK haplotype caller (per sample): java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R REFERENCE.fa -- emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 –I SAMPLEn.bam –o GATK_SAMPLEn.g.vcf Haplotype caller per-sample files were combined using the GATK GenotypeGVCFs tool with the --includeNonVariantSites option so that every basepair of the assembly was represented in the multisample VCF file.

Hard filters were applied to the following GATK annotations: Minimal inbreeding coefficient: 'InbreedingCoeff < -0.6' Minimum overall read depth: 'DP < 1000' Maximum overall read depth: 'DP > 3000' (except for mtDNA: scaffolds 747,2036) Max phred-scaled p-value from Fisher's exact test to detect strand bias: 'FS > 40.0' (except for mtDNA: scaffolds 747,2036) QualityByDepth: 'QD < 2.0’ Excess Missingness: 'NCC > 32' (>16 individuals with missing data) More than 10% of reads have mapping quality zero: '(MQ0/(1.0*DP)) > 0.10' Low mapping quality: 'MQ < 40.0' samtools calling (multisample): samtools mpileup -t DP,DPR,INFO/DPR -C50 -pm2 -F0.2 –ugf REFERENCE.fa SAMPLE1.bam SAMPLE2.bam ... | bcftools call -vmO z -f GQ -o samtools_VARIANTS.vcf.gz The consensus GATK and samtools call set was obtained using the GATK: java -Xmx10000m -jar GenomeAnalysisTK.jar -T SelectVariants -R REFERENCE.fa - -variant onlyVariants_filtered.vcf.gz -o onlyVariants_filtered_concord.vcf.gz --concordance samtools_unfiltered.vcf.gz BEAGLE genotype refinement (per scaffold) - specifying the trio relationships: java -jar beagle.r1398.jar gl=onlyVariants_filtered_concord_sc${sc}.vcf.gz ped=../Malawi_trios.pedind nthreads=8 ibd=true ibdtrim=200 phase-its=8 impute-its=8 out=beagle_onlyVariants_filtered_concord_sc_${sc}

Accessible genome The accessible genome was defined by masking every basepair of the genome assembly (including invariant sites) where any of the filters applied to the GATK dataset or where a GATK variant was absent in the samtools dataset. Excluding gaps, the reference assembly contains 713.6Mb of sequence. Our approach masked out just under 61Mb of sequence, resulting in an ‘accessible genome’ of 653Mb.

Shapeit phasing To further improve the quality of haplotype phasing, we re-phased the BEAGLE output data for scaffolds larger than 1Mb (scaffolds 0 to 201) using the shapeit v2.r790 haplotype phasing method 87 including the use of phase-informative reads88. The specific commands (with full trios included) were as follows: extractPIRs --bam all_alignment_files_sc_${i}.txt --vcf onlyVariants_filtered_concord_sc${sc}.vcf.gz --out sc_${i}_PIRlist shapeit -assemble --input-ped beagle_wTrios_sc${sc}.ped beagle_wTrios_sc${sc}.map --input-pir sc_${i}_PIRlist -O sc_${i}_phased_trio_ped --thread 4 --window 0.5 To check phasing accuracy, we also ran the above phasing process including only unrelated individuals. Then we compared switch error rates in the trio samples between the two datasets using the vcftools v0.1.12b option --diff-switch-error (Fig. S24).

Variant calls for outgroup Astatotilapia A separate variant call set was generated using short reads from 19 individuals from seven outgroup Astatotilapia (Table S2), 13 individuals representing the eco- morphological groups (C. afra, M. zebra, A. calliptera from Salima and Lake Massoko, O. lithobates, T. nigriventer, C. likomae, A. geoffreyi, L. gossei, D. limnothrissa, D. ngulube, R. longiceps, and R. woodi) and the more distantly related Lake Tanganyika N. brichardi. The N. brichardi short read data (100bp reads, ~15x coverage) were downloaded from the NCBI Short Read Archive (Accession: SRR077327). Because of the greater divergence between all these samples (compared with the within-Malawi dataset), we choose to map the reads to the Oreochromis niloticus (Nile Tilapia; version Orenil1.1) genome assembly which is equally distant from all the samples. The technical details of mapping and variant calling were as for the within-Malawi dataset (see above), except that we did not generate samtools calls (only GATK haplotype caller).

Because of mapping to a more distant reference, we applied slightly more stringent filtering on the variant calls compared with the within-Malawi dataset. Filters were applied to the following GATK annotations: Minimal inbreeding coefficient: 'InbreedingCoeff < -0.5' Minimum overall read depth: 'DP < 550' Maximum overall read depth: 'DP > 950' Fisher's exact test to detect strand bias: 'FS > 40.0' QualityByDepth: 'QD < 2.0’ Excess Missingness: 'NCC > 1' (any individuals with missing data) More than 10% of reads have mapping quality zero: '(MQ0/(1.0*DP)) > 0.10' Low mapping quality: 'MQ < 40.0' Excess heterozygosity: ‘ExcessHet >= 20’ In addition, we masked out all sites in the reference where any overlapping 50-mers (sub- sequences of length 50) could not be matched back uniquely and without 1-difference. For this we used Heng Li’s SNPable tool (http://lh3lh3.users.sourceforge.net/snpable.shtml), dividing the reference genome into overlapping k-mers (sequences of length k – we used k=50), and then aligning the extracted k-mers back to the genome (we used bwa aln -R 1000000 -O 3 - E 3). Finally, we removed multiallelic sites and indels and masked sites within +/- 3bp of indels. As separate accessible genome mask was produced for this dataset (see above).

Whole-genome alignments Pairwise alignments of cichlid genome assemblies listed in Tables S5 and S6 were generated using lastz v1.0 89, with the following parameters: For cichlid vs. cichlid alignments: B=2 C=0 E=150 H=0 K=4500 L=3000 M=254 O=600 Q=human_chimp.v2.q T=2 Y=15000 For cichlid vs. other teleost alignments: B=2 C=0 E=30 H=0 K=3000 L=3000 M=50 O=400 T=1 Y=9400 This was followed by using Jim Kent’s axtChain tool with -minScore=5000 for cichlid- cichlid and -minScore=3000 for cichlid-other teleost alignments. Additional tools with default parameters were then used following the UCSC whole-genome alignment paradigm90 (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto) in order to obtain a contiguous pairwise alignment.

Linkage disequilibrium calculations Linkage disequilibrium (LD) calculations (haplotype r2) between pairs of SNPs were performed along the phased scaffolds 0 to 201, using vcftools v0.1.12b with the options --hap- r2 --ld-window-bp 50000. To reduce the computational burden, we used a random subsample of 10% of SNPs. We binned the r2 values according to the distance between SNPs into 1kb or 100bp windows and plotted the average values in each bin.

To estimate background LD, we calculated haplotype r2 between variants mapping to different linkage groups (LG) in the Oreochromis niloticus genome assembly. First, we used the chain files generated by the whole genome alignment pipeline90 and the UCSC liftOver tool to translate the genomic coordinates of all SNPs to the O. niloticus coordinates. Then we calculated LD between variants mapping to LG1 and LG2.

De novo mutation rate estimation In each trio we looked for mutations in the child that were not present in either of its parents. Because the results of this analysis are very sensitive to false positives and false negative rates, we applied more stringent genome masks than in the population genomic work. Specifically, we defined the “Accessible Genome” for each trio by excluding: 1. Genomic regions where mapped read depth in any member of a trio is ≤25× or >50× 2. Bases where either of the parents has a mapped read that does not match the reference (the specific bases where any read has non-reference alleles in the parents were masked) 3. Sequences where indels were called in any sample (we also excluded +/- 3bp of sequence surrounding the indel) 4. Sites which were called as multiallelic among the nine samples in the overall trios dataset 5. Known segregating variable sites - i.e. sites with alternative alleles found in four and more copies in the overall Lake Malawi dataset 6. Sites in the reference where less than 90% of overlapping 50-mers (sub-sequences of length 50) could be matched back uniquely and without 1-difference. For this we used Heng Li’s SNPable tool (http://lh3lh3.users.sourceforge.net/snpable.shtml), dividing the reference genome into overlapping k-mers (sequences of length k – we used k=50), and then aligning the extracted k-mers back to the genome (we used bwa aln -R 1000000 -O 3 -E 3).

At a heterozygous site we assumed equal probability of sampling either of the two alleles. Therefore, Na - the number of reads supporting the alternative allele is distributed as: ~Binomial(ReadDepth, 0.5). We filtered out variants with observed Na below 2.5th or above 97.5th percentiles of this distribution, thus accepting a false negative rate of 5%. We also filtered out sites where the offspring call had Read Position or Base Quality rank sum test Z-score > 99.5th percentile of standard normal distribution or where the Strand Bias phred scaled p-value was ≥ 20 or where the phred scaled GQ (genotype quality) in either mother, father, or offspring was ≤ 30. For simplicity, assuming these filters are independent they are expected to introduce a false negative rate of 7.17%. The mutation rate estimate was adjusted to account for this.

Finally, because any observed de novo mutation could have occurred either on the chromosome inherited from the mother or on the chromosome inherited from the father, the point estimate of the per generation per basepair mutation rate is: µ = nMutations/(2×AccessibleGenome). The 95% confidence intervals for the number of observed mutations were calculated using the “exact” method relating chi-squared and Poisson distributions91,92. If N is the number of observed mutations, the lower (ciNL) and upper (ciNU) limits are: ! !(!! ! !.!"#) ��� = !(!!! !!.!"#) ��� = ! !!! ! ! ! ! where 2N and 2(N+1) are degrees of freedom of the corresponding chi-squared distributions.

Sample selection for phylogenetic and chromopainter analyses To prevent potential confounding effects of uneven sequencing depth on reconstructions of species relationships, we limited these analyses to one high coverage (15x) individual per species. Species without a high coverage sample (P. subocularis, F. rostratus and L. trewavasae) were not included.

Local phylogenetic trees and maximum clade consensus To generate a multiple alignment input in fasta format for building local phylogenetic trees, we used our program evo (available from https://github.com/millanek/evo) with the getWGSeq subprogram. We set the window size in terms of the numbers of variants rather than physical length (typically 8000 variants; --split 8000 option) aiming for the local regions to have similar strengths of phylogenetic signal (any small windows at the ends of scaffolds were discarded). We limited the sequence output to the accessible genome by using the -- accessibleGenomeBED option. The N. brichardi (Lake Tanganyika outgroup) sequence from the M. zebra-N. brichardi pairwise alignment in M. zebra genomic coordinates and with indels removed (deletions in N. brichardi filled by N characters) was added by supplying its sequence via the --incl-Pn option.

Maximum likelihood phylogenetic trees for each region were inferred using RAxML v7.7.893 under the GTRGAMMA model (General Time Reversible model of nucleotide substitution with the model of rate heterogeneity). The best tree for each region was selected out of twenty alternative runs on distinct starting maximum parsimony trees (using the -N 20 option).

The maximum clade credibility (MCC) trees were calculated in TreeAnnotator v.2.4.2, a part of the BEAST2 platform94. Clade credibility is simply the frequency with which a clade appears in the tree set; the MCC tree is the tree (from among the trees in the set) that maximizes the product of the frequencies of all its clades 33. The node heights for the MCC trees are derived as a summary from the heights of each clade in the whole tree set via the Common Ancestor heights option.

Mitochondrial sequence (mtDNA) phylogenies The mtDNA sequence is present in the Metriaclima zebra reference (Table S5), corresponding to scaffolds 747 and 2036. We selected all variants from these scaffolds and subjected them to the same filtering as in the rest of the genome (see above) except for the depth filter because the mapped read depth over these two scaffolds was much higher than in the rest of the genome (approximately 300-400x per sample). Because of the greater sequence diversity in the mtDNA genome, we found that a non-negligible proportion (more than 10%) of variants were multiallelic. We separated SNPs from indels at multiallelic sites using bcftools norm with the --multiallelics - option, then removed indels and the merged multiallelic SNPs back together with the --multiallelics + option. Sequences in the fasta format were generated using the bcftools consensus command, with any sites with missing genotypes in the VCF replaced by the N character with the --mask option. N. brichardi (Lake Tanganyika outgroup) sequence for scaffolds 747 and 2036 was simply added to the fasta files from the M. zebra-N. brichardi pairwise alignment in M. zebra genomic coordinates and with indels removed (deletions in N. brichardi filled by N characters). A maximum likelihood phylogenetic tree was inferred using RAxML v7.7.893 under the GTRGAMMA model. The best tree was selected out of twenty alternative runs on distinct starting maximum parsimony trees (using the -N 20 option) and two hundred bootstrap replicates were obtained using RAxML’s rapid bootstrapping algorithm95 satisfying the -N autoFC frequency-based bootstrap stopping criterion. Bipartition bootstrap support was drawn on the maximum likelihood tree using the RAxML -f b option.

Distance based trees and ‘tree violation’ For Neighbor-Joining (NJ)96 and UPGMA clustering methods, we calculated the average numbers of single nucleotide differences between haplotypes for each pair of species. This simple pairwise difference matrix was then divided by the accessible genome size to obtain pairwise differences per bp and used as input into the distance based tree-building methods implemented in the APE package97 in R language by the nj() and upgma() functions.

We measured the distances between all pairs of species in the reconstructed NJ tree (i.e. the lengths of branches) using the get_distance() method implemented in the ETE3 toolkit for phylogenetic trees98. Our first measure of ‘tree violation’ is the difference between these distances and the distances between samples in the original matrix that was used to build the NJ tree.

Bayesian multispecies coalescent (SNAPP) The multispecies coalescent model accounts for intraspecies variation and incomplete lineage sorting in estimation of species relationships99. We used the SNAPP package44 to estimate the species trees under the multispecies coalescent directly from unlinked biallelic SNPs. Application of SNAPP to the full set of 70+ species and hundreds of thousands of biallelic markers is not computationally feasible. We used a random subset of ~0.5% of genome-wide SNPs (48,922 SNPs) for 12 individuals representing the eco-morphological groups and the Lake Victoria outgroup P. nyererei whose alleles were filled in based on the whole genome alignment (see above).

The P. nyererei alleles were assigned as ‘ancestral’ (0 in the nexus input file). The ‘forward‘ and ‘backward‘ mutation rate parameters u and v were calculated directly from the data by SNAPP (the Calc mutation rates option). The default value 10 was used for the Coalescent rate parameter and the value of the parameter was sampled (estimated in the MCMC chain). The priors were set to be uninformative as we don’t assume strong a priori knowledge about the parameters. The prior for ancestral population sizes was chosen to be a relatively broad gamma distribution with parameters � = 4 and � = 20. The tree height prior � was set to the initial value of 100 but sampled in the MCMC chain with an uninformative uniform hyperprior on the interval [0,50000].

We run three independent MCMC chains with the same starting parameters, each on 30 threads with a total runtime of over 10 CPU years. The first one million steps from each MCMC chain was discarded as burn-in. In total, more than 30 million MCMC steps were sampled in the three runs. For the MCMC traces for each run see Fig. S25.

Chromopainter and fineSTRUCTURE Singleton SNPs were excluded using the bcftools v.1.1 -c 2:minor option, before exporting the remaining variants in the PLINK format100. The chromopainter v0.0.4 software 45 was then run for the 201 largest genomic scaffolds on shapeit phased SNPs. Briefly, we created a uniform recombination map using the makeuniformrecfile.pl script, then estimated the effective population size (Ne) for a subsample of 20 individuals using the 45 chromopainter inbuilt expectation-maximization procedure , averaged over the 20 Ne values using the provided neaverage.pl script. The chromopainter program was then run for each scaffold independently, with the -a 0 0 option to run all individuals against all others. Results for individual scaffolds were combined using the chromocombine tool before running fineSTRUCTURE v0.0.5 with 1,000,000 burn in iterations, and 200,000 sample iterations, recording a sample every 1,000 iterations (options -x 1000000 -y 200000 -z 1000). Finally, the sample relationship tree was built with fineSTRUCTURE using the -m T option and 20,000 iterations.

The ABBA-BABA related f statistics The f statistic was developed to estimate the proportion of introgressed material in an admixed 47 49 population [see SOM18 in ref. , and fG in ref. ]. Assume we have four populations with given phylogeny (((A, B), C), O). If we denote:

� �, �, �, � = 1 − �! �!�! 1 − �! − �!(1 − �!)�!(1 − �!) !"#$ then an estimator of the f-statistic is given by: �(�, �, �, �) � = �(�, �!, �!, �) where C1 and C2 designate two subsamples of population C. In a scenario where genetic material introgressed from population C into population B, � estimates the genomic proportion in B tracing its ancestry through the introgression event from C. Note that in theory C1 should be sampled from the population ancestral to C that introgressed into the ancestral population of B. However, in practice we use random subsamples of our sample for C.

Neolamprologus brichardi was used as the outgroup O for the reported calculations, however, using Pundamilia nyererei instead yielded very similar results. We computed the f statistic for all trios ((A, B), C). To make the total number of comparisons manageable, we binned samples from multiple species into subgroups that form monophyletic clades in the NJ tree in Fig 2A. The samples that violate the phylogeny in Fig. 2 most strongly (A. calliptera Kingiri, O. tetrastigma, C. trimaculatus, P. longimanus) were analyzed separately. We present results for a partition into 31 subgroups (Figs. 3, S12) which we found to represent the majority of the signal in the dataset while keeping the number of comparisons manageable. Statistical significance was assessed by computing block-jackknifes on windows of 60k SNPs in the original VCF. Multiple testing was accounted for by computing Bonferroni-Holm family-wise error rate (FWER).

The branch-specific fb(C) statistic f (or D) statistics calculated for different sets of species are not independent as soon as they share drift (that is, internal or external branches) (Fig. S26). At the same time, these correlations can be informative about the timing of introgression, i.e. about which (internal) branch was subject to ancestral structure or gene flow (Fig. S26A&B). To obtain a set of less correlated f statistics, we calculate for each branch �

�! � = ������! ���! � �, �, � where B runs over all clades that are descendants of �, and A over all clades that are descendants of �′� sister branch � (Fig. S26A). Thus �! � measures the relative excess of allele sharing between the descendants of branch � with the clade C, compared with allele sharing of the descendants of branch a with the clade C. Note that negative f scores, i.e. those where A and C share excess alleles relative to B, are set to zero in this calculation (but are accounted for in the score of sister branch �).

The idea behind taking the minimum across B is that for any B, a descendent of the branch b, � �, �, � measures excess allele sharing due to introgression (or ancestral structure) affecting b (Fig. S26B), but introgression events that affect a branch �′ descendent of b would only be picked up by � �, �, � if B descendent of �′, but not if B is not descendent of �′ (Fig. S26C). Hence, taking the minimum of � �, �, � across B gives a conservative estimate of the amount of excess allele sharing of b with C (relative to A) that is attributable to events involving b (rather than its descendent branches).

The rationale for taking the median across descendants of b’s sister branch � rather than the minimum is that additional introgression events from C (or a relative) into the descendent branches of � can dilute the signal of introgression from C into b since such events decrease � �, �, � . Conversely, introgression of an outgroup of A, B, C into a A would increase � �, �, � . Hence, by taking the median across A we reduce the chances of �! � being confounded by events on the branches between � and A. We note however, that there remains a general identifiability problem between events involving � or b. Results are qualitatively similar if the minimum is taken across A rather than the median (the number of significantly elevated �! � becomes 255 rather than 351).

Finally, we would like to point out that while the �! � statistic accounts for the inter- dependence of f statistics for different A and B, the same introgression event can still lead to correlated signals for different C, either because introgression involved the common ancestor of different C, or because of the fact that related C show correlated allele frequencies. We do not see a straightforward way to correct for these correlations, but suggest that a careful visual inspections of the data (Fig. 3C) allows one to draw conclusions on likely introgression patterns.

Similar to f scores, we also summarize block-jackknifing Z scores to get a branch-specific measure of significance �! � = ������! ���! � �, �, � .

Treemix analysis Following the treemix manual, allele frequencies in the subgroups described in Fig. S10 were calculated for all biallelic SNP on scaffolds 0-201 thus producing the input files. Then we ran treemix v1.12 using a range of values for the parameter specifying the number of migration events(-m). We always set N. brichardi as the root (-N N_brichardi), and used blocks of 5000 SNPs (-k 5000). The results were plotted using the R function plot_tree which is included with treemix.

Geometric morphometric analyses A total of 168 photographs were used to compare the gross body morphology of Astatotilapia calliptera to that of endemic Lake Malawi species and other East African Astatotilapia lineages (Table S7). Coordinates for 17 homologous landmarks [following ref. 101] were collected using tpsDig2 v2.26102. After landmark digitization, analysis of shape variation was carried out in the R statistical environment (v3.3.2) using the package GeoMorph v3.0.2103 for geometric morphometric analysis and Principal Component Analysis. First a General Procrustes Analysis was applied to remove non-shape variation and shape data were corrected for allometric size effects by performing a regression of Procrustes coordinates (10,000 iterations). The resulting allometry corrected residuals were used in PCA.

Maps Present day catchment boundary maps are based on ‘level 3’ detail of Hydro1K dataset from the US Geological Survey. We downloaded the watershed boundary data from the United Nations Environment Programme website (http://ede.grid.unep.ch) and processed it using the QGIS geographic information system software (http://www.qgis.org/en/site/).

Protein-coding gene annotations We used the BROADMZ2 annotation generated at the Broad Institute as a part of the cichlid genome project11. For all analyses, we removed overlapping transcripts using Jim Kent’s genePredSingleCover program. Genes whose annotated length in nucleotides was not divisible by three were discarded, as they typically had inaccuracies in annotation that would require manual curation (2495 out of 23698 genes). We also used the cichlid genome project11 assignment of homologs between the M. zebra genome reference and zebrafish (Danio rerio).

Coding sequence positive selection scan To detect selection acting on protein coding genes, we searched for coding sequences showing an excess of non-synonymous variation between Lake Malawi species when compared with synonymous variation. We used our custom C++ program evo (https://github.com/millanek/evo) with the getCodingSeq -H b --no-stats options to obtain the coding sequences for each allele and each gene in the following two datasets: 1) One high coverage (15x) individual where available for each Lake Malawi species, 2) Twenty A. calliptera individuals (excluding the offspring from the trio).

In addition, we created a third dataset based on pairwise whole genome alignments between the Lake Malawi M. zebra reference and the four other cichlid assemblies listed in Table S5. To obtain the sequences of the four cichlid species in M. zebra genomic coordinates, we used our evo program with aa-seq --anc-from-maf=1 options, removing indels (insertions in M. zebra in with respect to the other genome were filled by N characters

The excess of non-synonymous variation (�!!!) and the non-synonymous variation excess score (ΔN-S) were calculated on a per-gene basis as follows. Let NTS be the number of possible non- synonymous transitions and NTV the number of possible non-synonymous transversion between two sequences; analogously STS and STV represent possible synonymous differences. We do not specify the ancestral allele, and therefore consider it equally likely that allele � mutated into allele � or that allele � mutated to allele �. Then let �! be the number of observed non-synonymous mutations and �! the number of observed synonymous mutations. If there are more than one difference within a codon, all “mutation pathways” (i.e. the different orders in which mutations could have happened) have equal probabilities. When a particular allele contained a premature stop codon, the remainder of the sequence after the stop was excluded from the calculations.

Because the transition:transversion ratio in the Lake Malawi dataset was 1.73, and hence (because there are two possible transversions for each possible transition) the prior probability of each transition is 3.46 times that of each transversion, we account for the unequal probabilities of transitions and transversions in calculating the proportions of non-synonymous (�!) and of synonymous differences (�!) as follows: �! �! �! = �! = 3.46�!" + �!" 3.46�!" + �!"

The excess of non-synonymous variation (�!!!) is the average of �! − �! over pairwise sequence comparisons. Only between-species sequence comparisons are considered for the Lake Malawi dataset. We normalized the �!!! values in order to take into account the effect on the variance of this statics introduced by differences in gene length and by sequence composition. To achieve this, we used the leave-one-out jackknife procedure across different pairwise comparisons for each gene. The non-synonymous variation excess score (ΔN-S) is then: �!!! Δ!!! = ���������(�!!!)

Note that because the sequences are related by a genealogy, there is a correlation structure between the pairwise comparisons. Therefore, the jackknife approach substantially underestimates the true standard error of �!!! and is used here simply as a normalization factor.

We also calculated the above statistics for random non-coding regions, matching the gene sequences in length. We used the bedtools v2.26.0104 shuffle command to permute the locations of exons along the chromosomes. Of the total length of all the permuted sequences, 98.4% was within the ‘accessible genome’ and outside of coding sequences (we required at least 95% in any of the permuted locations). The specific command was: bedtools shuffle -chrom -I exons.bed -excl InaccessibleGenome_andExons.bed -f 0.05 -g chrom.sizes

Gene Ontology enrichment Zebrafish has the most extensive functional gene annotation of any fish species, providing a basis for Gene Ontology (GO)105 term enrichment analysis. Genome-wide, 16472 (77.9%) of M. zebra genes for whom we calculated Δ!!! had an assigned zebrafish ortholog.

Gene Ontology (GO) enrichment for the genes that were candidates for being under positive selection (the top 5% of Δ!!! values) was calculated in R using the topGO v2.26.0 package106 from the Bioconductor project107. The GO hierarchical structure was obtained from the GO.db v3.4.0 annotation and linking zebrafish gene identifiers to GO terms was accomplished using the org.Dr.eg.db v3.4.0 annotation package. Genome-wide, between 9024 and 9353 genes had a GO annotation that could be used by topGO, the exact number depending on the GO category being assessed. The nodeSize parameter was set to 5 to remove GO terms which have fewer than 5 annotated genes, as suggested in the topGO manual.

There is often an overlap between gene sets annotated with different GO terms, in part because the terms are related to each other in a hierarchical structure105. This is partly accounted for by our use in topGO of the weight algorithm that accounts for the GO graph structure by down- weighing genes in the GO terms that are neighbors of the locally most significant terms in the GO graph60. All the p-values we report are from the weight algorithm, which the authors suggest should be reported without multiple testing correction106.

Some interdependency between significant GO terms remains after using the weight algorithm. Therefore, we used the Enrichment Map108 app for Cytoscape (http://www.cytoscape.org) to organize all the significantly enriched terms into networks where terms are connected if they have a high overlap, i.e. if they share many genes.

Diplotaxodon and deep benthic convergence To obtain a quantitative measure of the similarity between and the extent of excess diversity in the Diplotaxodon and deep benthic amino acid sequences, we calculated simple statistics based on the proportions of non-synonymous differences (�! scores). Intuitively, the similarity score is high if Diplotaxodon and deep benthic jointly have high �! from all the others, but are not very different from each other relative to how much diversity there is within Diplotaxodon and deep benthic.

Specifically, the similarity score � is calculated as follows: ! ! ! �!"# = �! − (�! − �! ) and � � � = !"# − ���� !"# ���������(�!) ��������� �! ! where �! is the mean �! between Diplotaxodon jointly with deep benthic and all the other Lake ! ! Malawi species, �! is the mean �! between Diplotaxodon and deep benthic, and �! is the mean �! within Diplotaxodon and deep benthic. The jackknife normalization is analogous to the one used for ΔN-S and the mean (�!"#) is subtracted to center the statistic at zero.

The excess diversity score is high when the mean �! scores within Diplotaxodon and within deep benthic are high relative to the mean �! in the rest of the radiation. Specifically, the excess score �� is defined as: [(�! + �!")/2] − �! �� = ! ! ! ���������(�!)

! !" ! where �! is the mean �! within Diplotaxodon, �! is the mean �! within deep benthic, and �! is the mean �! within the rest of the radiation.

Haplotype genealogies To view the haplotype genealogies, we translated nucleotide sequences to amino acid sequences and loaded these into Haplotype Viewer (http://www.cibiv.at/~greg/haploviewer). This software requires that a tree is loaded together with the sequences. Therefore, we inferred gene trees using RAxML v7.7.893 with the PROTGAMMADAYHOFFF model of substitution.

Local excess allele sharing between Diplotaxodon and deep benthic 49 We used an extension of the fd statistic described by Martin et al. ; we refer to this extension as 56 fdM . fdM is a conservative version of the f statistic that is particularly suited for analysis of 49,56 small genomic windows . For the gene scores shown in Fig. 6b, we calculated fdM (, deep benthic, Diplotaxodon, N. brichardi) for each gene in window from the transcription start site (TSS) to 10 kb into the gene. For the ‘along the genome’ plots, as shown in Fig. 6d and Fig. S23, we used a product of two fdM statistics [fdM(shallow benthic, deep benthic, Diplotaxodon, N. brichardi) x fdM(Rhamphochromis, Diplotaxodon, deep benthic, N. brichardi)], an approach which we found to increase the local resolution. This score was calculated in sliding windows of 100 SNPs across a region of +- 100kb around the genes. Finally, we also calculated fdM (mbuna, deep benthic, Diplotaxodon, N. brichardi) separately for synonymous and non-synonymous mutations in each gene.

Supplementary Text Sample selection Our sample selection provides broad coverage of all the major lineages of the rapidly radiating cichlid tribe Haplochromini in Lake Malawi, with specimens from: 1. Eight species from the ‘mbuna’ group of mainly shallow-water rock-dwelling cichlids. Our specimens represent much of the diversity of this group, covering 6 out of 10 genera defined by Ribbink et al. in their detailed classification of 196 Malawian mbuna species 109. 2. Nine species of ‘deep water’ benthic : many of these species are found at depths >50m, a ‘twilight’ zone with very little visible light; a few inhabit shallower water but are often crepuscular feeders, residing among rocks by day. These include members of the genera Alticorpus and Aulonocara (characterized by greatly enlarged sensory openings of their heads and lateral lines) and several of the species currently assigned to the Lethrinops. 3. Forty-one species of cichlids found predominantly in shallow waters close to the shore (like ‘mbuna’), but on sandy or muddy lake floor and the transition zones between sandy and rocky habitats. This is a very diverse group of cichlids with hundreds of described species26, including for example large (over 35cm) predators such as Buccochromis nototaenia, the invertebrate picker Placidochromis johstoni, molluscivores such as Chilotilapia rhoadesii and Mylochromis anaphyrmus, and cichlids that filter the sandy sediment like Ctenopharynx nitidus. We refer to this group collectively as ‘shallow benthics’. 4. Four species of zooplankton-feeding, shoaling cichlids which are commonly referred to as ‘utaka’. These species feed in the water column, but their distribution is limited to locations close to the shore26. All four sequenced ‘utaka’ species belong to the genus Copadichromis. 5. Three out of eight described species from the genus Rhampochromis54 - large pelagic (open-water) piscivores, feeding mainly on lake sardines26. Cichlids are primarily bottom-dwellers, but members of this group [and Diplotaxodon (see below)] have undergone extensive changes in morphology and behavior to invade the large pelagic habitat in Lake Malawi. 6. Three out of seven scientifically described species of Diplotaxodon (ref. 110, p. 198), and three undescribed species: D. macrops ‘black dorsal’111, D. ‘ngolube’ (ref. 110 p. 239), and D. ‘white back similis’ (a new undescribed species). Like Rhampochromis, Diplotaxodon are pelagic cichlids, but are typically found at greater depths (>50m) and their diet consists mainly of zooplankton, rather than fish. We include in this group also the species Pallidochromis tokolosh, which is morphologically intermediate between the two genera and is a slightly more benthic form (ref. 110, pp. 198-199), but its genetic affinities are more to the genus Diplotaxodon, as we show in this manuscript. 7. Astatotilapia calliptera - one of only two cichlids found in Lake Malawi able to cross the lake-river barrier. A. calliptera is a versatile, relatively small cichlid (~10-15cm), common in the rivers throughout the Lake Malawi catchment. In Lake Malawi, A. calliptera frequents shallow sheltered bays with muddy sediment and aquatic plants, often feeding on snails (ref. 26, p. 281). It has been suggested that it may be related to the ancestral lake-river generalist species that seeded most or perhaps all of the Lake Malawi haplochromine radiation15,26. Therefore, we sampled A. calliptera genetic variation extensively, including 16 specimens (four from Lake Malawi itself and twelve more from the broader Lake Malawi catchment). The other species able to cross the lake-river barrier is Serranochromis robustus: a large predator often seen in very shallow water near river estuaries (ref. 26, p. 277) and common in rivers to the south-west of Lake Malawi, including the Zambezi system112. However, S. robustus is not a part of the Lake Malawi haplochromine radiation, but instead belongs to a distant group sometimes referred to as the ‘Congo clade’113. For a list of all samples and details of our assignment of Lake Malawi species into these seven lineages see Table S1.

Simulations We tested the behavior of our ‘tree violation’ inference methods (the fb statistic and our approach to calculating the residuals of neighbor-joining trees) on data simulated using the software ms' 114. Four models were evaluated, with parameters compatible with those inferred for the Lake Malawi cichlid radiation, as depicted in Fig. S27. They include a phylogeny without gene flow, and three independent simulations each with a directional gene flow event between two non- terminal branches, where the strength of the gene flow was 10%, 30%, and 50%.

First, we estimated species trees from the simulated data using the neighbor-joining (NJ) algorithm (Fig. S28). The inference is very accurate in the absence of gene flow (Fig. S28A) or if gene flow is relatively weak (f = 10%, Fig. S28B). However, for simulations with more substantial gene flow (f = 30%, and f = 50%) the inferred trees do not correspond to either of the two alternative topologies (i.e. neither the standard one nor the one following the gene flow event). Instead, the inferred topologies are ‘intermediate’, with details dependent on the strength of the gene flow (Fig. S28C,D).

Calculating residuals between the pairwise genetic distance matrix and the inferred NJ tree can help to capture inconsistencies of the tree like model (Fig. S29, cf. Fig. 2). However, the magnitude of the residuals does not increase linearly with the strength of gene flow. This is because the inferred tree topology changes with increasing gene flow in a way that keeps residuals lower than they would be if the original topology were enforced. This effect may be weaker in a larger tree which has more constraints.

Next, we computed all 56 f tests consistent with the inferred NJ tree for each model. The number of significant f statistics were 0, 24, 28, and 28, for f = 0%, 10%, 30%, and 50%, respectively (FWER<0.001). Calculating fb on the simulated data, yielded 0, 7, 8, and 10, significant fb, respectively (Fig. S30). Thus it is clear that while using the fb statistic reduces the correlation between different tests, the correlations are not removed entirely. A single gene flow event can still produce multiple significant fb values. However, we suggest that an additional benefit of the fb statistic is that it is branch specific and can be intuitively interpreted as identifying “problematic” branches in the phylogeny, in the sense that the phylogenetic model is violated at these branches. For example, Fig. S30C shows that the ancestral branch of the species A1 and A2 shows excess allele sharing with the clade containing C1, C2, D1, and D2, and most strongly so with C1 and C2. At the same time, B1 show excess allele sharing with A1 and A2, compared to B2. While these observations do not correspond to true gene-flow events, they are expected given the difference between the true relatedness and the inferred tree, and thus hint at the “correct” species relationships.

Note that the inferred magnitudes of f and fb are substantially lower than the actual gene flow proportion. We suggest that this is due to drift on the terminal branches that leads to a decrease of excess allele sharing between A and C. A possible explanation for the large values of f observed in the real data is the occurrence of gene flow over an extended period.

Finally, we applied the software treemix52, which is widely used to infer gene-flow events, to the simulated data (Figs. S31-S33). We found that treemix was able to correctly infer the gene flow edge in the model with relatively weak gene flow (f = 10%) when the parameter m is set to one (see Fig. S31C; the parameter m indicates to treemix how many gene flow events to expect and must be supplied by the user). However, increasing the strength of gene flow in the simulated models or increasing the m parameter value leads to unexpected results. First, when the m parameter value is increased, the correct event is still inferred, but higher migration weights are given to other, incorrect migration events (Figs. S32C, S33C). Second, as soon as gene flow is strong enough that the maximum likelihood tree computed by treemix does not correspond to either of the alternative topologies (as in the simulations with f = 30% and 50%), treemix infers gene flow edges which do not seem to bear any connection to the true event. In particular, in such cases treemix tends to infer strong gene flow into the outgroup.

In conclusion, we suggest that while the number of significant fb should not be interpreted as the number of independent gene flow events, fb correctly tags problematic branches and shows excess allele sharing, even in cases of strong gene flow where the inferred phylogeny is likely wrong, while treemix seems to give misleading results in such cases.

A 0.5 min: 0.19%; max: 0.27%

0.4

0.3

0.2

Divergence from reference (%) reference from Divergence 0.1

0

A. calliptera deep benthic Diplotaxodon mbuna Rhamphochromis shallow benthic utaka B

1200

1/600 1000

1/750 800

1/1000 600

400 1/2000 Frequency of heterozygous sites heterozygous of Frequency 200 Number of heterozygous sites (thousands) sites heterozygous of Number 1/5000

A. calliptera deep benthic Diplotaxodon mbuna Rhamphochromis shallow benthic utaka Fig. S1 Genome-wide variant statistics summarized per cichlid lineage. (A) Variants called against M. zebra genome in Lake Malawi samples. (B) The number and frequency (per bp) of heterozygous sites.

scaffold 170 AB25 bp 125,530 bp 125,550 bp

[0 - 50]

T C Mother

[0 - 54]

T T G Father

[0 - 45]

T T T T T Offspring T

T T T

Sequence g c g t g t g g g a g t c t t c t t t c t t t g a

Fig. S2 Direct de novo mutation rate estimation from parent-offspring trios. (A) Illustrating the approach: in each trio we looked for mutations in the child that were not present in either of the parents. (B) A genome browser screenshot with an example of de novo mutation in Lethrinops lethrinus offspring on scaffold 170.

Across all species A Horizontal solid lines show background LD levels mbuna deep benthic 0.25 shallow benthic Diplotaxodon

A. calliptera Subgroups )

2 0.20 C. virginalis M. anaphyrmus P. longimanus P. subocularis

0.15 Species T. placodon

0.10 LD (haplotype r (haplotype LD

0.05

0.00

0 5 10 15 20 Physical distance (kb)

B Across all samples +0.10 mbuna deep benthic shallow benthic 2 ) Diplotaxodon

A. calliptera Subgroups C. virginalis M. anaphyrmus +0.05 P. longimanus P. subocularis Species

LD (haplotype r (haplotype LD T. placodon above backgroundabove

background

0 5 10 15 20 Physical distance (kb)

C +0.10 2 )

+0.05 LD (haplotype r (haplotype LD above backgroundabove

background

012345 Physical distance (kb)

Fig. S3 Linkage Disequilibrium decay measure by haplotype r2 values. Three levels of detail are combined in the plots: across the whole radiation, within broad eco-morphological groups, and within-species. Only species and groups for which we had more than six individuals to calculate LD are shown (Rhamphochromis and utaka are therefore missing). (A) Average haplotype r2 values in 1kb windows over 20kb. (B) As in (A) but with background LD (LD between variants mapping to different chromosomes) subtracted. (C) Subtracted LD in 100bp widows over the first 5kb.

mtDNA ML phylogeny maximum clade credibility tree

A Diplotaxodon similis 'white back' B 62 Diplotaxodon macrops Placidochromis cf. longimanus 100 59 Pallidochromis tokolosh Placidochromis milomo polystigma 100 Diplotaxodon greenwoodi <1 88 36 Diplotaxodon limnothrissa Nimbochromis livingstoni 90 99 Diplotaxodon macrops 'black dorsal' 9 Nimbochromis linni 62 Diplotaxodon macrops 'ngulube' Dimidiochromis dimidiatus Dimidiochromis compressiceps Rhamphochromis woodi 97 23 100 <1 Dimidiochromis strigatus Rhamphochromis longiceps Tyrannochromis nigriventer Rhamphochromis esox 23 100 2 Champsochromis caeruelus Mylochromis anaphyrmus Dimidiochromis kiwinge Otopharynx lithobates 88 100 57 1 <1 Mylochromis ericotaenia Lethrinops auritus 14 77 Otopharynx lithobates Mylochromis ericotaenia <1 <1 53 Buccochromis rhoadesii Tyrannochromis nigriventer 82 Buccochromis nototaenia Protomelas ornatus Otopharynx speciosus 85 Copadichromis trimaculatus 100 Protomelas ornatus Copadichromis quadrimaculatus 6 100 60 24 Hemitaenichromis spilopterus 96 Ctenopharynx nitidus <1 3 Placidochromis johnstoni Mylochromis melanotaenia Chilotilapia rhoadesii 74 <1 Placidochromis johnstoni Tremitochranus placodon 96 68 Chilotilapia rhoadesii Mylochromis melanotaenia 99 <1 4 Otopharynx tetrastigma Hemitilapia oxyrhynchus Placidochromis cf. longimanus Copadichromis cf. trewavasae 68 8 Otopharynx brooksi 'nkhata' 100 Copadichromis likomae 1 21 Lethrinops lethrinus Stigmatochromis guttatus 47 4 17 100 Lethrinops albus Stigmatochromis modestus Tremitochranus placodon Taeniochromis holotaenia 51 98 Hemitilapia oxyrhynchus 1 Otopharynx tetrastigma 16 Lethrinops auritus Nimbochromis polystigma 19 100 99 18 Lethrinops albus Nimbochromis linni Lethrinops lethrinus Ctenopharynx intermedius 71 Taeniolethrinops macrorhynchus 100 Placidochromis electra 5 18 96 Taeniolethrinops praeorbitalis Copadichromis cf. trewavasae 81 26 99 Taeniolethrinops furcicauda Hemitaenichromis spilopterus Placidochromis electra 55 20 N.brichardi 27 Nimbochromis livingstoni Mylochromis anaphyrmus Taeniolethrinops furcicauda 100 Ctenopharynx intermedius 96 Taeniolethrinops macrorhynchus 83 Ctenopharynx nitidus Otopharynx speciosus 7 69 97 57 Lethrinops longimanus 'redhead' Buccochromis nototaenia 4 61 14 Lethrinops gossei Buccochromis rhoadesii Lethrinops 'oliveri' Dimidiochromis kiwinge 7 Aulonocara minutus 19 23 Champsochromis caeruelus Alticorpus macrocleithrum 100 69 7 39 Dimidiochromis dimidiatus Aulonocara yellow 100 Alticorpus geoffreyi 100 Dimidiochromis strigatus 12 Dimidiochromis compressiceps 76 Aulonocara steveni 83 Taeniochromis holotaenia Aulonocara stuartgranti Stigmatochromis modestus Copadichromis trimaculatus 100 11 6 14 Stigmatochromis guttatus Copadichromis quadrimaculatus 100 Copadichromis likomae Otopharynx brooksi 'nkhata' 22 Placidochromis milomo Copadichromis virginalis Diplotaxodon limnothrissa Metriaclima zebra 9 Diplotaxodon macrops Genyochromis mento 98 100 Diplotaxodon similis 'white back' 100 Cynotilapia afra 51 Diplotaxodon macrops 'ngulube' Iodotropheus sprengerae N. brichardi 75 77 Diplotaxodon macrops 'black dorsal' Cynotilapia axelrodi 12 6 8 Pallidochromis tokolosh Lethrinops gossei 19 Diplotaxodon greenwoodi 13 A. calliptera Kingiri 100 Rhamphochromis longiceps A. calliptera Massoko littoral 100 17 100 100 Rhamphochromis woodi 26 A. calliptera Massoko benthic Rhamphochromis esox Lethrinops sp oliveri Genyochromis mento 96 39 1 11 41 Lethrinops longimanus 'redhead' Iodotropheus sprengerae Alticorpus macrocleithrum 2 Metriaclima zebra 45 Aulonocara yellow 54 82 11 Aulonocara stuartgranti 3 Cynotilapia axelrodi 95 Aulonocara minutus Petrotilapia genalutea 52 15 7 Aulonocara steveni Cynotilapia afra Taeniolethrinops praeorbitalis A. calliptera Songwe River 47 Tropheops tropheops A. calliptera Kingiri 100 5 Petrotilapia genalutea A. calliptera Massoko littoral 54 7 92 A. calliptera Massoko benthic Alticorpus geoffreyi <1 42 Copadichromis virginalis 7 A. calliptera Mbaka river 81 2 A. calliptera Kyela 100 A. calliptera Chizumulu A. calliptera Kyela A. calliptera North Rukuru A. calliptera Bua A. calliptera South Rukuru 12 16 A. calliptera Enukweni 98 100 A. calliptera Luwawa 45 57 A. calliptera South Rukuru A. calliptera Bua 1 2 100 66 A. calliptera Luwawa A. calliptera Salima A. calliptera Salima A. calliptera Enukweni 92 <1 100 A. calliptera Chitimba 95 A. calliptera Chitimba A. calliptera Chizumulu A. calliptera Mbaka river A. calliptera Songwe river 100 100 A. calliptera North Rukuru 0.001 0.005 Fig. S4 mtDNA and maximum clade credibility (MCC) phylogenies. Scales are given in number of SNPs per bp. (A) A maximum likelihood (ML) phylogeny based on the full mtDNA sequence. Node labels show bootstrap support based on 200 replicates. (B) An MCC summary of 2543 ML trees based on non-overlapping genomic windows. Node labels show the percent prevalence of each clade.

Pairwise tree distances

700 A random sample of 2543 pairwise tree comparisons mtDNA tree vs. all 2543 nuclear DNA trees 600 500 400 Count 300 200 100 0

94 102 106 110 114 118 122 126 130 134 138 142 146 150 154 158

Robinson−Foulds distance

Fig. S5 Pairwise tree distances. Differences in topology (as measured by the Robinson-Foulds distance) tend to be much greater between the mtDNA tree and the local ML phylogenies based on genomic windows in nuclear DNA than when two random genomic windows are compared against each other.

a NJ phylogeny with block bootstrap b SNAPP MCMC samples Buccochromis rhoadesii 100 Buccochromis nototaenia P. nyererei (Lake Victoria) 100 Otopharynx speciosus Rhamphochromis longiceps Otopharynx lithobates 100 Diplotaxodon limnothrissa Mylochromis ericotaenia 100 Hemitilapia oxyrhynchus Labeothropeus trewavasae Petrotilapia genalutea 95 Placidochromis johnstoni 100 Hemitaenichromis spilopterus 91 A. calliptera Massoko littoral Chilotilapia rhoadesii 100 A. calliptera Salima, Lake Malawi 41 100 Protomelas ornatus Placidochromis milomo Copadichromis virginalis Dimidiochromis strigatus 100 Aulonocara stuartgranti 48 100 Dimidiochromis dimidiatus Lethrinops gossei Dimidiochromis kiwinge 100 Lethrinops lethrinus 100 Nimbochromis polystigma 53 Nimbochromis livingstoni Dimidiochromis strigatus 100 Nimbochromis linni Otopharynx speciosus Placidochromis cf. longimanus 56 Otopharynx tetrastigma Topology 1 (62.5% of trees) 53 56 Mylochromis melanotaenia 100 Tremitochranus placodon 46 Copadichromis cf. trewavasae P. nyererei (Lake Victoria) Otopharynx 'brooksi nkhata' 100 100 100 Stigmatochromis guttatus Rhamphochromis longiceps 48 100 Stigmatochromis modestus Diplotaxodon limnothrissa Tyrannochromis nigriventer 100 Labeothropeus trewavasae 70 Champsochromis caeruelus Petrotilapia genalutea 100 Taeniochromis holotaenia Copadichromis trimaculatus A. calliptera Massoko littoral 100 Ctenopharynx intermedius A. calliptera Salima, Lake Malawi 70 Ctenopharynx nitidus Copadichromis virginalis Mylochromis anaphyrmus 100 Placidochromis electra Aulonocara stuartgranti Lethrinops auritus Lethrinops gossei 96 Lethrinops albus 100 Lethrinops lethrinus Lethrinops lethrinus Dimidiochromis strigatus 100 Taeniolethrinops macrorhynchus 70 100 Otopharynx speciosus 100 Taeniolethrinops furcicauda Taeniolethrinops praeorbitalis Topology 2 (37% of trees) Aulonocara steveni 100 Aulonocara stuartgranti Alticorpus macrocleithrum 100 100 100 100 Aulonocara minutus P. nyererei (Lake Victoria) Aulonocara yellow Rhamphochromis longiceps 100 Lethrinops sp. oliveri 100 Diplotaxodon limnothrissa Lethrinops longimanus redhead 100 Alticorpus geoffreyi Labeothropeus trewavasae Copadichromis likomae Petrotilapia genalutea Copadichromis quadrimaculatus 100 70 A. calliptera Massoko littoral Copadichromis virginalis A. calliptera Salima, Lake Malawi 100 A. calliptera Massoko benthic 100 A. calliptera Massoko littoral Copadichromis virginalis A. calliptera Mbaka River 100 Aulonocara stuartgranti 60 A. calliptera Songwe River 100 Lethrinops gossei A. calliptera North Rukuru 100 100 A. calliptera Kyela Lethrinops lethrinus A. calliptera Enukweni Dimidiochromis strigatus 52 100 100 A. calliptera South Rukuru Otopharynx speciosus A. calliptera Chitimba 60 A. calliptera Chizumulu Island Topology 3 (0.5% of trees) A. calliptera Salima, Lake Malawi 100 100 A. calliptera Bua 100 A. calliptera Luwawa A. calliptera Kingiri Petrotilapia genalutea 100 100 74 100 Metriaclima zebra N. brichardi Tropheops tropheops 74 Genyochromis mento Iodotropheus sprengerae 100 Cynotilapia afra 85 Cynotilapia axelrodi Diplotaxodon greenwoodi 100 100 Pallidochromis tokolosh Diplotaxodon ngulube 75 100 100 Diplotaxodon 'white back similis' 100 Diplotaxodon macrops black dorsal Diplotaxodon macrops Diplotaxodon limnothrissa 100 100 Rhamphochromis woodi Rhamphochromis esox 100 Rhamphochromis longiceps

2×10-4 Fig. S6 Further details of phylogenetic inference. a, The NJ tree from Fig. 2 with full bootstrap values for every node. b, 30617 SNAPP trees sampled from three MCMC inference chains at every 500 steps. This inference includes 12 Lake Malawi species representing the major lineages (plus the Lake Victoria species Pundamilia nyererei as root), and is based on a random subsample of 48922 unlinked SNPs. Three distinct topologies were observed among the sampled trees. Topology 2 differs from topology 1 in that Aulonocara stuartgranti and Lethrinops lethrinus no longer cluster together; the branch lengths in the whole benthic group (deep + shallow) are extremely short. Topology 3 differs from topology 1 in that Lethrinops gossei is basal to the remainder of the benthic group. Buccochromis rhoadesii Buccochromis nototaenia Otopharynx speciosus Mylochromis anaphyrmus Placidochromis electra Ctenopharynx intermedius Ctenopharynx nitidus Stigmatochromis guttatus Otopharynx brooksi nkhata Stigmatochromis modestus Copadichromis cf. trewavasae Champsochromis caeruleus Tyranochromis nigriventer Taeniochromis holotaenia Various other Otopharynx tetrastigma Placidochromis milomo off-diagonal Protomelas ornatus Chilotilapia rhoadesii Hemitaenichromis spilopterus signatures Placidochromis johnstoni Tremitochranus placodon O. tetrastigma with Mylochromis melanotaenia Mylochromis ericotaenia A. calliptera Kingiri Otopharynx lithobates Hemitilapia oxyrhynchus Nimbochromis linni Nimbochromis livingstoni Nimbochromis polystigma Dimidiochromis kiwinge Dimidiochromis dimidiatus Dimidiochromis strigatus Dimidiochromis compressiceps Copadichromis trimaculatus Copadichromis quadrimaculatus Copadichromis likomae Copadichromis virginalis Aulonocara minutus C. trimaculatus Alticorpus macrocleithrum Aulonocara yellow Alticorpus geoffreyi with shallow Lethrinops gossei Lethrinops sp oliveri Lethrinops longimanus redhead Aulonocara stuartgranti Aulonocara steveni Taeniolethrinops furcicauda Taeniolethrinops macrorhynchus Some shallow benthics Taeniolethrinops preorbitalis Lethrinops lethrinus clustering close to Lethrinops albus Lethrinops auritus Placidochromis cf. longimanus the deep Diplotaxodon macrops D. macrops black dorsal Diplotaxodon limnothrissa Pallidochromis tokolosh P. longimanus Diplotaxodon greenwoodi Diplotaxodon white back similis Diplotaxodon ngulube with deep Rhampochromis longiceps Rhamphochromis esox Rhamphochromis woodi Genyochromis mento Cynotilapia afra Cynotilapia axelrodi Metriaclima zebra Petrotilapia genalutea Tropheops tropheops Iodotropheus sprengerae A. calliptera South Rukuru A. calliptera Enukweni A. calliptera Chitimba A. calliptera Songwe River A. calliptera Kyela A. calliptera North Rukuru A. calliptera Luwawa A. calliptera Bua A. calliptera Salima A. calliptera Chizumulu Island A. calliptera Massoko littoral A. calliptera Massoko benthic A. calliptera Lake Kingiri A. calliptera Mbaka River Bua Kyela Salima Luwawa Chitimba Enukweni Lake Kingiri Lake Mbaka River Mbaka North Rukuru North South Rukuru South Songwe River Songwe brooksi nkhata brooksi Cynotilapia afra Cynotilapia Massoko littoral Massoko A. calliptera calliptera A. Massoko benthic Massoko Lethrinops albus Lethrinops Chizumulu Island Chizumulu Metriaclima zebra Metriaclima A. calliptera calliptera A. Lethrinops gossei Lethrinops Lethrinops auritus Lethrinops Aulonocara yellow Aulonocara A. calliptera calliptera A. Aulonocara steveni Aulonocara Nimbochromis linni Nimbochromis Protomelas ornatus Protomelas Alticorpus geoffreyi Alticorpus Cynotilapia axelrodi Cynotilapia Lethrinops lethrinus Lethrinops minutus Aulonocara A. calliptera calliptera A. Lethrinops sp oliveri sp Lethrinops Chilotilapia rhoadesii Chilotilapia A. calliptera calliptera A. Ctenopharynx nitidus Ctenopharynx Genyochromis mento Genyochromis Tropheops tropheops Tropheops Petrotilapia genalutea Petrotilapia A. calliptera calliptera A. Diplotaxodon ngulube Diplotaxodon Otopharynx lithobates Otopharynx Otopharynx speciosus Otopharynx Diplotaxodon macrops Diplotaxodon Placidochromis electra Placidochromis Rhamphochromis esox Rhamphochromis Placidochromis milomo Placidochromis Otopharynx tetrastigma Otopharynx Copadichromis likomae Copadichromis Aulonocara stuartgranti Aulonocara Dimidiochromis kiwinge Dimidiochromis A. calliptera calliptera A. Pallidochromis tokolosh Pallidochromis D. macrops black dorsal black macrops D. Mylochromis ericotaenia Mylochromis Buccochromis rhoadesii Buccochromis Rhamphochromis woodi Rhamphochromis A. calliptera calliptera A. Hemitilapia oxyrhynchus Hemitilapia Copadichromis virginalis Copadichromis Dimidiochromis strigatus Dimidiochromis Iodotropheus sprengerae Iodotropheus Tremitochranus placodon Tremitochranus Placidochromis johnstoni Placidochromis Buccochromis nototaenia Buccochromis A. calliptera calliptera A. Nimbochromis livingstoni Nimbochromis Diplotaxodon greenwoodi Diplotaxodon Taeniochromis holotaenia Taeniochromis Mylochromis anaphyrmus Mylochromis Stigmatochromis guttatus Stigmatochromis A. calliptera calliptera A. A. calliptera calliptera A. Nimbochromis polystigma Nimbochromis Diplotaxodon limnothrissa Diplotaxodon Ctenopharynx intermedius Ctenopharynx Alticorpus macrocleithrum Alticorpus Tyranochromis nigriventer Tyranochromis Dimidiochromis dimidiatus Dimidiochromis Mylochromis melanotaenia Mylochromis Rhampochromis longiceps Rhampochromis Otopharynx Otopharynx Stigmatochromis modestus Stigmatochromis Champsochromis caeruleus Taeniolethrinops furcicauda Taeniolethrinops A. calliptera calliptera A. Copadichromis trimaculatus Copadichromis Taeniolethrinops preorbitalis Taeniolethrinops A. calliptera calliptera A. A. calliptera calliptera A. Hemitaenichromis spilopterus Hemitaenichromis Copadichromis cf. trewavasae cf. Copadichromis Placidochromis cf. longimanus cf. Placidochromis Dimidiochromis compressiceps Dimidiochromis Diplotaxodon white back similis white Diplotaxodon Lethrinops longimanus redhead longimanus Lethrinops Copadichromis quadrimaculatus Copadichromis Taeniolethrinops macrorhynchus Taeniolethrinops 27.5 136 244 352 460 784 568 892 676 higher 1000

Estimated coancestry (cM)

Fig. S7 Co-ancestry between Lake Malawi species measured by the Chromopainter software. Each row corresponds to a ‘recipient’ and each column to a ‘donor’. Thus the values indicate the total length (in cM) along which a ‘donor’ haplotype was inferred to be the closest relative of a ‘recipient’ haplotype. Clusters corresponding to the major morphological groups are highlighted by rectangles (species name colors as in Fig. 1). A number of interesting cases of excess ‘nearest neighbor’ haplotype sharing are indicated by arrows, including those discussed in the main text and some additional examples that stand out.

Fig. S8 The absolute values of the f statistic and associated Z scores. Results shown for all possible combinations of clades in Fig. S10B that are consistent with the phylogenetic tree. The rose- colored area corresponds to significant Z scores (Holm-Bonferroni FWER<0.001, |Z|>3.21).

Fig. S9 Overlaid density histograms of original f scores (170640 tests), and the reduced branch-specific scores fb (85311 tests).

Fig. S10 The definition of subgroups used for f statistic presentation in Figure 3c. (A) NJ tree of all samples, with the strongly admixed samples reported in Fig. 2b removed: C. trimaculatus, A. calliptera Kingiri, O. tetrastigma, P. cf. longimanus. Note that the removal of C. trimaculatus qualitatively influences the relationships between a number of the other samples. (B) Mapping of sample labels as in the uploaded VCF file) to the f statistic clades used in Fig. 3c.

m = 1 m = 3

A A. calliptera Chitimba+ B A. calliptera Salima+ A. calliptera Massoko+ A. calliptera Massoko+ A. calliptera Chizumulu A. calliptera Chitimba+ A. calliptera Salima+ A. calliptera Chizumulu Cynotilapia+ F. rostratus Metriaclima+ Nimbochromis I. sprengerae D. kiwinge L. trewawase Dimidiochromis F. rostratus P. milomo Nimbochromis C. rhoadesii Migration D. kiwinge P. ornatus weight Dimidiochromis H. spilopterus+ 0.5 P. milomo P. subocularis+ C. rhoadesii N. brichardi T. placodon+ P. ornatus Tyrannochromis H. spilopterus+ 0.0 0.1 0.2 0.3 0.4 0.5 Buccochromis+ P. subocularis+ Migration Ctenopharynx T. placodon+ weight M. anaphyrmus 0 Tyrannochromis 1 P. electra Buccochromis+ (Taenio-)lethrinops Ctenopharynx A. stuartgranti, steven M. anaphyrmus C. quadrimaculatus P. electra C. likomae (Taenio-)lethrinops C. virginalis A. stuartgranti, steveni 0 other deep benthic C. quadrimaculatus Diplotaxodon C. likomae Rhamphochromis C. virginalis L. trewawase other deep benthic Metriaclima+ Rhamphochromis I. sprengerae Diplotaxodon Cynotilapia+ N. brichardi

0.00 0.01 0.02 0.03 0.04 0.46 0.47 0.48 0.49 0.50 0.51 0.52 Drift parameter Drift parameter C m = 5 D m = 10 N. brichardi L. trewawase L. trewawase Cynotilapia+ Cynotilapia+ Metriaclima+ Metriaclima+ I. sprengerae I. sprengerae A. calliptera Salima+ A. calliptera Salima+ A. calliptera Massoko+ A. calliptera Chizumulu A. calliptera Chitimba+ A. calliptera Massoko+ A. calliptera Chizumulu A. calliptera Chitimba+ F. rostratus A. stuartgranti, steveni T. placodon+ Migration (Taenio-)lethrinops Nimbochromis weight Ctenopharynx D. kiwinge 0.5 F. rostratus Dimidiochromis P. subocularis+ P. milomo P. milomo H. spilopterus+ Migration H. spilopterus+ P. ornatus weight P. ornatus C. rhoadesii 0.5 C. rhoadesii D. kiwinge P. subocularis+ 0 Dimidiochromis Tyrannochromis Nimbochromis Buccochromis+ T. placodon+ P. electra Tyrannochromis M. anaphyrmus Buccochromis+ Ctenopharynx 0 P. electra (Taenio-)lethrinops M. anaphyrmus A. stuartgranti, steveni C. quadrimaculatus C. virginalis C. likomae C. quadrimaculatus C. virginalis C. likomae other deep benthic other deep benthic Diplotaxodon Diplotaxodon Rhamphochromis Rhamphochromis N. brichardi

0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 Drift parameter Drift parameter

Fig. S11 Population splits and mixtures inferred with treemix. The parameter specifying the number of migration events was set to (A) m=1, (B) m=3, (C) m=5, (D) m=10. Subgroups are given in Fig. S10 and are also the same as used in Fig. 3c. All migration events were inferred to be highly significant (treemix option --se, p < 10-15).

mtDNA phylogeny

100 Rhamphochromis esox 100 Rhamphochromis longiceps Rhamphochromis woodi 98 100 Diplotaxodon greenwoodi Diplotaxodon 'white back similis' 63 53 Pallidochromis tokolosh Diplotaxodon macrops 100 89 Diplotaxodon limnothrissa Diplotaxodon macrops black dorsal 59 Diplotaxodon ngulube Mylochromis melanotaenia 79 98 95 Placidochromis johnstoni Chilotilapia rhoadesii 95 Otopharynx tetrastigma Protomelas ornatus Ctenopharynx nitidus 81 63 100 Copadichromis trimaculatus Copadichromis quadrimaculatus 72 Placidochromis cf. longimanus 100 Copadichromis likomae 50 Lethrinops albus 97 Lethrinops lethrinus 85 99 Nimbochromis linni 100 13 Nimbochromis polystigma 48 Hemitilapia oxyrhynchus Tremitochranus placodon 100 Dimidiochromis kiwinge 100 100 Dimidiochromis strigatus 41 100 Dimidiochromis dimidiatus Dimidiochromis compressiceps 97 Champsochromis caeruelus 73 70 Ctenopharynx intermedius 97 Copadichromis cf. trewavasae Placidochromis electra 97 Hemitaenichromis spilopterus 99 Buccochromis rhoadesii 59 63 Otopharynx speciosus 100 21 Buccochromis nototaenia 100 Taeniolethrinops furcicauda 53 Taeniolethrinops macrorhynchus Nimbochromis livingstoni Otopharynx 'brooksi nkhata' 100 Taeniochromis holotaenia 100 78 100 Stigmatochromis guttatus Stigmatochromis modestus Tyrannochromis nigriventer Mylochromis anaphyrmus 10079 Lethrinops auritus 59 Mylochromis ericotaenia Otopharynx lithobates Lethrinops sp. oliveri N. brichardi 41 50 Alticorpus macrocleithrum 57 Aulonocara yellow 47 Aulonocara stuartgranti 35 Lethrinops longimanus redhead Lethrinops gossei A. calliptera Kingiri 21 100 100 A. calliptera Massoko benthic 6 A. calliptera Massoko littoral 100 A. calliptera Rovuma headwaters 98 100 A. calliptera Lake Chilwa 100 A. calliptera Kitai Dam 95 A. calliptera Upper Rovuma Aulonocara minutus 9 Aulonocara steveni Genyochromis mento 48 78 Cynotilapia axelrodi 100 Iodotropheus sprengerae 75 Cynotilapia afra 100 Taeniolethrinops praeorbitalis 49 100 Tropheops tropheops 100 49 Petrotilapia genalutea Alticorpus geoffreyi Metriaclima zebra Placidochromis milomo Copadichromis virginalis 94 100 A. calliptera North Rukuru A. calliptera Songwe River 100 97 100 A. calliptera Chitimba 100 A. calliptera Enukweni 95 A. calliptera Mbaka River A. calliptera Lake Chidya A. calliptera Chizumulu Island 100 A. calliptera Kyela 96 61 A. calliptera Bua 100 57 A. calliptera Salima, Lake Malawi 19 A. calliptera Luwawa A. calliptera South Rukuru

0.01

Fig. S12 mtDNA phylogeny with Indian Ocean catchment A. calliptera included. Scale is given in number of SNPs per bp. Node labels show bootstrap support based on 200 replicates. Fig. S13 Morphological diversity in sequenced Lake Malawi cichlids. PCA of body shape variation obtained from geometric morphometric analysis. (A) A. calliptera individuals occupy a central position along the first two eigenvectors, and there is very little overlap and there is no overlap other Malawi groups except for shallow benthic. (B) Although the extremes of the morphospace are occupied by individuals from other groups, it is notable that the body shape diversity within the shallow benthic group alone approaches the diversity of the full radiation. A. calliptera Massoko benthic A. calliptera Massoko littoral 1 A. calliptera Mbaka River A. calliptera Kingiri A. calliptera Songwe River A. calliptera Kyela 2 A. calliptera North Rukuru A. calliptera Enukweni A. calliptera South Rukuru 3 A. calliptera Chitimba A. calliptera Chizumulu Island 7 A. calliptera Salima, Lake Malawi 6 A. calliptera Bua 5 N. brichardi A. calliptera Luwawa 4 A. calliptera Lake Chilwa 12 A. calliptera Rovuma headwaters 11 A. calliptera Lake Chidya 9 A. calliptera Kitai Dam 8 A. calliptera Upper Rovuma 10 1×10-4 Fig. S14 A Neighbor Joining tree of A. calliptera from all sampled locations. The scale is given in SNPs per bp. All samples, including the one from Lake Kingiri cluster according to geography.

Fig. S15 Removing all mbuna individuals does not change the NJ tree topology. Specifically, the position of the A. calliptera group remains unchanged. a, NJ tree as in Fig 4b. b, NJ tree with the mbuna group removed. c, Approximate A. calliptera sampling locations shown on a map of the broader Lake Malawi region. Black lines correspond to present day level 3 catchment boundaries from the US Geological Survey’s HYDRO1k dataset.

Rhamphochromis and/or any A. calliptera Diplotaxodon

First branch: 100% 22.97% Remainder Remainder 5.99% of the radiation of the radiation

28.92% mbuna with none of these widows A. calliptera

Proportion of of Proportion 42.12% 0%

Remainder Remainder of the radiation of the radiation

Fig. S16 Variation in the basal (or ‘first’) branch among ML phylogenies for 2638 non-overlapping genomic windows of 8000SNPs each. This result is consistent with the tree in Fig. 4b.

Fig. S17

Comparisons between �!!! (non-synonymous minus synonymous variation) between Lake Malawi species (x-axis in both figures) and (A) the five major lineages and radiations of East African cichlids from ref. 11; (B) the A. calliptera populations sequenced in this study. Darker color indicates greater density of data points.

Fig. S18

Non-synonymous variation excess scores (��!�) and gene-length. Darker color indicates greater density of data points. (A) Overall, there is a disproportionally large fraction of short genes with high Δ!!! values. (B) This effect is greatly reduced when considering genes for which there are homologs in other non-cichlid teleost genome assemblies separately. (C) Genes without homologs in other teleosts tend to be short and have higher Δ!!! values; especially the shorter ones have a Δ!!! similar to that observed in non-coding sequences.

genes without homologs and length ≤ 450bp random non-coding sequences (null distribution) 800 1000 1200 density 0 200 400 600

0.000 0.001 0.002 0.003 0.004 0.005

pN average non-synonymous variation between Malawi species (tS/tV adjusted) Fig. S19

Overlaid density histograms of non-synonymous diversity (��). The distributions for short genes without homologs on other teleosts and for matched randomly selected non-coding sequences. The comparison of these distributions shows that short genes without homologs have a component with low �!, suggesting purifying selection, as well as a component with high �!, suggesting diversifying selection. A 3: peripherin-2 4: guanine nucleotide-binding protein G(t) 20 ● ● 1 ● ● XP_004556729.1; 343aa subunit alpha-2 - GNAT2 ● 2 ● ● XP_004554304.1; 350aa ● ● ● ● ● a ● ● ●● 3 ● b

● ● ● 1 ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● 4 ● ●● 1 1 ●● ● ● ● ● ●

● ● ● 5 ● ● 1 ● ● ● ●

● ● ● ● ● ● ● ● ● 1 ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● 6 ● ● ● ● ● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● 1 ●● ● ● ● ●● ● ● ● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ●● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● 7 ● ● ● ●● ● ● ● ● ●● ● ● 6: G-protein-coupled receptor kinase 7 - GRK7 ● ● ● ●● ● ●● ● ●● 5: arrestin-C ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● and deepand benthc groups ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● XP_004550345.1; 355aa XP_004570581.1; 557aa ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● similarityscore ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● 1

10 0 10 ● ●

− ●

● ● ● ● ● ● ● ● 1 ● ● ● ●

● ● ● ● 1 1 Diplotaxodon ● ● ● ● ● ● ● for ●

● ● ● 20 visual perception GO annotation − ● ● other vision related genes ● 7: rhodopsin RH1 a: arrestin-C ● Massoko XP_004546142.1; 354aa littoral XP_004545716.1; 365aa

−10 −50 5 10 1 1

excess of non-synonymous diversity in 1 the deep water groups compared to the rest of the radiation 1

Massoko 1 benthic 1 1: green-sensitive opsin RH2B XP_004548806.1; 352aa 1

1 b: homeobox protein SIX6 2 (closest zebrafish homolog is six7) Number of 1 10 20 40 100 XP_004570733.1; 260aa haplotypes: 1

1 depth adapted deep 2: green-sensitive opsin RH2Aβ 2 1 Diplotaxodon groups: benthic XP_004548809.1; 352aa

shallow A. calliptera 1 benthic predominantly Rhamphochomis

shallow groups: mbuna 1

1 utaka outgroups 1

Selected haplotypes for RH2B: B H2 AApos: H1: 1 H2: 2 H3: 1 H3 H1 Selected haplotypes for RH2Aβ:

AApos: H4 H1: 1

H2: H3: 1 H4: 1 H2

1 H1 H3

C Selected haplotypes for homeobox protein SIX6: H3 H2 Variant positions:

AApos: 1 H1: 2 1 H2: H3:

H1

Fig. S20 Vision-related candidate genes with signatures of shared selection in deep benthic and Diplotaxodon. Outgroups include Oreochromis niloticus, Neolamprologus brichardi, Astatotilapia burtoni, and Pundamilia nyererei. (A) Amino acid haplotype genealogies for genes that have high similarity between deep benthic and Diplotaxodon and/or excess diversity within these groups. Note that the homeobox gene is annotated as SIX6 in the NCBI database, but a BLAST search revealed that its closest zebrafish homolog is clearly the six7 gene. (B) Amino acid positions of mutations in the two copies of RH2. (C) Amino acid positions of mutations in the homolog of the homeobox protein six7. A

20 2: green-sensitive opsin RH2Aα 4: opsin-5 ● ● ● ● ● XP_004548807.1; 352aa XP_004546036.1; 404aa ● ● ● ● ● ● ● ● ● 1 ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● 1 ● ● ● ● ● ● ●● ● ● 1 ●● ● ● ● ●● ● 3● ● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ●● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● short-wave-sensitive opsin SWS1 vertebrate ancient opsin and deepand benthc groups ● ● ● ●4● ●● 3: 5: ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● XP_001297003.1; 335aa XP_004551342.1; 380aa similarityscore ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●5 ● ● ● ● ● ● ● ● ● ●● ●a ● ● ● ●● ● ● ● ●● ●

10 0 10 ● ●

− ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Diplotaxodon ● ● ● ● ● ● ● for ●

● ● ● 20 visual perception GO annotation − ● ● other vision related genes ● ● a: parapinopsin −10 −50 5 10 XP_004567849.1; 360aa excess of non-synonymous diversity in

the deep water groups compared to the rest of the radiation 1

1: red-sensitive opsin LWS 1 XP_004571972.1; 357aa Number of 1

10 20 40 100

1 haplotypes: 1 depth adapted deep Diplotaxodon groups: benthic

shallow A. calliptera benthic predominantly Rhamphochomis shallow groups: mbuna utaka outgroups B 20 ● ● ● ● ● 1: tudor domain-containing protein 1 ● ● 1 ● ● XP_004567589.1; 1183aa ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ●● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● and deepand benthc groups ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● similarityscore ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●

10 0 10 ● ●

− ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Diplotaxodon ● ● ● ● ● ● ● for GO categories: ● visual perception 2: interleukin-12 subunit beta ● ● ● 20 hemoglobin complex XP_014270149.1; 296aa − cytokine receptor binding ● ● piRNA metabolic process

−10 −50 5 10 excess of non-synonymous diversity in the deep water groups compared to the rest of the radiation

Fig. S21 Genes with low deep benthic - Diplotaxodon scores. The genealogies suggest deep benthic and Diplotaxodon groups do not share selection patterns for these genes. Outgroups include Oreochromis niloticus, Neolamprologus brichardi, Astatotilapia burtoni, and Pundamilia nyererei. (A) Amino acid haplotype genealogies for candidate vision-related genes. (B) Examples of genes from other GO categories. The legend is shared for both panels (A) and (B). A Hemoglobin MN (major) cluster - scaffold_28 750kb 760kb 770kb 780kb 790kb Assembly gaps: Accessible genome: HBα1 HBα2 HBβ1 HBβ2 HBα3 HBα4 HBβ3 HBα5 HBβ4 NPRL3 Genes:

SNPs:

Hemoglobin LA (minor) cluster - scaffold_81 NCBI Sequence IDs: 410kb 420kb 430kb MN (major) cluster Assembly gaps: HBα1 XP_004552225.1 HBα4 XP_004552228.1 Accessible genome: HBα2 XP_004552232.1 HBβ3 XP_004552234.1 HBβ1 XP_004552231.1 HBα5 XP_004552233.1 RHBDF1 HBα HBβ AQP8 Genes: HBβ2 XP_004552230.1 HBβ4 XP_004552230.1 HBα3 XP_004552229.1 NPRL3 XP_004552236.1 LA (minor) cluster SNPs: RHBDF1 XP_004562333.1 HBβ XP_004562337.1 HBα XP_004562338.1 AQP8 XP_004562340.1

B 1: HBβ - LA cluster 2: HBβ1 - MN cluster ● XP_004562337.1; 147aa XP_004552231.1; 148aa 20 ● ● ● ● ● ●

● ● ● ●

● ● ● ● ● 1 ●● ●

● ● ● α-globins β-globins 2 ● 1 ● ● ●● ● ● ● ●● ● ● ● ●

● ● ● ● ●●● ● ● 1 ● ●● ● ●● ● ● ● ● ● ● ● ● ● 1 1 ● ● ● ● ● ● ● ● ●●● ● 3 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● HBα3 - MN cluster a: HBβ2 - MN cluster ● ● ● ● ●●● ● 3: ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● a● XP_004552229.1; 143aa XP_004562330.1; 147aa ●● ● ● ● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ●● ● 2 ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● and deepand benthc groups ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● 4● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● similarityscore ● ●● ●● ●● ● ● ● ● ● ● ● b ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● HBα - LA cluster ●● ● 4:

10 0 10 ● ●

− ● ● ● ● ● ● XP_004562338.1; 141aa ● ● ● ● ● ● ● ● ●● ● ● Diplotaxodon ● ● ● ● ● ● ● Number of 1 for ● 10 20 40 100 haplotypes: ● ● ● 20 hemoglobin complex GO category − ● ● other hemoglobin complex genes ● depth adapted deep ● Diplotaxodon groups: benthic b: HBα2 - MN cluster −10 −50 5 10 XP_004562332.1; 147aa shallow A. calliptera benthic excess of non-synonymous diversity in predominantly Rhamphochomis the deep water groups compared to the rest of the radiation shallow groups: mbuna utaka outgroups C Selected haplotypes for the LA cluster HBβ H3 H2 AApos: H5 H1:

H2:

H3: 1 H4: 2

H5: 1

H1 1 H4 1

Variant positions:

Fig. S22 Hemoglobin complex genes. (A) Genome browser screenshots of the hemoglobin clusters with genes from the BROADMZ2 annotation. Note that the MN cluster genes HBα1, HBα5, HBβ3, and HBβ4 are either entirely or mostly outside of the ‘accessible genome’. Due to the highly repetitive nature of the MN locus, we were unable to confidently call variants in these genes. (B) Amino acid haplotype genealogies of the genes with high selection scores, with scatterplot showing measures of shared selection signatures between Diplotaxodon and deep benthic. (C) Amino acid positions of mutations in HBβ. A 1.0

peripherin-2

0.5 arrestin-C

0.0 missense variant missense

dM −0.5 f

−1.0 −1.0 −0.5 0.0 0.5 1.0 f dM synonymous variant B 0.4

0.2 dM f 0.0

-0.2 Assembly gap: Accessible genome: Gene annotation: peripherin-2

1.15Mb 1.20Mb 1.25Mb 1.30Mb scaffold_47 coordinates

C 0.2

0.1

dM 0.0 f

-0.1

-0.2 Assembly gap: Accessible genome:

Gene annotation: arrestin-C

4.70Mb 4.75Mb 4.80Mb scaffold_22 coordinates

Fig. S23 Excess allele sharing may be driven by de novo mutations in arrestin-C and peripherin-2. (A) The fdM measure of excess allele sharing between Diplotaxodon and deep benthic, calculated separately for all synonymous and non-synonymous (missense) variants within each gene. (B, C) The local patterns of fdM around the two genes. Phasing performance

1.0 beagle beagle shapeit shapeit 56 Switch accuracy Switch Average correct haplotype length (kb) length haplotype correct Average 01234 0.6 0.7 0.8 0.9 Father Father Father Father Father Father Mother Mother Mother Mother Mother Mother L. lethrinus L. lethrinus L. lethrinus L. lethrinus L. A. calliptera A. calliptera A. A. calliptera A. calliptera A. A. stuartgranti stuartgranti A. A. stuartgranti stuartgranti A.

Fig. S24 Haplotype phasing performance. Improvement in haplotype phasing accuracy achieved by re- phasing the dataset using the shapeit software.

Fig. S25 Combined MCMC state traces for the tree log-likelihood for the three runs of the SNAPP software. Although all three runs started with the same initial parameters, the first run reached different likelihood values compared with the other two. We interpret this as evidence that the likelihood surface is complicated and includes multiple optima, consistent with there being discordant phylogenetic signals.

Fig. S26 The calculation of f scores on trees. (A) A schematic illustration accompanying the methods section which explains how the reduced branch specific fb(C) scores are calculated. (B, C) Illustration of the way in which interdependences between different f scores can be informative about the timing of introgression. As shown in (B), gene flow into the common ancestor of two species is expected to be equally detectable in both of them. Additional gene flow into only one species after their split will add to the f statistic in that species, but not in its sister species.

Fig. S27 Phylogeny used for the coalescent simulations. Two diploid individuals were sampled from species A1, A2, …, D2 and a single diploid individual from the outgroup O. Effective population size was kept constant at 105, recombination rate was set to 2×10-8 and mutation rate to 3×10-9. For each sample we simulated 120 independent stretches of 5Mb (600Gb in total per individual, corresponding approximately to the size of the accessible genome). We simulated a single pulsed gene flow event from the common ancestor of C1, C2 into the common ancestor of A1, A2 (blue arrow). Four independent runs where performed with f = 0, 0.1, 0.3, and 0.5, respectively, where f is the fraction of lineages in the common ancestor of A1, A2 tracing their ancestry though the gene flow event.

Fig. S28 Inferred species trees for simulated data of the model in Fig. S27. (A) f=0, (B) f=0.1, (C) f=0.3, (D) f=0.5. Trees were constructed using the neighbor-joining method. Coalescent times were inferred from the pairwise genetic differences. Species split times were then calculated by subtracting within population differences from between population differences.

Fig. S29 Residuals of neighbor-joining trees for simulated data of the model in Fig. S27. Pairwise genetic differences (above diagonal) and residuals of pairwise difference and tree distance (below diagonal) trees for (A) f=0, (B) f=0.1, (C) f=0.3, (D) f=0.5. The analysis shown here is equivalent to Fig. 2 from the main manuscript.

Fig. S30

The fb statistic calculated on simulated data of the model in Fig. S27. (A) f=0, (B) f=0.1, (C) f=0.3, (D) f=0.5. The analysis shown here is equivalent to Fig. 3c from the main manuscript.

Fig. S31 Inference using the treemix software with m=1 for simulated data of the model in Fig. S27. (A-B) f=0, (C-D) f=0.1, (E-F) f=0.3, (G-H) f=0.5. The m=1 parameter tells treemix to expect exactly one migration event. The analysis shown here is equivalent to Fig. S11A.

Fig. S32 Inference using the treemix software with m=2 for simulated data of the model in Fig. S27. (A-B) f=0, (C-D) f=0.1, (E-F) f=0.3, (G-H) f=0.5. The m=2 parameter tells treemix to expect exactly two migration events. The analysis shown here is equivalent to Fig. S11B.

Fig. S33 Inference using the treemix software with m=3 for simulated data of the model in Fig. S27. (A-B) f=0, (C-D) f=0.1, (E-F) f=0.3, (G-H) f=0.5. The m=3 parameter tells treemix to expect exactly three migration events. The analysis shown here is equivalent to Fig. S11C.

Table S1 An overview of Lake Malawi and species and samples sequenced in this study. Approximate coverage and Illumina sequencing chemistry used (v3 or v4) are indicated. Number of individuals: ~coverage/chemistry Total Group Species ~15x/v3 ~15x/v4 ~6x/v3 ~6x/v4 individuals mbuna Cynotilapia afra 1 1 Cynotilapia axelrodi 1 1 Genyochromis mento 1 1 Iodotropheus sprengerae 1 1 Labeotropheus trewavasae 1 1 Metriaclima zebra 1 1 Petrotilapia genalutea 1 1 Tropheops tropheops 1 1 deep benthic Alticorpus geoffreyi 1 1 Alticorpus macrocleithrum 1 1 Aulonocara ‘minutus’ 1 1 Aulonocara steveni 1 1 Aulonocara stuartgranti 3 3 (trio) Aulonocara ‘yellow’ 1 1 Lethrinops gossei 1 1 Lethrinops longimanus ‘redhead’ 1 1 Lethrinops ‘oliveri’ 1 1 shallow benthic Buccochromis nototaenia 1 1 Buccochromis rhoadesii 1 1 Champsochromis caeruleus 2 2 Chilotilapia rhoadesii 1 3 4 Copadichromis cf. trewavasae 1 1 Ctenopharynx intermedius 1 2 3 Ctenopharynx nitidus 2 2 Dimidiochromis compressiceps 1 1 Dimidiochromis dimidiatus 1 1 Dimidiochromis kiwinge 1 1 Dimidiochromis strigatus 2 2 Fossorochromis rostratus 2 2 Hemitaeniochromis spilopterus 2 2 Hemitilapia oxyrhynchus 2 2 Lethrinops albus 1 1 Lethrinops auritus 1 1 Lethrinops lethrinus 4 4 (trio+1) Mylochromis anaphyrmus 1 4 5 Mylochromis ericotaenia 1 1 Mylochromis melanotaenia 1 1 Nimbochromis linni 1 1 Nimbochromis livingstoni 1 1 Nimbochromis polystigma 1 1 Otopharynx brooksi ‘nkhata’ 1 1 Otopharynx lithobates 1 1 Otopharynx speciosus 2 2 Otopharynx tetrastigma (Lake Ilamba) 1 1 Placidochromis electra 1 1 Placidochromis johnstoni 1 1 Placidochromis cf. longimanus 1 4 5 Placidochromis milomo 1 1 Placidochromis subocularis 8 8 Protomelas ornatus 2 2 Stigmatochromis guttatus 1 1 Stigmatochromis modestus 1 1 Taeniochromis holotaenia 1 1 Taeniolethrinops furcicauda 1 1 Taeniolethrinops macrorhynchus 1 1 Taeniolethrinops praeorbitalis 1 1 Tremitochranus placodon 1 4 5 Tyrannochromis nigriventer 1 1 utaka Copadichromis likomae 1 1 Copadichromis quadrimaculatus 1 1 Copadichromis trimaculatus 1 1 Copadichromis virginalis 1 4 5 Rhampochromis Rhamphochromis esox 1 1 Rhamphochromis longiceps 1 1 Rhamphochromis woodi 1 1 Diplotaxodon Diplotaxodon greenwoodi 1 1 Diplotaxodon limnothrissa 1 1 Diplotaxodon macrops 1 1 Diplotaxodon macrops ‘black dorsal’ 1 1 Diplotaxodon macrops ‘ngulube’ 1 1 Diplotaxodon similis ‘white back’ 1 1 Pallidochromis tokolosh 1 1 A. calliptera Astatotilapia calliptera 8 13 21 (trio+18) Number of species: 73 36 67 8 23 134

Table S2 Outgroup Astatotilapia species and specimens sequenced for this study. All individuals were sequenced to approximately 15x coverage using Illumina v4 sequencing chemistry (125bp paired end reads).

Number of Species Sampling location(s) individuals Kilosa, Wami System Astatotilapia bloyeti Lake Burungi 3 Lake Kumba, Korogwe District, Tanzania Astatotilapia ‘rukwa’ Lake Rukwa 3 Oxbow Lake (Rufiji river delta) Astatotilapia ‘rufiji blue’ Lake Mansi (Rufiji river system) 2 Astatotilapia ‘ruaha 1’ Kidatu, Ruaha river 2 Ruhuhu river Ruaha river Astatotilapia tweddlei Mindu Dam 5 Kitele Lake (Ruvuma catchment) Astatotilapia ‘ruaha 2’ Kidatu, Ruaha river 2 Rusizi, Bujumbura Astatotilapia burtoni 2 Lab strain from ref. 11 Total 19

Table S3 Gene ontology terms with significant (p≤0.01) enrichment when using the weight algorithm 60 implemented in the topGO package 106.

Molecular function:

Cellular component:

Biological Process:

Table S4 Individual BioSample accessions for whole genome sequencing data used in this study.

Samples BioSample Accessions Lake Malawi SAMEA1877409, SAMEA1877411, SAMEA1877414, SAMEA1877417, populations SAMEA1877421, SAMEA1877429, SAMEA1877440, SAMEA1877451, SAMEA1877455, SAMEA1877459, SAMEA1877464, SAMEA1877472, SAMEA1877476, SAMEA1877480, SAMEA1877484, SAMEA1877499, SAMEA1877503, SAMEA1904322, SAMEA1904324, SAMEA1904329- SAMEA1904331, SAMEA2661216-SAMEA2661246, SAMEA2661250- SAMEA2661253, SAMEA2661255-SAMEA2661258, SAMEA2661260, SAMEA2661262, SAMEA2661264-SAMEA2661270, SAMEA2661272, SAMEA2661275-SAMEA2661282, SAMEA2661287-SAMEA2661290, SAMEA3388853-SAMEA3388860, SAMEA3388862-SAMEA3388864, SAMEA3388868, SAMEA3388870-SAMEA3388874 Lake Malawi trios SAMEA1920096-SAMEA1920098 A. stuartgranti SAMEA1920093-SAMEA1920095 L. lethrinus SAMEA1920090-SAMEA1920092 A. calliptera Salima, Lake Malawi A. calliptera SAMEA1877400, SAMEA1904326, SAMEA1904327, SAMEA2661273, Malawi catchment SAMEA2661381-SAMEA2661385, SAMEA2661389-SAMEA2661391 A. calliptera Indian SAMEA1904323, SAMEA1904328, SAMEA2661386-SAMEA2661388 Ocean catchments Astatotilapia Under PRJEB1254: outgroups SAMEA2661249, SAMEA3388867

Under PRJEB15289: SAMEA4033331, SAMEA4033332, SAMEA4033339, SAMEA4033340 SAMEA4033333, SAMEA4033334, SAMEA4033341

Table S5 Versions of cichlid genome assemblies used in this study. All described in ref. 11.

Broad Institute Species URLs used to download Assembly ID http://www.broadinstitute.org/ftp/pub/assemblies/fish/M_zeb M. zebra MetZeb1.1_prescreen ra/MetZeb1.1_prescreen/M_zebra_v0.assembly.fasta http://www.broadinstitute.org/ftp/pub/assemblies/fish/P_nye P. nyererei PunNye1.0 rerei/PunNye1.0/P_nyererei_v1.assembly.fasta.gz http://www.broadinstitute.org/ftp/pub/assemblies/fish/H_bur A. burtoni HapBur1.0 toni/HapBur1.0/H_burtoni_v1.assembly.fasta.gz http://www.broadinstitute.org/ftp/pub/assemblies/fish/N_bri N. brichardi NeoBri1.0 chardi/NeoBri1.0/N_brichardi_v1.assembly.fasta.gz http://www.broadinstitute.org/ftp/pub/assemblies/fish/tilap O. niloticus Orenil1.1 ia/Orenil1.1/20120125_MapAssembly.anchored.assembly.fasta

Table S6 Versions of non-cichlid teleost assemblies used for multiple whole-genome alignments.

UCSC version Notes Species URLs used to download of assembly NIG v1.0 assembly ftp://hgdownload.soe.ucsc.edu/goldenPath/ory medaka oryLat2 Lat2/bigZips/oryLat2.fa.gz ftp://hgdownload.soe.ucsc.edu/gbdb/gasAcu1/g stickleback gasAcu1 Broad Institute v1.0 asAcu1.2bit http://hgdownload.cse.ucsc.edu/gbdb/danRer7/ zebrafish danRer7 Sanger Zv9 assembly danRer7.2bit

Table S7 An overview of the specimens whose photographs were used in the morphometric analyses. We used a total of 47 specimens from seven Astatotilapia outgroup species, 12 A. calliptera specimens, and 109 specimens from 55 Lake Malawi endemic species (a subset of the 72 sequenced Lake Malawi species for which we had available high quality photographs).

Group N species N populations N specimens Astatotilapia outgroups 7 10 47 A. ‘rukwa’, ‘rufiji blue’, ‘ruaha 1’ 3 3 19 A. tweddlei ,‘ruaha 2’ 2 2 10 A. bloyeti 1 2 9 A. burtoni 1 3 9

A. calliptera 1 3 12

Malawi radiation 55 109 mbuna 6 12 shallow benthic 30 55 deep benthic 8 15 utaka 2 6 Diplotaxodon 6 12 Rhamphochromis 3 9 Total 62 168

Table S8

The f4 admixture ratio and corresponding block-jackknifing Z-scores for all comparisons consistent with the tree in Fig. S10. The outgroup is fixed as N. brichardi. The four different values of f4 and Z-scores correspond to four independent random partitions of each clade C into two subsets in the denominator of f4 (see the methods section ‘The ABBA-BABA related f statistics’ for details). The first partition (columns f_h3c_0 and Z_h3c_0) is reported for all analyses in this paper.

[Online file]

References and Notes: 1. Losos, J. B. & Ricklefs, R. E. Adaptation and diversification on islands. Nature 457, 830–836 (2009). 2. Wagner, C. E., Harmon, L. J. & Seehausen, O. Ecological opportunity and sexual selection together predict adaptive radiation. Nature 487, 366–369 (2012). 3. Berner, D. & Salzburger, W. The genomics of organismal diversification illuminated by adaptive radiations. Trends Genet. 31, 491–499 (2015). 4. Darwin, C. On the Origin of Species. (OUP Oxford, 2008). 5. Lamichhaney, S. et al. Evolution of Darwin/'s finches and their beaks revealed by genome sequencing. Nature 518, 371–375 (2015). 6. Losos, J., Jackman, T., Larson, A., Queiroz, K. & Rodriguez-Schettino, L. Contingency and determinism in replicated adaptive radiations of island lizards. Science 279, 2115–2118 (1998). 7. Fryer, G. & Iles, T. D. The cichlid fishes of the great lakes of Africa: their biology and evolution. (Oliver and Boyd, 1972). 8. Salzburger, W., Van Bocxlaer, B. & Cohen, A. S. Ecology and Evolution of the African Great Lakes and Their Faunas. Annu. Rev. Ecol. Evol. Syst. 45, 519–545 (2014). 9. Genner, M. J. et al. How does the taxonomic status of allopatric populations influence species richness within African cichlid fish assemblages? Journal of Biogeography 31, 93–102 (2004). 10. Meyer, A. Phylogenetic relationships and evolutionary processes in East African cichlid fishes. Trends Ecol Evol 8, 279–284 (1993). 11. Brawand, D. et al. The genomic substrate for adaptive radiation in African cichlid fish. Nature 513, 375–381 (2014). 12. Meyer, B. S., Matschiner, M. & Salzburger, W. Disentangling incomplete lineage sorting and introgression to refine species-tree estimates for Lake Tanganyika cichlid fishes. Syst. Biol. (2016). doi:10.1093/sysbio/syw069 13. Meier, J. I. et al. Ancient hybridization fuels rapid cichlid fish adaptive radiations. Nat Commun 8, 14363 (2017). 14. Koblmüller, S., Egger, B., Sturmbauer, C. & Sefc, K. M. Rapid radiation, ancient incomplete lineage sorting and ancient hybridization in the endemic Lake Tanganyika cichlid tribe Tropheini. Mol. Phylogenet. Evol. 55, 318–334 (2010). 15. Joyce, D. A. et al. Repeated colonization and hybridization in Lake Malawi cichlids. Curr. Biol. 21, R108–9 (2011). 16. Weiss, J. D., Cotterill, F. P. D. & Schliewen, U. K. Lake Tanganyika—A ‘Melting Pot’ of Ancient and Young Cichlid Lineages (Teleostei: Cichlidae)? PLoS ONE 10, e0125043 (2015). 17. Gante, H. F. et al. Genomics of speciation and introgression in Princess cichlid fishes from Lake Tanganyika. Mol Ecol (2016). doi:10.1111/mec.13767 18. Wagner, C. E. et al. Genome‐wide RAD sequence data provide unprecedented resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiation. Mol Ecol 22, 787–798 (2012). 19. Genner, M. J. & Turner, G. F. Ancient hybridization and phenotypic novelty within Lake Malawi's cichlid fish radiation. Mol. Biol. Evol. 29, 195–206 (2012). 20. Moran, P., Kornfield, I. & Reinthal, P. N. Molecular Systematics and Radiation of the Haplochromine Cichlids (Teleostei: Perciformes) of Lake Malawi. Copeia 1994, 274 (1994). 21. See supplementary materials. 22. Leffler, E. M. et al. Revisiting an old riddle: what determines genetic diversity levels within species? PLoS Biol. 10, e1001388 (2012). 23. Albertson, R. C., Markert, J. A., Danley, P. D. & Kocher, T. D. Phylogeny of a rapidly evolving clade: the cichlid fishes of Lake Malawi, East Africa. Proc. Natl. Acad. Sci. U.S.A. 96, 5107–5110 (1999). 24. Kocher, T. D. Adaptive evolution and explosive speciation: the cichlid fish model. Nat. Rev. Genet. 5, 288–298 (2004). 25. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005). 26. Konings, A. Malaŵi Cichlids in Their Natural Habitat. (Cichlid Press, 2007). 27. Ravi, V. & Venkatesh, B. Rapidly evolving fish genomes and teleost diversity. Curr. Opin. Genet. Dev. 18, 544–550 (2008). 28. Recknagel, H., Elmer, K. R. & Meyer, A. A hybrid genetic linkage map of two ecologically and morphologically divergent Midas cichlid fishes (Amphilophus spp.) obtained by massively parallel DNA sequencing (ddRADSeq). G3 (Bethesda) 3, 65– 74 (2013). 29. Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu Rev Genomics Hum Genet 15, 47–70 (2014). 30. Ivory, S. J. et al. Environmental change explains cichlid adaptive radiation at Lake Malawi over the past 1.2 million years. Proceedings of the National Academy of Sciences 113, 11895–11900 (2016). 31. Malinsky, M. & Salzburger, W. Environmental context for understanding the iconic adaptive radiation of cichlid fishes in Lake Malawi. Proceedings of the National Academy of Sciences 113, 11654–11656 (2016). 32. Genner, M. J. et al. Age of cichlids: new dates for ancient lake fish radiations. Mol. Biol. Evol. 24, 1269–1282 (2007). 33. Heled, J. & Bouckaert, R. R. Looking for trees in the forest: summary tree from posterior samples. BMC Evol. Biol. 13, 221 (2013). 34. Mendes, F. K. & Hahn, M. W. Gene Tree Discordance Causes Apparent Substitution Rate Variation. Syst. Biol. 65, 711–721 (2016). 35. Roch, S. & Steel, M. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol 100C, 56–62 (2014). 36. Springer, M. S. & Gatesy, J. The gene tree delusion. Mol. Phylogenet. Evol. 94, 1– 33 (2016). 37. Ballard, J. W. O. & Whitlock, M. C. The incomplete natural history of mitochondria. Mol Ecol 13, 729–744 (2004). 38. Toews, D. P. L. & Brelsford, A. The biogeography of mitochondrial and nuclear discordance in . Mol Ecol 21, 3907–3930 (2012). 39. Consuegra, S., John, E., Verspoor, E. & de Leaniz, C. G. Patterns of natural selection acting on the mitochondrial genome of a locally adapted fish species. Genet. Sel. Evol. 47, 58 (2015). 40. Edwards, S. V. Is a new and general theory of molecular systematics emerging? Evolution 63, 1–19 (2009). 41. Edwards, S. V. et al. Implementing and testing the multispecies coalescent model: A valuable paradigm for phylogenomics. Mol. Phylogenet. Evol. 94, 447–462 (2016). 42. Dasarathy, G., Nowak, R. & Roch, S. Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method. IEEE/ACM Trans Comput Biol Bioinform 12, 422–432 (2015). 43. Rusinko, J. & McPartlon, M. Species tree estimation using Neighbor Joining. J. Theor. Biol. 414, 5–7 (2017). 44. Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A. & RoyChoudhury, A. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol. Biol. Evol. 29, 1917–1932 (2012). 45. Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of Population Structure using Dense Haplotype Data. PLoS Genet. 8, e1002453 (2012). 46. 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). 47. Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710– 722 (2010). 48. Durand, E. Y., Patterson, N., Reich, D. & Slatkin, M. Testing for ancient admixture between closely related populations. Mol. Biol. Evol. 28, 2239–2252 (2011). 49. Martin, S. H., Davey, J. W. & Jiggins, C. D. Evaluating the use of ABBA-BABA statistics to locate introgressed loci. Mol. Biol. Evol. 32, 244–257 (2015). 50. Martin, S. H. et al. Genome-wide evidence for speciation with gene flow in Heliconius butterflies. Genome Res. 23, 1817–1828 (2013). 51. Eriksson, A. & Manica, A. Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins. Proc. Natl. Acad. Sci. U.S.A. 109, 13956–13960 (2012). 52. Pickrell, J. K. & Pritchard, J. K. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8, e1002967 (2012). 53. Genner, M. J., Ngatunga, B. P., Mzighani, S., Smith, A. & Turner, G. F. Geographical ancestry of Lake Malawi's cichlid fish diversity: Figure 1. Biol. Lett. 11, 20150232 (2015). 54. Eccles, D. H. & Trewavas, E. Malawian Cichlid Fishes. (Lake Fish Movies, 1989). 55. Peterson, E. N., Cline, M. E., Moore, E. C., Roberts, N. B. & Roberts, R. B. Genetic sex determination in Astatotilapia calliptera, a prototype species for the Lake Malawi cichlid radiation. Naturwissenschaften 104, 41 (2017). 56. Malinsky, M. et al. Genomic islands of speciation separate cichlid ecomorphs in an East African crater lake. Science 350, 1493–1498 (2015). 57. Lyons, R. P. et al. Continuous 1.3-million-year record of East African hydroclimate, and implications for patterns of evolution and biodiversity. Proc. Natl. Acad. Sci. U.S.A. 112, 15568–15573 (2015). 58. Greenwood, P. H. Towards a phyletic classification of the ‘genus’ (Pisces, Cichlidae) and related taxa. Part 1. 35, 265–322 (1979). 59. Lippitsch, E. A phyletic study on lacustrine haplochromine fishes (Perciformes, Cichlidae) of East Africa, based on scale and squamation characters. Journal of Fish Biology 42, 903–946 (1993). 60. Alexa, A., Rahnenführer, J. & Lengauer, T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600–1607 (2006). 61. Van Bocxlaer, B., SCHULTHEIß, R., Plisnier, P.-D. & Albrecht, C. Does the decline of gastropods in deep water herald ecosystem change in Lakes Malawi and Tanganyika? Freshwater Biology 57, 1733–1744 (2012). 62. Spady, T. C. et al. Evolution of the cichlid visual palette through ontogenetic subfunctionalization of the opsin gene arrays. Mol. Biol. Evol. 23, 1538–1547 (2006). 63. Weadick, C. J. & Chang, B. S. W. Complex patterns of divergence among green- sensitive (RH2a) African cichlid opsins revealed by Clade model analyses. BMC Evol. Biol. 12, 206 (2012). 64. Sugawara, T. et al. Parallelism of amino acid changes at the RH1 affecting spectral sensitivity among deep-water cichlids from Lakes Tanganyika and Malawi. Proc. Natl. Acad. Sci. U.S.A. 102, 5448–5453 (2005). 65. Seehausen, O. et al. Speciation through sensory drive in cichlid fish. Nature 455, 620–626 (2008). 66. Bowmaker, J. K. & Hunt, D. M. Evolution of vertebrate visual pigments. Current Biology 16, R484–R489 (2006). 67. Davies, W. I. L., Collin, S. P. & Hunt, D. M. Molecular ecology and adaptation of visual photopigments in craniates. Mol Ecol 21, 3121–3158 (2012). 68. Carleton, K. L., Dalton, B. E., Escobar-Camacho, D. & Nandamuri, S. P. Proximate and ultimate causes of variable visual sensitivities: Insights from cichlid fish radiations. Genesis 54, 299–325 (2016). 69. Ogawa, Y., Shiraki, T., Kojima, D. & Fukada, Y. Homeobox transcription factor Six7 governs expression of green opsin genes in zebrafish. Proceedings of the Royal Society B: Biological Sciences 282, 20150659 (2015). 70. Marchler-Bauer, A. et al. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 45, D200–D203 (2017). 71. Renninger, S. L., Gesemann, M. & Neuhauss, S. C. F. Cone arrestin confers cone vision of high temporal resolution in zebrafish larvae. Eur. J. Neurosci. 33, 658–667 (2011). 72. Brockerhoff, S. E. et al. Light stimulates a transducin-independent increase of cytoplasmic Ca2+ and suppression of current in cones from the zebrafish mutant nof. J. Neurosci. 23, 470–480 (2003). 73. Boesze-Battaglia, K. & Goldberg, A. F. X. Photoreceptor renewal: a role for peripherin/rds. Int. Rev. Cytol. 217, 183–225 (2002). 74. Opazo, J. C., Butts, G. T., Nery, M. F., Storz, J. F. & Hoffmann, F. G. Whole- genome duplication and the functional diversification of teleost fish hemoglobins. Mol. Biol. Evol. 30, 140–153 (2013). 75. Hahn, C., Genner, M. J., Turner, G. T. & Joyce, D. A. The genomic basis of adaptation to the deep water‘twilight zone’in Lake Malawi cichlid fishes. bioRxiv (2017). 76. Heliconius Genome Consortium. Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature 487, 94–98 (2012). 77. Meyer, A., Kocher, T. D., Basasibwaki, P. & Wilson, A. C. Monophyletic origin of Lake Victoria cichlid fishes suggested by mitochondrial DNA sequences. Nature 347, 550–553 (1990). 78. Reid, N. M. et al. The genomic landscape of rapid repeated evolutionary adaptation to toxic pollution in wild fish. Science 354, 1305–1308 (2016). 79. Jones, F. C. et al. The genomic basis of adaptive evolution in threespine sticklebacks. Nature 484, 55–61 (2012). 80. Seehausen, O. Cichlid Fish Diversity Threatened by Eutrophication That Curbs Sexual Selection. Science 277, 1808–1811 (1997). 81. Konings, A. Tanganyika cichlids in their natural habitat. (Cichlid Press, 2015). 82. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. arXiv.org q-bio.GN, (2013). 83. DePristo, M. A. M. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011). 84. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). 85. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011). 86. Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007). 87. Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2012). 88. Delaneau, O., Howie, B., Cox, A. J., Zagury, J.-F. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013). 89. Harris, R. S. Improved pairwise alignment of genomic DNA. (PhD Thesis: The Pennsylvania State University, 2007). 90. Miller, W. et al. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res. 17, 1797–1808 (2007). 91. Ulm, K. A simple method to calculate the confidence interval of a standardized mortality ratio (SMR). Am. J. Epidemiol. 131, 373–375 (1990). 92. Dobson, A. J., Kuulasmaa, K., Eberle, E. & Scherer, J. Confidence intervals for weighted sums of Poisson parameters. Stat Med 10, 457–462 (1991). 93. Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006). 94. Bouckaert, R. et al. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 10, e1003537 (2014). 95. Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for the RAxML Web servers. Syst. Biol. 57, 758–771 (2008). 96. Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987). 97. Paradis, E., Claude, J. & Strimmer, K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20, 289–290 (2004). 98. Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol. Biol. Evol. 33, 1635–1638 (2016). 99. Heled, J. & Drummond, A. J. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27, 570–580 (2010). 100. Purcell, S. et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics 81, 559–575 (2007). 101. Theis, A., Ronco, F., Indermaur, A., Salzburger, W. & Egger, B. Adaptive divergence between lake and stream populations of an East African cichlid fish. Mol Ecol 23, 5304–5322 (2014). 102. Rohlf, F. J. tpsDig2. 103. Adams, D. C. & Castillo, E. O. geomorph: an R package for the collection and analysis of geometric morphometric shape data. Methods in Ecology and … (2013). 104. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010). 105. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000). 106. Alexa, A. & Rahnenfuhrer, J. topGO: enrichment analysis for gene ontology. R package version (2010). 107. Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015). 108. Merico, D., Isserlin, R., Stueker, O., Emili, A. & Bader, G. D. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS ONE 5, e13984 (2010). 109. Ribbink, A. J., Marsh, B. A., Marsh, A. C., Ribbink, A. C. & Sharp, B. J. A preliminary survey of the cichlid fishes of rocky habitats in Lake Malawi. S. Afr. J. Zool. 18, 149–310 (1983). 110. The cichlid diversity of Lake Malawi/Nyasa/Niassa. (Cichlid Press, 2004). 111. Genner, M. J. et al. Reproductive isolation among deep-water cichlid fishes of Lake Malawi differing in monochromatic male breeding dress. Mol Ecol 16, 651–662 (2007). 112. Joyce, D. A. et al. An extant cichlid fish radiation emerged in an extinct Pleistocene lake. Nature 435, 90–95 (2005). 113. Schwarzer, J. et al. Repeated trans-watershed hybridization among haplochromine cichlids (Cichlidae) was triggered by Neogene landscape evolution. Proceedings of the Royal Society B: Biological Sciences 279, 4389–4398 (2012). 114. Kelleher, J., Etheridge, A. M. & McVean, G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput. Biol. 12, e1004842 (2016).