Supporting Information

McGaugh et al. 10.1073/pnas.1419659112 SI Text RNAeasy kit (Qiagen cat. no. 74104) with a DNA digestion on the Summary of New Resources Available. For the 18 liver transcriptomes membrane, as described in the manual. The quality and quantity of we generated, the raw reads can be found at the NCBI Sequence RNA was determined on an Agilent Bioanalyzer using a Nano- Read Archive (SRA062458 at www.ncbi.nlm.nih.gov/sra/?term= RNA chip. For each sample, 1 μg total RNA was sent to the Duke SRA062458 and SRP017466 at www.ncbi.nlm.nih.gov/sra/?term= Genome Sequencing and Analysis Core Resource for library SRP017466). Transcriptome assemblies, annotation summaries, and preparation and to generate 100-bp paired-end reads using an alignments for protein coevolution analyses are available through Illumina Hi SEq. 2000 with TruSeq v3 chemistry with a standard Dryad (dx.doi.org/10.5061/dryad.vn872). Individual identifiers for insert size distribution. The library preparation protocol was these data can be found under citation in Table S2. Transcriptome based on the technical document TruSeq_RNA_SamplePrep_ assemblies, annotation summaries, and alignments are available Guide_15008136_A. Individual libraries were uniquely barcoded through Dryad: dx.doi.org/10.5061/dryad.vn872. (indexed), and quality was checked on the Bioanalyzer DNA100 chip. For 15 non–garter snake species, five indexed libraries were i) The transcriptome assembly for each of the 18 individuals pooled in each lane, and ∼8 pM of library pool was deposited on sequenced. These assemblies contain the longest ORFs pro- each lane. Because garter snakes (Thamnophis spp.) are focal duced by Trinity, which were then clustered by UCLUST into ’ species in our laboratory, the two Thamnophis species (three centroids to reduce redundancy within a single species tran- samples) were sequenced more deeply. The Thamnophis cou- scriptome. A centroid may have collapsed multiple isoforms, chii indexed library was pooled with separately indexed libraries truncated transcripts, and alleles from a , but it may also from two individual Thamnophis elegans of different ecotypes have collapsed very recent paralogs. (1) (meadow and lakeshore in Table S2). This Thamnophis pool ii) Trinotate annotation databases for each individual. The IDs (one T. couchii and two T. elegans individuals) was sequenced in the database correspond to the centroid IDs in the tran- twice, resulting in larger amounts of data available overall for scriptome assembly described above. these two species. None of the libraries were normalized. The iii) Putative ortholog amino acid alignments and corresponding raw reads for the 15 species excluding Thamnophis species can nucleotide alignments. We used OrthoMCL to cluster ORF be found at the SRA SRA062458. The raw reads for the three centroids into putative orthologs from all of the species in- garter snake liver transcriptomes (i.e., one from T. couchii and cluded in this study. Data are available as separate files for two from T. elegans) can be found at the SRA SRP017466 (samples each ortholog (104,235 total orthologs with two or more species). Additionally, we included a spreadsheet showing HS08, HS11, and TC). the best BLAST hit of each putative ortholog cluster to Processing and de Novo Assembly of Reads. For de novo assembly of the database. each species’ transcriptome, we used the Trinity version released iv) “Best” ortholog amino acid and nucleotide alignments. The on February 25, 2013 (2). Original reads were processed by the 104,235 putative orthologs described above often contained following methods. more than two representative sequences per species. For the The following processing steps were performed using the Fastx first 15,000 putative orthologs (those with the most species tool kit, (hannonlab.cshl.edu/fastx_toolkit/), Cutadapt (3), and included in the alignments), we used UCLUST to find the Trimmomatic (4). best representative per species per ortholog by taking the sequence that was closest to the centroid for that ortholog. i) Fastx_trimmer was used to remove the first base, as Illumina v) The final nucleotide and amino acid alignments for the 1417 personnel indicate that this base can be unreliable (Gary “control .” Schroth). vi) The hand-curated nucleotide and amino acid alignments for ii) Cut-adapt was used to trim adapters from the 3′ ends of reads 61 IIS/TOR network genes. with an allowed error rate of 0.01. iii) Trimmomatic was used to remove reads with sliding win- dows of 6bp that had average quality scores of 30 or less, SI Materials and Methods and then reads less than 30 bp in length were removed. Sample Collection. Animals or tissues used in this study were provided by colleagues or our research colonies. Each individual From this point, reads that were orphaned (only the left or the was maintained or shipped to Iowa State University (ISU). In right remained after processing) were removed from the left and agreement with ISU Institutional Animal Care and Use Com- right read files. These reads were placed at the end of the left read mittee protocol 3-2-5125J, animals were euthanized by decapi- files, as specified in the Trinity manual. All default settings were tation, exsanguinated, and dissected with relevant organs snap kept for transcriptome assembly. frozen. The exceptions were the cottonmouth and alligator (Agkistrodon piscivorus and Alligator mississippiensis), which were Transcriptome Quality Assessment and Annotation. We sequenced euthanized onsite in Texas and California, respectively, following 33.73–140.95 million reads per species (mean: 50.23; median: our established protocol; snap-frozen tissues were sent to ISU. 42.10). Reads were assembled into 87,016–221,818 contigs using The animals used were of a variety of ages and both sexes, thus Trinity (mean: 155,855; median: 165,685). Contigs shorter than findings reported here are robust to variation in transcripts that 200 bp were excluded (5). Table S2 contains statistics about the depend on age, sex, and rearing condition (Table S2). Trinity assemblies. To evaluate the quality of a transcriptome assembly, we aligned Tissue and RNA Extraction and Sequencing. Total RNA was isolated the assembled Trinity transcripts to the proteins of the UniProtKB/ from 12 to 19 mg of snap-frozen liver from each of 18 individuals Swiss-Prot database downloaded on March 21, 2013 using blastx with from 17 species: a single individual for 16 species and two different an E-value cutoff of 1e-20 and allowing only a single target sequence ecotypes from one species for Thamnophis elegans (Table S2 and to be reported. Next, we determined the percent of the UniProtKB/ Fig. S2). We followed standard protocols including Qiagen Swiss-Prot protein that aligned to the best matching Trinity transcript

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 1of19 through the perl script analyze_blastPlus_topHit_coverage.pl pro- (range, 10.43–26.51%). All Trinotate annotation databases are vided through Trinity. publically available on Dryad: dx.doi.org/10.5061/dryad.vn872. Likely coding regions (ORFs) were extracted from Trinity transcripts using Transdecoder. Transdecoder identified between Identifying Candidate Orthologs and Generating Multiple Species 25,945 and 113,672 best ORFs (mean: 65,766; median: 72,152). Alignments. For any comparative evolutionary analysis, identifi- Transcriptome size of the best ORFs identified in Transdecoder cation of putative orthologs and accurate alignment are essential ranged from 27.80 to 113.60 Mb (mean = 69.54 Mb; median = but can be extremely challenging due to paralogs and alternative 78.65 Mb), indicating ∼57- to 269-fold coverage when consid- splicing. In addition, we found that in some cases, a particular ering the amount of filtered and trimmed data input into Trinity species may have Trinity transcripts that blasted with high con- (range, 5.21–11.55 Gb; mean: 6.80 Gb; median = 6.43 Gb). fidence to the particular gene of interest, but this species was These ORFs were clustered into centroids using USEARCH (6) unrepresented in our final multiple species alignments because separately for each transcriptome (see below for a more detailed Transdecoder did not include the transcript from that particular description). gene in its best ORF candidate file. To avoid this complication, we The coding sequence of the peptides produced by Transde- only used ORFs from the longest ORF file and not the best ORF coder and the centroids were also analyzed with the analyze_ predictions. blastPlus_topHit_coverage.pl script provided by Trinity to de- We reduced overlap between the ORFs for each individual termine the percent length of coverage for the top hit in the species using USEARCH (6) with an identity threshold of 95% UniProtKB/Swiss-Prot database. We conducted this analysis on of the nucleotide sequences sorted by length (gaps are counted the best ORF sequences and separately on the centroids to ex- as differences in USEARCH). Because our goal was to cluster amine whether the Transdecoder or USEARCH processes re- isoforms to have one representative sequence per gene, we re- sulted in ORFs that spanned a greater percent length of their best duced the gap penalties to the settings -gapopen 5I/1E -gapext blast hit relative to the originally produced Trinity transcript 0.1I/0.1E. These clustered centroids were used for all subsequent contigs. Blastx analysis of the original Trinity transcripts to the analyses. UniProtKB/Swiss-Prot database resulted in an average of 54.10% For these clustered ORFs for each species (centroids from (SD = 5.82%; median = 55.19%) of transcripts that matched a USEARCH), we identified putative 1:1 orthologs across species hit in the UniProtKB/Swiss-Prot database, covering at least 80% using OrthoMCL (8), a program that is based on reciprocal best of the length of their best blast hit. This number increased blast hits. We analyzed a dataset that contained 74 total samples: slightly when the best ORF transcriptomes provided by Trans- the 18 samples from our transcriptome project and 56 additional decoder (average: 56.30%; SD: 5.50%; median: 56.64%) or the transcriptomes and gene sets available from genome projects USEARCH centroids (average: 58.00%; SD: 5.74%; median: and other past studies (Table S2). These literature-derived tran- 58.41%) were used in the Blastx analysis. scriptomes were made with various technologies and sometimes Last, because the Anolis carolinensis genome is published, we pools of individuals. We used the transcriptome assemblies pro- examined the percent length of transcripts from the best ORF vided by the authors in all cases. Transdecoder and USEARCH analysis from Anolis sagrei, which aligned to the Anolis carolinensis were run on literature-derived transcriptomes and RNA sets genome, using BLAT (7) (similar alignment tool to BLAST) to downloaded from NCBI. Ensembl protein sets, and associated provide a complementary measure of how many full-length tran- cDNAs were downloaded from the Ensembl website and used scripts were assembled. We did not do this for Alligator because without additional processing steps. Species from Ensembl, where this genome is less complete and low-length measures can be a the protein or gene datasets contained large contiguous stretches reflection solely of a fragmented genome assembly. of unknown bases, were not included in our analysis. All amino We aligned Anolis sagrei Trinity-assembled Transdecoder-fil- acid and corresponding nucleotide clusters are available as sepa- tered RNAseq data to the Anolis carolinensis genome v2.0 ge- rate files (104,235 total orthologs with two or more species) on nome scaffolds. From this, we found that 67% of transcripts Dryad along with a spreadsheet showing the best blast hit of each aligned over at least 95% of their length with at least 80% ortholog cluster to the uniprot database. In total, we started with identity, suggesting that ∼67% of our transcripts represent nearly 74 species, but pared this to 66 species for the alignments because full-length transcripts. Interestingly, 89.5% of transcripts aligned the additional eight species were not well represented. These eight over at least 25% of their length, and only 51.3% of transcripts species (as named in the alignments: Python, Quail, Phrynops, aligned over 99% of their length, indicating that, although many Tuatara, Caiman, Caretta, Elaphe, and Emys) generally had lower of our transcripts are present in the Anolis carolinensis genome, our quality or quantity of reads mined from previous studies, and all assembly of RNAseq data did not capture all full-length transcripts. 74 species are represented in the original alignment data available These percentages were comparable for the centroids (65.4%, through Dryad: dx.doi.org/10.5061/dryad.vn872. 88.8%, and 49.3%, respectively). We focused our analysis on 61 genes in the IIS/TOR network. The peptides from Transdecoder and centroids created in The final set of genes (Fig. S1 and Tables S1 and S3) was de- USEARCH were annotated with the Trinotate pipeline, which termined by presence in KEGG pathways for Human Insulin incorporates homology searches, protein domain identification, Signaling (KEGG 04910) and Human mTOR (KEGG 04150) (9, protein signal prediction, and evaluation with EMBL Uniprot 10), connections with Panther Pathways for MAP kinase cascade eggNOG and GO Pathways databases. Specifically, we used and insulin/IGF pathway-protein kinase B signaling cascade, Trinotate to use blastp to find the top hit in the UniProtKB/Swiss- and/or previous publications (11). We specifically wanted to in- Prot database (maximum e-value cutoff 0.001), HMMER to query clude the extracellular hormones, receptors, and binding proteins the PFAM database downloaded on March 29, 2013, signalP to in the insulin signaling network, which had not previously been predict the presence and location of signal peptide cleavage sites, included. and tmHMM to predict transmembrane helices in proteins. The To identify this focal set of genes in our OrthoMCL orthologs, final Trinotate report was made with an e-value cutoff of 0.001 for we performed two searches using Blastp. We made a reference reporting the best blast hit and additional annotations. On average, gene set from the KEGG proteins from chicken or anole. This 77.62% of the best ORFs had matches in UniProtKB/Swiss-Prot reference gene set was used as a blast database, and Blastp was database (maximum e-value cutoff of 0.001), 61.17% had matches used to find hits of our translated orthologs to the KEGG-derived in the PFAM database, 5.73% had matches in signalP, and 12.38% protein blast database with an e-value cutoff of 1e-5. We also percent had matches in tmHMM. On average, 18.3% of centroids required a percent identity of at least 50% and at least 60% of our were left with no annotation from any procedures performed ortholog to align to the KEGG protein. Second, we conducted a

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 2of19 Blastp search using uniprot as the blast database. We used Blastp by blasting (with Blastp) each sequence in each alignment to the to identify the best hit in the uniprot blast database for each of our uniprot database and confirming that, for a single alignment, all OrthoMCL-defined orthologs. For genes to be included in our sequences had a best blast hit with gene names identical to the subsequent analyses, we used only those OrthoMCL-defined expected for that gene. These measures were not performed for orthologs where both the criteria for the KEGG protein blast was the control genes due to the enormity of manual correction for so met, and the description/name of the best blast hit from the uniprot many alignments. This approach makes comparisons between blast output matched the name of the focal KEGG protein. focal genes and control genes more conservative, as poorer For the genes of interest, many of the OrthoMCL-defined quality alignments for control genes would artificially inflate how orthologs contained multiple sequences from each species. Our much positive selection is found in the control genes (28). goal was to generate alignments with one sequence per gene per We also note that Gblocks is thought not to perform well, species. We reduced redundancy in each OrthoMCL-defined especially with indels (29), and therefore for a subset of genes ortholog using USEARCH as above. For each species, we used only (n = 70), we also used PRANK and GUIDANCE (30). We found the sequence that was most like the centroid of the USEARCH- that the nucleotide alignments contained on average 64.1% gaps clustered OrthoMCL-defined ortholog. In a few cases, reptiles and (minimum = 17.4%, maximum = 95.3%) when generated by mammals formed separate clusters. All genes were clustered with PRANK and GUIDANCE and 12.7% gaps (minimum = 0.2%, identical parameters in USEARCH; however, the few genes that maximum = 31.5%) when processed with MSAProbs and Gblocks. exhibited taxon-specific clusters may be particularly fast evolving For this reason, we favored the alignments generated with genes. For example, IGF2, PPP1R3D, MKNK1, and SOCS1 had MSAProbs and GBlocks and used this method for all other align- mammal-specific and reptile-specific clusters. In some cases, we ments. The final focal gene alignments are available through Dryad: were able to combine these genes that appeared in separate clusters dx.doi.org/10.5061/dryad.vn872. into one single multiple sequence alignment (e.g., IGF1R). For IRS4, marsupials and reptiles were clustered separately by Classification of Connectivity. Because a gene’s position and extent USEARCH, and placental mammals were grouped in a separate of connections with other genes in a network influences the ortholog by OrthoMCL. We did not combine these clusters for impact that mutations might have on the target phenotype (31, further analyses because the sequences were too divergent to 32), we were interested in investigating whether more highly create robust alignments. IRS4 has been identified as being under connected genes [defined as the number of other genes or pro- positive selection in other studies (12, 13), indicating that the teins to which a gene is directly connected (33)] have a different alternative explanation for high divergence [i.e., that mutations evolutionary rate than peripheral genes with few connections. To in IRS4 function may be tolerated with only moderate pheno- estimate the level of connectivity for each gene in the IIS/TOR typic consequences (14)] may have weaker support. IRS4 is lo- network, we used NetworkAnalyzer (34) within Cytoscape v3.1.0 cated on the X in mammals and chromosome 4 in (35) to calculate the connectivity of all nodes in the BioGrid hu- chicken, and therefore it may be subjected to different selection man reactome 3.2.95 (36) (including protein-protein and protein- pressures in placental mammals vs. reptiles—which includes gene interactions). We focus on the measures of node degree birds—due to its different location in the genome (has three (i.e., connectivity) and betweenness centrality (34). Node degree fourths the effective population size in mammals as autosomal (i.e., connectivity) is the number of edges or interactions that gene genes). As with the other IRSs, IRS4 interacts with the in- has with other genes or proteins. Betweenness centrality ranges tracellular domain of the insulin receptor and IGF1R (15–17). from 0 to 1 and reflects the amount of influence a node exerts on IRS4 functions in the cytoplasm in cell cycle progression and the interactions of the other nodes (37). growth (18). It is also linked with decreased litter size, reduced growth and glucose homeostasis (14), and reduced maternal Molecular Evolutionary Analyses. For many of the analyses of nurturing and canonical maternal behaviors in mice (e.g., aggres- molecular evolution, we required a tree that best represented the sion against intruders and extended latency in retrieving wayward species tree for the 66 taxa included in our analyses. Because no pups) (14, 19). Given the high divergence of IRS4 in reptiles and single study exists with the tree for all of these species, we combined mammals, it would be interesting to pursue whether IRS4 serves a results from refs. 38 to 45 to generate a tree topology without particularly important role in physiological differences between branch lengths. Newick Utilities (46) was used to prune trees that reptiles and mammals. contained fewer than the total 66 species. For each putative ortholog clustered by USEARCH, we cre- ated multiple species alignments of the amino acid sequences Control Genes. We identified 1,417 putative orthologs that con- using MSAProbs (20), which is more accurate than many other tained all 66 species and referred to these as control genes. The common aligners (21, 22). RevTrans (23) and the original nu- control genes may be biased toward being conserved, as it is cleotide sequence for the centroid were used to generate nu- conceivable that conserved genes are more likely to be recovered cleotide alignments from amino acid alignments. The command for all 66 species. Our dataset of 61 focal IIS/TOR genes generally line version of TranslatorX (24) was used in conjunction with the contained most of the 66 species. In this focal gene set, 20% of the MSAProbs alignments to produce Gblocks-cleaned amino acid genes contained all 66 species and 62% of our focal genes and nucleotide alignments (25, 26) with the commands “-c 1 -t T contained 60 or more species (median = 62; mode = 66; mean = -g -b4 =2 -b5 =a -b3 =10 -b2 =34 –t =p-p=s.” Because the 58.4). The missing species in our 61 focal genes were mostly from nucleotide sequences were predicted ORFs from Trinity, we did the species for which we only had liver transcriptomes, and these not expect translation of the nucleotides to produce within- species could potentially be missing in the alignments because species frameshifts or stop codons; thus, we did not use a more the missing genes were not expressed in the liver and not because sophisticated program such as MACSE (27). they were too divergent to be included. Therefore, we conducted For additional quality control of the test gene alignments, we two supplemental analyses using a reduced number of genes to visually inspected the alignments to ensure they were correctly test how sensitive our conclusions were to the specific control aligned. Typically, editing included fixing aligned gaps and genes in our study. truncated sequences with obviously different start or stop codons Supplemental analysis I. First, we conducted an additional analysis causing small chunks at the beginning and end of an alignment for that limited our focal gene dataset to the 48 genes containing one or several species to be substantially different from all others. between 56 and 66 species (mean: 62.6 species; median: 63.5; We made every effort to be as conservative as possible. In ad- mode: 66 species). Although this is not a perfect comparison with dition, we ensured that no paralogs were present in the alignments the controls, this 48 focal gene set represents a very similar species

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 3of19 number distribution as the control gene dataset. This analysis was had significantly larger Ka values and marginally nonsignificant consistent with the findings of the original 61 focal gene set; Ka/Ks even compared with the phylogenetically matched control therefore, we present the 61 focal gene set in the main text. gene set. One-tailed Wilcoxon rank sum tests indicated that Briefly, results from our analyses of this reduced 48-gene IIS/ trends were identical in the 43 matched control-focal compari- TOR dataset include the following: sons relative to the other two gene sets we analyzed. Median Ka values were significantly different between the seven extracel- i) Extracellular genes of the IIS/TOR network exhibited greater lular genes and their phylogenetically matched control genes divergence between mammals and reptiles than 1,417 control (W = 83.5, P = 0.032), Ka/Ks values were marginally nonsignificant genes and intracellular genes. Extracellular genes had equiv- = = alent Ks compared with control genes (Wilcoxon rank sum between extracellular and control genes (W 100.5, P 0.083), test, W = 3818, P = 0.111), but had notably greater median and Ks values remain nonsignificant between extracellular genes and controls (W = 120.5, P = 0.204). We suspect that the Ka/Ks ω (W = 2847, P = 0.015) and Ka (W = 2162.5, P < 0.003). Compared with intracellular genes, extracellular genes also Wilcoxon test is not significant in this reduced gene analysis due to a lack of power. In addition, many of the extracellular genes that had significantly higher ω (W = 243, P = 0.022) and Ka (W = 266, P = 0.004), but Ks did not differ (W = 199, P = 0.287). were consistently found to be under positive selection within ii) Collectively, the intracellular IIS/TOR genes within the 48- PAML (IGF1, IGF1R, IGF2, IRS1, and IRS2) were not included in this reduced analysis because no appropriate phylogenetically gene set did not have elevated median Ka, Ks,orω com- pared with control genes (P > 0.287 in all cases). For ω and matched control genes were available. Altogether, these two supplemental analyses that considered Ka, the medians for intracellular and control genes were very similar. Specifically, the median Ks for extracellular genes different means of designating control genes, (i.e., the 48 focal was 1.91; the median Ks was 1.61 for intracellular genes and genes that better matched the number of species in the 1,417 1.51 for control genes. The median Ka for extracellular control genes and the 43 paired phylogenetically matched control- genes was 0.16; the median Ka was 0.087 for intracellular focal genes), are in agreement with our results reported in the genes and 0.083 for control genes. Finally, the median ω for main text for the 61 focal IIS/TOR genes and the corresponding extracellular genes was 0.10; ω was 0.051 for intracellular 1,417 control genes. genes 0.054 for control genes. iii) When comparing the distribution of ω values for extracellu- Testing Whether the IIS/TOR Network Contains Fast-Evolving Outliers. lar vs. intracellular IIS/TOR genes in the 48 focal gene set to To test for differences in evolutionary rate between mammals and the distribution of the ω values for the control genes, the reptiles for each of our focal genes, we used the clade model C, extracellular genes were 6.5 times more likely than control with M2a_rel as the null hypothesis (47). Clade models are less genes to reside in the highest 5% of ω values (OR, 6.51; 95% prone to false positives than branch-site models and better ac- CI, 1.29, 32.86). The intracellular genes were not more likely count for among-site variation in selective constraint (47). Im- than controls to be in the top 5% (OR, 1.00; 95% CI, 0.236, portantly, the clade model C tests whether there is evidence for 4.219). These odds ratios imply that the extracellular group differential ω between the test clade and the remainder of the tree, contains the fastest evolving components of the IIS/TOR and we did not use the results from the clade model as support for network. These three conclusions are in agreement with the positive selection. For those test genes that were significant via the 61 focal gene set analyses, which includes some genes with clade model, we compared the ω values (i.e., Ka/Ks)foreachclade fewer species, presented in the main text. via paired Wilcoxon test and χ2 tests. To calculate evolutionary parameters ω, K , and Ks, we pro- Supplemental analysis II. a In addition to the 48 IIS/TOR focal gene cessed the GBlocks nucleotide alignments in PAML. Because analysis detailed above, we conducted a second analysis to address we were specifically interested in molecular evolution between a different potential issue with the control genes. Specifically, to mammals and reptiles, for all IIS/TOR genes and control genes, assess how potentially conserved the original 1,417 control genes we calculated the pairwise mammal and reptile divergence (ev- with 66 species were, we identified additional control genes that ery reptile-mammal comparison) from the 2NG.dN and 2NG.dS contained phylogenetically-matched species sets as our 61 IIS/ output files from PAML, which always output the same values TOR focal genes. In many cases, we only had a single phyloge- regardless of the model because they are calculated with the Nei netically matched control gene for any given IIS/TOR gene. We and Gojobori method (48). These results were very similar to constructed a focal data set of 43 IIS/TOR genes (31 focal IIS/ confirmatory analysis conducted using the analysis package from + TOR genes with phylogenetically matched controls 12 focal libsequence (49). Using a Wilcoxon rank sum test on the median IIS/TOR genes that contained all 66 species) and compared ω, Ka,andKs of pairwise comparisons between reptile and mam- Ka/Ks between the 43 focal genes and the 43 phylogenetically malian taxa, we tested whether the extracellular IIS/TOR genes matched control genes. These 43 pairs of matched focal and or the intracellular IIS/TOR genes exhibited greater divergence control genes contained between 34 and 66 species (mean: 61 between mammals and reptiles than the control genes. species; median: 64 species; mode: 66 species). When there was more than one phylogenetically matched control gene for a par- Testing for Positive Selection for the IIS Network Genes. We con- ticular focal gene, we used a random number generator and took ducted branch-site tests for positive selection in PAML (50–52), the control gene with the largest random number. Although we which examines the likelihood of a modified model A (model = would have liked to phylogenetically match all original IIS/TOR 2, NSsites = 2, ω not fixed to 1) and the likelihood of the cor- focal genes with fewer than the total 66 species to a control gene, responding null model with ω fixed to 1. Two times the differ- or even multiple control genes, we did not have phylogenetically ence in likelihood between the two models conforms to a χ2 matched controls in all cases. distribution, permitting statistical tests. For the likelihood ratio For these 43 pairs of matched focal and control genes, the test (LRT), a P value was estimated assuming a null distribution 2 Ka/Ks and Ka values are somewhat elevated in the phylogenetically that is a 1:1 mixture of χ distribution with 1 and 0 df (53, 54). matched control gene set relative to the full set of 1,417 control For negative test statistics from the LRT (meaning that the null genes (Ka/Ks phylo-match 43 control genes: 0.076; original 1,417 model fit the data better than the alternative), typically one control genes: 0.054; Ka phylo-match 43 control genes: 0.115; would run PAML several times for these particular genes. Due original 1,417 control genes: 0.083). However, extracellular to the computational time required for the number of genes we genes (n = 7 that had phylogenetically matched controls) still were testing and that it was unlikely that these genes would have

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 4of19 large positive test statistics in subsequent runs, we did not rerun the IGF binding domain in the context of the protein structure. any genes multiple times. Additionally, we calculated similarity for each binding protein across the whole alignment using a Poisson correction model (62) Validation of Procedure Based on IGF1. Previously, we documented in MEGA6 (63). increased divergence of IGF1 in lizards and snakes relative to other reptiles and mammals (55). Those data were generated Coevolution Analysis of IGF Hormones and IGF2R in Reptiles. We used using single gene Sanger sequencing. In contrast, here we used a CAPs (64) to test for coevolving amino acid sites between IGF1 next-generation sequencing (NGS) approach, generating tran- and IGF2R and between IGF2 and IGF2R in reptiles. CAPS scriptomes from Illumina RNAsEq. (100-bp paired end) and uses the phylogenetic relationships from the sequence align- followed by nearly automated multiple sequence alignments. We ments along with the 3D structure of the proteins to identify use IGF1 for comparison between these methods for both se- coevolving pairs of amino acid using Pearson correlation co- quence quality and for molecular evolutionary analyses. To es- efficients. For these analyses, we used the amino acid sequence timate sequencing error, we compared the pairwise sequence alignments with their respective human protein structures from identity of IGF1 for the six species included in both approaches. PDB: IGF1, BQT.1 (58); IGF2, 2L29.1 (59); and IGF2R, 2V5O.1 For each of these pairs, the sequence identities were >99.4% (61). We used the following settings: bootstrap value of 0.8, gap identical. In each case that was not 100% identical between the threshold of 0.8, α threshold of P = 0.01, and simulated 100 two approaches, the difference was due to an ambiguity code in alignments. Significance is estimated by comparing the observed the Sanger sequencing that represented within-species allelic coefficients to a distribution from pseudorandomly sampled amino diversity. Thus, we are confident that our NGS approach pro- acid pairs, correcting for multiple comparisons and nonindepen- duced highly accurate sequence data for analysis. Furthermore, dence of data using a step-down permutation procedure (64). our NGS approach added an additional 200 bp of sequence to Comparison of phylogenetic gene trees can be used to detect the IGF1 alignment for every species. coevolution among genes (65). We used the MMMvII algorithm To validate the molecular evolution analyses, we compared the (66) to identify which subgroups of the hormone family (INS, IGF1, sites that were identified to be under positive selection in our and IGF2) and IGF2R were most tightly coevolving across species. previous IGF1 analysis (55) to our current NGS approach [both The MMMvII algorithm detects similarity between phylogenetic approaches using the branch-site model in PAML (50–52), with trees, using information from the both the tree topology and the the branch leading to Squamata (snakes and lizards) as the branch lengths, which are calculated by MMMvII. MMMvII foreground branch]. Every positively selected site identified in identifies the most tightly coevolving subtrees for any given ref. 55 had as strong or stronger support for being under positive tolerance level, returning all possible solutions. For each hor- selection in our current analyses. Overall, our NGS methods mone,weconstructedasinglemultiplesequencealignment appear to improve on traditional methods. ofthematureproteinsequences using ClustalX (67) within Geneious v6.1.6 (68). For IGF2R, we focused on the region of Mapping Positively Selected Sites onto Protein Structures of Hormones the protein that is involved with binding the hormones: do- and Receptors. To understand how positive selection may affect mains 11–13. To identify the most tightly coevolving subgroups of interactions between IGF hormones and receptors, we mapped proteins, we set the tolerance level to 0.2. High levels of co- the sites with a high probability of being under positive selection evolution are achieved by large or multiple subsections of the gene from the PAML branch-site analysis onto the predicted protein trees changing in a coordinated fashion (topology and branch structures. Because snakes in particular appear to be highly length). With this method, highly connected proteins may have divergent, we use a snake as a representative reptile for visu- no observable coevolution if they are highly conserved. alizing the predicted protein structures. We used Swiss-Model (56) to thread the snake sequences onto the human protein SI Results structures from the PDB: INS, PDB ID code 2KQP.1 (57); Divergent Evolutionary Rates Between Mammals and Reptiles. We IGF1, PDB ID code 1BQT.1 (58); IGF2, PDB ID code 2L29.1 tested for differences in mammal-specific ω and reptile-specific ω (59); IGF1R, PDB ID code 1IGR.1.A (60); and IGF2R, PDB using the clade model (47) for each of our 61 focal genes (each ID code 2V5O.1 (61). From the PAML branch-site analyses alignment contained 19–66 species; median: 62) in PAML (69). described above, we mapped the BEB posterior probability >0.90 Significant genes included five extracellular genes (of a total of of being under positive selection (branch-site model of positive 10) and 21 intracellular genes (of a total of 51). Extracellular selection) in mammals or reptiles onto the amino acids in the genes were not statistically more likely to be significant than mature protein structures and the full propeptide alignments. intracellular genes in the clade model (Fisher’s exact test, P = Separately for the reptile and mammal clades, we mapped the 0.430). We also compared the distribution of likelihood ratio test sites predicted from both branch-site models: one that specifically statistics for the clade model relative to a null model for 1,417 tests for selection on the branch leading to the clade of interest set control genes (SI Materials and Methods) to test statistics ob- in the foreground (e.g., the branch leading to reptiles) and one tained for the 61 members of the network. Only IGF2R ex- that tests for positive selection across the whole clade of interest hibited a result that was in the largest 5% of test statistics for (e.g., the whole clade of reptiles). We evaluated the clustering of IIS/TOR network + control genes. We compared the ω for each positively selected sites within functional domains of the protein clade for those control genes where the clade model indicated structure, and their relationship to the binding surfaces between support for a significant difference in ω between reptiles and the hormones and the receptors, as described by previous litera- mammals (n = 797 before sequential Bonferroni correction, n = ture (Table S4). 491 after sequential Bonferroni correction). In short, we found no appreciable difference between control and test genes; after Evaluating Variation in the Presence and Length of the IGF Binding correction for multiple testing, both had ∼77% of genes with Domains of the IGFBPs. The binding proteins consist of two do- larger ω in reptiles relative the rest of the tree. mains: the IGF binding domain on the 5′ end and the thyro- globulin domain on the 3′ end. We noted that the IGFBPs were Connectivity Is Associated with Evolutionary Rate. Nonsynonymous often truncated to various degrees on the 5′ end, leading to ex- reptile-mammal divergence (Ka) and ω were highly correlated tensive variation among species in the length or presence of the with connectivity. For extracellular genes, Ka and ω were nega- N-terminal binding domain. We realigned the original sequences tively correlated to the degree of connectivity (Ka Spearman’s using ClustalX to specifically evaluate variation in the length of ρ = −0.71, P = 0.02; ω Spearman’s ρ = −0.84, P < 0.01), and Ks

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 5of19 exhibited a positive, but nonsignificant relationship with degree connectivity—a relationship that would indicate that expression of connectivity (Spearman’s ρ = 0.40, P = 0.26). Likewise, for level, not connectivity, is driving molecular evolution (71). intracellular genes, Ka and ω were negatively correlated to de- gree of connectivity (Ka Spearman’s ρ = −0.39, P < 0.01; ω Tests for Positive Selection. We tested whether positive selection Spearman’s ρ = −0.34, P = 0.01), whereas Ks was not (Spear- shaped evolution of IIS/TOR pathway genes using a branch-site man’s ρ < 0.01, P = 0.99). In other words, more connected genes model in PAML. This model, with the reptile clade specified as generally had smaller nonsynonymous substitution rates than less the foreground branch, was favored over the null model of neutral connected genes; this result suggests that more connected genes evolution for only two genes, both of which were intracellular: experience more purifying selection than less connected genes. RPS6KA6 and MLST8 (after sequential Bonferroni correction; Importantly, the relationship of Ka and ω to degree of connec- Table S3). In this test, the entire clade of reptiles, including tivity was stronger for extracellular genes than for intracellular terminal branches, was specified as the foreground branch. This genes. Indeed, an interaction term of connectivity and classifica- relative lack of significance is likely due to variable selection tion (intracellular vs. extracellular) in a linear model was nearly among the diverse terminal branches, which span >350 My of significant (P = 0.07), with extracellular genes having a steeper evolution. Additional models are discussed in the main text and slope. Nearly identical results were obtained when using be- include a branch-site model of positive selection with the branch tweenness centrality (extracellular Ka Spearman’s ρ = −0.68, P = leading to the reptile clade as the foreground branch and a similar 0.03; ω Spearman’s ρ = −0.82, P < 0.01, Ks Spearman’s ρ = −0.37, model with the branch leading to mammals designated as the P = 0.29); therefore, we focus further analyses on connectivity. foreground branch. We also conducted a series of taxon-specific Expression level governs the amount of purifying selection branch-site tests, where the branch leading to a particular clade (70, 71). Thus, expression must be accounted for to conclude was specified as a foreground branch. The results of all tests are that the lower evolutionary rates we observed in more connected presented in Table S3. genes are because of high connectivity. Finding a suitable ex- As detailed in the main text, our results are concordant with pression measure across such a broad range of taxa is difficult. previous work that suggests that extracellular genes in the Because protein length is negatively correlated with expression IIS/TOR network may evolve more rapidly and are under level, we used the longest protein isoform in human to provide a stronger positive selection than the remainder of the network. For proxy for potential impacts of expression on protein evolutionary instance, DAF-2 (a homolog of the vertebrate IGF1R and INSR rate. We found no relationship of Ka, Ks, ω, connectivity, or genes) is the most divergent protein in the IIS/TOR network betweenness with the length of the longest protein isoform from across Caenorhabditis species (72), and changes in this receptor human (Spearman’s ρ < 0.15, P > 0.24 in all cases). Also, more and interactions with its hormone may allow for rapid adaptation highly expressed genes experience higher selection on Ks for under shifting environmental conditions (71). Likewise, residues easier translatable codons. Thus, a relationship between Ks and within the homolog of IGF1R (Drosophila’s insulin-like receptor) connectivity is a strong indication that expression level, not con- evolve under positive selection in Drosophila (79). In addition, IGF1 nectivity, is driving molecular evolution (71). We see no significant evolves under strong positive selection in snakes and lizards (55). relationships between Ks and connectivity; hence, expression may not be a strong driver of the relationship between ω and con- Evolution in Squamata. Because previous research indicates that nectivity in our data. components of the IIS/TOR network may be under strong pos- Evolutionary rates of members of the IIS/TOR network in our itive selection in Squamata (lizards and snakes) (55), we also study were negatively related with connectivity. This result is tested the branch-site model using the branch leading to snakes consistent with findings for other pathways, such as the N-gly- and lizards as the foreground branch. Fourteen genes exhibited cosylation pathway of primates (72) and the yeast proteome (71, significant support for positive selection along the branch leading 73–76). Likewise, a negative relationship of closeness centrality to lizards and snakes; seven remained significant after sequential with Ka and ω occurs in the mammalian phototransduction pathway, Bonferroni correction (IGF2R, IGF1R, PIK3R5, IRS2, IRS1, and closeness centrality is largely influenced by connectivity (77). IKBKB, and TSC2; Table S3). These seven also exhibited test Interpreting our findings requires two caveats. First, GC-biased statistics that were in the largest 5% of test statistics for all gene conversion (preferential substitution of GC during re- (control and test) genes analyzed in this comparison. For croc- combination) can produce results that resemble positive selec- odilians, birds, and turtles, fewer genes provided significant tion, although such a confounding effect is usually attenuated with support for the branch-site model either before (13, 11, and 11 increased phylogenetic distance due to the lack of conservation in genes, respectively) or after multiple test correction (6, 1, and 5 location of recombination hotspots (78). Thus, for mammal-reptile genes, respectively). The bird comparison is particularly notable comparisons, this may not be a substantive concern. Further, because birds represent an independent evolutionary origin of genes indicated with the branch-site model to be under positive endothermy (vs. mammals). selection are less likely to be confounded by biased gene con- We more explicitly assayed higher divergence in Squamata version than those indicated by the branch-test model (78). Sec- relative to the rest of the tree by the clade model with Squamata ond, we did not directly account for gene expression variation, as the foreground clade. We detected 24 genes with significant intron number, and gene essentiality, and these are all variables support (postmultiple test correction) for heterogeneous rates associated with protein evolution (71, 75, 76). Not including these relative to the rest of the tree (a total of 33 before multiple test covariates could affect our conclusion regarding the importance correction). For 14 of these significant genes, the ω estimated for of connectivity in influencing evolutionary rate. The choice of an the Squamata clade was larger than the estimate for the rest of appropriate tissue and developmental time point in which to the tree. However, this difference between the numbers of genes measure expression level for all 66 species and the lack of gene in Squamata that were more highly divergent than the rest of the expression data suitable for quantification in some species are tree was not significant (P > 0.3). Notably, IGF1, IGFBP2, vexing problems. However, we suspect that molecular evolu- RHEB, IGF2R, and INSR exhibit test statistics that were in the tionary rate is influenced, at least in part, by connectivity because largest 5% of test statistics for all (control and focal) genes an- we found no relationship of Ka, Ks, ω, connectivity, or between- alyzed for the clade model with Squamata in the foreground. ness with the length of the longest protein isoform from human In comparison, we detected 15 genes with significant support (a proxy for expression). In addition, as explained above, highly (after multiple test correction) for heterogeneous rates relative to expressed genes experience selection on Ks for easier translatable the rest of the tree when using snakes as the foreground clade (a total codons, and we see no significant relationships between Ks and of 30 before multiple test correction). For 11 of these significant

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 6of19 genes, the ω estimated for snakes was larger than the estimate for directly interact with the C-domain of the IGFs to regulate the rest of the tree, and the reverse was true for the other 4 genes. binding affinity. More specifically, from mutagenesis studies, one This difference between the numbers of genes in snakes that of these sites under positive selection on the IGF1R CR-domain were more or less divergent than the rest of the tree was nearly (F251, human numbering) directly interacts with the IGF1 significant (χ2 = 3.27, P = 0.07). Similar results were obtained C-domain to regulate binding of IGF1R to IGF1 (81). Further- for a paired Wilcoxon test (V = 24, P = 0.04). However, only more, one of the sites under positive selection on the reptilian PRKCG, IGFBP2, and INSR exhibit test statistics that were in the IGF1 C-domain (R37, human numbering) regulates binding of largest 5% of test statistics for all (control and test) genes ana- IGF1 to IGF1R (80) (Table S4). Thus, the location and clus- lyzed for the clade model with snakes in the foreground. tering of these positively selected sites on the hormone and the Overall, it appears that Squamata has qualitatively higher di- receptor suggest positive selection on the binding affinity be- vergence in IIS/TOR network genes, and several more genes may be tween IGF1 and IGF1R across the reptiles. This signature of under positive selection on the branch leading to Squamata, than on positive selection is absent in the mammalian IGF1 and IGF1R. the branch leading to crocodilians, birds, and turtles (tested inde- In contrast, we see positive selection on the C-domain of IGF2 in pendently). However, these differences are not exceptionally unique, mammals that regulates the binding to IGF1R and INSR. These and each branch of reptiles, excepting avian reptiles, contains positively selected sites in the C-domain of mammalian IGF2 multiple IIS/TOR genes under positive natural selection. may cause variation in the binding affinity between IGF2-IGF1R and IGF2-INSR among mammal species. Specifically, one of the Mammal-Specific and Reptile-Specific Evolution of Hormones and IGF1 residues in mammals that inhibits high-affinity binding to Receptors. The amino acid sites that define the ability of IGF1 IGF2R (R55) is an isoleucine (I55) in snakes, which is predicted and IGF2 to bind IGF1R (mainly in domains A and B; Fig. 2) are to promote binding to IGF2R due to its hydrophobicity. conserved, indicating that these protein sequences are likely functional (80). The C-domain of IGF1 and IGF2 form a flexible Coevolution of IGF2R and IGFs in Reptiles. In addition to high di- loop that is oriented toward the binding pocket of INSR and vergence in reptiles and snakes among focal genes mentioned above, IGF1R and contacts the CR domain in the binding pocket of the many of the positively selected sites on the receptors and hormones IGF1R and INSR (81) (Fig. 2). The IGF1 and IGF2 C-domain is are due to amino acid changes within the Squamates (lizards and essential to bind IGF1R (82), and variation in the C-domain snakes) relative to other reptiles. Our coevolution network analysis regulates the specificity of the hormones binding to IGF1R (82) clearly signals strong coevolution of the receptors and hormones and to INSR (83). INSR has two isoforms due to the absence specifically within snakes or squamates. This rapid molecular evo- (INSR-A) or presence (INSR-B) of exon 11 (84). In mammals, lution is in concordance with extensive recent work showing extreme both INSR isoforms bind INS with high affinity, but only INSR-A adaptation in metabolic pathways of snakes (86, 87). Although binds IGF2 with high affinity, and neither bind IGF1 with high nematodes and Drosophila are models for conservation of the in- affinity. This difference in INSR binding between IGF2 and IGF1 tracellular IIS (88, 89), snakes and lizards may be models for ex- is driven by the C-domain of the hormones (83). For IGF1, 30% amining the coevolution of the extracellular hormones-receptors. percent of the C-domain amino acids in reptiles are predicted to The CAPS analysis identified a pair of coevolving amino acids on be under positive selection, whereas none of the C-domain sites of IGF2 and IGF2R in reptiles: IGF2 P4 and IGF2R R1623 (ρ = 0.4, IGF1 in mammals are predicted to be under positive selection. In P < 0.01). No sites were identified as coevolving between IGF1 contrast, for IGF2, 25% percent of sites in the C-domain amino and IGF2R. To further predict how evolution has shaped the acids in mammals were identified as being under positive selec- interactions between IGF2R and the IGF hormones in reptiles, tion, and no sites were under positive selection in the reptilian we used MMMvII (66) to identify the species with the tightest IGF2 C-domain (Fig. 2 and Table S4). correlated rates of evolution between IGFs and IGF2R based on This positive selection in the C-domains of reptile IGF1 and the gene tree topologies and branch lengths, given a tolerance mammal IGF2 suggests their binding affinities to IGF1R and value of 0.2. Interestingly, within the reptiles, snakes (sunbeam and INSR are likely variable across the species in the respective viper boa) had the tightest coevolutionary signal between hormone- clades. IGF1R has three domains that are predicted to play a role receptor pairings IGF2 and IGF2R (ρ = 1), and the lizards (brown in binding both IGF1 and IGF2 hormones (L1-, CR-, and L2- and green anoles and gecko) had the tightest coevolutionary signal domains) (81, 85). Positively selected sites in reptiles clustered between IGF1 and IGF2R (ρ = 0.33), suggesting that among the on the hormone-binding surface of the CR domain of IGF1R reptiles, these receptor-hormone relationships are most strongly and include specific sites identified from mutagenesis studies to coevolving in the squamate clade specifically.

1. Sparkman AM, Vleck CM, Bronikowski AM (2009) Evolutionary ecology of endocrine- 12. Alvarez-Ponce D, Aguadé M, Rozas J (2011) Comparative genomics of the vertebrate mediated life-history variation in the garter snake Thamnophis elegans. Ecology insulin/TOR signal transduction pathway: A network-level analysis of selective pres- 90(3):720–728. sures. Genome Biol Evol 3:87–101. 2. Grabherr MG, et al. (2011) Full-length transcriptome assembly from RNA-Seq data 13. Wang M, et al. (2013) The molecular evolutionary patterns of the Insulin/FOXO sig- without a reference genome. Nat Biotechnol 29(7):644–652. naling pathway. Evol Bioinform Online 9:1–16. 3. Martin M (2011) Cutadapt removes adapter sequences from high-throughput se- 14. Fantin VR, Wang Q, Lienhard GE, Keller SR (2000) Mice lacking insulin receptor sub- quencing reads. EMBnet J 17(1):10–12. strate 4 exhibit mild defects in growth, reproduction, and glucose homeostasis. Am J 4. Lohse M, et al. (2012) RobiNA: A user-friendly, integrated software solution for RNA- Physiol Endocrinol Metab 278(1):E127–E133. Seq-based transcriptomics. Nucleic Acids Res 40(Web Server issue):W622–W627. 15. Yenush L, White MF (1997) The IRS-signalling system during insulin and cytokine 5. Cahais V, et al. (2012) Reference-free transcriptome assembly in non-model animals action. BioEssays 19(6):491–500. from next-generation sequencing data. Mol Ecol Resour 12(5):834–845. 16. Lavan BE, et al. (1997) A novel 160-kDa phosphotyrosine protein in insulin-treated 6. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bio- embryonic kidney cells is a new member of the insulin receptor substrate family. J Biol informatics 26(19):2460–2461. Chem 272(34):21403–21407. 7. Kent WJ (2002) BLAT—The BLAST-like alignment tool. Genome Res 12(4):656–664. 17. Fantin VR, et al. (1998) Characterization of insulin receptor substrate 4 in human 8. Li L, Stoeckert CJ, Jr, Roos DS (2003) OrthoMCL: Identification of ortholog groups for embryonic kidney 293 cells. J Biol Chem 273(17):10726–10732. eukaryotic genomes. Genome Res 13(9):2178–2189. 18. Qu B-H, Karas M, Koval A, LeRoith D (1999) Insulin receptor substrate-4 enhances 9. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic insulin-like growth factor-I-induced cell proliferation. JBiolChem274(44): Acids Res 28(1):27–30. 31179–31184. 10. Kanehisa M, et al. (2014) Data, information, knowledge and principle: Back to me- 19. Xu X, et al. (2012) Modular genetic control of sexually dimorphic behaviors. Cell tabolism in KEGG. Nucleic Acids Res 42(Database issue):D199–D205. 148(3):596–607. 11. Luisi P, et al. (2012) Network-level and population genetics analysis of the in- 20. Liu Y, Schmidt B, Maskell DL (2010) MSAProbs: Multiple sequence alignment based on sulin/TOR signal transduction pathway across human populations. Mol Biol Evol pair hidden Markov models and partition function posterior probabilities. Bioinformatics 29(5):1379–1392. 26(16):1958–1964.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 7of19 21. Plyusnin I, Holm L (2012) Comprehensive comparison of graph based multiple protein 55. Sparkman AM, et al. (2012) Rates of molecular evolution vary in vertebrates for insulin- sequence alignment strategies. BMC Bioinformatics 13(1):64. like growth factor-1 (IGF-1), a pleiotropic locus that regulates life history traits. Gen 22. Sievers F, et al. (2011) Fast, scalable generation of high-quality protein multiple se- Comp Endocrinol 178(1):164–173. quence alignments using Clustal Omega. Mol Syst Biol 7(1):539. 56. Biasini M, et al. (2014) SWISS-MODEL: Modelling protein tertiary and quaternary structure 23. Wernersson R, Pedersen AG (2003) RevTrans: Multiple alignment of coding DNA from using evolutionary information. Nucleic Acids Res 42(Web Server issue):W252-8. aligned amino acid sequences. Nucleic Acids Res 31(13):3537–3539. 57. Yang Y, et al. (2010) Solution structure of proinsulin: Connecting domain flexibility 24. Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: Multiple alignment of nucleotide and prohormone processing. J Biol Chem 285(11):7847–7851. sequences guided by amino acid translations. Nucleic Acids Res 38(Web Server issue): 58. Sato A, et al. (1993) Three-dimensional structure of human insulin-like growth factor-I W7-13. (IGF-I) determined by 1H-NMR and distance geometry. Int J Pept Protein Res 41(5):433–440. 25. Talavera G, Castresana J (2007) Improvement of phylogenies after removing divergent and 59. Williams C, et al. (2012) An exon splice enhancer primes IGF2:IGF2R binding site ambiguously aligned blocks from protein sequence alignments. Syst Biol 56(4):564–577. structure and function evolution. Science 338(6111):1209–1213. 26. Castresana J (2000) Selection of conserved blocks from multiple alignments for their 60. Garrett TPJ, et al. (1998) Crystal structure of the first three domains of the type-1 use in phylogenetic analysis. Mol Biol Evol 17(4):540–552. insulin-like growth factor receptor. Nature 394(6691):395–399. 27. Ranwez V, Harispe S, Delsuc F, Douzery EJ (2011) MACSE: Multiple Alignment of 61. Brown J, et al. (2008) Structure and functional analysis of the IGF-II/IGF2R interaction. Coding SEquences accounting for frameshifts and stop codons. PLoS ONE 6(9):e22594. EMBO J 27(1):265–276. 28. Schneider A, et al. (2009) Estimates of positive Darwinian selection are inflated by 62. Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. errors in sequencing, annotation, and alignment. Genome Biol Evol 1:114–118. Evolving Genes and Proteins, eds Bryson V, Vogel HJ (Academic Press, New York). 29. Jordan G, Goldman N (2012) The effects of alignment error and alignment filtering on 63. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular Evo- the sitewise detection of positive selection. Mol Biol Evol 29(4):1125–1139. lutionary Genetics Analysis version 6.0. Mol Biol Evol 30(12):2725–2729. 30. Penn O, et al. (2010) GUIDANCE: A web server for assessing alignment confidence 64. Fares MA, McNally D (2006) CAPS: Coevolution analysis using protein sequences. Bi- scores. Nucleic Acids Res 38(Web Server issue):W23-8. oinformatics 22(22):2821–2822. 31. Wright KM, Rausher MD (2010) The evolution of control and distribution of adaptive 65. de Juan D, Pazos F, Valencia A (2013) Emerging methods in protein co-evolution. Nat mutations in a metabolic pathway. Genetics 184(2):483–502. Rev Genet 14(4):249–261. 32. Kim PM, Korbel JO, Gerstein MB (2007) Positive selection at the protein network 66. Rodionov A, Bezginov A, Rose J, Tillier ER (2011) A new, fast algorithm for detecting periphery: Evaluation in terms of structural constraints and cellular context. Proc Natl protein coevolution using maximum compatible cliques. Algorithms Mol Biol 6(1):17. Acad Sci USA 104(51):20274–20279. 67. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL_X 33. Hahn MW, Kern AD (2005) Comparative genomics of centrality and essentiality in windows interface: Flexible strategies for multiple sequence alignment aided by quality three eukaryotic protein-interaction networks. Mol Biol Evol 22(4):803–806. analysis tools. Nucleic Acids Res 25(24):4876–4882. 34. Doncheva NT, Assenov Y, Domingues FS, Albrecht M (2012) Topological analysis and 68. Kearse M, et al. (2012) Geneious Basic: An integrated and extendable desktop soft- interactive visualization of biological networks and protein structures. Nat Protoc ware platform for the organization and analysis of sequence data. Bioinformatics 7(4):670–685. 28(12):1647–1649. 35. Shannon P, et al. (2003) Cytoscape: A software environment for integrated models of 69. Yang Z (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol biomolecular interaction networks. Genome Res 13(11):2498–2504. 24(8):1586–1591. 36. Stark C, et al. (2006) BioGRID: A general repository for interaction datasets. Nucleic 70. Subramanian S, Kumar S (2004) Gene expression intensity shapes evolutionary rates Acids Res 34(Database issue, suppl 1):D535–D539. of the proteins encoded by the vertebrate genome. Genetics 168(1):373–381. 37. Yoon J, Blumer A, Lee K (2006) An algorithm for modularity analysis of directed and 71. Jovelin R, Phillips PC (2011) Expression level drives the pattern of selective constraints along weighted biological networks based on edge-betweenness centrality. Bioinformatics the insulin/Tor signal transduction pathway in Caenorhabditis. Genome Biol Evol 3:715–722. 22(24):3106–3108. 72. Montanucci L, Laayouni H, Dall’Olio GM, Bertranpetit J (2011) Molecular evolution 38. Wiens JJ, et al. (2012) Resolving the phylogeny of lizards and snakes (Squamata) with and network-level analysis of the N-glycosylation metabolic pathway across primates. extensive sampling of genes and species. Biol Lett 8(6):1043–1046. Mol Biol Evol 28(1):813–823. 39. Kimball RT, Wang N, Heimer-McGinn V, Ferguson C, Braun EL (2013) Identifying lo- 73. Fraser HB, Wall DP, Hirsh AE (2003) A simple dependence between protein evolution calized biases in large datasets: A case study using the avian tree of life. Mol Phylo- rate and the number of protein-protein interactions. BMC Evol Biol 3(1):11. genet Evol 69(3):1021–1032. 74. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW (2002) Evolutionary rate 40. McCormack JE, et al. (2013) A phylogeny of birds based on over 1,500 loci collected by in the protein interaction network. Science 296(5568):750–752. target enrichment and high-throughput sequencing. PLoS ONE 8(1):e54848. 75. Bloom JD, Adami C (2004) Evolutionary rate depends on number of protein-protein 41. Thomson RC, Shaffer HB (2010) Sparse supermatrices for phylogenetic inference: interactions independently of gene expression level: Response. BMC Evol Biol 4(1):14. Taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. Syst Biol 59(1): 76. Larracuente AM, et al. (2008) Evolution of protein-coding genes in Drosophila. Trends 42–58. Genet 24(3):114–123. 42. Perelman P, et al. (2011) A molecular phylogeny of living primates. PLoS Genet 7(3): 77. Invergo BM, Montanucci L, Laayouni H, Bertranpetit J (2013) A system-level, molec- e1001342. ular evolutionary analysis of mammalian phototransduction. BMC Evol Biol 13(1):52. 43. Eo SH, Bininda-Emonds OR, Carroll JP (2009) A phylogenetic supertree of the fowls 78. Ratnakumar A, et al. (2010) Detecting positive selection within genomes: The problem (Galloanserae, Aves). Zool Scr 38(5):465–481. of biased gene conversion. Philos Trans R Soc Lond B Biol Sci 365(1552):2571–2580. 44. Hedges SB, Kumar S (2009) The Timetree of Life (Oxford Univ Press, New York). 79. Guirao-Rico S, Aguadé M (2009) Positive selection has driven the evolution of the Dro- 45. dos Reis M, et al. (2012) Phylogenomic datasets provide both precision and accuracy in sophila insulin-like receptor (InR) at different timescales. Mol Biol Evol 26(8):1723–1732. estimating the timescale of placental mammal phylogeny. Proc Roy Soc B Biol Sci 279 80. Denley A, Cosgrove LJ, Booker GW, Wallace JC, Forbes BE (2005) Molecular inter- (1742):3491–3500. actions of the IGF system. Cytokine Growth Factor Rev 16(4-5):421–439. 46. Junier T, Zdobnov EM (2010) The Newick utilities: High-throughput phylogenetic tree 81. Keyhanfar M, Booker GW, Whittaker J, Wallace JC, Forbes BE (2007) Precise mapping processing in the UNIX shell. Bioinformatics 26(13):1669–1670. of an IGF-I-binding site on the IGF-1R. Biochem J 401(1):269–277. 47. Weadick CJ, Chang BS (2012) An improved likelihood ratio test for detecting site- 82. Bayne ML, et al. (1989) The C region of human insulin-like growth factor (IGF) I is required specific functional divergence among clades of protein-coding genes. Mol Biol Evol for high affinity binding to the type 1 IGF receptor. J Biol Chem 264(19):11004–11008. 29(5):1297–1300. 83. Denley A, et al. (2004) Structural determinants for high-affinity binding of insulin-like 48. Nei M, Gojobori T (1986) Simple methods for estimating the numbers of synonymous growth factor II to insulin receptor (IR)-A, the exon 11 minus isoform of the IR. Mol and nonsynonymous nucleotide substitutions. Mol Biol Evol 3(5):418–426. Endocrinol 18(10):2502–2512. 49. Thornton K (2003) Libsequence: A C++ class library for evolutionary genetic analysis. 84. Seino S, Bell GI (1989) Alternative splicing of human insulin receptor messenger RNA. Bioinformatics 19(17):2325–2327. Biochem Biophys Res Commun 159(1):312–316. 50. Yang Z, Nielsen R (2002) Codon-substitution models for detecting molecular adap- 85. Epa VC, Ward CW (2006) Model for the complex between the insulin-like growth tation at individual sites along specific lineages. Mol Biol Evol 19(6):908–917. factor I and its receptor: Towards designing antagonists for the IGF-1 receptor. Pro- 51. Zhang J, Nielsen R, Yang Z (2005) Evaluation of an improved branch-site likelihood tein Eng Des Sel 19(8):377–384. method for detecting positive selection at the molecular level. Mol Biol Evol 22(12): 86. Castoe TA, Jiang ZJ, Gu W, Wang ZO, Pollock DD (2008) Adaptive evolution and 2472–2479. functional redesign of core metabolic proteins in snakes. PLoS ONE 3(5):e2201. 52. Yang Z, dos Reis M (2011) Statistical properties of the branch-site test of positive 87. Castoe TA, et al. (2009) Evidence for an ancient adaptive episode of convergent selection. Mol Biol Evol 28(3):1217–1228. molecular evolution. Proc Natl Acad Sci USA 106(22):8986–8991. 53. Self SG, Liang K-L (1987) Asymptotic properties of maximum likelihood estimators and 88. Oldham S (2011) Obesity and nutrient sensing TOR pathway in flies and vertebrates: likelihood ratio tests under nonstandard conditions. JAmStatAssoc82(398):605–610. Functional conservation of genetic mechanisms. Trends Endocrinol Metab 22(2):45–52. 54. Goldman N, Whelan S (2000) Statistical tests of gamma-distributed rate heteroge- 89. Tatar M, Bartke A, Antebi A (2003) The endocrine regulation of aging by insulin-like neity in models of sequence evolution in phylogenetics. Mol Biol Evol 17(6):975–978. signals. Science 299(5611):1346–1351.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 8of19 INS IGFBP4 IGFBP1 IGF2 IGF1

IGFBP5 Extracellular IGFBP2 IGFBP6 IGFBP3 INSR IGF1R *

P KRAS GTP PIP2 NRAS PIP3 SOS1 PIK3CA PIK3R5 SOCS1 Raf PIK3CB SOCS3 SH2B2 INPPL1 PDPK1 PIK3CD SOCS4 PIK3CG Degradation PTPN1 MAPK10 of Ligands MEK1/2 PRKCG AKT SGK1 PKC MLST8 mTOR BAD AKT1S1 Lipogenesis Rictor Survival, Growth, Apoptosis Proliferation PDE3B MLST8 TSC1 ERK1/2 PPP1R3C IKBKB mTOR TSC2 RSK PPP1R3D GSK3 PRKAA2 PPARGC1A Raptor STK11 STRADA 4EBP1 MO25 P RPS6KA6 RHEB FOXO1 eIF2B

eIF4E CALM1 GLS ULK2 PHKB ULK3 eIF4E2 PHKG1 RPS6 Autophagy

Glycogenesis Protein MKNK1 Synthesis

Gene Proliferation / Expression Differentiation P Elk1 FOXO1

Fig. S1. The IIS/TOR signaling network. Proteins not included in this study due to lack of sequence data across species are in gray. Gene names correspond to

Tables S1 and S3. Genes in yellow were identified as reptiles having highly divergent Ka/Ks relative to the rest of the tree by the CMCreptiles model (last column of Table S3), significant after correction for multiple comparisons. *IRS4 and *IGFBP6 were analyzed manually due to their exceptional divergence in sequence and length between reptiles and mammals (Table S5 and Fig. S3). Figure modified from ProteinLounge.com, SABiosciences.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 9of19 Fig. S2. A rooted cladogram showing the phylogenetic relationships among the species included in this study.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 10 of 19 Fig. S3. Annotated amino acid alignment of IGFBP6. The human sequence is set as a reference at the top of the alignment, and sequence differences from the reference sequence are highlighted. We provide functional annotation on the human sequence. The N- and C-terminal domains are in red; the cysteine residues are in dark blue. IGF binding sites that are conserved across all binding proteins are marked in cyan (excepting two snake species, for which only one of these is conserved). IGF binding sites specific to IGFBP6 are marked in green, and the sites with different function (e.g., integrin binding) are marked in gray.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 11 of 19 Table S1. IIS/TOR genes used in this study and their estimates of divergence between reptiles and mammals

Function Symbol EntrezID Betweenness Degree Mammal Gator Lizard Bird Turtle Snake Total Length N ω Ka Ks

Extracellular IGF1 3479 2.19E-04 20 32 2 3 10 5 7 59 0.54 864 0.12 0.17 1.30 Extracellular IGF1R 100500937 4.10E-04 88 31 1 4 10 5 5 56 0.98 772 0.05 0.09 1.66 Extracellular IGF2 3481 0 1 27 2 7 8 7 7 58 0.69 837 0.15 0.31 2.17 Extracellular IGF2R 3482 5.55E-05 34 25 2 7 10 8 7 59 0.81 850 0.11 0.31 3.00 Extracellular IGFBP2 3485 0 1 31 2 7 10 6 7 63 0.67 992 0.14 0.18 1.35 Extracellular IGFBP3 3486 4.45E-04 36 29 2 5 10 7 5 58 0.61 841 0.06 0.15 3.00 Extracellular IGFBP4 3487 5.40E-07 5 29 2 7 0 7 5 50 1 609 0.12 0.17 1.51 Extracellular IGFBP5 3488 4.42E-05 12 30 2 7 9 4 7 59 0.40 869 0.10 0.11 1.23 Extracellular INS 3630 1.98E-06 6 20 1 2 10 1 0 34 0.96 280 0.24 0.31 1.32 Extracellular INSR 3643 7.80E-04 76 32 2 6 10 8 7 65 0.96 1,041 0.05 0.12 2.88 Intracellular AKT1S1 84335 9.10E-07 10 24 2 7 0 4 7 44 0.87 480 0.18 0.34 1.91 Intracellular CALM1 801 1.90E-04 26 31 2 7 10 8 7 65 0.99 1,054 0.00 0.01 2.98 Intracellular EIF4E 1977 1.81E-04 60 32 2 7 10 8 7 66 0.87 1,088 0.06 0.03 0.55 Intracellular EIF4E2 9470 2.36E-05 13 32 2 7 10 7 7 65 0.89 1,055 0.02 0.03 1.41 Intracellular FOXO1 2308 4.39E-05 63 31 2 7 10 7 6 63 0.59 992 0.07 0.12 1.76 Intracellular GRB2 2885 8.00E-08 2 31 2 6 10 8 7 64 1 1,023 0.01 0.02 1.15 Intracellular IKBKB 3551 2.96E-05 3 32 2 7 10 8 7 66 0.79 1,087 0.06 0.12 1.88 Intracellular INPPL1 3636 0 1 32 2 6 6 7 7 60 0.67 895 0.07 0.09 1.20 Intracellular IRS1 3667 0 2 29 2 6 10 6 0 53 0.61 696 0.07 0.11 1.40 Intracellular IRS2 8660 4.50E-05 40 24 1 0 9 6 0 40 0.20 383 0.19 0.19 0.97 Intracellular KRAS 3845 7.19E-05 34 26 2 7 10 8 7 60 0.98 884 0.02 0.05 3.00 Intracellular MAPK10 5602 1.06E-04 28 29 1 2 10 3 0 45 0.92 464 0.01 0.02 0.95 Intracellular MKNK1 8569 3.73E-06 21 30 1 6 10 8 4 59 0.89 870 0.05 0.10 2.14 Intracellular MLST8 64223 9.28E-05 30 32 2 7 10 8 7 66 0.99 1,088 0.03 0.04 1.43 Intracellular MTOR 2475 4.00E-08 3 31 2 7 10 8 7 65 0.99 1,041 0.01 0.02 1.62 Intracellular NRAS 4893 3.20E-07 3 27 1 6 10 7 7 58 0.99 837 0.01 0.02 2.64 Intracellular PDE3B 5140 1.43E-05 4 32 2 7 10 8 7 66 0.82 1,084 0.15 0.18 1.30 Intracellular PDPK1 5170 8.32E-05 69 32 2 7 10 8 7 66 0.98 1,085 0.03 0.06 1.96 Intracellular PHKB 5257 2.21E-06 6 32 2 7 10 8 7 66 0.98 1,087 0.05 0.09 1.68 Intracellular PHKG1 5260 0 1 31 1 1 10 2 0 45 0.92 434 0.13 0.13 1.05 Intracellular PIK3CA 5290 2.67E-04 59 32 2 7 10 8 7 66 0.52 1,088 0.03 0.04 1.70 Intracellular PIK3CB 5291 9.32E-06 14 32 2 6 10 6 4 60 0.99 895 0.05 0.08 1.49 Intracellular PIK3CD 5293 0 1 30 2 6 10 8 7 63 0.97 989 0.07 0.11 1.54 Intracellular PIK3CG 5294 6.84E-06 27 32 2 7 10 8 7 66 1 1,082 0.03 0.10 3.00 Intracellular PIK3R5 23533 5.51E-06 7 32 1 5 10 5 7 60 0.89 841 0.14 0.24 1.57 Intracellular PPARGC1A 10891 1.95E-04 91 32 2 7 10 7 1 59 0.85 863 0.12 0.08 0.68 Intracellular PPP1R3C 5507 5.60E-06 7 32 2 7 10 6 7 64 0.99 1,024 0.09 0.21 2.23 Intracellular PPP1R3D 5509 2.40E-07 3 29 1 7 10 6 6 59 0.80 870 0.12 0.28 2.37 Intracellular PRKAA2 5563 3.10E-04 52 31 1 3 9 4 0 48 0.94 527 0.02 0.04 2.43 Intracellular PRKCG 5582 2.54E-04 19 31 2 5 0 5 4 47 0.64 489 0.06 0.09 1.99 Intracellular PTEN 5728 0.00141419 114 31 2 7 10 8 7 65 0.94 1,054 0.05 0.03 0.57 Intracellular PTPN1 5770 4.04E-05 30 30 2 7 10 8 6 63 0.95 990 0.04 0.11 2.62 Intracellular RHEB 6009 2.03E-06 36 30 2 7 9 8 6 62 0.99 960 0.01 0.01 0.89 Intracellular RICTOR 253260 3.27E-04 86 32 2 7 10 8 7 66 0.83 1,079 0.05 0.07 1.22 Intracellular RPS6 6194 3.91E-05 44 32 2 6 10 8 6 64 0.99 1,024 0.01 0.02 1.87 Intracellular RPS6KA6 27330 1.08E-04 28 4 1 6 0 2 6 19 0.99 60 0.06 0.09 1.62 Intracellular SGK1 6446 2.63E-04 70 31 2 7 10 6 6 62 0.80 961 0.02 0.04 1.42 Intracellular SH2B2 10603 0 2 30 2 6 10 6 6 60 0.13 880 0.14 0.18 1.41 Intracellular SHC1 6464 6.00E-08 2 31 2 7 10 6 7 63 0.78 992 0.06 0.09 1.30 Intracellular SHC2 25759 2.00E-08 5 28 2 1 7 2 0 40 0.70 336 0.15 0.13 0.93 Intracellular SHC3 53358 3.42E-06 10 30 1 0 10 2 0 43 0.70 390 0.06 0.16 2.98 Intracellular SOCS1 8651 0 8 29 1 7 10 7 4 58 0.92 841 0.14 0.23 1.63 Intracellular SOCS3 9021 0 6 30 0 6 8 4 5 53 0.92 688 0.09 0.07 0.75 Intracellular SOCS4 122809 0 6 29 2 5 9 8 3 56 0.95 783 0.07 0.10 1.82 Intracellular SOS1 6654 3.03E-04 114 31 2 8 10 8 7 66 0.97 1,042 0.04 0.05 1.15 Intracellular STK11 6794 0 1 31 2 7 9 8 7 64 1 1,023 0.03 0.07 2.44 Intracellular STRADA 92335 4.00E-08 12 31 2 7 10 8 7 65 0.49 1,054 0.04 0.09 2.88 Intracellular TSC1 7248 0 1 32 2 7 10 8 7 66 0.96 1,084 0.11 0.15 1.34 Intracellular TSC2 7249 6.56E-05 120 32 2 7 10 8 7 66 0.36 1,063 0.06 0.11 1.83 Intracellular ULK2 9706 7.04E-06 9 32 2 7 10 6 7 64 0.28 1,017 0.05 0.10 1.60 Intracellular ULK3 25989 0 1 32 2 7 10 7 7 65 0.90 1,054 0.11 0.14 1.23

Bold HGNC gene symbols are genes classified as extracellular; not bold are intracellular. Betweenness is the amount influence a node exerts on the interactions of the other nodes (range 0–1). Degree is a measure of connectivity and is the number of edges or interactions that gene has with other genes or proteins based on BioGrid human reactome 3.2.95 (1) (including protein-protein and protein-gene interactions). The numbers below each taxa represent the

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 12 of 19 number of sequences from that group represented in the alignment. Total is the number of sequences in alignment; N = total pairwise comparisons between reptiles and mammals used to calculate divergence measures. Divergence measures (Ka, nonsynonymous divergence; Ks, synonymous; ω, nonsynonymous/ synonymous) are the median of the pairwise comparisons calculated in PAML between reptiles and mammals. Length is the median length of sequences in the multiple species alignment given as a proportion of the longest human isoform.

1. Stark C, et al. (2006) BioGRID: A general repository for interaction datasets. Nucleic Acids Res 34(Database issue, suppl 1)D535–D539.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 13 of 19 Table S2. Genomic and transcriptomic datasets used in this study Common name Species name Tissue Total contigs Mean (bp) N50 (bp) n:N50 GC Citation

Green anole* Anolis carolinesis Multiple 19,177 1,589 2,094 4,305 48.42 (1 2) Red-eared slider turtle Trachemys scripta Brain 55,456 767 1,074 10,920 50.88 (3) Embryonic included above (4) stage 14, 17 Painted turtle† Chrysemys picta Multiple 25,802 1,646 2,091 5,881 48.83 (5) Galápagos tortoise† Chelonoidis nigra Blood 19,668 615 687 5,265 45.89 (6) Chinese softshell turtle* Pelodiscus sinensis Multiple 20,668 1,588 2,013 4,770 48.46 (7) † Chinese alligator Alligator sinensis Multiple 38,114 1,104 1,686 6,732 49.70 (8) Pigeon† Columba livia Multiple 31,132 118 1,737 5,435 50.56 (9) † Darwin finch Geospiza fortis Multiple 28,607 1,140 1,749 5,017 51.04 (10) ◇ Budgerigar Melopsittacus undulatus Multiple 26,145 1,179 1,818 4,610 49.28 (11) Saker falcon† Falco cherrug Multiple 26,628 1,207 1,875 4,689 49.37 (12) Peregrine falcon† Falco peregrinus Multiple 27,810 1,206 1,869 4,797 49.63 (12) Collared flycatcher* Ficedula albicollis Multiple 15,893 1,635 2,202 3,430 52.17 (13) Turkey* Meleagris gallopavo Multiple 16,496 1,596 2,148 3,634 48.61 (14) Chicken* Gallus gallus Multiple 16,354 1,669 2,223 3,537 50.32 (15, 16) Zebrafinch* Taeniopygia guttata Multiple 18,204 1,347 1,911 3,644 50.75 (17) Duck* Anas platyrhynchos Multiple 16,353 1,494 2,142 3,265 49.25 (18) † Tenrec Echinops telfairi Multiple 38,810 1,097 1,605 7,135 54.18 (19) Elephant* Loxodonta africana Multiple 25,635 1,623 2,109 5,771 51.76 (19) Rat* Rattus norvegicus Multiple 25,725 1,532 2,043 5,571 51.77 (20) Mouse* Mus musculus Multiple 50,718 1,358 2,013 9,740 51.95 (19) Shrew† Sorex araneus Multiple 40,099 1,125 1,590 7,676 55.44 (19) Vole† Microtus ochrogaster Multiple 46,900 1,042 1,620 8,080 52.22 Unpublished Broad Institute Ground squirrel* Ictidomys tridecemlineatus Multiple 20,000 1,542 1,932 4,560 51.83 (19) Pika† Ochotona princeps Multiple 40,749 1,092 1,632 7,378 54.49 (19) European rabbit* Oryctolagus cuniculus Multiple 20,588 1,602 2,100 4,533 53.88 (19) † Naked mole rat Heterocephalus glaber Multiple 69,635 1,046 1,578 12,738 53.87 (21) Guinea pig* Cavia porcellus Multiple 19,774 1,567 2,058 4,357 52.56 (19) Bush baby* Otolemur garnettii Multiple 19,986 1,619 2,085 4,505 51.55 (19) Macaque* Macaca mulatta Multiple 36,384 1,442 1,920 7,979 51.54 (22) White-cheeked gibbon* Nomascus leucogenys Multiple 19,988 1,626 2,133 4,435 51.52 Baylor College of Medicine Orangutan* Pongo abelii Multiple 21,414 1,507 2,040 4,562 52.03 (23) Gorilla gorilla* Gorilla gorilla Multiple 27,473 1,608 2,166 5,842 52.15 (24) Chimpanzee* Pan troglogdytes Multiple 19,907 1,582 2,094 4,327 51.96 (25) Human* Homo sapiens Multiple 102,156 1,147 1,839 17,747 52.24 (19) Pig* Sus scrofa Multiple 25,883 1,354 1,824 5,574 53.25 (26) Cow* Bos taurus Multiple 22,118 1,605 2,082 4,830 53.33 (19) † Dolphin Tursiops truncatus Multiple 38,169 979 1,377 7,665 53.63 (19) Horse* Equus caballus Multiple 22,654 1,688 2,319 4,641 51.57 (19) Little brown bat* Myotis lucifugus Multiple 20,719 1,535 2,037 4,466 53.16 (19) † Brandt’s bat Myotis brandtii Multiple 47,102 1,023 1,557 8,315 53.04 (27) Cat* Felis catus Multiple 20,259 1,587 2,112 4,354 52.67 (19) Dog* Canis familiaris Multiple 25,160 1,734 2,298 5,395 52.77 (28) Giant Panda* Ailuropoda melanoleuca Multiple 21,136 1,618 2,154 4,520 52.89 (29) Ferret* Mustela putorius Multiple 20,062 1,606 2,127 4,295 53.35 Unpublished Broad Institute Armadillo† Dasypus novemcinctus Multiple 57,911 991 1,407 11,113 54.18 (19) Opossum* Monodelphis domestica Multiple 22,310 1,592 2,049 4,975 48.32 (30) Platypus* Ornithorhynchus anatinus Multiple 23,584 1,166 1,593 4,777 54.07 (31) Tasmanian devil* Sarcophilus harrisii Multiple 22,404 1,604 2,091 4,987 47.91 (32) Alligator Alligator mississippiensis Liver, f‡, juvenile 47,884 868 1,206 9,548 49.35 This study, SM07 Anolis lizard Anolis sagrei Liver, m, adult 23,392 891 1,227 4,843 47.77 This study, SM02 Alligator lizard Elgaria multicarinata Liver, u, juvenile 24,018 888 1,242 4,978 48.76 This study, SM03 Fence lizard Sceloporus undulatus Liver, m, adult 32,046 1,000 1,479 6,178 47.48 This study, SM08 Bearded dragon Pogona vitticeps Liver, u, juvenile 38,739 933 1,323 7,910 49.44 This study, SM09 Skink Scincella lateralis Liver, u, adult 50,129 945 1,359 9,867 51.22 This study, SM12 Gecko Eublepharis macularius Liver, m, adult 37,488 931 1,338 7,508 48.76 This study, SM15 African house snake Lamprophis fuliginosus Liver, f, adult 32,952 818 1,077 7,149 47.69 This study, SM04 Cottonmouth Agkistrodon piscivorus Liver, f, adult 25,220 903 1,257 5,353 47.57 This study, SM05 Sunbeam snake Xenopeltis unicolor Liver, f, adult 27,211 956 1,359 5,606 47.63 This study, SM06

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 14 of 19 Table S2. Cont. Common name Species name Tissue Total contigs Mean (bp) N50 (bp) n:N50 GC Citation

Viper boa Candoia aspera Liver, f, adult 34,984 947 1,332 7,215 48.56 This study, SM14 W. aquatic garter snake Thamnophis couchii Liver, f, adult 38,648 986 1,410 7,666 47.77 This study, TC Garter snake-lake Thamnophis elegans Liver, m, juvenile 37,723 1,013 1,443 7,635 47.83 This study, HS08 Garter snake-meadow Thamnophis elegans Liver, f, juvenile 36,090 1,053 1,566 6,963 47.64 This study, HS11 Snapping turtle Cheyldra serpentina Liver, m, juvenile 26,251 835 1,119 5,688 50.45 This study, SM01 Stinkpot turtle Sternotherus odoratus Liver, f, juvenile 43,717 971 1,413 8,652 50.97 This study, SM10 Sideneck turtle Pelusios castaneus Liver, f, juvenile 40,755 984 1,434 7,943 49.70 This study, SM11 Box turtle Terrapene ornata Liver, u, juvenile 43,109 959 1,401 8,207 50.44 This study, SM13

Contigs less than 200 bp were not included in our study. n:N50 is defined here as the number of contigs that add up to 50% of the total assembly size when sorted longest to shortest, and the N50 refers to the mean length of the contig such that half of all bases in the assembly are made of sequences of equal or longer length. Liver transcriptome was sequenced for all individuals in our study and the sex and stage is given. Individual identifier abbreviation of raw sequence data for the liver transcriptome data generated from this study can be found under Citation. U, unknown. *Sequence was downloaded from Ensembl, thus, is also annotated using the genomic sequence. † Sequence was RNA downloaded from NCBI’s genome ftp. ‡ Sex: f, female; m, male; u, unknown.

1. Alföldi J, et al. (2011) The genome of the green anole lizard and a comparative analysis with birds and mammals. Nature 477(7366):587–591. 2. Eckalbar WL, et al. (2013) Genome reannotation of the lizard Anolis carolinensis based on 14 adult and embryonic deep transcriptomes. BMC Genomics 14(1):49. 3. Tzika AC, Helaers R, Schramm G, Milinkovitch MC (2011) Reptilian-transcriptome v1.0, a glimpse in the brain transcriptome of five divergent Sauropsida lineages and the phylogenetic position of turtles. Evodevo 2(1):19. 4. Kaplinsky NJ, et al. (2013) The embryonic transcriptome of the red-eared slider turtle (Trachemys scripta). PLoS ONE 8(6):e66357. 5. Shaffer HB, et al. (2013) The western painted turtle genome, a model for the evolution of extreme physiological adaptations in a slowly evolving lineage. Genome Biol 14(3):R28. 6. Chiari Y, Cahais V, Galtier N, Delsuc F (2012) Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (Archosauria). BMC Biol 10(1):65. 7. Wang Z, et al. (2013) The draft genomes of soft-shell turtle and green sea turtle yield insights into the development and evolution of the turtle-specific body plan. Nat Genet 45(6): 701–706. 8. Wan Q-H, et al. (2013) Genome analysis and signature discovery for diving and sensory properties of the endangered Chinese alligator. Cell Res 23(9):1091–1105. 9. Shapiro MD, et al. (2013) Genomic diversity and evolution of the head crest in the rock pigeon. Science 339(6123):1063–1067. 10. Parker P, Li B, Li H, Wang J (2012) The genome of Darwin’s Finch (Geospiza fortis). GigaScience. Available at dx.doi.org/10.5524/100040. Accessed September 10, 2013. 11. Bradnam KR, et al. (2013) Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2(1):10. 12. Zhan X, et al. (2013) Peregrine and saker falcon genome sequences provide insights into evolution of a predatory lifestyle. Nat Genet 45(5):563–566. 13. Ellegren H, et al. (2012) The genomic landscape of species divergence in Ficedula flycatchers. Nature 491(7426):756–760. 14. Dalloul RA, et al. (2010) Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): Genome assembly and analysis. PLoS Biol 8(9):e1000475. 15. Rubin C-J, et al. (2010) Whole-genome resequencing reveals loci under selection during chicken domestication. Nature 464(7288):587–591. 16. Hillier LW, et al.; International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432(7018):695–716. 17. Warren WC, et al. (2010) The genome of a songbird. Nature 464(7289):757–762. 18. Huang Y, et al. (2013) The duck genome and transcriptome provide insight into an avian influenza virus reservoir species. Nat Genet 45(7):776–783. 19. Lindblad-Toh K, et al.; Broad Institute Sequencing Platform and Whole Genome Assembly Team; Baylor College of Medicine Sequencing Center Sequencing Team; Genome Institute at Washington University (2011) A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478(7370):476–482. 20. Gibbs RA, et al.; Rat Genome Sequencing Project Consortium (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428(6982):493–521. 21. Kim EB, et al. (2011) Genome sequencing reveals insights into physiology and longevity of the naked mole rat. Nature 479(7372):223–227. 22. Gibbs RA, et al.; Rhesus Macaque Genome Sequencing and Analysis Consortium (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science 316(5822): 222–234. 23. Locke DP, et al. (2011) Comparative and demographic analysis of orang-utan genomes. Nature 469(7331):529–533. 24. Scally A, et al. (2012) Insights into hominid evolution from the gorilla genome sequence. Nature 483(7388):169–175. 25. Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437(7055):69–87. 26. Groenen MA, et al. (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491(7424):393–398. 27. Seim I, et al. (2013) Genome analysis reveals insights into physiology and longevity of the Brandt’s bat Myotis brandtii. Nat Commun 4:2212. 28. Lindblad-Toh K, et al. (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438(7069):803–819. 29. Li R, et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463(7279):311–317. 30. Mikkelsen TS, et al.; Broad Institute Genome Sequencing Platform; Broad Institute Whole Genome Assembly Team (2007) Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447(7141):167–177. 31. Warren WC, et al. (2008) Genome analysis of the platypus reveals unique signatures of evolution. Nature 453(7192):175–183. 32. Murchison EP, et al. (2012) Genome sequencing and analysis of the Tasmanian devil and its transmissible cancer. Cell 148(4):780–791.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 15 of 19 Table S3. Results from tests for positive selection on each IIS/TOR gene HGNC Classification symbol bs_reptilesC bs_reptiles bs_mammal bs_croc bs_bird bs_turtle bs_squamata CMC_squamata CMCreptiles

Extracellular IGF1 0.00 4.19 0 0.00 0.00 0.00 4.16 94.02 −33.35 Extracellular IGF1R 0.00 6.54 0.00 12.16 0.00 0.00 10.44 2.09 2.57 Extracellular IGF2 0.00 0.85 1.54 0.00 0.00 0.00 1.75 21.80 63.69 Extracellular IGF2R 0.00 38.16 29.44 8.77 17.32 25.59 31.03 156.70 372.49 Extracellular IGFBP2 0.00 8.30 0.00 0.00 0.00 1.88 5.84 109.35 23.43 Extracellular IGFBP3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 −209.79 Extracellular IGFBP4 0.00 2.21 4.45 10.55 NA 0.00 2.61 16.76 24.01 Extracellular IGFBP5 0.00 7.33 3.29 11.87 3.75 10.94 2.59 0.89 3.54 Extracellular INS 0.00 3.59 3.02 0.00 0.00 0.00 0.00 −99.60 −99.60 Extracellular INSR 0.00 9.40 15.62 24.17 −245.07 5.77 0.84 183.93 31.49 Intracellular AKT1S1 0.00 0.70 8.31 0.00 NA 0.00 3.65 3.90 14.55 Intracellular CALM1 0.00 0.00 0.0 0.00 0.07 0.00 0.00 0.00 −0.34 Intracellular EIF4E −6.08 0.00 0.0 0.00 0.00 0.00 0.00 10.83 0.46 Intracellular EIF4E2 0.90 0.00 0.0 0.00 −34.75 0.00 0.00 −9.30 −9.30 Intracellular FOXO1 0.00 0.00 1.64 0.60 0.00 0.00 0.00 51.56 48.00 Intracellular GRB2 −25.65 0.00 0.0 0.00 0.00 0.00 0.00 −41.64 −41.67 Intracellular IKBKB 0.00 1.78 3.45 0.00 7.08 0.00 72.16 7.36 31.28 Intracellular INPPL1 0.00 14.01 0.0 0.00 7.41 6.37 1.36 0.00 76.69 Intracellular IRS1 0.00 66.81 20.51 6.15 0.00 34.06 21.81 12.80 24.89 Intracellular IRS2 0.00 14.35 14.34 10.38 0.00 7.13 14.35 1.21 1.21 Intracellular KRAS 0.00 0.99 0.0 0.00 0.00 0.00 0.99 −90.99 −90.99 Intracellular MAPK10 6.58 0.00 0.0 6.93 0.00 0.00 0.00 2.90 −13.98 Intracellular MKNK1 0.00 0.00 6.47 0.00 0.00 0.00 0.00 0.06 0.65 Intracellular MLST8 24.72 0.00 2.03 0.00 0.00 0.00 0.00 17.46 3.26 Intracellular MTOR 0.02 2.18 4.16 NA 0.00 NA 0.79 16.02 17.98 Intracellular NRAS 0.00 0.00 0.0 −0.01 −0.29 0.07 0.00 0.92 1.29 Intracellular PDE3B 0.00 5.80 0.0 0.00 0.00 0.00 6.52 3.83 0.58 Intracellular PDPK1 −21.23 0.00 0.0 0.00 0.00 0.00 0.00 3.99 7.73 Intracellular PHKB 0.00 7.50 8.48 0.00 −1.13 −3.03 0.00 11.47 0.85 Intracellular PHKG1 0.00 0.05 0.37 0.55 2.40 0.01 6.67 12.86 0.00 Intracellular PIK3CA −538.65 0.00 0.0 0.00 0.00 61.91 −3.61 27.03 97.43 Intracellular PIK3CB −121.84 2.98 8.87 1.23 2.40 0.39 1.63 8.19 2.57 Intracellular PIK3CD 0.00 6.86 14.65 0.77 1.27 −0.01 1.90 4.54 29.62 Intracellular PIK3CG 0.00 0.00 4.18 1.19 0.00 0.00 0.00 0.90 3.35 Intracellular PIK3R5 0.00 50.68 24.83 5.14 5.66 0.00 13.56 19.40 35.65 Intracellular PPARGC1A 0.00 0.17 0.0 0.00 0.00 0.00 0.65 27.20 22.46 Intracellular PPP1R3C 0.00 0.00 0.56 0.00 0.00 0.00 0.00 0.44 7.17 Intracellular PPP1R3D 0.00 0.48 0.59 0.00 4.73 13.59 0.00 0.13 11.89 Intracellular PRKAA2 0.00 0.57 0.0 0.00 0.00 0.00 3.03 4.37 0.18 Intracellular PRKCG 0.00 12.88 27.81 4.74 NA 5.82 0.00 39.70 24.75 Intracellular PTEN −0.21 0.00 0.0 0.00 0.00 0.00 0.00 −67.73 −67.73 Intracellular PTPN1 0.00 0.00 0.0 0.00 3.02 0.00 0.00 −453.77 −436.67 Intracellular RHEB −0.02 0.00 0.0 0.00 0.00 0.00 0.00 125.64 0.44 Intracellular RICTOR 0.00 4.15 12.67 0.00 3.40 −0.01 5.32 9.08 43.17 Intracellular RPS6 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.48 22.79 Intracellular RPS6KA6 23.84 0.00 0.0 0.00 NA 0.00 0.00 51.16 61.63 Intracellular SGK1 0.00 0.00 0.0 0.00 0.00 0.00 0.00 −33.02 4.07 Intracellular SH2B2 0.00 0.00 2.11 0.00 0.00 0.00 0.00 16.07 69.51 Intracellular SHC1 6.30 0.15 0.05 0.00 0.00 0.00 2.43 0.16 31.10 Intracellular SHC2 0.00 1.43 6.62 0.00 0.00 8.13 6.64 0.01 0.02 Intracellular SHC3 0.00 0.00 2.06 3.32 0.00 0.47 NA NA 64.93 Intracellular SOCS1 0.00 0.00 0.01 0.00 5.60 0.00 0.00 9.92 39.29 Intracellular SOCS3 0.00 0.03 3.50 NA 0.00 0.00 0.02 8.01 2.22 Intracellular SOCS4 0.00 0.23 0.0 1.63 0.00 0.00 0.50 1.76 −543.68 Intracellular SOS1 −38.49 0.00 0.25 0.00 0.00 1.06 0.00 −0.03 −0.18 Intracellular STK11 −0.03 0.00 0.0 0.00 0.00 −0.22 0.00 1.11 6.11 Intracellular STRADA 0.00 0.01 1.98 0.00 0.00 0.00 0.18 9.60 0.87 Intracellular TSC1 0.00 0.00 2.85 12.95 2.79 8.12 0.00 14.98 25.44 Intracellular TSC2 0.00 3.36 17.23 −509.79 7.95 2.17 177.56 2.41 2.37 Intracellular ULK2 0.00 0.00 0.0 7.62 0.00 0.00 0.00 0.75 54.64 Intracellular ULK3 0.00 0.00 3.09 0.61 2.28 1.17 0.00 0.35 1.39

χ2 values from likelihood ratio tests from PAML, where significant values suggest evidence for positive selection at the gene level for the specified phylogenetic clade or branch. Italic and bold = significant at P < 0.05 before multiple test correction. Bold and underlined = significant at P < 0.05 after multiple test correction. The CMCs used the entire clade as the foreground. bs, branch-site test; bs_reptilesC, branch-site test with the entire reptile clade as the foreground branch, all other branch-site tests used only the branch leading to the specific taxa as the foreground branch; CMC, clade model; NA, not applicable for the specific gene.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 16 of 19 Table S4. Positively selected amino acid sites in hormones and binding domains of the receptors Human Mammal Mammal Snake Reptile Reptile Gene Protein domain mature protein clade branch proto-protein clade branch Functional annotations

INS Signal peptide R6 R6 0.97 INS C-domain D60 0.97 Q60 INS C-domain L80 0.99 Q78 INS C-domain L82 0.95 Q80 INS C-domain Q87 0.92 V85 IGF1 Signal peptide A17 0.99 IGF1 Signal peptide V18 0.99 IGF1 Propeptide I22 0.94 IGF1 Propeptide F37 0.98 IGF1 B-domain P2 Q54 1.00 IGF1 C-domain S33 G85 1.00 1.00 IGF1 C-domain S34 S86 0.93 IGF1 C-domain R37 S89 1.00 Affects binding affinity to IGF1R and INSR (1, 2) IGF1 C-domain A38 S90 0.99 0.99 IGF1 C-domain Q40 T92 1.00 0.98 IGF1 A-domain R55 I107 0.99 Affects binding affinity to IGF2R (3) IGF1 D-domain L64 V116 0.97 IGF1 E peptide Y87 0.96 V140 IGF1 E peptide Q88 0.99 H141 IGF1 E peptide S91 1.00 N144 IGF1 E peptide K94 0.92 R147 IGF1 E peptide K97 1.00 T150 IGF1 E peptide K102 0.98 Y155 IGF2 Signal peptide V 0.97 L3 IGF2 Signal peptide I 0.96 V15 IGF2 C-domain A32 1.00 V48 IGF2 C-domain V35 1.00 N51 IGF2 C-domain S36 1.00 R52 IGF2 Protopeptide P74 0.99 L91 IGF2 Protopeptide F81 1.00 F102 IGF2 Protopeptide R83 1.00 K104 IGF2 Protopeptide Y92 0.99 Y113 IGF2 Protopeptide V117 1.00 W139 IGF2 Protopeptide K120 0.99 E142 IGF2 Protopeptide E123 1.00 Q145 IGF2 Protopeptide F125 0.92 S147 IGF2 Protopeptide R126 0.92 E148 IGF2 Protopeptide K129 0.92 K151 IGF2 Protopeptide A136 0.96 V158 IGF2 Protopeptide T139 1.00 T161 IGF2 Protopeptide Q140 1.00 H162 INSR L1 domain V1 V1 0.99 INSR L1 domain P3 P3 0.99 INSR L1 domain R13 N13 1.00 INSR L1 domain D68 K68 0.95 INSR CR domain Q171 0.99 D170 INSR CR domain S180 0.92 S179 INSR CR domain T188 A187 1.00 INSR CR domain Y226 V225 0.98 INSR CR domain R230 R229 0.95 INSR CR domain Q266 0.91 S265 INSR CR domain P280 P277 1.00 INSR L2 domain G311 0.97 E307 INSR FnIII-1 P537 S533 1.00 INSR FnIII-1 Q540 K536 0.99 INSR FnIII-2 S658 0.91 NA INSR FnIII-2 G735 1.00 A719 INSR FnIII-2 V737 S721 0.91 INSR FnIII-2 V744 0.97 T728 INSR FnIII-2 A746 0.91 G730

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 17 of 19 Table S4. Cont. Human Mammal Mammal Snake Reptile Reptile Gene Protein domain mature protein clade branch proto-protein clade branch Functional annotations

INSR FnIII-2 T757 1.00 E740 INSR FnIII-2 S758 V741 1.00 INSR FnIII-2 V769 V752 1.00 INSR FnIII-2 N770 F753 1.00 INSR FnIII-2 T796 1.00 A779 INSR FnIII-3 L886 Q846 1.00 INSR FnIII-3 L865 S848 1.00 INSR FnIII-3 S884 0.97 Q867 INSR Transmembrane K923 0.98 A904 1.00 1.00 region INSR Transmembrane S936 I916 1.00 region INSR Transmembrane V938 F918 0.96 region INSR Transmembrane I942 0.91 G922 1.00 region INSR P1266 R1241 0.91 IGF1R Signal peptide * S12 0.98 IGF1R Signal peptide * W14 0.96 IGF1R Signal peptide * L16 0.99 IGF1R Signal peptide * S29 0.96 IGF1R L1 domain E1 K31 0.93 IGF1R L1 domain Q14 E44 IGF1R CR domain P145 A175 0.95 IGF1R CR domain R192 Y222 0.98 IGF1R CR domain D248 0.92 T278 IGF1R CR domain F251 N281 0.92 0.96 Interacts with IGF1 C-domain (not IGF2) (4) IGF1R CR domain E259 P289 0.98 IGF1R CR domain D262 L292 1.00 IGF1R CR domain Q275 Q306 0.96 IGF1R L2 domain M319 S349 0.97 IGF1R L2 domain L379 N409 1.00 IGF2R Domain 11 A541 Y1456 1.00 IGF2R Domain 11 Y1542 F1458 1.00 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 E1544 N1460 0.98 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 K1545 Q1461 1.00 IGF2R Domain 11 Y1549 Q1641 0.95 IGF2R Domain 11 N1558 0.97 T1474 0.90 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 P1561 1.00 G1478 IGF2R Domain 11 G1568 0.94 G1487 IGF2R Domain 11 Q1569 H1488 0.98 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 T1570 Q1489 0.94 IGF2R Domain 11 R1571 P1490 0.99 IGF2R Domain 11 A1577 L1497 0.96 IGF2R Domain 11 K1593 K1512 1.00 IGF2R Domain 11 D1594 E1513 0.91 IGF2R Domain 11 G1603 A1522 0.97 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 V1609 0.94 Y1528 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 R1623 Q1542 0.98 IGF2R Domain 11 I1627 I1546 1.00 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 Q1632 K1551 0.98 Predicted to affect IGF2 binding based on substitution in Chicken/Monotreme (5) IGF2R Domain 11 P1643 V1562 0.99 IGF2R Domain 11 −1648 1.00 R1569 IGF2R Domain 11 R1655 T1576 1.00

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 18 of 19 Listed are the sites with a posterior probabilities > 0.9 of being under positive selection in PAML branch-site model using either the branch leading to the clade or the entire clade in the foreground. The amino acid sites in the “human mature protein” sequence correspond to the expanded amino acids in Fig. 2. For the “human mature protein”, the amino acid listed is the human variant. For the “snake protoprotein” the amino acid listed is the snake variant. “Functional Annotations” column lists studies that have assigned functional significance to particular sites based on mutagenesis, antibody binding, and crystalline structure complexes (not an exhaustive list). NA, not applicable.

1. Denley A, Cosgrove LJ, Booker GW, Wallace JC, Forbes BE (2005) Molecular interactions of the IGF system. Cytokine Growth Factor Rev 16(4-5):421–439. 2. Zhang W, Gustafson TA, Rutter WJ, Johnson JD (1994) Positively charged side chains in the insulin-like growth factor-1 C- and D-regions determine receptor binding specificity. J Biol Chem 269(14):10609–10613. 3. Sakano K, et al. (1991) The design, expression, and characterization of human insulin-like growth factor II (IGF-II) mutants specific for either the IGF-II/cation-independent mannose 6-phosphate receptor or IGF-I receptor. J Biol Chem 266(31):20626–20635. 4. Keyhanfar M, Booker GW, Whittaker J, Wallace JC, Forbes BE (2007) Precise mapping of an IGF-I-binding site on the IGF-1R. Biochem J 401(1):269–277. 5. Brown J, Jones EY, Forbes BE (2009) Keeping IGF-II under control: Lessons from the IGF-II-IGF2R crystal structure. Trends Biochem Sci 34(12):612–619.

Table S5. Variation in the sequence and presence of the IGF binding domain in IGF binding proteins 2–6 (% is the amino acid percent identity over the complete alignments) Taxon BP2 (71%) BP3 (75%) BP4 (80%) BP5 (83%) BP6 (56%)

Reptiles Archosaurs G: 5/5 M G: 1/5 F G: 3/5 R G: 2/4 F G: 0 T: 7/7 M 3/5 R 2/5 M 2/4 M T: 0 1/5 M T: 2/6 F T: 1/7 F Suspect T: 2/7 R 2/6 R 4/7 R Gene 5/7 M 2/6 M 2/7 M Lost Turtles G: 0 G: 1/1 F G: 1/1 F G: 1/1 M G: 1/1 M T: 2/6 F T: 3/6 F T: 6/6 F T: 2/4 F T: 2/2 R 3/6 R 2/6 R 2/4 R 1/6 M 1/6 M Squamates G: 1/1 M G: 1/1 F G: 1/1 F G: 1/1 M G: 1/1 M T: 9/12 F T: 3/ 8 F T: 10/11 F* T: 10/12 F T: 5/8 F 1/12 R 2/8 R 1/11 R 2/12 R 3/8 R 2/12 M 3/8 M Mammal Primates G: 6/7 F G: 6/7 F G: 7/7 F G: 7/7 F G: 6/6 F 1/7 M 1/7 M Other placental mammals G: 6/14 F G: 6/14 F G: 12/14 F G: 14/14 F G: 14/14 F 4/14 R 4/14 R 2/14 R T: 7/7 F T: 7/7 F 4/14 M 4/14 M T: 5/7 F T: 6/9 F T: 1/6 F 1/1 R 2/9 R 1/6 R 1/1 M 1/9 M 5/6 M Monotreme/marsupials G: 1/2 F G: 2/3 F G: 2/3 F G: 1/2 F G: 1/1 F 1/2 M 1/3 M 1/3 M 1/2 R

Within reptiles and mammals, for each specified group of species, we report the proportion of sequences from genomic data (G) and/or transcriptomic data (T) that have the full N-terminal domain (F), a truncated N-terminal domain (R), or a missing binding domain (M). *Two species of snakes also showed an isoform with a missing IGF domain.

McGaugh et al. www.pnas.org/cgi/content/short/1419659112 19 of 19