1 SUPPORTING RESULTS Generation of the Initial STAND
Total Page:16
File Type:pdf, Size:1020Kb
SUPPORTING RESULTS Generation of the Initial STAND Markov Model To obtain a large number of STAND NTPase domains from protein databases, we generated a hidden Markov model to serve as a probe sensitive to both NACHT and NB-ARC domains. We first selected an initial set of 207 canonical NB-ARC and NACHT domain peptide sequences, including NB-ARC domains of R-proteins, APAF1, and related fungal and bacterial NTPases as well as the NACHT domains of NOD-like receptors. The numbers of NACHT and NB-ARC sequences in this starting set were approximately equal. Starting with these 207 sequences, we used the CD-HIT program (1) to cluster the sequences by degree of sequence identity, and to pick representative sequences that reflected their overall diversity (see Methods), leaving 133, that we aligned using MUSCLE (2) and manual realignment. This alignment, which included 68 NACHT and 65 NB-ARC domain sequences, was used as the training set to generate a custom STAND hidden Markov model (HMM) using the hmmbuild program from the HMMER package (3). While this custom HMM, which we refer to as STAND-HMM, is tuned for the NB-ARC and NACHT subfamilies of the STAND NTPases, it is also sensitive to the SWACOS and MalT clades, but only weakly to the MNS clade. Mining of Genbank for NTPase Domains and Subsequent Iterative Alignments and Phylogeny The STAND Markov model, STAND-HMM was used to probe the NCBI NR protein set containing 10,565,004 sequences, thereby identifying 15,500 STAND NTPases with scores of 30 or greater, a threshold that our analysis indicated was appropriate for detecting known NACHT and NB-ARC sequences while rejecting fragmentary and spurious matches (Figure S2). The CD-HIT program (1) program was then used to select a representative subset of these sequences 1 with a 90% identity cutoff, resulting in a set of 5,679 representative NBS domains. The HMMER hmmalign program which aligns sequences using an HMM as a guide was then used to create an initial alignment of these sequences with the STAND-HMM. Because this Markov model-guided alignment was well aligned only in the most highly conserved regions of the STAND NTPases, SeaView (4) was used to align regions with less conservation using the MUSCLE (2) or MAFFT (5) options, as well as manual alignment. For ease of handling and analysis, the alignment was split into two separate subset alignments of 4,178 NB-ARC (including SWACOS, and MalT, plus 10 NACHT domain sequences from Leipe et al. as outgroups) and 1,501 NACHT domains that were handled separately before reuniting for final ML tree generation. To further simplify the process of alignment, a quarter of these domains (1,045 sequences for the NB-ARC alignment, and 377 for the NACHTS) were randomly selected. Of these, 817 NB-ARC and 286 NACHT domains had associated C-Terminal HETHS domains (~150 amino acids) in their source protein sequences. In order to benefit from the additional phylogenetic information included in these HETHS domains, the corresponding HETHS domains amino acid sequences were appended to the 817 NB-ARC + HETHS sequences and 286 NACHT + HETHS sequences, and aligned as described above. Maximum likelihood phylogenies were inferred from these remaining 817 NB-ARC and outgroup sequences and 271 NACHT sequences using RAxML, with 100 bootstraps. To identify rogue taxa, the script 'compare_trees_leaf_correlation.pl', developed in-house, was run on these sets of 100 bootstraps to calculate correlation coefficients of reciprocal distance, where distance is the number of branches between two tree leaves (Supporting Methods). Taxa with correlation coefficients of reciprocal distance less than 0.8 were excluded from further analysis, further winnowing the number of remaining sequences to 664 NB-ARC and 271 NACHT sequences (Figure S2). The minimal value of 0.8 was chosen as a threshold to 2 remove the taxa with the greatest variability in nearest neighbors while keeping the most phylogenetically stable taxa. Lastly, a set of 15 NB-ARC, 2 MalT, 2 SWACOS, and 10 NACHT NTPase domains previously identified by of Leipe and coworkers was added to the sequence alignments to serve as markers for clades identified in that earlier work (6). MalT and SWACOS, and NACHT domains served as outgroups. Also, a single TIR-NBS-WD40, identified in the Medicago truncatula genome sequencing project, which was of interest from an evolutionary standpoint, was included. The alignments were combined into a single alignment containing 964 NBS + HETHS domains. DNA sequences were highly divergent and were not used for phylogeny reconstruction. Survey of NBS, NBS-LRR, NBS-WD40, NBS-TPR and NBS-ARM Domain Structures in the NCBI NR Protein Database. The NCBI Protein NR Set from March 11, 2010 was scanned for matches to the STAND- HMM with scores of 30 or greater, as described in Supporting Methods. This cutoff was chosen to limit matches to complete STAND domains closely matching the model. Proteins with matches to the STAND-HMM were then scanned for C-terminal matches to ARM, LRR, TPR, and WD40 repeat Markov models obtained from PFAM, and associated with NCBI taxonomic information. The totals of NBS, NBS-ARM, NBS-LRR, NBS-TPR, and NBS-WD40 domain combinations in various eukaryotic clades are compiled in Table S5. In the NCBI NR Database (March 11, 2010 release), 12,549 (82%) of identified STAND NBS domains are in eukaryotic proteins, 2,685 (17.5%) in bacterial proteins, and fractions of a percent in archaeal and viral proteins. However, the vast majority (99.7% or 4,376 out of 4,388) of proteins with NBS-LRR architecture occur in eukaryotic sequences, the only exception being a small group of 12 NACHT-LRR proteins from actinobacteria and planctomycetes (Table S3). 3 In contrast, the NBS-WD40 architecture occurs within eukaryotes and prokaryotes in approximately the same proportions as the NBS domain itself. The NBS-TPR domain architecture, on the other hand, is predominantly bacterial. The NBS-ARM domain architecture is extremely rare, with only 51 identified occurrences across the NCBI NR data set, including archaea, bacterial, and metazoan clades. Among eukaryotic sequences from NCBI, the NBS- LRR architecture is restricted to plants and metazoans, except for a single protein from Tetrahymena thermophila (XP_001030846.1) that possesses an unusual combination of an NTPase plus a series of WD40 and LRR repeats. Aside from some homologs in other alveolates and in the related rhizaria, the closest homologs of the STAND NTPase of this Tetrahymena protein are eubacterial, suggesting a possible acquisition by horizontal gene transfer. Survey of NBS-LRR, NBS-WD40, and NBS-TPR Domain Combinations in Sequenced Eukaryotic Genomes Protein sequences and gene annotation data for 46 representative eukaryotic genomes representing a broad phylogenetic sampling were downloaded from their respective genome repositories (Dataset S1). These 46 full-genome sequences, which included both finished and draft genomes, were downloaded in October 2013 and therefore contain sequences not represented in the NCBI NR Database March 11, 2010 release described above. Protein data sets were scanned for the domain combinations of interest as described in Supporting Methods, and the number of each type of domain architecture in each genome identified is presented in Figure 2. In this case, a maximum E-value of 10 was required for individual matches so as to be permissive, and allow for maximal sensitivity for detection of STAND NTPases and repeat domains. Minor differences between the NCBI NR set survey and the eukaryotic genomes survey reflect the higher sensitivity of the latter survey, and the inclusion of recently-sequenced 4 and draft genomes from October 2013. The NBS-LRR domain combination appears in 25 of the 46 surveyed eukaryotic genomes. As in the NCBI database survey, there are few occurrences of NBS-LRR sequences outside of land plants and metazoa. Consistent with the NCBI database survey, among the 46 eukayotic genomes surveyed, the NBS-WD40 and NBS-TPR architectures have a fairly broad phylogenetic distribution, whereas NBS-LRRs are predominantly associated with plants and metazoans. The NBS-LRR architecture is absent from the genome of Mnemiopsis leidyi, a representative of the ctenophores (comb jellies), which have been suggested to be the earliest diverging metazoans (7), but there are ten proteins with this architecture in the sponge Amphimedon queenslandica, nine of which are clearly homologs of the NLRs of vertebrates. This suggests that NLRs might have evolved after the divergence of ctenophores and all other metazoans. NBS-LRR proteins are absent in the unicellular chlorophyte algae Ostreococcus tauri and Chlamydomonas reinhardtii, but their presence in all of the multicellular viridiplantae, including reports of a remarkable variety of R-proteins in the liverwort Marchantia polymorpha (8), suggests that proteins with the NBS-LRR architecture evolved very early in the evolution of multicellular plants. Furthermore, the observation in the multicellular alga, Klebsormidium flaccidum (9) of two proteins of the TIR-NBS-LRR domain structure (kfl00170_0010_v1.1, and kfl00295_0030_v1.1) with STAND NTPases similar to those of Clade I seemingly pushes the genesis of R-proteins back before the divergence of charophyte algae and land plants. Of the six NBS-LRR domain combination detected outside the metazoan and viridiplantae clades, most seem to be in some ways exceptional, with other domains interposed between the NBS and the LRR, or with weak or truncated NTPase or LRR hits (Supporting Table 6). It is likely that the fairly permissive criteria of this survey contributed to their detection. Upon 5 examination, most of these NBS-LRR candidates appear to be poor analogs of R-proteins or NOD-like receptors, with such architectures as LRR-NBS-LRR (Ectocarpus silicosus, Esi0031_0015), Trans-Membrane-NBS-LRR (Naegleria gruberi, jgi|Naegr1|54705|fgeneshHS_pg.scaffold_166000001), and NBS-WD40-LRR (Tetrahymena thermophila, TTHERM_01006540), with STAND-NTPases apparently unrelated to those of R- proteins or NLRs.