Supplemental Information Evolutionary Diversification of Protein-Protein Interactions by Interface Add-Ons
Total Page:16
File Type:pdf, Size:1020Kb
Supplemental Information Classification: BIOLOGICAL SCIENCES – Biochemistry Evolutionary diversification of protein-protein interactions by interface add-ons Maximilian G. Plach1, Florian Semmelmann1, Florian Busch2, Markus Busch1, Leonhard Heizinger1, Vicki H. Wysocki2, Rainer Merkl1*, Reinhard Sterner1* 1 Institute of Biophysics and Physical Biochemistry University of Regensburg, D-93040 Regensburg, Germany 2 Department of Chemistry and Biochemistry The Ohio State University, 460 West 12th Avenue, OH-43210 Columbus, USA *Corresponding authors: Rainer Merkl: +49-941-3086; [email protected] Reinhard Sterner: +49-941 943 3015; [email protected] SI - 1 SI Materials and Methods Materials Glutamate dehydrogenase was purchased from Roche. Chorismate was purchased from Sigma Aldrich as the barium salt and barium ions were precipitated by the addition of a slight excess of sodium sulfate. All other chemical reagents were purchased from Sigma Aldrich in analytical or HPLC grade and used without further purification. Survey of interface add-ons in heteromeric protein complexes The initial dataset contained 1739 heteromeric bacterial protein complex structures deposited in PDB that were devoid of non-protein macromolecules and had subunit stoichiometries of AB, A2B2, A3B3, A4B4, A6B6, ABC, and A2B2C2. To avoid redundancy, identical proteins crystallized under different experimental conditions were excluded, leaving a subset of 918 complex structures. The InterPro dataset distinguishes protein families, which represent groups of homologous proteins at different levels of functional and structural similarity, and domains, which often occur in numerous non-homologous proteins (1). We thus removed all complex structures whose subunits are only associated with “domain”, “repeat”, and “site” entries and selected those that were assigned to highest-level InterPro families. For the remaining 305 complex structures, we extracted all bacterial sequences from the corresponding InterPro families. In the following, we name these complexes “reference complexes” and use SU to address one of the subunits A, B, or C; a homologous sequence from the corresponding family is denoted by H. Due to the average number of 12 000 homologs per family, it is difficult to create a reliable multiple sequence alignment. In order to identify insertions in SU or H, we thus computed individual pairwise sequence alignments PW(SU, H) by means of MAFFT (2) with the -localpair option and a gap-opening penalty of 3.0. Heavily fragmented alignments were excluded by considering further only those that contained a maximum of 40 isolated gaps in SU. Furthermore, PW(SU, H) that contained N- or C-terminal indels were also ignored because these may arise from erroneous sequence annotations. Finally, all PW(SU, H) were selected that showed in SU or H an additional fragment comprising at least eight residues. SI - 2 The insertions resulting from all PW(SU, H) were mapped onto the sequence of SU and a histogram hist was computed; hist(k) specifies for each residue position k of SU how often it is part of an insertion (for examples see Fig. S1A). Due to the high number of mappings and the sequence variability of the chosen InterPro families, most residue positions k are part of insertions, which results in noisy histograms. After correcting for this noise, the maximal number max_hist of all hist(k) values was determined and all continuous sections with hist(k) > 0.3 max_hist were identified as significant insertions. Analogously, for each SU a histogram was computed that specifies how often an insertion in H starts at residue position k (Fig. S1B). Again the maximal value max_hist of all hist(k) values was determined and all positions k with hist(k) < 0.5 max_hist were identified as starting points of significant insertions in H. In total, 209 insertions were identified in 117 reference complexes and 418 insertions were identified in InterPro homologs associated with 226 reference complexes (Table S1). From these 209 insertions, we selected those that are part of a protein-protein interface in the respective reference complex structure. To this end for each reference structure the biological quaternary assembly was generated using the PDBePISA server (3) or taken from author-provided assembly files from the PDB. Then, we calculated for each residue of an insertion the shortest distance to any residue belonging to the other subunits based on the position of their heavy atoms. A residue was designated as an interface residue (IFR), if this distance was less than 4.5 Å, which is a commonly chosen cut-off (4-7). The contribution of these insertions to protein-protein interactions in the respective complexes was analyzed using mCSM (8) with the protein-protein option as follows: All i non-alanine IFRs of an insertion were mutated in silico to alanine in the full quaternary assembly resulting in i predicted protein- protein affinity change values ∆∆→ . Analogously, these i IFRs were mutated to alanine in the isolated subunit structure giving i ∆∆→ values. An insertion was designated as a candidate interface add-on, if at least one mutation resulted in ∆∆ 2 . Finally, after manual → inspection, candidate interface add-ons were removed based on the following conditions: i) No 3D structure available for a homolog H that does not contain the interface add-on, ii) highly similar duplicates, iii) location of an interface add-on in a homomeric interface, which is present for example in complexes with the stoichiometry A2B2, and iv) candidates were N- or C-terminal extensions, not detected by the previous sequence filtering. In the end, 30 interface add-ons in 26 different reference SI - 3 structures remained. The secondary structure content of the interface add-ons was calculated from the corresponding PDB files using PyMol. Computation of sequence similarity networks The SSN of the InterPro entry IPR019999 was computed according to (9, 10). To exclude sequence fragments and multi-domain proteins, only sequences between 320 and 620 amino acids in length were chosen for the initial all-by-all BLAST; the resulting E-values were the basis for further analysis. An E- value cut-off of 1E-77 was applied and the complexity of the resulting network was reduced by computing a representative network in which sequences with >75% identity were grouped into single nodes. Networks were visualized with the organic y-files layout in Cytoscape 3.5 (11, 12). Similarly, we computed a SSN for the InterPro entry IPR015890, which is a “domain” entry that contains the sequences from IPR019999 as well as additional sequences belonging to the “chorismate binding domain” fold. For this SSN we chose an E-value cut-off of 1E-80. Both SSNs showed the same aggregation of TrpE sequences in one large cluster with a noticeable separation of the TrpEx sequences to a distinct subcluster. At more stringent E-values, the TrpEx subcluster became separated from the TrpE cluster in both SSNs. In order to assess the impact of TrpEx add-on regions on clusters resulting from the SSN analysis of IPR019999 shown in Fig. 2D, two sequences sets were created based on three distinct clusters of sequences. In the following, we name the content of the PabB cluster PabBSSN , the content of the TrpE cluster TrpESSN and that of TrpEx TrpExSSN . The sequence set PabBSNN++ TrpE SNN TrpEx SNN consisted of all sequences from the corresponding clusters that shared at most 75% identical residues. The second set, add on PabBSNN++ TrpE SNN TrpEx SNN had the same content, but in all TrpEx sequences, the interface add-on regions corresponding to the residues Leu73 to Ser122 of stTrpEx were excised. SSNs were generated from both sets as described above and an E-value threshold of 1E-77 was applied in both cases. The SSNs were finally visualized in Cytoscape using the organic y-files layout. To compare the content of the two SSNs, the relative frequencies of three types of edges as a function of the E-value were determined: i) edges that connect PabBSNN or TrpESNN nodes with each add on other, ii) edges that connect TrpExSNN (TrpExSNN ) nodes with each other, and iii) edges that connect SI - 4 add on TrpExSNN (TrpExSNN ) nodes and PabBSNN or TrpESNN nodes. The resulting three pairs of histograms were subjected to a Mann-Whitney U test to determine significant differences in frequency distributions. Computation of amino acid conservation and sequence logos TrpExSNN sequences were extracted via their UniProt identifiers and aligned with MAFFT (2). The MSA was made non-redundant at 90% identity and contained 210 TrpE sequences. The amino-acid conservation of the TrpEx interface add-on was derived from the MSA and visualized as a sequence logo by means of WebLogo (13). The amino acids in the sequence logo are colored according to their chemical properties: purple, amido functionality; red, acidic; blue, basic; black, hydrophobic; green, hydroxyl functionality and glycine. Sequence logos of TrpE and PabB synthases, as well as PabA and TrpG glutaminases were generated accordingly from MSAs of corresponding sequences in TrpExrepr and TrpErepr (see below). The MSAs contained 1644, 410, 716, and 121 sequences, respectively. Genetic profiling of Archaea and Bacteria to determine the phylogenetic distribution of AS and ADCS complexes BLAST-scans of TrpEx and TrpE species The species that comprise TrpExSSN and TrpESSN were extracted from the SSN of IPR015890 by their NCBI taxonomy identifiers (TaxIDs). After eliminating duplicate identifiers, 4297 and 11561 unique TaxID TaxID TaxIDs remained. The two datasets of TaxIDs are hereafter referred to as TrpExSSN and TrpESSN . To TaxID TaxID scan each species in TrpExSSN and TrpESSN individually for the presence of TrpEx, TrpE, PabB, TrpG, and PabA, it was necessary to limit the BLAST-search space to the proteome of the individual species. To circumvent limitations of the command line version of blastp 2.2.30+ each search was restricted to a species-specific (i.