Supplemental Information
Classification: BIOLOGICAL SCIENCES – Biochemistry
Evolutionary diversification of protein-protein interactions
by interface add-ons
Maximilian G. Plach1, Florian Semmelmann1, Florian Busch2, Markus Busch1, Leonhard Heizinger1, Vicki H. Wysocki2, Rainer Merkl1*, Reinhard Sterner1*
1 Institute of Biophysics and Physical Biochemistry University of Regensburg, D-93040 Regensburg, Germany
2 Department of Chemistry and Biochemistry The Ohio State University, 460 West 12th Avenue, OH-43210 Columbus, USA
*Corresponding authors: Rainer Merkl: +49-941-3086; [email protected] Reinhard Sterner: +49-941 943 3015; [email protected]
SI - 1 SI Materials and Methods
Materials Glutamate dehydrogenase was purchased from Roche. Chorismate was purchased from Sigma Aldrich as the barium salt and barium ions were precipitated by the addition of a slight excess of sodium sulfate. All other chemical reagents were purchased from Sigma Aldrich in analytical or HPLC grade and used without further purification.
Survey of interface add-ons in heteromeric protein complexes The initial dataset contained 1739 heteromeric bacterial protein complex structures deposited in PDB that were devoid of non-protein macromolecules and had subunit stoichiometries of AB, A2B2, A3B3, A4B4,
A6B6, ABC, and A2B2C2. To avoid redundancy, identical proteins crystallized under different experimental conditions were excluded, leaving a subset of 918 complex structures. The InterPro dataset distinguishes protein families, which represent groups of homologous proteins at different levels of functional and structural similarity, and domains, which often occur in numerous non-homologous proteins (1). We thus removed all complex structures whose subunits are only associated with “domain”,
“repeat”, and “site” entries and selected those that were assigned to highest-level InterPro families. For the remaining 305 complex structures, we extracted all bacterial sequences from the corresponding
InterPro families. In the following, we name these complexes “reference complexes” and use SU to address one of the subunits A, B, or C; a homologous sequence from the corresponding family is denoted by H.
Due to the average number of 12 000 homologs per family, it is difficult to create a reliable multiple sequence alignment. In order to identify insertions in SU or H, we thus computed individual pairwise sequence alignments PW(SU, H) by means of MAFFT (2) with the -localpair option and a gap-opening penalty of 3.0. Heavily fragmented alignments were excluded by considering further only those that contained a maximum of 40 isolated gaps in SU. Furthermore, PW(SU, H) that contained N- or
C-terminal indels were also ignored because these may arise from erroneous sequence annotations.
Finally, all PW(SU, H) were selected that showed in SU or H an additional fragment comprising at least eight residues.
SI - 2 The insertions resulting from all PW(SU, H) were mapped onto the sequence of SU and a histogram hist was computed; hist(k) specifies for each residue position k of SU how often it is part of an insertion (for examples see Fig. S1A). Due to the high number of mappings and the sequence variability of the chosen InterPro families, most residue positions k are part of insertions, which results in noisy histograms. After correcting for this noise, the maximal number max_hist of all hist(k) values was determined and all continuous sections with hist(k) > 0.3 max_hist were identified as significant insertions. Analogously, for each SU a histogram was computed that specifies how often an insertion in H starts at residue position k (Fig. S1B). Again the maximal value max_hist of all hist(k) values was determined and all positions k with hist(k) < 0.5 max_hist were identified as starting points of significant insertions in H. In total, 209 insertions were identified in 117 reference complexes and 418 insertions were identified in InterPro homologs associated with 226 reference complexes (Table S1).
From these 209 insertions, we selected those that are part of a protein-protein interface in the respective reference complex structure. To this end for each reference structure the biological quaternary assembly was generated using the PDBePISA server (3) or taken from author-provided assembly files from the PDB. Then, we calculated for each residue of an insertion the shortest distance to any residue belonging to the other subunits based on the position of their heavy atoms. A residue was designated as an interface residue (IFR), if this distance was less than 4.5 Å, which is a commonly chosen cut-off (4-7).
The contribution of these insertions to protein-protein interactions in the respective complexes was analyzed using mCSM (8) with the protein-protein option as follows: All i non-alanine IFRs of an insertion were mutated in silico to alanine in the full quaternary assembly resulting in i predicted protein-