BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Conserved SQ and QS motifs in bacterial effectors suggest pathogen interplay with the ATM kinase family during infection Davide Sampietro1,2, Hugo Sámano-Sánchez2,3, Norman E. Davey4, Malvika Sharan2, Bálint Mészáros2,5, Toby J. Gibson2, Manjeet Kumar2*

1. Department of Biotechnology and Biosciences, University of Milano-Bicocca, Piazza della Scienza 2, 20126 Milan, Italy.

2. Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, 69117, Germany

3. (Candidate for) Joint PhD degree from EMBL and Heidelberg University, Faculty of Biosciences.

4. UCD School of Medicine & Medical Science, University College Dublin, Belfield, Dublin 4, Ireland.

5. MTA-ELTE Momentum Bioinformatics Research Group, Department of Biochemistry, Eötvös Loránd University, Budapest H-1117, Hungary.

>sp|P55980|CAGA_HELPY Cytotoxicity-associated immunodominant antigen OS= (strain ATCC 700392 / 26695) OX=85962 GN=cagA PE=1 SV=1 MTNETIDQTRTPDQTQSQTAFDPQQFINNLQVAFIKVDNVVASFDPDQKPIVDKNDRDNR QAFDGISQLREEYSNKAIKNPTKKNQYFSDFIDKSNDLINKDNLIDVESSTKSFQKFGDQ RYQIFTSWVSHQKDPSKINTRSIRNFMENIIQPPIPDDKEKAEFLKSAKQSFAGIIIGNQ IRTDQKFMGVFDESLKERQEAEKNGGPTGGDWLDIFLSFIFNKKQSSDVKEAINQEPVPH VQPDIATTTTDIQGLPPEARDLLDERGNFSKFTLGDMEMLDVEGVADIDPNYKFNQLLIH NNALSSVLMGSHNGIEPEKVSLLYAGNGGFGDKHDWNATVGYKDQQGNNVATLINVHMKN GSGLVIAGGEKGINNPSFYLYKEDQLTGSQRALSQEEIRNKVDFMEFLAQNNTKLDNLSE KEKEKFQNEIEDFQKDSKAYLDALGNDRIAFVSKKDTKHSALITEFNNGDLSYTLKDYGK KADKALDREKNVTLQGSLKHDGVMFVDYSNFKYTNASKNPNKGVGATNGVSHLEAGFNKV AVFNLPDLNNLAITSFVRRNLENKLTAKGLSLQEANKLIKDFLSSNKELAGKALNFNKAV AEAKSTGNYDEVKKAQKDLEKSLRKREHLEKEVEKKLESKSGNKNKMEAKAQANSQKDEI FALINKEANRDARAIAYTQNLKGIKRELSDKLEKISKDLKDFSKSFDEFKNGKNKDFSKA EETLKALKGSVKDLGINPEWISKVENLNAALNEFKNGKNKDFSKVTQAKSDLENSVKDVI INQKVTDKVDNLNQAVSVAKAMGDFSRVEQVLADLKNFSKEQLAQQAQKNEDFNTGKNSE LYQSVKNSVNKTLVGNGLSGIEATALAKNFSDIKKELNEKFKNFNNNNNGLKNSTEPIYA KVNKKKTGQVASPEEPIYTQVAKKVNAKIDRLNQIASGLGGVGQAAGFPLKRHDKVDDLS KVGLSASPEPIYATIDDLGGPFPLKRHDKVDDLSKVGRSRNQELAQKIDNLNQAVSEAKA GFFGNLEQTIDKLKDSTKKNVMNLYVESAKKVPASLSAKLDNYAINSHTRINSNIQNGAI NEKATGMLTQKNPEWLKLVNDKIVAHNVGSVSLSEYDKIGFNQKNMKDYSDSFKFSTKLN NAVKDIKSGFTHFLANAFSTGYYCLARENAEHGIKNVNTKGGFQKS Figure S1. Sequence of CagA (P55980). Many of the ST/Q and QS/T motifs highlighted in yellow are conserved across different Helicobacter strains.

1 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S2. Empirical cumulated distribution function (ecdf) for the residuals of the dipeptide in the disordered regions of the substrates of AADk. Each dot represents Yi , where i is the dipeptide (see Materials and Methods section). SQ, QS and QT dipeptides are marked with black arrows. While the y-axis denotes the cumulative probability, the x-axis denotes the range of values of Yi

2 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S3. Empirical cumulated distribution function (ecdf) for the residuals of the tripeptides in the disordered regions of the substrates of AADk. Each dot represents Yi, where i is the dipeptide (see Materials and Methods section). SQS, SQE, SQP and SSQ are shown. While the y-axis denotes the cumulative probability, the x-axis denotes the range of values of Yi

3 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S4 Top: fequency of each tetrapeptide in the disordered regions of the substrates of AADk. The diagonal is the 1:1 correlation line. The further the dot from the diagonal, the stronger the enrichment. Whatever is above the diagonal is enriched in the positive set. Bottom: Empirical cumulated distribution function (ecdf) for the residuals of the tetrapeptides in the disordered regions of the substrates of AADk. Each dot represents Yi , where i is the tetrapeptide (see Materials and Methods section). While the y-axis denotes the cumulative probability, the x-axis denotes the range of values of Yi

4 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S5 Alignment of Tir from E.coli (E.coli), Escherichia albertii (E.alb), (S.ent), Salmonella cholaraesius (S.cho). S/TQs are highlighted in black, QS/Ts in red. The motifs cluster together and often overlap. Blue highlights the percentage of identity for all the residues, except those highlighted in red and black. The Uniprot or UniRef accessions have been provided for all the sequences in the alignment.

5 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S6. Alignment of lpg2577 from pneumophila (L.pne), Legionella feeleii (L.fee), Legionella quateriensis (L.qua), Tatlockia micdadei(T.mic), Legionella lansingensis (L.lan), Legionella pasculli (L.pas), Legionella wadsworthii (L.wad), Legionella moravica (L.mor), Legionella israelensis (L.isr), Legionella jamestowniensis (L.jam), Legionella fallonii (L.faa). S/TQs are highlighted in black, QS/Ts in red. The motifs cluster together and often overlap. In grey a docking site (DOC_PIKK_1) for AADks kinases involved in DNA-damage response is shown. Blue highlights the percentage of identity for all the residues, except those highlighted in red and black. The Uniprot or UniRef accessions have been provided for all the sequences in the alignment.

6 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S7. Alignment of sidG from (L.pne), Legionella moravica (L.mor), Legionella sainthelensi (L.sai), Legionella santicrucis (L.san). S/TQs are highlighted in red, QS/Ts in black. Blue highlights the percentage of identity for all the residues, except those highlighted in red and black. The Uniprot or UniRef accessions have been provided for all the sequences in the alignment.

7 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S8. Distribution of the SQ - QS distances in the disordered region of the AADk substrates. The theoretical distribution of the minimal distances (black), the minimal distances between the experimentally validated pSQ - QS (yellow) and the minimal distances between all the SQs, (experimentally and not experimentally validated) (green) are compared. Top: 2 - 150 amino acid SQ - QS distances. Bin size = 10 SQ-QS distances, overlapping. Binned distribution was preferred to the real one due to the lack of data. Bottom: Bin size = 3 SQ-QS distances.

Due to the fact that we are analysing only the amino acid sequence of the protein, large distances (larger than 150-200) are not considered as we cannot foresee their vicinity in the 3D structure of the protein by just looking at its sequence. By looking at the binned distribution, we noticed that there seems to be a difference between the two distributions between 7 and 26 SQ-QS distances. Due to this, we did a Chi-square test to compare the theoretical distribution vs the real ones, which is given in the text.

8 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Figure S9. Box plot for the IUPred score of the experimentally validated phosphorylation sites in the substrates of the AADks.

9 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Table S1. List of effectors possibly multi-phosphorylated by AADks.

Uniprot ID Gene name SQ TQ QS QT Organism

P55980 CagA 5 5 4 4 Helicobacter pylori

B7UM99 Tir 2 2 2 3 O127:H6

C8TWM3 map 1 5 3 4 Escherichia coli O103:H2

Q5ZSE2 lpg2577 0 1 1 0 Legionella pneumophila

Q5ZRN6 lpg2844 12 1 12 3 Legionella pneumophila

Q5ZVT5 sidG 5 2 2 2 Legionella pneumophila

B7UKZ9 yhhA 3 4 1 3 Escherichia coli O127:H6

Q87GF9 VPA1357 12 26 6 35 serotype O3:K6

Q9RBS0 PopA 4 1 8 0 Ralstonia solanacearum

Q9RBS1 PopB 3 1 1 3 Ralstonia solanacearum

10 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Table S2. Data for the p-value distribution of the probability of relative local conservation. Due to the extreme difference in the size of the positive and negative control, for each dipeptide, we sampled from the bigger database (negative control) a number of unit equal to the one in the smaller one (positive control). We iterated this process 5000 times. Each time, we calculated the p-values using the Wilcoxon-Mann-Whitney-Test.

Dipeptide Mean Median Standard deviation

SQ 3.2089E-06 1.3642E-08 3.208E-05

TQ 0.06741190 0.031 0.089

QS 0.0223 0.005 0.043

QT 0.080678 0.0415 0.098

11 BioRxiv: SQ-QS, Effectors and ATM Supplementary materials

Table S3. Comparison between human and four different bacterial proteomes. P-value suggests the absence of significant difference in the composition of the disordered regions.

Comparison p-value (Wilcoxon signed) P-value (Wilcoxon signed)

Homo - Chlamydia 0.81

Homo - Coxiella 0.64

Homo - Legionella 0.78

Homo - Pseudomonas 0.67

12