Supplementary Figures

A domain-resolution map of in vivo DNA binding reveals the regulatory consequences of somatic mutations in transcription factors

Berat Dogan1,2,4,¶, Senthilkumar Kailasam1,2,¶, Aldo Hernández Corchado1,2, Naghmeh Nikpoor3, Hamed S. Najafabadi1,2,*

1 Department of Human Genetics, McGill University, Montreal, QC, Canada 2 McGill Genome Centre, Montreal, QC, Canada 3 Rosell Institute for Microbiome and Probiotics, Montreal, QC, Canada 4 Current address: Department of Biomedical Engineering, Inonu University, Malatya, Turkey

¶ These authors contributed equally to this work. * Correspondence should be addressed to: Hamed S. Najafabadi, [email protected]

a b c I E P A Y R N C

M DNA position 1 DNA position 2 DNA position 3 Trained and cross-validated on W L T F K S V D H Q G ZF residue +6 ZF residue +3 ZF residue –1 B1H data 1 Polarity

Hydrophilicity

Polarizability A Residue accessible surface area 0 Hydrophobicity 1 Trained and cross-validated on ChIP-seq data Solvation free energy

C Normalized +1 values G A 0 S C V 01 T L 1 Trained on B1H data, tested on N I 0 D P ChIP-seq data Q M F PC 2 PC H E K Y

W G observedPearsonpredicted correlation vs. of R Identity of individual of amino acids Identity Biochemicalrepresentation amino of acids –1 0 0 +1–1 ACGTACGT ACGT PC 1 Position 1 Position 2 Position 3

Average log2 predicted affinity –10 1 T

Figure S1. Encoding the amino acids by their biochemical properties. (a) The heatmap (top) represents the biochemical properties we considered. The PCA-transformed values are shown in the scatterplot at the bottom. Underlying data are provided in Table S3. (b) Encoding the amino acids based on their PCA representation allows a random forest regression model to learn simpler rules that are shared among amino acids with similar properties. The figure shows a visual representation of example rules learned by random forest. Specifically, we generated 40,000 “pseudo-amino acids” by dividing the PCA plot in panel (a) into a 200´200 grid, and then generated 40,000 random ZFs by sampling (without replacement) these pseudo- amino-acids for each of the 12 ZF positions. The color gradient in the graphs shows the predicted affinity of these random ZFs for recognition of each base at each position of the DNA triplet. The ZFs are projected on the scatterplot based on the PCA coordinates of the pseudo-amino-acid at position +6, +3, or –1, as indicated above the graph. (c) The performance of the recognition code for predicting the probability of each base at each triplet position, when the amino acids are encoded as categorical variables (20 individual identities) or using the PCA-transformed biochemical properties. We used 5-fold cross- validation on B1H motifs3 (top) or ChIP-seq motifs1 (middle), or trained the recognition code on B1H motifs and tested on ChIP- seq motifs (bottom). For the latter, we ensured that any ZF present in the ChIP-seq data was removed from the B1H training set to prevent the over-estimation of performance.

Figure S2. Features used by the compound recognition code (C-RC). (a) The structure and residue numbering in C2H2- ZFs. Arrows, loop, and helix represent the beta-sheets, turn, and alpha helix structures, respectively. (b) The different sets of residues that were considered as input features for training C-RC: the four canonical residues of the ZFs (top), the seven residues that showed the highest correlations with the DNA preference according to Chi-square test of in vivo data (middle), and all 12 residues between the second Cys and the first His in the ZF (bottom). In each case, the residues that are used as predictor variables are highlighted in blue. (c) The optimal feature sets that were selected for each of the 36 random forests (4 bases x 3 triplet positions x 3 ZF contexts). Each panel shows one of the three ZF contexts: predicting the DNA triplet that is adjacent to an N-terminal ZF (left), C-terminal ZF (middle), or non-terminal ZF (right). In each panel, each column corresponds to the random forest model for predicting the preference for one base at one of the triplet positions, and blue squares represent the features that are included in each random forest model.

Figure S3. C-RC quantitatively predicts amino acid-base interactions. (a) To understand how C-RC works, we examined the output of the code for 100,000 randomly generated ZFs, each with one randomly generated adjacent ZF on each side. We then used linear regression to examine the association of the output of C-RC with the amino acid identities at each of the central or adjacent ZF positions. The heatmap shows the regression coefficients, representing the average contribution of each amino acid at different ZF positions for recognition of each base at different DNA triplet positions. Each panel represents recognition of one base, indicated above the figure, in a specific DNA triplet position, indicated on the right; each column represents one amino acid, each row represents one ZF position, and the color gradient denotes the contribution toward specificity (red: increased preference for the specified base; blue: decreased preference). The dashed lines separate three consecutive ZFs, where the DNA triplet would be directly adjacent to the middle ZF. The specificity residues that, according to the canonical model, contribute to the recognition of each DNA position are shown with green arrows. (b) H-bonds identified from MD simulations of CTCF in complex with its target DNA. The interactions in the canonical C2H2-ZF model are shown with a green border. The non-canonical interaction highlighted in panel b is shown here with black border. Underlying data are provided in Table S4. (c) The predicted preference of variants at position +6 of ZF4 for binding to G at DNA position 9 (i.e. position –1 relative to the ZF4-associated triplet). The wild-type amino acid is highlighted in blue. (d) Scatterplots for the recognition code-predicted associations vs. MD simulation-based H-bonds. In each plot, each dot represents one ZF. The graphs represent the amino acid-base pairs for which at least one H-bond and at least one non-zero association based on the recognition code was found. The spearman correlations are shown, along with their associated P-values (two-tailed).

Amino acids AUC Amino acids AUC 0 900 0.5 1.0 0 900 0.5 1.0 ZNF214 ZNF454 ZNF75D ZNF519 ZNF418 ZNF765 ZNF492 ZNF354A ZNF430 ZNF616 ZNF141 ZNF547 ZIM3 ZNF84 ZNF506 ZNF708 ZNF684 ZNF808 ZNF525 ZNF304 ZKSCAN3 ZNF283 ZNF266 ZNF189 ZFP69 ZNF135 ZNF320 ZNF860 ZNF626 ZNF132 ZNF79 ZNF7 ZNF667 HKR1 ZNF274 ZNF528 ZNF669 ZNF12 ZIK1 ZNF701 ZNF483 ZNF287 ZNF793 ZNF662 ZFP69B ZNF19 ZNF680 ZNF480 ZNF570 ZNF573 ZNF331 ZNF257 ZNF100 ZNF282 ZNF248 ZNF317 ZNF157 ZNF778 ZNF85 ZNF263 ZNF224 ZNF776 ZNF707 ZNF343 ZNF433 ZNF736 ZNF324 ZNF17 ZNF816 ZNF333 ZNF479 ZNF28 ZNF780A ZNF716 ZFP14 ZNF10 ZNF382 ZNF585A ZNF101 ZNF674 ZNF681 PRDM9 ZNF675 ZNF611 ZNF417 ZNF566 ZNF880 ZNF267 ZNF44 ZNF182 ZNF671 ZNF180 ZNF468 ZNF460 ZKSCAN5 ZNF383 ZNF789 ZNF613 ZNF184 ZNF273 ZNF610 ZNF222 ZNF527 RBAK ZNF714 ZNF543 ZNF484 ZNF337 ZNF567 ZNF605 ZNF429 ZNF571 ZNF8 ZNF565 ZNF530 ZNF561 ZNF93 ZNF823 ZNF133 ZNF2 ZNF891 ZNF730 ZNF440 ZNF485 ZNF534 ZNF846

Figure S4. Motifs identified by recognition code-assisted analysis of ChIP-exo2 data for C2H2-ZFPs. Annotations are similar to Fig. 3a.

Figure S5. Association of in vivo DNA binding with conservation and sequence specificity. (a) Consistence between C-RC-predicted DNA-binding ZFs and previously reported DNA-binding ZFs for four C2H2-ZFPs. In each case, the optimized in vivo motif is shown on top, and the corresponding domains based on C-RC predictions are shown in the middle (with the color gradient showing the information content of the recognized triplet as in Fig. 3a) The DNA-interacting ZFs based on previous studies is shown at the bottom (DNA-interacting ZFs are in blue). Previously reported DNA-interacting ZFs: CTCF: ref4, PRDM9: ref5, OSR2: ref6, GTF3A: ref7. (b) Pearson correlation of the phyloP score with the information content (IC) of the triplet recognized by each ZF. Only KRAB are included in the bar plot. Annotations are similar to Fig. 3c. (c) Pearson correlation of phyloP vs. IC for non-KRAB proteins (similar to panel b). (d) Association between in vivo DNA binding and sequence specificity. We defined sequence specificity of each ZF as the logarithm of the ratio of the probabilities of binding to the most-preferred triplet vs. binding to the least-preferred triplet, as predicted by C-RC. (e) Venn-diagram of the overlap between ZFs that were previously reported to bind to DNA in vitro3 and those that we found, based on analysis of ChIP-seq data1, to engage with DNA in vivo.

a Samples without CNA at ZFP locus b Missense mutations in CTCF ZFs Samples with CNA at ZFP locus 50 LOH No LOH ZFs with 26 63 40 motif IC < 1 ZFs with 5 69 30 motif IC ≥ 1 Odds Ratio = 0.18 20 P < 3×10–4

10

Number ofwith Number samples missense 0 somatic mutations in DNA-binding ZFssomatic mutations in DNA-binding YY1 SP1 MAZ KLF7 ZFP3 CTCF EGR2 EGR3 IKZF3 GLIS1 KLF15 KLF12 ZFP42 ZNF76 ZNF41 ZNF85 ZNF22 FEZF1 ZNF30 ZNF329 ZNF257 ZNF582 ZNF146 ZNF189 ZNF547 ZNF140 ZNF224 ZNF554 ZNF214 ZNF502 ZNF121 ZNF768 ZNF549 ZNF449 ZNF684 ZNF134 ZNF213 ZNF200 ZNF778 ZNF770 ZNF467 ZNF250 ZNF415 ZNF784 ZBTB12 ZNF37A ZSCAN22 SP2 SP4 ZIM3 KLF1 OSR2 SNAI1 MYNN KLF10 PATZ1 ZFP28 ZFP64 ZNF98 ZNF71 ZNF35 ZNF34 ZNF16 ZBTB6 GTF3A PRDM1 ZNF454 ZNF331 ZNF677 ZNF382 ZNF317 ZNF528 ZNF680 ZNF436 ZNF350 ZNF563 ZNF418 ZNF594 ZNF610 ZNF574 ZNF774 ZNF675 ZNF596 ZNF324 ZNF490 ZNF667 ZNF260 ZNF281 ZNF264 ZNF669 ZNF708 ZNF384 ZNF419 ZBTB48

ZSCAN29 Figure S6. (a) Summary of TCGA samples with ZF somatic mutations. Note that only samples with no copy number alterations (CNAs) at each ZFP locus were used for identification of -ZFP associations. (b) Contingency table showing the number of samples with somatic mutation in DNA-binding ZFs of CTCF (ZFs with information content or IC ³ 1 bit) and non-DNA-binding ZFs of CTCF (IC < 1) stratified by loss-of-heterozygosity (LOH) at CTCF locus. Note that this table corresponds to the curated non-redundant set of tumors from cBioPortal8, which do not necessary have associated RNA-seq data. Therefore, the sample numbers in this table are larger than those in panel (a) which was limited to the TCGA samples with RNA-seq data.

Supplementary references:

1. Schmitges, F.W. et al. Multiparameter functional diversity of human C2H2 zinc finger proteins. Genome Res 26, 1742-1752 (2016). 2. Imbeault, M., Helleboid, P.Y. & Trono, D. KRAB zinc-finger proteins contribute to the evolution of gene regulatory networks. Nature 543, 550-554 (2017). 3. Najafabadi, H.S. et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat Biotechnol 33, 555-62 (2015). 4. Hashimoto, H. et al. Structural Basis for the Versatile and Methylation-Dependent Binding of CTCF to DNA. Mol Cell 66, 711-720 e3 (2017). 5. Patel, A., Horton, J.R., Wilson, G.G., Zhang, X. & Cheng, X. Structural basis for human PRDM9 action at recombination hot spots. Dev 30, 257-65 (2016). 6. Kawai, S., Yamauchi, M., Wakisaka, S., Ooshima, T. & Amano, A. Zinc-finger odd-skipped related 2 is one of the regulators in osteoblast proliferation and bone formation. J Bone Miner Res 22, 1362-72 (2007). 7. Nolte, R.T., Conlin, R.M., Harrison, S.C. & Brown, R.S. Differing roles for zinc fingers in DNA recognition: structure of a six-finger transcription factor IIIA complex. Proc Natl Acad Sci U S A 95, 2938-43 (1998). 8. Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal 6, pl1 (2013).